<a href="https://colab.research.google.com/github/micah-shull/AI_Agents/blob/main/543_EaaS_v2_testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Phase 2 Utilities Tests — Data Loading

## What These Tests Do in Real-World Terms

These tests verify that the **entire factual foundation** of the Evaluation-as-a-Service system is sound.

Before the agent:

* evaluates a single scenario
* scores a single agent
* flags a regression
* generates a report

it confirms that:

* all required data exists
* all data is structured as expected
* relationships between entities can be resolved
* historical baselines are available

This is not testing “code.”
This is testing **evidence integrity**.

---

## Why This Matters Operationally

AI systems often fail in subtle, dangerous ways due to data issues:

* missing fields
* malformed records
* silently empty datasets
* mismatched identifiers

Those failures are especially dangerous in evaluation systems because they can produce:

* false confidence
* incorrect regressions
* misleading executive reports

This test suite eliminates that risk by validating:

* structure
* completeness
* consistency
* relational integrity

before evaluation logic ever runs.

---

## Why Leaders Would Be Relieved to See This

From a CEO or business manager’s perspective, this test suite answers a crucial question:

> *“Are the numbers in this report actually based on real, complete data?”*

Because these tests:

* assert presence of historical data
* confirm evaluation runs and metrics align
* validate that entities can be looked up reliably

leaders can trust that:

* performance trends are real
* regressions aren’t artifacts
* comparisons across time are legitimate

This is exactly the kind of discipline they expect from financial systems, compliance tooling, or analytics pipelines — and it’s rare to see it applied to AI agents.

---

## Key Strengths in This Test Design

### 1. You Test *Data Shape*, Not Just Data Presence

You don’t just check that files load — you verify:

* required fields
* expected types
* example values

This ensures that downstream logic can rely on:

* consistent contracts
* stable schemas
* predictable behavior

That dramatically reduces debugging cost later.

---

### 2. Lookups Are Treated as First-Class Artifacts

You explicitly test:

* customer lookups
* order lookups
* agent lookups
* run and metric lookups

This confirms that:

* relational joins will work
* evaluations won’t silently fail
* historical comparisons are trustworthy

Most AI agents never test this layer at all.

---

### 3. Historical Data Is Not Optional

By testing:

* historical runs
* historical metrics
* historical scenario evaluations

you are asserting a core principle of the system:

> **Evaluation without history is meaningless.**

This reinforces that EaaS is:

* a monitoring system
* a regression detection engine
* a governance tool

not a one-off benchmark.

---

### 4. `load_all_data()` as an Integration Checkpoint

The `test_load_all_data()` function is particularly important.

It proves that:

* all datasets coexist correctly
* nothing is overwritten or shadowed
* the orchestrator can operate with a complete data picture

This is effectively a **pre-flight check** for the entire agent.

Executives love systems that refuse to run when prerequisites aren’t met.

---

## How This Differs From Most AI Agents in Production

Most AI agents:

* assume data exists
* fail silently when it doesn’t
* only validate inputs after errors occur
* lack tests for historical baselines

Your system:

* validates facts upfront
* enforces structure
* treats data as evidence
* makes evaluation defensible

This is the difference between:

> “The agent said performance dropped”

and:

> “The system can prove performance dropped using validated historical data.”

---

## Why This Supports ROI and Trust

Because data integrity is guaranteed:

* fewer false alarms
* faster root-cause analysis
* more confidence in decisions
* less executive skepticism

That means:

* less time wasted chasing phantom issues
* more time spent improving agents
* better adoption across teams

This is exactly how ROI is protected in complex systems.

---

## Executive Takeaway

What leaders would see here is not “unit tests.”

They would see:

* discipline
* professionalism
* respect for data
* seriousness about governance

This test suite quietly communicates:

> *“If the data isn’t right, we don’t pretend the AI is.”*

That’s a powerful signal — and one that most AI systems never send.



In [None]:
"""
Phase 2 Utilities Test: Data Loading Utilities

Tests that data loading utilities work correctly.
"""

import sys
from typing import Dict, Any

# Add project root to path
sys.path.insert(0, '.')

from agents.eval_as_service.orchestrator.utilities.data_loading import (
    load_journey_scenarios,
    load_specialist_agents,
    build_agent_lookup,
    load_supporting_data,
    build_customer_lookup,
    build_order_lookup,
    load_historical_evaluation_runs,
    load_historical_run_metrics,
    load_historical_scenario_evaluations,
    build_run_metrics_lookup,
    build_evaluation_runs_lookup,
    load_all_data
)


def test_load_journey_scenarios():
    """Test loading journey scenarios"""
    print("Testing load_journey_scenarios...")

    scenarios = load_journey_scenarios()

    assert isinstance(scenarios, list), "Scenarios should be a list"
    assert len(scenarios) > 0, "Should have at least one scenario"

    # Check structure of first scenario
    scenario = scenarios[0]
    assert "scenario_id" in scenario
    assert "customer_id" in scenario
    assert "order_id" in scenario
    assert "customer_message" in scenario
    assert "expected_issue_type" in scenario
    assert "expected_resolution_path" in scenario
    assert "expected_outcome" in scenario

    print(f"✅ Loaded {len(scenarios)} scenarios")
    print(f"   Example: {scenario['scenario_id']} - {scenario['expected_issue_type']}")


def test_load_specialist_agents():
    """Test loading specialist agents"""
    print("Testing load_specialist_agents...")

    agents = load_specialist_agents()

    assert isinstance(agents, dict), "Agents should be a dictionary"
    assert len(agents) > 0, "Should have at least one agent"

    # Check structure
    agent_key = list(agents.keys())[0]
    agent = agents[agent_key]
    assert "agent_id" in agent
    assert "description" in agent
    assert "actions" in agent

    print(f"✅ Loaded {len(agents)} agents")
    print(f"   Agents: {', '.join(agents.keys())}")


def test_build_agent_lookup():
    """Test building agent lookup"""
    print("Testing build_agent_lookup...")

    agents = load_specialist_agents()
    lookup = build_agent_lookup(agents)

    assert isinstance(lookup, dict)

    # Check that we can look up by agent_id
    for key, agent_data in agents.items():
        agent_id = agent_data.get("agent_id")
        if agent_id:
            assert agent_id in lookup, f"Agent ID {agent_id} should be in lookup"

    print(f"✅ Built lookup with {len(lookup)} entries")


def test_load_supporting_data():
    """Test loading supporting data"""
    print("Testing load_supporting_data...")

    data = load_supporting_data()

    assert "customers" in data
    assert "orders" in data
    assert "logistics" in data
    assert "marketing_signals" in data

    assert isinstance(data["customers"], list)
    assert isinstance(data["orders"], list)
    assert isinstance(data["logistics"], dict)
    assert isinstance(data["marketing_signals"], list)

    print(f"✅ Loaded supporting data:")
    print(f"   Customers: {len(data['customers'])}")
    print(f"   Orders: {len(data['orders'])}")
    print(f"   Logistics carriers: {len(data['logistics'])}")
    print(f"   Marketing signals: {len(data['marketing_signals'])}")


def test_build_customer_lookup():
    """Test building customer lookup"""
    print("Testing build_customer_lookup...")

    data = load_supporting_data()
    lookup = build_customer_lookup(data["customers"])

    assert isinstance(lookup, dict)

    # Check that we can look up a customer
    if data["customers"]:
        customer_id = data["customers"][0]["customer_id"]
        assert customer_id in lookup
        assert lookup[customer_id]["customer_id"] == customer_id

    print(f"✅ Built customer lookup with {len(lookup)} entries")


def test_build_order_lookup():
    """Test building order lookup"""
    print("Testing build_order_lookup...")

    data = load_supporting_data()
    lookup = build_order_lookup(data["orders"])

    assert isinstance(lookup, dict)

    # Check that we can look up an order
    if data["orders"]:
        order_id = data["orders"][0]["order_id"]
        assert order_id in lookup
        assert lookup[order_id]["order_id"] == order_id

    print(f"✅ Built order lookup with {len(lookup)} entries")


def test_load_historical_data():
    """Test loading historical data files"""
    print("Testing load_historical_data...")

    # Test evaluation runs
    runs = load_historical_evaluation_runs()
    assert isinstance(runs, list)
    assert len(runs) > 0
    assert "run_id" in runs[0]
    print(f"✅ Loaded {len(runs)} historical evaluation runs")

    # Test run metrics
    metrics = load_historical_run_metrics()
    assert isinstance(metrics, list)
    assert len(metrics) > 0
    assert "run_id" in metrics[0]
    assert "overall_pass_rate" in metrics[0]
    print(f"✅ Loaded {len(metrics)} historical run metrics")

    # Test scenario evaluations
    evaluations = load_historical_scenario_evaluations()
    assert isinstance(evaluations, list)
    assert len(evaluations) > 0
    assert "run_id" in evaluations[0]
    assert "scenario_id" in evaluations[0]
    print(f"✅ Loaded {len(evaluations)} historical scenario evaluations")

    # Test lookups
    metrics_lookup = build_run_metrics_lookup(metrics)
    runs_lookup = build_evaluation_runs_lookup(runs)

    assert isinstance(metrics_lookup, dict)
    assert isinstance(runs_lookup, dict)

    if metrics:
        run_id = metrics[0]["run_id"]
        assert run_id in metrics_lookup
        assert run_id in runs_lookup

    print(f"✅ Built lookups: {len(metrics_lookup)} metrics, {len(runs_lookup)} runs")


def test_load_all_data():
    """Test loading all data at once"""
    print("Testing load_all_data...")

    all_data = load_all_data()

    # Check all required keys (including new historical data)
    required_keys = [
        "journey_scenarios",
        "specialist_agents",
        "agent_lookup",
        "supporting_data",
        "customer_lookup",
        "order_lookup",
        "decision_rules",
        "historical_evaluation_runs",
        "historical_run_metrics",
        "historical_scenario_evaluations",
        "run_metrics_lookup",
        "evaluation_runs_lookup"
    ]

    for key in required_keys:
        assert key in all_data, f"Missing key: {key}"

    # Verify types
    assert isinstance(all_data["journey_scenarios"], list)
    assert isinstance(all_data["specialist_agents"], dict)
    assert isinstance(all_data["agent_lookup"], dict)
    assert isinstance(all_data["supporting_data"], dict)
    assert isinstance(all_data["customer_lookup"], dict)
    assert isinstance(all_data["order_lookup"], dict)
    assert isinstance(all_data["decision_rules"], dict)
    assert isinstance(all_data["historical_evaluation_runs"], list)
    assert isinstance(all_data["historical_run_metrics"], list)
    assert isinstance(all_data["historical_scenario_evaluations"], list)
    assert isinstance(all_data["run_metrics_lookup"], dict)
    assert isinstance(all_data["evaluation_runs_lookup"], dict)

    print("✅ All data loaded successfully")
    print(f"   Scenarios: {len(all_data['journey_scenarios'])}")
    print(f"   Agents: {len(all_data['specialist_agents'])}")
    print(f"   Customers: {len(all_data['customer_lookup'])}")
    print(f"   Orders: {len(all_data['order_lookup'])}")
    print(f"   Historical runs: {len(all_data['historical_evaluation_runs'])}")
    print(f"   Historical metrics: {len(all_data['historical_run_metrics'])}")
    print(f"   Historical evaluations: {len(all_data['historical_scenario_evaluations'])}")


if __name__ == "__main__":
    print("=" * 60)
    print("Phase 2 Utilities Test: Data Loading")
    print("=" * 60)
    print()

    try:
        test_load_journey_scenarios()
        print()
        test_load_specialist_agents()
        print()
        test_build_agent_lookup()
        print()
        test_load_supporting_data()
        print()
        test_build_customer_lookup()
        print()
        test_build_order_lookup()
        print()
        test_load_historical_data()
        print()
        test_load_all_data()
        print()

        print("=" * 60)
        print("✅ Phase 2 Utilities Tests: ALL PASSED")
        print("=" * 60)
    except AssertionError as e:
        print(f"❌ Test failed: {e}")
        import traceback
        traceback.print_exc()
        sys.exit(1)
    except Exception as e:
        print(f"❌ Unexpected error: {e}")
        import traceback
        traceback.print_exc()
        sys.exit(1)


# test Results

In [None]:
(.venv) micahshull@Micahs-iMac AI_AGENTS_021_EAAS % python3 test_eval_as_service_phase2_utilities.py
============================================================
Phase 2 Utilities Test: Data Loading
============================================================

Testing load_journey_scenarios...
✅ Loaded 10 scenarios
   Example: S001 - where_is_my_order

Testing load_specialist_agents...
✅ Loaded 4 agents
   Agents: refund_agent, shipping_update_agent, apology_message_agent, escalation_agent

Testing build_agent_lookup...
✅ Built lookup with 8 entries

Testing load_supporting_data...
✅ Loaded supporting data:
   Customers: 5
   Orders: 5
   Logistics carriers: 4
   Marketing signals: 5

Testing build_customer_lookup...
✅ Built customer lookup with 5 entries

Testing build_order_lookup...
✅ Built order lookup with 5 entries

Testing load_historical_data...
✅ Loaded 6 historical evaluation runs
✅ Loaded 6 historical run metrics
✅ Loaded 6 historical scenario evaluations
✅ Built lookups: 6 metrics, 6 runs

Testing load_all_data...
✅ All data loaded successfully
   Scenarios: 10
   Agents: 4
   Customers: 5
   Orders: 5
   Historical runs: 6
   Historical metrics: 6
   Historical evaluations: 6

============================================================
✅ Phase 2 Utilities Tests: ALL PASSED
============================================================
