<a href="https://colab.research.google.com/github/micah-shull/AI_Agents/blob/main/544_EaaS_v2_dataLoading_node_testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# Data Loading Node — Architecture Review

## What This Node Does in Practical Terms

The `data_loading_node` is the **handoff point between intent and execution**.

Up to this point, the system has:

* declared *what* it wants to evaluate (goal)
* defined *how* it will proceed (plan)

This node answers the next critical question:

> *“Do we have everything we need to evaluate this system responsibly?”*

It loads **all factual inputs** required for evaluation and prepares them in a form that downstream nodes can rely on without ambiguity.

---

## Why This Matters Operationally

This node establishes a crucial invariant:

> **No evaluation happens without validated, complete context.**

That invariant is enforced by:

* centralized data loading
* controlled error handling
* explicit state updates

If something is missing, malformed, or inconsistent, the system:

* stops
* records the error
* refuses to proceed silently

This prevents one of the most dangerous failure modes in AI systems: *confidently evaluating incomplete reality*.

---

## Why Leaders Would Be Relieved to See This

From a CEO or business manager’s perspective, this node quietly answers:

> *“Does this system check that it actually knows the facts before it judges performance?”*

Because this node:

* loads historical baselines explicitly
* prepares customer, order, and agent context
* fails loudly on missing or invalid inputs
* preserves prior state and error history

leaders can trust that:

* reports are grounded in real data
* regressions aren’t artifacts of missing context
* performance claims are defensible

This is exactly how mature enterprise systems behave — and most AI agents do not.

---

## Key Design Strengths

### 1. The Node Is an Orchestrator, Not a Loader

This node does **no file parsing itself**.

Instead, it:

* delegates loading to utilities
* assembles the resulting artifacts
* injects them into the orchestration state

That separation is extremely important.

It means:

* the node expresses *intent*
* the utilities express *mechanics*
* testing can occur at both levels independently

This keeps complexity controlled and evolution-safe.

---

### 2. Explicit State Construction (No Hidden Mutation)

The return value explicitly lists every field the node adds to state.

Nothing is:

* modified implicitly
* mutated in place
* hidden behind side effects

This makes the system:

* predictable
* debuggable
* auditable

Executives don’t need to understand Python to appreciate this — it mirrors how financial or compliance workflows are built.

---

### 3. Error Handling Is Structured and Honest

Your exception handling is intentionally conservative:

* `FileNotFoundError` → missing inputs
* `JSONDecodeError` → corrupted facts
* generic `Exception` → unexpected failure

Each one:

* appends to the existing error list
* preserves prior context
* stops the node cleanly

This is critical for evaluation systems.

It ensures that:

* errors are not swallowed
* partial state is not trusted
* failures are visible upstream

Most agents in production today don’t fail — they *degrade silently*. This node refuses to do that.

---

### 4. Historical Data Is Loaded at the Same Level as Live Data

This is a subtle but powerful architectural choice.

By loading:

* historical evaluation runs
* historical metrics
* historical scenario evaluations

alongside live data, the node enforces a core principle:

> **Evaluation is comparative, not absolute.**

This enables:

* baseline comparisons
* regression detection
* trend analysis
* release gating

Most AI agents evaluate “now” in isolation. Yours evaluates **in context**.

---

## How This Differs From Most Agents in Production Today

Most agent systems:

* load data lazily
* mix loading with execution
* assume files exist
* overwrite state silently
* cannot reproduce past evaluations

This node:

* centralizes all data access
* enforces preconditions
* preserves orchestration context
* makes failures explicit
* supports historical accountability

That’s the difference between an experiment and infrastructure.

---

## Why This Design Supports ROI and Accountability

Because this node:

* fails early
* preserves evidence
* ensures completeness
* supports historical comparison

teams spend:

* less time debugging weird results
* less time questioning metrics
* more time improving agents

Executives gain:

* confidence in reports
* trust in trends
* clarity around risk

That directly protects ROI.

---

## Executive Takeaway

What a leader would see here is not “a loader.”

They would see a system that says:

> *“If we don’t have the facts, we don’t pretend to evaluate performance.”*

That single property — enforced at the node level — is what makes the entire EaaS Orchestrator credible.




In [None]:
def data_loading_node(
    state: EvalAsServiceOrchestratorState,
    config: EvalAsServiceOrchestratorConfig
) -> Dict[str, Any]:
    """
    Data Loading Node: Orchestrate loading all data files.

    Loads:
    - Journey scenarios (test cases)
    - Specialist agents (agents to evaluate)
    - Supporting data (customers, orders, logistics, marketing)
    - Decision rules (orchestrator logic)
    - Historical data (for baseline comparison and regression detection)

    Also builds lookup dictionaries for efficient access.
    """
    errors = state.get("errors", [])
    data_dir = config.data_dir

    try:
        # Load all data using utility
        all_data = load_all_data(data_dir)

        return {
            "journey_scenarios": all_data["journey_scenarios"],
            "specialist_agents": all_data["specialist_agents"],
            "agent_lookup": all_data["agent_lookup"],
            "supporting_data": all_data["supporting_data"],
            "customer_lookup": all_data["customer_lookup"],
            "order_lookup": all_data["order_lookup"],
            "decision_rules": all_data["decision_rules"],
            "historical_evaluation_runs": all_data["historical_evaluation_runs"],
            "historical_run_metrics": all_data["historical_run_metrics"],
            "historical_scenario_evaluations": all_data["historical_scenario_evaluations"],
            "run_metrics_lookup": all_data["run_metrics_lookup"],
            "evaluation_runs_lookup": all_data["evaluation_runs_lookup"],
            "errors": errors
        }
    except FileNotFoundError as e:
        return {
            "errors": errors + [f"data_loading_node: File not found: {str(e)}"]
        }
    except json.JSONDecodeError as e:
        return {
            "errors": errors + [f"data_loading_node: Invalid JSON: {str(e)}"]
        }
    except Exception as e:
        return {
            "errors": errors + [f"data_loading_node: Unexpected error: {str(e)}"]
        }




# Phase 2 Node Test — Data Loading Node

## What This Test Validates in Real-World Terms

This test suite confirms that the **Data Loading Node functions as a proper orchestration boundary**, not just a thin wrapper around utilities.

In practice, it verifies that:

* The agent can move from *intent* (goal + plan)
* To *prepared execution* (all required data loaded, validated, and indexed)
* Without losing state or context

That transition is one of the most failure-prone moments in real systems — and you’re explicitly testing it.

---

## Why This Matters Operationally

Many AI systems fail not because logic is wrong, but because:

* nodes overwrite state
* integrations silently drop data
* utilities work in isolation but not together
* errors are swallowed or misrouted

This test ensures that:

* the node loads **all required inputs**
* no critical datasets are missing
* lookup tables are usable immediately
* prior orchestration state is preserved
* failures surface explicitly in the `errors` channel

That’s operational robustness, not just correctness.

---

## Why Leaders Would Be Relieved to See This

From a business or executive perspective, this test answers a critical question:

> *“Does the system reliably prepare itself before it starts judging performance?”*

Because this test verifies:

* presence of historical baselines
* completeness of customer, order, and agent data
* integrity of lookups
* graceful failure when inputs are invalid

leaders can trust that:

* evaluation results aren’t based on partial data
* regressions aren’t caused by missing context
* reports don’t silently degrade in quality

This is exactly the behavior they expect from production-grade systems — and almost never see from AI agents.

---

## Key Strengths in This Test Design

### 1. You Test the Node, Not Just the Utilities

You already tested the utilities in Phase 2.

This test goes one level higher and asks:

> “Does the orchestrator node correctly assemble everything?”

That’s a critical distinction.

Many systems stop at utility tests and assume orchestration will “just work.” You explicitly prove that it does.

---

### 2. Lookup Validation Is Business-Critical

You don’t just assert that lookups exist — you assert they work using **real scenario references**:

* scenario → customer
* scenario → order
* agent → agent_id
* run → metrics

This confirms that:

* relational integrity exists
* downstream evaluation logic can safely assume lookups will succeed

That dramatically reduces runtime surprises later.

---

### 3. Error Handling Is Tested Explicitly

The invalid data directory test is particularly important.

It proves that:

* failures surface early
* errors are captured, not swallowed
* the system refuses to proceed silently

Executives don’t fear AI because it fails — they fear it because it fails quietly. This test directly addresses that concern.

---

### 4. State Preservation Is Verified

The `test_data_loading_after_planning()` test is subtle but powerful.

It confirms that:

* the agent behaves as a *state machine*
* prior nodes’ outputs are preserved
* orchestration is additive, not destructive

That’s essential for:

* debuggability
* audit trails
* future branching or retries

---

## How This Differs From Most Agents in Production Today

Most agents:

* call loaders inline
* assume data exists
* overwrite state unintentionally
* don’t test node-level integration
* don’t verify historical data presence

Your system:

* treats data loading as a formal node
* validates integration explicitly
* preserves orchestration context
* refuses to run under invalid conditions

That’s the difference between a demo pipeline and operational infrastructure.

---

## Why This Supports ROI and Accountability

Because the system:

* fails early
* preserves state
* validates facts
* exposes errors clearly

teams spend:

* less time debugging
* less time questioning results
* more time improving agents

That directly translates to:

* faster iteration
* safer releases
* higher executive confidence
* better adoption

Which is exactly how ROI is protected.

---

## Executive Takeaway

What a CEO or business manager would see here is not “test code.”

They would see:

* discipline
* seriousness
* operational safeguards
* a system that checks itself before making claims

This test quietly communicates:

> *“If we don’t have the facts, we don’t pretend to evaluate performance.”*

That’s a powerful signal — and one that most AI systems never send.

---

### What You’ve Achieved So Far

At this point, you’ve fully locked down:

* **intent** (goal)
* **structure** (plan)
* **facts** (data loading)
* **integration safety** (node tests)
* **historical grounding**




In [None]:
"""
Phase 2 Node Test: Data Loading Node

Tests that data_loading_node works correctly with utilities.
"""

import sys
from typing import Dict, Any

# Add project root to path
sys.path.insert(0, '.')

from agents.eval_as_service.orchestrator.nodes import data_loading_node
from config import EvalAsServiceOrchestratorState, EvalAsServiceOrchestratorConfig


def test_data_loading_node():
    """Test data_loading_node with minimal state"""
    print("Testing data_loading_node...")

    # Create config
    config = EvalAsServiceOrchestratorConfig()

    # Test 1: Basic data loading
    state: EvalAsServiceOrchestratorState = {
        "errors": []
    }

    result = data_loading_node(state, config)

    # Check that all required fields are present
    required_fields = [
        "journey_scenarios",
        "specialist_agents",
        "agent_lookup",
        "supporting_data",
        "customer_lookup",
        "order_lookup",
        "decision_rules",
        "historical_evaluation_runs",
        "historical_run_metrics",
        "historical_scenario_evaluations",
        "run_metrics_lookup",
        "evaluation_runs_lookup"
    ]

    for field in required_fields:
        assert field in result, f"Missing field: {field}"

    # Verify types
    assert isinstance(result["journey_scenarios"], list)
    assert len(result["journey_scenarios"]) > 0, "Should have scenarios"

    assert isinstance(result["specialist_agents"], dict)
    assert len(result["specialist_agents"]) > 0, "Should have agents"

    assert isinstance(result["agent_lookup"], dict)
    assert len(result["agent_lookup"]) > 0, "Should have agent lookup"

    assert isinstance(result["supporting_data"], dict)
    assert "customers" in result["supporting_data"]
    assert "orders" in result["supporting_data"]
    assert "logistics" in result["supporting_data"]
    assert "marketing_signals" in result["supporting_data"]

    assert isinstance(result["customer_lookup"], dict)
    assert len(result["customer_lookup"]) > 0, "Should have customer lookup"

    assert isinstance(result["order_lookup"], dict)
    assert len(result["order_lookup"]) > 0, "Should have order lookup"

    assert isinstance(result["decision_rules"], dict)

    assert isinstance(result["historical_evaluation_runs"], list)
    assert len(result["historical_evaluation_runs"]) > 0, "Should have historical runs"

    assert isinstance(result["historical_run_metrics"], list)
    assert len(result["historical_run_metrics"]) > 0, "Should have historical metrics"

    assert isinstance(result["historical_scenario_evaluations"], list)
    assert len(result["historical_scenario_evaluations"]) > 0, "Should have historical evaluations"

    assert isinstance(result["run_metrics_lookup"], dict)
    assert len(result["run_metrics_lookup"]) > 0, "Should have run metrics lookup"

    assert isinstance(result["evaluation_runs_lookup"], dict)
    assert len(result["evaluation_runs_lookup"]) > 0, "Should have evaluation runs lookup"

    # Check that errors list is preserved
    assert "errors" in result
    assert isinstance(result["errors"], list)

    print("✅ Data loading node test 1 passed: All data loaded")
    print(f"   Scenarios: {len(result['journey_scenarios'])}")
    print(f"   Agents: {len(result['specialist_agents'])}")
    print(f"   Customers: {len(result['customer_lookup'])}")
    print(f"   Orders: {len(result['order_lookup'])}")
    print(f"   Historical runs: {len(result['historical_evaluation_runs'])}")
    print(f"   Historical metrics: {len(result['historical_run_metrics'])}")
    print(f"   Historical evaluations: {len(result['historical_scenario_evaluations'])}")

    # Test 2: Verify lookup functionality
    # Test customer lookup
    if result["journey_scenarios"]:
        scenario = result["journey_scenarios"][0]
        customer_id = scenario.get("customer_id")
        if customer_id:
            assert customer_id in result["customer_lookup"], f"Customer {customer_id} should be in lookup"
            customer = result["customer_lookup"][customer_id]
            assert customer["customer_id"] == customer_id

    # Test order lookup
    if result["journey_scenarios"]:
        scenario = result["journey_scenarios"][0]
        order_id = scenario.get("order_id")
        if order_id:
            assert order_id in result["order_lookup"], f"Order {order_id} should be in lookup"
            order = result["order_lookup"][order_id]
            assert order["order_id"] == order_id

    # Test agent lookup
    if result["specialist_agents"]:
        agent_key = list(result["specialist_agents"].keys())[0]
        agent_data = result["specialist_agents"][agent_key]
        agent_id = agent_data.get("agent_id")
        if agent_id:
            assert agent_id in result["agent_lookup"], f"Agent {agent_id} should be in lookup"

    # Test run metrics lookup
    if result["historical_run_metrics"]:
        run_id = result["historical_run_metrics"][0]["run_id"]
        assert run_id in result["run_metrics_lookup"], f"Run {run_id} should be in metrics lookup"
        metrics = result["run_metrics_lookup"][run_id]
        assert metrics["run_id"] == run_id

    # Test evaluation runs lookup
    if result["historical_evaluation_runs"]:
        run_id = result["historical_evaluation_runs"][0]["run_id"]
        assert run_id in result["evaluation_runs_lookup"], f"Run {run_id} should be in runs lookup"
        run = result["evaluation_runs_lookup"][run_id]
        assert run["run_id"] == run_id

    print("✅ Data loading node test 2 passed: Lookups work correctly")

    # Test 3: Error handling (invalid data directory)
    config_invalid = EvalAsServiceOrchestratorConfig()
    config_invalid.data_dir = "nonexistent/directory"

    state_error: EvalAsServiceOrchestratorState = {
        "errors": []
    }

    result_error = data_loading_node(state_error, config_invalid)
    assert "errors" in result_error
    assert len(result_error["errors"]) > 0, "Should have errors for invalid directory"
    print("✅ Data loading node test 3 passed: Error handling")


def test_data_loading_after_planning():
    """Test data loading node after planning node"""
    print("Testing data_loading_node after planning_node...")

    from agents.eval_as_service.orchestrator.nodes import goal_node, planning_node

    config = EvalAsServiceOrchestratorConfig()

    # Start with goal and planning
    state: EvalAsServiceOrchestratorState = {
        "scenario_id": None,
        "target_agent_id": None,
        "errors": []
    }

    # Run goal node
    state = goal_node(state)
    assert "goal" in state

    # Run planning node
    state = planning_node(state)
    assert "plan" in state

    # Run data loading node
    state = data_loading_node(state, config)

    # Verify all data is loaded
    assert "journey_scenarios" in state
    assert "specialist_agents" in state
    assert "agent_lookup" in state
    assert "supporting_data" in state
    assert "customer_lookup" in state
    assert "order_lookup" in state
    assert "decision_rules" in state
    assert "historical_evaluation_runs" in state
    assert "historical_run_metrics" in state
    assert "historical_scenario_evaluations" in state

    # Verify previous state is preserved
    assert "goal" in state
    assert "plan" in state

    print("✅ Data loading after planning test passed")


if __name__ == "__main__":
    print("=" * 60)
    print("Phase 2 Node Test: Data Loading Node")
    print("=" * 60)
    print()

    try:
        test_data_loading_node()
        print()
        test_data_loading_after_planning()
        print()

        print("=" * 60)
        print("✅ Phase 2 Node Tests: ALL PASSED")
        print("=" * 60)
    except AssertionError as e:
        print(f"❌ Test failed: {e}")
        import traceback
        traceback.print_exc()
        sys.exit(1)
    except Exception as e:
        print(f"❌ Unexpected error: {e}")
        import traceback
        traceback.print_exc()
        sys.exit(1)


# Test Results

In [None]:
(.venv) micahshull@Micahs-iMac AI_AGENTS_021_EAAS % python3 test_eval_as_service_phase2_node.py
============================================================
Phase 2 Node Test: Data Loading Node
============================================================

Testing data_loading_node...
✅ Data loading node test 1 passed: All data loaded
   Scenarios: 10
   Agents: 4
   Customers: 5
   Orders: 5
   Historical runs: 6
   Historical metrics: 6
   Historical evaluations: 6
✅ Data loading node test 2 passed: Lookups work correctly
✅ Data loading node test 3 passed: Error handling

Testing data_loading_node after planning_node...
✅ Data loading after planning test passed

============================================================
✅ Phase 2 Node Tests: ALL PASSED
============================================================
