<a href="https://colab.research.google.com/github/micah-shull/AI_Agents/blob/main/320_EaaS_Agent.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Orchestrator for Evaluation-as-a-Service Agent

In [None]:
"""Orchestrator for Evaluation-as-a-Service Agent

Creates and compiles the LangGraph workflow for agent evaluation.
"""

from langgraph.graph import StateGraph, END
from config import EvalAsServiceOrchestratorState, EvalAsServiceOrchestratorConfig
from agents.eval_as_service.nodes import (
    goal_node,
    planning_node,
    data_loading_node,
    evaluation_execution_node,
    scoring_node,
    performance_analysis_node,
    report_generation_node
)


def create_orchestrator(config: EvalAsServiceOrchestratorConfig = None) -> StateGraph:
    """Create and return the EaaS orchestrator workflow."""
    if config is None:
        config = EvalAsServiceOrchestratorConfig()

    workflow = StateGraph(EvalAsServiceOrchestratorState)

    # Add nodes
    workflow.add_node("goal", goal_node)
    workflow.add_node("planning", planning_node)
    workflow.add_node("data_loading", lambda s: data_loading_node(s, config))
    workflow.add_node("evaluation_execution", lambda s: evaluation_execution_node(s, config))
    workflow.add_node("scoring", lambda s: scoring_node(s, config))
    workflow.add_node("performance_analysis", lambda s: performance_analysis_node(s, config))
    workflow.add_node("report_generation", lambda s: report_generation_node(s, config))

    # Set entry point
    workflow.set_entry_point("goal")

    # Linear flow
    workflow.add_edge("goal", "planning")
    workflow.add_edge("planning", "data_loading")
    workflow.add_edge("data_loading", "evaluation_execution")
    workflow.add_edge("evaluation_execution", "scoring")
    workflow.add_edge("scoring", "performance_analysis")
    workflow.add_edge("performance_analysis", "report_generation")
    workflow.add_edge("report_generation", END)

    return workflow.compile()




# The Orchestrator: Turning Components into a Governed System

The orchestrator is responsible for assembling all nodes into a **single, coherent workflow**. It defines *how* evaluation happens, *in what order*, and *under what rules*.

This is where individual utilities and nodes become a **system**.

---

## Why an Explicit Orchestrator Matters

In many agent systems, execution order is implied by code structure. Logic is scattered across files, and the true workflow is difficult to reason about.

Here, the workflow is:

* explicit
* linear
* inspectable
* deterministic

Anyone reading this file can immediately understand how an evaluation run proceeds from start to finish.

---

## StateGraph as the Backbone

The orchestrator uses a `StateGraph` to model evaluation as a sequence of state transitions.

Each node:

* receives the shared state
* adds new information
* passes the updated state forward

The graph enforces the rule that **no step runs without the outputs of the previous step**. This preserves the integrity of the evaluation pipeline.

---

## Clear, Single-Responsibility Nodes

Each node in the workflow has a focused responsibility:

1. **Goal** – define intent and scope
2. **Planning** – make the workflow explicit
3. **Data Loading** – prepare and validate inputs
4. **Evaluation Execution** – run controlled experiments
5. **Scoring** – apply deterministic judgment
6. **Performance Analysis** – aggregate and classify results
7. **Report Generation** – communicate outcomes

Nothing overlaps. Nothing is ambiguous.

This clarity is what keeps the system maintainable as it grows.

---

## Configuration Is Injected, Not Embedded

The orchestrator passes configuration into nodes explicitly rather than letting nodes read from global state.

This allows:

* different environments to use different standards
* easy testing and experimentation
* predictable behavior across runs

Configuration defines **policy**, while the orchestrator enforces **process**.

---

## Linear Flow Is a Feature, Not a Limitation

The workflow is intentionally linear:

```
goal → planning → data → execution → scoring → analysis → report
```

This mirrors how real evaluations are conducted and makes the system:

* easier to reason about
* easier to debug
* easier to audit

More complex branching can be added later, but linearity is ideal for establishing trust and correctness first.

---

## Determinism Preserved End to End

Because:

* utilities are deterministic
* nodes only aggregate
* state is explicit
* execution order is fixed

the entire orchestrator is reproducible.

Running the same workflow with the same inputs will always produce the same results. That property is rare in agent systems — and incredibly valuable.

---

## Why This Is the Right Capstone

This orchestrator completes the architectural story:

* **Utilities** define rules
* **Nodes** coordinate logic
* **State** carries truth
* **Orchestrator** enforces order

Nothing happens by accident. Nothing is implicit.

That’s exactly what’s required when AI systems move from experimentation into environments where:

* trust matters
* decisions matter
* accountability matters

---

## A Template Worth Reusing

This orchestrator is more than a one-off implementation. It’s a **reusable blueprint** for building agent systems that are:

* transparent
* auditable
* scalable
* executive-friendly

Swapping out nodes or utilities creates new agents, while the underlying governance structure remains intact.

That’s a powerful pattern to carry forward.

---

## The Big Takeaway

This file quietly encodes the philosophy you’ve been developing throughout the project:

> **Deterministic systems earn trust.
> Probabilistic systems add insight.**

The orchestrator ensures that trust is built into the workflow itself — not bolted on later. This is an excellent foundation to reuse and build on going forward.


# Test file for Evaluation-as-a-Service Orchestrator Agent

In [None]:
"""Test file for Evaluation-as-a-Service Orchestrator Agent

MVP smoke tests to validate the orchestrator works end-to-end.
"""

import pytest
from config import EvalAsServiceOrchestratorState, EvalAsServiceOrchestratorConfig
from agents.eval_as_service.orchestrator import create_orchestrator


def test_orchestrator_creation():
    """Test that orchestrator can be created."""
    config = EvalAsServiceOrchestratorConfig()
    orchestrator = create_orchestrator(config)

    assert orchestrator is not None
    assert hasattr(orchestrator, 'invoke')


def test_complete_workflow():
    """Test complete evaluation workflow."""
    config = EvalAsServiceOrchestratorConfig()
    orchestrator = create_orchestrator(config)

    # Initial state - evaluate all scenarios
    initial_state: EvalAsServiceOrchestratorState = {
        "scenario_id": None,  # Evaluate all scenarios
        "target_agent_id": None,  # Evaluate all agents
        "errors": []
    }

    # Run orchestrator
    result = orchestrator.invoke(initial_state)

    # Validate results
    assert "evaluation_report" in result
    assert "report_file_path" in result or result.get("report_file_path") is None
    assert "evaluation_summary" in result
    assert "agent_performance_summary" in result
    assert len(result.get("errors", [])) == 0

    # Check summary metrics
    summary = result.get("evaluation_summary", {})
    assert summary.get("total_scenarios", 0) > 0
    assert summary.get("total_evaluations", 0) > 0

    # Check agent summaries
    agent_summaries = result.get("agent_performance_summary", [])
    assert len(agent_summaries) > 0

    print("\n✅ Complete workflow test passed!")
    print(f"   - Total scenarios: {summary.get('total_scenarios')}")
    print(f"   - Total evaluations: {summary.get('total_evaluations')}")
    print(f"   - Overall pass rate: {summary.get('overall_pass_rate', 0.0):.1%}")
    print(f"   - Agents evaluated: {len(agent_summaries)}")


def test_single_scenario_evaluation():
    """Test evaluation of a single scenario."""
    config = EvalAsServiceOrchestratorConfig()
    orchestrator = create_orchestrator(config)

    # Evaluate single scenario
    initial_state: EvalAsServiceOrchestratorState = {
        "scenario_id": "S001",  # Single scenario
        "target_agent_id": None,  # All agents
        "errors": []
    }

    result = orchestrator.invoke(initial_state)

    assert "evaluation_report" in result
    assert len(result.get("errors", [])) == 0

    summary = result.get("evaluation_summary", {})
    assert summary.get("total_scenarios", 0) == 1

    print("\n✅ Single scenario test passed!")


def test_single_agent_evaluation():
    """Test evaluation of a single agent."""
    config = EvalAsServiceOrchestratorConfig()
    orchestrator = create_orchestrator(config)

    # Evaluate single agent
    initial_state: EvalAsServiceOrchestratorState = {
        "scenario_id": None,  # All scenarios
        "target_agent_id": "shipping_update_agent",  # Single agent
        "errors": []
    }

    result = orchestrator.invoke(initial_state)

    assert "evaluation_report" in result
    assert len(result.get("errors", [])) == 0

    agent_summaries = result.get("agent_performance_summary", [])
    assert len(agent_summaries) == 1
    assert agent_summaries[0].get("agent_id") == "shipping_update_agent"

    print("\n✅ Single agent test passed!")


if __name__ == "__main__":
    # Run tests manually
    print("Running EaaS Orchestrator Tests...\n")

    try:
        test_orchestrator_creation()
        print("✅ Orchestrator creation test passed\n")

        test_complete_workflow()
        print("\n✅ All tests passed!")

    except Exception as e:
        print(f"\n❌ Test failed: {e}")
        import traceback
        traceback.print_exc()



In [None]:
(.venv) micahshull@Micahs-iMac AI_AGENTS_006_EvalAsAService % python test_eval_as_service.py
Running EaaS Orchestrator Tests...

✅ Orchestrator creation test passed

  [Goal Node] Starting...
  [Goal Node] Goal: Evaluate all agents across all scenarios
  [Data Loading Node] Starting...
    Loading scenarios...
    Loaded 10 scenarios
    Loading agents...
    Loaded 4 agents
    Loading supporting data...
    Supporting data loaded
    Loading decision rules...
    Decision rules loaded
  [Data Loading Node] Loaded 10 scenarios, 4 agents
  [Evaluation Execution Node] Starting...
  [Evaluation Execution Node] Processing 10 scenarios...
    Scenario 1/10: S001 -> 1 agents
    Scenario 2/10: S002 -> 2 agents
    Scenario 3/10: S003 -> 2 agents
    Scenario 4/10: S004 -> 2 agents
    Scenario 5/10: S005 -> 2 agents
    Scenario 6/10: S006 -> 3 agents
    Scenario 7/10: S007 -> 1 agents
    Scenario 8/10: S008 -> 3 agents
    Scenario 9/10: S009 -> 2 agents
    Scenario 10/10: S010 -> 1 agents
  [Evaluation Execution Node] Completed 19/19 evaluations
  [Scoring Node] Starting...
  [Scoring Node] Scored 19 evaluations
  [Performance Analysis Node] Starting...
  [Report Generation Node] Starting...
  [Report Generation Node] Report generated (2006 chars)
  [Report Generation Node] Saved to: output/eval_as_service_reports/eval_report_evaluation_report_20251222_151121.md

✅ Complete workflow test passed!
   - Total scenarios: 10
   - Total evaluations: 19
   - Overall pass rate: 100.0%
   - Agents evaluated: 4

✅ All tests passed!


# Evaluation-as-a-Service Report

**Generated:** 2025-12-22 15:11:21

## Executive Summary

- **Total Scenarios Evaluated:** 10
- **Total Evaluations:** 19
- **Overall Pass Rate:** 100.0%
- **Average Score:** 0.99
- **Healthy Agents:** 4
- **Degraded Agents:** 0
- **Critical Agents:** 0

## Agent Performance Details

### refund_agent

- **Status:** healthy
- **Total Evaluations:** 2
- **Passed:** 2
- **Failed:** 0
- **Average Score:** 1.00
- **Average Response Time:** 0.00s

### shipping_update_agent

- **Status:** healthy
- **Total Evaluations:** 7
- **Passed:** 7
- **Failed:** 0
- **Average Score:** 1.00
- **Average Response Time:** 0.00s

### apology_message_agent

- **Status:** healthy
- **Total Evaluations:** 6
- **Passed:** 6
- **Failed:** 0
- **Average Score:** 0.97
- **Average Response Time:** 0.00s

### escalation_agent

- **Status:** healthy
- **Total Evaluations:** 4
- **Passed:** 4
- **Failed:** 0
- **Average Score:** 1.00
- **Average Response Time:** 0.00s

## Evaluation Results

| Scenario | Agent | Score | Passed | Issues |
|----------|-------|-------|--------|--------|
| S001 | shipping_update_agent | 1.00 | ✓ |  |
| S002 | shipping_update_agent | 1.00 | ✓ |  |
| S002 | apology_message_agent | 1.00 | ✓ |  |
| S003 | refund_agent | 1.00 | ✓ |  |
| S003 | apology_message_agent | 0.85 | ✓ | Output status doesn't match expected outcome type |
| S004 | shipping_update_agent | 1.00 | ✓ |  |
| S004 | apology_message_agent | 1.00 | ✓ |  |
| S005 | escalation_agent | 1.00 | ✓ |  |
| S005 | apology_message_agent | 1.00 | ✓ |  |
| S006 | shipping_update_agent | 1.00 | ✓ |  |
| S006 | apology_message_agent | 1.00 | ✓ |  |
| S006 | escalation_agent | 1.00 | ✓ |  |
| S007 | shipping_update_agent | 1.00 | ✓ |  |
| S008 | shipping_update_agent | 1.00 | ✓ |  |
| S008 | apology_message_agent | 1.00 | ✓ |  |
| S008 | escalation_agent | 1.00 | ✓ |  |
| S009 | escalation_agent | 1.00 | ✓ |  |
| S009 | refund_agent | 1.00 | ✓ |  |
| S010 | shipping_update_agent | 1.00 | ✓ |  |
