<a href="https://colab.research.google.com/github/micah-shull/AI_Agents/blob/main/539_EaaS_v2_Agent_State.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Evaluation-as-a-Service (EaaS) Orchestrator Agent

This agent evaluates the performance of other AI agents by:
1. Loading test scenarios with expected outcomes
2. Executing scenarios through target agents
3. Comparing actual vs expected outputs
4. Scoring performance (correctness, response time, output quality)
5. Generating comprehensive evaluation reports

## Why This Is Different From Most AI Agents in Production

Most AI agents today:

* hide logic in prompts
* lack explicit performance standards
* cannot explain why something passed or failed
* blur the line between reasoning and governance

This system:

* exposes its standards
* encodes business expectations
* tracks performance over time
* supports policy-driven decision making

In other words, it behaves less like a demo — and more like infrastructure.

---

## The Executive Takeaway

What leaders see in this configuration is not just flexibility — it’s **control**.

They can:

* define success
* monitor risk
* detect regressions
* approve deployment with confidence

And they can do it without needing to understand model internals.

That’s why this design isn’t just technically sound — it’s *organizationally adoptable*.




# Evaluation-as-a-Service Orchestrator State — Architecture Review

## What This State Represents in Practical Terms

This state defines the **entire operating context** for the EaaS Orchestrator.

In real-world terms, this is the **single source of truth** that allows the agent to:

* Load realistic business scenarios
* Simulate how agents behave
* Evaluate decisions objectively
* Track performance over time
* Detect regressions and improvements
* Produce executive-ready reports

Nothing here is abstract — every field corresponds to a concrete step in how a real evaluation service would run in production.

---

## Why This Design Builds Trust and Control

A critical strength of this state is that it **separates facts from interpretation**.

You clearly distinguish:

* **Inputs** (scenarios, agents, data)
* **Execution results** (what actually happened)
* **Scoring** (objective measurements)
* **Analysis** (aggregations and trends)
* **Narrative output** (reports and summaries)

That separation is what makes the system:

* auditable
* testable
* defensible to leadership

An executive can point to *exactly* where a decision came from.

---

## Key Architectural Highlights

### 1. Scenario-First Evaluation Model

Your `journey_scenarios` field anchors the entire evaluation process in **realistic customer journeys**, not synthetic benchmarks.

This means the agent isn’t asking:

> “Did the model do well?”

It’s asking:

> “Did the system behave correctly in situations that matter to the business?”

That framing is a huge credibility boost.

---

### 2. Explicit Agent Simulation (Not Black-Box Evaluation)

By loading `specialist_agents` as structured definitions instead of opaque functions, you make agent behavior:

* inspectable
* configurable
* reproducible

This reinforces a key principle:

> The evaluator measures *system behavior*, not model magic.

That’s exactly what enterprises want.

---

### 3. Historical Data as a First-Class Citizen

Your inclusion of:

* `historical_evaluation_runs`
* `historical_run_metrics`
* `historical_scenario_evaluations`

is not cosmetic — it fundamentally changes the nature of the agent.

This turns EaaS into:

* a monitoring system
* a regression detector
* a release gatekeeper

Executives don’t just see **today’s performance**, they see **directionality**.

---

### 4. Clean Separation Between Execution and Scoring

The split between:

* `executed_evaluations`
* `evaluation_scores`
* `agent_performance_summary`

is exactly right.

This allows you to:

* change scoring rules without rerunning executions
* audit failures independently
* explain *why* an agent failed, not just that it did

That’s accountability by design.

---

### 5. Health States That Map to Business Language

Your `health_thresholds` configuration is a great example of translating technical signals into **business-readable status**:

* healthy
* degraded
* critical

This is the language decision-makers actually use — and you’ve encoded it directly into the system.

---

## Why This State Supports ROI and Executive Confidence

From a leadership perspective, this state enables answers to questions like:

* Which agents are safe to deploy?
* Which ones need attention?
* Did last week’s change improve or degrade performance?
* Where is human review still required?
* Are we trending toward stability or risk?

Crucially, those answers are backed by:

* explicit data
* historical comparisons
* repeatable logic

Not opinion. Not intuition.

---

## Config Design: Pragmatic and Scalable

Your config class reinforces the same philosophy:

* Rules and thresholds are **visible and adjustable**
* LLM usage is clearly marked as **optional enhancement**
* Tooling integrations are feature-flagged, not assumed
* Scoring logic is parameterized, not hard-coded

That makes this agent:

* safe to evolve
* easy to explain
* resistant to accidental complexity

---

## Overall Assessment

This state definition is:

* Architecturally sound
* Enterprise-credible
* MVP-disciplined
* Portfolio-ready

Most importantly, it clearly communicates this idea:

> **The system does not “judge” agents — it measures them, compares them, and reports the results transparently.**

That’s exactly what an Evaluation-as-a-Service agent should do.



In [None]:
# ============================================================================
# Evaluation-as-a-Service (EaaS) Orchestrator Agent
# ============================================================================

class EvalAsServiceOrchestratorState(TypedDict, total=False):
    """State for Evaluation-as-a-Service Orchestrator Agent"""

    # Input fields
    scenario_id: Optional[str]              # Specific scenario to evaluate (None = evaluate all)
    target_agent_id: Optional[str]          # Specific agent to evaluate (None = evaluate all)

    # Goal & Planning fields (MVP: Fixed goal, template-based plan)
    goal: Dict[str, Any]                    # Goal definition (from goal_node)
    plan: List[Dict[str, Any]]              # Execution plan (from planning_node)

    # Data Ingestion
    journey_scenarios: List[Dict[str, Any]]  # Loaded test scenarios
    # Structure per scenario:
    # {
    #   "scenario_id": "S001",
    #   "customer_id": "C001",
    #   "order_id": "O1001",
    #   "customer_message": "Hi, my order hasn't arrived yet...",
    #   "expected_issue_type": "where_is_my_order",
    #   "expected_resolution_path": ["shipping_update_agent"],
    #   "expected_outcome": "provide_delivery_update"
    # }

    specialist_agents: List[Dict[str, Any]]  # Loaded specialist agents to evaluate
    # Structure per agent:
    # {
    #   "agent_id": "refund_agent",
    #   "description": "Issues refunds for lost or incorrect orders.",
    #   "actions": {...}
    # }

    supporting_data: Dict[str, Any]         # Supporting data (customers, orders, logistics, marketing)
    # Structure:
    # {
    #   "customers": [...],
    #   "orders": [...],
    #   "logistics": {...},
    #   "marketing_signals": [...]
    # }

    decision_rules: Dict[str, Any]          # Orchestrator decision rules for validation

    # Historical Data (for baseline comparison and regression detection)
    historical_evaluation_runs: List[Dict[str, Any]]  # Historical evaluation runs metadata
    # Structure per run:
    # {
    #   "run_id": "RUN_2025_12_20",
    #   "run_timestamp": "2025-12-20T09:00:00Z",
    #   "target_orchestrator": "customer_support_orchestrator",
    #   "target_version": "v0.9",
    #   "scenario_count": 10,
    #   "evaluation_type": "pre_release",
    #   "notes": "..."
    # }

    historical_run_metrics: List[Dict[str, Any]]  # Summary metrics per historical run
    # Structure per run:
    # {
    #   "run_id": "RUN_2025_12_20",
    #   "overall_pass_rate": 0.75,
    #   "issue_classification_accuracy": 0.78,
    #   "resolution_path_accuracy": 0.72,
    #   "outcome_accuracy": 0.75,
    #   "high_risk_failures": 3,
    #   "human_review_rate": 0.40,
    #   "avg_latency_ms": 910,
    #   "p95_latency_ms": 1300,
    #   "regression_flags": [],
    #   "improvement_flags": []
    # }

    historical_scenario_evaluations: List[Dict[str, Any]]  # Detailed scenario evaluations from past runs
    # Structure per evaluation:
    # {
    #   "run_id": "RUN_2025_12_20",
    #   "scenario_id": "S006",
    #   "actual_issue_type": "delivery_delay",
    #   "expected_issue_type": "delivery_delay_with_churn_risk",
    #   "issue_type_match": false,
    #   "actual_resolution_path": [...],
    #   "expected_resolution_path": [...],
    #   "resolution_path_match": false,
    #   "actual_outcome": "...",
    #   "expected_outcome": "...",
    #   "outcome_match": false,
    #   "latency_ms": 910,
    #   "confidence_score": 0.61,
    #   "requires_human_review": true,
    #   "failure_reason": "..."
    # }

    # Evaluation Execution
    executed_evaluations: List[Dict[str, Any]]  # Completed evaluations
    # Structure per evaluation:
    # {
    #   "scenario_id": "S001",
    #   "target_agent_id": "shipping_update_agent",
    #   "input": {...},
    #   "actual_output": {...},
    #   "expected_output": {...},
    #   "execution_time_seconds": 0.5,
    #   "status": "completed" | "failed" | "timeout",
    #   "error": Optional[str]
    # }

    # Scoring & Analysis
    evaluation_scores: List[Dict[str, Any]]  # Scores per evaluation
    # Structure per score:
    # {
    #   "scenario_id": "S001",
    #   "target_agent_id": "shipping_update_agent",
    #   "correctness_score": 0.95,  # 0-1, matches expected outcome
    #   "response_time_score": 0.90,  # 0-1, based on thresholds
    #   "output_quality_score": 0.85,  # 0-1, based on structure/format
    #   "overall_score": 0.90,  # Weighted average
    #   "passed": True,  # Overall score >= threshold
    #   "issues": ["slight_format_deviation"]
    # }

    agent_performance_summary: Dict[str, Any]  # Performance summary per agent
    # Structure per agent:
    # {
    #   "agent_id": "shipping_update_agent",
    #   "total_evaluations": 10,
    #   "passed_count": 9,
    #   "failed_count": 1,
    #   "average_score": 0.88,
    #   "average_response_time": 0.45,
    #   "health_status": "healthy" | "degraded" | "critical"
    # }

    # Quality Control Metrics (using toolshed)
    performance_metrics: Dict[str, Any]      # Performance tracking metrics
    workflow_analysis: List[Dict[str, Any]]  # Workflow health analysis
    validation_results: List[Dict[str, Any]]  # Data validation results

    # Progress Tracking (using toolshed)
    progress_percentage: float              # 0-100
    evaluations_completed: int              # Count of completed evaluations
    evaluations_total: int                  # Total evaluations to run
    elapsed_time_seconds: float             # Time since evaluation start
    estimated_remaining_seconds: float      # Estimated time to completion
    evaluation_start_time: Optional[str]     # ISO timestamp when evaluation started

    # Summary Metrics
    evaluation_summary: Dict[str, Any]
    # Structure:
    # {
    #   "total_scenarios": 10,
    #   "total_evaluations": 30,
    #   "total_passed": 27,
    #   "total_failed": 3,
    #   "overall_pass_rate": 0.90,
    #   "average_score": 0.87,
    #   "agents_evaluated": 4,
    #   "healthy_agents": 3,
    #   "degraded_agents": 1,
    #   "critical_agents": 0
    # }

    # Historical Comparison & Regression Detection
    baseline_comparison: Optional[Dict[str, Any]]  # Comparison to baseline run
    # Structure:
    # {
    #   "baseline_run_id": "RUN_2026_01_10",
    #   "current_pass_rate": 0.92,
    #   "baseline_pass_rate": 0.88,
    #   "improvement": 0.04,
    #   "improvement_percentage": 4.55,
    #   "regression_detected": false,
    #   "regression_details": []
    # }

    regression_analysis: Optional[Dict[str, Any]]  # Regression analysis vs previous runs
    # Structure:
    # {
    #   "regressions_detected": [],
    #   "improvements_detected": [],
    #   "trend_analysis": {...}
    # }

    trend_analysis: Optional[Dict[str, Any]]  # Trend analysis across historical runs
    # Structure:
    # {
    #   "pass_rate_trend": "improving" | "declining" | "stable",
    #   "latency_trend": "improving" | "declining" | "stable",
    #   "accuracy_trend": "improving" | "declining" | "stable"
    # }

    # Output
    evaluation_report: str                  # Final markdown report
    report_file_path: Optional[str]        # Path to saved report file

    # Metadata
    errors: Annotated[List[str], operator.add]  # Any errors encountered
    processing_time: Optional[float]      # Time taken to process


@dataclass
class EvalAsServiceOrchestratorConfig:
    """Configuration for Evaluation-as-a-Service Orchestrator Agent"""
    llm_model: str = os.getenv("LLM_MODEL", "gpt-4o-mini")
    temperature: float = 0.3
    reports_dir: str = "output/eval_as_service_reports"  # Where to save reports

    # Data file paths
    data_dir: str = "agents/data"
    journey_scenarios_file: str = "journey_scenarios.json"
    specialist_agents_file: str = "specialist_agents.json"
    customers_file: str = "customers.json"
    orders_file: str = "orders.json"
    logistics_file: str = "logistics_api.json"
    marketing_signals_file: str = "marketing_signals.json"
    decision_rules_file: str = "orchestrator_decision_rules.json"
    evaluation_runs_file: str = "evaluation_runs.json"
    run_summary_metrics_file: str = "run_summary_metrics.json"
    scenario_evaluations_file: str = "scenario_evaluations.json"

    # Evaluation Settings
    pass_threshold: float = 0.80  # Minimum score to pass (0-1)
    response_time_threshold_seconds: float = 2.0  # Max acceptable response time
    enable_parallel_evaluation: bool = False  # MVP: Sequential only

    # Scoring Weights
    scoring_weights: Dict[str, float] = field(default_factory=lambda: {
        "correctness": 0.50,      # Matches expected outcome
        "response_time": 0.20,     # Response time performance
        "output_quality": 0.30     # Output structure/format quality
    })

    # Health Status Thresholds
    health_thresholds: Dict[str, float] = field(default_factory=lambda: {
        "healthy": 0.85,      # >= 85% average score
        "degraded": 0.70,     # 70-85% average score
        "critical": 0.0       # < 70% average score
    })

    # Toolshed Integration
    enable_progress_tracking: bool = True   # Use toolshed.progress
    enable_performance_tracking: bool = True  # Use toolshed.performance
    enable_workflow_analysis: bool = True    # Use toolshed.workflows
    enable_validation: bool = True          # Use toolshed.validation
    enable_kpi_tracking: bool = True        # Use toolshed.kpi
    enable_reporting: bool = True          # Use toolshed.reporting

    # LLM Enhancement (Optional - Phase 8)
    enable_llm_summaries: bool = True      # Enable LLM-enhanced executive summaries
    llm_summary_max_tokens: int = 500      # Max tokens for LLM summaries




# Why These Configurable Controls Matter (And Why Leaders Care)

This configuration block is where the Evaluation-as-a-Service agent fundamentally separates itself from most AI systems in production today.

Instead of embedding judgment inside opaque logic or prompts, this system **exposes its standards explicitly** — allowing leaders to see, adjust, and govern how AI performance is measured.

That difference is subtle technically, but enormous organizationally.

---

## Evaluation Settings: Turning Expectations Into Policy

### `pass_threshold`

This setting defines the **minimum acceptable performance level** for an agent to be considered production-ready.

In practical terms, this answers a question executives constantly ask but rarely get a clear answer to:

> *“How good is good enough?”*

By making this threshold explicit and configurable:

* Performance standards are no longer subjective
* Teams stop arguing over anecdotes
* Deployment decisions become policy-driven, not opinion-driven

Most agents in production today **do not have a formal pass/fail definition**. They rely on informal testing or gut feel. This system replaces that ambiguity with clarity.

---

### `response_time_threshold_seconds`

This setting encodes **user experience expectations** directly into evaluation.

It ensures that an agent is not only correct, but *fast enough to be usable*.

From a business perspective, this is critical because:

* Slow responses silently kill adoption
* Latency failures often go unnoticed until customers complain
* Engineering teams optimize accuracy while ignoring experience

Here, response time is treated as a **first-class performance constraint**, not an afterthought.

That’s how real products are judged — and evaluated.

---

### `enable_parallel_evaluation`

This flag is a quiet but important signal of maturity.

By explicitly choosing **sequential execution for MVP**, the system prioritizes:

* determinism
* debuggability
* auditability

Leaders are often nervous about AI systems because they move too fast to understand. This design choice shows restraint and intentionality — speed is added *after* trust is established.

Most AI systems start with scale and struggle to retrofit control. You’ve done the opposite.

---

## Scoring Weights: Aligning AI Behavior With Business Priorities

### `scoring_weights`

This configuration answers a deceptively simple question:

> *“What do we value most?”*

By breaking performance into:

* correctness
* response time
* output quality

and weighting them explicitly, the system allows leadership to **encode business priorities into evaluation logic**.

This is powerful because:

* Different teams can prioritize differently
* The same agent can be evaluated under different standards
* Tradeoffs become visible and intentional

In contrast, most AI agents are optimized implicitly through prompts or training — priorities are hidden, not declared.

This system makes value judgments transparent.

---

## Health Status Thresholds: Translating Metrics Into Decisions

### `health_thresholds`

These thresholds convert raw scores into **clear operational states**:

* healthy
* degraded
* critical

This is not cosmetic — it’s decision-enabling.

Executives don’t want charts. They want answers like:

* “Is this safe to run?”
* “Do we need intervention?”
* “Can we scale this?”

By mapping performance into health states, the system:

* shortens decision cycles
* enables automated gating
* supports escalation workflows

Most AI systems stop at metrics. This system goes one step further and **interprets them operationally**.

---

## Toolshed Integrations: Instrumentation by Default

The toolshed flags signal another major difference from typical agents:

This system assumes that **observability is mandatory**, not optional.

Each flag controls whether the agent:

* tracks progress
* logs performance
* analyzes workflows
* validates inputs
* monitors KPIs
* generates structured reports

This reflects how mature software systems operate — yet most AI agents in production today have little to no instrumentation.

Here, visibility is built in from day one.

For leaders, this is deeply reassuring:

* Problems don’t stay hidden
* Performance can be reviewed after the fact
* Accountability doesn’t rely on heroics

---

## LLM Enhancements: Intelligence Without Loss of Control

### `enable_llm_summaries`

This final section is subtle — and extremely important.

The LLM is positioned as a **communication layer**, not a decision-maker.

It summarizes results that are:

* already computed
* already scored
* already validated

This preserves:

* control
* explainability
* auditability

Most agents invert this relationship, letting LLMs *decide* and then attempting to explain later. This system does the opposite.

The LLM explains what the system already knows to be true.

That distinction is what allows executives to trust it.





# EaaS Orchestrator - Phase 0 Planning Document

## Data Structure Analysis

### Journey Scenarios (`journey_scenarios.json`)
- **Structure**: List of 10 test scenarios
- **Key Fields**:
  - `scenario_id`: Unique identifier (S001-S010)
  - `customer_id`: Links to customer data (C001-C005)
  - `order_id`: Links to order data (O1001-O1005)
  - `customer_message`: Input message for the orchestrator
  - `expected_issue_type`: Expected classification (e.g., "where_is_my_order", "delivery_delay")
  - `expected_resolution_path`: List of agent IDs in order (e.g., ["shipping_update_agent"])
  - `expected_outcome`: Expected final outcome string

### Specialist Agents (`specialist_agents.json`)
- **Structure**: Dictionary with 4 agents
- **Agents**: refund_agent, shipping_update_agent, apology_message_agent, escalation_agent
- **Key Fields**:
  - `agent_id`: Unique identifier
  - `description`: What the agent does
  - `actions`: Dictionary of actions with response templates
  - Action-specific configs (e.g., `default_refund_amounts`, `priority_rules`)

### Supporting Data
- **customers.json**: List of 5 customers with loyalty_tier, churn_risk
- **orders.json**: List of 5 orders with status, carrier, warehouse_issue_flag
- **logistics_api.json**: Nested dict by carrier → order_id → status/details
- **marketing_signals.json**: Customer engagement metrics

### Decision Rules (`orchestrator_decision_rules.json`)
- **Structure**: Contains both JSON rules and Python functions
- **Functions Available**:
  - `classify_issue(order, ticket, customer, logistics)` → issue_type
  - `determine_resolution_path(issue_type)` → list of agent IDs
  - `determine_expected_outcome(issue_type)` → outcome string
- **Note**: This file contains Python code, not pure JSON. We'll need to extract the functions or implement them.

## Architecture Plan

### Node Structure (Linear Flow)
1. **goal_node** - Define evaluation goal
2. **planning_node** - Create execution plan
3. **data_loading_node** - Load all data files
4. **evaluation_execution_node** - Run scenarios through agents
5. **scoring_analysis_node** - Compare and score results
6. **report_generation_node** - Generate final report

### Utility Structure
```
utilities/
├── data_loading.py      # Load JSON files, build lookups
├── evaluation_execution.py  # Execute scenarios, simulate agent calls
├── scoring.py          # Score correctness, response time, quality
├── analysis.py         # Aggregate scores, calculate metrics
└── report_generation.py  # Generate markdown reports
```

## Key Assumptions

1. **Agent Simulation**: Since we're evaluating agents, we'll simulate their behavior based on:
   - Specialist agent definitions (response templates)
   - Decision rules (to determine which agents get called)
   - Supporting data (to populate responses)

2. **Evaluation Logic**: For each scenario:
   - Extract customer_message, customer_id, order_id
   - Load supporting data (customer, order, logistics, marketing)
   - Use decision rules to determine expected issue_type and resolution_path
   - Simulate orchestrator calling agents in resolution_path
   - Compare actual vs expected outputs

3. **Scoring Dimensions**:
   - **Correctness** (50% weight): Does actual match expected?
   - **Response Time** (20% weight): Is it within threshold?
   - **Output Quality** (30% weight): Is structure/format correct?

4. **Data Loading**: All data files are in `agents/data/` directory

5. **Decision Rules**: The orchestrator_decision_rules.json file contains Python code. We'll need to either:
   - Extract and import the functions
   - Re-implement the logic based on the JSON structure

## Testing Strategy

- **Phase 1 Test**: Goal and planning nodes work independently
- **Phase 2 Test**: Data loading works, all files load correctly
- **Phase 3 Test**: Evaluation execution runs scenarios, produces outputs
- **Phase 4 Test**: Scoring produces correct scores, analysis aggregates correctly
- **Phase 5 Test**: Report generation creates valid markdown

## Toolshed Integration

- **Progress Tracking**: Track evaluation progress (X of Y scenarios)
- **KPI Tracking**: Track pass rates, average scores
- **Reporting**: Use toolshed.reporting for report structure
- **Validation**: Validate data structures on load
