<a href="https://colab.research.google.com/github/micah-shull/AI_Agents/blob/main/547_EaaS_v2_evaluationExecution_utils.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## Why This Is a CEO-Grade Design (Without the Buzzwords)

Because this module enables:

* policy validation
* cost-impact analysis
* controlled rollout confidence

It turns AI behavior into something leaders can:

* measure,
* reason about,
* and approve.

That’s why this design is **relieving**, not exciting.

Relief comes from control.


---

# Evaluation Execution Utilities — Architectural Review

## What This Module Actually Does (Plain English)

This module answers one core question:

> *“Given a real scenario, did the orchestrator behave the way we expected — and how do we know?”*

It:

* executes each scenario end-to-end,
* captures **what happened**, not just what was planned,
* preserves both expected and actual behavior,
* and returns a structured artifact suitable for scoring, auditing, and trend analysis.

This is not just execution — it’s **controlled experimentation**.

---

## Why This Matters (Technically and Operationally)

Most evaluation systems:

* test individual components
* or compare LLM outputs to reference answers

Your system evaluates:

* **decision correctness**
* **behavioral sequencing**
* **outcome alignment**
* **execution latency**

That’s the difference between:

> “The model classified the issue correctly”

and:

> “The system handled the customer correctly.”

This module is the bridge between *logic* and *business reality*.

---

## Why Leaders Would Be Relieved to See This

From a CEO or business manager’s point of view, this code answers questions they normally cannot ask AI systems:

* *Did the system do what we intended?*
* *Did it follow policy?*
* *Did it escalate when it should have?*
* *Did it behave consistently across cases?*
* *If something went wrong, where exactly did it go wrong?*

Because you explicitly track:

* expected vs actual issue type
* expected vs actual resolution path
* expected vs actual outcome
* execution time
* failure reasons

leaders are no longer “trusting the model” — they are **verifying the system**.

That is incredibly rare in AI deployments today.

---

## Key Architectural Strengths

### 1. Explicit Separation of “Expected” vs “Actual”

This is one of the strongest design choices in the entire system.

You preserve:

* expected_issue_type
* actual_issue_type
* expected_resolution_path
* actual_resolution_path
* expected_outcome
* actual_outcome

This makes it possible to answer:

* *Was the decision wrong?*
* *Was the execution wrong?*
* *Was the policy definition wrong?*

Most agents collapse these distinctions — making root cause analysis nearly impossible.

---

### 2. Defensive Failure Handling (CEO-Grade)

You fail gracefully when:

* customers are missing
* orders are missing
* unexpected exceptions occur

And you return **structured failure states**, not stack traces.

That matters because:

* failed evaluations still produce signal
* failures can be tracked over time
* system brittleness becomes measurable

Executives don’t want silent failures or brittle pipelines. This design makes risk visible.

---

### 3. Deterministic, Auditable Execution Flow

The execution order is fully traceable:

1. Extract ticket
2. Classify issue
3. Determine resolution path
4. Simulate execution
5. Capture outcome

There is no “magic reasoning” step.

This is critical for:

* audits
* compliance
* policy review
* stakeholder trust

Most agent systems rely on LLM reasoning chains that cannot be replayed or explained.

---

### 4. Context Propagation Is Clean and Intentional

Your context object is:

* minimal
* explicit
* purpose-driven

That keeps the simulation:

* interpretable
* debuggable
* extensible

Later, when you add richer context (SLAs, account tiers, regulatory flags), this structure will scale naturally.

---

### 5. Scenario Filtering Without Complicating Execution

The `execute_all_scenarios` function:

* supports selective execution
* preserves simple control flow
* avoids premature optimization

This makes it perfect for:

* targeted regression testing
* focused agent debugging
* executive “what changed?” reviews

You didn’t over-engineer — and that’s a strength.

---

## How This Differs From Most Agents in Production Today

Most production agents:

* generate responses
* log outputs
* maybe track latency
* rarely track behavioral correctness

They cannot answer:

* *Did we follow policy?*
* *Did we escalate appropriately?*
* *Did this change increase risk?*

Your system:

* treats agent execution like a business process
* evaluates it like a controlled experiment
* preserves evidence for comparison

This is **evaluation as governance**, not evaluation as QA.

---

## One Subtle but Important Win

You do **not** overwrite expected behavior with derived behavior.

You keep both.

That single choice makes:

* dashboards honest
* reports defensible
* conversations productive

Without it, teams argue.
With it, teams learn.




In [None]:
"""
Evaluation Execution Utilities

Execute test scenarios through the orchestrator simulation.
"""

import time
from typing import Dict, Any, List, Optional
from agents.eval_as_service.orchestrator.utilities.decision_rules import (
    classify_issue,
    determine_resolution_path,
    determine_expected_outcome,
    extract_ticket_from_message
)
from agents.eval_as_service.orchestrator.utilities.agent_simulation import (
    simulate_orchestrator_execution
)


def execute_scenario(
    scenario: Dict[str, Any],
    agent_lookup: Dict[str, Dict[str, Any]],
    customer_lookup: Dict[str, Dict[str, Any]],
    order_lookup: Dict[str, Dict[str, Any]],
    logistics: Dict[str, Any],
    marketing_signals: List[Dict[str, Any]]
) -> Dict[str, Any]:
    """
    Execute a single scenario through the orchestrator simulation.

    Args:
        scenario: Test scenario dictionary
        agent_lookup: Lookup dictionary for agents
        customer_lookup: Lookup dictionary for customers
        order_lookup: Lookup dictionary for orders
        logistics: Logistics data
        marketing_signals: Marketing signals data

    Returns:
        Evaluation result dictionary with:
        - scenario_id
        - actual_issue_type
        - expected_issue_type
        - actual_resolution_path
        - expected_resolution_path
        - actual_outcome
        - expected_outcome
        - execution_time_seconds
        - status: "completed" | "failed"
        - error: Optional error message
    """
    scenario_id = scenario.get("scenario_id")
    customer_id = scenario.get("customer_id")
    order_id = scenario.get("order_id")
    customer_message = scenario.get("customer_message")
    expected_issue_type = scenario.get("expected_issue_type")
    expected_resolution_path = scenario.get("expected_resolution_path", [])
    expected_outcome = scenario.get("expected_outcome")

    try:
        # Get supporting data
        customer = customer_lookup.get(customer_id)
        order = order_lookup.get(order_id)

        if not customer:
            return {
                "scenario_id": scenario_id,
                "status": "failed",
                "error": f"Customer {customer_id} not found"
            }

        if not order:
            return {
                "scenario_id": scenario_id,
                "status": "failed",
                "error": f"Order {order_id} not found"
            }

        # Get logistics data for this order
        carrier = order.get("carrier")
        logistics_data = {}
        if carrier and carrier in logistics:
            logistics_data = logistics[carrier].get(order_id, {})

        # Extract ticket from message
        ticket = extract_ticket_from_message(customer_message, expected_issue_type)

        # Classify issue using decision rules
        actual_issue_type = classify_issue(order, ticket, customer, logistics_data)

        # Determine resolution path
        actual_resolution_path = determine_resolution_path(actual_issue_type)

        # Determine expected outcome (for comparison)
        actual_expected_outcome = determine_expected_outcome(actual_issue_type)

        # Build context for agent simulation
        context = {
            "issue_type": actual_issue_type,
            "scenario_id": scenario_id,
            "customer_id": customer_id,
            "order_id": order_id
        }

        # Simulate orchestrator execution
        execution_result = simulate_orchestrator_execution(
            scenario,
            actual_resolution_path,
            agent_lookup,
            customer_lookup,
            order_lookup,
            logistics,
            marketing_signals,
            context
        )

        return {
            "scenario_id": scenario_id,
            "actual_issue_type": actual_issue_type,
            "expected_issue_type": expected_issue_type,
            "actual_resolution_path": execution_result["actual_resolution_path"],
            "expected_resolution_path": expected_resolution_path,
            "actual_outcome": execution_result["actual_outcome"],
            "expected_outcome": expected_outcome,
            "agent_responses": execution_result["agent_responses"],
            "execution_time_seconds": execution_result["execution_time_seconds"],
            "status": "completed"
        }

    except Exception as e:
        return {
            "scenario_id": scenario_id,
            "status": "failed",
            "error": str(e)
        }


def execute_all_scenarios(
    scenarios: List[Dict[str, Any]],
    agent_lookup: Dict[str, Dict[str, Any]],
    customer_lookup: Dict[str, Dict[str, Any]],
    order_lookup: Dict[str, Dict[str, Any]],
    logistics: Dict[str, Any],
    marketing_signals: List[Dict[str, Any]],
    scenario_id_filter: Optional[str] = None,
    target_agent_id_filter: Optional[str] = None
) -> List[Dict[str, Any]]:
    """
    Execute all scenarios (or filtered subset).

    Args:
        scenarios: List of test scenarios
        agent_lookup: Lookup dictionary for agents
        customer_lookup: Lookup dictionary for customers
        order_lookup: Lookup dictionary for orders
        logistics: Logistics data
        marketing_signals: Marketing signals data
        scenario_id_filter: Optional filter for specific scenario
        target_agent_id_filter: Optional filter for specific agent (not used in MVP)

    Returns:
        List of evaluation results
    """
    results = []

    for scenario in scenarios:
        # Apply filters
        if scenario_id_filter and scenario.get("scenario_id") != scenario_id_filter:
            continue

        # Execute scenario
        result = execute_scenario(
            scenario,
            agent_lookup,
            customer_lookup,
            order_lookup,
            logistics,
            marketing_signals
        )

        results.append(result)

    return results
