<a href="https://colab.research.google.com/github/micah-shull/AI_Agents/blob/main/318_EaaS_Utilities.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Utilities for EaaS Orchestrator Agent

In [None]:
"""Utilities for EaaS Orchestrator Agent

Reusable business logic for data loading, evaluation execution, and scoring.
"""

import json
from pathlib import Path
from typing import Dict, List, Any, Optional
from datetime import datetime


def load_journey_scenarios(data_dir: str, filename: str) -> List[Dict[str, Any]]:
    """Load journey scenarios from JSON file."""
    file_path = Path(data_dir) / filename
    with open(file_path, 'r') as f:
        return json.load(f)


def load_specialist_agents(data_dir: str, filename: str) -> Dict[str, Any]:
    """Load specialist agents configuration from JSON file."""
    file_path = Path(data_dir) / filename
    with open(file_path, 'r') as f:
        return json.load(f)


def load_supporting_data(
    data_dir: str,
    customers_file: str,
    orders_file: str,
    logistics_file: str,
    marketing_signals_file: str
) -> Dict[str, Any]:
    """Load all supporting data files."""
    data = {}

    # Load customers
    customers_path = Path(data_dir) / customers_file
    with open(customers_path, 'r') as f:
        data['customers'] = json.load(f)

    # Load orders
    orders_path = Path(data_dir) / orders_file
    with open(orders_path, 'r') as f:
        data['orders'] = json.load(f)

    # Load logistics
    logistics_path = Path(data_dir) / logistics_file
    with open(logistics_path, 'r') as f:
        data['logistics'] = json.load(f)

    # Load marketing signals
    marketing_path = Path(data_dir) / marketing_signals_file
    with open(marketing_path, 'r') as f:
        data['marketing_signals'] = json.load(f)

    return data


def load_decision_rules(data_dir: str, filename: str) -> Dict[str, Any]:
    """Load orchestrator decision rules from JSON file."""
    file_path = Path(data_dir) / filename
    with open(file_path, 'r') as f:
        content = f.read()
        # The file contains both JSON and Python code, extract just the JSON
        if 'decision_rules_json =' in content:
            # Extract the JSON dictionary
            start = content.find('decision_rules_json = {')
            if start != -1:
                # Find the matching closing brace
                brace_count = 0
                i = start + len('decision_rules_json = ')
                while i < len(content):
                    if content[i] == '{':
                        brace_count += 1
                    elif content[i] == '}':
                        brace_count -= 1
                        if brace_count == 0:
                            json_str = content[start + len('decision_rules_json = '):i+1]
                            return eval(json_str)  # Safe eval for JSON-like dict
        # Fallback: try to parse as JSON
        return json.loads(content)

def build_agent_lookup(agents: Dict[str, Any]) -> Dict[str, Dict[str, Any]]:
    """Create fast lookup dictionary for agents."""
    return {agent_id: agent_data for agent_id, agent_data in agents.items()}


def build_scenario_lookup(scenarios: List[Dict[str, Any]]) -> Dict[str, Dict[str, Any]]:
    """Create fast lookup dictionary for scenarios."""
    return {s['scenario_id']: s for s in scenarios}


def get_customer_data(customer_id: str, supporting_data: Dict[str, Any]) -> Optional[Dict[str, Any]]:
    """Get customer data by ID."""
    customers = supporting_data.get('customers', [])
    return next((c for c in customers if c.get('customer_id') == customer_id), None)


def get_order_data(order_id: str, supporting_data: Dict[str, Any]) -> Optional[Dict[str, Any]]:
    """Get order data by ID."""
    orders = supporting_data.get('orders', [])
    return next((o for o in orders if o.get('order_id') == order_id), None)


def get_logistics_data(order_id: str, supporting_data: Dict[str, Any]) -> Optional[Dict[str, Any]]:
    """Get logistics data for an order."""
    logistics = supporting_data.get('logistics', {})
    for carrier, orders in logistics.items():
        if order_id in orders:
            return orders[order_id]
    return None



# EaaS Orchestrator Utilities — Data Loading and System Inputs

This section contains a set of **utility functions** used by the Evaluation-as-a-Service (EaaS) Orchestrator to load data required for evaluation runs.

These functions may appear simple at first glance, but they play a critical role in making the system **reliable, auditable, and scalable**. They define how evaluation inputs enter the system and ensure that every evaluation run starts from a clean, well-defined foundation.

---

## Why Data Loading Deserves Its Own Layer

In production systems, many failures do not come from logic errors — they come from **bad or inconsistent inputs**.

By separating data loading into dedicated utilities, the orchestrator gains:

* predictable inputs
* reusable data pipelines
* clearer debugging paths
* stronger guarantees around reproducibility

This design choice treats data as a first-class concern rather than an afterthought.

---

## Journey Scenarios: The Test Cases

```python
load_journey_scenarios(...)
```

Journey scenarios define the **situations being tested**. Each scenario represents a realistic business interaction, such as a customer asking about a delayed order or requesting a refund.

Loading these scenarios from a structured JSON file ensures that:

* evaluations are repeatable
* scenarios can be versioned and reviewed
* business teams can modify test cases without changing code

This keeps evaluation aligned with real customer experiences instead of abstract benchmarks.

---

## Specialist Agents: The Subjects of Evaluation

```python
load_specialist_agents(...)
```

Specialist agents represent the AI systems being evaluated. Loading them from configuration files allows the orchestrator to:

* evaluate different agents using the same framework
* swap agents in and out without refactoring logic
* compare performance across agent roles

This makes the evaluation system adaptable as agent ecosystems evolve.

---

## Supporting Data: Context Matters

```python
load_supporting_data(...)
```

Supporting data provides the **context** agents need to behave realistically. This includes:

* customer records
* order information
* logistics data
* marketing signals

Rather than hard-coding this data, it is loaded dynamically so that:

* evaluations reflect real operational conditions
* datasets can be refreshed or expanded over time
* simulations remain grounded in business reality

This is especially important for evaluating decision-making agents, not just text generation.

---

## Decision Rules: Making Expectations Explicit

```python
load_decision_rules(...)
```

Decision rules define how outputs should be interpreted and validated. These rules encode business logic such as:

* acceptable resolution paths
* policy constraints
* escalation criteria

By loading decision rules from an external file, the system separates:

* **what the business expects**
  from
* **how the system evaluates those expectations**

This allows governance standards to evolve independently from execution logic.

---

## Why This Matters for Scale and Governance

These loading utilities may look simple, but together they establish an important principle:

> **Evaluations should be deterministic, explainable, and repeatable.**

From a business perspective, this enables:

* consistent evaluation results across environments
* easier audits and compliance reviews
* faster iteration without introducing hidden variables
* clearer ownership over evaluation inputs

In short, this layer ensures that when the orchestrator evaluates an agent, everyone can agree on **what data was used and why**.

---

## Design Philosophy in Practice

This utilities layer reflects a broader design philosophy used throughout the EaaS system:

* configuration over hard-coding
* separation of concerns
* transparency over convenience

These choices reduce long-term risk and make the system easier to trust as it grows more complex.





# Data as a First-Class Citizen

This set of utility functions reinforces an important design principle used throughout the EaaS Orchestrator:

**Data is treated as a first-class citizen, not a side effect.**

That idea shows up clearly in how agent information, scenario definitions, and supporting business data are handled.

---

## What “First-Class” Means in Practice

In many AI systems, data is:

* embedded directly in prompts
* passed loosely between components
* accessed inconsistently
* difficult to inspect or validate

In this orchestrator, data is handled differently. It is:

* loaded explicitly
* structured intentionally
* accessed through clear interfaces
* separated from execution logic

This makes data **visible, reusable, and trustworthy**.

---

## Fast Lookups Without Hidden Logic

```python
build_agent_lookup(...)
build_scenario_lookup(...)
```

These functions convert raw lists into lookup dictionaries keyed by IDs.

At a glance, this looks like a simple performance optimization — but the design intent goes deeper.

By normalizing how agents and scenarios are accessed:

* every part of the system refers to the same source of truth
* identifiers become stable references
* behavior stays consistent across evaluation runs

This avoids situations where different parts of the system interpret the same data differently.

---

## Context Is Data, Not Guesswork

```python
get_customer_data(...)
get_order_data(...)
get_logistics_data(...)
```

These functions retrieve supporting data that provides **context** for evaluations.

Rather than baking assumptions into agent logic, the orchestrator:

* looks up customer records
* retrieves order details
* pulls logistics status dynamically

This ensures that agents are evaluated using the same information they would have access to in real operations.

Context becomes something the system **knows**, not something it invents.

---

## Why This Matters for Evaluation

When evaluating agent behavior, it is important to distinguish between:

* agent reasoning failures
* incomplete or incorrect data
* mismatched expectations

By treating data as a first-class component:

* evaluation failures can be traced back to their source
* incorrect behavior can be explained, not guessed at
* scenarios remain grounded in business reality

This dramatically improves the quality of evaluation results.

---

## First-Class Data Enables Better Accountability

Because data is:

* structured
* addressable by ID
* loaded from known sources

the system can answer questions like:

* What customer information was available at the time?
* What did the logistics system report for this order?
* Was the agent missing context, or did it misuse it?

That level of clarity is essential for audits, reviews, and continuous improvement.

---

## Supporting Scale and Change

Treating data as first-class also makes the system easier to evolve:

* new data sources can be added without changing core logic
* existing datasets can be updated independently
* scenarios can grow more complex without breaking evaluations

As agent systems scale, this separation becomes the difference between manageable growth and fragile complexity.

---

## A Subtle but Important Design Choice

These utilities may appear small, but they enforce a powerful pattern:

> **Agents reason.
> Data informs.
> Evaluation measures.**

Each role is clear, and none are mixed together.

That clarity is what allows the orchestrator to remain explainable, testable, and trustworthy over time.





## Structure Is What Creates Data Integrity

These utilities do more than make data easier to access. They establish **structure and consistency**, which is what ultimately creates data integrity.

Every time data enters the system:

* it is loaded the same way
* accessed through the same interfaces
* referenced using the same identifiers
* shaped into predictable structures

This removes ambiguity from the very beginning of the evaluation process.

---

## Consistency Upstream Enables Reliability Downstream

When data is prepared consistently at the start, every downstream component benefits:

* agents receive the same type of context every time
* evaluators compare like-for-like outputs
* scoring logic operates on stable inputs
* reports reflect real differences, not data noise

Downstream systems do not need to guess how to interpret inputs. They can rely on them.

---

## Standardization Reduces Hidden Failure Modes

Many AI failures are subtle. They do not come from obvious bugs, but from:

* missing fields
* inconsistent identifiers
* partial context
* silent data mismatches

By standardizing how data is loaded and accessed, these utilities eliminate entire classes of hidden failure modes before evaluation even begins.

That makes results more trustworthy and easier to explain.

---

## Integrity Enables Meaningful Evaluation

Evaluation only works when everyone agrees on what the system saw.

Because these utilities enforce:

* consistent data access patterns
* clear boundaries between data and logic
* stable representations of customers, orders, and scenarios

evaluation results reflect **agent behavior**, not accidental differences in inputs.

This is essential when comparing:

* agents to each other
* versions over time
* performance across environments

---

## Effortless Handling Is the Goal

Downstream components are simpler because of this early attention to detail.

Execution logic can focus on:

* running evaluations
* measuring outcomes
* detecting trends

Scoring and reporting layers can assume:

* inputs are complete
* structures are predictable
* identifiers are reliable

That simplicity is not accidental — it is earned through careful data design upfront.

---

## Why This Matters in Business Terms

From a business perspective, this approach:

* reduces operational risk
* lowers debugging and maintenance costs
* improves confidence in metrics
* supports scaling without fragility

When leaders look at evaluation results, they can trust that the numbers reflect reality — not data inconsistencies.

---

## The Bigger Pattern at Work

This is a recurring pattern throughout the EaaS Orchestrator:

> **Strong foundations make higher-level insights possible.**

By treating data as a first-class citizen, the system earns the right to produce reliable evaluations, meaningful trends, and defensible reports over time.



In [None]:
def simulate_agent_execution(
    agent_id: str,
    scenario: Dict[str, Any],
    supporting_data: Dict[str, Any],
    agents: Dict[str, Any]
) -> Dict[str, Any]:
    """
    Simulate agent execution (MVP: Rule-based simulation).

    In a real implementation, this would call the actual agent.
    For MVP, we simulate based on expected behavior.
    """
    agent = agents.get(agent_id)
    if not agent:
        return {
            "status": "failed",
            "error": f"Agent {agent_id} not found",
            "output": None
        }

    # MVP: Simple rule-based simulation
    # In production, this would be an actual API call to the agent

    # Get relevant data
    customer_id = scenario.get('customer_id')
    order_id = scenario.get('order_id')

    customer = get_customer_data(customer_id, supporting_data)
    order = get_order_data(order_id, supporting_data)
    logistics = get_logistics_data(order_id, supporting_data)

    # Simulate agent response based on agent type
    if agent_id == 'shipping_update_agent':
        output = {
            "status": "shipping_update",
            "carrier": logistics.get('carrier') if logistics else "Unknown",
            "current_status": logistics.get('status') if logistics else "unknown",
            "estimated_delivery": logistics.get('estimated_delivery') if logistics else "Unknown",
            "details": logistics.get('details') if logistics else "No tracking information available"
        }
    elif agent_id == 'refund_agent':
        # Simulate refund calculation
        order = get_order_data(order_id, supporting_data)
        items = order.get('items', []) if order else []
        # Simple refund calculation (in real system, would use actual pricing)
        refund_amount = len(items) * 25.0  # Placeholder
        output = {
            "status": "refund_issued",
            "refund_amount": refund_amount,
            "refunded_at": datetime.now().isoformat(),
            "notes": "Refund processed successfully."
        }
    elif agent_id == 'apology_message_agent':
        output = {
            "status": "apology_message",
            "message": "We're very sorry for the inconvenience with your order. We're taking steps to resolve this as quickly as possible."
        }
    elif agent_id == 'escalation_agent':
        issue_type = scenario.get('expected_issue_type', 'unknown')
        priority = "high" if issue_type in ['lost_package', 'item_not_received'] else "medium"
        output = {
            "status": "escalated",
            "priority": priority,
            "assigned_to": "tier_2_support",
            "notes": "Issue escalated for manual review."
        }
    else:
        output = {
            "status": "unknown",
            "message": f"Agent {agent_id} executed"
        }

    return {
        "status": "completed",
        "output": output,
        "execution_time_seconds": 0.5  # Simulated
    }



# Simulated Execution as a Controlled Testing Environment

The `simulate_agent_execution` function represents the **execution layer** of the Evaluation-as-a-Service Orchestrator. Its role is to run an agent against a scenario and return a structured result that can be evaluated, scored, and audited.

In this implementation, execution is **intentionally simulated** rather than delegated to live agent APIs.

This is a deliberate design choice.

---

## Why Simulation Is a Strength, Not a Shortcut

In early or controlled evaluation environments, calling live agents introduces noise:

* network latency
* external service failures
* changing models or prompts
* inconsistent runtime conditions

By simulating execution in a rule-based way, the orchestrator creates a **stable testing environment** where:

* inputs are controlled
* behavior is predictable
* evaluation logic can be validated independently

This allows the evaluation framework itself to be tested and trusted before introducing real-world variability.

---

## Execution Still Respects Real-World Structure

Even though execution is simulated, it follows the same structure a production system would use.

The function:

* identifies the target agent
* retrieves relevant customer, order, and logistics data
* produces structured outputs aligned with each agent’s role
* returns execution metadata such as status and response time

This ensures the evaluation pipeline behaves the same way whether execution is simulated or real.

---

## Context Is Pulled, Not Assumed

Before generating outputs, the function retrieves:

* customer data
* order data
* logistics status

This reinforces an important system principle:

**Agents operate on data, not assumptions.**

By explicitly pulling context, the system ensures that execution — even simulated execution — mirrors how real agents would reason in production.

---

## Agent-Specific Behavior Is Explicit

Each agent type produces outputs appropriate to its role:

* shipping updates include carrier and delivery details
* refunds include amounts and timestamps
* apologies return customer-facing messages
* escalations include priority and ownership

This keeps behavior:

* understandable
* inspectable
* easy to extend

It also ensures that evaluation focuses on **role-appropriate outcomes**, not generic responses.

---

## Structured Outputs Enable Fair Evaluation

Every execution returns a consistent structure:

* execution status
* output payload
* execution time

This consistency is critical. It allows downstream components to:

* score performance without special cases
* compare agents fairly
* aggregate metrics across runs

Evaluation logic does not need to guess what an agent meant — it evaluates what the agent produced.

---

## Clean Failure Modes Are Part of the Design

If an agent is missing or misconfigured, the function returns a clear failure state.

This prevents silent errors and ensures that:

* failures are explicit
* evaluation results remain trustworthy
* missing agents do not contaminate metrics

Clear failure signaling is essential for reliable reporting and governance.

---

## Designed to Transition to Production Execution

The simulation layer is intentionally isolated.

In a production system, this function can be replaced with:

* live API calls
* async execution
* parallel processing
* retries and timeouts

Because the interface and outputs are already defined, the rest of the evaluation system does not need to change.

This makes the architecture future-proof.

---

## Why This Matters for the Bigger System

This execution layer reinforces several key principles:

* controlled inputs enable trustworthy evaluation
* structured outputs enable fair scoring
* isolation enables safe iteration
* realism without randomness enables confidence

Together, these principles allow the orchestrator to answer an important question reliably:

**“How did the agent behave under known conditions?”**

That clarity is what makes everything downstream — scoring, health metrics, trend analysis, and reporting — meaningful.



In [None]:
def score_evaluation(
    evaluation: Dict[str, Any],
    expected_outcome: str,
    expected_resolution_path: List[str],
    scoring_weights: Dict[str, float],
    pass_threshold: float
) -> Dict[str, Any]:
    """
    Score an evaluation by comparing actual output to expected.

    MVP: Rule-based scoring. Future: LLM-as-a-judge scoring.
    """
    actual_output = evaluation.get('actual_output', {})
    execution_status = evaluation.get('status', 'failed')

    if execution_status != 'completed':
        return {
            "correctness_score": 0.0,
            "response_time_score": 0.0,
            "output_quality_score": 0.0,
            "overall_score": 0.0,
            "passed": False,
            "issues": [f"Execution failed: {evaluation.get('error', 'Unknown error')}"]
        }

    # Score correctness (matches expected outcome)
    correctness_score = 1.0
    issues = []

    # Check if output status matches expected outcome type
    output_status = actual_output.get('status', '')
    if expected_outcome == 'provide_delivery_update' and output_status != 'shipping_update':
        correctness_score -= 0.3
        issues.append("Output status doesn't match expected outcome type")
    elif expected_outcome == 'issue_refund_and_notify_customer' and output_status != 'refund_issued':
        correctness_score -= 0.3
        issues.append("Output status doesn't match expected outcome type")
    elif expected_outcome == 'acknowledge_delay_and_update_eta' and output_status not in ['shipping_update', 'apology_message']:
        correctness_score -= 0.2
        issues.append("Output doesn't match expected outcome type")

    correctness_score = max(0.0, correctness_score)

    # Score response time
    execution_time = evaluation.get('execution_time_seconds', 10.0)
    response_time_threshold = 2.0  # From config
    if execution_time <= response_time_threshold:
        response_time_score = 1.0
    elif execution_time <= response_time_threshold * 2:
        response_time_score = 0.7
    elif execution_time <= response_time_threshold * 3:
        response_time_score = 0.4
    else:
        response_time_score = 0.1
        issues.append(f"Response time too slow: {execution_time}s")

    # Score output quality (structure/format)
    output_quality_score = 1.0
    if not isinstance(actual_output, dict):
        output_quality_score = 0.0
        issues.append("Output is not a dictionary")
    elif 'status' not in actual_output:
        output_quality_score = 0.5
        issues.append("Output missing 'status' field")
    elif len(actual_output) < 2:
        output_quality_score = 0.7
        issues.append("Output has minimal fields")

    # Calculate overall score
    overall_score = (
        correctness_score * scoring_weights.get('correctness', 0.5) +
        response_time_score * scoring_weights.get('response_time', 0.2) +
        output_quality_score * scoring_weights.get('output_quality', 0.3)
    )

    passed = overall_score >= pass_threshold

    return {
        "correctness_score": correctness_score,
        "response_time_score": response_time_score,
        "output_quality_score": output_quality_score,
        "overall_score": overall_score,
        "passed": passed,
        "issues": issues
    }




# Scoring: Turning AI Behavior into Measurable Signals

The `score_evaluation` function is where raw agent behavior is translated into **clear, decision-ready metrics**.

Instead of treating agent outputs as subjective or “good enough,” this function enforces a structured comparison between:

* what the agent actually produced
* what was expected
* how quickly it responded
* how usable the output was

This is the step that turns AI from an opaque system into something that can be **measured, tracked, and improved**.

---

## A Simple, Transparent Scoring Model

Rather than using a complex or opaque scoring system, this implementation starts with **clear, rule-based logic**.

That choice is intentional.

Rule-based scoring:

* is easy to understand
* is easy to explain
* is easy to audit
* establishes trust early

More advanced techniques, such as LLM-as-a-judge, can be layered on later without changing the structure of the evaluation pipeline.

---

## Clean Failure Handling Comes First

The function begins by checking whether execution completed successfully.

If execution failed:

* all scores are set to zero
* the evaluation is marked as failed
* the reason is recorded explicitly

This ensures that failures are never silently ignored and that metrics remain honest.

---

## Correctness: Did the Agent Do the Right Thing?

Correctness scoring checks whether the agent’s output aligns with the **expected outcome** for the scenario.

Rather than scoring vague similarity, the system:

* looks at the agent’s declared output status
* compares it against the expected resolution type
* deducts points when behavior deviates

This makes correctness:

* explicit
* explainable
* scenario-aware

Issues are recorded alongside the score, making it easy to understand *why* points were lost.

---

## Response Time: Did It Perform Fast Enough?

Response time is scored against defined thresholds:

* fast responses receive full credit
* slower responses receive progressively lower scores
* excessively slow responses are flagged as issues

This reflects real operational expectations, where speed often matters as much as accuracy — especially in customer-facing workflows.

---

## Output Quality: Is the Result Usable?

Output quality evaluates whether the agent’s response is:

* structured correctly
* complete enough to act on
* formatted in a predictable way

Rather than judging content subjectively, this score focuses on **structural quality**, which is essential for downstream automation and integrations.

A response that cannot be reliably parsed or consumed is treated as lower quality, even if it is technically correct.

---

## Weighted Scoring Reflects Business Priorities

Each dimension contributes to the final score using configurable weights.

This allows organizations to tune evaluation criteria based on what matters most:

* accuracy-heavy workflows
* latency-sensitive use cases
* integration-driven systems that demand clean structure

The scoring logic remains the same — only priorities change.

---

## Pass/Fail as a Clear Decision Boundary

The final score is compared against a configurable pass threshold.

This creates a simple but powerful outcome:

* the evaluation either meets the standard or it doesn’t

That clarity enables:

* automated gating
* trend analysis
* escalation logic
* executive reporting

---

## Why This Layer Is So Important

This function does more than score outputs — it establishes **accountability**.

Because scoring is:

* consistent
* transparent
* repeatable

changes in performance over time can be trusted. When scores drop, it reflects real behavioral changes, not shifting evaluation criteria.

---

## A Foundation for Growth

This scoring layer is designed to evolve:

* rules can be refined
* weights can be adjusted
* LLM-based judging can be introduced

But even as sophistication increases, the core structure remains the same.

That stability is what allows the orchestrator to support:

* longitudinal performance tracking
* drift detection
* SLA enforcement
* governance workflows

---

## The Bigger Picture

At this point in the pipeline, AI behavior has been converted into numbers, thresholds, and decisions.

That conversion is what allows the system to move beyond experimentation and into **real operational use** — where performance can be tracked, compared, and trusted over time.




In [None]:
def calculate_agent_performance_summary(
    agent_id: str,
    evaluations: List[Dict[str, Any]],
    scores: List[Dict[str, Any]],
    health_thresholds: Dict[str, float]
) -> Dict[str, Any]:
    """Calculate performance summary for an agent."""
    agent_evaluations = [e for e in evaluations if e.get('target_agent_id') == agent_id]
    agent_scores = [s for s in scores if s.get('target_agent_id') == agent_id]

    if not agent_scores:
        return {
            "agent_id": agent_id,
            "total_evaluations": 0,
            "passed_count": 0,
            "failed_count": 0,
            "average_score": 0.0,
            "average_response_time": 0.0,
            "health_status": "unknown"
        }

    total = len(agent_scores)
    passed = sum(1 for s in agent_scores if s.get('passed', False))
    failed = total - passed
    avg_score = sum(s.get('overall_score', 0.0) for s in agent_scores) / total

    avg_response_time = sum(
        e.get('execution_time_seconds', 0.0) for e in agent_evaluations
    ) / len(agent_evaluations) if agent_evaluations else 0.0

    # Determine health status
    if avg_score >= health_thresholds.get('healthy', 0.85):
        health_status = "healthy"
    elif avg_score >= health_thresholds.get('degraded', 0.70):
        health_status = "degraded"
    else:
        health_status = "critical"

    return {
        "agent_id": agent_id,
        "total_evaluations": total,
        "passed_count": passed,
        "failed_count": failed,
        "average_score": avg_score,
        "average_response_time": avg_response_time,
        "health_status": health_status
    }


# From Individual Scores to Agent Health

The `calculate_agent_performance_summary` function is responsible for one of the most important transformations in the system:

**It turns many individual evaluations into a clear, high-level health signal for a single agent.**

This is where detailed metrics become **manageable insight**.

---

## Why Aggregation Matters

Raw evaluation data is useful for engineers, but it quickly becomes overwhelming at scale. Business leaders don’t want to inspect dozens of scenario scores — they want to know:

* Is this agent reliable?
* Is it getting better or worse?
* Should we trust it in production?
* Does it need attention?

This function answers those questions in a structured, repeatable way.

---

## A Fair and Focused View Per Agent

The function begins by isolating:

* evaluations associated with a specific agent
* scores generated from those evaluations

This ensures that each agent is judged **only on its own behavior**, under consistent conditions.

No cross-contamination.
No hidden averaging across unrelated agents.

---

## Core Performance Metrics

For each agent, the summary computes:

* **Total evaluations**
  How much evidence exists for this agent’s performance.

* **Passed vs failed evaluations**
  A simple reliability signal that’s easy to reason about.

* **Average score**
  A normalized performance measure that can be tracked over time.

* **Average response time**
  An operational metric that highlights latency trends.

These metrics create a balanced view of both **quality and efficiency**.

---

## Health Status: Translating Metrics into Meaning

The most important output of this function is the **health status**:

```python
healthy | degraded | critical
```

Rather than exposing raw numbers alone, the system classifies agent performance using explicit thresholds defined in configuration.

This creates a shared language:

* “Healthy” agents are safe to rely on
* “Degraded” agents require monitoring or tuning
* “Critical” agents need intervention

Health status is not subjective — it is computed deterministically from agreed-upon standards.

---

## Determinism Builds Confidence

Because health classification is:

* rule-based
* threshold-driven
* reproducible

the same agent, evaluated under the same conditions, will always receive the same health status.

That consistency is essential for:

* trend analysis
* escalation workflows
* performance reviews
* executive reporting

When health changes, it reflects real behavioral change — not scoring variability.

---

## Designed for Long-Term Tracking

This summary structure is intentionally stable and compact, making it ideal for:

* storing over time
* plotting trends
* feeding dashboards
* triggering alerts

As evaluations run repeatedly, each agent accumulates a performance history that can answer questions like:

* Is this agent improving?
* Is performance drifting?
* Did something change after a deployment?

---

## Why This Is a Natural Capstone Utility

This function brings together several earlier design choices:

* standardized data
* deterministic scoring
* explicit thresholds
* reproducible metrics

The result is a **single, trustworthy representation of agent health** that can be consumed by both technical and non-technical stakeholders.

---

## The Bigger Pattern

This utility reflects a recurring pattern in the EaaS Orchestrator:

> **Detailed signals feed deterministic aggregation,
> which produces simple, actionable insight.**

That pattern is what allows the system to scale from individual tests to organizational oversight.

---

## Why Leaders Care About This Layer

From a business perspective, this function enables:

* clear ownership of agent performance
* faster decision-making
* reduced ambiguity during incidents
* confidence in automation at scale

It transforms AI agents from experimental components into **managed, measurable systems**.


