<a href="https://colab.research.google.com/github/micah-shull/AI_Agents/blob/main/319_EaaS_Nodes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Nodes for EaaS Orchestrator Agent

In [None]:
"""Nodes for EaaS Orchestrator Agent

Orchestration logic for the evaluation workflow.
"""

from typing import Dict, Any
from datetime import datetime
from toolshed.progress import calculate_progress, calculate_elapsed_time, estimate_remaining_time
from toolshed.performance import create_metrics_config, track_execution_time
from toolshed.workflows import analyze_workflow_health
from toolshed.validation import validate_data_structure
from agents.eval_as_service.utilities import (
    load_journey_scenarios,
    load_specialist_agents,
    load_supporting_data,
    load_decision_rules,
    build_agent_lookup,
    build_scenario_lookup,
    simulate_agent_execution,
    score_evaluation,
    calculate_agent_performance_summary
)
from config import EvalAsServiceOrchestratorState, EvalAsServiceOrchestratorConfig


def goal_node(state: EvalAsServiceOrchestratorState) -> Dict[str, Any]:
    """Define the goal for evaluation."""
    scenario_id = state.get('scenario_id')
    target_agent_id = state.get('target_agent_id')

    if scenario_id:
        goal_description = f"Evaluate agent performance for scenario {scenario_id}"
    elif target_agent_id:
        goal_description = f"Evaluate agent {target_agent_id} across all scenarios"
    else:
        goal_description = "Evaluate all agents across all scenarios"

    return {
        "goal": {
            "description": goal_description,
            "type": "evaluation",
            "scope": {
                "scenario_id": scenario_id,
                "target_agent_id": target_agent_id
            }
        }
    }


def planning_node(state: EvalAsServiceOrchestratorState) -> Dict[str, Any]:
    """Create execution plan for evaluation."""
    plan = [
        {
            "step": 1,
            "task": "Load evaluation data",
            "description": "Load scenarios, agents, and supporting data"
        },
        {
            "step": 2,
            "task": "Execute evaluations",
            "description": "Run scenarios through target agents"
        },
        {
            "step": 3,
            "task": "Score evaluations",
            "description": "Compare actual outputs to expected outcomes"
        },
        {
            "step": 4,
            "task": "Analyze performance",
            "description": "Calculate agent performance summaries"
        },
        {
            "step": 5,
            "task": "Generate report",
            "description": "Create comprehensive evaluation report"
        }
    ]

    return {"plan": plan}


def data_loading_node(
    state: EvalAsServiceOrchestratorState,
    config: EvalAsServiceOrchestratorConfig
) -> Dict[str, Any]:
    """Load all required data for evaluation."""
    errors = state.get('errors', [])

    try:
        # Load scenarios
        scenarios = load_journey_scenarios(config.data_dir, config.journey_scenarios_file)

        # Filter by scenario_id if specified
        scenario_id = state.get('scenario_id')
        if scenario_id:
            scenarios = [s for s in scenarios if s.get('scenario_id') == scenario_id]

        # Load agents
        agents_dict = load_specialist_agents(config.data_dir, config.specialist_agents_file)

        # Filter by target_agent_id if specified
        target_agent_id = state.get('target_agent_id')
        if target_agent_id:
            agents_dict = {k: v for k, v in agents_dict.items() if k == target_agent_id}

        # Load supporting data
        supporting_data = load_supporting_data(
            config.data_dir,
            config.customers_file,
            config.orders_file,
            config.logistics_file,
            config.marketing_signals_file
        )

        # Load decision rules
        decision_rules = load_decision_rules(config.data_dir, config.decision_rules_file)

        # Build lookups
        agent_lookup = build_agent_lookup(agents_dict)
        scenario_lookup = build_scenario_lookup(scenarios)

        # Validate data structure if enabled
        if config.enable_validation:
            try:
                validate_data_structure(scenarios, required_fields=['scenario_id', 'customer_id', 'order_id'])
            except Exception as e:
                errors.append(f"Validation warning: {str(e)}")

        return {
            "journey_scenarios": scenarios,
            "specialist_agents": agents_dict,
            "supporting_data": supporting_data,
            "decision_rules": decision_rules,
            "errors": errors
        }
    except Exception as e:
        errors.append(f"Data loading error: {str(e)}")
        return {"errors": errors}




# Orchestrator Nodes: Turning Evaluation into a Workflow

This section introduces the first **orchestration nodes** of the Evaluation-as-a-Service (EaaS) agent. These nodes define how an evaluation run is structured, sequenced, and executed from start to finish.

Rather than hard-coding a single evaluation flow, the orchestrator models evaluation as a **series of explicit steps**, each with a clear responsibility.

This makes the system easier to understand, extend, and govern.

---

## Goal Definition: Making Intent Explicit

```python
goal_node(...)
```

The goal node defines *why* the evaluation is being run.

Depending on the inputs, the goal may be:

* evaluating a single scenario
* evaluating a specific agent across all scenarios
* evaluating all agents across all scenarios

By explicitly recording the goal in state, the system ensures that:

* evaluation intent is visible
* scope is clearly defined
* reports can explain *what was evaluated and why*

This prevents evaluation runs from becoming ambiguous or open-ended.

---

## Planning: Making the Workflow Visible

```python
planning_node(...)
```

The planning node creates a simple, transparent execution plan that outlines the evaluation workflow step by step.

Even though the plan is static in this MVP, its presence is important:

* it documents the evaluation lifecycle
* it provides a mental model for how the system operates
* it creates a foundation for future dynamic planning

From a governance perspective, this plan acts as a **contract** for how evaluations are performed.

---

## Data Loading as a Dedicated Phase

```python
data_loading_node(...)
```

The data loading node is where all required inputs for evaluation are gathered and prepared.

This includes:

* journey scenarios
* specialist agent definitions
* supporting business data
* decision rules
* lookup structures

By isolating data loading into its own node, the system reinforces a key design principle:

> **Evaluation should not begin until inputs are complete and validated.**

---

## Scope Control Without Rewrites

This node supports targeted evaluation through simple filters:

* scenario-level filtering
* agent-level filtering

This allows teams to:

* re-evaluate a single failure
* isolate a problematic agent
* run focused audits

All without changing orchestration logic.

---

## Validation as a Guardrail

When validation is enabled, the node checks that critical data fields are present before evaluation proceeds.

Validation errors are captured as warnings rather than hard failures, which:

* preserves system resilience
* surfaces data issues early
* prevents silent corruption of metrics

This reinforces the idea that **data quality is enforced at the boundary**, not assumed downstream.

---

## Errors Are Collected, Not Hidden

Errors encountered during data loading are accumulated into state rather than causing uncontrolled failures.

This ensures that:

* failures are visible
* partial results can still be reported
* evaluation runs remain explainable

Transparent error handling is essential for trust, especially when evaluations are automated.

---

## Why These Nodes Matter Together

Taken as a group, these first nodes establish a clear pattern:

* **Goal** defines intent
* **Plan** defines structure
* **Data loading** defines inputs

Only after these steps does the system move on to execution and scoring.

This mirrors how mature operational systems are designed and makes evaluation runs:

* predictable
* repeatable
* defensible

---

## From Ad-Hoc Scripts to Orchestrated Evaluation

Many agent systems mix data loading, execution, and scoring into a single block of logic. This orchestrator deliberately avoids that pattern.

By breaking evaluation into nodes:

* each step is easier to reason about
* failures are easier to localize
* future enhancements are easier to add

This structure is what allows the system to grow from a simple evaluator into a full **AI governance workflow**.



In [None]:
def evaluation_execution_node(
    state: EvalAsServiceOrchestratorState,
    config: EvalAsServiceOrchestratorConfig
) -> Dict[str, Any]:
    """Execute evaluations by running scenarios through agents."""
    scenarios = state.get('journey_scenarios', [])
    agents = state.get('specialist_agents', {})
    supporting_data = state.get('supporting_data', {})

    executed_evaluations = []
    errors = state.get('errors', [])

    # Track start time for progress
    start_time = state.get('evaluation_start_time')
    if not start_time:
        start_time = datetime.now().isoformat()

    # Determine which agents to evaluate for each scenario
    for scenario in scenarios:
        expected_resolution_path = scenario.get('expected_resolution_path', [])

        # Evaluate each agent in the expected resolution path
        for agent_id in expected_resolution_path:
            if agent_id not in agents:
                errors.append(f"Agent {agent_id} not found for scenario {scenario.get('scenario_id')}")
                continue

            try:
                # Track execution time
                execution_start = datetime.now()

                # Simulate agent execution
                result = simulate_agent_execution(
                    agent_id,
                    scenario,
                    supporting_data,
                    agents
                )

                execution_end = datetime.now()
                execution_time = (execution_end - execution_start).total_seconds()

                evaluation = {
                    "scenario_id": scenario.get('scenario_id'),
                    "target_agent_id": agent_id,
                    "input": {
                        "customer_message": scenario.get('customer_message'),
                        "customer_id": scenario.get('customer_id'),
                        "order_id": scenario.get('order_id')
                    },
                    "actual_output": result.get('output'),
                    "expected_output": {
                        "expected_issue_type": scenario.get('expected_issue_type'),
                        "expected_resolution_path": expected_resolution_path,
                        "expected_outcome": scenario.get('expected_outcome')
                    },
                    "execution_time_seconds": execution_time,
                    "status": result.get('status', 'failed'),
                    "error": result.get('error')
                }

                executed_evaluations.append(evaluation)

            except Exception as e:
                errors.append(f"Evaluation error for scenario {scenario.get('scenario_id')}, agent {agent_id}: {str(e)}")
                executed_evaluations.append({
                    "scenario_id": scenario.get('scenario_id'),
                    "target_agent_id": agent_id,
                    "status": "failed",
                    "error": str(e)
                })

    # Update progress
    total = len(executed_evaluations)
    completed = len([e for e in executed_evaluations if e.get('status') == 'completed'])

    progress = calculate_progress(completed=completed, total=total) if total > 0 else 0.0

    elapsed = calculate_elapsed_time(start_time)
    remaining = estimate_remaining_time(
        completed=completed,
        total=total,
        elapsed_minutes=elapsed / 60.0
    ) if total > 0 else 0.0

    return {
        "executed_evaluations": executed_evaluations,
        "evaluations_completed": completed,
        "evaluations_total": total,
        "progress_percentage": progress,
        "elapsed_time_seconds": elapsed,
        "estimated_remaining_seconds": remaining * 60.0,
        "evaluation_start_time": start_time,
        "errors": errors
    }





# Evaluation Execution: Running Controlled Experiments

The `evaluation_execution_node` is responsible for executing evaluations in a **structured, repeatable way**. This is where scenarios are run through agents and concrete evidence of agent behavior is produced.

Rather than treating execution as an opaque black box, this node treats it as a **controlled experiment** with clearly defined inputs, outputs, and measurements.

---

## Execution Follows Explicit Expectations

Each scenario defines an `expected_resolution_path`, which specifies which agents should be involved.

The orchestrator uses this path to determine:

* which agents are evaluated
* in what context
* for which scenario

This ensures agents are evaluated **only when they are expected to act**, preventing irrelevant or misleading measurements.

---

## One Scenario, One Agent, One Record

For every scenario–agent pairing, the node creates a complete evaluation record that includes:

* the scenario identifier
* the target agent
* the exact inputs provided
* the agent’s actual output
* the expected outcome
* execution time
* execution status and errors

Each evaluation becomes a **self-contained unit of evidence** that can be scored, audited, or revisited later.

---

## Execution Is Measured, Not Assumed

Execution time is measured explicitly rather than assumed or estimated.

This allows:

* latency to be scored objectively
* performance regressions to be detected
* operational behavior to be tracked over time

Speed becomes a first-class metric, not an afterthought.

---

## Errors Are Captured Without Breaking the System

If an agent fails or an exception occurs:

* the failure is recorded
* the error message is preserved
* evaluation continues for other scenarios and agents

This design avoids cascading failures and ensures that partial results are still meaningful and explainable.

---

## Progress Tracking Is Built In

The node calculates:

* how many evaluations were completed
* how many remain
* overall progress percentage
* elapsed time
* estimated time remaining

This transforms evaluation from a “fire and forget” process into a **predictable operation** that can be monitored and scheduled.

From a business perspective, this predictability matters just as much as correctness.

---

## Determinism at the Execution Layer

Execution follows a fixed pattern:

* same scenarios
* same agents
* same inputs
* same structure

When execution is deterministic, downstream scoring and trend analysis become trustworthy. If performance changes, it reflects real behavioral changes — not variability in how evaluations were run.

---

## Why This Node Is So Important

This node creates the **raw material** for everything that follows:

* scoring
* health classification
* trend analysis
* reporting

If execution were inconsistent or opaque, none of those layers would be reliable.

By designing execution as a controlled, measurable step, the orchestrator preserves integrity across the entire evaluation pipeline.

---

## The Bigger Pattern at Work

This node reinforces a core architectural pattern used throughout the system:

> **Prepare inputs carefully → execute predictably → measure explicitly → analyze confidently**

That pattern is what allows evaluation results to scale from individual tests into organizational insight.




## State as the Thread That Connects Everything

The orchestrator is built around a shared **state object** that flows through every node in the workflow.

Each node:

* reads what it needs from state
* performs a focused piece of work
* adds new information to state
* passes the updated state forward

Nothing is hidden, and nothing is overwritten without intent.

---

## Nodes Read from State, They Don’t Recreate Context

When the execution node begins, it retrieves what it needs directly from state:

```python
scenarios = state.get('journey_scenarios', [])
agents = state.get('specialist_agents', {})
supporting_data = state.get('supporting_data', {})
```

This means:

* scenarios were already loaded and filtered
* agents were already selected and validated
* supporting data was already prepared

The execution node does not re-fetch or reinterpret data. It trusts the state.

---

## Utilities Do the Work, Nodes Coordinate It

The nodes themselves are intentionally lightweight.

They do not:

* load files
* manipulate raw data structures
* implement scoring logic
* perform domain-specific calculations

Instead, they **delegate work to utilities** that encapsulate reusable business logic.

This creates a clean division of responsibility:

* utilities implement *how* things are done
* nodes decide *when* and *in what order* they are done

---

## Why This Pattern Matters

This separation delivers several important benefits:

### 1. Clarity

Each node has a single purpose. Readers can understand the workflow without digging into implementation details.

### 2. Reusability

Utilities can be reused across nodes, workflows, or even other agents without duplication.

### 3. Testability

Utilities can be tested independently from orchestration logic, improving reliability.

### 4. Traceability

Because all outputs are written to state, the full evaluation story can be reconstructed at any point.

---

## State Accumulates Knowledge Over Time

As the workflow progresses, state gradually becomes richer:

* early nodes add goals and plans
* data loading nodes add scenarios and context
* execution nodes add evaluation records
* scoring nodes add metrics and classifications
* reporting nodes add summaries and artifacts

State is not just storage — it is the **memory of the evaluation run**.

---

## Why This Design Scales

This pattern scales cleanly because:

* new nodes can be added without breaking existing ones
* new utilities can be introduced without changing orchestration
* additional metrics can be attached to state without refactoring

As the system grows, complexity is distributed rather than tangled.

---

## A Simple Mental Model

A useful way to think about the orchestrator is:

> **Utilities do the work.
> Nodes coordinate the work.
> State remembers the work.**

That mental model makes the entire system easier to reason about and easier to explain to others.

---

## Why Leaders Care About This

From a governance perspective, this architecture ensures:

* transparency into how results were produced
* clear boundaries between data, logic, and decisions
* confidence that evaluations are repeatable and explainable

This is exactly what’s required when AI systems move beyond experimentation and into operational use.


In [None]:
def scoring_node(
    state: EvalAsServiceOrchestratorState,
    config: EvalAsServiceOrchestratorConfig
) -> Dict[str, Any]:
    """Score evaluations by comparing actual outputs to expected outcomes."""
    evaluations = state.get('executed_evaluations', [])
    scenarios = state.get('journey_scenarios', [])
    scenario_lookup = build_scenario_lookup(scenarios)

    evaluation_scores = []
    errors = state.get('errors', [])

    for evaluation in evaluations:
        scenario_id = evaluation.get('scenario_id')
        scenario = scenario_lookup.get(scenario_id)

        if not scenario:
            errors.append(f"Scenario {scenario_id} not found for scoring")
            continue

        expected_outcome = scenario.get('expected_outcome')
        expected_resolution_path = scenario.get('expected_resolution_path', [])

        try:
            score = score_evaluation(
                evaluation,
                expected_outcome,
                expected_resolution_path,
                config.scoring_weights,
                config.pass_threshold
            )

            score['scenario_id'] = scenario_id
            score['target_agent_id'] = evaluation.get('target_agent_id')

            evaluation_scores.append(score)

        except Exception as e:
            errors.append(f"Scoring error for scenario {scenario_id}: {str(e)}")

    return {
        "evaluation_scores": evaluation_scores,
        "errors": errors
    }


# Scoring Node: Where Evidence Becomes Judgment

The `scoring_node` is responsible for converting executed evaluations into **explicit performance assessments**.

At this point in the workflow:

* execution has already happened
* inputs and outputs are known
* timing has been measured

What remains is to answer a critical question:

**Did the agent meet expectations?**

---

## Scoring Is Context-Aware, Not Isolated

Rather than scoring evaluations in isolation, the scoring node reconnects each evaluation to its original scenario.

By building a `scenario_lookup`, the system ensures that:

* expected outcomes are retrieved reliably
* scoring logic is grounded in scenario intent
* evaluations are judged against the correct expectations

This avoids generic or one-size-fits-all scoring.

---

## One Evaluation, One Scorecard

For each evaluation record, the node:

* retrieves the relevant scenario
* extracts expected outcomes and resolution paths
* applies the scoring function using configured weights and thresholds

Each evaluation produces a **complete scorecard** that includes:

* correctness
* response time
* output quality
* overall score
* pass/fail status
* recorded issues

This makes every judgment explainable and traceable.

---

## Configuration Drives Standards

Scoring behavior is controlled entirely through configuration:

* scoring weights reflect business priorities
* pass thresholds define acceptable performance

The scoring node does not decide *what good means* — it enforces the standards that were defined earlier.

This separation ensures that:

* evaluation rules are consistent
* changes are intentional
* results remain comparable over time

---

## Deterministic by Design

The scoring node applies deterministic logic to deterministic inputs.

Given the same:

* evaluation record
* scenario expectations
* scoring configuration

the outcome will always be the same.

This is what allows scores to be:

* trusted
* tracked over time
* used for trend and drift analysis
* relied upon in reports and dashboards

---

## Failures Are Explicit, Not Silent

If a scenario cannot be found or scoring fails:

* the issue is recorded
* scoring continues for other evaluations
* partial results remain usable

This ensures robustness without hiding problems.

Transparent error handling is essential for governance and operational confidence.

---

## What This Node Produces

The output of this node is a list of **evaluation scores**, each tied to:

* a specific scenario
* a specific agent
* a specific execution

These scores become the foundation for:

* agent performance summaries
* health classifications
* system-wide metrics
* executive reporting

---

## Why This Layer Matters So Much

This node represents a key architectural boundary:

> **Before this point, the system observes behavior.
> After this point, the system evaluates it.**

By keeping that boundary explicit and deterministic, the orchestrator avoids the ambiguity that undermines trust in many AI systems.

---

## The Bigger Pattern Continues

This node follows the same pattern used throughout the system:

* state provides context
* utilities implement logic
* nodes coordinate workflow
* outputs are written back to state

That consistency makes the system easier to reason about and easier to extend.




In [None]:
def performance_analysis_node(
    state: EvalAsServiceOrchestratorState,
    config: EvalAsServiceOrchestratorConfig
) -> Dict[str, Any]:
    """Analyze agent performance and generate summaries."""
    agents = state.get('specialist_agents', {})
    evaluations = state.get('executed_evaluations', [])
    scores = state.get('evaluation_scores', [])

    agent_performance_summaries = []

    # Calculate performance for each agent
    for agent_id in agents.keys():
        summary = calculate_agent_performance_summary(
            agent_id,
            evaluations,
            scores,
            config.health_thresholds
        )
        agent_performance_summaries.append(summary)

    # Calculate overall evaluation summary
    total_scenarios = len(state.get('journey_scenarios', []))
    total_evaluations = len(evaluations)
    total_passed = sum(1 for s in scores if s.get('passed', False))
    total_failed = len(scores) - total_passed
    overall_pass_rate = total_passed / len(scores) if scores else 0.0
    average_score = sum(s.get('overall_score', 0.0) for s in scores) / len(scores) if scores else 0.0

    healthy_agents = sum(1 for s in agent_performance_summaries if s.get('health_status') == 'healthy')
    degraded_agents = sum(1 for s in agent_performance_summaries if s.get('health_status') == 'degraded')
    critical_agents = sum(1 for s in agent_performance_summaries if s.get('health_status') == 'critical')

    evaluation_summary = {
        "total_scenarios": total_scenarios,
        "total_evaluations": total_evaluations,
        "total_passed": total_passed,
        "total_failed": total_failed,
        "overall_pass_rate": overall_pass_rate,
        "average_score": average_score,
        "agents_evaluated": len(agents),
        "healthy_agents": healthy_agents,
        "degraded_agents": degraded_agents,
        "critical_agents": critical_agents
    }

    # Workflow analysis (using toolshed)
    workflow_analysis = []
    if config.enable_workflow_analysis:
        for summary in agent_performance_summaries:
            # Use failure rate as metric for workflow health
            failure_rate = (summary.get('failed_count', 0) / summary.get('total_evaluations', 1)) * 100

            workflow = {
                "workflow_id": f"eval_{summary.get('agent_id')}",
                "agent_id": summary.get('agent_id'),
                "failure_rate_7d": failure_rate
            }

            # Use workflow health analysis
            thresholds = {
                "healthy": 10.0,    # <= 10% failure rate
                "degraded": 30.0,   # 10-30% failure rate
                "critical": 30.0    # > 30% failure rate
            }

            analysis = analyze_workflow_health(workflow, thresholds)
            workflow_analysis.append(analysis)

    # Performance metrics (using toolshed)
    performance_metrics = {}
    if config.enable_performance_tracking:
        metrics_config = create_metrics_config(
            metrics={
                "evaluation_time": {"threshold": 2.0, "unit": "seconds"},
                "pass_rate": {"threshold": 0.80, "unit": "ratio"}
            }
        )

        avg_eval_time = sum(e.get('execution_time_seconds', 0.0) for e in evaluations) / len(evaluations) if evaluations else 0.0

        performance_metrics = {
            "average_evaluation_time": avg_eval_time,
            "overall_pass_rate": overall_pass_rate,
            "metrics_config": metrics_config
        }

    return {
        "agent_performance_summary": agent_performance_summaries,
        "evaluation_summary": evaluation_summary,
        "workflow_analysis": workflow_analysis,
        "performance_metrics": performance_metrics
    }



# Performance Analysis: From Scores to System Insight

The `performance_analysis_node` is responsible for turning individual evaluation scores into **meaningful summaries** at both the agent level and the system level.

This is the point where raw metrics become insight.

---

## Agent-Level Performance: Clear Accountability

The node begins by calculating a performance summary for each agent using previously scored evaluations.

Each summary answers practical questions:

* How many times was this agent evaluated?
* How often did it pass or fail?
* What is its average score?
* How fast does it typically respond?
* What is its overall health status?

These summaries provide a **clean, consistent performance profile** for every agent in the system.

Because health classification is driven by explicit thresholds, agent health is:

* deterministic
* comparable across agents
* stable over time

This makes it easy to identify which agents are reliable and which require attention.

---

## System-Level Evaluation Summary

Beyond individual agents, the node computes a **system-wide evaluation summary** that captures overall performance at a glance.

This includes:

* total scenarios evaluated
* total evaluation runs
* pass and fail counts
* overall pass rate
* average score across all evaluations
* distribution of agent health states

This summary is designed to answer executive-level questions quickly, without digging into details.

---

## Translating Failures into Workflow Health

When workflow analysis is enabled, the node goes one step further.

Instead of treating failures as isolated events, it:

* computes failure rates per agent
* interprets those rates using health thresholds
* classifies workflows as healthy, degraded, or critical

This reframes performance issues as **workflow health signals**, which are easier to reason about and easier to act on.

---

## Performance Metrics as Operational Signals

The node also computes performance metrics that describe how the evaluation system itself is behaving.

These include:

* average evaluation execution time
* overall pass rate
* defined performance thresholds

These metrics help ensure that the evaluation process remains efficient and predictable, not just accurate.

---

## Determinism Preserved at Every Level

Importantly, this node does not introduce new judgment logic.

It:

* aggregates deterministic scores
* applies deterministic thresholds
* produces deterministic summaries

As a result, performance summaries remain reproducible and trustworthy. Changes in metrics reflect real changes in behavior, not analytical drift.

---

## Why This Node Matters

This node is where the orchestrator becomes truly useful beyond engineering teams.

It enables:

* rapid assessment of agent reliability
* early detection of performance degradation
* clear communication of risk
* confidence in scaling AI systems

It transforms evaluation from a technical exercise into an **operational capability**.

---

## The Architectural Pattern Holds

This node continues the system’s core pattern:

* state provides context
* utilities perform calculations
* configuration defines standards
* nodes coordinate flow
* outputs are written back to state

That consistency is what allows insights to scale without adding complexity.




In [None]:
def report_generation_node(
    state: EvalAsServiceOrchestratorState,
    config: EvalAsServiceOrchestratorConfig
) -> Dict[str, Any]:
    """Generate comprehensive evaluation report."""
    from toolshed.reporting import generate_mission_report, save_report

    summary = state.get('evaluation_summary', {})
    agent_summaries = state.get('agent_performance_summary', [])
    scores = state.get('evaluation_scores', [])
    evaluations = state.get('executed_evaluations', [])

    # Build report sections
    report_sections = []

    # Executive Summary
    report_sections.append("## Executive Summary\n\n")
    report_sections.append(f"- **Total Scenarios Evaluated:** {summary.get('total_scenarios', 0)}\n")
    report_sections.append(f"- **Total Evaluations:** {summary.get('total_evaluations', 0)}\n")
    report_sections.append(f"- **Overall Pass Rate:** {summary.get('overall_pass_rate', 0.0):.1%}\n")
    report_sections.append(f"- **Average Score:** {summary.get('average_score', 0.0):.2f}\n")
    report_sections.append(f"- **Healthy Agents:** {summary.get('healthy_agents', 0)}\n")
    report_sections.append(f"- **Degraded Agents:** {summary.get('degraded_agents', 0)}\n")
    report_sections.append(f"- **Critical Agents:** {summary.get('critical_agents', 0)}\n\n")

    # Agent Performance Details
    report_sections.append("## Agent Performance Details\n\n")
    for agent_summary in agent_summaries:
        report_sections.append(f"### {agent_summary.get('agent_id')}\n\n")
        report_sections.append(f"- **Status:** {agent_summary.get('health_status', 'unknown')}\n")
        report_sections.append(f"- **Total Evaluations:** {agent_summary.get('total_evaluations', 0)}\n")
        report_sections.append(f"- **Passed:** {agent_summary.get('passed_count', 0)}\n")
        report_sections.append(f"- **Failed:** {agent_summary.get('failed_count', 0)}\n")
        report_sections.append(f"- **Average Score:** {agent_summary.get('average_score', 0.0):.2f}\n")
        report_sections.append(f"- **Average Response Time:** {agent_summary.get('average_response_time', 0.0):.2f}s\n\n")

    # Evaluation Results
    report_sections.append("## Evaluation Results\n\n")
    report_sections.append("| Scenario | Agent | Score | Passed | Issues |\n")
    report_sections.append("|----------|-------|-------|--------|--------|\n")

    for score in scores[:20]:  # Limit to first 20 for readability
        scenario_id = score.get('scenario_id', 'N/A')
        agent_id = score.get('target_agent_id', 'N/A')
        overall_score = score.get('overall_score', 0.0)
        passed = "✓" if score.get('passed', False) else "✗"
        issues = ", ".join(score.get('issues', []))[:50]  # Truncate long issues
        report_sections.append(f"| {scenario_id} | {agent_id} | {overall_score:.2f} | {passed} | {issues} |\n")

    if len(scores) > 20:
        report_sections.append(f"\n*Showing first 20 of {len(scores)} evaluations*\n\n")

    # Combine report
    report = "# Evaluation-as-a-Service Report\n\n"
    report += f"**Generated:** {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n\n"
    report += "".join(report_sections)

    # Save report if enabled
    report_file_path = None
    if config.enable_reporting:
        try:
            report_file_path = save_report(
                report_content=report,
                reports_dir=config.reports_dir,
                report_name="evaluation_report"
            )
        except Exception as e:
            state.get('errors', []).append(f"Report saving error: {str(e)}")

    return {
        "evaluation_report": report,
        "report_file_path": report_file_path
    }


# Report Generation: Turning Metrics into Meaning

The `report_generation_node` is responsible for transforming evaluation data into a **clear, readable report** that can be shared with business leaders, managers, or clients.

This is the moment where the system stops thinking like a machine and starts **communicating like an analyst**.

---

## Why Reporting Is a Separate Node

Report generation is intentionally isolated from:

* execution
* scoring
* analysis

By the time this node runs:

* all decisions have already been made
* all metrics are finalized
* all classifications are deterministic

The report does not influence outcomes — it **explains them**.

This separation is critical for trust.

---

## Executive Summary Comes First (On Purpose)

The report opens with a concise **Executive Summary** that answers the most important questions immediately:

* How much was evaluated?
* How well did the system perform?
* How many agents are healthy, degraded, or critical?

This mirrors how leaders actually consume information:

* high-level first
* details only if needed

Nothing in this section is interpretive — it is a direct reflection of computed metrics.

---

## Agent-Level Transparency

The next section breaks performance down **agent by agent**.

For each agent, the report shows:

* health status
* evaluation volume
* pass/fail counts
* average score
* average response time

This creates clear ownership and accountability.
Each agent has a visible performance profile, grounded in deterministic evaluation.

---

## Evidence Is Still Available

For readers who want to go deeper, the report includes a table of individual evaluation results:

* scenario
* agent
* score
* pass/fail
* issues encountered

This preserves traceability without overwhelming the report.

The report summarizes — but it never hides the evidence.

---

## Deterministic Data, Human-Friendly Presentation

One of the most important aspects of this node is **what it does not do**.

It does not:

* reinterpret scores
* adjust thresholds
* “smooth” results
* make new judgments

It simply presents what the system already knows in a format humans can understand.

This aligns perfectly with your guiding principle:

> **Deterministic systems earn trust.
> Probabilistic systems add insight.**

Here, the report is informational — not authoritative.

---

## Where an LLM Fits (Optional, Not Required)

This node is an ideal place to *optionally* introduce an LLM **without compromising consistency**.

For example, an LLM could:

* summarize trends across runs
* explain why certain agents are degraded
* highlight unusual patterns
* suggest investigation areas

Crucially:

* the LLM would not change scores
* the LLM would not decide pass/fail
* the LLM would not alter health status

It would act as a **commentator**, not a judge.

---

## Reports as Durable Artifacts

When enabled, the report is saved as a persistent artifact:

* timestamped
* reproducible
* auditable

This allows reports to be:

* shared with stakeholders
* stored for compliance
* compared across time
* used as inputs to dashboards

Reports become part of the system’s memory, not just transient output.

---

## Why This Node Completes the System

This node completes a full, end-to-end loop:

1. Data is loaded consistently
2. Behavior is executed predictably
3. Performance is scored deterministically
4. Health is classified explicitly
5. Results are communicated clearly

Nothing is hidden.
Nothing is subjective.
Nothing is left unexplained.

---

## The Bigger Takeaway

This report generation step is where your architectural philosophy fully pays off.

Because everything upstream is:

* structured
* deterministic
* standardized

the reporting layer can stay:

* simple
* readable
* trustworthy

That is exactly what business leaders want from AI systems — not magic, but **clarity**.

---

At this point, your orchestrator is complete:

* utilities define rules
* nodes orchestrate flow
* state carries truth
* reports communicate outcomes

What you’ve built is not just an agent — it’s a **governance-ready evaluation system**.

