<a href="https://colab.research.google.com/github/micah-shull/AI_Agents/blob/main/205_Evaluations_as_a_Service_(EaaS)_Agent.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# EaaS Agent: Next Steps Learning Plan

**Purpose:** Strategic learning path to improve the agent while maximizing learning value.

---

## üéØ Recommended Learning Path

### **Phase 1: Pattern Detection (Highest Value for Learning)** ‚≠ê START HERE

**Why this first:**
- Demonstrates orchestrator value immediately
- Teaches how to analyze across dimensions (agents √ó scenarios √ó outcomes)
- Shows how state design enables insights
- Creates "aha!" moments about orchestrator power

**What to build:**
- Add pattern detection to `scoring_node`
- Detect: "Both agents fail on neutral sentiment"
- Detect: "All agents are slow on scenario type X"
- Detect: "Agent A fails where Agent B succeeds" (complementary patterns)

**Learning outcomes:**
- How to analyze multi-dimensional data
- How to find patterns across agents
- How to structure state to enable pattern detection
- This is the "orchestrator insight" that makes it valuable

---

### **Phase 2: Improve Execution Node (Orchestration Logic)** ‚≠ê SECOND

**Why this second:**
- Now that you understand what insights you need, you can design better execution
- Teaches orchestration coordination patterns
- Makes the agent more realistic/functional

**What to build:**
- Better mock agents that actually analyze input text (simple keyword-based for MVP)
- Or: Connect to real agent functions/endpoints
- Handle errors, timeouts, retries
- Capture more metadata (tokens used, confidence scores, etc.)

**Learning outcomes:**
- How to coordinate multiple agents
- How to handle failures gracefully
- How to structure agent execution logic
- This is the "orchestration mechanics"

---

### **Phase 3: State Design Evolution (Continuous Learning)** ‚≠ê THROUGHOUT

**Why throughout:**
- State design becomes clearer as you add features
- You'll see what relationships need to be captured
- You'll understand data flow better

**What to learn:**
- What data flows through each node
- What relationships need to be captured (agent ‚Üí scenario ‚Üí result ‚Üí pattern)
- How state evolves as you add features
- How to design state for future features

**Learning outcomes:**
- State design patterns for orchestrators
- How to plan state evolution
- How to capture relationships, not just data

---

## üéì Why This Order?

### Pattern Detection First Because:

1. **Demonstrates Value Immediately**
   - You'll see: "Oh! Both agents fail on neutral sentiment"
   - This is the orchestrator "aha!" moment
   - Shows why orchestrators are valuable

2. **Teaches Multi-Dimensional Analysis**
   - You'll learn to analyze: agents √ó scenarios √ó outcomes
   - This is core orchestrator skill
   - Pattern detection = orchestrator superpower

3. **Informs State Design**
   - As you build pattern detection, you'll see what state you need
   - "I need scenario metadata to detect patterns"
   - "I need to group results by scenario type"
   - State design becomes clearer through practice

4. **Informs Execution Design**
   - Once you know what patterns you want, you can design better execution
   - "I need to capture confidence scores for pattern detection"
   - "I need to track which scenario categories fail"

### Execution Node Second Because:

1. **You Know What You Need**
   - After building pattern detection, you know what data to capture
   - You can design execution to provide that data

2. **Orchestration Logic is Simpler**
   - The coordination (agents √ó scenarios) is already there
   - You're just making it more realistic
   - Less conceptual, more implementation

3. **State Design is Clearer**
   - You've already evolved state for pattern detection
   - Now you're just enriching it with execution metadata

---

## üìä What Each Phase Teaches

### Phase 1: Pattern Detection
```
Current: "Agent A: 40% accurate, Agent B: 0% accurate"
After:   "Both agents fail on neutral sentiment - systemic issue"
         "Agent A is 3x faster but less accurate"
         "All agents struggle with edge case X"
```

**Learning:**
- Multi-dimensional analysis
- Cross-agent pattern recognition
- How to structure queries across dimensions
- This is orchestrator value creation

### Phase 2: Execution Node
```
Current: Mock returns "positive" for everything
After:   Analyzes input text, returns realistic responses
         Handles errors gracefully
         Captures metadata (confidence, tokens, etc.)
```

**Learning:**
- Agent coordination patterns
- Error handling in orchestration
- Metadata capture for analysis
- This is orchestration mechanics

### Phase 3: State Design
```
Current: Basic state structure
After:   State captures relationships:
         - agent ‚Üí scenario ‚Üí result ‚Üí pattern
         - scenario metadata for grouping
         - cross-agent comparison data
```

**Learning:**
- How to design state for insights
- How to capture relationships
- How state evolves with features
- This is orchestrator foundation

---

## üöÄ Implementation Plan

### Step 1: Add Pattern Detection to Scoring Node

**What to detect:**
1. **Scenario-level patterns:**
   - Which scenarios do all agents fail on?
   - Which scenarios do all agents succeed on?
   - What's the failure rate by scenario category?

2. **Cross-agent patterns:**
   - Do agents fail on the same scenarios?
   - Are there complementary patterns? (A fails where B succeeds)
   - Which agents are most/least consistent?

3. **Performance patterns:**
   - Are slow scenarios also inaccurate?
   - Do certain scenario types cause timeouts?
   - Is there a speed/accuracy trade-off?

**State additions needed:**
```python
failure_analysis: List[Dict[str, Any]]
# Structure:
# [
#   {
#     "pattern_type": "scenario_failure",
#     "description": "All agents fail on neutral sentiment",
#     "scenarios": ["c003", "c006", "c008"],
#     "failure_rate": 1.0,
#     "agents_affected": ["agent_001", "agent_002"],
#     "recommendation": "Improve neutral sentiment detection"
#   }
# ]
```

### Step 2: Improve Execution Node

**What to improve:**
1. **Better mock agents:**
   - Simple keyword-based classification
   - Actually analyze input text
   - Return realistic responses

2. **Metadata capture:**
   - Confidence scores
   - Tokens used
   - Processing time breakdown

3. **Error handling:**
   - Timeout handling
   - Retry logic
   - Graceful degradation

### Step 3: Evolve State Design

**What to add:**
1. **Relationship tracking:**
   - Scenario ‚Üí category mapping
   - Agent ‚Üí scenario ‚Üí result links
   - Pattern ‚Üí affected agents/scenarios

2. **Metadata enrichment:**
   - Scenario categories
   - Agent capabilities
   - Historical comparisons

---

## üí° Key Learning Principles

1. **Start with Value, Then Mechanics**
   - Pattern detection shows value
   - Execution improvement is mechanics
   - State design is foundation (learns throughout)

2. **Learn Through Building**
   - You'll understand state design better as you build features
   - Pattern detection will show you what state you need
   - Execution improvement will show you what to capture

3. **Orchestrator Insights = Multi-Dimensional Analysis**
   - Not just "Agent A is 40% accurate"
   - But "Agents fail on neutral sentiment" (cross-agent pattern)
   - This is what makes orchestrators valuable

---

## üéØ Success Criteria

### After Phase 1 (Pattern Detection):
- ‚úÖ Can detect: "Both agents fail on scenario type X"
- ‚úÖ Can detect: "Agent A is faster but less accurate"
- ‚úÖ Can detect: "All agents struggle with edge cases"
- ‚úÖ Report includes "Orchestrator Insights" section

### After Phase 2 (Execution Node):
- ‚úÖ Mock agents analyze input text realistically
- ‚úÖ Captures metadata (confidence, tokens, etc.)
- ‚úÖ Handles errors gracefully
- ‚úÖ Can connect to real agent functions

### After Phase 3 (State Design):
- ‚úÖ State captures relationships clearly
- ‚úÖ Easy to add new pattern types
- ‚úÖ State supports future features
- ‚úÖ Clear data flow through nodes

---

## üìö Related Learning

- **Orchestrator Guide:** `docs/guides/agent_patterns/ORCHESTRATOR_AGENTS_GUIDE.md`
- **Learning Review:** `docs/guides/eaas/LEARNING_REVIEW.md`
- **Development Workflow:** `docs/guides/development/DEVELOPMENT_WORKFLOW.md`

---

*This plan balances learning value with practical implementation. Start with pattern detection to see orchestrator value, then improve execution mechanics.*



# Scoring Node

In [None]:
"""Scoring Node - Scores and analyzes evaluation results"""

import logging
from typing import Dict, Any, List
from collections import defaultdict
from config import EaaSState

logger = logging.getLogger(__name__)


def _calculate_accuracy(results: List[Dict[str, Any]]) -> float:
    """Calculate accuracy score (correct / total)"""
    if len(results) == 0:
        return 0.0

    correct = sum(1 for r in results if r.get("actual_output") == r.get("expected_output"))
    return correct / len(results)


def _calculate_latency_metrics(results: List[Dict[str, Any]]) -> Dict[str, Any]:
    """Calculate latency percentiles"""
    latencies = [r.get("latency_ms", 0) for r in results if r.get("latency_ms", 0) > 0]

    if len(latencies) == 0:
        return {"p50": 0, "p95": 0, "avg": 0}

    sorted_latencies = sorted(latencies)
    p50_idx = int(len(sorted_latencies) * 0.5)
    p95_idx = int(len(sorted_latencies) * 0.95)

    return {
        "p50": sorted_latencies[p50_idx] if p50_idx < len(sorted_latencies) else sorted_latencies[-1],
        "p95": sorted_latencies[p95_idx] if p95_idx < len(sorted_latencies) else sorted_latencies[-1],
        "avg": sum(sorted_latencies) / len(sorted_latencies)
    }


def _detect_scenario_failure_patterns(
    evaluation_results: List[Dict[str, Any]],
    evaluation_data: Dict[str, Any]
) -> List[Dict[str, Any]]:
    """
    Detect scenarios where all agents fail (systemic issues).

    This is the orchestrator insight: "Both agents fail on neutral sentiment"
    """
    patterns = []

    # Get scenario metadata
    scenarios = {s.get("id"): s for s in evaluation_data.get("test_scenarios", [])}

    # Group results by scenario
    scenario_results = defaultdict(list)
    for result in evaluation_results:
        scenario_id = result.get("scenario_id")
        scenario_results[scenario_id].append(result)

    # Find scenarios where all agents fail
    for scenario_id, results in scenario_results.items():
        if len(results) == 0:
            continue

        # Check if all results are incorrect
        all_failed = all(
            r.get("actual_output") != r.get("expected_output")
            for r in results
        )

        if all_failed and len(results) > 1:  # Need at least 2 agents to be a pattern
            scenario = scenarios.get(scenario_id, {})
            expected_output = scenario.get("expected_output", "unknown")
            metadata = scenario.get("metadata", {})
            category = metadata.get("category", "unknown")

            agents_affected = [r.get("agent_id") for r in results]

            patterns.append({
                "pattern_type": "scenario_failure",
                "description": f"All agents fail on scenario type: {expected_output}",
                "scenario_id": scenario_id,
                "scenarios_affected": [scenario_id],
                "failure_rate": 1.0,
                "agents_affected": agents_affected,
                "expected_output": expected_output,
                "category": category,
                "recommendation": f"Improve handling of {expected_output} scenarios (category: {category})"
            })

    return patterns


def _detect_cross_agent_patterns(
    evaluation_results: List[Dict[str, Any]],
    evaluation_data: Dict[str, Any]
) -> List[Dict[str, Any]]:
    """
    Detect patterns across agents (do they fail on the same scenarios?).

    This is the orchestrator insight: "Agents fail on similar scenarios"
    """
    patterns = []

    # Get scenario metadata
    scenarios = {s.get("id"): s for s in evaluation_data.get("test_scenarios", [])}

    # Group results by scenario
    scenario_results = defaultdict(list)
    for result in evaluation_results:
        scenario_id = result.get("scenario_id")
        is_correct = result.get("actual_output") == result.get("expected_output")
        scenario_results[scenario_id].append({
            "agent_id": result.get("agent_id"),
            "correct": is_correct
        })

    # Find scenarios where multiple agents fail
    failure_groups = defaultdict(list)
    for scenario_id, results in scenario_results.items():
        failed_agents = [r["agent_id"] for r in results if not r["correct"]]
        if len(failed_agents) >= 2:  # At least 2 agents fail
            scenario = scenarios.get(scenario_id, {})
            expected_output = scenario.get("expected_output", "unknown")
            metadata = scenario.get("metadata", {})
            category = metadata.get("category", "unknown")

            # Group by expected_output type
            key = f"{expected_output}_{category}"
            failure_groups[key].append({
                "scenario_id": scenario_id,
                "agents": failed_agents,
                "expected_output": expected_output,
                "category": category
            })

    # Create patterns for groups with multiple scenarios
    for key, group in failure_groups.items():
        if len(group) >= 2:  # At least 2 scenarios with same pattern
            all_agents = set()
            scenario_ids = []
            for item in group:
                all_agents.update(item["agents"])
                scenario_ids.append(item["scenario_id"])

            expected_output = group[0]["expected_output"]
            category = group[0]["category"]

            patterns.append({
                "pattern_type": "cross_agent_failure",
                "description": f"Multiple agents consistently fail on {expected_output} scenarios (category: {category})",
                "scenarios_affected": scenario_ids,
                "failure_count": len(group),
                "agents_affected": list(all_agents),
                "expected_output": expected_output,
                "category": category,
                "recommendation": f"Systemic issue: All affected agents struggle with {expected_output} scenarios. Consider improving training data or model architecture for this category."
            })

    return patterns


def _detect_performance_patterns(
    evaluation_results: List[Dict[str, Any]],
    scores: Dict[str, Any]
) -> List[Dict[str, Any]]:
    """
    Detect performance patterns (speed/accuracy trade-offs, consistency).

    This is the orchestrator insight: "Agent A is 3x faster but less accurate"
    """
    patterns = []

    if len(scores) < 2:
        return patterns  # Need at least 2 agents to compare

    # Compare agents
    agent_comparisons = []
    for agent_id, agent_scores in scores.items():
        agent_comparisons.append({
            "agent_id": agent_id,
            "accuracy": agent_scores.get("accuracy", 0),
            "latency_avg": agent_scores.get("latency_avg", 0)
        })

    # Sort by accuracy
    sorted_by_accuracy = sorted(agent_comparisons, key=lambda x: x["accuracy"], reverse=True)
    sorted_by_speed = sorted(agent_comparisons, key=lambda x: x["latency_avg"])

    # Find speed/accuracy trade-offs
    if len(sorted_by_accuracy) >= 2:
        most_accurate = sorted_by_accuracy[0]
        fastest = sorted_by_speed[0]

        if most_accurate["agent_id"] != fastest["agent_id"]:
            speed_diff = fastest["latency_avg"] / most_accurate["latency_avg"] if most_accurate["latency_avg"] > 0 else 0
            accuracy_diff = most_accurate["accuracy"] - fastest["accuracy"]

            if speed_diff > 1.5 or accuracy_diff > 0.1:  # Significant difference
                patterns.append({
                    "pattern_type": "performance_tradeoff",
                    "description": f"Speed/Accuracy Trade-off: {fastest['agent_id']} is {speed_diff:.1f}x faster but {accuracy_diff:.1%} less accurate than {most_accurate['agent_id']}",
                    "fastest_agent": fastest["agent_id"],
                    "most_accurate_agent": most_accurate["agent_id"],
                    "speed_ratio": speed_diff,
                    "accuracy_difference": accuracy_diff,
                    "recommendation": f"Consider using {fastest['agent_id']} for low-latency requirements, {most_accurate['agent_id']} for high-accuracy requirements"
                })

    # Find consistency patterns
    for agent_id, agent_scores in scores.items():
        scenario_scores = agent_scores.get("scenario_scores", [])
        if len(scenario_scores) == 0:
            continue

        correct_count = sum(1 for s in scenario_scores if s.get("correct", False))
        incorrect_count = len(scenario_scores) - correct_count

        # Check if agent is very consistent (all correct or all incorrect)
        if correct_count == 0 or incorrect_count == 0:
            consistency = "highly consistent"
            if correct_count == 0:
                consistency_desc = "consistently fails"
            else:
                consistency_desc = "consistently succeeds"

            patterns.append({
                "pattern_type": "consistency",
                "description": f"{agent_id} is {consistency_desc} across all scenarios",
                "agent_id": agent_id,
                "consistency_type": consistency_desc,
                "recommendation": f"{agent_id} shows {consistency_desc.replace('consistently ', '')} - investigate root cause" if incorrect_count == 0 else f"{agent_id} performs reliably"
            })

    return patterns


def scoring_node(state: EaaSState) -> EaaSState:
    """
    Score and analyze evaluation results.

    Reads: evaluation_results, evaluation_config
    Writes: scores, drift_detection, failure_analysis
    """
    logger.info("üìä Scoring evaluation results...")

    try:
        evaluation_results = state.get("evaluation_results", [])
        evaluation_config = state.get("evaluation_config", {})

        if len(evaluation_results) == 0:
            error_msg = "No evaluation results to score"
            logger.error(error_msg)
            state.setdefault("errors", []).append(error_msg)
            return state

        # Group results by agent
        scores = {}

        # Get unique agent IDs
        agent_ids = set(r.get("agent_id") for r in evaluation_results)

        for agent_id in agent_ids:
            agent_results = [r for r in evaluation_results if r.get("agent_id") == agent_id]

            # Calculate accuracy
            accuracy = _calculate_accuracy(agent_results)

            # Calculate latency metrics
            latency_metrics = _calculate_latency_metrics(agent_results)

            # Calculate scenario-level scores
            scenario_scores = []
            for result in agent_results:
                correct = result.get("actual_output") == result.get("expected_output")
                scenario_scores.append({
                    "scenario_id": result.get("scenario_id"),
                    "correct": correct,
                    "score": 1.0 if correct else 0.0
                })

            # Calculate overall score (simple average for MVP)
            overall_score = accuracy  # MVP: just use accuracy

            scores[agent_id] = {
                "overall_score": overall_score,
                "accuracy": accuracy,
                "latency_p50": latency_metrics["p50"],
                "latency_p95": latency_metrics["p95"],
                "latency_avg": latency_metrics["avg"],
                "scenario_scores": scenario_scores,
                "total_scenarios": len(agent_results)
            }

        state["scores"] = scores

        # Pattern Detection - This is where orchestrator insights are created!
        evaluation_data = state.get("evaluation_data", {})
        failure_analysis = []

        # 1. Detect scenario-level patterns (systemic failures)
        scenario_patterns = _detect_scenario_failure_patterns(evaluation_results, evaluation_data)
        failure_analysis.extend(scenario_patterns)
        logger.info(f"  Found {len(scenario_patterns)} scenario-level failure patterns")

        # 2. Detect cross-agent patterns (agents failing on same scenarios)
        cross_agent_patterns = _detect_cross_agent_patterns(evaluation_results, evaluation_data)
        failure_analysis.extend(cross_agent_patterns)
        logger.info(f"  Found {len(cross_agent_patterns)} cross-agent patterns")

        # 3. Detect performance patterns (speed/accuracy trade-offs)
        performance_patterns = _detect_performance_patterns(evaluation_results, scores)
        failure_analysis.extend(performance_patterns)
        logger.info(f"  Found {len(performance_patterns)} performance patterns")

        state["failure_analysis"] = failure_analysis

        # MVP: Empty drift detection (future feature)
        state["drift_detection"] = {}

        logger.info(f"‚úÖ Scored {len(scores)} agent(s) and detected {len(failure_analysis)} patterns")

    except Exception as e:
        error_msg = f"Error in scoring_node: {str(e)}"
        logger.error(error_msg)
        state.setdefault("errors", []).append(error_msg)

    return state



# Pattern Detection Walkthrough: Understanding Orchestrator Insights

**Purpose:** Deep dive into the pattern detection logic added to `scoring_node` - this is where orchestrator value is created.

---

## üéØ What Changed: The Big Picture

### Before (Standard Agent Approach):
```python
# Just calculate per-agent metrics
for agent_id in agent_ids:
    accuracy = calculate_accuracy(agent_results)
    scores[agent_id] = {"accuracy": accuracy}
```

**Output:** "Agent A: 40% accurate, Agent B: 0% accurate"

### After (Orchestrator Approach):
```python
# Calculate per-agent metrics
scores = calculate_scores(evaluation_results)

# THEN: Analyze across agents to find patterns
patterns = detect_patterns(evaluation_results, evaluation_data)
```

**Output:** "Agent A: 40% accurate, Agent B: 0% accurate"
**+ Orchestrator Insight:** "Both agents fail on neutral sentiment - systemic issue"

---

## üîç The Three Pattern Detection Functions

### 1. `_detect_scenario_failure_patterns()` - Systemic Failures

**What it does:**
- Finds scenarios where ALL agents fail
- This is the "systemic issue" detection

**The Logic:**
```python
# Step 1: Group results by scenario
scenario_results = defaultdict(list)
for result in evaluation_results:
    scenario_id = result.get("scenario_id")
    scenario_results[scenario_id].append(result)

# Step 2: Check if ALL agents failed on this scenario
for scenario_id, results in scenario_results.items():
    all_failed = all(
        r.get("actual_output") != r.get("expected_output")
        for r in results
    )
    
    if all_failed and len(results) > 1:  # Pattern detected!
        # This is a systemic issue
```

**Why this matters:**
- **Standard approach:** "Agent A failed on scenario X" ‚Üí Fix Agent A
- **Orchestrator approach:** "ALL agents failed on scenario X" ‚Üí Fix the root cause (data, prompt, architecture)

**Example:**
- Scenario: "It's fine, I guess" (neutral sentiment)
- Agent A: Returns "positive" ‚ùå
- Agent B: Returns "positive" ‚ùå
- **Orchestrator insight:** "All agents fail on neutral sentiment - improve neutral detection"

---

### 2. `_detect_cross_agent_patterns()` - Cross-Agent Failure Patterns

**What it does:**
- Finds scenarios where MULTIPLE agents fail (but not necessarily all)
- Groups failures by scenario type/category
- Detects patterns like "agents consistently fail on neutral sentiment"

**The Logic:**
```python
# Step 1: Group results by scenario, track which agents failed
scenario_results = defaultdict(list)
for result in evaluation_results:
    scenario_id = result.get("scenario_id")
    is_correct = result.get("actual_output") == result.get("expected_output")
    scenario_results[scenario_id].append({
        "agent_id": result.get("agent_id"),
        "correct": is_correct
    })

# Step 2: Find scenarios where multiple agents fail
failure_groups = defaultdict(list)
for scenario_id, results in scenario_results.items():
    failed_agents = [r["agent_id"] for r in results if not r["correct"]]
    if len(failed_agents) >= 2:  # At least 2 agents fail
        # Group by scenario type (e.g., "neutral_sentiment")
        key = f"{expected_output}_{category}"
        failure_groups[key].append(...)

# Step 3: If multiple scenarios have same pattern, it's a systemic issue
for key, group in failure_groups.items():
    if len(group) >= 2:  # Pattern detected!
```

**Why this matters:**
- **Standard approach:** See individual failures
- **Orchestrator approach:** See that "neutral sentiment" is a problem across multiple scenarios and agents

**Example:**
- Scenario c003: "It's fine, I guess" ‚Üí Both agents fail
- Scenario c006: "The results are okay, but..." ‚Üí Both agents fail
- Scenario c008: "I don't really care" ‚Üí Both agents fail
- **Orchestrator insight:** "Multiple agents consistently fail on neutral sentiment scenarios (3 scenarios affected)"

---

### 3. `_detect_performance_patterns()` - Performance Trade-offs

**What it does:**
- Compares agents across dimensions (speed vs accuracy)
- Detects consistency patterns
- Finds trade-offs that inform decision-making

**The Logic:**
```python
# Step 1: Compare agents
agent_comparisons = []
for agent_id, agent_scores in scores.items():
    agent_comparisons.append({
        "agent_id": agent_id,
        "accuracy": agent_scores.get("accuracy", 0),
        "latency_avg": agent_scores.get("latency_avg", 0)
    })

# Step 2: Find speed/accuracy trade-offs
sorted_by_accuracy = sorted(agent_comparisons, key=lambda x: x["accuracy"], reverse=True)
sorted_by_speed = sorted(agent_comparisons, key=lambda x: x["latency_avg"])

most_accurate = sorted_by_accuracy[0]
fastest = sorted_by_speed[0]

if most_accurate["agent_id"] != fastest["agent_id"]:
    # Trade-off detected!
    speed_diff = fastest["latency_avg"] / most_accurate["latency_avg"]
    accuracy_diff = most_accurate["accuracy"] - fastest["accuracy"]
```

**Why this matters:**
- **Standard approach:** "Agent A: 40% accurate, 100ms latency"
- **Orchestrator approach:** "Agent A is 3x faster but 20% less accurate than Agent B - use A for low-latency, B for high-accuracy"

**Example:**
- Agent A: 40% accurate, 100ms latency
- Agent B: 80% accurate, 300ms latency
- **Orchestrator insight:** "Agent A is 3x faster but 40% less accurate - choose based on requirements"

---

## üéì What to Focus On Learning

### 1. **Multi-Dimensional Analysis** ‚≠ê MOST IMPORTANT

**The Key Concept:**
- Standard agents analyze in one dimension: "Agent A's accuracy"
- Orchestrators analyze across multiple dimensions: "Agents √ó Scenarios √ó Outcomes"

**How to think about it:**
```
Standard:  Agent A ‚Üí Accuracy: 40%
           Agent B ‚Üí Accuracy: 0%

Orchestrator:  Agent A √ó Scenario c003 ‚Üí Fail
               Agent B √ó Scenario c003 ‚Üí Fail
               Pattern: Both fail on neutral sentiment
```

**The Code Pattern:**
```python
# Group by one dimension
scenario_results = defaultdict(list)
for result in evaluation_results:
    scenario_id = result.get("scenario_id")
    scenario_results[scenario_id].append(result)

# Then analyze across dimensions
for scenario_id, results in scenario_results.items():
    # Analyze: Do ALL agents fail? (cross-agent dimension)
    all_failed = all(r.get("actual_output") != r.get("expected_output") for r in results)
```

**Why this matters:**
- This is the core orchestrator skill
- You're not just processing data - you're finding relationships
- This creates insights that are invisible to single-agent analysis

---

### 2. **Grouping and Aggregation Patterns** ‚≠ê IMPORTANT

**The Pattern:**
```python
# Step 1: Group data by dimension
grouped = defaultdict(list)
for item in data:
    key = item.get("dimension")
    grouped[key].append(item)

# Step 2: Analyze groups
for key, group in grouped.items():
    if len(group) >= threshold:  # Pattern detected!
        # Create insight
```

**Why this matters:**
- This is how you find patterns in multi-dimensional data
- `defaultdict` is your friend for grouping
- Thresholds (e.g., "at least 2 agents") filter noise from patterns

**Example from code:**
```python
# Group failures by scenario type
failure_groups = defaultdict(list)
for scenario_id, results in scenario_results.items():
    failed_agents = [r["agent_id"] for r in results if not r["correct"]]
    if len(failed_agents) >= 2:  # Threshold: at least 2 agents
        key = f"{expected_output}_{category}"
        failure_groups[key].append(...)
```

---

### 3. **Metadata Enrichment** ‚≠ê IMPORTANT

**The Pattern:**
```python
# Get scenario metadata to understand context
scenarios = {s.get("id"): s for s in evaluation_data.get("test_scenarios", [])}
scenario = scenarios.get(scenario_id, {})
expected_output = scenario.get("expected_output", "unknown")
category = scenario.get("metadata", {}).get("category", "unknown")
```

**Why this matters:**
- Raw results: "Scenario c003 failed"
- With metadata: "Scenario c003 (neutral sentiment) failed"
- Metadata enables pattern detection: "All neutral sentiment scenarios fail"

**The Insight:**
- You need metadata to group and analyze
- This is why state design matters - you need to capture relationships

---

### 4. **Pattern Structure Design** ‚≠ê IMPORTANT

**The Pattern:**
```python
patterns.append({
    "pattern_type": "scenario_failure",
    "description": f"All agents fail on scenario type: {expected_output}",
    "scenario_id": scenario_id,
    "agents_affected": agents_affected,
    "recommendation": f"Improve handling of {expected_output} scenarios"
})
```

**Why this matters:**
- Patterns need structure to be useful
- Include: what, who, why, recommendation
- This structure enables reporting and action

---

## üí° Why These Changes Are Important

### 1. **This is Orchestrator Value Creation**

**Before:** You had evaluation results
**After:** You have strategic insights

**The difference:**
- Standard: "Agent A is 40% accurate"
- Orchestrator: "All agents fail on neutral sentiment - fix the root cause"

### 2. **Multi-Dimensional Analysis is Core Orchestrator Skill**

**What you're learning:**
- How to analyze across dimensions (agents √ó scenarios √ó outcomes)
- How to find patterns that are invisible in single dimensions
- How to structure queries across data

**Why this matters:**
- This is what makes orchestrators valuable
- Single agents can't do this
- This is the "network effect" of orchestrators

### 3. **State Design Enables Pattern Detection**

**What you're learning:**
- You need metadata (scenario categories, expected outputs)
- You need relationships (agent ‚Üí scenario ‚Üí result)
- State structure determines what patterns you can detect

**Why this matters:**
- As you build pattern detection, you see what state you need
- This informs state design evolution
- This is the learning loop: build features ‚Üí see state needs ‚Üí evolve state

### 4. **Pattern Detection Informs Strategy**

**What you're learning:**
- Patterns lead to recommendations
- Recommendations lead to action
- Action creates value

**Why this matters:**
- Not just "what happened" but "what to do about it"
- This is strategic value, not just reporting

---

## üîë Key Takeaways

### 1. **Orchestrators Analyze Across Dimensions**
- Not just "Agent A's accuracy"
- But "Do agents fail on the same scenarios?"
- This is multi-dimensional analysis

### 2. **Grouping is the Core Pattern**
- Group by dimension (scenario, agent, category)
- Analyze groups for patterns
- Thresholds filter noise

### 3. **Metadata Enables Insights**
- Raw data: "Scenario c003 failed"
- With metadata: "Neutral sentiment scenarios fail"
- Metadata is what makes patterns visible

### 4. **Pattern Structure Enables Action**
- Patterns need: what, who, why, recommendation
- This structure enables reporting and decision-making

### 5. **This is Where Orchestrator Value Lives**
- Pattern detection = orchestrator superpower
- This is what makes orchestrators valuable
- This is what single agents can't do

---

## üöÄ Next Steps for Learning

1. **Study the grouping patterns** - How data is grouped by dimension
2. **Study the threshold logic** - How patterns are filtered from noise
3. **Study the metadata usage** - How metadata enables pattern detection
4. **Experiment with new patterns** - Try detecting different types of patterns
5. **Understand state requirements** - See what state you need for pattern detection

---

## üìä Visual Summary

```
Standard Agent Analysis:
Agent A ‚Üí Accuracy: 40%
Agent B ‚Üí Accuracy: 0%

Orchestrator Analysis:
Agent A √ó Scenario c003 ‚Üí Fail
Agent B √ó Scenario c003 ‚Üí Fail
  ‚Üì
Pattern: Both agents fail on neutral sentiment
  ‚Üì
Insight: Systemic issue - improve neutral detection
  ‚Üì
Recommendation: Fix root cause, not individual agents
```

**This is the orchestrator value!**

---

*This walkthrough explains the pattern detection logic that creates orchestrator insights. Focus on understanding multi-dimensional analysis - this is the core orchestrator skill.*



# Report Node

In [None]:
"""Report Node - Generates evaluation report"""

import logging
from pathlib import Path
from datetime import datetime
from typing import Dict, Any
from config import EaaSState, EaaSConfig

logger = logging.getLogger(__name__)

# Initialize config
config = EaaSConfig()


def report_node(state: EaaSState) -> EaaSState:
    """
    Generate evaluation report.

    Reads: scores, evaluation_results, goal
    Writes: evaluation_report, report_file_path
    """
    logger.info("üìù Generating evaluation report...")

    try:
        scores = state.get("scores", {})
        evaluation_results = state.get("evaluation_results", [])
        goal = state.get("goal", {})
        failure_analysis = state.get("failure_analysis", [])

        # MVP: Simple markdown report (no template for now)
        report_lines = [
            "# Evaluation Report",
            "",
            f"**Generated:** {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}",
            "",
            "## Summary",
            "",
            f"Evaluated **{len(scores)} agent(s)** across **{len(evaluation_results)} test scenario(s)**.",
            f"Detected **{len(failure_analysis)} orchestrator insight(s)**.",
            "",
            "## Agent Scores",
            ""
        ]

        # Add scores for each agent
        for agent_id, agent_scores in scores.items():
            report_lines.extend([
                f"### {agent_id}",
                "",
                f"- **Overall Score:** {agent_scores.get('overall_score', 0):.2%}",
                f"- **Accuracy:** {agent_scores.get('accuracy', 0):.2%}",
                f"- **Latency (P50):** {agent_scores.get('latency_p50', 0)}ms",
                f"- **Latency (P95):** {agent_scores.get('latency_p95', 0)}ms",
                f"- **Total Scenarios:** {agent_scores.get('total_scenarios', 0)}",
                ""
            ])

        report_lines.extend([
            "## Detailed Results",
            "",
            "| Agent | Scenario | Input | Expected | Actual | Correct |",
            "|-------|----------|-------|----------|--------|---------|"
        ])

        # Add detailed results (limit to first 10 for readability)
        for result in evaluation_results[:10]:
            agent_id = result.get("agent_id", "unknown")
            scenario_id = result.get("scenario_id", "unknown")
            input_text = result.get("input", "")[:50] + "..." if len(result.get("input", "")) > 50 else result.get("input", "")
            expected = result.get("expected_output", "")
            actual = result.get("actual_output", "")
            correct = "‚úÖ" if expected == actual else "‚ùå"

            report_lines.append(
                f"| {agent_id} | {scenario_id} | {input_text} | {expected} | {actual} | {correct} |"
            )

        if len(evaluation_results) > 10:
            report_lines.append(f"\n*... and {len(evaluation_results) - 10} more results*")

        # Add Orchestrator Insights Section - This is the key value!
        if len(failure_analysis) > 0:
            report_lines.extend([
                "",
                "## üéØ Orchestrator Insights",
                "",
                "*These insights are only visible when evaluating multiple agents together - this is the orchestrator value!*",
                ""
            ])

            # Group patterns by type
            pattern_groups = {}
            for pattern in failure_analysis:
                pattern_type = pattern.get("pattern_type", "unknown")
                if pattern_type not in pattern_groups:
                    pattern_groups[pattern_type] = []
                pattern_groups[pattern_type].append(pattern)

            # Display each pattern type
            for pattern_type, patterns in pattern_groups.items():
                if pattern_type == "scenario_failure":
                    report_lines.append("### üî¥ Systemic Failures (All Agents Fail)")
                    report_lines.append("")
                    for pattern in patterns:
                        report_lines.extend([
                            f"**{pattern.get('description', 'Unknown pattern')}**",
                            f"- Scenario: {pattern.get('scenario_id', 'unknown')}",
                            f"- Agents affected: {', '.join(pattern.get('agents_affected', []))}",
                            f"- Category: {pattern.get('category', 'unknown')}",
                            f"- üí° Recommendation: {pattern.get('recommendation', 'N/A')}",
                            ""
                        ])

                elif pattern_type == "cross_agent_failure":
                    report_lines.append("### ‚ö†Ô∏è Cross-Agent Failure Patterns")
                    report_lines.append("")
                    for pattern in patterns:
                        report_lines.extend([
                            f"**{pattern.get('description', 'Unknown pattern')}**",
                            f"- Scenarios affected: {len(pattern.get('scenarios_affected', []))} scenarios",
                            f"- Agents affected: {', '.join(pattern.get('agents_affected', []))}",
                            f"- Failure count: {pattern.get('failure_count', 0)}",
                            f"- üí° Recommendation: {pattern.get('recommendation', 'N/A')}",
                            ""
                        ])

                elif pattern_type == "performance_tradeoff":
                    report_lines.append("### ‚ö° Performance Trade-offs")
                    report_lines.append("")
                    for pattern in patterns:
                        report_lines.extend([
                            f"**{pattern.get('description', 'Unknown pattern')}**",
                            f"- üí° Recommendation: {pattern.get('recommendation', 'N/A')}",
                            ""
                        ])

                elif pattern_type == "consistency":
                    report_lines.append("### üìä Consistency Patterns")
                    report_lines.append("")
                    for pattern in patterns:
                        report_lines.extend([
                            f"**{pattern.get('description', 'Unknown pattern')}**",
                            f"- üí° Recommendation: {pattern.get('recommendation', 'N/A')}",
                            ""
                        ])
        else:
            report_lines.extend([
                "",
                "## üéØ Orchestrator Insights",
                "",
                "*No patterns detected. This may indicate:*",
                "- Agents are performing well across all scenarios",
                "- Need more agents or scenarios to detect patterns",
                "- Evaluation data may need more diversity",
                ""
            ])

        report_markdown = "\n".join(report_lines)
        state["evaluation_report"] = report_markdown

        # Save report to file
        reports_dir = Path(config.evaluation_reports_dir)
        reports_dir.mkdir(parents=True, exist_ok=True)

        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        report_file = reports_dir / f"evaluation_report_{timestamp}.md"
        report_file.write_text(report_markdown)

        state["report_file_path"] = str(report_file)
        logger.info(f"‚úÖ Report generated: {report_file}")

    except Exception as e:
        error_msg = f"Error in report_node: {str(e)}"
        logger.error(error_msg)
        state.setdefault("errors", []).append(error_msg)

    return state



# Test Results

In [None]:
(.venv) micahshull@Micahs-iMac LG_Cursor_026 % python3 tests/test_mvp_runner.py

============================================================
üß™ EaaS Agent Smoke Test
============================================================

1Ô∏è‚É£ Testing goal_node...
INFO: üéØ Defining evaluation goal...
INFO: ‚úÖ Goal defined for 2 agent(s) with criteria: ['accuracy', 'safety', 'latency']
   ‚úÖ Goal defined: Evaluate target agents against test scenarios

2Ô∏è‚É£ Testing planning_node...
INFO: üìã Creating execution plan...
INFO: ‚úÖ Plan created with 5 steps
   ‚úÖ Plan created with 5 steps

3Ô∏è‚É£ Testing data_ingestion_node...
INFO: üì• Ingesting evaluation data...
INFO: ‚úÖ Loaded 10 test scenarios (types: ['classification'])
   ‚úÖ Loaded 10 test scenarios

4Ô∏è‚É£ Testing scenario_generation_node...
INFO: üîß Generating additional scenarios...
INFO: ‚úÖ Test data provided, skipping scenario generation (MVP)
   ‚úÖ Scenario generation complete

5Ô∏è‚É£ Testing evaluation_execution_node...
INFO: üöÄ Executing evaluations...
INFO:   Evaluating agent: agent_001
INFO:   Evaluating agent: agent_002
INFO: ‚úÖ Executed 20 evaluations across 2 agent(s)
   ‚úÖ Executed 20 evaluations

6Ô∏è‚É£ Testing scoring_node...
INFO: üìä Scoring evaluation results...
INFO:   Found 6 scenario-level failure patterns
INFO:   Found 2 cross-agent patterns
INFO:   Found 2 performance patterns
INFO: ‚úÖ Scored 2 agent(s) and detected 10 patterns
   ‚úÖ Scored 2 agent(s)

   üìä agent_001:
      Accuracy: 40.00%
      Overall: 40.00%
   üìä agent_002:
      Accuracy: 0.00%
      Overall: 0.00%

7Ô∏è‚É£ Testing report_node...
INFO: üìù Generating evaluation report...
INFO: ‚úÖ Report generated: output/evaluation_reports/evaluation_report_20251117_161413.md
   ‚úÖ Report generated: output/evaluation_reports/evaluation_report_20251117_161413.md

============================================================
‚úÖ All nodes passed smoke test!
============================================================

üìÑ Report saved to: output/evaluation_reports/evaluation_report_20251117_161413.md

‚ú® No errors encountered!

üéâ Smoke test completed successfully!


# Evaluation Report

**Generated:** 2025-11-17 16:14:13

## Summary

Evaluated **2 agent(s)** across **20 test scenario(s)**.
Detected **10 orchestrator insight(s)**.

## Agent Scores

### agent_001

- **Overall Score:** 40.00%
- **Accuracy:** 40.00%
- **Latency (P50):** 105ms
- **Latency (P95):** 105ms
- **Total Scenarios:** 10

### agent_002

- **Overall Score:** 0.00%
- **Accuracy:** 0.00%
- **Latency (P50):** 105ms
- **Latency (P95):** 105ms
- **Total Scenarios:** 10

## Detailed Results

| Agent | Scenario | Input | Expected | Actual | Correct |
|-------|----------|-------|----------|--------|---------|
| agent_001 | c001 | I absolutely loved the new dashboard ‚Äì it‚Äôs so muc... | positive | positive | ‚úÖ |
| agent_001 | c002 | This update is terrible, nothing works the way it ... | negative | positive | ‚ùå |
| agent_001 | c003 | It‚Äôs fine, I guess. Not really better or worse tha... | neutral | positive | ‚ùå |
| agent_001 | c004 | Thank you so much for fixing this so quickly, I re... | positive | positive | ‚úÖ |
| agent_001 | c005 | I‚Äôm really frustrated that I keep getting logged o... | negative | positive | ‚ùå |
| agent_001 | c006 | The results are okay, but there‚Äôs still room for i... | neutral | positive | ‚ùå |
| agent_001 | c007 | This new feature saves me at least an hour every d... | positive | positive | ‚úÖ |
| agent_001 | c008 | I don‚Äôt really care about this change. | neutral | positive | ‚ùå |
| agent_001 | c009 | This is completely unusable; I‚Äôm going back to the... | negative | positive | ‚ùå |
| agent_001 | c010 | Nice job on the redesign ‚Äì it looks clean and intu... | positive | positive | ‚úÖ |

*... and 10 more results*

## üéØ Orchestrator Insights

*These insights are only visible when evaluating multiple agents together - this is the orchestrator value!*

### üî¥ Systemic Failures (All Agents Fail)

**All agents fail on scenario type: negative**
- Scenario: c002
- Agents affected: agent_001, agent_002
- Category: sentiment
- üí° Recommendation: Improve handling of negative scenarios (category: sentiment)

**All agents fail on scenario type: neutral**
- Scenario: c003
- Agents affected: agent_001, agent_002
- Category: sentiment
- üí° Recommendation: Improve handling of neutral scenarios (category: sentiment)

**All agents fail on scenario type: negative**
- Scenario: c005
- Agents affected: agent_001, agent_002
- Category: sentiment
- üí° Recommendation: Improve handling of negative scenarios (category: sentiment)

**All agents fail on scenario type: neutral**
- Scenario: c006
- Agents affected: agent_001, agent_002
- Category: sentiment
- üí° Recommendation: Improve handling of neutral scenarios (category: sentiment)

**All agents fail on scenario type: neutral**
- Scenario: c008
- Agents affected: agent_001, agent_002
- Category: sentiment
- üí° Recommendation: Improve handling of neutral scenarios (category: sentiment)

**All agents fail on scenario type: negative**
- Scenario: c009
- Agents affected: agent_001, agent_002
- Category: sentiment
- üí° Recommendation: Improve handling of negative scenarios (category: sentiment)

### ‚ö†Ô∏è Cross-Agent Failure Patterns

**Multiple agents consistently fail on negative scenarios (category: sentiment)**
- Scenarios affected: 3 scenarios
- Agents affected: agent_001, agent_002
- Failure count: 3
- üí° Recommendation: Systemic issue: All affected agents struggle with negative scenarios. Consider improving training data or model architecture for this category.

**Multiple agents consistently fail on neutral scenarios (category: sentiment)**
- Scenarios affected: 3 scenarios
- Agents affected: agent_001, agent_002
- Failure count: 3
- üí° Recommendation: Systemic issue: All affected agents struggle with neutral scenarios. Consider improving training data or model architecture for this category.

### ‚ö° Performance Trade-offs

**Speed/Accuracy Trade-off: agent_002 is 1.0x faster but 40.0% less accurate than agent_001**
- üí° Recommendation: Consider using agent_002 for low-latency requirements, agent_001 for high-accuracy requirements

### üìä Consistency Patterns

**agent_002 is consistently fails across all scenarios**
- üí° Recommendation: agent_002 performs reliably


# Strategic Recommendation: What to Focus On Next

**Based on:** Evaluation report showing orchestrator insights working perfectly!

---

## üéâ What's Working Great

### Pattern Detection is Perfect!
- ‚úÖ Detected 6 scenario-level failures
- ‚úÖ Detected 2 cross-agent patterns (systemic issues)
- ‚úÖ Detected performance trade-offs
- ‚úÖ Detected consistency patterns

**The orchestrator insights are clear:**
- "Multiple agents consistently fail on negative scenarios"
- "Multiple agents consistently fail on neutral scenarios"
- This is exactly the orchestrator value we wanted!

---

## üîç What We're Seeing

### Current Situation:
- **Agent 001:** 40% accurate (correct on 4 positive scenarios, wrong on 6 negative/neutral)
- **Agent 002:** 0% accurate (returns "safe" for everything, which never matches classification labels)

### The Pattern Detection Revealed:
- **Systemic Issue:** Both agents fail on negative and neutral sentiment
- **Root Cause:** Mock agents are too simple (just return "positive" or "safe")
- **Orchestrator Insight:** "This is a systemic issue, not individual agent problems"

---

## üí° Strategic Recommendation

### **Option A: Improve Mock Agents (Recommended Next Step)** ‚≠ê

**Why:**
1. **Pattern detection is working** - we've proven orchestrator value
2. **Better mocks = more realistic patterns** - we'll see more nuanced insights
3. **Aligns with Phase 2** of learning plan (improve execution node)
4. **Teaches orchestration mechanics** - how to coordinate real agents

**What to do:**
- Make mock agents analyze input text (simple keyword-based classification)
- Agent 001: Actually classify sentiment (positive/negative/neutral)
- Agent 002: Actually check safety (safe/unsafe)
- This will show more realistic patterns and insights

**Learning Value:**
- How to structure agent execution
- How to handle different agent types
- How to capture metadata for pattern detection

---

### **Option B: Focus on Training Data**

**Why this might make sense:**
- The pattern detection is showing "systemic issues" with negative/neutral sentiment
- This could indicate training data problems

**Why this is less valuable right now:**
- The "systemic issues" are because mock agents are too simple
- Real training data issues would only be visible with real agents
- We're still in MVP/learning phase

**When to do this:**
- After we have realistic agents
- When we see patterns with real agent behavior
- When we want to improve actual agent performance

---

### **Option C: Add More Pattern Types**

**Why this might make sense:**
- Pattern detection is working, let's expand it
- Could detect more sophisticated patterns

**Why this is less valuable right now:**
- We've proven orchestrator value
- Better to improve execution first, then see what patterns emerge
- More patterns = more complexity, but we want to learn fundamentals first

**When to do this:**
- After we have realistic agents
- When we see what patterns real agents produce
- When we understand what patterns are most valuable

---

## üéØ My Recommendation: Improve Mock Agents (Option A)

### Why This is the Best Next Step:

1. **Pattern Detection is Proven**
   - We've shown orchestrator value works
   - We can detect systemic issues
   - The architecture is solid

2. **Better Mocks = Better Learning**
   - More realistic patterns to detect
   - See how pattern detection works with varied data
   - Learn orchestration mechanics

3. **Aligns with Learning Plan**
   - Phase 1 (Pattern Detection): ‚úÖ Complete
   - Phase 2 (Execution Node): Next step
   - This is the natural progression

4. **Teaches Core Skills**
   - How to structure agent execution
   - How to handle different agent types
   - How to capture metadata for analysis

---

## üöÄ Implementation Plan

### Step 1: Improve Mock Classification Agent
```python
def _run_classification_agent(input_text: str) -> str:
    """Simple keyword-based sentiment classification"""
    input_lower = input_text.lower()
    
    # Positive keywords
    if any(word in input_lower for word in ["love", "great", "awesome", "thank", "appreciate", "nice"]):
        return "positive"
    
    # Negative keywords
    if any(word in input_lower for word in ["terrible", "frustrated", "unusable", "hate", "bad"]):
        return "negative"
    
    # Neutral (default)
    return "neutral"
```

### Step 2: Improve Mock Safety Agent
```python
def _run_safety_agent(input_text: str) -> str:
    """Simple keyword-based safety check"""
    input_lower = input_text.lower()
    
    # Unsafe keywords
    if any(word in input_lower for word in ["hack", "dangerous", "harm", "illegal"]):
        return "unsafe"
    
    # Safe (default)
    return "safe"
```

### Step 3: Test Again
- Run evaluation with improved mocks
- See more realistic patterns
- Verify pattern detection still works
- Learn how orchestration mechanics work

---

## üìä Expected Outcomes

### With Improved Mocks:
- **More realistic accuracy** (agents will actually analyze text)
- **More nuanced patterns** (some scenarios will pass, some will fail)
- **Better insights** (patterns will reflect actual behavior, not just mock limitations)
- **Learning value** (understand orchestration mechanics)

### What We'll Learn:
- How to structure agent execution
- How different agent types work
- How to capture metadata for pattern detection
- How orchestration coordinates multiple agents

---

## üéì Learning Focus

### What to Focus On:
1. **Agent Execution Structure** - How to run different agent types
2. **Metadata Capture** - What data to capture for pattern detection
3. **Error Handling** - How to handle agent failures gracefully
4. **Orchestration Mechanics** - How coordination works

### Why This Matters:
- This is the "orchestration mechanics" part of orchestrators
- Pattern detection is the "insights" part
- You need both to understand orchestrators fully

---

## üí≠ Alternative: Keep Current Mocks, Focus on State Design

If you want to focus on state design instead:

**Why:**
- Pattern detection revealed what state we need
- We could refine state structure based on patterns
- This teaches state design for orchestrators

**What to do:**
- Analyze what metadata pattern detection needs
- Refine state structure to capture relationships better
- Plan state evolution for future features

**When to do this:**
- If you want to understand state design deeply
- If you want to plan for future features
- If you want to optimize current architecture

---

## üéØ Final Recommendation

**Improve Mock Agents (Option A)** because:
1. ‚úÖ Pattern detection is proven - we've shown orchestrator value
2. ‚úÖ Natural next step in learning plan
3. ‚úÖ Teaches orchestration mechanics
4. ‚úÖ Will reveal more realistic patterns
5. ‚úÖ Better foundation for future improvements

**Then:**
- After improved mocks, we can see what patterns emerge
- We can refine pattern detection based on real behavior
- We can improve state design based on what we learn
- We can add more sophisticated features

---

*This recommendation balances learning value with practical progress. Improving mocks teaches orchestration mechanics while keeping pattern detection working.*



# Evaluation Report

**Generated:** 2025-11-17 16:21:51

## Summary

Evaluated **2 agent(s)** across **20 test scenario(s)**.
Detected **2 orchestrator insight(s)**.

## Agent Scores

### agent_001

- **Overall Score:** 90.00%
- **Accuracy:** 90.00%
- **Latency (P50):** 105ms
- **Latency (P95):** 105ms
- **Total Scenarios:** 10

### agent_002

- **Overall Score:** 0.00%
- **Accuracy:** 0.00%
- **Latency (P50):** 105ms
- **Latency (P95):** 105ms
- **Total Scenarios:** 10

## Detailed Results

| Agent | Scenario | Input | Expected | Actual | Correct |
|-------|----------|-------|----------|--------|---------|
| agent_001 | c001 | I absolutely loved the new dashboard ‚Äì it‚Äôs so muc... | positive | positive | ‚úÖ |
| agent_001 | c002 | This update is terrible, nothing works the way it ... | negative | negative | ‚úÖ |
| agent_001 | c003 | It‚Äôs fine, I guess. Not really better or worse tha... | neutral | positive | ‚ùå |
| agent_001 | c004 | Thank you so much for fixing this so quickly, I re... | positive | positive | ‚úÖ |
| agent_001 | c005 | I‚Äôm really frustrated that I keep getting logged o... | negative | negative | ‚úÖ |
| agent_001 | c006 | The results are okay, but there‚Äôs still room for i... | neutral | neutral | ‚úÖ |
| agent_001 | c007 | This new feature saves me at least an hour every d... | positive | positive | ‚úÖ |
| agent_001 | c008 | I don‚Äôt really care about this change. | neutral | neutral | ‚úÖ |
| agent_001 | c009 | This is completely unusable; I‚Äôm going back to the... | negative | negative | ‚úÖ |
| agent_001 | c010 | Nice job on the redesign ‚Äì it looks clean and intu... | positive | positive | ‚úÖ |

*... and 10 more results*

## üéØ Orchestrator Insights

*These insights are only visible when evaluating multiple agents together - this is the orchestrator value!*

### üî¥ Systemic Failures (All Agents Fail)

**All agents fail on scenario type: neutral**
- Scenario: c003
- Agents affected: agent_001, agent_002
- Category: sentiment
- üí° Recommendation: Improve handling of neutral scenarios (category: sentiment)

### üìä Consistency Patterns

**agent_002 is consistently fails across all scenarios**
- üí° Recommendation: agent_002 performs reliably
