# Context Distraction: Standard Agent vs Deep Agent Comparison

This notebook compares standard ReAct agent performance against the Deep Agent (supervisor/worker delegation pattern) on context distraction tasks.

## What We're Testing

Both agents answer 8 questions (Q1-8) requiring research across 5 technology domains. The Deep Agent uses a supervisor that delegates to workers with isolated context, while the Standard Agent processes everything in a single context.

**Hypothesis**: Deep Agent's context isolation should improve recall accuracy on multi-step tasks.

## Setup

In [None]:
# Imports and Setup
import asyncio
from typing import Dict, Any, List
import pandas as pd
import numpy as np
from IPython.display import display, Markdown
from dotenv import load_dotenv

load_dotenv(override=True)

# Import test infrastructure
from context_distraction.resources.test_tasks import TEST_TASKS, build_partial_task
from context_distraction.tests.evaluators import recall_accuracy_evaluator
from context_distraction.tests.setup_datasets import build_reference_outputs
from context_distraction.resources.validation_utils import extract_tool_calls_from_message

# Import agents
from context_distraction.agent import agent as standard_agent
from context_distraction.deep import run_deep_agent

print("Setup complete")

## Test Configuration

Running Q1-8 on all 3 tasks with multiple trials to measure consistency.

In [None]:
# Configuration
QUESTIONS = [1, 2, 3, 4, 5, 6, 7, 8]  # Q1-8
NUM_TRIALS = 3  # Run each test multiple times

print(f"Test Configuration:")
print(f"  Questions: {QUESTIONS}")
print(f"  Trials per task: {NUM_TRIALS}")
print(f"\nTasks:")
for i, task in enumerate(TEST_TASKS, 1):
    print(f"  Task {i}: {task['name']} (focus: {task['primary_domain']})")

## Agent Runner Functions

In [None]:
async def run_standard(query: str) -> dict:
    """Run standard agent and extract outputs."""
    try:
        trajectory = []
        final_response = ""
        all_messages = []
        
        async for chunk in standard_agent.astream(
            {"messages": [("user", query)]},
            stream_mode="updates",
        ):
            if isinstance(chunk, dict):
                for key in ['tools', 'model']:
                    if key in chunk:
                        msgs = chunk[key].get('messages', [])
                        all_messages.extend(msgs)
                        for msg in msgs:
                            tool_calls = extract_tool_calls_from_message(msg)
                            trajectory.extend(tool_calls)
        
        for msg in reversed(all_messages):
            if hasattr(msg, 'content') and msg.content:
                final_response = msg.content
                break
        
        return {"final_response": final_response, "trajectory": trajectory, "error": None}
    except Exception as e:
        return {"final_response": "", "trajectory": [], "error": str(e)}

async def run_test(task_idx: int, questions: list, agent_type: str) -> dict:
    """Run a single test and return results."""
    task = TEST_TASKS[task_idx]
    partial_task = build_partial_task(task, questions)
    reference = build_reference_outputs(partial_task)
    inputs = {"query": partial_task["query"]}
    
    if agent_type == "standard":
        outputs = await run_standard(inputs["query"])
    else:
        outputs = await run_deep_agent(inputs["query"])
    
    if outputs.get("error"):
        return {"score": 0.0, "error": outputs["error"], "correct": 0, "total": len(questions)}
    
    result = recall_accuracy_evaluator(inputs, outputs, reference)
    return {
        "score": result["score"],
        "error": None,
        "correct": int(result["score"] * len(questions)),
        "total": len(questions)
    }

print("Runner functions defined")

## Run All Tests

Running {NUM_TRIALS} trials for each task/agent combination.

In [None]:
# Run all tests
all_results = []

for task_idx in range(len(TEST_TASKS)):
    task = TEST_TASKS[task_idx]
    print(f"\n{'='*60}")
    print(f"Task {task_idx + 1}: {task['name']}")
    print(f"{'='*60}")
    
    for trial in range(NUM_TRIALS):
        print(f"\n  Trial {trial + 1}/{NUM_TRIALS}")
        
        # Run standard agent
        print(f"    Standard agent...", end=" ", flush=True)
        std_result = await run_test(task_idx, QUESTIONS, "standard")
        std_status = f"{std_result['correct']}/{std_result['total']}" if not std_result['error'] else "ERROR"
        print(std_status)
        
        # Run deep agent  
        print(f"    Deep agent...", end=" ", flush=True)
        deep_result = await run_test(task_idx, QUESTIONS, "deep")
        deep_status = f"{deep_result['correct']}/{deep_result['total']}" if not deep_result['error'] else "ERROR"
        print(deep_status)
        
        all_results.append({
            'task': task_idx + 1,
            'task_name': task['name'],
            'trial': trial + 1,
            'standard_score': std_result['score'],
            'standard_correct': std_result['correct'],
            'deep_score': deep_result['score'],
            'deep_correct': deep_result['correct'],
            'total': std_result['total']
        })

print(f"\n{'='*60}")
print("All tests completed!")
print(f"{'='*60}")

## Results Summary

In [None]:
# Create results dataframe
df = pd.DataFrame(all_results)

# Summary by task
print("Results by Task (averaged across trials):")
print("-" * 50)
task_summary = df.groupby('task').agg({
    'standard_score': ['mean', 'std'],
    'deep_score': ['mean', 'std'],
    'standard_correct': 'mean',
    'deep_correct': 'mean',
    'total': 'first'
}).round(3)

for task_idx in range(1, len(TEST_TASKS) + 1):
    task_data = df[df['task'] == task_idx]
    std_mean = task_data['standard_score'].mean()
    std_std = task_data['standard_score'].std()
    deep_mean = task_data['deep_score'].mean()
    deep_std = task_data['deep_score'].std()
    
    print(f"\nTask {task_idx}: {TEST_TASKS[task_idx-1]['name']}")
    print(f"  Standard: {std_mean:.1%} +/- {std_std:.1%}")
    print(f"  Deep:     {deep_mean:.1%} +/- {deep_std:.1%}")
    print(f"  Winner:   {'Deep' if deep_mean > std_mean else 'Standard' if std_mean > deep_mean else 'Tie'}")

# Overall summary
print("\n" + "=" * 50)
print("OVERALL RESULTS")
print("=" * 50)
std_overall = df['standard_score'].mean()
deep_overall = df['deep_score'].mean()
print(f"  Standard Agent: {std_overall:.1%}")
print(f"  Deep Agent:     {deep_overall:.1%}")
print(f"  Improvement:    {(deep_overall - std_overall)*100:+.1f} percentage points")
print(f"\n  Deep agent wins: {(df['deep_score'] > df['standard_score']).sum()}/{len(df)} trials")
print(f"  Standard wins:   {(df['standard_score'] > df['deep_score']).sum()}/{len(df)} trials")
print(f"  Ties:            {(df['standard_score'] == df['deep_score']).sum()}/{len(df)} trials")

## Detailed Results Table

In [None]:
# Display full results table
display(Markdown("### All Trial Results"))
display(df[['task', 'trial', 'standard_correct', 'deep_correct', 'total', 'standard_score', 'deep_score']]
        .rename(columns={
            'task': 'Task',
            'trial': 'Trial', 
            'standard_correct': 'Standard Correct',
            'deep_correct': 'Deep Correct',
            'total': 'Total Questions',
            'standard_score': 'Standard Score',
            'deep_score': 'Deep Score'
        }))

## Comparing the Two Approaches

### Standard ReAct Agent

**How it works:**
- Single agent context processes all questions
- All tool calls and results accumulate in one message history
- Agent must track multiple data points across domains simultaneously

**Pros:**
- Simpler implementation
- Direct access to all context

**Cons:**
- Context grows with each tool call
- Important values can get "lost" in long conversation history
- Prone to confusion when handling multiple domains

### Deep Agent (Supervisor/Worker Pattern)

**How it works:**
1. **Supervisor** analyzes the task and delegates sub-tasks to workers
2. **Workers** execute in isolated contexts, focusing on one domain at a time
3. Workers report results back to supervisor
4. Supervisor synthesizes all worker results into final answer

**Pros:**
- Context isolation prevents cross-contamination
- Workers focus on single domain/question
- Supervisor can coordinate complex multi-step reasoning

**Cons:**
- More complex architecture
- Additional overhead from delegation
- Relies on supervisor making correct delegations

## Key Takeaways

### Context Isolation Helps

The Deep Agent's context isolation approach shows promise for reducing confusion in multi-domain tasks:
- Workers process one sub-task at a time without accumulated context noise
- Explicit delegation forces task decomposition
- Results are explicitly passed back rather than implicitly remembered

### When to Use Each Approach

| Aspect | Standard Agent | Deep Agent |
|--------|----------------|------------|
| **Best for** | Simple, single-domain tasks | Complex, multi-domain tasks |
| **Context size** | Grows unbounded | Isolated per worker |
| **Complexity** | Low | Higher |
| **Latency** | Lower | Higher (multiple agent calls) |

### Next Steps

Based on these results:
1. **Tune delegation prompts** - Help supervisor make better task breakdowns
2. **Add verification** - Have workers verify calculated values before reporting
3. **Expand evaluation** - Test on more diverse task types