# Context Distraction: Comparing Custom Graph vs DeepAgents Framework

This notebook compares two approaches to context isolation for complex, multi-step research tasks.

## The Problem

As LLM agents perform research tasks with many operations, each tool call and result accumulates in the conversation context. With complex tasks requiring dozens of tool calls, the context becomes extremely long. **LLMs struggle to maintain recall accuracy over very long contexts** - this is called **context distraction**.

## What We'll Compare

We'll evaluate two context-isolation approaches on multi-domain investment research tasks:

1. **Custom Graph Agent** - Our hand-crafted LangGraph implementation
   - Uses supervisor/researcher pattern with explicit subgraphs
   - Custom state management and deliverable tracking
   - Manually designed workflow nodes

2. **DeepAgents Framework** - LangChain's open-source agent harness
   - Built-in `task` tool for spawning subagents
   - Built-in filesystem for intermediate results
   - Built-in planning with todos
   - Automatic summarization

**Goal**: Determine if the simpler DeepAgents approach can match or exceed the custom graph agent's recall accuracy.

## Setup

In [None]:
# Imports
import asyncio
from typing import Dict, Any, List
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import pandas as pd
import numpy as np
from IPython.display import display
from dotenv import load_dotenv

load_dotenv()

# Import our test infrastructure
from context_distraction.resources.test_tasks import TEST_TASKS
from context_distraction.tests.evaluators import (
    recall_accuracy_evaluator,
    tool_call_completeness_evaluator,
    tool_call_efficiency_evaluator,
    extract_answers_json_from_text,
)
from context_distraction.tests.setup_datasets import build_reference_outputs
from context_distraction.resources.validation_utils import extract_tool_calls_from_message

print("✓ Setup complete")

## The Research Tasks

We'll evaluate agents on 3 complex investment analysis tasks covering 5 technology sectors each:
- **Task 1**: Focus on Renewable Energy
- **Task 2**: Focus on Electric Vehicles
- **Task 3**: Focus on Biotechnology

Each task requires gathering statistics, expert opinions, case studies, and performing financial calculations including compound growth projections, cost-benefit analysis with NPV, correlation analyses, and investment portfolio optimization.

Each agent must answer 9 specific questions requiring precise recall of facts from throughout the research process.

In [None]:
# Show all test cases
print("Test Cases:")
for i, task in enumerate(TEST_TASKS, 1):
    print(f"\n{i}. {task['name']}")
    print(f"   Primary domain: {task['primary_domain']}")
    print(f"   All domains: {', '.join(task['topics'])}")
    print(f"   Questions to answer: {len(task['recall_questions'])}")

## Agent Runner Functions

These functions run each agent type and extract structured outputs for evaluation.

In [None]:
# Import agents
from context_distraction.graph import graph as graph_agent
from context_distraction.deepagent import deep_agent, run_deep_agent
from langchain_core.messages import HumanMessage

async def run_graph_agent(query: str) -> dict:
    """Run graph agent with recursion_limit=200 and extract outputs."""
    try:
        trajectory = []
        final_response = ""
        all_messages = []
        
        # Set recursion limit to 200 for complex tasks
        config = {"recursion_limit": 200}
        
        async for chunk in graph_agent.astream(
            {"supervisor_messages": [HumanMessage(content=query)]},
            config=config,
            subgraphs=True,
            stream_mode="updates",
        ):
            if isinstance(chunk, tuple) and len(chunk) >= 2:
                namespace, data = chunk
            elif isinstance(chunk, dict):
                data = chunk
            else:
                continue
            
            if isinstance(data, dict):
                for node_key, node_data in data.items():
                    if isinstance(node_data, dict):
                        for msg_key in ['supervisor_messages', 'reseacher_messages', 'messages']:
                            if msg_key in node_data and isinstance(node_data[msg_key], list):
                                msgs = node_data[msg_key]
                                all_messages.extend(msgs)
                                
                                for msg in msgs:
                                    tool_calls = extract_tool_calls_from_message(msg)
                                    for tc in tool_calls:
                                        trajectory.append(tc)
        
        # Extract final response
        for msg in reversed(all_messages):
            if isinstance(msg, dict) and msg.get("content"):
                final_response = msg["content"]
                break
            elif hasattr(msg, 'content') and msg.content:
                final_response = msg.content
                break
        
        return {"final_response": final_response, "trajectory": trajectory, "error": None}
    except Exception as e:
        return {"final_response": "", "trajectory": [], "error": str(e)}

print("✓ Defined agent runners")

## Run All Test Cases on Both Agents

Let's run all 3 test cases on both agents and evaluate their performance.

**Note:** This will take approximately 30-45 minutes as each test case runs sequentially (Graph agent, then DeepAgent).

In [None]:
# Run all test cases sequentially
all_results = []

for i, task in enumerate(TEST_TASKS, 1):
    print(f"\n{'='*80}")
    print(f"Running Test Case {i}: {task['name']}")
    print(f"{'='*80}\n")
    
    reference_outputs = build_reference_outputs(task)
    inputs = {"query": task["query"]}
    
    # Run graph agent (baseline)
    print(f"Running Graph agent...", flush=True)
    graph_outputs = await run_graph_agent(task["query"])
    if graph_outputs.get("error"):
        print(f"  ✗ FAILED: {graph_outputs['error']}", flush=True)
        graph_recall = {"score": 0.0}
        graph_completeness = {"score": 0.0}
        graph_efficiency = {"score": 0.0}
    else:
        print(f"  ✓ Completed {len(graph_outputs['trajectory'])} tool calls", flush=True)
        graph_recall = recall_accuracy_evaluator(inputs, graph_outputs, reference_outputs)
        graph_completeness = tool_call_completeness_evaluator(inputs, graph_outputs, reference_outputs)
        graph_efficiency = tool_call_efficiency_evaluator(inputs, graph_outputs, reference_outputs)
    
    # Run deep agent
    print(f"Running DeepAgent...", flush=True)
    deep_outputs = await run_deep_agent(task["query"])
    if deep_outputs.get("error"):
        print(f"  ✗ FAILED: {deep_outputs['error']}", flush=True)
        deep_recall = {"score": 0.0}
        deep_completeness = {"score": 0.0}
        deep_efficiency = {"score": 0.0}
    else:
        print(f"  ✓ Completed {len(deep_outputs['trajectory'])} tool calls", flush=True)
        deep_recall = recall_accuracy_evaluator(inputs, deep_outputs, reference_outputs)
        deep_completeness = tool_call_completeness_evaluator(inputs, deep_outputs, reference_outputs)
        deep_efficiency = tool_call_efficiency_evaluator(inputs, deep_outputs, reference_outputs)
    
    # Store results
    all_results.append({
        'case': i,
        'name': task['name'],
        'primary_domain': task['primary_domain'],
        'graph': {
            'recall': graph_recall['score'],
            'completeness': graph_completeness['score'],
            'efficiency': graph_efficiency['score'],
            'tool_calls': len(graph_outputs['trajectory']),
            'failed': bool(graph_outputs.get("error"))
        },
        'deep': {
            'recall': deep_recall['score'],
            'completeness': deep_completeness['score'],
            'efficiency': deep_efficiency['score'],
            'tool_calls': len(deep_outputs['trajectory']),
            'failed': bool(deep_outputs.get("error"))
        }
    })
    
    print(f"\n  Graph:     {graph_recall['score']:.1%} recall, {graph_completeness['score']:.1%} completeness{'  [FAILED]' if graph_outputs.get('error') else ''}")
    print(f"  DeepAgent: {deep_recall['score']:.1%} recall, {deep_completeness['score']:.1%} completeness{'  [FAILED]' if deep_outputs.get('error') else ''}")

print(f"\n{'='*80}")
print("✓ All test cases completed")
print(f"{'='*80}")

## Results Summary: Individual Test Cases

In [None]:
# Create detailed results table
results_data = []
for result in all_results:
    graph_recall = f"{result['graph']['recall']:.1%}"
    deep_recall = f"{result['deep']['recall']:.1%}"
    
    if result['graph'].get('failed'):
        graph_recall += " (FAILED)"
    if result['deep'].get('failed'):
        deep_recall += " (FAILED)"
    
    results_data.append({
        'Test Case': f"Case {result['case']}",
        'Domain': result['primary_domain'].replace('_', ' ').title(),
        'Graph Recall': graph_recall,
        'DeepAgent Recall': deep_recall,
        'Difference': f"{(result['deep']['recall'] - result['graph']['recall']) * 100:+.1f}pp"
    })

results_df = pd.DataFrame(results_data)
display(results_df)

print("\n📊 Individual Case Analysis:")
for result in all_results:
    diff = (result['deep']['recall'] - result['graph']['recall']) * 100
    graph_status = " (FAILED)" if result['graph'].get('failed') else ""
    deep_status = " (FAILED)" if result['deep'].get('failed') else ""
    print(f"   Case {result['case']}: DeepAgent {result['deep']['recall']:.1%}{deep_status} vs Graph {result['graph']['recall']:.1%}{graph_status} ({diff:+.1f}pp)")

## Average Performance Comparison

Let's calculate and visualize the average performance across all 3 test cases.

In [None]:
# Calculate averages
avg_graph_recall = np.mean([r['graph']['recall'] for r in all_results])
avg_deep_recall = np.mean([r['deep']['recall'] for r in all_results])
avg_graph_completeness = np.mean([r['graph']['completeness'] for r in all_results])
avg_deep_completeness = np.mean([r['deep']['completeness'] for r in all_results])
avg_graph_efficiency = np.mean([r['graph']['efficiency'] for r in all_results])
avg_deep_efficiency = np.mean([r['deep']['efficiency'] for r in all_results])

# Create average comparison chart
fig = go.Figure()

agents = ["Graph Agent\n(Custom LangGraph)", "DeepAgent\n(Framework)"]

fig.add_trace(go.Bar(
    name='Recall Accuracy',
    x=agents,
    y=[avg_graph_recall, avg_deep_recall],
    marker_color='#1f77b4',
    text=[f"{avg_graph_recall:.1%}", f"{avg_deep_recall:.1%}"],
    textposition='outside'
))

fig.add_trace(go.Bar(
    name='Tool Call Completeness',
    x=agents,
    y=[avg_graph_completeness, avg_deep_completeness],
    marker_color='#2ca02c',
    text=[f"{avg_graph_completeness:.1%}", f"{avg_deep_completeness:.1%}"],
    textposition='outside'
))

fig.add_trace(go.Bar(
    name='Tool Call Efficiency',
    x=agents,
    y=[avg_graph_efficiency, avg_deep_efficiency],
    marker_color='#ff7f0e',
    text=[f"{avg_graph_efficiency:.2f}", f"{avg_deep_efficiency:.2f}"],
    textposition='outside'
))

fig.update_layout(
    title="Average Performance: Graph Agent vs DeepAgent",
    yaxis_title="Score",
    barmode='group',
    height=500,
    yaxis=dict(range=[0, 1.1]),
    showlegend=True
)

fig.show()

# Print summary statistics
print("\n📊 AVERAGE RESULTS (across all 3 test cases):")
print(f"\n  Graph Agent (Custom LangGraph):")
print(f"    - Recall Accuracy: {avg_graph_recall:.1%}")
print(f"    - Tool Completeness: {avg_graph_completeness:.1%}")
print(f"    - Tool Efficiency: {avg_graph_efficiency:.2f}")
print(f"\n  DeepAgent (Framework):")
print(f"    - Recall Accuracy: {avg_deep_recall:.1%}")
print(f"    - Tool Completeness: {avg_deep_completeness:.1%}")
print(f"    - Tool Efficiency: {avg_deep_efficiency:.2f}")
print(f"\n  📈 Difference (DeepAgent - Graph):")
recall_diff = (avg_deep_recall - avg_graph_recall) * 100
print(f"    - Recall Accuracy: {recall_diff:+.1f} percentage points")

## Detailed Case-by-Case Comparison

In [None]:
# Create case-by-case comparison chart
cases = [f"Case {r['case']}" for r in all_results]
graph_scores = [r['graph']['recall'] for r in all_results]
deep_scores = [r['deep']['recall'] for r in all_results]

fig = go.Figure()

fig.add_trace(go.Bar(
    name='Graph Agent',
    x=cases,
    y=graph_scores,
    marker_color='#1f77b4',
    text=[f"{s:.1%}" for s in graph_scores],
    textposition='outside'
))

fig.add_trace(go.Bar(
    name='DeepAgent',
    x=cases,
    y=deep_scores,
    marker_color='#2ca02c',
    text=[f"{s:.1%}" for s in deep_scores],
    textposition='outside'
))

# Add average lines
fig.add_trace(go.Scatter(
    x=cases,
    y=[avg_graph_recall] * len(cases),
    mode='lines',
    name='Graph Avg',
    line=dict(color='#1f77b4', width=2, dash='dash'),
    showlegend=True
))

fig.add_trace(go.Scatter(
    x=cases,
    y=[avg_deep_recall] * len(cases),
    mode='lines',
    name='DeepAgent Avg',
    line=dict(color='#2ca02c', width=2, dash='dash'),
    showlegend=True
))

fig.update_layout(
    title="Recall Accuracy by Test Case",
    xaxis_title="Test Case",
    yaxis_title="Recall Accuracy",
    barmode='group',
    height=500,
    yaxis=dict(range=[0, 1.0]),
    showlegend=True
)

fig.show()

## Comparing the Two Approaches

### Custom Graph Agent (LangGraph)

**How it works:**
1. **Planner node** extracts deliverables as structured output
2. **Supervisor node** coordinates research via `deep_research` tool
3. **Researcher subgraph** executes in isolated context, stores results via `store_deliverable`
4. **Final report node** synthesizes all deliverables

**Pros:**
- Explicit workflow control
- Guaranteed execution order (plan → research → report)
- Custom state management for deliverables

**Cons:**
- Complex implementation (~250 lines of graph code)
- Manual state passing between nodes
- Requires understanding LangGraph internals

### DeepAgent Framework

**How it works:**
1. Built-in `task` tool spawns subagents with isolated context
2. Built-in filesystem for storing/sharing intermediate results
3. Built-in todo tracking for planning
4. Automatic summarization when context grows

**Pros:**
- Simpler implementation (~150 lines including prompts)
- Framework handles state propagation
- Built-in context management features
- Easier to extend and customize

**Cons:**
- Less explicit workflow control (relies on prompting)
- May require prompt tuning for specific use cases
- Filesystem operations add some overhead

## Key Takeaways

### Context Isolation Works

Both approaches demonstrate that **context isolation is essential** for complex, multi-step tasks:
- Spawning subagents/researchers prevents context accumulation
- Explicit storage of findings (state or filesystem) preserves critical information
- Planning before execution helps organize complex workflows

### Framework vs Custom Trade-offs

| Aspect | Graph Agent | DeepAgent |
|--------|-------------|-----------|
| **Implementation complexity** | High | Low |
| **Workflow control** | Explicit nodes | Prompt-driven |
| **State management** | Custom | Built-in |
| **Context handling** | Manual subgraph | Auto summarization |
| **Extensibility** | Requires graph changes | Add tools/prompts |

### Recommendations

**Use DeepAgent framework when:**
- You want simpler, faster implementation
- The workflow can be guided by prompts
- You need built-in features (filesystem, todos, summarization)
- You're building general-purpose agents

**Use custom LangGraph when:**
- You need strict workflow control
- You have complex state requirements
- You need fine-grained observability
- You're optimizing for a specific use case

### Next Steps

Based on these results, consider:
1. **Prompt tuning** - Adjust supervisor/researcher prompts to improve recall
2. **Hybrid approach** - Use DeepAgent with custom middleware for specific behaviors
3. **Evaluation expansion** - Test on additional task types and domains