# Custom Strategies Benchmark

**Focus:** Compare adaptive multi-agent strategies against single-model baseline

**Date:** [Fill in]  
**Author:** Leif Haven Martinson  

## Purpose

This notebook focuses on benchmarking custom mental model strategies:
- **Single Model** - Baseline performance
- **Design Critique** - 5 designer archetypes provide iterative feedback
- **Interdisciplinary Team** - PM, Engineer, Designer collaborate  
- **Adaptive Team** - ‚≠ê **NEW!** Dynamically generates custom experts tailored to each specific problem

Compare these approaches on established benchmarks with **latest 2025 SOTA baselines**.

## Adaptive Team Strategy

The **adaptive_team** strategy is a meta-strategy that:
1. Analyzes each problem to understand what expertise is needed
2. Dynamically generates custom expert personas specifically for that problem
3. Runs collaborative analysis with the problem-specific experts

**Example:** 
- Math problem ‚Üí Mathematician, Math Teacher, Applied Scientist
- Business question ‚Üí Market Analyst, CFO, Operations Manager
- Medical question ‚Üí Doctor, Pharmacist, Research Scientist

This allows the expert team to be **perfectly tailored** to each individual problem!

In [None]:
# Setup
import sys
sys.path.append('../code')

from harness import (
    load_benchmark,
    get_baseline_scores,
    BENCHMARKS,
    run_strategy,
    ExperimentConfig,
    ExperimentResult,
    get_tracker
)
from harness.defaults import DEFAULT_MODEL, DEFAULT_PROVIDER

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm

%matplotlib inline
sns.set_style('whitegrid')

print("‚úÖ Setup complete")
print(f"\nAvailable benchmarks: {list(BENCHMARKS.keys())}")

## Agent Persona Configuration

**CUSTOMIZE HERE:** Edit the personas for rapid experimentation

In [None]:
# ========================================
# DESIGN CRITIQUE PANEL - EDIT THESE
# ========================================

CRITIQUE_PANEL = [
    {
        "name": "Systems Designer",
        "focus": "System architecture and holistic design",
        "criteria": """You are a Systems Designer who thinks about the big picture and interconnections.
Evaluate:
- Does the solution consider the whole system and its parts?
- Are relationships between components clear?
- Is there coherence between different elements?
- Does the design scale and adapt to different contexts?
Focus on holistic thinking, interconnections, and systemic coherence."""
    },
    {
        "name": "Visual Craft Specialist",
        "focus": "Visual clarity, aesthetics, and presentation",
        "criteria": """You are a Visual Craft Specialist focused on how information is presented and perceived.
Evaluate:
- Is information presented clearly and visually comprehensible?
- Is there good hierarchy and structure in the presentation?
- Are concepts illustrated or explained in ways that are easy to visualize?
- Does the format enhance understanding?
Focus on clarity of presentation, visual thinking, and aesthetic quality."""
    },
    {
        "name": "AI Specialist",
        "focus": "AI/ML capabilities, limitations, and best practices",
        "criteria": """You are an AI Specialist with deep knowledge of AI systems, capabilities, and limitations.
Evaluate:
- Are claims about AI accurate and grounded in current capabilities?
- Are limitations and potential issues with AI acknowledged?
- Are AI-related recommendations practical and informed?
- Is the approach aligned with AI best practices?
Focus on technical accuracy regarding AI/ML, practical feasibility, and responsible AI considerations."""
    },
    {
        "name": "Human-Computer Interaction Expert",
        "focus": "User experience, usability, and human factors",
        "criteria": """You are an HCI Expert focused on how humans interact with systems and information.
Evaluate:
- Is the solution usable and accessible to the target audience?
- Are user needs and cognitive limitations considered?
- Is interaction intuitive and aligned with mental models?
- Are there potential usability issues or barriers?
Focus on user-centered design, accessibility, cognitive ergonomics, and interaction patterns."""
    },
    {
        "name": "IDEO Design Thinking Facilitator",
        "focus": "Human-centered innovation and creative problem-solving",
        "criteria": """You are an IDEO-trained Design Thinking practitioner emphasizing empathy, ideation, and iteration.
Evaluate:
- Does the solution demonstrate empathy for user needs and pain points?
- Is there creative thinking and exploration of possibilities?
- Are assumptions tested or validated?
- Is the approach iterative and open to refinement?
Focus on empathy, creative exploration, prototyping mindset, and bias toward action."""
    },
]

# ========================================
# INTERDISCIPLINARY TEAM - EDIT THESE
# ========================================

EXPERT_TEAM = [
    {
        "name": "Product Manager",
        "role": "Product Management",
        "perspective": "User needs, business value, roadmap, and strategic priorities",
        "system_prompt": """You are a Product Manager responsible for defining what to build and why.

Your mission: Ensure the solution creates real user value while achieving business objectives.

Focus on:
- User needs, pain points, and jobs-to-be-done
- Business impact, ROI, and strategic alignment
- Feature prioritization and tradeoffs
- Market fit and competitive positioning
- Success metrics and measurable outcomes
- Feasibility vs. value vs. risk assessment

Analyze problems by asking:
- What user problem does this solve?
- What is the business value?
- How do we measure success?
- What are the must-haves vs. nice-to-haves?
- What are the risks and mitigations?

Bring a balanced perspective that bridges user needs, business goals, and technical reality."""
    },
    {
        "name": "Software Engineer",
        "role": "Engineering",
        "perspective": "Technical implementation, architecture, scalability, and feasibility",
        "system_prompt": """You are a Software Engineer responsible for building and shipping reliable systems.

Your mission: Deliver technically sound solutions that are maintainable, scalable, and feasible within constraints.

Focus on:
- Technical feasibility and implementation complexity
- System architecture and design patterns
- Scalability, performance, and reliability
- Security, data integrity, and edge cases
- Technical debt and long-term maintainability
- Development velocity and engineering resources
- Integration with existing systems

Analyze problems by asking:
- Is this technically feasible?
- What is the implementation complexity?
- What are the technical risks?
- How does this scale?
- What are the dependencies and blockers?
- What technical debt are we taking on?

Bring a pragmatic engineering perspective focused on what we can actually build and ship."""
    },
    {
        "name": "Product Designer",
        "role": "Design",
        "perspective": "User experience, interaction design, usability, and design quality",
        "system_prompt": """You are a Product Designer responsible for crafting intuitive, delightful user experiences.

Your mission: Ensure the solution is usable, accessible, and provides a great user experience.

Focus on:
- User experience and interaction design
- Usability, learnability, and accessibility
- User flows and mental models
- Information architecture and navigation
- Visual design and brand consistency
- Edge cases and error states
- User research insights and validation

Analyze problems by asking:
- Is this intuitive for users?
- What is the user flow?
- Are there usability issues or friction points?
- Is it accessible to all users?
- How do we handle edge cases and errors?
- Does this match user mental models?

Bring a user-centered design perspective that ensures solutions are not just functional but delightful to use."""
    },
]

print(f"‚úÖ Agent personas configured:")
print(f"   Design Critique Panel: {len(CRITIQUE_PANEL)} critics")
print(f"   Interdisciplinary Team: {len(EXPERT_TEAM)} experts")
print(f"\nüí° Edit these personas above to experiment with different perspectives!")

## Benchmark Configuration

**CUSTOMIZE HERE:** Choose benchmark and strategies

In [None]:
# ========================================
# BENCHMARK CONFIGURATION
# ========================================

# Which benchmark to run
BENCHMARK_NAME = "gsm8k"  # üîß CHANGE: "gsm8k", "mmlu", "truthfulqa", "arc", "humaneval", "gpqa"

# How many tasks to evaluate
NUM_TASKS = 10  # üîß CHANGE: Start with 10-20, increase to 100+ for full eval

# Random seed for reproducibility
SEED = 42

# Model configuration
PROVIDER = DEFAULT_PROVIDER
MODEL = DEFAULT_MODEL

# ========================================
# STRATEGIES TO COMPARE
# ========================================

STRATEGIES_TO_TEST = [
    # Baseline
    ("single", {
        "provider": PROVIDER,
        "model": MODEL,
        "verbose": False
    }),

    # Design Critique - 5 designer archetypes (fixed panel)
    ("design_critique", {
        "n_iterations": 2,  # üîß Number of critique/revision cycles
        "critique_panel": CRITIQUE_PANEL,
        "provider": PROVIDER,
        "model": MODEL,
        "verbose": False  # üîß Set to True to see streaming
    }),

    # Interdisciplinary Team - PM, Engineer, Designer (fixed team)
    ("interdisciplinary_team", {
        "refinement_rounds": 1,  # üîß Number of refinement iterations
        "expert_team": EXPERT_TEAM,
        "provider": PROVIDER,
        "model": MODEL,
        "verbose": False  # üîß Set to True to see streaming
    }),

    # ========================================
    # ‚≠ê ADAPTIVE TEAM - DYNAMICALLY GENERATED EXPERTS
    # ========================================
    # This strategy analyzes EACH problem and generates custom experts!
    # Different problems get different expert teams tailored to that specific question
    
    ("adaptive_team", {
        "n_experts": 3,  # üîß How many experts to generate per problem
        "refinement_rounds": 1,  # üîß Number of refinement iterations
        "provider": PROVIDER,
        "model": MODEL,
        "verbose": True  # üîß Recommended: True to see which experts are generated!
    }),
]

print(f"‚úÖ Configuration:")
print(f"   Benchmark: {BENCHMARK_NAME}")
print(f"   Tasks: {NUM_TASKS}")
print(f"   Strategies: {len(STRATEGIES_TO_TEST)}")
print(f"   Model: {MODEL} ({PROVIDER})")
print(f"\nüí° Adaptive team will generate custom experts for each task!")

## Load Benchmark

Load benchmark tasks and show current SOTA baselines

In [None]:
# Load benchmark
benchmark = load_benchmark(BENCHMARK_NAME)
benchmark.load()  # Make sure benchmark is loaded
tasks = benchmark.get_tasks(n=NUM_TASKS, seed=SEED)

print(f"üìä Loaded {len(tasks)} tasks from {BENCHMARK_NAME}")
print(f"   (Requested: {NUM_TASKS}, Actual: {len(tasks)})")

if len(tasks) == 0:
    print("\n‚ö†Ô∏è  WARNING: No tasks loaded! Check benchmark.load() method.")
else:
    print(f"\nSample task:")
    print(f"  Input: {tasks[0].input[:100]}...")
    print(f"  Expected: {tasks[0].expected}")
    print(f"  Category: {tasks[0].category}")

# Show baseline scores from literature
baselines = get_baseline_scores(BENCHMARK_NAME)
if baselines:
    print(f"\nüìà Published SOTA baselines on {BENCHMARK_NAME} (Jan 2025):")
    print("‚îÄ" * 60)
    for model, score in sorted(baselines.items(), key=lambda x: x[1], reverse=True)[:15]:
        print(f"   {model:30s}: {score:.1%}")
    print("‚îÄ" * 60)

## Run Evaluation

Test each strategy on the benchmark tasks

**Note:** This may take a while depending on NUM_TASKS and whether verbose=True

In [None]:
# Run evaluation for each strategy
results = []

for strategy_name, strategy_kwargs in STRATEGIES_TO_TEST:
    print(f"\n{'='*80}")
    print(f"üîÑ Running: {strategy_name}")
    print(f"   Config: {strategy_kwargs}")
    print(f"{'='*80}\n")
    
    # Create experiment tracker
    config = ExperimentConfig(
        experiment_name=f"custom_benchmark_{BENCHMARK_NAME}_{strategy_name}",
        task_type=BENCHMARK_NAME,
        strategy=strategy_name,
        provider=PROVIDER,
        model=MODEL,
        notes=f"Custom strategies benchmark on {BENCHMARK_NAME}"
    )
    
    tracker = get_tracker()
    tracker.start_experiment(config)
    
    # Run on each task
    for task in tqdm(tasks, desc=f"{strategy_name}"):
        # Run strategy
        result = run_strategy(strategy_name, task.input, **strategy_kwargs)
        
        # Evaluate with benchmark-specific eval
        eval_scores = benchmark.evaluate(task, result.output)
        
        # Log result
        exp_result = ExperimentResult(
            config=config,
            task_input=task.input,
            output=result.output,
            latency_s=result.latency_s,
            tokens_in=result.tokens_in,
            tokens_out=result.tokens_out,
            cost_usd=result.cost_usd,
            eval_scores=eval_scores,
            eval_metadata={
                'task_id': task.id,
                'expected': task.expected,
                'benchmark': BENCHMARK_NAME
            }
        )
        
        tracker.log_result(exp_result)
        results.append({
            'strategy': strategy_name,
            'task_id': task.id,
            'accuracy': eval_scores.get('accuracy', 0),
            'latency_s': result.latency_s,
            'cost_usd': result.cost_usd or 0
        })
    
    # Finish experiment
    summary = tracker.finish_experiment()
    
    print(f"\n‚úÖ {strategy_name} complete:")
    print(f"   Accuracy: {summary['eval_scores'].get('accuracy', {}).get('mean', 0):.1%}")
    print(f"   Avg latency: {summary['avg_latency_s']:.2f}s")
    print(f"   Total cost: ${summary['total_cost_usd']:.4f}")

print(f"\n{'='*80}")
print("‚úÖ All strategies evaluated!")
print(f"{'='*80}")

## Results Analysis

In [None]:
# Convert to DataFrame for analysis
df = pd.DataFrame(results)

# Aggregate by strategy
strategy_summary = df.groupby('strategy').agg({
    'accuracy': ['mean', 'std'],
    'latency_s': 'mean',
    'cost_usd': 'sum'
}).round(4)

print("\nüìä Strategy Comparison:")
print("="*80)
print(strategy_summary)
print("="*80)

# Accuracy ranking
accuracy_ranking = df.groupby('strategy')['accuracy'].mean().sort_values(ascending=False)
print("\nüèÜ Accuracy Ranking:")
for i, (strategy, acc) in enumerate(accuracy_ranking.items(), 1):
    print(f"   {i}. {strategy:25s}: {acc:.1%}")

# Cost analysis
cost_ranking = df.groupby('strategy')['cost_usd'].sum().sort_values()
print("\nüí∞ Cost Ranking (lowest to highest):")
for i, (strategy, cost) in enumerate(cost_ranking.items(), 1):
    print(f"   {i}. {strategy:25s}: ${cost:.4f}")

# Latency analysis
latency_ranking = df.groupby('strategy')['latency_s'].mean().sort_values()
print("\n‚è±Ô∏è  Latency Ranking (fastest to slowest):")
for i, (strategy, latency) in enumerate(latency_ranking.items(), 1):
    print(f"   {i}. {strategy:25s}: {latency:.2f}s")

## Task-by-Task Breakdown

**Detailed view:** See which strategies succeeded/failed on each individual question

In [None]:
# Create detailed task-by-task breakdown
print("\n" + "="*80)
print("üìã DETAILED TASK-BY-TASK RESULTS")
print("="*80)

# Get list of strategies tested
strategies = df['strategy'].unique()

# Build a lookup for task results
task_results = {}
for _, row in df.iterrows():
    task_id = row['task_id']
    if task_id not in task_results:
        task_results[task_id] = {}
    task_results[task_id][row['strategy']] = {
        'accuracy': row['accuracy'],
        'latency': row['latency_s'],
        'cost': row['cost_usd']
    }

# Display each task with strategy results
for i, task in enumerate(tasks, 1):
    print(f"\n{'‚îÄ'*80}")
    print(f"üìù Task {i}/{len(tasks)}: {task.input[:150]}{'...' if len(task.input) > 150 else ''}")
    print(f"\n‚úì Expected Answer: {task.expected}")
    
    if task.category:
        print(f"üè∑Ô∏è  Category: {task.category}")
    
    print(f"\n Strategy Results:")
    print(f" {'‚îÄ'*78}")
    
    # Sort by accuracy (descending) to show best first
    results_for_task = task_results.get(task.id, {})
    sorted_strategies = sorted(strategies, 
                              key=lambda s: results_for_task.get(s, {}).get('accuracy', 0), 
                              reverse=True)
    
    for strategy in sorted_strategies:
        if strategy in results_for_task:
            result = results_for_task[strategy]
            correct = result['accuracy'] == 1.0
            symbol = "‚úÖ" if correct else "‚ùå"
            status = "CORRECT" if correct else "INCORRECT"
            
            # Color code the status
            print(f"   {symbol} {strategy:25s}: {status:10s} | " + 
                  f"Latency: {result['latency']:6.2f}s | Cost: ${result['cost']:.4f}")

# Summary statistics
print(f"\n\n{'='*80}")
print("üìä TASK DIFFICULTY ANALYSIS")
print("="*80)

# Group tasks by how many strategies got them right
task_difficulty = df.groupby('task_id')['accuracy'].sum()

print(f"\n Tasks by difficulty (# of strategies that got it right):")
print(f" {'‚îÄ'*78}")

for num_correct in sorted(task_difficulty.unique(), reverse=True):
    count = (task_difficulty == num_correct).sum()
    difficulty = "üü¢ Easy" if num_correct == len(strategies) else \
                 "üü° Medium" if num_correct >= len(strategies) / 2 else \
                 "üî¥ Hard"
    print(f"   {difficulty} - {int(num_correct)}/{len(strategies)} strategies correct: {count} tasks")

# Show which tasks only some strategies got right (multi-agent value!)
print(f"\n\n{'='*80}")
print("üí° TASKS WHERE MULTI-AGENT STRATEGIES HELPED")
print("="*80)

single_results = df[df['strategy'] == 'single'].set_index('task_id')['accuracy']
other_results = df[df['strategy'] != 'single'].groupby('task_id')['accuracy'].max()

helped_tasks = []
for task_id in single_results.index:
    if single_results[task_id] == 0 and other_results.get(task_id, 0) == 1:
        helped_tasks.append(task_id)

if helped_tasks:
    print(f"\n‚úÖ Found {len(helped_tasks)} tasks where multi-agent strategies succeeded but single model failed:")
    for task_id in helped_tasks:
        task = [t for t in tasks if t.id == task_id][0]
        # Find which strategies got it right
        right_strategies = df[(df['task_id'] == task_id) & (df['accuracy'] == 1.0)]['strategy'].tolist()
        print(f"\n  üìù {task.input[:100]}{'...' if len(task.input) > 100 else ''}")
        print(f"     ‚úì Solved by: {', '.join(right_strategies)}")
else:
    print("\n‚ö†Ô∏è  No tasks found where multi-agent strategies outperformed single model")

# Show tasks that stumped everyone
all_failed = df.groupby('task_id')['accuracy'].max()
stumped_tasks = all_failed[all_failed == 0].index.tolist()

if stumped_tasks:
    print(f"\n\n{'='*80}")
    print(f"üî¥ TASKS THAT STUMPED ALL STRATEGIES ({len(stumped_tasks)} tasks)")
    print("="*80)
    for task_id in stumped_tasks:
        task = [t for t in tasks if t.id == task_id][0]
        print(f"\n  ‚ùå {task.input[:150]}{'...' if len(task.input) > 150 else ''}")
        print(f"     Expected: {task.expected}")
else:
    print("\n\n‚úÖ At least one strategy solved every task!")

print("\n" + "="*80)

## Visualizations

In [None]:
# Create comparison plots
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Color mapping for strategies
color_map = {
    'single': '#1f77b4',  # Blue
    'design_critique': '#ff7f0e',  # Orange
    'interdisciplinary_team': '#2ca02c',  # Green
    'adaptive_team': '#d62728'  # Red (NEW!)
}

# 1. Accuracy by strategy
ax = axes[0, 0]
strategy_acc = df.groupby('strategy')['accuracy'].mean().sort_values(ascending=True)
colors = [color_map.get(s, '#gray') for s in strategy_acc.index]
strategy_acc.plot(kind='barh', ax=ax, color=colors)
ax.set_title(f'Accuracy on {BENCHMARK_NAME}', fontsize=14, weight='bold')
ax.set_xlabel('Accuracy')
ax.set_xlim(0, 1.0)
ax.grid(True, alpha=0.3)

# 2. Latency by strategy
ax = axes[0, 1]
strategy_latency = df.groupby('strategy')['latency_s'].mean().sort_values()
colors = [color_map.get(s, '#gray') for s in strategy_latency.index]
strategy_latency.plot(kind='barh', ax=ax, color=colors)
ax.set_title('Average Latency', fontsize=14, weight='bold')
ax.set_xlabel('Seconds')
ax.grid(True, alpha=0.3)

# 3. Cost by strategy
ax = axes[1, 0]
strategy_cost = df.groupby('strategy')['cost_usd'].sum().sort_values()
colors = [color_map.get(s, '#gray') for s in strategy_cost.index]
strategy_cost.plot(kind='barh', ax=ax, color=colors)
ax.set_title('Total Cost', fontsize=14, weight='bold')
ax.set_xlabel('USD')
ax.grid(True, alpha=0.3)

# 4. Accuracy vs Latency scatter
ax = axes[1, 1]
strategy_stats = df.groupby('strategy').agg({
    'accuracy': 'mean',
    'latency_s': 'mean'
})
for strategy, row in strategy_stats.iterrows():
    ax.scatter(row['latency_s'], row['accuracy'], s=300, alpha=0.6, 
              color=color_map.get(strategy, '#gray'))
    ax.annotate(strategy, (row['latency_s'], row['accuracy']), 
                xytext=(5, 5), textcoords='offset points', fontsize=9)
ax.set_xlabel('Avg Latency (s)', fontsize=11)
ax.set_ylabel('Accuracy', fontsize=11)
ax.set_title('Accuracy vs Latency Tradeoff', fontsize=14, weight='bold')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(f'../experiments/custom_strategies_{BENCHMARK_NAME}_comparison.png', 
           dpi=150, bbox_inches='tight')
plt.show()

print(f"\nüíæ Saved plot to: experiments/custom_strategies_{BENCHMARK_NAME}_comparison.png")

## Compare to Published Baselines

In [None]:
# Compare your results to published baselines
baselines = get_baseline_scores(BENCHMARK_NAME)

if baselines:
    # Get your best strategy
    your_best = df.groupby('strategy')['accuracy'].mean().max()
    your_best_strategy = df.groupby('strategy')['accuracy'].mean().idxmax()
    
    print(f"\nüìä Comparison to Published Baselines on {BENCHMARK_NAME}:")
    print(f"\nYour best: {your_best_strategy} = {your_best:.1%}\n")
    
    # Create comparison DataFrame
    comparison_data = []
    
    # Add top 10 baselines
    for model, score in sorted(baselines.items(), key=lambda x: x[1], reverse=True)[:10]:
        comparison_data.append({
            'Model/Strategy': model,
            'Accuracy': score,
            'Type': 'Published Baseline'
        })
    
    # Add your results
    for strategy, score in df.groupby('strategy')['accuracy'].mean().items():
        comparison_data.append({
            'Model/Strategy': f"{MODEL} ({strategy})",
            'Accuracy': score,
            'Type': 'Your Results'
        })
    
    comparison_df = pd.DataFrame(comparison_data).sort_values('Accuracy', ascending=False)
    
    # Plot comparison
    plt.figure(figsize=(12, 8))
    colors = ['steelblue' if t == 'Published Baseline' else 'coral' 
              for t in comparison_df['Type']]
    
    plt.barh(range(len(comparison_df)), comparison_df['Accuracy'], color=colors)
    plt.yticks(range(len(comparison_df)), comparison_df['Model/Strategy'])
    plt.xlabel('Accuracy', fontsize=12)
    plt.title(f'{BENCHMARK_NAME} - Your Results vs Published Baselines (Jan 2025)', 
              fontsize=14, weight='bold')
    plt.xlim(0, 1.0)
    
    # Add legend
    from matplotlib.patches import Patch
    legend_elements = [
        Patch(facecolor='steelblue', label='Published Baseline (Jan 2025)'),
        Patch(facecolor='coral', label='Your Results')
    ]
    plt.legend(handles=legend_elements, loc='lower right')
    
    plt.grid(True, alpha=0.3, axis='x')
    plt.tight_layout()
    plt.show()
    
    print("\nüí° Key Insights:")
    print(f"   - Your best strategy: {your_best_strategy} ({your_best:.1%})")
    
    # Find closest baseline
    baseline_scores = list(baselines.values())
    closest_baseline = min(baseline_scores, key=lambda x: abs(x - your_best))
    closest_model = [k for k, v in baselines.items() if v == closest_baseline][0]
    
    print(f"   - Closest published baseline: {closest_model} ({closest_baseline:.1%})")
    
    if your_best > closest_baseline:
        print(f"   - ‚úÖ You're {(your_best - closest_baseline):.1%} better!")
    else:
        print(f"   - Room for improvement: {(closest_baseline - your_best):.1%} gap")
    
    # Compare to top baseline
    top_baseline = max(baseline_scores)
    top_model = [k for k, v in baselines.items() if v == top_baseline][0]
    print(f"   - Top baseline: {top_model} ({top_baseline:.1%})")
    print(f"   - Gap to SOTA: {(top_baseline - your_best):.1%}")
else:
    print("No published baselines available for this benchmark.")

## Key Findings

### Summary

[Fill in after running]

- **Best strategy:** _____ (___% accuracy)
- **Single-model baseline:** ___% accuracy
- **Design Critique:** ___% accuracy (___% vs baseline)
- **XFN Team:** ___% accuracy (___% vs baseline)
- **Latency overhead:** Design Critique ___x, XFN Team ___x
- **Cost overhead:** Design Critique $_____, XFN Team $_____

### Insights

- **When did custom strategies help?**
- **Was the cost/latency tradeoff worth it?**
- **How do we compare to published SOTA?**
- **Which tasks benefited most from multi-agent approaches?**

### Next Steps

1. Test on different benchmarks
2. Tune agent personas for specific domains
3. Adjust n_iterations / refinement_rounds
4. Try with larger models
5. Analyze failure cases

## Error Analysis

Examine specific failures to understand where strategies struggle

In [None]:
# Find tasks where all strategies failed
task_accuracy = df.groupby('task_id')['accuracy'].mean()
hard_tasks = task_accuracy[task_accuracy == 0].index.tolist()

if hard_tasks:
    print(f"\n‚ùå Tasks where all strategies failed: {len(hard_tasks)}")
    print(f"\nSample hard task:")
    hard_task = [t for t in tasks if t.id == hard_tasks[0]][0]
    print(f"  {hard_task.input[:200]}...")
    print(f"  Expected: {hard_task.expected}")
else:
    print("\n‚úÖ At least one strategy got each task correct!")

# Find tasks where custom strategies helped most
single_results = df[df['strategy'] == 'single'].set_index('task_id')['accuracy']
multi_results = df[df['strategy'] != 'single'].groupby('task_id')['accuracy'].max()

improvement = multi_results - single_results
best_improvements = improvement.nlargest(3)

if len(best_improvements) > 0 and best_improvements.max() > 0:
    print(f"\n‚úÖ Tasks where custom strategies helped most:")
    for task_id, improvement_val in best_improvements.items():
        if improvement_val > 0:
            task = [t for t in tasks if t.id == task_id][0]
            print(f"\n  Task: {task.input[:100]}...")
            print(f"  Single: {single_results[task_id]:.0%} ‚Üí Custom: {multi_results[task_id]:.0%}")
            print(f"  Improvement: +{improvement_val:.0%}")
else:
    print("\n‚ö†Ô∏è  Custom strategies didn't outperform single model on any tasks")