# Multi-Agent Experiments: Debate Strategy

**Experiment:** Compare debate-based multi-agent vs single-model baselines  
**Date:** [Fill in]  
**Author:** Leif Haven Martinson  

## Goals
- Implement 2-agent debate with judge
- Compare debate vs single-model baseline
- Measure quality, latency, and cost tradeoffs
- Identify when debate helps vs hurts

## Hypotheses
- Debate improves accuracy on tasks with multiple valid perspectives
- Debate adds 2-3x latency but may justify cost with quality gains
- Judge quality matters more than debater quality


In [None]:
# 1. Add repository root to path to import harness
import sys
sys.path.append('../../../')  # Go up to repo root from notebooks/

# 2. Import harness functions for running strategies and tracking
from harness import (
    llm_call,              # Single LLM call wrapper
    run_strategy,          # Run any strategy (debate, single, etc.)
    debate_strategy,       # Debate-specific strategy function
    ExperimentConfig,      # Experiment configuration structure
    ExperimentResult,      # Result logging structure
    get_tracker,           # Get experiment tracker instance
    evaluate_task          # Evaluate task results
)
from harness.defaults import DEFAULT_MODEL, DEFAULT_PROVIDER

# 3. Import data analysis and visualization libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 4. Enable inline plotting in notebook
%matplotlib inline

# 5. Set visual style for plots
sns.set_style('whitegrid')

# 6. Show current configuration
print("="*70)
print("üîß DEBATE NOTEBOOK CONFIGURATION")
print("="*70)
print(f"üìç Provider: {DEFAULT_PROVIDER}")
print(f"ü§ñ Model: {DEFAULT_MODEL or '(default for provider)'}")
print("="*70)
print("\nüí° TO CHANGE: Edit cell 7 (Configure Debate Models section)")
print("   Set PROVIDER and MODEL variables there")
print("="*70)
print("\n‚úÖ Setup complete")

## Debate Strategy Implementation

Two agents debate, then a judge decides the best answer.

In [None]:
# 1. Test the debate strategy with a simple question
# The harness already has debate strategy built-in!
test_result = run_strategy(
    "debate",              # Use debate strategy
    "What is 2+2?",       # Simple test question
    n_debaters=2,          # Use 2 debating agents
    provider="ollama"      # Use Ollama local provider
)

# 2. Print the final answer from the judge
print(f"Debate result: {test_result.output}")

# 3. Print execution time
print(f"Latency: {test_result.latency_s:.2f}s")

# 4. Print metadata about the debate
print(f"\nNumber of debaters: {test_result.metadata['n_debaters']}")

# 5. Confirm system is ready
print("\nDebate system ready!")

## Load Tasks from Baseline

Use the same tasks for fair comparison.

In [None]:
# 1. Define the same reasoning tasks used in baseline experiments for fair comparison
reasoning_tasks = [
    {
        "id": "logic_01",
        "category": "logical_reasoning",
        "input": "If all roses are flowers and some flowers fade quickly, can we conclude that some roses fade quickly?",
        "expected": "No, this doesn't follow logically."  # Tests understanding of logical syllogisms
    },
    {
        "id": "math_01",
        "category": "arithmetic",
        "input": "A train travels 120 km in 2 hours, then 180 km in 3 hours. What is its average speed for the entire journey?",
        "expected": "60 km/h"  # Tests arithmetic reasoning: total distance / total time
    },
    {
        "id": "reasoning_01",
        "category": "causal_reasoning",
        "input": "Studies show that people who drink coffee tend to live longer. Does this mean coffee causes longevity?",
        "expected": "No, correlation doesn't imply causation."  # Tests causal reasoning understanding
    },
    {
        "id": "planning_01",
        "category": "planning",
        "input": "You need to be at a meeting 30 km away at 2 PM. Traffic is heavy (20 km/h). It's now 1:15 PM. Can you make it on time?",
        "expected": "No. Travel time = 1.5 hours."  # Tests planning and time calculation
    },
    {
        "id": "pattern_01",
        "category": "pattern_recognition",
        "input": "What comes next in this sequence: 2, 6, 12, 20, 30, ?",
        "expected": "42"  # Tests pattern recognition: differences of 4, 6, 8, 10, 12...
    }
]

# 2. Confirm tasks are loaded
print(f"Loaded {len(reasoning_tasks)} tasks")

## Configure Debate Models

Start with same model for debaters and judge.

In [None]:
# ==================== CONFIGURE MODEL HERE ====================
# 1. Import default model/provider settings from harness
from harness.defaults import DEFAULT_MODEL, DEFAULT_PROVIDER

# 2. SET YOUR PROVIDER AND MODEL HERE:
# Uncomment and edit these lines to override:
# PROVIDER = "mlx"  # Options: "mlx", "ollama", "anthropic", "openai"
# MODEL = "mlx-community/Llama-3.2-3B-Instruct-4bit"  # Your model name

# Or use defaults:
try:
    PROVIDER
except NameError:
    PROVIDER = DEFAULT_PROVIDER
    MODEL = DEFAULT_MODEL

# 3. Configure debate parameters
NUM_DEBATERS = 2        # Number of agents that will debate (2+ required)

# 4. Print configuration for verification
print("="*70)
print("üéØ DEBATE CONFIGURATION")
print("="*70)
print(f"Strategy: {NUM_DEBATERS} debaters + 1 judge")
print(f"Provider: {PROVIDER}")
print(f"Model: {MODEL or '(default for provider)'}")
print("="*70)

## Run Debate Experiments

In [None]:
# 1. Configure experiment tracking for debate runs
config = ExperimentConfig(
    experiment_name=f"debate_{NUM_DEBATERS}agents",  # Descriptive experiment name
    task_type="reasoning",                            # Type of tasks being run
    strategy="debate",                                # Strategy being tested
    provider=PROVIDER,                                # Provider used
    model=MODEL,                                      # Model used
    n_agents=NUM_DEBATERS,                           # Number of debate agents
    notes=f"{NUM_DEBATERS} debaters with judge"     # Additional context
)

# 2. Initialize tracker and start experiment
tracker = get_tracker()
tracker.start_experiment(config)

# 3. Run debate on each task and track results
for i, task in enumerate(reasoning_tasks):
    print(f"\n{'='*60}")
    print(f"Running debate on: {task['id']}")
    print(f"{'='*60}")
    
    # 3a. Run debate strategy on this task
    result = run_strategy(
        "debate",          # Strategy name
        task['input'],     # Task input text
        n_debaters=NUM_DEBATERS,  # Number of debaters
        provider=PROVIDER,  # Provider to use
        model=MODEL        # Model to use
    )
    
    # 3b. Display arguments from each debater
    for j, arg in enumerate(result.metadata['arguments']):
        print(f"\nDebater {j+1}: {arg[:100]}...")  # First 100 chars
    
    # 3c. Display judge's final verdict
    print(f"\nJudge verdict: {result.output[:100]}...")
    print(f"Total latency: {result.latency_s:.2f}s")
    
    # 3d. Create experiment result record
    exp_result = ExperimentResult(
        config=config,
        task_input=task['input'],
        output=result.output,
        latency_s=result.latency_s,
        tokens_in=result.tokens_in,
        tokens_out=result.tokens_out,
        cost_usd=result.cost_usd,
        eval_metadata={
            'task_id': task['id'],
            'category': task['category'],
            'expected': task['expected'],
            'arguments': result.metadata['arguments']
        }
    )
    
    # 3e. Evaluate the result against expected answer
    exp_result.eval_scores = evaluate_task(task, result.output)
    
    # 3f. Log result to tracker
    tracker.log_result(exp_result)
    print("‚úì Logged")

# 4. Finish experiment and get summary
summary = tracker.finish_experiment()
print("\n" + "="*60)
print("Debate experiment complete!")
print(f"Saved to: {tracker.current_run_dir}")

## Compare: Single vs Debate

Load baseline and debate results for comparison.

In [None]:
# Load and compare experiments
# Note: You'll need to run the baseline experiment first (01_baseline_experiments.ipynb)
# Then update the paths below with your actual experiment directories

import os
from pathlib import Path

# List available experiments
exp_dir = Path("../experiments")
if exp_dir.exists():
    experiments = sorted([d.name for d in exp_dir.iterdir() if d.is_dir()])
    print("Available experiments:")
    for exp in experiments:
        print(f"  - {exp}")
else:
    print("No experiments found yet. Run baseline experiments first!")

In [None]:
# Comparison metrics
comparison_df = pd.DataFrame([
    {
        'Strategy': 'Single Model',
        'Avg Latency (ms)': baseline_data['summary']['stats']['avg_latency_ms'],
        'Total Cost ($)': baseline_data['summary']['stats']['total_cost'],
        'Runs': baseline_data['summary']['num_runs']
    },
    {
        'Strategy': f'Debate ({NUM_DEBATERS} agents)',
        'Avg Latency (ms)': debate_data['summary']['stats']['avg_latency_ms'],
        'Total Cost ($)': debate_data['summary']['stats']['total_cost'],
        'Runs': debate_data['summary']['num_runs']
    }
])

# Calculate overhead
latency_overhead = (debate_data['summary']['stats']['avg_latency_ms'] / 
                   baseline_data['summary']['stats']['avg_latency_ms'])
cost_overhead = (debate_data['summary']['stats']['total_cost'] / 
                baseline_data['summary']['stats']['total_cost'])

print("\nPerformance Comparison:")
print(comparison_df.to_string(index=False))
print(f"\nDebate Overhead:")
print(f"  Latency: {latency_overhead:.1f}x slower")
print(f"  Cost: {cost_overhead:.1f}x more expensive")

In [None]:
# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Latency comparison
ax = axes[0]
comparison_df.plot.bar(x='Strategy', y='Avg Latency (ms)', ax=ax, legend=False, color=['steelblue', 'coral'])
ax.set_title('Latency: Single vs Debate', fontsize=14, weight='bold')
ax.set_ylabel('Milliseconds')
ax.set_xlabel('')
plt.setp(ax.xaxis.get_majorticklabels(), rotation=45, ha='right')

# Cost comparison
ax = axes[1]
comparison_df.plot.bar(x='Strategy', y='Total Cost ($)', ax=ax, legend=False, color=['steelblue', 'coral'])
ax.set_title('Cost: Single vs Debate', fontsize=14, weight='bold')
ax.set_ylabel('USD')
ax.set_xlabel('')
plt.setp(ax.xaxis.get_majorticklabels(), rotation=45, ha='right')

plt.tight_layout()
plt.show()

## Qualitative Analysis

Compare actual outputs to assess quality differences.

In [None]:
def compare_outputs(task_id: str):
    """Show baseline vs debate outputs side by side."""
    print(f"\n{'='*80}")
    print(f"Task: {task_id}")
    print(f"{'='*80}\n")
    
    # Get task
    task = next(t for t in reasoning_tasks if t['id'] == task_id)
    print(f"Input: {task['input']}\n")
    print(f"Expected: {task['expected']}\n")
    print("-" * 80)
    
    # Get baseline
    baseline_run = next(r for r in baseline_data['runs'] if r['metadata']['task_id'] == task_id)
    print(f"\nBASELINE (Single Model):")
    print(f"Output: {baseline_run['output']}")
    print(f"Latency: {baseline_run['latency_ms']:.0f}ms")
    
    # Get debate
    debate_run = next(r for r in debate_data['runs'] if r['metadata']['task_id'] == task_id)
    print(f"\nDEBATE ({NUM_DEBATERS} agents):")
    for i, arg in enumerate(debate_run['metadata']['arguments']):
        print(f"  Debater {i+1}: {arg[:80]}...")
    print(f"\nJudge Verdict: {debate_run['output']}")
    print(f"Latency: {debate_run['latency_ms']:.0f}ms")
    print("-" * 80)

# Compare first task
compare_outputs(reasoning_tasks[0]['id'])

In [None]:
# Browse all tasks
for task in reasoning_tasks:
    compare_outputs(task['id'])

## Key Findings

### Quantitative
- [Fill in after running]
- Debate adds Xx latency overhead
- Cost increased by X%

### Qualitative
- [Observations on quality differences]
- When did debate help?
- When did debate hurt?

### Hypotheses
- [ ] Debate improves accuracy on multi-perspective tasks
- [ ] Latency overhead is 2-3x
- [ ] Judge quality matters more than debaters

## Next Experiments
1. Test with larger judge model (13B or 30B)
2. Add 3rd debater
3. Try specialized debater roles
4. Test on creative/open-ended tasks
