# Multi-Agent Experiments: Debate Strategy

**Experiment:** Compare debate-based multi-agent vs single-model baselines  
**Date:** [Fill in]  
**Author:** Leif Haven Martinson  

## Goals
- Implement 2-agent debate with judge
- Compare debate vs single-model baseline
- Measure quality, latency, and cost tradeoffs
- Identify when debate helps vs hurts

## Hypotheses
- Debate improves accuracy on tasks with multiple valid perspectives
- Debate adds 2-3x latency but may justify cost with quality gains
- Judge quality matters more than debater quality


In [None]:
# Setup
import sys
sys.path.append('../code')

from llm_providers import create_llm
from experiment_logger import experiment, RunResult

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set_style('whitegrid')

print("✅ Setup complete")

## Debate Strategy Implementation

Two agents debate, then a judge decides the best answer.

In [None]:
class DebateSystem:
    """Multi-agent debate with judge."""
    
    def __init__(self, debater_model: str, judge_model: str, num_debaters: int = 2):
        self.num_debaters = num_debaters
        
        # Create debaters (all using same model for now)
        self.debaters = [
            create_llm("mlx", debater_model, temperature=0.8)
            for _ in range(num_debaters)
        ]
        
        # Create judge (potentially different/larger model)
        self.judge = create_llm("mlx", judge_model, temperature=0.3)
    
    def run_debate(self, task_input: str) -> dict:
        """
        Run a debate on the task.
        
        Returns:
            {
                'arguments': list of debater responses,
                'verdict': judge's final answer,
                'total_latency_ms': combined time,
                'cost_estimate': total cost
            }
        """
        # Phase 1: Debaters generate arguments
        arguments = []
        total_latency = 0
        total_cost = 0
        
        for i, debater in enumerate(self.debaters):
            prompt = f"""
You are Debater {i+1} in a debate. Provide your best answer to this question.

Question: {task_input}

Your argument:"""
            
            result = debater.generate(prompt)
            arguments.append(result['text'])
            total_latency += result['latency_ms']
            total_cost += result.get('cost_estimate', 0)
        
        # Phase 2: Judge evaluates arguments
        judge_prompt = f"""
You are a judge evaluating different arguments to a question.

Question: {task_input}

Arguments:
"""
        for i, arg in enumerate(arguments):
            judge_prompt += f"\n\nDebater {i+1}:\n{arg}"
        
        judge_prompt += """\n\n
Based on these arguments, provide the best final answer to the question.
Consider accuracy, reasoning quality, and completeness.

Final answer:"""
        
        verdict_result = self.judge.generate(judge_prompt)
        total_latency += verdict_result['latency_ms']
        total_cost += verdict_result.get('cost_estimate', 0)
        
        return {
            'arguments': arguments,
            'verdict': verdict_result['text'],
            'total_latency_ms': total_latency,
            'cost_estimate': total_cost,
            'num_debaters': self.num_debaters
        }

print("Debate system ready")

## Load Tasks from Baseline

Use the same tasks for fair comparison.

In [None]:
# Same tasks as baseline experiment
reasoning_tasks = [
    {
        "id": "logic_01",
        "category": "logical_reasoning",
        "input": "If all roses are flowers and some flowers fade quickly, can we conclude that some roses fade quickly?",
        "expected": "No, this doesn't follow logically."
    },
    {
        "id": "math_01",
        "category": "arithmetic",
        "input": "A train travels 120 km in 2 hours, then 180 km in 3 hours. What is its average speed for the entire journey?",
        "expected": "60 km/h"
    },
    {
        "id": "reasoning_01",
        "category": "causal_reasoning",
        "input": "Studies show that people who drink coffee tend to live longer. Does this mean coffee causes longevity?",
        "expected": "No, correlation doesn't imply causation."
    },
    {
        "id": "planning_01",
        "category": "planning",
        "input": "You need to be at a meeting 30 km away at 2 PM. Traffic is heavy (20 km/h). It's now 1:15 PM. Can you make it on time?",
        "expected": "No. Travel time = 1.5 hours."
    },
    {
        "id": "pattern_01",
        "category": "pattern_recognition",
        "input": "What comes next in this sequence: 2, 6, 12, 20, 30, ?",
        "expected": "42"
    }
]

print(f"Loaded {len(reasoning_tasks)} tasks")

## Configure Debate Models

Start with same model for debaters and judge.

In [None]:
# Configuration
DEBATER_MODEL = "mlx-community/Llama-3.2-7B-Instruct-4bit"
JUDGE_MODEL = "mlx-community/Llama-3.2-7B-Instruct-4bit"  # Try larger model later
NUM_DEBATERS = 2

# Create debate system
debate_system = DebateSystem(
    debater_model=DEBATER_MODEL,
    judge_model=JUDGE_MODEL,
    num_debaters=NUM_DEBATERS
)

print(f"Debate system: {NUM_DEBATERS} debaters + 1 judge")

## Run Debate Experiments

In [None]:
# Run debate experiment
model_config = {
    "debater": DEBATER_MODEL,
    "judge": JUDGE_MODEL,
    "num_debaters": NUM_DEBATERS
}

with experiment(
    name=f"debate_{NUM_DEBATERS}agents",
    task="reasoning_suite",
    strategy="debate",
    model_config=model_config,
    notes=f"{NUM_DEBATERS} debaters with judge"
) as logger:
    
    for i, task in enumerate(reasoning_tasks):
        print(f"\n{'='*60}")
        print(f"Running debate on: {task['id']}")
        print(f"{'='*60}")
        
        # Run debate
        result = debate_system.run_debate(task['input'])
        
        # Display arguments
        for j, arg in enumerate(result['arguments']):
            print(f"\nDebater {j+1}: {arg[:100]}...")
        
        print(f"\nJudge verdict: {result['verdict'][:100]}...")
        print(f"Total latency: {result['total_latency_ms']:.0f}ms")
        
        # Log result
        run = RunResult(
            run_id=i,
            input=task['input'],
            output=result['verdict'],
            latency_ms=result['total_latency_ms'],
            tokens=None,
            cost_estimate=result['cost_estimate'],
            metadata={
                'task_id': task['id'],
                'category': task['category'],
                'expected': task['expected'],
                'arguments': result['arguments'],
                'num_debaters': result['num_debaters']
            }
        )
        logger.log_run(run)
        print("✓ Logged")

print("\n" + "="*60)
print("Debate experiment complete!")

## Compare: Single vs Debate

Load baseline and debate results for comparison.

In [None]:
from experiment_logger import ExperimentComparison

comp = ExperimentComparison()

# Get most recent baseline and debate experiments
all_exps = comp.list_experiments()
baseline_exp = next(e for e in all_exps if e['strategy'] == 'single')
debate_exp = next(e for e in all_exps if e['strategy'] == 'debate')

# Load data
baseline_data = comp.load_experiment(baseline_exp['id'])
debate_data = comp.load_experiment(debate_exp['id'])

print("Loaded experiments:")
print(f"  Baseline: {baseline_exp['name']}")
print(f"  Debate: {debate_exp['name']}")

In [None]:
# Comparison metrics
comparison_df = pd.DataFrame([
    {
        'Strategy': 'Single Model',
        'Avg Latency (ms)': baseline_data['summary']['stats']['avg_latency_ms'],
        'Total Cost ($)': baseline_data['summary']['stats']['total_cost'],
        'Runs': baseline_data['summary']['num_runs']
    },
    {
        'Strategy': f'Debate ({NUM_DEBATERS} agents)',
        'Avg Latency (ms)': debate_data['summary']['stats']['avg_latency_ms'],
        'Total Cost ($)': debate_data['summary']['stats']['total_cost'],
        'Runs': debate_data['summary']['num_runs']
    }
])

# Calculate overhead
latency_overhead = (debate_data['summary']['stats']['avg_latency_ms'] / 
                   baseline_data['summary']['stats']['avg_latency_ms'])
cost_overhead = (debate_data['summary']['stats']['total_cost'] / 
                baseline_data['summary']['stats']['total_cost'])

print("\nPerformance Comparison:")
print(comparison_df.to_string(index=False))
print(f"\nDebate Overhead:")
print(f"  Latency: {latency_overhead:.1f}x slower")
print(f"  Cost: {cost_overhead:.1f}x more expensive")

In [None]:
# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Latency comparison
ax = axes[0]
comparison_df.plot.bar(x='Strategy', y='Avg Latency (ms)', ax=ax, legend=False, color=['steelblue', 'coral'])
ax.set_title('Latency: Single vs Debate', fontsize=14, weight='bold')
ax.set_ylabel('Milliseconds')
ax.set_xlabel('')
plt.setp(ax.xaxis.get_majorticklabels(), rotation=45, ha='right')

# Cost comparison
ax = axes[1]
comparison_df.plot.bar(x='Strategy', y='Total Cost ($)', ax=ax, legend=False, color=['steelblue', 'coral'])
ax.set_title('Cost: Single vs Debate', fontsize=14, weight='bold')
ax.set_ylabel('USD')
ax.set_xlabel('')
plt.setp(ax.xaxis.get_majorticklabels(), rotation=45, ha='right')

plt.tight_layout()
plt.show()

## Qualitative Analysis

Compare actual outputs to assess quality differences.

In [None]:
def compare_outputs(task_id: str):
    """Show baseline vs debate outputs side by side."""
    print(f"\n{'='*80}")
    print(f"Task: {task_id}")
    print(f"{'='*80}\n")
    
    # Get task
    task = next(t for t in reasoning_tasks if t['id'] == task_id)
    print(f"Input: {task['input']}\n")
    print(f"Expected: {task['expected']}\n")
    print("-" * 80)
    
    # Get baseline
    baseline_run = next(r for r in baseline_data['runs'] if r['metadata']['task_id'] == task_id)
    print(f"\nBASELINE (Single Model):")
    print(f"Output: {baseline_run['output']}")
    print(f"Latency: {baseline_run['latency_ms']:.0f}ms")
    
    # Get debate
    debate_run = next(r for r in debate_data['runs'] if r['metadata']['task_id'] == task_id)
    print(f"\nDEBATE ({NUM_DEBATERS} agents):")
    for i, arg in enumerate(debate_run['metadata']['arguments']):
        print(f"  Debater {i+1}: {arg[:80]}...")
    print(f"\nJudge Verdict: {debate_run['output']}")
    print(f"Latency: {debate_run['latency_ms']:.0f}ms")
    print("-" * 80)

# Compare first task
compare_outputs(reasoning_tasks[0]['id'])

In [None]:
# Browse all tasks
for task in reasoning_tasks:
    compare_outputs(task['id'])

## Key Findings

### Quantitative
- [Fill in after running]
- Debate adds Xx latency overhead
- Cost increased by X%

### Qualitative
- [Observations on quality differences]
- When did debate help?
- When did debate hurt?

### Hypotheses
- [ ] Debate improves accuracy on multi-perspective tasks
- [ ] Latency overhead is 2-3x
- [ ] Judge quality matters more than debaters

## Next Experiments
1. Test with larger judge model (13B or 30B)
2. Add 3rd debater
3. Try specialized debater roles
4. Test on creative/open-ended tasks
