# Multi-Agent Experiments: Debate Strategy

**Experiment:** Compare debate-based multi-agent vs single-model baselines  
**Date:** [Fill in]  
**Author:** Leif Haven Martinson  

## Goals
- Implement 2-agent debate with judge
- Compare debate vs single-model baseline
- Measure quality, latency, and cost tradeoffs
- Identify when debate helps vs hurts

## Hypotheses
- Debate improves accuracy on tasks with multiple valid perspectives
- Debate adds 2-3x latency but may justify cost with quality gains
- Judge quality matters more than debater quality


In [11]:
# Setup
import sys
sys.path.append('../code')

from harness import (
    llm_call,
    run_strategy,
    debate_strategy,
    ExperimentConfig,
    ExperimentResult,
    get_tracker,
    evaluate_task
)

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set_style('whitegrid')

print("âœ… Setup complete")

âœ… Setup complete


## Debate Strategy Implementation

Two agents debate, then a judge decides the best answer.

In [None]:
# Quick test to verify debate is working
# This uses default settings - see the configuration cell below to customize

test_result = run_strategy(
    "debate",
    "What is 2+2?",
    n_debaters=2,
    provider="ollama",
    verbose=False  # Set to True to see the full debate
)

print(f"Debate result: {test_result.output[:100]}...")
print(f"Latency: {test_result.latency_s:.2f}s")
print(f"\nNumber of debaters: {test_result.metadata['n_debaters']}")
print(f"Number of rounds: {test_result.metadata['n_rounds']}")
print("\nâœ… Debate system ready! Scroll down to customize debate parameters.")

## Load Tasks from Baseline

Use the same tasks for fair comparison.

In [8]:
# Same tasks as baseline experiment
reasoning_tasks = [
    {
        "id": "logic_01",
        "category": "logical_reasoning",
        "input": "If all roses are flowers and some flowers fade quickly, can we conclude that some roses fade quickly?",
        "expected": "No, this doesn't follow logically."
    },
    {
        "id": "math_01",
        "category": "arithmetic",
        "input": "A train travels 120 km in 2 hours, then 180 km in 3 hours. What is its average speed for the entire journey?",
        "expected": "60 km/h"
    },
    {
        "id": "reasoning_01",
        "category": "causal_reasoning",
        "input": "Studies show that people who drink coffee tend to live longer. Does this mean coffee causes longevity?",
        "expected": "No, correlation doesn't imply causation."
    },
    {
        "id": "planning_01",
        "category": "planning",
        "input": "You need to be at a meeting 30 km away at 2 PM. Traffic is heavy (20 km/h). It's now 1:15 PM. Can you make it on time?",
        "expected": "No. Travel time = 1.5 hours."
    },
    {
        "id": "pattern_01",
        "category": "pattern_recognition",
        "input": "What comes next in this sequence: 2, 6, 12, 20, 30, ?",
        "expected": "42"
    }
]

print(f"Loaded {len(reasoning_tasks)} tasks")

Loaded 5 tasks


## Configure Debate Parameters

**CUSTOMIZE HERE:** Adjust these settings to control the debate behavior

In [None]:
# ========================================
# DEBATE CONFIGURATION - EDIT THESE VALUES
# ========================================

from harness.defaults import DEFAULT_MODEL, DEFAULT_PROVIDER

# Number of debating agents
NUM_DEBATERS = 2

# Number of debate rounds (agents see each other's arguments and refine)
# 1 = initial arguments only, 2+ = rebuttals and refinements
NUM_ROUNDS = 2  # ðŸ”§ CHANGE THIS to adjust debate depth

# Model configuration
PROVIDER = DEFAULT_PROVIDER
MODEL = DEFAULT_MODEL

# ========================================
# DEBATER PERSPECTIVES - EDIT THESE PROMPTS
# ========================================
# Give each debater a different perspective or role
# Set to None to use default prompts

DEBATER_PROMPTS = [
    # Debater 1: Skeptical/Critical perspective
    """You are a critical thinker who questions assumptions and looks for flaws in reasoning.
Approach the question with healthy skepticism and consider what could go wrong.""",

    # Debater 2: Optimistic/Constructive perspective
    """You are a constructive thinker who builds on ideas and explores positive possibilities.
Look for opportunities and creative solutions to the question.""",

    # Add more debaters if NUM_DEBATERS > 2:
    # """You are a practical thinker focused on real-world implementation...""",
]

# ========================================
# JUDGE PROMPT - EDIT THIS
# ========================================
# Custom prompt for the judge (use {task_input} as placeholder for the question)
# Set to None to use default judge prompt

JUDGE_PROMPT = """You are an expert judge evaluating different perspectives on a question.
Consider the strength of reasoning, evidence, and practical implications of each answer.

Question: {task_input}

Evaluate the answers below and provide your final verdict on the best answer, with clear reasoning."""

# Set to None to use default:
# JUDGE_PROMPT = None

print(f"âœ… Debate configuration:")
print(f"   - {NUM_DEBATERS} debaters")
print(f"   - {NUM_ROUNDS} round(s) of debate")
print(f"   - Provider: {PROVIDER}")
print(f"   - Model: {MODEL}")
print(f"   - Custom debater prompts: {len(DEBATER_PROMPTS) if DEBATER_PROMPTS else 0}")
print(f"   - Custom judge prompt: {'Yes' if JUDGE_PROMPT else 'No'}")

## Run Debate Experiments

In [None]:
# Run debate experiment with tracking
config = ExperimentConfig(
    experiment_name=f"debate_{NUM_DEBATERS}agents_{NUM_ROUNDS}rounds",
    task_type="reasoning",
    strategy="debate",
    provider=PROVIDER,
    model=MODEL,
    n_agents=NUM_DEBATERS,
    notes=f"{NUM_DEBATERS} debaters, {NUM_ROUNDS} rounds with custom prompts"
)

tracker = get_tracker()
tracker.start_experiment(config)

for i, task in enumerate(reasoning_tasks):
    print(f"\n{'='*60}")
    print(f"Running debate on: {task['id']}")
    print(f"{'='*60}")
    
    # Run debate strategy with custom configuration
    result = run_strategy(
        "debate",
        task['input'],
        n_debaters=NUM_DEBATERS,
        n_rounds=NUM_ROUNDS,  # ðŸ”§ Using configured rounds
        provider=PROVIDER,
        model=MODEL,
        debater_prompts=DEBATER_PROMPTS,  # ðŸ”§ Using custom debater perspectives
        judge_prompt=JUDGE_PROMPT,  # ðŸ”§ Using custom judge prompt
        verbose=True  # Set to False to hide streaming output
    )
    
    # Display debate results
    print(f"\nðŸ“Š Debate completed:")
    print(f"   - Rounds: {NUM_ROUNDS}")
    print(f"   - Total latency: {result.latency_s:.2f}s")
    
    if NUM_ROUNDS > 1:
        print(f"\n   Final arguments after {NUM_ROUNDS} rounds:")
    else:
        print(f"\n   Arguments:")
    
    for j, arg in enumerate(result.metadata['arguments']):
        print(f"\n   Debater {j+1}: {arg[:100]}...")
    
    print(f"\n   Judge verdict: {result.output[:100]}...")
    
    # Log result
    exp_result = ExperimentResult(
        config=config,
        task_input=task['input'],
        output=result.output,
        latency_s=result.latency_s,
        tokens_in=result.tokens_in,
        tokens_out=result.tokens_out,
        cost_usd=result.cost_usd,
        eval_metadata={
            'task_id': task['id'],
            'category': task['category'],
            'expected': task['expected'],
            'arguments': result.metadata['arguments'],
            'n_rounds': NUM_ROUNDS
        }
    )
    
    # Evaluate the result
    exp_result.eval_scores = evaluate_task(task, result.output)
    
    tracker.log_result(exp_result)
    print("âœ“ Logged")

summary = tracker.finish_experiment()
print("\n" + "="*60)
print("Debate experiment complete!")
print(f"Saved to: {tracker.current_run_dir}")

## Compare: Single vs Debate

Load baseline and debate results for comparison.

In [None]:
# Load and compare experiments
# Note: You'll need to run the baseline experiment first (01_baseline_experiments.ipynb)
# Then update the paths below with your actual experiment directories

import os
from pathlib import Path

# List available experiments
exp_dir = Path("../experiments")
if exp_dir.exists():
    experiments = sorted([d.name for d in exp_dir.iterdir() if d.is_dir()])
    print("Available experiments:")
    for exp in experiments:
        print(f"  - {exp}")
else:
    print("No experiments found yet. Run baseline experiments first!")

In [None]:
# Comparison metrics
comparison_df = pd.DataFrame([
    {
        'Strategy': 'Single Model',
        'Avg Latency (ms)': baseline_data['summary']['stats']['avg_latency_ms'],
        'Total Cost ($)': baseline_data['summary']['stats']['total_cost'],
        'Runs': baseline_data['summary']['num_runs']
    },
    {
        'Strategy': f'Debate ({NUM_DEBATERS} agents)',
        'Avg Latency (ms)': debate_data['summary']['stats']['avg_latency_ms'],
        'Total Cost ($)': debate_data['summary']['stats']['total_cost'],
        'Runs': debate_data['summary']['num_runs']
    }
])

# Calculate overhead
latency_overhead = (debate_data['summary']['stats']['avg_latency_ms'] / 
                   baseline_data['summary']['stats']['avg_latency_ms'])
cost_overhead = (debate_data['summary']['stats']['total_cost'] / 
                baseline_data['summary']['stats']['total_cost'])

print("\nPerformance Comparison:")
print(comparison_df.to_string(index=False))
print(f"\nDebate Overhead:")
print(f"  Latency: {latency_overhead:.1f}x slower")
print(f"  Cost: {cost_overhead:.1f}x more expensive")

In [None]:
# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Latency comparison
ax = axes[0]
comparison_df.plot.bar(x='Strategy', y='Avg Latency (ms)', ax=ax, legend=False, color=['steelblue', 'coral'])
ax.set_title('Latency: Single vs Debate', fontsize=14, weight='bold')
ax.set_ylabel('Milliseconds')
ax.set_xlabel('')
plt.setp(ax.xaxis.get_majorticklabels(), rotation=45, ha='right')

# Cost comparison
ax = axes[1]
comparison_df.plot.bar(x='Strategy', y='Total Cost ($)', ax=ax, legend=False, color=['steelblue', 'coral'])
ax.set_title('Cost: Single vs Debate', fontsize=14, weight='bold')
ax.set_ylabel('USD')
ax.set_xlabel('')
plt.setp(ax.xaxis.get_majorticklabels(), rotation=45, ha='right')

plt.tight_layout()
plt.show()

## Qualitative Analysis

Compare actual outputs to assess quality differences.

In [None]:
def compare_outputs(task_id: str):
    """Show baseline vs debate outputs side by side."""
    print(f"\n{'='*80}")
    print(f"Task: {task_id}")
    print(f"{'='*80}\n")
    
    # Get task
    task = next(t for t in reasoning_tasks if t['id'] == task_id)
    print(f"Input: {task['input']}\n")
    print(f"Expected: {task['expected']}\n")
    print("-" * 80)
    
    # Get baseline
    baseline_run = next(r for r in baseline_data['runs'] if r['metadata']['task_id'] == task_id)
    print(f"\nBASELINE (Single Model):")
    print(f"Output: {baseline_run['output']}")
    print(f"Latency: {baseline_run['latency_ms']:.0f}ms")
    
    # Get debate
    debate_run = next(r for r in debate_data['runs'] if r['metadata']['task_id'] == task_id)
    print(f"\nDEBATE ({NUM_DEBATERS} agents):")
    for i, arg in enumerate(debate_run['metadata']['arguments']):
        print(f"  Debater {i+1}: {arg[:80]}...")
    print(f"\nJudge Verdict: {debate_run['output']}")
    print(f"Latency: {debate_run['latency_ms']:.0f}ms")
    print("-" * 80)

# Compare first task
compare_outputs(reasoning_tasks[0]['id'])

In [None]:
# Browse all tasks
for task in reasoning_tasks:
    compare_outputs(task['id'])

## Key Findings

### Quantitative
- [Fill in after running]
- Debate adds Xx latency overhead
- Cost increased by X%

### Qualitative
- [Observations on quality differences]
- When did debate help?
- When did debate hurt?

### Hypotheses
- [ ] Debate improves accuracy on multi-perspective tasks
- [ ] Latency overhead is 2-3x
- [ ] Judge quality matters more than debaters

## Next Experiments
1. Test with larger judge model (13B or 30B)
2. Add 3rd debater
3. Try specialized debater roles
4. Test on creative/open-ended tasks
