# Consensus Strategy: Multi-Agent Without Judge

**Experiment:** Agents debate and build consensus amongst themselves (no separate judge)

**Date:** [Fill in]  
**Author:** Leif Haven Martinson  

## Goals
- Compare consensus-building vs judge-based debate
- Measure convergence across rounds
- Understand when agents naturally align vs diverge

## Hypotheses
- Consensus works better when there's objectively correct answers
- Judge-based debate works better for subjective/creative tasks
- More rounds lead to better convergence but diminishing returns

In [None]:
# Setup
import sys
sys.path.append('../code')

from harness import (
    llm_call,
    run_strategy,
    consensus_strategy,
    ExperimentConfig,
    ExperimentResult,
    get_tracker,
    evaluate_task
)

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set_style('whitegrid')

print("âœ… Setup complete")

## Quick Test

In [None]:
# Quick test to verify consensus is working
# This uses default settings - see the configuration cell below to customize

test_result = run_strategy(
    "consensus",
    "What is 2+2?",
    n_agents=3,
    n_rounds=2,
    provider="ollama",
    verbose=False  # Set to True to see the full consensus process
)

print(f"Consensus result: {test_result.output[:100]}...")
print(f"Latency: {test_result.latency_s:.2f}s")
print(f"\nNumber of agents: {test_result.metadata['n_agents']}")
print(f"Number of rounds: {test_result.metadata['n_rounds']}")
print("\nâœ… Consensus system ready! Scroll down to customize parameters.")

## Load Tasks from Baseline

Use the same tasks for fair comparison.

In [None]:
# Same tasks as baseline and debate experiments
reasoning_tasks = [
    {
        "id": "logic_01",
        "category": "logical_reasoning",
        "input": "If all roses are flowers and some flowers fade quickly, can we conclude that some roses fade quickly?",
        "expected": "No, this doesn't follow logically."
    },
    {
        "id": "math_01",
        "category": "arithmetic",
        "input": "A train travels 120 km in 2 hours, then 180 km in 3 hours. What is its average speed for the entire journey?",
        "expected": "60 km/h"
    },
    {
        "id": "reasoning_01",
        "category": "causal_reasoning",
        "input": "Studies show that people who drink coffee tend to live longer. Does this mean coffee causes longevity?",
        "expected": "No, correlation doesn't imply causation."
    },
    {
        "id": "planning_01",
        "category": "planning",
        "input": "You need to be at a meeting 30 km away at 2 PM. Traffic is heavy (20 km/h). It's now 1:15 PM. Can you make it on time?",
        "expected": "No. Travel time = 1.5 hours."
    },
    {
        "id": "pattern_01",
        "category": "pattern_recognition",
        "input": "What comes next in this sequence: 2, 6, 12, 20, 30, ?",
        "expected": "42"
    }
]

print(f"Loaded {len(reasoning_tasks)} tasks")

## Configure Consensus Parameters

**CUSTOMIZE HERE:** Adjust these settings to control the consensus behavior

In [None]:
# ========================================
# CONSENSUS CONFIGURATION - EDIT THESE VALUES
# ========================================

from harness.defaults import DEFAULT_MODEL, DEFAULT_PROVIDER

# Number of agents working toward consensus
NUM_AGENTS = 3

# Number of consensus rounds
# Each round, agents see others' positions and refine their own
# More rounds = more convergence, but also more cost/latency
NUM_ROUNDS = 3  # ðŸ”§ CHANGE THIS to adjust consensus depth

# Model configuration
PROVIDER = DEFAULT_PROVIDER
MODEL = DEFAULT_MODEL

# ========================================
# AGENT PERSPECTIVES - EDIT THESE PROMPTS
# ========================================
# Give each agent a different perspective or role
# Set to None to use default prompts

AGENT_PROMPTS = [
    # Agent 1: Analytical perspective
    """You are an analytical thinker who breaks down problems logically and systematically.
Focus on facts, data, and structured reasoning.""",

    # Agent 2: Creative perspective
    """You are a creative thinker who explores unconventional solutions and lateral thinking.
Consider multiple perspectives and novel approaches.""",

    # Agent 3: Practical perspective
    """You are a practical thinker focused on real-world applicability and common sense.
Consider what actually works in practice.""",

    # Add more agents if NUM_AGENTS > 3:
    # """You are a risk-aware thinker who considers potential downsides...""",
]

# Set to None to use default neutral prompts:
# AGENT_PROMPTS = None

print(f"âœ… Consensus configuration:")
print(f"   - {NUM_AGENTS} agents")
print(f"   - {NUM_ROUNDS} round(s) of consensus building")
print(f"   - Provider: {PROVIDER}")
print(f"   - Model: {MODEL}")
print(f"   - Custom agent prompts: {len(AGENT_PROMPTS) if AGENT_PROMPTS else 0}")
print(f"\nðŸ’¡ Note: Consensus has NO judge - agents synthesize their own final answer")

## Run Consensus Experiments

In [None]:
# Run consensus experiment with tracking
config = ExperimentConfig(
    experiment_name=f"consensus_{NUM_AGENTS}agents_{NUM_ROUNDS}rounds",
    task_type="reasoning",
    strategy="consensus",
    provider=PROVIDER,
    model=MODEL,
    n_agents=NUM_AGENTS,
    notes=f"{NUM_AGENTS} agents, {NUM_ROUNDS} rounds, no judge"
)

tracker = get_tracker()
tracker.start_experiment(config)

for i, task in enumerate(reasoning_tasks):
    print(f"\n{'='*60}")
    print(f"Building consensus on: {task['id']}")
    print(f"{'='*60}")
    
    # Run consensus strategy with custom configuration
    result = run_strategy(
        "consensus",
        task['input'],
        n_agents=NUM_AGENTS,
        n_rounds=NUM_ROUNDS,  # ðŸ”§ Using configured rounds
        provider=PROVIDER,
        model=MODEL,
        agent_prompts=AGENT_PROMPTS,  # ðŸ”§ Using custom agent perspectives
        verbose=True  # Set to False to hide streaming output
    )
    
    # Display consensus results
    print(f"\nðŸ“Š Consensus completed:")
    print(f"   - Rounds: {NUM_ROUNDS}")
    print(f"   - Total latency: {result.latency_s:.2f}s")
    
    print(f"\n   Final positions after {NUM_ROUNDS} rounds:")
    for j, pos in enumerate(result.metadata['all_positions']):
        print(f"\n   Agent {j+1}: {pos[:100]}...")
    
    print(f"\n   Synthesized consensus: {result.output[:100]}...")
    
    # Log result
    exp_result = ExperimentResult(
        config=config,
        task_input=task['input'],
        output=result.output,
        latency_s=result.latency_s,
        tokens_in=result.tokens_in,
        tokens_out=result.tokens_out,
        cost_usd=result.cost_usd,
        eval_metadata={
            'task_id': task['id'],
            'category': task['category'],
            'expected': task['expected'],
            'final_positions': result.metadata['all_positions'],
            'n_rounds': NUM_ROUNDS
        }
    )
    
    # Evaluate the result
    exp_result.eval_scores = evaluate_task(task, result.output)
    
    tracker.log_result(exp_result)
    print("âœ“ Logged")

summary = tracker.finish_experiment()
print("\n" + "="*60)
print("Consensus experiment complete!")
print(f"Saved to: {tracker.current_run_dir}")

## Compare: Debate (with Judge) vs Consensus (no Judge)

Load results from both strategies for comparison.

In [None]:
# Load and compare experiments
import os
from pathlib import Path
import json

# List available experiments
exp_dir = Path("../experiments")
if exp_dir.exists():
    experiments = sorted([d.name for d in exp_dir.iterdir() if d.is_dir()])
    print("Available experiments:")
    
    debate_exps = [e for e in experiments if 'debate' in e]
    consensus_exps = [e for e in experiments if 'consensus' in e]
    
    print("\nDebate experiments:")
    for exp in debate_exps:
        print(f"  - {exp}")
    
    print("\nConsensus experiments:")
    for exp in consensus_exps:
        print(f"  - {exp}")
else:
    print("No experiments found yet.")

In [None]:
# TODO: Add comparison code once you have both debate and consensus results
# Compare:
# - Latency (consensus likely slower - more rounds, no single judge)
# - Quality (when does each work better?)
# - Convergence (do agents actually agree?)
# - Cost (token usage comparison)

## Analyze Convergence Over Rounds

How much do agent positions change across rounds?

In [None]:
# TODO: Analyze position similarity across rounds
# - Extract final positions from metadata
# - Measure similarity (e.g., word overlap, embedding similarity)
# - Plot convergence over rounds

## Key Findings

### Quantitative
- [Fill in after running]
- Consensus adds Xx latency vs debate
- Convergence rate across rounds

### Qualitative
- [Observations on consensus quality]
- When did agents converge well?
- When did they diverge or fail to reach agreement?
- Comparison to judge-based approach

### Hypotheses
- [ ] Consensus works better for objective questions
- [ ] Debate works better for subjective tasks
- [ ] More rounds improve convergence (with diminishing returns)

## Next Experiments
1. Test with different agent perspectives (domain experts)
2. Vary number of rounds (1, 2, 3, 5) to find optimal
3. Mix consensus + judge (agents build consensus, then external judge validates)
4. Test on creative/open-ended tasks where consensus may struggle