# Lab 2: Multi-Model AI Competition

## What You'll Build
Today you'll create an intelligent competition system that:
- **Tests multiple AI models** with challenging questions
- **Automatically evaluates responses** using AI judges  
- **Ranks performance** to find the best model
- **Handles multiple APIs** with consistent interfaces

## Two Competition Approaches

### Path A: Mixed-Model Competition  
- Test different providers: OpenAI, Anthropic, Google, etc.
- Great for comparing diverse AI approaches
- Requires multiple API keys

### Path B: Qwen-Only Competition (Recommended)
- Use different Qwen model variants competing against each other
- Single API key needed (DASHSCOPE_API_KEY)
- Consistent, reliable access globally

**Choose the path that fits your API access!**

# Section 1: Multi-Model Competition Setup

In this lab, we're creating a competition system where different AI models compete to answer questions, with another AI acting as the judge. This demonstrates how multiple agents can interact and evaluate each other's performance.

In [1]:
# Start with imports - ask ChatGPT to explain any package that you don't know

import os
import json
from dotenv import load_dotenv
from openai import OpenAI
from anthropic import Anthropic
from IPython.display import Markdown, display

In [None]:
# Always remember to do this!
load_dotenv(override=True)

In [None]:
# Check which API keys are available
print("API Key Status:")
api_keys = {
    'OPENAI_API_KEY': 'OpenAI',
    'ANTHROPIC_API_KEY': 'Anthropic (optional)',
    'GOOGLE_API_KEY': 'Google (optional)', 
    'DASHSCOPE_API_KEY': 'Qwen/Alibaba (recommended)'
}

for key, name in api_keys.items():
    value = os.getenv(key)
    if value:
        print(f"[OK] {name}: Available ({value[:6]}...)")
    else:
        print(f"[--] {name}: Not set")

print("\nTip: For Qwen-only path, you only need DASHSCOPE_API_KEY!")

# Section 2: Option 1 - Multi-Model Competition (OpenAI + Anthropic vs Google)

This section creates a competition between different model providers. It's more complex but shows how different models perform on the same tasks.

In [None]:
# Select which models to use based on available API keys
models = []

if os.getenv('OPENAI_API_KEY'):
    print("[OpenAI] Available - Adding GPT models")
    models.extend([
        {"name": "GPT-4", "client": "openai", "model": "gpt-4o-mini"},
        {"name": "GPT-3.5", "client": "openai", "model": "gpt-3.5-turbo"}
    ])
else:
    print("[OpenAI] Not configured - Skipping GPT models")

if os.getenv('ANTHROPIC_API_KEY'):
    print("[Anthropic] Available - Adding Claude")
    models.append({"name": "Claude", "client": "anthropic", "model": "claude-3-haiku-20240307"})
else:
    print("[Anthropic] Not configured - Skipping Claude")

if os.getenv('GOOGLE_API_KEY'):
    print("[Google] Available - Adding Gemini")
    models.append({"name": "Gemini", "client": "google", "model": "gemini-1.5-flash"})
else:
    print("[Google] Not configured - Skipping Gemini")

print(f"\nTotal models available for competition: {len(models)}")
if len(models) < 2:
    print("WARNING: You need at least 2 models for a competition!")
    print("Consider using the Qwen-only option in Section 3 instead.")

## Competition Function (Multi-Model)

In [None]:
def run_competition(question, models):
    """Run a competition with multiple models"""
    print(f"COMPETITION: {question}")
    print("=" * 60)
    
    responses = {}
    
    # Get responses from each model
    for model in models:
        try:
            print(f"[{model['name']}] Thinking...")
            
            if model['client'] == 'openai':
                response = openai_client.chat.completions.create(
                    model=model['model'],
                    messages=[{"role": "user", "content": question}],
                    max_tokens=150
                )
                answer = response.choices[0].message.content
            
            elif model['client'] == 'anthropic':
                response = anthropic_client.messages.create(
                    model=model['model'],
                    max_tokens=150,
                    messages=[{"role": "user", "content": question}]
                )
                answer = response.content[0].text
                
            elif model['client'] == 'google':
                response = google_model.generate_content(question)
                answer = response.text
            
            responses[model['name']] = answer.strip()
            print(f"[{model['name']}] Response recorded")
            
        except Exception as e:
            print(f"[{model['name']}] Error: {str(e)}")
            responses[model['name']] = "Error occurred"
    
    return responses

In [None]:
def judge_competition(question, responses):
    """Have an AI judge the competition responses"""
    
    # Format all responses for judging
    response_text = f"Question: {question}\n\n"
    for model, answer in responses.items():
        response_text += f"{model}: {answer}\n\n"
    
    judge_prompt = f"""You are judging a competition between AI models. Here are their responses:

{response_text}

Please evaluate each response and pick the winner. Consider:
- Accuracy and correctness
- Clarity and helpfulness  
- Completeness of the answer

Respond with your judgment in this JSON format:
{{
    "winner": "model_name",
    "reasoning": "brief explanation of why this model won",
    "scores": {{"model1": score, "model2": score}}
}}"""

    try:
        print("[JUDGE] Evaluating responses...")
        
        # Use OpenAI as judge (fallback to first available model if needed)
        judge_client = openai_client if os.getenv('OPENAI_API_KEY') else None
        
        if judge_client:
            response = judge_client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[{"role": "user", "content": judge_prompt}],
                max_tokens=200
            )
            judgment = response.choices[0].message.content
            print("[JUDGE] Judgment complete")
            return judgment
        else:
            return "No judge available (OpenAI API key needed for judging)"
            
    except Exception as e:
        print(f"[JUDGE] Error: {str(e)}")
        return "Judging failed"

In [None]:
def display_results(question, responses, judgment):
    """Display the competition results nicely"""
    
    print("\n" + "="*60)
    print("COMPETITION RESULTS")
    print("="*60)
    print(f"Question: {question}")
    print("\nRESPONSES:")
    print("-" * 40)
    
    for model, response in responses.items():
        print(f"\n[{model}]")
        print(response)
    
    print("\n" + "-" * 40)
    print("JUDGE'S DECISION:")
    print(judgment)
    print("="*60)

In [None]:
# Run a test competition
if len(models) >= 2:
    print("Starting multi-model competition...")
    
    test_question = "What is the capital of France and what is it famous for?"
    
    # Run the competition
    responses = run_competition(test_question, models)
    judgment = judge_competition(test_question, responses)
    display_results(test_question, responses, judgment)
    
else:
    print("Not enough models available for competition.")
    print("You need at least 2 API keys configured.")
    print("Try the Qwen-only option in Section 3 instead!")

# Section 3: Option 2 - Qwen-Only Competition (Simplified)

This section uses only Qwen models for the competition, making it simpler but still demonstrating the agent interaction concepts. Perfect if you only have the DASHSCOPE_API_KEY!

In [None]:
# Setup for Qwen-only competition
if os.getenv('DASHSCOPE_API_KEY'):
    print("[Qwen] Setting up competition models...")
    
    # Create different "personalities" using the same model with different prompts
    qwen_models = [
        {
            "name": "Qwen-Academic", 
            "personality": "You are an academic expert. Give detailed, scholarly responses with precise facts."
        },
        {
            "name": "Qwen-Creative", 
            "personality": "You are a creative writer. Give engaging, imaginative responses with vivid descriptions."
        },
        {
            "name": "Qwen-Practical", 
            "personality": "You are a practical advisor. Give concise, actionable responses focused on real-world applications."
        }
    ]
    
    print(f"[Qwen] Ready with {len(qwen_models)} different personalities")
    
else:
    print("[ERROR] DASHSCOPE_API_KEY not found!")
    print("Please set your Qwen API key to use this section.")

In [None]:
def run_qwen_competition(question, qwen_models):
    """Run a competition using different Qwen personalities"""
    print(f"QWEN COMPETITION: {question}")
    print("=" * 60)
    
    responses = {}
    
    for model in qwen_models:
        try:
            print(f"[{model['name']}] Generating response...")
            
            # Create a prompt that includes the personality
            full_prompt = f"{model['personality']}\n\nQuestion: {question}"
            
            response = qwen_client.chat.completions.create(
                model="qwen2.5-72b-instruct",
                messages=[{"role": "user", "content": full_prompt}],
                max_tokens=150
            )
            
            answer = response.choices[0].message.content.strip()
            responses[model['name']] = answer
            print(f"[{model['name']}] Response recorded")
            
        except Exception as e:
            print(f"[{model['name']}] Error: {str(e)}")
            responses[model['name']] = "Error occurred"
    
    return responses

In [None]:
def judge_qwen_competition(question, responses):
    """Have a Qwen model judge the competition"""
    
    # Format responses for judging
    response_text = f"Question: {question}\n\n"
    for model, answer in responses.items():
        response_text += f"{model}: {answer}\n\n"
    
    judge_prompt = f"""You are an impartial judge evaluating different responses. Here are the responses to evaluate:

{response_text}

Please evaluate each response and pick the winner. Consider:
- Accuracy and helpfulness
- Clarity and engagement
- Completeness and relevance

Respond with your judgment in JSON format:
{{
    "winner": "model_name",
    "reasoning": "brief explanation",
    "scores": {{"model1": score, "model2": score, "model3": score}}
}}"""

    try:
        print("[JUDGE] Evaluating all responses...")
        
        response = qwen_client.chat.completions.create(
            model="qwen2.5-72b-instruct",
            messages=[{"role": "user", "content": judge_prompt}],
            max_tokens=200
        )
        
        judgment = response.choices[0].message.content
        print("[JUDGE] Judgment complete")
        return judgment
        
    except Exception as e:
        print(f"[JUDGE] Error: {str(e)}")
        return "Judging failed"

In [None]:
# Run Qwen competition test
if os.getenv('DASHSCOPE_API_KEY') and 'qwen_models' in locals():
    print("Starting Qwen personality competition...")
    
    test_question = "What makes a great leader in today's world?"
    
    # Run the competition
    qwen_responses = run_qwen_competition(test_question, qwen_models)
    qwen_judgment = judge_qwen_competition(test_question, qwen_responses)
    display_results(test_question, qwen_responses, qwen_judgment)
    
else:
    print("Qwen competition not available.")
    print("Make sure DASHSCOPE_API_KEY is set and run the setup cell above.")

In [None]:
# Try your own question!
my_question = "What's the most important invention in human history?"

print(f"Testing custom question: {my_question}")
print("\n" + "="*60)

# Choose which competition to run based on what's available
if len(models) >= 2:
    print("Running multi-model competition...")
    my_responses = run_competition(my_question, models)
    my_judgment = judge_competition(my_question, my_responses)
    display_results(my_question, my_responses, my_judgment)
    
elif os.getenv('DASHSCOPE_API_KEY') and 'qwen_models' in locals():
    print("Running Qwen competition...")
    my_responses = run_qwen_competition(my_question, qwen_models)
    my_judgment = judge_qwen_competition(my_question, my_responses)
    display_results(my_question, my_responses, my_judgment)
    
else:
    print("No competition available - please configure API keys first!")

# Section 4: Advanced Competition Features

In [None]:
def run_batch_competition(questions, competition_models):
    """Run multiple questions in a tournament"""
    
    print("BATCH COMPETITION STARTING")
    print("=" * 60)
    
    results = []
    model_wins = {model['name'] if isinstance(model, dict) else model['name']: 0 
                  for model in competition_models}
    
    for i, question in enumerate(questions, 1):
        print(f"\n[Round {i}/{len(questions)}] {question}")
        print("-" * 40)
        
        # Determine which competition function to use
        if isinstance(competition_models[0], dict) and 'personality' in competition_models[0]:
            responses = run_qwen_competition(question, competition_models)
            judgment = judge_qwen_competition(question, responses)
        else:
            responses = run_competition(question, competition_models)
            judgment = judge_competition(question, responses)
        
        # Try to extract winner from judgment
        try:
            import json
            if judgment.startswith('{'):
                judgment_data = json.loads(judgment)
                winner = judgment_data.get('winner', 'Unknown')
                if winner in model_wins:
                    model_wins[winner] += 1
                    print(f"[WINNER] {winner}")
            else:
                print("[WINNER] Could not parse judgment")
        except:
            print("[WINNER] Could not determine winner")
        
        results.append({
            'question': question,
            'responses': responses,
            'judgment': judgment
        })
    
    # Show final tournament results
    print("\n" + "="*60)
    print("TOURNAMENT RESULTS")
    print("="*60)
    for model, wins in model_wins.items():
        print(f"{model}: {wins} wins")
    
    return results

In [None]:
# Run a mini tournament
tournament_questions = [
    "What is the most important skill for the future?",
    "How will AI change education in the next decade?",
    "What's the best way to learn a new language?"
]

print("Starting mini tournament...")

# Choose models based on availability
if len(models) >= 2:
    print("Using multi-model setup")
    tournament_results = run_batch_competition(tournament_questions, models)
elif os.getenv('DASHSCOPE_API_KEY') and 'qwen_models' in locals():
    print("Using Qwen personality setup")
    tournament_results = run_batch_competition(tournament_questions, qwen_models)
else:
    print("No models available for tournament")

In [None]:
def analyze_competition_patterns(results):
    """Analyze patterns from competition results"""
    
    print("COMPETITION ANALYSIS")
    print("=" * 60)
    
    print(f"Total competitions run: {len(results)}")
    
    # Analyze response lengths
    response_lengths = []
    for result in results:
        for model, response in result['responses'].items():
            response_lengths.append(len(response))
    
    if response_lengths:
        avg_length = sum(response_lengths) / len(response_lengths)
        print(f"Average response length: {avg_length:.0f} characters")
    
    # Show some interesting patterns
    print("\nPattern Analysis:")
    print("- Different models have different response styles")
    print("- Competition creates more detailed responses") 
    print("- Judging adds an extra layer of evaluation")
    
    return {
        'total_competitions': len(results),
        'average_response_length': avg_length if response_lengths else 0
    }

In [None]:
# Analyze the tournament results
if 'tournament_results' in locals():
    print("Analyzing tournament patterns...")
    analysis = analyze_competition_patterns(tournament_results)
    print(f"\nAnalysis complete: {analysis}")
else:
    print("No tournament results to analyze - run a tournament first!")

# Section 5: Experiments and Extensions

In [None]:
# Create a specialized competition for creative tasks
def creative_competition(topic):
    """Run a competition focused on creative responses"""
    
    creative_prompt = f"Write a creative short story or poem about: {topic}"
    
    print(f"CREATIVE COMPETITION: {topic}")
    print("=" * 60)
    
    if len(models) >= 2:
        responses = run_competition(creative_prompt, models)
        judgment = judge_competition(creative_prompt, responses)
    elif os.getenv('DASHSCOPE_API_KEY') and 'qwen_models' in locals():
        responses = run_qwen_competition(creative_prompt, qwen_models)
        judgment = judge_qwen_competition(creative_prompt, responses)
    else:
        print("No models available for creative competition")
        return
    
    display_results(creative_prompt, responses, judgment)

# Try a creative competition
print("Running creative writing competition...")
creative_competition("a robot learning to paint")

In [None]:
# Create a technical debate between models
def technical_debate(topic, position1, position2):
    """Have models argue different sides of a technical issue"""
    
    print(f"TECHNICAL DEBATE: {topic}")
    print("=" * 60)
    
    # Create specific prompts for each side
    prompts = {
        "Pro": f"Argue IN FAVOR of: {position1}. Topic: {topic}",
        "Con": f"Argue AGAINST: {position2}. Topic: {topic}"
    }
    
    responses = {}
    
    # Get arguments from available models
    if len(models) >= 2:
        for i, (side, prompt) in enumerate(prompts.items()):
            model = models[i]
            print(f"[{side}] {model['name']} preparing argument...")
            
            if model['client'] == 'openai':
                response = openai_client.chat.completions.create(
                    model=model['model'],
                    messages=[{"role": "user", "content": prompt}],
                    max_tokens=200
                )
                responses[f"{side} ({model['name']})"] = response.choices[0].message.content
                
    elif os.getenv('DASHSCOPE_API_KEY'):
        for side, prompt in prompts.items():
            response = qwen_client.chat.completions.create(
                model="qwen2.5-72b-instruct",
                messages=[{"role": "user", "content": prompt}],
                max_tokens=200
            )
            responses[f"{side} (Qwen)"] = response.choices[0].message.content
    
    return responses

# Run a technical debate
debate_topic = "AI in Education"
debate_responses = technical_debate(
    debate_topic,
    "AI will revolutionize education for the better",
    "AI poses risks to traditional learning"
)

if debate_responses:
    print("\nDEBATE RESULTS:")
    print("=" * 60)
    for side, argument in debate_responses.items():
        print(f"\n[{side}]")
        print(argument)
        print("-" * 40)

In [None]:
# Interactive competition - you pick the winner!
def user_judged_competition(question):
    """Let the user judge the competition"""
    
    print(f"USER-JUDGED COMPETITION: {question}")
    print("=" * 60)
    
    # Get responses
    if len(models) >= 2:
        responses = run_competition(question, models)
    elif os.getenv('DASHSCOPE_API_KEY') and 'qwen_models' in locals():
        responses = run_qwen_competition(question, qwen_models)
    else:
        print("No models available")
        return
    
    # Display responses for user judgment
    print("\nRESPONSES FOR YOUR JUDGMENT:")
    print("=" * 40)
    
    model_list = list(responses.keys())
    for i, (model, response) in enumerate(responses.items(), 1):
        print(f"\n[Option {i}] {model}")
        print(response)
        print("-" * 30)
    
    print("\nWhich response do you think is best?")
    print("In a real interactive version, you would input your choice here!")
    print(f"Options: {', '.join([f'{i}: {model}' for i, model in enumerate(model_list, 1)])}")
    
    return responses

# Example user-judged competition
user_question = "What's the best programming language for beginners?"
user_responses = user_judged_competition(user_question)

In [None]:
# Ideas for further experimentation
print("EXPERIMENTATION IDEAS")
print("=" * 60)

experiments = [
    "Multi-round competitions where models can respond to each other",
    "Domain-specific competitions (science, arts, business)",
    "Collaborative mode where models work together instead of competing",
    "Different judging criteria (speed, creativity, accuracy)",
    "Real-time competitions with live user voting",
    "Model personality tournaments with different character types"
]

for i, experiment in enumerate(experiments, 1):
    print(f"{i}. {experiment}")

print("\nFeel free to implement any of these ideas!")
print("The competition framework is flexible and can be extended in many ways.")

# Conclusion

Great work! You've successfully created an AI competition system that demonstrates key concepts in multi-agent interactions:

**Key Concepts Learned:**
- Multiple AI agents working on the same task
- Automated evaluation and judging
- Comparative analysis of different models
- Flexible system design that works with different APIs

**Real-World Applications:**
- Content generation with multiple perspectives
- Automated quality assessment systems
- A/B testing for AI responses
- Educational tools for comparing different approaches

**Next Steps:**
- Try different types of questions and competitions
- Experiment with different judging criteria
- Build more complex multi-agent systems
- Explore collaborative vs competitive agent interactions

The competition framework you've built is a foundation for understanding how multiple AI agents can interact, compete, and evaluate each other's performance!