# 🏆 Lab 2: Multi-Model AI Competition

## 🎯 What You'll Build
Today you'll create an intelligent competition system that:
- **Tests multiple AI models** with challenging questions
- **Automatically evaluates responses** using AI judges  
- **Ranks performance** to find the best model
- **Handles multiple APIs** with consistent interfaces

## 🛤️ Two Competition Approaches

### Path A: Mixed-Model Competition  
- Test different providers: OpenAI, Anthropic, Google, etc.
- Great for comparing diverse AI approaches
- Requires multiple API keys

### Path B: Qwen-Only Competition (Recommended)
- Use different Qwen model variants competing against each other
- Single API key needed (DASHSCOPE_API_KEY)
- Consistent, reliable access globally

**Choose the path that fits your API access!**

<table style="margin: 0; text-align: left; width:100%">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/stop.png" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#ff7800;">Important point - please read</h2>
            <span style="color:#ff7800;">The way I collaborate with you may be different to other courses you've taken. I prefer not to type code while you watch. Rather, I execute Jupyter Labs, like this, and give you an intuition for what's going on. My suggestion is that you carefully execute this yourself, <b>after</b> watching the lecture. Add print statements to understand what's going on, and then come up with your own variations.<br/><br/>If you have time, I'd love it if you submit a PR for changes in the community_contributions folder - instructions in the resources. Also, if you have a Github account, use this to showcase your variations. Not only is this essential practice, but it demonstrates your skills to others, including perhaps future clients or employers...
            </span>
        </td>
    </tr>
</table>

In [1]:
# Start with imports - ask ChatGPT to explain any package that you don't know

import os
import json
from dotenv import load_dotenv
from openai import OpenAI
from anthropic import Anthropic
from IPython.display import Markdown, display

In [None]:
# Always remember to do this!
load_dotenv(override=True)

In [None]:
# Check which API keys are available
print("🔑 API Key Status:")
api_keys = {
    'OPENAI_API_KEY': 'OpenAI',
    'ANTHROPIC_API_KEY': 'Anthropic (optional)',
    'GOOGLE_API_KEY': 'Google (optional)', 
    'DASHSCOPE_API_KEY': 'Qwen/Alibaba (recommended)'
}

for key, name in api_keys.items():
    value = os.getenv(key)
    if value:
        print(f"✅ {name}: Available ({value[:6]}...)")
    else:
        print(f"❌ {name}: Not set")

print("\n💡 Tip: For Qwen-only path, you only need DASHSCOPE_API_KEY!")

# 📚 Section 1: Generate Challenge Question

First, we'll create a challenging question to test the AI models.

In [None]:
# Create the challenge question using OpenAI
request = "Please come up with a challenging, nuanced question that I can ask a number of LLMs to evaluate their intelligence. "
request += "Answer only with the question, no explanation."
messages = [{"role": "user", "content": request}]

openai = OpenAI()
response = openai.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
)
question = response.choices[0].message.content
print("🎯 Challenge Question Generated:")
print(question)

# 📚 Section 2: Mixed-Model Competition (Path A)

**Only run this section if you have multiple API keys and want to test different providers!**

If you prefer a simpler approach with just Qwen, skip to Section 3 below.

In [None]:
# Initialize competition tracking
competitors = []
answers = []
test_messages = [{"role": "user", "content": question}]

# Competitor 1: OpenAI GPT-4o-mini
model_name = "gpt-4o-mini"
response = openai.chat.completions.create(model=model_name, messages=test_messages)
answer = response.choices[0].message.content

display(Markdown(f"**{model_name}:**"))
display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

In [None]:
# Competitor 2: Anthropic Claude
model_name = "claude-3-5-sonnet-20241022"

claude = Anthropic()
response = claude.messages.create(model=model_name, messages=test_messages, max_tokens=1000)
answer = response.content[0].text

display(Markdown(f"**{model_name}:**"))
display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

In [None]:
# Competitor 3: Google Gemini  
gemini = OpenAI(
    api_key=os.getenv("GOOGLE_API_KEY"), 
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)
model_name = "gemini-2.0-flash"

response = gemini.chat.completions.create(model=model_name, messages=test_messages)
answer = response.choices[0].message.content

display(Markdown(f"**{model_name}:**"))
display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

In [None]:
# Competitor 4: Qwen (Alibaba Cloud)
qwen = OpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"), 
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
)
model_name = "qwen-plus"

response = qwen.chat.completions.create(model=model_name, messages=test_messages)
answer = response.choices[0].message.content

display(Markdown(f"**{model_name}:**"))
display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

# 🌟 Section 3: Complete Qwen-Only Competition (Path B - Recommended)

**Choose this path for a simpler, more reliable approach!**

This section creates a competition between different Qwen model variants:
- **Qwen-Max** (Most powerful)
- **Qwen-Plus** (Balanced performance/cost) 
- **Qwen-Turbo** (Fast and efficient)
- **Qwen-Long** (Specialized for long contexts)

**Benefits:**
- Single API key needed (DASHSCOPE_API_KEY)
- Consistent performance and access
- No geographic restrictions
- Cost-effective testing

**Setup:** Make sure `DASHSCOPE_API_KEY` is in your `.env` file from https://modelstudio.console.alibabacloud.com/

In [None]:
# Setup Qwen client and generate question
qwen_client = OpenAI(
    api_key=os.getenv('DASHSCOPE_API_KEY'), 
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
)

# Create question using Qwen-Max (most powerful model)
request = "Please come up with a challenging, nuanced question that I can ask a number of LLMs to evaluate their intelligence. "
request += "Answer only with the question, no explanation."

response = qwen_client.chat.completions.create(
    model="qwen-max",
    messages=[{"role": "user", "content": request}]
)
qwen_question = response.choices[0].message.content

print("🎯 Question generated by Qwen-Max:")
print(qwen_question)

# Initialize Qwen competition
qwen_competitors = []
qwen_answers = []
qwen_messages = [{"role": "user", "content": qwen_question}]

In [None]:
# Qwen Competitor 1: Qwen-Max (Most powerful)
model_name = "qwen-max"
response = qwen_client.chat.completions.create(model=model_name, messages=qwen_messages)
answer = response.choices[0].message.content

display(Markdown(f"**{model_name}:**"))
display(Markdown(answer))
qwen_competitors.append(model_name)
qwen_answers.append(answer)

In [None]:
# Qwen Competitor 2: Qwen-Plus (Balanced)
model_name = "qwen-plus" 
response = qwen_client.chat.completions.create(model=model_name, messages=qwen_messages)
answer = response.choices[0].message.content

display(Markdown(f"**{model_name}:**"))
display(Markdown(answer))
qwen_competitors.append(model_name)
qwen_answers.append(answer)

In [None]:
# Qwen Competitor 3: Qwen-Turbo (Fast)
model_name = "qwen-turbo"
response = qwen_client.chat.completions.create(model=model_name, messages=qwen_messages)
answer = response.choices[0].message.content

display(Markdown(f"**{model_name}:**"))
display(Markdown(answer))
qwen_competitors.append(model_name)
qwen_answers.append(answer)

In [None]:
# Qwen Competitor 4: Qwen-Long (Long context)
model_name = "qwen-long"
response = qwen_client.chat.completions.create(model=model_name, messages=qwen_messages)
answer = response.choices[0].message.content

display(Markdown(f"**{model_name}:**"))
display(Markdown(answer))
qwen_competitors.append(model_name)
qwen_answers.append(answer)

# 🏛️ Section 4: Qwen Competition Judging

In [None]:
# Compile all Qwen answers for judging
print(f"📊 Qwen Competition Status: {len(qwen_competitors)} competitors, {len(qwen_answers)} answers")

qwen_together = ""
for index, answer in enumerate(qwen_answers):
    qwen_together += f"# Response from Qwen competitor {index+1}\n\n"
    qwen_together += answer + "\n\n"

print("✅ All responses compiled for judging!")

In [None]:
# Create judge prompt for Qwen competition
qwen_judge_prompt = f"""You are judging a competition between {len(qwen_competitors)} different Qwen model variants.

Question asked: {qwen_question}

Your job is to evaluate each response for clarity and strength of argument, and rank them from best to worst.
Respond with JSON only: {{"results": ["best competitor number", "second best", "third best", ...]}}

Responses from each Qwen competitor:
{qwen_together}

Respond with ONLY the JSON ranking, no markdown or explanations."""

print("⚖️ Judge prompt created for Qwen competition")

In [None]:
# Execute Qwen competition judging
judge_messages = [{"role": "user", "content": qwen_judge_prompt}]
response = qwen_client.chat.completions.create(
    model="qwen-max",  # Use most powerful model as judge
    messages=judge_messages
)

qwen_results = response.choices[0].message.content
print("⚖️ Qwen-Max judgment:")
print(qwen_results)

In [None]:
# Display Qwen competition results
qwen_results_dict = json.loads(qwen_results)
qwen_ranks = qwen_results_dict["results"]

print("🏆 QWEN COMPETITION FINAL RESULTS:")
print("=" * 40)
for index, result in enumerate(qwen_ranks):
    competitor = qwen_competitors[int(result)-1]
    print(f"🥇 Rank {index+1}: {competitor}")

print(f"\n🎉 Qwen competition complete! {len(qwen_competitors)} models evaluated by Qwen-Max")

# 🏛️ Section 5: Mixed-Model Competition Judging (Path A)

**Only run this section if you completed Section 2 (Mixed-Model Competition)!**

In [None]:
# Check mixed-model competition status and compile answers
print(f"📊 Mixed-Model Competition Status:")
print(f"Competitors: {competitors}")
print(f"Answers collected: {len(answers)}")

# Compile all answers for judging
together = ""
for index, answer in enumerate(answers):
    together += f"# Response from competitor {index+1}\n\n"
    together += answer + "\n\n"

print("✅ All mixed-model responses compiled!")

In [None]:
# Create judge prompt for mixed-model competition  
judge_prompt = f"""You are judging a competition between {len(competitors)} AI models.

Question asked: {question}

Your job is to evaluate each response for clarity and strength of argument, and rank them from best to worst.
Respond with JSON only: {{"results": ["best competitor number", "second best", "third best", ...]}}

Responses from each competitor:
{together}

Respond with ONLY the JSON ranking, no markdown or explanations."""

print("⚖️ Judge prompt created for mixed-model competition")

In [None]:
# Execute mixed-model competition judging
judge_messages = [{"role": "user", "content": judge_prompt}]
response = openai.chat.completions.create(
    model="gpt-4o-mini",
    messages=judge_messages
)

results = response.choices[0].message.content
print("⚖️ GPT-4o-mini judgment:")
print(results)

In [None]:
# Display mixed-model competition results
results_dict = json.loads(results)
ranks = results_dict["results"]

print("🏆 MIXED-MODEL COMPETITION FINAL RESULTS:")
print("=" * 45)
for index, result in enumerate(ranks):
    competitor = competitors[int(result)-1]
    print(f"🥇 Rank {index+1}: {competitor}")

print(f"\n🎉 Mixed-model competition complete! {len(competitors)} models evaluated by GPT-4o-mini")

# 🎯 Lab 2 Complete!

## 🏆 What You've Built

You've successfully created a sophisticated **AI model competition system**:

### 🚀 Core Features:
- **Multi-model testing** - Compare different AI providers or model variants
- **Automated evaluation** - AI judges rank performance objectively  
- **Flexible competition formats** - Mixed models or single-provider variants
- **JSON-structured results** - Clean, parseable output format

### 🛤️ Two Paths Available:
- **Path A (Mixed-Model):** Test OpenAI, Anthropic, Google, Qwen together
- **Path B (Qwen-Only):** Compare Qwen model variants (simpler, more reliable)

### 🔧 Key Technologies Mastered:
- **Multiple API integrations** with consistent OpenAI-compatible interfaces
- **Automated judging systems** using AI as evaluators
- **JSON parsing and results processing**
- **Competition workflow orchestration**

## 🎯 Agentic Patterns Used:
- **Multi-agent evaluation** - Different models compete and judge
- **Structured output processing** - JSON responses for reliable parsing
- **Workflow orchestration** - Systematic testing and evaluation pipeline

## 💼 Commercial Applications:
These patterns are valuable for:
- **Model selection** for production systems
- **Performance benchmarking** across different AI providers  
- **Quality assurance** for AI-powered applications
- **Cost-performance optimization** by comparing model effectiveness

**Great work! You now have a robust system for evaluating and comparing AI models.** 🌟