# 🤖 AI Agent Performance Comparison Tool

## Using Mozilla's Any-Agent Framework

This is a command line tool for developers exploring multiple agent frameworks. If you’re evaluating trade-offs in speed, cost, or style, and you want a quick, reproducible comparison, this tool gives you a ready-to-run example using Mozilla’s [any-agent](https://github.com/mozilla-ai/any-agent) abstraction.


In [None]:

! pip install 'any-agent[openai,google,mistral]'
! pip install mistralai 

### API Keys Configuration
Enter your API Keys to compare frameworks. We are using Mistral, OpenAI and Gemini, but you can add more as we explain later. 


## 🤖 Supported Models & Frameworks

The tool currently supports these any-agent framework combinations:

| Model | Framework | Provider | Strengths |
|-------|-----------|----------|-----------|
| **GPT-4.1 Mini** | `openai` | OpenAI | Ultra-fast, minimal resource usage |
| **GPT-3.5 Nano** | `openai` | OpenAI | Balanced speed and accuracy |
| **Mistral Small** | `mistral` | Mistral AI | Cost-effective, multilingual capabilities |
| **Gemini 2.5 Flash** | `google` | Google | Lightning-fast multimodal processing |

In [None]:
import os
from dotenv import load_dotenv
import time
from datetime import datetime
from dataclasses import asdict
from typing import List, Optional
import asyncio
import sys
import os
from getpass import getpass
from dataclasses import dataclass

@dataclass
class ModelConfig:
    """Configuration for each model/framework combination"""
    name: str
    framework: str
    model_id: str
    provider: str
    description: str

@dataclass
class TestResult:
    """Results from a single agent test"""
    model_config: str
    framework: str
    model_id: str
    prompt: str
    response: str
    latency_ms: int
    tokens_used: int
    estimated_cost: float
    success: bool
    error: Optional[str]
    timestamp: str
    output: str

model_configs = {
            "gpt-4.1-nano": ModelConfig(
                name="GPT 4.1 Nano",
                framework="openai",
                model_id="openai/gpt-4.1-nano",
                provider="OpenAI",
                description="High-quality model with excellent reasoning capabilities"
            ),
            "gpt-4.1-mini": ModelConfig(
                name="GPT 4.1 Mini",
                framework="openai", 
                model_id="openai/gpt-4.1-mini",
                provider="OpenAI",
                description="Fast and cost-effective for most tasks"
            ),
            "mistral-small": ModelConfig(
                name="Mistral Small",
                framework="tinyagent",
                model_id="mistral/mistral-small-latest",
                provider="Mistral AI",
                description="Open-source friendly with good performance"
            ),
            "Gemini": ModelConfig(
                name="Gemini 2.5 Flash",
                framework="google",  # Using tinyagent as fallback framework
                model_id="gemini/gemini-2.5-flash",
                provider="Google",
                description="Balanced model with strong reasoning (Google Gemini 2.5 Flash)"
            )
        }

for key in ("MISTRAL_API_KEY", "OPENAI_API_KEY", "GEMINI_API_KEY"):
    if key not in os.environ:
        print(f"{key} not found in environment!")
        api_key = getpass(f"Please enter your {key}: ")
        os.environ[key] = api_key
        print(f"{key} set for this session!")
    else:
        print(f"{key} found in environment.")

def get_user_input(  prompt: str) -> str:
            """Get user input with optional validation"""
            while True:
                try:
                    return input(f"\n{prompt}").strip()
                except KeyboardInterrupt:
                    print(f"\nOperation cancelled")
                    return ""


### Easy Extension

```python
# Add new models easily
"new-model": ModelConfig(
    name="New Model",
    framework="new_framework", 
    model_id="new-model-id",
    provider="New Provider",
    description="Description of capabilities"
)
```

In [None]:
def list_models() -> List[str]:
            print("We are going to run your prompt against the following models")
            print()
            
            model_keys = list(model_configs.keys())
            for i, key in enumerate(model_keys):
                config = model_configs[key]
                print(f"{i+1}. {config.name} ({config.provider})") 
                print(f"   Framework: {config.framework}")
            
            return model_keys


## 🔧 Technical Architecture

### Key Design Decisions

#### 1. Any-Agent Integration

- Uses Mozilla's framework abstraction for consistency
- Easy to add new frameworks as any-agent supports them

#### 2. Async Performance Testing

```python
# Concurrent execution for fair comparison
tasks = [test_agent_performance(model, prompt) for model in selected_models]
results = await asyncio.gather(*tasks, return_exceptions=True)
```

### 3. Comprehensive Error Handling

- Graceful degradation when models fail
- Detailed error reporting for debugging
- Continues testing even if some models error out

In [None]:
async def test_agent_performance(  model_key: str, prompt: str) -> TestResult:
            """Test a single agent configuration and measure performance"""
            model_config = model_configs[model_key]
            start_time = time.time()
            print(f"TESTING {model_config.name} ({model_config.framework})")
            try:
                # Import any-agent components
                from any_agent import AgentConfig, AnyAgent
                
                # Create agent configuration
                agent_config = AgentConfig(
                    model_id=model_config.model_id,
                    instructions="You are a helpful AI assistant. Provide clear, concise, and accurate responses.",
                    tools=[]  # No tools for basic performance testing
                )
                
                # Create agent with specified framework
                agent = await AnyAgent.create_async(model_config.framework, agent_config)
                
                # Show progress indicator
                print(f"  → {model_config.name}: Processing", end="", flush=True)
                
                # Run the agent
                agent_trace = await agent.run_async(prompt)
        
                latency_ms = agent_trace.duration.total_seconds() * 1000; 
                
                # Extract response from trace
                response = str(agent_trace.final_response) if hasattr(agent_trace, 'final_response') else str(agent_trace)
                output = agent_trace.final_output if hasattr(agent_trace, 'final_output') else ""

                estimated_cost = agent_trace.cost.total_cost;
                
                print(f"\r  ✓ {model_config.name}: {latency_ms}ms")
                
                return TestResult(
                    model_config=model_key,
                    framework=model_config.framework,
                    model_id=model_config.model_id,
                    prompt=prompt,
                    response=response,
                    latency_ms=latency_ms,
                    tokens_used=agent_trace.tokens.total_tokens,
                    estimated_cost=estimated_cost,
                    success=True,
                    error=None,
                    timestamp=datetime.now().isoformat(),
                    output=output
                )
                
            except Exception as e:
                end_time = time.time()
                latency_ms = int((end_time - start_time) * 1000)
                
                print(f"\r  ✗ {model_config.name}: Failed ({str(e)[:500]}...)")
                
                return TestResult(
                    model_config=model_key,
                    framework=model_config.framework,
                    model_id=model_config.model_id,
                    prompt=prompt,
                    response="",
                    latency_ms=latency_ms,
                    tokens_used=0,
                    estimated_cost=0,
                    success=False,
                    error=str(e),
                    timestamp=datetime.now().isoformat(),
                    output=""
                )


## 📊 Sample Output

Example run for prompt: "Write a haiku on Uranus"

```bash
═══════════════════════════════════════
  PERFORMANCE COMPARISON RESULTS
═══════════════════════════════════════

1. Mistral Small
   Framework: tinyagent
   Latency:   2743.4970000000003ms
   Cost:      $0.0000
   Tokens:    127
   Output:  Icy blue jewel,
            Uranus spins on its side,
            Mysteries untold. 

2. GPT-4.1 Nano
   Framework: openai
   Latency:   2458.0440000000003ms
   Cost:      $0.0000
   Tokens:    53
   Output:  Blue planet afar,  
            Majestic and cold in the night,  
            Uranus whispers. 

3. GPT-4.1 Mini
   Framework: openai
   Latency:   2749.108ms
   Cost:      $0.0000
   Tokens:    53
   Output:  Icy giant spins,  
            Skyward blue-green mystery,  
            Rings in tilted grace. 

4. Gemini 2.5 Flash
   Framework: google
   Latency:   1955.3020000000001ms
   Cost:      $0.0008
   Tokens:    350
   Output:  Blue-green ice giant,
            Tilted world, so cold and deep,
            Whispers in the dark.   
```

In [None]:

async def display_results( results: List[TestResult]) -> None:
            """Display formatted results with analysis"""
            print("PERFORMANCE COMPARISON RESULTS")
            
            # Filter successful results for analysis
            successful_results = [r for r in results if r.success]
            
            if not successful_results:
                print("❌ No successful results to display")
                return
            
            # Sort by latency for ranking
            successful_results.sort(key=lambda x: x.latency_ms)
            
            print()
            for i, result in enumerate(successful_results):
                model_config = model_configs[result.model_config]
                rank = i + 1
               
                
                print(f"{rank}. {model_config.name}")
                print(f"   Framework: {result.framework}")
                print(f"   Latency:   {result.latency_ms}ms")
                print(f"   Cost:      ${result.estimated_cost:.4f}")
                print(f"   Tokens:    {result.tokens_used}")
                print(f"   Output:  {result.output} ")
                print()
            
            # display_analysis(successful_results)
            
            # Show failed results if any
            failed_results = [r for r in results if not r.success]
            if failed_results:
                print(f"\n⚠️  {len(failed_results)} model(s) failed:")
                for result in failed_results:
                    model_config = model_configs[result.model_config]
                    print(f"   ✗ {model_config.name}: {result.error}")

async def run_performance_comparison() -> None:
            """Main performance comparison workflow"""
            models = list_models();

            prompt = get_user_input("Enter prompt: "); 
            # await select_prompt(modelconfigs)
            
            print("RUNNING PERFORMANCE COMPARISON")
            print(f"Testing {len(models)} models with prompt:")
            print(f"{prompt[:80]}...'")
            print()
            
            start_time = time.time()
            
            # Run tests concurrently for realistic comparison
            tasks = [test_agent_performance( model, prompt) for model in models]
            results = await asyncio.gather(*tasks, return_exceptions=True)
            
            # Handle any exceptions
            valid_results = []
            for i, result in enumerate(results):
                if isinstance(result, Exception):
                    model_config = model_configs[models[i]]
                    print(f"  ✗ {model_config.name}: Exception occurred")
                else:
                    valid_results.append(result)
            
            total_time = time.time() - start_time
            print(f"\nTotal test time: {total_time:.2f}s")
            
            if valid_results:
                await display_results(valid_results)
                

await run_performance_comparison()


## 🚀 Roadmap

1. Add Anthropic and Llamafile support for local models
2. Integrate official any-agent.evaluation module for trace-based scoring
3. Export results to CSV/JSON for dashboards
4. Statistical analysis across multiple prompts
5. CLI flags for batch testing and exporting

## 🤝 Contributing

We welcome contributions! Here are a few ways to help:

- Add models/frameworks by extending the ModelConfig dictionary.
- Improve evaluation with new scoring rubrics or by integrating any-agent.evaluation.
- Suggest prompts for testing creative, technical, or multilingual cases.
- Open issues/PRs for bugs, docs, or feature ideas.