# MCP Benchmark SDK - Harness Example

This notebook demonstrates how to use the MCP Benchmark SDK to run a test harness and save results.

## Overview

The SDK provides a simple API for:
- Loading test scenarios from JSON harness files
- Running benchmarks across multiple models
- Collecting and saving detailed results
- Analyzing success rates and conversation history


In [1]:
from dotenv import load_dotenv

from pathlib import Path
from mcp_benchmark_sdk import TestHarness, TestHarnessConfig, MCPConfig, create_agent
load_dotenv()
# Configure MCP and harness
config = TestHarnessConfig(
    mcps=[MCPConfig(name="jira", url="http://localhost:8015/mcp", transport="streamable_http")],
    sql_runner_url="http://localhost:8015/api/sql-runner"
)

# Load harness and run
harness = TestHarness(harness_path=Path("9_tasks/task1.json"), config=config)
results = await harness.run(models=["claude-sonnet-4-5"], agent_factory=create_agent)

# Check results
for result in results:
    print(f"{'✅' if result.success else '❌'} {result.scenario_name} - {result.model}")

  return ClaudeAgent(


✅ pm review task - claude-sonnet-4-5


## Setup

First, import the necessary modules from the SDK:


In [None]:
import asyncio
import json
from pathlib import Path
from datetime import datetime

from mcp_benchmark_sdk import (
    TestHarness,
    TestHarnessConfig,
    MCPConfig,
    create_agent,
    RunObserver,
)

## Configuration


True

## Configuration

Configure the MCP server(s) and test harness settings:


In [3]:
# Configure MCP Server (adjust URL to match your setup)
mcp_config = MCPConfig(
    name="jira",
    url="http://localhost:8015/mcp",
    transport="streamable_http",
)

# Configure test harness
config = TestHarnessConfig(
    mcps=[mcp_config],
    sql_runner_url="http://localhost:8015/api/sql-runner",  # For database verifiers
    max_steps=1000,
    tool_call_limit=1000,
    temperature=0.1,
    runs_per_scenario=1,  # How many times to run each scenario
    max_concurrent_runs=5,  # Limit concurrent executions
)

# Models to test (adjust based on your API keys)
models = [
    "claude-sonnet-4-5",
    # "gpt-4o",
    # "gemini-2.5-pro",
    # "grok-4",
]

# Harness file path
harness_path = Path("9_tasks/task1.json")

print(f"Configuration complete!")
print(f"MCP Server: {mcp_config.url}")
print(f"Models: {', '.join(models)}")
print(f"Harness: {harness_path}")


Configuration complete!
MCP Server: http://localhost:8015/mcp
Models: claude-sonnet-4-5
Harness: 9_tasks/task1.json


## Simple Harness Example

Let's start with the simplest possible example - creating and running a minimal harness:


In [4]:
# Create a simple harness file
simple_harness = {
    "scenarios": [
        {
            "scenario_id": "simple_test",
            "name": "Simple Issue Creation Test",
            "description": "Test creating a single issue",
            "prompts": [
                {
                    "prompt_text": "Create a bug issue in project DEMO with title 'Test Bug' and description 'This is a test'",
                    "expected_tools": ["create_issue"],
                    "verifier": [
                        {
                            "verifier_type": "database_state",
                            "name": "issue_created",
                            "validation_config": {
                                "query": "SELECT COUNT(*) FROM issue WHERE summary = 'Test Bug'",
                                "expected_value": 1,
                                "comparison_type": "equals"
                            }
                        }
                    ]
                }
            ]
        }
    ]
}

# Save to file
simple_harness_path = Path("simple_harness.json")
with open(simple_harness_path, "w") as f:
    json.dump(simple_harness, f, indent=2)

print(f"✅ Created simple harness: {simple_harness_path}")
print(f"\nHarness structure:")
print(f"  - Scenarios: {len(simple_harness['scenarios'])}")
print(f"  - Scenario ID: {simple_harness['scenarios'][0]['scenario_id']}")
print(f"  - Prompts: {len(simple_harness['scenarios'][0]['prompts'])}")
print(f"  - Verifiers: {len(simple_harness['scenarios'][0]['prompts'][0]['verifier'])}")


✅ Created simple harness: simple_harness.json

Harness structure:
  - Scenarios: 1
  - Scenario ID: simple_test
  - Prompts: 1
  - Verifiers: 1


### Run the Simple Harness

Now let's run this simple harness with minimal code:


In [5]:
async def run_simple_example():
    """Minimal example - just 3 lines of code!"""
    
    # 1. Create harness
    harness = TestHarness(harness_path=simple_harness_path, config=config)
    
    # 2. Run it
    results = await harness.run(models=["claude-sonnet-4-5"], agent_factory=create_agent)
    
    # 3. Check results
    for result in results:
        print(f"\n{'='*60}")
        print(f"Model: {result.model}")
        print(f"Scenario: {result.scenario_name}")
        print(f"Success: {'✅ PASSED' if result.success else '❌ FAILED'}")
        print(f"{'='*60}")
        
        # Show verifier results
        for vr in result.verifier_results:
            status = "✅" if vr.success else "❌"
            msg = vr.error if vr.error else f"Expected: {vr.expected_value}, Got: {vr.actual_value}"
            print(f"{status} {vr.name}: {msg}")
    
    return results

# Uncomment to run the simple example:
# simple_results = await run_simple_example()
print("Simple example ready (uncomment to run)")


Simple example ready (uncomment to run)


### Understanding the Simple Harness Format

The minimal harness file contains:

**Required Fields:**
- `scenarios`: Array of test scenarios
  - `scenario_id`: Unique identifier
  - `name`: Human-readable name
  - `description`: What this scenario tests
  - `prompts`: Array of prompts to send to the agent
    - `prompt_text`: The instruction for the agent
    - `verifier`: Array of checks to validate success
      - `verifier_type`: Type of verification (e.g., "database_state")
      - `validation_config`: Configuration for the verifier
        - `query`: SQL query to check state
        - `expected_value`: What value we expect
        - `comparison_type`: How to compare (equals, greater_than, etc.)

**Optional Fields:**
- `expected_tools`: List of tools the agent should use
- `restricted_tools`: Tools the agent cannot use
- `system_prompt`: Custom system prompt (overrides default)
- `metadata`: Additional metadata about the scenario


### More Simple Harness Examples

Here are a few more minimal harness examples for different scenarios:


In [6]:
# Example 1: Simple issue assignment
assignment_harness = {
    "scenarios": [{
        "scenario_id": "assign_issue",
        "name": "Assign Issue",
        "description": "Test assigning an issue to a user",
        "prompts": [{
            "prompt_text": "Assign issue WEB-1 to user Alice",
            "expected_tools": ["get_issue", "update_issue"],
            "verifier": [{
                "verifier_type": "database_state",
                "name": "issue_assigned",
                "validation_config": {
                    "query": "SELECT assignee_id FROM issue WHERE key = 'WEB-1'",
                    "expected_value": 1,  # Assuming Alice has ID 1
                    "comparison_type": "equals"
                }
            }]
        }]
    }]
}

# Example 2: Simple comment addition
comment_harness = {
    "scenarios": [{
        "scenario_id": "add_comment",
        "name": "Add Comment",
        "description": "Test adding a comment to an issue",
        "prompts": [{
            "prompt_text": "Add a comment to WEB-1 saying 'Working on this now'",
            "expected_tools": ["add_issue_comment"],
            "verifier": [{
                "verifier_type": "database_state",
                "name": "comment_added",
                "validation_config": {
                    "query": "SELECT COUNT(*) FROM comment c JOIN issue i ON c.issue_id = i.id WHERE i.key = 'WEB-1' AND c.body LIKE '%Working on this now%'",
                    "expected_value": 1,
                    "comparison_type": "equals"
                }
            }]
        }]
    }]
}

# Example 3: Multiple verifiers (checking multiple things)
multi_check_harness = {
    "scenarios": [{
        "scenario_id": "create_with_label",
        "name": "Create Issue with Label",
        "description": "Create issue and verify both creation and label",
        "prompts": [{
            "prompt_text": "Create a bug issue 'Login Error' in DEMO project with label 'urgent'",
            "expected_tools": ["create_issue", "update_issue"],
            "verifier": [
                {
                    "verifier_type": "database_state",
                    "name": "issue_created",
                    "validation_config": {
                        "query": "SELECT COUNT(*) FROM issue WHERE summary = 'Login Error'",
                        "expected_value": 1,
                        "comparison_type": "equals"
                    }
                },
                {
                    "verifier_type": "database_state",
                    "name": "label_added",
                    "validation_config": {
                        "query": "SELECT COUNT(*) FROM issue_label_association ila JOIN issue_label il ON ila.label_id = il.id JOIN issue i ON ila.issue_id = i.id WHERE i.summary = 'Login Error' AND il.name = 'urgent'",
                        "expected_value": 1,
                        "comparison_type": "equals"
                    }
                }
            ]
        }]
    }]
}

print("✅ Created 3 example harness structures:")
print("  1. assignment_harness - Assign an issue")
print("  2. comment_harness - Add a comment")
print("  3. multi_check_harness - Multiple verifiers")
print("\nEach can be saved to a file and run the same way!")


✅ Created 3 example harness structures:
  1. assignment_harness - Assign an issue
  2. comment_harness - Add a comment
  3. multi_check_harness - Multiple verifiers

Each can be saved to a file and run the same way!


### Quick Reference: Running Any Harness

The pattern for running ANY harness is always the same:


In [7]:
# Quick reference template - copy and modify this!
async def run_any_harness(harness_file_path, model_name="claude-sonnet-4-5"):
    """
    Universal harness runner - works with any harness file!
    
    Args:
        harness_file_path: Path to your harness JSON file
        model_name: Model to test (default: claude-sonnet-4-5)
    """
    # Create harness
    harness = TestHarness(
        harness_path=Path(harness_file_path),
        config=config
    )
    
    # Run and return results
    results = await harness.run(
        models=[model_name],
        agent_factory=create_agent
    )
    
    # Print summary
    for result in results:
        status = "✅ PASSED" if result.success else "❌ FAILED"
        print(f"{status} | {result.scenario_name} | {result.model}")
    
    return results

# Usage examples:
# results = await run_any_harness("simple_harness.json")
# results = await run_any_harness("9_tasks/task1.json", "gpt-4o")
# results = await run_any_harness("my_custom_test.json", "gemini-2.5-pro")

print("Quick reference function defined!")


Quick reference function defined!


## Optional: Create a Progress Observer

You can create a custom observer to track progress during execution:


In [8]:
class SimpleProgressObserver(RunObserver):
    """Simple observer that prints progress updates."""
    
    def __init__(self):
        self.tool_count = 0
        self.message_count = 0
    
    async def on_message(self, role, content, metadata=None):
        """Called when a message is sent/received."""
        self.message_count += 1
        # Truncate long messages for cleaner output
        content_preview = content[:80] + "..." if len(content) > 80 else content
        print(f"  [{role}] {content_preview}")
    
    async def on_tool_call(self, tool_name, arguments, result, is_error=False):
        """Called when a tool is executed."""
        self.tool_count += 1
        status = "❌" if is_error else "✅"
        print(f"  {status} Tool: {tool_name}")
    
    async def on_status(self, message, level="info"):
        """Called for status updates."""
        print(f"  [{level.upper()}] {message}")
    
    async def on_complete(self, success, metadata=None):
        """Called when execution completes."""
        print(f"\n  Completed: {success}")
        print(f"  Total messages: {self.message_count}")
        print(f"  Total tool calls: {self.tool_count}")

print("Observer class defined!")


Observer class defined!


## Create and Run Test Harness

Now we'll create the test harness and run the benchmarks:


In [9]:
async def run_harness():
    """Run the test harness and return results."""
    
    print("="*80)
    print("INITIALIZING TEST HARNESS")
    print("="*80)
    
    # Create test harness
    harness = TestHarness(
        harness_path=harness_path,
        config=config,
    )
    
    print(f"\nLoaded {len(harness.scenarios)} scenario(s) from harness")
    for i, scenario in enumerate(harness.scenarios, 1):
        print(f"  {i}. {scenario.name} ({scenario.scenario_id})")
    
    # Optional: Add observer for progress tracking
    harness.add_observer_factory(lambda: SimpleProgressObserver())
    
    print("\n" + "="*80)
    print("RUNNING BENCHMARKS")
    print("="*80)
    
    # Run the benchmarks
    results = await harness.run(
        models=models,
        agent_factory=create_agent,
    )
    
    print("\n" + "="*80)
    print("BENCHMARK COMPLETE")
    print("="*80)
    print(f"Total runs completed: {len(results)}\n")
    
    return results

# Run the harness
results = await run_harness()


INITIALIZING TEST HARNESS

Loaded 1 scenario(s) from harness
  1. pm review task (pm_review_task_1)

RUNNING BENCHMARKS
  [INFO] Initialized with 49 tools from 1 MCP(s)
  [system] You are an autonomous project-management agent operating inside an MCP server.
Y...
  [user] I'm Emily Davis. I just reviewed the testing results on WEB-3 and it's working c...
  [INFO] Step 1/1000
  [assistant] I'll help you complete these tasks for WEB-3 and WEB-9. Let me start by gatherin...
  ✅ Tool: get_transitions
  ✅ Tool: find_users
  [INFO] Step 2/1000
  [assistant] Perfect! Now I'll proceed with the tasks:
  ❌ Tool: do_transition
  ✅ Tool: update_issue
  ✅ Tool: add_issue_comment
  ✅ Tool: assign_issue
  ✅ Tool: add_issue_comment
  [INFO] Step 3/1000
  [assistant] The transition requires a comment. Let me retry with the comment included:
  ✅ Tool: do_transition
  [INFO] Step 4/1000
  [assistant] Excellent! All tasks have been completed successfully:

## ✅ WEB-3 Updates Compl...
  [INFO] Agent comple

## Analyze Results

Let's examine the results and calculate success rates:


In [10]:
print("="*80)
print("RESULTS SUMMARY")
print("="*80)

# Group results by model
results_by_model = {}
for result in results:
    if result.model not in results_by_model:
        results_by_model[result.model] = []
    results_by_model[result.model].append(result)

# Calculate statistics per model
for model, model_results in results_by_model.items():
    total = len(model_results)
    successful = sum(1 for r in model_results if r.success)
    failed = total - successful
    success_rate = (successful / total * 100) if total > 0 else 0
    
    print(f"\n{model}:")
    print(f"  Total runs: {total}")
    print(f"  Successful: {successful} ✅")
    print(f"  Failed: {failed} ❌")
    print(f"  Success rate: {success_rate:.1f}%")
    
    # Show details for each run
    for result in model_results:
        status = "✅" if result.success else "❌"
        print(f"    {status} {result.scenario_name} (run #{result.run_number})")
        
        # Show verifier results
        for vr in result.verifier_results:
            vr_status = "✅" if vr.success else "❌"
            vr_name = vr.name or "unnamed"
            msg = vr.error if vr.error else f"Expected: {vr.expected_value}, Got: {vr.actual_value}"
            print(f"        {vr_status} {vr_name}: {msg}")


RESULTS SUMMARY

claude-sonnet-4-5:
  Total runs: 1
  Successful: 1 ✅
  Failed: 0 ❌
  Success rate: 100.0%
    ✅ pm review task (run #1)
        ✅ DatabaseVerifier: Expected: 3, Got: 3
        ✅ DatabaseVerifier: Expected: 1, Got: 1
        ✅ DatabaseVerifier: Expected: 1, Got: 1
        ✅ DatabaseVerifier: Expected: 1, Got: 1
        ✅ DatabaseVerifier: Expected: 2, Got: 2
        ✅ DatabaseVerifier: Expected: 1, Got: 1


## Save Results to JSON

Save the detailed results to JSON files for later analysis:


In [11]:
# Create results directory
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
results_dir = Path(f"sdk_results_{timestamp}")
results_dir.mkdir(exist_ok=True)

print(f"Saving results to: {results_dir}\n")

# Save individual result files
for result in results:
    # Create filename: {scenario_id}_{model}_{run_number}.json
    filename = f"{result.scenario_id}_{result.model.replace('/', '-')}_{result.run_number}.json"
    filepath = results_dir / filename
    
    # Convert result to dictionary (includes conversation history)
    result_dict = result.to_dict()
    
    # Save to JSON
    with open(filepath, "w") as f:
        json.dump(result_dict, f, indent=2)
    
    status = "✅" if result.success else "❌"
    print(f"{status} Saved: {filename}")

# Create summary file
summary = {
    "timestamp": timestamp,
    "harness_file": str(harness_path),
    "models": models,
    "total_runs": len(results),
    "results_by_model": {},
}

for model, model_results in results_by_model.items():
    total = len(model_results)
    successful = sum(1 for r in model_results if r.success)
    summary["results_by_model"][model] = {
        "total": total,
        "successful": successful,
        "failed": total - successful,
        "success_rate": (successful / total * 100) if total > 0 else 0,
        "scenarios": [
            {
                "scenario_id": r.scenario_id,
                "scenario_name": r.scenario_name,
                "run_number": r.run_number,
                "success": r.success,
            }
            for r in model_results
        ],
    }

summary_file = results_dir / "summary.json"
with open(summary_file, "w") as f:
    json.dump(summary, f, indent=2)

print(f"\n✅ Saved summary: {summary_file.name}")
print(f"\n{'='*80}")
print(f"All results saved to: {results_dir}")
print(f"{'='*80}")


Saving results to: sdk_results_20251106_202239

✅ Saved: pm_review_task_1_claude-sonnet-4-5_1.json

✅ Saved summary: summary.json

All results saved to: sdk_results_20251106_202239


## Examine Conversation History

Let's look at the conversation history for one of the runs:


In [12]:
# Pick the first result to examine
if results:
    example_result = results[0]
    
    print("="*80)
    print(f"CONVERSATION HISTORY: {example_result.scenario_name}")
    print(f"Model: {example_result.model}")
    print(f"Success: {example_result.success}")
    print("="*80)
    
    # Get conversation history
    conversation = example_result.get_conversation_history()
    
    print(f"\nTotal conversation entries: {len(conversation)}\n")
    
    # Display first few entries
    for i, entry in enumerate(conversation[:10], 1):
        print(f"\n[{i}] Type: {entry.get('type')}")
        
        if entry["type"] == "message":
            role = entry.get("role", "unknown")
            content = entry.get("content", "")
            # Truncate long content
            content_preview = content[:200] + "..." if len(content) > 200 else content
            print(f"    Role: {role}")
            print(f"    Content: {content_preview}")
        
        elif entry["type"] == "tool_call":
            tool = entry.get("tool", "unknown")
            args = entry.get("args", {})
            print(f"    Tool: {tool}")
            print(f"    Args: {args}")
        
        elif entry["type"] == "tool_result":
            tool = entry.get("tool", "unknown")
            output = entry.get("output", {})
            # Truncate large outputs
            output_str = str(output)[:200] + "..." if len(str(output)) > 200 else str(output)
            print(f"    Tool: {tool}")
            print(f"    Output: {output_str}")
    
    if len(conversation) > 10:
        print(f"\n... ({len(conversation) - 10} more entries)")
else:
    print("No results to display.")


CONVERSATION HISTORY: pm review task
Model: claude-sonnet-4-5
Success: True

Total conversation entries: 30


[1] Type: message
    Role: system
    Content: You are an autonomous project-management agent operating inside an MCP server.
You receive: (a) a user request and (b) a set of tool definitions (schemas, params, return types).
Your goal is to comple...

[2] Type: message
    Role: user
    Content: I'm Emily Davis. I just reviewed the testing results on WEB-3 and it's working correctly now. I need to mark it as resolved, add 'ready-for-deployment' label to it and leave a comment 'Registration fe...

[3] Type: message
    Role: assistant
    Content: I'll help you complete these tasks for WEB-3 and WEB-9. Let me start by gathering the necessary information.

[4] Type: tool_call
    Tool: get_transitions
    Args: {'issueIdOrKey': 'WEB-3'}

[5] Type: tool_call
    Tool: find_users
    Args: {'query': 'Sarah Johnson'}

[6] Type: tool_call
    Tool: get_transitions
    Args: {'issue

## Additional Analysis

You can perform additional analysis on the results:


In [13]:
import pandas as pd

# Create a DataFrame for easy analysis
data = []
for result in results:
    # Count conversation elements
    conversation = result.get_conversation_history()
    tool_calls = sum(1 for entry in conversation if entry["type"] == "tool_call")
    messages = sum(1 for entry in conversation if entry["type"] == "message")
    
    # Count verifier results
    verifiers_passed = sum(1 for vr in result.verifier_results if vr.success)
    verifiers_total = len(result.verifier_results)
    
    data.append({
        "model": result.model,
        "scenario": result.scenario_name,
        "run": result.run_number,
        "success": result.success,
        "tool_calls": tool_calls,
        "messages": messages,
        "verifiers_passed": verifiers_passed,
        "verifiers_total": verifiers_total,
    })

df = pd.DataFrame(data)

print("="*80)
print("RESULTS DATAFRAME")
print("="*80)
print(df)

print("\n" + "="*80)
print("STATISTICS BY MODEL")
print("="*80)
print(df.groupby("model").agg({
    "success": ["sum", "count", "mean"],
    "tool_calls": ["mean", "min", "max"],
    "messages": ["mean", "min", "max"],
}))

# Save DataFrame to CSV
csv_file = results_dir / "results_analysis.csv"
df.to_csv(csv_file, index=False)
print(f"\n✅ Saved analysis to: {csv_file}")


RESULTS DATAFRAME
               model        scenario  run  success  tool_calls  messages  \
0  claude-sonnet-4-5  pm review task    1     True          16         6   

   verifiers_passed  verifiers_total  
0                 6                6  

STATISTICS BY MODEL
                  success            tool_calls         messages        
                      sum count mean       mean min max     mean min max
model                                                                   
claude-sonnet-4-5       1     1  1.0       16.0  16  16      6.0   6   6

✅ Saved analysis to: sdk_results_20251106_202239/results_analysis.csv


## Running Multiple Harness Files

You can also run multiple harness files by pointing to a directory:


In [None]:
async def run_directory_harness():
    """Run all harness files in a directory."""
    
    harness_dir = Path("9_tasks")  # Directory with multiple .json files
    
    if not harness_dir.exists():
        print(f"Directory not found: {harness_dir}")
        return []
    
    print(f"Running harness directory: {harness_dir}")
    
    harness = TestHarness(
        harness_path=harness_dir,
        config=config,
    )
    
    print(f"Loaded {len(harness.scenarios)} total scenarios from directory")
    print(f"From {len(harness.file_map)} harness files:")
    for filename, scenarios in harness.file_map.items():
        print(f"  - {filename}: {len(scenarios)} scenario(s)")
    
    # Run all scenarios
    results = await harness.run(
        models=models,
        agent_factory=create_agent,
    )
    
    print(f"\nCompleted {len(results)} total runs")
    return results

# Uncomment to run:
directory_results = await run_directory_harness()

Directory harness example ready (uncomment to run)


## Creating Custom Agents

The SDK allows you to create custom agents for any LLM provider. Here's an example based on the Qwen implementation from the CLI:



In [None]:
import os
from typing import Any, Optional
from langchain_core.language_models.chat_models import BaseChatModel
from langchain_core.messages import AIMessage, BaseMessage
from langchain_openai import ChatOpenAI
from mcp_benchmark_sdk import Agent, AgentResponse
from mcp_benchmark_sdk.parsers import OpenAIResponseParser, ResponseParser
from mcp_benchmark_sdk.utils import retry_with_backoff


class QwenAgent(Agent):
    """Agent implementation for Alibaba Cloud Qwen models.

    Uses OpenAI-compatible API from DashScope.
    Requires DASHSCOPE_API_KEY environment variable.

    Supported models:
    - qwen-plus (default)
    - qwen3-14b
    - qwen-turbo
    - qwen-max
    - qwen2.5-72b-instruct
    """

    def __init__(
        self,
        model: str = "qwen-plus",
        temperature: float = 0.1,
        max_output_tokens: Optional[int] = None,
        tool_call_limit: int = 1000,
        system_prompt: Optional[str] = None,
        base_url: Optional[str] = None,
        **kwargs,
    ):
        """Initialize Qwen agent.

        Args:
            model: Qwen model name
            temperature: Sampling temperature
            max_output_tokens: Maximum output tokens
            tool_call_limit: Maximum tool calls per run
            system_prompt: Optional system prompt for the agent
            base_url: Optional custom base URL (default: Singapore endpoint)
            **kwargs: Additional arguments for ChatOpenAI
        """
        # Pass system_prompt and tool_call_limit to parent Agent class
        super().__init__(system_prompt=system_prompt, tool_call_limit=tool_call_limit)
        
        self.model = model
        self.temperature = temperature
        self.max_output_tokens = max_output_tokens
        
        # Map common model names to actual DashScope model IDs
        model_map = {
            "qwen-14b": "qwen3-14b",
            "qwen3-14b": "qwen3-14b",
            "qwen-plus": "qwen-plus",
            "qwen-turbo": "qwen-turbo",
            "qwen-max": "qwen-max",
            "qwen2.5-72b-instruct": "qwen2.5-72b-instruct",
        }
        
        self.actual_model = model_map.get(model.lower(), model)
        
        # Get base URL from environment or use default (Singapore)
        self.base_url = base_url or os.environ.get(
            "DASHSCOPE_BASE_URL",
            "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
        )
        
        # Get API key
        self.api_key = os.environ.get("DASHSCOPE_API_KEY")
        if not self.api_key:
            raise EnvironmentError(
                "DASHSCOPE_API_KEY is not set. Export the API key before running Qwen models.\n"
                "Get your API key from: https://modelstudio.console.alibabacloud.com/?tab=model#/api-key\n"
                "Set it with: export DASHSCOPE_API_KEY='your-key-here'"
            )
        
        self.extra_kwargs = kwargs

    def _build_llm(self) -> BaseChatModel:
        """Build Qwen model using OpenAI-compatible interface."""
        config: dict[str, Any] = {
            "model": self.actual_model,
            "temperature": self.temperature,
            "timeout": None,
            "max_retries": 3,
            "base_url": self.base_url,
            "api_key": self.api_key,
        }

        if self.max_output_tokens is not None:
            config["max_completion_tokens"] = self.max_output_tokens

        # Pass through any additional kwargs
        config.update(self.extra_kwargs)
        
        # Disable thinking in non-streaming mode
        extra_body = config.get("extra_body", {})
        extra_body["enable_thinking"] = False
        config["extra_body"] = extra_body

        llm = ChatOpenAI(**config)
        return llm.bind_tools(self._tools) if self._tools else llm

    async def get_response(self, messages: list[BaseMessage]) -> tuple[AgentResponse, AIMessage]:
        """Get Qwen model response with retry logic."""
        if not self._llm:
            raise RuntimeError("LLM not initialized. Call initialize() first.")

        async def _invoke():
            return await self._llm.ainvoke(messages)

        ai_message = await retry_with_backoff(
            _invoke,
            max_retries=2,
            timeout_seconds=600.0,
            on_retry=lambda attempt, exc, delay: None,
        )

        # Parse response (use OpenAI parser since it's compatible)
        parser = self.get_response_parser()
        parsed = parser.parse(ai_message)

        agent_response = AgentResponse(
            content=parsed.content,
            tool_calls=parsed.tool_calls,
            reasoning="\n".join(parsed.reasoning) if parsed.reasoning else None,
            done=not bool(parsed.tool_calls),
            info={"raw_reasoning": parsed.raw_reasoning},
        )

        return agent_response, ai_message

    def get_response_parser(self) -> ResponseParser:
        """Get OpenAI-compatible response parser."""
        return OpenAIResponseParser()



### Using Custom Agents with TestHarness

Now let's create a custom agent factory that uses our `QwenAgent`:



In [None]:
def custom_agent_factory(
    model: str,
    temperature: float = 0.1,
    max_output_tokens: int | None = None,
    tool_call_limit: int = 1000,
    system_prompt: str | None = None,
    **kwargs,
) -> Agent:
    """Custom agent factory that supports Qwen models.
    
    This factory checks the model name and creates the appropriate agent:
    - Qwen models -> QwenAgent
    - Other models -> Use SDK's default create_agent
    
    Args:
        model: Model name (e.g., "qwen-plus", "claude-sonnet-4-5", "gpt-4o")
        temperature: Sampling temperature
        max_output_tokens: Maximum output tokens
        tool_call_limit: Maximum tool calls per run
        system_prompt: Optional system prompt
        **kwargs: Additional model-specific arguments
    
    Returns:
        Agent instance for the specified model
    """
    model_lower = model.lower()
    
    # Check if it's a Qwen model
    if model_lower.startswith("qwen"):
        print(f"Creating QwenAgent for model: {model}")
        return QwenAgent(
            model=model,
            temperature=temperature,
            max_output_tokens=max_output_tokens,
            tool_call_limit=tool_call_limit,
            system_prompt=system_prompt,
            **kwargs,
        )
    
    # For other models, use the SDK's default factory
    print(f"Creating default agent for model: {model}")
    return create_agent(
        model=model,
        temperature=temperature,
        max_output_tokens=max_output_tokens,
        tool_call_limit=tool_call_limit,
        system_prompt=system_prompt,
        **kwargs,
    )


### Running with Custom Agent

Now you can use your custom agent factory with the test harness:


In [None]:
async def run_with_custom_agent():
    """Example: Run harness with custom agent factory."""
    
    # Create harness
    harness = TestHarness(
        harness_path=simple_harness_path,
        config=config,
    )
    
    print("="*80)
    print("RUNNING WITH CUSTOM AGENT FACTORY")
    print("="*80)
    
    # Run with multiple models including Qwen
    # NOTE: You need DASHSCOPE_API_KEY set to run Qwen models
    models_to_test = [
        "claude-sonnet-4-5",  # Uses default ClaudeAgent
        # "qwen-plus",           # Uses custom QwenAgent (uncomment if you have API key)
        # "gpt-4o",              # Uses default OpenAIAgent
    ]
    
    results = await harness.run(
        models=models_to_test,
        agent_factory=custom_agent_factory,  # Use custom factory instead of create_agent
    )
    
    print("\n" + "="*80)
    print("RESULTS")
    print("="*80)
    
    for result in results:
        status = "✅ PASSED" if result.success else "❌ FAILED"
        print(f"{status} | {result.model} | {result.scenario_name}")
    
    return results


# Example: Run with custom agent
# Uncomment to run:
# custom_results = await run_with_custom_agent()



### Key Concepts for Custom Agents

When creating a custom agent, you need to implement:

1. **`__init__()`**: Initialize your agent
   - Call `super().__init__(system_prompt, tool_call_limit)` 
   - Store model configuration (API keys, base URLs, etc.)
   - Validate environment variables

2. **`_build_llm()`**: Build the LangChain model
   - Create and configure your LLM (e.g., `ChatOpenAI`, `ChatAnthropic`)
   - Bind tools to the model: `llm.bind_tools(self._tools)`
   - Return the configured model

3. **`get_response()`**: Get model responses
   - Call `await self._llm.ainvoke(messages)`
   - Add retry logic using `retry_with_backoff()`
   - Parse the response and return `AgentResponse`

4. **`get_response_parser()`**: Parse model responses
   - Return appropriate parser (e.g., `OpenAIResponseParser()`)
   - Or create custom parser for your model's response format

**Agent Factory Pattern:**

The agent factory is a function that creates agents based on the model name:

```python
def my_factory(model: str, **kwargs) -> Agent:
    if "custom-model" in model:
        return CustomAgent(model=model, **kwargs)
    return create_agent(model=model, **kwargs)  # Fallback to SDK default
```

This allows you to:
- Support multiple LLM providers
- Use different configurations per model
- Mix custom and built-in agents
- Keep agent creation logic centralized



## Summary

This notebook demonstrated:

1. ✅ **Configuration**: Setting up MCP servers and harness configuration
2. ✅ **Running**: Executing benchmarks across models
3. ✅ **Observers**: Using custom observers for progress tracking
4. ✅ **Results**: Analyzing success rates and verifier results
5. ✅ **Saving**: Persisting results to JSON files
6. ✅ **Conversation History**: Examining agent interactions
7. ✅ **Analysis**: Using pandas for additional insights
8. ✅ **Custom Agents**: Creating custom agents for new LLM providers (Qwen example)

### Next Steps

- Create your own harness files with custom scenarios
- Add custom verifiers for domain-specific validation
- Implement custom observers for rich UI feedback
- Compare results across different models and configurations
- Create custom agents for additional LLM providers
- Use the SDK in your own benchmarking pipeline

For more information, see:
- SDK Documentation: `packages/mcp_benchmark_sdk/README.md`
- Harness Design: `packages/mcp_benchmark_sdk/src/mcp_benchmark_sdk/harness/README.md`
