# MCP Benchmark SDK - Usage Showcase

This notebook demonstrates all the major usage patterns of the MCP Benchmark SDK:

1. **Using Agents Standalone** - Run agents without the harness
2. **Custom Agents** - Create your own agent (Qwen example)
3. **Custom Verifiers** - Build custom verification logic
4. **Test Harness** - Run benchmarks across multiple models
5. **Observers** - Monitor execution in real-time
6. **Advanced Patterns** - Tracing, custom prompts, and more

## Prerequisites

Make sure you have:
- An MCP server running (e.g., Jira MCP on `http://localhost:8015/mcp`)
- API keys set as environment variables
- The SDK installed: `pip install -e .`


In [1]:
# Setup and Imports

import asyncio
import os
from pathlib import Path

# Core SDK imports
from mcp_benchmark_sdk import (
    # Agents
    ClaudeAgent,
    GPTAgent,
    GeminiAgent,
    GrokAgent,
    # Task/Result
    Task,
    Result,
    # MCP
    MCPConfig,
    # Verifiers
    Verifier,
    DatabaseVerifier,
    VerifierResult,
    # Runtime
    RunContext,
    RunObserver,
    # Harness
    TestHarness,
    TestHarnessConfig,
    create_agent,
)

# Check if we're in Jupyter
def is_jupyter():
    try:
        from IPython.core.getipython import get_ipython
        return get_ipython() is not None
    except:
        return False

print(f"Running in {'Jupyter' if is_jupyter() else 'Python script'}")
print(f"SDK imported successfully âœ“")


Running in Jupyter
SDK imported successfully âœ“


## 1. Using Agents Standalone (Without Harness)

Agents can be used directly without the test harness for single-task execution.


In [3]:
### Example 1.1: Basic Agent Usage

async def example_basic_agent():
    """Run a simple agent task."""
    
    # Configure MCP server
    mcp_config = MCPConfig(
        name="jira",
        url="http://localhost:8015/mcp",
        transport="streamable_http"
    )
    
    # Create agent
    agent = ClaudeAgent(
        model="claude-sonnet-4-5",
        temperature=0.1,
        tool_call_limit=1000
    )
    
    # Define task
    task = Task(
        prompt="List all projects available in the system",
        mcp=mcp_config,
        max_steps=100
    )
    
    # Run agent
    print("Running agent...")
    result = await agent.run(task)
    
    # Display results
    print(f"\n{'='*60}")
    print(f"Success: {result.success}")
    print(f"Steps taken: {len(result.messages)}")
    print(f"Database ID: {result.database_id}")
    
    if result.error:
        print(f"Error: {result.error}")
    
    # Show last few messages
    print(f"\nLast 3 messages:")
    for msg in result.messages[-3:]:
        print(f"  {msg.type}: {str(msg.content)[:100]}...")
    
    return result

# Run the example
# Uncomment to execute:
result = await example_basic_agent()


  agent = ClaudeAgent(


Running agent...

Success: True
Steps taken: 4
Database ID: 9bd53603-b91d-4ad8-9d96-2eca67abac39

Last 3 messages:
  ai: [{'signature': 'EpQLCkYICRgCKkDXze33UI5sqjyWOapdWKFrPf4fY4argQmBAyq1OYKgfW8EvmJy2Hd8BVscTVFQTDfW52en...
  tool: {"isLast": true, "maxResults": 100, "nextPage": null, "self": "http://localhost:8015/rest/api/3/proj...
  ai: Great! Here are all **6 projects** currently available in the system:

## ðŸ“Š Projects Summary

| # | ...


In [5]:
### Example 1.2: Using Different Agent Providers

async def example_multiple_providers():
    """Compare different LLM providers."""
    
    mcp_config = MCPConfig(
        name="jira",
        url="http://localhost:8015/mcp",
        transport="streamable_http"
    )
    
    task = Task(
        prompt="Create a test issue in the DEMO project",
        mcp=mcp_config
    )
    
    # Try different providers
    agents = {
        "GPT-4o": GPTAgent(model="gpt-4o", temperature=0.1),
        "Claude Sonnet": ClaudeAgent(model="claude-sonnet-4-5", temperature=0.1), # will give a warning as claude-sonnet-4-5 temperature needs to be 1
        "Gemini 2.5 Pro": GeminiAgent(model="gemini-2.5-pro", temperature=0.1),
    }
    
    results = {}
    for name, agent in agents.items():
        print(f"\nTrying {name}...")
        try:
            result = await agent.run(task)
            results[name] = {
                "steps": len(result.messages),
                "error": result.error
            }
            print(f"Steps: {len(result.messages)}")
        except Exception as e:
            print(f"  âœ— Error: {e}")
            results[name] = {"success": False, "error": str(e)}
    
    return results

results = await example_multiple_providers()
for provider, result in results.items():
    print(f"{provider}: {result}")


  "Claude Sonnet": ClaudeAgent(model="claude-sonnet-4-5", temperature=0.1),



Trying GPT-4o...
Steps: 4

Trying Claude Sonnet...


Key 'additionalProperties' is not supported in schema, ignoring
Key 'additionalProperties' is not supported in schema, ignoring
Key 'additionalProperties' is not supported in schema, ignoring
Key 'additionalProperties' is not supported in schema, ignoring
Key 'additionalProperties' is not supported in schema, ignoring
Key 'additionalProperties' is not supported in schema, ignoring
Key 'additionalProperties' is not supported in schema, ignoring
Key 'additionalProperties' is not supported in schema, ignoring
Key 'additionalProperties' is not supported in schema, ignoring
Key 'additionalProperties' is not supported in schema, ignoring
Key 'additionalProperties' is not supported in schema, ignoring
Key 'additionalProperties' is not supported in schema, ignoring
Key 'additionalProperties' is not supported in schema, ignoring
Key 'additionalProperties' is not supported in schema, ignoring
Key 'additionalProperties' is not supported in schema, ignoring
Key 'additionalProperties' is not suppor

Steps: 6

Trying Gemini 2.5 Pro...
Steps: 2
GPT-4o: {'steps': 4, 'error': None}
Claude Sonnet: {'steps': 6, 'error': None}
Gemini 2.5 Pro: {'steps': 2, 'error': None}


In [11]:
### Example 1.3: Using Observers for Real-time Monitoring
import json
class SimpleConsoleObserver(RunObserver):
    """Simple console observer to track execution."""
    
    def __init__(self, label="Agent"):
        self.label = label
        self.tool_count = 0
        self.error_count = 0
    
    async def on_message(self, role: str, content: str, metadata=None):
        if role == "assistant":
            print(f"[{self.label}] Agent: {content}...")
    
    async def on_tool_call(self, tool_name, arguments, result, is_error=False):
        self.tool_count += 1
        if is_error:
            self.error_count += 1
            print(f"[{self.label}] Tool failed: {tool_name}")
            print(f"[{self.label}] Tool error: {result}")
        else:
            print(f"[{self.label}] Tool called: {tool_name}: {json.dumps(arguments, indent=2)}")
            print(f"[{self.label}] Tool result: {json.dumps(result, indent=2)}")
    
    async def on_status(self, message: str, level: str = "info"):
        print(f"[{self.label}] {level.upper()}: {message}")


async def example_with_observer():
    """Use an observer to monitor agent execution."""
    
    # Create run context with observer
    async with RunContext() as ctx:
        observer = SimpleConsoleObserver(label="DEMO")
        ctx.add_observer(observer)
        
        # Create agent and task
        agent = ClaudeAgent(model="claude-sonnet-4-5")
        task = Task(
            prompt="List the first 3 projects in the system",
            mcp=MCPConfig(
                name="jira",
                url="http://localhost:8015/mcp",
                transport="streamable_http"
            )
        )
        
        # Run with observer
        result = await agent.run(task, run_context=ctx)
        
        print(f"\nExecution complete!")
        print(f"  Success: {result.success}")
        print(f"  Tool calls: {observer.tool_count}")
        print(f"  Errors: {observer.error_count}")
        
        return result

# Uncomment to execute:
result = await example_with_observer()


[DEMO] INFO: Initialized with 49 tools from MCP server
[DEMO] INFO: Step 1/1000
[DEMO] Agent: I'll retrieve the first 3 projects for you....
[DEMO] Tool called: get_paginated_projects: {
  "maxResults": 3,
  "startAt": 0
}
[DEMO] Tool result: {
  "isLast": false,
  "maxResults": 3,
  "nextPage": "http://localhost:8015/rest/api/3/project/search?startAt=3&maxResults=3",
  "self": "http://localhost:8015/rest/api/3/project/search?startAt=0&maxResults=3",
  "startAt": 0,
  "total": 6,
  "values": [
    {
      "id": "4",
      "key": "AS3",
      "name": "apple-s3",
      "simplified": false,
      "style": "classic",
      "self": "http://localhost:8015/rest/api/3/project/AS3",
      "avatarUrls": {
        "16x16": "http://localhost:8015/secure/projectavatar?size=xsmall&pid=4",
        "24x24": "http://localhost:8015/secure/projectavatar?size=small&pid=4",
        "32x32": "http://localhost:8015/secure/projectavatar?size=medium&pid=4",
        "48x48": "http://localhost:8015/secure/projec

## 2. Creating Custom Agents

You can create custom agents by subclassing Agent. Here is the Qwen agent as an example:


In [12]:
### Example 2.1: QwenAgent Implementation

from typing import Optional
from langchain_core.messages import BaseMessage, AIMessage
from langchain_openai import ChatOpenAI
from mcp_benchmark_sdk import Agent, AgentResponse
from mcp_benchmark_sdk.parsers import OpenAIResponseParser, ResponseParser
from mcp_benchmark_sdk.utils import retry_with_backoff


class QwenAgent(Agent):
    """Custom agent for Alibaba Cloud Qwen models.
    
    Uses OpenAI-compatible API from DashScope.
    Requires DASHSCOPE_API_KEY environment variable.
    """
    
    def __init__(
        self,
        model: str = "qwen-plus",
        temperature: float = 0.1,
        max_output_tokens: Optional[int] = None,
        tool_call_limit: int = 1000,
        system_prompt: Optional[str] = None,
        **kwargs,
    ):
        super().__init__(system_prompt=system_prompt, tool_call_limit=tool_call_limit)
        
        self.model = model
        self.temperature = temperature
        self.max_output_tokens = max_output_tokens
        
        # Get API key
        self.api_key = os.environ.get("DASHSCOPE_API_KEY")
        if not self.api_key:
            raise EnvironmentError(
                "DASHSCOPE_API_KEY not set. "
                "Get your key from: https://modelstudio.console.alibabacloud.com/"
            )
        
        self.base_url = "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    
    def _build_llm(self):
        """Build the LLM instance."""
        config = {
            "model": self.model,
            "temperature": self.temperature,
            "timeout": None,
            "max_retries": 3,
            "base_url": self.base_url,
            "api_key": self.api_key,
        }
        
        if self.max_output_tokens:
            config["max_completion_tokens"] = self.max_output_tokens
        
        llm = ChatOpenAI(**config)
        return llm.bind_tools(self._tools) if self._tools else llm
    
    async def get_response(self, messages: list[BaseMessage]) -> tuple[AgentResponse, AIMessage]:
        """Get model response with retry logic."""
        if not self._llm:
            raise RuntimeError("LLM not initialized")
        
        async def _invoke():
            return await self._llm.ainvoke(messages)
        
        ai_message = await retry_with_backoff(
            _invoke, 
            max_retries=2, 
            timeout_seconds=600.0
        )
        
        # Parse response
        parser = self.get_response_parser()
        parsed = parser.parse(ai_message)
        
        agent_response = AgentResponse(
            content=parsed.content,
            tool_calls=parsed.tool_calls,
            reasoning="\n".join(parsed.reasoning) if parsed.reasoning else None,
            done=not bool(parsed.tool_calls),
        )
        
        return agent_response, ai_message
    
    def get_response_parser(self) -> ResponseParser:
        """Get response parser."""
        return OpenAIResponseParser()


print("QwenAgent class defined successfully!")


QwenAgent class defined successfully!


In [13]:
### Example 2.2: Using the Custom QwenAgent

async def example_qwen_agent():
    """Use the custom Qwen agent."""
    
    # Note: You need DASHSCOPE_API_KEY set in environment
    # os.environ["DASHSCOPE_API_KEY"] = "your-key-here"
    
    try:
        agent = QwenAgent(
            model="qwen-plus",
            temperature=0.1
        )
        
        task = Task(
            prompt="List all available projects",
            mcp=MCPConfig(
                name="jira",
                url="http://localhost:8015/mcp",
                transport="streamable_http"
            )
        )
        
        result = await agent.run(task)
        
        print(f"Qwen Agent Result:")
        print(f"  Success: {result.success}")
        print(f"  Steps: {len(result.messages)}")
        
        return result
    
    except EnvironmentError as e:
        print(f"Cannot run Qwen agent: {e}")
        print("Set DASHSCOPE_API_KEY to use Qwen models")

# Uncomment to execute (requires DASHSCOPE_API_KEY):
result = await example_qwen_agent()


Qwen Agent Result:
  Success: True
  Steps: 4


## 3. Creating Custom Verifiers

Verifiers check that your agent's execution produced the expected results.


In [14]:
### Example 3.1: Custom API Verifier

import httpx

class CustomAPIVerifier(Verifier):
    """Verify state via custom API calls."""
    
    def __init__(
        self,
        api_url: str,
        expected_status: str,
        name: str = "CustomAPIVerifier",
    ):
        super().__init__(name)
        self.api_url = api_url
        self.expected_status = expected_status
    
    async def verify(self) -> VerifierResult:
        """Execute verification."""
        try:
            async with httpx.AsyncClient() as client:
                response = await client.get(self.api_url)
                response.raise_for_status()
                
                data = response.json()
                actual_status = data.get("status")
                
                success = actual_status == self.expected_status
                
                return VerifierResult(
                    name=self.name,
                    success=success,
                    expected_value=self.expected_status,
                    actual_value=actual_status,
                    comparison_type="equals",
                    error=None if success else f"Status mismatch",
                    metadata={"response": data},
                )
        
        except Exception as e:
            return VerifierResult(
                name=self.name,
                success=False,
                expected_value=self.expected_status,
                actual_value=None,
                comparison_type="equals",
                error=str(e),
            )


async def example_custom_verifier():
    """Demonstrate custom verifier."""
    
    verifier = CustomAPIVerifier(
        api_url="https://api.example.com/status",
        expected_status="completed",
    )
    
    result = await verifier.verify()
    
    print(f"Verifier: {result.name}")
    print(f"  Success: {result.success}")
    print(f"  Expected: {result.expected_value}")
    print(f"  Actual: {result.actual_value}")
    if result.error:
        print(f"  Error: {result.error}")
    
    return result

print("Custom verifier defined!")
# Uncomment to execute:
result = await example_custom_verifier()


Custom verifier defined!
Verifier: CustomAPIVerifier
  Success: False
  Expected: completed
  Actual: None
  Error: [Errno 8] nodename nor servname provided, or not known


In [18]:
### Example 3.2: Using Built-in Database Verifier

async def example_database_verifier():
    """Use the built-in database verifier."""
    
    # Create a database verifier
    verifier = DatabaseVerifier(
        query="SELECT COUNT(*) FROM issues",
        expected_value=0,
        mcp_url="http://localhost:8015/mcp",
        database_id="test-db-123",
        comparison="equals",
        name="Bug Count Check"
    )
    
    result = await verifier.verify()
    
    print(f"Database Verifier: {result.name}")
    if result.metadata:
        print(f"  Query: {result.metadata.get('query')}")
    print(f"  Expected: {result.expected_value}")
    print(f"  Actual: {result.actual_value}")
    print(f"  Comparison: {result.comparison_type}")
    print(f"  Success: {result.success}")
    
    if result.error:
        print(f"  Error: {result.error}")
    
    return result

# Uncomment to execute:
result = await example_database_verifier()

Database Verifier: Bug Count Check
  Query: SELECT COUNT(*) FROM issues
  Expected: 0
  Actual: 3
  Comparison: equals
  Success: False
  Error: Comparison failed


## 4. Using the Test Harness for Benchmarks

The TestHarness orchestrates running multiple scenarios across multiple models with built-in parallelization.


In [19]:
### Example 4.1: Running Benchmarks with Test Harness

async def example_test_harness():
    """Run benchmarks using the test harness."""
    
    # Configure MCP server
    mcp_config = MCPConfig(
        name="jira",
        url="http://localhost:8015/mcp",
        transport="streamable_http"
    )
    
    # Create test harness
    # Note: You'll need a harness JSON file (see next cell for format)
    harness_path = Path("../../../9_tasks/task1.json")
    
    if not harness_path.exists():
        print(f"Harness file not found: {harness_path}")
        print("Create a harness JSON file or adjust the path")
        return
    
    harness = TestHarness(
        harness_path=harness_path,
        config=TestHarnessConfig(
            mcp=mcp_config,
            max_steps=1000,
            tool_call_limit=1000,
            temperature=0.1,
            runs_per_scenario=1,  # Run each scenario once
            max_concurrent_runs=3,  # Run 3 in parallel
        )
    )
    
    print(f"Loaded {len(harness.scenarios)} scenario(s)")
    print("Running benchmarks...")
    
    # Run benchmarks across models
    results = await harness.run(
        models=["gpt-4o", "claude-sonnet-4-5"],
        agent_factory=create_agent,
    )
    
    # Analyze results
    print(f"\n{'='*60}")
    print("RESULTS")
    print(f"{'='*60}")
    
    for result in results:
        status = "âœ“ PASS" if result.success else "âœ— FAIL"
        print(f"{result.model} - {result.scenario_id}: {status}")
        
        if result.error:
            print(f"  Error: {result.error}")
        
        conversation = result.get_conversation_history()
        print(f"  Conversation: {len(conversation)} messages")
        
        if result.verifier_results:
            print(f"  Verifiers:")
            for v in result.verifier_results:
                v_status = "âœ“" if v.success else "âœ—"
                print(f"    {v_status} {v.name}")
    
    successful = sum(1 for r in results if r.success)
    print(f"\nTotal: {len(results)} | Passed: {successful} | Failed: {len(results) - successful}")
    
    return results

# Uncomment to execute:
results = await example_test_harness()

Loaded 1 scenario(s)
Running benchmarks...


  return ClaudeAgent(



RESULTS
gpt-4o - pm_review_task_1: âœ— FAIL
  Error: Verifiers failed: DatabaseVerifier, DatabaseVerifier, DatabaseVerifier
  Conversation: 12 messages
  Verifiers:
    âœ— DatabaseVerifier
    âœ— DatabaseVerifier
    âœ— DatabaseVerifier
    âœ“ DatabaseVerifier
    âœ“ DatabaseVerifier
    âœ“ DatabaseVerifier
claude-sonnet-4-5 - pm_review_task_1: âœ— FAIL
  Error: Verifiers failed: DatabaseVerifier
  Conversation: 40 messages
  Verifiers:
    âœ“ DatabaseVerifier
    âœ“ DatabaseVerifier
    âœ“ DatabaseVerifier
    âœ— DatabaseVerifier
    âœ“ DatabaseVerifier
    âœ“ DatabaseVerifier

Total: 2 | Passed: 0 | Failed: 2


### Harness JSON File Format

A harness JSON file defines scenarios, prompts, and verifiers:

```json
{
  "scenarios": [
    {
      "id": "create_bug_issue",
      "description": "Create a bug issue in DEMO project",
      "prompts": [
        {
          "role": "user",
          "content": "Create a high-priority bug about login failures"
        }
      ],
      "verifiers": [
        {
          "verifier_type": "database_state",
          "name": "Issue Created",
          "validation_config": {
            "query": "SELECT COUNT(*) FROM issues WHERE type = 'Bug'",
            "expected_value": 1,
            "comparison_type": "equals"
          }
        }
      ]
    }
  ]
}
```


## 5. Saving and Analyzing Results

Results can be exported to JSON for analysis and storage.


In [21]:
### Example 5.1: Saving Results to JSON

import json

async def example_save_results():
    """Save benchmark results to JSON files."""
    
    # Run a simple benchmark first
    agent = ClaudeAgent(model="claude-sonnet-4-5")
    task = Task(
        prompt="List all projects",
        mcp=MCPConfig(
            name="jira",
            url="http://localhost:8015/mcp",
            transport="streamable_http"
        )
    )
    
    result = await agent.run(task)
    
    # Create artifacts directory
    artifacts_dir = Path("artifacts")
    artifacts_dir.mkdir(exist_ok=True)
    
    # Export to dict (includes full conversation history)
    result_dict = {
        "success": result.success,
        "error": result.error,
        "database_id": result.database_id,
        "conversation": [
            {
                "type": msg.type,
                "content": str(msg.content)[:500]  # Truncate for display
            }
            for msg in result.messages
        ],
        "metadata": result.metadata,
    }
    
    # Save to file
    db_id_short = result.database_id[:8] if result.database_id else "unknown"
    filename = f"result_{db_id_short}.json"
    filepath = artifacts_dir / filename
    
    with open(filepath, "w") as f:
        json.dump(result_dict, f, indent=2)
    
    print(f"Saved result to: {filepath}")
    print(f"  Success: {result_dict['success']}")
    print(f"  Conversation: {len(result_dict['conversation'])} messages")
    print(f"  File size: {filepath.stat().st_size} bytes")
    
    # Access conversation programmatically
    print(f"\nFirst 2 conversation messages:")
    for i, msg in enumerate(result_dict['conversation'][:2]):
        print(f"  {i+1}. {msg['type']}: {msg['content'][:100]}...")
    
    return result_dict

# Uncomment to execute:
result_dict = await example_save_results()


Saved result to: artifacts/result_28fad97d.json
  Success: True
  Conversation: 4 messages
  File size: 1951 bytes

First 2 conversation messages:
  1. human: List all projects...
  2. ai: [{'signature': 'EusDCkYICRgCKkD5McyQEN1ve5R1X0oy7fYGOLBVUpZRbcvLBDLRUATplA18hqw6lNRmedy4EOOKSwvKyTpm...


## 6. Advanced Patterns

Additional SDK features for power users.


In [23]:
### Example 6.1: Custom System Prompts

async def example_custom_system_prompt():
    """Use a custom system prompt."""
    
    CUSTOM_PROMPT = """You are an expert project manager assistant.
You are precise, thorough, and always verify your work.
When creating issues, always include:
1. Clear title
2. Detailed description
3. Priority level
4. Assignee if known"""
    
    agent = ClaudeAgent(
        model="claude-sonnet-4-5",
        system_prompt=CUSTOM_PROMPT,
        temperature=0.1,
        tool_call_limit=500,
    )
    
    task = Task(
        prompt="Create a bug issue about the login page crashing",
        mcp=MCPConfig(
            name="jira",
            url="http://localhost:8015/mcp",
            transport="streamable_http"
        )
    )
    
    result = await agent.run(task)
    
    print("Agent with custom system prompt:")
    print(f"  Success: {result.success}")
    print(f"  Steps: {len(result.messages)}")
    
    return result

# Uncomment to execute:
result = await example_custom_system_prompt()


  agent = ClaudeAgent(


Agent with custom system prompt:
  Success: True
  Steps: 3


In [None]:
### Example 6.2: LangSmith Tracing (Optional)

# Set environment variables for LangSmith:
# os.environ["LANGSMITH_API_KEY"] = "your-key"
# os.environ["LANGSMITH_TRACING"] = "true"
# os.environ["LANGSMITH_PROJECT"] = "mcp-benchmarks"

from mcp_benchmark_sdk import configure_langsmith, with_tracing, is_tracing_enabled

async def example_langsmith_tracing():
    """Enable LangSmith tracing for observability."""
    
    # Configure LangSmith (if env vars are set)
    # Environment variables handle configuration:
    # LANGSMITH_API_KEY, LANGSMITH_TRACING, LANGSMITH_PROJECT
    
    if not is_tracing_enabled():
        print("LangSmith tracing not enabled (API key not set)")
        print("Set LANGSMITH_API_KEY and LANGSMITH_TRACING=true to enable")
        return
    
    # Wrap agent with tracing
    agent = ClaudeAgent(model="claude-sonnet-4-5")
    traced_agent = with_tracing(agent)
    
    task = Task(
        prompt="List projects",
        mcp=MCPConfig(
            name="jira",
            url="http://localhost:8015/mcp",
            transport="streamable_http"
        )
    )
    
    result = await traced_agent.run(task)
    
    print("Traced agent execution:")
    print(f"  Success: {result.success}")
    print(f"  Tracing enabled: {is_tracing_enabled()}")
    print("  Check LangSmith dashboard for traces!")
    
    return result

# Uncomment to execute (requires LANGSMITH_API_KEY):
# result = await example_langsmith_tracing()


## Summary

This notebook demonstrated the key usage patterns of the MCP Benchmark SDK:

### âœ… What We Covered

1. **Standalone Agents**: Using built-in agents (GPT, Claude, Gemini, Grok) without the harness
2. **Custom Agents**: Creating custom agents by subclassing `Agent` (Qwen example)
3. **Custom Verifiers**: Building verification logic for agent execution results
4. **Test Harness**: Running benchmarks across multiple models and scenarios
5. **Observers**: Real-time monitoring of agent execution
6. **Advanced Features**: Custom system prompts, LangSmith tracing, result persistence

### ðŸ“– Next Steps

- Explore the `examples/` directory for more complete examples
- Read the SDK source code - it's well-documented!
- Check out existing agents in `src/mcp_benchmark_sdk/agents/`
- Review the README for detailed documentation

### ðŸš€ Quick Reference

```python
# Simple agent usage
agent = ClaudeAgent(model="claude-sonnet-4-5")
result = await agent.run(task)

# With observer
async with RunContext() as ctx:
    ctx.add_observer(MyObserver())
    result = await agent.run(task, run_context=ctx)

# Test harness
harness = TestHarness(harness_path=path, config=config)
results = await harness.run(models=["gpt-4o"], agent_factory=create_agent)
```

### ðŸ”— Resources

- GitHub: [Your repository]
- Documentation: See package README
- MCP Protocol: https://modelcontextprotocol.io/

**Happy benchmarking! ðŸŽ‰**
