# Evaluation with Grounded Agent

This notebook shows how to run evaluations using the GroundedOpenAIChatAgent on OSWorld.

The grounded agent separates visual grounding from reasoning:
- **Planning model** (GPT-4o-mini): High-level reasoning and task planning
- **Grounding model** (Qwen2.5-VL): Visual element detection and coordinate resolution

## Prerequisites

- Set `HUD_API_KEY` in your environment
- Set `OPENAI_API_KEY` for the planning model
- Set `OPENROUTER_API_KEY` for the grounding model (or use local grounding)

In [None]:
%pip install 'hud-python[dev]'

In [None]:
import os
import logging

import hud
from datasets import load_dataset
from openai import AsyncOpenAI

from hud.datasets import Task
from hud.agents.grounded_openai import GroundedOpenAIChatAgent
from hud.tools.grounding.config import GrounderConfig
from hud.settings import settings

# Configure logging
logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(name)s - %(message)s", datefmt="%H:%M:%S"
)
logging.getLogger("hud.agents").setLevel(logging.INFO)

# Disable httpx logging to reduce noise
logging.getLogger("httpx").setLevel(logging.WARNING)
logging.getLogger("httpcore").setLevel(logging.WARNING)

logger = logging.getLogger(__name__)

## Configuration

Set up API keys and model configurations:

In [None]:
# API Keys - make sure these are set in your environment
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") or settings.openai_api_key
OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY") or settings.openrouter_api_key
HUD_API_KEY = os.getenv("HUD_API_KEY")

if not OPENAI_API_KEY:
    raise ValueError("OPENAI_API_KEY not found. Please set it in your environment.")
if not OPENROUTER_API_KEY:
    raise ValueError("OPENROUTER_API_KEY not found. Please set it in your environment.")
if not HUD_API_KEY:
    raise ValueError("HUD_API_KEY not found. Please set it in your environment.")

print("✅ API keys configured")

## Create Grounded Agent Configuration

You can configure the grounded agent to work with `run_dataset` for full evaluations:

In [None]:
# Grounding model configuration
grounder_config = GrounderConfig(
    api_key=OPENROUTER_API_KEY,
    api_base="https://openrouter.ai/api/v1",
    model="qwen/qwen-2.5-vl-7b-instruct",
)

# OpenAI client for planning model
openai_client = AsyncOpenAI(api_key=OPENAI_API_KEY)

# Agent configuration for dataset runner
agent_class = GroundedOpenAIChatAgent
agent_config = {
    "grounder_config": grounder_config,
    "openai_client": openai_client,
}

print("✅ Agent configuration ready")

## Single Task Test

First, let's test the grounded agent on a single OSWorld task:

In [None]:
async def run_single_task(
    dataset_name: str,
    task_index: int = 1,
    max_steps: int = 10,
) -> None:
    """Load one task from dataset_name and execute it."""

    print("📊 Loading dataset…")
    dataset = load_dataset(dataset_name, split="train")

    # Get a task from dataset
    sample_task = dataset[task_index]
    task_prompt = sample_task.get("prompt", f"Task {sample_task.get('id', 0)}")

    with hud.trace(name=task_prompt):
        task = Task(**sample_task)

        # Create agent with configuration
        agent = agent_class(**agent_config)

        print(f"\n🎯 Task: {task.prompt}")
        result = await agent.run(task, max_steps=max_steps)
        print("✅ Reward:", result.reward)
        return result

In [None]:
# Test single task
result = await run_single_task("hud-evals/OSWorld-Gold", task_index=1, max_steps=15)

## Full Dataset Evaluation

Now let's run evaluations on complete datasets using the factory functions from `hud.utils.agent_factories`:

In [None]:
from hud.datasets import run_dataset
from hud.utils.agent_factories import create_grounded_agent

# Configuration for the factory functions
grounded_agent_config = {
    "api_key": OPENAI_API_KEY,
    "grounder_api_key": OPENROUTER_API_KEY,
    "grounder_api_base": "https://openrouter.ai/api/v1",
    "grounder_model": "qwen/qwen-2.5-vl-7b-instruct",
    "model_name": "gpt-4o-mini",
}

print("✅ Factory configurations ready")

### Run Small Dataset Evaluation

For smaller datasets (< 100 tasks), use the standard `run_dataset` with async concurrency:

In [None]:
# Load OSWorld dataset and take a subset for evaluation
dataset = load_dataset("hud-evals/OSWorld-Gold", split="train")

# Take first 30 tasks for evaluation
subset_size = 30
task_subset = dataset[:subset_size]


task_list = []
for i in range(len(task_subset["prompt"])):
    task_dict = {key: task_subset[key][i] for key in task_subset.keys()}
    task_list.append(task_dict)

print(f"📊 Loaded {len(task_list)} tasks from OSWorld-Gold")
print(f"First task: {task_list[0].get('prompt', 'No prompt')[:100]}...")

In [None]:
# Run evaluation on the OSWorld subset with grounded agent
import time

start_time = time.time()

results = await run_dataset(
    name="OSWorld-30 Grounded Eval",
    dataset=task_list,  # Pass the list of task dicts
    agent_class=create_grounded_agent,  # Use factory function
    agent_config=grounded_agent_config,
    max_concurrent=10,  # Moderate concurrency for 30 tasks
    max_steps=15,
    auto_respond=True,  # Auto-continue agent
    metadata={"model": "gpt-4o-mini", "grounding": "qwen-2.5-vl", "dataset": "OSWorld-30"},
)

elapsed = time.time() - start_time

# Calculate statistics
successful = sum(1 for r in results if getattr(r, "reward", 0) > 0)
failed = sum(1 for r in results if getattr(r, "isError", False))
total = len(results)

print("\n" + "=" * 50)
print("📊 OSWorld-30 Evaluation Complete!")
print("=" * 50)
print(f"Total tasks: {total}")
print(f"✅ Successful: {successful} ({100 * successful / total:.1f}%)")
print(f"❌ Failed: {failed} ({100 * failed / total:.1f}%)")
print(f"⏱️ Time elapsed: {elapsed:.2f} seconds")
print(f"📈 Throughput: {total / elapsed:.2f} tasks/second")

### Run Large Dataset Evaluation (Parallel)

For larger datasets (100+ tasks), use `run_dataset_parallel` for process-based parallelization:

In [None]:
# Run full OSWorld evaluation with parallel execution and configured workers
from hud.datasets import run_dataset_parallel_manual

# Uncomment to run with parallel execution
"""
import time
start_time = time.time()

results = await run_dataset_parallel_manual(
    name="OSWorld Parallel Eval",
    dataset="hud-evals/OSWorld-Gold",  # 300+ tasks
    agent_class=create_grounded_agent,
    agent_config=grounded_agent_config,
    max_workers=8,                      # Number of worker processes
    max_concurrent_per_worker=10,       # Concurrent tasks per worker (8*10 = 80 total)
    max_steps=15,
    auto_respond=True,
    metadata={"model": "gpt-4o-mini", "grounding": "qwen-2.5-vl", "parallel": True}
)

elapsed = time.time() - start_time

# Print statistics
print("\\n" + "=" * 50)
print("📊 Evaluation Complete!")
print("=" * 50)
print(f"Total tasks: {len(results)}")
print(f"Time elapsed: {elapsed:.2f} seconds")
print(f"Throughput: {len(results) / elapsed:.2f} tasks/second")
print(f"Execution mode: PARALLEL (workers: 8, concurrent per worker: 10)")

successful = sum(1 for r in results if getattr(r, "reward", 0) > 0)
print(f"Successful tasks: {successful}/{len(results)} ({100 * successful / len(results):.1f}%)")
"""

print("Ready to run large parallel evaluation with configured workers (uncomment code above)")