# Evaluation with Grounded Agent

This notebook shows how to run evaluations using the GroundedOpenAIChatAgent on OSWorld.

The grounded agent separates visual grounding from reasoning:
- **Planning model** (GPT-4o-mini): High-level reasoning and task planning
- **Grounding model** (Qwen2.5-VL): Visual element detection and coordinate resolution

## Prerequisites

- Set `HUD_API_KEY` in your environment
- Set `OPENAI_API_KEY` for the planning model
- Set `OPENROUTER_API_KEY` for the grounding model (or use local grounding)

In [None]:
# !pip install hud-python

In [None]:
import os
import logging

import hud
from datasets import load_dataset
from openai import AsyncOpenAI

from hud.agents.grounded_openai import GroundedOpenAIChatAgent
from hud.tools.grounding.config import GrounderConfig
from hud.settings import settings

# Configure logging
logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(name)s - %(message)s", datefmt="%H:%M:%S"
)
logging.getLogger("hud.agents").setLevel(logging.INFO)
logger = logging.getLogger(__name__)

## Configuration

Set up API keys and model configurations:

In [16]:
# API Keys - make sure these are set in your environment
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") or settings.openai_api_key
OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY") or settings.openrouter_api_key
HUD_API_KEY = os.getenv("HUD_API_KEY")

if not OPENAI_API_KEY:
    raise ValueError("OPENAI_API_KEY not found. Please set it in your environment.")
if not OPENROUTER_API_KEY:
    raise ValueError("OPENROUTER_API_KEY not found. Please set it in your environment.")
if not HUD_API_KEY:
    raise ValueError("HUD_API_KEY not found. Please set it in your environment.")

print("✅ API keys configured")

✅ API keys configured


## Create Grounded Agent Configuration

In [17]:
# Grounding model configuration
grounder_config = GrounderConfig(
    api_key=OPENROUTER_API_KEY,
    api_base="https://openrouter.ai/api/v1",
    model="qwen/qwen-2.5-vl-7b-instruct",
)

# OpenAI client for planning model
openai_client = AsyncOpenAI(api_key=OPENAI_API_KEY)

# Agent configuration for dataset runner
agent_class = GroundedOpenAIChatAgent
agent_config = {
    "grounder_config": grounder_config,
    "openai_client": openai_client,
}

print("✅ Agent configuration ready")

✅ Agent configuration ready


## Single Task Test

First, let's test the grounded agent on a single OSWorld task:

In [18]:
async def run_single_task(
    dataset_name: str,
    task_index: int = 1,
    max_steps: int = 10,
) -> None:
    """Load one task from dataset_name and execute it."""

    print("📊 Loading dataset…")
    dataset = load_dataset(dataset_name, split="train")

    # Get a task from dataset
    sample_task = dataset[task_index]
    task_prompt = sample_task.get("prompt", f"Task {sample_task.get('id', 0)}")

    with hud.trace(name=task_prompt):
        task = Task(**sample_task)

        # Create agent with configuration
        agent = agent_class(**agent_config)
        agent.metadata = {}

        print(f"\n🎯 Task: {task.prompt}")
        result = await agent.run(task, max_steps=max_steps)
        print("✅ Reward:", result.reward)
        return result

In [19]:
# Test single task
result = await run_single_task("hud-evals/OSWorld-Verified", task_index=1, max_steps=15)

📊 Loading dataset…

[90m╔═════════════════════════════════════════════════════════════════╗[0m
[90m║[0m                    🚀 See your agent live at:                   [90m║[0m
[90m╟─────────────────────────────────────────────────────────────────╢[0m
[90m║[0m  [1m[33mhttps://app.hud.so/trace/a35b27a1-46ff-463f-99d6-06eb5352e81e[0m  [90m║[0m
[90m╚═════════════════════════════════════════════════════════════════╝[0m


🎯 Task: I am currently using a ubuntu system. Could you help me set the default video player as VLC?


13:20:56 - httpx - HTTP Request: POST https://telemetry.hud.so/v3/api/trace/a35b27a1-46ff-463f-99d6-06eb5352e81e/status "HTTP/1.1 200 OK"
13:22:22 - httpx - HTTP Request: POST https://mcp.hud.so/v3/mcp "HTTP/1.1 200 OK"
13:22:22 - mcp.client.streamable_http - Received session ID: dddadd0f-0675-4f1b-8068-478b86480df1
13:22:22 - mcp.client.streamable_http - Negotiated protocol version: 2025-06-18
13:22:23 - httpx - HTTP Request: GET https://mcp.hud.so/v3/mcp "HTTP/1.1 405 Method Not Allowed"
13:22:23 - httpx - HTTP Request: POST https://mcp.hud.so/v3/mcp "HTTP/1.1 202 Accepted"
13:22:25 - httpx - HTTP Request: POST https://mcp.hud.so/v3/mcp "HTTP/1.1 200 OK"
13:22:26 - httpx - HTTP Request: POST https://mcp.hud.so/v3/mcp "HTTP/1.1 200 OK"
13:22:27 - httpx - HTTP Request: POST https://mcp.hud.so/v3/mcp "HTTP/1.1 200 OK"
13:22:27 - hud.clients.mcp_use - Created 1 MCP sessions
13:22:28 - httpx - HTTP Request: POST https://mcp.hud.so/v3/mcp "HTTP/1.1 200 OK"
13:22:28 - hud.clients.mcp_use - 

13:22:51 - httpx - HTTP Request: POST https://mcp.hud.so/v3/mcp "HTTP/1.1 200 OK"
13:22:54 - httpx - HTTP Request: POST https://mcp.hud.so/v3/mcp "HTTP/1.1 200 OK"
13:23:02 - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
13:23:05 - httpx - HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
13:23:10 - httpx - HTTP Request: POST https://mcp.hud.so/v3/mcp "HTTP/1.1 200 OK"


13:23:20 - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
13:23:24 - httpx - HTTP Request: POST https://mcp.hud.so/v3/mcp "HTTP/1.1 200 OK"


13:23:32 - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
13:23:35 - httpx - HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
13:23:40 - httpx - HTTP Request: POST https://mcp.hud.so/v3/mcp "HTTP/1.1 200 OK"


13:23:55 - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


13:24:05 - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


13:24:19 - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
13:24:23 - httpx - HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
13:24:30 - httpx - HTTP Request: POST https://mcp.hud.so/v3/mcp "HTTP/1.1 200 OK"


13:24:41 - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
13:24:43 - httpx - HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
13:24:51 - httpx - HTTP Request: POST https://mcp.hud.so/v3/mcp "HTTP/1.1 200 OK"


13:25:06 - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
13:25:10 - httpx - HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
13:25:15 - httpx - HTTP Request: POST https://mcp.hud.so/v3/mcp "HTTP/1.1 200 OK"


13:25:51 - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
13:25:53 - httpx - HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
13:25:58 - httpx - HTTP Request: POST https://mcp.hud.so/v3/mcp "HTTP/1.1 200 OK"


13:26:23 - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
13:26:25 - httpx - HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
13:26:29 - httpx - HTTP Request: POST https://mcp.hud.so/v3/mcp "HTTP/1.1 200 OK"


13:26:57 - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
13:26:58 - httpx - HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
13:27:04 - httpx - HTTP Request: POST https://mcp.hud.so/v3/mcp "HTTP/1.1 200 OK"


13:27:34 - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


13:27:37 - httpx - HTTP Request: POST https://mcp.hud.so/v3/mcp "HTTP/1.1 200 OK"
13:27:37 - client - Tool evaluate has an output schema but did not return structured content. Continuing without structured content validation.
13:27:39 - httpx - HTTP Request: DELETE https://mcp.hud.so/v3/mcp "HTTP/1.1 204 No Content"
13:27:39 - hud.clients.base - Client disconnected


✅ Reward: 1.0


13:27:39 - httpx - HTTP Request: POST https://telemetry.hud.so/v3/api/trace/a35b27a1-46ff-463f-99d6-06eb5352e81e/status "HTTP/1.1 200 OK"



[92m✓ Trace complete![0m [2mView at:[0m [1m[33mhttps://app.hud.so/trace/a35b27a1-46ff-463f-99d6-06eb5352e81e[0m

