# OSWorld Evaluation with Grounded Agent

This notebook evaluates the GroundedOpenAIChatAgent on OSWorld test-subset tasks.

The grounded agent separates visual grounding from reasoning:
- **Planning model** (GPT-4o-mini): High-level reasoning and task planning
- **Grounding model** (Qwen2.5-VL): Visual element detection and coordinate resolution

## Prerequisites

- Set `HUD_API_KEY` in your environment
- Set `OPENAI_API_KEY` for the planning model
- Set `OPENROUTER_API_KEY` for the grounding model (or use local grounding)


In [None]:
# !pip install hud-python

In [1]:
import os
import logging
import time
from typing import Any

import hud
from datasets import load_dataset
from openai import AsyncOpenAI

from hud.agents.grounded_openai import GroundedOpenAIChatAgent
from hud.tools.grounding.config import GrounderConfig
from hud.datasets import Task, run_dataset
from hud.settings import settings

# Configure logging
logging.basicConfig(
    level=logging.INFO, 
    format="%(asctime)s - %(name)s - %(message)s", 
    datefmt="%H:%M:%S"
)
logging.getLogger("hud.agents").setLevel(logging.INFO)

  from .autonotebook import tqdm as notebook_tqdm


## Configuration

Set up API keys and model configurations:

In [2]:
# API Keys - make sure these are set in your environment
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") or settings.openai_api_key
OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY") or settings.openrouter_api_key
HUD_API_KEY = os.getenv("HUD_API_KEY")

if not OPENAI_API_KEY:
    raise ValueError("OPENAI_API_KEY not found. Please set it in your environment.")
if not OPENROUTER_API_KEY:
    raise ValueError("OPENROUTER_API_KEY not found. Please set it in your environment.")
if not HUD_API_KEY:
    raise ValueError("HUD_API_KEY not found. Please set it in your environment.")

print("✅ API keys configured")

✅ API keys configured


## Create Grounded Agent

The grounded agent uses defaults from the examples - minimal configuration needed:

In [None]:
# Grounding model configuration (uses good defaults)
grounder_config = GrounderConfig(
    api_key=OPENROUTER_API_KEY,
    api_base="https://openrouter.ai/api/v1",  # Default
    model="qwen/qwen-2.5-vl-7b-instruct",     # Default
    # system_prompt=... good default for grounding # Default
)

# OpenAI client for planning model
openai_client = AsyncOpenAI(api_key=OPENAI_API_KEY)

# Create grounded agent with defaults
def create_grounded_agent() -> GroundedOpenAIChatAgent:
    return GroundedOpenAIChatAgent(
        grounder_config=grounder_config,
        openai_client=openai_client,
        # model_name="gpt-4o-mini",           # Default
        # allowed_tools=["computer"],         # Default 
        # append_setup_output=False,          # Default
        # system_prompt=... good default...  # Default
    )

print("✅ Agent configuration ready")

✅ Agent configuration ready


## Single Task Test

First, let's test the grounded agent on a single OSWorld task:

In [None]:
async def run_single_osworld_task(task_index: int = 1, max_steps: int = 15):
    """Run a single OSWorld task to test the grounded agent."""
    
    print("📊 Loading OSWorld sample...")
    dataset = load_dataset("hud-evals/OSWorld-Verified", split="train")
    
    # Get a task from the dataset
    sample_task = dataset[task_index]
    task_prompt = sample_task.get("prompt", f"Task {sample_task.get('id', 0)}")
    
    print(f"\n🎯 Task: {task_prompt}")
    print(f"📝 Max steps: {max_steps}")
    
    # Create task and agent
    with hud.trace(name=task_prompt):
        task = Task(**sample_task)
        agent = create_grounded_agent()
        agent.metadata = {}
        
        # Run the task
        start_time = time.time()
        result = await agent.run(task, max_steps=max_steps)
        elapsed = time.time() - start_time
        
        print(f"\n✅ Task completed in {elapsed:.2f}s")
        print(f"🏆 Reward: {result.reward}")
        print(f"📊 Steps taken: {result.steps}")
        
        return result

# Run a single task
result = await run_single_osworld_task(task_index=1, max_steps=15)

📊 Loading OSWorld test-subset...

🎯 Task: I am currently using a ubuntu system. Could you help me set the default video player as VLC?
📝 Max steps: 15

[90m╔═════════════════════════════════════════════════════════════════╗[0m
[90m║[0m                    🚀 See your agent live at:                   [90m║[0m
[90m╟─────────────────────────────────────────────────────────────────╢[0m
[90m║[0m  [1m[33mhttps://app.hud.so/trace/58341e55-21c2-4cc4-aec2-2e8cfe7bb914[0m  [90m║[0m
[90m╚═════════════════════════════════════════════════════════════════╝[0m



11:55:50 - httpx - HTTP Request: POST https://telemetry.hud.so/v3/api/trace/58341e55-21c2-4cc4-aec2-2e8cfe7bb914/status "HTTP/1.1 200 OK"
11:57:26 - httpx - HTTP Request: POST https://mcp.hud.so/v3/mcp "HTTP/1.1 200 OK"
11:57:26 - mcp.client.streamable_http - Received session ID: ce073c4e-6a0c-410d-8af7-2ff57775c74f
11:57:26 - mcp.client.streamable_http - Negotiated protocol version: 2025-06-18
11:57:26 - httpx - HTTP Request: GET https://mcp.hud.so/v3/mcp "HTTP/1.1 405 Method Not Allowed"
11:57:27 - httpx - HTTP Request: POST https://mcp.hud.so/v3/mcp "HTTP/1.1 202 Accepted"
11:57:28 - httpx - HTTP Request: POST https://mcp.hud.so/v3/mcp "HTTP/1.1 200 OK"
11:57:29 - httpx - HTTP Request: POST https://mcp.hud.so/v3/mcp "HTTP/1.1 200 OK"
11:57:30 - httpx - HTTP Request: POST https://mcp.hud.so/v3/mcp "HTTP/1.1 200 OK"
11:57:30 - hud.clients.mcp_use - Created 1 MCP sessions
11:57:32 - httpx - HTTP Request: POST https://mcp.hud.so/v3/mcp "HTTP/1.1 200 OK"
11:57:32 - hud.clients.mcp_use - 

11:57:56 - httpx - HTTP Request: POST https://mcp.hud.so/v3/mcp "HTTP/1.1 200 OK"
11:57:59 - httpx - HTTP Request: POST https://mcp.hud.so/v3/mcp "HTTP/1.1 200 OK"
11:58:09 - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
11:58:13 - httpx - HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
11:58:18 - httpx - HTTP Request: POST https://mcp.hud.so/v3/mcp "HTTP/1.1 200 OK"


11:58:26 - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
11:58:29 - httpx - HTTP Request: POST https://mcp.hud.so/v3/mcp "HTTP/1.1 200 OK"


11:58:52 - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
11:58:53 - httpx - HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
11:58:58 - httpx - HTTP Request: POST https://mcp.hud.so/v3/mcp "HTTP/1.1 200 OK"


11:59:14 - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
11:59:16 - httpx - HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
11:59:21 - httpx - HTTP Request: POST https://mcp.hud.so/v3/mcp "HTTP/1.1 200 OK"


11:59:46 - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
11:59:48 - httpx - HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
11:59:53 - httpx - HTTP Request: POST https://mcp.hud.so/v3/mcp "HTTP/1.1 200 OK"


12:00:17 - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
12:00:20 - httpx - HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
12:00:25 - httpx - HTTP Request: POST https://mcp.hud.so/v3/mcp "HTTP/1.1 200 OK"


12:00:43 - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
12:00:46 - httpx - HTTP Request: POST https://openrouter.ai/api/v1/chat/completions "HTTP/1.1 200 OK"
12:00:51 - httpx - HTTP Request: POST https://mcp.hud.so/v3/mcp "HTTP/1.1 200 OK"
