# ARPO OSWorld Evaluation - Chrome Tasks (Original vs Noisy)

This notebook evaluates **10 Chrome tasks** (original + noisy versions) to test model robustness.

## Prerequisites

1. **GPU Server Running**: Start `notebooks/GPU_Server_for_OSWorld.ipynb` on Colab first
2. **VMware VM Ready**: Your Mac OSWorld VM should be set up and ready
3. **Server URL**: Copy the ngrok URL from the GPU server notebook

---

## Setup Summary

- **Model**: ARPO UITARS 7B (running on Colab GPU)
- **Tasks**: 10 Chrome tasks (original) + 10 Chrome tasks (noisy)
- **Total**: 20 tasks
- **Expected Time**: ~1.5 hours (20 tasks √ó ~4.5 min per task)
- **Results**: Saved to `results/gpu_eval_chrome_10/` and `results/gpu_eval_chrome_noisy_10/`
- **Dataset**: 128 total tasks (18 Chrome, 19 VS Code, 19 GIMP, etc.)

## 1. Check Environment

In [12]:
import os
import sys
from pathlib import Path

# Get project root
ARPO_ROOT = Path("/Users/hanszhu/Desktop/ARPO_replicate")
os.chdir(ARPO_ROOT)

# Add OSWorld to path
sys.path.insert(0, str(ARPO_ROOT / "OSWorld"))

print(f"‚úÖ Working directory: {os.getcwd()}")
print(f"‚úÖ OSWorld path: {ARPO_ROOT / 'OSWorld'}")
print(f"‚úÖ Test data: {ARPO_ROOT / 'test_data' / 'osworld_examples'}")
print(f"‚úÖ Results will be saved to: {ARPO_ROOT / 'results' / 'gpu_eval'}")

‚úÖ Working directory: /Users/hanszhu/Desktop/ARPO_replicate
‚úÖ OSWorld path: /Users/hanszhu/Desktop/ARPO_replicate/OSWorld
‚úÖ Test data: /Users/hanszhu/Desktop/ARPO_replicate/test_data/osworld_examples
‚úÖ Results will be saved to: /Users/hanszhu/Desktop/ARPO_replicate/results/gpu_eval


## 2. Configure GPU Server URL

**‚ö†Ô∏è Important**: Update this with your actual ngrok URL from the GPU server notebook!

In [13]:
# UPDATE THIS WITH YOUR NGROK URL FROM COLAB!
GPU_SERVER_URL = "https://miller-unshapeable-melany.ngrok-free.dev"  # Example: https://1234-56-78-90-12.ngrok.io

# Test connection
import requests

if GPU_SERVER_URL == "https://YOUR-NGROK-URL.ngrok.io":
    print("‚ö†Ô∏è  WARNING: You need to update GPU_SERVER_URL with your actual ngrok URL!")
    print("   Get it from the GPU server notebook (Cell 4 output)")
else:
    try:
        response = requests.get(f"{GPU_SERVER_URL}/health", timeout=5)
        if response.status_code == 200:
            print(f"‚úÖ GPU Server is reachable: {GPU_SERVER_URL}")
            print(f"‚úÖ Server status: {response.json()}")
        else:
            print(f"‚ùå Server returned status {response.status_code}")
    except Exception as e:
        print(f"‚ùå Cannot reach server: {e}")
        print(f"   Make sure GPU server notebook is running on Colab!")

‚úÖ GPU Server is reachable: https://miller-unshapeable-melany.ngrok-free.dev
‚úÖ Server status: {'model': 'arpo-uitars-7b', 'status': 'healthy'}


## 3. Update OSWorld Agent Configuration

Update the agent to use the Colab GPU server instead of localhost.

In [14]:
import fileinput
import shutil

# Backup original
agent_file = ARPO_ROOT / "OSWorld" / "mm_agents" / "uitars_agent.py"
backup_file = agent_file.with_suffix('.py.backup')

if not backup_file.exists():
    shutil.copy(agent_file, backup_file)
    print(f"‚úÖ Created backup: {backup_file}")

# Update base_url
agent_content = agent_file.read_text()
if "localhost:9000" in agent_content:
    updated_content = agent_content.replace(
        'base_url="http://localhost:9001/v1"',
        f'base_url="{GPU_SERVER_URL}/v1"'
    )
    agent_file.write_text(updated_content)
    print(f"‚úÖ Updated agent to use: {GPU_SERVER_URL}/v1")
else:
    print(f"‚ö†Ô∏è  Agent already configured (not using localhost)")
    print(f"   Current config will be used")

‚ö†Ô∏è  Agent already configured (not using localhost)
   Current config will be used


## 4. Run Evaluation on Original Chrome Tasks (10 tasks)

In [16]:
import subprocess
import time
import shutil

# Create/clear results directory
results_dir_chrome = ARPO_ROOT / "results" / "gpu_eval_chrome_10"

# Clear previous results if they exist
if results_dir_chrome.exists():
    print(f"üßπ Clearing previous results from: {results_dir_chrome}")
    shutil.rmtree(results_dir_chrome)
    
results_dir_chrome.mkdir(parents=True, exist_ok=True)

print("üöÄ Starting evaluation on 10 CHROME tasks (original)...")
print(f"üìÅ Results will be saved to: {results_dir_chrome}")
print(f"‚è±Ô∏è  Expected time: ~45 minutes (10 tasks √ó ~4.5 min)")
print("="*70)

start_time = time.time()

# Run OSWorld evaluation
cmd = [
    "python", "run_uitars.py",
    "--headless",
    "--observation_type", "screenshot",
    "--max_steps", "15",
    "--model", "arpo-uitars-7b",
    "--temperature", "0.6",
    "--max_tokens", "256",
    "--test_config_base_dir", "../test_data/osworld_examples",
    "--test_all_meta_path", "../test_data/osworld_examples/test_chrome_10.json",
    "--result_dir", str(results_dir_chrome),
]

try:
    # Run from OSWorld directory
    result = subprocess.run(
        cmd,
        cwd=ARPO_ROOT / "OSWorld",
        capture_output=True,
        text=True,
        timeout=3600  # 60 min timeout (10 tasks √ó ~4.5 min + buffer)
    )
    
    elapsed = time.time() - start_time
    
    print("\n" + "="*70)
    print(f"‚úÖ Evaluation complete in {elapsed/60:.1f} minutes!")
    print("="*70)
    
    # Show last 50 lines of output
    print("\nüìä Last 50 lines of output:")
    print("\n".join(result.stdout.split("\n")[-50:]))
    
    if result.returncode != 0:
        print(f"\n‚ö†Ô∏è  Warning: Process returned code {result.returncode}")
        print("Last 20 lines of stderr:")
        print("\n".join(result.stderr.split("\n")[-20:]))
        
except subprocess.TimeoutExpired:
    print("‚ùå Evaluation timed out after 60 minutes")
    print("‚ö†Ô∏è  Some tasks may have completed - check results folder")
except Exception as e:
    print(f"‚ùå Error: {e}")

üßπ Clearing previous results from: /Users/hanszhu/Desktop/ARPO_replicate/results/gpu_eval_chrome_10
üöÄ Starting evaluation on 10 CHROME tasks (original)...
üìÅ Results will be saved to: /Users/hanszhu/Desktop/ARPO_replicate/results/gpu_eval_chrome_10
‚è±Ô∏è  Expected time: ~45 minutes (10 tasks √ó ~4.5 min)

‚úÖ Evaluation complete in 27.6 minutes!

üìä Last 50 lines of output:
'''

pyautogui.click(680.745, 375.824, button='left')
[1;33m[2026-01-19 18:07:54,514 [31mINFO [32mpython/125-MainProcess[1;33m] [0mCommand executed successfully: {
  "error": "",
  "output": "",
  "returncode": 0,
  "status": "success"
}

[1;33m[2026-01-19 18:07:54,683 [31mINFO [32mpython/32-MainProcess[1;33m] [0mGot screenshot successfully
[1;33m[2026-01-19 18:07:54,683 [31mINFO [32mlib_run_single/53-MainProcess[1;33m] [0mReward: 0.00
[1;33m[2026-01-19 18:07:54,683 [31mINFO [32mlib_run_single/54-MainProcess[1;33m] [0mDone: False
Action: finished(content='Â∑≤‰∏∫‰Ω†ÂºÄÂêØChromeÁöÑËøô‰∏™Â

In [None]:
## 5. View Results for Original Tasks

In [17]:
import json
from pathlib import Path

def analyze_results(result_dir):
    """Analyze OSWorld evaluation results"""
    results = []
    
    # Find all result.txt files
    for result_file in Path(result_dir).rglob("result.txt"):
        task_id = result_file.parent.name
        domain = result_file.parent.parent.name
        
        try:
            score = float(result_file.read_text().strip())
            results.append({
                "task_id": task_id,
                "domain": domain,
                "score": score
            })
        except:
            pass
    
    if not results:
        print("‚ö†Ô∏è  No results found yet")
        return None
    
    # Print summary
    print("="*70)
    print(f"üìä Results Summary ({len(results)} tasks)")
    print("="*70)
    
    for r in results:
        status = "‚úÖ PASS" if r["score"] >= 0.9 else "‚ùå FAIL"
        print(f"{status} | {r['task_id'][:20]:20s} | Score: {r['score']:.2f}")
    
    avg_score = sum(r["score"] for r in results) / len(results)
    success_rate = sum(1 for r in results if r["score"] >= 0.9) / len(results) * 100
    
    print("="*70)
    print(f"Average Score: {avg_score:.3f}")
    print(f"Success Rate:  {success_rate:.1f}% ({sum(1 for r in results if r['score'] >= 0.9)}/{len(results)})")
    print("="*70)
    
    return results

# Analyze Chrome tasks
print("\nüîç Analyzing CHROME task results...\n")
chrome_results = analyze_results(results_dir_chrome)


üîç Analyzing CHROME task results...

üìä Results Summary (9 tasks)
‚úÖ PASS | 368d9ba4-203c-40c1-9 | Score: 1.00
‚ùå FAIL | 2ae9ba84-3a0d-4d4c-8 | Score: 0.00
‚úÖ PASS | 9656a811-9b5b-4ddf-9 | Score: 1.00
‚úÖ PASS | 2ad9387a-65d8-4e33-a | Score: 1.00
‚úÖ PASS | 3720f614-37fd-4d04-8 | Score: 1.00
‚ùå FAIL | 06fe7178-4491-4589-8 | Score: 0.00
‚ùå FAIL | 0d8b7de3-e8de-4d86-b | Score: 0.00
‚úÖ PASS | 3299584d-8f11-4457-b | Score: 1.00
‚ùå FAIL | 7b6c7e24-c58a-49fc-a | Score: 0.00
Average Score: 0.556
Success Rate:  55.6% (5/9)


osworld testing 

In [18]:
import subprocess
import time
import shutil

# Create/clear results directory
results_dir_chrome_noisy = ARPO_ROOT / "results" / "gpu_eval_chrome_noisy_10"

# Clear previous results if they exist
if results_dir_chrome_noisy.exists():
    print(f"üßπ Clearing previous results from: {results_dir_chrome_noisy}")
    shutil.rmtree(results_dir_chrome_noisy)
    
results_dir_chrome_noisy.mkdir(parents=True, exist_ok=True)

print("üöÄ Starting evaluation on 10 CHROME tasks (noisy)...")
print(f"üìÅ Results will be saved to: {results_dir_chrome_noisy}")
print(f"‚è±Ô∏è  Expected time: ~45 minutes (10 tasks √ó ~4.5 min)")
print("="*70)

start_time = time.time()

# Run OSWorld evaluation on noisy Chrome tasks
cmd = [
    "python", "run_uitars.py",
    "--headless",
    "--observation_type", "screenshot",
    "--max_steps", "15",
    "--model", "arpo-uitars-7b",
    "--temperature", "0.6",
    "--max_tokens", "256",
    "--test_config_base_dir", "../test_data/osworld_examples",
    "--test_all_meta_path", "../test_data/osworld_examples/test_chrome_noisy_10.json",
    "--result_dir", str(results_dir_chrome_noisy),
]

try:
    result = subprocess.run(
        cmd,
        cwd=ARPO_ROOT / "OSWorld",
        capture_output=True,
        text=True,
        timeout=3600  # 60 min timeout (10 tasks √ó ~4.5 min + buffer)
    )
    
    elapsed = time.time() - start_time
    
    print("\n" + "="*70)
    print(f"‚úÖ Evaluation complete in {elapsed/60:.1f} minutes!")
    print("="*70)
    
    # Show last 50 lines
    print("\nüìä Last 50 lines of output:")
    print("\n".join(result.stdout.split("\n")[-50:]))
    
    if result.returncode != 0:
        print(f"\n‚ö†Ô∏è  Warning: Process returned code {result.returncode}")
        print("Last 20 lines of stderr:")
        print("\n".join(result.stderr.split("\n")[-20:]))
        
except subprocess.TimeoutExpired:
    print("‚ùå Evaluation timed out after 60 minutes")
    print("‚ö†Ô∏è  Some tasks may have completed - check results folder")
except Exception as e:
    print(f"‚ùå Error: {e}")

üöÄ Starting evaluation on 10 CHROME tasks (noisy)...
üìÅ Results will be saved to: /Users/hanszhu/Desktop/ARPO_replicate/results/gpu_eval_chrome_noisy_10
‚è±Ô∏è  Expected time: ~45 minutes (10 tasks √ó ~4.5 min)

‚úÖ Evaluation complete in 35.2 minutes!

üìä Last 50 lines of output:
'''

pyautogui.click(673.789, 380.769, button='left')
[1;33m[2026-01-19 19:09:27,527 [31mINFO [32mpython/125-MainProcess[1;33m] [0mCommand executed successfully: {
  "error": "",
  "output": "",
  "returncode": 0,
  "status": "success"
}

[1;33m[2026-01-19 19:09:27,761 [31mINFO [32mpython/32-MainProcess[1;33m] [0mGot screenshot successfully
[1;33m[2026-01-19 19:09:27,761 [31mINFO [32mlib_run_single/53-MainProcess[1;33m] [0mReward: 0.00
[1;33m[2026-01-19 19:09:27,761 [31mINFO [32mlib_run_single/54-MainProcess[1;33m] [0mDone: False
Action: finished(content='Chrome will automatically alert you when you visit a potentially harmful website. I have already enabled this safety feature by ac

## 7. View Results for Noisy Tasks

print("\nüîç Analyzing NOISY CHROME task results...\n")
chrome_noisy_results = analyze_results(results_dir_chrome_noisy)

In [None]:
# Check if both result sets exist (must run cells 4-6 and 8-10 first)
try:
    chrome_results
    chrome_noisy_results
    has_both = True
except NameError:
    has_both = False

if has_both and chrome_results and chrome_noisy_results:
    print("\n" + "="*70)
    print("üìä CHROME: Original vs Noisy Comparison")
    print("="*70)
    
    orig_avg = sum(r["score"] for r in chrome_results) / len(chrome_results)
    noisy_avg = sum(r["score"] for r in chrome_noisy_results) / len(chrome_noisy_results)
    
    orig_success = sum(1 for r in chrome_results if r["score"] >= 0.9) / len(chrome_results) * 100
    noisy_success = sum(1 for r in chrome_noisy_results if r["score"] >= 0.9) / len(chrome_noisy_results) * 100
    
    print(f"\nOriginal Chrome Tasks ({len(chrome_results)}):")
    print(f"  Average Score:  {orig_avg:.3f}")
    print(f"  Success Rate:   {orig_success:.1f}%")
    print(f"  Passed:         {sum(1 for r in chrome_results if r['score'] >= 0.9)}/{len(chrome_results)}")
    
    print(f"\nNoisy Chrome Tasks ({len(chrome_noisy_results)}):")
    print(f"  Average Score:  {noisy_avg:.3f}")
    print(f"  Success Rate:   {noisy_success:.1f}%")
    print(f"  Passed:         {sum(1 for r in chrome_noisy_results if r['score'] >= 0.9)}/{len(chrome_noisy_results)}")
    
    print(f"\nRobustness (Noisy/Original):")
    print(f"  Score Ratio:    {noisy_avg/orig_avg:.2%}" if orig_avg > 0 else "  Score Ratio:  N/A")
    print(f"  Success Ratio:  {noisy_success/orig_success:.2%}" if orig_success > 0 else "  Success Ratio:  N/A")
    
    print("="*70)
    
    # Expected performance (from ARPO paper)
    print("\nüìö Expected Performance (from ARPO paper):")
    print("  ARPO UITARS1.5 7B on OSWorld Chrome: ~22.6% success rate")
    print("  (10 tasks is a good sample for initial testing)")
    print("="*70)
else:
    print("‚ö†Ô∏è  Please run cells in order:")
    print("   1. Cell 8: Run original Chrome tasks")
    print("   2. Cell 10: Analyze original results")
    print("   3. Cell 12: Run noisy Chrome tasks")
    print("   4. Cell 14: Analyze noisy results")
    print("   5. Then run this cell to compare")

NameError: name 'chrome_noisy_results' is not defined

## 9. Inspect Individual Task Trajectories

In [None]:
def view_trajectory(result_dir, task_id):
    """View detailed trajectory for a specific task"""
    traj_file = None
    for f in Path(result_dir).rglob(f"{task_id}/traj.jsonl"):
        traj_file = f
        break
    
    if not traj_file:
        print(f"‚ö†Ô∏è  Trajectory not found for {task_id}")
        return
    
    print(f"\nüìù Trajectory: {task_id}")
    print("="*70)
    
    with open(traj_file) as f:
        steps = [json.loads(line) for line in f]
    
    for i, step in enumerate(steps, 1):
        print(f"\nStep {i}:")
        if "prediction" in step:
            pred = step["prediction"]
            if isinstance(pred, str):
                # Show first 200 chars
                print(f"  Prediction: {pred[:200]}...")
            else:
                print(f"  Prediction: {pred}")
        if "action" in step:
            print(f"  Action: {step['action']}")
        if "reward" in step:
            print(f"  Reward: {step['reward']}")
        if "done" in step:
            print(f"  Done: {step['done']}")
    
    print("="*70)

# Example: View first task from Chrome results
if chrome_results:
    first_task = chrome_results[0]["task_id"]
    print(f"\nüîç Viewing trajectory for first Chrome task: {first_task}")
    view_trajectory(results_dir_chrome, first_task)
else:
    print("‚ö†Ô∏è  No results available yet")

## 10. Restore Original Agent Configuration

After evaluation, restore the agent to use localhost.

In [None]:
# Restore backup
if backup_file.exists():
    shutil.copy(backup_file, agent_file)
    print(f"‚úÖ Restored original agent configuration")
    print(f"   Agent is now using localhost:9000 again")
else:
    print("‚ö†Ô∏è  No backup found")

## Summary

### What This Notebook Does

1. ‚úÖ Connects to Colab GPU server (via ngrok)
2. ‚úÖ Runs 10 original Chrome tasks
3. ‚úÖ Runs 10 noisy Chrome tasks
4. ‚úÖ Analyzes results and computes success rates
5. ‚úÖ Compares robustness (original vs noisy)
6. ‚úÖ Restores original configuration

### Results Location

- **Original Chrome**: `results/gpu_eval_chrome_10/`
- **Noisy Chrome**: `results/gpu_eval_chrome_noisy_10/`

Each task folder contains:
- `traj.jsonl` - Step-by-step log
- `result.txt` - Final score (0.0 or 1.0)
- `step_*.png` - Screenshots
- `recording.mp4` - Video

### Dataset Info

- **Total tasks**: 128 across 10 domains
- **Chrome tasks**: Testing first 10 (out of 18 available)
- **Other domains available**: gimp (19), vs_code (19), libreoffice_calc (12), etc.

### Next Steps

- Analyze failed Chrome tasks to understand errors
- Compare with ARPO paper results (~22.6% on full OSWorld)
- Test other domains if needed
- Use insights for further training/fine-tuning