# ARPO OSWorld Evaluation - GPU Testing

This notebook runs the **complete OSWorld evaluation** on 10 tasks (5 original + 5 noisy) using the GPU inference server.

## Prerequisites

1. **GPU Server Running**: Start `notebooks/GPU_Server_for_OSWorld.ipynb` on Colab first
2. **VMware VM Ready**: Your Mac OSWorld VM should be set up and ready
3. **Server URL**: Copy the ngrok URL from the GPU server notebook

---

## Setup Summary

- **Model**: ARPO UITARS 7B (running on Colab GPU)
- **Tasks**: 10 OSWorld tasks (Chrome domain)
- **Expected Time**: ~10-15 minutes (vs 10 hours on CPU!)
- **Results**: Saved to `results/gpu_eval/`

## 1. Check Environment

In [None]:
import os
import sys
from pathlib import Path

# Get project root
ARPO_ROOT = Path("/Users/hanszhu/Desktop/ARPO_replicate")
os.chdir(ARPO_ROOT)

# Add OSWorld to path
sys.path.insert(0, str(ARPO_ROOT / "OSWorld"))

print(f"‚úÖ Working directory: {os.getcwd()}")
print(f"‚úÖ OSWorld path: {ARPO_ROOT / 'OSWorld'}")
print(f"‚úÖ Test data: {ARPO_ROOT / 'test_data' / 'osworld_examples'}")
print(f"‚úÖ Results will be saved to: {ARPO_ROOT / 'results' / 'gpu_eval'}")

## 2. Configure GPU Server URL

**‚ö†Ô∏è Important**: Update this with your actual ngrok URL from the GPU server notebook!

In [None]:
# UPDATE THIS WITH YOUR NGROK URL FROM COLAB!
GPU_SERVER_URL = "https://YOUR-NGROK-URL.ngrok.io"  # Example: https://1234-56-78-90-12.ngrok.io

# Test connection
import requests

if GPU_SERVER_URL == "https://YOUR-NGROK-URL.ngrok.io":
    print("‚ö†Ô∏è  WARNING: You need to update GPU_SERVER_URL with your actual ngrok URL!")
    print("   Get it from the GPU server notebook (Cell 4 output)")
else:
    try:
        response = requests.get(f"{GPU_SERVER_URL}/health", timeout=5)
        if response.status_code == 200:
            print(f"‚úÖ GPU Server is reachable: {GPU_SERVER_URL}")
            print(f"‚úÖ Server status: {response.json()}")
        else:
            print(f"‚ùå Server returned status {response.status_code}")
    except Exception as e:
        print(f"‚ùå Cannot reach server: {e}")
        print(f"   Make sure GPU server notebook is running on Colab!")

## 3. Update OSWorld Agent Configuration

Update the agent to use the Colab GPU server instead of localhost.

In [None]:
import fileinput
import shutil

# Backup original
agent_file = ARPO_ROOT / "OSWorld" / "mm_agents" / "uitars_agent.py"
backup_file = agent_file.with_suffix('.py.backup')

if not backup_file.exists():
    shutil.copy(agent_file, backup_file)
    print(f"‚úÖ Created backup: {backup_file}")

# Update base_url
agent_content = agent_file.read_text()
if "localhost:9000" in agent_content:
    updated_content = agent_content.replace(
        'base_url="http://localhost:9000/v1"',
        f'base_url="{GPU_SERVER_URL}/v1"'
    )
    agent_file.write_text(updated_content)
    print(f"‚úÖ Updated agent to use: {GPU_SERVER_URL}/v1")
else:
    print(f"‚ö†Ô∏è  Agent already configured (not using localhost)")
    print(f"   Current config will be used")

## 4. Run Evaluation on Original Tasks (5 tasks)

In [None]:
## 5. View Results for Original Tasks

In [None]:
import json
from pathlib import Path

def analyze_results(result_dir):
    """Analyze OSWorld evaluation results"""
    results = []
    
    # Find all result.txt files
    for result_file in Path(result_dir).rglob("result.txt"):
        task_id = result_file.parent.name
        domain = result_file.parent.parent.name
        
        try:
            score = float(result_file.read_text().strip())
            results.append({
                "task_id": task_id,
                "domain": domain,
                "score": score
            })
        except:
            pass
    
    if not results:
        print("‚ö†Ô∏è  No results found yet")
        return None
    
    # Print summary
    print("="*70)
    print(f"üìä Results Summary ({len(results)} tasks)")
    print("="*70)
    
    for r in results:
        status = "‚úÖ PASS" if r["score"] >= 0.9 else "‚ùå FAIL"
        print(f"{status} | {r['task_id'][:20]:20s} | Score: {r['score']:.2f}")
    
    avg_score = sum(r["score"] for r in results) / len(results)
    success_rate = sum(1 for r in results if r["score"] >= 0.9) / len(results) * 100
    
    print("="*70)
    print(f"Average Score: {avg_score:.3f}")
    print(f"Success Rate:  {success_rate:.1f}% ({sum(1 for r in results if r['score'] >= 0.9)}/{len(results)})")
    print("="*70)
    
    return results

# Analyze original tasks
print("\nüîç Analyzing ORIGINAL task results...\n")
original_results = analyze_results(results_dir_original)

osworld testing 

In [None]:
import subprocess
import time

# Create results directory
results_dir_noisy = ARPO_ROOT / "results" / "gpu_eval_noisy"
results_dir_noisy.mkdir(parents=True, exist_ok=True)

print("üöÄ Starting evaluation on 5 NOISY tasks...")
print(f"üìÅ Results will be saved to: {results_dir_noisy}")
print(f"‚è±Ô∏è  Expected time: ~5-8 minutes")
print("="*70)

start_time = time.time()

# Run OSWorld evaluation on noisy tasks
cmd = [
    "python", "run_uitars.py",
    "--headless",
    "--observation_type", "screenshot",
    "--max_steps", "15",
    "--model", "arpo-uitars-7b",
    "--temperature", "0.6",
    "--max_tokens", "256",
    "--test_all_meta_path", "../test_data/osworld_examples/test_10tasks_noisy.json",
    "--result_dir", str(results_dir_noisy),
]

try:
    result = subprocess.run(
        cmd,
        cwd=ARPO_ROOT / "OSWorld",
        capture_output=True,
        text=True,
        timeout=1800  # 30 min timeout
    )
    
    elapsed = time.time() - start_time
    
    print("\n" + "="*70)
    print(f"‚úÖ Evaluation complete in {elapsed/60:.1f} minutes!")
    print("="*70)
    
    # Show last 50 lines
    print("\nüìä Last 50 lines of output:")
    print("\n".join(result.stdout.split("\n")[-50:]))
    
    if result.returncode != 0:
        print(f"\n‚ö†Ô∏è  Warning: Process returned code {result.returncode}")
        print("Last 20 lines of stderr:")
        print("\n".join(result.stderr.split("\n")[-20:]))
        
except subprocess.TimeoutExpired:
    print("‚ùå Evaluation timed out after 30 minutes")
except Exception as e:
    print(f"‚ùå Error: {e}")

## 7. View Results for Noisy Tasks

print("\nüîç Analyzing NOISY task results...\n")
noisy_results = analyze_results(results_dir_noisy)

In [None]:
if original_results and noisy_results:
    print("\n" + "="*70)
    print("üìä ORIGINAL vs NOISY Comparison")
    print("="*70)
    
    orig_avg = sum(r["score"] for r in original_results) / len(original_results)
    noisy_avg = sum(r["score"] for r in noisy_results) / len(noisy_results)
    
    orig_success = sum(1 for r in original_results if r["score"] >= 0.9) / len(original_results) * 100
    noisy_success = sum(1 for r in noisy_results if r["score"] >= 0.9) / len(noisy_results) * 100
    
    print(f"\nOriginal Tasks ({len(original_results)}):")
    print(f"  Average Score:  {orig_avg:.3f}")
    print(f"  Success Rate:   {orig_success:.1f}%")
    
    print(f"\nNoisy Tasks ({len(noisy_results)}):")
    print(f"  Average Score:  {noisy_avg:.3f}")
    print(f"  Success Rate:   {noisy_success:.1f}%")
    
    print(f"\nRobustness (Noisy/Original):")
    print(f"  Score Ratio:    {noisy_avg/orig_avg:.2%}")
    print(f"  Success Ratio:  {noisy_success/orig_success:.2%}" if orig_success > 0 else "  Success Ratio:  N/A")
    
    print("="*70)
    
    # Expected performance (from ARPO paper)
    print("\nüìö Expected Performance (from ARPO paper):")
    print("  ARPO UITARS1.5 7B on OSWorld: ~22.6% success rate")
    print("  (This is a subset of 10 tasks, so results may vary)")
    print("="*70)
else:
    print("‚ö†Ô∏è  Run cells 4 and 6 first to get results")

## 9. Inspect Individual Task Trajectories

In [None]:
def view_trajectory(result_dir, task_id):
    """View detailed trajectory for a specific task"""
    traj_file = None
    for f in Path(result_dir).rglob(f"{task_id}/traj.jsonl"):
        traj_file = f
        break
    
    if not traj_file:
        print(f"‚ö†Ô∏è  Trajectory not found for {task_id}")
        return
    
    print(f"\nüìù Trajectory: {task_id}")
    print("="*70)
    
    with open(traj_file) as f:
        steps = [json.loads(line) for line in f]
    
    for i, step in enumerate(steps, 1):
        print(f"\nStep {i}:")
        if "prediction" in step:
            pred = step["prediction"]
            if isinstance(pred, str):
                # Show first 200 chars
                print(f"  Prediction: {pred[:200]}...")
            else:
                print(f"  Prediction: {pred}")
        if "action" in step:
            print(f"  Action: {step['action']}")
        if "reward" in step:
            print(f"  Reward: {step['reward']}")
        if "done" in step:
            print(f"  Done: {step['done']}")
    
    print("="*70)

# Example: View first task from original results
if original_results:
    first_task = original_results[0]["task_id"]
    print(f"\nüîç Viewing trajectory for first task: {first_task}")
    view_trajectory(results_dir_original, first_task)
else:
    print("‚ö†Ô∏è  No results available yet")

## 10. Restore Original Agent Configuration

After evaluation, restore the agent to use localhost.

In [None]:
# Restore backup
if backup_file.exists():
    shutil.copy(backup_file, agent_file)
    print(f"‚úÖ Restored original agent configuration")
    print(f"   Agent is now using localhost:9000 again")
else:
    print("‚ö†Ô∏è  No backup found")

## Summary

### What This Notebook Does

1. ‚úÖ Connects to Colab GPU server (via ngrok)
2. ‚úÖ Runs 5 original OSWorld tasks
3. ‚úÖ Runs 5 noisy OSWorld tasks
4. ‚úÖ Analyzes results and computes success rates
5. ‚úÖ Compares robustness (original vs noisy)
6. ‚úÖ Restores original configuration

### Results Location

- **Original**: `results/gpu_eval_original/`
- **Noisy**: `results/gpu_eval_noisy/`

Each task folder contains:
- `traj.jsonl` - Step-by-step log
- `result.txt` - Final score (0.0 or 1.0)
- `step_*.png` - Screenshots
- `recording.mp4` - Video

### Next Steps

- Analyze failed tasks to understand errors
- Compare with ARPO paper results (~22.6% on full OSWorld)
- Use insights for further training/fine-tuning