# ARPO Training - UI-TARS-2B (Colab GPU + Mac OSWorld)

Train UI-TARS-2B on 128 OSWorld tasks using Colab GPU for inference.

## Prerequisites

- ✅ Colab GPU server running
- ✅ VMware + Ubuntu VM ready
- ✅ wandb configured

See `PRE_TRAINING_CHECKLIST.md`

## 1. Environment Check

In [1]:
import os
import sys
import json
from pathlib import Path

ARPO_ROOT = Path("/Users/hanszhu/Desktop/ARPO_replicate")
os.chdir(ARPO_ROOT)
sys.path.insert(0, str(ARPO_ROOT))

print(f"✅ Working directory: {os.getcwd()}")
print(f"✅ Python: {sys.executable}")

# Check dependencies
try:
    import torch, transformers, wandb
    print(f"✅ PyTorch {torch.__version__}")
    print(f"✅ Transformers {transformers.__version__}")
    print(f"✅ wandb {wandb.__version__}")
except ImportError as e:
    print(f"❌ Missing: {e}")

✅ Working directory: /Users/hanszhu/Desktop/ARPO_replicate
✅ Python: /opt/anaconda3/envs/arpo/bin/python


  import pynvml  # type: ignore[import]
  from .autonotebook import tqdm as notebook_tqdm


✅ PyTorch 2.5.1
✅ Transformers 4.57.6
✅ wandb 0.24.0


## 2. Training Configuration

In [2]:
config = {
    # Model
    "model": "ByteDance-Seed/UI-TARS-2B-SFT",
    "inference_server": "https://miller-unshapeable-melany.ngrok-free.dev",  # ⬅️ UPDATE!
    
    # Training
    "tasks": 128,
    "num_envs": 4,
    "rollouts_per_task": 4,
    "epochs": 1,
    "max_steps": 16,
    "batch_size": 8,
    
    # Paths
    "train_data": str(ARPO_ROOT / "test_data" / "osworld_examples" / "train_all_128.json"),
    "result_dir": str(ARPO_ROOT / "results_training_128"),
    "checkpoint_dir": str(ARPO_ROOT / "checkpoints_training_128"),
    
    # wandb
    "wandb_entity": "hanszhu05",
    "wandb_project": "arpo-uitars-training",
}

print("Training Configuration:")
print(json.dumps(config, indent=2))
print()
print(f"Expected time: ~34-68 hours for {config['epochs']} epoch")

Training Configuration:
{
  "model": "ByteDance-Seed/UI-TARS-2B-SFT",
  "inference_server": "https://miller-unshapeable-melany.ngrok-free.dev",
  "tasks": 128,
  "num_envs": 4,
  "rollouts_per_task": 4,
  "epochs": 1,
  "max_steps": 16,
  "batch_size": 8,
  "train_data": "/Users/hanszhu/Desktop/ARPO_replicate/test_data/osworld_examples/train_all_128.json",
  "result_dir": "/Users/hanszhu/Desktop/ARPO_replicate/results_training_128",
  "checkpoint_dir": "/Users/hanszhu/Desktop/ARPO_replicate/checkpoints_training_128",
  "wandb_entity": "hanszhu05",
  "wandb_project": "arpo-uitars-training"
}

Expected time: ~34-68 hours for 1 epoch


## 3. Verify Colab Server

In [3]:
import requests

server_url = config["inference_server"].replace("/v1", "")

if "YOUR-NGROK-URL" in server_url:
    print("❌ Update config['inference_server'] with Colab ngrok URL!")
else:
    try:
        response = requests.get(f"{server_url}/health", timeout=5)
        if response.status_code == 200:
            print(f"✅ Server reachable: {server_url}")
            print(f"Server: {response.json()}")
        else:
            print(f"❌ Server returned {response.status_code}")
    except Exception as e:
        print(f"❌ Cannot reach server: {e}")

✅ Server reachable: https://miller-unshapeable-melany.ngrok-free.dev
Server: {'model': 'arpo-uitars-7b', 'status': 'healthy'}


## 4. Update OSWorld Agent

In [4]:
import shutil

agent_file = ARPO_ROOT / "OSWorld" / "mm_agents" / "uitars_agent.py"
backup_file = agent_file.with_suffix('.py.backup_training')

if not backup_file.exists():
    shutil.copy(agent_file, backup_file)
    print(f"✅ Created backup")

# Update base_url
content = agent_file.read_text()
new_content = content.replace(
    'base_url="http://localhost:9000/v1"',
    f'base_url="{config["inference_server"]}"'
)
agent_file.write_text(new_content)
print(f"✅ Updated agent to: {config['inference_server']}")

✅ Updated agent to: https://miller-unshapeable-melany.ngrok-free.dev


## 5. Initialize wandb

In [5]:
import wandb

# Initialize wandb (will use your logged-in account)
run = wandb.init(
    project="arpo-uitars-training",
    name="uitars-2b-128tasks-epoch1",
    config=config,
    tags=["ui-tars-2b", "128-tasks", "colab-gpu", "1-epoch"],
)

print(f"✅ wandb run: {wandb.run.url}")
print(f"Project: {wandb.run.project}")
print(f"Entity: {wandb.run.entity}")

[34m[1mwandb[0m: [wandb.login()] Loaded credentials for https://api.wandb.ai from /Users/hanszhu/.netrc.
[34m[1mwandb[0m: Currently logged in as: [33mhanszhu05[0m ([33mhanszhu05-university-of-pennsylvania[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


CommError: Error uploading run: returned error 403: {"data":{"upsertBucket":null},"errors":[{"message":"permission denied","path":["upsertBucket"],"extensions":{"code":"PERMISSION_ERROR"}}]}

## 6. Run Training

⚠️ This will take ~34-68 hours! Ensure:
- Colab server stays running
- Stable internet
- Mac stays awake

In [None]:
import subprocess
import time

os.makedirs(config["result_dir"], exist_ok=True)
os.makedirs(config["checkpoint_dir"], exist_ok=True)

print("🚀 Starting ARPO Training...")
print(f"📁 Results: {config['result_dir']}")
print("="*70)

start_time = time.time()

cmd = [
    "python", "run_uitars.py",
    "--headless",
    "--observation_type", "screenshot",
    "--max_steps", str(config["max_steps"]),
    "--model", "ui-tars-2b",
    "--temperature", "0.7",
    "--max_tokens", "256",
    "--test_config_base_dir", "../test_data/osworld_examples",
    "--test_all_meta_path", config["train_data"],
    "--result_dir", config["result_dir"],
]

print(f"Training {config['tasks']} tasks with {config['num_envs']} VMs...")
print("⚠️  For full ARPO with VERL, use: bash scripts/train_uitars_2b_arpo.sh")
print()

try:
    result = subprocess.run(
        cmd,
        cwd=ARPO_ROOT / "OSWorld",
        text=True,
    )
    
    elapsed = time.time() - start_time
    print(f"\n✅ Complete in {elapsed/3600:.1f} hours")
    
except KeyboardInterrupt:
    print("\n🛑 Training interrupted")
except Exception as e:
    print(f"\n❌ Error: {e}")

## 7. View Results

In [None]:
results = []
for result_file in Path(config["result_dir"]).rglob("result.txt"):
    try:
        score = float(result_file.read_text().strip())
        results.append(score)
    except:
        pass

if results:
    avg_score = sum(results)/len(results)
    success_rate = sum(1 for r in results if r >= 0.9)/len(results)
    
    print("="*70)
    print(f"📊 Training Results ({len(results)} tasks)")
    print("="*70)
    print(f"Average Score: {avg_score:.3f}")
    print(f"Success Rate: {success_rate*100:.1f}%")
    print(f"Passed: {sum(1 for r in results if r >= 0.9)}/{len(results)}")
    print("="*70)
    
    # Log to wandb
    if wandb.run:
        wandb.log({
            "final_average_score": avg_score,
            "final_success_rate": success_rate,
            "tasks_completed": len(results),
        })
else:
    print("⚠️  No results found yet")

## 8. Cleanup

In [None]:
# Finish wandb
if wandb.run:
    wandb.finish()
    print("✅ wandb run finished")

# Restore agent config
backup_file = ARPO_ROOT / "OSWorld" / "mm_agents" / "uitars_agent.py.backup_training"
if backup_file.exists():
    agent_file = ARPO_ROOT / "OSWorld" / "mm_agents" / "uitars_agent.py"
    shutil.copy(backup_file, agent_file)
    print("✅ Restored original agent config")

---

## Summary

**For full ARPO training with VERL**:
- Use `scripts/train_uitars_2b_arpo.sh`
- See `TRAINING_WITH_COLAB.md`

**wandb Dashboard**: https://wandb.ai/hanszhu05-university-of-pennsylvania-org/arpo-uitars-training