# ARPO Training - UI-TARS-2B (Colab GPU + Mac OSWorld)

Train UI-TARS-2B on 128 OSWorld tasks using Colab GPU for inference.

## Prerequisites

- ‚úÖ Colab GPU server running
- ‚úÖ VMware + Ubuntu VM ready
- ‚úÖ wandb configured

See `PRE_TRAINING_CHECKLIST.md`

## 1. Environment Check

In [1]:
import os
import sys
import json
from pathlib import Path

ARPO_ROOT = Path("/Users/hanszhu/Desktop/ARPO_replicate")
os.chdir(ARPO_ROOT)
sys.path.insert(0, str(ARPO_ROOT))

print(f"‚úÖ Working directory: {os.getcwd()}")
print(f"‚úÖ Python: {sys.executable}")

# Check dependencies
try:
    import torch, transformers, wandb
    print(f"‚úÖ PyTorch {torch.__version__}")
    print(f"‚úÖ Transformers {transformers.__version__}")
    print(f"‚úÖ wandb {wandb.__version__}")
except ImportError as e:
    print(f"‚ùå Missing: {e}")

‚úÖ Working directory: /Users/hanszhu/Desktop/ARPO_replicate
‚úÖ Python: /opt/anaconda3/envs/arpo/bin/python


  import pynvml  # type: ignore[import]
  from .autonotebook import tqdm as notebook_tqdm


‚úÖ PyTorch 2.5.1
‚úÖ Transformers 4.57.6
‚úÖ wandb 0.24.0


## 2. Training Configuration

In [2]:
config = {
    # Model
    "model": "ByteDance-Seed/UI-TARS-2B-SFT",
    "inference_server": "https://miller-unshapeable-melany.ngrok-free.dev",  # ‚¨ÖÔ∏è UPDATE!
    
    # Training
    "tasks": 128,
    "num_envs": 4,
    "rollouts_per_task": 4,
    "epochs": 1,
    "max_steps": 16,
    "batch_size": 8,
    
    # Paths
    "train_data": str(ARPO_ROOT / "test_data" / "osworld_examples" / "train_all_128.json"),
    "result_dir": str(ARPO_ROOT / "results_training_128"),
    "checkpoint_dir": str(ARPO_ROOT / "checkpoints_training_128"),
    
    # wandb
    "wandb_entity": "hanszhu05",
    "wandb_project": "arpo-uitars-training",
}

print("Training Configuration:")
print(json.dumps(config, indent=2))
print()
print(f"Expected time: ~34-68 hours for {config['epochs']} epoch")

Training Configuration:
{
  "model": "ByteDance-Seed/UI-TARS-2B-SFT",
  "inference_server": "https://miller-unshapeable-melany.ngrok-free.dev",
  "tasks": 128,
  "num_envs": 4,
  "rollouts_per_task": 4,
  "epochs": 1,
  "max_steps": 16,
  "batch_size": 8,
  "train_data": "/Users/hanszhu/Desktop/ARPO_replicate/test_data/osworld_examples/train_all_128.json",
  "result_dir": "/Users/hanszhu/Desktop/ARPO_replicate/results_training_128",
  "checkpoint_dir": "/Users/hanszhu/Desktop/ARPO_replicate/checkpoints_training_128",
  "wandb_entity": "hanszhu05",
  "wandb_project": "arpo-uitars-training"
}

Expected time: ~34-68 hours for 1 epoch


## 3. Verify Colab Server

In [3]:
import requests

server_url = config["inference_server"].replace("/v1", "")

if "YOUR-NGROK-URL" in server_url:
    print("‚ùå Update config['inference_server'] with Colab ngrok URL!")
else:
    try:
        response = requests.get(f"{server_url}/health", timeout=5)
        if response.status_code == 200:
            print(f"‚úÖ Server reachable: {server_url}")
            print(f"Server: {response.json()}")
        else:
            print(f"‚ùå Server returned {response.status_code}")
    except Exception as e:
        print(f"‚ùå Cannot reach server: {e}")

‚úÖ Server reachable: https://miller-unshapeable-melany.ngrok-free.dev
Server: {'model': 'arpo-uitars-7b', 'status': 'healthy'}


## 4. Update OSWorld Agent

In [4]:
import shutil

agent_file = ARPO_ROOT / "OSWorld" / "mm_agents" / "uitars_agent.py"
backup_file = agent_file.with_suffix('.py.backup_training')

if not backup_file.exists():
    shutil.copy(agent_file, backup_file)
    print(f"‚úÖ Created backup")

# Update base_url
content = agent_file.read_text()
new_content = content.replace(
    'base_url="http://localhost:9000/v1"',
    f'base_url="{config["inference_server"]}"'
)
agent_file.write_text(new_content)
print(f"‚úÖ Updated agent to: {config['inference_server']}")

‚úÖ Updated agent to: https://miller-unshapeable-melany.ngrok-free.dev


## 5. Initialize wandb

In [5]:
import wandb

# Initialize wandb (will use your logged-in account)
run = wandb.init(
    project="arpo-uitars-training",
    name="uitars-2b-128tasks-epoch1",
    config=config,
    tags=["ui-tars-2b", "128-tasks", "colab-gpu", "1-epoch"],
)

print(f"‚úÖ wandb run: {wandb.run.url}")
print(f"Project: {wandb.run.project}")
print(f"Entity: {wandb.run.entity}")

[34m[1mwandb[0m: [wandb.login()] Loaded credentials for https://api.wandb.ai from /Users/hanszhu/.netrc.
[34m[1mwandb[0m: Currently logged in as: [33mhanszhu05[0m ([33mhanszhu05-university-of-pennsylvania[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


‚úÖ wandb run: https://wandb.ai/hanszhu05-university-of-pennsylvania/arpo-uitars-training/runs/p548k6pb
Project: arpo-uitars-training
Entity: hanszhu05-university-of-pennsylvania


## 6. Run Training

‚ö†Ô∏è This will take ~34-68 hours! Ensure:
- Colab server stays running
- Stable internet
- Mac stays awake

In [6]:
import subprocess
import time

os.makedirs(config["result_dir"], exist_ok=True)
os.makedirs(config["checkpoint_dir"], exist_ok=True)

print("üöÄ Starting ARPO Training...")
print(f"üìÅ Results: {config['result_dir']}")
print("="*70)

start_time = time.time()

cmd = [
    "python", "run_uitars.py",
    "--headless",
    "--observation_type", "screenshot",
    "--max_steps", str(config["max_steps"]),
    "--model", "ui-tars-2b",
    "--temperature", "0.7",
    "--max_tokens", "256",
    "--test_config_base_dir", "../test_data/osworld_examples",
    "--test_all_meta_path", config["train_data"],
    "--result_dir", config["result_dir"],
]

print(f"Training {config['tasks']} tasks with {config['num_envs']} VMs...")
print("‚ö†Ô∏è  For full ARPO with VERL, use: bash scripts/train_uitars_2b_arpo.sh")
print()

try:
    result = subprocess.run(
        cmd,
        cwd=ARPO_ROOT / "OSWorld",
        text=True,
    )
    
    elapsed = time.time() - start_time
    print(f"\n‚úÖ Complete in {elapsed/3600:.1f} hours")
    
except KeyboardInterrupt:
    print("\nüõë Training interrupted")
except Exception as e:
    print(f"\n‚ùå Error: {e}")

üöÄ Starting ARPO Training...
üìÅ Results: /Users/hanszhu/Desktop/ARPO_replicate/results_training_128
Training 128 tasks with 4 VMs...
‚ö†Ô∏è  For full ARPO with VERL, use: bash scripts/train_uitars_2b_arpo.sh



  import pynvml  # type: ignore[import]


[1;33m[2026-01-25 15:05:41,601 [31mINFO [32mrun_uitars/323-MainProcess[1;33m] [0mLeft tasks:
libreoffice_calc: 12
vlc: 6
multi_apps: 11
chrome: 18
vs_code: 19
os: 12
libreoffice_impress: 13
libreoffice_writer: 11
thunderbird: 7
gimp: 19

New experiment, no result yet.
[1;33m[2026-01-25 15:05:41,601 [31mINFO [32mrun_uitars/119-MainProcess[1;33m] [0mArgs: Namespace(path_to_vm=None, headless=True, action_space='pyautogui', observation_type='screenshot', screen_width=1920, screen_height=1080, sleep_after_execution=0.0, max_steps=16, max_trajectory_length=15, test_config_base_dir='../test_data/osworld_examples', model='ui-tars-2b', temperature=0.7, top_p=0.9, max_tokens=256, stop_token=None, domain='all', test_all_meta_path='/Users/hanszhu/Desktop/ARPO_replicate/test_data/osworld_examples/train_all_128.json', result_dir='/Users/hanszhu/Desktop/ARPO_replicate/results_training_128')
[1;33m[2026-01-25 15:05:41,670 [31mINFO [32mdesktop_env/83-MainProcess[1;33m] [0mInitializing...

Domain:   0%|          | 0/10 [00:00<?, ?it/s]
Example:   0%|          | 0/12 [00:00<?, ?it/s][A
Domain:   0%|          | 0/10 [00:00<?, ?it/s] [A
Traceback (most recent call last):
  File "/Users/hanszhu/Desktop/ARPO_replicate/OSWorld/run_uitars.py", line 332, in <module>
    test(args, test_file_list)
  File "/Users/hanszhu/Desktop/ARPO_replicate/OSWorld/run_uitars.py", line 175, in test
    with open(config_file, "r", encoding="utf-8") as f:
FileNotFoundError: [Errno 2] No such file or directory: '../test_data/osworld_examples/examples/libreoffice_calc/1273e544-688f-496b-8d89-3e0f40aa0606.json'



‚úÖ Complete in 0.0 hours


## 7. View Results

In [None]:
results = []
for result_file in Path(config["result_dir"]).rglob("result.txt"):
    try:
        score = float(result_file.read_text().strip())
        results.append(score)
    except:
        pass

if results:
    avg_score = sum(results)/len(results)
    success_rate = sum(1 for r in results if r >= 0.9)/len(results)
    
    print("="*70)
    print(f"üìä Training Results ({len(results)} tasks)")
    print("="*70)
    print(f"Average Score: {avg_score:.3f}")
    print(f"Success Rate: {success_rate*100:.1f}%")
    print(f"Passed: {sum(1 for r in results if r >= 0.9)}/{len(results)}")
    print("="*70)
    
    # Log to wandb
    if wandb.run:
        wandb.log({
            "final_average_score": avg_score,
            "final_success_rate": success_rate,
            "tasks_completed": len(results),
        })
else:
    print("‚ö†Ô∏è  No results found yet")

## 8. Cleanup

In [None]:
# Finish wandb
if wandb.run:
    wandb.finish()
    print("‚úÖ wandb run finished")

# Restore agent config
backup_file = ARPO_ROOT / "OSWorld" / "mm_agents" / "uitars_agent.py.backup_training"
if backup_file.exists():
    agent_file = ARPO_ROOT / "OSWorld" / "mm_agents" / "uitars_agent.py"
    shutil.copy(backup_file, agent_file)
    print("‚úÖ Restored original agent config")

---

## Summary

**For full ARPO training with VERL**:
- Use `scripts/train_uitars_2b_arpo.sh`
- See `TRAINING_WITH_COLAB.md`

**wandb Dashboard**: https://wandb.ai/hanszhu05-university-of-pennsylvania-org/arpo-uitars-training