# ARPO Training with VERL on Colab A100

Complete ARPO training: 128 tasks, 8 Docker environments, ~10 hours.

## Configuration
- Model: UI-TARS-2B
- Tasks: 128 (all 10 domains)
- Environments: 8 Docker containers (parallel)
- Rollouts: 4 per task
- Max steps: 12
- Epochs: 1

**Expected time: ~10-12 hours on A100**

## 1. Check GPU

In [None]:
import torch
!nvidia-smi

if torch.cuda.is_available():
    print(f"✅ GPU: {torch.cuda.get_device_name(0)}")
    print(f"💾 Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
else:
    raise RuntimeError("❌ No GPU! Change runtime to A100")

## 2. Clone Repository (Private)

In [None]:
from getpass import getpass

print('Get token from: https://github.com/settings/tokens/new')
print('Scope needed: repo')
github_token = getpass('GitHub token: ')

!git clone https://{github_token}@github.com/gowathena/arpo_replica.git
%cd arpo_replica
!git checkout arpo-cpu-replicate
!git submodule update --init --recursive

print('✅ Repository cloned!')

## 3. Install Dependencies

In [None]:
%pip install -q torch==2.5.1 transformers accelerate
%pip install -q ray omegaconf wandb tqdm psutil
%pip install -q qwen-vl-utils pillow
%pip install -q tensordict datasets

# Install OSWorld
%cd OSWorld
%pip install -q -r requirements.txt
%pip install -q -e .
%cd ..

print('✅ Dependencies installed!')

## 4. Setup Docker for OSWorld

In [None]:
!sudo service docker start
!docker pull happysixd/osworld-docker:latest
print('✅ Docker ready')

## 5. Start Ray Cluster

In [None]:
import ray

ray.init(num_cpus=8, num_gpus=1, ignore_reinit_error=True)
print('✅ Ray started')
print(ray.cluster_resources())

## 6. Configure wandb

In [None]:
import wandb
import os
from getpass import getpass

api_key = getpass('wandb API key: ')
os.environ['WANDB_API_KEY'] = api_key
wandb.login(key=api_key)
print('✅ wandb authenticated!')

## 7. Update OSWorld for Docker

In [None]:
# Change from vmware to docker
files = ['OSWorld/run_uitars.py', 'OSWorld/run_multienv_uitars.py']
for file_path in files:
    with open(file_path) as f:
        content = f.read()
    content = content.replace('provider_name="vmware"', 'provider_name="docker"')
    with open(file_path, 'w') as f:
        f.write(content)
print('✅ Updated to Docker')

## 8. Create Training Config

In [None]:
import yaml

config = {
    'data': {
        'train_files': 'test_data/osworld_examples/train_all_128.json',
        'val_files': 'test_data/osworld_examples/test_chrome_10.json',
        'prompt_key': 'instruction',
        'max_prompt_length': 32768,
        'max_response_length': 4096,
        'max_pixels': 2116800,
        'min_pixels': 2800,
    },
    'algorithm': {
        'adv_estimator': 'grpo',
        'disable_kl': True,
        'kl_coef': 0,
        'enable_replay': True,
    },
    'worker': {
        'actor': {
            'model': {
                'model_path': 'ByteDance-Seed/UI-TARS-2B-SFT',
                'trust_remote_code': True,
            },
            'optim': {'lr': 1e-6, 'strategy': 'adamw'},
            'clip_ratio_low': 0.2,
            'clip_ratio_high': 0.3,
        },
        'rollout': {'temperature': 0.7, 'n': 4},
    },
    'env': {
        'num_envs': 8,
        'max_steps': 12,
        'provider': 'docker',
    },
    'trainer': {
        'total_episodes': 1,
        'logger': ['console', 'wandb'],
        'project_name': 'arpo-uitars-training',
        'experiment_name': 'uitars-2b-128tasks-8envs',
        'n_gpus_per_node': 1,
    },
}

with open('config_colab.yaml', 'w') as f:
    yaml.dump(config, f)

print('✅ Config created: 128 tasks, 8 envs, 12 steps')
print('⏱️  Expected: ~10-12 hours')

## 9. Run ARPO Training

⚠️ **This will run for ~10-12 hours!**

Keep Colab tab open and connected.

In [None]:
!python -m verl.trainer.main \
    config=config_colab.yaml \
    worker.actor.model.model_path=ByteDance-Seed/UI-TARS-2B-SFT \
    algorithm.enable_replay=True \
    env.provider=docker \
    env.num_envs=8 \
    env.max_steps=12 \
    trainer.total_episodes=1