# ARPO Training with VERL on Colab A100

Complete ARPO training with experience replay using VERL framework on Colab A100.

## What This Does

- ✅ Full VERL training framework
- ✅ Experience replay buffer
- ✅ GRPO policy optimization
- ✅ Model weight updates
- ✅ OSWorld with Docker on Colab
- ✅ 128 tasks, 1 epoch

**Everything runs on Colab A100** - no Mac needed!

## 1. Setup Colab Environment

In [None]:
# Check GPU
!nvidia-smi

# Check if A100
import torch
if torch.cuda.is_available():
    print(f"✅ GPU: {torch.cuda.get_device_name(0)}")
    print(f"💾 Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
else:
    print("❌ No GPU! Change runtime to A100")
    raise RuntimeError("GPU required")

## 2. Clone Repository

In [None]:
# Option 1: Use Personal Access Token (recommended)
from getpass import getpass

print('Generate a GitHub Personal Access Token at:')
print('https://github.com/settings/tokens/new')
print('Scopes needed: repo (full access)')
print()

github_token = getpass('Enter GitHub token: ')

# Clone with token
!git clone https://{github_token}@github.com/gowathena/arpo_replica.git
%cd arpo_replica
!git checkout arpo-cpu-replicate
!git submodule update --init --recursive

print('✅ Repository cloned!')

# Option 2: Upload files manually (alternative)
# If token doesn't work:
# 1. Download repo as ZIP from GitHub
# 2. Upload to Colab: Files icon → Upload
# 3. !unzip arpo_replica.zip

## 3. Install Dependencies

In [None]:
%pip install -q torch transformers accelerate
%pip install -q ray omegaconf wandb tqdm psutil
%pip install -q qwen-vl-utils pillow

# Install OSWorld
%cd OSWorld
%pip install -q -r requirements.txt
%pip install -q -e .
%cd ..

print("✅ Dependencies installed!")

## 4. Setup Docker for OSWorld

OSWorld on Colab uses Docker (not VMware).

In [None]:
# Start Docker service
!sudo service docker start

# Pull OSWorld Docker image
!docker pull happysixd/osworld-docker:latest

print("✅ Docker ready for OSWorld")

## 5. Start Ray Cluster

In [None]:
import ray

# Start Ray
ray.init(
    num_cpus=4,
    num_gpus=1,
    ignore_reinit_error=True,
)

print("✅ Ray cluster started")
print(ray.cluster_resources())

## 6. Configure wandb

In [None]:
import wandb
import os
from getpass import getpass

# Set wandb API key
print("Enter your wandb API key:")
api_key = getpass("API Key: ")
os.environ["WANDB_API_KEY"] = api_key

# Login
wandb.login(key=api_key)
print("✅ wandb authenticated!")

## 7. Update OSWorld for Docker

Change from VMware to Docker provider.

In [None]:
# Update run_uitars.py to use Docker
import re

files_to_update = [
    'OSWorld/run_uitars.py',
    'OSWorld/run_multienv_uitars.py',
]

for file_path in files_to_update:
    with open(file_path, 'r') as f:
        content = f.read()
    
    # Change vmware to docker
    content = content.replace('provider_name="vmware"', 'provider_name="docker"')
    
    with open(file_path, 'w') as f:
        f.write(content)
    
    print(f"✅ Updated {file_path} to use Docker")

## 8. Create Training Configuration

In [None]:
import yaml

config = {
    'data': {
        'train_files': 'test_data/osworld_examples/train_all_128.json',
        'val_files': 'test_data/osworld_examples/test_chrome_10.json',
        'prompt_key': 'instruction',
        'max_prompt_length': 32768,
        'max_response_length': 4096,
        'max_pixels': 2116800,
        'min_pixels': 2800,
    },
    'algorithm': {
        'adv_estimator': 'grpo',
        'disable_kl': True,
        'kl_coef': 0,
        'enable_replay': True,  # Key ARPO feature!
    },
    'worker': {
        'actor': {
            'model': {
                'model_path': 'ByteDance-Seed/UI-TARS-2B-SFT',
                'trust_remote_code': True,
            },
            'optim': {
                'lr': 1e-6,
                'strategy': 'adamw',
            },
            'clip_ratio_low': 0.2,
            'clip_ratio_high': 0.3,
        },
        'rollout': {
            'temperature': 0.7,
            'n': 4,  # 4 rollouts per task
        },
    },
    'env': {
        'num_envs': 2,  # 2 Docker containers on Colab
        'max_steps': 16,
        'provider': 'docker',
    },
    'trainer': {
        'total_episodes': 1,  # 1 epoch
        'logger': ['console', 'wandb'],
        'project_name': 'arpo-uitars-training',
        'experiment_name': 'uitars-2b-128tasks-colab',
        'n_gpus_per_node': 1,
        'nnodes': 1,
    },
}

# Save config
with open('config_colab_training.yaml', 'w') as f:
    yaml.dump(config, f)

print("✅ Training configuration created")
print(f"Tasks: 128, Envs: 2, Rollouts: 4, Epochs: 1")

## 9. Run VERL Training

⚠️ This will take ~20-40 hours on A100!

In [None]:
# Run VERL training
!python -m verl.trainer.main \
    config=config_colab_training.yaml \
    worker.actor.model.model_path=ByteDance-Seed/UI-TARS-2B-SFT \
    algorithm.enable_replay=True \
    env.provider=docker \
    env.num_envs=2 \
    env.max_steps=16 \
    trainer.total_episodes=1 \
    trainer.n_gpus_per_node=1