# ARPO Smoke Test - VERL Pipeline Verification

**Stage 1: Wiring / Smoke Run**

Minimal test to verify ARPO training pipeline works:
- 4 tasks only
- 2 Docker environments
- ~100-200 optimizer steps

## Success Criteria
- ✅ Loss is finite and changes
- ✅ Checkpoints save successfully
- ✅ Can resume from checkpoint
- ✅ Evaluation loads checkpoint

**Expected time: ~30-60 minutes on A100**

## 1. Check GPU (A100)

In [None]:
import torch
!nvidia-smi

if not torch.cuda.is_available():
    raise RuntimeError("GPU required!")
print(f"✅ {torch.cuda.get_device_name(0)}")

## 2. Clone Repo

In [None]:
from getpass import getpass
token = getpass('GitHub token: ')
!git clone https://{token}@github.com/gowathena/arpo_replica.git
%cd arpo_replica
!git checkout arpo-cpu-replicate
!git submodule update --init --recursive

## 3. Install

In [None]:
# Core ML
%pip install -q torch==2.5.1 transformers>=4.45.0 accelerate

# VERL framework dependencies (from requirements.txt)
%pip install -q ray omegaconf wandb tqdm psutil
%pip install -q tensordict datasets
%pip install -q codetiming filelock
%pip install -q mathruler  # Required by VERL reward scoring

# Model
%pip install -q qwen-vl-utils pillow

# Install OSWorld
%cd OSWorld  
%pip install -q -r requirements.txt
%pip install -q -e .
%cd ..

print('✅ All dependencies installed!')

## 4. Setup

In [None]:
# Docker
!sudo service docker start
!docker pull happysixd/osworld-docker:latest

# Ray
import ray
ray.init(num_cpus=4, num_gpus=1, ignore_reinit_error=True)

# wandb
import wandb, os
from getpass import getpass
os.environ['WANDB_API_KEY'] = getpass('wandb key: ')
wandb.login()

# Update for Docker
!sed -i 's/vmware/docker/g' OSWorld/run_uitars.py
!sed -i 's/vmware/docker/g' OSWorld/run_multienv_uitars.py

print('✅ Setup complete!')

## 5. Smoke Test Config

In [None]:
import yaml

config = {
    'data': {
        'train_files': 'test_data/osworld_examples/train_smoke_4.json',
        'val_files': 'test_data/osworld_examples/train_smoke_4.json',
        'prompt_key': 'instruction',
        'max_prompt_length': 16384,
        'max_response_length': 2048,
    },
    'algorithm': {
        'adv_estimator': 'grpo',  # ARPO uses GRPO
        'disable_kl': True,
        'kl_coef': 0,
        'enable_replay': True,  # Experience replay (key ARPO feature!)
    },
    'worker': {
        'actor': {
            'global_batch_size': 2,
            'micro_batch_size_per_device_for_update': 1,
            'ppo_epochs': 4,  # Note: Called 'ppo_epochs' in VERL but implements GRPO
            'model': {
                'model_path': 'ByteDance-Seed/UI-TARS-2B-SFT',
                'trust_remote_code': True,
            },
            'optim': {'lr': 1e-6, 'strategy': 'adamw'},
            'clip_ratio_low': 0.2,
            'clip_ratio_high': 0.3,
        },
        'rollout': {
            'temperature': 0.7,
            'n': 8,  # 8 rollouts per task
        },
    },
    'env': {
        'num_envs': 2,
        'max_steps': 16,  # 16 steps for proper exploration
        'provider': 'docker',
    },
    'trainer': {
        'total_episodes': 2,  # 2 epochs
        'logger': ['console', 'wandb'],
        'project_name': 'arpo-smoke-test',
        'experiment_name': 'smoke-4tasks-grpo',
        'n_gpus_per_node': 1,
        'save_freq': 1,
    },
}

with open('smoke_test.yaml', 'w') as f:
    yaml.dump(config, f)

print('✅ ARPO Smoke Test Config (GRPO + Experience Replay):')
print('  Tasks: 4')
print('  Envs: 2')
print('  Rollouts: 8 per task')
print('  Max steps: 16 (proper exploration)')
print('  Epochs: 2')
print('  GRPO update passes: 4')
print('  Batch size: 2')
print()
print('Optimization steps:')
print('  4 tasks × 8 rollouts × 2 epochs = 64 total rollouts')
print('  64 rollouts ÷ 2 batch = 32 batches')
print('  32 batches × 4 passes = 128 optimization steps ✅')
print()
print('Expected: ~1-1.5 hours (16 steps × 8 rollouts × 4 tasks)')

## 6. Run Smoke Test

This verifies:
1. VERL pipeline works
2. Experience replay functions
3. Checkpoints save
4. Loss converges

**~30-60 minutes**

In [None]:
!python -m verl.trainer.main config=smoke_test.yaml

## 7. Verify Results

In [None]:
# Check checkpoints saved
!ls -lh checkpoints*/

# Check results
!ls -lh results*/

print('\n✅ If you see checkpoints and results, smoke test passed!')

---

## Success Checklist

After smoke test completes:

- [ ] Training completed without crashes
- [ ] Loss values are finite (not NaN)
- [ ] Loss changes over training (not stuck)
- [ ] Checkpoints saved in checkpoints/ folder
- [ ] Can load checkpoint for evaluation
- [ ] wandb shows training curves
- [ ] Experience replay buffer populated

**If all pass → Ready to scale to 32 or 128 tasks!**

**wandb**: https://wandb.ai/hanszhu05/arpo-smoke-test