# ARPO Smoke Test - VERL Pipeline Verification

**Stage 1: Wiring / Smoke Run**

Minimal test to verify ARPO training pipeline works:
- 4 tasks only
- 2 Docker environments
- ~100-200 optimizer steps

## Success Criteria
- ✅ Loss is finite and changes
- ✅ Checkpoints save successfully
- ✅ Can resume from checkpoint
- ✅ Evaluation loads checkpoint

**Expected time: ~30-60 minutes on A100**

## 1. Check GPU (A100)

In [None]:
import torch
!nvidia-smi

if not torch.cuda.is_available():
    raise RuntimeError("GPU required!")
print(f"✅ {torch.cuda.get_device_name(0)}")

## 2. Clone Repo

In [None]:
from getpass import getpass
token = getpass('GitHub token: ')
!git clone https://{token}@github.com/gowathena/arpo_replica.git
%cd arpo_replica
!git checkout arpo-cpu-replicate
!git submodule update --init --recursive

## 3. Install

In [None]:
%pip install -q torch==2.5.1 transformers accelerate
%pip install -q ray omegaconf wandb tensordict datasets
%pip install -q qwen-vl-utils pillow tqdm psutil
%cd OSWorld
%pip install -q -r requirements.txt
%pip install -q -e .
%cd ..

## 4. Setup

In [None]:
# Docker
!sudo service docker start
!docker pull happysixd/osworld-docker:latest

# Ray
import ray
ray.init(num_cpus=4, num_gpus=1, ignore_reinit_error=True)

# wandb
import wandb, os
from getpass import getpass
os.environ['WANDB_API_KEY'] = getpass('wandb key: ')
wandb.login()

# Update for Docker
!sed -i 's/vmware/docker/g' OSWorld/run_uitars.py
!sed -i 's/vmware/docker/g' OSWorld/run_multienv_uitars.py

print('✅ Setup complete!')

## 5. Smoke Test Config

In [None]:
import yaml

config = {
    'data': {
        'train_files': 'test_data/osworld_examples/train_smoke_4.json',
        'val_files': 'test_data/osworld_examples/train_smoke_4.json',
        'prompt_key': 'instruction',
        'max_prompt_length': 16384,
        'max_response_length': 2048,
    },
    'algorithm': {
        'adv_estimator': 'grpo',
        'disable_kl': True,
        'kl_coef': 0,
        'enable_replay': True,
    },
    'worker': {
        'actor': {
            'global_batch_size': 4,
            'model': {
                'model_path': 'ByteDance-Seed/UI-TARS-2B-SFT',
                'trust_remote_code': True,
            },
            'optim': {'lr': 1e-6},
            'clip_ratio_low': 0.2,
            'clip_ratio_high': 0.3,
        },
        'rollout': {'temperature': 0.7, 'n': 2},
    },
    'env': {'num_envs': 2, 'max_steps': 10, 'provider': 'docker'},
    'trainer': {
        'total_episodes': 1,
        'logger': ['console', 'wandb'],
        'project_name': 'arpo-smoke-test',
        'experiment_name': 'smoke-4tasks-2envs',
        'n_gpus_per_node': 1,
        'save_freq': 1,
    },
}

with open('smoke_test.yaml', 'w') as f:
    yaml.dump(config, f)

print('✅ Smoke test config:')
print('  Tasks: 4')
print('  Envs: 2')
print('  Rollouts: 2')
print('  Steps: 10')
print('  Expected: ~30-60 minutes')

## 6. Run Smoke Test

This verifies:
1. VERL pipeline works
2. Experience replay functions
3. Checkpoints save
4. Loss converges

**~30-60 minutes**

In [None]:
!python -m verl.trainer.main config=smoke_test.yaml

## 7. Verify Results

In [None]:
# Check checkpoints saved
!ls -lh checkpoints*/

# Check results
!ls -lh results*/

print('\n✅ If you see checkpoints and results, smoke test passed!')

---

## Success Checklist

After smoke test completes:

- [ ] Training completed without crashes
- [ ] Loss values are finite (not NaN)
- [ ] Loss changes over training (not stuck)
- [ ] Checkpoints saved in checkpoints/ folder
- [ ] Can load checkpoint for evaluation
- [ ] wandb shows training curves
- [ ] Experience replay buffer populated

**If all pass → Ready to scale to 32 or 128 tasks!**

**wandb**: https://wandb.ai/hanszhu05/arpo-smoke-test