A lightweight verl-style RL training framework implemented from scratch.
- Readability:
nanoverlhas about 6k lines of code, compared with 90K+ lines inverl. - Distributed training: uses
FSDP+vLLMas the training and inference backends, withRayfor distributed management. Supportsrollout load balancing,dynamic batch,remove padding, and more. - Asynchronous support: supports
one-step-off-policyasynchronous training, enabled by settingtrainer.mode=one_step_off.
- Clone the code:
git clone https://github.com/kidding-404/nano-verl.git
cd nano-verl- Install dependencies with
uv:
uv sync- Find the compatible
flash-attnwheel and install it separately:
uv run pip install <flash_attn_wheel_url>Train qwen3-0.6B on the gsm8k dataset:
uv run python main.py --config configs/gsm8k-qwen3-0.6b-single-gpu.yamlYou can also train qwen3-1.7B asynchronously on two GPUs:
uv run python main.py --config configs/gsm8k-qwen3-1.7b-1p1-async.yamlTest configuration:
- Model: Qwen3-4B
- Trainset: DAPO-17K
- Reward: 1/-1 accuracy reward
- Steps: 150
- Global batch size: 64
- Rollout n: 8
- Prompt length: 1024
- Response length: 8192
- Hardware: 1 node, 8 x NVIDIA H100 80GB HBM3
Reward curve:
Performance comparison:
| Setting | AIME24 avg16 | AIME24 pass@16 | AIME25 avg16 | AIME25 pass@16 |
|---|---|---|---|---|
| Qwen3-4B Base | 0.4333 | 0.7000 | 0.3563 | 0.5333 |
| Qwen3-4B + verl | 0.5313 | 0.8333 | 0.4417 | 0.6667 |
| Qwen3-4B + nano-verl | 0.535 | 0.8333 | 0.429 | 0.6667 |

