-
Notifications
You must be signed in to change notification settings - Fork 69
Description
🐛 Describe the bug
My training job sometime hangs after some steps, using claude code to debug and found out this, please help:
Summary
Training stops and hangs after step 12 when max_policy_age is set to 0 (via off_by_n: 0). The process crashes silently with no error messages, while the Julia docker container continues running and processing requests.
Root Cause
A race condition between weight updates and buffer sampling when max_policy_age: 0:
Configuration:
# apps/openenv/llama3_8b_julia.yaml:11,116
off_by_n: 0
max_policy_age: ${off_by_n} # This is 0!The Deadly Sequence:
- Training completes step 12 → increments training_step to 13
- Buffer sampling calls _evict(curr_policy_version=13)
- Eviction logic (replay_buffer.py:36) removes ALL episodes where policy_version < 13
- Generator is still on version 12 (weight updates take 17-30 seconds)
- New rollouts create episodes with generator_version=12
- These episodes get evicted immediately on next sample call
Result: Infinite wait for version 13 episodes that never arrive → buffer becomes empty → sample() returns None → training deadlock
Evidence from Logs
Timeline:
21:19:45 - RLTrainer pushes weights for version 12
21:19:57 - Push completes (12s)
21:20:15 - Generator FINISHES updating to v12 (30s total!)
Metrics:
buffer/sample/avg_data_utilization: 1.0 (only 8 episodes in buffer - the minimum needed)
buffer/evict/sum_episodes_evicted: 8.0 (evicting 8 episodes per step)
Last logged activity: 21:20:15, then silence
Training completed only 12 out of 1500 steps
Versions
using my branch