-
Notifications
You must be signed in to change notification settings - Fork 69
Open
Labels
Description
Keep centralized tracking of observability gaps
Monitoring
- [RFC] Logging and Observability Requirements for Distributed RL Training #203
- Lack of easy access to system metrics, especially across multi-nodes. This should ideally be pushed down to Monarch: Move distributed, see
S576170 Follow-ups.
Profiling
- Lack of wiring of external profiling process to the logging UI. For example, PyTorch profiler does kinda work but the result is written to the node's local file system instead of being uploaded to WandB. This again, may be pushed down to Monarch like how they're supported in Ray: https://docs.ray.io/en/latest/ray-observability/user-guides/profiling.html#nsight-system-profiler
Rollout Tracing
- For a simple GRPO app, printing the prompt and response is mostly sufficient for debugging. With multi-turn agentic RL training, we need a rollout tracer similar to https://verl.readthedocs.io/en/latest/advance/rollout_trace.html
felipemello1