[Main] Observability Gap Tracking

> Keep centralized tracking of observability gaps 

## Monitoring  
* https://github.com/meta-pytorch/torchforge/issues/203
* Lack of easy access to system metrics, especially across multi-nodes. This should ideally be pushed down to Monarch: Move distributed, see 
[S576170 Follow-ups](https://docs.google.com/document/d/1SZ4Fk6RBlWnFuPEa4ZnDVq38_rE53gqdkqhR6MFLGm0/edit?tab=t.0#bookmark=id.a37chz31ygb7).

## Profiling 
* Lack of wiring of external profiling process to the logging UI. For example, PyTorch profiler does kinda work but the result is written to the node's local file system instead of being uploaded to WandB. This again, may be pushed down to Monarch like how they're supported in Ray: https://docs.ray.io/en/latest/ray-observability/user-guides/profiling.html#nsight-system-profiler

## Rollout Tracing
* For a simple GRPO app, printing the prompt and response is mostly sufficient for debugging. With multi-turn agentic RL training, we need a rollout tracer similar to https://verl.readthedocs.io/en/latest/advance/rollout_trace.html



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Main] Observability Gap Tracking #569

Monitoring

Profiling

Rollout Tracing

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Main] Observability Gap Tracking #569

Description

Monitoring

Profiling

Rollout Tracing

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions