-
Notifications
You must be signed in to change notification settings - Fork 69
Description
This issue outlines observability requirements for Forge. The goal is to establish shared understanding of what we need to build, not how to build it. Implementation proposals welcome in follow-up issues/PRs.
Context
RL presents a unique ML systems challenge. At minimum, it comprises heterogeneous components (trainers, generators, replay buffers) that can be modeled as individual SPMD jobs with fundamentally different computational profiles. This heterogeneous nature creates a combined systems and numerics challenge - intra-component performance remains critical (trainer MFU, generator throughput), but optimizing components in isolation isn't sufficient. You must ensure inter-component interactions are healthy, otherwise systemic coordination failures manifest as opaque symptoms like "rewards are tanking and we don't know why."
A system like Forge requires next-level observability that can trace causality across heterogeneous components - something we argue isn't sufficiently solved by the current ecosystem. Whether this architecture becomes a permanent part of our stack is out of scope; these are simply the observability requirements we need to meet today.
Problem Statement: Per-Experiment vs System-Level Observability
Starting from first principles, we can distinguish two categories of metrics:
Systems-Level Metrics
Infrastructure and resource utilization that exists independent of any particular experiment. These include GPU memory usage, CPU utilization, network bandwidth, disk I/O - metrics that persist across experiment boundaries and are primarily useful for infrastructure health monitoring. Most cluster providers already provide robust and mature solutions for this.
Per-Experiment Metrics
Training dynamics, model behavior, and coordination patterns that are scoped to a specific experiment lifecycle. These include training loss, gradient statistics, actor queue depths, policy staleness, generation throughput - metrics that start and stop with the experiment and are primarily useful for debugging training behavior.
The Gap
Forge users (ML researchers and engineers) need unified access to per-experiment metrics to debug their workloads. With the current observability ecosystem, debugging an RL experiment requires going out-of-band - checking training loss in WandB, then switching to Grafana for queue depths, then manual correlation of timestamps. If researchers must leave their experiment dashboard to debug their experiment, our observability system has failed.
An ML researcher should be able to open their experiment in WandB and have sufficient visibility to debug training issues - trainer loss trends, generator throughput, cross-actor coordination metrics, data freshness - all unified in the experiment view where they're already working.
The core requirement is therefore experiment-scoped aggregation of metrics from distributed heterogeneous actors into a single unified experiment dashboard.
Current Ecosystem Limitations
Moving from a Single-Process World
WandB was designed around the assumption of single-process training jobs. Each wandb.init() call creates a distinct run, and the platform excels at tracking metrics from a single training process over time. This works perfectly for traditional ML where one GPU trains one model.
Meanwhile RL training involves multiple heterogeneous actors (trainers, generators, replay buffers) that need to contribute metrics to the same logical experiment. When each actor calls wandb.init() with the same experiment name, WandB creates separate runs instead of aggregating into a unified experiment view:
# What happens today:
Trainer: wandb.init(project="grpo", name="experiment-123") # → run-abc
Generator-1: wandb.init(project="grpo", name="experiment-123") # → run-def
Generator-2: wandb.init(project="grpo", name="experiment-123") # → run-ghi
# Result: 3 disconnected runs, manual correlation required
Therefore, to debug a training issue, researchers currently must:
- Check training loss in WandB
run-abc - Switch to WandB
run-defto check generator throughput - Open Grafana to check system queue depths
- Manually correlate timestamps across different dashboards
- Piece together the narrative from fragmented data sources
Alternative tools don't solve this!
- Prometheus/Grafana: Designed for system metrics, lacks experiment scoping and ML workflow integration
- TensorBoard: Single-process limitations similar to WandB
- MLflow: Primarily model registry, limited real-time experiment monitoring
- Custom dashboards: Require researchers to leave their ML workflow tools
No tool in the current ecosystem provides experiment-scoped aggregation of metrics from distributed heterogeneous processes. The gap isn't in the individual tools' capabilities, but in the fundamental mismatch between distributed RL training patterns and the single-process assumptions baked into ML observability tools.
Required Per-Experiment Metrics
To enable effective debugging of distributed RL training, the following metrics must be aggregated and correlated within each experiment scope. These represent the minimum set needed to trace causality from low-level component performance to high-level training outcomes.
Training Components
- Training dynamics: Loss curves, gradient statistics (abs-max, mean-squared norm), learning rate schedule
- Policy quality: Advantage distribution (min/max/mean), KL divergence (policy || reference), policy entropy
- Data processing: Total reward per batch, episode length, batch creation time
- Performance: MFU, Policy update frequency, parameter update magnitude, idle time
- Coordination: Weight synchronization overhead, time to push weights to generators
Generator Component(s)
- Resource utilization: KV cache size and utilization, KV cache miss rate
- Throughput: Token generation rate, requests per replica, sequence length distribution, idle time
- Quality: Tokenization time, model loading overhead, sampling parameter distributions
- Coordination: Policy staleness (age of policy when generating), request queue depth, time to pull weights from trainer
Service-Level Visibility
- Load balancing: Per replica queue depth
- Fault Tolerance: Per replica failures, time to recovery
Replay Buffer
- Capacity: Buffer size, queue depth, eviction rate
- Performance: Write/read throughput, data retrieval latency
These metrics must be temporally aligned and experiment-scoped to enable root cause analysis.
Success Criteria
A few proposed criteria for success:
Primary Goal - Single Dashboard Debugging
- ML researchers can diagnose training issues using only their experiment dashboard (WandB)
- No context-switching to external tools (Grafana, logs, etc.) required for standard debugging workflows
- Metrics from all actors (trainer, generators, replay buffer) visible in unified experiment view
API Simplicity
- Preserves familiar logging patterns: Each actor can use simple
mlogger.log(metrics)calls - No distributed coordination in user code: Actors don't need to know about other actors or handle message passing
- Local API, global aggregation: Simple local logging interface with transparent backend aggregation
Operational Requirements
- Real-time visibility: Metrics update within N seconds of generation
- Historical analysis: Full experiment timeline preserved for post-mortem analysis
- Cross-component correlation: Can trace causality from symptoms (reward drop) to root causes (generator slowdown)
Integration Requirements
- Preserves existing ML workflow: Researchers continue using WandB as primary experiment interface
- Backward compatible: Existing single-actor experiments continue working unchanged. It should be transparent to users if we switch our backend
- Minimal instrumentation overhead: Ideally, <5% performance impact on training throughput
Anti-Patterns
# This kind of complexity is a failure:
trainer_metrics = trainer.get_metrics.call()
policy_metrics = await policy.get_metrics.call() # Complex message passing
combined_metrics = merge_metrics(trainer_metrics, policy_metrics)
global_wandb_logger.log(combined_metrics)
A proposed front-end
class Trainer(Actor):
@endpoint
def push_weights(self):
...
mlogger.log({"push_weights_s": push_weights_s})
The solution must keep the logging API dead simple while handling the distributed aggregation transparently under the hood.
Next Steps
This document establishes our core requirement: experiment-scoped aggregation of metrics from distributed heterogeneous actors, accessible through simple local logging APIs, unified in a single dashboard view.
The challenge is building a system that handles distributed coordination transparently while preserving the simplicity of mlogger.log() calls that researchers expect. Whether this involves a lightweight aggregation layer, modified WandB integration, or hybrid file-based approach is an implementation question.
Immediate Priority: Unblock ML researchers from out-of-band debugging workflows. Due to our immediate timeline constraints, I welcome any solution that enables unified experiment visibility within the next week while preserving API simplicity as we explore more sustainable longer-term solutions.
Ecosystem Evolution: This capability gap likely reflects the relative immaturity of distributed RL tooling compared to traditional ML infrastructure. As the ecosystem evolves—whether through WandB adding native distributed experiment support, improved MLOps platforms, or standardized observability protocols—we should evaluate replacing any custom solution with proven industry tooling.
Implementation proposals welcome in follow-up issues.