v0.4.0
Monarch v0.4 Release Notes
New Features
Networking & RDMA
- EFA support for RDMA — RDMA with AWS's libefa (elastic fabric adapter).
- TCP fallback for RDMA — when RDMA is unavailable the data-plane automatically falls back to TCP, broadening hardware compatibility (#2999).
- ROCm / HIP support for the RDMA stack, enabling AMD GPU deployments (#2891).
- The channel transport layer was rewritten around a typed session lifecycle and unified
NetLinkdispatch, improving reconnect reliability and adding duplex-mode channels.
Distributed Telemetry & Dashboard
Monarch now ships a built-in observability dashboard. The new distributed telemetry system collects actor, mesh, host, proc, and message-level data in real time and exposes it through both a web UI and a schema-first REST API (OpenAPI 3.1). An OTLP-compatible metrics, logs, and trace exporter makes it straightforward to integrate with Grafana, Jaeger, or any OpenTelemetry collector in Kubernetes deployments.
Admin TUI & Live Diagnostics
A new terminal UI (admin_tui) provides live introspection of running meshes, procs, and actors via an HTTP admin server. It includes a built-in py-spy integration that can capture Python stack traces from any running actor directly in the TUI, making it much easier to diagnose stalls and performance issues in production.
Kubernetes
KubernetesJob gained Python-native provisioning, removing the dependency on an external Go controller for mesh creation. A new optional labels parameter on add_mesh() enables integration with Kueue and other label-based Kubernetes controllers (#2693).
Python API Changes
allocate_nonblocking,from_alloc, andhost_meshare renamed to private methods; useattach_to_workersand theKubernetesJob/ProcessJobAPIs instead (#2971).- NUMA bindings are now exposed for proc mesh spawning (#2996).
Bug Fixes & Performance Improvements
Supervision & Fault Tolerance
- ControllerController supervision — a single child torchstore controller failure no longer poisons the parent and all siblings. Each child is now isolated, fixing a critical bug where one failed session could block all subsequent
get_or_spawn_controller()calls (#2835). - Orphaned mesh cleanup — child actors now detect when their parent is unreachable and self-terminate, preventing leaked GPU resources (#2198).
- Clean Python shutdown — proc exit now calls
Py_FinalizeEx, giving Python objects a chance to run destructors and eliminating thepybind11::dec_refGIL crashes seen during shutdown (#2524). - Reliable
proc_mesh.stop()— stop now flushes pending messages and acks before exiting, fixing races that caused spurious errors in CI and user code (#2658).
Performance
- Lazy ValueMesh unpickling — values returned from
accumulateare now deserialized on access rather than eagerly, reducing latency for large results (#2983). - RLE-compressed OnceBuffer accumulation — repeated identical values are run-length encoded during accumulation, cutting memory and network cost for common broadcast patterns (#2989).
- Telemetry overhead was significantly reduced by demoting internal spans and gating channel-level tracing behind DEBUG.
Build & Packaging
- Official aarch64 (ARM64) release binaries are now published alongside x86_64 on PyPI