Skip to content

v0.4.0

Choose a tag to compare

@dulinriley dulinriley released this 26 Mar 20:52
· 663 commits to main since this release

Monarch v0.4 Release Notes

New Features

Networking & RDMA

  • EFA support for RDMA — RDMA with AWS's libefa (elastic fabric adapter).
  • TCP fallback for RDMA — when RDMA is unavailable the data-plane automatically falls back to TCP, broadening hardware compatibility (#2999).
  • ROCm / HIP support for the RDMA stack, enabling AMD GPU deployments (#2891).
  • The channel transport layer was rewritten around a typed session lifecycle and unified NetLink dispatch, improving reconnect reliability and adding duplex-mode channels.

Distributed Telemetry & Dashboard

Monarch now ships a built-in observability dashboard. The new distributed telemetry system collects actor, mesh, host, proc, and message-level data in real time and exposes it through both a web UI and a schema-first REST API (OpenAPI 3.1). An OTLP-compatible metrics, logs, and trace exporter makes it straightforward to integrate with Grafana, Jaeger, or any OpenTelemetry collector in Kubernetes deployments.

Admin TUI & Live Diagnostics

A new terminal UI (admin_tui) provides live introspection of running meshes, procs, and actors via an HTTP admin server. It includes a built-in py-spy integration that can capture Python stack traces from any running actor directly in the TUI, making it much easier to diagnose stalls and performance issues in production.

Kubernetes

KubernetesJob gained Python-native provisioning, removing the dependency on an external Go controller for mesh creation. A new optional labels parameter on add_mesh() enables integration with Kueue and other label-based Kubernetes controllers (#2693).

Python API Changes

  • allocate_nonblocking, from_alloc, and host_mesh are renamed to private methods; use attach_to_workers and the KubernetesJob / ProcessJob APIs instead (#2971).
  • NUMA bindings are now exposed for proc mesh spawning (#2996).

Bug Fixes & Performance Improvements

Supervision & Fault Tolerance

  • ControllerController supervision — a single child torchstore controller failure no longer poisons the parent and all siblings. Each child is now isolated, fixing a critical bug where one failed session could block all subsequent get_or_spawn_controller() calls (#2835).
  • Orphaned mesh cleanup — child actors now detect when their parent is unreachable and self-terminate, preventing leaked GPU resources (#2198).
  • Clean Python shutdown — proc exit now calls Py_FinalizeEx, giving Python objects a chance to run destructors and eliminating the pybind11::dec_ref GIL crashes seen during shutdown (#2524).
  • Reliable proc_mesh.stop() — stop now flushes pending messages and acks before exiting, fixing races that caused spurious errors in CI and user code (#2658).

Performance

  • Lazy ValueMesh unpickling — values returned from accumulate are now deserialized on access rather than eagerly, reducing latency for large results (#2983).
  • RLE-compressed OnceBuffer accumulation — repeated identical values are run-length encoded during accumulation, cutting memory and network cost for common broadcast patterns (#2989).
  • Telemetry overhead was significantly reduced by demoting internal spans and gating channel-level tracing behind DEBUG.

Build & Packaging

  • Official aarch64 (ARM64) release binaries are now published alongside x86_64 on PyPI