Skip to content

0.4.1

Choose a tag to compare

@dulinriley dulinriley released this 08 Apr 01:13
· 557 commits to main since this release

Full Changelog: v0.4.0...v0.4.1

v0.4.1 is a small patch release that includes some powerful new features and important bug fixes.

New Features & API changes

v0.4.1 adds a substantial new CLI workflow around long-lived jobs:
monarch apply and monarch exec can now be used to launch subclasses
of JobTrait.
This release also introduces JobTrait.remote_mount: mounting a local filesystem to
sync with workers in the monarch job. This makes a FUSE mount on each worker and syncs
changes to the filesystem to all workers. It can use RDMA or TCP depending on availability to
send the data.
JobTrait.gather_mount works in reverse: a read-only FUSE mount that
pulls per-worker directories back into a unified local view. This can be used to gather
logs or other outputs from all workers to be examined locally.

The Monarch Dashboard is a local web UI for inspecting a running Monarch job
in real time. It is included in torchmonarch and starts alongside telemetry.
For jobs, enable both admin and telemetry:

job.enable_admin()
job.enable_telemetry(TelemetryConfig(include_dashboard=True, dashboard_port=8265))

The dashboard has three views:
Summary for overall health, actor counts, failures, and message traffic;
Hierarchy for drilling from host mesh down to individual actor details;
DAG for an interactive topology view of hosts, procs, and actors.

It’s still early, so the UI and APIs may evolve, but it’s already useful for
understanding topology, debugging failures, and inspecting message flow.

On the mesh-admin side, the HTTP surface expands with POST /v1/query and
POST /v1/pyspy_dump, while the internals were refactored to use typed IDs,
references, and timestamps behind a curl-friendly JSON/DTO boundary. That
should make the admin API easier to evolve without breaking existing
consumers.

Bug fixes

  • RDMA function is_rdma_available brought back but with a deprecation warning, was deleted in v0.4.0. It is now just a wrapper around is_ibverbs_available. get_rdma_backend is recommended to check which implementation is used.
  • RDMA bug fix for mlx5dv: #3293

Runtime correctness also improved in a few important places in error paths:
stop_actor_by_name now waits for actual actor termination, mesh scans no
longer crash or spin forever when a ProcMesh spawn fails, and mesh-controller
OncePort replies now return accumulated responses correctly.

Performance

A zero-copy regression in the pickle send path was fixed: #3234