Skip to content

0.3.0

Choose a tag to compare

@colin2328 colin2328 released this 30 Jan 22:27

Monarch 0.3.0 Release Notes

New Features

Kubernetes Job Support

Monarch now supports running distributed training workloads on Kubernetes clusters. The new KubernetesJob API connects to pre-provisioned GPU pods managed by the https://github.com/meta-pytorch/monarch-kubernetes/ repository, enabling seamless multi-node DDP training
on Kubernetes.

Key Capabilities:

  • Connect to Kubernetes pods using KubernetesJob
  • Provision GPU workers via the MonarchMesh Custom Resource Definition
  • Run multi-node DDP training using SPMDActor

Example:

  from monarch.job.kubernetes import KubernetesJob
  from monarch.spmd import SPMDActor

  k8s_job = KubernetesJob(namespace="monarch-tests")
  k8s_job.add_mesh("ddpmesh", num_replicas=2)

  job_state = k8s_job.state()
  proc_mesh = job_state.ddpmesh.spawn_procs({"gpus": 4})
  spmd_actors = proc_mesh.spawn("_SPMDActor", SPMDActor)

See the full tutorial: https://meta-pytorch.org/monarch/generated/examples/ddp/kubernetes_ddp.html

We also publish docker packages, see https://github.com/meta-pytorch/monarch/pkgs/container/monarch


monarch.spmd and monarch.job.spmd SPMDJob

The new monarch.job.spmd module provides serve() and run_spmd() for an interactive SPMD development workflow:

  • Reserve once, iterate many times: Allocate hosts once, then call run_spmd() repeatedly without reprovisioning
  • Remote debugging: Add breakpoint() in your training script and attach with monarch debug
  • Job caching: Reload cached job state and re-run on the same reserved hosts
  Example:

  from monarch.job.spmd import serve

  job = serve(
      ["torchrun", "--nproc-per-node=4", "--standalone", "train.py"],
      scheduler="local_cwd",
  )
  job.run_spmd()

 # Later, reload and re-run without reprovisioning:
  job = job_load(".monarch/job_state.pkl")
  job.run_spmd()

This supports single-node training with command lists and multi-node training with TorchX AppDef on schedulers like Slurm.

See the example: https://meta-pytorch.org/monarch/generated/examples/ddp/spmd_job.html


Experimental Queue Dispatch Mode (Performance)

A new actor dispatch mode where Rust enqueues messages to a channel for Python to process, rather than Rust acquiring the GIL directly. This can improve throughput for message-heavy workloads.

  from monarch.config import configure

  configure(actor_queue_dispatch=True)

Real this_proc() for Local Spawning

The this_proc() function returns a handle to the current singleton process, enabling actors to spawn other actors locally. Remote actors can use this_proc() to spawn actors on their own host—enabling patterns like handing out references to a local proc and having
remote actors spawn resources on it.

from monarch.actor import Actor, endpoint, this_proc

class ManagerActor(Actor):
    @endpoint
    def spawn_helper(self) -> HelperActor:
        # Spawns HelperActor in the same process as ManagerActor
        return this_proc().spawn("helper", HelperActor)

Zero-Copy Messaging Path from Python

A new Buffer class enables zero-copy message serialization from Python. Large writes (≥256 bytes) are stored as references to Python bytes objects rather than being copied, integrating with multipart serialization for efficient vectored I/O.

from monarch._rust_bindings.monarch_hyperactor.buffers import Buffer
from monarch.config import configure

  buffer = Buffer()
  buffer.write(b"small")       # copied into pending buffer
  buffer.write(b"x" * 1000)    # stored as zero-copy reference

  # Configure the threshold via:
  configure(small_write_threshold=256)  # default

Principles of Ownership in Supervision

This release improves the supervision model for error handling in meshes, built on four core principles:

  1. Owned meshes: Creating new meshes always results in an owned mesh
  2. Single ownership: All meshes are owned by at most one actor (no transfer or suspension)
  3. Lifecycle binding: A mesh cannot outlive its owner—when the owner dies, so does the mesh
  4. Graceful cleanup: Stopped meshes drain pending messages before cleanup; owned meshes clean up before their owner

Actors can now implement supervise to handle failures from owned meshes.

Example:

  class ManagerActor(Actor):
      def __supervise__(self, failure: MeshFailure) -> bool:
          logging.error(f"failure encountered: {failure}")
          # Return truthy to handle, falsey to propagate
          return None

See the documentation: https://meta-pytorch.org/monarch/actors.html#error-handling-in-meshes


SkyPilot Integration (Community Contribution)

SkyPilotJob enables running Monarch on Kubernetes and cloud VMs across 20+ cloud providers (AWS, GCP, Azure, CoreWeave, Nebius, etc.) via https://skypilot.readthedocs.io/.

  import sky
  from monarch_skypilot import SkyPilotJob

  job = SkyPilotJob(
      meshes={"trainers": 2},
      resources=sky.Resources(accelerators="A100:1"),
      cluster_name="my-monarch-cluster",
  )
  state = job.state()
  trainers = state.trainers  # HostMesh with 2 nodes

Features:

  • Automatic cluster provisioning and teardown
  • Autostop for idle clusters
  • Workdir sync and custom file mounts
  • Default PyPI install or custom Docker images

Install with:

pip install torchmonarch-nightly skypilot[kubernetes]


Getting Started

Install Monarch 0.3.0:

pip install monarch==0.3.0