New features & API changes
Python: actor identifiers renamed to ActorAddr. ActorId is now ActorAddr across the Python bindings (#3618, #3622). The old pid: int constructor argument is gone — ActorAddr carries a string uid (with pid retained as a compatibility alias) and new label / proc_label properties. ActorAddr.from_string now expects the actor.proc@location wire format. Mailbox.post, PythonActorHandle.bind, ActorSupervisionEvent.actor_id, UndeliverableMessageEnvelope.sender, Instance.actor_id, and the ClientActor / Error / Failure stubs are all updated. ActorMeshProtocol no longer exposes region or get(rank).
Kubernetes operator integration. KubernetesJob.add_mesh now takes pod_template: V1PodTemplateSpec instead of pod_spec: V1PodSpec, and accepts a new annotations= kwarg (#3872, #3949). With meta-pytorch/monarch-kubernetes#49, we need v0.2.0+ of the monarch operator for KubernetesJob with monarch v0.5.0+.
Per-rank bootstrap. HostMesh.spawn_procs(bootstrap_command=…) accepts either a uniform BootstrapCommand or a Callable[[Point], BootstrapCommand] for per-rank customization (e.g. per-GPU CUDA_VISIBLE_DEVICES) (#3463). New helpers default_bootstrap_cmd() and BootstrapCommand.with_env(env).
SPMD entry point. New host_mesh_from_store(...) stands up a HostMesh from a torchrun/torchx-style entry point without going through the Job API (#3559).
Telemetry helpers. monarch.actor.span(name) and @traced decorator replace ad-hoc OTEL TRACER.start_as_current_span(...) blocks; spans auto-bind to the current actor (#3665, #3774). PySpan is now a context manager.
Tensor engine & multiprocessing. Tensor engine builds on CPU and macOS via a split tensor_engine_gpu Cargo feature; the env var MONARCH_RDMA_GPU_PLATFORM was renamed to MONARCH_GPU_PLATFORM (#3530). RDMA Python bindings now degrade gracefully when native libs are absent. Linux default multiprocessing start method flipped from spawn to forkserver (#3529). async def __supervise__ is now supported (#3526).
config.configure keys. Added rdma_disable_ibverbs, rdma_allow_tcp_fallback, rdma_max_chunk_size_mb. Removed remote_allocator_heartbeat_interval. New parametrize_config_pointwise test helper.
Removals & deprecations.
- The legacy allocator stack is gone:
monarch._src.actor.allocator,LocalAllocator,ProcessAllocator,HostMesh.allocate_nonblocking/_allocate_nonblocking, theprocess_allocatorbinary (#3567–#3586). UseHostMesh+attach_to_workersor aJobTraitclass. monarch._src.actor.namespaceand the namespace API removed (#3116).Future.get()called from inside an active asyncio or tokio thread now emits aDeprecationWarningand becomes aRuntimeErrorin v0.6 (#3827).
Examples & docs. New Kubernetes GRPO tutorial (Qwen3.5-0.8B on GSM8K) (#3597), Oracle OKE example (#3671), GRPO via cooperative multitasking (#3525).
Rust internals (not Python-visible). Endpoint sends are now infallible and renamed send → post, with failures flowing through a new Undeliverable<M> enum (#3890–#3894, #3912). A new Gateway layer owns per-proc reachability and serving (#3818–#3823); Proc::local → Proc::isolated. Identity constructors collapsed into anonymous() / instance(label) / singleton(name) (#3935, #3940). hyperactor::reference deleted and hyperactor::host moved to hyperactor_mesh::host (#3641, #3724). New hyperactor_remote crate adds keepalive links, supervisors, and rendezvous tokens (#3762–#3768).
Bug Fixes
- Ctrl-C no longer hangs the runtime (#3801); flaky
PyShared.__await__borrow race (#3862); two RwLock/DashMap deadlocks in actor teardown (#3754); re-entrantTraceEventDispatcherSIGSEGV in real training runs (#3690);Mailbox::post_uncheckedshard deadlock (#3684); host shutdown race (#3663). - Bootstrap falls back when
XDG_RUNTIME_DIRdoesn't exist (#3418); long-pathSUN_LENunix-socket panic (#3697); HostMesh label sanitization (#3691); controllerGetStateno longer triggers an undeliverable bounce (#3450); RDMAfind_cuda_segmentboundary (#3769).
Performance & Reliability
- Native V1 casting and the destination-actor reorder buffer are now on by default (#3812), with a point-to-point optimization for small casts (#3646).
- RDMA completion polling is now adaptive — default flipped from a fixed 1 ms sleep to yield-only, gated by
MONARCH_RDMA_CQ_BUSY_POLL_WINDOW(#3771).resolve_ibvmade synchronous, removing a per-read round-trip (#3773). TLS code-transfer replaced with RDMABuffer leader fan-out (#3390). Arc-refcounted PDs/MRs close a latent PD double-free (#3883);KeepaliveLocalMemoryis the sole local-memory handle, with explicitunsafeaccessors (#3922). - Channel correctness: host flushes acks before exit (#3637); duplex sessions made structurally concurrent (#3675); experimental multi-stream sender
exp_dial_unordered(#3557, #3558). ProcMeshControllerreaps procs orphaned by a dead client viaMESH_ORPHAN_TIMEOUT(#3811); periodic RSS recording for managed processes (#3733).
Build & Release
- macOS wheels ship with the stable PyPI release (#3854) and the nightly matrix (#3451); the initial publish pipeline landed in #3412, with follow-up fixes for missing fields (#3344), no-torch (#3371), the crash-recovery plugin (#3831), and general build breakage (#3786).
- ROCm GPU CI via a matrix-based workflow (#3190); ROCm excluded from PR runs (#3861).
- PyTorch bumped 2.11.0 → 2.12.0 for stable; nightly tracks 2.13.0 (#3863).
publish_releaseDocker base aligned to CUDA 12.6 for torch 2.12.0 (#3921); nightly Docker images repaired after upstream cuda12.8 removal (#3880). - PyPI wheels now carry classifiers and project URLs (#3379); docs deploy targets
stable(#3415). New GHA workflow marks stale PRs and deletes branches of closed/non-merged PRs (#3778); test-result XML uploaded as artifacts (#3670); global 5-minute cargo-nextest timeout (#3855).