Skip to content

0.5.0

Latest

Choose a tag to compare

@dulinriley dulinriley released this 19 May 18:11
· 242 commits to main since this release

New features & API changes

Python: actor identifiers renamed to ActorAddr. ActorId is now ActorAddr across the Python bindings (#3618, #3622). The old pid: int constructor argument is gone — ActorAddr carries a string uid (with pid retained as a compatibility alias) and new label / proc_label properties. ActorAddr.from_string now expects the actor.proc@location wire format. Mailbox.post, PythonActorHandle.bind, ActorSupervisionEvent.actor_id, UndeliverableMessageEnvelope.sender, Instance.actor_id, and the ClientActor / Error / Failure stubs are all updated. ActorMeshProtocol no longer exposes region or get(rank).

Kubernetes operator integration. KubernetesJob.add_mesh now takes pod_template: V1PodTemplateSpec instead of pod_spec: V1PodSpec, and accepts a new annotations= kwarg (#3872, #3949). With meta-pytorch/monarch-kubernetes#49, we need v0.2.0+ of the monarch operator for KubernetesJob with monarch v0.5.0+.

Per-rank bootstrap. HostMesh.spawn_procs(bootstrap_command=…) accepts either a uniform BootstrapCommand or a Callable[[Point], BootstrapCommand] for per-rank customization (e.g. per-GPU CUDA_VISIBLE_DEVICES) (#3463). New helpers default_bootstrap_cmd() and BootstrapCommand.with_env(env).

SPMD entry point. New host_mesh_from_store(...) stands up a HostMesh from a torchrun/torchx-style entry point without going through the Job API (#3559).

Telemetry helpers. monarch.actor.span(name) and @traced decorator replace ad-hoc OTEL TRACER.start_as_current_span(...) blocks; spans auto-bind to the current actor (#3665, #3774). PySpan is now a context manager.

Tensor engine & multiprocessing. Tensor engine builds on CPU and macOS via a split tensor_engine_gpu Cargo feature; the env var MONARCH_RDMA_GPU_PLATFORM was renamed to MONARCH_GPU_PLATFORM (#3530). RDMA Python bindings now degrade gracefully when native libs are absent. Linux default multiprocessing start method flipped from spawn to forkserver (#3529). async def __supervise__ is now supported (#3526).

config.configure keys. Added rdma_disable_ibverbs, rdma_allow_tcp_fallback, rdma_max_chunk_size_mb. Removed remote_allocator_heartbeat_interval. New parametrize_config_pointwise test helper.

Removals & deprecations.

  • The legacy allocator stack is gone: monarch._src.actor.allocator, LocalAllocator, ProcessAllocator, HostMesh.allocate_nonblocking / _allocate_nonblocking, the process_allocator binary (#3567#3586). Use HostMesh + attach_to_workers or a JobTrait class.
  • monarch._src.actor.namespace and the namespace API removed (#3116).
  • Future.get() called from inside an active asyncio or tokio thread now emits a DeprecationWarning and becomes a RuntimeError in v0.6 (#3827).

Examples & docs. New Kubernetes GRPO tutorial (Qwen3.5-0.8B on GSM8K) (#3597), Oracle OKE example (#3671), GRPO via cooperative multitasking (#3525).

Rust internals (not Python-visible). Endpoint sends are now infallible and renamed send → post, with failures flowing through a new Undeliverable<M> enum (#3890#3894, #3912). A new Gateway layer owns per-proc reachability and serving (#3818#3823); Proc::localProc::isolated. Identity constructors collapsed into anonymous() / instance(label) / singleton(name) (#3935, #3940). hyperactor::reference deleted and hyperactor::host moved to hyperactor_mesh::host (#3641, #3724). New hyperactor_remote crate adds keepalive links, supervisors, and rendezvous tokens (#3762#3768).

Bug Fixes

  • Ctrl-C no longer hangs the runtime (#3801); flaky PyShared.__await__ borrow race (#3862); two RwLock/DashMap deadlocks in actor teardown (#3754); re-entrant TraceEventDispatcher SIGSEGV in real training runs (#3690); Mailbox::post_unchecked shard deadlock (#3684); host shutdown race (#3663).
  • Bootstrap falls back when XDG_RUNTIME_DIR doesn't exist (#3418); long-path SUN_LEN unix-socket panic (#3697); HostMesh label sanitization (#3691); controller GetState no longer triggers an undeliverable bounce (#3450); RDMA find_cuda_segment boundary (#3769).

Performance & Reliability

  • Native V1 casting and the destination-actor reorder buffer are now on by default (#3812), with a point-to-point optimization for small casts (#3646).
  • RDMA completion polling is now adaptive — default flipped from a fixed 1 ms sleep to yield-only, gated by MONARCH_RDMA_CQ_BUSY_POLL_WINDOW (#3771). resolve_ibv made synchronous, removing a per-read round-trip (#3773). TLS code-transfer replaced with RDMABuffer leader fan-out (#3390). Arc-refcounted PDs/MRs close a latent PD double-free (#3883); KeepaliveLocalMemory is the sole local-memory handle, with explicit unsafe accessors (#3922).
  • Channel correctness: host flushes acks before exit (#3637); duplex sessions made structurally concurrent (#3675); experimental multi-stream sender exp_dial_unordered (#3557, #3558).
  • ProcMeshController reaps procs orphaned by a dead client via MESH_ORPHAN_TIMEOUT (#3811); periodic RSS recording for managed processes (#3733).

Build & Release

  • macOS wheels ship with the stable PyPI release (#3854) and the nightly matrix (#3451); the initial publish pipeline landed in #3412, with follow-up fixes for missing fields (#3344), no-torch (#3371), the crash-recovery plugin (#3831), and general build breakage (#3786).
  • ROCm GPU CI via a matrix-based workflow (#3190); ROCm excluded from PR runs (#3861).
  • PyTorch bumped 2.11.0 → 2.12.0 for stable; nightly tracks 2.13.0 (#3863). publish_release Docker base aligned to CUDA 12.6 for torch 2.12.0 (#3921); nightly Docker images repaired after upstream cuda12.8 removal (#3880).
  • PyPI wheels now carry classifiers and project URLs (#3379); docs deploy targets stable (#3415). New GHA workflow marks stale PRs and deletes branches of closed/non-merged PRs (#3778); test-result XML uploaded as artifacts (#3670); global 5-minute cargo-nextest timeout (#3855).