Skip to content

HostMesh integration #144

@allenwang28

Description

@allenwang28

Context

HostMesh in Monarch represents a "Host" resource, providing a utility to spawn proc_mesh and actors on those hosts. This gives users fine-grained control over where actors should be placed, for i.e. colocation etc.

Forge is currently proc_mesh centric - services are created without any host mechanism applied:

s = await spawn_service(ServiceConfig(replicas=2, procs_per_replica=2, with_gpus=True), MyActor, ...)

This issue tracks the work needed to support HostMesh in Forge services:

s = await spawn_service(ServiceConfig(
    replicas=2,
    hosts_per_replica=2,
    procs_per_host=8,
    with_gpus=True),
    ...)

See EX545081 for a high level diagram for how services work.

Technical Requirements

Resource Management/Scheduling and Service Decoupling

HostMeshes are allocated via torchx, which acts as the abstraction layer between application code (i.e. Forge) and the underlying scheduler (i.e. Kubernetes, SLURM, etc.).

Forge is currently configured so that the underlying proc_mesh talks directly to the scheduler, but this might not make sense as there likely needs to be some sort of global coordination of resources.

We should be mindful of complexity - this doesn't need to be as comprehensive or general purpose as Ray is (i.e. I hope we don't write our own binpacking algorithm), and we should feel free to change our service definition to provide the right level of control.

Replica Colocation

Replica colocation includes 2 pieces of functionality in my mind:

  • GPU-aware scheduling: assuming I have 2 hosts in my cluster, and I want to spawn 4 inference servers with 4 GPUs each, I expect that my service will use those 2 hosts fully
  • but with backdoors: One common pattern in RL systems like veRL is the idea of "colocating" trainer and inference servers. The main idea is that the workload can end up being bounded by inference servers, so the idea is to "sleep" the trainers, freeing up the GPU memory, and replacing them with inference actors. See this blog post for more details. We should be able to re-use torch_memory_saver for this. In fact, this should just work with Monarch already.
    • This "trainer/generator colocation" is likely its own project, but the point is - two services should be able to share resources in some way

Support for Autoscaling

Services currently do not support autoscaling, but can and should be extended to do so. An API could be something as follows:

ServiceConfig(min_replicas=1, max_replicas=4, num_replicas=2, **other_args)

Initial Proposal

Some thoughts that can be disregarded:

I would create a global resource manager (RM) that is:

  1. a Monarch controller, and
  2. initialized at the beginning of the workload

This GPU manager shows a simple starting point with this pattern and should be folded into the RM.

Services and the RM communicate through each other through resource requests, meaning we need some type of common language:

@dataclass
class ResourceRequest:
    resource_group: str # i.e. trainer, inference, trainer_inference
    procs: int
    hosts: int
    with_gpus: bool # RM assigns     

The resource manager should be initialized with 1. scheduler information and 2. service information.

@dataclass
class ResourceManager:
    scheduler: "slurm"
    num_hosts: 16 # 16 total in the cluster
    # min_hosts/max_hosts?
    services: dict[str, ServiceConfig] # to provide info on the total expected min/max replicas, how resources are shared etc.

ServiceConfig would need to incorporate some heuristics as well

@dataclass
class ServiceConfig:
    hosts_per_replica: int
    procs_per_replica: int
    num_replicas: int 
    min_replicas: int
    max_replicas: int
    resource_group: str # i.e. trainer, inference, trainer_inference

Status and Current Workarounds

(as of 9/11/2025)

This integration work is blocked by:

  1. HostMesh is not fully implemented in Monarch yet (TODO - tag issue/ETA)
  2. Actors created on proc_meshes owned by other Actors cannot communicate with each other (known bug)

Our current workaround for multi-host is to 1. use Monarch to create "a giant proc mesh" and slice at the actor level:

procs = mast_host_mesh.spawn_procs(per_host={"procs": 8})

actors = procs.spawn("test", Trainer)
actors = actors.slice(**actor_slice())

other_actors = procs.spawn("test", Generators)
other_actors = other_actors.slice(**other_actors_slice())

which we will have in Forge (TODO - tag PR). This puts some of the scaffolding in place, but does not fully address the other requirements listed above. I've tagged pieces in the codebase with this issue so we can grep where things can and should be changed as we make progress here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions