HostMesh integration

# Context
[`HostMesh`](https://github.com/meta-pytorch/monarch/blob/fe9b99947779b3755c37f8fa65edea79d19b32a0/python/monarch/_src/actor/host_mesh.py#L48) in Monarch represents a "Host" resource, providing a utility to spawn `proc_mesh` and `actor`s on those hosts. This gives users fine-grained control over where actors should be placed, for i.e. colocation etc.

Forge is currently `proc_mesh` centric - services are created without any host mechanism applied:
```
s = await spawn_service(ServiceConfig(replicas=2, procs_per_replica=2, with_gpus=True), MyActor, ...)
```

This issue tracks the work needed to support `HostMesh` in Forge services:

```
s = await spawn_service(ServiceConfig(
    replicas=2,
    hosts_per_replica=2,
    procs_per_host=8,
    with_gpus=True),
    ...)
```

See EX545081 for a high level diagram for how services work.

# Technical Requirements
## Resource Management/Scheduling and Service Decoupling
HostMeshes are allocated via `torchx`, which acts as the abstraction layer between application code (i.e. Forge) and the underlying scheduler (i.e. Kubernetes, SLURM, etc.). 

Forge is currently configured so that the underlying [proc_mesh](https://github.com/meta-pytorch/forge/blob/ce91430307547726ff172548cde82d6fe7b4ae31/src/forge/types.py#L94) [talks directly to the scheduler](https://github.com/meta-pytorch/forge/blob/ce91430307547726ff172548cde82d6fe7b4ae31/src/forge/controller/proc_mesh.py#L93), but this might not make sense as there likely needs to be some sort of global coordination of resources.

We should be mindful of complexity - this doesn't need to be as comprehensive or general purpose as Ray is (i.e. I hope we don't write our own binpacking algorithm), and we should feel free to change our service definition to provide the right level of control.

## Replica Colocation
Replica colocation includes 2 pieces of functionality in my mind:
- **GPU-aware scheduling**: assuming I have 2 hosts in my cluster, and I want to spawn 4 inference servers with 4 GPUs each, I expect that my service will use those 2 hosts fully
- **but with backdoors**: One common pattern in RL systems like veRL is the idea of "colocating" trainer and inference servers. The main idea is that the workload can end up being bounded by inference servers, so the idea is to "sleep" the trainers, freeing up the GPU memory, and replacing them with inference actors. See [this blog post](https://hebiao064.github.io/rl-memory-management#implementation-using-cuda-virtual-memory-apis) for more details. We should be able to re-use [torch_memory_saver](https://github.com/fzyzcjy/torch_memory_saver/tree/master) for this. In fact, this should [just work with Monarch already](https://www.internalfb.com/phabricator/paste/view/P1881182490).
  - This "trainer/generator colocation" is likely its own project, but the point is - two services should be able to share resources in some way 

## Support for Autoscaling
Services currently do not support autoscaling, but can and should be extended to do so. An API could be something as follows:
```
ServiceConfig(min_replicas=1, max_replicas=4, num_replicas=2, **other_args)
```

# Initial Proposal
Some thoughts that can be disregarded:

I would create a global resource manager (RM) that is:
1. a Monarch controller, and 
2. initialized at the beginning of the workload

This [GPU manager](https://github.com/meta-pytorch/forge/blob/ccd237769449f1ad9ce9574475b91ecca9941958/src/forge/controller/system_controllers/gpu_manager.py) shows a simple starting point with this pattern and should be folded into the RM. 

Services and the RM communicate through each other through resource requests, meaning we need some type of common language:
```
@dataclass
class ResourceRequest:
    resource_group: str # i.e. trainer, inference, trainer_inference
    procs: int
    hosts: int
    with_gpus: bool # RM assigns     
```

The resource manager should be initialized with 1. scheduler information and 2. service information. 
```
@dataclass
class ResourceManager:
    scheduler: "slurm"
    num_hosts: 16 # 16 total in the cluster
    # min_hosts/max_hosts?
    services: dict[str, ServiceConfig] # to provide info on the total expected min/max replicas, how resources are shared etc.
```


`ServiceConfig` would need to incorporate some heuristics as well 
```
@dataclass
class ServiceConfig:
    hosts_per_replica: int
    procs_per_replica: int
    num_replicas: int 
    min_replicas: int
    max_replicas: int
    resource_group: str # i.e. trainer, inference, trainer_inference
```

# Status and Current Workarounds
(as of 9/11/2025)

This integration work is blocked by:
1. `HostMesh` is not fully implemented in Monarch yet (TODO - tag issue/ETA)
2. Actors created on proc_meshes owned by other Actors cannot communicate with each other ([known bug](https://github.com/meta-pytorch/monarch/issues/1172))

Our current workaround for multi-host is to 1. use Monarch to create "a giant proc mesh" and slice at the actor level:
```
procs = mast_host_mesh.spawn_procs(per_host={"procs": 8})

actors = procs.spawn("test", Trainer)
actors = actors.slice(**actor_slice())

other_actors = procs.spawn("test", Generators)
other_actors = other_actors.slice(**other_actors_slice())
```

which we will have in Forge (TODO - tag PR). This puts some of the scaffolding in place, but does not fully address the other requirements listed above. I've tagged pieces in the codebase with this issue so we can `grep` where things can and should be changed as we make progress here.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HostMesh integration #144

Context

Technical Requirements

Resource Management/Scheduling and Service Decoupling

Replica Colocation

Support for Autoscaling

Initial Proposal

Status and Current Workarounds

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

HostMesh integration #144

Description

Context

Technical Requirements

Resource Management/Scheduling and Service Decoupling

Replica Colocation

Support for Autoscaling

Initial Proposal

Status and Current Workarounds

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions