-
Notifications
You must be signed in to change notification settings - Fork 16
Description
Context
HostMesh
in Monarch represents a "Host" resource, providing a utility to spawn proc_mesh
and actor
s on those hosts. This gives users fine-grained control over where actors should be placed, for i.e. colocation etc.
Forge is currently proc_mesh
centric - services are created without any host mechanism applied:
s = await spawn_service(ServiceConfig(replicas=2, procs_per_replica=2, with_gpus=True), MyActor, ...)
This issue tracks the work needed to support HostMesh
in Forge services:
s = await spawn_service(ServiceConfig(
replicas=2,
hosts_per_replica=2,
procs_per_host=8,
with_gpus=True),
...)
See EX545081 for a high level diagram for how services work.
Technical Requirements
Resource Management/Scheduling and Service Decoupling
HostMeshes are allocated via torchx
, which acts as the abstraction layer between application code (i.e. Forge) and the underlying scheduler (i.e. Kubernetes, SLURM, etc.).
Forge is currently configured so that the underlying proc_mesh talks directly to the scheduler, but this might not make sense as there likely needs to be some sort of global coordination of resources.
We should be mindful of complexity - this doesn't need to be as comprehensive or general purpose as Ray is (i.e. I hope we don't write our own binpacking algorithm), and we should feel free to change our service definition to provide the right level of control.
Replica Colocation
Replica colocation includes 2 pieces of functionality in my mind:
- GPU-aware scheduling: assuming I have 2 hosts in my cluster, and I want to spawn 4 inference servers with 4 GPUs each, I expect that my service will use those 2 hosts fully
- but with backdoors: One common pattern in RL systems like veRL is the idea of "colocating" trainer and inference servers. The main idea is that the workload can end up being bounded by inference servers, so the idea is to "sleep" the trainers, freeing up the GPU memory, and replacing them with inference actors. See this blog post for more details. We should be able to re-use torch_memory_saver for this. In fact, this should just work with Monarch already.
- This "trainer/generator colocation" is likely its own project, but the point is - two services should be able to share resources in some way
Support for Autoscaling
Services currently do not support autoscaling, but can and should be extended to do so. An API could be something as follows:
ServiceConfig(min_replicas=1, max_replicas=4, num_replicas=2, **other_args)
Initial Proposal
Some thoughts that can be disregarded:
I would create a global resource manager (RM) that is:
- a Monarch controller, and
- initialized at the beginning of the workload
This GPU manager shows a simple starting point with this pattern and should be folded into the RM.
Services and the RM communicate through each other through resource requests, meaning we need some type of common language:
@dataclass
class ResourceRequest:
resource_group: str # i.e. trainer, inference, trainer_inference
procs: int
hosts: int
with_gpus: bool # RM assigns
The resource manager should be initialized with 1. scheduler information and 2. service information.
@dataclass
class ResourceManager:
scheduler: "slurm"
num_hosts: 16 # 16 total in the cluster
# min_hosts/max_hosts?
services: dict[str, ServiceConfig] # to provide info on the total expected min/max replicas, how resources are shared etc.
ServiceConfig
would need to incorporate some heuristics as well
@dataclass
class ServiceConfig:
hosts_per_replica: int
procs_per_replica: int
num_replicas: int
min_replicas: int
max_replicas: int
resource_group: str # i.e. trainer, inference, trainer_inference
Status and Current Workarounds
(as of 9/11/2025)
This integration work is blocked by:
HostMesh
is not fully implemented in Monarch yet (TODO - tag issue/ETA)- Actors created on proc_meshes owned by other Actors cannot communicate with each other (known bug)
Our current workaround for multi-host is to 1. use Monarch to create "a giant proc mesh" and slice at the actor level:
procs = mast_host_mesh.spawn_procs(per_host={"procs": 8})
actors = procs.spawn("test", Trainer)
actors = actors.slice(**actor_slice())
other_actors = procs.spawn("test", Generators)
other_actors = other_actors.slice(**other_actors_slice())
which we will have in Forge (TODO - tag PR). This puts some of the scaffolding in place, but does not fully address the other requirements listed above. I've tagged pieces in the codebase with this issue so we can grep
where things can and should be changed as we make progress here.