Enables dynamic GPU allocation for local workloads #91

allenwang28 · 2025-08-28T21:09:12Z

Context for this PR - I want to be able to demo a service that spawns N replicas of vLLM. For that to work, we need dynamic GPU allocation. Our current approach hard codes GPU assignment.

So this PR does a few things:

Introduces GpuResourceManager, a singleton actor (or "controller") responsible for tracking and releasing GPUs. This only works for local proc mesh for now, will expand to multi-host once/if needed. Since it's a singleton, there's no concern about race conditions etc.
Updates proc and service types to now track num_gpus too, which is the signal for receiving GPU ids
Adds stubs for the new hostmesh APIs - currently broken for some NYI stuff in Monarch. Probably need to work through a re-design of procmesh management/allocation
Updates applications (sft_v2, rl, grpo_main, vllm) accordingly

Next up:

Create a policy specific Replica which will use get_proc_mesh() correctly

Copilot

Pull Request Overview

This PR introduces dynamic GPU allocation for local workloads by implementing a GPU resource manager that tracks and distributes GPU resources instead of hard-coding GPU assignments. This enables spawning multiple replicas of GPU-intensive services like vLLM.

Adds GpuResourceManager singleton actor to track and allocate GPU resources
Updates process and service types to include num_gpus field for GPU allocation requests
Integrates GPU allocation into process mesh creation and service configuration

Reviewed Changes

Copilot reviewed 16 out of 19 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tests/unit_tests/test_service.py	Updates test configurations to specify `gpus_per_replica=0` for CPU-only tests
tests/unit_tests/test_gpu_manager.py	Comprehensive test suite for GPU manager allocation and release functionality
src/forge/types.py	Adds `num_gpus` field to `ProcessConfig` and `gpus_per_replica` to `ServiceConfig`
src/forge/controller/service/spawn.py	Updates import paths after service module restructuring
src/forge/controller/service/service.py	Updates import paths after service module restructuring
src/forge/controller/service/metrics.py	Updates import paths after service module restructuring
src/forge/controller/service/init.py	Creates service module package with proper exports
src/forge/controller/proc_mesh.py	Integrates GPU allocation into process mesh creation and environment setup
src/forge/controller/custom_actors/service_registry.py	Stub implementation for future service tracking functionality
src/forge/controller/custom_actors/gpu_manager.py	Core GPU manager actor implementation with allocation and release logic
src/forge/controller/custom_actors/init.py	Exports GPU manager utility functions
src/forge/controller/init.py	Updates exports after service module restructuring
apps/sft_v2/main.py	Updates comment to reflect correct module path
apps/sft_v2/llama3_8b.yaml	Adds GPU allocation configuration and fixes tokenizer path
apps/rl/llama3_8b.yaml	Adds GPU allocation configuration for trainer and replay buffer
apps/grpo/main.py	Adds GPU allocation to service configuration and removes hardcoded device assignment

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

src/forge/controller/custom_actors/service_registry.py

src/forge/controller/proc_mesh.py

apps/grpo/main.py

src/forge/controller/__init__.py

pbontrager

Theses are a lot of changes but they seem to be mostly renaming and import changes. I like the addition of the GPU manager.

apps/grpo/main.py

Allen Wang added 13 commits August 28, 2025 07:31

initial commit

befe538

Merge branch 'main' into dynamic_gpus

47aa7e0

add back spawn

ea0cd09

stash

ac50eba

park

286e841

add gpu resource management

2ee010d

add gpu resource management

efb5661

update test apis

8f0380c

Merge branch 'main' into dynamic_gpus

8b00bb2

stash

caa147a

sft v2 works again

4aa8778

renames, adds stop capability

d1dce29

some updates

730365c

allenwang28 requested a review from Copilot August 28, 2025 21:09

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 28, 2025

allenwang28 requested a review from joecummings August 28, 2025 21:09

Copilot AI reviewed Aug 28, 2025

View reviewed changes

src/forge/controller/custom_actors/service_registry.py Outdated Show resolved Hide resolved

src/forge/controller/proc_mesh.py Outdated Show resolved Hide resolved

Allen Wang added 2 commits August 28, 2025 14:11

typo fix

1ce8635

fix test

97f04b6

allenwang28 requested a review from pbontrager August 28, 2025 21:22

missing import

4b83d16

joecummings approved these changes Aug 28, 2025

View reviewed changes

apps/grpo/main.py Outdated Show resolved Hide resolved

src/forge/controller/__init__.py Outdated Show resolved Hide resolved

Allen Wang added 2 commits August 28, 2025 14:45

no nested submodule

d11dcae

check

2ea4d3e

pbontrager approved these changes Aug 28, 2025

View reviewed changes

apps/grpo/main.py Outdated Show resolved Hide resolved

Allen Wang added 2 commits August 28, 2025 15:07

num_gpus => with_gpus

00c1a2b

proc mesh update

2d30f2f

allenwang28 merged commit ce91430 into meta-pytorch:main Aug 28, 2025
4 checks passed

allenwang28 deleted the dynamic_gpus branch August 28, 2025 22:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enables dynamic GPU allocation for local workloads #91

Enables dynamic GPU allocation for local workloads #91

Uh oh!

allenwang28 commented Aug 28, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pbontrager left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Enables dynamic GPU allocation for local workloads #91

Enables dynamic GPU allocation for local workloads #91

Uh oh!

Conversation

allenwang28 commented Aug 28, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pbontrager left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants