Skip to content

Conversation

allenwang28
Copy link
Contributor

Context for this PR - I want to be able to demo a service that spawns N replicas of vLLM. For that to work, we need dynamic GPU allocation. Our current approach hard codes GPU assignment.

So this PR does a few things:

  • Introduces GpuResourceManager, a singleton actor (or "controller") responsible for tracking and releasing GPUs. This only works for local proc mesh for now, will expand to multi-host once/if needed. Since it's a singleton, there's no concern about race conditions etc.
  • Updates proc and service types to now track num_gpus too, which is the signal for receiving GPU ids
  • Adds stubs for the new hostmesh APIs - currently broken for some NYI stuff in Monarch. Probably need to work through a re-design of procmesh management/allocation
  • Updates applications (sft_v2, rl, grpo_main, vllm) accordingly

Next up:

  • Create a policy specific Replica which will use get_proc_mesh() correctly

@allenwang28 allenwang28 requested a review from Copilot August 28, 2025 21:09
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 28, 2025
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces dynamic GPU allocation for local workloads by implementing a GPU resource manager that tracks and distributes GPU resources instead of hard-coding GPU assignments. This enables spawning multiple replicas of GPU-intensive services like vLLM.

  • Adds GpuResourceManager singleton actor to track and allocate GPU resources
  • Updates process and service types to include num_gpus field for GPU allocation requests
  • Integrates GPU allocation into process mesh creation and service configuration

Reviewed Changes

Copilot reviewed 16 out of 19 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/unit_tests/test_service.py Updates test configurations to specify gpus_per_replica=0 for CPU-only tests
tests/unit_tests/test_gpu_manager.py Comprehensive test suite for GPU manager allocation and release functionality
src/forge/types.py Adds num_gpus field to ProcessConfig and gpus_per_replica to ServiceConfig
src/forge/controller/service/spawn.py Updates import paths after service module restructuring
src/forge/controller/service/service.py Updates import paths after service module restructuring
src/forge/controller/service/metrics.py Updates import paths after service module restructuring
src/forge/controller/service/init.py Creates service module package with proper exports
src/forge/controller/proc_mesh.py Integrates GPU allocation into process mesh creation and environment setup
src/forge/controller/custom_actors/service_registry.py Stub implementation for future service tracking functionality
src/forge/controller/custom_actors/gpu_manager.py Core GPU manager actor implementation with allocation and release logic
src/forge/controller/custom_actors/init.py Exports GPU manager utility functions
src/forge/controller/init.py Updates exports after service module restructuring
apps/sft_v2/main.py Updates comment to reflect correct module path
apps/sft_v2/llama3_8b.yaml Adds GPU allocation configuration and fixes tokenizer path
apps/rl/llama3_8b.yaml Adds GPU allocation configuration for trainer and replay buffer
apps/grpo/main.py Adds GPU allocation to service configuration and removes hardcoded device assignment

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@allenwang28 allenwang28 requested a review from pbontrager August 28, 2025 21:22
Copy link
Contributor

@pbontrager pbontrager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theses are a lot of changes but they seem to be mostly renaming and import changes. I like the addition of the GPU manager.

@allenwang28 allenwang28 merged commit ce91430 into meta-pytorch:main Aug 28, 2025
4 checks passed
@allenwang28 allenwang28 deleted the dynamic_gpus branch August 28, 2025 22:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants