-
Notifications
You must be signed in to change notification settings - Fork 29
Docs Content Part 2: Concepts #449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
AlannaBurke
wants to merge
23
commits into
main
Choose a base branch
from
docs/pr2-split-concepts-final
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+1,335
−20
Open
Changes from all commits
Commits
Show all changes
23 commits
Select commit
Hold shift + click to select a range
b1794a9
Updating homepage, getting started, concepts.
AlannaBurke 087e2ff
Update documentation with blog post insights: enhanced homepage, comp…
AlannaBurke a0b2412
Update docs/source/getting_started.md
AlannaBurke b6d466c
Update docs/source/index.md
AlannaBurke b564175
Update docs/source/index.md
AlannaBurke 92ca627
Minor fixes and updates.
AlannaBurke f4b951b
Merge branch 'getting-started' of github.com:meta-pytorch/forge into …
AlannaBurke 34640e7
Update docs/source/getting_started.md
AlannaBurke 32c8d78
Restructing info.
AlannaBurke e448c90
Merge branch 'main' of github.com:meta-pytorch/forge into getting-sta…
AlannaBurke ce9b472
Update docs/source/getting_started.md
AlannaBurke e998d94
Merge branch 'getting-started' of github.com:meta-pytorch/forge into …
AlannaBurke c89393c
Updating gpu references.
AlannaBurke 7a31e26
Updating toctree entries.
AlannaBurke af4eae7
Removing FAQs
AlannaBurke 9d49ee6
Removing FAQ references.
AlannaBurke c410375
Update docs/source/getting_started.md
AlannaBurke 6c70c8f
Merge branch 'main' into getting-started
AlannaBurke f9b136a
docs: Improve homepage and getting started pages
AlannaBurke 1e9245e
docs: Split concepts page into focused sub-pages
AlannaBurke 4cad9f2
Updating getting started.
AlannaBurke 79c6b50
Minor fixes.
AlannaBurke cdd3c29
Minor fixes.
AlannaBurke File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,200 @@ | ||
# Architecture | ||
|
||
This guide provides a deep dive into TorchForge's architecture, explaining how Monarch, Services, and TorchStore work together to enable distributed RL. | ||
|
||
## The Foundation: Monarch | ||
|
||
At TorchForge's core is **Monarch**, a PyTorch-native distributed programming framework that brings single-controller orchestration to entire GPU clusters. | ||
|
||
### Single-Controller vs SPMD | ||
|
||
Traditional distributed training uses **SPMD (Single Program, Multiple Data)** - where multiple copies of the same script run across different machines, each with only a local view of the workflow. This works well for simple data-parallel training, but becomes notoriously difficult for complex RL workflows with: | ||
- Asynchronous generation and training | ||
- Multiple heterogeneous components (policy, reward model, reference model) | ||
- Dynamic resource allocation | ||
- Fault tolerance across components | ||
|
||
**Monarch's single-controller model** changes this entirely. You write one Python script that orchestrates all distributed resources, making them feel almost local. The code looks and feels like a single-machine program, but can scale across thousands of GPUs. | ||
|
||
### Actor Meshes | ||
|
||
Monarch organizes resources into multidimensional arrays called **meshes**: | ||
|
||
**Process Mesh** | ||
: An array of processes spread across many hosts, typically one process per GPU | ||
|
||
**Actor Mesh** | ||
: An array of actors, each running inside a separate process | ||
|
||
Like array programming in NumPy or PyTorch, meshes make it simple to dispatch operations efficiently across large systems. You can slice meshes, broadcast operations, and operate on entire meshes with simple APIs. | ||
|
||
```python | ||
from monarch.actor import Actor, this_host | ||
|
||
# Create a process mesh with 8 processes (one per GPU) | ||
procs = this_host().spawn_procs({"gpus": 8}) | ||
|
||
# Define an actor | ||
class PolicyActor(Actor): | ||
@endpoint | ||
def generate(self, prompt): | ||
return self.model.generate(prompt) | ||
|
||
# Spawn actors across the mesh | ||
actors = procs.spawn("policy", PolicyActor) | ||
|
||
# Call methods on the entire mesh | ||
results = actors.generate.call_all("Hello world") | ||
``` | ||
|
||
### Fault Tolerance | ||
|
||
Monarch provides **progressive fault handling** - you write your code as if nothing fails. When something does fail, Monarch fails fast by default, stopping the whole program like an uncaught exception. | ||
|
||
But you can progressively add fine-grained fault handling exactly where you need it: | ||
|
||
```python | ||
try: | ||
result = await policy.generate.route(prompt) | ||
except ActorFailure: | ||
# Handle failure - maybe retry with different replica | ||
result = await policy.generate.route(prompt) | ||
``` | ||
|
||
For long-running RL training, this is crucial. Hardware failures are common at scale - in Meta's Llama 3 training, there were 419 interruptions across 54 days on a 16K GPU job (roughly one failure every 3 hours). | ||
|
||
### RDMA and Data Plane | ||
|
||
Monarch separates the **control plane** (messaging) from the **data plane** (bulk data transfers). This enables direct GPU-to-GPU memory transfers across your cluster using RDMA (Remote Direct Memory Access). | ||
|
||
Control commands go through one optimized path, while large data transfers (like model weights) go through another path optimized for bandwidth. | ||
|
||
## Services: RL-Friendly Actor Abstraction | ||
|
||
**Services** wrap Monarch's ActorMesh with patterns common in RL. A service is a managed group of actor replicas with built-in load balancing, fault tolerance, and routing primitives. | ||
|
||
```python | ||
# Create a policy service with 16 replicas, each using 8 processes | ||
policy = PolicyActor.options( | ||
procs=8, | ||
with_gpus=True, | ||
num_replicas=16 | ||
).as_service() | ||
``` | ||
|
||
### Service Adverbs | ||
|
||
Services provide intuitive operations called "adverbs": | ||
|
||
**route()** | ||
: Load-balanced request to one replica | ||
```python | ||
response = await policy.generate.route(prompt) | ||
``` | ||
|
||
**fanout()** | ||
: Broadcast to ALL replicas in parallel | ||
```python | ||
await policy.update_weights.fanout(version) | ||
``` | ||
|
||
**session()** | ||
: Sticky sessions for stateful operations (maintains KV cache consistency) | ||
```python | ||
async with policy.session(): | ||
response1 = await policy.generate.route(prompt1) | ||
response2 = await policy.generate.route(prompt2) # Same replica | ||
``` | ||
|
||
### Why Services Matter for RL | ||
|
||
Services solve critical infrastructure challenges: | ||
|
||
**Heterogeneous Scaling** | ||
: Different components need different resources. Your policy might need 16 replicas × 8 processes for high-throughput vLLM inference. Your reward model might need 4 replicas × 4 processes. Your coding environment might need 16 lightweight CPU-only replicas. Services let each component scale independently. | ||
|
||
**Load Balancing** | ||
: In async RL, multiple `continuous_rollouts()` tasks run concurrently. Services automatically distribute these rollouts across available replicas - no manual worker pool management. | ||
|
||
**Fault Tolerance** | ||
: If a replica fails during a rollout, services detect it, mark it unhealthy, and route subsequent requests to healthy replicas. The failed replica gets restarted automatically. Your RL code never sees the failure. | ||
|
||
**Ephemeral Infrastructure** | ||
: Services are created with your job and torn down when finished. Want to try a new reward model? Change your Python code. No standing deployments to maintain, no infrastructure to provision ahead of time. | ||
|
||
## TorchStore: Distributed Weight Storage | ||
|
||
In async RL, every training step produces new policy weights that must propagate to all inference replicas. For a 70B parameter model across 16 replicas, this means moving hundreds of gigabytes of data. **TorchStore** makes this efficient. | ||
|
||
### The Weight Synchronization Challenge | ||
|
||
Traditionally, you have two options: | ||
1. **Build complex p2p mappings** between training and inference sharding strategies (fast but extremely complex) | ||
2. **Use network filesystem** like NFS (simple but slow, with high infrastructure cost) | ||
|
||
TorchStore combines the **UX of central storage** with the **performance of in-memory p2p operations**. | ||
|
||
### How TorchStore Works | ||
|
||
TorchStore is a distributed, in-memory key-value store for PyTorch tensors, built on Monarch primitives: | ||
|
||
```python | ||
import torchstore as ts | ||
from torch.distributed._tensor import distribute_tensor, Shard | ||
from torch.distributed.device_mesh import init_device_mesh | ||
|
||
# Training process: store sharded weights | ||
async def store_weights(): | ||
device_mesh = init_device_mesh("cuda", (4,)) | ||
tensor = model.state_dict()['layer.weight'] | ||
dtensor = distribute_tensor(tensor, device_mesh, [Shard(0)]) | ||
|
||
# Each rank stores its shard | ||
await ts.put("policy_weights_v123", dtensor) | ||
|
||
# Inference process: fetch with different sharding | ||
async def load_weights(): | ||
device_mesh = init_device_mesh("cuda", (2, 2)) # Different topology! | ||
tensor = torch.empty_like(model.state_dict()['layer.weight']) | ||
dtensor = distribute_tensor(tensor, device_mesh, [Shard(0)]) | ||
|
||
# TorchStore handles resharding automatically | ||
await ts.get("policy_weights_v123", dtensor) | ||
``` | ||
|
||
**Key Features:** | ||
|
||
**Automatic Resharding** | ||
: Handles complex weight transfer between different sharding strategies transparently | ||
|
||
**DTensor Native** | ||
: Works seamlessly with PyTorch's distributed tensors | ||
|
||
**RDMA Transfers** | ||
: Uses RDMA for high-bandwidth data movement without blocking GPUs | ||
|
||
**Asynchronous Updates** | ||
: Training and inference can read/write weights independently, enabling true async RL | ||
|
||
**Flexible Storage** | ||
: Store tensors co-located with trainers, on their own storage tier, sharded or replicated - change with minimal code modifications | ||
|
||
### Why TorchStore Matters | ||
|
||
Weight synchronization becomes a bottleneck in async RL. Traditional approaches either: | ||
- Require synchronous GPU-to-GPU transfers (blocking training) | ||
- Use slow network filesystems (minutes per update) | ||
- Demand complex manual resharding logic (error-prone, hard to maintain) | ||
|
||
TorchStore solves all of these, keeping data distributed across the cluster until requested and moving it efficiently with RDMA. | ||
|
||
## Distributed Training Strategies | ||
AlannaBurke marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
TorchForge leverages multiple parallelism strategies through TorchTitan. [See their docs for more](https://github.com/pytorch/torchtitan). | ||
|
||
## See Also | ||
|
||
- {doc}`concepts` - Core philosophy and key abstractions | ||
- {doc}`technology_stack` - Understanding the dependency stack | ||
- {doc}`rl_workflows` - Writing RL algorithms with these components | ||
- {doc}`getting_started` - Installation and setup |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,150 @@ | ||
# Concepts | ||
|
||
This guide covers the fundamental concepts and architecture behind TorchForge, | ||
helping you understand how the system works and how to effectively use its components. | ||
This guide introduces the fundamental principles and concepts behind TorchForge, helping you understand the philosophy that drives the system. | ||
|
||
## The Core Philosophy | ||
|
||
TorchForge is built on one principle: **researchers should write algorithms, not infrastructure**. | ||
|
||
The traditional approach to distributed RL requires you to write complex coordination logic, retry mechanisms, resource management, and synchronization code. TorchForge abstracts all of this away, letting you express RL algorithms as naturally as pseudocode while powerful infrastructure handles the distributed complexity underneath. | ||
|
||
## Key Abstractions | ||
|
||
Understanding these core abstractions helps you use TorchForge effectively: | ||
|
||
### Actor | ||
|
||
A component that encapsulates a model along with its execution logic. Actors provide: | ||
- **Isolation**: Independent resources and failure domains | ||
- **Flexibility**: Different parallelism strategies per actor | ||
- **Composability**: Combine actors to create complex pipelines | ||
|
||
### Service | ||
|
||
A managed group of actor replicas with built-in routing, load balancing, and fault tolerance. Services handle operational complexity so your RL code stays clean. Think of services as horizontally scaled actors with automatic load distribution. | ||
|
||
### DTensor (Distributed Tensor) | ||
|
||
A tensor sharded across multiple devices. TorchStore handles resharding DTensors between different topologies automatically, making distributed tensor operations transparent. | ||
|
||
### Episode | ||
|
||
A complete RL interaction sequence containing: | ||
- **Prompt**: Input to the policy | ||
- **Response**: Generated output | ||
- **Reward**: Feedback signal | ||
- **Metadata**: Additional context (timestamps, model versions, etc.) | ||
|
||
Episodes flow through your system from generation to replay buffer to training. | ||
|
||
### Replay Buffer | ||
|
||
Stores episodes for training. Can be implemented with various strategies: | ||
- **FIFO**: Simple queue for on-policy algorithms | ||
- **Prioritized**: Importance sampling for off-policy learning | ||
- **Reservoir**: Uniform sampling from history | ||
- **Hybrid**: Mix multiple strategies | ||
|
||
Integrates with TorchStore for efficient distributed storage. | ||
|
||
## Design Principles | ||
|
||
### Single-Controller Model | ||
|
||
Traditional distributed training uses **SPMD (Single Program, Multiple Data)** - where multiple copies of the same script run across different machines, each with only a local view of the workflow. This works well for simple data-parallel training, but becomes notoriously difficult for complex RL workflows with: | ||
- Asynchronous generation and training | ||
- Multiple heterogeneous components (policy, reward model, reference model) | ||
- Dynamic resource allocation | ||
- Fault tolerance across components | ||
|
||
TorchForge adopts **Monarch's single-controller model**: You write one Python script that orchestrates all distributed resources, making them feel almost local. The code looks and feels like a single-machine program, but can scale across thousands of GPUs. | ||
|
||
### Composable Components | ||
|
||
Write your core logic once, compose it into any paradigm: | ||
- **Synchronous on-policy** (PPO, GRPO) | ||
- **Asynchronous off-policy** (continuous rollouts + training) | ||
- **Hybrid approaches** (batch collection with async training) | ||
|
||
The same `generate_episode()` function works everywhere. Just change how you compose it. | ||
|
||
### Ephemeral Infrastructure | ||
|
||
Services are created with your job and torn down when finished: | ||
- No standing deployments to maintain | ||
- No infrastructure to provision ahead of time | ||
- Want to try a new reward model? Change your Python code and rerun | ||
|
||
This dramatically reduces operational overhead and enables rapid experimentation. | ||
|
||
### Progressive Fault Tolerance | ||
|
||
Write code as if nothing fails. When failures do occur: | ||
- Monarch fails fast by default (like uncaught exceptions) | ||
- Add fine-grained fault handling exactly where you need it | ||
- Services automatically route around failed replicas | ||
- Failed actors restart automatically | ||
|
||
You choose your fault tolerance granularity based on your needs. | ||
|
||
## Best Practices | ||
|
||
### Model Selection | ||
|
||
- Start with smaller models for prototyping | ||
- Use pre-configured model setups when available | ||
- Validate configurations before large-scale training | ||
|
||
### Data Preparation | ||
|
||
- Ensure balanced and diverse training data | ||
- Implement proper train/validation splits | ||
- Monitor data quality throughout training | ||
- Verify token distributions match expectations | ||
|
||
### Training Strategy | ||
|
||
- Begin with SFT before attempting GRPO | ||
- Use gradient accumulation for larger effective batch sizes | ||
- Monitor KL divergence to prevent policy collapse | ||
- Implement regular checkpointing for fault tolerance | ||
- Apply warmup schedules for stable training | ||
|
||
### Resource Optimization | ||
|
||
- Profile memory usage to identify bottlenecks | ||
- Tune batch sizes for your hardware configuration | ||
- Consider mixed precision training to reduce memory | ||
- Use appropriate parallelism strategies for your model size | ||
|
||
### Debugging | ||
|
||
- Start with single-GPU training to isolate issues | ||
- Enable verbose logging for distributed runs | ||
- Use profiling tools to identify bottlenecks | ||
- Validate data pipelines before full training | ||
- Monitor loss curves and generation quality | ||
|
||
## Validation | ||
|
||
TorchForge has been validated in real-world deployments: | ||
|
||
- **Stanford Collaboration**: Integration with the Weaver weak verifier project, training models that hill-climb on challenging reasoning benchmarks (MATH, GPQA) | ||
- **CoreWeave**: Large-scale training runs on 512 H100 GPU clusters with smooth, efficient performance | ||
- **Scale**: Tested across hundreds of GPUs with continuous rollouts and asynchronous training | ||
|
||
## Learn More | ||
|
||
Dive deeper into specific topics: | ||
|
||
```{toctree} | ||
:maxdepth: 1 | ||
|
||
architecture | ||
technology_stack | ||
rl_workflows | ||
``` | ||
|
||
**Related Documentation:** | ||
- {doc}`getting_started` - Installation and first training run | ||
- {doc}`api` - Complete API reference |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this all looks good to me, but cc @LucasLLC in case of any desired big changes