Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
b1794a9
Updating homepage, getting started, concepts.
AlannaBurke Oct 8, 2025
087e2ff
Update documentation with blog post insights: enhanced homepage, comp…
AlannaBurke Oct 8, 2025
a0b2412
Update docs/source/getting_started.md
AlannaBurke Oct 10, 2025
b6d466c
Update docs/source/index.md
AlannaBurke Oct 10, 2025
b564175
Update docs/source/index.md
AlannaBurke Oct 10, 2025
92ca627
Minor fixes and updates.
AlannaBurke Oct 10, 2025
f4b951b
Merge branch 'getting-started' of github.com:meta-pytorch/forge into …
AlannaBurke Oct 10, 2025
34640e7
Update docs/source/getting_started.md
AlannaBurke Oct 10, 2025
32c8d78
Restructing info.
AlannaBurke Oct 11, 2025
e448c90
Merge branch 'main' of github.com:meta-pytorch/forge into getting-sta…
AlannaBurke Oct 14, 2025
ce9b472
Update docs/source/getting_started.md
AlannaBurke Oct 14, 2025
e998d94
Merge branch 'getting-started' of github.com:meta-pytorch/forge into …
AlannaBurke Oct 15, 2025
c89393c
Updating gpu references.
AlannaBurke Oct 15, 2025
7a31e26
Updating toctree entries.
AlannaBurke Oct 15, 2025
af4eae7
Removing FAQs
AlannaBurke Oct 15, 2025
9d49ee6
Removing FAQ references.
AlannaBurke Oct 15, 2025
c410375
Update docs/source/getting_started.md
AlannaBurke Oct 15, 2025
6c70c8f
Merge branch 'main' into getting-started
AlannaBurke Oct 15, 2025
f9b136a
docs: Improve homepage and getting started pages
AlannaBurke Oct 17, 2025
c41f035
Updating index and getting started pages.
AlannaBurke Oct 17, 2025
e39f29b
Removing broken links and references to concepts.
AlannaBurke Oct 17, 2025
c6c309a
Minor doc changes.
AlannaBurke Oct 18, 2025
babf5bb
Updating download instructions.
AlannaBurke Oct 20, 2025
b02389f
Updating download instructions.
AlannaBurke Oct 20, 2025
d904304
Fixing errors.
AlannaBurke Oct 21, 2025
b36ea66
Fixing typo causing error.
AlannaBurke Oct 21, 2025
f213765
Removing concepts page entirely.
AlannaBurke Oct 21, 2025
af774cf
Update docs/source/getting_started.md
svekars Oct 21, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 0 additions & 4 deletions docs/source/concepts.md

This file was deleted.

3 changes: 2 additions & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -140,8 +140,8 @@ def get_version_path():
"navbar_center": "navbar-nav",
"canonical_url": "https://meta-pytorch.org/forge/",
"header_links_before_dropdown": 7,
"show_nav_level": 2,
"show_toc_level": 2,
"navigation_depth": 3,
}

theme_variables = pytorch_sphinx_theme2.get_theme_variables()
Expand Down Expand Up @@ -173,6 +173,7 @@ def get_version_path():
"colon_fence",
"deflist",
"html_image",
"substitution",
]

# Configure MyST parser to treat mermaid code blocks as mermaid directives
Expand Down
284 changes: 278 additions & 6 deletions docs/source/getting_started.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,281 @@
# Get Started
# Getting Started

Welcome to TorchForge! This guide will help you get up and running with TorchForge, a PyTorch-native platform specifically designed for post-training generative AI models.
This guide will walk you through installing TorchForge, understanding its dependencies, verifying your setup, and running your first training job.

TorchForge specializes in post-training techniques for large language models, including:
## System Requirements

- **Supervised Fine-Tuning (SFT)**: Adapt pre-trained models to specific tasks using labeled data
- **Group Relative Policy Optimization (GRPO)**: Advanced reinforcement learning for model alignment
- **Multi-GPU Distributed Training**: Efficient scaling across multiple GPUs and nodes
Before installing TorchForge, ensure your system meets the following requirements.

| Component | Requirement | Notes |
|-----------|-------------|-------|
| **Operating System** | Linux (Fedora/Ubuntu/Debian) | MacOS and Windows not currently supported |
| **Python** | 3.10 or higher | Python 3.11 recommended |
| **GPU** | NVIDIA with CUDA support | AMD GPUs not currently supported |
| **Minimum GPUs** | 2+ for SFT, 3+ for GRPO | More GPUs enable larger models |
| **CUDA** | 12.8 | Required for GPU training |
| **RAM** | 32GB+ recommended | Depends on model size |
| **Disk Space** | 50GB+ free | For models, datasets, and checkpoints |
| **PyTorch** | Nightly build | Latest distributed features (DTensor, FSDP) |
| **Monarch** | Pre-packaged wheel | Distributed orchestration and actor system |
| **vLLM** | v0.10.0+ | Fast inference with PagedAttention |
| **TorchTitan** | Latest | Production training infrastructure |


## Prerequisites

- **Conda or Miniconda**: For environment management
- Download from [conda.io](https://docs.conda.io/en/latest/miniconda.html)

- **GitHub CLI (gh)**: Required for downloading pre-packaged dependencies
- Install instructions: [github.com/cli/cli#installation](https://github.com/cli/cli#installation)
- After installing, authenticate with: `gh auth login`
- You can use either HTTPS or SSH as the authentication protocol

- **Git**: For cloning the repository
- Usually pre-installed on Linux systems
- Verify with: `git --version`


**Installation note:** The installation script provides pre-built wheels with PyTorch nightly already included.

## Installation

TorchForge uses pre-packaged wheels for all dependencies, making installation faster and more reliable.

1. **Clone the Repository**

```bash
git clone https://github.com/meta-pytorch/forge.git
cd forge
```

2. **Create Conda Environment**

```bash
conda create -n forge python=3.10
conda activate forge
```

3. **Run Installation Script**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this may be ok for now, but even possibly as soon as EOD today we may have different instructions cc @joecummings

If we keep a script, what the script does will be different. I think we can ship this for now, and update this once we're done


```bash
./scripts/install.sh
```

The installation script will:
- Install system dependencies using DNF (or your package manager)
- Download pre-built wheels for PyTorch nightly, Monarch, vLLM, and TorchTitan
- Install TorchForge and all Python dependencies
- Configure the environment for GPU training

```{tip}
**Using sudo instead of conda**: If you prefer installing system packages directly rather than through conda, use:
`./scripts/install.sh --use-sudo`
```

```{warning}
When adding packages to `pyproject.toml`, use `uv sync --inexact` to avoid removing Monarch and vLLM.
```

## Verifying Your Setup

After installation, verify that all components are working correctly:

1. **Check GPU Availability**

```bash
python -c "import torch; print(f'GPUs available: {torch.cuda.device_count()}')"
```

Expected output: `GPUs available: 2` (or more)

2. **Check CUDA Version**

```bash
python -c "import torch; print(f'CUDA version: {torch.version.cuda}')"
```

Expected output: `CUDA version: 12.8`
3. **Check All Dependencies**

```bash
# Check core components
python -c "import torch, forge, monarch, vllm; print('All imports successful')"

# Check specific versions
python -c "
import torch
import forge
import vllm

print(f'PyTorch: {torch.__version__}')
print(f'TorchForge: {forge.__version__}')
print(f'vLLM: {vllm.__version__}')
print(f'CUDA: {torch.version.cuda}')
print(f'GPUs: {torch.cuda.device_count()}')
"
```

4. **Verify Monarch**

```bash
python -c "
from monarch.actor import Actor, this_host

# Test basic Monarch functionality
procs = this_host().spawn_procs({'gpus': 1})
print('Monarch: Process spawning works')
"
```

## Quick Start Examples

Now that TorchForge is installed, let's run some training examples.

Here's what training looks like with TorchForge:

```bash
# Install dependencies
conda create -n forge python=3.10
conda activate forge
git clone https://github.com/meta-pytorch/forge
cd forge
./scripts/install.sh

# Download a model
hf download meta-llama/Meta-Llama-3.1-8B-Instruct --local-dir /tmp/Meta-Llama-3.1-8B-Instruct --exclude "original/consolidated.00.pth"

# Run SFT training (requires 2+ GPUs)
uv run forge run --nproc_per_node 2 \
apps/sft/main.py --config apps/sft/llama3_8b.yaml

# Run GRPO training (requires 3+ GPUs)
python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml
```

### Example 1: Supervised Fine-Tuning (SFT)

Fine-tune Llama 3 8B on your data. **Requires: 2+ GPUs**

1. **Access the Model**

```{note}
Model downloads are no longer required, but Hugging Face authentication is required to access the models.

Run `huggingface-cli login` first if you haven't already.
```

2. **Run Training**

```bash
python -m apps.sft.main --config apps/sft/llama3_8b.yaml
```

**What's Happening:**
- `--nproc_per_node 2`: Use 2 GPUs for training
- `apps/sft/main.py`: SFT training script
- `--config apps/sft/llama3_8b.yaml`: Configuration file with hyperparameters
- **TorchTitan** handles model sharding across the 2 GPUs
- **Monarch** coordinates the distributed training

### Example 2: GRPO Training

Train a model using reinforcement learning with GRPO. **Requires: 3+ GPUs**

```bash
python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml
```

**What's Happening:**
- GPU 0: Trainer model (being trained, powered by TorchTitan)
- GPU 1: Reference model (frozen baseline, powered by TorchTitan)
- GPU 2: Policy model (scoring outputs, powered by vLLM)
- **Monarch** orchestrates all three components
- **TorchStore** handles weight synchronization from training to inference

## Understanding Configuration Files

TorchForge uses YAML configuration files to manage training parameters. Let's examine a typical config:

```yaml
# Example: apps/sft/llama3_8b.yaml
model:
name: meta-llama/Meta-Llama-3.1-8B-Instruct
path: /tmp/Meta-Llama-3.1-8B-Instruct

training:
batch_size: 4
learning_rate: 1e-5
num_epochs: 10
gradient_accumulation_steps: 4

distributed:
strategy: fsdp # Managed by TorchTitan
precision: bf16

checkpointing:
save_interval: 1000
output_dir: /tmp/checkpoints
```

**Key Sections:**
- **model**: Model path and settings
- **training**: Hyperparameters like batch size and learning rate
- **distributed**: Multi-GPU strategy (FSDP, tensor parallel, etc.) handled by TorchTitan
- **checkpointing**: Where and when to save model checkpoints

## Next Steps

Now that you have TorchForge installed and verified:

1. **Explore Examples**: Check the `apps/` directory for more training examples
2. **Read Tutorials**: Follow {doc}`tutorials` for step-by-step guides
3. **API Documentation**: Explore {doc}`api` for detailed API reference

## Getting Help

If you encounter issues:

1. **Search Issues**: Look through [GitHub Issues](https://github.com/meta-pytorch/forge/issues)
2. **File a Bug Report**: Create a new issue with:
- Your system configuration (output of diagnostic command below)
- Full error message
- Steps to reproduce
- Expected vs actual behavior

**Diagnostic command:**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a really good idea - let's keep this, and I think we should come up with a script for this in our issue templates..

cc @joecummings @daniellepintz ? not sure who to tag here

```bash
python -c "
import torch
import forge

try:
import monarch
monarch_status = 'OK'
except Exception as e:
monarch_status = str(e)

try:
import vllm
vllm_version = vllm.__version__
except Exception as e:
vllm_version = str(e)

print(f'PyTorch: {torch.__version__}')
print(f'TorchForge: {forge.__version__}')
print(f'Monarch: {monarch_status}')
print(f'vLLM: {vllm_version}')
print(f'CUDA: {torch.version.cuda}')
print(f'GPUs: {torch.cuda.device_count()}')
"
```

Include this output in your bug reports!

## Additional Resources

- **Contributing Guide**: [CONTRIBUTING.md](https://github.com/meta-pytorch/forge/blob/main/CONTRIBUTING.md)
- **Code of Conduct**: [CODE_OF_CONDUCT.md](https://github.com/meta-pytorch/forge/blob/main/CODE_OF_CONDUCT.md)
- **Monarch Documentation**: [meta-pytorch.org/monarch](https://meta-pytorch.org/monarch)
- **vLLM Documentation**: [docs.vllm.ai](https://docs.vllm.ai)
- **TorchTitan**: [github.com/pytorch/torchtitan](https://github.com/pytorch/torchtitan)
Loading
Loading