-
Notifications
You must be signed in to change notification settings - Fork 20
Docs Content Part 1: Homepage and getting started #448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+467
−24
Merged
Changes from all commits
Commits
Show all changes
28 commits
Select commit
Hold shift + click to select a range
b1794a9
Updating homepage, getting started, concepts.
AlannaBurke 087e2ff
Update documentation with blog post insights: enhanced homepage, comp…
AlannaBurke a0b2412
Update docs/source/getting_started.md
AlannaBurke b6d466c
Update docs/source/index.md
AlannaBurke b564175
Update docs/source/index.md
AlannaBurke 92ca627
Minor fixes and updates.
AlannaBurke f4b951b
Merge branch 'getting-started' of github.com:meta-pytorch/forge into …
AlannaBurke 34640e7
Update docs/source/getting_started.md
AlannaBurke 32c8d78
Restructing info.
AlannaBurke e448c90
Merge branch 'main' of github.com:meta-pytorch/forge into getting-sta…
AlannaBurke ce9b472
Update docs/source/getting_started.md
AlannaBurke e998d94
Merge branch 'getting-started' of github.com:meta-pytorch/forge into …
AlannaBurke c89393c
Updating gpu references.
AlannaBurke 7a31e26
Updating toctree entries.
AlannaBurke af4eae7
Removing FAQs
AlannaBurke 9d49ee6
Removing FAQ references.
AlannaBurke c410375
Update docs/source/getting_started.md
AlannaBurke 6c70c8f
Merge branch 'main' into getting-started
AlannaBurke f9b136a
docs: Improve homepage and getting started pages
AlannaBurke c41f035
Updating index and getting started pages.
AlannaBurke e39f29b
Removing broken links and references to concepts.
AlannaBurke c6c309a
Minor doc changes.
AlannaBurke babf5bb
Updating download instructions.
AlannaBurke b02389f
Updating download instructions.
AlannaBurke d904304
Fixing errors.
AlannaBurke b36ea66
Fixing typo causing error.
AlannaBurke f213765
Removing concepts page entirely.
AlannaBurke af774cf
Update docs/source/getting_started.md
svekars File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file was deleted.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,281 @@ | ||
# Get Started | ||
# Getting Started | ||
|
||
Welcome to TorchForge! This guide will help you get up and running with TorchForge, a PyTorch-native platform specifically designed for post-training generative AI models. | ||
This guide will walk you through installing TorchForge, understanding its dependencies, verifying your setup, and running your first training job. | ||
|
||
TorchForge specializes in post-training techniques for large language models, including: | ||
## System Requirements | ||
|
||
- **Supervised Fine-Tuning (SFT)**: Adapt pre-trained models to specific tasks using labeled data | ||
- **Group Relative Policy Optimization (GRPO)**: Advanced reinforcement learning for model alignment | ||
- **Multi-GPU Distributed Training**: Efficient scaling across multiple GPUs and nodes | ||
Before installing TorchForge, ensure your system meets the following requirements. | ||
|
||
| Component | Requirement | Notes | | ||
|-----------|-------------|-------| | ||
| **Operating System** | Linux (Fedora/Ubuntu/Debian) | MacOS and Windows not currently supported | | ||
| **Python** | 3.10 or higher | Python 3.11 recommended | | ||
| **GPU** | NVIDIA with CUDA support | AMD GPUs not currently supported | | ||
| **Minimum GPUs** | 2+ for SFT, 3+ for GRPO | More GPUs enable larger models | | ||
| **CUDA** | 12.8 | Required for GPU training | | ||
| **RAM** | 32GB+ recommended | Depends on model size | | ||
| **Disk Space** | 50GB+ free | For models, datasets, and checkpoints | | ||
| **PyTorch** | Nightly build | Latest distributed features (DTensor, FSDP) | | ||
| **Monarch** | Pre-packaged wheel | Distributed orchestration and actor system | | ||
| **vLLM** | v0.10.0+ | Fast inference with PagedAttention | | ||
| **TorchTitan** | Latest | Production training infrastructure | | ||
|
||
|
||
## Prerequisites | ||
|
||
- **Conda or Miniconda**: For environment management | ||
- Download from [conda.io](https://docs.conda.io/en/latest/miniconda.html) | ||
|
||
- **GitHub CLI (gh)**: Required for downloading pre-packaged dependencies | ||
- Install instructions: [github.com/cli/cli#installation](https://github.com/cli/cli#installation) | ||
- After installing, authenticate with: `gh auth login` | ||
- You can use either HTTPS or SSH as the authentication protocol | ||
|
||
- **Git**: For cloning the repository | ||
- Usually pre-installed on Linux systems | ||
- Verify with: `git --version` | ||
|
||
|
||
**Installation note:** The installation script provides pre-built wheels with PyTorch nightly already included. | ||
|
||
## Installation | ||
|
||
TorchForge uses pre-packaged wheels for all dependencies, making installation faster and more reliable. | ||
|
||
1. **Clone the Repository** | ||
|
||
```bash | ||
git clone https://github.com/meta-pytorch/forge.git | ||
cd forge | ||
``` | ||
|
||
2. **Create Conda Environment** | ||
|
||
```bash | ||
conda create -n forge python=3.10 | ||
conda activate forge | ||
``` | ||
|
||
3. **Run Installation Script** | ||
|
||
```bash | ||
./scripts/install.sh | ||
``` | ||
|
||
The installation script will: | ||
- Install system dependencies using DNF (or your package manager) | ||
- Download pre-built wheels for PyTorch nightly, Monarch, vLLM, and TorchTitan | ||
- Install TorchForge and all Python dependencies | ||
- Configure the environment for GPU training | ||
|
||
```{tip} | ||
**Using sudo instead of conda**: If you prefer installing system packages directly rather than through conda, use: | ||
`./scripts/install.sh --use-sudo` | ||
``` | ||
|
||
```{warning} | ||
When adding packages to `pyproject.toml`, use `uv sync --inexact` to avoid removing Monarch and vLLM. | ||
``` | ||
|
||
## Verifying Your Setup | ||
|
||
After installation, verify that all components are working correctly: | ||
|
||
1. **Check GPU Availability** | ||
|
||
```bash | ||
python -c "import torch; print(f'GPUs available: {torch.cuda.device_count()}')" | ||
``` | ||
|
||
Expected output: `GPUs available: 2` (or more) | ||
|
||
2. **Check CUDA Version** | ||
|
||
```bash | ||
python -c "import torch; print(f'CUDA version: {torch.version.cuda}')" | ||
``` | ||
|
||
Expected output: `CUDA version: 12.8` | ||
3. **Check All Dependencies** | ||
|
||
```bash | ||
# Check core components | ||
python -c "import torch, forge, monarch, vllm; print('All imports successful')" | ||
|
||
# Check specific versions | ||
python -c " | ||
import torch | ||
import forge | ||
import vllm | ||
|
||
print(f'PyTorch: {torch.__version__}') | ||
print(f'TorchForge: {forge.__version__}') | ||
print(f'vLLM: {vllm.__version__}') | ||
print(f'CUDA: {torch.version.cuda}') | ||
print(f'GPUs: {torch.cuda.device_count()}') | ||
" | ||
``` | ||
|
||
4. **Verify Monarch** | ||
|
||
```bash | ||
python -c " | ||
from monarch.actor import Actor, this_host | ||
|
||
# Test basic Monarch functionality | ||
procs = this_host().spawn_procs({'gpus': 1}) | ||
print('Monarch: Process spawning works') | ||
" | ||
``` | ||
|
||
## Quick Start Examples | ||
|
||
Now that TorchForge is installed, let's run some training examples. | ||
|
||
Here's what training looks like with TorchForge: | ||
|
||
```bash | ||
# Install dependencies | ||
conda create -n forge python=3.10 | ||
conda activate forge | ||
git clone https://github.com/meta-pytorch/forge | ||
cd forge | ||
./scripts/install.sh | ||
|
||
# Download a model | ||
hf download meta-llama/Meta-Llama-3.1-8B-Instruct --local-dir /tmp/Meta-Llama-3.1-8B-Instruct --exclude "original/consolidated.00.pth" | ||
|
||
# Run SFT training (requires 2+ GPUs) | ||
uv run forge run --nproc_per_node 2 \ | ||
apps/sft/main.py --config apps/sft/llama3_8b.yaml | ||
|
||
# Run GRPO training (requires 3+ GPUs) | ||
python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml | ||
``` | ||
|
||
### Example 1: Supervised Fine-Tuning (SFT) | ||
|
||
Fine-tune Llama 3 8B on your data. **Requires: 2+ GPUs** | ||
|
||
1. **Access the Model** | ||
|
||
```{note} | ||
Model downloads are no longer required, but Hugging Face authentication is required to access the models. | ||
|
||
Run `huggingface-cli login` first if you haven't already. | ||
``` | ||
|
||
2. **Run Training** | ||
|
||
```bash | ||
python -m apps.sft.main --config apps/sft/llama3_8b.yaml | ||
``` | ||
|
||
**What's Happening:** | ||
- `--nproc_per_node 2`: Use 2 GPUs for training | ||
- `apps/sft/main.py`: SFT training script | ||
- `--config apps/sft/llama3_8b.yaml`: Configuration file with hyperparameters | ||
- **TorchTitan** handles model sharding across the 2 GPUs | ||
- **Monarch** coordinates the distributed training | ||
|
||
### Example 2: GRPO Training | ||
|
||
Train a model using reinforcement learning with GRPO. **Requires: 3+ GPUs** | ||
|
||
```bash | ||
python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml | ||
``` | ||
|
||
**What's Happening:** | ||
- GPU 0: Trainer model (being trained, powered by TorchTitan) | ||
- GPU 1: Reference model (frozen baseline, powered by TorchTitan) | ||
- GPU 2: Policy model (scoring outputs, powered by vLLM) | ||
- **Monarch** orchestrates all three components | ||
- **TorchStore** handles weight synchronization from training to inference | ||
|
||
## Understanding Configuration Files | ||
|
||
TorchForge uses YAML configuration files to manage training parameters. Let's examine a typical config: | ||
|
||
```yaml | ||
# Example: apps/sft/llama3_8b.yaml | ||
model: | ||
name: meta-llama/Meta-Llama-3.1-8B-Instruct | ||
path: /tmp/Meta-Llama-3.1-8B-Instruct | ||
|
||
training: | ||
batch_size: 4 | ||
learning_rate: 1e-5 | ||
num_epochs: 10 | ||
gradient_accumulation_steps: 4 | ||
|
||
distributed: | ||
strategy: fsdp # Managed by TorchTitan | ||
precision: bf16 | ||
|
||
checkpointing: | ||
save_interval: 1000 | ||
output_dir: /tmp/checkpoints | ||
``` | ||
|
||
**Key Sections:** | ||
- **model**: Model path and settings | ||
- **training**: Hyperparameters like batch size and learning rate | ||
- **distributed**: Multi-GPU strategy (FSDP, tensor parallel, etc.) handled by TorchTitan | ||
- **checkpointing**: Where and when to save model checkpoints | ||
|
||
## Next Steps | ||
|
||
Now that you have TorchForge installed and verified: | ||
|
||
1. **Explore Examples**: Check the `apps/` directory for more training examples | ||
2. **Read Tutorials**: Follow {doc}`tutorials` for step-by-step guides | ||
3. **API Documentation**: Explore {doc}`api` for detailed API reference | ||
|
||
## Getting Help | ||
|
||
If you encounter issues: | ||
|
||
1. **Search Issues**: Look through [GitHub Issues](https://github.com/meta-pytorch/forge/issues) | ||
2. **File a Bug Report**: Create a new issue with: | ||
- Your system configuration (output of diagnostic command below) | ||
- Full error message | ||
- Steps to reproduce | ||
- Expected vs actual behavior | ||
|
||
**Diagnostic command:** | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is a really good idea - let's keep this, and I think we should come up with a script for this in our issue templates.. cc @joecummings @daniellepintz ? not sure who to tag here |
||
```bash | ||
python -c " | ||
import torch | ||
import forge | ||
|
||
try: | ||
import monarch | ||
monarch_status = 'OK' | ||
except Exception as e: | ||
monarch_status = str(e) | ||
|
||
try: | ||
import vllm | ||
vllm_version = vllm.__version__ | ||
except Exception as e: | ||
vllm_version = str(e) | ||
|
||
print(f'PyTorch: {torch.__version__}') | ||
print(f'TorchForge: {forge.__version__}') | ||
print(f'Monarch: {monarch_status}') | ||
print(f'vLLM: {vllm_version}') | ||
print(f'CUDA: {torch.version.cuda}') | ||
print(f'GPUs: {torch.cuda.device_count()}') | ||
" | ||
``` | ||
|
||
Include this output in your bug reports! | ||
|
||
## Additional Resources | ||
|
||
- **Contributing Guide**: [CONTRIBUTING.md](https://github.com/meta-pytorch/forge/blob/main/CONTRIBUTING.md) | ||
- **Code of Conduct**: [CODE_OF_CONDUCT.md](https://github.com/meta-pytorch/forge/blob/main/CODE_OF_CONDUCT.md) | ||
- **Monarch Documentation**: [meta-pytorch.org/monarch](https://meta-pytorch.org/monarch) | ||
- **vLLM Documentation**: [docs.vllm.ai](https://docs.vllm.ai) | ||
- **TorchTitan**: [github.com/pytorch/torchtitan](https://github.com/pytorch/torchtitan) |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this may be ok for now, but even possibly as soon as EOD today we may have different instructions cc @joecummings
If we keep a script, what the script does will be different. I think we can ship this for now, and update this once we're done