Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@ An e2e framework for creating, deploying and using isolated execution environmen
[![Discord](https://img.shields.io/badge/Discord-OpenEnv-7289da?style=flat&logo=discord&logoColor=white)](https://discord.gg/YsTYBh6PD9)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/meta-pytorch/OpenEnv/blob/main/examples/OpenEnv_Tutorial.ipynb) **← Try the Interactive Tutorial!**

---

**🚀 Featured Example:** Train LLMs to play BlackJack using [torchforge](https://github.com/meta-pytorch/torchforge) (PyTorch's agentic RL framework): [`examples/grpo_blackjack/`](examples/grpo_blackjack/)

## OpenEnv on partner platforms:

- [Lightning AI Studio](https://lightning.ai/environments?section=featured)
Expand Down Expand Up @@ -178,10 +182,10 @@ client.close() # Stops and removes container
- smolagents (for coding environment)

## Supported RL Tools
The goal of this project is to support a broad set of open and closed tools to help standardize the agentic RL community. If you have a project that supports OpenEnv environments, please put up a PR to add your tool name along with a link to your documentation.
The goal of this project is to support a broad set of open and closed tools to help standardize the agentic RL community. If you have a project that supports OpenEnv environments, please put up a PR to add your tool name along with a link to your documentation.

### torchforge
(coming soon)
See GRPO BlackJack training example: [`examples/grpo_blackjack/`](examples/grpo_blackjack/)

### TRL
(coming soon}
Expand Down
191 changes: 191 additions & 0 deletions examples/grpo_blackjack/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
# Training LLMs to Play BlackJack with GRPO + OpenEnv

This example demonstrates how to train language models to play BlackJack using **GRPO (Group Relative Policy Optimization)** and **OpenEnv**.

## 🎯 What This Example Shows

- **OpenEnv**: Universal RL environment interface for 70+ environments
- **GRPO**: Efficient RL algorithm (used by DeepSeek R1) that only needs 2 models instead of 3
- **Forge**: PyTorch-native agentic RL library for production training
- **End-to-End Training**: From random policy (~35% win rate) to trained agent

## 📁 Files

- `grpo_blackjack_tutorial.ipynb` - Interactive tutorial notebook (recommended starting point)
- `grpo_utils.py` - Production GRPO utilities and helper functions
- `blackjack.yaml` - Training configuration file
- `README.md` - This file

## 🚀 Quick Start

### Prerequisites

1. **Install OpenEnv**:
```bash
# Clone OpenEnv repo
git clone https://github.com/meta-pytorch/OpenEnv.git
cd OpenEnv
pip install -e .
```

2. **Install Forge** (PyTorch's agentic RL library):
```bash
git clone https://github.com/meta-pytorch/torchforge.git
cd torchforge
pip install -e .
```

3. **Start OpenEnv BlackJack Server**:
```bash
# In a separate terminal
export OPENENV_PATH="/path/to/OpenEnv/src"
export PYTHONPATH="${OPENENV_PATH}:${PYTHONPATH}"

OPENSPIEL_GAME=blackjack python -m envs.openspiel_env.server.app --port 8004
```

### Run the Tutorial

Open the Jupyter notebook:
```bash
jupyter notebook grpo_blackjack_tutorial.ipynb
```

Follow the cells to:
1. **Explore OpenEnv** - Connect to BlackJack environment
2. **Benchmark baseline** - Test random policy performance
3. **Learn about GRPO** - Understand the training algorithm
4. **Train with Forge** - Run production GRPO training
5. **Switch environments** - See how to train on other games

## 📚 What You'll Learn

### OpenEnv: Universal RL Environment Spec

OpenEnv is **not a game engine** - it's a **specification** that wraps ANY RL environment:

```python
# Same interface works for 70+ environments
result = env.reset() # Start episode
result = env.step(action) # Take action
state = env.state() # Get state
env.close() # Cleanup
```

Change one environment variable → train on different games!

### Forge: PyTorch-Native Agentic RL

Forge handles all distributed systems complexity:
- **Generator (vLLM)**: Fast LLM inference
- **RLTrainer**: Distributed training with FSDP
- **ReplayBuffer**: Off-policy learning
- **ReferenceModel**: KL penalty computation
- **Torchstore**: Distributed weight management

You just write:
```python
trainer = await setup_forge_training("blackjack.yaml")
await trainer.run(steps=100)
```

Everything else is automated!

## 🎓 Educational Resources

This tutorial is inspired by the excellent [Unsloth RL Guide](https://docs.unsloth.ai/get-started/reinforcement-learning-rl-guide). We highly recommend reading it for deeper insights!

### Further Reading

- **OpenEnv**: [GitHub](https://github.com/meta-pytorch/OpenEnv)
- **GRPO Paper**: [arXiv:2402.03300](https://arxiv.org/abs/2402.03300)
- **Forge**: [GitHub](https://github.com/meta-pytorch/torchforge) | [Docs](https://meta-pytorch.org/torchforge/)
- **Unsloth RL Guide**: [docs.unsloth.ai](https://docs.unsloth.ai/get-started/reinforcement-learning-rl-guide)

## 💡 Key Concepts

### "Patience Is All You Need" for RL

RL works by patience: if the correct answer has *any* non-zero probability, we'll eventually find it through sampling. While waiting:
1. Learn from **bad answers** → decrease their probability
2. When finding **good answers** → increase their probability

Over time, the model learns not just *what* to do, but *why* (reasoning process).

### Reward Functions

Reward functions tell the model what's good/bad. For BlackJack:

```python
def evaluate_response(prompt, response, game_reward):
reward = float(game_reward) # +1 (win), -1 (loss), 0 (push)

# Reward shaping
if game_reward > 0:
reward = 2.0 # Wins more valuable
elif game_reward == 0:
reward = 0.5 # Pushes better than losses

return reward
```

The key: **Reward functions must be verifiable**. You can verify "is the answer correct?" but not "is this creative?"

## 🔄 Switching to Other Games

The beauty of OpenEnv: **same code works for any environment!**

### Try Tic-Tac-Toe
```bash
OPENSPIEL_GAME=tic_tac_toe python -m envs.openspiel_env.server.app --port 8005
```
Update config: `server_url = "http://localhost:8005"`

### Try Chess
```bash
OPENSPIEL_GAME=chess python -m envs.openspiel_env.server.app --port 8006
```

### Try Atari
```bash
python -m envs.atari_env.server.app --game pong --port 8007
```

Everything else stays the same! Same GRPO code, same Forge infrastructure.

## 🛠️ Customization

All code is in `grpo_utils.py`:
- Modify `BlackJackReward.evaluate_response()` for reward shaping
- Adjust `ComputeAdvantages.compute()` for advantage computation
- Tweak `simple_grpo_loss()` for KL penalty (beta parameter)
- Change `format_prompt()` for different prompt templates

Edit `blackjack.yaml` for:
- Different model sizes (1B to 70B+)
- More training steps
- Larger group sizes
- Parallel rollout collection

## 📊 Expected Results

- **Random policy**: ~35% win rate
- **After GRPO training**: Improves toward optimal BlackJack strategy (~43% win rate)
- **Training time**: Varies based on model size and training steps

The model learns both strategy AND reasoning process (similar to DeepSeek R1's `<think>` tokens).

## 🤝 Credits

- **OpenEnv**: Meta PyTorch team
- **Forge**: Meta PyTorch team
- **GRPO**: DeepSeek research team
- **Tutorial inspiration**: Unsloth team

## 📝 License

This example follows the same license as the parent OpenEnv repository.

## 🙏 Acknowledgments

Big thanks to the **Unsloth team** for their educational approach to RL! This tutorial's GRPO section is heavily inspired by their excellent guide.
155 changes: 155 additions & 0 deletions examples/grpo_blackjack/blackjack.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
# BlackJack GRPO Training Configuration
# >>> python -m apps.grpo.blackjack_main --config apps/grpo/blackjack.yaml
#
# Prerequisites:
# 1. Start BlackJack server:
# cd /Users/sanyambhutani/OpenEnv/OpenEnv
# export PYTHONPATH="/Users/sanyambhutani/OpenEnv/OpenEnv/src:${PYTHONPATH}"
# OPENSPIEL_GAME=blackjack python -m envs.openspiel_env.server.app
#
# 2. Run training:
# python -m apps.grpo.blackjack_main --config apps/grpo/blackjack.yaml

# Global configuration
group_size: 4 # Number of parallel games per rollout
local_batch_size: 8 # Per-device batch size
max_req_tokens: 512 # Max tokens for prompt (BlackJack prompts are ~200-300 tokens)
max_res_tokens: 32 # Max tokens for response (just "HIT" or "STAND" + thinking)
model: "Qwen/Qwen3-1.7B"
off_by_n: 1 # Off-policy tolerance

# Main loop configuration
rollout_threads: 1 # Number of parallel rollout threads

# Observability configuration
metric_logging:
wandb:
project: "blackjack-grpo-tutorial"
group: "blackjack_exp_${oc.env:USER}"
reduce_across_ranks: True
console:
reduce_across_ranks: True

# BlackJack environment configuration
blackjack_env:
server_url: "http://localhost:8004"
model: ${model}

# Policy configuration (generator)
policy:
engine_args: # https://docs.vllm.ai/en/v0.10.0/api/vllm/engine/arg_utils.html#vllm.engine.arg_utils.EngineArgs
model: ${model}
tensor_parallel_size: 1
pipeline_parallel_size: 1
enforce_eager: false
sampling_params: # https://docs.vllm.ai/en/v0.10.0/api/vllm/sampling_params.html#vllm.sampling_params.SamplingParams
n: 1 # Generate 1 response per game state (not group_size, since we play full games)
max_tokens: ${max_res_tokens}
temperature: 1.0
top_p: 1.0

# Trainer configuration
trainer:
model:
name: qwen3
flavor: 1.7B
hf_assets_path: hf://${model}
optimizer:
name: AdamW
lr: 1e-5
eps: 1e-8
lr_scheduler:
warmup_steps: 1
training:
local_batch_size: ${local_batch_size}
seq_len: 1024 # Shorter than GSM8K since BlackJack episodes are shorter
max_norm: 1.0
steps: 1000 # Tutorial: 1000 steps (increase for production)
dtype: bfloat16
gc_freq: 1
compile:
enable: false
parallelism:
data_parallel_replicate_degree: 1
data_parallel_shard_degree: 1
tensor_parallel_degree: 1
pipeline_parallel_degree: 1
context_parallel_degree: 1
expert_parallel_degree: 1
disable_loss_parallel: true
checkpoint:
enable: true
initial_load_path: hf://${model}
initial_load_in_hf: true
last_save_in_hf: true
interval: 500
async_mode: "disabled"
activation_checkpoint:
mode: selective
selective_ac_option: op

# Replay buffer configuration
replay_buffer:
batch_size: ${local_batch_size}
max_policy_age: ${off_by_n}
dp_size: ${trainer.parallelism.data_parallel_shard_degree}

# Reference model configuration
ref_model:
model:
name: qwen3
flavor: 1.7B
hf_assets_path: hf://${model}
training:
seq_len: ${trainer.training.seq_len}
dtype: bfloat16
gc_freq: 1
compile:
enable: false
parallelism:
data_parallel_replicate_degree: 1
data_parallel_shard_degree: 1
tensor_parallel_degree: 1
pipeline_parallel_degree: 1
context_parallel_degree: 1
expert_parallel_degree: 1
checkpoint:
enable: true
initial_load_path: hf://${model}
initial_load_in_hf: true

# All resource allocations
services:
policy:
procs: ${policy.engine_args.tensor_parallel_size}
num_replicas: 1
mesh_name: policy
with_gpus: true
ref_model:
procs: 1
num_replicas: 1
mesh_name: ref_model
with_gpus: true
reward_actor:
procs: 1
num_replicas: 1
mesh_name: reward_actor
with_gpus: false

actors:
blackjack_env:
procs: 1
with_gpus: false
mesh_name: blackjack_env
trainer:
procs: 1
with_gpus: true
mesh_name: trainer
replay_buffer:
procs: 1
with_gpus: false
mesh_name: replay_buffer
compute_advantages:
procs: 1
with_gpus: false
mesh_name: compute_advantages
Loading