meta-pytorch · init27 · Oct 23, 2025 · Oct 22, 2025 · Oct 22, 2025 · Oct 22, 2025
diff --git a/README.md b/README.md
@@ -6,6 +6,10 @@ An e2e framework for creating, deploying and using isolated execution environmen
 [![Discord](https://img.shields.io/badge/Discord-OpenEnv-7289da?style=flat&logo=discord&logoColor=white)](https://discord.gg/YsTYBh6PD9)
 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/meta-pytorch/OpenEnv/blob/main/examples/OpenEnv_Tutorial.ipynb) **← Try the Interactive Tutorial!**
 
+---
+
+**🚀 Featured Example:** Train LLMs to play BlackJack using [torchforge](https://github.com/meta-pytorch/torchforge) (PyTorch's agentic RL framework): [`examples/grpo_blackjack/`](examples/grpo_blackjack/)
+
 ## OpenEnv on partner platforms:
 
 - [Lightning AI Studio](https://lightning.ai/environments?section=featured)
@@ -178,10 +182,10 @@ client.close()  # Stops and removes container
 - smolagents (for coding environment)
 
 ## Supported RL Tools
-The goal of this project is to support a broad set of open and closed tools to help standardize the agentic RL community. If you have a project that supports OpenEnv environments, please put up a PR to add your tool name along with a link to your documentation. 
+The goal of this project is to support a broad set of open and closed tools to help standardize the agentic RL community. If you have a project that supports OpenEnv environments, please put up a PR to add your tool name along with a link to your documentation.
 
 ### torchforge
-(coming soon)
+See GRPO BlackJack training example: [`examples/grpo_blackjack/`](examples/grpo_blackjack/)
 
 ### TRL
 (coming soon} 

diff --git a/examples/grpo_blackjack/README.md b/examples/grpo_blackjack/README.md
@@ -0,0 +1,191 @@
+# Training LLMs to Play BlackJack with GRPO + OpenEnv
+
+This example demonstrates how to train language models to play BlackJack using **GRPO (Group Relative Policy Optimization)** and **OpenEnv**.
+
+## 🎯 What This Example Shows
+
+- **OpenEnv**: Universal RL environment interface for 70+ environments
+- **GRPO**: Efficient RL algorithm (used by DeepSeek R1) that only needs 2 models instead of 3
+- **Forge**: PyTorch-native agentic RL library for production training
+- **End-to-End Training**: From random policy (~35% win rate) to trained agent
+
+## 📁 Files
+
+- `grpo_blackjack_tutorial.ipynb` - Interactive tutorial notebook (recommended starting point)
+- `grpo_utils.py` - Production GRPO utilities and helper functions
+- `blackjack.yaml` - Training configuration file
+- `README.md` - This file
+
+## 🚀 Quick Start
+
+### Prerequisites
+
+1. **Install OpenEnv**:
+   ```bash
+   # Clone OpenEnv repo
+   git clone https://github.com/meta-pytorch/OpenEnv.git
+   cd OpenEnv
+   pip install -e .
+   ```
+
+2. **Install Forge** (PyTorch's agentic RL library):
+   ```bash
+   git clone https://github.com/meta-pytorch/torchforge.git
+   cd torchforge
+   pip install -e .
+   ```
+
+3. **Start OpenEnv BlackJack Server**:
+   ```bash
+   # In a separate terminal
+   export OPENENV_PATH="/path/to/OpenEnv/src"
+   export PYTHONPATH="${OPENENV_PATH}:${PYTHONPATH}"
+
+   OPENSPIEL_GAME=blackjack python -m envs.openspiel_env.server.app --port 8004
+   ```
+
+### Run the Tutorial
+
+Open the Jupyter notebook:
+```bash
+jupyter notebook grpo_blackjack_tutorial.ipynb
+```
+
+Follow the cells to:
+1. **Explore OpenEnv** - Connect to BlackJack environment
+2. **Benchmark baseline** - Test random policy performance
+3. **Learn about GRPO** - Understand the training algorithm
+4. **Train with Forge** - Run production GRPO training
+5. **Switch environments** - See how to train on other games
+
+## 📚 What You'll Learn
+
+### OpenEnv: Universal RL Environment Spec
+
+OpenEnv is **not a game engine** - it's a **specification** that wraps ANY RL environment:
+
+```python
+# Same interface works for 70+ environments
+result = env.reset()              # Start episode
+result = env.step(action)         # Take action
+state = env.state()               # Get state
+env.close()                       # Cleanup
+```
+
+Change one environment variable → train on different games!
+
+### Forge: PyTorch-Native Agentic RL
+
+Forge handles all distributed systems complexity:
+- **Generator (vLLM)**: Fast LLM inference
+- **RLTrainer**: Distributed training with FSDP
+- **ReplayBuffer**: Off-policy learning
+- **ReferenceModel**: KL penalty computation
+- **Torchstore**: Distributed weight management
+
+You just write:
+```python
+trainer = await setup_forge_training("blackjack.yaml")
+await trainer.run(steps=100)
+```
+
+Everything else is automated!
+
+## 🎓 Educational Resources
+
+This tutorial is inspired by the excellent [Unsloth RL Guide](https://docs.unsloth.ai/get-started/reinforcement-learning-rl-guide). We highly recommend reading it for deeper insights!
+
+### Further Reading
+
+- **OpenEnv**: [GitHub](https://github.com/meta-pytorch/OpenEnv)
+- **GRPO Paper**: [arXiv:2402.03300](https://arxiv.org/abs/2402.03300)
+- **Forge**: [GitHub](https://github.com/meta-pytorch/torchforge) | [Docs](https://meta-pytorch.org/torchforge/)
+- **Unsloth RL Guide**: [docs.unsloth.ai](https://docs.unsloth.ai/get-started/reinforcement-learning-rl-guide)
+
+## 💡 Key Concepts
+
+### "Patience Is All You Need" for RL
+
+RL works by patience: if the correct answer has *any* non-zero probability, we'll eventually find it through sampling. While waiting:
+1. Learn from **bad answers** → decrease their probability
+2. When finding **good answers** → increase their probability
+
+Over time, the model learns not just *what* to do, but *why* (reasoning process).
+
+### Reward Functions
+
+Reward functions tell the model what's good/bad. For BlackJack:
+
+```python
+def evaluate_response(prompt, response, game_reward):
+    reward = float(game_reward)  # +1 (win), -1 (loss), 0 (push)
+
+    # Reward shaping
+    if game_reward > 0:
+        reward = 2.0  # Wins more valuable
+    elif game_reward == 0:
+        reward = 0.5  # Pushes better than losses
+
+    return reward
+```
+
+The key: **Reward functions must be verifiable**. You can verify "is the answer correct?" but not "is this creative?"
+
+## 🔄 Switching to Other Games
+
+The beauty of OpenEnv: **same code works for any environment!**
+
+### Try Tic-Tac-Toe
+```bash
+OPENSPIEL_GAME=tic_tac_toe python -m envs.openspiel_env.server.app --port 8005
+```
+Update config: `server_url = "http://localhost:8005"`
+
+### Try Chess
+```bash
+OPENSPIEL_GAME=chess python -m envs.openspiel_env.server.app --port 8006
+```
+
+### Try Atari
+```bash
+python -m envs.atari_env.server.app --game pong --port 8007
+```
+
+Everything else stays the same! Same GRPO code, same Forge infrastructure.
+
+## 🛠️ Customization
+
+All code is in `grpo_utils.py`:
+- Modify `BlackJackReward.evaluate_response()` for reward shaping
+- Adjust `ComputeAdvantages.compute()` for advantage computation
+- Tweak `simple_grpo_loss()` for KL penalty (beta parameter)
+- Change `format_prompt()` for different prompt templates
+
+Edit `blackjack.yaml` for:
+- Different model sizes (1B to 70B+)
+- More training steps
+- Larger group sizes
+- Parallel rollout collection
+
+## 📊 Expected Results
+
+- **Random policy**: ~35% win rate
+- **After GRPO training**: Improves toward optimal BlackJack strategy (~43% win rate)
+- **Training time**: Varies based on model size and training steps
+
+The model learns both strategy AND reasoning process (similar to DeepSeek R1's `<think>` tokens).
+
+## 🤝 Credits
+
+- **OpenEnv**: Meta PyTorch team
+- **Forge**: Meta PyTorch team
+- **GRPO**: DeepSeek research team
+- **Tutorial inspiration**: Unsloth team
+
+## 📝 License
+
+This example follows the same license as the parent OpenEnv repository.
+
+## 🙏 Acknowledgments
+
+Big thanks to the **Unsloth team** for their educational approach to RL! This tutorial's GRPO section is heavily inspired by their excellent guide.
diff --git a/examples/grpo_blackjack/blackjack.yaml b/examples/grpo_blackjack/blackjack.yaml
@@ -0,0 +1,155 @@
+# BlackJack GRPO Training Configuration
+# >>> python -m apps.grpo.blackjack_main --config apps/grpo/blackjack.yaml
+#
+# Prerequisites:
+# 1. Start BlackJack server:
+#    cd /Users/sanyambhutani/OpenEnv/OpenEnv
+#    export PYTHONPATH="/Users/sanyambhutani/OpenEnv/OpenEnv/src:${PYTHONPATH}"
+#    OPENSPIEL_GAME=blackjack python -m envs.openspiel_env.server.app
+#
+# 2. Run training:
+#    python -m apps.grpo.blackjack_main --config apps/grpo/blackjack.yaml
+
+# Global configuration
+group_size: 4  # Number of parallel games per rollout
+local_batch_size: 8  # Per-device batch size
+max_req_tokens: 512  # Max tokens for prompt (BlackJack prompts are ~200-300 tokens)
+max_res_tokens: 32  # Max tokens for response (just "HIT" or "STAND" + thinking)
+model: "Qwen/Qwen3-1.7B"
+off_by_n: 1  # Off-policy tolerance
+
+# Main loop configuration
+rollout_threads: 1  # Number of parallel rollout threads
+
+# Observability configuration
+metric_logging:
+  wandb:
+    project: "blackjack-grpo-tutorial"
+    group: "blackjack_exp_${oc.env:USER}"
+    reduce_across_ranks: True
+  console:
+    reduce_across_ranks: True
+
+# BlackJack environment configuration
+blackjack_env:
+  server_url: "http://localhost:8004"
+  model: ${model}
+
+# Policy configuration (generator)
+policy:
+  engine_args:  # https://docs.vllm.ai/en/v0.10.0/api/vllm/engine/arg_utils.html#vllm.engine.arg_utils.EngineArgs
+    model: ${model}
+    tensor_parallel_size: 1
+    pipeline_parallel_size: 1
+    enforce_eager: false
+  sampling_params:  # https://docs.vllm.ai/en/v0.10.0/api/vllm/sampling_params.html#vllm.sampling_params.SamplingParams
+    n: 1  # Generate 1 response per game state (not group_size, since we play full games)
+    max_tokens: ${max_res_tokens}
+    temperature: 1.0
+    top_p: 1.0
+
+# Trainer configuration
+trainer:
+  model:
+    name: qwen3
+    flavor: 1.7B
+    hf_assets_path: hf://${model}
+  optimizer:
+    name: AdamW
+    lr: 1e-5
+    eps: 1e-8
+  lr_scheduler:
+    warmup_steps: 1
+  training:
+    local_batch_size: ${local_batch_size}
+    seq_len: 1024  # Shorter than GSM8K since BlackJack episodes are shorter
+    max_norm: 1.0
+    steps: 1000  # Tutorial: 1000 steps (increase for production)
+    dtype: bfloat16
+    gc_freq: 1
+  compile:
+    enable: false
+  parallelism:
+    data_parallel_replicate_degree: 1
+    data_parallel_shard_degree: 1
+    tensor_parallel_degree: 1
+    pipeline_parallel_degree: 1
+    context_parallel_degree: 1
+    expert_parallel_degree: 1
+    disable_loss_parallel: true
+  checkpoint:
+    enable: true
+    initial_load_path: hf://${model}
+    initial_load_in_hf: true
+    last_save_in_hf: true
+    interval: 500
+    async_mode: "disabled"
+  activation_checkpoint:
+    mode: selective
+    selective_ac_option: op
+
+# Replay buffer configuration
+replay_buffer:
+  batch_size: ${local_batch_size}
+  max_policy_age: ${off_by_n}
+  dp_size: ${trainer.parallelism.data_parallel_shard_degree}
+
+# Reference model configuration
+ref_model:
+  model:
+    name: qwen3
+    flavor: 1.7B
+    hf_assets_path: hf://${model}
+  training:
+    seq_len: ${trainer.training.seq_len}
+    dtype: bfloat16
+    gc_freq: 1
+  compile:
+    enable: false
+  parallelism:
+    data_parallel_replicate_degree: 1
+    data_parallel_shard_degree: 1
+    tensor_parallel_degree: 1
+    pipeline_parallel_degree: 1
+    context_parallel_degree: 1
+    expert_parallel_degree: 1
+  checkpoint:
+    enable: true
+    initial_load_path: hf://${model}
+    initial_load_in_hf: true
+
+# All resource allocations
+services:
+  policy:
+    procs: ${policy.engine_args.tensor_parallel_size}
+    num_replicas: 1
+    mesh_name: policy
+    with_gpus: true
+  ref_model:
+    procs: 1
+    num_replicas: 1
+    mesh_name: ref_model
+    with_gpus: true
+  reward_actor:
+    procs: 1
+    num_replicas: 1
+    mesh_name: reward_actor
+    with_gpus: false
+
+actors:
+  blackjack_env:
+    procs: 1
+    with_gpus: false
+    mesh_name: blackjack_env
+  trainer:
+    procs: 1
+    with_gpus: true
+    mesh_name: trainer
+  replay_buffer:
+    procs: 1
+    with_gpus: false
+    mesh_name: replay_buffer
+  compute_advantages:
+    procs: 1
+    with_gpus: false
+    mesh_name: compute_advantages