Companion interactive apps for Complete Reinforcement Learning Journey: From Basics to RLHF
Don't just read about algorithms โ watch them think.
Each chapter of the book has a companion browser-based interactive lab where you can:
- ๐ต Step through algorithms cell-by-cell and see values update in real-time
- ๐๏ธ Tweak parameters (ฮณ, ฮต, learning rate) with sliders and instantly see the effect
- ๐ค Watch agents navigate grids, solve problems, and learn from mistakes
- ๐ Inspect Q-values, policy arrows, and convergence logs live
No installation required. Open in any browser. Works on desktop and mobile.
| Chapter | Lab | Concepts Covered | Try It |
|---|---|---|---|
| Ch 2 | MDP Explorer | States, Actions, Rewards, Transitions, Deterministic vs Stochastic, Policy, Value Function | โถ Launch |
| Ch 3 | Policy Iteration on FrozenLake | Bellman equations, policy evaluation (sweep-by-sweep), policy improvement, convergence | โถ Launch |
| Ch 4 | Monte Carlo Blackjack (coming soon) | First-visit MC, exploring starts, episode replay | โ |
| Ch 5 | TD Learning & SARSA (coming soon) | TD(0), SARSA, Q-learning, cliff walking | โ |
| Ch 6 | DQN on CartPole (coming soon) | Experience replay, target networks, training curves | โ |
| Ch 7 | Policy Gradients (coming soon) | REINFORCE, baselines, variance reduction | โ |
Understand the building blocks of every RL algorithm. Explore a 5ร5 Gridworld MDP interactively.
| Mode | What You Learn |
|---|---|
| ๐ Explore | Click any cell โ see its state (r,c), reward, transition probabilities for each action, and Q-values |
| ฯ Policy | See policy arrows on every cell. Click to cycle actions and build your own policy |
| V Value | Color-coded heatmap of V(s). Green = high value, red = low value |
- Deterministic vs Stochastic โ slide slip from 0 to 0.6 and watch transition probabilities change
- Click any cell โ full breakdown of transitions, rewards, and Q(s,a) for all 4 actions
- โก Solve Optimal Policy โ finds ฯ* and shows value heatmap
- ๐ค Run Robot โ animated step-by-step episode
- ๐คร10 Run 10 Episodes โ shows success rate (deterministic vs stochastic)
- โ๏ธ Edit Grid โ paint walls, pits, goals, and start positions to create your own MDP
- Click cells โ inspect State, Action, Reward, Transition
- โก Solve with slip=0 โ observe shortest path
- Set slip=0.3 โ Solve again โ policy becomes cautious near pits!
- Run ๐ค with slip=0 โ always reaches goal
- Run ๐คร10 with slip=0.3 โ some episodes fail!
- Compare ฮณ=0.3 vs ฮณ=0.99 โ value function changes dramatically
- Edit grid: add more pits near the goal โ watch policy adapt
Step through the Policy Iteration algorithm on a 4ร4 FrozenLake grid.
| Button | What Happens |
|---|---|
| โ One Eval Sweep | Each cell lights up blue as its value updates via the Bellman equation |
| โ Full Evaluation | Runs all sweeps until V^ฯ converges |
| โก Improve Policy | Arrows change one-by-one to the greedy action โ green flash on changes |
| โถโถ Auto-Run | Runs the full evaluate โ improve loop with pauses between iterations |
| ๐ค Run Robot | Animated robot walks the grid following the current policy |
| โบ Reset | Start fresh with new parameters |
- Press โ One Sweep โ watch cells light up one by one
- Press โ again โ values get more accurate each sweep
- Press โ Full Eval โ converge V^ฯ completely
- Press โก โ watch arrows change direction!
- Repeat โ โโก until ฯ* found
- Press ๐ค โ watch the robot navigate!
- Try ฮณ=0.5 vs ฮณ=0.99 โ compare policies
- Try Slip=0 vs Slip=0.5 โ deterministic vs stochastic
rl-book-labs/
โโโ README.md
โโโ ch2/
โ โโโ index.html # MDP Explorer (5ร5 Gridworld)
โโโ ch3/
โ โโโ index.html # Policy Iteration on FrozenLake
โโโ ch4/ # (coming soon)
โโโ ch5/ # (coming soon)
โโโ ch6/ # (coming soon)
Each lab is a single HTML file โ no build step, no dependencies, no frameworks. Just open in any browser.
Complete Reinforcement Learning Journey: From Basics to RLHF
The only book that takes you from "What is a Markov Decision Process?" all the way to "How do we align language models with human values?" โ with intuition, math, code, and interactive labs at every step.
- ๐ Intuition โ Math โ Code triple for every concept
- ๐ค DeliBot running example that grows with the theory
- ๐ง Think Like an Agent boxes for building intuition
โ ๏ธ Common Misconceptions boxes to prevent errors- ๐ฌ Interactive Labs (this repo!) for hands-on learning
- ๐ Quizzes with detailed answer keys for each chapter
Found a bug in a lab? Have an idea for a new visualization? Contributions are welcome!
- Fork the repo
- Create a branch (
git checkout -b feature/new-lab) - Commit your changes
- Open a Pull Request
MIT License โ free to use, modify, and distribute.
Built with โค๏ธ as a companion to the book.
"The best way to learn an algorithm is to watch it think."