Implementations of actor-critic reinforcement learning algorithms on the CartPole-v1 environment.
Four algorithms are implemented across multiple backends:
| Algorithm | Description |
|---|---|
td |
TD Actor-Critic — online per-step updates using TD error δ |
reinforce |
Vanilla REINFORCE — no critic; actor updated with raw discounted returns G_t |
advantage |
A2C (single worker) — actor updated with normalised advantage (G_t − V(s_t)) |
a2c |
A2C Parallel Workers — synchronous multi-worker batch updates with entropy bonus |
Three Python backends share the same algorithm set and command-line interface.
pip install gymnasium torch tensorflow pygame rich| Script | Backend | Notes |
|---|---|---|
actor_critic.py |
PyTorch | Neural network actor/critic (128-unit hidden layer) |
actor_critic_tf.py |
TensorFlow | Same architecture in Keras |
actor_critic_lfa.py |
NumPy | Linear function approximation — no autograd |
actor_critic_cont_tf.py |
TensorFlow | Continuous action space variant (Gaussian policy) |
python actor_critic.py [--algo ALGO] [--viz VIZ] [--workers N]
python actor_critic_tf.py [--algo ALGO] [--viz VIZ] [--workers N]
python actor_critic_lfa.py [--algo ALGO] [--viz VIZ] [--workers N]
python actor_critic_cont_tf.py [--algo ALGO] [--viz VIZ] [--workers N]Choices: td | reinforce | advantage | a2c Default: a2c
| Value | Algorithm |
|---|---|
td |
TD Actor-Critic — step-level TD error updates |
reinforce |
Vanilla REINFORCE — episode-level policy gradient, no critic |
advantage |
Actor-Critic with Advantage — episode-level with normalised advantage |
a2c |
A2C Parallel Workers — batched multi-worker with entropy bonus |
Choices: text | gui | interactive | none Default: text
| Value | Description |
|---|---|
text |
Colour heatmaps printed to the terminal every 50 episodes |
gui |
Live pygame window showing policy and value function heatmaps |
interactive |
Pygame window with axis/slice controls to explore the 4D state space |
none |
No visualisation — fastest training |
Type: int Default: 8
Number of synchronous environments used when --algo a2c. Has no effect for other algorithms.
| Parameter | Value |
|---|---|
| Discount factor γ | 0.99 |
| Actor learning rate α | 0.001 |
| Critic learning rate α | 0.005 |
| Max episodes | 5 000 |
| Solved threshold (avg 100 ep.) | 295.0 |
The linear FA script defaults to tile coding. Edit the FEATURES variable at the top of actor_critic_lfa.py to switch:
| Value | Feature type | Reference |
|---|---|---|
tile (default) |
Tile coding (8 tilings × 8 tiles) | S&B §9.5.4 |
polynomial |
Polynomial basis (degree 3) | S&B §9.5.1 |
fourier |
Fourier basis (order 3) | S&B §9.5.2 |
rbf |
Radial Basis Functions (5 centres/dim) | S&B §9.5.5 |
# A2C with 16 workers, no visualisation
python actor_critic.py --algo a2c --workers 16 --viz none
# TD actor-critic with live pygame heatmap
python actor_critic.py --algo td --viz gui
# Vanilla REINFORCE on TensorFlow with terminal heatmaps
python actor_critic_tf.py --algo reinforce --viz text
# Linear FA, advantage algorithm, interactive heatmap explorer
python actor_critic_lfa.py --algo advantage --viz interactiveAfter training, a rendered episode is displayed automatically using the trained actor.
The PyTorch advantage variant also saves the trained actor to actor.pt.
An in-browser training engine that mirrors the Python algorithms. No server required — everything runs in Web Workers.
Try it online: https://mochan.dev/actorcritic/
cd js
npx serve . # or any static file serverOpen http://localhost:3000 in your browser.
| Control | Options |
|---|---|
| Approximator | Linear FA · Neural Network |
| Algorithm | TD Actor-Critic · REINFORCE · Advantage · A2C |
| Feature extractor (Linear FA only) | Tile Coding · Polynomial · Fourier · RBF |
| Actor α / Critic α | Learning rates (pre-filled with sensible defaults per approximator) |
| Skip episodes | Fast-forward N episodes without rendering |
Three live canvases update as training runs:
- CartPole — animated environment rendering
- Value heatmap — V(s) over a 2D slice of the 4D state space
- Policy heatmap — P(push right | s) over the same slice
Use the axis selector controls to choose which two state dimensions to display on the axes, and set fixed values for the remaining two dimensions. A snapshot history slider lets you scrub back through earlier checkpoints.
Click Replay at any time to run a greedy episode with the current policy.
| File | Purpose |
|---|---|
index.html / style.css |
UI layout and styling |
main.js |
UI logic, rendering, and worker coordination |
worker.js |
Training loop running off the main thread |
algorithms.js |
Trainer classes (TD, REINFORCE, Advantage, A2C) |
linear_fa.js |
Linear function approximation (features, actor, critic) |
nn_fa.js |
Neural network actor and critic |
cartpole.js |
CartPole environment simulation |
heatmap_worker.js |
Off-thread heatmap computation |
benchmark_heatmap.js |
Heatmap benchmark utility |
test_linear_fa.js |
Unit tests for linear FA |