Skip to content

mochan-b/actorcritic

Repository files navigation

Actor-Critic CartPole

Implementations of actor-critic reinforcement learning algorithms on the CartPole-v1 environment.

Four algorithms are implemented across multiple backends:

Algorithm Description
td TD Actor-Critic — online per-step updates using TD error δ
reinforce Vanilla REINFORCE — no critic; actor updated with raw discounted returns G_t
advantage A2C (single worker) — actor updated with normalised advantage (G_t − V(s_t))
a2c A2C Parallel Workers — synchronous multi-worker batch updates with entropy bonus

Python

Three Python backends share the same algorithm set and command-line interface.

Requirements

pip install gymnasium torch tensorflow pygame rich

Scripts

Script Backend Notes
actor_critic.py PyTorch Neural network actor/critic (128-unit hidden layer)
actor_critic_tf.py TensorFlow Same architecture in Keras
actor_critic_lfa.py NumPy Linear function approximation — no autograd
actor_critic_cont_tf.py TensorFlow Continuous action space variant (Gaussian policy)

Usage

python actor_critic.py [--algo ALGO] [--viz VIZ] [--workers N]
python actor_critic_tf.py [--algo ALGO] [--viz VIZ] [--workers N]
python actor_critic_lfa.py [--algo ALGO] [--viz VIZ] [--workers N]
python actor_critic_cont_tf.py [--algo ALGO] [--viz VIZ] [--workers N]

Command-Line Arguments

--algo — Training algorithm

Choices: td | reinforce | advantage | a2c Default: a2c

Value Algorithm
td TD Actor-Critic — step-level TD error updates
reinforce Vanilla REINFORCE — episode-level policy gradient, no critic
advantage Actor-Critic with Advantage — episode-level with normalised advantage
a2c A2C Parallel Workers — batched multi-worker with entropy bonus

--viz — Visualisation mode

Choices: text | gui | interactive | none Default: text

Value Description
text Colour heatmaps printed to the terminal every 50 episodes
gui Live pygame window showing policy and value function heatmaps
interactive Pygame window with axis/slice controls to explore the 4D state space
none No visualisation — fastest training

--workers — Number of parallel workers (A2C only)

Type: int Default: 8

Number of synchronous environments used when --algo a2c. Has no effect for other algorithms.

Hyperparameters

Parameter Value
Discount factor γ 0.99
Actor learning rate α 0.001
Critic learning rate α 0.005
Max episodes 5 000
Solved threshold (avg 100 ep.) 295.0

Feature Extractors (actor_critic_lfa.py)

The linear FA script defaults to tile coding. Edit the FEATURES variable at the top of actor_critic_lfa.py to switch:

Value Feature type Reference
tile (default) Tile coding (8 tilings × 8 tiles) S&B §9.5.4
polynomial Polynomial basis (degree 3) S&B §9.5.1
fourier Fourier basis (order 3) S&B §9.5.2
rbf Radial Basis Functions (5 centres/dim) S&B §9.5.5

Examples

# A2C with 16 workers, no visualisation
python actor_critic.py --algo a2c --workers 16 --viz none

# TD actor-critic with live pygame heatmap
python actor_critic.py --algo td --viz gui

# Vanilla REINFORCE on TensorFlow with terminal heatmaps
python actor_critic_tf.py --algo reinforce --viz text

# Linear FA, advantage algorithm, interactive heatmap explorer
python actor_critic_lfa.py --algo advantage --viz interactive

After training, a rendered episode is displayed automatically using the trained actor.

The PyTorch advantage variant also saves the trained actor to actor.pt.


JavaScript

An in-browser training engine that mirrors the Python algorithms. No server required — everything runs in Web Workers.

Try it online: https://mochan.dev/actorcritic/

Running locally

cd js
npx serve .        # or any static file server

Open http://localhost:3000 in your browser.

Controls

Control Options
Approximator Linear FA · Neural Network
Algorithm TD Actor-Critic · REINFORCE · Advantage · A2C
Feature extractor (Linear FA only) Tile Coding · Polynomial · Fourier · RBF
Actor α / Critic α Learning rates (pre-filled with sensible defaults per approximator)
Skip episodes Fast-forward N episodes without rendering

Visualisation

Three live canvases update as training runs:

  • CartPole — animated environment rendering
  • Value heatmap — V(s) over a 2D slice of the 4D state space
  • Policy heatmap — P(push right | s) over the same slice

Use the axis selector controls to choose which two state dimensions to display on the axes, and set fixed values for the remaining two dimensions. A snapshot history slider lets you scrub back through earlier checkpoints.

Click Replay at any time to run a greedy episode with the current policy.

Files

File Purpose
index.html / style.css UI layout and styling
main.js UI logic, rendering, and worker coordination
worker.js Training loop running off the main thread
algorithms.js Trainer classes (TD, REINFORCE, Advantage, A2C)
linear_fa.js Linear function approximation (features, actor, critic)
nn_fa.js Neural network actor and critic
cartpole.js CartPole environment simulation
heatmap_worker.js Off-thread heatmap computation
benchmark_heatmap.js Heatmap benchmark utility
test_linear_fa.js Unit tests for linear FA

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors