reinforcement learning algorithms and demos from the book by Sutton and Barto
Java
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
src v016 Aug 19, 2018
.gitignore maximization bias Jun 10, 2017
.travis.yml v016 Aug 19, 2018
README.md v016 Aug 19, 2018
pom.xml v016 Aug 19, 2018

README.md

ch.ethz.idsc.subare Build Status

Java 8 implementation of algorithms, examples, and exercises from the 2nd edition of

Sutton and Barto: Reinforcement Learning

Version 0.1.6

Our implementation is inspired by the python code by Shangtong Zhang.

Our implementation is different from the reference in two aspects:

  • the algorithms are implemented separate from the problem scenarios
  • the math is in exact precision which reproduces symmetries in the results in case the problem features symmetries

List of algorithms:

  • Iterative Policy Evaluation (parallel, in 4.1, p.59)
  • Value Iteration to determine V*(s) (parallel, in 4.4, p.65)
  • Action-Value Iteration to determine Q*(s,a) (parallel)
  • First Visit Policy Evaluation (in 5.1, p.74)
  • Monte Carlo Exploring Starts (in 5.3, p.79)
  • Contant-alpha Monte Carlo
  • Tabular Temporal Difference (in 6.1, p.96)
  • Sarsa: An on-policy TD control algorithm (in 6.4, p.104)
  • Q-learning: An off-policy TD control algorithm (in 6.5, p.105)
  • Expected Sarsa (in 6.6, p.107)
  • Double Sarsa, Double Expected Sarsa, Double Q-Learning (in 6.7, p.109)
  • n-step Temporal Difference for estimating V(s) (in 7.1, p.115)
  • n-step Sarsa, n-step Expected Sarsa, n-step Q-Learning (in 7.2, p.118)
  • Random-sample one-step tabular Q-planning (parallel, in 8.1, p.131)
  • Tabular Dyna-Q (in 8.2, p.133)
  • Prioritized Sweeping (in 8.4, p.137)
  • Semi-gradient Tabular Temporal Difference (in 9.3, p.164)
  • True Online Sarsa (in 12.8, p.309)

Examples from the book

4.1 Gridworld

AV-Iteration q(s,a)

gridworld_qsa_avi

TabularQPlan

gridworld_qsa_rstqp

Monte Carlo

gridworld_qsa_mces

Q-Learning

gridworld_qsa_qlearning

Expected-Sarsa

gridworld_qsa_expected

Sarsa

gridworld_qsa_original

3-step Q-Learning

gridworld_qsa_qlearning3

3-step E-Sarsa

gridworld_qsa_expected3

3-step Sarsa

gridworld_qsa_original3

OTrue Online Sarsa

gridworld_tos_original

ETrue Online Sarsa

gridworld_tos_expected

QTrue Online Sarsa

gridworld_tos_qlearning

4.2: Jack's car rental

Value Iteration v(s)

carrental_vi_true

4.4: Gambler's problem

Value Iteration v(s)

gambler_sv

Action Value Iteration and optimal policy

gambler_avi

Monte Carlo q(s,a)

gambler_qsa_mces

ESarsa q(s,a)

gambler_qsa_esarsa

QLearning q(s,a)

gambler_qsa_qlearn

5.1 Blackjack

Monte Carlo Exploring Starts

blackjack_mces

5.2 Wireloop

AV-Iteration

wire5_avi

TabularQPlan

wire5_qsa_rstqp

Q-Learning

wire5_qsa_qlearning

E-Sarsa

wire5_qsa_expected

Sarsa

wire5_qsa_original

Monte Carlo

wire5_mces

5.8 Racetrack

paths obtained using value iteration

track 1

track1

track 2

track2

6.5 Windygrid

Action Value Iteration

windygrid_qsa_avi

TabularQPlan

windygrid_qsa_rstqp

6.6 Cliffwalk

Action Value Iteration

cliffwalk_qsa_avi

Q-Learning

cliffwalk_qsa_qlearning

TabularQPlan

cliffwalk_qsa_rstqp

Expected Sarsa

cliffwalk_qsa_expected

8.1 Dynamaze

Action Value Iteration

maze5_qsa_avi

Prioritized sweeping

maze2_ps_qlearning


Additional Examples

Repeated Prisoner's dilemma

Exact expected reward of two adversarial optimistic agents depending on their initial configuration:

opts

Exact expected reward of two adversarial Upper-Confidence-Bound agents depending on their initial configuration:

ucbs

Integration

Specify dependency and repository of the tensor library in the pom.xml file of your maven project:

<dependencies>
  <dependency>
    <groupId>ch.ethz.idsc</groupId>
    <artifactId>subare</artifactId>
    <version>0.1.6</version>
  </dependency>
</dependencies>

<repositories>
  <repository>
    <id>subare-mvn-repo</id>
    <url>https://raw.github.com/idsc-frazzoli/subare/mvn-repo/</url>
    <snapshots>
      <enabled>true</enabled>
      <updatePolicy>always</updatePolicy>
    </snapshots>
  </repository>
</repositories>

The source code is attached to every release.

Contributors

Jan Hakenberg, Christian Fluri