ch.ethz.idsc.subare

Library for reinforcement learning in Java, version 0.3.8

Repository includes algorithms, examples, and exercises from the 2nd edition of Reinforcement Learning: An Introduction by Richard S. Sutton, and Andrew G. Barto.

Our implementation is inspired by the python code by Shangtong Zhang, but differs from the reference in two aspects:

the algorithms are implemented separate from the problem scenarios
the math is in exact precision which reproduces symmetries in the results in case the problem features symmetries

Algorithms

Iterative Policy Evaluation (parallel, in 4.1, p.59)
Value Iteration to determine V*(s) (parallel, in 4.4, p.65)
Action-Value Iteration to determine Q*(s,a) (parallel)
First Visit Policy Evaluation (in 5.1, p.74)
Monte Carlo Exploring Starts (in 5.3, p.79)
Contant-alpha Monte Carlo
Tabular Temporal Difference (in 6.1, p.96)
Sarsa: An on-policy TD control algorithm (in 6.4, p.104)
Q-learning: An off-policy TD control algorithm (in 6.5, p.105)
Expected Sarsa (in 6.6, p.107)
Double Sarsa, Double Expected Sarsa, Double Q-Learning (in 6.7, p.109)
n-step Temporal Difference for estimating V(s) (in 7.1, p.115)
n-step Sarsa, n-step Expected Sarsa, n-step Q-Learning (in 7.2, p.118)
Random-sample one-step tabular Q-planning (parallel, in 8.1, p.131)
Tabular Dyna-Q (in 8.2, p.133)
Prioritized Sweeping (in 8.4, p.137)
Semi-gradient Tabular Temporal Difference (in 9.3, p.164)
True Online Sarsa (in 12.8, p.309)

Gallery

Prisoner's Dilemma

Exact Gambler

Examples

4.1 Gridworld

AV-Iteration q(s,a)

TabularQPlan

Monte Carlo

Q-Learning

Expected-Sarsa

Sarsa

3-step Q-Learning

3-step E-Sarsa

3-step Sarsa

OTrue Online Sarsa

ETrue Online Sarsa

QTrue Online Sarsa

4.2: Jack's car rental

Value Iteration v(s)

4.4: Gambler's problem

Value Iteration v(s)

Action Value Iteration and optimal policy

Monte Carlo q(s,a)

ESarsa q(s,a)

QLearning q(s,a)

5.1 Blackjack

Monte Carlo Exploring Starts

5.2 Wireloop

AV-Iteration

TabularQPlan

Q-Learning

E-Sarsa

Sarsa

Monte Carlo

5.8 Racetrack

paths obtained using value iteration

track 1

track 2

6.5 Windygrid

Action Value Iteration

TabularQPlan

6.6 Cliffwalk

Action Value Iteration

Q-Learning

TabularQPlan

Expected Sarsa

8.1 Dynamaze

Action Value Iteration

Prioritized sweeping

Additional Examples

Repeated Prisoner's dilemma

Exact expected reward of two adversarial optimistic agents depending on their initial configuration:

Exact expected reward of two adversarial Upper-Confidence-Bound agents depending on their initial configuration:

Integration

Specify dependency and repository of the tensor library in the pom.xml file of your maven project:

<dependencies>
  <dependency>
    <groupId>ch.ethz.idsc</groupId>
    <artifactId>subare</artifactId>
    <version>0.3.8</version>
  </dependency>
</dependencies>

<repositories>
  <repository>
    <id>subare-mvn-repo</id>
    <url>https://raw.github.com/idsc-frazzoli/subare/mvn-repo/</url>
    <snapshots>
      <enabled>true</enabled>
      <updatePolicy>always</updatePolicy>
    </snapshots>
  </repository>
</repositories>

The source code is attached to every release.

Contributors

Jan Hakenberg, Christian Fluri

Publications

Learning to Operate a Fleet of Cars by Christian Fluri, Claudio Ruch, Julian Zilly, Jan Hakenberg, and Emilio Frazzoli

References

Reinforcement Learning: An Introduction by Richard S. Sutton, and Andrew G. Barto

Name		Name	Last commit message	Last commit date
Latest commit History 480 Commits
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ch.ethz.idsc.subare

Algorithms

Gallery