Skip to content


Browse files Browse the repository at this point in the history
  • Loading branch information
ishikota committed Nov 26, 2016
1 parent 8c3cf80 commit a5dca26
Showing 1 changed file with 7 additions and 153 deletions.
160 changes: 7 additions & 153 deletions
Expand Up @@ -3,169 +3,23 @@
[![Coverage Status](](
## Implemented algorithmes
## Supported algorithmes
- MonteCarlo
- Sarsa
- QLearning
- SarsaLambda
- QLambda
- deep Q-network (DQN)
- deep QLearning (from DQN paper)
- MonteCarloTreeSearch

- [Sutton & Barto Book: Reinforcement Learning: An Introduction](
- [Human-level control through deep reinforcement learning](

# Getting Started
## Motivation
RL(Reinforcement Learning) algorithms learns which action is good or bad through **trial-and-error**.
So what we need to do is **making our learning task in RL format**.

This library provides two template classes to make your task in RL format.
- `BaseDomain` class which represents our learning task
- `ValueFunction` class which RL algorithm uses to save trial-and-error result

So let's see how to use these template classes through simple *maze* example.

## Example. Find the best policy to escape from the maze
Here we will find the best policy to escape from the below maze by using RL algorithm.
S: start, G: goal, X: wall

### Step1. Create MazeDomain class
`BaseDomain` class requires you to implement 5 methods
- `generate_initial_state()`
- returns initial state that RL algorithms starts simulation from.
- `generate_possible_actions(state)`
- returns valid actions in passed state. RL algorithms choose next action from these actions.
- `transit_state(state, action)`
- returns next state after applied the passed action on the passed state.
- `calculate_reward(state)`
- returns how good the passed state is.
- `is_terminal_state(state)`
- returns if passed state is terminal state or not.

from kyoka.domain.base_domain import BaseDomain

class MazeDomain(BaseDomain):


# we use current position of the maze as "state". So here we return start position of the maze.
def generate_initial_state(self):
return (0, 0)

# the position of the goal is (row=0, column=8)
def is_terminal_state(self, state):
return (0, 8) == state

# we can always move to 4 directions.
def generate_possible_actions(self, state):
return [self.ACTION_UP, self.ACTION_DOWN, self.ACTION_RIGHT, self.ACTION_LEFT]

# RL algorithm can get reward only when he reaches to the goal.
def calculate_reward(self, state):
return 1 if self.is_terminal_state(state) else 0

def transit_state(self, state, action):
row, col = state
wall_position = [(1,2), (2,2), (3,2), (4,5), (0,7), (1,7), (2,7)]
height, width = 6, 9
if action == self.ACTION_UP:
row = max(0, row-1)
elif action == self.ACTION_DOWN:
row = min(height-1, row+1)
elif action == self.ACTION_RIGHT:
col= min(width-1, col+1)
elif action == self.ACTION_LEFT:
col = max(0, col-1)
if (row, col) not in wall_position:
return (row, col)
return state # If destination is the wall or edge of the maze then position does not change.

Ok! next is `ValueFunction`!!

### Step2. Create MazeActionValueFunction class
`BaseActionValueFunction` class requires you to implement 2 methods.
- `calculate_value(state, action)`
- fetch current value of state and action pair.
- `update_function(state, action, new_value)`
- update value of passed state and action pair by passed value.

The state space of this example is very small (state space = |state| x |action| = (6 x 9) x 4 = 216).
So we prepare the table (3-dimentional array) and save value on it.

from kyoka.value_function.base_action_value_function import BaseActionValueFunction

class MazeActionValueFunction(BaseActionValueFunction):

# call this method before start learning
def setUp(self):
maze_width, maze_height, action_num = 6, 9, 4
self.table = [[[0 for k in range(action_num)] for j in range(maze_height)] for i in range(maze_width)]

# just take value from the table
def calculate_value(self, state, action):
row, col = state
return self.table[row][col][action]

# just insert value into the table
def update_function(self, state, action, new_value):
row, col = state
self.table[row][col][action] = new_value

### Step3. Running RL algorithm and see its result
OK, let's try `QLearning` for our *maze* task.

from kyoka.policy.epsilon_greedy_policy import EpsilonGreedyPolicy
from kyoka.algorithm.td_learning.q_learning import QLearning

domain = MazeDomain()
policy = EpsilonGreedyPolicy(eps=0.1)
value_function = MazeActionValueFunction()

# You can easily replace algorithm like "rl_algo = Sarsa(alpha=0.1, gamma=0.7)"
rl_algo = QLearning(alpha=0.1, gamma=0.7)
rl_algo.setUp(domain, policy, value_function)

That's all !! Let's visualize the value function which QLearning learned.
(If you interested in `MazeHelper` utility class, Please checkout [complete code](
>>> print MazeHelper.visualize_policy(domain, value_function)
S -> v-X-vvvX^

Great!! QLearning found the policy which leads us to goal in 14 steps. (14 step is minimum step to the goal !!)
## Getting Started

## Sample code
In sample directory, we prepared more practical sample code as jupyter notebook and script.
You can also checkout another RL task example *tick-tack-toe* .
- [sample: Learning how to escape from maze by RL](
- [sample: Learning tick-tack-toe by RL](
- [example1: Learning how to escape from maze by RL](./examples/maze)
- [example2: Learning tick-tack-toe by RL](./examples/ticktacktoe)

# Installation
You can use pip like this.
Expand Down

0 comments on commit a5dca26

Please sign in to comment.