Skip to content

Commit

Permalink
Add page for QLearning
Browse files Browse the repository at this point in the history
  • Loading branch information
ishikota committed Dec 7, 2016
1 parent aecc525 commit 0969cf6
Showing 1 changed file with 110 additions and 12 deletions.
122 changes: 110 additions & 12 deletions docs/sources/algorithm/q_learning.md
@@ -1,17 +1,115 @@
# Welcome to MkDocs
# QLearning - off-policy TD learning method
Qlearning method updates value of state-action pair `Q(s,a)`in following way.
```
s : current state
a : action to take at state s choosed by policy PI
r : reward by transition (s, a)
s' : next state after took action a at s
ga : greedy action at s' under current value function
For full documentation visit [mkdocs.org](http://mkdocs.org).
Q(s,a) = Q(s,a) + alpha [ r + gamma * Q(s', ga) - Q(s, a) ]
```
The new keyword **greedy action** represents the **action which has maximum estimated value** under current value function.
(If multiple actions are greedy then choose one at random.)
You can get greedy action like this
```python
acts = task.generate_possible_actions(state)
vals = [value_function.predict_value(state, action) for action in acts]
greedy_value_and_actions = [(v,a) for v,a in zip(vals, acts) if v==max(vals)]
_, greedy_action = random.choice(greedy_value_and_actions)
```

## Commands

* `mkdocs new [dir-name]` - Create a new project.
* `mkdocs serve` - Start the live-reloading docs server.
* `mkdocs build` - Build the documentation site.
* `mkdocs help` - Print this help message.
This method is also called as *off-policy TD learning* method.
**off-policy** means that this algorithm use different policy to choose `a` and `a'`.
QLearning must use *greedy policy* (the policy always choose greedy action) to choose action `a'`.
But for choosing `a`, you can use any policy. (Most of the case this is *epsilon greedy policy*)

## Project layout
## Algorithm
```
Parameter:
a <- alpha. learning rate. [0,1].
g <- gamma. discounting factor. [0,1].
Initialize:
T <- your RL task
PI <- policy used in the algorithm
Q <- action value function
mkdocs.yml # The configuration file.
docs/
index.md # The documentation homepage.
... # Other markdown pages, images and other files.
Repeat until computational budget runs out:
S <- generate initial state of task T
A <- choose action at S by following policy PI
Repeat until S is terminal state:
S' <- next state of S after taking action A
R <- reward gained by taking action A at state S
A' <- next action at S' by following policy PI
GA <- greedy action at S' under action value function Q
Q(S, A) <- Q(S, A) + a * [ R + g * Q(S', GA) - Q(S, A)]
S, A <- S', A'
```

## Value function
QLearning method provides **tabular** and **approximation** type of value functions.

### QLearningTabularActionValueFunction
If your task is *tabular size*, you can use `QLearningTabularActionValueFunction`.
>If you can store the value of all state-action pair on the memory(array), your task is **tabular** size.
`QLearningTabularActionValueFunction` has 3 abstracted method to define the table size of your task.

- `generate_initial_table` : initialize table object and return it here
- `fetch_value_from_table` : define how to fetch value from your table
- `insert_value_into_table` : define how to insert new value into your table

If the shape of your state-action space is SxA, implementation would be like this.
```python
class MyTabularActionValueFunction(QLearningTabularActionValueFunction):

def generate_initial_table(self):
return [[0 for j in range(A)] for i in range(S)]

def fetch_value_from_table(self, table, state, action):
return table[state][action]

def insert_value_into_table(self, table, state, action, new_value):
table[state][action] = new_value
```

### QLearningApproxActionValueFunction
If your task is not *tabular* size, you use `QLearningApproxActionValueFunction`.

`QLearningApproxActionValueFunction` has 3 abstracted methods. You would wrap some prediction model (ex. neuralnet) in these methods.

- `construct_features` : transform state-action pair into feature representation
- `approx_predict_value` : predict value of state-action pair with prediction model you want to use
- `approx_backup` : update your model in supervised learning way with passed input and output pair

The implementation with some neuralnet library would be like this.
```python
class MyApproxActionValueFunction(QLearningApproxActionValueFunction):

def setup(self):
super(MazeApproxActionValueFunction, self).setup()
self.neuralnet = build_neuralnet_in_some_way()

def construct_features(self, state, action):
feature1 = do_something(state, action)
feature2 = do_anotherthing(state, action)
return [feature1, feature2]

def approx_predict_value(self, features):
return self.neuralnet.predict(features)

def approx_backup(self, features, backup_target, alpha):
self.neuralnet.incremental_training(X=features, Y=backup_target)
```

#### Sample code to start learning
```python
test_length = 1000
task = MyTask()
policy = EpsilonGreedyPolicy(eps=0.1)
value_func = MyTabularActionValueFunction()
algorithm = QLearning(gamma=0.99)
algorithm.setup(task, policy, value_func)
algorithm.run_gpi(test_length)
```

0 comments on commit 0969cf6

Please sign in to comment.