## Q-Learning Tutorial
I wanted to understand reinforcement learning a little better for a hackathon project at work. One approach I remember learning about in a machine learning class in college is called Q-learning. I found a blog post that briefly goes over this approach from a very high level and so to understand it better I decided to code up the example in the blog. The blog is found at: https://towardsdatascience.com/q-bay-explaining-q-learning-with-simulated-auctions-f85bac990c60

As the blog mentions there are two matrices or tables to be concerned about when dealing with Q-learning. The first is the R table and the second is the Q table. 

The R table is the table that contains the rewards of going certain places.

So in the example given in the blog post we want to teach the computer to go to a certain subway station given any other starting point. I've included the reward table from the blog below:

|Stations |Arsenal       | Finsbusry Park | Manor House  |  Seven Sisters | Stamford Hill | Tottenham Hale |
|:--- |:------------:|:-------------: | :-----:      |:-------------: |:-------------:| :-----:        |
    |**Arsenal** |0      | 0 | - |-      | - | - |
|**Finsbusry Park** |0      | 0 | 0 |0      | - | - |
|**Manor House**|-      | 0 | 0 |-      | - | - |
|**Seven Sisters** |-      | 0 | - |0      | 0 | 100 |
|**Stamford Hill** |-      | - | - |0      | 0 | - |
|**Tottenham Hale** |-      | - | - |0      | - | 100 |

Imagine the rows representing the states (where you are at now) and the columns representing the actions (where you can go.) In this case Tottenham Hale is where we want to end up. Lets pretend that the only way to get there is through the Seven Sisters train station, and if we get to Tottenham Hale we want to stay there. Thus the values 100 in both those positions.

The Q matrix will be the exact same format except everything will be initialized to 0. I put both of these tables into a pandas data frame below.

In [4]:
import pandas as pd
import numpy as np

In [7]:
Q = np.zeros((6,6))
R = np.zeros((6,6))

# Fill in the rewards
R[3,5] = 100
R[5,5] = 100

# Fill in state action pairs with null values if it is not possible to go from state to action
R[2:,0] = np.nan
R[4:,1] = np.nan
R[0,2] = np.nan
R[3:,2] = np.nan
R[0,3] = np.nan
R[2,3] = np.nan
R[:3,4] = np.nan
R[5,4] = np.nan
R[:3,5] = np.nan
R[4,5] = np.nan

Q_table = pd.DataFrame(Q)
R_table = pd.DataFrame(R)

In [8]:
R_table

Unnamed: 0,0,1,2,3,4,5
0,0.0,0.0,,,,
1,0.0,0.0,0.0,0.0,,
2,,0.0,0.0,,,
3,,0.0,,0.0,0.0,100.0
4,,,,0.0,0.0,
5,,,,0.0,,100.0


In [9]:
Q_table

Unnamed: 0,0,1,2,3,4,5
0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0


Now that both tables are initialized the first step is to pick a random starting point and a random action

In [24]:
import random

In [54]:
start_state = random.randint(0,5)
print("Start state: {0}".format(start_state))

# Now all actions are available at the given state, chose out of the ones that are available
indices = R_table.iloc[start_state,:].index[R_table.iloc[start_state,:].notnull()].values
start_action = random.choice(indices)
print("Start action: {0}".format)

Start state: 4


3

In [51]:
indices

array([3, 5])

In [29]:
R_table.iloc[start_state,:]

0    0.0
1    0.0
2    NaN
3    NaN
4    NaN
5    NaN
Name: 0, dtype: float64

Given our starting point now we need to determine which action to take. This is where our q-learning formula comes in (see blog for more info). Initialize the parameters used in that equation:

In [21]:
alpha = .1 # learning rate
gamma = .1 # gamma is the discount factor

# Also create dictionary with stations corresponding to state
stations = {0:'Arsenal',1:'Finsbusry Park',2:'Manor House',3:'Seven Sisters',4:'Stamford Hill',5:'Tottenham Hill'}

#### Q equation

In [None]:
(1 - alpha) * Q[start_state,:]

Now we use the q-learning equation to update the matrix Q:

Q(s<sub>t</sub>, a<sub>t</sub>) = (1 - alpha) Q(s<sub>t</sub>, a<sub>t</sub>) + alpha[R(s<sub>t</sub>, a<sub>t</sub>) + gamma max{Q(s<sub>t_1</sub>, a<sub>t+1</sub>)}]

Since we are starting in state 3 (Seven Sisters) we have 6 options to choose from for our action. Say we randomly choose Tottenham Hale. Our Bellman equation will then be:

Q(s<sub>t</sub>, a<sub>t</sub>) 
