<a href="https://colab.research.google.com/github/montali/ChainLearning/blob/main/QLearning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We now have a chain environment with 5 states and 2 discrete actions, and we want to compute the Q-values using Q-learning. 
We know that theoretically Q-learning updates the Q-values as follows:
$$
Q^{new}(s,a) = Q^{old}(s,a) + \alpha [r(s,a,s') + \gamma \max_{a'}Q^{old}(s', a') - Q^{old}(s,a)]
$$
We can therefore encode these Q-values into a 2d array $Q_{state, action}$

In [31]:
import numpy as np
import random

In [40]:
# Constants definition
STATES = 5
ACTIONS = 2
REWARDS = np.array([
                      [0,0,0,0,1], #Action A
                      [0.2, 0,0,0,0] # Action B
                      ]).transpose() # Tranpose it to be state,action like q

TRANSITION = np.array([
              [1,2,3,4,4],
              [0,0,0,0,0] # Action B
]).transpose() # Tranpose it to be state,action like q
DELTA_THRESHOLD = 0.0001
DISCOUNT = 0.9

In [58]:
def solve_q_learning(lr, epsilon):
    q = np.zeros((STATES, ACTIONS))
    for _ in range(10000):
        old_q = q.copy()
        for state in range(STATES):
            # Pick an action using epsilon-greedy
            if random.random()<epsilon:
                action = np.argmax(q[state])
            else:
                action = random.randint(0, ACTIONS-1)
            q[state,action] = q[state, action] + lr * (REWARDS[state,action] + DISCOUNT * max(q[TRANSITION[state, action]]) - q[state,action])
    return q

In [59]:
solve_q_learning(0.2, 0.1)

array([[ 3.87420489,  3.6867844 ],
       [ 4.3046721 ,  3.4867844 ],
       [ 4.782969  ,  3.4867844 ],
       [ 5.31441   ,  3.4867844 ],
       [ 5.9049    ,  3.4867844 ],
       [ 6.561     ,  3.4867844 ],
       [ 7.29      ,  3.4867844 ],
       [ 8.1       ,  3.4867844 ],
       [ 9.        ,  3.4867844 ],
       [10.        ,  3.4867844 ]])

# Now, repeat for 10 states!

In [54]:
# Constants definition
STATES = 10
ACTIONS = 2
REWARDS = np.array([
                      [0]*9 + [1], #Action A
                      [0.2] + [0]*9 # Action B
                      ]).transpose() # Tranpose it to be state,action like q

TRANSITION = np.array([
              [1,2,3,4,5,6,7,8,9,9],
              [0]*10 # Action B
]).transpose() # Tranpose it to be state,action like q
DELTA_THRESHOLD = 0.0001
DISCOUNT = 0.9

In [None]:
solve_q_learning(0.2,0.1)

And now let's do a nice gridsearch!

In [64]:
value_range = np.arange(0.1,0.95, 0.05)
size = len(value_range)
results = np.zeros((size, size, STATES, ACTIONS))
for i, lr in enumerate(value_range):
    for j, epsilon in enumerate(value_range):
        results[i, j] = solve_q_learning(lr, epsilon)

In [65]:
results

array([[[[ 3.87420489,  3.6867844 ],
         [ 4.3046721 ,  3.4867844 ],
         [ 4.782969  ,  3.4867844 ],
         ...,
         [ 8.1       ,  3.4867844 ],
         [ 9.        ,  3.4867844 ],
         [10.        ,  3.4867844 ]],

        [[ 3.87420489,  3.6867844 ],
         [ 4.3046721 ,  3.4867844 ],
         [ 4.782969  ,  3.4867844 ],
         ...,
         [ 8.1       ,  3.4867844 ],
         [ 9.        ,  3.4867844 ],
         [10.        ,  3.4867844 ]],

        [[ 3.87420489,  3.6867844 ],
         [ 4.3046721 ,  3.4867844 ],
         [ 4.782969  ,  3.4867844 ],
         ...,
         [ 8.1       ,  3.4867844 ],
         [ 9.        ,  3.4867844 ],
         [10.        ,  3.4867844 ]],

        ...,

        [[ 3.87420489,  3.6867844 ],
         [ 4.3046721 ,  3.4867844 ],
         [ 4.782969  ,  3.4867844 ],
         ...,
         [ 8.1       ,  3.4867844 ],
         [ 9.        ,  3.4867844 ],
         [10.        ,  3.4867844 ]],

        [[ 3.87420489,  3.6867844 