![MLU Logo](https://drive.corp.amazon.com/view/bwernes@/MLU_Logo.png?download=true)

## Lecture 2 Support Notebook
# TOP

### Table of Contents
<p>
<div class="lev1">
    <a href="#Law-of-Large-Numbers">
        <span class="toc-item-num">1&nbsp;&nbsp;</span>
        Law of Large Numbers
    </a>
</div>
<div class="lev1">
    <a href="#Cleaning-Robot-GridWorld"><span class="toc-item-num">2.&nbsp;&nbsp;</span>
        Cleaning Robot GridWorld
    </a>
</div>
<div class="lev1">
    <a href="#Monte-Carlo-Method"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>
        Monte Carlo Method
    </a>
</div>
<div class="lev1">
    <a href="#MC-Exploring-Starts"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>
        MC Exploring Starts
    </a>
</div>
<div class="lev1">
    <a href="#MC-Control"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>
        MC Control
    </a>
</div>
<div class="lev1">
    <a href="#TD(0)"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>
        TD(0)
    </a>
</div>
<div class="lev1">
    <a href="#TD(lambda)"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>
        TD(lambda)
    </a>
</div>
<div class="lev1">
    <a href="#SARSA"><span class="toc-item-num">5&nbsp;&nbsp;</span>
        SARSA
    </a>
</div>
<div class="lev1">
    <a href="#Q-learning"><span class="toc-item-num">5&nbsp;&nbsp;</span>
        Q-learning
    </a>
</div>

In [1]:
#!/usr/bin/env python

#MIT License
#Copyright (c) 2017 Massimiliano Patacchiola
#
#Permission is hereby granted, free of charge, to any person obtaining a copy
#of this software and associated documentation files (the "Software"), to deal
#in the Software without restriction, including without limitation the rights
#to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
#copies of the Software, and to permit persons to whom the Software is
#furnished to do so, subject to the following conditions:
#
#The above copyright notice and this permission notice shall be included in all
#copies or substantial portions of the Software.
#
#THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
#IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
#FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
#AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
#LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
#OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
#SOFTWARE.

In [3]:
# INSTALL THIS PACKAGE THE FIRST TIME YOU RUN THIS INSTANCE
# !pip install tqdm

In [1]:
from tqdm import tqdm
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")
import matplotlib.patches as mpatches
import matplotlib.lines as mlines
#%pylab inline
import random
%matplotlib inline

import numpy as np
import sys

if "../" not in sys.path:
  sys.path.append("../") # this line is for use in regular SageMaker
if "MLU-Repo-RL/RL/" not in sys.path:
    sys.path.append("MLU-Repo-RL/RL/") # this line is for use in SageMaker Studio
from lib_rl.gridworld import GridWorld

# Law of Large Numbers
Rolling a six-sided dice produces the expectation
(1+2+3+4+5+6)/6=3.5


In [2]:
# Trowing a dice for N times and evaluating the expectation
dice = np.random.randint(low=1, high=7, size=3)
print("Expectation (rolling 3 times): " + str(np.mean(dice)))
dice = np.random.randint(low=1, high=7, size=10)
print("Expectation (rolling 10 times): " + str(np.mean(dice)))
dice = np.random.randint(low=1, high=7, size=100)
print("Expectation (rolling 100 times): " + str(np.mean(dice)))
dice = np.random.randint(low=1, high=7, size=1000)
print("Expectation (rolling 1000 times): " + str(np.mean(dice)))
dice = np.random.randint(low=1, high=7, size=100000)
print("Expectation (rolling 100000 times): " + str(np.mean(dice)))

Expectation (rolling 3 times): 4.666666666666667
Expectation (rolling 10 times): 3.1
Expectation (rolling 100 times): 3.41
Expectation (rolling 1000 times): 3.444
Expectation (rolling 100000 times): 3.49895


## Learn
As you can see the estimation of the expectation converges to the true value of 3.5. <br/>
What we are doing in MC reinforcement learning is exactly the same but in this case we want to estimate the **value** for each state based on the return of each episode. <br/>
As for the dice, more episodes we take into account more accurate our estimation will be.

<a href="#TOP">
        <span class="toc-item-num">&nbsp;&nbsp;</span>
        TOP
</a>

# Cleaning Robot GridWorld
+ The class called GridWorld creates a grid world of any size, add obstacles and terminal states. 
+ The cleaning robot will move in the grid world following a specific policy. Let’s bring to life our 4x3 world.
The class GridWorld has many similarities with [OpenAI Gym package](https://gym.openai.com/):
+ The method **step()** moves forward at t+1 and returns:
    + the **reward**, 
    + the **observation** (position of the robot), and 
    + a variable called **done** which is True when the episode is finished (the robot reached a terminal state).


In [3]:
# Declare our environmnet variable
# The world has 3 rows and 4 columns
env = GridWorld(3, 4)
# Define the state matrix
# Adding obstacle at position (1,1)
# Adding the two terminal states
state_matrix = np.zeros((3,4))
state_matrix[0, 3] = 1
state_matrix[1, 3] = 1
state_matrix[1, 1] = -1
print(state_matrix)

[[ 0.  0.  0.  1.]
 [ 0. -1.  0.  1.]
 [ 0.  0.  0.  0.]]


In [4]:
# Define the reward matrix
# The reward is -0.04 for all states but the terminal
reward_matrix = np.full((3,4), -0.04)
reward_matrix[0, 3] = 1
reward_matrix[1, 3] = -1
# Define the transition matrix
# For each one of the four actions there is a probability
transition_matrix = np.array([[0.8, 0.1, 0.0, 0.1],
                              [0.1, 0.8, 0.1, 0.0],
                              [0.0, 0.1, 0.8, 0.1],
                              [0.1, 0.0, 0.1, 0.8]])
print(transition_matrix)

[[0.8 0.1 0.  0.1]
 [0.1 0.8 0.1 0. ]
 [0.  0.1 0.8 0.1]
 [0.1 0.  0.1 0.8]]


In [5]:
# Define the policy matrix
# 0=UP, 1=RIGHT, 2=DOWN, 3=LEFT, NaN=Obstacle, -1=NoAction
# This is the optimal policy for world with reward=-0.04
policy_matrix = np.array([[1,      1,  1,  -1],
                          [0, np.NaN,  0,  -1],
                          [0,      3,  3,   3]])
# Set the matrices 
env.setStateMatrix(state_matrix)
env.setRewardMatrix(reward_matrix)
env.setTransitionMatrix(transition_matrix)
print (policy_matrix)

[[ 1.  1.  1. -1.]
 [ 0. nan  0. -1.]
 [ 0.  3.  3.  3.]]


+ In a few lines I defined a grid world with the properties of our example.
+ The policy is the optimal policy for a reward of -0.04 as we saw in the first lecture. 
+ Let's reset the environment (move the robot to starting position) and using the render() method to display the world.

In [6]:
#Reset the environment
observation = env.reset()
#Display the world printing on terminal
env.render()

 -  -  -  * 
 -  #  -  * 
 ○  -  -  - 



Now we can run an episode using a for loop:

In [7]:
for _ in range(1000):
    action = policy_matrix[observation[0], observation[1]]
    observation, reward, done = env.step(action)
    print("")
    print("ACTION: " + str(action))
    print("REWARD: " + str(reward))
    print("DONE: " + str(done))
    env.render()
    if done: break


ACTION: 0.0
REWARD: -0.04
DONE: False
 -  -  -  * 
 ○  #  -  * 
 -  -  -  - 


ACTION: 0.0
REWARD: -0.04
DONE: False
 ○  -  -  * 
 -  #  -  * 
 -  -  -  - 


ACTION: 1.0
REWARD: -0.04
DONE: False
 -  ○  -  * 
 -  #  -  * 
 -  -  -  - 


ACTION: 1.0
REWARD: -0.04
DONE: False
 -  -  ○  * 
 -  #  -  * 
 -  -  -  - 


ACTION: 1.0
REWARD: -0.04
DONE: False
 -  -  -  * 
 -  #  ○  * 
 -  -  -  - 


ACTION: 0.0
REWARD: -0.04
DONE: False
 -  -  ○  * 
 -  #  -  * 
 -  -  -  - 


ACTION: 1.0
REWARD: 1.0
DONE: True
 -  -  -  ○ 
 -  #  -  * 
 -  -  -  - 



<a href="#TOP">
        <span class="toc-item-num">&nbsp;&nbsp;</span>
        TOP
</a>

# Monte Carlo Method
Here I will use: 
+ A discount factor of $\gamma=0.999$
+ The best policy $\pi^*$
+ The same transition model used in the previous lecture*. 

*Remember that with the current transition model the robot will go in the desired direction only in 80% of the cases. 
### First, The Return Function
$G(t) = R_{t+1} + \gamma R_{t+2} + ... = \sum_{t=0}^ {\infty} \gamma^t R(S_{t})$

In [8]:
def get_return(state_list, gamma):
    """"
    Summary line.
    ------------
    Function that estimates the return
    
    Parameters:
    ----------
        state_list: tuple
            List containing a tuple (position, reward).
        gamma: float
            The discount factor gamma.
        
    Returns:
    -------
        return_value: float
            A value representing the return for that action list. 
    """
    
    counter = 0
    return_value = 0
    for visit in state_list:
        reward = visit[1]
        return_value += reward * np.power(gamma, counter)
        counter += 1
    return return_value

In [9]:
# Defining an empty utility matrix
utility_matrix = np.zeros((3,4))
# init with 1.0e-10 to avoid division by zero
running_mean_matrix = np.full((3,4), 1.0e-10) 
gamma = 0.999 #discount factor
tot_epoch = 50000
print_epoch = 10000

for epoch in range(tot_epoch):
    #Starting a new episode
    episode_list = list()
    #Reset and return the first observation
    observation= env.reset(exploring_starts=False)
    for _ in range(1000):
        # Take the action from the action matrix
        action = policy_matrix[observation[0], observation[1]]
        # Move one step in the environment and get obs and reward
        observation, reward, done = env.step(action)
        # Append the visit in the episode list
        episode_list.append((observation, reward))
        if done: break
    # The episode is finished, now estimating the utilities
    counter = 0
    # Checkup to identify if it is the first visit to a state
    checkup_matrix = np.zeros((3,4))
    # This cycle is the implementation of First-Visit MC.
    # For each state stored in the episode list it checks if it
    # is the first visit and then estimates the return.
    for visit in episode_list:
        observation = visit[0]
        row = observation[0]
        col = observation[1]
        reward = visit[1]
        if(checkup_matrix[row, col] == 0):
            return_value = get_return(episode_list[counter:], gamma)
            running_mean_matrix[row, col] += 1
            utility_matrix[row, col] += return_value
            checkup_matrix[row, col] = 1
        counter += 1
    if(epoch % print_epoch == 0):
        print("Utility matrix after " + str(epoch+1) + " iterations:") 
        print(utility_matrix / running_mean_matrix)

#Time to check the state-value matrix obtained
print("Utility matrix after " + str(tot_epoch) + " iterations:")
print(utility_matrix / running_mean_matrix)

Utility matrix after 1 iterations:
[[0.71385957 0.75461418 0.83624584 1.        ]
 [0.67314571 0.         0.87712296 0.        ]
 [0.         0.         0.         0.        ]]
Utility matrix after 10001 iterations:
[[ 0.81258549  0.87021201  0.92147643  1.        ]
 [ 0.76184848  0.          0.70314085 -1.        ]
 [ 0.70591512  0.65226754  0.          0.        ]]
Utility matrix after 20001 iterations:
[[ 0.81053044  0.86855081  0.91984667  1.        ]
 [ 0.75965483  0.          0.68475416 -1.        ]
 [ 0.70532575  0.65204334  0.          0.        ]]
Utility matrix after 30001 iterations:
[[ 0.80999002  0.86791341  0.91915506  1.        ]
 [ 0.75895368  0.          0.68257497 -1.        ]
 [ 0.70340358  0.65265229  0.          0.        ]]
Utility matrix after 40001 iterations:
[[ 0.80932971  0.86716144  0.9184321   1.        ]
 [ 0.75819847  0.          0.67667488 -1.        ]
 [ 0.70267753  0.65264453  0.          0.        ]]
Utility matrix after 50000 iterations:
[[ 0.8089530

The script above will print the estimation of the state-value matrix every 1000 iterations.

<a href="#TOP">
        <span class="toc-item-num">&nbsp;&nbsp;</span>
        TOP
</a>

# MC Exploring Starts
+ To enable the exploring starts in our code the only thing to do is to set the parameter exploring_strarts in the reset() function to True.
+ Every time a new episode begins, the robot will start from a random position. 

In [10]:
# Defining an empty utility matrix
utility_matrix = np.zeros((3,4))
# init with 1.0e-10 to avoid division by zero
running_mean_matrix = np.full((3,4), 1.0e-10) 
gamma = 0.999 #discount factor
tot_epoch = 50000
print_epoch = 10000

for epoch in range(tot_epoch):
    #Starting a new episode
    episode_list = list()
    #Reset and return the first observation
    observation= env.reset(exploring_starts=True)
    for _ in range(1000):
        # Take the action from the action matrix
        action = policy_matrix[observation[0], observation[1]]
        # Move one step in the environment and get obs and reward
        observation, reward, done = env.step(action)
        # Append the visit in the episode list
        episode_list.append((observation, reward))
        if done: break
    # The episode is finished, now estimating the utilities
    counter = 0
    # Checkup to identify if it is the first visit to a state
    checkup_matrix = np.zeros((3,4))
    # This cycle is the implementation of First-Visit MC.
    # For each state stored in the episode list it checks if it
    # is the first visit and then estimates the return.
    for visit in episode_list:
        observation = visit[0]
        row = observation[0]
        col = observation[1]
        reward = visit[1]
        if(checkup_matrix[row, col] == 0):
            return_value = get_return(episode_list[counter:], gamma)
            running_mean_matrix[row, col] += 1
            utility_matrix[row, col] += return_value
            checkup_matrix[row, col] = 1
        counter += 1
    if(epoch % print_epoch == 0):
        print("Utility matrix after " + str(epoch+1) + " iterations:") 
        print(utility_matrix / running_mean_matrix)

#Time to check the utility matrix obtained
print("Utility matrix after " + str(tot_epoch) + " iterations:")
print(utility_matrix / running_mean_matrix)

Utility matrix after 1 iterations:
[[0.87712296 0.918041   0.959      1.        ]
 [0.79540959 0.         0.         0.        ]
 [0.         0.         0.         0.        ]]
Utility matrix after 10001 iterations:
[[ 0.81026248  0.868537    0.91979558  1.        ]
 [ 0.75772467  0.          0.66832482 -1.        ]
 [ 0.70289022  0.65622204  0.60444678  0.34647217]]
Utility matrix after 20001 iterations:
[[ 0.80925457  0.86744368  0.91851261  1.        ]
 [ 0.75718637  0.          0.65995054 -1.        ]
 [ 0.70006232  0.65313244  0.60595218  0.33536195]]
Utility matrix after 30001 iterations:
[[ 0.80834709  0.86671298  0.91794931  1.        ]
 [ 0.75569356  0.          0.65826502 -1.        ]
 [ 0.69747972  0.64923281  0.60282439  0.34505005]]
Utility matrix after 40001 iterations:
[[ 0.80707175  0.8653966   0.91682541  1.        ]
 [ 0.75492972  0.          0.65628554 -1.        ]
 [ 0.69742164  0.64818843  0.59982525  0.35682387]]
Utility matrix after 50000 iterations:
[[ 0.8084053

<a href="#TOP">
        <span class="toc-item-num">&nbsp;&nbsp;</span>
        TOP
</a>

# MC Control
We will use again the function **get_return()** but this time the input will be a list containing the tuple (observation, action, reward).

In [11]:
def get_return(state_list, gamma):
    '''Get the return for a list of action-state values.
    @return get the Return
    '''
    counter = 0
    return_value = 0
    for visit in state_list:
        reward = visit[2]
        return_value += reward * np.power(gamma, counter)
        counter += 1
    return return_value

We will use another new function called **update_policy()**, which will make the policy greedy with respect to the current state-action function.

In [12]:
def update_policy(episode_list, policy_matrix, state_action_matrix):
    '''Update a policy making it greedy in respect of the state-action matrix.
    @return the updated policy
    '''
    for visit in episode_list:
        observation = visit[0]
        col = observation[1] + (observation[0]*4)
        if(policy_matrix[observation[0], observation[1]] != -1):      
            policy_matrix[observation[0], observation[1]] = \
                np.argmax(state_action_matrix[:,col])
    return policy_matrix

+ The update_policy() function is part of the improvement step of the GPI and it is fundamental in order to get convergence to an optimal policy. 
+ I will use also the function print_policy() which I already used in the previous lecture in order to print on terminal the policy using the symbols: ^, >, v, <, *, #. 
+ In the main() function, I initialized a random policy matrix and the state_action_matrix that contains the utilities of each state-action pair. 
+ The matrix can be initialised to zeros or to random values, it does not matter.

In [13]:
def print_policy(policy_matrix):
    '''Print the policy using specific symbol.
    * terminal state
    ^ > v < up, right, down, left
    # obstacle
    '''
    counter = 0
    shape = policy_matrix.shape
    policy_string = ""
    for row in range(shape[0]):
        for col in range(shape[1]):
            if(policy_matrix[row,col] == -1): policy_string += " *  "            
            elif(policy_matrix[row,col] == 0): policy_string += " ^  "
            elif(policy_matrix[row,col] == 1): policy_string += " >  "
            elif(policy_matrix[row,col] == 2): policy_string += " v  "           
            elif(policy_matrix[row,col] == 3): policy_string += " <  "
            elif(np.isnan(policy_matrix[row,col])): policy_string += " #  "
            counter += 1
        policy_string += '\n'
    print(policy_string)

Here we have the main loop of the algorithm, which is not so different from the loop used for MC prediction.

In [14]:
def main():

    env = GridWorld(3, 4)

    #Define the state matrix
    state_matrix = np.zeros((3,4))
    state_matrix[0, 3] = 1
    state_matrix[1, 3] = 1
    state_matrix[1, 1] = -1
    print("State Matrix:")
    print(state_matrix)

    #Define the reward matrix
    reward_matrix = np.full((3,4), -0.04)
    reward_matrix[0, 3] = 1
    reward_matrix[1, 3] = -1
    print("Reward Matrix:")
    print(reward_matrix)

    #Define the transition matrix
    transition_matrix = np.array([[0.8, 0.1, 0.0, 0.1],
                                  [0.1, 0.8, 0.1, 0.0],
                                  [0.0, 0.1, 0.8, 0.1],
                                  [0.1, 0.0, 0.1, 0.8]])

    #Random policy
    policy_matrix = np.random.randint(low=0, high=4, size=(3, 4)).astype(np.float32)
    policy_matrix[1,1] = np.NaN #NaN for the obstacle at (1,1)
    policy_matrix[0,3] = policy_matrix[1,3] = -1 #No action for the terminal states

    #Set the matrices in the world
    env.setStateMatrix(state_matrix)
    env.setRewardMatrix(reward_matrix)
    env.setTransitionMatrix(transition_matrix)

    state_action_matrix = np.random.random_sample((4,12)) # Q
    #init with 1.0e-10 to avoid division by zero
    running_mean_matrix = np.full((4,12), 1.0e-10) 
    gamma = 0.999
    tot_epoch = 500000
    print_epoch = 3000

    for epoch in range(tot_epoch):
        #Starting a new episode
        episode_list = list()
        #Reset and return the first observation and reward
        observation = env.reset(exploring_starts=True)
        #action = np.random.choice(4, 1)
        #action = policy_matrix[observation[0], observation[1]]
        #episode_list.append((observation, action, reward))
        is_starting = True
        for _ in range(1000):
            #Take the action from the action matrix
            action = policy_matrix[observation[0], observation[1]]
            #If the episode just started then it is
                #necessary to choose a random action (exploring starts)
            if(is_starting): 
                action = np.random.randint(0, 4)
                is_starting = False      
            #Move one step in the environment and get obs and reward
            new_observation, reward, done = env.step(action)
            #Append the visit in the episode list
            episode_list.append((observation, action, reward))
            observation = new_observation
            if done: break
        #The episode is finished, now estimating the utilities
        counter = 0
        #Checkup to identify if it is the first visit to a state
        checkup_matrix = np.zeros((4,12))
        #This cycle is the implementation of First-Visit MC.
        #For each state stored in the episode list check if it
        #is the rist visit and then estimate the return.
        for visit in episode_list:
            observation = visit[0]
            action = visit[1]
            col = int(observation[1] + (observation[0]*4))
            row = int(action)
            if(checkup_matrix[row, col] == 0):
                return_value = get_return(episode_list[counter:], gamma)
                running_mean_matrix[row, col] += 1
                state_action_matrix[row, col] += return_value
                checkup_matrix[row, col] = 1
            counter += 1
        #Policy Update
        policy_matrix = update_policy(episode_list, 
                                      policy_matrix, 
                                      state_action_matrix/running_mean_matrix)
        #Printing
        if(epoch % print_epoch == 0):
            print("")
            print("State-Action matrix after " + str(epoch+1) + " iterations:") 
            print(state_action_matrix / running_mean_matrix)
            print("Policy matrix after " + str(epoch+1) + " iterations:") 
            print(policy_matrix)
            print_policy(policy_matrix)
    #Time to check the utility matrix obtained
    print("Utility matrix after " + str(tot_epoch) + " iterations:")
    print(state_action_matrix / running_mean_matrix)


if __name__ == "__main__":
    main()

State Matrix:
[[ 0.  0.  0.  1.]
 [ 0. -1.  0.  1.]
 [ 0.  0.  0.  0.]]
Reward Matrix:
[[-0.04 -0.04 -0.04  1.  ]
 [-0.04 -0.04 -0.04 -1.  ]
 [-0.04 -0.04 -0.04 -0.04]]

State-Action matrix after 1 iterations:
[[ 7.71696287e+09  9.33849552e+08  3.97822401e+09  6.94396100e+09
  -2.49301551e+01  8.55541615e+09  8.31173144e+09  2.26406827e+09
   7.33239643e+09  6.03950161e+08  3.88134064e+09  9.70449289e+09]
 [ 6.41951536e+09 -2.48911585e+01  7.67571513e+08  5.51195807e+09
   2.51512823e+09  2.12100012e+09  5.45856439e+08  2.14481050e+09
   1.58636431e+09  8.80395855e+09  5.57438991e+08  7.02723037e+09]
 [ 9.64446449e+08 -2.47451203e+01  8.16166341e+08  1.16510824e+09
   7.63951689e+09  5.30423046e+09  9.95403416e+09  5.01049261e+09
   9.34478792e+09  3.87166691e+09  4.90449884e+09  3.34983171e+09]
 [-2.45579184e+01  2.10857095e+09  9.87472961e+09  1.14452396e+09
   9.58739992e+09  3.45244244e+09  1.96063362e+09  5.59657106e+09
   5.62846834e+09  5.93031962e+09  1.53663511e+09  8.43722100


State-Action matrix after 27001 iterations:
[[ 6.86398393e-01  5.73156852e-01  8.77615740e-01  6.94396100e+09
   7.87262572e-01  8.55541615e+09  6.95310121e-01  2.26406827e+09
   7.18793442e-01  3.84775721e-01  5.86178310e-01 -8.45356943e-01]
 [ 8.17010013e-01  9.05600649e-01  9.54084881e-01  5.51195807e+09
   6.41758090e-01  2.12100012e+09 -6.43000142e-01  2.14481050e+09
   1.67473685e-01 -3.56836503e-03 -7.19440024e-02  6.34310038e-02]
 [ 6.12647000e-01  8.01640308e-01  6.45049906e-01  1.16510824e+09
   5.64179658e-01  5.30423046e+09  3.34176950e-01  5.01049261e+09
   4.73551800e-01  4.27931500e-01  1.44580428e-01  1.95867935e-01]
 [ 7.30179133e-01  6.61175838e-01  7.42198927e-01  1.14452396e+09
   4.17356277e-01  3.45244244e+09  5.28509358e-01  5.59657106e+09
   5.51289966e-01  6.31124744e-01  4.43154496e-01  3.62625095e-01]]
Policy matrix after 27001 iterations:
[[ 1.  1.  1. -1.]
 [ 0. nan  0. -1.]
 [ 0.  3.  0.  3.]]
 >   >   >   *  
 ^   #   ^   *  
 ^   <   ^   <  


State-Act


State-Action matrix after 54001 iterations:
[[ 7.47309328e-01  6.77558505e-01  9.03840340e-01  6.94396100e+09
   7.91667037e-01  8.55541615e+09  6.99820671e-01  2.26406827e+09
   7.29571086e-01  4.49539664e-01  6.09122204e-01 -7.87792775e-01]
 [ 8.31272684e-01  9.04781967e-01  9.55348090e-01  5.51195807e+09
   7.00417668e-01  2.12100012e+09 -6.46444262e-01  2.14481050e+09
   3.43748411e-01  2.30238635e-01  1.29171708e-01  1.75406347e-01]
 [ 6.89071549e-01  8.35304453e-01  6.68916921e-01  1.16510824e+09
   6.33536582e-01  5.30423046e+09  4.08386799e-01  5.01049261e+09
   5.93964732e-01  5.38360022e-01  3.34360469e-01  2.62997994e-01]
 [ 7.65573707e-01  7.24567237e-01  8.02484638e-01  1.14452396e+09
   5.62806540e-01  3.45244244e+09  5.94020215e-01  5.59657106e+09
   6.15748639e-01  6.65182546e-01  5.40174296e-01  4.02508300e-01]]
Policy matrix after 54001 iterations:
[[ 1.  1.  1. -1.]
 [ 0. nan  0. -1.]
 [ 0.  3.  0.  3.]]
 >   >   >   *  
 ^   #   ^   *  
 ^   <   ^   <  


State-Act


State-Action matrix after 81001 iterations:
[[ 7.63547010e-01  7.22790448e-01  9.07039523e-01  6.94396100e+09
   7.91712014e-01  8.55541615e+09  6.98033670e-01  2.26406827e+09
   7.32476004e-01  4.88954038e-01  6.17590477e-01 -7.70606063e-01]
 [ 8.34976296e-01  9.03047781e-01  9.54268504e-01  5.51195807e+09
   7.14293996e-01  2.12100012e+09 -6.45999821e-01  2.14481050e+09
   4.21543554e-01  3.45400146e-01  2.05106817e-01  1.91241598e-01]
 [ 7.12558625e-01  8.44639970e-01  6.87961602e-01  1.16510824e+09
   6.55429738e-01  5.30423046e+09  4.06852523e-01  5.01049261e+09
   6.28410173e-01  5.76169843e-01  4.08103518e-01  2.98838003e-01]
 [ 7.75403161e-01  7.59085242e-01  8.14981787e-01  1.14452396e+09
   6.21136721e-01  3.45244244e+09  6.19855462e-01  5.59657106e+09
   6.46551820e-01  6.74248226e-01  5.76779595e-01  4.06592493e-01]]
Policy matrix after 81001 iterations:
[[ 1.  1.  1. -1.]
 [ 0. nan  0. -1.]
 [ 0.  3.  0.  3.]]
 >   >   >   *  
 ^   #   ^   *  
 ^   <   ^   <  


State-Act


State-Action matrix after 108001 iterations:
[[ 7.78026671e-01  7.51046701e-01  9.09695451e-01  6.94396100e+09
   7.92952877e-01  8.55541615e+09  7.00504509e-01  2.26406827e+09
   7.34354725e-01  5.10469927e-01  6.20972847e-01 -7.49998293e-01]
 [ 8.38840535e-01  9.04118356e-01  9.55255790e-01  5.51195807e+09
   7.23738564e-01  2.12100012e+09 -6.31522566e-01  2.14481050e+09
   4.69608248e-01  4.00797823e-01  2.62563603e-01  2.09446127e-01]
 [ 7.26491373e-01  8.47200425e-01  6.93206779e-01  1.16510824e+09
   6.67517793e-01  5.30423046e+09  4.29494514e-01  5.01049261e+09
   6.45202122e-01  5.97665985e-01  4.31742047e-01  3.12655114e-01]
 [ 7.84806140e-01  7.74144147e-01  8.27788124e-01  1.14452396e+09
   6.52785858e-01  3.45244244e+09  6.36588500e-01  5.59657106e+09
   6.63585787e-01  6.78078973e-01  5.94826744e-01  4.09146258e-01]]
Policy matrix after 108001 iterations:
[[ 1.  1.  1. -1.]
 [ 0. nan  0. -1.]
 [ 0.  3.  0.  3.]]
 >   >   >   *  
 ^   #   ^   *  
 ^   <   ^   <  


State-A


State-Action matrix after 135001 iterations:
[[ 7.86229596e-01  7.71299939e-01  9.08080663e-01  6.94396100e+09
   7.94017251e-01  8.55541615e+09  7.00069926e-01  2.26406827e+09
   7.35984843e-01  5.30898209e-01  6.22492187e-01 -7.36098093e-01]
 [ 8.40943229e-01  9.04638779e-01  9.55705490e-01  5.51195807e+09
   7.30969552e-01  2.12100012e+09 -6.33480475e-01  2.14481050e+09
   4.99972467e-01  4.36884566e-01  2.97394072e-01  2.21428284e-01]
 [ 7.35131186e-01  8.50513248e-01  6.94864436e-01  1.16510824e+09
   6.79330586e-01  5.30423046e+09  4.37694998e-01  5.01049261e+09
   6.53287292e-01  6.05279365e-01  4.59089871e-01  3.22924002e-01]
 [ 7.88201208e-01  7.83332075e-01  8.31694810e-01  1.14452396e+09
   6.70940855e-01  3.45244244e+09  6.32807196e-01  5.59657106e+09
   6.71121764e-01  6.81356466e-01  6.01910213e-01  4.09153515e-01]]
Policy matrix after 135001 iterations:
[[ 1.  1.  1. -1.]
 [ 0. nan  0. -1.]
 [ 0.  3.  0.  3.]]
 >   >   >   *  
 ^   #   ^   *  
 ^   <   ^   <  


State-A


State-Action matrix after 162001 iterations:
[[ 7.91236568e-01  7.82751074e-01  9.09272066e-01  6.94396100e+09
   7.94702868e-01  8.55541615e+09  6.98056363e-01  2.26406827e+09
   7.36884077e-01  5.41970296e-01  6.21590067e-01 -7.32251395e-01]
 [ 8.42339062e-01  9.04757214e-01  9.55829376e-01  5.51195807e+09
   7.36268400e-01  2.12100012e+09 -6.37035277e-01  2.14481050e+09
   5.25223566e-01  4.54475404e-01  3.15790612e-01  2.36136232e-01]
 [ 7.41732245e-01  8.54004678e-01  6.95674170e-01  1.16510824e+09
   6.83636980e-01  5.30423046e+09  4.34110812e-01  5.01049261e+09
   6.61937203e-01  6.11008316e-01  4.73157445e-01  3.27888833e-01]
 [ 7.90097579e-01  7.90645259e-01  8.32465594e-01  1.14452396e+09
   6.86390821e-01  3.45244244e+09  6.40571032e-01  5.59657106e+09
   6.76920658e-01  6.82996390e-01  6.08092382e-01  4.08263785e-01]]
Policy matrix after 162001 iterations:
[[ 1.  1.  1. -1.]
 [ 0. nan  0. -1.]
 [ 0.  3.  0.  3.]]
 >   >   >   *  
 ^   #   ^   *  
 ^   <   ^   <  


State-A


State-Action matrix after 189001 iterations:
[[ 7.96096380e-01  7.93261731e-01  9.10724568e-01  6.94396100e+09
   7.95029893e-01  8.55541615e+09  6.98378828e-01  2.26406827e+09
   7.37369731e-01  5.51167378e-01  6.22309070e-01 -7.26539301e-01]
 [ 8.43201365e-01  9.04779970e-01  9.55911644e-01  5.51195807e+09
   7.37809428e-01  2.12100012e+09 -6.37521819e-01  2.14481050e+09
   5.42761379e-01  4.74122507e-01  3.27490671e-01  2.36184647e-01]
 [ 7.48242443e-01  8.51959457e-01  6.99808841e-01  1.16510824e+09
   6.88732105e-01  5.30423046e+09  4.35562657e-01  5.01049261e+09
   6.66385726e-01  6.16763088e-01  4.86233162e-01  3.35717677e-01]
 [ 7.93384405e-01  7.95180645e-01  8.34607174e-01  1.14452396e+09
   6.93529472e-01  3.45244244e+09  6.47093633e-01  5.59657106e+09
   6.79669656e-01  6.83805060e-01  6.15808943e-01  4.08296953e-01]]
Policy matrix after 189001 iterations:
[[ 1.  1.  1. -1.]
 [ 0. nan  0. -1.]
 [ 0.  3.  0.  3.]]
 >   >   >   *  
 ^   #   ^   *  
 ^   <   ^   <  


State-A


State-Action matrix after 216001 iterations:
[[ 7.96860449e-01  8.00813723e-01  9.11196867e-01  6.94396100e+09
   7.95506200e-01  8.55541615e+09  6.98662909e-01  2.26406827e+09
   7.37422823e-01  5.59790515e-01  6.23384921e-01 -7.23225042e-01]
 [ 8.43949704e-01  9.04961045e-01  9.56171453e-01  5.51195807e+09
   7.40714670e-01  2.12100012e+09 -6.41225515e-01  2.14481050e+09
   5.56235569e-01  4.83692700e-01  3.39884296e-01  2.36177201e-01]
 [ 7.52933344e-01  8.54398236e-01  7.05489554e-01  1.16510824e+09
   6.92599558e-01  5.30423046e+09  4.36562953e-01  5.01049261e+09
   6.71008230e-01  6.19434247e-01  4.99446317e-01  3.39878364e-01]
 [ 7.94800654e-01  7.99656199e-01  8.34854796e-01  1.14452396e+09
   7.03361725e-01  3.45244244e+09  6.50188585e-01  5.59657106e+09
   6.83355776e-01  6.83859170e-01  6.18644600e-01  4.08227126e-01]]
Policy matrix after 216001 iterations:
[[ 1.  1.  1. -1.]
 [ 0. nan  0. -1.]
 [ 0.  3.  0.  3.]]
 >   >   >   *  
 ^   #   ^   *  
 ^   <   ^   <  


State-A


State-Action matrix after 243001 iterations:
[[ 7.97872072e-01  8.07712156e-01  9.12623813e-01  6.94396100e+09
   7.95770910e-01  8.55541615e+09  6.98423593e-01  2.26406827e+09
   7.37603235e-01  5.68454968e-01  6.24217252e-01 -7.23723748e-01]
 [ 8.44230299e-01  9.04973062e-01  9.56335611e-01  5.51195807e+09
   7.43200814e-01  2.12100012e+09 -6.42394346e-01  2.14481050e+09
   5.67043114e-01  4.93572620e-01  3.48207211e-01  2.38019192e-01]
 [ 7.54985242e-01  8.54557307e-01  7.06622105e-01  1.16510824e+09
   6.95748896e-01  5.30423046e+09  4.37861458e-01  5.01049261e+09
   6.72018491e-01  6.21587796e-01  5.07564032e-01  3.43636732e-01]
 [ 7.94804875e-01  8.00081728e-01  8.37052173e-01  1.14452396e+09
   7.09089030e-01  3.45244244e+09  6.52786590e-01  5.59657106e+09
   6.86016746e-01  6.84555813e-01  6.21533325e-01  4.08930649e-01]]
Policy matrix after 243001 iterations:
[[ 1.  1.  1. -1.]
 [ 0. nan  0. -1.]
 [ 0.  3.  0.  3.]]
 >   >   >   *  
 ^   #   ^   *  
 ^   <   ^   <  


State-A


State-Action matrix after 270001 iterations:
[[ 7.99927215e-01  8.11934497e-01  9.12791716e-01  6.94396100e+09
   7.96132132e-01  8.55541615e+09  6.98496957e-01  2.26406827e+09
   7.38007202e-01  5.74152630e-01  6.24423149e-01 -7.23302846e-01]
 [ 8.44703120e-01  9.04894843e-01  9.56410848e-01  5.51195807e+09
   7.43443205e-01  2.12100012e+09 -6.39738715e-01  2.14481050e+09
   5.75522423e-01  5.03704639e-01  3.53732633e-01  2.38867418e-01]
 [ 7.56842782e-01  8.54996519e-01  7.06933886e-01  1.16510824e+09
   6.97152581e-01  5.30423046e+09  4.39102424e-01  5.01049261e+09
   6.75036720e-01  6.25351564e-01  5.12275068e-01  3.49161779e-01]
 [ 7.95548756e-01  8.01916004e-01  8.38284491e-01  1.14452396e+09
   7.14056608e-01  3.45244244e+09  6.57160914e-01  5.59657106e+09
   6.88453826e-01  6.85048816e-01  6.23038543e-01  4.08518582e-01]]
Policy matrix after 270001 iterations:
[[ 1.  1.  1. -1.]
 [ 0. nan  0. -1.]
 [ 0.  3.  0.  3.]]
 >   >   >   *  
 ^   #   ^   *  
 ^   <   ^   <  


State-A


State-Action matrix after 297001 iterations:
[[ 8.01678480e-01  8.14448649e-01  9.13140411e-01  6.94396100e+09
   7.96332884e-01  8.55541615e+09  6.98410532e-01  2.26406827e+09
   7.38463005e-01  5.78154103e-01  6.25185732e-01 -7.23948538e-01]
 [ 8.44993226e-01  9.04824928e-01  9.56396612e-01  5.51195807e+09
   7.43694429e-01  2.12100012e+09 -6.40523652e-01  2.14481050e+09
   5.83419358e-01  5.11852953e-01  3.64616480e-01  2.39156307e-01]
 [ 7.57460946e-01  8.55735583e-01  7.05772373e-01  1.16510824e+09
   6.99857727e-01  5.30423046e+09  4.42945151e-01  5.01049261e+09
   6.77008568e-01  6.26664902e-01  5.15962609e-01  3.57954647e-01]
 [ 7.94576675e-01  8.03299306e-01  8.40178296e-01  1.14452396e+09
   7.16588302e-01  3.45244244e+09  6.58061032e-01  5.59657106e+09
   6.89512884e-01  6.85760187e-01  6.29163083e-01  4.11859958e-01]]
Policy matrix after 297001 iterations:
[[ 1.  1.  1. -1.]
 [ 0. nan  0. -1.]
 [ 0.  3.  3.  3.]]
 >   >   >   *  
 ^   #   ^   *  
 ^   <   <   <  


State-A


State-Action matrix after 324001 iterations:
[[ 8.02181340e-01  8.19273399e-01  9.13126278e-01  6.94396100e+09
   7.96275959e-01  8.55541615e+09  6.97901135e-01  2.26406827e+09
   7.38443248e-01  5.83431533e-01  6.24818129e-01 -7.19622148e-01]
 [ 8.45228596e-01  9.04741572e-01  9.56280324e-01  5.51195807e+09
   7.43520816e-01  2.12100012e+09 -6.43411338e-01  2.14481050e+09
   5.89145766e-01  5.18609074e-01  3.70667092e-01  2.38451308e-01]
 [ 7.58097929e-01  8.56932700e-01  7.07412679e-01  1.16510824e+09
   7.01564299e-01  5.30423046e+09  4.39267802e-01  5.01049261e+09
   6.77645150e-01  6.27514354e-01  5.23311711e-01  3.62173440e-01]
 [ 7.95128187e-01  8.04341211e-01  8.38937481e-01  1.14452396e+09
   7.19027202e-01  3.45244244e+09  6.57079681e-01  5.59657106e+09
   6.91718705e-01  6.86024148e-01  6.34254407e-01  4.13127457e-01]]
Policy matrix after 324001 iterations:
[[ 1.  1.  1. -1.]
 [ 0. nan  0. -1.]
 [ 0.  3.  3.  3.]]
 >   >   >   *  
 ^   #   ^   *  
 ^   <   <   <  


State-A


State-Action matrix after 351001 iterations:
[[ 8.03194976e-01  8.22552440e-01  9.13792941e-01  6.94396100e+09
   7.96679697e-01  8.55541615e+09  6.97995591e-01  2.26406827e+09
   7.38824507e-01  5.88385899e-01  6.24747626e-01 -7.18615133e-01]
 [ 8.45719174e-01  9.04954553e-01  9.56433586e-01  5.51195807e+09
   7.44657122e-01  2.12100012e+09 -6.43406317e-01  2.14481050e+09
   5.94831475e-01  5.25958072e-01  3.75236396e-01  2.37661821e-01]
 [ 7.59793577e-01  8.56198246e-01  7.05423545e-01  1.16510824e+09
   7.02773299e-01  5.30423046e+09  4.39300022e-01  5.01049261e+09
   6.79047258e-01  6.28704863e-01  5.28870550e-01  3.66095648e-01]
 [ 7.96747308e-01  8.05232017e-01  8.38732004e-01  1.14452396e+09
   7.21862985e-01  3.45244244e+09  6.59189594e-01  5.59657106e+09
   6.93463076e-01  6.86768770e-01  6.38152681e-01  4.14410004e-01]]
Policy matrix after 351001 iterations:
[[ 1.  1.  1. -1.]
 [ 0. nan  0. -1.]
 [ 0.  3.  3.  3.]]
 >   >   >   *  
 ^   #   ^   *  
 ^   <   <   <  


State-A


State-Action matrix after 378001 iterations:
[[ 8.04644103e-01  8.25195673e-01  9.13644455e-01  6.94396100e+09
   7.96592179e-01  8.55541615e+09  6.97398325e-01  2.26406827e+09
   7.38820918e-01  5.91937527e-01  6.24722218e-01 -7.16818565e-01]
 [ 8.45776299e-01  9.04910717e-01  9.56369955e-01  5.51195807e+09
   7.45808541e-01  2.12100012e+09 -6.46223770e-01  2.14481050e+09
   5.98753875e-01  5.31221647e-01  3.80761130e-01  2.37903205e-01]
 [ 7.60822042e-01  8.56762495e-01  7.03377153e-01  1.16510824e+09
   7.03082834e-01  5.30423046e+09  4.40956157e-01  5.01049261e+09
   6.80093177e-01  6.30204158e-01  5.33708244e-01  3.67981958e-01]
 [ 7.96600789e-01  8.06149927e-01  8.39511791e-01  1.14452396e+09
   7.24010088e-01  3.45244244e+09  6.58348406e-01  5.59657106e+09
   6.95218177e-01  6.86630261e-01  6.38386151e-01  4.15249483e-01]]
Policy matrix after 378001 iterations:
[[ 1.  1.  1. -1.]
 [ 0. nan  0. -1.]
 [ 0.  3.  3.  3.]]
 >   >   >   *  
 ^   #   ^   *  
 ^   <   <   <  


State-A


State-Action matrix after 405001 iterations:
[[ 8.05588236e-01  8.27533106e-01  9.14507161e-01  6.94396100e+09
   7.96931126e-01  8.55541615e+09  6.96882639e-01  2.26406827e+09
   7.39197607e-01  5.95087072e-01  6.24579867e-01 -7.13648782e-01]
 [ 8.46219201e-01  9.05246218e-01  9.56546780e-01  5.51195807e+09
   7.46521540e-01  2.12100012e+09 -6.46886781e-01  2.14481050e+09
   6.03387644e-01  5.36670609e-01  3.83672651e-01  2.40629135e-01]
 [ 7.62528615e-01  8.57066642e-01  7.02435836e-01  1.16510824e+09
   7.04136534e-01  5.30423046e+09  4.40723780e-01  5.01049261e+09
   6.81276092e-01  6.31291184e-01  5.36975033e-01  3.69926350e-01]
 [ 7.97168052e-01  8.07403995e-01  8.40524121e-01  1.14452396e+09
   7.25887297e-01  3.45244244e+09  6.59014202e-01  5.59657106e+09
   6.95594578e-01  6.87321004e-01  6.39122667e-01  4.15837195e-01]]
Policy matrix after 405001 iterations:
[[ 1.  1.  1. -1.]
 [ 0. nan  0. -1.]
 [ 0.  3.  3.  3.]]
 >   >   >   *  
 ^   #   ^   *  
 ^   <   <   <  


State-A


State-Action matrix after 432001 iterations:
[[ 8.05655604e-01  8.30491964e-01  9.14400083e-01  6.94396100e+09
   7.96859809e-01  8.55541615e+09  6.96657080e-01  2.26406827e+09
   7.39183700e-01  5.98616918e-01  6.24575982e-01 -7.12319667e-01]
 [ 8.46280412e-01  9.05186454e-01  9.56563739e-01  5.51195807e+09
   7.47776191e-01  2.12100012e+09 -6.44377490e-01  2.14481050e+09
   6.07480455e-01  5.41060444e-01  3.85132508e-01  2.41039752e-01]
 [ 7.63474236e-01  8.55867334e-01  7.03901486e-01  1.16510824e+09
   7.04316082e-01  5.30423046e+09  4.39829277e-01  5.01049261e+09
   6.83022515e-01  6.32255493e-01  5.39460701e-01  3.72018438e-01]
 [ 7.96752404e-01  8.08551174e-01  8.41564533e-01  1.14452396e+09
   7.28018764e-01  3.45244244e+09  6.57990351e-01  5.59657106e+09
   6.96812458e-01  6.87344581e-01  6.39664448e-01  4.15945499e-01]]
Policy matrix after 432001 iterations:
[[ 1.  1.  1. -1.]
 [ 0. nan  0. -1.]
 [ 0.  3.  3.  3.]]
 >   >   >   *  
 ^   #   ^   *  
 ^   <   <   <  


State-A


State-Action matrix after 459001 iterations:
[[ 8.06067018e-01  8.32410151e-01  9.14129768e-01  6.94396100e+09
   7.96688759e-01  8.55541615e+09  6.95942674e-01  2.26406827e+09
   7.38902198e-01  6.00946057e-01  6.24669272e-01 -7.11706109e-01]
 [ 8.46252713e-01  9.05091005e-01  9.56415624e-01  5.51195807e+09
   7.47800558e-01  2.12100012e+09 -6.41272020e-01  2.14481050e+09
   6.11037283e-01  5.45176914e-01  3.88978290e-01  2.42014969e-01]
 [ 7.64058877e-01  8.57396769e-01  7.02612553e-01  1.16510824e+09
   7.04951067e-01  5.30423046e+09  4.39119915e-01  5.01049261e+09
   6.82589512e-01  6.32817541e-01  5.39975602e-01  3.74982474e-01]
 [ 7.96943798e-01  8.09307776e-01  8.42317569e-01  1.14452396e+09
   7.29884790e-01  3.45244244e+09  6.56232017e-01  5.59657106e+09
   6.97043688e-01  6.87118510e-01  6.39606845e-01  4.17092699e-01]]
Policy matrix after 459001 iterations:
[[ 1.  1.  1. -1.]
 [ 0. nan  0. -1.]
 [ 0.  3.  3.  3.]]
 >   >   >   *  
 ^   #   ^   *  
 ^   <   <   <  


State-A


State-Action matrix after 486001 iterations:
[[ 8.06275975e-01  8.34485730e-01  9.14826114e-01  6.94396100e+09
   7.96693685e-01  8.55541615e+09  6.96567515e-01  2.26406827e+09
   7.38943221e-01  6.03296378e-01  6.24536477e-01 -7.11855124e-01]
 [ 8.46402781e-01  9.05190582e-01  9.56492119e-01  5.51195807e+09
   7.47798942e-01  2.12100012e+09 -6.41806161e-01  2.14481050e+09
   6.14114965e-01  5.49574388e-01  3.91912551e-01  2.43146894e-01]
 [ 7.63864081e-01  8.58464429e-01  7.04501051e-01  1.16510824e+09
   7.06145788e-01  5.30423046e+09  4.39963715e-01  5.01049261e+09
   6.83072815e-01  6.33995347e-01  5.42559730e-01  3.77194559e-01]
 [ 7.97833609e-01  8.10191067e-01  8.42277911e-01  1.14452396e+09
   7.31575225e-01  3.45244244e+09  6.59654326e-01  5.59657106e+09
   6.96383875e-01  6.87320840e-01  6.41039288e-01  4.18108740e-01]]
Policy matrix after 486001 iterations:
[[ 1.  1.  1. -1.]
 [ 0. nan  0. -1.]
 [ 0.  3.  3.  3.]]
 >   >   >   *  
 ^   #   ^   *  
 ^   <   <   <  


State-A

If we compare the code below with the one used in MC for prediction we will notice some important differences, for example the following condition:

if(is_starting): 
    action = np.random.randint(0, 4)
    is_starting = False 

This condition assures to satisfy the exploring starts. The MC algorithm will converge to the optimal solution only if we assure the exploring starts. In MC for control it is not sufficient to select random starting states. During the iterations the algorithm will improve the policy only if all the actions have a non-zero probability to be chosen. In this sense when the episode start we have to select a random action, this must be done only for the starting state.

There is another subtle difference that we must analyse. In the code I differentiate between observation and new_observation meaning the observation at time t and observation at time t+1. What we have to store in our episode list is the observation at t, the action taken at t and the reward obtained at t+1. Remember that we are interested in the utility of taking a certain action in a certain state.

It is time to run the script and see what we obtain. Before remember that for the special 4x3 world we already know the optimal policy. If you go back to the first post you will see that we found the optimal policy in case of reward equal to -0.04 (for non terminal states), and in case of transition model with 80-10-10 percent probabilities. This optimal policy is the following:

<img src="../images/RL-lecture2-9.png" alt="Drawing" style="width: 150px;">

In the optimal policy the robot will move far away from the stairs at state (4, 2) and will reach the charging station through the longest path. Now I will show you the evolution of the policy once we run the script for MC control estimation:

## Conclusions
At the beginning the MC method is initialized with a random policy, it is not a surprise that the first policy is a complete non-sense. 
After 3000 iteration the algorithm find a sub-optimal policy. In this policy the robot moves close to the stairs in order to reach the charging station. As we said in the previous post this is risky because the robot can fall down. At iteration 78000 the algorithm finds another policy, which is always sub-optimal, but it is slightly better than the previous one. Finally at iteration 405000 the algorithm finds the optimal policy and stick to it until the end.

The MC method cannot converge to any sub-optimal policy. Looking to the GPI scheme this is obvious. If the algorithm converges to a sub-optimal policy then the utility function would eventually converge to the utility function for that policy and that in turn would cause the policy to change. Stability is reached only when both the policy and the utility function are optimal. Convergence to this optimal fixed point seems inevitable but has not yet been formally proved.

I would like to reflect for a moment on the beauty of the MC algorithm. In MC for control the method can estimate the best policy having nothing. The robot is moving in the environment trying different actions and following the consequences of those actions until the end. That’s all. The robot does not know the reward function, it does not know the transition model and it does not have any policy to follow. Nevertheless the algorithm improves until reaching the optimal strategy.


<a href="#TOP">
        <span class="toc-item-num">&nbsp;&nbsp;</span>
        TOP
</a>

# TD(0)
Using our cleaning robot we can easily see what is the difference between TD and MC learning and what each one does at each step,

+ The update rule found in the previous part is the simplest form of TD learning, the TD(0) algorithm. 
+ TD(0) allows estimating the utility values following a specific policy. 
+ We are in the **passive learning** case for prediction, and we are in **model-free reinforcement learning**, meaning that we do not have the **transition model**. 
+ To estimate the state-value function, we can only move in the world. 
+ Let's see what does it mean to apply the TD algorithm to a single episode. 
+ I am going to use the same episode where the robot starts at (1,1) and reaches the terminal state at (4,3) after seven steps.
<img src="../images/RL-lecture2-1.png" alt="Drawing" style="width: 500px;">


Applying the TD algorithm means to move step by step considering only the state at t and the state at t+1. That’s it, after each step we get the utility value and the reward at t+1 and we update the value at t. The TD(0) algorithm ignores the past states and this is shown by the shadow I added above those states. Applying the algorithm to the episode ($\gamma=0.9, \alpha=0.1$) leads to the following changes in the utility matrix:
<img src="../images/RL-lecture2-2.png" alt="Drawing" style="width: 500px;">

+ The red frame highlights the utility value that has been updated at each visit. The matrix is initialised with zeros. 
+ At k=1 the state (1,1) is updated since the robot is in the state (1,2) and the first reward (-0.04) is available. 
+ The calculation for updating the utility at (1,1) is: 0.0 + 0.1 (-0.04 + 0.9 (0.0) - 0.0) = -0.004. Similarly to (1,1) the algorithm updates the state at (1,2). 
+ At k=3 the robot goes back and the calculation take the form: 0.0 + 0.1 (-0.04 + 0.9 (-0.004) - 0.0) = -0.00436. 
+ At k=4 the robot changes again its direction. In this case the algorithm update for the second time the state (1,2) as follow: -0.004 + 0.1 (-0.04 + 0.9 (-0.00436) + 0.004) = -0.0079924. 
+ The same process is applied until the end of the episode.

We are using the class GridWorld contained in the module gridworld.py again. <br/>
I will use again the 4x3 world with a charging station at (4,3) and the stairs at (4,2). <br/>
The optimal policy and the state-values of this world are the same we obtained in the previous example:
<img src="../images/RL-lecture2-3.png" alt="Drawing" style="width: 500px;">

The update rule of TD(0) can be implemented in a few lines:



In [15]:
def update_utility(utility_matrix, observation, new_observation, 
                   reward, alpha, gamma):
    '''Return the updated utility matrix
    @param utility_matrix the matrix before the update
    @param observation the state obsrved at t
    @param new_observation the state observed at t+1
    @param reward the reward observed after the action
    @param alpha the ste size (learning rate)
    @param gamma the discount factor
    @return the updated utility matrix
    '''
    u = utility_matrix[observation[0], observation[1]]
    u_t1 = utility_matrix[new_observation[0], new_observation[1]]
    utility_matrix[observation[0], observation[1]] += \
        alpha * (reward + gamma * u_t1 - u)
    return utility_matrix

The main loop is much simpler than the one of MC methods. In this case we do not have any first-visit constraint and the only thing to do is to apply the update rule.

In [16]:
def main():

    env = GridWorld(3, 4)

    #Define the state matrix
    state_matrix = np.zeros((3,4))
    state_matrix[0, 3] = 1
    state_matrix[1, 3] = 1
    state_matrix[1, 1] = -1
    print("State Matrix:")
    print(state_matrix)

    #Define the reward matrix
    reward_matrix = np.full((3,4), -0.04)
    reward_matrix[0, 3] = 1
    reward_matrix[1, 3] = -1
    print("Reward Matrix:")
    print(reward_matrix)

    #Define the transition matrix
    transition_matrix = np.array([[0.8, 0.1, 0.0, 0.1],
                                  [0.1, 0.8, 0.1, 0.0],
                                  [0.0, 0.1, 0.8, 0.1],
                                  [0.1, 0.0, 0.1, 0.8]])

    #Define the policy matrix
    #This is the optimal policy for world with reward=-0.04
    policy_matrix = np.array([[1,      1,  1,  -1],
                              [0, np.NaN,  0,  -1],
                              [0,      3,  3,   3]])
    print("Policy Matrix:")
    print(policy_matrix)

    env.setStateMatrix(state_matrix)
    env.setRewardMatrix(reward_matrix)
    env.setTransitionMatrix(transition_matrix)

    utility_matrix = np.zeros((3,4))
    gamma = 0.999
    alpha = 0.1 #constant step size
    tot_epoch = 300000
    print_epoch = 50000

    for epoch in range(tot_epoch):
        #Reset and return the first observation
        observation = env.reset(exploring_starts=False)
        for step in range(1000):
            #Take the action from the action matrix
            action = policy_matrix[observation[0], observation[1]]
            #Move one step in the environment and get obs and reward
            new_observation, reward, done = env.step(action)
            utility_matrix = update_utility(utility_matrix, observation, 
                                            new_observation, reward, alpha, gamma)
            observation = new_observation
            #print(utility_matrix)
            if done: break

        if(epoch % print_epoch == 0):
            print("")
            print("Utility matrix after " + str(epoch+1) + " iterations:") 
            print(utility_matrix)
    #Time to check the utility matrix obtained
    print("Utility matrix after " + str(tot_epoch) + " iterations:")
    print(utility_matrix)



if __name__ == "__main__":
    main()

State Matrix:
[[ 0.  0.  0.  1.]
 [ 0. -1.  0.  1.]
 [ 0.  0.  0.  0.]]
Reward Matrix:
[[-0.04 -0.04 -0.04  1.  ]
 [-0.04 -0.04 -0.04 -1.  ]
 [-0.04 -0.04 -0.04 -0.04]]
Policy Matrix:
[[ 1.  1.  1. -1.]
 [ 0. nan  0. -1.]
 [ 0.  3.  3.  3.]]

Utility matrix after 1 iterations:
[[-0.004 -0.004  0.1    0.   ]
 [-0.004  0.     0.     0.   ]
 [-0.004  0.     0.     0.   ]]

Utility matrix after 50001 iterations:
[[0.86135319 0.92089811 0.95584791 0.        ]
 [0.81809754 0.         0.58523962 0.        ]
 [0.74892387 0.70315465 0.         0.        ]]

Utility matrix after 100001 iterations:
[[0.85271854 0.91885718 0.96663391 0.        ]
 [0.80126948 0.         0.89066707 0.        ]
 [0.74468331 0.69195097 0.         0.        ]]

Utility matrix after 150001 iterations:
[[0.79229969 0.83139772 0.90190888 0.        ]
 [0.75893635 0.         0.37511175 0.        ]
 [0.71456017 0.68488362 0.         0.        ]]

Utility matrix after 200001 iterations:
[[0.83677492 0.89617591 0.93833395 0.  

<a href="#TOP">
        <span class="toc-item-num">&nbsp;&nbsp;</span>
        TOP
</a>

# TD(lambda)
The Python implementation of TD(λ) is straightforward. We only need to add an eligibility matrix and its update rule.

In [17]:
def update_utility(utility_matrix, trace_matrix, alpha, delta):
    '''Return the updated utility matrix
    @param utility_matrix the matrix before the update
    @param alpha the step size (learning rate)
    @param delta the error (Taget-OldEstimte) 
    @return the updated utility matrix
    '''
    utility_matrix += alpha * delta * trace_matrix
    return utility_matrix

In [18]:
def update_eligibility(trace_matrix, gamma, lambda_):
    '''Return the updated trace_matrix
    @param trace_matrix the eligibility traces matrix
    @param gamma discount factor
    @param lambda_ the decaying value
    @return the updated trace_matrix
    '''
    trace_matrix = trace_matrix * gamma * lambda_
    return trace_matrix

+ The main loop introduces some new components compared to the TD(0) case. 
+ We have the estimation of delta in a separate line and the management of the trace_matrix in two lines. 
+ First of all the states are increased (+1) and then they are decayed.

In [19]:
def main():

    env = GridWorld(3, 4)

    #Define the state matrix
    state_matrix = np.zeros((3,4))
    state_matrix[0, 3] = 1
    state_matrix[1, 3] = 1
    state_matrix[1, 1] = -1
    print("State Matrix:")
    print(state_matrix)

    #Define the reward matrix
    reward_matrix = np.full((3,4), -0.04)
    reward_matrix[0, 3] = 1
    reward_matrix[1, 3] = -1
    print("Reward Matrix:")
    print(reward_matrix)

    #Define the transition matrix
    transition_matrix = np.array([[0.8, 0.1, 0.0, 0.1],
                                  [0.1, 0.8, 0.1, 0.0],
                                  [0.0, 0.1, 0.8, 0.1],
                                  [0.1, 0.0, 0.1, 0.8]])

    #Define the policy matrix
    #This is the optimal policy for world with reward=-0.04
    policy_matrix = np.array([[1,      1,  1,  -1],
                              [0, np.NaN,  0,  -1],
                              [0,      3,  3,   3]])
    print("Policy Matrix:")
    print(policy_matrix)

    #Define and print the eligibility trace matrix
    trace_matrix = np.zeros((3,4))
    print("Trace Matrix:")
    print(trace_matrix)

    env.setStateMatrix(state_matrix)
    env.setRewardMatrix(reward_matrix)
    env.setTransitionMatrix(transition_matrix)

    utility_matrix = np.zeros((3,4))
    gamma = 0.999 #discount rate
    alpha = 0.1 #constant step size
    lambda_ = 0.5 #decaying factor
    tot_epoch = 300000
    print_epoch = 50000

    for epoch in range(tot_epoch):
        #Reset and return the first observation
        observation = env.reset(exploring_starts=True)        
        for step in range(1000):
            #Take the action from the action matrix
            action = policy_matrix[observation[0], observation[1]]
            #Move one step in the environment and get obs and reward
            new_observation, reward, done = env.step(action)
            #Estimate the error delta (Target - OldEstimate)
            delta = reward + gamma * utility_matrix[new_observation[0], new_observation[1]] - \
                                     utility_matrix[observation[0], observation[1]]
            #Adding +1 in the trace matrix for the state visited
            trace_matrix[observation[0], observation[1]] += 1
            #Update the utility matrix
            utility_matrix = update_utility(utility_matrix, trace_matrix, alpha, delta)
            #Update the trace matrix (decaying)
            trace_matrix = update_eligibility(trace_matrix, gamma, lambda_)
            observation = new_observation
            if done: break #return

        if(epoch % print_epoch == 0):
            print("")
            print("Utility matrix after " + str(epoch+1) + " iterations:") 
            print(utility_matrix)
            print(trace_matrix)
    #Time to check the utility matrix obtained
    print("Utility matrix after " + str(tot_epoch) + " iterations:")
    print(utility_matrix)



if __name__ == "__main__":
    main()

State Matrix:
[[ 0.  0.  0.  1.]
 [ 0. -1.  0.  1.]
 [ 0.  0.  0.  0.]]
Reward Matrix:
[[-0.04 -0.04 -0.04  1.  ]
 [-0.04 -0.04 -0.04 -1.  ]
 [-0.04 -0.04 -0.04 -0.04]]
Policy Matrix:
[[ 1.  1.  1. -1.]
 [ 0. nan  0. -1.]
 [ 0.  3.  3.  3.]]
Trace Matrix:
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]

Utility matrix after 1 iterations:
[[0.01895203 0.04595    0.1        0.        ]
 [0.00479687 0.         0.         0.        ]
 [0.         0.         0.         0.        ]]
[[0.12462537 0.24950025 0.4995     0.        ]
 [0.09334444 0.         0.         0.        ]
 [0.         0.         0.         0.        ]]

Utility matrix after 50001 iterations:
[[0.8834698  0.94154579 1.00014216 0.        ]
 [0.83124815 0.         0.75262683 0.        ]
 [0.77955405 0.73278148 0.69709184 0.59609685]]
[[1.24626311e-01 2.49505888e-01 4.99809008e-01 0.00000000e+00]
 [6.22508424e-02 0.00000000e+00 1.65572368e-04 0.00000000e+00]
 [3.88522718e-02 2.13423268e-02 9.66840731e-04 4.82936931e-04]]

Utili

+ Comparing the final utility matrix with the one obtained without the use of eligibility traces in TD(0) you will notice similar values. 
+ What’s the advantage eligibility traces? 
    + The eligibility traces version converges faster. 

This advantage become clear when dealing with sparse reward in a large state space. In this case the eligibility trace mechanism can considerably speeds up the convergence propagating what learnt at t+1 back to the last states visited.

<a href="#TOP">
        <span class="toc-item-num">&nbsp;&nbsp;</span>
        TOP
</a>