# Introduction to Q-Learning 

## CSCI E-82A

## Stephen Elston


In the previous lesson we explored Monte Carlo **reinforcement learning**. MC RL required that the returns for an entire episode be computed before any values are available for use. The disadvantage of this approach is the the full set of returns are required for state value or action value estimates. But, how can we get state values or action-values in fewer time steps? It turns out there are algorithms which compute estimates in as few as one step known as **time difference learning** or **TD-learning** and **Q-learning**. 

Recall that reinforcement learning has several distinctive characteristics, which differentiate this method from other machine learning and dynamic programming:
- **No Markov model** needs to be specified for reinforcement learning, in contrast to dynamic programming.
- Like dynamic programming, reinforcement learning **optimizes a reward function**. This is in contrast to supervised and unsupervised learning which use an error or objective function.  
- Reinforcement learning algorithms learn by **experience**. Over time, the algorithm learns a model of the environment and these results are used to optimize the expected reward. Learning from experience is in contrast to supervised learning which uses known marked cases. 
- Reinforcement learning agents take **actions** and only receive **state** and **rewards** from the environment. These are the only interaction between the RL agent and the environment.    

The interaction between a reinforcement learning agent and the environment are illustrated in the figure below. Notice that the only feedback the agent receives from the environment is reward and state.   

<img src="img/RL_AgentModel.JPG" alt="Drawing" style="width:500px; height:300px"/>
<center> **Reinforcement Learning Agent and Environment** </center>  

The ability to learn from experience is an attractive concept. This method of learning seems to mimic human learning. However, reinforcement learning has proven difficult to use in real-world applications. For a review of successes and problems arising when applying RL to robotics see [Kobler et. al.](https://www.ias.informatik.tu-darmstadt.de/uploads/Publications/Kober_IJRR_2013.pdf). At the present time, RL has mostly succeeded in cases where simulations can be used to gain experience. 

**Suggested readings** for TD and Q reinforcement learning, Chapters 6 and 7 of Sutton and Barto, second edition, provides a good introductions, including many alternative algorithms and details not discussed here.   

## On Policy vs. Off Policy Algorithms

In this lesson we will explore examples of two broad categories of RL algorithms known as **on policy** and **off policy** methods. 

On policy methods evaluate and improve a single policy. On policy methods converge quickly and often to good solution. In general, **exploration** is performed using $\epsilon$-greedy methods. The TD(0) and MC algorithms we have examined are examples of on policy methods. On policy algorithms are known to have good convergence properties. 

In contrast, off policy methods use two policies. The policy the agent is following is called the **behavior policy**, denoted $b(A_t | S_t)$. The policy being improved is known as the **target policy**, denoted $\pi (A_t | S_t)$. The agent obtains samples of the environment while following the behavior policy. These samples are used to improve the target policy. An advantage of off policy methods is that a deterministic behavior policy can be used while a better target policy is developed. 

## Example of Time Difference RL

With this short introduction TD RL in mind, let's try an example. We will sample the value function using a basic TD(0) algorithm here. 

As discussed in other labs, **Navigation** to a goal is a significant problem in robotics. Real-world navigation is rather complex. Therefore, in this example we will use a simple analog called a **grid world**. The grid world for this problem is shown below. 

<img src="img/GridWorld.JPG" alt="Drawing" style="width:200px; height:200px"/>
<center> **A 4x4 Grid World with Terminal State** </center>

The grid world consists of a 4x4 set of positions the robot can occupy. Each position is considered a state. The goal is to navigate to state 0, the goal, in the minimum steps. We will explore methods to find policies which reach this goal and achieve maximum reward. 

Grid position 0 is the goal and a **terminal state**. There are no possible state transitions out of this position. The presence of a terminal state makes this an **episodic Markov random process**. For each episode sampled the robot can start in any other random position, $\{ 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \}$. This random selection process makes this a **random start** TD algorithm. The episode terminates when the robot enters the terminal position (state 0).  

In reality, an RL agent may need to explore to find the possible actions when it is in some particular state. To simplify our example, we encode, or represent, these possibilities in a dictionary as shown in the code block below. We use a dictionary of dictionaries to perform the lookup. The keys of the outer dictionary are the identifiers (numbers) of the states. The keys of the inner dictionary are the possible actions and the values are the **successor state**, $s'$, for that transition.  

In each state, there are four possible actions the robot can take:
- up, u
- down, d,
- left, l
- right, r

The TD RL agent has no model for the environment. Therefore, beyond these allowed actions, all other information is encapsulated in the environment and is unobservable by the agent. This is the key difference between reinforcement learning and dynamic programming. 

In [1]:
## import numpy for latter
import numpy as np
import numpy.random as nr
import pandas as pd

## Define the transition dictonary of dictionaries:
neighbors = {0:{'u':0, 'd':0, 'l':0, 'r':0},
          1:{'u':1, 'd':5, 'l':0, 'r':2},
          2:{'u':2, 'd':6, 'l':1, 'r':3},
          3:{'u':3, 'd':7, 'l':2, 'r':3},
          4:{'u':0, 'd':8, 'l':4, 'r':5},
          5:{'u':1, 'd':9, 'l':4, 'r':6},
          6:{'u':2, 'd':10, 'l':5, 'r':7},
          7:{'u':3, 'd':11, 'l':6, 'r':7},
          8:{'u':4, 'd':12, 'l':8, 'r':9},
          9:{'u':5, 'd':13, 'l':8, 'r':10},
          10:{'u':6, 'd':14, 'l':9, 'r':11},
          11:{'u':7, 'd':15, 'l':10, 'r':11},
          12:{'u':8, 'd':12, 'l':12, 'r':13},
          13:{'u':9, 'd':13, 'l':12, 'r':14},
          14:{'u':10, 'd':14, 'l':13, 'r':15},
          15:{'u':11, 'd':15, 'l':14, 'r':15}}

rewards = {0:{'u':10.0, 'd':10.0, 'l':10.0, 'r':10.0},
          1:{'u':-1, 'd':-0.1, 'l':10.0, 'r':-0.1},
          2:{'u':-1.0, 'd':-0.1, 'l':-0.1, 'r':-0.1},
          3:{'u':-1.0, 'd':-0.1, 'l':-0.1, 'r':-1.0},
          4:{'u':10.0, 'd':-0.1, 'l':-1.0, 'r':-0.1},
          5:{'u':-0.1, 'd':-0.1, 'l':-0.1, 'r':-0.1},
          6:{'u':-0.1, 'd':-0.1, 'l':-0.1, 'r':-0.1},
          7:{'u':-0.1, 'd':-0.1, 'l':-0.1, 'r':-1.0},
          8:{'u':-0.1, 'd':-0.1, 'l':-1.0, 'r':-0.1},
          9:{'u':-0.1, 'd':-0.1, 'l':-0.1, 'r':-0.1},
          10:{'u':-0.1, 'd':-0.1, 'l':-0.1, 'r':-0.1},
          11:{'u':-0.1, 'd':-0.1, 'l':-0.1, 'r':-1.0},
          12:{'u':-0.1, 'd':-1.0, 'l':-1.0, 'r':-0.1},
          13:{'u':-0.1, 'd':-1.0, 'l':-0.1, 'r':-0.1},
          14:{'u':-0.1, 'd':-1.0, 'l':-0.1, 'r':-0.1},
          15:{'u':-0.1, 'd':-1.0, 'l':-0.1, 'r':-1.0}}


def action_lookup(index):
    """Helper function returns action given an index"""
    action_dic = {0:'u', 1:'d', 2:'l', 3:'r'}
    return action_dic[index]

def index_lookup(action):
    """Helper function returns index given action"""
    index_dic = {'u':0, 'd':1, 'l':2, 'r':3}
    return index_dic[action]


def next_state(state, action_index, neighbors = neighbors, action_lookup = action_lookup):
    return(neighbors[state][action_lookup[action_index]])

def simulate_environment(s, action, neighbors = neighbors, rewards = rewards, terminal = 0):
    """
    Function simulates the environment for Q-learning.
    returns s_prime and reward given s and action
    """
    s_prime = neighbors[s][action]
    reward_prime = np.array([rewards[s_prime][a] for a in rewards[0].keys()])
    return (s_prime, reward_prime, is_terminal(s_prime, terminal))
    

def is_terminal(state, terminal = 0):
    return state == terminal

## Test the function
for a in ['u', 'd', 'r', 'l']:
    print(simulate_environment(1, a))
    

(1, array([-1. , -0.1, 10. , -0.1]), False)
(5, array([-0.1, -0.1, -0.1, -0.1]), False)
(2, array([-1. , -0.1, -0.1, -0.1]), False)
(0, array([10., 10., 10., 10.]), True)


## Q-Learning

As we have just seen, the SARSA algorithm is an on policy action value TD estimation method. The **Q-learning** algorithm is a **off policy** TD action value estimation method. 

The update formula for single step Q-learning or **Q-learning(0)** is:

$$Q(S_t,A_t) = Q(S_t,A_t) + \alpha \big[ R_{t+1} + \gamma\ max_a Q(S_{t+1},a) - Q(S_t,A_t) \big]$$  

Where,   
$\delta_t = R_{t+1} + \gamma max_a Q(S_{t+1},a) - Q(S_t,A_t) = $ the TD error,   
$max_a = $ the maximum operator applied to all possible actions in state $S_{t+1}$,   
$Q(S_t,A_t) = $ is the action value in state S given action A,  
$R_{t+1} = $ is the reward for the next time step,   
$\alpha = $ the learning rate,   
$\gamma = $ discount factor.  

The use of the operator $max_a$ makes Q-learning greedy. But, why does using this operator result in an off-policy algorithm? To answer this question, examine the backup diagram shown below. 

<img src="img/Q-Learning.JPG" alt="Drawing" style="width:200px; height:150px"/>
<center> **Backup Diagram for one-step Q-Learning** </center>

The $max_a$ greedily picks the action with the greatest value, regardless of policy. Therefore, Q-learning is an off-policy algorithm. 

### Why is Q-learning off-policy

$$\pi(S_t,A_t)$$

### Q-Learning Example

The code in the cell below implements the one step Q-learning(0) algorithm. The code is nearly identical to the SARSA(0) code shown previously. The main difference is the addition of the $max_a$ operation when computing the TD error, $\delta_t$.  Additional details on this algorithm can be seen by reading the code comments.  
 
Execute this code for the random walk policy on the grid world and examine the results. 

In [2]:
initial_policy = {0:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        1:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25}, 
                        2:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        3:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        4:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        5:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        6:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        7:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        8:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        9:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        10:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        11:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        12:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        13:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        14:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        15:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25}}

def start_episode(n_states, n_actions):
    '''Function to find a random starting values for the episode
    that is not the terminal state'''
    state = nr.choice(range(n_states))
    while(is_terminal(state)):  ## Make sure not starting at the terminal state
         state = nr.choice(range(n_states))
    ## Now find a random starting action index
    a_index = nr.choice(range(4), size = 1)[0]
    s_prime, reward, terminal = simulate_environment(state, action_lookup(a_index))   
    return state, a_index, reward[a_index] ## action_lookup(a_index), reward[a_index]

## test the function to make sure never starting in terminal state
[start_episode(15,4) for _ in range(10)]

[(2, 2, 10.0),
 (5, 3, -0.1),
 (14, 2, -0.1),
 (10, 0, -0.1),
 (12, 0, -0.1),
 (7, 1, -0.1),
 (1, 1, -0.1),
 (12, 3, -0.1),
 (1, 2, 10.0),
 (4, 3, -0.1)]

In [3]:
def take_action(state, policy):
    '''Function takes action given state using the transition probabilities 
    of the policy'''
    ## Find the action given the transistion probabilities defined by the policy.
    action = action_lookup(nr.choice(range(len(policy[0].keys())), p = list(policy[state].values()))) 
    s_prime, reward, terminal = simulate_environment(state, action)
    return (action, s_prime, reward, terminal)

## Test function for several states
for s in range(16):
    print(take_action(s, initial_policy))

('l', 0, array([10., 10., 10., 10.]), True)
('r', 2, array([-1. , -0.1, -0.1, -0.1]), False)
('r', 3, array([-1. , -0.1, -0.1, -1. ]), False)
('r', 3, array([-1. , -0.1, -0.1, -1. ]), False)
('r', 5, array([-0.1, -0.1, -0.1, -0.1]), False)
('u', 1, array([-1. , -0.1, 10. , -0.1]), False)
('u', 2, array([-1. , -0.1, -0.1, -0.1]), False)
('r', 7, array([-0.1, -0.1, -0.1, -1. ]), False)
('u', 4, array([10. , -0.1, -1. , -0.1]), False)
('l', 8, array([-0.1, -0.1, -1. , -0.1]), False)
('d', 14, array([-0.1, -1. , -0.1, -0.1]), False)
('u', 7, array([-0.1, -0.1, -0.1, -1. ]), False)
('l', 12, array([-0.1, -1. , -1. , -0.1]), False)
('u', 9, array([-0.1, -0.1, -0.1, -0.1]), False)
('r', 15, array([-0.1, -1. , -0.1, -1. ]), False)
('l', 14, array([-0.1, -1. , -0.1, -0.1]), False)


In [4]:
def print_Q(Q):
    Q = pd.DataFrame(Q, columns = ['up', 'down', 'left', 'right'])
    print(Q)

def update_Q(Q, current_state, a_index, reward, alpha, gamma):
    """Function to update the actions values in the Q matrix"""
    ## Get s_prime given s and a
    s_prime, reward_prime, terminal = simulate_environment(current_state, action_lookup(a_index))
    a_prime_index = nr.choice(np.where(reward_prime == max(reward_prime))[0], size = 1)[0]
    ## Update the action values 
    Q[current_state,a_index] = Q[current_state,a_index] + alpha * (reward + gamma * (Q[s_prime,a_prime_index] - Q[current_state,a_index]))
    return Q, s_prime, reward_prime, terminal, a_prime_index

def Q_learning_0(policy, episodes, alpha = 0.2, gamma = 0.9):
    """
    Function to perform Q-learning(0) control policy improvement.
    """
    ## Initialize the state list and action values
    states = list(policy.keys())
    n_states = len(states)
    n_actions = len(policy[0].keys())
    
    ## Initialize Q matrix
    Q = np.zeros((n_states,n_actions))
    
    for _ in range(episodes): # Loop over the episodes
        terminal = False
        ## Find the inital state, action index and reward
        current_state, a_index, reward = start_episode(n_states,n_actions)
        
        while(not terminal): # Episode ends where get to terminal state   
            ## Update the action values in Q
            Q, s_prime, reward_prime, terminal, a_prime_index = update_Q(Q, current_state, a_index, reward, alpha, gamma)
            ## Set action, reward and state for next iteration
            a_index = a_prime_index
            current_state = s_prime
            reward = reward_prime[a_prime_index]
    return(Q)

Q = Q_learning_0(initial_policy, 1000)
print_Q(Q)

           up      down       left      right
0    0.000000  0.000000   0.000000   0.000000
1    8.623594  8.516038  11.111111   8.300862
2    8.006739  9.388375  12.296909   9.400343
3    7.627920  9.058208  10.583861   7.595816
4   11.111111  7.611493   9.575815  10.020892
5   10.843975  9.391161  10.838495   9.641176
6    9.788479  8.671816  10.233252   8.826360
7    9.672429  8.639434   9.571359   6.236508
8   11.203943  9.111985   7.978905   9.232373
9   10.169199  8.804533   9.122590   8.623188
10   9.332563  8.391409   8.976757   8.359066
11   9.310721  8.271786   8.710405   5.830758
12  10.370852  7.105834   6.592921   8.797197
13   9.197513  5.987868   9.184293   8.476149
14   8.588262  5.400399   8.730751   8.285717
15   8.565980  6.753404   8.319820   6.592903


In [5]:
def update_policy(policy, Q, epsilon):
    '''Updates the policy based on estiamtes of Q using 
    an epslion greedy algorithm. The action with the highest
    action value is used.'''
    
    ## Find the keys for the actions in the policy
    keys = list(policy[0].keys())
    
    ## Iterate over the states and find the maximm action value.
    for state in range(len(policy)):
        ## First find the index of the max Q values  
        q = Q[state,:]
        max_action_index = np.where(q == max(q))[0]

        ## Find the probabilities for the transitions
        n_transitions = float(len(q))
        n_max_transitions = float(len(max_action_index))
        p_max_transitions = (1.0 - epsilon *(n_transitions - n_max_transitions))/(n_max_transitions)
  
        ## Now assign the probabilities to the policy as epsilon greedy.
        for key in keys:
            if(index_lookup(key) in max_action_index): policy[state][key] = p_max_transitions
            else: policy[state][key] = epsilon
    return(policy)                

update_policy(initial_policy, Q, 0.1)    

{0: {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25},
 1: {'u': 0.1, 'd': 0.1, 'l': 0.7, 'r': 0.1},
 2: {'u': 0.1, 'd': 0.1, 'l': 0.7, 'r': 0.1},
 3: {'u': 0.1, 'd': 0.1, 'l': 0.7, 'r': 0.1},
 4: {'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0.1},
 5: {'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0.1},
 6: {'u': 0.1, 'd': 0.1, 'l': 0.7, 'r': 0.1},
 7: {'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0.1},
 8: {'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0.1},
 9: {'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0.1},
 10: {'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0.1},
 11: {'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0.1},
 12: {'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0.1},
 13: {'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0.1},
 14: {'u': 0.1, 'd': 0.1, 'l': 0.7, 'r': 0.1},
 15: {'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0.1}}

### Double Q-Learning

The Q-learning algorithm just presented has a significant **bias**. To understand why this might be consider the following thought experiment. In most cases the sampled action values are inaccurate. Some action values will have a positive error and some will have a negative error. In the error is on the order of the values themselves, the $max_a$ operator has a reasonable chance of selecting an action value that is the largest because of this error. However, the $max_a$ operator will never select an action with a low value simply because of the errors. The net result is a bias toward action values with the largest positive error. 

What can be done to correct this situation? One relatively simple and effective algorithm is known as **double Q-learning**. Double Q-learning maintains two tables of action values. The values from one table are used to perform the bootstrap updates of the other table and vice versa. This approach averages out the bias. For two tables, $Q_1$ and $Q_2$ we can express double Q-learning as follows:

$$Q_1(S_t,A_t) = Q_1(S_t,A_t) + \alpha \big[ R_{t+1} + \gamma Q_2 \big( S_{t+1},argmax_a Q_1(S_{t+1} , a) \big) - Q_1(S_t,A_t) \big] \\
Q_2(S_t,A_t) = Q_2(S_t,A_t) + \alpha \big[ R_{t+1} + \gamma Q_1 \big( S_{t+1},argmax_a Q_2(S_{t+1} , a) \big) - Q_2(S_t,A_t) \big]$$  

With a 0.5 probability one or the other of these expressions is used for the TD update at each time step. While double Q-learning requires twice as much memory, to maintain the two tables, the computational complexity is the same when compared to Q-learning. 

> **Note:** Another **unbiased** one step off policy TD algorithm is known as **expected SARSA**. See Section 6.6 of Sutton and Barto, second edition, for details.   

### Example of Double Q-Learning

The code in the cell below implements the double Q-learning algorithm. The steps are essentially the same as the foregoing code, except for the updates of the two Q tables. Additional details on this algorithm can be seen by reading the code comments.  

Execute this code and examine the results. 

In [6]:
def update_double_Q(q1, q2, current_state, a_index, reward, alpha, gamma):
    """Function to update the actions values in the Q matrix"""
    ## Get s_prime given s and a
    s_prime, reward_prime, terminal = simulate_environment(current_state, action_lookup(a_index))
    a_prime_index = nr.choice(np.where(reward_prime == max(reward_prime))[0], size = 1)[0]
    ## Update the action values 
    q1[current_state,a_index] = q1[current_state,a_index] + alpha * (reward + gamma * (q2[s_prime,a_prime_index] - q1[current_state,a_index]))
    return q1, s_prime, reward_prime, terminal, a_prime_index


def double_Q_learning_0(policy, episodes, alpha = 0.2, gamma = 0.9):
    """
    Function to perform Q-learning(0) control policy improvement.
    """
    ## Initialize the state list and action values
    states = list(policy.keys())
    n_states = len(states)
    n_actions = len(policy[0].keys())
    
    ## Initialize both Q matricies
    Q1 = np.zeros((n_states,n_actions))
    Q2 = np.zeros((n_states,n_actions))
    
    for _ in range(episodes): # Loop over the episodes
        terminal = False
        ## Find the inital state, action index and reward
        current_state, a_index, reward = start_episode(n_states,n_actions)
        
        while(not terminal): # Episode ends where get to terminal state   
            ## Update the action values in Q1 or Q2 based on random choice
            if(nr.uniform() <= 0.5):
                Q1, s_prime, reward_prime, terminal, a_prime_index = update_double_Q(Q1, Q2, current_state, a_index, reward, alpha, gamma)
            else:
                Q2, s_prime, reward_prime, terminal, a_prime_index = update_double_Q(Q2, Q1, current_state, a_index, reward, alpha, gamma)
            ## Set action, reward and state for next iteration
            a_index = a_prime_index
            current_state = s_prime
            reward = reward_prime[a_prime_index]
    return(Q1)

Q = double_Q_learning_0(initial_policy, 1000)
print_Q(Q)

           up      down       left     right
0    0.000000  0.000000   0.000000  0.000000
1    7.955846  7.648373  11.111111  6.260142
2    2.378590  7.893437  11.084556  7.793346
3    4.752731  7.318153  10.010388  6.109519
4   11.111111  4.429610   9.050392  5.642359
5   10.999939  7.804638  10.981116  8.177650
6    7.503754  7.444300   9.073326  7.756483
7    8.035907  7.149989   8.530987  3.953227
8   11.808918  7.994408   7.226267  8.055853
9    8.189998  7.614877   8.054167  7.567628
10   8.178048  7.378599   8.456243  7.112965
11   7.424702  6.873993   7.291253  2.440998
12   9.303148  2.862937   4.957945  7.579751
13   8.207038  4.253232   8.265794  7.450918
14   7.422226  3.700480   7.501077  6.851646
15   7.224980  2.373863   7.481431  3.853542


In [7]:
update_policy(initial_policy, Q, 0.1)    

{0: {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25},
 1: {'u': 0.1, 'd': 0.1, 'l': 0.7, 'r': 0.1},
 2: {'u': 0.1, 'd': 0.1, 'l': 0.7, 'r': 0.1},
 3: {'u': 0.1, 'd': 0.1, 'l': 0.7, 'r': 0.1},
 4: {'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0.1},
 5: {'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0.1},
 6: {'u': 0.1, 'd': 0.1, 'l': 0.7, 'r': 0.1},
 7: {'u': 0.1, 'd': 0.1, 'l': 0.7, 'r': 0.1},
 8: {'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0.1},
 9: {'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0.1},
 10: {'u': 0.1, 'd': 0.1, 'l': 0.7, 'r': 0.1},
 11: {'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0.1},
 12: {'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0.1},
 13: {'u': 0.1, 'd': 0.1, 'l': 0.7, 'r': 0.1},
 14: {'u': 0.1, 'd': 0.1, 'l': 0.7, 'r': 0.1},
 15: {'u': 0.1, 'd': 0.1, 'l': 0.7, 'r': 0.1}}

### GPI with Double Q Learning

The code in the cell below applies general policy iteration using double Q-learning(0) to estimate the action values.  Additional details on this algorithm can be seen by reading the code comments.  

Execute this code and examine the results. 

In [12]:
from copy import deepcopy
def double_Q_learning_0_GPI(policy, neighbors, reward, cycles, episodes, goal, alpha = 0.2, gamma = 0.9, epsilon = 0.1):
    ## iterate over GPI cycles
    current_policy = deepcopy(policy)
    for _ in range(cycles):
        ## Evaluate policy with SARSA
        Q = double_Q_learning_0(policy, neighbors, rewards, episodes = episodes, goal = goal)
        
        for s in list(current_policy.keys()): # iterate over all states
            ## Find the index action with the largest Q values 
            ## May be more than one. 
            max_index = np.where(Q[:,s] == max(Q[:,s]))[0]
            
            ## Probabilities of transition
            ## Need to allow for further exploration so don't let any 
            ## transition probability be 0.
            ## Some gymnastics are required to ensure that the probabilities 
            ## over the transistions actual add to exactly 1.0
            neighbors_len = float(Q.shape[0])
            max_len = float(len(max_index))
            diff = round(neighbors_len - max_len,3)
            prob_for_policy = round(1.0/max_len,3)
            adjust = round((epsilon * (diff)), 3)
            prob_for_policy = prob_for_policy - adjust
            if(diff != 0.0):
                remainder = (1.0 - max_len * prob_for_policy)/diff
            else:
                remainder = epsilon
                                                 
            for i, key in enumerate(current_policy[s]): ## Update policy
                if(i in max_index): current_policy[s][key] = prob_for_policy
                else: current_policy[s][key] = remainder   
                    
    return(current_policy)                    
 

Double_Q_0_Policy = double_Q_learning_0_GPI(initial_policy, neighbors, rewards, cycles = 10, episodes = 500, goal = 0, alpha = 0.2, epsilon = 0.01)
Double_Q_0_Policy 

TypeError: double_Q_learning_0() got multiple values for argument 'episodes'

In [None]:
np.round(np.array(td_0_state_values(Double_Q_0_Policy, n_samps = 10000, goal = 0)).reshape((4,4)), 4)

## N-Step Off-Policy Learning with Importance Sampling

For n-step off-policy learning we update a target policy $\pi(A_t|S_t)$ using samples from a behavior policy $b(A_t|S_t)$. Since the two policies differ, the probabilities of an action given the state will undoubtedly differ. For example, the behavior policy can be exploratory whereas, the target policy is greedy. 

To account for the different probabilities of sampling we reweight by the **importance sampling ratio**. For an n-step algorithm at time step $t$ the importance sampling ratio can be expressed as:

$$\rho_{t:t + n -1} = \prod_{k=\tau}^{min(t + n -1,T-1)} \frac{\pi(A_k|S_k)}{b(A_k|S_k)}$$

The n-step TD update then becomes:

$$V_{t+n}(S_t) = V_{t+n-1}(S_t) + \alpha\ \rho_{t:t+n-1} \big[ G_{t:t+n} - V_{t+n-1}(S_t) \big],\ 0 \leq t < T]$$

And the SARSA update becomes:

$$Q_{t+n}(S_t, A_t) = Q_{t+n-1}(S_t, A_t) + \alpha\ \rho_{t:t+n-1} \big[ G_{t:t+n} - Q_{t+n-1}(S_t,A_t) \big],\ 0 \leq t < T]$$

For both of the above update equations consider the effect of importance sampling ratio. If the action given state is more likely under the target policy that the behavior policy, more weight is given to updating with the error term. However, If the action given state is less likely under the target policy that the behavior policy, less weight is given to updating with the error term. In this way, the weighting by the importance sampling ratio gives the correct updates for the target policy regardless of the transition probabilities of the behavior policy. 

> **NOte:** Considerably more detail on n-step off-policy RL algorithms can be found in Sutton and Barto, second edition, Sections 7.3, 7.4 and 7.5. 

#### Copyright 2018, 2019, Stephen F Elston. All rights reserved. 