# Homework 8

## CSCI E-82A


In the a previous homework assignments, you used two different dynamic programming algorithms and Monte Carlo reinforcement learning to solve a robot navigation problem by finding optimal paths to a goal in a simplified warehouse environment. Now you will use time differencing reinforcement learning to find optimal paths in the same environment.

The configuration of the warehouse environment is illustrated in the figure below.

<img src="GridWorldFactory.JPG" alt="Drawing" style="width:200px; height:200px"/>
<center> **Grid World for Factory Navigation Example** </center>

The goal is for the robot to deliver some material to position (state) 12, shown in blue. Since there is a goal state or **terminal state** this an **episodic task**. 

There are some barriers comprised of the states $\{ 6, 7, 8 \}$ and $\{ 16, 17, 18 \}$, shown with hash marks. In a real warehouse, these positions might be occupied by shelving or equipment. We do not want the robot to hit these barriers. Thus, we say that transitioning to these barrier states is **taboo**.

As before, we do not want the robot to hit the edges of the grid world, which represent the outer walls of the warehouse. 

## Representation

You are, no doubt, familiar with the representation for this problem by now.    

As with many such problems, the starting place is creating the **representation**. In the cell below encode your representation for the possible action-state transitions. From each state there are 4 possible actions:
- up, u
- down, d,
- left, l
- right, r

There are a few special cases you need to consider:
- Any action transitioning state off the grid or into a barrier should keep the state unchanged. 
- Any action in the goal state keeps the state unchanged. 
- Any transition within the taboo (barrier) states can keep the state unchanged. If you experiment, you will see that other encodings work as well since the value of a barrier states are always zero and there are no actions transitioning into these states. 

> **Hint:** It may help you create a pencil and paper sketch of the transitions, rewards, and probabilities or policy. This can help you to keep the bookkeeping correct. 

In [1]:
## import numpy for latter
import numpy as np
import numpy.random as nr



In [2]:
neighbors = {0:{'u':0, 'd':5, 'l':0, 'r':1},
          1:{'u':1, 'd':1, 'l':0, 'r':2},
          2:{'u':2, 'd':2, 'l':1, 'r':3},
          3:{'u':3, 'd':3, 'l':2, 'r':4},
          4:{'u':4, 'd':9, 'l':3, 'r':4},
          
          5:{'u':0, 'd':10, 'l':5, 'r':5},
          6:{'u':6, 'd':6, 'l':6, 'r':6},###barrier
          7:{'u':7, 'd':7, 'l':7, 'r':7},###barrier
          8:{'u':8, 'd':8, 'l':8, 'r':8},###barrier
          9:{'u':4, 'd':14, 'l':9, 'r':9},
          
          10:{'u':5, 'd':15, 'l':10, 'r':11},
          11:{'u':11, 'd':11, 'l':10, 'r':12},
          12:{'u':12, 'd':12, 'l':12, 'r':12},#goal
          13:{'u':13, 'd':13, 'l':12, 'r':14},
          14:{'u':9, 'd':19, 'l':13, 'r':14},
          
          15:{'u':10, 'd':20, 'l':15, 'r':15},
          16:{'u':16, 'd':16, 'l':16, 'r':16},###barrier
          17:{'u':17, 'd':17, 'l':17, 'r':17},###barrier
          18:{'u':18, 'd':18, 'l':18, 'r':18},###barrier
          19:{'u':14, 'd':24, 'l':19, 'r':19},
          
          20:{'u':15, 'd':20, 'l':20, 'r':21},
          21:{'u':21, 'd':21, 'l':20, 'r':22},
          22:{'u':22, 'd':22, 'l':21, 'r':23},
          23:{'u':23, 'd':23, 'l':22, 'r':24},
          24:{'u':19, 'd':24, 'l':23, 'r':24}}

You need to define the initial transition probabilities for the Markov process. Set the probabilities for each transition as a **uniform distribution** leading to random action by the robot. 

> **Note:** As these are just starting values, the exact values of the transition probabilities are not actually all that important in terms of solving the RL problem. Also, notice that it does not matter how the taboo state transitions are encoded. The point of the DP algorithm is to learn the transition policy.

In [3]:
policy = {0:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
          1:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
          2:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
          3:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
          4:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
          
          5:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
          6:{'u':0, 'd':0, 'l':0, 'r':0},###barrier
          7:{'u':0, 'd':0, 'l':0, 'r':0},###barrier
          8:{'u':0, 'd':0, 'l':0, 'r':0},###barrier
          9:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
          
          10:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
          11:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
          12:{'u':0, 'd':0, 'l':0, 'r':0},#goal
          13:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
          14:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
          
          15:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
          16:{'u':0, 'd':0, 'l':0, 'r':0},###barrier
          17:{'u':0, 'd':0, 'l':0, 'r':0},###barrier
          18:{'u':0, 'd':0, 'l':0, 'r':0},###barrier
          19:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
          
          20:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
          21:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
          22:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
          23:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
          24:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25}}

The robot receives the following rewards:
- 10 for entering position 0. 
- -1 for attempting to leave the grid. In other words, we penalize the robot for hitting the edges of the grid.  
- -0.1 for all other state transitions, which is the cost for the robot to move from one state to another. If we did not have this penalty, the robot could follow any random plan to the goal which did not hit the edges. 

This **reward structure is unknown to the MC RL agent**. The agent must **learn** the rewards by sampling the environment. 

In the code cell below encode your representation of this reward structure you will use in your simulated environment.  

In [4]:
reward = {0:{'u':-1, 'd':-0.1, 'l':-1, 'r':-0.1},
          1:{'u':-1, 'd':-1, 'l':-0.1, 'r':-0.1},
          2:{'u':-1, 'd':-1, 'l':-0.1, 'r':-0.1},
          3:{'u':-1, 'd':-1, 'l':-0.1, 'r':-0.1},
          4:{'u':-1, 'd':-0.1, 'l':-1, 'r':-0.1},
          
          5:{'u':-0.1, 'd':-0.1, 'l':-1, 'r':-1},
          6:{'u':0, 'd':0, 'l':0, 'r':0},###barrier
          7:{'u':0, 'd':0, 'l':0, 'r':0},###barrier
          8:{'u':0, 'd':0, 'l':0, 'r':0},###barrier
          9:{'u':-0.1, 'd':-0.1, 'l':-1, 'r':-1},
          
          10:{'u':-0.1, 'd':-0.1, 'l':-1, 'r':-0.1},
          11:{'u':-1, 'd':-1, 'l':-0.1, 'r':10},
          12:{'u':0, 'd':0, 'l':0, 'r':0},#goal
          13:{'u':-1, 'd':-1, 'l':10, 'r':-0.1},
          14:{'u':-0.1, 'd':-0.1, 'l':-0.1, 'r':-1},
          
          15:{'u':-0.1, 'd':-0.1, 'l':-1, 'r':-1},
          16:{'u':0, 'd':0, 'l':0, 'r':0},###barrier
          17:{'u':0, 'd':0, 'l':0, 'r':0},###barrier
          18:{'u':0, 'd':0, 'l':0, 'r':0},###barrier
          19:{'u':-0.1, 'd':-0.1, 'l':-1, 'r':-1},
          
          20:{'u':-0.1, 'd':-1, 'l':-0.1, 'r':-0.1},
          21:{'u':-1, 'd':-1, 'l':-0.1, 'r':-0.1},
          22:{'u':-1, 'd':-1, 'l':-0.1, 'r':-0.1},
          23:{'u':-1, 'd':-1, 'l':-0.1, 'r':-0.1},
          24:{'u':-0.1, 'd':-1, 'l':-0.1, 'r':-1}}

You will find it useful to create a list of taboo states, which you can encode in the cell below.

In [5]:
taboo = [6,7,8,16,17,18]


## TD(0) Policy Evaluation

With your representations defined, you can now create and test functions to perform TD(0) **policy evaluation**. 

As a first step you will need a function to find the rewards and next state given a state and an action. You are welcome to start with the `state_values` function from the TD/Q-learning notebook. However, keep in mind that you must modify this code to correctly treat the taboo states of the barrier. Specifically, taboo states should not be visited. 

Execute your code to test it for each possible action from state 11.  

In [6]:
def state_values(s, action, neighbors = neighbors, rewards = reward):
    """
    Function simulates the environment
    returns s_prime and reward given s and action
    """
    s_prime = neighbors[s][action]
    reward = rewards[s][action]
    return (s_prime,reward)

## Test the function
for a in ['u', 'd', 'r', 'l']:
    print(state_values(11, a))
    

(11, -1)
(11, -1)
(12, 10)
(10, -0.1)


Examine your results. Are the action values consistent with the transitions?

ANS: No, they differ by the rewards values

Next, you need to create a function to compute the state values using the TD(0) algorithm. You should use the function you just created  to find the rewards and next state given a state and action. You are welcome to use the `td_0_state_values` function from the TD/Q-learning notebook as a starting point.  

Execute your function for 1,000 episodes and examine the results.

In [7]:
def td_0_state_values(policy, n_samps, goal, alpha = 0.2, gamma = 0.9):
    """
    Function for TD(0) policy 
    """
    ## Initialize the state list and state values
    states = list(policy.keys())
    v = [0]*len(list(policy.keys()))
    action_index = list(range(len(list(policy[0].keys()))))
    for _ in range(n_samps):
        s = nr.choice(states, size =1)[0]
        probs = list(policy[s].values())
        if(s not in taboo+[goal]):
            a = list(policy[s].keys())[nr.choice(action_index, size = 1, p = probs)[0]]
        else:
            a = list(policy[s].keys())[nr.choice(action_index, size = 1)[0]]
        transistion = state_values(s, a)
        v[s] = v[s] + alpha * (transistion[1] +  gamma * v[transistion[0]] - v[s])
    return(v)
    
   

In [8]:
nr.seed(345)    
np.round(np.array(td_0_state_values(policy, n_samps = 1000, goal = 12)).reshape((5,5)), 4) 

array([[-2.1482, -2.349 , -2.6565, -2.6894, -2.4085],
       [-2.3095,  0.    ,  0.    ,  0.    , -2.6015],
       [ 2.0263,  4.7749,  0.    ,  0.9428, -1.7944],
       [-1.2251,  0.    ,  0.    ,  0.    , -2.3868],
       [-1.9846, -2.4509, -2.8161, -2.8375, -2.7372]])

Examine your results and answer the following questions to ensure you action value function operates correctly:
1. Are the values of the taboo states 0? ANS: Yes
2. Are the states with the highest values adjacent to the terminal state? ANS: Partialy, because weirdly the right side,13, is smaller than 10. 
3. Are the values of the states decreasing as the distance from the terminal state increases? ANS: Yes


## SARSA(0) Policy Improvement

Now you will perform policy improvement using the SARSA(0) algorithm.  You are welcome to start with the `select_a_prime` and `SARSA_0` functions from the TD/Q-learning notebooks.    

Execute your code for 1,000 episodes, and with $\alpha = 0.2$, and $\epsilon = 0.1$)

In [9]:
import copy

def select_a_prime(s_prime, policy, action_index, greedy, goal):
    ## Randomly select an action prime 
    ## Make sure to handle the terminal state
    if(s_prime != goal and greedy): 
        probs = list(policy[s_prime].values())
        a_prime_index = nr.choice(action_index, size = 1, p = probs)[0]
        a_prime = list(policy[s_prime].keys())[a_prime_index]
    else: ## Don't probability weight for terminal state or non-greedy selecttion
        a_prime_index = nr.choice(action_index, size = 1)[0]
        a_prime = list(policy[s_prime].keys())[a_prime_index]   
    return(a_prime_index, a_prime)


def SARSA_0(policy, episodes, goal, alpha = 0.2, gamma = 0.9, epsilon = 0.1):
    """
    Function to perform SARSA(0) control policy improvement.
    """
    ## Initialize the state list and action values
    states = list(policy.keys())
    n_states = len(states)
    
    ## Initialize possible actions and the action values
    action_index = list(range(len(list(policy[0].keys()))))
    Q = np.zeros((len(action_index),len(states)))
    
    current_policy = copy.deepcopy(policy)
    
    for _ in range(episodes): # Loop over the episodes
        ## sample a state at random ensuring it is not terminal state
        s = nr.choice(states, size = 1)[0]
        while(s in taboo+[goal]): s = nr.choice(states, size = 1)[0]
        ## Now choose action given policy
        a_index, a = select_a_prime(s, current_policy, action_index, True, goal)
        
        s_prime = float('inf') # Value of s_prime to start loop
        while(s_prime != goal): # Episode ends where get to terminal state 
            ## The next state given the action
            s_prime, reward = state_values(s, a)
            a_prime_index, a_prime = select_a_prime(s_prime, current_policy, action_index, True, goal)
     
            ## Update the action values
            Q[a_index,s] = Q[a_index,s] + alpha * (reward + gamma * Q[a_prime_index,s_prime] - Q[a_index,s])
            
            ## Set action and state for next iteration
            a = a_prime
            a_index = a_prime_index
            s = s_prime

    return(Q)



In [10]:
Q = SARSA_0(policy, 1000, goal = 12, alpha = 0.2, epsilon = 0.1)

for i in range(4):
    print(np.round(Q[i,:].reshape((5,5)), 4))

[[-4.2282 -5.2765 -5.6288 -5.4639 -5.7089]
 [-3.586   0.      0.      0.     -4.5257]
 [-2.6974  2.564   0.      0.7909 -3.1976]
 [-0.5514  0.      0.      0.     -1.7296]
 [-2.0087 -5.3606 -5.1868 -5.1598 -3.2489]]
[[-2.4912 -4.8844 -5.3963 -5.4337 -3.73  ]
 [-0.9637  0.      0.      0.     -1.911 ]
 [-2.4687  0.3427  0.     -0.9032 -3.3185]
 [-3.3496  0.      0.      0.     -3.869 ]
 [-4.4689 -4.5704 -5.2045 -5.2181 -4.821 ]]
[[-4.2811 -3.464  -4.14   -4.6461 -5.5983]
 [-2.9792  0.      0.      0.     -4.5057]
 [-0.2172  0.3087  0.     10.      1.4675]
 [-3.6716  0.      0.      0.     -4.386 ]
 [-3.0332 -3.1384 -3.794  -4.4014 -4.4273]]
[[-4.4285 -4.2415 -4.4059 -4.3376 -4.2544]
 [-3.2421  0.      0.      0.     -4.5583]
 [ 4.698  10.      0.     -1.9334 -3.2374]
 [-3.9916  0.      0.      0.     -3.1249]
 [-4.0677 -4.3701 -4.5946 -3.9306 -4.4064]]


Examine the action values you have computed. Ensure that the action values are 0 for the goal and taboo states. Also check that the actions with the largest values for each state make sense in terms of reaching the goal. 

With the action value function completed, you will now create and test code to perform GPI with SARSA(0).  You are welcome to use the `SRASA_0_GPI` function from the TD/Q-learning notebook as a starting point. 

Execute your code for 10 cycles of 100 episodes, with $\alpha = 0.2$, $\gamma = 0.9$ and $\epsilon = 0.01$, and examine the results.

In [11]:
def SARSA_0_GPI(policy, cycles, episodes, goal, alpha = 0.2, gamma = 0.9, epsilon = 0.1):
    ## iterate over GPI cycles
    current_policy = copy.deepcopy(policy)
    for _ in range(cycles):
        ## Evaluate policy with SARSA
        Q = SARSA_0(policy, episodes = episodes, goal = goal, alpha = alpha, epsilon = epsilon)
        
        for s in list(current_policy.keys()): # iterate over all states
            ## Find the index action with the largest Q values 
            ## May be more than one. 
            max_index = np.where(Q[:,s] == max(Q[:,s]))[0]
            
            ## Probabilities of transition
            ## Need to allow for further exploration so don't let any 
            ## transition probability be 0.
            ## Some gymnastics are required to ensure that the probabilities 
            ## over the transistions actual add to exactly 1.0
            neighbors_len = float(Q.shape[0])
            max_len = float(len(max_index))
            diff = round(neighbors_len - max_len,3)
            prob_for_policy = round(1.0/max_len,3)
            adjust = round((epsilon * (diff)), 3)
            prob_for_policy = prob_for_policy - adjust
            if(diff != 0.0):
                remainder = (1.0 - max_len * prob_for_policy)/diff
            else:
                remainder = epsilon
                                                 
            for i, key in enumerate(current_policy[s]): ## Update policy
                if(i in max_index): current_policy[s][key] = prob_for_policy
                else: current_policy[s][key] = remainder   
                    
    return(current_policy)                    


In [12]:
 

SARSA_0_Policy = SARSA_0_GPI(policy,goal=12, cycles = 10, episodes = 100)
SARSA_0_Policy

{0: {'d': 0.7,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 1: {'d': 0.10000000000000002,
  'l': 0.7,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 2: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.7,
  'u': 0.10000000000000002},
 3: {'d': 0.10000000000000002,
  'l': 0.7,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 4: {'d': 0.7,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 5: {'d': 0.7,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 6: {'d': 0.25, 'l': 0.25, 'r': 0.25, 'u': 0.25},
 7: {'d': 0.25, 'l': 0.25, 'r': 0.25, 'u': 0.25},
 8: {'d': 0.25, 'l': 0.25, 'r': 0.25, 'u': 0.25},
 9: {'d': 0.7,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 10: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.7,
  'u': 0.10000000000000002},
 11: {'d': 0.10000000000000002,
  'l': 0.10000000

Verify that your results make sense? For example, starting at state 2 or 22, do the most probable actions follow a shortest path?

ANS: Yes, both state 2 or 22 show the path to go left side.

## Apply Double Q-Learning

As a next step, you will apply Double Q-learning(0) to the warehouse navigation problem. In the cell below create and test a function to perform Double Q-Learning for this problem. You are welcome to use the `double_Q_learning` function from the TD/Q-learning notebook as a starting point.

Execute your code for 10 cycles of 500 episodes, with $\alpha = 0.2$, and $\gamma = 0.9$ and examine the results.

In [13]:
def Q_learning_0(policy, neighbors, rewards, episodes, goal, alpha = 0.2, gamma = 0.9):
    """
    Function to perform Q-learning(0) control policy improvement.
    """
    ## Initialize the state list and action values
    states = list(policy.keys())
    n_states = len(states)
    
    ## Initialize possible actions and the action values
    possible_actions = list(rewards[0].keys())
    action_index = list(range(len(list(policy[0].keys()))))
    Q = np.zeros((len(possible_actions),len(states)))
    
    current_policy = copy.deepcopy(policy)
    
    for _ in range(episodes): # Loop over the episodes
        ## sample an intial state at random but make sure it is not goal
        s = nr.choice(states, size = 1)[0]
        while(s in taboo+ [goal]): s = nr.choice(states, size = 1)[0]
        ## Now choose action following policy
        a_index, a = select_a_prime(s, current_policy, action_index, True, goal)
        
        s_prime = n_states + 1 # Dummy value of s_prime to start loop
        while(s_prime != goal): # Episode ends where get to terminal state   
            ## Get s_prime given s and a
            s_prime = neighbors[s][a]
            
            ## Find the index or indices of maximum action values for s_prime
            ## Break any tie with multiple max values by random selection
            action_values = Q[:,s_prime]
            a_prime_index = nr.choice(np.where(action_values == max(action_values))[0], size = 1)[0]
            a_prime = possible_actions[a_prime_index]
            
            ## Lookup the reward 
            reward = rewards[s][a]
            
            ## Update the action values
            Q[a_index,s] = Q[a_index,s] + alpha * (reward + gamma * Q[a_prime_index,s_prime] - Q[a_index,s])
            
            ## Set action and state for next iteration
            a = a_prime
            a_index = a_prime_index
            s = s_prime

    return(Q)


In [14]:

QQ = Q_learning_0(policy, neighbors, reward, episodes=500, goal = 12)

for i in range(4):
    print(np.round(QQ[i,:].reshape((5,5)), 4))

[[4.6576 2.3519 3.0003 0.8702 1.6673]
 [4.5709 0.     0.     0.     4.5546]
 [3.0681 6.6243 0.     5.9028 5.0752]
 [7.91   0.     0.     0.     7.91  ]
 [7.019  1.6248 1.6534 3.5173 7.019 ]]
[[7.019  2.9334 2.5275 2.3373 7.0189]
 [7.91   0.     0.     0.     7.91  ]
 [3.5445 6.6524 0.     6.8731 5.1443]
 [4.6978 0.     0.     0.     3.6438]
 [4.5848 3.1666 2.3795 3.3294 4.7006]]
[[ 3.0979  6.2171  5.4916  1.2128  1.3888]
 [ 3.9946  0.      0.      0.      4.9742]
 [ 6.2168  4.455   0.     10.      8.9   ]
 [ 0.9325  0.      0.      0.      5.2453]
 [ 4.3729  6.2166  5.4281  1.0831  3.9313]]
[[ 4.2212  4.2177  3.1193  6.1009  5.4883]
 [ 5.2951  0.      0.      0.      4.4764]
 [ 8.9    10.      0.      5.2973  5.1199]
 [ 4.745   0.      0.      0.      5.2005]
 [ 3.3908  2.5138  2.1374  6.2125  1.3775]]


In [15]:
def double_Q_learning_0(policy, neighbors, rewards, episodes, goal, alpha = 0.2, gamma = 0.9):
    """
    Function to perform SARSA(0) control policy improvement.
    """
    ## Initialize the state list and action values
    states = list(policy.keys())
    n_states = len(states)
    
    ## Initialize possible actions and the action values
    possible_actions = list(rewards[0].keys())
    action_index = list(range(len(list(policy[0].keys()))))
    Q1 = np.zeros((len(possible_actions),len(states)))
    Q2 = np.zeros((len(possible_actions),len(states)))
    
    current_policy = copy.deepcopy(policy)
    
    for _ in range(episodes): # Loop over the episodes
        ## sample an intial state at random but make sure it is not goal
        s = nr.choice(states, size = 1)[0]
        while(s in taboo+[goal]): s = nr.choice(states, size = 1)[0]
        ## Now choose action following policy
        a_index, a = select_a_prime(s, current_policy, action_index, True, goal)
        
        s_prime = n_states + 1 # Dummy value of s_prime to start loop
        while(s_prime != goal): # Episode ends where get to terminal state   
            ## Get s_prime given s and a
            s_prime = neighbors[s][a]
            
            ## Update one or the other action values at random
            if(nr.uniform() <= 0.5):
                ## Find the index or indices of maximum action values for s_prime
                ## Break any tie with multiple max values by random selection
                action_values = Q1[:,s_prime]
                a_prime_index = nr.choice(np.where(action_values == max(action_values))[0], size = 1)[0]
                a_prime = possible_actions[a_prime_index]
                ## Lookup the reward 
                reward = rewards[s][a]
                ## Update Q1 
                Q1[a_index,s] = Q1[a_index,s] + alpha * (reward + gamma * Q2[a_prime_index,s_prime] - Q1[a_index,s])
            
                ## Set action and state for next iteration
                a = a_prime
                a_index = a_prime_index
                s = s_prime
            
            else:
                ## Find the index or indices of maximum action values for s_prime
                ## Break any tie with multiple max values by random selection
                action_values = Q2[:,s_prime]
                a_prime_index = nr.choice(np.where(action_values == max(action_values))[0], size = 1)[0]
                a_prime = possible_actions[a_prime_index]
                ## Lookup the reward 
                reward = rewards[s][a]
                ## Update Q2
                Q2[a_index,s] = Q2[a_index,s] + alpha * (reward + gamma * Q1[a_prime_index,s_prime] - Q2[a_index,s])
            
                ## Set action and state for next iteration
                a = a_prime
                a_index = a_prime_index
                s = s_prime

    return(Q1)


In [16]:

doudble_Q = double_Q_learning_0(policy, neighbors, reward, 500, goal = 12)

for i in range(4):
    print(np.round(doudble_Q[i,:].reshape((5,5)), 4))

[[ 2.9053  1.0291 -0.0575  1.0663  0.7141]
 [ 3.5645  0.      0.      0.      4.1165]
 [ 2.0151  3.8248  0.      1.44    3.9789]
 [ 7.9073  0.      0.      0.      7.91  ]
 [ 6.8824  0.2245  0.3559  0.8978  7.0167]]
[[ 6.9182 -0.4105 -0.532   0.6655  7.0148]
 [ 7.9079  0.      0.      0.      7.9097]
 [ 2.3816  3.904   0.      2.752   4.4786]
 [ 1.1378  0.      0.      0.      3.5121]
 [ 0.5783 -0.726  -0.2829  1.2675  2.5116]]
[[ 3.6394  5.6451  0.8454  2.314   1.4896]
 [ 0.9125  0.      0.      0.      1.7264]
 [ 2.1984  3.5504  0.     10.      8.9   ]
 [ 0.9653  0.      0.      0.      3.265 ]
 [ 0.5424  4.0794  0.3063  2.1038  0.9592]]
[[ 0.4252  0.2685  4.9937  6.1278  2.7775]
 [ 1.7723  0.      0.      0.      3.6496]
 [ 8.9    10.      0.      5.2978  4.0568]
 [ 1.5804  0.      0.      0.      3.5042]
 [ 0.2847  2.0146  5.0325  6.179   3.0988]]


Examine the action values you have computed. Ensure that the action values are 0 for the goal and taboo states. Also check that the actions with the largest values for each state make sense in terms of reaching the goal. 

With the action value function completed, you will now create and test code to perform GPI with Double Q-Learning(0).  You are welcome to use the `double_Q_learning_0_GPI` function from the TD/Q-learning notebook as a starting point. 

Execute your code for 10 cycles of 500 episodes, with $\alpha = 0.2$, $\gamma = 0.9$ and $\epsilon = 0.01$, and examine the results. 

In [17]:
def double_Q_learning_0_GPI(policy, neighbors, reward, cycles, episodes, goal, alpha = 0.2, gamma = 0.9, epsilon = 0.1):
    ## iterate over GPI cycles
    current_policy = copy.deepcopy(policy)
    for _ in range(cycles):
        ## Evaluate policy with SARSA
        Q = double_Q_learning_0(policy, neighbors, reward, episodes = episodes, goal = goal)
        
        for s in list(current_policy.keys()): # iterate over all states
            ## Find the index action with the largest Q values 
            ## May be more than one. 
            max_index = np.where(Q[:,s] == max(Q[:,s]))[0]
            
            ## Probabilities of transition
            ## Need to allow for further exploration so don't let any 
            ## transition probability be 0.
            ## Some gymnastics are required to ensure that the probabilities 
            ## over the transistions actual add to exactly 1.0
            neighbors_len = float(Q.shape[0])
            max_len = float(len(max_index))
            diff = round(neighbors_len - max_len,3)
            prob_for_policy = round(1.0/max_len,3)
            adjust = round((epsilon * (diff)), 3)
            prob_for_policy = prob_for_policy - adjust
            if(diff != 0.0):
                remainder = (1.0 - max_len * prob_for_policy)/diff
            else:
                remainder = epsilon
                                                 
            for i, key in enumerate(current_policy[s]): ## Update policy
                if(i in max_index): current_policy[s][key] = prob_for_policy
                else: current_policy[s][key] = remainder   
                    
    return(current_policy)                    


In [18]:
Double_Q_0_Policy = double_Q_learning_0_GPI(policy, neighbors, reward, cycles = 10, episodes = 500, goal = 12)
Double_Q_0_Policy 

{0: {'d': 0.7,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 1: {'d': 0.10000000000000002,
  'l': 0.7,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 2: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.7,
  'u': 0.10000000000000002},
 3: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.7,
  'u': 0.10000000000000002},
 4: {'d': 0.7,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 5: {'d': 0.7,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 6: {'d': 0.25, 'l': 0.25, 'r': 0.25, 'u': 0.25},
 7: {'d': 0.25, 'l': 0.25, 'r': 0.25, 'u': 0.25},
 8: {'d': 0.25, 'l': 0.25, 'r': 0.25, 'u': 0.25},
 9: {'d': 0.7,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 10: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.7,
  'u': 0.10000000000000002},
 11: {'d': 0.10000000000000002,
  'l': 0.10000000

Verify that your results make sense? For example, starting at state 2 or 22, do the most probable actions follow a shortest path?

ANS: Um.. at state 2 is something weird, because it shows go 'r', whereas it was 'l' in SARSA.
    But it is still shortest path though. 
    State 22 is 'l' which is same as before.

## N-Step TD Learning

Finally, you will apply N-Step TD learning and N-Step SARSA to the warehouse navigation problem.  First create a function to perform N-step TD policy evaluation. You are welcome to start with the `TD_n` policy evaluation function from the TD/Q-Learning notebook. 

Test your function using 1,000 episodes, $n = 4$, $\gamma = 0.9$, and $\alpha = 0.2$.

In [20]:
def TD_n(policy, episodes, n, goal, alpha = 0.2, gamma = 0.9, epsilon = 0.1):
    """
    Function to perform TD(N) policy evaluation.
    """
    ## Initialize the state list and action values
    states = list(policy.keys())
    n_states = len(states)
    
    ## Initialize possible actions and the action values
    action_index = list(range(len(list(policy[0].keys()))))
    v = [0]*len(list(policy.keys()))
    
    current_policy = copy.deepcopy(policy)
    
    
    ## sample an initial state at random and make sure is not terminal state
    s = nr.choice(states, size = 1)[0]
    while(s in taboo+[goal]):
        s = nr.choice(states, size = 1)[0]  
        
    for _ in range(episodes): # Loop over the episodes
        T = float("inf")
        tau = 0
        reward_list = []
        t = 0
        
        while(tau != T - 1): # Episode ends where get to terminal state 
            if(t < T):
                ## Choose action given policy
                probs = list(policy[s].values())
                a = list(policy[s].keys())[nr.choice(action_index, size = 1, p = probs)[0]]
                ## The next state given the action
                s_prime, reward = state_values(s, a)
                reward_list.append(reward)  # append the reward to the list
                if(s_prime == goal): T = t + 1  # We reached the terminal state
                
            tau = t - n + 1 ## update the time step being updated

            if(tau >= 0): # Check if enough time steps to compute return
                ## Compute the return
                ## The formula for the first index in the loop is different from Sutton and Barto
                ## but seems to be correct at least for Python.
                G = 0.0 
                for i in range(tau, min(tau + n - 1, T)):
                    G = G + gamma**(i-tau) * reward_list[i]   
                ## Deal with case of where we are not at the terminal state
                if(tau + n < T): G = G + gamma**n * v[s_prime]
                ## Update v
                v[s] = v[s] + alpha * (G - v[s])
            
            ## Set state for next iteration
            if(s_prime != goal):
                s = s_prime
            t = t +1
    return(v)



In [23]:
np.round(np.array(TD_n(policy, episodes = 1000, n = 4, goal = 12, alpha = 0.2, gamma = 0.9)).reshape((5,5)), 4)

array([[-5.1827, -5.3007, -4.6006, -4.4811, -4.0039],
       [-4.2994,  0.    ,  0.    ,  0.    , -3.6184],
       [-1.8351,  8.5854,  0.    ,  2.3435, -2.7753],
       [-2.9688,  0.    ,  0.    ,  0.    , -4.4164],
       [-2.8594, -3.539 , -4.8624, -5.0437, -4.8238]])

Verify that the result you obtained appears correct. Are the values of the goal and taboo states all 0? Do the state values decrease with distance from the goal?

YES

Now that you have an estimate of the best values for the number of steps and the learning rate you can compute the action values using multi-step SARSA. In the cell below, create and test a function to compute the action values using N-step SARSA. You are welcome to use the `SRARSA_n` function from the TD/Q-learning notebook as a starting point. 

Test your function by executing 4-step SARSA for 1,000 episodes with $\alpha = 0.2$ and $\gamma = 0.9$ and using the optimum number of steps and learning rate you have determined. 

In [25]:
def SARSA_n(policy, episodes, n, goal, alpha = 0.2, gamma = 0.9, epsilon = 0.1):
    """
    Function to perform SARSA(N) control policy improvement.
    """
    ## Initialize the state list and action values
    states = list(policy.keys())
    n_states = len(states)
    
    ## Initialize possible actions and the action values
    action_index = list(range(len(list(policy[0].keys()))))
    Q = np.zeros((len(action_index),len(states)))
    
    current_policy = copy.deepcopy(policy)
    
    for _ in range(episodes): # Loop over the episodes
        ## sample a state at random and make sure is not terminal state
        s = nr.choice(states, size = 1)[0]
        while(s in taboo+[goal]):
            s = nr.choice(states, size = 1)[0]
        ## Now choose action given policy
        a_index, a = select_a_prime(s, current_policy, action_index, True, goal)
        
        t = 0 # Initialize the time step count
        T = float("inf")
        tau = 0
        reward_list = []
        while(tau != T - 1): # Episode ends where get to terminal state 
            if(t < T):
                ## The next state given the action
                s_prime, reward = state_values(s, a)
                reward_list.append(reward)  # append the reward to the list
                if(s_prime == goal): T = t + 1  # We reached the terminal state
                else:
                    # Select and store the next action using the policy
                    a_prime_index, a_prime = select_a_prime(s_prime, current_policy, action_index, True, goal)
                
                
            tau = t - n + 1 ## update the time step being updated
  
            if(tau >= 0): # Check if enough time steps to compute return
                ## Compute the return
                ## The formula for the first index in the loop is different from Sutton and Barto
                ## but seems to be correct at least for Python.
                G = 0.0 
                for i in range(tau, min(tau + n, T)):
                    G = G + gamma**(i-tau) * reward_list[i]   
                ## Deal with case of where we are not at the terminal state
                if(tau + n < T): G = G + gamma**n * Q[a_prime_index,s_prime]
                ## Finally, update Q
                Q[a_index,s] = Q[a_index,s] + alpha * (G - Q[a_index,s])
            
            ## Set action and state for next iteration
            if(s_prime != goal):
                s = s_prime   
                a = a_prime 
                a_index = a_prime_index
                
            
            ## increment t
            t = t + 1
    return(Q)


In [26]:

Q = SARSA_n(policy, episodes = 1000, n = 4, goal = 12, alpha = 0.2, gamma = 0.9)

for i in range(4):
    print(np.round(Q[i,:].reshape((5,5)), 4))

[[-5.0316 -6.1924 -6.4394 -6.2852 -5.4243]
 [-4.2968  0.      0.      0.     -4.5333]
 [-3.8832 -2.934   0.     -0.1248 -4.6013]
 [-3.5688  0.      0.      0.     -3.4199]
 [-4.2063 -5.829  -5.4088 -6.195  -4.6529]]
[[-4.4785 -5.8728 -6.0345 -6.3933 -4.1263]
 [-3.9808  0.      0.      0.     -4.3189]
 [-4.514  -1.808   0.     -3.2603 -4.2205]
 [-4.1924  0.      0.      0.     -5.3263]
 [-4.6897 -5.4482 -5.9443 -6.1827 -6.2488]]
[[-5.4124 -4.611  -5.2169 -5.5703 -6.0322]
 [-4.452   0.      0.      0.     -4.6965]
 [-4.5925 -3.7707  0.      8.1643 -1.1079]
 [-5.1071  0.      0.      0.     -4.9628]
 [-4.2205 -3.9273 -5.0363 -5.178  -5.7837]]
[[-5.2485 -5.7384 -5.5723 -5.3475 -4.6591]
 [-5.2497  0.      0.      0.     -5.0167]
 [ 0.2827  8.5595  0.     -3.1464 -4.0874]
 [-5.0973  0.      0.      0.     -4.9852]
 [-3.7471 -4.9646 -5.1118 -5.1226 -6.2858]]


Verify that the results you have computed appear correct using the aforementioned criteria. 

Finally, create a function to use the GPI algorithm with N-step SARSA in the cell below. You are welcome to start with the `SARSA_n_GPI` function from the TD/Q-learning notebook. 

Execute your function using 4 step SARSA for 5 cycles of 500 episodes, with $\alpha = 0.2$, $\epsilon = 0.1$, and $\gamma = 0.9$.

In [27]:
def SARSA_n_GPI(policy, n, cycles, episodes, goal, alpha = 0.2, gamma = 0.9, epsilon = 0.1):
    ## iterate over GPI cycles
    current_policy = copy.deepcopy(policy)
    for _ in range(cycles):
        ## Evaluate policy with SARSA
        Q = SARSA_n(policy, episodes, n, goal = goal, alpha = alpha, gamma = gamma, epsilon = epsilon)
        
        for s in list(current_policy.keys()): # iterate over all states
            ## Find the index action with the largest Q values 
            ## May be more than one. 
            max_index = np.where(Q[:,s] == max(Q[:,s]))[0]
            
            ## Probabilities of transition
            ## Need to allow for further exploration so don't let any 
            ## transition probability be 0.
            ## Some gymnastics are required to ensure that the probabilities 
            ## over the transistions actual add to exactly 1.0
            neighbors_len = float(Q.shape[0])
            max_len = float(len(max_index))
            diff = round(neighbors_len - max_len,3)
            prob_for_policy = round(1.0/max_len,3)
            adjust = round((epsilon * (diff)), 3)
            prob_for_policy = prob_for_policy - adjust
            if(diff != 0.0):
                remainder = (1.0 - max_len * prob_for_policy)/diff
            else:
                remainder = epsilon
                                                 
            for i, key in enumerate(current_policy[s]): ## Update policy
                if(i in max_index): current_policy[s][key] = prob_for_policy
                else: current_policy[s][key] = remainder   
                    
    return(current_policy)                    
 


In [28]:

SARSA_N_Policy = SARSA_n_GPI(policy, n = 4, cycles = 5, episodes = 500, goal = 12, alpha = 0.2, epsilon = 0.1)
SARSA_N_Policy

{0: {'d': 0.7,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 1: {'d': 0.10000000000000002,
  'l': 0.7,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 2: {'d': 0.10000000000000002,
  'l': 0.7,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 3: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.7,
  'u': 0.10000000000000002},
 4: {'d': 0.7,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 5: {'d': 0.7,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 6: {'d': 0.25, 'l': 0.25, 'r': 0.25, 'u': 0.25},
 7: {'d': 0.25, 'l': 0.25, 'r': 0.25, 'u': 0.25},
 8: {'d': 0.25, 'l': 0.25, 'r': 0.25, 'u': 0.25},
 9: {'d': 0.7,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 10: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.7,
  'u': 0.10000000000000002},
 11: {'d': 0.10000000000000002,
  'l': 0.10000000

Examine your results. Verify that the most probable paths to the goal from states 2 and 22 are the shortest possible.  

At the states 2 and 22, it shows go left and right respectively, 
which is indicating the shortest path correctly