# Homework 8

## CSCI E-82A


In the a previous homework assignments, you used two different dynamic programming algorithms and Monte Carlo reinforcement learning to solve a robot navigation problem by finding optimal paths to a goal in a simplified warehouse environment. Now you will use time differencing reinforcement learning to find optimal paths in the same environment.

The configuration of the warehouse environment is illustrated in the figure below.

<img src="GridWorldFactory.JPG" alt="Drawing" style="width:200px; height:200px"/>
<center> **Grid World for Factory Navigation Example** </center>

The goal is for the robot to deliver some material to position (state) 12, shown in blue. Since there is a goal state or **terminal state** this an **episodic task**. 

There are some barriers comprised of the states $\{ 6, 7, 8 \}$ and $\{ 16, 17, 18 \}$, shown with hash marks. In a real warehouse, these positions might be occupied by shelving or equipment. We do not want the robot to hit these barriers. Thus, we say that transitioning to these barrier states is **taboo**.

As before, we do not want the robot to hit the edges of the grid world, which represent the outer walls of the warehouse. 

## Representation

You are, no doubt, familiar with the representation for this problem by now.    

As with many such problems, the starting place is creating the **representation**. In the cell below encode your representation for the possible action-state transitions. From each state there are 4 possible actions:
- up, u
- down, d,
- left, l
- right, r

There are a few special cases you need to consider:
- Any action transitioning state off the grid or into a barrier should keep the state unchanged. 
- Any action in the goal state keeps the state unchanged. 
- Any transition within the taboo (barrier) states can keep the state unchanged. If you experiment, you will see that other encodings work as well since the value of a barrier states are always zero and there are no actions transitioning into these states. 

> **Hint:** It may help you create a pencil and paper sketch of the transitions, rewards, and probabilities or policy. This can help you to keep the bookkeeping correct. 

In [218]:
## import numpy for latter
import numpy as np
import numpy.random as nr
import pandas as pd


## Reference

- [TD in RL](https://towardsdatascience.com/td-in-reinforcement-learning-the-easy-way-f92ecfa9f3ce)

You need to define the initial transition probabilities for the Markov process. Set the probabilities for each transition as a **uniform distribution** leading to random action by the robot. 

> **Note:** As these are just starting values, the exact values of the transition probabilities are not actually all that important in terms of solving the RL problem. Also, notice that it does not matter how the taboo state transitions are encoded. The point of the DP algorithm is to learn the transition policy.

In [219]:
neighbors = {0:{'u':0, 'd':5, 'l':0, 'r':1},
          1:{'u':1, 'd':1, 'l':0, 'r':2},
          2:{'u':2, 'd':2, 'l':1, 'r':3},
          3:{'u':3, 'd':3, 'l':2, 'r':4},
          4:{'u':4, 'd':9, 'l':3, 'r':4},
          5:{'u':0, 'd':10, 'l':5, 'r':5},
          6:{'u':6, 'd':6, 'l':6, 'r':6},
          7:{'u':7, 'd':7, 'l':7, 'r':7},
          8:{'u':8, 'd':8, 'l':8, 'r':8},
          9:{'u':4, 'd':14, 'l':9, 'r':9},
          10:{'u':5, 'd':15, 'l':10, 'r':11},
          11:{'u':11, 'd':11, 'l':10, 'r':12},
          12:{'u':12, 'd':12, 'l':12, 'r':12},
          13:{'u':13, 'd':13, 'l':12, 'r':14},
          14:{'u':9, 'd':19, 'l':13, 'r':14},
          15:{'u':10, 'd':20, 'l':15, 'r':15},
          16:{'u':16, 'd':16, 'l':16, 'r':16},
          17:{'u':17, 'd':17, 'l':17, 'r':17},
          18:{'u':18, 'd':18, 'l':18, 'r':18},
          19:{'u':14, 'd':24, 'l':19, 'r':19},
          20:{'u':15, 'd':20, 'l':20, 'r':21},
          21:{'u':21, 'd':21, 'l':20, 'r':22},
          22:{'u':22, 'd':22, 'l':21, 'r':23},
          23:{'u':23, 'd':23, 'l':22, 'r':24},
          24:{'u':19, 'd':24, 'l':23, 'r':24}}

The robot receives the following rewards:
- 10 for entering position 0. 
- -1 for attempting to leave the grid. In other words, we penalize the robot for hitting the edges of the grid.  
- -0.1 for all other state transitions, which is the cost for the robot to move from one state to another. If we did not have this penalty, the robot could follow any random plan to the goal which did not hit the edges. 

This **reward structure is unknown to the MC RL agent**. The agent must **learn** the rewards by sampling the environment. 

In the code cell below encode your representation of this reward structure you will use in your simulated environment.  

In [220]:
rewards =  {0:{'u':-1, 'd':-0.1, 'l':-1, 'r':-0.1},
          1:{'u':-1, 'd':-1, 'l':-0.1, 'r':-0.1},
          2:{'u':-1, 'd':-1, 'l':-0.1, 'r':-0.1},
          3:{'u':-1, 'd':-1, 'l':-0.1, 'r':-0.1},
          4:{'u':-1, 'd':-0.1, 'l':-0.1, 'r':-1},
          5:{'u':-0.1, 'd':-0.1, 'l':-1, 'r':-1},
          6:{'u':-1, 'd':-1, 'l':-1, 'r':-1},
          7:{'u':-1, 'd':-1, 'l':-1, 'r':-1},
          8:{'u':-1, 'd':-1, 'l':-1, 'r':-1},
          9:{'u':-0.1, 'd':-0.1, 'l':-1, 'r':-1},
          10:{'u':-0.1, 'd':-0.1, 'l':-1, 'r':-0.1},
          11:{'u':-1, 'd':-1, 'l':-0.1, 'r':10},
          12:{'u':10.0, 'd':10.0, 'l':10.0, 'r':10.0}, # 12:{'u':0.0, 'd':0.0, 'l':0.0, 'r':0.0}
          13:{'u':-1, 'd':-1, 'l':10, 'r':-0.1},
          14:{'u':-0.1, 'd':-0.1, 'l':-0.1, 'r':-1},
          15:{'u':-0.1, 'd':-0.1, 'l':-1, 'r':-1},
          16:{'u':-1, 'd':-1, 'l':-1, 'r':-1},
          17:{'u':-1, 'd':-1, 'l':-1, 'r':-1},
          18:{'u':-0.1, 'd':-0.1, 'l':-0.1, 'r':-0.1},
          19:{'u':-0.1, 'd':-0.1, 'l':-1, 'r':-1},
          20:{'u':-0.1, 'd':-1, 'l':-1, 'r':-0.1},
          21:{'u':-1, 'd':-1, 'l':-0.1, 'r':-0.1},
          22:{'u':-1, 'd':-1, 'l':-0.1, 'r':-0.1},
          23:{'u':-1, 'd':-1, 'l':-0.1, 'r':-0.1},
          24:{'u':-0.1, 'd':-1, 'l':-0.1, 'r':-1}}

You will find it useful to create a list of taboo states, which you can encode in the cell below.

In [221]:
taboos = [6, 7, 8, 16, 17, 18]

## TD(0) Policy Evaluation

With your representations defined, you can now create and test functions to perform TD(0) **policy evaluation**. 

As a first step you will need a function to find the rewards and next state given a state and an action. You are welcome to start with the `state_values` function from the TD/Q-learning notebook. However, keep in mind that you must modify this code to correctly treat the taboo states of the barrier. Specifically, taboo states should not be visited. 

Execute your code to test it for each possible action from state 11.  

In [222]:
def action_lookup(index):
    """Helper function returns action given an index"""
    action_dic = {0:'u', 1:'d', 2:'l', 3:'r'}
    return action_dic[index]

def index_lookup(action):
    """Helper function returns index given action"""
    index_dic = {'u':0, 'd':1, 'l':2, 'r':3}
    return index_dic[action]


def next_state(state, action_index, neighbors = neighbors, action_lookup = action_lookup):
    return(neighbors[state][action_lookup[action_index]])

def simulate_environment(s, action, neighbors = neighbors, rewards = rewards, terminal = 12):
    """
    Function simulates the environment for Q-learning.
    returns s_prime and reward given s and action
    """
    s_prime = neighbors[s][action]
    reward_prime = np.array([rewards[s_prime][a] for a in rewards[0].keys()])
    return (s_prime, reward_prime, is_terminal(s_prime, terminal))
    

def is_terminal(state, terminal = 12):
    return state == terminal

#adding a function to take care of the taboos
def is_taboos(state, taboo = taboos):
    return state in taboo

## Test the function
for a in ['u', 'd', 'r', 'l']:
    print(simulate_environment(11, a,terminal=12))
    
    

(11, array([-1. , -1. , -0.1, 10. ]), False)
(11, array([-1. , -1. , -0.1, 10. ]), False)
(12, array([10., 10., 10., 10.]), True)
(10, array([-0.1, -0.1, -1. , -0.1]), False)


We have the expected results here from 11 it can only to 12 or 10

In [223]:
initial_policy  = {0:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      1:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      2:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      3:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      4:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      5:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      6:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      7:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      8:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      9:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      10:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      11:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      12:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                      13:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      14:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      15:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      16:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      17:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      18:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      19:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      20:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      21:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      22:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      23:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      24:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25}}

In [224]:
def start_episode(n_states, n_actions):
    '''Function to find a random starting values for the episode
    that is not the terminal state'''
    state = nr.choice(range(n_states))
    while(is_terminal(state) or is_taboos(state)):  ## Make sure not starting at the terminal state
         state = nr.choice(range(n_states))
    ## Now find a random starting action index
    a_index = nr.choice(range(4), size = 1)[0]
    s_prime, reward, terminal = simulate_environment(state, action_lookup(a_index))   
    return state, a_index, reward[a_index] ## action_lookup(a_index), reward[a_index]

## test the function to make sure never starting in terminal state
test = [start_episode(25,4) for _ in range(25)]
starts = [i[0] for i in test]
True not in list(set([i in taboos for i in starts]))

True

the algo cannot starts from the terminal state or a taboo state

In [225]:
def take_action(state, policy):
    '''Function takes action given state using the transition probabilities 
    of the policy'''
    ## Find the action given the transistion probabilities defined by the policy.
    action = action_lookup(nr.choice(range(len(policy[0].keys())), p = list(policy[state].values()))) 
    s_prime, reward, terminal = simulate_environment(state, action)
    return (action, s_prime, reward, terminal)

## Test function for several states
for s in range(25):
    print('{} {}'.format(s, take_action(s, initial_policy)))
    

0 ('r', 1, array([-1. , -1. , -0.1, -0.1]), False)
1 ('r', 2, array([-1. , -1. , -0.1, -0.1]), False)
2 ('u', 2, array([-1. , -1. , -0.1, -0.1]), False)
3 ('d', 3, array([-1. , -1. , -0.1, -0.1]), False)
4 ('d', 9, array([-0.1, -0.1, -1. , -1. ]), False)
5 ('l', 5, array([-0.1, -0.1, -1. , -1. ]), False)
6 ('d', 6, array([-1, -1, -1, -1]), False)
7 ('r', 7, array([-1, -1, -1, -1]), False)
8 ('d', 8, array([-1, -1, -1, -1]), False)
9 ('u', 4, array([-1. , -0.1, -0.1, -1. ]), False)
10 ('l', 10, array([-0.1, -0.1, -1. , -0.1]), False)
11 ('d', 11, array([-1. , -1. , -0.1, 10. ]), False)
12 ('u', 12, array([10., 10., 10., 10.]), True)
13 ('l', 12, array([10., 10., 10., 10.]), True)
14 ('l', 13, array([-1. , -1. , 10. , -0.1]), False)
15 ('u', 10, array([-0.1, -0.1, -1. , -0.1]), False)
16 ('d', 16, array([-1, -1, -1, -1]), False)
17 ('l', 17, array([-1, -1, -1, -1]), False)
18 ('u', 18, array([-0.1, -0.1, -0.1, -0.1]), False)
19 ('r', 19, array([-0.1, -0.1, -1. , -1. ]), False)
20 ('r', 2

In [226]:
res = []
for s in range(25):
    if s not in taboos:
        tmp = take_action(s, initial_policy)
        print('{} {}'.format(s, tmp))
        res.append(tmp[1] in taboos)
True in list(set(res))

0 ('u', 0, array([-1. , -0.1, -1. , -0.1]), False)
1 ('d', 1, array([-1. , -1. , -0.1, -0.1]), False)
2 ('u', 2, array([-1. , -1. , -0.1, -0.1]), False)
3 ('r', 4, array([-1. , -0.1, -0.1, -1. ]), False)
4 ('r', 4, array([-1. , -0.1, -0.1, -1. ]), False)
5 ('u', 0, array([-1. , -0.1, -1. , -0.1]), False)
9 ('l', 9, array([-0.1, -0.1, -1. , -1. ]), False)
10 ('u', 5, array([-0.1, -0.1, -1. , -1. ]), False)
11 ('d', 11, array([-1. , -1. , -0.1, 10. ]), False)
12 ('d', 12, array([10., 10., 10., 10.]), True)
13 ('u', 13, array([-1. , -1. , 10. , -0.1]), False)
14 ('l', 13, array([-1. , -1. , 10. , -0.1]), False)
15 ('d', 20, array([-0.1, -1. , -1. , -0.1]), False)
19 ('l', 19, array([-0.1, -0.1, -1. , -1. ]), False)
20 ('r', 21, array([-1. , -1. , -0.1, -0.1]), False)
21 ('r', 22, array([-1. , -1. , -0.1, -0.1]), False)
22 ('d', 22, array([-1. , -1. , -0.1, -0.1]), False)
23 ('l', 22, array([-1. , -1. , -0.1, -0.1]), False)
24 ('l', 23, array([-1. , -1. , -0.1, -0.1]), False)


False

taboo states are not visited.

In [227]:
def print_Q(Q):
    Q = pd.DataFrame(Q, columns = ['up', 'down', 'left', 'right'])
    print(Q)

def update_Q(Q, current_state, a_index, reward, alpha, gamma):
    """Function to update the actions values in the Q matrix"""
    ## Get s_prime given s and a
    s_prime, reward_prime, terminal = simulate_environment(current_state, action_lookup(a_index))
    a_prime_index = nr.choice(np.where(reward_prime == max(reward_prime))[0], size = 1)[0]
    ## Update the action values 
    Q[current_state,a_index] = Q[current_state,a_index] + alpha * (reward + gamma * (Q[s_prime,a_prime_index] - Q[current_state,a_index]))
    return Q, s_prime, reward_prime, terminal, a_prime_index

def Q_learning_0(policy, episodes, alpha = 0.2, gamma = 0.9):
    """
    Function to perform Q-learning(0) control policy improvement.
    """
    ## Initialize the state list and action values
    states = list(policy.keys())
    n_states = len(states)
    n_actions = len(policy[0].keys())
    
    ## Initialize Q matrix
    Q = np.zeros((n_states,n_actions))
    
    for _ in range(episodes): # Loop over the episodes
        terminal = False
        ## Find the inital state, action index and reward
        current_state, a_index, reward = start_episode(n_states,n_actions)
        
        while(not terminal): # Episode ends where get to terminal state   
            ## Update the action values in Q
            Q, s_prime, reward_prime, terminal, a_prime_index = update_Q(Q, current_state, a_index, reward, alpha, gamma)
            ## Set action, reward and state for next iteration
            a_index = a_prime_index
            current_state = s_prime
            reward = reward_prime[a_prime_index]
    return(Q)

Q = Q_learning_0(initial_policy, 1000)
print_Q(Q)

           up      down       left      right
0    5.208680  8.852522   5.430972   7.542826
1    5.581559  5.251197   8.038281   7.387402
2    5.096938  5.637530   7.513451   7.633368
3    3.863071  5.960940   7.435222   7.825248
4    5.852391  8.364024   7.390184   5.732424
5    7.660039  9.343824   6.316889   6.723548
6    0.000000  0.000000   0.000000   0.000000
7    0.000000  0.000000   0.000000   0.000000
8    0.000000  0.000000   0.000000   0.000000
9    7.788300  9.312319   5.880178   6.262584
10   8.293990  8.631622   6.325854  11.000000
11   9.490425  9.656732   7.214887  11.111111
12   0.000000  0.000000   0.000000   0.000000
13   9.759785  9.620741  11.111111   8.111195
14   8.377784  8.870751  11.621894   4.843872
15   9.873292  8.044817   6.497006   6.112811
16   0.000000  0.000000   0.000000   0.000000
17   0.000000  0.000000   0.000000   0.000000
18   0.000000  0.000000   0.000000   0.000000
19  10.160714  7.788941   6.037405   7.215170
20   9.224196  5.877220   5.115425

In [190]:
def update_policy(policy, Q, epsilon):
    '''Updates the policy based on estiamtes of Q using 
    an epslion greedy algorithm. The action with the highest
    action value is used.'''
    
    ## Find the keys for the actions in the policy
    keys = list(policy[0].keys())
    
    ## Iterate over the states and find the maximm action value.
    for state in range(len(policy)):
        ## First find the index of the max Q values  
        q = Q[state,:]
        max_action_index = np.where(q == max(q))[0]

        ## Find the probabilities for the transitions
        n_transitions = float(len(q))
        n_max_transitions = float(len(max_action_index))
        p_max_transitions = (1.0 - epsilon *(n_transitions - n_max_transitions))/(n_max_transitions)
  
        ## Now assign the probabilities to the policy as epsilon greedy.
        for key in keys:
            if(index_lookup(key) in max_action_index): policy[state][key] = p_max_transitions
            else: policy[state][key] = epsilon
    return(policy)                

update_policy(initial_policy, Q, 0.1) 

{0: {'u': 0.1, 'd': 0.7, 'l': 0.1, 'r': 0.1},
 1: {'u': 0.1, 'd': 0.1, 'l': 0.7, 'r': 0.1},
 2: {'u': 0.1, 'd': 0.1, 'l': 0.1, 'r': 0.7},
 3: {'u': 0.1, 'd': 0.1, 'l': 0.1, 'r': 0.7},
 4: {'u': 0.1, 'd': 0.7, 'l': 0.1, 'r': 0.1},
 5: {'u': 0.1, 'd': 0.7, 'l': 0.1, 'r': 0.1},
 6: {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25},
 7: {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25},
 8: {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25},
 9: {'u': 0.1, 'd': 0.7, 'l': 0.1, 'r': 0.1},
 10: {'u': 0.1, 'd': 0.1, 'l': 0.1, 'r': 0.7},
 11: {'u': 0.1, 'd': 0.1, 'l': 0.1, 'r': 0.7},
 12: {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25},
 13: {'u': 0.1, 'd': 0.1, 'l': 0.7, 'r': 0.1},
 14: {'u': 0.1, 'd': 0.1, 'l': 0.7, 'r': 0.1},
 15: {'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0.1},
 16: {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25},
 17: {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25},
 18: {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25},
 19: {'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0.1},
 20: {'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0

Examine your results. Are the action values consistent with the transitions?

ANS:  Yes

Next, you need to create a function to compute the state values using the TD(0) algorithm. You should use the function you just created  to find the rewards and next state given a state and action. You are welcome to use the `td_0_state_values` function from the TD/Q-learning notebook as a starting point.  

Execute your function for 1,000 episodes and examine the results.

## Pseudo Code for TD(0)

```
Evaluate_Policy(policy):
  randomly_initialize_non_terminal_states_values()
Loop number_of_episodes:
  let s = start_state()
  # Play episode until the end
  Loop until game_over():
    let a = get_action(policy, s, 0.1) 
                      # get action to perform on state s according 
                      # to the given policy 90% of the time, and a
                      # random action 10% of the time.
    let (s', r) = make_move(s, a) #make move from s using a and get 
                                  #the new state s' and the reward r

     # incrementally compute the average at V(s). Notice that V(s)
     # depends on an estimate of V(s') and not on the return 
     # G as in MC 
     let V(s) = V(s) + alpha * [r + gamma * V(s') - V(s)]
    let s = s'
 End Loop
End Loop
```

In [228]:
def td_0_state_values(policy, n_samps, alpha = 0.2, gamma = 1.0):
    """
    Function for TD(0) policy evalutation
    """
    
    ## Find the starting state
    n_states = len(policy)
    n_actions = len(policy[0].keys())
    current_state = start_episode(n_states, n_actions )[0]
    terminal = False
    ## Array for state values
    v = np.zeros((n_states,1))
    
    for _ in range(n_samps):
        ## Find the next action and reward
        print('{} {}'.format(_, current_state))
        action, s_prime, reward, terminal = take_action(current_state, policy)
        ## Compute the TD error
        
        delta = reward + gamma*v[s_prime] - v[current_state]
        print('reward={}'.format(reward))
        print('delta={}'.format(delta))
            ## Update the state value
        if current_state not in taboos:
            v[current_state] = v[current_state] + alpha*delta
        else:
            v[current_state] = 0
        current_state = s_prime
        if(terminal): ## start new episode when terminal
            current_state = start_episode(n_states, n_actions)[0]
    return(v)

td_0_state_values(initial_policy, 1000).reshape((5,5)) 

0 0
reward=[-1.  -0.1 -1.  -0.1]
delta=[-1.  -0.1 -1.  -0.1]


ValueError: could not broadcast input array from shape (4) into shape (1)

## From IntroductionToTDLearning

In [229]:
import numpy as np
import numpy.random as nr
import pandas as pd



In [230]:
neighbors = {0:{'u':0, 'd':5, 'l':0, 'r':1},
          1:{'u':1, 'd':1, 'l':0, 'r':2},
          2:{'u':2, 'd':2, 'l':1, 'r':3},
          3:{'u':3, 'd':3, 'l':2, 'r':4},
          4:{'u':4, 'd':9, 'l':3, 'r':4},
          5:{'u':0, 'd':10, 'l':5, 'r':5},
          6:{'u':6, 'd':6, 'l':6, 'r':6},
          7:{'u':7, 'd':7, 'l':7, 'r':7},
          8:{'u':8, 'd':8, 'l':8, 'r':8},
          9:{'u':4, 'd':14, 'l':9, 'r':9},
          10:{'u':5, 'd':15, 'l':10, 'r':11},
          11:{'u':11, 'd':11, 'l':10, 'r':12},
          12:{'u':12, 'd':12, 'l':12, 'r':12},
          13:{'u':13, 'd':13, 'l':12, 'r':14},
          14:{'u':9, 'd':19, 'l':13, 'r':14},
          15:{'u':10, 'd':20, 'l':15, 'r':15},
          16:{'u':16, 'd':16, 'l':16, 'r':16},
          17:{'u':17, 'd':17, 'l':17, 'r':17},
          18:{'u':18, 'd':18, 'l':18, 'r':18},
          19:{'u':14, 'd':24, 'l':19, 'r':19},
          20:{'u':15, 'd':20, 'l':20, 'r':21},
          21:{'u':21, 'd':21, 'l':20, 'r':22},
          22:{'u':22, 'd':22, 'l':21, 'r':23},
          23:{'u':23, 'd':23, 'l':22, 'r':24},
          24:{'u':19, 'd':24, 'l':23, 'r':24}}

In [231]:
rewards =  {0:{'u':-1, 'd':-0.1, 'l':-1, 'r':-0.1},
          1:{'u':-1, 'd':-1, 'l':-0.1, 'r':-0.1},
          2:{'u':-1, 'd':-1, 'l':-0.1, 'r':-0.1},
          3:{'u':-1, 'd':-1, 'l':-0.1, 'r':-0.1},
          4:{'u':-1, 'd':-0.1, 'l':-0.1, 'r':-1},
          5:{'u':-0.1, 'd':-0.1, 'l':-1, 'r':-1},
          6:{'u':-1, 'd':-1, 'l':-1, 'r':-1},
          7:{'u':-1, 'd':-1, 'l':-1, 'r':-1},
          8:{'u':-1, 'd':-1, 'l':-1, 'r':-1},
          9:{'u':-0.1, 'd':-0.1, 'l':-1, 'r':-1},
          10:{'u':-0.1, 'd':-0.1, 'l':-1, 'r':-0.1},
          11:{'u':-1, 'd':-1, 'l':-0.1, 'r':10},
          12:{'u':10.0, 'd':10.0, 'l':10.0, 'r':10.0}, # 12:{'u':0.0, 'd':0.0, 'l':0.0, 'r':0.0}
          13:{'u':-1, 'd':-1, 'l':10, 'r':-0.1},
          14:{'u':-0.1, 'd':-0.1, 'l':-0.1, 'r':-1},
          15:{'u':-0.1, 'd':-0.1, 'l':-1, 'r':-1},
          16:{'u':-1, 'd':-1, 'l':-1, 'r':-1},
          17:{'u':-1, 'd':-1, 'l':-1, 'r':-1},
          18:{'u':-0.1, 'd':-0.1, 'l':-0.1, 'r':-0.1},
          19:{'u':-0.1, 'd':-0.1, 'l':-1, 'r':-1},
          20:{'u':-0.1, 'd':-1, 'l':-1, 'r':-0.1},
          21:{'u':-1, 'd':-1, 'l':-0.1, 'r':-0.1},
          22:{'u':-1, 'd':-1, 'l':-0.1, 'r':-0.1},
          23:{'u':-1, 'd':-1, 'l':-0.1, 'r':-0.1},
          24:{'u':-0.1, 'd':-1, 'l':-0.1, 'r':-1}}

In [232]:
taboos = [6, 7, 8, 16, 17, 18]

In [233]:
def simulate_environment(s, action, neighbors = neighbors, rewards = rewards, terminal = 12):
    """
    Function simulates the environment
    returns s_prime and reward given s and action
    """
    s_prime = neighbors[s][action]
    reward = rewards[s][action]
    return (s_prime, reward, is_terminal(s_prime, terminal))

def is_terminal(state, terminal = 12):
    return state == terminal

#adding a function to take care of the taboos
def is_taboos(state, taboo = taboos):
    return state in taboo

## Test the function
for a in ['u', 'd', 'r', 'l']:
    print(simulate_environment(11, a,terminal=12))

(11, -1, False)
(11, -1, False)
(12, 10, True)
(10, -0.1, False)


In [234]:
initial_policy  = {0:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      1:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      2:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      3:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      4:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      5:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      6:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      7:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      8:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      9:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      10:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      11:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      12:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                      13:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      14:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      15:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      16:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      17:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      18:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      19:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      20:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      21:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      22:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      23:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      24:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25}}

In [235]:
def start_episode(n_states):
    '''Function to find a random starting value for the episode
    that is not the terminal state'''
    state = nr.choice(range(n_states))
    while(is_terminal(state) or is_taboos(state)):
         state = nr.choice(range(n_states))
    return state

## test the function to make sure never starting in terminal state
[start_episode(25) for _ in range(10)]

[23, 22, 14, 24, 13, 4, 4, 1, 15, 10]

In [236]:
starts = [i[0] for i in test]
True not in list(set([i in taboos for i in starts]))

True

In [237]:
def take_action(state, policy, actions = {1:'u', 2:'d', 3:'l', 4:'r'}):
    '''Function takes action given state using the transition probabilities 
    of the policy'''
    ## Find the action given the transistion probabilities defined by the policy.
    action = actions[nr.choice(range(len(actions)), p = list(policy[state].values())) + 1]
    s_prime, reward, terminal = simulate_environment(state, action)
    return (action, s_prime, reward, terminal)

## Test function for several states
for s in range(25):
    print('{} {}'.format(s,take_action(s, initial_policy)))

0 ('u', 0, -1, False)
1 ('l', 0, -0.1, False)
2 ('r', 3, -0.1, False)
3 ('r', 4, -0.1, False)
4 ('u', 4, -1, False)
5 ('u', 0, -0.1, False)
6 ('l', 6, -1, False)
7 ('l', 7, -1, False)
8 ('l', 8, -1, False)
9 ('r', 9, -1, False)
10 ('d', 15, -0.1, False)
11 ('r', 12, 10, True)
12 ('u', 12, 10.0, True)
13 ('l', 12, 10, True)
14 ('l', 13, -0.1, False)
15 ('u', 10, -0.1, False)
16 ('l', 16, -1, False)
17 ('d', 17, -1, False)
18 ('u', 18, -0.1, False)
19 ('d', 24, -0.1, False)
20 ('u', 15, -0.1, False)
21 ('u', 21, -1, False)
22 ('u', 22, -1, False)
23 ('r', 24, -0.1, False)
24 ('r', 24, -1, False)


In [238]:
res = []
for s in range(25):
    if s not in taboos:
        tmp = take_action(s, initial_policy)
        print('{} {}'.format(s, tmp))
        res.append(tmp[1] in taboos)
True in list(set(res))

0 ('r', 1, -0.1, False)
1 ('l', 0, -0.1, False)
2 ('d', 2, -1, False)
3 ('r', 4, -0.1, False)
4 ('u', 4, -1, False)
5 ('u', 0, -0.1, False)
9 ('l', 9, -1, False)
10 ('l', 10, -1, False)
11 ('d', 11, -1, False)
12 ('u', 12, 10.0, True)
13 ('l', 12, 10, True)
14 ('l', 13, -0.1, False)
15 ('u', 10, -0.1, False)
19 ('r', 19, -1, False)
20 ('l', 20, -1, False)
21 ('u', 21, -1, False)
22 ('d', 22, -1, False)
23 ('l', 22, -0.1, False)
24 ('d', 24, -1, False)


False

In [240]:
def take_action(state, policy, actions = {1:'u', 2:'d', 3:'l', 4:'r'}):
    '''Function takes action given state using the transition probabilities 
    of the policy'''
    ## Find the action given the transistion probabilities defined by the policy.
    action = actions[nr.choice(range(len(actions)), p = list(policy[state].values())) + 1]
    s_prime, reward, terminal = simulate_environment(state, action)
    return (action, s_prime, reward, terminal)

## Test function for several states
for s in range(16):
    print(take_action(s, initial_policy))

('d', 5, -0.1, False)
('d', 1, -1, False)
('d', 2, -1, False)
('d', 3, -1, False)
('d', 9, -0.1, False)
('u', 0, -0.1, False)
('u', 6, -1, False)
('d', 7, -1, False)
('l', 8, -1, False)
('l', 9, -1, False)
('d', 15, -0.1, False)
('r', 12, 10, True)
('l', 12, 10.0, True)
('d', 13, -1, False)
('r', 14, -1, False)
('u', 10, -0.1, False)


Examine your results and answer the following questions to ensure you action value function operates correctly:
1. Are the values of the taboo states 0? ANS:
2. Are the states with the highest values adjacent to the terminal state? ANS: 
3. Are the values of the states decreasing as the distance from the terminal state increases? ANS: 


## SARSA(0) Policy Improvement

Now you will perform policy improvement using the SARSA(0) algorithm.  You are welcome to start with the `select_a_prime` and `SARSA_0` functions from the TD/Q-learning notebooks.    

Execute your code for 1,000 episodes, and with $\alpha = 0.2$, and $\epsilon = 0.1$)

In [239]:
def print_Q(Q):
    Q = pd.DataFrame(Q, columns = ['up', 'down', 'left', 'right'])
    print(Q)

def new_episode(n_states, policy):
    '''This function provides a start for a TD
    episode making sure the first transition is not 
    the termnal state'''
    n_actions = len(policy[0].keys())
    current_state = start_episode(n_states,n_actions )[0]
    ## Find fist action and reward
    action, s_prime, reward, terminal = take_action(current_state, policy)
    return(current_state, action, s_prime, reward, terminal)    


def SARSA_0(policy, n_samps, alpha = 0.1, gamma = 0.9, action_index = {'u':0, 'd':1, 'l':2, 'r':3}):
    """
    Function for TD(0) policy evalutation
    """
    
    ## Find the starting state
    n_states = len(policy)
    n_actions = len(policy[0].keys())
    current_state, action, s_prime, reward, terminal = new_episode(n_states, policy)
    action_idx = action_index[action]
    
    ## Array for state values
    q = np.zeros((n_states, len(policy[0])))
    
    for _ in range(n_samps):
        ## Find the next action and reward
        action_prime, s_prime_prime, reward_prime, terminal_prime = take_action(s_prime, policy)
        action_idx_prime = action_index[action_prime]
        ## Compute the TD error
        delta = reward + gamma*q[s_prime, action_idx_prime] - q[current_state, action_idx]
        print(delta)
        ## Update the action values
        q[current_state, action_idx] = q[current_state, action_idx] + alpha*delta
        ## Update the state, action and reward for the next time step
        current_state = s_prime
        s_prime = s_prime_prime
        action = action_prime
        reward = reward_prime
        terminal = terminal_prime
        action_idx = action_idx_prime

        ## Check if end of episode
        if(terminal): 
            ## start new episode
            current_state, action, s_prime, reward, terminal = new_episode(n_states, policy)        
    return(q)


Q = SARSA_0(initial_policy, 1000, alpha = 0.2, gamma = 0.1)
print_Q(Q)

TypeError: start_episode() takes 1 positional argument but 2 were given

Examine the action values you have computed. Ensure that the action values are 0 for the goal and taboo states. Also check that the actions with the largest values for each state make sense in terms of reaching the goal. 

With the action value function completed, you will now create and test code to perform GPI with SARSA(0).  You are welcome to use the `SRASA_0_GPI` function from the TD/Q-learning notebook as a starting point. 

Execute your code for 10 cycles of 100 episodes, with $\alpha = 0.2$, $\gamma = 0.9$ and $\epsilon = 0.01$, and examine the results.

Verify that your results make sense? For example, starting at state 2 or 22, do the most probable actions follow a shortest path?

ANS: 

## Apply Double Q-Learning

As a next step, you will apply Double Q-learning(0) to the warehouse navigation problem. In the cell below create and test a function to perform Double Q-Learning for this problem. You are welcome to use the `double_Q_learning` function from the TD/Q-learning notebook as a starting point.

Execute your code for 10 cycles of 500 episodes, with $\alpha = 0.2$, and $\gamma = 0.9$ and examine the results.

Examine the action values you have computed. Ensure that the action values are 0 for the goal and taboo states. Also check that the actions with the largest values for each state make sense in terms of reaching the goal. 

With the action value function completed, you will now create and test code to perform GPI with Double Q-Learning(0).  You are welcome to use the `double_Q_learning_0_GPI` function from the TD/Q-learning notebook as a starting point. 

Execute your code for 10 cycles of 500 episodes, with $\alpha = 0.2$, $\gamma = 0.9$ and $\epsilon = 0.01$, and examine the results. 

Verify that your results make sense? For example, starting at state 2 or 22, do the most probable actions follow a shortest path?

ANS: 

## N-Step TD Learning

Finally, you will apply N-Step TD learning and N-Step SARSA to the warehouse navigation problem.  First create a function to perform N-step TD policy evaluation. You are welcome to start with the `TD_n` policy evaluation function from the TD/Q-Learning notebook. 

Test your function using 1,000 episodes, $n = 4$, $\gamma = 0.9$, and $\alpha = 0.2$.

Verify that the result you obtained appears correct. Are the values of the goal and taboo states all 0? Do the state values decrease with distance from the goal?

Now that you have an estimate of the best values for the number of steps and the learning rate you can compute the action values using multi-step SARSA. In the cell below, create and test a function to compute the action values using N-step SARSA. You are welcome to use the `SRARSA_n` function from the TD/Q-learning notebook as a starting point. 

Test your function by executing 4-step SARSA for 1,000 episodes with $\alpha = 0.2$ and $\gamma = 0.9$ and using the optimum number of steps and learning rate you have determined. 

Verify that the results you have computed appear correct using the aforementioned criteria. 

Finally, create a function to use the GPI algorithm with N-step SARSA in the cell below. You are welcome to start with the `SARSA_n_GPI` function from the TD/Q-learning notebook. 

Execute your function using 4 step SARSA for 5 cycles of 500 episodes, with $\alpha = 0.2$, $\epsilon = 0.1$, and $\gamma = 0.9$.

Examine your results. Verify that the most probable paths to the goal from states 2 and 22 are the shortest possible.  