# Homework 8

## CSCI E-82A


In the a previous homework assignments, you used two different dynamic programming algorithms and Monte Carlo reinforcement learning to solve a robot navigation problem by finding optimal paths to a goal in a simplified warehouse environment. Now you will use time differencing reinforcement learning to find optimal paths in the same environment.

The configuration of the warehouse environment is illustrated in the figure below.

<img src="GridWorldFactory.JPG" alt="Drawing" style="width:200px; height:200px"/>
<center> **Grid World for Factory Navigation Example** </center>

The goal is for the robot to deliver some material to position (state) 12, shown in blue. Since there is a goal state or **terminal state** this an **episodic task**. 

There are some barriers comprised of the states $\{ 6, 7, 8 \}$ and $\{ 16, 17, 18 \}$, shown with hash marks. In a real warehouse, these positions might be occupied by shelving or equipment. We do not want the robot to hit these barriers. Thus, we say that transitioning to these barrier states is **taboo**.

As before, we do not want the robot to hit the edges of the grid world, which represent the outer walls of the warehouse. 

## Representation

You are, no doubt, familiar with the representation for this problem by now.    

As with many such problems, the starting place is creating the **representation**. In the cell below encode your representation for the possible action-state transitions. From each state there are 4 possible actions:
- up, u
- down, d,
- left, l
- right, r

There are a few special cases you need to consider:
- Any action transitioning state off the grid or into a barrier should keep the state unchanged. 
- Any action in the goal state keeps the state unchanged. 
- Any transition within the taboo (barrier) states can keep the state unchanged. If you experiment, you will see that other encodings work as well since the value of a barrier states are always zero and there are no actions transitioning into these states. 

> **Hint:** It may help you create a pencil and paper sketch of the transitions, rewards, and probabilities or policy. This can help you to keep the bookkeeping correct. 

In [395]:
## import numpy for latter
import numpy as np
import numpy.random as nr
import pandas as pd


## Reference

- [TD in RL](https://towardsdatascience.com/td-in-reinforcement-learning-the-easy-way-f92ecfa9f3ce)

**please note that I spoke to Danka and she helps me on the GPI algo as I was having a index issue**

You need to define the initial transition probabilities for the Markov process. Set the probabilities for each transition as a **uniform distribution** leading to random action by the robot. 

> **Note:** As these are just starting values, the exact values of the transition probabilities are not actually all that important in terms of solving the RL problem. Also, notice that it does not matter how the taboo state transitions are encoded. The point of the DP algorithm is to learn the transition policy.

In [396]:
neighbors = {0:{'u':0, 'd':5, 'l':0, 'r':1},
          1:{'u':1, 'd':1, 'l':0, 'r':2},
          2:{'u':2, 'd':2, 'l':1, 'r':3},
          3:{'u':3, 'd':3, 'l':2, 'r':4},
          4:{'u':4, 'd':9, 'l':3, 'r':4},
          5:{'u':0, 'd':10, 'l':5, 'r':5},
          6:{'u':6, 'd':6, 'l':6, 'r':6},
          7:{'u':7, 'd':7, 'l':7, 'r':7},
          8:{'u':8, 'd':8, 'l':8, 'r':8},
          9:{'u':4, 'd':14, 'l':9, 'r':9},
          10:{'u':5, 'd':15, 'l':10, 'r':11},
          11:{'u':11, 'd':11, 'l':10, 'r':12},
          12:{'u':12, 'd':12, 'l':12, 'r':12},
          13:{'u':13, 'd':13, 'l':12, 'r':14},
          14:{'u':9, 'd':19, 'l':13, 'r':14},
          15:{'u':10, 'd':20, 'l':15, 'r':15},
          16:{'u':16, 'd':16, 'l':16, 'r':16},
          17:{'u':17, 'd':17, 'l':17, 'r':17},
          18:{'u':18, 'd':18, 'l':18, 'r':18},
          19:{'u':14, 'd':24, 'l':19, 'r':19},
          20:{'u':15, 'd':20, 'l':20, 'r':21},
          21:{'u':21, 'd':21, 'l':20, 'r':22},
          22:{'u':22, 'd':22, 'l':21, 'r':23},
          23:{'u':23, 'd':23, 'l':22, 'r':24},
          24:{'u':19, 'd':24, 'l':23, 'r':24}}

The robot receives the following rewards:
- 10 for entering position 0. 
- -1 for attempting to leave the grid. In other words, we penalize the robot for hitting the edges of the grid.  
- -0.1 for all other state transitions, which is the cost for the robot to move from one state to another. If we did not have this penalty, the robot could follow any random plan to the goal which did not hit the edges. 

This **reward structure is unknown to the MC RL agent**. The agent must **learn** the rewards by sampling the environment. 

In the code cell below encode your representation of this reward structure you will use in your simulated environment.  

In [397]:
rewards =  {0:{'u':-1, 'd':-0.1, 'l':-1, 'r':-0.1},
          1:{'u':-1, 'd':-1, 'l':-0.1, 'r':-0.1},
          2:{'u':-1, 'd':-1, 'l':-0.1, 'r':-0.1},
          3:{'u':-1, 'd':-1, 'l':-0.1, 'r':-0.1},
          4:{'u':-1, 'd':-0.1, 'l':-0.1, 'r':-1},
          5:{'u':-0.1, 'd':-0.1, 'l':-1, 'r':-1},
          6:{'u':-1, 'd':-1, 'l':-1, 'r':-1},
          7:{'u':-1, 'd':-1, 'l':-1, 'r':-1},
          8:{'u':-1, 'd':-1, 'l':-1, 'r':-1},
          9:{'u':-0.1, 'd':-0.1, 'l':-1, 'r':-1},
          10:{'u':-0.1, 'd':-0.1, 'l':-1, 'r':-0.1},
          11:{'u':-1, 'd':-1, 'l':-0.1, 'r':10},
          12:{'u':10.0, 'd':10.0, 'l':10.0, 'r':10.0}, # 12:{'u':0.0, 'd':0.0, 'l':0.0, 'r':0.0}
          13:{'u':-1, 'd':-1, 'l':10, 'r':-0.1},
          14:{'u':-0.1, 'd':-0.1, 'l':-0.1, 'r':-1},
          15:{'u':-0.1, 'd':-0.1, 'l':-1, 'r':-1},
          16:{'u':-1, 'd':-1, 'l':-1, 'r':-1},
          17:{'u':-1, 'd':-1, 'l':-1, 'r':-1},
          18:{'u':-0.1, 'd':-0.1, 'l':-0.1, 'r':-0.1},
          19:{'u':-0.1, 'd':-0.1, 'l':-1, 'r':-1},
          20:{'u':-0.1, 'd':-1, 'l':-1, 'r':-0.1},
          21:{'u':-1, 'd':-1, 'l':-0.1, 'r':-0.1},
          22:{'u':-1, 'd':-1, 'l':-0.1, 'r':-0.1},
          23:{'u':-1, 'd':-1, 'l':-0.1, 'r':-0.1},
          24:{'u':-0.1, 'd':-1, 'l':-0.1, 'r':-1}}

You will find it useful to create a list of taboo states, which you can encode in the cell below.

In [398]:
taboos = [6, 7, 8, 16, 17, 18]

## TD(0) Policy Evaluation

With your representations defined, you can now create and test functions to perform TD(0) **policy evaluation**. 

As a first step you will need a function to find the rewards and next state given a state and an action. You are welcome to start with the `state_values` function from the TD/Q-learning notebook. However, keep in mind that you must modify this code to correctly treat the taboo states of the barrier. Specifically, taboo states should not be visited. 

Execute your code to test it for each possible action from state 11.  

In [399]:
def simulate_environment(s, action, neighbors = neighbors, rewards = rewards, terminal = 12):
    """
    Function simulates the environment
    returns s_prime and reward given s and action
    """
    s_prime = neighbors[s][action]
    reward = rewards[s][action]
    return (s_prime, reward, is_terminal(s_prime, terminal))

def is_terminal(state, terminal = 12):
    return state == terminal

#adding a function to take care of the taboos
def is_taboos(state, taboo = taboos):
    return state in taboo

## Test the function
for a in ['u', 'd', 'r', 'l']:
    print(simulate_environment(11, a,terminal=12))
    

(11, -1, False)
(11, -1, False)
(12, 10, True)
(10, -0.1, False)


We have the expected results here from 11 it can only to 12 or 10

In [400]:
initial_policy  = {0:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      1:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      2:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      3:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      4:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      5:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      6:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      7:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      8:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      9:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      10:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      11:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      12:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                      13:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      14:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      15:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      16:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      17:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      18:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      19:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      20:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      21:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      22:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      23:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      24:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25}}

In [401]:
def start_episode(n_states):
    '''Function to find a random starting value for the episode
    that is not the terminal state'''
    state = nr.choice(range(n_states))
    while(is_terminal(state) or is_taboos(state)):
         state = nr.choice(range(n_states))
    return state

## test the function to make sure never starting in terminal state
[start_episode(25) for _ in range(10)]

[9, 24, 1, 10, 24, 0, 11, 23, 5, 23]

In [402]:
starts = [i[0] for i in test]
True not in list(set([i in taboos for i in starts]))

True

In [403]:
def take_action(state, policy, actions = {1:'u', 2:'d', 3:'l', 4:'r'}):
    '''Function takes action given state using the transition probabilities 
    of the policy'''
    ## Find the action given the transistion probabilities defined by the policy.
    action = actions[nr.choice(range(len(actions)), p = list(policy[state].values())) + 1]
    s_prime, reward, terminal = simulate_environment(state, action)
    return (action, s_prime, reward, terminal)

## Test function for several states
for s in range(25):
    print('{} {}'.format(s,take_action(s, initial_policy)))

0 ('r', 1, -0.1, False)
1 ('d', 1, -1, False)
2 ('d', 2, -1, False)
3 ('d', 3, -1, False)
4 ('d', 9, -0.1, False)
5 ('u', 0, -0.1, False)
6 ('r', 6, -1, False)
7 ('r', 7, -1, False)
8 ('r', 8, -1, False)
9 ('r', 9, -1, False)
10 ('r', 11, -0.1, False)
11 ('d', 11, -1, False)
12 ('r', 12, 10.0, True)
13 ('u', 13, -1, False)
14 ('u', 9, -0.1, False)
15 ('u', 10, -0.1, False)
16 ('u', 16, -1, False)
17 ('l', 17, -1, False)
18 ('u', 18, -0.1, False)
19 ('l', 19, -1, False)
20 ('r', 21, -0.1, False)
21 ('r', 22, -0.1, False)
22 ('d', 22, -1, False)
23 ('l', 22, -0.1, False)
24 ('d', 24, -1, False)


In [404]:
res = []
for s in range(25):
    if s not in taboos:
        tmp = take_action(s, initial_policy)
        print('{} {}'.format(s, tmp))
        res.append(tmp[1] in taboos)
True in list(set(res))

0 ('u', 0, -1, False)
1 ('l', 0, -0.1, False)
2 ('d', 2, -1, False)
3 ('d', 3, -1, False)
4 ('r', 4, -1, False)
5 ('r', 5, -1, False)
9 ('l', 9, -1, False)
10 ('l', 10, -1, False)
11 ('d', 11, -1, False)
12 ('r', 12, 10.0, True)
13 ('u', 13, -1, False)
14 ('l', 13, -0.1, False)
15 ('r', 15, -1, False)
19 ('d', 24, -0.1, False)
20 ('l', 20, -1, False)
21 ('u', 21, -1, False)
22 ('d', 22, -1, False)
23 ('d', 23, -1, False)
24 ('d', 24, -1, False)


False

the algo cannot starts from the terminal state or a taboo state

taboo states are not visited.

Examine your results. Are the action values consistent with the transitions?

ANS:  Yes

Next, you need to create a function to compute the state values using the TD(0) algorithm. You should use the function you just created  to find the rewards and next state given a state and action. You are welcome to use the `td_0_state_values` function from the TD/Q-learning notebook as a starting point.  

Execute your function for 1,000 episodes and examine the results.

In [405]:
def td_0_state_values(policy, n_samps, alpha = 0.2, gamma = 1.0):
    """
    Function for TD(0) policy evalutation
    """
    
    ## Find the starting state
    n_states = len(policy)
    current_state = start_episode(n_states)
    terminal = False
    
    ## Array for state values
    v = np.zeros((n_states,1))
    
    for _ in range(n_samps):
        ## Find the next action and reward
        action, s_prime, reward, terminal = take_action(current_state, policy)
        ## Compute the TD error

        delta = reward + gamma*v[s_prime] - v[current_state]
        ## Update the state value
        v[current_state] = v[current_state] + alpha*delta
        #print(v[current_state])
        current_state = s_prime
        if(terminal): ## start new episode when terminal
            current_state = start_episode(n_states)
    return(v)

output = td_0_state_values(initial_policy, 200000).reshape((5,5))  
print(output)

[[-30.8364 -34.3273 -40.1091 -37.8503 -34.518 ]
 [-23.9818   0.       0.       0.     -32.2474]
 [-18.5754  -3.1457   0.      -9.0533 -31.2816]
 [-26.5844   0.       0.       0.     -35.2863]
 [-33.3377 -35.2329 -37.4605 -37.8578 -37.4012]]


## From IntroductionToTDLearning

Examine your results and answer the following questions to ensure you action value function operates correctly:
1. Are the values of the taboo states 0? ANS:
YES
2. Are the states with the highest values adjacent to the terminal state? ANS: 
YES
3. Are the values of the states decreasing as the distance from the terminal state increases? ANS: 
YES


## SARSA(0) Policy Improvement

Now you will perform policy improvement using the SARSA(0) algorithm.  You are welcome to start with the `select_a_prime` and `SARSA_0` functions from the TD/Q-learning notebooks.    

Execute your code for 1,000 episodes, and with $\alpha = 0.2$, and $\epsilon = 0.1$)

In [406]:
def print_Q(Q):
    Q = pd.DataFrame(Q, columns = ['up', 'down', 'left', 'right'])
    print(Q)

def new_episode(n_states, policy):
    '''This function provides a start for a TD
    episode making sure the first transition is not 
    the termnal state'''
    current_state = start_episode(n_states)
    ## Find fist action and reward
    action, s_prime, reward, terminal = take_action(current_state, policy)
    return(current_state, action, s_prime, reward, terminal)    


def SARSA_0(policy, n_samps, alpha = 0.1, gamma = 0.9, action_index = {'u':0, 'd':1, 'l':2, 'r':3}):
    """
    Function for TD(0) policy evalutation
    """
    
    ## Find the starting state
    n_states = len(policy)
    current_state, action, s_prime, reward, terminal = new_episode(n_states, policy)
    action_idx = action_index[action]
    
    ## Array for state values
    q = np.zeros((n_states, len(policy[0])))
    
    for _ in range(n_samps):
        ## Find the next action and reward
        action_prime, s_prime_prime, reward_prime, terminal_prime = take_action(s_prime, policy)
        action_idx_prime = action_index[action_prime]
        ## Compute the TD error
        delta = reward + gamma*q[s_prime, action_idx_prime] - q[current_state, action_idx]
        ## Update the action values
        q[current_state, action_idx] = q[current_state, action_idx] + alpha*delta
        ## Update the state, action and reward for the next time step
        current_state = s_prime
        s_prime = s_prime_prime
        action = action_prime
        reward = reward_prime
        terminal = terminal_prime
        action_idx = action_idx_prime

        ## Check if end of episode
        if(terminal): 
            ## start new episode
            current_state, action, s_prime, reward, terminal = new_episode(n_states, policy)        
    return(q)


Q = SARSA_0(initial_policy, 20000, alpha = 0.2, gamma = 0.99)
print_Q(Q)

           up       down       left      right
0  -18.856493 -16.829844 -19.180059 -18.584431
1  -19.447865 -19.257480 -18.736408 -18.677141
2  -19.683658 -19.613434 -18.719325 -18.096243
3  -19.013997 -18.996370 -18.691407 -17.555762
4  -17.888972 -14.677286 -18.352936 -18.608698
5  -17.448085 -14.704784 -16.793949 -16.236492
6    0.000000   0.000000   0.000000   0.000000
7    0.000000   0.000000   0.000000   0.000000
8    0.000000   0.000000   0.000000   0.000000
9  -17.653009 -11.935030 -16.296525 -15.846807
10 -15.492232 -17.253777 -14.721795  -9.872872
11 -11.312768  -9.871202 -14.772483  -2.314001
12   0.000000   0.000000   0.000000   0.000000
13  -9.714979  -8.949564  -1.219648 -14.301063
14 -14.797107 -15.702988  -9.551889 -12.217672
15 -13.333360 -19.861271 -17.762895 -17.171149
16   0.000000   0.000000   0.000000   0.000000
17   0.000000   0.000000   0.000000   0.000000
18   0.000000   0.000000   0.000000   0.000000
19 -11.821810 -17.634935 -16.054488 -16.240086
20 -16.211919

Examine the action values you have computed. Ensure that the action values are 0 for the goal and taboo states. Also check that the actions with the largest values for each state make sense in terms of reaching the goal. 

With the action value function completed, you will now create and test code to perform GPI with SARSA(0).  You are welcome to use the `SRASA_0_GPI` function from the TD/Q-learning notebook as a starting point. 

Execute your code for 10 cycles of 100 episodes, with $\alpha = 0.2$, $\gamma = 0.9$ and $\epsilon = 0.01$, and examine the results.

it seems that there is an issue on the sarasa algo

In [407]:
def update_policy(policy, Q, epsilon, action_index = {'u':0, 'd':1, 'l':2, 'r':3}):
    '''Updates the policy based on estiamtes of Q using 
    an epslion greedy algorithm. The action with the highest
    action value is used.'''
    
    ## Find the keys for the actions in the policy
    keys = list(policy[0].keys())
    
    ## Iterate over the states and find the maximm action value.
    for state in range(len(policy)):
        ## First find the index of the max Q values  
        q = Q[state,:]
        max_action_index = np.where(q == max(q))[0]

        ## Find the probabilities for the transitions
        n_transitions = float(len(q))
        n_max_transitions = float(len(max_action_index))
        p_max_transitions = (1.0 - epsilon *(n_transitions - n_max_transitions))/(n_max_transitions)
  
        ## Now assign the probabilities to the policy as epsilon greedy.
        for key in keys:
            if(action_index[key] in max_action_index): policy[state][key] = p_max_transitions
            else: policy[state][key] = epsilon
    return(policy)                

update_policy(initial_policy, Q, 0.1)    

{0: {'u': 0.1, 'd': 0.7, 'l': 0.1, 'r': 0.1},
 1: {'u': 0.1, 'd': 0.1, 'l': 0.1, 'r': 0.7},
 2: {'u': 0.1, 'd': 0.1, 'l': 0.1, 'r': 0.7},
 3: {'u': 0.1, 'd': 0.1, 'l': 0.1, 'r': 0.7},
 4: {'u': 0.1, 'd': 0.7, 'l': 0.1, 'r': 0.1},
 5: {'u': 0.1, 'd': 0.7, 'l': 0.1, 'r': 0.1},
 6: {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25},
 7: {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25},
 8: {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25},
 9: {'u': 0.1, 'd': 0.7, 'l': 0.1, 'r': 0.1},
 10: {'u': 0.1, 'd': 0.1, 'l': 0.1, 'r': 0.7},
 11: {'u': 0.1, 'd': 0.1, 'l': 0.1, 'r': 0.7},
 12: {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25},
 13: {'u': 0.1, 'd': 0.1, 'l': 0.7, 'r': 0.1},
 14: {'u': 0.1, 'd': 0.1, 'l': 0.7, 'r': 0.1},
 15: {'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0.1},
 16: {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25},
 17: {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25},
 18: {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25},
 19: {'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0.1},
 20: {'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0

In [422]:
def SARSA_GPI(policy, n_samples, n_cycles, epsilon = 0.1, n_actions = 4):
    '''Function perfoms GPI using Monte Carlo value estimation.
    Updates to policy are epsilon greedy to prevent the algorithm
    from being trapped at some point.'''
    Q = np.zeros((len(policy), n_actions))
    ## Iterate over the required number of cycles
    for _ in range(n_cycles):
        Q = SARSA_0(policy, n_samples, alpha = 0.2, gamma = 0.99)
        policy = update_policy(policy, Q, epsilon = epsilon)
    return(policy)

improved_policy = SARSA_GPI(initial_policy, 100, 50, epsilon = 0.1)  
for state in range(25):
    print('{} {}'.format(state, improved_policy[state]))

0 {'u': 0.1, 'd': 0.7, 'l': 0.1, 'r': 0.1}
1 {'u': 0.1, 'd': 0.1, 'l': 0.7, 'r': 0.1}
2 {'u': 0.1, 'd': 0.1, 'l': 0.1, 'r': 0.7}
3 {'u': 0.1, 'd': 0.1, 'l': 0.7, 'r': 0.1}
4 {'u': 0.1, 'd': 0.1, 'l': 0.7, 'r': 0.1}
5 {'u': 0.1, 'd': 0.1, 'l': 0.7, 'r': 0.1}
6 {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25}
7 {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25}
8 {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25}
9 {'u': 0.1, 'd': 0.7, 'l': 0.1, 'r': 0.1}
10 {'u': 0.1, 'd': 0.7, 'l': 0.1, 'r': 0.1}
11 {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25}
12 {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25}
13 {'u': 0.1, 'd': 0.1, 'l': 0.7, 'r': 0.1}
14 {'u': 0.4, 'd': 0.4, 'l': 0.1, 'r': 0.1}
15 {'u': 0.3, 'd': 0.3, 'l': 0.1, 'r': 0.3}
16 {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25}
17 {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25}
18 {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25}
19 {'u': 0.1, 'd': 0.3, 'l': 0.3, 'r': 0.3}
20 {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25}
21 {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25}
22

Verify that your results make sense? For example, starting at state 2 or 22, do the most probable actions follow a shortest path?

ANS: yes it seems to make sense

## Apply Double Q-Learning

As a next step, you will apply Double Q-learning(0) to the warehouse navigation problem. In the cell below create and test a function to perform Double Q-Learning for this problem. You are welcome to use the `double_Q_learning` function from the TD/Q-learning notebook as a starting point.

Execute your code for 10 cycles of 500 episodes, with $\alpha = 0.2$, and $\gamma = 0.9$ and examine the results.

In [409]:
import copy

In [410]:
def action_lookup(index):
    """Helper function returns action given an index"""
    action_dic = {0:'u', 1:'d', 2:'l', 3:'r'}
    return action_dic[index]

def index_lookup(action):
    """Helper function returns index given action"""
    index_dic = {'u':0, 'd':1, 'l':2, 'r':3}
    return index_dic[action]


def next_state(state, action_index, neighbors = neighbors, action_lookup = action_lookup):
    return(neighbors[state][action_lookup[action_index]])

def simulate_environment(s, action, neighbors = neighbors, rewards = rewards, terminal = 12):
    """
    Function simulates the environment for Q-learning.
    returns s_prime and reward given s and action
    """
    s_prime = neighbors[s][action]
    reward_prime = np.array([rewards[s_prime][a] for a in rewards[0].keys()])
    return (s_prime, reward_prime, is_terminal(s_prime, terminal))
    

def is_terminal(state, terminal = 12):
    return state == terminal

#adding a function to take care of the taboos
def is_taboos(state, taboo = taboos):
    return state in taboo

## Test the function
for a in ['u', 'd', 'r', 'l']:
    print(simulate_environment(1, a))

(1, array([-1. , -1. , -0.1, -0.1]), False)
(1, array([-1. , -1. , -0.1, -0.1]), False)
(2, array([-1. , -1. , -0.1, -0.1]), False)
(0, array([-1. , -0.1, -1. , -0.1]), False)


In [411]:
def start_episode(n_states, n_actions):
    '''Function to find a random starting values for the episode
    that is not the terminal state'''
    state = nr.choice(range(n_states))
    while(is_terminal(state) or is_taboos(state)):  ## Make sure not starting at the terminal state
         state = nr.choice(range(n_states))
    ## Now find a random starting action index
    a_index = nr.choice(range(4), size = 1)[0]
    s_prime, reward, terminal = simulate_environment(state, action_lookup(a_index))   
    return state, a_index, reward[a_index] ## action_lookup(a_index), reward[a_index]

## test the function to make sure never starting in terminal state
a = [start_episode(15,4) for _ in range(25)]

assert( 12 not in list(set([x[0] for x in a])))

In [412]:
def take_action(state, policy):
    '''Function takes action given state using the transition probabilities 
    of the policy'''
    ## Find the action given the transistion probabilities defined by the policy.
    action = action_lookup(nr.choice(range(len(policy[0].keys())), p = list(policy[state].values()))) 
    s_prime, reward, terminal = simulate_environment(state, action)
    return (action, s_prime, reward, terminal)

## Test function for several states
for s in range(25):
    print(take_action(s, initial_policy))

('r', 1, array([-1. , -1. , -0.1, -0.1]), False)
('d', 1, array([-1. , -1. , -0.1, -0.1]), False)
('l', 1, array([-1. , -1. , -0.1, -0.1]), False)
('l', 2, array([-1. , -1. , -0.1, -0.1]), False)
('u', 4, array([-1. , -0.1, -0.1, -1. ]), False)
('u', 0, array([-1. , -0.1, -1. , -0.1]), False)
('d', 6, array([-1, -1, -1, -1]), False)
('u', 7, array([-1, -1, -1, -1]), False)
('d', 8, array([-1, -1, -1, -1]), False)
('d', 14, array([-0.1, -0.1, -0.1, -1. ]), False)
('r', 11, array([-1. , -1. , -0.1, 10. ]), False)
('l', 10, array([-0.1, -0.1, -1. , -0.1]), False)
('d', 12, array([10., 10., 10., 10.]), True)
('l', 12, array([10., 10., 10., 10.]), True)
('r', 14, array([-0.1, -0.1, -0.1, -1. ]), False)
('u', 10, array([-0.1, -0.1, -1. , -0.1]), False)
('l', 16, array([-1, -1, -1, -1]), False)
('l', 17, array([-1, -1, -1, -1]), False)
('d', 18, array([-0.1, -0.1, -0.1, -0.1]), False)
('r', 19, array([-0.1, -0.1, -1. , -1. ]), False)
('u', 15, array([-0.1, -0.1, -1. , -1. ]), False)
('l', 20,

In [413]:
def update_double_Q(q1, q2, current_state, a_index, reward, alpha, gamma):
    """Function to update the actions values in the Q matrix"""
    ## Get s_prime given s and a
    s_prime, reward_prime, terminal = simulate_environment(current_state, action_lookup(a_index))
    a_prime_index = nr.choice(np.where(reward_prime == max(reward_prime))[0], size = 1)[0]
    ## Update the action values 
    q1[current_state,a_index] = q1[current_state,a_index] + alpha * (reward + gamma * (q2[s_prime,a_prime_index] - q1[current_state,a_index]))
    return q1, s_prime, reward_prime, terminal, a_prime_index


def double_Q_learning_0(policy, episodes, alpha = 0.2, gamma = 0.9):
    """
    Function to perform Q-learning(0) control policy improvement.
    """
    ## Initialize the state list and action values
    states = list(policy.keys())
    n_states = len(states)
    n_actions = len(policy[0].keys())
    
    ## Initialize both Q matricies
    Q1 = np.zeros((n_states,n_actions))
    Q2 = np.zeros((n_states,n_actions))
    
    for _ in range(episodes): # Loop over the episodes
        terminal = False
        ## Find the inital state, action index and reward
        current_state, a_index, reward = start_episode(n_states,n_actions)
        
        while(not terminal): # Episode ends where get to terminal state   
            ## Update the action values in Q1 or Q2 based on random choice
            if(nr.uniform() <= 0.5):
                Q1, s_prime, reward_prime, terminal, a_prime_index = update_double_Q(Q1, Q2, current_state, a_index, reward, alpha, gamma)
            else:
                Q2, s_prime, reward_prime, terminal, a_prime_index = update_double_Q(Q2, Q1, current_state, a_index, reward, alpha, gamma)
            ## Set action, reward and state for next iteration
            a_index = a_prime_index
            current_state = s_prime
            reward = reward_prime[a_prime_index]
    return(Q1)

Q = double_Q_learning_0(initial_policy, 1000)
print_Q(Q)

          up      down       left      right
0   3.647743  7.780917   3.721580   6.612731
1   3.616720  2.314130   8.040339   7.000128
2   2.873600  4.046767   6.687332   6.308640
3   1.791154  2.947040   7.025245   7.352395
4   4.439645  7.070568   6.315840   0.075755
5   7.646268  9.220973   4.646529   3.671837
6   0.000000  0.000000   0.000000   0.000000
7   0.000000  0.000000   0.000000   0.000000
8   0.000000  0.000000   0.000000   0.000000
9   7.215010  8.836026   3.019240   4.438518
10  7.958803  8.038468   4.469527  11.197680
11  6.292600  5.478753   6.997341  11.111111
12  0.000000  0.000000   0.000000   0.000000
13  4.486317  7.914651  11.111111   3.681902
14  7.549500  7.334963  11.000121   6.176568
15  8.696996  6.863987   2.795281   2.374631
16  0.000000  0.000000   0.000000   0.000000
17  0.000000  0.000000   0.000000   0.000000
18  0.000000  0.000000   0.000000   0.000000
19  8.920531  7.014369   5.567586   3.194307
20  7.781656  2.509616   3.327837   6.896219
21  2.0746

In [414]:
Q_df = pd.DataFrame(Q, columns = ['up', 'down', 'left', 'right'])
Q_df.idxmax('columns')

0      down
1      left
2      left
3     right
4      down
5      down
6        up
7        up
8        up
9      down
10    right
11    right
12       up
13     left
14     left
15       up
16       up
17       up
18       up
19       up
20       up
21     left
22    right
23    right
24       up
dtype: object

Examine the action values you have computed. Ensure that the action values are 0 for the goal and taboo states. Also check that the actions with the largest values for each state make sense in terms of reaching the goal. 

Actual values on terminal states make senses

With the action value function completed, you will now create and test code to perform GPI with Double Q-Learning(0).  You are welcome to use the `double_Q_learning_0_GPI` function from the TD/Q-learning notebook as a starting point. 

Execute your code for 10 cycles of 500 episodes, with $\alpha = 0.2$, $\gamma = 0.9$ and $\epsilon = 0.01$, and examine the results. 

In [415]:
import copy

def double_Q_learning_0_GPI(policy, neighbors, reward, cycles, episodes, goal, 
                            alpha = 0.2, gamma = 0.9, epsilon = 0.1):
    ## iterate over GPI cycles
    current_policy = copy.deepcopy(policy)
    for _ in range(cycles):
        ## Evaluate policy with double Q learning
        Q = double_Q_learning_0(policy, episodes, alpha, gamma)
        
        for s in list(current_policy.keys()): # iterate over all states
            ## Find the index action with the largest Q values 
            ## May be more than one. 
            max_index = np.where(Q[s,:] == max(Q[s,:]))[0]
            
            ## Probabilities of transition
            ## Need to allow for further exploration so don't let any 
            ## transition probability be 0.
            ## Some gymnastics are required to ensure that the probabilities 
            ## over the transistions actual add to exactly 1.0
            actions_count = len(current_policy[s])
            max_len = len(max_index)
            prob_for_policy = 1.0/float(max_len)
            prob_for_policy -= epsilon * (float(actions_count) - float(max_len))
            if(actions_count > max_len):
                remainder = epsilon
            else:
                remainder = 0.25            
                                                 
            for i, key in enumerate(current_policy[s]): ## Update policy
                if(i in max_index): current_policy[s][key] = prob_for_policy
                else: current_policy[s][key] = remainder   
                    
    return(current_policy)                    
 

Double_Q_0_Policy = double_Q_learning_0_GPI(initial_policy, neighbors, rewards, cycles=10, episodes=500, goal = 0,
                                            alpha = 0.2, epsilon = 0.01)
Double_Q_0_Policy 

{0: {'u': 0.01, 'd': 0.97, 'l': 0.01, 'r': 0.01},
 1: {'u': 0.01, 'd': 0.01, 'l': 0.97, 'r': 0.01},
 2: {'u': 0.01, 'd': 0.01, 'l': 0.01, 'r': 0.97},
 3: {'u': 0.01, 'd': 0.01, 'l': 0.01, 'r': 0.97},
 4: {'u': 0.01, 'd': 0.97, 'l': 0.01, 'r': 0.01},
 5: {'u': 0.01, 'd': 0.97, 'l': 0.01, 'r': 0.01},
 6: {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25},
 7: {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25},
 8: {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25},
 9: {'u': 0.01, 'd': 0.97, 'l': 0.01, 'r': 0.01},
 10: {'u': 0.01, 'd': 0.01, 'l': 0.01, 'r': 0.97},
 11: {'u': 0.01, 'd': 0.01, 'l': 0.01, 'r': 0.97},
 12: {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25},
 13: {'u': 0.01, 'd': 0.01, 'l': 0.97, 'r': 0.01},
 14: {'u': 0.01, 'd': 0.01, 'l': 0.97, 'r': 0.01},
 15: {'u': 0.97, 'd': 0.01, 'l': 0.01, 'r': 0.01},
 16: {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25},
 17: {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25},
 18: {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25},
 19: {'u': 0.97, 'd': 0.01, 'l': 0.01, 'r

Verify that your results make sense? For example, starting at state 2 or 22, do the most probable actions follow a shortest path?

ANS: yes it makes senses

## N-Step TD Learning

Finally, you will apply N-Step TD learning and N-Step SARSA to the warehouse navigation problem.  First create a function to perform N-step TD policy evaluation. You are welcome to start with the `TD_n` policy evaluation function from the TD/Q-Learning notebook. 

Test your function using 1,000 episodes, $n = 4$, $\gamma = 0.9$, and $\alpha = 0.2$.

In [416]:
def start_episode(n_states):
    '''Function to find a random starting value for the episode
    that is not the terminal state'''
    state = nr.choice(range(n_states))
    while(is_terminal(state) or is_taboos(state)):
         state = nr.choice(range(n_states))
    return state

In [417]:
def simulate_environment(s, action, neighbors = neighbors, rewards = rewards, terminal = 12):
    """
    Function simulates the environment
    returns s_prime and reward given s and action
    """
    s_prime = neighbors[s][action]
    reward = rewards[s][action]
    return (s_prime, reward, is_terminal(s_prime, terminal))

def is_terminal(state, terminal = 12):
    return state == terminal

#adding a function to take care of the taboos
def is_taboos(state, taboo = taboos):
    return state in taboo

## Test the function
for a in ['u', 'd', 'r', 'l']:
    print(simulate_environment(11, a,terminal=12))
    

(11, -1, False)
(11, -1, False)
(12, 10, True)
(10, -0.1, False)


In [418]:
def take_action(state, policy, actions = {1:'u', 2:'d', 3:'l', 4:'r'}):
    '''Function takes action given state using the transition probabilities 
    of the policy'''
    ## Find the action given the transistion probabilities defined by the policy.
    action = actions[nr.choice(range(len(actions)), p = list(policy[state].values())) + 1]
    s_prime, reward, terminal = simulate_environment(state, action)
    return (action, s_prime, reward, terminal)

## Test function for several states
for s in range(25):
    print('{} {}'.format(s,take_action(s, initial_policy)))

0 ('r', 1, -0.1, False)
1 ('r', 2, -0.1, False)
2 ('r', 3, -0.1, False)
3 ('u', 3, -1, False)
4 ('u', 4, -1, False)
5 ('u', 0, -0.1, False)
6 ('l', 6, -1, False)
7 ('l', 7, -1, False)
8 ('u', 8, -1, False)
9 ('u', 4, -0.1, False)
10 ('r', 11, -0.1, False)
11 ('l', 10, -0.1, False)
12 ('d', 12, 10.0, True)
13 ('d', 13, -1, False)
14 ('u', 9, -0.1, False)
15 ('u', 10, -0.1, False)
16 ('d', 16, -1, False)
17 ('d', 17, -1, False)
18 ('u', 18, -0.1, False)
19 ('r', 19, -1, False)
20 ('u', 15, -0.1, False)
21 ('d', 21, -1, False)
22 ('d', 22, -1, False)
23 ('r', 24, -0.1, False)
24 ('l', 23, -0.1, False)


In [419]:
def TD_n(policy, episodes, n, alpha = 0.2, gamma = 0.9, epsilon = 0.1, action_index = {'u':0, 'd':1, 'l':2, 'r':3}):
    """
    Function to perform TD(N) policy evaluation.
    """
    ## Initialize the state list and action values
#    action_index = list(range(len(list(policy[0].keys()))))
    states = list(policy.keys())
    n_states = len(states)
    n_actions = len(policy[0].keys())
    v = np.zeros((n_states))
    
    for _ in range(episodes):
        ## Initialize variables
        T = float("inf")
        tau = 0
        t = 0
        rewards = []
       
        ## Get the random initial state
        current_state = start_episode(n_states)   
        state = [current_state]
        ## Initial action
        action, s_prime, reward, terminal = take_action(current_state, policy)
        state.append(s_prime)
        
        while(not (tau == T - 1)):
            if(t < T):
                ## Append the reward to the list
                rewards.append(reward)             
                if(terminal): 
                    ## update T if at terminal state
                    T = t + 1
                else: 
                    ## Get the next action state and rewards
                    action_prime, s_prime_prime, reward_prime, terminal_prime = take_action(current_state, policy)
                    state.append(s_prime_prime)
                      
            ## Update tau
            tau = t - n + 1
            
            if(tau > 0):
                G = 0.0
                for i in range(tau + 1, min(tau + n, T)):
                    exponent = i + tau - 1
                    G = G + gamma**exponent * rewards[i]
                if(tau + n < T): G += gamma**n * v[state[tau + n]] 
                v[state[tau]] = v[state[tau]] + alpha * (G - v[state[tau]])    
                
            
            ## Update variables for the next step
            t += 1
            current_state = s_prime
            if(not terminal):
                action = action_prime
                s_prime = s_prime_prime
                reward = reward_prime
                terminal = terminal_prime
    return(v)     

output = TD_n(initial_policy, 1000, 4).reshape((5,5))
np.set_printoptions(precision=4)
print(output)

[[-0.0195 -0.0707 -0.0912 -0.0339 -0.0141]
 [ 0.0284  0.      0.      0.      0.0186]
 [ 0.8324  1.3761  0.      0.1847  0.0015]
 [ 0.4333  0.      0.      0.     -0.0327]
 [ 0.1109 -0.0128 -0.0092 -0.0055 -0.0204]]


Verify that the result you obtained appears correct. Are the values of the goal and taboo states all 0? Do the state values decrease with distance from the goal?

Yes they seems correct

Now that you have an estimate of the best values for the number of steps and the learning rate you can compute the action values using multi-step SARSA. In the cell below, create and test a function to compute the action values using N-step SARSA. You are welcome to use the `SRARSA_n` function from the TD/Q-learning notebook as a starting point. 

Test your function by executing 4-step SARSA for 1,000 episodes with $\alpha = 0.2$ and $\gamma = 0.9$ and using the optimum number of steps and learning rate you have determined. 

In [420]:
def SARSA_n(policy, episodes, n, alpha = 0.2, gamma = 0.9, epsilon = 0.1, action_index = {'u':0, 'd':1, 'l':2, 'r':3}):
    """
    Function to perform TD(N) policy evaluation.
    """
    ## Initialize the state list and action values
#    action_index = list(range(len(list(policy[0].keys()))))
    states = list(policy.keys())
    n_states = len(states)
    n_actions = len(policy[0].keys())
    q = np.zeros((n_states, n_actions))
    
    for _ in range(episodes):
        ## Initialize variables
        T = float("inf")
        tau = 0
        t = 0
        rewards = []
       
        ## Get the random initial state
        current_state = start_episode(n_states)   
        state = [current_state]
        ## Initial action
        action, s_prime, reward, terminal = take_action(current_state, policy)
        state.append(s_prime)
        
        while(not (tau == T - 1)):
            if(t < T):
                ## Append the reward to the list
                rewards.append(reward)             
                if(terminal): 
                    ## update T if at terminal state
                    T = t + 1
                else: 
                    ## Get the next action state and rewards
                    action_prime, s_prime_prime, reward_prime, terminal_prime = take_action(current_state, policy)
                    state.append(s_prime_prime)
                      
            ## Update tau
            tau = t - n + 1
            
            if(tau > 0):
                G = 0.0
                for i in range(tau + 1, min(tau + n, T)):
                    exponent = i + tau - 1
                    G = G + gamma**exponent * rewards[i]
                if(tau + n < T): G += gamma**n * q[state[tau + n], action_index[action_prime]] 
                q[state[tau], action_index[action]] = q[state[tau], action_index[action]] + alpha * (G - q[state[tau], action_index[action]])    
                
            
            ## Update variables for the next step
            t += 1
            current_state = s_prime
            if(not terminal):
                action = action_prime
                s_prime = s_prime_prime
                reward = reward_prime
                terminal = terminal_prime
    return(q)              
            
Q = SARSA_n(initial_policy, 1000, 4)
print_Q(Q)

          up      down      left     right
0  -0.134792 -0.095305 -0.089533 -0.078448
1  -0.116082 -0.498497 -0.094211 -0.197105
2  -0.367159 -0.196308 -0.108810 -0.116112
3  -0.046306 -0.054957 -0.088467 -0.048756
4  -0.038729 -0.032936 -0.044112 -0.037341
5  -0.349516 -0.047144 -0.114560 -0.071097
6   0.000000  0.000000  0.000000  0.000000
7   0.000000  0.000000  0.000000  0.000000
8   0.000000  0.000000  0.000000  0.000000
9  -0.028553 -0.038992 -0.047846 -0.041077
10 -0.034873 -0.037511 -0.086519  1.119001
11 -0.558291 -0.139025 -0.256826  0.404793
12  0.000000  0.000000  0.000000  0.000000
13 -0.041624 -0.054438  0.125726 -0.099365
14 -0.029839 -0.046961 -0.016068 -0.039273
15 -0.023386 -0.124809 -0.014423  0.252861
16  0.000000  0.000000  0.000000  0.000000
17  0.000000  0.000000  0.000000  0.000000
18  0.000000  0.000000  0.000000  0.000000
19 -0.022965 -0.103897 -0.028593 -0.058018
20 -0.009831 -0.890913 -0.939223 -0.140337
21 -0.024220 -0.106176 -0.057902 -0.132241
22 -0.22361

Verify that the results you have computed appear correct using the aforementioned criteria. 

THe actions seems ok 0 on the taboo and goal nodes

Finally, create a function to use the GPI algorithm with N-step SARSA in the cell below. You are welcome to start with the `SARSA_n_GPI` function from the TD/Q-learning notebook. 

Execute your function using 4 step SARSA for 5 cycles of 500 episodes, with $\alpha = 0.2$, $\epsilon = 0.1$, and $\gamma = 0.9$.

In [421]:
def SARSA_n_GPI(policy, n, cycles, episodes, goal, alpha = 0.2, gamma = 0.9, epsilon = 0.1):
    ## iterate over GPI cycles
    current_policy = copy.deepcopy(policy)
    for _ in range(cycles):
        ## Evaluate policy with SARSA
        Q = SARSA_n(policy, episodes, n, alpha = alpha, gamma = gamma, epsilon = epsilon)
        
        for s in list(current_policy.keys()): # iterate over all states
            ## Find the index action with the largest Q values 
            ## May be more than one. 
            max_index = np.where(Q[s,:] == max(Q[s,:]))[0]
            
            ## Probabilities of transition
            ## Need to allow for further exploration so don't let any 
            ## transition probability be 0.
            ## Some gymnastics are required to ensure that the probabilities 
            ## over the transistions actual add to exactly 1.0
            actions_count = len(current_policy[s])
            max_len = len(max_index)
            prob_for_policy = 1.0/float(max_len)
            prob_for_policy -= epsilon * (float(actions_count) - float(max_len))
            if(actions_count > max_len):
                remainder = epsilon
            else:
                remainder = 0.25
                                                 
            for i, key in enumerate(current_policy[s]): ## Update policy
                if(i in max_index): current_policy[s][key] = prob_for_policy
                else: current_policy[s][key] = remainder   
            
    return(current_policy)           
 

SARSA_N_Policy = SARSA_n_GPI(initial_policy, n = 4, cycles = 5, episodes = 1000, goal = 0, alpha = 0.2, epsilon = 0.1)
SARSA_N_Policy

{0: {'u': 0.1, 'd': 0.7, 'l': 0.1, 'r': 0.1},
 1: {'u': 0.1, 'd': 0.1, 'l': 0.1, 'r': 0.7},
 2: {'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0.1},
 3: {'u': 0.1, 'd': 0.1, 'l': 0.7, 'r': 0.1},
 4: {'u': 0.1, 'd': 0.1, 'l': 0.7, 'r': 0.1},
 5: {'u': 0.1, 'd': 0.1, 'l': 0.1, 'r': 0.7},
 6: {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25},
 7: {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25},
 8: {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25},
 9: {'u': 0.1, 'd': 0.1, 'l': 0.7, 'r': 0.1},
 10: {'u': 0.1, 'd': 0.1, 'l': 0.1, 'r': 0.7},
 11: {'u': 0.1, 'd': 0.1, 'l': 0.1, 'r': 0.7},
 12: {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25},
 13: {'u': 0.1, 'd': 0.1, 'l': 0.7, 'r': 0.1},
 14: {'u': 0.1, 'd': 0.7, 'l': 0.1, 'r': 0.1},
 15: {'u': 0.1, 'd': 0.1, 'l': 0.7, 'r': 0.1},
 16: {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25},
 17: {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25},
 18: {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25},
 19: {'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0.1},
 20: {'u': 0.7, 'd': 0.1, 'l': 0.1, 'r': 0

Examine your results. Verify that the most probable paths to the goal from states 2 and 22 are the shortest possible.  

there is an issue with the algo for example from state 2 you can go to step 7, a forbidden state