# Homework 8

## CSCI E-82A


In the a previous homework assignments, you used two different dynamic programming algorithms and Monte Carlo reinforcement learning to solve a robot navigation problem by finding optimal paths to a goal in a simplified warehouse environment. Now you will use time differencing reinforcement learning to find optimal paths in the same environment.

The configuration of the warehouse environment is illustrated in the figure below.

<img src="GridWorldFactory.JPG" alt="Drawing" style="width:200px; height:200px"/>
<center> **Grid World for Factory Navigation Example** </center>

The goal is for the robot to deliver some material to position (state) 12, shown in blue. Since there is a goal state or **terminal state** this an **episodic task**. 

There are some barriers comprised of the states $\{ 6, 7, 8 \}$ and $\{ 16, 17, 18 \}$, shown with hash marks. In a real warehouse, these positions might be occupied by shelving or equipment. We do not want the robot to hit these barriers. Thus, we say that transitioning to these barrier states is **taboo**.

As before, we do not want the robot to hit the edges of the grid world, which represent the outer walls of the warehouse. 

## Representation

You are, no doubt, familiar with the representation for this problem by now.    

As with many such problems, the starting place is creating the **representation**. In the cell below encode your representation for the possible action-state transitions. From each state there are 4 possible actions:
- up, u
- down, d,
- left, l
- right, r

There are a few special cases you need to consider:
- Any action transitioning state off the grid or into a barrier should keep the state unchanged. 
- Any action in the goal state keeps the state unchanged. 
- Any transition within the taboo (barrier) states can keep the state unchanged. If you experiment, you will see that other encodings work as well since the value of a barrier states are always zero and there are no actions transitioning into these states. 

> **Hint:** It may help you create a pencil and paper sketch of the transitions, rewards, and probabilities or policy. This can help you to keep the bookkeeping correct. 

In [5]:
## import numpy for latter
import numpy as np
import numpy.random as nr



You need to define the initial transition probabilities for the Markov process. Set the probabilities for each transition as a **uniform distribution** leading to random action by the robot. 

> **Note:** As these are just starting values, the exact values of the transition probabilities are not actually all that important in terms of solving the RL problem. Also, notice that it does not matter how the taboo state transitions are encoded. The point of the DP algorithm is to learn the transition policy.

In [9]:
neighbors = {0:{'u':0, 'd':5, 'l':0, 'r':1},
          1:{'u':1, 'd':1, 'l':0, 'r':2},
          2:{'u':2, 'd':2, 'l':1, 'r':3},
          3:{'u':3, 'd':3, 'l':2, 'r':4},
          4:{'u':4, 'd':9, 'l':3, 'r':4},
          5:{'u':0, 'd':10, 'l':5, 'r':5},
          6:{'u':6, 'd':6, 'l':6, 'r':6},
          7:{'u':7, 'd':7, 'l':7, 'r':7},
          8:{'u':8, 'd':8, 'l':8, 'r':8},
          9:{'u':4, 'd':14, 'l':9, 'r':9},
          10:{'u':5, 'd':15, 'l':10, 'r':11},
          11:{'u':11, 'd':11, 'l':10, 'r':12},
          12:{'u':12, 'd':12, 'l':12, 'r':12},
          13:{'u':13, 'd':13, 'l':12, 'r':14},
          14:{'u':9, 'd':19, 'l':13, 'r':14},
          15:{'u':10, 'd':20, 'l':15, 'r':15},
          16:{'u':16, 'd':16, 'l':16, 'r':16},
          17:{'u':17, 'd':17, 'l':17, 'r':17},
          18:{'u':18, 'd':18, 'l':18, 'r':18},
          19:{'u':14, 'd':24, 'l':19, 'r':19},
          20:{'u':15, 'd':20, 'l':20, 'r':21},
          21:{'u':21, 'd':21, 'l':20, 'r':22},
          22:{'u':22, 'd':22, 'l':21, 'r':23},
          23:{'u':23, 'd':23, 'l':22, 'r':24},
          24:{'u':19, 'd':24, 'l':23, 'r':24}}

The robot receives the following rewards:
- 10 for entering position 0. 
- -1 for attempting to leave the grid. In other words, we penalize the robot for hitting the edges of the grid.  
- -0.1 for all other state transitions, which is the cost for the robot to move from one state to another. If we did not have this penalty, the robot could follow any random plan to the goal which did not hit the edges. 

This **reward structure is unknown to the MC RL agent**. The agent must **learn** the rewards by sampling the environment. 

In the code cell below encode your representation of this reward structure you will use in your simulated environment.  

In [6]:
rewards =  {0:{'u':-1, 'd':-0.1, 'l':-1, 'r':-0.1},
          1:{'u':-1, 'd':-1, 'l':-0.1, 'r':-0.1},
          2:{'u':-1, 'd':-1, 'l':-0.1, 'r':-0.1},
          3:{'u':-1, 'd':-1, 'l':-0.1, 'r':-0.1},
          4:{'u':-1, 'd':-0.1, 'l':-0.1, 'r':-1},
          5:{'u':-0.1, 'd':-0.1, 'l':-1, 'r':-1},
          6:{'u':-1, 'd':-1, 'l':-1, 'r':-1},
          7:{'u':-1, 'd':-1, 'l':-1, 'r':-1},
          8:{'u':-1, 'd':-1, 'l':-1, 'r':-1},
          9:{'u':-0.1, 'd':-0.1, 'l':-1, 'r':-1},
          10:{'u':-0.1, 'd':-0.1, 'l':-1, 'r':-0.1},
          11:{'u':-1, 'd':-1, 'l':-0.1, 'r':10},
          12:{'u':10.0, 'd':10.0, 'l':10.0, 'r':10.0}, # 12:{'u':0.0, 'd':0.0, 'l':0.0, 'r':0.0}
          13:{'u':-1, 'd':-1, 'l':10, 'r':-0.1},
          14:{'u':-0.1, 'd':-0.1, 'l':-0.1, 'r':-1},
          15:{'u':-0.1, 'd':-0.1, 'l':-1, 'r':-1},
          16:{'u':-1, 'd':-1, 'l':-1, 'r':-1},
          17:{'u':-1, 'd':-1, 'l':-1, 'r':-1},
          18:{'u':-0.1, 'd':-0.1, 'l':-0.1, 'r':-0.1},
          19:{'u':-0.1, 'd':-0.1, 'l':-1, 'r':-1},
          20:{'u':-0.1, 'd':-1, 'l':-1, 'r':-0.1},
          21:{'u':-1, 'd':-1, 'l':-0.1, 'r':-0.1},
          22:{'u':-1, 'd':-1, 'l':-0.1, 'r':-0.1},
          23:{'u':-1, 'd':-1, 'l':-0.1, 'r':-0.1},
          24:{'u':-0.1, 'd':-1, 'l':-0.1, 'r':-1}}

You will find it useful to create a list of taboo states, which you can encode in the cell below.

In [7]:
taboos = [6, 7, 8, 16, 17, 18]

## TD(0) Policy Evaluation

With your representations defined, you can now create and test functions to perform TD(0) **policy evaluation**. 

As a first step you will need a function to find the rewards and next state given a state and an action. You are welcome to start with the `state_values` function from the TD/Q-learning notebook. However, keep in mind that you must modify this code to correctly treat the taboo states of the barrier. Specifically, taboo states should not be visited. 

Execute your code to test it for each possible action from state 11.  

In [10]:
def simulate_environment(s, action, neighbors = neighbors, rewards = rewards, terminal = 0):
    """
    Function simulates the environment
    returns s_prime and reward given s and action
    """
    s_prime = neighbors[s][action]
    reward = rewards[s][action]
    return (s_prime, reward, is_terminal(s_prime, terminal))

def is_terminal(state, terminal = 0):
    return state == terminal

## Test the function
for a in ['u', 'd', 'r', 'l']:
    print(simulate_environment(1, a))

(1, -1, False)
(1, -1, False)
(2, -0.1, False)
(0, -0.1, True)


Examine your results. Are the action values consistent with the transitions?

ANS: 

Next, you need to create a function to compute the state values using the TD(0) algorithm. You should use the function you just created  to find the rewards and next state given a state and action. You are welcome to use the `td_0_state_values` function from the TD/Q-learning notebook as a starting point.  

Execute your function for 1,000 episodes and examine the results.

In [None]:
initial_policy  = {0:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      1:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      2:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      3:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      4:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      5:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      6:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      7:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      8:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      9:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      10:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      11:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      12:{'u':0.0, 'd':0.0, 'l':0.0, 'r':0.0},
                      13:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      14:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      15:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      16:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      17:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      18:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      19:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      20:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      21:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      22:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      23:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25},
                      24:{'u':0.25, 'd':0.25, 'l': 0.25, 'r':0.25}}

In [None]:
def start_episode(n_states):
    '''Function to find a random starting value for the episode
    that is not the terminal state'''
    state = nr.choice(range(n_states))
    while(is_terminal(state)):
         state = nr.choice(range(n_states))
    return state

## test the function to make sure never starting in terminal state
[start_episode(15) for _ in range(10)]

In [None]:
def take_action(state, policy, actions = {1:'u', 2:'d', 3:'l', 4:'r'}):
    '''Function takes action given state using the transition probabilities 
    of the policy'''
    ## Find the action given the transistion probabilities defined by the policy.
    action = actions[nr.choice(range(len(actions)), p = list(policy[state].values())) + 1]
    s_prime, reward, terminal = simulate_environment(state, action)
    return (action, s_prime, reward, terminal)

## Test function for several states
for s in range(16):
    print(take_action(s, initial_policy))

In [None]:
def td_0_state_values(policy, n_samps, alpha = 0.2, gamma = 1.0):
    """
    Function for TD(0) policy evalutation
    """
    
    ## Find the starting state
    n_states = len(policy)
    current_state = start_episode(n_states)
    terminal = False
    
    ## Array for state values
    v = np.zeros((n_states,1))
    
    for _ in range(n_samps):
        ## Find the next action and reward
        action, s_prime, reward, terminal = take_action(current_state, policy)
        ## Compute the TD error
        delta = reward + gamma*v[s_prime] - v[current_state]
        ## Update the state value
        v[current_state] = v[current_state] + alpha*delta
        current_state = s_prime
        if(terminal): ## start new episode when terminal
            current_state = start_episode(n_states)
    return(v)

td_0_state_values(initial_policy, 2000).reshape((4,4))   

Examine your results and answer the following questions to ensure you action value function operates correctly:
1. Are the values of the taboo states 0? ANS:
2. Are the states with the highest values adjacent to the terminal state? ANS: 
3. Are the values of the states decreasing as the distance from the terminal state increases? ANS: 


## SARSA(0) Policy Improvement

Now you will perform policy improvement using the SARSA(0) algorithm.  You are welcome to start with the `select_a_prime` and `SARSA_0` functions from the TD/Q-learning notebooks.    

Execute your code for 1,000 episodes, and with $\alpha = 0.2$, and $\epsilon = 0.1$)

Examine the action values you have computed. Ensure that the action values are 0 for the goal and taboo states. Also check that the actions with the largest values for each state make sense in terms of reaching the goal. 

With the action value function completed, you will now create and test code to perform GPI with SARSA(0).  You are welcome to use the `SRASA_0_GPI` function from the TD/Q-learning notebook as a starting point. 

Execute your code for 10 cycles of 100 episodes, with $\alpha = 0.2$, $\gamma = 0.9$ and $\epsilon = 0.01$, and examine the results.

Verify that your results make sense? For example, starting at state 2 or 22, do the most probable actions follow a shortest path?

ANS: 

## Apply Double Q-Learning

As a next step, you will apply Double Q-learning(0) to the warehouse navigation problem. In the cell below create and test a function to perform Double Q-Learning for this problem. You are welcome to use the `double_Q_learning` function from the TD/Q-learning notebook as a starting point.

Execute your code for 10 cycles of 500 episodes, with $\alpha = 0.2$, and $\gamma = 0.9$ and examine the results.

Examine the action values you have computed. Ensure that the action values are 0 for the goal and taboo states. Also check that the actions with the largest values for each state make sense in terms of reaching the goal. 

With the action value function completed, you will now create and test code to perform GPI with Double Q-Learning(0).  You are welcome to use the `double_Q_learning_0_GPI` function from the TD/Q-learning notebook as a starting point. 

Execute your code for 10 cycles of 500 episodes, with $\alpha = 0.2$, $\gamma = 0.9$ and $\epsilon = 0.01$, and examine the results. 

Verify that your results make sense? For example, starting at state 2 or 22, do the most probable actions follow a shortest path?

ANS: 

## N-Step TD Learning

Finally, you will apply N-Step TD learning and N-Step SARSA to the warehouse navigation problem.  First create a function to perform N-step TD policy evaluation. You are welcome to start with the `TD_n` policy evaluation function from the TD/Q-Learning notebook. 

Test your function using 1,000 episodes, $n = 4$, $\gamma = 0.9$, and $\alpha = 0.2$.

Verify that the result you obtained appears correct. Are the values of the goal and taboo states all 0? Do the state values decrease with distance from the goal?

Now that you have an estimate of the best values for the number of steps and the learning rate you can compute the action values using multi-step SARSA. In the cell below, create and test a function to compute the action values using N-step SARSA. You are welcome to use the `SRARSA_n` function from the TD/Q-learning notebook as a starting point. 

Test your function by executing 4-step SARSA for 1,000 episodes with $\alpha = 0.2$ and $\gamma = 0.9$ and using the optimum number of steps and learning rate you have determined. 

Verify that the results you have computed appear correct using the aforementioned criteria. 

Finally, create a function to use the GPI algorithm with N-step SARSA in the cell below. You are welcome to start with the `SARSA_n_GPI` function from the TD/Q-learning notebook. 

Execute your function using 4 step SARSA for 5 cycles of 500 episodes, with $\alpha = 0.2$, $\epsilon = 0.1$, and $\gamma = 0.9$.

Examine your results. Verify that the most probable paths to the goal from states 2 and 22 are the shortest possible.  