# Challenge Assignment
## Cliff Walking with Reinforcement Learning

## CSCI E-82A

>**Make sure** you include your name along with the name of your team and team members in the notebook you submit.

**Your name and team name here:** 

## Introduction

In this challenge you will apply Monte Carlo reinforcement learning algorithms to a classic problem in reinforcement learning, known as the **cliff walking problem**. The cliff walking problem is a type of game. The goal is for the agent to find the highest reward (lowest cost) path from a starting state to the goal.   

There are a number of versions of the cliff walking problems which have been used as research benchmarks over the years. You can find a short discussion of the cliff walking problem on page 132 of Sutton and Barto, second edition.    

In the general cliff walking problem the agent starts in one corner of the state-space and must travel to goal, or terminal state, in another corner of the state-space. Between the starting state and goal state there is an area with a **cliff**. If the agent falls off a cliff it is sent back to the starting state. A schematic diagram of the state-space is shown in the diagram below.      

<img src="CliffWalkingDiagram.JPG" alt="Drawing" style="width:500px; height:400px"/>
<center> State-space of cliff-walking problem </center>



### Problem Description

The agent must learn a policy to navigate from the starting state to the terminal state. The properties this problem are as follows:

1. The state-space has two **continuous variables**, x and y.
2. The starting state is at $x = 0.0$, $y = 0.0$. 
3. The terminal state has two segments:
  - At $y = 0.0$ is in the range $9.0 \le x \le 10.0$. 
  - At $x = 10.0$ is in the range $0.0 \le y \le 1.0$.  
4. The cliff zone is bounded by:
  - $0.0 \le y \le 1.0$ and 
  - $1.0 \le x \le 9.0$. 
5. An agent entering the cliff zone is returned to the starting state.
6. The agent moves 1.0 units per time step. 
7. The 8 possible **discrete actions** are moves in the following directions:  
  - +x, 
  - +x, +y,
  - +y
  - -x, +y,
  - -y,
  - -x, -y,
  - -y, and
  - +x, -y. 
8. The rewards are:
  - -1 for a time step in the state-space,
  - -10 for colliding with an edge (barrier) of the state-space,
  - -100 for falling off the cliff and returning to the starting state, and 
  - +1000 for reaching the terminal or goal state. 
  


## Instructions

In this challenge you and your team will do the following. Include commentary on each component of your algorithms. Make sure you answer the questions.  

### Environment Simulator   

Your reinforcement learning agent cannot contain any information about the environment other that the starting state and the possible actions. Therefore, you must create an environment simulator, with the following input and output:
- Input: Arguments of state, the $(x,y)$ tuple, and discrete action
- Output: the new state (s'), reward, and if the new state meets the terminal or goal criteria.

Make sure you test your simulator functions carefully. The test cases must include, steps with each of the actions, falling off the cliff from each edge, hitting the barriers, and reaching the goal (terminal) edges. Errors in the simulator will make the rest of this challenge difficult.   

> **Note**: For this problem, coordinate state is represented by a tuple of continuous variables. Make sure that you maintain coordinate state as continuous variables for this problem. 

In [26]:
from math import cos
import numpy as np
import numpy.random as nr
import matplotlib.pyplot as plt
%matplotlib inline

def sim_environment(x, y, action):
    fell = False
    hit_wall = False
    # calculate new location
    x_prime = x
    y_prime = y
    if (action[0] == 1):
        if (action[1] == 1):
            x_prime = x + .707
            y_prime = y + .707
        elif(action[1] == -1):
            x_prime = x + .707
            y_prime = y - .707
        else:
            X_prime = x + 1
    elif (action[0] == -1):
        if (action[1] == 1):
            x_prime = x - .707
            y_prime = y + .707
        elif(action[1] == -1):
            x_prime = x - .707
            y_prime = y - .707
        else:
            X_prime = x - 1
    else:
        if (action[1] == 1):
            y_prime = y + 1
        elif (action[1] == -1):
            y_prime = y - 1
             
    #ensure new location is in bounds
    if (x_prime >= 10.0):
        x_prime = 0.0
        y_prime = 0.0
        hit_wall = True
    elif (x_prime <= 0.0):
        x_prime = 0.0
        y_prime = 0.0
        hit_wall = True
    if (y_prime >= 10.0):
        x_prime = 0.0
        y_prime = 0.0
        hit_wall = True
    elif (y_prime <= 0.0):
        x_prime = 0.0
        y_prime = 0.0  
        hit_wall = True
             
    #ensure new location is not off cliff
    if (x_prime >= 1.0 and x_prime <= 9.0):
        if (y_prime >= 0.0 and y_prime <= 1.0):
             x_prime = 0.0
             y_prime = 0.0
             fell = True
    
    ## At the terminal state or not and set reward
    if (in_terminal(x_prime, y_prime)):
        done = True
        reward = 1000
    elif (fell):
        done = False
        reward = -100
    elif (hit_wall):
        done = False
        reward = -10
        x_prime = x
        y_prime = y
    else:
        done = False
        reward = -1.0
    # output new state (s'), reward, and if the new state meets the terminal or goal criteria
    return(x_prime, y_prime, done, reward)

def in_terminal(x, y):
    if (y == 0.0 and x >= 9.0):
        return True
    elif (x == 10.0 and y <= 1.0):
        return True
    return False

In [None]:
from itertools import permutations as perm
moves = list(perm([0, 1, -1], 2)) + [[1, 1], [-1, -1]]

def test_sim(init, moves):
    """
    Test simulator outcomes from initial state.
    Results are determined for each move in moves.
    """
    print(f"initial location: {init}")
    for m in moves:
        res, v, f = sim_environment(init, m)
        move = [round(i, 3) for i in m]
        res = [round(i, 3) for i in res]
        print(f"move: {move} ==> {res, v, f}")

        
# Some basic tests
inits = [[0,0],                                   # start
         [1,0],                                   # left of cliff
         [9,1],                                   # right of cliff
         [5,2],                                   # middle/above cliff
         [5,5],                                   # middle of grid
         [8,1],                                   # border terminal
         [9,2]                                    # border terminal
         ]
for i in inits:
    test_sim(i, moves)


### Grid Approximation

The state-space of the cliff walking problem is continuous. Therefor, you will need to use a **grid approximation** to construct a policy. The policy is specified as the probability of action for each grid cell. For this problem, use a 10x10 grid. 

> **Note:** While the policy uses a grid approximation, state should be represented as continuous variables.

In [41]:
def state(x, x_lims = (0.0,10.0), n_tiles = 10):
    """Function to compute tile state given positon"""
    state = int((x - x_lims[0])/(x_lims[1] - x_lims[0]) * float(n_tiles))
    if(state > n_tiles - 1): state = n_tiles - 1
    return(state)

def get_grid_state(x, y):
    return (state(x) + 10 * state(y))

### Initial Policy

Start with a uniform initial policy. A uniform policy has an equal probability of taking any of the 8 possible actions for each cell in the grid representation.     

> **Note:** As has already been stated, the coordinate state representation for this problem is a tuple of coordinate values. However, policy, state-values and action-values are represented with a grid approximation. 

> **Hint:** You may wish to use a 3-dimensional numpy array to code the policy for this problem. With 8 possible actions, this approach will be easier to work with. 



In [23]:
actions = [.125]*8
x_actions = [actions] * 10
initial_policy = [x_actions] * 10
initial_policy

[[[0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125],
  [0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125],
  [0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125],
  [0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125],
  [0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125],
  [0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125],
  [0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125],
  [0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125],
  [0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125],
  [0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125]],
 [[0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125],
  [0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125],
  [0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125],
  [0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125],
  [0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125],
  [0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125],
  [0.125, 0.125, 0.125, 0.125, 0.125, 0

### Monte Carlo State Value Estimation   

For the initial uniform policy, compute the state values using the Monte Carlo RL algorithm:
1. Compute and print the state values for each grid in the representation. Use at least 1,000 episodes. This will take some time to execute.      
2. Plot the grid of state values, as an image (e.g. matplotlib [imshow](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.imshow.html)). 
3. Compute the Forbenious norm (Euclidean norm) of the state value array with [numpy.linalg.norm](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.norm.html). You will use this figure as a basis to compare your improved policy. 

Study your plot to ensure your state values seem correct. Do these state values seem reasonable given the uniform policy and why? Make sure you pay attention to the state values of the cliff zone.    

> **Hint:** Careful testing at each stage of your algorithm development will potentially save you considerable time. Test your function(s) to for a single episode to make sure your algorithm converges. Then test for say 10 episodes to ensure the state values update in a reasonable manner at each episode.    

> **Note:** The Monte Carlo episodes can be executed in parallel for production systems. The Markov chain of each episode is statistically independent. 

In [47]:
actions = [(0,0), (0,1), (0,-1), (1,0), (1,1), (1,-1), (-1,0), (-1,1), (-1,-1)]
def take_action(x, y, policy):
    '''Function takes action given state using the transition probabilities 
    of the policy'''
    ## Find the action given the transistion probabilities defined by the policy.
    x_grid = state(x)
    y_grid = state(y)
    action = actions[nr.choice(8, p = policy[x_grid][y_grid]) + 1]
    x_prime, y_prime, is_terminal, reward = sim_environment(x,y, action)
    return (action, x_prime, y_prime, is_terminal, reward)

#print(initial_policy[0][0])
#take_action(0,0,initial_policy)

def MC_episode(policy, G, n_visits): 
    '''Function creates the Monte Carlo samples of one episode.
    This function does most of the real work'''
    ## For each episode we use a list to keep track of states we have visited.
    ## Once we visit a state we need to accumulate values to get the returns
    states_visited = []
        
    ## Find the starting state
    x_current = 0.0
    y_current = 0.0
    current_state = get_grid_state(x_current, y_current)
    terminal = False
    g = 0.0
    
    while(not terminal):
        ## Find the next action and reward
        action, x_prime, y_prime, is_terminal, reward = take_action(x_current, y_current, policy)
        print("from: " + str((action, x_current, y_current, is_terminal, reward)))
        print("to: " + str((action, x_prime, y_prime, is_terminal, reward)))
        ## Add the reward to the states visited if this is a first visit  
        if(current_state not in states_visited):
            ## Mark that the current state has been visited 
            states_visited.append(current_state) 
            ## Add the reward to states visited 
            for state in states_visited:
                n_visits[state] = n_visits[state] + 1.0
                G[state] = G[state] + (reward - G[state])/n_visits[state]
        
        ## Update the current state for next transition
        current_state = get_grid_state(x_prime, y_prime) 
        x_current = x_prime
        y_current = y_prime
    return (G, n_visits) 

def MC_state_values(policy, n_episodes):
    '''Function that evaluates the state value of 
    a policy using the Monte Carlo method.'''
    ## Create list of states 
    n_states = 100
    
    ## An array to hold the accumulated returns as we visit states
    G = np.zeros((n_states))
    
    ## An array to keep track of how many times we visit each state so we can 
    ## compute the mean
    n_visits = np.zeros((n_states))
    print(n_visits)
    
    ## Iterate over the episodes
    for i in range(n_episodes):
        G, n_visits = MC_episode(policy, G, n_visits) # neighbors, i, n_states)
    return(G) 

MC_state_values(initial_policy, 1)
#actions[nr.choice(8, p = initial_policy[0][0]) + 1]

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0.]
from: ((1, 1), 0.0, 0.0, False, -1.0)
to: ((1, 1), 0.707, 0.707, False, -1.0)
from: ((1, 1), 0.707, 0.707, False, -1.0)
to: ((1, 1), 1.414, 1.414, False, -1.0)
from: ((1, 1), 1.414, 1.414, False, -1.0)
to: ((1, 1), 2.121, 2.121, False, -1.0)
from: ((1, 0), 2.121, 2.121, False, -1.0)
to: ((1, 0), 2.121, 2.121, False, -1.0)
from: ((1, -1), 2.121, 2.121, False, -1.0)
to: ((1, -1), 2.828, 1.4140000000000001, False, -1.0)
from: ((1, -1), 2.828, 1.4140000000000001, False, -100)
to: ((1, -1), 0.0, 0.0, False, -100)
from: ((0, 1), 0.0, 0.0, False, -10)
to: ((0, 1), 0.0, 0.0, False, -10)
from: ((-1, -1), 0.0, 0.0, False, -10)
to: ((-1, -1), 0.0, 0.0, False, -10)
from: ((1, 1), 0.0, 0.0, False, 

from: ((-1, -1), 2.828, 8.241999999999999, False, -1.0)
to: ((-1, -1), 2.121, 7.534999999999999, False, -1.0)
from: ((0, 1), 2.121, 7.534999999999999, False, -1.0)
to: ((0, 1), 2.121, 8.535, False, -1.0)
from: ((-1, 0), 2.121, 8.535, False, -1.0)
to: ((-1, 0), 2.121, 8.535, False, -1.0)
from: ((-1, 0), 2.121, 8.535, False, -1.0)
to: ((-1, 0), 2.121, 8.535, False, -1.0)
from: ((1, 1), 2.121, 8.535, False, -1.0)
to: ((1, 1), 2.828, 9.242, False, -1.0)
from: ((-1, -1), 2.828, 9.242, False, -1.0)
to: ((-1, -1), 2.121, 8.535, False, -1.0)
from: ((1, -1), 2.121, 8.535, False, -1.0)
to: ((1, -1), 2.828, 7.828, False, -1.0)
from: ((1, 0), 2.828, 7.828, False, -1.0)
to: ((1, 0), 2.828, 7.828, False, -1.0)
from: ((0, 1), 2.828, 7.828, False, -1.0)
to: ((0, 1), 2.828, 8.828, False, -1.0)
from: ((0, -1), 2.828, 8.828, False, -1.0)
to: ((0, -1), 2.828, 7.827999999999999, False, -1.0)
from: ((0, -1), 2.828, 7.827999999999999, False, -1.0)
to: ((0, -1), 2.828, 6.827999999999999, False, -1.0)
from: ((

from: ((1, 0), 0.707, 0.707, False, -1.0)
to: ((1, 0), 0.707, 0.707, False, -1.0)
from: ((-1, -1), 0.707, 0.707, False, -10)
to: ((-1, -1), 0.707, 0.707, False, -10)
from: ((-1, 1), 0.707, 0.707, False, -10)
to: ((-1, 1), 0.707, 0.707, False, -10)
from: ((-1, 1), 0.707, 0.707, False, -10)
to: ((-1, 1), 0.707, 0.707, False, -10)
from: ((1, 1), 0.707, 0.707, False, -1.0)
to: ((1, 1), 1.414, 1.414, False, -1.0)
from: ((1, -1), 1.414, 1.414, False, -100)
to: ((1, -1), 0.0, 0.0, False, -100)
from: ((-1, -1), 0.0, 0.0, False, -10)
to: ((-1, -1), 0.0, 0.0, False, -10)
from: ((1, -1), 0.0, 0.0, False, -10)
to: ((1, -1), 0.0, 0.0, False, -10)
from: ((1, 0), 0.0, 0.0, False, -10)
to: ((1, 0), 0.0, 0.0, False, -10)
from: ((1, -1), 0.0, 0.0, False, -10)
to: ((1, -1), 0.0, 0.0, False, -10)
from: ((-1, -1), 0.0, 0.0, False, -10)
to: ((-1, -1), 0.0, 0.0, False, -10)
from: ((1, 1), 0.0, 0.0, False, -1.0)
to: ((1, 1), 0.707, 0.707, False, -1.0)
from: ((-1, 0), 0.707, 0.707, False, -1.0)
to: ((-1, 0), 0

from: ((-1, 1), 1.414, 1.414, False, -1.0)
to: ((-1, 1), 0.707, 2.121, False, -1.0)
from: ((1, 1), 0.707, 2.121, False, -1.0)
to: ((1, 1), 1.414, 2.828, False, -1.0)
from: ((1, 0), 1.414, 2.828, False, -1.0)
to: ((1, 0), 1.414, 2.828, False, -1.0)
from: ((-1, 0), 1.414, 2.828, False, -1.0)
to: ((-1, 0), 1.414, 2.828, False, -1.0)
from: ((1, 0), 1.414, 2.828, False, -1.0)
to: ((1, 0), 1.414, 2.828, False, -1.0)
from: ((0, 1), 1.414, 2.828, False, -1.0)
to: ((0, 1), 1.414, 3.828, False, -1.0)
from: ((1, -1), 1.414, 3.828, False, -1.0)
to: ((1, -1), 2.121, 3.121, False, -1.0)
from: ((-1, 0), 2.121, 3.121, False, -1.0)
to: ((-1, 0), 2.121, 3.121, False, -1.0)
from: ((-1, -1), 2.121, 3.121, False, -1.0)
to: ((-1, -1), 1.4140000000000001, 2.414, False, -1.0)
from: ((-1, 0), 1.4140000000000001, 2.414, False, -1.0)
to: ((-1, 0), 1.4140000000000001, 2.414, False, -1.0)
from: ((-1, -1), 1.4140000000000001, 2.414, False, -1.0)
to: ((-1, -1), 0.7070000000000002, 1.7070000000000003, False, -1.0)
fr

from: ((1, 1), 0.707, 6.1209999999999996, False, -1.0)
to: ((1, 1), 1.414, 6.827999999999999, False, -1.0)
from: ((-1, -1), 1.414, 6.827999999999999, False, -1.0)
to: ((-1, -1), 0.707, 6.1209999999999996, False, -1.0)
from: ((-1, 0), 0.707, 6.1209999999999996, False, -1.0)
to: ((-1, 0), 0.707, 6.1209999999999996, False, -1.0)
from: ((1, 1), 0.707, 6.1209999999999996, False, -1.0)
to: ((1, 1), 1.414, 6.827999999999999, False, -1.0)
from: ((1, -1), 1.414, 6.827999999999999, False, -1.0)
to: ((1, -1), 2.121, 6.1209999999999996, False, -1.0)
from: ((0, 1), 2.121, 6.1209999999999996, False, -1.0)
to: ((0, 1), 2.121, 7.1209999999999996, False, -1.0)
from: ((-1, 0), 2.121, 7.1209999999999996, False, -1.0)
to: ((-1, 0), 2.121, 7.1209999999999996, False, -1.0)
from: ((-1, 0), 2.121, 7.1209999999999996, False, -1.0)
to: ((-1, 0), 2.121, 7.1209999999999996, False, -1.0)
from: ((-1, -1), 2.121, 7.1209999999999996, False, -1.0)
to: ((-1, -1), 1.4140000000000001, 6.414, False, -1.0)
from: ((-1, -1),

from: ((-1, 0), 2.121, 6.051, False, -1.0)
to: ((-1, 0), 2.121, 6.051, False, -1.0)
from: ((-1, 0), 2.121, 6.051, False, -1.0)
to: ((-1, 0), 2.121, 6.051, False, -1.0)
from: ((-1, -1), 2.121, 6.051, False, -1.0)
to: ((-1, -1), 1.4140000000000001, 5.344, False, -1.0)
from: ((1, -1), 1.4140000000000001, 5.344, False, -1.0)
to: ((1, -1), 2.121, 4.6370000000000005, False, -1.0)
from: ((0, -1), 2.121, 4.6370000000000005, False, -1.0)
to: ((0, -1), 2.121, 3.6370000000000005, False, -1.0)
from: ((1, 0), 2.121, 3.6370000000000005, False, -1.0)
to: ((1, 0), 2.121, 3.6370000000000005, False, -1.0)
from: ((1, 0), 2.121, 3.6370000000000005, False, -1.0)
to: ((1, 0), 2.121, 3.6370000000000005, False, -1.0)
from: ((-1, 1), 2.121, 3.6370000000000005, False, -1.0)
to: ((-1, 1), 1.4140000000000001, 4.344, False, -1.0)
from: ((1, 0), 1.4140000000000001, 4.344, False, -1.0)
to: ((1, 0), 1.4140000000000001, 4.344, False, -1.0)
from: ((1, -1), 1.4140000000000001, 4.344, False, -1.0)
to: ((1, -1), 2.121, 3.

from: ((0, -1), 0.0, 0.0, False, -10)
to: ((0, -1), 0.0, 0.0, False, -10)
from: ((-1, 1), 0.0, 0.0, False, -10)
to: ((-1, 1), 0.0, 0.0, False, -10)
from: ((0, -1), 0.0, 0.0, False, -10)
to: ((0, -1), 0.0, 0.0, False, -10)
from: ((-1, 0), 0.0, 0.0, False, -10)
to: ((-1, 0), 0.0, 0.0, False, -10)
from: ((1, -1), 0.0, 0.0, False, -10)
to: ((1, -1), 0.0, 0.0, False, -10)
from: ((-1, -1), 0.0, 0.0, False, -10)
to: ((-1, -1), 0.0, 0.0, False, -10)
from: ((-1, -1), 0.0, 0.0, False, -10)
to: ((-1, -1), 0.0, 0.0, False, -10)
from: ((1, 0), 0.0, 0.0, False, -10)
to: ((1, 0), 0.0, 0.0, False, -10)
from: ((1, 1), 0.0, 0.0, False, -1.0)
to: ((1, 1), 0.707, 0.707, False, -1.0)
from: ((1, 0), 0.707, 0.707, False, -1.0)
to: ((1, 0), 0.707, 0.707, False, -1.0)
from: ((1, 0), 0.707, 0.707, False, -1.0)
to: ((1, 0), 0.707, 0.707, False, -1.0)
from: ((1, -1), 0.707, 0.707, False, -10)
to: ((1, -1), 0.707, 0.707, False, -10)
from: ((0, 1), 0.707, 0.707, False, -1.0)
to: ((0, 1), 0.707, 1.7069999999999999, 

to: ((-1, 1), 2.1210000000000004, 9.191000000000003, False, -1.0)
from: ((1, 0), 2.1210000000000004, 9.191000000000003, False, -1.0)
to: ((1, 0), 2.1210000000000004, 9.191000000000003, False, -1.0)
from: ((0, -1), 2.1210000000000004, 9.191000000000003, False, -1.0)
to: ((0, -1), 2.1210000000000004, 8.191000000000003, False, -1.0)
from: ((-1, 0), 2.1210000000000004, 8.191000000000003, False, -1.0)
to: ((-1, 0), 2.1210000000000004, 8.191000000000003, False, -1.0)
from: ((-1, -1), 2.1210000000000004, 8.191000000000003, False, -1.0)
to: ((-1, -1), 1.4140000000000006, 7.484000000000003, False, -1.0)
from: ((1, -1), 1.4140000000000006, 7.484000000000003, False, -1.0)
to: ((1, -1), 2.1210000000000004, 6.777000000000003, False, -1.0)
from: ((1, 0), 2.1210000000000004, 6.777000000000003, False, -1.0)
to: ((1, 0), 2.1210000000000004, 6.777000000000003, False, -1.0)
from: ((-1, -1), 2.1210000000000004, 6.777000000000003, False, -1.0)
to: ((-1, -1), 1.4140000000000006, 6.070000000000003, False, -1

from: ((-1, 1), 0.0, 0.0, False, -10)
to: ((-1, 1), 0.0, 0.0, False, -10)
from: ((1, -1), 0.0, 0.0, False, -10)
to: ((1, -1), 0.0, 0.0, False, -10)
from: ((-1, 1), 0.0, 0.0, False, -10)
to: ((-1, 1), 0.0, 0.0, False, -10)
from: ((1, 0), 0.0, 0.0, False, -10)
to: ((1, 0), 0.0, 0.0, False, -10)
from: ((0, 1), 0.0, 0.0, False, -10)
to: ((0, 1), 0.0, 0.0, False, -10)
from: ((-1, -1), 0.0, 0.0, False, -10)
to: ((-1, -1), 0.0, 0.0, False, -10)
from: ((1, 0), 0.0, 0.0, False, -10)
to: ((1, 0), 0.0, 0.0, False, -10)
from: ((-1, 1), 0.0, 0.0, False, -10)
to: ((-1, 1), 0.0, 0.0, False, -10)
from: ((1, 0), 0.0, 0.0, False, -10)
to: ((1, 0), 0.0, 0.0, False, -10)
from: ((-1, 0), 0.0, 0.0, False, -10)
to: ((-1, 0), 0.0, 0.0, False, -10)
from: ((1, 0), 0.0, 0.0, False, -10)
to: ((1, 0), 0.0, 0.0, False, -10)
from: ((0, -1), 0.0, 0.0, False, -10)
to: ((0, -1), 0.0, 0.0, False, -10)
from: ((1, 1), 0.0, 0.0, False, -1.0)
to: ((1, 1), 0.707, 0.707, False, -1.0)
from: ((0, 1), 0.707, 0.707, False, -1.0)


from: ((1, -1), 2.8280000000000003, 3.8279999999999994, False, -1.0)
to: ((1, -1), 3.535, 3.1209999999999996, False, -1.0)
from: ((-1, 1), 3.535, 3.1209999999999996, False, -1.0)
to: ((-1, 1), 2.8280000000000003, 3.8279999999999994, False, -1.0)
from: ((1, -1), 2.8280000000000003, 3.8279999999999994, False, -1.0)
to: ((1, -1), 3.535, 3.1209999999999996, False, -1.0)
from: ((1, 0), 3.535, 3.1209999999999996, False, -1.0)
to: ((1, 0), 3.535, 3.1209999999999996, False, -1.0)
from: ((-1, 0), 3.535, 3.1209999999999996, False, -1.0)
to: ((-1, 0), 3.535, 3.1209999999999996, False, -1.0)
from: ((1, 0), 3.535, 3.1209999999999996, False, -1.0)
to: ((1, 0), 3.535, 3.1209999999999996, False, -1.0)
from: ((-1, 1), 3.535, 3.1209999999999996, False, -1.0)
to: ((-1, 1), 2.8280000000000003, 3.8279999999999994, False, -1.0)
from: ((0, -1), 2.8280000000000003, 3.8279999999999994, False, -1.0)
to: ((0, -1), 2.8280000000000003, 2.8279999999999994, False, -1.0)
from: ((-1, -1), 2.8280000000000003, 2.8279999

from: ((0, 1), 4.949, 8.878999999999996, False, -1.0)
to: ((0, 1), 4.949, 9.878999999999996, False, -1.0)
from: ((1, 1), 4.949, 9.878999999999996, False, -10)
to: ((1, 1), 4.949, 9.878999999999996, False, -10)
from: ((1, 0), 4.949, 9.878999999999996, False, -1.0)
to: ((1, 0), 4.949, 9.878999999999996, False, -1.0)
from: ((0, 1), 4.949, 9.878999999999996, False, -10)
to: ((0, 1), 4.949, 9.878999999999996, False, -10)
from: ((-1, -1), 4.949, 9.878999999999996, False, -1.0)
to: ((-1, -1), 4.242, 9.171999999999995, False, -1.0)
from: ((1, 1), 4.242, 9.171999999999995, False, -1.0)
to: ((1, 1), 4.949, 9.878999999999996, False, -1.0)
from: ((1, 0), 4.949, 9.878999999999996, False, -1.0)
to: ((1, 0), 4.949, 9.878999999999996, False, -1.0)
from: ((1, 0), 4.949, 9.878999999999996, False, -1.0)
to: ((1, 0), 4.949, 9.878999999999996, False, -1.0)
from: ((1, 0), 4.949, 9.878999999999996, False, -1.0)
to: ((1, 0), 4.949, 9.878999999999996, False, -1.0)
from: ((1, 0), 4.949, 9.878999999999996, False

to: ((-1, 0), 0.0, 0.0, False, -10)
from: ((-1, -1), 0.0, 0.0, False, -10)
to: ((-1, -1), 0.0, 0.0, False, -10)
from: ((-1, 0), 0.0, 0.0, False, -10)
to: ((-1, 0), 0.0, 0.0, False, -10)
from: ((0, -1), 0.0, 0.0, False, -10)
to: ((0, -1), 0.0, 0.0, False, -10)
from: ((1, 1), 0.0, 0.0, False, -1.0)
to: ((1, 1), 0.707, 0.707, False, -1.0)
from: ((-1, 1), 0.707, 0.707, False, -10)
to: ((-1, 1), 0.707, 0.707, False, -10)
from: ((-1, -1), 0.707, 0.707, False, -10)
to: ((-1, -1), 0.707, 0.707, False, -10)
from: ((-1, 1), 0.707, 0.707, False, -10)
to: ((-1, 1), 0.707, 0.707, False, -10)
from: ((1, -1), 0.707, 0.707, False, -10)
to: ((1, -1), 0.707, 0.707, False, -10)
from: ((-1, 0), 0.707, 0.707, False, -1.0)
to: ((-1, 0), 0.707, 0.707, False, -1.0)
from: ((-1, 1), 0.707, 0.707, False, -10)
to: ((-1, 1), 0.707, 0.707, False, -10)
from: ((-1, 1), 0.707, 0.707, False, -10)
to: ((-1, 1), 0.707, 0.707, False, -10)
from: ((-1, -1), 0.707, 0.707, False, -10)
to: ((-1, -1), 0.707, 0.707, False, -10)


to: ((-1, 1), 3.535, 9.777000000000001, False, -10)
from: ((0, 1), 3.535, 9.777000000000001, False, -10)
to: ((0, 1), 3.535, 9.777000000000001, False, -10)
from: ((-1, 1), 3.535, 9.777000000000001, False, -10)
to: ((-1, 1), 3.535, 9.777000000000001, False, -10)
from: ((0, 1), 3.535, 9.777000000000001, False, -10)
to: ((0, 1), 3.535, 9.777000000000001, False, -10)
from: ((0, 1), 3.535, 9.777000000000001, False, -10)
to: ((0, 1), 3.535, 9.777000000000001, False, -10)
from: ((-1, 1), 3.535, 9.777000000000001, False, -10)
to: ((-1, 1), 3.535, 9.777000000000001, False, -10)
from: ((-1, -1), 3.535, 9.777000000000001, False, -1.0)
to: ((-1, -1), 2.8280000000000003, 9.07, False, -1.0)
from: ((0, -1), 2.8280000000000003, 9.07, False, -1.0)
to: ((0, -1), 2.8280000000000003, 8.07, False, -1.0)
from: ((-1, 0), 2.8280000000000003, 8.07, False, -1.0)
to: ((-1, 0), 2.8280000000000003, 8.07, False, -1.0)
from: ((-1, 1), 2.8280000000000003, 8.07, False, -1.0)
to: ((-1, 1), 2.1210000000000004, 8.7770000

from: ((0, 1), 6.661338147750939e-16, 4.585999999999999, False, -1.0)
to: ((0, 1), 6.661338147750939e-16, 5.585999999999999, False, -1.0)
from: ((-1, 0), 6.661338147750939e-16, 5.585999999999999, False, -1.0)
to: ((-1, 0), 6.661338147750939e-16, 5.585999999999999, False, -1.0)
from: ((1, -1), 6.661338147750939e-16, 5.585999999999999, False, -1.0)
to: ((1, -1), 0.7070000000000006, 4.879, False, -1.0)
from: ((-1, 1), 0.7070000000000006, 4.879, False, -1.0)
to: ((-1, 1), 6.661338147750939e-16, 5.585999999999999, False, -1.0)
from: ((-1, 0), 6.661338147750939e-16, 5.585999999999999, False, -1.0)
to: ((-1, 0), 6.661338147750939e-16, 5.585999999999999, False, -1.0)
from: ((-1, -1), 6.661338147750939e-16, 5.585999999999999, False, -10)
to: ((-1, -1), 6.661338147750939e-16, 5.585999999999999, False, -10)
from: ((1, -1), 6.661338147750939e-16, 5.585999999999999, False, -1.0)
to: ((1, -1), 0.7070000000000006, 4.879, False, -1.0)
from: ((1, 0), 0.7070000000000006, 4.879, False, -1.0)
to: ((1, 0),

from: ((-1, 1), 4.949, 4.534999999999999, False, -1.0)
to: ((-1, 1), 4.242, 5.241999999999999, False, -1.0)
from: ((0, 1), 4.242, 5.241999999999999, False, -1.0)
to: ((0, 1), 4.242, 6.241999999999999, False, -1.0)
from: ((-1, 1), 4.242, 6.241999999999999, False, -1.0)
to: ((-1, 1), 3.535, 6.948999999999999, False, -1.0)
from: ((1, -1), 3.535, 6.948999999999999, False, -1.0)
to: ((1, -1), 4.242, 6.241999999999999, False, -1.0)
from: ((1, 0), 4.242, 6.241999999999999, False, -1.0)
to: ((1, 0), 4.242, 6.241999999999999, False, -1.0)
from: ((0, 1), 4.242, 6.241999999999999, False, -1.0)
to: ((0, 1), 4.242, 7.241999999999999, False, -1.0)
from: ((1, 0), 4.242, 7.241999999999999, False, -1.0)
to: ((1, 0), 4.242, 7.241999999999999, False, -1.0)
from: ((1, -1), 4.242, 7.241999999999999, False, -1.0)
to: ((1, -1), 4.949, 6.534999999999999, False, -1.0)
from: ((1, 1), 4.949, 6.534999999999999, False, -1.0)
to: ((1, 1), 5.656, 7.241999999999999, False, -1.0)
from: ((-1, 1), 5.656, 7.2419999999999

from: ((-1, -1), 0.7070000000000002, 6.535000000000001, False, -1.0)
to: ((-1, -1), 2.220446049250313e-16, 5.828000000000001, False, -1.0)
from: ((0, -1), 2.220446049250313e-16, 5.828000000000001, False, -1.0)
to: ((0, -1), 2.220446049250313e-16, 4.828000000000001, False, -1.0)
from: ((1, 0), 2.220446049250313e-16, 4.828000000000001, False, -1.0)
to: ((1, 0), 2.220446049250313e-16, 4.828000000000001, False, -1.0)
from: ((1, -1), 2.220446049250313e-16, 4.828000000000001, False, -1.0)
to: ((1, -1), 0.7070000000000002, 4.121000000000001, False, -1.0)
from: ((1, 0), 0.7070000000000002, 4.121000000000001, False, -1.0)
to: ((1, 0), 0.7070000000000002, 4.121000000000001, False, -1.0)
from: ((0, -1), 0.7070000000000002, 4.121000000000001, False, -1.0)
to: ((0, -1), 0.7070000000000002, 3.1210000000000013, False, -1.0)
from: ((1, 1), 0.7070000000000002, 3.1210000000000013, False, -1.0)
to: ((1, 1), 1.4140000000000001, 3.828000000000001, False, -1.0)
from: ((1, 0), 1.4140000000000001, 3.828000000

KeyboardInterrupt: 

ANS:

### Monte Carlo State Policy Improvement   

Finally, you will perform Monte Carlo RL policy improvement:
1. Starting with the uniform policy, compute action-values for each grid in the representation. Use at least 1,000 episodes.      
2. Use these action values to find an improved policy.
3. To evaluate your updated policy compute the state-values for this policy.  
4. Plot the grid of state values for the improved policy, as an image. 
5. Compute the Forbenious norm (Euclidean norm) of the state value array. 

Compare the state value plot for the improved policy to the one for the initial uniform policy. Does the improved state values increase generally as distance to the terminal states decreases?  Is this what you expect and why?    

Compare the norm of the state values with your improved policy to the norm for the uniform policy. Is the increase significant?  

> **Hint:** Careful testing at each stage of your algorithm development will potentially save you considerable time. Test your function(s) to for a single episode to make sure your algorithm converges. Then test for say 10 episodes to ensure the state values update in a reasonable manner at each episode.   

> **Note:** You could continue to improve policy using the general policy improvement algorithm (GPI). In the interest of time, you are not required to do so here. 

ANS:

ANS:

ANS:

## Solution

Create cells below for your solution to the stated problem. Be sure to include some Markdown text and code comments to explain each component of your algorithm. 