# **Dynamic Programming (Basics)**
### 2022/04/24, A. J. Zerouali

 

## **1 - Introduction**
Following Lazy Programmer's Reinforcement Learning course. I'm starting with Section 5 on Dynamic Programming. It will cover the following topics:
* **Prediction:** The iterative policy evaluation algorithm;
* **Control:** The policy iteration and value iteration methods for policy improvement.

The contents are divided into the following sections:

    1 - Introduction
    2 - Iterative policy evaluation
    3 - Policy iteration
    4 - Value iteration
    5 - Dynamic programming in optimal control

### 22/04/21 note:
Need to debug policy iteration, creating new file for that. Present notebook is becoming unreadable due to code. Consider switching to \*.py files and including them in the notebook. 
### 22/04/22 note:
The main algorithm works. Compare results of section 3 with those of Lecture 57.


In [1]:
#####################################################
##### IMPORTANT: ALWAYS EXECUTE THIS CELL FIRST #####
#####################################################
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## 2 - Iterative policy evaluation

Below are my notes from the relevant lectures. 

**From Lecture 48:**
* For this part, we assume we know the *finite* MDP: The transition probabilities and the policy. 
* The goal is to evaluate the value function using the Bellman equation:

$$V_\pi(s) = \sum_a\pi(a|s)\sum_{s',r}P(s',r|s,a)\left(r+\gamma V_\pi(s')\right)$$

(Remark: We're assuming the rewards are also stochastic.)

* In principle, since the state $\mathcal{S}$ and action $\mathcal{A}$ spaces are finite, we could view the Bellman equation as a linear system, but that's not optimal. 
* Iterative policy evaluation is a fixed-point algorithm.
* Implementation remark: Arrays and lists are more computationally intensive than dictionaries. Lazy Programmer mentions that in lecture 48, in relation to the data structure used to store the "value function". He makes the same suggestions for the transition probabilities and the policy.

**Lecture 49:**
* This lecture describes the general layout of a reinforcement learning algorithm.

**Lecture 50: Gridworld in code**
* Link to GitHub repo: https://github.com/lazyprogrammer/machine_learning_examples/blob/master/rl/grid_world.py
* I'll write a simplified version below. 
* Here's how an episode execution should look like:
        g = GridWorld(rows, cols, initial_state)
        while not g.game_over()
            s = g.current_state()
            a = policy(s)
            r = g.move(a)
    Above, the GridWorld methods used are:
        # Return current state of the agent:
        g.current_state()
        # Return reward  
        reward = g.move(action)
        # Check if game (episode) is over, i.e. if current state terminal.
        g.game_over()
        
**Lecture 51: Iterative policy evaluation**

* We'll start with the deterministic version of GridWorld. There will be several helper functions to consider too and additional data structures.
        

* First we want to visualize the value function and the policy. We'll write it insuch a way that we can see it on a grid.

* Regarding the transition probabilities, we will make a dictionary to store $p(s'|s,a)$. Note that the rewards are deterministic here, and that we'll need a helper function to build this dictionary.

* The policy will also be a dictionary, where the keys are the admissible states and the values are the actions "U", "D", "L" and "R".

* We'll create a function implementing the policy evaluation algorithm.

* The code discussed in the lecture can be found here: https://github.com/lazyprogrammer/machine_learning_examples/blob/master/rl/iterative_policy_evaluation_deterministic.py

* I copied Lazy Programmer's printing functions and modified them a little. For the actual implementation of iterative policy evaluation, I made a specific function, and I wrote a simplified version compared to his. I get the same results as those of Lecture 51.

**Lecture 52: Windy GridWorld**

* The idea here is that we'll make a slight generalization where the transition probabilities are not just 0 or 1. In The instructor's code, the transitions $P(s'|s,a)$ will now be encoded as a dictionary where the keys are the state-actions $(s,a)$, and where the coresponding values are dictionaries $\{s': P(s'|s,a); s'\in\mathcal{S}\}$. Although not immediately intuitive, this approach is better for generalizing iterative policy evaluation to stochastic policies, and for the "Windy GridWorld" environment.

* Windy GridWorld is a variant where some of the squares in the grid push the agent to another state. The code for Windy Gridworld is in the same file, while the code for stochastic policy evaluation is at: https://github.com/lazyprogrammer/machine_learning_examples/blob/master/rl/iterative_policy_evaluation_probabilistic.py.

* I will have to modify the GridWorld class to account for stochastic transitions. The *set(attributes)* method needs to now take the transition probabilities into account, and the *get_next_state()* method will not be used anymore. The *move()* method is now more interesting, and calls upon *np.random.choice()*.

### Windy GridWorld Class

In [2]:
##### WINDY GRIDWORLD #####
# Updated: 22/04/07, A. J. Zerouali
# The Windy GridWorld environment used in Lazy Programmer's course.
# This is a 3x4 grid, with wall at (1,1), +1 reward at the terminal 
# square (0,3), and -1 reward at the terminal square (1,3).
# For the "windy" variant, the main changes occur in the move() method.
# States are (i,j) tuples, actions are characters, containers are dictionaries.


# GridWorld_simple with only 3x4 grid. This is the environment.
class GridWorld_Windy_small():
    def __init__(self, rows, cols, ini_state, non_term_states, term_states, actions):
        # Attributes rows and cols are dimensions of the grid
        self.rows = rows
        self.cols = cols
        # Coordinates of agent
        self.i = ini_state[0]
        self.j = ini_state[1]
        # State and action spaces
        self.non_term_states = non_term_states
        self.term_states = term_states
        self.actions = actions 
        # The next attributes are populated using the set() method
        self.adm_actions = {}
        self.rewards = {}
        self.transition_probs = {}
        
    # Method setting up the actions, rewards, and transition probabilities
    def set(self, rewards, adm_actions, transition_probs):
        # INPUT: adm_actions: Dictionary of (i,j):[a_i] = (row,col):[action list]
        #        rewards: Dictionary of (i,j):r = (row,col):reward
        #        transition_probs: Dictionary of (i,j):{a_i:p_ij}= ...
        #                          .. (row,col):{dictionary of probs for each action}
        # WARNING: Do not confuse self.adm_actions with self.actions. Latter is the action space,
        #          adm_actions are the accessible actions from a state (dict. {s_i:[a_ij]}).
        self.rewards = rewards
        self.adm_actions = adm_actions
        self.transition_probs = transition_probs
    
    # Method that sets current state of agent
    def set_state(self, s):
        # INPUT: s: (i,j)=(row,col), coord. of agent
        self.i = s[0]
        self.j = s[1]

    # Method to return current state of agent
    def current_state(self):
        return (self.i, self.j)
    
    # Method to check if current agent state is terminal
    # Note: Lazy Prog not explciting terminal states
    def is_terminal(self, s):
        return (s in self.term_states)
    
    # HAS TO BE MODIFIED FOR WINDY GRIDWORLD
    # Method to perform action in environment
    def move(self, action):
        # Input:  action: New action to execute
        # Output: reward
        # Comments: - Requires transition probabilities. 
        #           - Calls numpy.random.choice(), doesn't work with dictionaries.
        
        # Check if action is admissible in current state
        if action in adm_actions[self.current_state()]:
            
            # Convert transition_probs to lists compatible with np.random.choice().
            # Recall self.transition_probs[(self.current_state(), action)] is a dictionary,
            # while np.random.choice() works with ints or ndarrays.
            next_states = list(self.transition_probs[(self.current_state(), action)].keys())
            next_states_probs = list(self.transition_probs[(self.current_state(), action)].values())
            
            # Generate a random index (this Numpy function is tricky)
            rand_ind = np.random.choice(a = len(next_states), p = next_states_probs)
            # Set new state of agent
            s_new = next_states[rand_ind] # Not necessary, for debug
            self.set_state(s_new)
        # END IF
   
        # Return reward. If not in given dictionary, return 0
        return self.rewards.get((self.i, self.j), 0)

    # Method to check if agent is currently in terminal state
    def game_over(self):
        # Output true if agent is in terminal states (0,3) or (1,3)
        return ( (self.i, self.j) in self.term_states)
    
    # Method returnning all admissible states, i.e. not in the wall (1,1)
    def all_states(self):
        return (self.non_term_states | self.term_states )
# END CLASS

"""
#### Scrap for move() method
# Create lists compatible with np.random.choice().
# self.transition_probs[(self.current_state(), action)] is a dictionary
next_states = list(transition_probs[((1,2),"U")].keys())
next_states_probs = list(transition_probs[((1,2),"U")].values())
            
# Generate a random index (this numpy function is tricky)
rand_ind = np.random.choice(a = len(next_states), p = next_states_probs)
# Set new state of agent
s_new = next_states[rand_ind] # Not necessary, for debug
print(s_new)

"""
    
# Helper function to construct an environment.
# Consists mainly of initializations.
def windy_standard_grid(penalty=0):
    # Input: penalty: Float. Penalty for moving to non terminal state.
    # Output: env. Windy_GridWorld_small() object (the environment).
    
    # Start at bottom left (randomize later)
    ini_state = (2,0)
    # Action space 
    ACTION_SPACE = {"U", "D", "L", "R"}
    # Non terminal states
    NON_TERMINAL_STATES = {(0,0), (0,1), (0,2), (1,0), (1,2), (2,0), (2,1), (2,2), (2,3)}
    # Terminal states
    TERMINAL_STATES = {(0,3), (1,3)}
    
    # Instantiate:
    env = GridWorld_Windy_small(3, 4, ini_state, NON_TERMINAL_STATES, TERMINAL_STATES, ACTION_SPACE)

    
    # Dictionary of rewards
    # Not storing 0s if penalty=0
    rewards = {(0,3):1, (1,3): -1}
    # Poplate non terminal states for penalty != 0
    if penalty != 0:
        for s in NON_TERMINAL_STATES:
            rewards[s] = penalty
    
    # Dictionary of admissible actions per state
    adm_actions = {
        (0,0): ("D", "R"),
        (0,1): ("L", "R"),
        (0,2): ("L", "R", "D"),
        (1,0): ("D", "U"),
        (1,2): ("U", "D", "R"),
        (2,0): ("U", "R"),
        (2,1): ("L", "R"),
        (2,2): ("U", "R", "L"),
        (2,3): ("U", "L"),
    }
    
    # Dictionary of transition probabilities
    # NOTE: I've modified the instructor's implementation.
    #       I've removed all tautologies (agent doesn't stay in current state).
    transition_probs = {
        ((2, 0), 'U'): {(1, 0): 1.0},
        ((2, 0), 'R'): {(2, 1): 1.0},
        
        ((1, 0), 'U'): {(0, 0): 1.0},
        ((1, 0), 'D'): {(2, 0): 1.0},
        
        ((0, 0), 'D'): {(1, 0): 1.0},
        ((0, 0), 'R'): {(0, 1): 1.0},
        
        ((0, 1), 'L'): {(0, 0): 1.0},
        ((0, 1), 'R'): {(0, 2): 1.0},
        
        ((0, 2), 'D'): {(1, 2): 1.0},
        ((0, 2), 'L'): {(0, 1): 1.0},
        ((0, 2), 'R'): {(0, 3): 1.0},
        
        ((2, 1), 'L'): {(2, 0): 1.0},
        ((2, 1), 'R'): {(2, 2): 1.0},
        
        ((2, 2), 'U'): {(1, 2): 1.0},
        ((2, 2), 'L'): {(2, 1): 1.0},
        ((2, 2), 'R'): {(2, 3): 1.0},
        
        ((2, 3), 'U'): {(1, 3): 1.0},
        ((2, 3), 'L'): {(2, 2): 1.0},
        
        ((1, 2), 'U'): {(0, 2): 0.5, (1, 3): 0.5},
        ((1, 2), 'D'): {(2, 2): 1.0},
        ((1, 2), 'R'): {(1, 3): 1.0},
    }
    
    # Assign missing environment attributes
    env.set(rewards, adm_actions, transition_probs)
    
    # Output line
    return env

### Printing functions

In [3]:
##### PRINTING FUNCTIONS #####
# 2022/04/06, AJ Zerouali
# Modified from Lazy Prog's GitHub

def print_values(Val_fn, env):
    print(f"## VALUE FUNCTION ##")
    for i in range(env.rows):
        print("------------------------")
        for j in range(env.cols):
            v = Val_fn.get((i,j), 0)
            if v >= 0:
                print(" %.2f|" % v, end="")
            else:
                print("%.2f|" % v, end="") # -ve sign takes up an extra space
        print("")
    print("------------------------")

def print_policy(Pi_fn, env):
    # REMARK: WILL ONLY PRINT A DETERMINISTIC POLICY WITH {(i,j):{"action":1.0}}
    print(f"##  POLICY  ##")
    for i in range(env.rows):
        print("------------------------")
        for j in range(env.cols):
            if (i,j) not in [(1,1), (0,3), (1,3)]:
                # WARNING: Will only work if there's one and only one element
                a = list(Pi_fn[(i,j)].keys())[0]
                print("  %s  |" % a, end="")
            elif (i,j) == (1,1):
                print("  %s  |" % " ", end="")
        print("")
    print("------------------------")

"""
The policy looks like this:

pi = {
    (2, 0): {'U': 1.0},
    (1, 0): {'U': 1.0},
    (0, 0): {'R': 1.0},
    (0, 1): {'R': 1.0},
    (0, 2): {'R': 1.0},
    (1, 2): {'U': 1.0},
    (2, 1): {'R': 1.0},
    (2, 2): {'U': 1.0},
    (2, 3): {'L': 1.0},
  }

"""

"\nThe policy looks like this:\n\npi = {\n    (2, 0): {'U': 1.0},\n    (1, 0): {'U': 1.0},\n    (0, 0): {'R': 1.0},\n    (0, 1): {'R': 1.0},\n    (0, 2): {'R': 1.0},\n    (1, 2): {'U': 1.0},\n    (2, 1): {'R': 1.0},\n    (2, 2): {'U': 1.0},\n    (2, 3): {'L': 1.0},\n  }\n\n"

### Iterative policy evaluation

A more general function for stochastic policies and non-trivial transitions.

In [4]:
##### ITERATIVE POLICY EVALUATION #####
## 2022/04/08, AJ Zerouali

def iter_policy_eval(Pi, V_ini, P_trans, Rwds, adm_actions, non_term_states, term_states, epsilon, gamma):
    # ARGUMENTS:
    #  Pi: Dict. Policy function to be evaluated, from main() function.
    #  V_ini: Dict. Initial value fn, from main() function.
    #  P_trans: Dict. Transition probabilities of MDP, from main() function.
    #  Rwds: Dict. Rewards by (state, action, state_new), from main() function.
    #  adm_actions: Dict. Admissible actions in a given state, from grid attributes.
    #  non_term_states: Set. Non terminal states, from grid attributes.
    #  term_states: Set. Terminal states only, from grid attributes.
    #  epsilon: Float. Convergence threshold (for sup norm of value function), from main() function.
    #  gamma: Float. Discount factor, from main() function.
    
    # OUTPUT:
    #  V_pi: Dict. Value function corresp. to Pi
    #  k: Number of iterations for convergence of policy eval.
    
    
    # INITIALIZATIONS
    # V_k and V_(k+1) ini. (get switched in while loop)
    V_new = V_ini
    for s in term_states:
        V_new[s] = 0
    V_old = {}
    # Iteration counter ini
    k = 0
    # Stopping Boolean ini
    V_is_stable = False
    
    
    # MAIN LOOP
    # Iterates over k
    while not V_is_stable:
        
        # Initialize V_k and V_(k+1)
        V_old = V_new
        V_new = {}
        for s in term_states:
            V_new[s] = 0
        # Initialize sup|V_(k+1) - V_k|
        Delta_V = 0
        
        # EVALUATE V_(k+1)=V_new
        # Loop over non terminal states
        for s in non_term_states:  
            
            # COMPUTE V_(k+1)(s)
            
            # Initialize
            V_s_new = 0
            
            # Loop over admissible actions in state s
            for a in adm_actions[s]:
                
                # Add sum over s_ind only if pi(a|s) is non-zero:
                if (Pi[s].get(a,0) != 0):
                
                    # This loop is only over non-trivial transitions
                    for s_ind in P_trans[(s,a)].keys(): 
                        # UPDATE V_s_new
                        V_s_new += Pi[s].get(a,0)*P_trans[(s,a)].get(s_ind,0) \
                                    *( Rwds.get(s_ind,0) + gamma*V_old[s_ind] )  
                    # END FOR OVER s_ind
                    
                # END IF
                
            # END FOR OVER a
            
            # Assign V_(k+1)(s)
            V_new[s] = V_s_new
            
            # Update sup|V_(k+1) - V_k|
            Delta_V = max(Delta_V, abs(V_s_new-V_old.get(s,0)) )
            
        # END FOR OVER s     
        
        # Update stopping Boolean
        V_is_stable = (Delta_V < epsilon)
        
        # Update iteration counter
        k += 1
    # END WHILE
    
    # Return V_pi and number of iterations
    return V_new, k

### Main

This is where we run all of the above. I get the same results as Lazy Programmer's (5:56min in Lecture 53):

    ### Value Function ###
    ------------------------
     0.81| 0.90| 1.00| 0.00|
    ------------------------
     0.73| 0.00|-0.05| 0.00|
    ------------------------
     0.31|-0.04|-0.04|-0.04|
    ------------------------

In [8]:
##### MAIN - ITERATIVE POLICY EVALUATION IN WINDY GRIDWORLD #####
# 2022/04/08, AJ Zerouali
# Loosely follows lectures 52-53.
#

# Create environment
# adm_actions, rewards and transition_probs are attributes of grid
grid = windy_standard_grid()


### The policy dictionary ###
pi = {
    (2, 0): {'U': 0.5, 'R': 0.5},
    (1, 0): {'U': 1.0},
    (0, 0): {'R': 1.0},
    (0, 1): {'R': 1.0},
    (0, 2): {'R': 1.0},
    (1, 2): {'U': 1.0},
    (2, 1): {'R': 1.0},
    (2, 2): {'U': 1.0},
    (2, 3): {'L': 1.0},
  }

### Initial value function ###
# Just a dictionary of 0s
V = {}
for s in grid.all_states():
    V[s] = 0

# Discount factor and convergence threshold
gamma = 0.9
epsilon = 1e-3

# Compute V_pi
# Signature: iter_policy_eval(Pi, V_ini, P_trans, Rwds, adm_actions, non_term_states, term_states, epsilon, gamma)
V_pi, N_iter = iter_policy_eval(pi, V, grid.transition_probs, grid.rewards, grid.adm_actions,\
                                grid.non_term_states, grid.term_states, epsilon, gamma)

# Print the value function function obtained
print(f"For the policy pi specified above, iterative policy evaluation converged after N_iter={N_iter} iterations. V_pi is:")
print_values(V_pi, grid)

For the policy pi specified above, iterative policy evaluation converged after N_iter=6 iterations. V_pi is:
## VALUE FUNCTION ##
------------------------
 0.81| 0.90| 1.00| 0.00|
------------------------
 0.73| 0.00|-0.05| 0.00|
------------------------
 0.31|-0.04|-0.04|-0.04|
------------------------


### Test case (non windy)

Below is a test case. Should give the following table of values:

        Value function
        
        ------------------------
         0.81| 0.90| 1.00| 0.00|
        ------------------------
         0.73| 0.00| 0.90| 0.00|
        ------------------------
         0.66| 0.73| 0.81| 0.73|
        ------------------------

In [5]:
##### (NON)WINDY GRIDWORLD (TEST) #####
# Updated: 22/04/08, A. J. Zerouali
# Test with non-windy case and deterministic policy

def test_standard_grid():
    # Start at bottom left (randomize later)
    ini_state = (2,0)
    # Action space 
    ACTION_SPACE = {"U", "D", "L", "R"}
    # Non terminal states
    NON_TERMINAL_STATES = {(0,0), (0,1), (0,2), (1,0), (1,2), (2,0), (2,1), (2,2), (2,3)}
    # Terminal states
    TERMINAL_STATES = {(0,3), (1,3)}
    
    # Instantiate:
    # 
    env = GridWorld_Windy_small(3, 4, ini_state, NON_TERMINAL_STATES, TERMINAL_STATES, ACTION_SPACE)

    
    # Dictionary of rewards
    # Not storing 0s
    rewards = {(0,3):1, (1,3): -1}
    
    # Dictionary of admissible actions per state
    adm_actions = {
        (0,0): ("D", "R"),
        (0,1): ("L", "R"),
        (0,2): ("L", "R", "D"),
        (1,0): ("D", "U"),
        (1,2): ("U", "D", "R"),
        (2,0): ("U", "R"),
        (2,1): ("L", "R"),
        (2,2): ("U", "R", "L"),
        (2,3): ("U", "L"),
    }
    
    # Dictionary of deterministic transitions:
    transition_probs = {
        ((2, 0), 'U'): {(1, 0): 1.0},
        ((2, 0), 'R'): {(2, 1): 1.0},
        
        ((1, 0), 'U'): {(0, 0): 1.0},
        ((1, 0), 'D'): {(2, 0): 1.0},
        
        ((0, 0), 'D'): {(1, 0): 1.0},
        ((0, 0), 'R'): {(0, 1): 1.0},
        
        ((0, 1), 'L'): {(0, 0): 1.0},
        ((0, 1), 'R'): {(0, 2): 1.0},
        
        ((0, 2), 'D'): {(1, 2): 1.0},
        ((0, 2), 'L'): {(0, 1): 1.0},
        ((0, 2), 'R'): {(0, 3): 1.0},
        
        ((2, 1), 'L'): {(2, 0): 1.0},
        ((2, 1), 'R'): {(2, 2): 1.0},
        
        ((2, 2), 'U'): {(1, 2): 1.0},
        ((2, 2), 'L'): {(2, 1): 1.0},
        ((2, 2), 'R'): {(2, 3): 1.0},
        
        ((2, 3), 'U'): {(1, 3): 1.0},
        ((2, 3), 'L'): {(2, 2): 1.0},
        
        ((1, 2), 'U'): {(0, 2): 1.0},
        ((1, 2), 'D'): {(2, 2): 1.0},
        ((1, 2), 'R'): {(1, 3): 1.0},
    }
    
    # Assign missing environment attributes
    env.set(rewards, adm_actions, transition_probs)
    
    # Output line
    return env

# Create environment
# adm_actions, rewards and transition_probs are attributes of grid

grid = test_standard_grid()


### The policy dictionary ###
pi = {
    (2, 0): {'U': 1.0},
    (1, 0): {'U': 1.0},
    (0, 0): {'R': 1.0},
    (0, 1): {'R': 1.0},
    (0, 2): {'R': 1.0},
    (1, 2): {'U': 1.0},
    (2, 1): {'R': 1.0},
    (2, 2): {'U': 1.0},
    (2, 3): {'L': 1.0},
  }

### Initial value function ###
# Just a dictionary of 0s
V = {}
for s in grid.all_states():
    V[s] = 0

# Discount factor and convergence threshold
gamma = 0.9
epsilon = 1e-3

# Compute V_pi
# Signature: iter_policy_eval(Pi, V_ini, P_trans, Rwds, adm_actions, non_term_states, term_states, epsilon, gamma)
V_pi, N_iter = iter_policy_eval(pi, V, grid.transition_probs, grid.rewards, grid.adm_actions,\
                                grid.non_term_states, grid.term_states, epsilon, gamma)

# Print the value function function obtained
print(f"The Windy GridWorld deterministic test converged after N_iter={N_iter} iterations. V_pi is:")
print_values(V_pi, grid)

The Windy GridWorld deterministic test converged after N_iter=6 iterations. V_pi is:
## VALUE FUNCTION ##
------------------------
 0.81| 0.90| 1.00| 0.00|
------------------------
 0.73| 0.00| 0.90| 0.00|
------------------------
 0.66| 0.73| 0.81| 0.73|
------------------------


## 3 - Policy improvement by policy iteration

The control step of dynamic programming is where we improve the policy. This is done iteratively and calls upon policy evaluation at each step. Value iteration is more efficient as it blends these two steps (see next section).

**Lecture 54: Policy improvement**
* Idea is to start from a policy and improve it iteratively (obviously). Lazy Programmer only considers a deterministic policy in lecture 54 (similar to sections 4.2 and 4.3 of Sutton-Barto).

* Suppose we have a deterministic policy $\pi$ for which the value function is $V_\pi(s) = Q_\pi\left(s,\pi(s)\right)$ (expectation reduces to one term only). Now consider the value $a'_s=\arg\max_aQ_\pi(s,a)$ such that $Q_\pi(s,a'_s)\ge V_\pi(s)$. Computationally, we get a higher state-action value by taking a value different from $\pi(s)$ in the beginning, but still following $\pi$ for the remaining steps of the episode. 

* Going further, if we now decide to *always* use action $a'_s$ when in state $s$, we are in fact following a new policy $\tilde{\pi}$ such that $\tilde{\pi}(s')=\pi(s')$ for $s'\ne s$ and $\tilde{\pi}(s)=\arg\max_aQ_\pi(s,a)$. This is the basis of policy improvement.

* The Policy Improvement Theorem states that if $Q_\pi\left(s,\tilde{\pi}(s_0)\right)\ge V_\pi(s_0)$ for *some* state $s_0\in\mathcal{S}$, **then** $V_\tilde{\pi}(s)\ge V_\pi(s)$ **for all** $s\in\mathcal{S}$. Thus, improving $\pi$ iteratively over the states guarantees obtaining an overall better policy $\tilde{\pi}$. This is the main idea behind policy iteration.

* Note that the Policy Improvement Theorem is not that obvious. In particular, the Bellman equation for updates does not hold anymore since we're switching policies.

* Suppose we improve the policy $\pi$ to $\tilde{\pi}$ iteratively. We get a convergence once $\tilde{\pi}=\pi$, which is decided once $V_\tilde{\pi}=V_\pi\$. In that case, the value function satisfies **Bellman's optimality equation**:

$$V_\pi(s) = \max_a \left\{\sum_{s'}p(s'|s,a)\left(r(s,a,s')+\gamma\cdot V_\pi(s')\right)\right\}.$$

  The policy satisfying this equation is typically denoted $\pi^\ast$, and $V^\ast:=V_{\pi^\ast}$.

**Lecture 55: Policy iteration**

* This lecture covers the high level implementation of policy iteration, with an explicit algorithm for the theoretical considerations above. There are less things to note in this lecture and the next 2, so I'll just recap the important points.

* The optimal value function is unique, but that is not necessarily the case for optimal policies. The simple 3x4 GridWorld example starting from the bottom left (2,0) is a perfect illustration: Assuming all rewards aside from the terminal states are 0, there are 2 optimal paths to reach the winning state (0,3), from the top of the wall (1,1) or from its right.

* When implementing policy iteration, a good practice would be to store the value function, and check whether a new policy improvement step changes the value function. Without this verification, there is a possibility that the policy iteration loop never ends, and just alternates between the 2 optimal policies.

**Lecture 56: Policy iteration (cont'd)**

* Lazy Programmer walks the listener through his implementation of policy iteration for (non windy) GridWorld.

* Some highlights: 4:00: Constructing a random policy. 5:00: Policy improvement. 6:25: Comments on Bellman optimality eq'n. 7:35: Comments on convergence.

**Lecture 57: Policy iteration in Windy GridWorld**

* Since there's a "wind push" in state (1,2), the optimal policy is less obvious. To also see how rewards have an effect on optimal policies, it's also useful to introduce penalties to visiting states other than terminal ones. (**Comment:** Note how this means I should modify the environment generation function.)

* In this lecture, Lazy Prog tests 2 environments, and gives a good illustration of how the optimal (deterministic) policy changes with different penalty values. He shows the cases where the penalties are 0, -0.1, -0.2, -0.4, -0.5 and -2. Notice how in the last 2 cases, the agent simply goes immediately to the losing state when near it, since the paths to the winning state are too long (leading to much lower cumulative returns).

**Functions that we'll need:**
1) Need a function that will implement policy improvement.

2) Policy iteration will call iterative policy evaluation and policy improvement function in (1).

3) Need a function that generates a random initial policy.

4) Need a function that compares two value functions.


### Random policy generator and value function comparator

In [5]:
##### RANDOM DETERMINISTIC POLICY GENERATOR #####
## 2022/04/08, AJ Zerouali
# Recall: rand_ind = np.random.choice(a = len(next_states), p = next_states_probs)

def gen_random_policy(env):
    # Input: env, Windy_GridWorld_simple object (environment).
    # Output: Pi, a (deterministic) policy dictionary.
    non_term_states = env.non_term_states
    adm_actions = env.adm_actions
    Pi = {}
    
    for s in non_term_states:
        actions_list = list(adm_actions[s])
        a_random = actions_list[np.random.randint(len(actions_list))]
        Pi[s] = {a_random:1.0}
    
    return Pi

In [6]:
##### COMPARE VALUE FUNCTION #####
## 2022/04/22, AJ Zerouali

def compare_value_fns(V_old, V_new, non_term_states):
    # ARGUMENTS: - V_old and V_new: Dictionaries of 2 value functions to compare
    #            - non_term_states: Set of non-terminal states in the environment
    # OUTPUT: delta_V = sup_{s in S} |V_old(s)- V_new(s)|
    delta_V = 0
    for s in non_term_states:
        delta_V = max(delta_V, abs(V_old[s]-V_new[s]))
        
    return delta_V

### Policy improvement function

The main things to keep in mind:
* Create a dictionary for $Q_\pi$.
* Use the *max()* function to extract the argmax from $Q_\pi(s,\cdot)$ for each $s$. Syntax is as follows:

                {argmax in dict} = max(dict, key = dict.get)
* Working with $V_\pi$ is more memory efficient than $Q_\pi$.


In [7]:
###########################################
## POLICY IMPROVEMENT - improve_policy() ##
###########################################
# 2022/04/22 - A. J. Zerouali
# This function is called in the main loop of the policy iteration algorithm.

def improve_policy(Pi, V_pi, P_trans, Rwds, adm_actions, non_term_states, term_states, gamma):
    
    
    # Initialize policy_is_stable
    policy_is_stable = True
    
    for s in non_term_states:
        
        # Store old action
        a_old = list(Pi[s].keys())[0]
        
        # Initialize Vs_dict (dictionary for Q_pi(s,-))
        Vs_dict = {}
        
        # Loop over admissible actions
        for a in adm_actions[s]:
            
            V_temp = 0
            
            # Loop over non-zero probability transitions
            # Evaluate new V_pi(s)
            for s_ind in P_trans[(s,a)].keys(): 
                V_temp += P_trans[(s,a)].get(s_ind,0)*\
                            ( Rwds.get(s_ind,0) + gamma*V_pi[s_ind] )
            # END FOR over s_ind
            
            # Store V_temp in Vs_dict
            Vs_dict[a] = V_temp     
            
        # END FOR over a in adm_actions[s]
        
        # Get argmax
        a_new = max(Vs_dict, key = Vs_dict.get)
        
        # Update policy with argmax:
        Pi[s] = {a_new:1.0}
        # Update V? Not necessary, gets evaluated again at beginning of loop
        
        # CLARIFY WHY THIS IS THE WAY
        if a_old != a_new:
            policy_is_stable = False
        
    # END FOR s in non_term_states
        
    return Pi, policy_is_stable
###########################################
## END OF improve_policy()               ##
###########################################

### Policy iteration function

This function calls both the policy evaluation and policy improvement functions

In [8]:
################################
## POLICY ITERATION ALGORITHM ##
################################
# 2022/04/22, AJ Zerouali


def Policy_Iteration(Pi, V_ini, P_trans, Rwds, adm_actions, non_term_states, term_states, epsilon, gamma):
    
    

    # Initialize counter and looping Boolean
    N_iter = 0
    policy_is_stable = True #Necessary?
    
    # Init. V_old
    V_old = V_ini

    # Loop until policy_is_stable = True
    while True:
        #######################
        ## POLICY EVALUATION ##
        #######################

        # Execute policy eval function
        V_new, k = iter_policy_eval(Pi, V_old, P_trans, Rwds, adm_actions, non_term_states, term_states, epsilon, gamma)

        # DEBUG:
        print(f"Policy evaluation fn iter_policy_eval() converged after {k} iterations.")

        ###########################################
        ## POLICY IMPROVEMENT - improve_policy() ##
        ###########################################

        Pi, policy_is_stable = improve_policy(Pi, V_new, P_trans, Rwds, adm_actions, non_term_states, term_states, gamma)

        # Break condition (Tricky)####
        # Update policy iteration counter
        N_iter += 1

        # Compare value functions:
        delta_V = compare_value_fns(V_old, V_new, non_term_states)
        # Update value function
        V_old = V_new

        # BREAK WHILE condition
        #if policy_is_stable or N_iter>30:
        #    break
        if policy_is_stable:
            break
        elif delta_V<=epsilon:
            break

    # END WHILE not policy_is_stable

    # DEBUG/REMINDER: In function, should finish with
    return V_new, Pi, N_iter

# END DEF Policy_Iteration

### Windy GridWorld with various penalties

In this part I'm attempting to reproduce the results of Lecture 57.

In [9]:
### Windy GridWorld with various penalties
# 2022/04/22, AJ Zerouali
### SIGNATURES:
# windy_standard_grid(penalty=0)
# Policy_Iteration(Pi_ini, V_ini, ---grid attributes---)
# print_values(Val_fn, env)
# print_policy(Val_fn, env)

# Discount factor and error threshold
gamma = 0.9
epsilon = 1e-3

# Penalty list
penalties = [0.0, -0.1, -0.2, -0.4, -0.5, -2]

# Loop over penalties
for pen in penalties:
    
    # Create environment
    grid = windy_standard_grid(penalty=pen)
    print(f"Windy GridWorld environment with penalty = {pen} created ... \n")

    # Initialize policy
    Pi = gen_random_policy(grid)

    # Print optimal (deterministic) policy
    print(f"Printing initial policy ...")
    print_policy(Pi, grid)
    print("\n")

    # Initialize value function
    V_ini = {}
    for s in (grid.non_term_states | grid.term_states):
        V_ini[s] = 0

    ##############################
    ## EXECUTE POLICY ITERATION ##
    ##############################

    print(f"Executing policy iteration algorithm ...")
    # SIGNATURE: Policy_Iteration(Pi, V_ini, P_trans, Rwds, adm_actions, non_term_states, term_states, epsilon, gamma)
    (V_star, Pi_star, N_iter) = Policy_Iteration(Pi, V_ini, grid.transition_probs, grid.rewards, grid.adm_actions, \
                                                grid.non_term_states, grid.term_states, epsilon, gamma)


    ###################
    ## PRINT RESULTS ##
    ###################

    # Print N_iter
    # Print optimal value function
    print(f"Policy_Iteration() converged after {N_iter} iterations ...\n")

    # Print optimal value function
    print(f"Printing optimal value function ...")
    #print_values(V_star, grid) #
    print_values(V_star, grid)

    # Print optimal (deterministic) policy
    print(f"Printing optimal policy ...")
    #print_policy(Pi_star, grid)
    print_policy(Pi_star, grid)
    
    # Separator
    print("_____________________________________________\n\n")

Windy GridWorld environment with penalty = 0.0 created ... 

Printing initial policy ...
##  POLICY  ##
------------------------
  R  |  L  |  D  |
------------------------
  U  |     |  R  |
------------------------
  U  |  R  |  L  |  U  |
------------------------


Executing policy iteration algorithm ...
Policy evaluation fn iter_policy_eval() converged after 3 iterations.
Policy evaluation fn iter_policy_eval() converged after 2 iterations.
Policy evaluation fn iter_policy_eval() converged after 2 iterations.
Policy evaluation fn iter_policy_eval() converged after 2 iterations.
Policy evaluation fn iter_policy_eval() converged after 4 iterations.
Policy evaluation fn iter_policy_eval() converged after 3 iterations.
Policy_Iteration() converged after 6 iterations ...

Printing optimal value function ...
## VALUE FUNCTION ##
------------------------
 0.81| 0.90| 1.00| 0.00|
------------------------
 0.73| 0.00| 0.48| 0.00|
------------------------
 0.66| 0.59| 0.53| 0.48|
----------

## 4 - Value iteration

**Lecture 58:**

* From a practical standpoint, policy iteration consists of 2 nested loops in which we need to wait for 2 different quantities to converge. This is not computationally efficient when the environment contains large action and state spaces.

* To improve the previous algorithm, there are 2 ideas we can use. Firstly, the policy is not really needed. Although we are interested in the argmax:

    $\pi_{k+1}(s) = \arg\max_a \sum_{s'}T(s'|s,a)\{r(s,a)+\gamma V_k(s')\},$
    
    what we really use is the quantity:
    
    $V_{k+1}(s) = \max_a \sum_{s'}T(s'|s,a)\{r(s,a)+\gamma V_k(s')\}.$

    The second idea is to limit the number of policy evaluation iterations. (**Comment:** I'm not sure that I fully get this point, I would guess that instead of re-computing the value function of a policy at each iteration, we could work with a stored value function, find the argmax at each given $s\in\mathcal S$, and update the value function in that given state.)

* Here's the pseudocode of value iteration:

       Initalize V(s) with V(s_terminal) = 0, fix threshold epsilon
        Loop:
            Delta = 0
            for s in non_term_states:
                V_old = V(s)
                V(s) = max_a sum_s' T(s'|s,a)[r+gamma*V(s')]
                Delta = max(Delta, |V_old - V(s)|)
            if Delta <epsilon:
                break
        
        for s in non_term_states:
            pi*(s)=argmax_a sum_s' T(s'|s,a)[r+gamma*V(s')]

    Notice here that the first "while" loop computes $V^\ast = V_{\pi^\ast}$, and the second "for" loop extracts the optimal policy $\pi^\ast$.

* The main advantage of value iteration is that we have only one infinite loop. One issue that could arise is when there's more than one optimal policy.

* Interesting point: Value iteration was proposed by Bellman in 1957. The crux is that the RHS of the Bellman optimality equation is treated as an update rule.

**Lecture 59: Value iteration in code**

* After writing my own implementation (without looking at lecture 59), I re-executed the previous cases studied: Windy GridWorld with various penalties and non-WindyGridworld. I obtained the same results as before.

* Note that Value Iteration considerably reduces the execution time.

* The next subsection contains my implementation of value iteration. The next ones are where I do the tests.


### a) Value Iteration Algorithm

In [4]:
################################
## VALUE ITERATION ALGORITHM  ##
################################
## 2022/04/24, AJ Zerouali
# This is Bellman's famous algorithm of 1957.
# REMARK: Check the break condition is correct.

def Value_Iteration(env, epsilon, gamma):
    # ARGUMENTS: - Pi: Necessary?
    #            - env: Environment. Gives the state and action spaces.
    #            - epsilon: Convergence threshold 
    #            - gamma: Discount factor
    # OUTPUT:    - V_star: Optimal value function
    #            - Pi_star: Optimal policy
    #            - N_iter: No. of iterations 
    # NOTE: Pi_star is obtained from the actions that gave the last update of V_new = V_star
    
    # Initialize env. attributes
    term_states = env.term_states
    non_term_states = env.non_term_states
    P_trans = env.transition_probs
    Rwds = env.rewards
    adm_actions = env.adm_actions
    
    # Initialize V and Q to zero
    V_new = {}
    Q = {}
    for s in term_states:
        V_new[s] = 0.0
    for s in non_term_states:
        V_new[s] = 0.0
        Q[s] = {}
        for a in adm_actions[s]:
            Q[s][a] = 0.0
            
    # DEBUG: Initialize Pi_star
    Pi_star = {}
    
    # Init. iteration counter
    N_iter = 0
    
    ## MAIN LOOP
    while True:
        
        V_old = V_new
        
        Delta_V = 0.0
        
        for s in non_term_states:
            
            # Store V_old(s)
            Vs_old = V_old[s]
            
            # This loop computes V(s)
            for a in adm_actions[s]:
                
                # Init. Q_sa
                Q_sa = 0.0
                
                # This loop computes Q(s,a)
                # Loop only over non-zero probability transitions
                for s_ind in P_trans[(s,a)].keys():
                    # Bellman update
                    # Template: V_temp += P_trans[(s,a)].get(s_ind,0)*( Rwds.get(s_ind,0) + gamma*V_pi[s_ind] )
                    Q_sa += P_trans[(s,a)].get(s_ind,0)*( Rwds.get(s_ind,0) + gamma*V_old[s_ind] )
                
                # Update Q(s,a)
                Q[s][a] = Q_sa
                
            # END FOR a in adm_actions[s]
            
            # Get max over a's
            V_new[s] = max(Q[s].values())
            
            # Pi_star debug:
            # Store argmax
            a_star = max(Q[s], key = Q[s].get)
            Pi_star[s] = {a_star:1.0}
            
            # Update Delta_V
            Delta_V = max(Delta_V, abs(Vs_old - V_new[s]))
            
        # END FOR s in non_term_states
        
        if Delta_V < epsilon:
            break
        
        # Update iteration counter
        N_iter += 1
        
    # END WHILE
    
    # Return optimal value fn and no. of iterations
    return V_new, N_iter, Pi_star

# END DEF Value_Iteration()


################################
##      FIND OPTIMAL POLICY   ##
################################
## This function simply extracts argmaxes.
## Should necessarily be executed after value iteration.

def Get_Pi_Star(V_star, env, epsilon, gamma):
    # ARGUMENTS: - V_star: Optimal value function
    #            - env: Environment. Gives the state and action spaces.
    #            - epsilon: Convergence threshold 
    #            - gamma: Discount factor
    # OUTPUT:    - Pi := Pi_star, optimal policy
    
    # Init. env. attributes
    P_trans = env.transition_probs
    Rwds = env.rewards
    non_term_states = env.non_term_states
    term_states = env.term_states
    adm_actions = env.adm_actions

    # Init. Q and Pi (output)
    Pi = {}
    Q = {}
    for s in non_term_states:
        Pi[s] = {}
        Q[s] = {}
        for a in adm_actions[s]:
            Q[s][a] = 0.0
    
    # Here you should compute Q(s,a) then extract argmax
    # Recall argmax for dictionary given by
    # a_new = max(Vs_dict, key = Vs_dict.get)
    for s in non_term_states:
        
        for a in adm_actions[s]:
            
            Q_sa = 0.0
            
            # Loop over non-zero transitions
            for s_ind in P_trans[(s,a)].keys():
                
                # Bellman equation
                Q_sa += P_trans[(s,a)].get(s_ind,0)*(Rwds.get(s_ind, 0)+gamma*V_star[s_ind])
                
            # END FOR s_ind in admissible
            
            # Store above sum
            Q[s][a] = Q_sa
            
        # END FOR a in adm_actions[s]
        
        # Get argmax and store in Pi
        a_star = max(Q[s], key = Q[s].get)
        Pi[s] = {a_star:1.0}
        #Pi[s] = {max(Q[s], key = Q[s].get):1.0}
        
    # END FOR s in non_term_states    
    
    # Return optimal policy
    return Pi

# END DEF Value_Iteration()

### b) Non-windy GridWorld

The optimal value function is:

        Value function
        
        ------------------------
         0.81| 0.90| 1.00| 0.00|
        ------------------------
         0.73| 0.00| 0.90| 0.00|
        ------------------------
         0.66| 0.73| 0.81| 0.73|
        ------------------------
        
The optimal policy is:

        ##  POLICY  ##
        ------------------------
          R  |  R  |  R  |
        ------------------------
          U  |     |  U  |
        ------------------------
          U  |  R  |  U  |  L  |
        ------------------------

In [5]:
##### (NON)WINDY GRIDWORLD (TEST) #####
# Updated: 22/04/24, A. J. Zerouali
# Test with non-windy case and deterministic policy
# Adapted from previous cell for Value Iteration.

def test_standard_grid():
    # Start at bottom left (randomize later)
    ini_state = (2,0)
    # Action space 
    ACTION_SPACE = {"U", "D", "L", "R"}
    # Non terminal states
    NON_TERMINAL_STATES = {(0,0), (0,1), (0,2), (1,0), (1,2), (2,0), (2,1), (2,2), (2,3)}
    # Terminal states
    TERMINAL_STATES = {(0,3), (1,3)}
    
    # Instantiate:
    # 
    env = GridWorld_Windy_small(3, 4, ini_state, NON_TERMINAL_STATES, TERMINAL_STATES, ACTION_SPACE)

    
    # Dictionary of rewards
    # Not storing 0s
    rewards = {(0,3):1, (1,3): -1}
    
    # Dictionary of admissible actions per state
    adm_actions = {
        (0,0): ("D", "R"),
        (0,1): ("L", "R"),
        (0,2): ("L", "R", "D"),
        (1,0): ("D", "U"),
        (1,2): ("U", "D", "R"),
        (2,0): ("U", "R"),
        (2,1): ("L", "R"),
        (2,2): ("U", "R", "L"),
        (2,3): ("U", "L"),
    }
    
    # Dictionary of deterministic transitions:
    transition_probs = {
        ((2, 0), 'U'): {(1, 0): 1.0},
        ((2, 0), 'R'): {(2, 1): 1.0},
        
        ((1, 0), 'U'): {(0, 0): 1.0},
        ((1, 0), 'D'): {(2, 0): 1.0},
        
        ((0, 0), 'D'): {(1, 0): 1.0},
        ((0, 0), 'R'): {(0, 1): 1.0},
        
        ((0, 1), 'L'): {(0, 0): 1.0},
        ((0, 1), 'R'): {(0, 2): 1.0},
        
        ((0, 2), 'D'): {(1, 2): 1.0},
        ((0, 2), 'L'): {(0, 1): 1.0},
        ((0, 2), 'R'): {(0, 3): 1.0},
        
        ((2, 1), 'L'): {(2, 0): 1.0},
        ((2, 1), 'R'): {(2, 2): 1.0},
        
        ((2, 2), 'U'): {(1, 2): 1.0},
        ((2, 2), 'L'): {(2, 1): 1.0},
        ((2, 2), 'R'): {(2, 3): 1.0},
        
        ((2, 3), 'U'): {(1, 3): 1.0},
        ((2, 3), 'L'): {(2, 2): 1.0},
        
        ((1, 2), 'U'): {(0, 2): 1.0},
        ((1, 2), 'D'): {(2, 2): 1.0},
        ((1, 2), 'R'): {(1, 3): 1.0},
    }
    
    # Assign missing environment attributes
    env.set(rewards, adm_actions, transition_probs)
    
    # Output line
    return env

# Create environment
# adm_actions, rewards and transition_probs are attributes of grid

grid = test_standard_grid()

print(f"Test with non-windy GridWorld environment ... \n")

# Discount factor and error threshold
gamma = 0.9
epsilon = 1e-3

##############################
## EXECUTE VALUE ITERATION  ##
##############################

print(f"Executing value iteration algorithm ...")
# SIGNATURE: V_star, N_iter, Pi_star = Value_Iteration(env, epsilon, gamma)
(V_star, N_iter, Pi_star) = Value_Iteration(grid, epsilon, gamma)

##############################
##  GET OPTIMAL POLICY      ##
##############################

# SIGNATURE: Pi_star = Get_Pi_Star(V_star, env, epsilon, gamma)
Pi_computed = Get_Pi_Star(V_star, grid, epsilon, gamma)


###################
## PRINT RESULTS ##
###################

# Print N_iter
# Print optimal value function
print(f"Value_Iteration() converged after {N_iter} iterations ...\n")

# Print optimal value function
print(f"Printing optimal value function ...")
print_values(V_star, grid)

# Print optimal (deterministic) policy
print(f"Printing policy obtained from Value_Iteration()...")
print_policy(Pi_star, grid)

# Print optimal (deterministic) policy
print(f"Printing policy obtained from Get_Pi_star()...")
print_policy(Pi_computed, grid)

Test with non-windy GridWorld environment ... 

Executing value iteration algorithm ...
Value_Iteration() converged after 3 iterations ...

Printing optimal value function ...
## VALUE FUNCTION ##
------------------------
 0.81| 0.90| 1.00| 0.00|
------------------------
 0.73| 0.00| 0.90| 0.00|
------------------------
 0.66| 0.73| 0.81| 0.73|
------------------------
Printing policy obtained from Value_Iteration()...
##  POLICY  ##
------------------------
  R  |  R  |  R  |
------------------------
  U  |     |  U  |
------------------------
  U  |  R  |  U  |  L  |
------------------------
Printing policy obtained from Get_Pi_star()...
##  POLICY  ##
------------------------
  R  |  R  |  R  |
------------------------
  U  |     |  U  |
------------------------
  U  |  R  |  U  |  L  |
------------------------


### c) Windy GridWorld with various penalties

Here I am redoing the cases of Lecture 57 with Value iteration. We get the same results with much faster convergence.

In [6]:
### Windy GridWorld with various penalties
## This time using Value Iteration Algorithm instead of Policy Iteration
# 2022/04/24

# Discount factor and error threshold
gamma = 0.9
epsilon = 1e-3

# Penalty list
penalties = [0.0, -0.1, -0.2, -0.4, -0.5, -2]

# Loop over penalties
for pen in penalties:
    
    # Create environment
    grid = windy_standard_grid(penalty=pen)
    print(f"Windy GridWorld environment with penalty = {pen} created ... \n")

    ##############################
    ## EXECUTE VALUE ITERATION  ##
    ##############################

    print(f"Executing value iteration algorithm ...")
    # SIGNATURE: V_star, N_iter, Pi_star = Value_Iteration(env, epsilon, gamma)
    (V_star, N_iter, Pi_star) = Value_Iteration(grid, epsilon, gamma)

    ##############################
    ##  GET OPTIMAL POLICY      ##
    ##############################

    # SIGNATURE: Pi_star = Get_Pi_Star(V_star, env, epsilon, gamma)
    Pi_computed = Get_Pi_Star(V_star, grid, epsilon, gamma)


    ###################
    ## PRINT RESULTS ##
    ###################

    # Print N_iter
    # Print optimal value function
    print(f"Value_Iteration() converged after {N_iter} iterations ...\n")

    # Print optimal value function
    print(f"Printing optimal value function ...")
    print_values(V_star, grid)

    # Print optimal (deterministic) policy
    print(f"Printing policy obtained from Value_Iteration()...")
    print_policy(Pi_star, grid)

    # Print optimal (deterministic) policy
    print(f"Printing policy obtained from Get_Pi_star()...")
    print_policy(Pi_computed, grid)
    
    # Separator
    print("_____________________________________________\n\n")

Windy GridWorld environment with penalty = 0.0 created ... 

Executing value iteration algorithm ...
Value_Iteration() converged after 5 iterations ...

Printing optimal value function ...
## VALUE FUNCTION ##
------------------------
 0.81| 0.90| 1.00| 0.00|
------------------------
 0.73| 0.00| 0.48| 0.00|
------------------------
 0.66| 0.59| 0.53| 0.48|
------------------------
Printing policy obtained from Value_Iteration()...
##  POLICY  ##
------------------------
  R  |  R  |  R  |
------------------------
  U  |     |  D  |
------------------------
  U  |  L  |  L  |  L  |
------------------------
Printing policy obtained from Get_Pi_star()...
##  POLICY  ##
------------------------
  R  |  R  |  R  |
------------------------
  U  |     |  D  |
------------------------
  U  |  L  |  L  |  L  |
------------------------
_____________________________________________


Windy GridWorld environment with penalty = -0.1 created ... 

Executing value iteration algorithm ...
Value_Itera