#  A different kind of Dynamic Programming: Value Iteration.

- Dynamic Programming: We know all there is to know about the model and we want to find the optimal policy.
- Iterative Process: Starting from a random guess for the values, we use the Bellman Operator, and exploit the fact that it is _contracting_.

![image.png](attachment:image.png)


The next approximation of the values ($v_{k+1}(S)$) is the Bellman Operator applied to the current approximation ($v_k(S)$). 

This does not impose any constrains on the dynamics of the environment (there can be loops!), so it is more general than the "start from the end" approach we used for the Travelling Salesman.

__Q__: Is there a "proper" ordering for selecting the states to update? 

# The GridWorld Environment in Episodic tasks:

"Gridworld" is a paradigmatic environment for simple RL worlds: a square-cell world (... a grid), where an agent learns to find the optimal path from an initial state S to one (or more) goal states, sometimes avoiding dangers. 

It can be defined in many ways, but we will use the _episodic_ gridworld: Once it arrives to a goal state it gets a reward R and it _stops moving_. Goals are terminal states!

The agent can try and move up, down, left, right _but_ 

- certain sites are blocked 
- the agents cannot go outside the initial perimeter. 

If the agent tries to do forbidden moves, it stays still.


## Exercise

Let us try and code all we need to solve GridWorld. We are going to need few things.
- A function to construct the matrix describing the world ("_new_world_").
- A function to code the transition (move) of the agent ("_p_transition_deterministic_").
- A function to return the rewards of a couple state/action
- A function to implement one single step of value iteration.

In [None]:
import numpy as np
# TYPICAL (GRID)WORLD


def new_world(Lx, Ly, Nblocks, goal, rewards):
    """
    Construct a gridworld of width Lx and height Ly, 
    with a number of blocks Nblocks (to be distributed randomly)
    and a list of tuple for positions of goal, and a list of corresponding rewards 
    """
    
    # Checks that the number of goals is consistent with the number of rewards
    assert len(goal) == len(rewards)
    
    # Constructs the empty matrix
    World = np.zeros((Ly,Lx))
    
    # Fill the empty matrix with Nblocks blocks and goals
    # -------------
    # ADD HERE!
    # -------------
    
    # Fill the entries of the matrix with:
    # -1 - if site is a block
    # reward[i] if site is in position goal[i]
    # 0 if site is neither a block nor a goal
    
    return World


In [None]:
# Just plot a new gridworld to see if it works!

Lx = 10
Ly = 15
Nblocks=10
goal = [(Ly-1,Lx-1)]
rewards = [1]

World = new_world(Lx, Ly, Nblocks, goal, rewards)

print("Visual representation of the gridworld:")
plot_world(World)

print("Matrix representation of the gridworld: ")
print(World)

# GridWorld as an MDP:

__State__ S: Position S=(i,j)

__Action__ A: Discrete. Up, Down, Left, Right     [ A=((+1,0), (-1,0), (0, -1), (0, +1)) ]

__Transition__ p: Deterministic.

p(S' | S, A) = 
- 1  _if    S' == S+A     and S' is allowed_
- 0  _else_

__Reward__ R: Only in reaching goal (=terminal state!): 0 everywhere, R when it moves _into_ a Goal state.
      


# Transition and Rewards

 - Transitions: Given a state S and the Action A, we return the new state, taking care that we do not do forbidden actions.

 - Rewards: Given a state S and the Action A and the new state S', we also have the probability to receive a reward R.

__PS: Achtung!__
The convention for python arrays is $A[i_y, i_x]$, where $i_y$ is the row-index and $i_x$ is the column-index... So the convention with _up_, _down_, _right_ and _left_ directions consistent, but may be a bit confusing: Use special care!

In [None]:
# The list of actions I can take: Actions = [Up, Down, Right, Left]
Actions = np.array([[1,0],[-1,0],[0,1],[0,-1]])

def p_transition_deterministic(S, A, World):
    """
    Takes the current position S and selected action A,
    and returns the resulting new S given a world World.
    """
    # Find the new position
    # ADD HERE!
    
    S_new = ...
    
    # Find the new position with the constrains that:
    # S_new can never go out of the world boundaries!
    # S_new can never be on a block
    
    return S_new

def rewards(S, A, S_new, World):
    """
    Takes the current position S and selected action A,
    and returns the resulting reward given that it has ended up in S_new and 
    the gridwolrd is World.
    """
    # Find the reward associated to the new position
    # ADD HERE!
    
    return reward


# Bellman Operator code: one-step update.

We are now going to code a single update of the Bellman Operation.


In [None]:
def update_values(Values, World, gamma):
    """
    Takes the current matrix of *Values* (V_k(s) )
    The associated gridworld *World*,
    And computes the bellman operator for a discount *gamma*   
    V_(k+1) (s) = max_a { sum_s'r   p(r, s'| s, a)(r + gamma V_k(s') }
                = max_a { sum_s'    p(s'| s, a)( r(s', s, a) + gamma V_k(s') }
                 
    And the relative best policy
    pi_(k+1)(s) = argmax_a { sum_s'r   p(r, s'| s, a)(r + gamma V_k(s') }
    
    Returns V_(k+1)(s) in *NewValues* and pi_(k+1)(s) in *NewPolicy*
    """
    
    # -----------------------------------------------------------
    # The dimension of the world
    Ly, Lx = World.shape
    # initialize the vectors to store the new values and policy
    NewValues = np.zeros((Ly,Lx))
    NewPolicy = np.zeros((Ly,Lx,2))
    
    # --------------- UPDATE -------------------------------------
    # Do one Bellman update!
    # Cycle over all the states
        # Try all possible actions
            # Check what is the action that maximize the R + gamma V(s')
            # Store the new value, store the new best action
   
    # -----------
    # ADD HERE!
    # -----------
    
    # REMEMBER THAT the Value for Terminal states is ALWAYS ZERO!
    # --------------------------------------------------------------
    for gx, gy in zip(goal[0],goal[1]):
        NewValues[gx, gy] = 0
        NewPolicy[gx, gy] = [0,0]
    # --------------------------------------------------------------
    return NewValues, NewPolicy

## To convergence!

Now that all tools are there, try and do consecutive iterations of the algorithm until convergence.
Finally, plot the results.

In [None]:
# Create a world to solve
# ADD HERE!


# Define the starting values 
# ADD HERE!

# Check what is the starting 
plot_world_values(World, Values)


# Define a tolerance
# And do iterative updates of the value matrix until tolerance!


# Check what is the final value and the constructed best policy.
plot_world_values_policy(World, NewValues, Policy)

# Optional Exercises / Experiments.

This is the end of the exercise. 
In the _solution_ notebook you can find all the solutions so far, *plus* several other small tweeks that change a bit the environment. Try to think about the questions below, and check what actually happens on that notebook! 

__Q:__ If actions are deterministic, does a Goal with negative reward have any impact?
__Q:__ What happens if actions are stochastic  - i.e. choosing up does not always mean _going_ up?

In [None]:
import matplotlib.pyplot as plt
import matplotlib
import numpy as np

plt.rcParams['figure.figsize'] = [10, 7]
plt.rcParams['figure.dpi'] = 100 
plt.rcParams['font.size'] = 6

def plot_world(World):
    # ------------------
    Ly, Lx = World.shape

    fig, ax = plt.subplots()
    im = ax.imshow(World, cmap=plt.get_cmap("Spectral"))
    
    # We want to show all ticks...
    ax.set_xticks(np.arange(Lx))
    ax.set_yticks(np.arange(Ly))

    goal = np.where(np.logical_or( World > 0.0, World < -1.0))
    blocks = np.where(World == -1.0)
    # Loop over data dimensions and create text annotations.
    for i in range(Lx):
        for j in range(Ly):
            if np.logical_and(goal[0]==j,goal[1]==i).any():
                text = ax.text(i,j, 'G{}'.format(int(World[j,i])), ha="center", va="center", color="black")
            elif np.logical_and(blocks[0]==j,blocks[1]==i).any():
                 text = ax.text(i,j, 'X', ha="center", va="center", color="black", backgroundcolor="black")
            else:
                pass
    plt.show()
    # -------------------

    

def plot_world_values(World, Values):
    # ------------------
    Ly, Lx = World.shape

    fig, (ax, ax2) = plt.subplots(1,2)
    im = ax.imshow(World, cmap=plt.get_cmap("Spectral"))

    # We want to show all ticks...
    ax.set_xticks(np.arange(Lx))
    ax.set_yticks(np.arange(Ly))

    goal = np.where(np.logical_or( World > 0.0, World < -1.0))
    blocks = np.where(World == -1.0)
    # Loop over data dimensions and create text annotations.
    for i in range(Lx):
        for j in range(Ly):
            if np.logical_and(goal[0]==j,goal[1]==i).any():
                text = ax.text(i,j, 'G{}'.format(World[j,i]), ha="center", va="center", color="black")
            elif np.logical_and(blocks[0]==j,blocks[1]==i).any():
                text = ax.text(i,j, 'X', ha="center", va="center", color="black", backgroundcolor="black")
            else:
                pass

    im2 = ax2.imshow(Values, cmap=plt.get_cmap("Spectral"))

    # We want to show all ticks...
    ax2.set_xticks(np.arange(Lx))
    ax2.set_yticks(np.arange(Ly))

    # Loop over data dimensions and create text annotations.
    for i in range(Lx):
        for j in range(Ly):
            if np.logical_and(goal[0]==j, goal[1]==i).any():
                text = ax2.text(i,j, 'G{}'.format(World[j,i]), ha="center", va="center", color="black")
            elif np.logical_and(blocks[0]==j,blocks[1]==i).any():
                text = ax2.text(i,j, 'X', ha="center", va="center", color="black", backgroundcolor="black")
            else:
                text = ax2.text(i, j, '{:.2f}'.format(Values[j, i]), ha="center", va="center", color="black")
                
                
    plt.show()
    # -------------------

    

def plot_world_values_policy(World, Values, Policy):
    # ------------------
    Ly, Lx = World.shape

    fig, (ax, ax2, ax3) = plt.subplots(1,3)
    im = ax.imshow(World, cmap=plt.get_cmap("Spectral"))

    # We want to show all ticks...
    ax.set_xticks(np.arange(Lx))
    ax.set_yticks(np.arange(Ly))

    goal = np.where(np.logical_or( World > 0.0, World < -1.0))
    blocks = np.where(World == -1.0)
    # Loop over data dimensions and create text annotations.
    for i in range(Lx):
        for j in range(Ly):
            if np.logical_and(goal[0]==j,goal[1]==i).any():
                text = ax.text(i,j, 'G-{}'.format(World[j,i]), ha="center", va="center", color="black")
            elif np.logical_and(blocks[0]==j,blocks[1]==i).any():
                text = ax.text(i,j, 'X', ha="center", va="center", color="black", backgroundcolor="black")
            else:
                pass

    im2 = ax2.imshow(Values, cmap=plt.get_cmap("Spectral"))

    # We want to show all ticks...
    ax2.set_xticks(np.arange(Lx))
    ax2.set_yticks(np.arange(Ly))

    # Loop over data dimensions and create text annotations.
    for i in range(Lx):
        for j in range(Ly):
            if np.logical_and(goal[0]==j, goal[1]==i).any():
                text = ax2.text(i,j, 'G{}'.format(World[j,i]), ha="center", va="center", color="black")
                text = ax3.text(i,j, 'G{}'.format(World[j,i]), ha="center", va="center", color="black")
            elif np.logical_and(blocks[0]==j,blocks[1]==i).any():
                text = ax2.text(i,j, 'X', ha="center", va="center", color="black", backgroundcolor="black")
                text = ax3.text(i,j, 'X', ha="center", va="center", color="black", backgroundcolor="black")
            else:
                text = ax2.text(i, j, '{:.2f}'.format(Values[j, i]), ha="center", va="center", color="black")
    
    im3 = ax3.imshow(Values, cmap=plt.get_cmap("Spectral"))
    X = np.arange(Lx)
    Y = np.arange(Ly)
    U, V = Policy[:,:,1], -Policy[:,:,0]
    q = ax3.quiver(X, Y, U, V, color="black")

    plt.show()
    # -------------------
    