# Policy iteration and value iteration

In this notebook, you will implement different dynamic programming approaches described in [Sutton and Barto's book, Introduction to Reinforcement Learning](http://incompleteideas.net/book/the-book-2nd.html). A grid ```World``` class and policy iteration has been implemented. Feel free to add more actions, rewards and/or terminals, or to modify the code to suit your needs.

### Install dependencies

In [1]:
! pip install numpy pandas



### Imports

In [2]:
import numpy as np
import sys          # We use sys to get the max value of a float
import pandas as pd # We only use pandas for displaying tables nicely
pd.options.display.float_format = '{:,.3f}'.format

### ```World``` class and globals

The ```World``` is a grid represented as a two-dimensional array of characters where each character can represent free space, an obstacle, or a terminal. Each non-obstacle cell is associated with a reward that an agent gets for moving to that cell (can be 0). The size of the world is _width_ $\times$ _height_ characters.

A _state_ is a tuple $(x,y)$.

An empty world is created in the ```__init__``` method. Obstacles, rewards and terminals can then be added with ```add_obstacle``` and ```add_reward```.

To calculate the next state of an agent (that is, an agent is in some state $s = (x,y)$ and performs and action, $a$), ```get_next_state()```should be called. It will only be relevant to call this function later on, when we do learning based on interaction with the environment and where an agent actually has to move.

For now, you will only need the probabilities over next states given an action, $a$, that is, call ```get_state_transition_probabilities```.

In [3]:
# Globals:
ACTIONS = ("up", "down", "left", "right")

# Rewards, terminals and obstacles are characters:
REWARDS = {" ": 0, ".": 0.1, "+": 10, "-": -10}
TERMINALS = ("+", "-") # Note a terminal should also have a reward assigned
OBSTACLES = ("#")

# Discount factor
gamma = 1

# The probability of a random move:
rand_move_probability = 0

class World:
  def __init__(self, width, height):
    self.width = width
    self.height = height
    # Create an empty world where the agent can move to all cells
    self.grid = np.full((width, height), ' ', dtype='U1')

  def add_obstacle(self, start_x, start_y, end_x=None, end_y=None):
    """
    Create an obstacle in either a single cell or rectangle.
    """
    if end_x == None: end_x = start_x
    if end_y == None: end_y = start_y

    self.grid[start_x:end_x + 1, start_y:end_y + 1] = OBSTACLES[0]

  def add_reward(self, x, y, reward):
    assert reward in REWARDS, f"{reward} not in {REWARDS}"
    self.grid[x, y] = reward

  def add_terminal(self, x, y, terminal):
    assert terminal in TERMINALS, f"{terminal} not in {TERMINALS}"
    self.grid[x, y] = terminal

  def is_obstacle(self, x, y):
    if x < 0 or x >= self.width or y < 0 or y >= self.height:
      return True
    else:
      return self.grid[x ,y] in OBSTACLES

  def is_terminal(self, x, y):
    return self.grid[x ,y] in TERMINALS

  def get_reward(self, x, y):
    """
    Return the reward associated with a given location
    """
    return REWARDS[self.grid[x, y]]

  def get_next_state(self, current_state, action, deterministic=False):
    """
    Get the next state given a current state and an action. Can eiter be
    deterministic (no random actions) or non-deterministic,
    where rand_move_probability determines the probability of ignoring the
    action and performing a random move.
    """
    assert action in ACTIONS, f"Unknown acion {action} must be one of {ACTIONS}"

    x, y = current_state

    # If our current state is a terminal, there is no next state
    if self.grid[x, y] in TERMINALS:
      return None

    # Check of a random action should be performed:
    if not deterministic and np.random.rand() < rand_move_probability:
      action = np.random.choice(ACTIONS)

    if action == "up":      y -= 1
    elif action == "down":  y += 1
    elif action == "left":  x -= 1
    elif action == "right": x += 1

    # If the next state is an obstacle, stay in the current state
    return (x, y) if not self.is_obstacle(x, y) else current_state

  def get_state_transition_probabilities(self, current_state, action):
    """
    Returns a dict where key = state and value = probability given current state
    is (x,y) and "action" is performed.
    """
    assert action in ACTIONS, f"Unknown acion {action} must be one of {ACTIONS}"

    x, y = current_state
    if self.is_terminal(x, y):
      return {}

    next_state_probabilities = {}
    # Since there is rand_move_probability of performing any action, we have to
    # go through all actions and check what their next state would be:
    for a in ACTIONS:
      next_state = self.get_next_state((x, y), a, deterministic=True)
      if a == action:
        prob = 1 - rand_move_probability + rand_move_probability / len(ACTIONS)
      else:
        prob = rand_move_probability / len(ACTIONS)

      if next_state in next_state_probabilities:
        next_state_probabilities[next_state] += prob
      else:
        if prob > 0.0:
          next_state_probabilities[next_state] = prob

    return next_state_probabilities

## Basic examples: World, obstacles, rewards and terminals

Below are some examples to illustrate how the ```World``` class works.

First, we create a 4x4 world:

In [4]:
world = World(4, 4)

# Note, that we have to transpose the 2D array (.T) for (x,y)
# to match the convention when displayed
print(world.grid.T)

[[' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ']]


Obstacles and terminals are all represented as single characters:

In [5]:
print(f"Obstacles: {OBSTACLES}")
print(f"Terminals: {TERMINALS}")

Obstacles: #
Terminals: ('+', '-')


Rewards are also represented as characters in the world, but they have an associated value (note that defining a value for an empty space "  " is equivalent to the agent receiving that reward each time a move is made):

In [6]:
print(f"Rewards: {REWARDS}")

Rewards: {' ': 0, '.': 0.1, '+': 10, '-': -10}


To assign rewards to terminal states, just use the same character in the ```REWARDS``` dict and in the ```TERMINALS``` tuple.

In [7]:
for t in TERMINALS:
  print(f"Terminal {t} has reward {REWARDS[t]}")

world.add_terminal(0, 0, "+")
world.add_terminal(3, 3, "-")


print(world.grid.T)

# An alternative way of displaying the world using pandas:
display(pd.DataFrame(world.grid.T))

Terminal + has reward 10
Terminal - has reward -10
[['+' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ']
 [' ' ' ' ' ' '-']]


Unnamed: 0,0,1,2,3
0,+,,,
1,,,,
2,,,,
3,,,,-


## Policies ($\pi$)

Recall that a policy, $\pi(a|s) = \Pr(A_t = a | S_t = s)$, maps states to action probabilities. In the code below, we let policies return the probabilities of each possible action given a state. States are $(x, y)$ coordinates and the policy must return action probabilities as a dict where the action is the ```key``` and the corresponding ```value``` is the probability of taking that action in the given state. Deterministic policies, for instance, return a dict with only one entry (e.g. ```{ "up": 1 } ``` if the action for the current state is ```up```).

A random policy can be defined as follows:


In [8]:
def equiprobable_random_policy(x, y):
  return { k : 1 / len(ACTIONS) for k in ACTIONS }

# Example (since the action probabilities do not depend on the state in this
# basic policy, we can just call it for state (0, 0)):
print(equiprobable_random_policy(0, 0))

{'up': 0.25, 'down': 0.25, 'left': 0.25, 'right': 0.25}


## Iterative policy evaluation


Iterative policy evaluation takes a ```World```, a discount factor, $\gamma$ (```gamma```, defined above in the ```World``` code cell), a policy, $\pi$, and a threshold, $\theta$ (```theta```), that determines when to stop the iteration. You can also specify a maximum number of iterations which can be useful for debugging using the ```max_iterations``` argument.

**IMPORTANT:** Remember that in iterative policy evaluation, we just learn state values ($V_\pi$) given a policy $\pi$. We are **not** trying to learn a policy.

(see page 74-75 of [Introduction to Reinforcement Learning](http://incompleteideas.net/book/the-book-2nd.html) for an explanation and the algorithm)


In [9]:
def iterative_policy_evaluation(world, policy, theta=1e-5, max_iterations=1e3):

  # Our initial estimates for all states in the world is 0:
  V = np.full((world.width, world.height), 0.0)

  while True:
    # delta keeps track of the largest change in one iteration, so we set it to
    # 0 at the start of each iteration:
    delta = 0

    # Loop over all states (x,y)
    for y in range(world.height):
      for x in range(world.width):
        if not world.is_obstacle(x, y):
          # Get action probabilities for the current state:
          actions = policy(x, y)

          # v is the new estimate that will be updated in the loop:
          v = 0

          # loop over all actions that our policy says that we can perform
          # in the current state:
          for action, action_prob in actions.items():
            # For each action, get state transition probabilities and
            # accumulate in v rewards weighted with action and state transition
            # probabilities:
            for (xi, yi), state_prob in world.get_state_transition_probabilities((x, y), action).items():
              v += action_prob * state_prob * (world.get_reward(xi, yi) + gamma * V[xi, yi])

          # update delta (largest change in estimate so far)
          delta = max(delta, abs(v - V[x, y]))
          V[x, y] = v

    # check if current state value estimates are close enought to end:
    if delta <= theta:
      break

    max_iterations -= 1
    if max_iterations == 0:
      break

  # Return the state value estimates
  return V


## Implementation of Example 4.1 from the book

Below, you can see the implementation of Example 4.1 on page 76 in the book [Introduction to Reinforcement Learning](http://incompleteideas.net/book/the-book-2nd.html)

In [10]:
# World is 4x4
world = World(4, 4)

# Rewards are -1 for each move (including when hitting a terminal state, "+"):
REWARDS = {" ": -1, "+": -1}


# Add terminal states in two corners
world.add_terminal(0, 0, "+")
world.add_terminal(3, 3, "+")

print(world.grid.T)

[['+' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ']
 [' ' ' ' ' ' '+']]


In [11]:
V = iterative_policy_evaluation(world, equiprobable_random_policy)

display(pd.DataFrame(V.T))

Unnamed: 0,0,1,2,3
0,0.0,-14.0,-20.0,-22.0
1,-14.0,-18.0,-20.0,-20.0
2,-20.0,-20.0,-18.0,-14.0
3,-22.0,-20.0,-14.0,0.0


## Exercise - policy and discount factor

Experiment with example 4.1: what effect does it have to change the policy, e.g. so that an agent always goes left or always goes right? What effect does it have on state values to change the value of the discount factor (```gamma```)?

In [12]:
# TODO: implement your code here

def left_policy(x, y):
  return { "left" : 1}

def right_policy(x, y):
  return { "right" : 1}

VR = iterative_policy_evaluation(world, right_policy)

# Example (since the action probabilities do not depend on the state in this
# basic policy, we can just call it for state (0, 0)):
print(left_policy(0, 0))

V1 = iterative_policy_evaluation(world, left_policy)
gamma = 1

print(f"Left policy:")
display(pd.DataFrame(V1.T))
print("\n")
print(f"Right policy:")
display(pd.DataFrame(VR.T))

{'left': 1}
Left policy:


Unnamed: 0,0,1,2,3
0,0.0,-1.0,-2.0,-3.0
1,-1000.0,-1001.0,-1002.0,-1003.0
2,-1000.0,-1001.0,-1002.0,-1003.0
3,-1000.0,-1001.0,-1002.0,0.0




Right policy:


Unnamed: 0,0,1,2,3
0,0.0,-1000.0,-1000.0,-1000.0
1,-1000.0,-1000.0,-1000.0,-1000.0
2,-1000.0,-1000.0,-1000.0,-1000.0
3,-3.0,-2.0,-1.0,0.0


Man kan her se at når man bruger en left policy, så virker det godt for den øverste række. Men virkelig skidt for de andre. Derimod ved right policy er den god for den nederste række.

Testing discount factor


In [13]:
gamma = 0.5

print(f"equipropable:")
V = iterative_policy_evaluation(world, equiprobable_random_policy)
display(pd.DataFrame(V.T))

equipropable:


Unnamed: 0,0,1,2,3
0,0.0,-1.695,-1.949,-1.983
1,-1.695,-1.915,-1.966,-1.949
2,-1.949,-1.966,-1.915,-1.695
3,-1.983,-1.949,-1.695,0.0


Ved at ændre værdien af gamma, så blev state værdierne meget mindre negative.

Try to write a policy that is deterministic, but where the action performed differs between states. You can implement it as a two-dimensional array with the dimensions corresponding to the world dimensions and have each entry be an action for the corresponding state.

In [14]:
# TODO: implement you code here
def deterministic_policy(x, y):
  if(y==0):
    return {"left" : 1}
  elif(y == 3):
    return {"right" : 1}
  elif(x==0):
    return {"up" : 1}
  elif(x == 3):
    return {"down" : 1}
  elif(x == 2):
    return { "right" : 1}
  elif(x == 1):
    return {"left" : 1}
  else:
    return {"right" : 1}


gamma = 0.5

print(f"deterministic:")
Vd = iterative_policy_evaluation(world, deterministic_policy)
display(pd.DataFrame(Vd.T))

deterministic:


Unnamed: 0,0,1,2,3
0,0.0,-1.0,-1.5,-1.75
1,-1.0,-1.5,-1.75,-1.5
2,-1.5,-1.75,-1.5,-1.0
3,-1.75,-1.5,-1.0,0.0


## Exercise - stochasticity

You can adjust the degree of stochasticity in the environment by setting the global variable ```rand_move_probability``` (the probability of the world ignoring an action and performing a random move instead). What effect does stochasticity have on the state-value estimates?


In [15]:
# TODO: implement you code here
rand_move_probability = 0.5
print(f"deterministic:")
Vd = iterative_policy_evaluation(world, deterministic_policy)
display(pd.DataFrame(Vd.T))


deterministic:


Unnamed: 0,0,1,2,3
0,0.0,-1.298,-1.75,-1.893
1,-1.298,-1.721,-1.871,-1.75
2,-1.75,-1.871,-1.721,-1.298
3,-1.893,-1.75,-1.298,0.0


It has gotten worse compared to the deterministic policy.

## Exercise - robot, cake and mouse trap

Implement a robot, cake and mouse trap example and compute state value estimates under different policies (equiprobable, always right, always right:50% or up:50%) with and without stochasticity.


In [16]:
# TODO: implement you code here
cakeWorld = World(4, 4)
cakeWorld.add_terminal(3, 3, "+")
cakeWorld.add_obstacle(1,1)

REWARDS = {" ": -1, "+": 5}

print(f"equipropable:")
Vc = iterative_policy_evaluation(cakeWorld, equiprobable_random_policy)
display(pd.DataFrame(Vc.T))


equipropable:


Unnamed: 0,0,1,2,3
0,-1.998,-1.995,-1.972,-1.94
1,-1.995,0.0,-1.868,-1.669
2,-1.972,-1.868,-1.435,0.128
3,-1.94,-1.669,0.128,0.0


Prøver her at give en straf for at ramme en mussefælde


In [17]:

gamma = 1

cakeWorld2 = World(4, 4)
REWARDS = {" ": 0, "+": 10, "-":-10}
cakeWorld2.add_terminal(3, 3, "+")
cakeWorld2.add_terminal(0,0,"-")

cakeWorld2.add_reward(0,0,"-")
cakeWorld2.add_reward(3,3,"+")

rand_move_probability = 0

def eq_5050_policy(x,y):
  return {"up" : 0.5, "right" : 0.5}



print("with stochasticity:")
print(f"\n")

print(f"equipropable:")
Vc2 = iterative_policy_evaluation(cakeWorld2, equiprobable_random_policy)
display(pd.DataFrame(Vc2.T))

print(f"\n")
print("right:")
Vr2 = iterative_policy_evaluation(cakeWorld2, right_policy)
display(pd.DataFrame(Vr2.T))

print(f"\n")
print("50/50 up and right:")
V50= iterative_policy_evaluation(cakeWorld2, eq_5050_policy)
display(pd.DataFrame(V50.T))

with stochasticity:


equipropable:


Unnamed: 0,0,1,2,3
0,0.0,-4.615,-1.539,-0.0
1,-4.615,-2.308,-0.0,1.538
2,-1.539,-0.0,2.308,4.615
3,-0.0,1.538,4.615,0.0




right:


Unnamed: 0,0,1,2,3
0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0
3,10.0,10.0,10.0,0.0




50/50 up and right:


Unnamed: 0,0,1,2,3
0,0.0,0.0,0.0,0.0
1,-5.0,0.0,0.0,0.0
2,-2.5,0.0,0.0,0.0
3,0.0,2.5,5.0,0.0


In [18]:
rand_move_probability = 0

print("without stochasticity:")
print(f"\n")

print(f"equipropable:")
Vc2 = iterative_policy_evaluation(cakeWorld2, equiprobable_random_policy)
display(pd.DataFrame(Vc2.T))

print(f"\n")
print("right:")
Vr2 = iterative_policy_evaluation(cakeWorld2, right_policy)
display(pd.DataFrame(Vr2.T))

print(f"\n")
print("50/50 up and right:")
V50= iterative_policy_evaluation(cakeWorld2, eq_5050_policy)
display(pd.DataFrame(V50.T))

without stochasticity:


equipropable:


Unnamed: 0,0,1,2,3
0,0.0,-4.615,-1.539,-0.0
1,-4.615,-2.308,-0.0,1.538
2,-1.539,-0.0,2.308,4.615
3,-0.0,1.538,4.615,0.0




right:


Unnamed: 0,0,1,2,3
0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0
3,10.0,10.0,10.0,0.0




50/50 up and right:


Unnamed: 0,0,1,2,3
0,0.0,0.0,0.0,0.0
1,-5.0,0.0,0.0,0.0
2,-2.5,0.0,0.0,0.0
3,0.0,2.5,5.0,0.0


With stochasity does not help using the equipropable policy because it always is random. However, it helps in the other policiees since, you have a larger chance of reaching the cake.

## Exercise - action-value function

Based on a set of calculated state values, try to implement an action value function, that is $q_\pi(s, a)$ (if in doubt, see page 78 in [Introduction to Reinforcement Learning](http://incompleteideas.net/book/the-book-2nd.html)). Note: you have to use the ```get_state_transition_probabilities()``` method on ```World``` to be able to handle stochastic environments where performing ```a``` does not lead to a deterministic outcome.

In [19]:
def action_value(world, V, state, action):
  q=0
  for next_state, probability in world.get_state_transition_probabilities(state, action).items():
      reward = world.get_reward(next_state[0], next_state[1])
      q += probability * (reward + gamma * V[next_state])
  return q


Vc2 = iterative_policy_evaluation(cakeWorld2, equiprobable_random_policy)
display(pd.DataFrame(Vc2.T))

action_value(cakeWorld2, Vc2, (1,2), "up")

Unnamed: 0,0,1,2,3
0,0.0,-4.615,-1.539,-0.0
1,-4.615,-2.308,-0.0,1.538
2,-1.539,-0.0,2.308,4.615
3,-0.0,1.538,4.615,0.0


-2.307770874224413

## Exercise - policy iteration

You are now ready to implement policy iteration. That is, first estimate state values under a given policy, then improve the policy based on those estimates and action values, estimate state values again, and so on. See page 80 in [Introduction to Reinforcement Learning](http://incompleteideas.net/book/the-book-2nd.html)

You will need an explicit representation of your policy that you can easily change.

Test your implementation and print out the policies found.


In [20]:

# TODO: Implement your code here

#Make a grid based policy
pol4x4 = np.full((4, 4), "right")

def explicit(x,y):
  return {pol4x4[x,y] : 1}



def pol_improve(world, policy, theta=1e-3):
  policy_stable = False
  while(not policy_stable):
    Vc2 = iterative_policy_evaluation(cakeWorld2, policy)

    policy_stable = True
    for y in range(world.height):
      for x in range(world.width):
        old_val = policy(x, y)
        Vs = max(action_value(world, V, s, a) for a in actions)
        new_action_val = action_value(world, Vc2, (x,y), action)
        if(new_action_val > old_action_val):
          pol4x4[x][y] = action
        if(old_action != policy(x,y)):
          policy_stable = False


  return Vc2


vc5 = pol_improve(cakeWorld2, explicit)
display(pd.DataFrame(vc5.T))
display(pd.DataFrame(pol4x4.T))

NameError: name 'actions' is not defined

## Exercise - value iteration


Value iteration is much more effecient than policy iteration. Implement value iteration below. See page 83 in [Introduction to Reinforcement Learning](http://incompleteideas.net/book/the-book-2nd.html).

Test your implementation and display the policies found (i.e., a grid with the perferred action in each cell).

In [None]:
# TODO: Implement your code here

V = np.zeros((world.width, world.height))  # Initialize state values to 0

#Make a grid based policy
polR = np.full((4, 4), "right")
dt = 10
def value_iteration(world, policy):
    while(dt>1e-2):
      dt = 0
      for y in range(world.height):
        for x in range(world.width):
          v = V[x,y]
          vs =
          old_action = policy(x, y)
          old_action_val = action_value(world, Vc2, (x, y), *old_action)
          for action in ACTIONS:
            new_action_val = action_value(world, Vc2, (x,y), action)
            if(new_action_val > old_action_val):
              pol4x4[x][y] = action
          if(old_action != policy(x,y)):
            policy_stable = False

