<a href="https://colab.research.google.com/github/maggieliuzzi/reinforcement_learning/blob/master/dynamic_programming/prediction/IterativePolicyEvaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Dynamic Programming** | Prediction Problem | Iterative Policy Evaluation

Given a policy, find the value function

- Policy (deciding what action to take given the state): 1) probabilistic (uniform random), 2) deterministic (fixed)
- State Transitions (the next state and reward given your action-state pair): deterministic (all p(s',r|s,a) = 1 or 0)

In [0]:
from __future__ import print_function, division
from builtins import range
import numpy as np
!wget "https://raw.githubusercontent.com/lazyprogrammer/machine_learning_examples/master/rl/grid_world.py"
from grid_world import standard_grid
!wget "https://raw.githubusercontent.com/maggieliuzzi/reinforcement_learning/master/environments/utils.py"
from utils import print_values, print_policy


In [0]:
SMALL_ENOUGH = 1e-3  # threshold for convergence

In [0]:
grid = standard_grid()
states = grid.all_states()

1) **Probabilistic policy (uniformly-random)**

In [0]:
V = {}
for s in states:
  V[s] = 0

# Without discount factor
gamma = 1.0 # discount factor

In [0]:
# repeat until convergence
while True:
  biggest_change = 0
  for s in states:
    old_v = V[s]

    # V(s) only has value if it's not a terminal state
    if s in grid.actions:

      new_v = 0 # we will accumulate the answer
      p_a = 1.0 / len(grid.actions[s]) # each action has equal probability (uniform random)
      for a in grid.actions[s]:
        grid.set_state(s)
        r = grid.move(a)
        new_v += p_a * (r + gamma * V[grid.current_state()])
      V[s] = new_v
      biggest_change = max(biggest_change, np.abs(old_v - V[s]))

  if biggest_change < SMALL_ENOUGH:
    break

In [17]:
print("values for uniformly random actions:")
print_values(V, grid)

values for uniformly random actions:
---------------------------
-0.03| 0.09| 0.22| 0.00|
---------------------------
-0.16| 0.00|-0.44| 0.00|
---------------------------
-0.29|-0.41|-0.54|-0.77|


2) **Deterministic policy (fixed)**

In [18]:
policy = {
  (2, 0): 'U',
  (1, 0): 'U',
  (0, 0): 'R',
  (0, 1): 'R',
  (0, 2): 'R',
  (1, 2): 'R',
  (2, 1): 'R',
  (2, 2): 'R',
  (2, 3): 'U',
}
print_policy(policy, grid)

---------------------------
  R  |  R  |  R  |     |
---------------------------
  U  |     |  R  |     |
---------------------------
  U  |  R  |  R  |  U  |


In [0]:
# With discount factor
V = {}
for s in states:
  V[s] = 0

gamma = 0.9 # discount factor

In [0]:
# repeat until convergence
while True:
  biggest_change = 0
  for s in states:
    old_v = V[s]

    # V(s) only has value if it's not a terminal state
    if s in policy:
      a = policy[s]
      grid.set_state(s)
      r = grid.move(a)
      V[s] = r + gamma * V[grid.current_state()]
      biggest_change = max(biggest_change, np.abs(old_v - V[s]))

  if biggest_change < SMALL_ENOUGH:
    break

In [21]:
print("values for fixed policy:")
print_values(V, grid)

values for fixed policy:
---------------------------
 0.81| 0.90| 1.00| 0.00|
---------------------------
 0.73| 0.00|-1.00| 0.00|
---------------------------
 0.66|-0.81|-0.90|-1.00|
