# Week 12 - Sequential Decision Making I
## 2. Policy Iteration

In this notebook, you will be tested on your capacity to **complete the poliy iteration algorithm**. <br>
You can find details about the algorithm at slide 47 of the slide deck. <br>
The algorithm will be tested on a simple Gridworld similar to the one presented at slide 12. <br>
This Gridworld is however simpler because the MDP is deterministic. <br>

The code was adapted from: https://github.com/lazyprogrammer/machine_learning_examples/tree/master/rl <br>
and then from: https://github.com/omerbsezer/Reinforcement_learning_tutorial_with_demo

### 2.1 Setup

In [3]:
#imports
import numpy as np
from gridWorldGame import standard_grid, negative_grid, print_values, print_policy

Let's set some variables. <br>
SMALL_ENOUGH is a threshold we will utilize to determine if the algorithm as converged <br>
GAMMA is the discount factor explained in slide 36 <br>
ALL_POSSIBLE_ACTIONS are the actions you can take in the GridWold, as in slide 12

In [5]:
SMALL_ENOUGH = 1e-3
GAMMA = 0.9
ALL_POSSIBLE_ACTIONS = ('U', 'D', 'L', 'R')

Now we will set up a the Gridworld. <br>
To find a shorter path to the goal, we will use a negative grid that give negatives rewards of -0.1 in the non-absorbing states

In [6]:
grid = negative_grid()
print("rewards:")
print_values(grid.rewards, grid)

rewards:
---------------------------
-0.10|-0.10|-0.10| 1.00|
---------------------------
-0.10| 0.00|-0.10|-1.00|
---------------------------
-0.10|-0.10|-0.10|-0.10|


Next, we will define a random inital policy. <br>
Remember that a policy maps states to actions.

In [7]:
policy = {}
for s in grid.actions.keys():
  policy[s] = np.random.choice(ALL_POSSIBLE_ACTIONS)

# initial policy
print("initial policy:")
print_policy(policy, grid)

initial policy:
---------------------------
  D  |  R  |  L  |     |
---------------------------
  R  |     |  R  |     |
---------------------------
  R  |  D  |  R  |  U  |


Next, we will randomly initialize the value fonction

In [8]:
# initialize V(s) - value function
V = {}
states = grid.all_states()
for s in states:
  if s in grid.actions:
    V[s] = np.random.random()
  else:
    # terminal state
    V[s] = 0

# initial value for all states in grid
print_values(V, grid)

---------------------------
 0.95| 0.46| 0.19| 0.00|
---------------------------
 0.56| 0.00| 0.88| 0.00|
---------------------------
 0.45| 0.33| 0.76| 0.48|


Note that we set to Null the values of the terminal states. <br> 
For the print_values() function to compile, we set them to 0.

### 2.2 Policy iteration - code completion

You will now have to complete the Policy iteration algorithm. <br>
Remember that the algorithm works in two phases. <br>
First, in the *policy evaluation* phase, the value function is update with the formula:

$$
V^\pi(s) =  \sum_{s',a}  p(s'|s,\pi(s))(r + \gamma*V^\pi(s') 
$$
This part of the algorithm is already coded for you. <br>

Second, in the *policy improvement* step, the policy is updated with the formula:

$$
\pi'(s) = \underset{a}{arg max}\big\{ \sum_{s',a}  p(s'|s,\pi(s))(r + \gamma*V^\pi(s') \big\}
$$

This is the part of code you will have to complete. <br>

Note that in the current gridWorld, p(s'|s,a) is deterministic. <br>
Run the algorithm until convergence.

In [9]:
iteration=0
# repeat until convergence
# when policy does not change, it will finish
while True:
  iteration+=1
  print("values %d: " % iteration)
  print_values(V, grid)
  print("policy %d: " % iteration)
  print_policy(policy, grid)
  print('\n\n')

  # policy evaluation step
  while True:
    biggest_change = 0
    for s in states:
      old_v = V[s]

      # V(s) only has value if it's not a terminal state
      if s in policy:
        a = policy[s]
        grid.set_state(s)
        r = grid.move(a) # reward
        next_state = grid.current_state() # s' 
        V[s] = r + GAMMA * V[next_state]
        biggest_change = max(biggest_change, np.abs(old_v - V[s]))

    if biggest_change < SMALL_ENOUGH:
      break

  # policy improvement step
  is_policy_converged = True
  for s in states:
    if s in policy:
      old_a = policy[s]
      new_a = None
      best_value = float('-inf')
      # loop through all possible actions to find the best current action
      for a in ALL_POSSIBLE_ACTIONS:
        
        ## Implement This!
        
        ## Hint:
        ##   - keep track of the changes to the policy with the variable is_policy_converged.
        ##     Use it to break the loop upon convergence


  if is_policy_converged:
    break

IndentationError: expected an indented block (<ipython-input-9-b8c398e3b83e>, line 47)

Now print your policy and make sure it leads to the upper-right corner which is the termnial state returning the most rewards.

In [10]:
print("final values:")
print_values(V, grid)
print("final policy:")
print_policy(policy, grid)

final values:
---------------------------
 0.95| 0.46| 0.19| 0.00|
---------------------------
 0.56| 0.00| 0.88| 0.00|
---------------------------
 0.45| 0.33| 0.76| 0.48|
final policy:
---------------------------
  D  |  R  |  L  |     |
---------------------------
  R  |     |  R  |     |
---------------------------
  R  |  D  |  R  |  U  |
