# Week 12 - Sequential Decision Making I
## Value and Policy Iteration Exercices

Author: Massimo Caccia massimo.p.caccia@gmail.com <br>

The code was Adapted from: https://github.com/lazyprogrammer/machine_learning_examples/tree/master/rl <br>
and then from: https://github.com/omerbsezer/Reinforcement_learning_tutorial_with_demo

## 0. Preliminaries

Before we jump into the value and policy iteration excercies, we will test your comprehension of a Markov Decision Process (MDP). <br>
Let's take a simple example: Tic-Tac-Toe.

In [1]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "https://bjc.edc.org/bjc-r/img/3-lists/TTT1_img/Three%20States%20of%20TTT.png")

Can you describe the MDP? Specifically, what are the states, actions, transition function and rewards?

## 1. Value Iteration

The exercises will test your capacity to **complete the value iteration algorithm**.

You can find details about the algorithm at slide 46 of the [slide](http://www.cs.toronto.edu/~lcharlin/courses/80-629/slides_rl.pdf) deck. <br>

The algorithm will be tested on a simple Gridworld similar to the one presented at slide 12. 
This Gridworld is however simpler because the MDP is deterministic. In other words there is no uncertainty and a single possible next state given each state-action pair.

### 1.1 Setup

In [18]:
#imports
import numpy as np
from gridWorldGame import standard_grid, negative_grid, print_values, print_policy

Let's set some variables. <br>
`SMALL_ENOUGH` is a threshold we will utilize to determine the convergence of value iteration<br>
`GAMMA` is the discount factor denoted $\gamma$ in the slides (see slide 36) <br>
`ALL_POSSIBLE_ACTIONS` are the actions you can take in the GridWold, as in slide 12. In this simple grid world, we will have four actions: Up, Down, Right, Left. 

In [3]:
SMALL_ENOUGH = 1e-3
GAMMA = 0.9
ALL_POSSIBLE_ACTIONS = ('U', 'D', 'L', 'R')

Now we will set up a the Gridworld. <br>

In [22]:
grid = standard_grid()
print("rewards:")
print_values(grid.rewards, grid)

rewards:
---------------------------
 0.00| 0.00| 0.00| 1.00|
---------------------------
 0.00| 0.00| 0.00|-1.00|
---------------------------
 0.00| 0.00| 0.00| 0.00|


Note that in this grid, not only spot (1,4) and (2,4) are absorbing states, but (2,2) is as well.

Next, we will define a random inital policy $\pi$. <br>
Remember that a policy maps states to actions $\pi : S \rightarrow A$.

In [23]:
policy = {}
for s in grid.actions.keys():
  policy[s] = np.random.choice(ALL_POSSIBLE_ACTIONS)

# initial policy
print("initial policy:")
print_policy(policy, grid)

initial policy:
---------------------------
  U  |  R  |  U  |     |
---------------------------
  U  |     |  U  |     |
---------------------------
  L  |  U  |  U  |  R  |


Note that there is policy in the absorbing/terminal states

Next, we will randomly initialize the value fonction

In [6]:
V = {}
states = grid.all_states()
for s in states:
  if s in grid.actions:
    V[s] = np.random.random()
  else:
    # terminal state
    V[s] = 0

# initial value for all states in grid
print_values(V, grid)

---------------------------
 0.78| 0.43| 0.22| 0.00|
---------------------------
 0.89| 0.00| 0.47| 0.00|
---------------------------
 0.96| 0.53| 0.74| 0.30|


Note that we set to Null the values of the terminal states. <br> 
For the print_values() function to compile, we set them to 0.

### 1.2 Value iteration algorithms - code completion

You will now have to complete the Value iteration algorithm. <br>
Remember that, for each iteration, each state s need to have to be update with the formula:

$$
V(s) = \underset{a}{max}\big\{ \sum_{s',a}  p(s'|s,a)(r + \gamma*V(s') \big\}
$$
Note that in the current gridWorld, p(s'|s,a) is deterministic. <br>
Also, remember that in value iteration, the policy is implicit. <br> Thus, you don't need to update it at every iteration. <br>
Run the algorithm until convergence.


In [7]:
iteration=0
while True:
  iteration+=1
  print("values %d: " % iteration)
  print_values(V, grid)
  print("policy %d: " % iteration)
  print_policy(policy, grid)
  print("\n\n")

  biggest_change = 0
  
  for s in states:
    old_v = V[s]

    # V(s) only has value if it's not a terminal/absorbing state
    if s in policy:
        
      new_v = float('-inf')
      for a in ALL_POSSIBLE_ACTIONS:
        grid.set_state(s)
        # get reward
        r = grid.move(a)
        # get s'
        next_state = grid.current_state()
        
        ## Implement This!

        ## hints:
        ##   - at every iteration, use the biggest_change variable to keep track of the biggest state value change.
        ##     Use it to break out of the loop upon convergence.
        ##   - compute this V[s] = max[a]{ sum[s',r] { p(s',r|s,a)[r + gamma*V[s']] } }

  if biggest_change < SMALL_ENOUGH:
    break


values 1: 
---------------------------
 0.78| 0.43| 0.22| 0.00|
---------------------------
 0.89| 0.00| 0.47| 0.00|
---------------------------
 0.96| 0.53| 0.74| 0.30|
policy 1: 
---------------------------
  R  |  R  |  R  |     |
---------------------------
  L  |     |  D  |     |
---------------------------
  U  |  U  |  R  |  U  |





Now that the value function is trained, use it to find the optimal policy.

In [8]:
for s in policy.keys():
  best_a = None
  best_value = float('-inf')
  # loop through all possible actions to find the best current action
  for a in ALL_POSSIBLE_ACTIONS:
    grid.set_state(s)
    r = grid.move(a)
    next_state = grid.current_state()
    v = r + GAMMA * V[next_state]
    if v > best_value:
      best_value = v
      best_a = a
  policy[s] = best_a

Now print your policy and make sure it leads to the upper-right corner which is the termnial state returning the most rewards.

In [9]:
print("values:")
print_values(V, grid)
print("policy:")
print_policy(policy, grid)

values:
---------------------------
 0.78| 0.43| 0.22| 0.00|
---------------------------
 0.89| 0.00| 0.47| 0.00|
---------------------------
 0.96| 0.53| 0.74| 0.30|
policy:
---------------------------
  D  |  L  |  R  |     |
---------------------------
  D  |     |  D  |     |
---------------------------
  D  |  L  |  D  |  L  |


## 2. Policy Iteration

You will be tested on your capacity to **complete the poliy iteration algorithm**. <br>
You can find details about the algorithm at slide 47 of the slide deck. <br>
The algorithm will be tested on a simple Gridworld similar to the one presented at slide 12. <br>
This Gridworld is however simpler because the MDP is deterministic. <br>

First we will define a random inital policy. <br>
Remember that a policy maps states to actions.

In [14]:
policy = {}
for s in grid.actions.keys():
  policy[s] = np.random.choice(ALL_POSSIBLE_ACTIONS)

# initial policy
print("initial policy:")
print_policy(policy, grid)

initial policy:
---------------------------
  R  |  U  |  D  |     |
---------------------------
  L  |     |  D  |     |
---------------------------
  R  |  U  |  U  |  D  |


Next, we will randomly initialize the value fonction

In [15]:
# initialize V(s) - value function
V = {}
states = grid.all_states()
for s in states:
  if s in grid.actions:
    V[s] = np.random.random()
  else:
    # terminal state
    V[s] = 0

# initial value for all states in grid
print_values(V, grid)

---------------------------
 0.68| 0.19| 0.63| 0.00|
---------------------------
 0.68| 0.00| 0.76| 0.00|
---------------------------
 0.01| 0.66| 0.34| 0.23|


Note that we set to Null the values of the terminal states. <br> 
For the print_values() function to compile, we set them to 0.

### 2.2 Policy iteration - code completion

You will now have to complete the Policy iteration algorithm. <br>
Remember that the algorithm works in two phases. <br>
First, in the *policy evaluation* phase, the value function is update with the formula:

$$
V^\pi(s) =  \sum_{s',a}  p(s'|s,\pi(s))(r + \gamma*V^\pi(s') 
$$
This part of the algorithm is already coded for you. <br>

Second, in the *policy improvement* step, the policy is updated with the formula:

$$
\pi'(s) = \underset{a}{arg max}\big\{ \sum_{s',a}  p(s'|s,\pi(s))(r + \gamma*V^\pi(s') \big\}
$$

This is the part of code you will have to complete. <br>

Note that in the current gridWorld, p(s'|s,a) is deterministic. <br>
Run the algorithm until convergence.

In [16]:
iteration=0
# repeat until convergence
# when policy does not change, it will finish
while True:
  iteration+=1
  print("values %d: " % iteration)
  print_values(V, grid)
  print("policy %d: " % iteration)
  print_policy(policy, grid)
  print('\n\n')

  # policy evaluation step
  while True:
    biggest_change = 0
    for s in states:
      old_v = V[s]

      # V(s) only has value if it's not a terminal state
      if s in policy:
        a = policy[s]
        grid.set_state(s)
        r = grid.move(a) # reward
        next_state = grid.current_state() # s' 
        V[s] = r + GAMMA * V[next_state]
        biggest_change = max(biggest_change, np.abs(old_v - V[s]))

    if biggest_change < SMALL_ENOUGH:
      break

  # policy improvement step
  is_policy_converged = True
  for s in states:
    if s in policy:
      old_a = policy[s]
      new_a = None
      best_value = float('-inf')
      # loop through all possible actions to find the best current action
      for a in ALL_POSSIBLE_ACTIONS:
        
        ## Implement This!
        
        ## Hint:
        ##   - keep track of the changes to the policy with the variable is_policy_converged.
        ##     Use it to break the loop upon convergence


  if is_policy_converged:
    break

IndentationError: expected an indented block (<ipython-input-16-b8c398e3b83e>, line 47)

Now print your policy and make sure it leads to the upper-right corner which is the termnial state returning the most rewards.

In [17]:
print("final values:")
print_values(V, grid)
print("final policy:")
print_policy(policy, grid)

final values:
---------------------------
 0.68| 0.19| 0.63| 0.00|
---------------------------
 0.68| 0.00| 0.76| 0.00|
---------------------------
 0.01| 0.66| 0.34| 0.23|
final policy:
---------------------------
  R  |  U  |  D  |     |
---------------------------
  L  |     |  D  |     |
---------------------------
  R  |  U  |  U  |  D  |
