# Week 12 - Sequential Decision Making I
## 1. Value Iteration

Author: Massimo Caccia massimo.p.caccia@gmail.com <br>

The code was Adapted from: https://github.com/lazyprogrammer/machine_learning_examples/tree/master/rl <br>
and then from: https://github.com/omerbsezer/Reinforcement_learning_tutorial_with_demo

The exercises will test your capacity to **complete the value iteration algorithm**.

You can find details about the algorithm at slide 46 of the [slide](http://www.cs.toronto.edu/~lcharlin/courses/80-629/slides_rl.pdf) deck. <br>

The algorithm will be tested on a simple Gridworld similar to the one presented at slide 12. 
This Gridworld is however simpler because the MDP is deterministic. In other words there is no uncertainty and a single possible next state given each state-action pair.




### 1.1 Setup

In [1]:
#imports
import numpy as np
from gridWorldGame import standard_grid, negative_grid, print_values, print_policy

Let's set some variables. <br>
`SMALL_ENOUGH` is a threshold we will utilize to determine the convergence of value iteration<br>
`GAMMA` is the discount factor denoted $\gamma$ in the slides (see slide 36) <br>
`ALL_POSSIBLE_ACTIONS` are the actions you can take in the GridWold, as in slide 12. In this simple grid world, we will have four actions: Up, Down, Right, Left. 

In [2]:
SMALL_ENOUGH = 1e-3
GAMMA = 0.9
ALL_POSSIBLE_ACTIONS = ('U', 'D', 'L', 'R')

Now we will set up a the Gridworld. <br>
To find a shorter path to the goal, we will use a negative grid that give negatives rewards of -0.1 in the non-absorbing states

In [3]:
grid = negative_grid()
print("rewards:")
print_values(grid.rewards, grid)

rewards:
---------------------------
-0.10|-0.10|-0.10| 1.00|
---------------------------
-0.10| 0.00|-0.10|-1.00|
---------------------------
-0.10|-0.10|-0.10|-0.10|


Next, we will define a random inital policy $\pi$. <br>
Remember that a policy maps states to actions $\pi : S \rightarrow A$.

In [4]:
policy = {}
for s in grid.actions.keys():
  policy[s] = np.random.choice(ALL_POSSIBLE_ACTIONS)

# initial policy
print("initial policy:")
print_policy(policy, grid)

initial policy:
---------------------------
  L  |  D  |  L  |     |
---------------------------
  L  |     |  D  |     |
---------------------------
  L  |  U  |  D  |  L  |


Note that there is policy in the absorbing/terminal states

Next, we will randomly initialize the value fonction

In [5]:
V = {}
states = grid.all_states()
for s in states:
  if s in grid.actions:
    V[s] = np.random.random()
  else:
    # terminal state
    V[s] = 0

# initial value for all states in grid
print_values(V, grid)

---------------------------
 0.62| 0.38| 0.04| 0.00|
---------------------------
 0.83| 0.00| 0.29| 0.00|
---------------------------
 0.44| 0.31| 0.47| 0.70|


Note that we set to Null the values of the terminal states. <br> 
For the print_values() function to compile, we set them to 0.

### 1.2 Value iteration algorithms - code completion

You will now have to complete the Value iteration algorithm. <br>
Remember that, for each iteration, each state s need to have to be update with the formula:

$$
V(s) = \underset{a}{max}\big\{ \sum_{s',a}  p(s'|s,a)(r + \gamma*V(s') \big\}
$$
Note that in the current gridWorld, p(s'|s,a) is deterministic. <br>
Also, remember that in value iteration, the policy is implicit. <br> Thus, you don't need to update it at every iteration. <br>
Run the algorithm until convergence.


In [6]:
iteration=0
while True:
  iteration+=1
  print("values %d: " % iteration)
  print_values(V, grid)
  print("policy %d: " % iteration)
  print_policy(policy, grid)
  print("\n\n")

  biggest_change = 0
  
  for s in states:
    old_v = V[s]

    # V(s) only has value if it's not a terminal/absorbing state
    if s in policy:
        
      new_v = float('-inf')
      for a in ALL_POSSIBLE_ACTIONS:
        grid.set_state(s)
        # get reward
        r = grid.move(a)
        # get s'
        next_state = grid.current_state()
        
        ## Implement This!

        ## hints:
        ##   - at every iteration, use the biggest_change variable to keep track of the biggest state value change.
        ##     Use it to break out of the loop upon convergence.
        ##   - compute this V[s] = max[a]{ sum[s',r] { p(s',r|s,a)[r + gamma*V[s']] } }

  if biggest_change < SMALL_ENOUGH:
    break


values 1: 
---------------------------
 0.62| 0.38| 0.04| 0.00|
---------------------------
 0.83| 0.00| 0.29| 0.00|
---------------------------
 0.44| 0.31| 0.47| 0.70|
policy 1: 
---------------------------
  L  |  D  |  L  |     |
---------------------------
  L  |     |  D  |     |
---------------------------
  L  |  U  |  D  |  L  |





Now that the value function is trained, use it to find the optimal policy.

In [7]:
for s in policy.keys():
  best_a = None
  best_value = float('-inf')
  # loop through all possible actions to find the best current action
  for a in ALL_POSSIBLE_ACTIONS:
    grid.set_state(s)
    r = grid.move(a)
    next_state = grid.current_state()
    v = r + GAMMA * V[next_state]
    if v > best_value:
      best_value = v
      best_a = a
  policy[s] = best_a

Now print your policy and make sure it leads to the upper-right corner which is the termnial state returning the most rewards.

In [8]:
print("values:")
print_values(V, grid)
print("policy:")
print_policy(policy, grid)

values:
---------------------------
 0.62| 0.38| 0.04| 0.00|
---------------------------
 0.83| 0.00| 0.29| 0.00|
---------------------------
 0.44| 0.31| 0.47| 0.70|
policy:
---------------------------
  D  |  L  |  R  |     |
---------------------------
  L  |     |  D  |     |
---------------------------
  U  |  R  |  R  |  D  |
