# Value Iteration
An implementation of value iteration using a grid_world with  deterministic policy 



The gridworld has the shape(3,4) with a winning state "w"(0,3), and a lossing state "l"(1,3), a non valid state "x"(2,1) and a start state s(3,0)

|  |  |  |  |
|---|---|---|---|
|  |  |  | w |
|  |  |  | l |
|  | x |  |  |
| s |  |  |  |

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict

import grid_world

### Define the maximum change before finish the loop

In [2]:
CONVERGENCE_THRESHOLD = 10e-5

### Auxiliary function to display the values of a policy after finishing iterative policy evaluation

In [3]:
def print_values(V,grid):
    for i in range(grid.width):
        print("--------------------------")
        for j in range(grid.height):
            v = V.get((i,j),0)
            if v >= 0:
                print(" %.2f|" % v, end="")
            else:
                print("%.2f|" % v, end="")
        print("")

### Auxiliary function to display a stochastic policy

In [4]:
def print_policy(P,grid):
    for i in range(grid.width):
        print("---------------------------")
        for j in range(grid.height):
            a = P.get((i,j),' ')
            if isinstance(a,dict):
                a = list(a)[0]
            print("  %s  |" % a, end="")
        print("")

### From or defined grid world file, import a negative grid ,retrieve all actions and states and print grid rewards
Negative grid is used to encourage the agent to find a shortest path to the goal

In [5]:
grid = grid_world.Grid.negative_grid(step_cost = -0.1)
states = grid.all_states()
actions = list(set([action   for action_tup in grid.actions.values() for action in action_tup]))

In [6]:
actions

['U', 'R', 'L', 'D']

In [7]:
print("Rewards of negative grid")
print_values(grid.rewards,grid)

Rewards of negative grid
--------------------------
-0.10|-0.10|-0.10| 1.00|
--------------------------
-0.10| 0.00|-0.10|-1.00|
--------------------------
-0.10|-0.10|-0.10|-0.10|


### Implement "value iteration" algorithm

In [8]:
def value_iteration(gamma,theta):
    V = defaultdict(lambda: 0)
    policy  = {}
    
    while True:
        max_change = 0
        best_state_actions = {}
        
        for  s in grid.actions.keys():
            current_Vs = V[s]
            action_values = []
            
            best_state_value = 0
            best_state_action = actions[0]
            best_state_actions[s]  = actions[0]
            
            for a in actions:
                grid.set_state(s)
                r = grid.move(a)
                s_prime = grid.current_state()
                
                v = (r+ gamma*V[s_prime])
                action_values.append(v)
                
                if v > best_state_value:
                    best_state_value = v
                    best_state_action = a
                    best_state_actions[s] = a
                
            V[s] = max(action_values)
            max_change = max(max_change,abs(current_Vs-V[s]))
            
        if max_change < theta:
            break
            
            
    for s in grid.actions.keys():
        policy[s] = {}
        
        for a in actions:
            policy[s][a] = 1.0 if a == best_state_actions[s] else 0.0
            
            
    return policy,V

## Run value iteration

In [9]:
policy,V = value_iteration(0.9,CONVERGENCE_THRESHOLD)

In [10]:
print("Learned policy is:")
det_policy ={}
for state in policy.keys():
    for action in policy[state].keys():
        if policy[state][action] == 1.0:
            det_policy[state] = action

print_policy(det_policy,grid)

Learned policy is:
---------------------------
  R  |  R  |  R  |     |
---------------------------
  U  |     |  U  |     |
---------------------------
  U  |  R  |  U  |  L  |


In [11]:
print("The state-values of learned  policy are:")
print_values(grid=grid,V=V)

The state-values of learned  policy are:
--------------------------
 0.62| 0.80| 1.00| 0.00|
--------------------------
 0.46| 0.00| 0.80| 0.00|
--------------------------
 0.31| 0.46| 0.62| 0.46|


## Conclusions
* The learned policy is the same and the state-value function of it, is also the same, but its more efficient since do not requires a loop inside a loop(combines policy evaluation and policy improvement in a single step)