# Iterative policy evaluation
An implementation of iterative policy evaluation using a grid_world and  evaluation of 2 different policies

The gridworld has the shape(3,4) with a winning state "w"(0,3), and a lossing state "l"(1,3), a non valid state "x"(2,1) and a start state s(3,0)

|  |  |  |  |
|---|---|---|---|
|  |  |  | w |
|  | x |  | l |
| s |  |  |  |

In [12]:
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict

import grid_world

### Define the maximum change before finish the loop

In [2]:
CONVERGENCE_THRESHOLD = 10e-4

### Auxiliary function to display the values of a policy after finishing iterative policy evaluation

In [3]:
def print_values(V,grid):
    for i in range(grid.width):
        print("--------------------------")
        for j in range(grid.height):
            v = V.get((i,j),0)
            if v >= 0:
                print(" %.2f|" % v, end="")
            else:
                print("%.2f|" % v, end="")
        print("")

### Auxiliary function to display a stochastic policy

In [4]:
def print_policy(P,grid):
    for i in range(grid.width):
        print("---------------------------")
        for j in range(grid.height):
            a = P.get((i,j),' ')
            if isinstance(a,dict):
                a = list(a)[0]
            print("  %s  |" % a, end="")
        print("")

### From or defined grid world file, import a standard grid and all states

In [5]:
grid = grid_world.Grid.standard_grid()
states = grid.all_states()

### Implement the iterative policy evaluation algorithm:
Function that receives as parameter a policy to evaluate, the disccount factor gamma, and the convergence threshold

In [6]:
def iterative_policy_evaluation(policy,gamma = 1,convergence_threshold = 0.1):
    V = defaultdict(lambda :0) # initializes to 0 the value of all states
    
    while True:
        max_change = 0 #the maximun differente between old and current value in an iteration
        for s in grid.actions.keys(): #finds the value for every state
            old_value  = V[s] #keep a copy of old value of state
            
            new_value = 0 #create an accumulator for the new value of the state after aplying bellman equation
            for action in policy[s].keys(): #calculate expected return for every action in state s (P[s][a])
                action_prob = policy[s][action] #probability of taking state "a" in state "s"
                
                grid.set_state(s)
                reward = grid.move(action) #get inmediate reward of action
                
                new_value += action_prob*(reward + gamma*V[grid.current_state()]) 
                
            V[s] = new_value #store the new value after bellman equation
            max_change = max(max_change,np.abs(old_value  - new_value )) #calc max change to break from loop
            
            
        if max_change < convergence_threshold:
            break
            
    return V

## Compare the value of different policies

### Evaluate an uniform stochastic policy(every action has same probability)

In [7]:
uniform_policy = {}

for s in grid.actions.keys():
    actions = grid.actions[s]
    uniform_policy[s] = {}
    
    for action in actions: #every action in this state has the same probability
        uniform_policy[s][action] = 1.0/len(actions)
        

In [8]:
random_policy_values = iterative_policy_evaluation(policy=uniform_policy,gamma=1,convergence_threshold=CONVERGENCE_THRESHOLD)
print("The state-values of this policy are:")
print_values(grid=grid,V=random_policy_values)

The state-values of this policy are:
--------------------------
-0.03| 0.09| 0.22| 0.00|
--------------------------
-0.16| 0.00|-0.44| 0.00|
--------------------------
-0.29|-0.41|-0.54|-0.77|


### Evaluate deterministic policy

In [9]:
# In position (2,0) take action: UP , in position (1,0) take action UP and so on....
fixed_policy = {
    (2,0):{'U':1},
    (1,0):{'U':1},
    (0,0):{'R':1},
    (0,1):{'R':1},
    (0,2):{'R':1},
    (1,2):{'R':1},
    (2,1):{'R':1},
    (2,2):{'R':1},
    (2,3):{'U':1}
}

fixed_policy_values = iterative_policy_evaluation(policy=fixed_policy,gamma=0.9,convergence_threshold=CONVERGENCE_THRESHOLD)

In [10]:
print("The state-values of this policy are:")
print_values(grid=grid,V=fixed_policy_values)

The state-values of this policy are:
--------------------------
 0.81| 0.90| 1.00| 0.00|
--------------------------
 0.73| 0.00|-1.00| 0.00|
--------------------------
 0.66|-0.81|-0.90|-1.00|


In [11]:
print("The deterministic policy is:")
print_policy(grid=grid,P=fixed_policy)

The deterministic policy is:
---------------------------
  R  |  R  |  R  |     |
---------------------------
  U  |     |  R  |     |
---------------------------
  U  |  R  |  R  |  U  |


## Conclusions
* Fixed policy seems to be better than random policy
* In random policy, many state values are negative because theres high probability of ending in losing state
* In fixed policy , the farther we are from goal state, the worst the value is
* In fixed policy, states close to winning are more positive, states close to lossing are more negative