# First Visit policy evaluation with stochastic dynamics
An implementation of MC first vist policy evaluation using a windy grid_world(sthocastic dynamics)

The gridworld has the shape(3,4) with a winning state "w"(0,3), and a lossing state "l"(1,3), a non valid state "x"(2,1) and a start state s(3,0)

|  |  |  |  |
|---|---|---|---|
|  |  |  | w |
|  |  |  | l |
|  | x |  |  |
| s |  |  |  |

### This is a simple not optimal policy evaluation

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict

import grid_world

### Define the maximum change before finish the loop

In [2]:
CONVERGENCE_THRESHOLD = 10e-5
GAMMA = 0.9

### Auxiliary function to display the values of a policy after finishing iterative policy evaluation

In [3]:
def print_values(V,grid):
    for i in range(grid.width):
        print("--------------------------")
        for j in range(grid.height):
            v = V.get((i,j),0)
            if v >= 0:
                print(" %.2f|" % v, end="")
            else:
                print("%.2f|" % v, end="")
        print("")

### Auxiliary function to display a stochastic policy

In [4]:
def print_policy(P,grid):
    for i in range(grid.width):
        print("---------------------------")
        for j in range(grid.height):
            a = P.get((i,j),' ')
            if isinstance(a,dict):
                a = list(a)[0]
            print("  %s  |" % a, end="")
        print("")

### From or defined grid world file, import a negative grid ,retrieve all actions and states and print grid rewards
Negative grid is used to encourage the agent to find a shortest path to the goal

In [5]:
grid = grid_world.Grid.negative_grid(step_cost = -0.1)
states = grid.all_states()
actions = list(set([action   for action_tup in grid.actions.values() for action in action_tup]))

In [6]:
actions

['U', 'D', 'L', 'R']

### Make the dynamics sthocastic by randomizing p(s',r | s,a)
In windy world, the agent decides an action but with 0.5 probability takes that action  and  with 0.5/3 takes other action

In [7]:
def randomize_action(a):
    p = np.random.random()
    
    if p < 0.5:
        return a
    else:
        tmp = list(actions)
        tmp.remove(a)
        
        return np.random.choice(tmp)

In [8]:
def play_game(grid,policy,gamma=0.9):
    # exploring starts
    start_states = list(grid.actions.keys())
    start_index = np.random.choice(len(start_states))
    grid.set_state(start_states[start_index])
    
    s = grid.current_state()
    states_and_rewards = [] 
    
    # play a game until game is finished and record all new states and rewards
    while not grid.game_over():
        old_s = s
        a = randomize_action(policy[s]) #the policy provides an action for state s, but since its windy actual action is different
        r = grid.move(a)
        s = grid.current_state()
        
        states_and_rewards.append((old_s,r))
    
        
    # calculate every state returns by going backwards
    states_and_returns = []
    
    G=0
    for s,r in reversed(states_and_rewards):
        
        G = r + gamma*G
        states_and_returns.append((s,G))
        
    return states_and_returns

In [9]:
grid = grid_world.Grid.standard_grid()

In [10]:
print("Rewards of grid")
print_values(grid.rewards,grid)

Rewards of grid
--------------------------
 0.00| 0.00| 0.00| 1.00|
--------------------------
 0.00| 0.00| 0.00|-1.00|
--------------------------
 0.00| 0.00| 0.00| 0.00|


In [11]:
policy = {
    (2,0):'U',
    (1,0):'U',
    (0,0):'R',
    (0,1):'R',
    (0,2):'R',
    (1,2):'U',
    (2,1):'L',
    (2,2):'U',
    (2,3):'L'
}

In [12]:
play_game(grid,policy,GAMMA)

[((0, 2), 1.0), ((0, 1), 0.9), ((0, 1), 0.81)]

In [15]:
def firt_visit_MC(grid,policy,episodes,gamma):
    V = defaultdict(lambda:0)
    state_visit_count = defaultdict(lambda:0)
    
    for i in range(episodes):
        states_and_returns = play_game(grid,policy,gamma)
        visited_states = set()
        
        for s,G in states_and_returns:
            
            if s not in visited_states:
                
                V[s]  = V[s] + G
                visited_states.add(s)
                state_visit_count[s] += 1
            
                
    for s in V.keys():
        if state_visit_count[s] != 0 :
            V[s] = V[s]/state_visit_count[s]
        else:
            V[s] = 0
    
    return V
        
V = firt_visit_MC(grid,policy,100000,GAMMA)

In [16]:
print("Values")
print_values(V,grid)

Values
--------------------------
 0.58| 0.72| 0.86| 0.00|
--------------------------
 0.45| 0.00| 0.23| 0.00|
--------------------------
 0.34| 0.23| 0.13|-0.19|


In [17]:
print("Policy")
print_policy(policy,grid)

Policy
---------------------------
  R  |  R  |  R  |     |
---------------------------
  U  |     |  U  |     |
---------------------------
  U  |  L  |  U  |  L  |


## Conclusions
* The result is the same as "iterative policy evaluation" in dynamic programming but seems to converge faster
* One of the advantages over dynamyc programing is this does not need a model of the dynamics, its learned by experience