# TD(0) Policy Evaluation
An implementation of TD(0) Policy Evaluation using a gridworld.

The code is intentionally not optimal in order to increase legibility and make it easier to understand(TD(0) update is not performed "online" but after finishing the episode)

The gridworld has the shape(3,4) with a winning state "w"(0,3), and a lossing state "l"(1,3), a non valid state "x"(2,1) and a start state s(3,0)

|  |  |  |  |
|---|---|---|---|
|  |  |  | w |
|  |  |  | l |
|  | x |  |  |
| s |  |  |  |

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict

import grid_world

### Disccount factor and step size

In [2]:

GAMMA = 0.9
ALPHA =0.1

### Auxiliary function to display the values of a policy after finishing iterative policy evaluation

In [3]:
def print_values(V,grid):
    for i in range(grid.width):
        print("--------------------------")
        for j in range(grid.height):
            v = V.get((i,j),0)
            if v >= 0:
                print(" %.2f|" % v, end="")
            else:
                print("%.2f|" % v, end="")
        print("")

### Auxiliary function to display a stochastic policy

In [4]:
def print_policy(P,grid):
    for i in range(grid.width):
        print("---------------------------")
        for j in range(grid.height):
            a = P.get((i,j),' ')
            if isinstance(a,dict):
                a = list(a)[0]
            print("  %s  |" % a, end="")
        print("")

### From or defined grid world file, import a negative grid ,retrieve all actions and states and print grid rewards
Negative grid is used to encourage the agent to find a shortest path to the goal

In [5]:
grid = grid_world.Grid.standard_grid()
states = grid.all_states()
actions = list(set([action   for action_tup in grid.actions.values() for action in action_tup]))

In [6]:
def argmax_dict(dictionary):
    # returns the argmax key and the max value from a dictionary
    # will be used for policy improvement from Q
    max_key = None
    max_val = float("-inf")
    
    for k,v in dictionary.items():
        if v > max_val:
            max_val = v
            max_key = k
            
    return max_key,max_val
        
argmax_dict({"a":1,"b":2})

('b', 2)

In [7]:
actions

['D', 'U', 'R', 'L']

In [8]:
def epsilon_greedy_action(policy,state,epsilon=0.1):
    # choose an action using epsilon-greedy strategy
    probability = np.random.random()
    greedy_action = policy[state]
    non_greedy_actions = list(set(actions)-set(greedy_action))
    
    if probability < epsilon:
        return np.random.choice(non_greedy_actions)
    else: 
        return policy[state]

In [9]:
def play_game(grid,policy,gamma=0.9,epsilon  = 0.1,alpha=1):
    # ejecutes a complete episode and returns the states and rewards in the episode
    # using e-greedy policy
    
    s = (2,0)
    grid.set_state(s)
    states_and_rewards = [(s,0)]
    
    while not grid.game_over():
        a = epsilon_greedy_action(policy,s,epsilon)
        r = grid.move(a)
        s = grid.current_state()
        
        states_and_rewards.append((s,r))
        
    return states_and_rewards


In [10]:
grid = grid_world.Grid.standard_grid()

In [11]:
print("Rewards of grid")
print_values(grid.rewards,grid)

Rewards of grid
--------------------------
 0.00| 0.00| 0.00| 1.00|
--------------------------
 0.00| 0.00| 0.00|-1.00|
--------------------------
 0.00| 0.00| 0.00| 0.00|


### Initialize  policy

In [12]:
policy = {(2,0):'U',
         (1,0):'U',
         (0,0):'R',
         (0,1):'R',
         (0,2):'R',
         (1,2):'R',
         (2,1):'R',
         (2,2):'R',
         (2,3):'U'}

print_policy(policy,grid)

---------------------------
  R  |  R  |  R  |     |
---------------------------
  U  |     |  R  |     |
---------------------------
  U  |  R  |  R  |  U  |


In [13]:
def TD0(grid,policy,episodes,gamma =1,epsilon=0.1,alpha=1):
    V = defaultdict(lambda:0)
    
    for i in range(episodes):
        states_and_rewards = play_game(grid,policy,gamma,epsilon,alpha)
        
        # perform TD0 update(this is not efficient and offline but its intentionall to maintain clarity)
        for t in range(len(states_and_rewards)-1):
            s = states_and_rewards[t][0]
            s1 = states_and_rewards[t+1][0]
            r = states_and_rewards[t+1][1]
            
            V[s] = V[s] + alpha*(r + (gamma*V[s1]) -V[s] )
        
    return V
        
V = TD0(grid,policy,10000,GAMMA,0.1,ALPHA)

In [14]:
policy

{(0, 0): 'R',
 (0, 1): 'R',
 (0, 2): 'R',
 (1, 0): 'U',
 (1, 2): 'R',
 (2, 0): 'U',
 (2, 1): 'R',
 (2, 2): 'R',
 (2, 3): 'U'}

In [15]:
print("Policy")
print_policy(policy,grid)

Policy
---------------------------
  R  |  R  |  R  |     |
---------------------------
  U  |     |  R  |     |
---------------------------
  U  |  R  |  R  |  U  |


In [16]:
for state in policy.keys():
    print("state  |policy action| state value")
    print(state,"|      ",policy[state] , "    |", V[state])
    

state  |policy action| state value
(2, 0) |       U     | 0.5838810033674448
state  |policy action| state value
(1, 0) |       U     | 0.6710959132776831
state  |policy action| state value
(0, 0) |       R     | 0.7234236767056433
state  |policy action| state value
(0, 1) |       R     | 0.77008863398827
state  |policy action| state value
(0, 2) |       R     | 0.8903916223953797
state  |policy action| state value
(1, 2) |       R     | -0.93092332460674
state  |policy action| state value
(2, 1) |       R     | -0.7282953092171845
state  |policy action| state value
(2, 2) |       R     | -0.8795926533738817
state  |policy action| state value
(2, 3) |       U     | -0.9939348818040227


In [17]:
print_values(grid=grid,V=V)

--------------------------
 0.72| 0.77| 0.89| 0.00|
--------------------------
 0.67| 0.00|-0.93| 0.00|
--------------------------
 0.58|-0.73|-0.88|-0.99|


## Conclusions
* TD(0) learns from experience as Markov and doesnt need a complete model of dynamics
* It bootstraps as dynamic programming so it does not need to wait for episodes to end