# td-lambda for prediction
An implementation of "td-lambda "  using a gridworld.

Based on: https://youtu.be/PnHCvfgC_ZA?t=5546

More info about this can be found in "Reinforcement Learning: an introduction" 2nd edition by Barto and Sutton


The gridworld has the shape(3,4) with a winning state "w"(0,3), and a lossing state "l"(1,3), a non valid state "x"(2,1) and a start state s(3,0)

|  |  |  |  |
|---|---|---|---|
|  |  |  | w |
|  |  |  | l |
|  | x |  |  |
| s |  |  |  |

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict
import pixiedust

import grid_world

Pixiedust database opened successfully


### Disccount factor and step size

In [2]:

GAMMA = 0.9
ALPHA =0.1

### Auxiliary function to display the values of a policy after finishing iterative policy evaluation

In [3]:
def print_values(V,grid):
    for i in range(grid.width):
        print("--------------------------")
        for j in range(grid.height):
            v = V.get((i,j),0)
            if v >= 0:
                print(" %.2f|" % v, end="")
            else:
                print("%.2f|" % v, end="")
        print("")

### Auxiliary function to display a stochastic policy

In [4]:
def print_policy(P,grid):
    for i in range(grid.width):
        print("---------------------------")
        for j in range(grid.height):
            a = P.get((i,j),' ')
            if isinstance(a,dict):
                a = list(a)[0]
            print("  %s  |" % a, end="")
        print("")

### From or defined grid world file, import a negative grid ,retrieve all actions and states and print grid rewards
Negative grid is used to encourage the agent to find a shortest path to the goal

In [5]:
grid = grid_world.Grid.negative_grid()
states = grid.all_states()
actions = list(set([action   for action_tup in grid.actions.values() for action in action_tup]))

In [6]:
actions

['R', 'D', 'U', 'L']

In [7]:
print("Rewards of grid")
print_values(grid.rewards,grid)

Rewards of grid
--------------------------
-0.10|-0.10|-0.10| 1.00|
--------------------------
-0.10| 0.00|-0.10|-1.00|
--------------------------
-0.10|-0.10|-0.10|-0.10|


### Initialize  policy

In [8]:
policy = {(2,0):'U',
         (1,0):'U',
         (0,0):'R',
         (0,1):'R',
         (0,2):'R',
         (1,2):'R',
         (2,1):'R',
         (2,2):'R',
         (2,3):'U'}

print_policy(policy,grid)

---------------------------
  R  |  R  |  R  |     |
---------------------------
  U  |     |  R  |     |
---------------------------
  U  |  R  |  R  |  U  |


In [9]:
def elegibility_trace_increment(elegibility_trace ,previous_step,gamma,lambda_val):
    
    for state in states:
        elegibility_trace[state] = gamma*lambda_val*elegibility_trace[state]
        
    elegibility_trace[previous_step] += 1
    
    return elegibility_trace
        


def update_V(V,td_error,elegibility_trace,alpha=0.5):
    
    for state in V.keys():
        V[state] = V[state] + alpha*td_error*elegibility_trace[state]
        
    return V

In [24]:
#%%pixie_debugger
def td_lambda(policy,episodes =1,gamma = 0.9,alpha = 0.9,lambda_val = 0.5):
    """
    gamma = discount factor
    alpha = step size
    lambda_val = trace decay
    """
    V = defaultdict(lambda:0)
                    
    for episode in range(episodes):
        s = (2,0)
        grid.set_state(s)
        E = defaultdict(lambda:0) # the elegibility trace vector(1 position for every state)
        finished = False
        
        while not finished:
            a = policy[s]
            r = grid.move(a)
            s1 = grid.current_state()
            finished = grid.game_over()
            
            E = elegibility_trace_increment(E,s,gamma,lambda_val)
            td_error = r + gamma*V[s1] - V[s]
            V = update_V(V,td_error,E,alpha)
            
            s = s1
        
    return V

V = td_lambda(policy,episodes = 100,gamma = 0.9,alpha = 0.5,lambda_val = 0.5)



In [25]:
policy


{(0, 0): 'R',
 (0, 1): 'R',
 (0, 2): 'R',
 (1, 0): 'U',
 (1, 2): 'R',
 (2, 0): 'U',
 (2, 1): 'R',
 (2, 2): 'R',
 (2, 3): 'U'}

In [26]:
print("Policy")

print_policy(policy,grid)

Policy
---------------------------
  R  |  R  |  R  |     |
---------------------------
  U  |     |  R  |     |
---------------------------
  U  |  R  |  R  |  U  |


In [27]:
print_values(grid=grid,V=V)

--------------------------
 0.62| 0.80| 1.00| 0.00|
--------------------------
 0.46| 0.00| 0.00| 0.00|
--------------------------
 0.31| 0.00| 0.00| 0.00|


## Conclusions
* td-lambda is an efficient mechanistic(equivalent) version of lambda returns average
* The elegibility trace weights recent and frequent events to assign errors to them using a backward view