# Write-up and code for Feb 13

## To Do
- Implement Forward-View TD(Lambda) algorithm for Value Function Prediction
- Implement Backward View TD(Lambda), i.e., Eligibility Traces algorithm for Value Function Prediction
- Implement these algorithms as offline or online algorithms (offline means updates happen only after a full simulation trace, online means updates happen at every time step)
- Test these algorithms on some example MDPs, compare them versus DP Policy Evaluation, and plot their accuracy as a function of Lambda
- Prove that Offline Forward-View TD(Lambda) and Offline Backward View TD(Lambda) are equivalent. We covered the proof of Lambda = 1 in class. Do the proof for arbitrary Lambda (similar telescoping argument as done in class) for the case where a state appears only once in an episode.

## Forward View TD($\lambda$) for Value Function Prediction

In [3]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [8]:
from typing import List
from modules.MDP import V, Policy
from modules.state_action_vars import S, A
import numpy as np

def n_step_return(return_series: List[float], state_series: List[S], gamma: float, n: int, vf: V) -> np.ndarray:
    # returns all the n-step returns from 1 up to n
    g_n = np.zeros((n,1))
    
    for i in range(n):
        g_n[i:] += np.power(gamma, i) * return_series[i]
        g_n[i] += np.power(gamma, i+1) * vf[state_series[i]]
        
    return g_n


def sum_n_step_return(g_n: np.ndarray, lambd: float) -> float:
    # take in all the n-step returns and return the lambda summation form the slides
    g_lambda = 0
    for j, g in enumerate(g_n):
        g_lambda += np.power(lambd, j) * g
        
    return (1-lambd)*g_lambd

def forward_view_TD_lambda(policy: Policy, mdp: MDP, lambd: float, num_epi: int, num_steps: int) -> V:
    # implementation of forward view TD(lambda) as it is outlined in the lecture slides
    v = {}
    gamma = mdp.gamma
    for s in mdp.States:
        v[s] = 0

    for i in range(num_epi):
        # generate an episode
        s_list, a_list, r_list = generate_path(policy, mdp, num_steps)

        for j in range(num_steps):
            g_n = n_step_return(r_list[j:], s_list[j:], gamma, num_steps, v)
            g_lambd = sum_n_step_return(g_n, lambd)
            v[s_list[j]] += alpha * (g_lambd - v[s_list[j]])

    return v

NameError: name 'Policy' is not defined