## Chapter 6: Eligibility Traces

Eligibility traces are one of the basic mechanisms of reinforcement learning that aims to bridge Temporal Difference (TD) and Monte-Carlo (MC). 

Combined with TD methods such as Q-learning or SARSA, it can obtain a more general method that may learn more efficiently.



### TD$(\lambda)$

![n_step-2.png](attachment:n_step-2.png)

In Temporal-Difference value prediction, return $G_t$ is bootstrapped as the following:
</br>
</br>
<font size="3">
$$\begin{align}
G_t = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{T-1} R_T = R_{t+1} + \gamma V(S_{t+1})
\end{align}$$
</font>

Return of state $S_t$ in MC is calculated at the end of the episode, and boostrapped return in TD is calculated after one step of $S_t$. 

#### n-step TD Prediction

We can think of intermediate version of return of TD and MC, that boostrapping is performed using multiple steps of returns. 
</br>
</br>
<font size="3">
$$\begin{align}
G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1}R_{t+n} + \gamma^{n}V(S_{t+n})
\end{align}$$
</font>
which is named **n-step return**.

Using n-step return, n-step TD prediction can be done.
</br>
</br>
<font size="3">
$$\begin{align}
V(S_t) = V(S_t) + \alpha(G_t^{(n)} - V(S_t))
\end{align}$$
</font>


#### $\lambda$ - return

We can average n-step returns over different n, interrelating TD and Monte-Carlo methods.
$\lambda$-return is weighted average of all n-step returns $G_t^{(n)}$ with $\lambda \in [0, 1]$ as the following:
</br>
</br>
<font size="3">
$$\begin{align}
G_t^{(\lambda)} = (1-\lambda) \sum_{n=1}^{\infty}[\lambda^{n-1}G_t^{(t + n)}] 
\end{align}$$
</font>

In $\lambda$-return, one-step return is given the largest weight $(1-\lambda)$, two-step is $(1-\lambda)\lambda$, and the three-step is $(1-\lambda)\lambda^2$, and so on. After a terminal state has been reached, all subsequent n-step returns are equal to $G_t$. If we want, we can separate these post-termination terms from the main sum, yielding

</br>
</br>
<font size="3">
$$\begin{align}
G_t^{(\lambda)} = (1-\lambda) \sum_{n=1}^{T-t-1}[\lambda^{n-1}G_t^{(t + n)}] + \lambda^{T-t-1}G_t \\
= (1-\lambda)G_t^{(1)} + (1-\lambda)\lambda G_t^{(2)} + \cdots + \lambda^{T-t-1} G_t&
\end{align}$$
</font>

last term of $G_t^{(\lambda)}$ is set as $\lambda^{T-t-1} G_t^{T}$ because weights of n-step returns have to sum to 1. $\lambda = 0$ is equivalent to TD-target, and $\lambda = 1$ is equivalent to Monte-Carlo return. 

![lambda_return.png](attachment:lambda_return.png)


### Forward View of TD$(\lambda)$

There are two ways to view eligibility traces. The first one: forward view of TD$(\lambda)$ is a more theroetical view of eligibility traces. 

![forward_view.png](attachment:forward_view.png)

Forward view of TD$(\lambda)$

- Update the value function toward the $\lambda$-return
- Looks into the future to compute $G_t^\lambda$
- Can only be computed from complete episode

### Backward View of TD$(\lambda)$

Backward view of TD$(\lambda)$ is a mechanistic view of eligibility traces. Unlike forward view of TD$(\lambda)$, backward view updates online, every step from incomplete episodes.

![backward_view.png](attachment:backward_view.png)

In backward view of TD($\lambda$), there exists additional variable named as **eligibility trace** in each state. The **Eligibility trace** for each state $s$ at time $t$ is a random variable denoted $E_t(s) \in \mathbb{E}^+$. On each step $t$, $E_t(s)$ is updated as following:
</br>
</br>
<font size="3">
$$\begin{align}
E_t(s) = \gamma \lambda E_{t-1}(s) + \mathbb{1}(S_t = s)
\end{align}$$
</font>

where $\gamma$ is the discount rate hyperparameter, and $\lambda$ the *trace-decay* hyperparameter. E_t(s) is incremented by **1** if state $S_t$ is visited at time $t$.

**Eligibility traces** keep a simple record of which states have recently been visited in terms of $\gamma\lambda$, in order to indicate the degree to which each state is eligible for the reinforcing event; **one-step TD errors**.

In backward view of TD$(\lambda)$, the TD errors propagate to all recently visitied states as the following.
</br>
</br>
<font size="3">
$$\begin{align}
V_{t+1}(s) = V_t(s) + \alpha \delta_t E_t(s), \quad \text{for all} \; s \in \mathcal{S}
\end{align}$$
</font>

where $\delta_t$ is TD-error, $R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$.

### Online and Offline updates of forward and backward TD($\lambda$)