# Chapter 6 - Temporal-Difference Learning
This chapter introduces TD learning, which combines ideas from both Monte Carlo methods (Chapter 5) and Dynamic Programming (Chapter 4). Key aspects:

- Like MC, it learns from experience (samples of sequences of states, actions and rewards from interacting with an environment or a simulation), not from complete knowledge of the environment.

- Like DP, it updates its state- and action-values based partially on other learned estimates, that is it bootstraps.

### Exercise 6.1

Give an example in which TD update would be better on average than an MC update.

__Answer__: Following on the hint offered in the problem, if one already has good estimates of $V(s')$ for all $s' \in \mathcal{S'} \subset \mathcal{S}$, a subspace of $\mathcal{S}$, the TD update for any $s \notin \mathcal{S'}$ would be made immediately after the transition from $s$ to $s'$, like below:

$$
V^{TD}(s) \leftarrow V^{TD}(s) + \alpha \left[ r + \gamma V^{TD}(s') - V^{TD}(s) \right]
$$

On the other hand, the MC update would be made at the end of the episode, as below:

$$
V^{MC}(s) \leftarrow V^{MC}(s) + \alpha \left[ R(s) - V^{MC}(s) \right] = V^{MC}(s) + \alpha \left[ r + \gamma R(s') - V^{MC}(s) \right] \text {, where } R(\cdot) \text { denotes here a single return from a state.}
$$

And since we assumed that $V(s')$ is already very good, i.e. $V(s') \approx \mathbb{E} [R(s')]$, then we can see how the only difference in the update for $V(s)$ is between using a good approximation of expected returns from next state $s'$, i.e. $V(s')$ in the TD case versus using a single episode's return instead in the MC case, i.e. $R(s')$. We know that in the long term, after many episodes both converge to the expected value of returns, but in the TD case it does so more quickly as it only needs to learn the expected value of $r$ received by transitioning from $s$ to $s'$, not the entire returns from $s$.

The better estimate of $V^{TD}(s)$ also becomes available and propagates faster to other states that come ahead of $s$ in the episodes, thus reaching the expected values faster overall for all states, especially the ones that come before those in $\mathcal{S'}$.

Another way to see this situation is by looking at the two estimates as expectations:

$$
\begin{aligned}
    V^{MC}(s) &= \mathbb{E} \left[ R_t \mid s_t = s \right] \\
    V^{TD}(s) &= \mathbb{E} \left[ r_{t+1} + \gamma V^{TD} (s_{t+1}) \mid s_t = s \right] \\
\end{aligned}
$$

It is immediately obvious that for an unseen state $s$, the MC estimate will need a large number of (newly) sampled episodes (with their corresponding full returns) to converge to the expected value of $R_t$, while the TD estimate uses the fact that $R_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + ... $ and that we might already have a good estimate for the return $R_{t'}$ from some later point $t' > t$, which is approximately $V(s_{t'})$. So TD could also be more sample efficient, as we'd only need to generate transitions from $s_t$ to $s_{t'}$ and use the (good) estimate $V(s_{t'})$ from there on.

### Exercise 6.2

From Figure 6.6 it appears that the first episode results in a change only in $V(A)$. What does this tell you about what happened on the first episode? Why was only the estimate for this one state changed? By exactly how much was it changed?

__Answer__: The first episode must have ended on the left terminal state, because the value of $V(A)$ was reduced from its original estimate of $0.5$ and the only positive reward is given by finishing on the right terminal. By the definition of any TD update, on this episode only the very last state before terminal has changed its value. For all other states their estimates remain unchanged:

$$
    V(s) \leftarrow V(s) + \alpha \left[ r + \gamma V(s') - V(s) \right] = 0.5 + 0.1 (0 + 1 \cdot 0.5 - 0.5) = 0.5
$$

For the last transition, from $A$ to the left terminal ($LT$), the update was:

$$
    V(A) = V(A) + \alpha \left[ r + \gamma V(LT) - V(A) \right] = 0.5 + 0.1 (0 + 1 \cdot 0 - 0.5) = 0.5 - 0.05 = 0.45.
$$

### Exercise 6.3

Do you think that by choosing the step-size parameter $\alpha$ differently, but still leaving it a constant, either algorithm could have done significantly better than shown in Figure 6.7? Why or why not?

__Answer__: In Section 6.2 it is mentioned that the constant-$\alpha$ update rule is proved to converge in the mean, for any small enough $\alpha$.
We can generalize the update rule for both TD and MC as:

$$
    V(s) \leftarrow V(s) + \alpha \left[ Target - V(s) \right] = (1 - \alpha) V(s) + \alpha \cdot Target
$$

So the new estimate for $V(s)$ could be considered a weighted average between the $Target$ and the old estimate, with weights $\alpha$ and $1-\alpha$, respectively, therefore $\alpha$ should be in $(0, 1)$. Additionally, an intuition would be that the weight for $Target$, i.e. $\alpha$ should be smaller than $1-\alpha$, since the value of $Target$ comes from a single state transition (for TD) or a single episode (for MC), hence can introduce high variance. So choosing a small $\alpha$ means the estimate is conservative, whereas a larger $\alpha$ leads to chasing the $Target$ faster, with potentially big jumps at least in the first few updates.