# Temporal Difference Learning

Temporal Difference learning combines the positives of Dynamic programming and Monte Carlo methods.
\
We combine sampling from MC and bootstrapping from DP.

Consider the incremental update rule: __NewEstimate = OldEstimate + StepSize(Target - OldEstimate)__.

$$ V(S_t) \leftarrow V(S_t) + \alpha \left [ G_t - V(S_t) \right ] $$

__Bootstrap__ the current estimate of the expected return $G_{t} = R_{t+1} + \gamma . V(S_{t+1})$

$$ V(S_t) \leftarrow V(S_t) + \alpha \left [ R_{t+1} + \gamma . G_{t+1} - V(S_t) \right ] $$

$$ V(S_t) \leftarrow V(S_t) + \alpha \left [ R_{t+1} + \gamma . V(S_{t+1}) - V(S_t) \right ] $$

__TD error__ $-$ $ \delta = R_{t+1} + \gamma . V(S_{t+1}) - V(S_t) $
\
__TD target__ $-$ is $ R_{t+1} + \gamma . V(S_{t+1}) $

* TD, unlike DP does not require the environment dynamics
* TD, unlike MC does not need to wait until the episode terminates
* TD converges faster than MC
* TD(0) has been proven to converge to $v_\pi$ for any fixed policy $\pi$.

## SARSA (TD with GPI)

* On-policy
* We need action-value functions for control
* Sarsa is a sample based algorithm to solve the Bellman equations for action values

Similar to the TD update we have above for state values $V$, for action values we have,

$$ Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha \left [ R_{t+1} + \gamma . Q(S_{t+1},A_{t+1}) - Q(S_t,A_t) \right ] $$

Note that it is similar to __Bellman equation__ for action value estimation.

$$ q_{\pi}(s,a) = \sum_{s',r} p(s',r | s,a) \left [ r + \gamma . \sum_{a'} \pi (a'|s') . q_{\pi}(s',a') \right ] $$

## Q-learning

* Off-policy (target policy is greedy but the behavior policy need not be)
* Uses the Bellman optimality equation for action values

$$ Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha \left [ R_{t+1} + \gamma . \underset{a'}{max}\ Q(S_{t+1},a') - Q(S_t,A_t) \right ] $$

__Bellman optimality equation__ for action values:

$$ q_{\pi_*}(s,a) = \sum_{s',r} p(s',r | s,a) \left [ r + \gamma . \underset{a'}{max} . q_{\pi_*}(s',a') \right ] $$

## Expected Sarsa

In Sarsa, we sample the next state from the environment and the next action from the policy.
\
Only the environment is unknown, so why sample from the policy?

Instead of sampling from its policy, Expected Sarsa, unlike Sarsa, computes the expectation of the next action.

$$ Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha \left [ R_{t+1} + \gamma . \sum_{a'} \pi (a'|S_{t+1}) . Q(S_{t+1},a') - Q(S_t,A_t) \right ] $$

Note that it is still similar to __Bellman equation__ for action value estimation.

$$ q_{\pi}(s,a) = \sum_{s',r} p(s',r | s,a) \left [ r + \gamma . \sum_{a'} \pi (a'|s') . q_{\pi}(s',a') \right ] $$

* Expected Sarsa update targets have low variance.
* Large step sizes can be used since the randomness within the policy is averaged.
* More expensive if there are more actions since the average is computed at every update step.
* Since we average over all actions, and the selected action does not matter, we can use a different policy for action selection.
\
  (Expected Sarsa can be __Off-policy__).
* Q-learning is a special case of Expected Sarsa when the target policy is greedy.
