# Temporal-Difference methods

An incremental model-free method.

TD target = current policy + move one step further.

## TD learning of state values

Our goal is to estimate $v_{\pi}(s)$ for all $s\in\mathcal{S}$. Suppose we have some experience samples $(s_{0},r_{1},s_{1},\dots,s_{t},r_{t+1},s_{t+1},\dots)$ generated following $\pi$. The following TD algorithm can estimate state values using these samples:

$$
\begin{aligned}
v_{t+1}(s_{t}) &= v_{t}(s_{t}) - \alpha_{t}(s_{t})\left[v_{t}(s_{t}) - (r_{t+1} + \gamma v_{t}(s_{t+1}))\right]\\
v_{t+1}(s) &= v_{t}(s),\quad\text{for all }s\ne s_{t}
\end{aligned}
$$

At time $t$, only the value of the visited state $s_{t}$ is updated. This algorithm can be viewd as a special stochastic approximation algorithm for solving the Bellman equation:

$$v_\pi(s) = \mathbb{E}\left[R + \gamma v_{\pi}(S)|s\right]$$

Define $g_{s_{t}}(w) = w - \mathbb{E}[R + \gamma v_{\pi}(S)|s_{t}]$, $r_{t+1}$ and $s_{t+1}$ are samples of $R$ and $S$, so:

$$\tilde{g}_{s_{t}}(w,\eta) = w - (r_{t+1} + \gamma v_{\pi}(s_{t+1}))$$

The only difference between TD algorithm and the Robbins-Monro algorithm is that we use $v_{t}(s_{t+1})$ to replace $v_{\pi}(s_{t+1})$.

TD target move one step forward:

$$r_{t+1} + \gamma v_{t}(s_{t+1})$$

TD error:

$$v_{t}(s_{t}) - (r_{t+1} + \gamma v_{t}(s_{t+1}))$$

TD learning is online, it can update the state/action values immediately after receiving an experience sample, while MC learning is offline.

## TD learning of action values: Sarsa

Suppose we have some experience samples generated following $\pi$: $s_{0},a_{0},r_{1},s_{1},a_{1},\dots,s_{t},a_{t},r_{t+1},s_{t+1},a_{t+1},\dots$, We can use the following Sarsa algorithm:

$$
\begin{aligned}
q_{t+1}(s_{t},a_{t}) &= q_{t}(s_{t},a_{t}) - \alpha_{t}(s_{t},a_{t})\left[q_{t}(s_{t},a_{t}) - (r_{t+1} + \gamma q_{t}(s_{t+1},a_{t+1}))\right]\\
q_{t+1}(s,a) &= q_{t}(s,a),\quad\text{for all }(s,a)\ne (s_{t},a_{t})
\end{aligned}
$$

Why is this algorithm called "Sarsa"? This is because each iteration requires $(s_{t},a_{t},r_{t+1},s_{t+1},a_{t+1})$

Sarsa is a stochastic approximation algorithm for solving the Bellman equation:

$$q_{\pi}(s, a) = \mathbb{E}\left[R_{t+1} + \gamma q_{\pi}(S_{t+1},A_{t+1})|S_{t}=s,A_{t}=a\right]$$

## n-step Sarsa

$$
\begin{aligned}
\text{Sarsa}\leftarrow G_{t}^{(1)} &= R_{t+1} + \gamma q_{\pi}(S_{t+1}, A_{t+1})\\
G_{t}^{(2)} &= R_{t+1} + \gamma R_{t+2} + \gamma^{2} q_{\pi}(S_{t+2}, A_{t+2})\\
&\vdots\\
\text{n-step Sarsa}\leftarrow G_{t}^{(n)} &= R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{n} q_{\pi}(S_{t+n}, A_{t+n})\\
&\vdots\\
\text{MC}\leftarrow G_{t}^{(\infty)} &= R_{t+1} + \gamma R_{t+2} + \gamma^{2} R_{t+3} \dots
\end{aligned}
$$

Sarsa has fewer random vairiables, it has small variance but big bias; MC learning has big variance but small bias.

## TD learning of optimal action values: Q-learning

The Q-learning algorithm is:

$$
\begin{aligned}
q_{t+1}(s_{t},a_{t}) &= q_{t}(s_{t},a_{t}) - \alpha_{t}(s_{t},a_{t})\left[q_{t}(s_{t},a_{t}) - (r_{t+1} + \gamma \underset{a\in\mathcal{A}}{\max}q_{t}(s_{t+1},a))\right]\\
q_{t+1}(s,a) &= q_{t}(s,a),\quad\text{for all }(s,a)\ne (s_{t},a_{t})
\end{aligned}
$$

Q-learning is a stochastic approximation algorithm for solving the Bellman optimality equation:

$$q(s,a) = \mathbb{E}\left[R_{t+1} + \gamma\underset{a}{\max}q(S_{t+1},a)|S_{t}=s,A_{t}=a\right]$$