# Chapter 7: $n$-step Bootstrapping

$n$-step methods are a way of interpolating between the two extremes of single step TD methods and Monte Carlo state estimation methods.

## 7.1 $n$-step TD Prediction

> _Exercise 7.1_ In Chapter 6 we noted that the Monte Carlo error can be written as the sum of TD errors (6.6) if the value estimates don’t change from step to step. Show that the $n$-step error used in (7.2) can also be written as a sum TD errors (again if the value estimates don’t change) generalizing the earlier result.

Recall that the TD-error is defined as

$$
\begin{align*}
\delta_t &\;\dot{=}\; R_{t+1} + \gamma V(S_{t+1}) - V(S_t)
\end{align*}
$$

Also note that:

$$
\begin{align*}
G_{t:t+n} &= \gamma^n V(S_{t+n}) + \sum_{k=0}^{n-1}{\gamma^{k} R_{t+k+1}}\\\\
\Rightarrow R_t + \gamma G_{t:t+n} &= \gamma^{n+1} V(S_{t+n}) + \sum_{k=0}^{n}{\gamma^k R_{t+k}}\\\\
\Rightarrow R_t + \gamma G_{t:t+n} &= \gamma^{n+1} V(S_{t+n}) + \gamma^n R_{t+n}+ \sum_{k=0}^{n-1}{\gamma^k R_{t+k}}\\\\
\Rightarrow R_t + \gamma G_{t:t+n} &= \gamma^{n}\left(\gamma V(S_{t+n}) + R_{t+n}\right)+ \sum_{k=0}^{n-1}{\gamma^k R_{t+k}}\\\\
\Rightarrow R_t + \gamma G_{t:t+n} &= \gamma^{n}\left(\delta_{t+n-1} + V(S_{t+n-1})\right)+ \sum_{k=0}^{n-1}{\gamma^k R_{t+k}}\\\\
\Rightarrow R_t + \gamma G_{t:t+n} &= \gamma^{n}\delta_{t+n-1} + G_{t-1:t+n-1}\\\\
\Rightarrow G_{t:t+n} &= R_{t+1} + \gamma G_{t+1:t+n+1} - \gamma^{n}\delta_{t+n}\\\\
\end{align*}
$$

So we can write

$$
\begin{align}
G_{t:t+n} - V(S_t) &= R_{t+1} + \gamma G_{t+1:t+n} - \gamma^n\delta_{t+n} - V(S_t)\\\\
&= R_{t+1} + \gamma G_{t+1:t+n} - \gamma^n\delta_{t+n} - V(S_t) + \gamma V(S_{t+1}) - \gamma V(S_{t+1})\\\\
&= R_{t+1} + \gamma V(S_{t+1}) - V(S_t) - \gamma^n\delta_{t+n} + \gamma G_{t+1:t+n} - \gamma V(S_{t+1})\\\\
&= \delta_t - \gamma^n\delta_{t+n} + \gamma \left(G_{t+1:t+n} - V(S_{t+1})\right)\\\\
&= \sum_{k=0}{\gamma^k(\delta_{t+k} - \gamma^n\delta_{t+k+n})}
\end{align}
$$

> _Exercise 7.2 (Programming)_ With an n-step method, the value estimates do change from step to step, so an algorithm that used the sum of TD errors (see previous exercise) in place of the error in (7.2) would actually be a slightly different algorithm. Would it be a better algorithm or a worse one? Devise and program a small experiment to answer this question empirically.

> _Exercise 7.3_ Why do you think a larger random walk task (19 states instead of 5) was used in the examples of this chapter? Would a smaller walk have shifted the advantage to a different value of $n$? How about the change in left-side outcome from 0 to -1 made in the larger walk? Do you think that made any difference in the best value of $n$?

A larger walk allows for longer trajectories on average, and therefore larger values of $n$ become meaningfully different than Monte Carlo estimates (if $n$ is comparable with the length of an average trajectory, then $n$-step Sarsa and other methods would reduce to Monte Carlo estimation for many trajectories). A smaller walk would have likely shifted the best value of $n$ to the left, but because of the limited resolution of $n \lt 19$, the result may be noisy.

## 7.2 $n$-step Sarsa

> _Exercise 7.4_ Prove that the $n$-step return of Sarsa (7.4) can be written exactly in terms of a novel TD error, as

  $$
  G_{t:t+n} = Q_{t-1}(S_t, A_t) + \sum_{k=t}^{\min{(t+n, T)}-1}{\gamma^{k-t}}\left[R_{k+1}+\gamma Q_k(S_{k+1}, A_{k+1}) - Q_k(S_k, A_k)\right]
  $$
  
$$
\begin{align}
G_{t:t+n} &= \gamma^n Q(S_{t+n}, A_{t+n}) + \sum_{k=0}^{n-1}{\gamma^{k} R_{t+k+1}}\\\\
\Rightarrow R_t + \gamma G_{t:t+n} &= \gamma^{n+1} Q(S_{t+n}, A_{t+n}) + \sum_{k=0}^{n}{\gamma^k R_{t+k}}\\\\
\Rightarrow R_t + \gamma G_{t:t+n} &= \gamma^{n+1} Q(S_{t+n}, A_{t+n}) + \gamma^n R_{t+n}+ \sum_{k=0}^{n-1}{\gamma^k R_{t+k}}\\\\
\Rightarrow R_t + \gamma G_{t:t+n} &= \gamma^{n}\left(\gamma Q(S_{t+n}, A_{t+n}) + R_{t+n}\right)+ \sum_{k=0}^{n-1}{\gamma^k R_{t+k}}\\\\
\end{align}
$$

## 7.3 $n$-step Off-policy Learning

## 7.4 Per-decision Methods with Control Variates

> _Exercise 7.5_ Write the pseudocode for the off-policy state-value prediction algorithm described above.

In [42]:
using Random
using StatsBase
using Distributions

function td_control_variates(
        π, b, 𝓢, 𝓐, initial, dynamics, reward, terminals;
        n=3, α=0.1, episodes=1000, γ=0.99)
    for i in 1:episodes
        V = Dict(s => 0.0 for s in 𝓢)
        s_0 = sample(𝓢, initial)
        a_0 = sample(𝓐, π(s_0))
        r_0 = rand(reward(s_0, a_0))
        S, A, R = [s_0], [a_0], [r_0]
        T = typemax(Int)
        t = 1
        s = s_0
        while true
            if t < T
                s = sample(𝓢, dynamics(s, A[t]))
                push!(S, s)
                if S[t+1] in terminals
                    T = t + 1
                else
                    A[t+1] = sample(𝓐, π(S[t+1]))
                end
            end
            
            τ = t - n + 1
            if τ ≥ 0
                G = (τ + n < T) ? V[S[τ + n]] : R[end]
                for i in min(τ + n, T):-1:(τ+1)
                        ρ = π(S[i])[indexin(A[t], 𝓐)] / b(S[i])[indexin(A[t], 𝓐)]
                    G = ρ*(R[i] + γ*G) + (1-ρ)*V[S[t]]
                end
                V[S[τ]] = V[S[τ]] + α*(G - V[S[τ]])
            end
        end
    end
end

td_control_variates (generic function with 2 methods)

> _Exercise 7.6_ Prove that the control variate in the above equations does not change the expected value of the return.

$$
\begin{align*}
\mathbb{E}\left[G^{CV}_{t:h}\right] &= \mathbb{E}\left[\rho_t(R_{t+1} + \gamma G_{t+1:h}) + (1 - \rho_t)V_{h-1}(S_t)\right)]\\\\
&= \mathbb{E}\left[\rho_t(R_{t+1} + \gamma G_{t+1:h})\right] + \mathbb{E}\left[(1 - \rho_t)V_{h-1}(S_t)\right)]\\\\
&= \mathbb{E}\left[\rho_t(R_{t+1} + \gamma G_{t+1:h})\right] + \mathbb{E}\left[1 - \rho_t\right]\mathbb{E}\left[V_{h-1}(S_t)\right)] \quad \text{Due to non-correlation}\\\\
&= \mathbb{E}\left[\rho_t(R_{t+1} + \gamma G_{t+1:h})\right]\\\\
&= \mathbb{E}\left[G_{t:h}\right]
\end{align*}
$$

> _Exercise 7.7_ Write the pseudocode for the off-policy action-value prediction algorithm described immediately above. Pay particular attention to the termination conditions for the recursion upon hitting the horizon or the end of the episode.

In [None]:
using Random
using StatsBase
using Distributions

function action_control_variates(
        π, b, 𝓢, 𝓐, initial, dynamics, reward, terminals;
        n=3, α=0.1, episodes=1000, γ=0.99)
    for i in 1:episodes
        Q = Dict((s, a) => 0.0 for s in 𝓢, a in 𝓐)
        s_0 = sample(𝓢, initial)
        a_0 = sample(𝓐, π(s_0))
        r_0 = rand(reward(s_0, a_0))
        S, A, R = [s_0], [a_0], [r_0]
        T = typemax(Int)
        t = 1
        s = s_0
        while true
            if t < T
                s = sample(𝓢, dynamics(s, A[t]))
                push!(S, s)
                if S[t+1] in terminals
                    T = t + 1
                else
                    A[t+1] = sample(𝓐, π(S[t+1]))
                end
            end
            
            τ = t - n + 1
            if τ ≥ 0
                G = (τ + n < T) ? Q[(S[τ + n], A[τ + n])] : R[end]
                for i in min(τ + n, T):-1:(τ+1)
                    ρ = π(S[i])[indexin(A[t], 𝓐)] / b(S[i])[indexin(A[t], 𝓐)]
                    G = R[i] + γ*ρ*(G -)
                end
                V[S[τ]] = V[S[τ]] + α*(G - V[S[τ]])
            end
        end
    end
end

> _Exercise 7.8_ Show the general (off-policy) version of the $n$-step return (7.13) can still be written exactly and compactly as the sum of state-based TD errors (6.5) if the approximate state value function does not change.

> _Exercise 7.9_ Repeat the above exercise for the action version of the off-policy $n$-step return (7.14) and the Expected Sarsa TD error (the quantity in brackets in Equation 6.9).

> _Exercise 7.10 (programming)_ Devise a small off-policy prediction problem and use it to show that the off-policy learning algorithm using (7.13) and (7.2) is more data efficient than the simpler algorithm using (7.1) and (7.9)

## 7.5 Off-policy Learning Without Importance Sampling: The $n$-step Tree Backup Algorithm

> _Exercise 7.11_ Show that if the approximate action values are unchanging, then the tree-backup return (7.16) can be written as a sum of expectation-based TD errors:
>
> $$
> G_{t:t+n} = Q(S_t, A_t) + \sum_{k=t}^{\min(t+n-1, T-1)}{\delta_k\prod_{i=t+1}^k{\gamma\pi(A_i\mid S_i)}}
> $$
>
> Where $\delta_t \dot= R_{t+1} + \gamma \bar{V}_t(S_{t+1}) - Q(S_t, A_t)$ and $\bar{V}_t$ is given by (7.8).



## 7.6 A Unifying Algorithm: $n$-step $Q(\sigma)$

## 7.7 Summary