# Chapter 10: On-Policy Control with Approximation

Now we investigate the _control_ problem as opposed to the _prediction_ problem for the case where the learned value function is approximate. As such, we now consider a parameterized action-value function $\hat{q}(s, a; \mathbf{w}) \approx q_*(s, a)$. We only consider the on-policy case. We will implement _semi-gradient Sarsa_. The continuing case must be reformulated in order to ensure the problem remains well posed. This new formulation will rely on an "average-reward" ordering of policies, and we must use differential value functions and TD errors.

# Episodic Semi-gradient Control

In the case of episodic tasks, it is straightforward to extend semi-gradient value estimation to action values and therefore to the control task. One simply uses any particular target, such as $U_t = G_t$ for Monte-Carlo estimation or $U_t = R_t + \gamma \hat{q}(S_t, A_t, \mathbf{w})$ for Sarsa.

The general gradient update rule can be written as:

$$
\begin{align}
\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha(U_t - \hat{q}(S_t, A_t, \mathbf{w})\nabla \hat{q}(S_t, A_t, \mathbf{w})\tag{10.1}
\end{align}
$$

While the particular Sarsa gradient can be written as

$$
\begin{align}
\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha(R_t + \gamma\hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}) - \hat{q}(S_t, A_t, \mathbf{w})\nabla \hat{q}(S_t, A_t, \mathbf{w})\tag{10.1}
\end{align}
$$

In the case of an action set that is discrete and not too large, we can explicitly improve the policy at any time by selecting

$$
\pi(a\mid s) = \begin{cases}
1 - \epsilon + \frac{\epsilon}{|\mathcal{A}(s)|} & a = \arg\max_a \hat{q}(s, a, \mathbf{w}) \\ 
\frac{\epsilon}{|\mathcal{A}(s)|} & \mathrm{otherwise}
\end{cases}
$$

For any state $s$.

**Example 10.1 Mountain Car Task**

_Exercise 10.1_ We have not explicitly considered or given pseudocode for any Monte Carlo methods in this chapter. What would they be like? Why is it reasonable not to give pseudocode for them? How would they perform on the Mountain Car task?

The Monte Carlo methods for On-Policy control could be implemented by instantiating the $n$-step Sarsa control method with $n = \infty$. They would likely perform poorly on the Mountain-Car task however since they will not update any action values until the end of at least one episode, which may last an extremely long time under the random policy.

_Exercise 10.2_ Give pseudocode for semi-gradient one-step _Expected_ Sarsa for control.

In [4]:
function expected_sarsa(𝓢, 𝓐, N, α, ε, γ, s0_sampler, stepper, isterminal, d, x)
    w = zeros(d)
    q(s, a) = w' * x(s)
    π(s) = rand() < ε ? rand(𝓐(s)) : argmax(a -> q(s, a), 𝓐(s))
    π_prob(a, s) = 
        a == argmax(a -> q(s, a), 𝓐(s)) ? 1 - ε + ε / length(𝓐(s)) : ε / length(𝓐(s))
    for i in 1:N
        s = s0_sampler()
        while !isterminal(s)
            a = π(s)
            sp, r = stepper(s, a)
            G = r + isterminal(sp) ? 0 : γ * sum(π_prob(ap, sp) * q(sp, ap) for ap in 𝓐(sp))
            w += α*(G - q(s, a))*x(s)
            s = sp
        end
    end
end

expected_sarsa (generic function with 1 method)

_Exercise 10.3_ Why do the results shown in Figure 10.4 have higher standard errors at large $n$ than at small $n$?

At large $n$, the TD estimates incorporate more of the variance of the rewards, at the expense of lower bias (less overall dependence on the initialization of $\hat{q}(s, a, \mathbf{w})$


# 10.3 Average Reward: A New Problem Setting for Continuing Tasks

_Exercise 10.4_ Give pseudocode for a differential version of semi-gradient Q-learning.

In [None]:
function differential_q_learning(𝓢, 𝓐, α, β, d, s0_sampler, stepper, isterminal, N, ε)
    w = zeros(d)
    R̄ = 0
    q(s, a) = w' * x(s)
    π(s) = rand() < ε ? rand(𝓐(s)) : argmax(a -> q(s, a), 𝓐(s))
    π_prob(a, s) = 
        a == argmax(a -> q(s, a), 𝓐(s)) ? 1 - ε + ε / length(𝓐(s)) : ε / length(𝓐(s))
    for i in 1:N
        s = s0_sampler()
        while !isterminal(s)
            a = π(s)
            sp, r = stepper(s, a)
            δ = r - R̄ + maximum(a -> q(sp, a), 𝓐(s)) - q(s, a)
            R̄ += β * δ
            w += α * δ * x
            s = sp
        end
    end
end

_Exercise 10.5_ What equations are needed (beyond 10.10) to specify the differential version of TD(0)?

_Exercise 10.6_ Suppose there is an MDP that under any policy produces the deterministic sequence of rewards $+1,0,+1,0,+1,0,\dots$ going on forever. Technically, this violates ergodicity; there is no stationary limiting distribution $\mu_\pi$ and the limit (10.7) does not exist. Nevertheless, the average reward (10.6) is well defined. What is it? Now consider two states in this MDP. From $\verb|A|$, the reward sequence is exactly as described above, starting with a +1, whereas, from B, the reward sequence starts with a 0 and then continues with $+1, 0, +1, 0,\dots$. We would like to compute the differential values of $\verb|A|$ and $\verb|B|$. Unfortunately, the differential return (10.9) is not well defined when starting from these states as the implicit limit does not exist. To repair this, one could alternatively define the differential value of a state as

$$
v_\pi(s) = \lim_{\gamma\to1}\lim_{h\to\infty}\sum_{t=0}^h{\gamma^t\big(\mathbb{E}_\pi\left[R_{t+1}\mid S_0 = s\right] - r(\pi)\big)}.
$$

Under this definition, what are the differential values of $\verb|A|$ and $\verb|B|$?

The average reward is:

$$
\begin{align}
r(\pi) &= \lim_{h\to\infty}\frac{1}{h}\sum_{t=1}^{h}\mathbb{E}[R_t\mid S_0, A_{0:t-1}\sim\pi]\\
&= \lim_{h\to\infty}\frac{1}{h}\left(1 + 0 + 1 + 0 + \dots\right)\\
&= \lim_{h\to\infty}\frac{1}{h}\bigg\lceil\frac{h}{2}\bigg\rceil = \frac{1}{2}
\end{align}
$$

The alternate value of state $\verb|A|$ is

$$
\begin{align}
v_\pi(\verb|A|) &= \lim_{\gamma\to1}\lim_{h\to\infty}\sum_{t=0}^h{\gamma^t\big(\mathbb{E}_\pi\left[R_{t+1}\mid S_0 = s\right] - \frac{1}{2}\big)}\\
&= \frac{1}{2}\lim_{\gamma\to1}\lim_{h\to\infty}\sum_{t=0}^h\gamma^t(-1)^t\\
&= \frac{1}{2}\lim_{\gamma\to1}\frac{1}{1+\gamma}\\
&= \frac{1}{4}
\end{align}
$$

The alternate value of state $\verb|B|$ is

$$
\begin{align}
v_\pi(\verb|B|) &= \lim_{\gamma\to1}\lim_{h\to\infty}\sum_{t=0}^h{\gamma^t\big(\mathbb{E}_\pi\left[R_{t+1}\mid S_0 = s\right] - \frac{1}{2}\big)}\\
&= \frac{1}{2}\lim_{\gamma\to1}\lim_{h\to\infty}\sum_{t=0}^h\gamma^t(-1)^{t+1}\\
&= -\frac{1}{2}\lim_{\gamma\to1}\frac{1}{1+\gamma}\\
&= -\frac{1}{4}
\end{align}
$$

_Exercise 10.7_ Consider a Markov reward process consisting of a ring of three states $\verb|A|, \verb|B|$, and $\verb|C|$, with state transistions going deterministically around the ring. A reward of $+1$ is received upon arrival in $\verb|A|$ and otherwise the reward is $0$. What are the differential values of the three states, using (10.13)?

The alternate value of state $\verb|A|$ is

$$
\begin{align}
v_\pi(\verb|A|) &= \lim_{\gamma\to1}\lim_{h\to\infty}\sum_{t=0}^h{\gamma^t\big(\mathbb{E}_\pi\left[R_{t+1}\mid S_0 = s\right] - \frac{1}{3}\big)}\\
&= \lim_{\gamma\to1}\lim_{h\to\infty}\sum_{t=0}^h{-\frac{1}{3}\gamma^{3t} - \frac{1}{3}\gamma^{3t+1}+\frac{2}{3}\gamma^{3t+2}}\\
&= \frac{1}{3}\lim_{\gamma\to1}\lim_{h\to\infty}\sum_{t=0}^h\gamma^{3t}(2\gamma^2 - \gamma - 1)\\
&= \frac{1}{3}\lim_{\gamma\to1}\lim_{h\to\infty}\sum_{t=0}^h\gamma^{3t}(2\gamma + 1)(\gamma - 1)\\
&= \frac{1}{3}\lim_{\gamma\to1}\frac{(2\gamma + 1)(\gamma - 1)}{1-\gamma^3}\\
&= \frac{1}{3}\lim_{\gamma\to1}{2\gamma + 1}\\
&= 1
\end{align}
$$

The alternate value of state $\verb|B|$ is

$$
\begin{align}
v_\pi(\verb|B|) &= \lim_{\gamma\to1}\lim_{h\to\infty}\sum_{t=0}^h{\gamma^t\big(\mathbb{E}_\pi\left[R_{t+1}\mid S_0 = s\right] - \frac{1}{3}\big)}\\
&= \lim_{\gamma\to1}\lim_{h\to\infty}\sum_{t=0}^h{-\frac{1}{3}\gamma^{3t} - \frac{1}{3}\gamma^{3t+1}+\frac{2}{3}\gamma^{3t+2}}\\
&= \frac{1}{3}\lim_{\gamma\to1}\lim_{h\to\infty}\sum_{t=0}^h\gamma^{3t}(-\gamma^2 + 2\gamma - 1)\\
&= \frac{1}{3}\lim_{\gamma\to1}\lim_{h\to\infty}\sum_{t=0}^h\gamma^{3t}(-\gamma + 1)(\gamma - 1)\\
&= \frac{1}{3}\lim_{\gamma\to1}\frac{(\gamma + 1)(\gamma - 1)}{1-\gamma^3}\\
&= \frac{1}{3}\lim_{\gamma\to1}{\gamma + 1}\\
&= \frac{2}{3}
\end{align}
$$

The alternate value of state $\verb|C|$ is

$$
\begin{align}
v_\pi(\verb|C|) &= \lim_{\gamma\to1}\lim_{h\to\infty}\sum_{t=0}^h{\gamma^t\big(\mathbb{E}_\pi\left[R_{t+1}\mid S_0 = s\right] - \frac{1}{3}\big)}\\
&= \lim_{\gamma\to1}\lim_{h\to\infty}\sum_{t=0}^h{-\frac{1}{3}\gamma^{3t} - \frac{1}{3}\gamma^{3t+1}+\frac{2}{3}\gamma^{3t+2}}\\
&= \frac{1}{3}\lim_{\gamma\to1}\lim_{h\to\infty}\sum_{t=0}^h\gamma^{3t}(-\gamma^2 - 2\gamma + 1)\\
&= \frac{1}{3}\lim_{\gamma\to1}\lim_{h\to\infty}\sum_{t=0}^h\gamma^{3t}(-\gamma - 1)(\gamma + 1)\\
&= \frac{1}{3}\lim_{\gamma\to1}\frac{(2\gamma + 1)(\gamma - 1)}{1-\gamma^3}\\
&= \frac{1}{3}\lim_{\gamma\to1}{2\gamma + 1}\\
&= 1
\end{align}
$$

_Exercise 10.8_ The pseudocode in the box on page 251 updates $\bar{R}_t$ using $\delta_t$ as an error rather than simply $R_{t+1} - \bar{R}_t$. Both errors work, but using $\delta_t$ is better. To see why, consider the ring MRP of three states from Exercise 10.7. The estimate of the average reward should tend towards its true value of $\frac{1}{3}$ . Suppose it was already there and was held stuck there. What would the sequence of $R_{t+1} - \bar{R}_t$ errors be? What would the sequence of $\delta_t$ errors be (using Equation 10.10)? Which error sequence would produce a more stable estimate of the average reward if the estimate were allowed to change in response to the errors? Why?

The sequence of $R_{t+1} - \bar{R}_t$ would be:

$$
-\frac{1}{3}, -\frac{1}{3}, \frac{2}{3}, -\frac{1}{3}, -\frac{1}{3}, \frac{2}{3}, \dots
$$

The sequence of $\delta_t$ errors would be:

$$
-\frac{1}{3} 
$$

_Exercise 10.9_ In the differential semi-gradient n-step Sarsa algorithm, the step-size parameter on the average reward, $\beta$, needs to be quite small so that $\bar{R}$ becomes a good long-term estimate of the average reward. Unfortunately, $\bar{R}$ will then be biased by its initial value for many steps, which may make learning ineffcient. Alternatively, one could use a sample average of the observed rewards for $\bar{R}$. That would initially adapt rapidly but in the long run would also adapt slowly. As the policy slowly changed, $\bar{R}$ would also change; the potential for such long-term nonstationarity makes sample-average methods ill-suited. In fact, the step-size parameter on the average reward is a perfect place to use the unbiased constant-step-size trick from Exercise 2.7. Describe the specific changes needed to the boxed algorithm for differential semi-gradient n-step Sarsa to use this trick.