# Chapter 10 - Exercises

### Exercise 10.1 

**Q**

We have not explicitly considered or given pseudocode for any Monte Carlo methods in this chapter. What would they be like? Why is it reasonable not to give pseudocode for them? How would they perform on the Mountain Car task?

**A**

The Monte Carlo method is equivalent to the n-step TD method for $n = \infty$, so the Monte Carlo can be seen as this specific case of the given n-step pseudocode. 

For the episodic pseudocode (the only one given until this part), the Monte Carlo method is equivalent to defining and using $\tau$ only after the end was reached, defining $n$ as any value higher than $T$ after the end of the episode was reached (for example, defining $n = T + 1$ right after the line $T \gets t + 1$). The Monte Carlo method cannot be applied to continuing tasks (it requires that the end of the episode is reached).

The Mountain Car task has better performance going from $n = 1$ until $n = 4$, with the best case at this point, and then it becomes worse as $n$ increases. So, it's expected that the Monte Carlo would behave worse than the cases shown, that has the worst case at $n = 16$, which was the highest value of $n$ among the cases, with the tendence to have a worse performance as $n$ increases. It's important to note, tough, that the task was only considered for the first 50 episodes (the Monte Carlo method may end up better when considering a huge number of episodes because the error goes to 0, as it wouldn't require semi-gradient updates, but exact gradient updates).

### Exercise 10.2

**Q**

Give pseudocode for semi-gradient one-step Expected Sarsa for control.

**A**

The pseudocode is of the given n-step Sarsa is:

**Episodic semi-gradient *n*-step Sarsa for estimating $\widehat{q} \approx q_*$ or $q_{\pi}$**

> Input: a differentiable action-value function parameterization $\widehat{q}: \mathcal{S} \times \mathcal{A} \times \mathbb{R}^d \to \mathbb{R}$<br/>
> Input: a policy $\pi$ (if estimating $q_{\pi}$)<br/>
> Algorithm parameters: step size $\alpha > 0$, small $\epsilon > 0$, a positive integer $n$<br/>
> Initialize value-function weights $\textbf{w} \in \mathbb{R}^d$ arbitrarily (e.g., $\textbf{w} = \textbf{0}$)<br/>
> All store and access operations (for $S_t$, $A_t$, and $R_t$) can take their index mod $n + 1$
>
> Loop for each episode:<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;Initialize and store $S_0 \neq terminal$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;Select and store an action $A_0 \sim \pi(\cdot | S_0)$ or $\epsilon$-greedy wrt $\widehat{q}(S_0, \cdot, \textbf{w})$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;$T \gets \infty$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;Loop for $t = 0, 1, 2, ...$:<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;If $t < T$, then:<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Take action $A_t$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Observe and store the next reward as $R_{t+1}$ and the next state as $S_{t+1}$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;If $S_{t+1}$ is terminal, then:<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$T \gets t + 1$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;else:<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Select and store an action $A_{t+1} \sim \pi(\cdot | S_{t+1})$ or $\epsilon$-greedy wrt $\widehat{q}(S_{t+1}, \cdot, \textbf{w})$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$\tau \gets t - n + 1$ ($\tau$ is the time whose estimate is being updated)<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;If $\tau \geq 0$:<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$G \gets \sum_{i = \tau + 1}^{min(\tau + n, T)} \gamma^{i - \tau - 1} R_i$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;If $\tau + n < T$, then $G \gets G + \gamma^n \widehat{q}(S_{\tau + n}, A_{\tau + n}, \textbf{w})$ $\quad (G_{\tau:\tau+n})$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$\textbf{w} \gets \textbf{w} + \alpha [G - \widehat{q}(S_{\tau}, A_{\tau}, \textbf{w})] \nabla \widehat{q}(S_{\tau}, A_{\tau}, \textbf{w})$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;until $\tau = T - 1$

The one-step Expected Sarsa is similar to the one-step Sarsa (the same as the pseudocode above for $n = 1$), but updating the value of a state-action using the expected value, which is the sum of the action-values of all actions weighed by the probability of choosing the action according to the policy, or to the $\epsilon$-greedy policy when estimating $q_*$.

The pseudocode would be exactly the same as the pseudocode above (n-step Sarsa), but defining $n = 1$ and changing the line:

>If $\tau + n < T$, then $G \gets G + \gamma^n \widehat{q}(S_{\tau + n}, A_{\tau + n}, \textbf{w})$

to:

>If $\tau + n < T$, then $G \gets G + \gamma^n \sum_{a \in \mathcal{A}(S_{\tau + n})} \pi(a | S_{\tau + n}) \widehat{q}(S_{\tau + n}, a, \textbf{w})$

where $\pi$ is the policy being estimated, or, if the objective is estimating $q_*$, the $\epsilon$-greedy policy over $\widehat{q}$.

It's important to note that the pseudocode can be generalized for any value of $n$.

For $n = 1$, specifically, the algorithm can be simplified to:

**Episodic semi-gradient one-step Expected Sarsa for estimating $\widehat{q} \approx q_*$ or $q_{\pi}$**

> Input: a differentiable action-value function parameterization $\widehat{q}: \mathcal{S} \times \mathcal{A} \times \mathbb{R}^d \to \mathbb{R}$<br/>
> Input: a policy $\pi$ (if estimating $q_{\pi}$)<br/>
> Algorithm parameters: step size $\alpha > 0$, small $\epsilon > 0$<br/>
> Initialize value-function weights $\textbf{w} \in \mathbb{R}^d$ arbitrarily (e.g., $\textbf{w} = \textbf{0}$)<br/>
>
> Loop for each episode:<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;Initialize $S \gets S_0 \neq terminal$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;Select an action $A \gets A_0 \sim \pi(\cdot | S_0)$ or $\epsilon$-greedy wrt $\widehat{q}(S_0, \cdot, \textbf{w})$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;Loop until $S$ is terminal:<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Take action $A$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Observe the next reward as $R$ and the next state as $NS$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$G \gets R$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;If $NS$ is not terminal, then $G \gets G + \gamma \sum_{a \in \mathcal{A}(NS)} [\pi(a | NS)$ or $\epsilon$-greedy wrt $\widehat{q}(NS, a, \textbf{w})] \widehat{q}(NS, a, \textbf{w})$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$\textbf{w} \gets \textbf{w} + \alpha [G - \widehat{q}(S, A, \textbf{w})] \nabla \widehat{q}(S, A, \textbf{w})$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$S \gets NS$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;If $S$ is not terminal, then select an action $A \sim \pi(\cdot | S)$ or $\epsilon$-greedy wrt $\widehat{q}(S, \cdot, \textbf{w})$<br/>

### Exercise 10.3

**Q**

Why do the results shown in Figure 10.4 have higher standard errors at large n than at small n?

**A**

The tendence is that higher values of n will give smaller errors with time, especially with n higher than the number of steps in the episodes, as the error due to the use of semi-gradients reduce with higher n (with no semi-gradients for n higher than T for an episode, using the actual returns in this case). On the other hand, the convergence of the state values, and consequently the reduction of the error, for higher values of n should be require a huge number of episodes, normally with high variance at the beginning, so the number of 50 episodes used during the runs was not enough, making their results worse than using smaller values of n.

### Exercise 10.4

**Q**

Give pseudocode for a differential version of semi-gradient Q-learning.

**A**

First, for the **non-differential** pseudocode, it should be almost the same as the n-step Sarsa, but using the best next action-value during the updates (after the last known reward):

>If $\tau + n < T$, then $G \gets G + \gamma^n \operatorname{max}_a \widehat{q}(S_{\tau + n}, a, \textbf{w})$

**Episodic semi-gradient *n*-step Q-learning for estimating $\widehat{q} \approx q_*$ or $q_{\pi}$**

> Input: a differentiable action-value function parameterization $\widehat{q}: \mathcal{S} \times \mathcal{A} \times \mathbb{R}^d \to \mathbb{R}$<br/>
> Input: a policy $\pi$ (if estimating $q_{\pi}$)<br/>
> Algorithm parameters: step size $\alpha > 0$, small $\epsilon > 0$, a positive integer $n$<br/>
> Initialize value-function weights $\textbf{w} \in \mathbb{R}^d$ arbitrarily (e.g., $\textbf{w} = \textbf{0}$)<br/>
> All store and access operations (for $S_t$, $A_t$, and $R_t$) can take their index mod $n + 1$
>
> Loop for each episode:<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;Initialize and store $S_0 \neq terminal$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;Select and store an action $A_0 \sim \pi(\cdot | S_0)$ or $\epsilon$-greedy wrt $\widehat{q}(S_0, \cdot, \textbf{w})$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;$T \gets \infty$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;Loop for $t = 0, 1, 2, ...$:<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;If $t < T$, then:<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Take action $A_t$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Observe and store the next reward as $R_{t+1}$ and the next state as $S_{t+1}$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;If $S_{t+1}$ is terminal, then:<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$T \gets t + 1$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;else:<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Select and store an action $A_{t+1} \sim \pi(\cdot | S_{t+1})$ or $\epsilon$-greedy wrt $\widehat{q}(S_{t+1}, \cdot, \textbf{w})$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$\tau \gets t - n + 1$ ($\tau$ is the time whose estimate is being updated)<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;If $\tau \geq 0$:<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$G \gets \sum_{i = \tau + 1}^{min(\tau + n, T)} \gamma^{i - \tau - 1} R_i$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;If $\tau + n < T$, then $G \gets G + \gamma^n \operatorname{max}_a \widehat{q}(S_{\tau + n}, a, \textbf{w})$ $\quad (G_{\tau:\tau+n})$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$\textbf{w} \gets \textbf{w} + \alpha [G - \widehat{q}(S_{\tau}, A_{\tau}, \textbf{w})] \nabla \widehat{q}(S_{\tau}, A_{\tau}, \textbf{w})$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;until $\tau = T - 1$

The **differential** semi-gradient Q-learning pseudocode should use the differential form of the TD error with action-values:

$$
\delta_t \doteq R_{t+1} - \overline{R}_t + \widehat{q}(S_{t+1}, A_{t+1}, \textbf{w}_t) - \widehat{q}(S_t, A_t, \textbf{w}_t) \tag{10.11}
$$

with the weights update being:

$$
\textbf{w}_{t+1} \doteq \textbf{w}_t + \alpha \delta_t \nabla \widehat{q}(S_t, A_t, \textbf{w}_t) \tag{10.12}
$$

In the case of Q-learning, the next action-value used in the update is the maximum among all possible actions:

$$
\delta_t \doteq R_{t+1} - \overline{R}_t + \operatorname{max}_a \widehat{q}(S_{t+1}, a, \textbf{w}_t) - \widehat{q}(S_t, A_t, \textbf{w}_t)
$$

**Differential semi-gradient Q-learning for estimating $\widehat{q} \approx q_*$**

> Input: a differentiable action-value function parameterization $\widehat{q}: \mathcal{S} \times \mathcal{A} \times \mathbb{R}^d \to \mathbb{R}$<br/>
> Input: a policy $\pi$ (if estimating $q_{\pi}$)<br/>
> Algorithm parameters: step size $\alpha, \beta > 0$<br/>
> Initialize value-function weights $\textbf{w} \in \mathbb{R}^d$ arbitrarily (e.g., $\textbf{w} = \textbf{0}$)<br/>
> Initialize average reward estimate $\overline{R} \in \mathbb{R}$ arbitrarily (e.g., $\overline{R} = 0$)<br/>
>
> Initialize state S, action A<br/>
> Loop for each step:<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;Take action A, observe R, S'<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;$\delta \gets R - \overline{R} + \operatorname{max}_a \widehat{q}(S', a, \textbf{w}) - \widehat{q}(S, A, \textbf{w})$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;$\overline{R} \gets \overline{R} + \beta \delta$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;$\textbf{w} \gets \textbf{w} + \alpha \delta \nabla \widehat{q}(S, A, \textbf{w})$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;$S \gets S'$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;Choose A as a function of $\widehat{q}(S, \cdot, \textbf{w})$ (e.g., $\epsilon$-greedy)<br/>

### Exercise 10.5 

**Q**

What equations are needed (beyond 10.10) to specify the differential version of TD(0)?

**A**

Aside from equation 10.10:

$$
\delta_t \doteq R_{t+1} - \overline{R}_t + \widehat{v}(S_{t+1}, \textbf{w}_t) - \widehat{v}(S_t, \textbf{w}_t) \tag{10.10}
$$

TD(0) update state values, so it will use the corresponding equation to 10.12, but with state-values instead of state-action values:

$$
\textbf{w}_{t+1} \doteq \textbf{w}_t + \alpha \delta_t \nabla \widehat{v}(S_t, \textbf{w}_t)
$$

Finally, it needs an equation to update the average reward based on the samples:

$$
\overline{R}_{t+1} \gets \overline{R}_t + \beta \delta_t
$$

with $\beta > 0$ used to weigh the error, and as the error goes to 0, $\overline{R}$ stabilizes (from then on, $\overline{R}_{t+1} \approx \overline{R}_t$).

Alternatively, the average reward could be updated using a counter (initialized with 0) and defining it as:

\begin{align*}
C &\gets C + 1 \\
\overline{R}_{t+1} &\gets \overline{R}_t + \frac{1}{C}[R_{t+1} - \overline{R}_t]
\end{align*}

The only issue with the above approach is that very old rewards (including the rewards at the start of the episode) are never forgotten, but the update above should make $\overline{R}$ converge into the true average reward as the number of steps approach $\infty$.

It's important to consider that while TD(0) works fine in a Markov Reward Process (MRP), in which there are no actions, and the objective is to estimate the state-values, there's not a direct approach to use it in an MDP with more than one action, unless a model of the environment is used to simulate transitions (an approach using action values, like Sarsa, can be used instead).

### Exercise 10.6

**Q**

Suppose there is an MDP that under any policy produces the deterministic sequence of rewards +1, 0, +1, 0, +1, 0,... going on forever. Technically, this is not allowed because it violates ergodicity; there is no stationary limiting distribution $\mu_{\pi}$ and the limit (10.7) does not exist. Nevertheless, the average reward (10.6) is well defined; What is it? Now consider two states in this MDP. From A, the reward sequence is exactly as described above, starting with a +1, whereas, from B, the reward sequence starts with a 0 and then  continues with +1, 0, +1, 0, .... The differential return (10.9) is not well defined for this case as the limit does not exist. To repair this, one could alternately define the value of a state as

$$
v_{\pi}(s) \doteq \operatorname*{lim}_{\gamma \to 1} \operatorname*{lim}_{h \to \infty} \sum_{t=0}^h \gamma^t (\mathbb{E}_{\pi}[R_{t+1} | S_0=s] - r(\pi)). \tag{10.13}
$$

Under this definition, what are the values of states A and B?

**A**

The average reward is the aproximate reward received per step (which should be the mean reward accross a huge number of steps). In the example, the rewards alternate between +1 and 0, so the average reward is 0.5.

The sequence of states can be seen as an infinite loop alternating between states A and B, with the reward of going from A to B being +1, and the reward of going from B to A being 0.

We have:

$$
r(\pi) = 0.5
$$

$$
\mathbb{E}_{\pi}[R_{t+1} | S_t=A] = 1
$$

$$
\mathbb{E}_{\pi}[R_{t+1} | S_t=B] = 0
$$

When $S_0 = A$, $R_{2t + 1} = 1$ and $R_{2t + 2} = 0$ for any time step $t \geq 0$ (it alternates the rewards, with $R_1=1, R_2=0, R_3=1, R_4=0, ...$).

When $S_0 = B$, $R_{2t + 1} = 0$ and $R_{2t + 2} = 1$ for any time step $t \geq 0$ (it alternates the rewards, with $R_1=0, R_2=1, R_3=0, R_4=1, ...$).

So:

\begin{align*}
v_{\pi}(A) &\doteq \operatorname*{lim}_{\gamma \to 1} \operatorname*{lim}_{h \to \infty} \sum_{t=0}^h \gamma^t (\mathbb{E}_{\pi}[R_{t+1} | S_0=A] - r(\pi)) \\
&= \operatorname*{lim}_{\gamma \to 1} \operatorname*{lim}_{h \to \infty} \sum_{t=0}^{\frac{h}{2}} [\gamma^{2t} (\mathbb{E}_{\pi}[R_{2t+1} | S_0=A] - r(\pi)) + \gamma^{2t+1} (\mathbb{E}_{\pi}[R_{2t+2} | S_0=A] - r(\pi))] \\
&= \operatorname*{lim}_{\gamma \to 1} \operatorname*{lim}_{h \to \infty} \sum_{t=0}^{\frac{h}{2}} [\gamma^{2t} (1 - 0.5) + \gamma^{2t+1} (0 - 0.5)] \\
&= \operatorname*{lim}_{\gamma \to 1} \operatorname*{lim}_{h \to \infty} \sum_{t=0}^{\frac{h}{2}} [0.5\gamma^{2t} - 0.5\gamma^{2t+1}] \\
&= \operatorname*{lim}_{\gamma \to 1} \operatorname*{lim}_{h \to \infty} \sum_{t=0}^{\frac{h}{2}} 0.5 \gamma^{2t} (1 - \gamma) \\
&= \operatorname*{lim}_{\gamma \to 1} 0.5 (1 - \gamma) \operatorname*{lim}_{h \to \infty} \sum_{t=0}^{\frac{h}{2}} \gamma^{2t} \\
&= \operatorname*{lim}_{\gamma \to 1} 0.5 (1 - \gamma) \operatorname*{lim}_{h \to \infty} \frac{1 - \gamma^{h+2}}{1 - \gamma^2} \\
&= \operatorname*{lim}_{\gamma \to 1} 0.5 (1 - \gamma) \frac{1}{1 - \gamma^2} \\
&= \operatorname*{lim}_{\gamma \to 1} 0.5 (1 - \gamma) \frac{1}{(1 + \gamma)(1 - \gamma)} \\
&= \operatorname*{lim}_{\gamma \to 1} \frac{0.5}{1 + \gamma} \\
&= \frac{0.5}{1 + 1} \\
&= 0.25
\end{align*}

Similarly:

\begin{align*}
v_{\pi}(B) &\doteq \operatorname*{lim}_{\gamma \to 1} \operatorname*{lim}_{h \to \infty} \sum_{t=0}^h \gamma^t (\mathbb{E}_{\pi}[R_{t+1} | S_0=B] - r(\pi)) \\
&= \operatorname*{lim}_{\gamma \to 1} \operatorname*{lim}_{h \to \infty} \sum_{t=0}^{\frac{h}{2}} [\gamma^{2t} (\mathbb{E}_{\pi}[R_{2t+1} | S_0=B] - r(\pi)) + \gamma^{2t+1} (\mathbb{E}_{\pi}[R_{2t+2} | S_0=B] - r(\pi))] \\
&= \operatorname*{lim}_{\gamma \to 1} \operatorname*{lim}_{h \to \infty} \sum_{t=0}^{\frac{h}{2}} [\gamma^{2t} (0 - 0.5) + \gamma^{2t+1} (1 - 0.5)] \\
&= \operatorname*{lim}_{\gamma \to 1} \operatorname*{lim}_{h \to \infty} \sum_{t=0}^{\frac{h}{2}} [(-0.5\gamma^{2t}) + (0.5\gamma^{2t+1})] \\
&= \operatorname*{lim}_{\gamma \to 1} \operatorname*{lim}_{h \to \infty} \sum_{t=0}^{\frac{h}{2}} -1 \times [0.5\gamma^{2t} - 0.5\gamma^{2t+1}] \\
&= -1 \times v_{\pi}(A) \\
&= -0.25
\end{align*}

Alternatively, due to the alternating nature of the states and rewards (for example, $\mathbb{E}_{\pi}[R_{t+2} | S_0=A] = \mathbb{E}_{\pi}[R_{t+1} | S_0=B]$), and considering that $h \to \infty$ (so $\operatorname*{lim}_{h \to \infty} \sum_{t=1}^h \gamma^t f(t) = \gamma \operatorname*{lim}_{h \to \infty} \sum_{t=0}^h \gamma^t f(t+1)$), we can define:

\begin{align*}
v_{\pi}(A) &\doteq \operatorname*{lim}_{\gamma \to 1} \operatorname*{lim}_{h \to \infty} \sum_{t=0}^h \gamma^t (\mathbb{E}_{\pi}[R_{t+1} | S_0=A] - r(\pi)) \\
&= R_1 - r(\pi) + \operatorname*{lim}_{\gamma \to 1} \operatorname*{lim}_{h \to \infty} \sum_{t=1}^h \gamma^t (\mathbb{E}_{\pi}[R_{t+1} | S_0=A] - r(\pi)) \\
&= R_1 - r(\pi) + \operatorname*{lim}_{\gamma \to 1} \operatorname*{lim}_{h \to \infty} \sum_{t=0}^h \gamma^{t+1} (\mathbb{E}_{\pi}[R_{t+2} | S_0=A] - r(\pi)) \\
&= R_1 - r(\pi) + \operatorname*{lim}_{\gamma \to 1} \gamma \left[ \operatorname*{lim}_{h \to \infty} \sum_{t=0}^h \gamma^t (\mathbb{E}_{\pi}[R_{t+2} | S_0=A] - r(\pi)) \right] \\
&= R_1 - r(\pi) + \operatorname*{lim}_{\gamma \to 1} \gamma \left[ \operatorname*{lim}_{h \to \infty} \sum_{t=0}^h \gamma^t (\mathbb{E}_{\pi}[R_{t+1} | S_0=B] - r(\pi)) \right] \\
&= R_1 - r(\pi) + \operatorname*{lim}_{\gamma \to 1} \gamma v_{\pi}(B) \\
&= 1 - 0.5 + \operatorname*{lim}_{\gamma \to 1} \gamma v_{\pi}(B) \\
&= \operatorname*{lim}_{\gamma \to 1} \gamma v_{\pi}(B) + 0.5
\end{align*}

And the equivalent for the state B:

\begin{align*}
v_{\pi}(B) &= R_1 - r(\pi) + \operatorname*{lim}_{\gamma \to 1} \gamma v_{\pi}(A) \\
&= 0 - 0.5 + \operatorname*{lim}_{\gamma \to 1} \gamma v_{\pi}(A) \\
&= \operatorname*{lim}_{\gamma \to 1} \gamma v_{\pi}(A) - 0.5
\end{align*}

Finally:

\begin{align*}
v_{\pi}(A) &= \operatorname*{lim}_{\gamma \to 1} \gamma v_{\pi}(B) + 0.5 \\
&= \operatorname*{lim}_{\gamma \to 1} \gamma [\gamma v_{\pi}(A) - 0.5] + 0.5 \\
&= \operatorname*{lim}_{\gamma \to 1} \gamma^2 v_{\pi}(A) - 0.5 \gamma + 0.5 \\
\operatorname*{lim}_{\gamma \to 1} v_{\pi}(A) - \gamma^2 v_{\pi}(A) &= \operatorname*{lim}_{\gamma \to 1} \frac{1 - \gamma}{2} \\
v_{\pi}(A) &= \operatorname*{lim}_{\gamma \to 1} \frac{1 - \gamma}{2(1 - \gamma^2)} \\
&= \operatorname*{lim}_{\gamma \to 1} \frac{1 - \gamma}{2(1 + \gamma)(1 - \gamma)} \\
&= \operatorname*{lim}_{\gamma \to 1} \frac{1}{2(1 + \gamma)} \\
&= \frac{1}{2(1 + 1)} \\
v_{\pi}(A) &= 0.25
\end{align*}

and:

\begin{align*}
v_{\pi}(B) &= \operatorname*{lim}_{\gamma \to 1} \gamma v_{\pi}(A) - 0.5 \\
&= \operatorname*{lim}_{\gamma \to 1} 0.25 \gamma - 0.5 \\
&= 0.25 - 0.5 \\
&= -0.25
\end{align*}