# Chapter 10 - Exercises

### Exercise 10.1 

**Q**

We have not explicitly considered or given pseudocode for any Monte Carlo methods in this chapter. What would they be like? Why is it reasonable not to give pseudocode for them? How would they perform on the Mountain Car task?

**A**

The Monte Carlo method is equivalent to the n-step TD method for $n = \infty$, so the Monte Carlo can be seen as this specific case of the given n-step pseudocode. 

For the episodic pseudocode (the only one given until this part), the Monte Carlo method is equivalent to defining and using $\tau$ only after the end was reached, defining $n$ as any value higher than $T$ after the end of the episode was reached (for example, defining $n = T + 1$ right after the line $T \gets t + 1$). The Monte Carlo method cannot be applied to continuing tasks (it requires that the end of the episode is reached).

The Mountain Car task has better performance going from $n = 1$ until $n = 4$, with the best case at this point, and then it becomes worse as $n$ increases. So, it's expected that the Monte Carlo would behave worse than the cases shown, that has the worst case at $n = 16$, which was the highest value of $n$ among the cases, with the tendence to have a worse performance as $n$ increases. It's important to note, tough, that the task was only considered for the first 50 episodes (the Monte Carlo method may end up better when considering a huge number of episodes because the error goes to 0, as it wouldn't require semi-gradient updates, but exact gradient updates).

### Exercise 10.2

**Q**

Give pseudocode for semi-gradient one-step Expected Sarsa for control.

**A**

The pseudocode is of the given n-step Sarsa is:

**Episodic semi-gradient *n*-step Sarsa for estimating $\widehat{q} \approx q_*$ or $q_{\pi}$**

> Input: a differentiable action-value function parameterization $\widehat{q}: \mathcal{S} \times \mathcal{A} \times \mathbb{R}^d \to \mathbb{R}$<br/>
> Input: a policy $\pi$ (if estimating $q_{\pi}$)<br/>
> Algorithm parameters: step size $\alpha > 0$, small $\epsilon > 0$, a positive integer $n$<br/>
> Initialize value-function weights $\textbf{w} \in \mathbb{R}^d$ arbitrarily (e.g., $\textbf{w} = \textbf{0}$)<br/>
> All store and access operations (for $S_t$, $A_t$, and $R_t$) can take their index mod $n + 1$
>
> Loop for each episode:<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;Initialize and store $S_0 \neq terminal$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;Select and store an action $A_0 \sim \pi(\cdot | S_0)$ or $\epsilon$-greedy wrt $\widehat{q}(S_0, \cdot, \textbf{w})$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;$T \gets \infty$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;Loop for $t = 0, 1, 2, ...$:<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;If $t < T$, then:<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Take action $A_t$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Observe and store the next reward as $R_{t+1}$ and the next state as $S_{t+1}$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;If $S_{t+1}$ is terminal, then:<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$T \gets t + 1$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;else:<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Select and store an action $A_{t+1} \sim \pi(\cdot | S_{t+1})$ or $\epsilon$-greedy wrt $\widehat{q}(S_{t+1}, \cdot, \textbf{w})$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$\tau \gets t - n + 1$ ($\tau$ is the time whose estimate is being updated)<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;If $\tau \geq 0$:<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$G \gets \sum_{i = \tau + 1}^{min(\tau + n, T)} \gamma^{i - \tau - 1} R_i$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;If $\tau + n < T$, then $G \gets G + \gamma^n \widehat{q}(S_{\tau + n}, A_{\tau + n}, \textbf{w})$ $\quad (G_{\tau:\tau+n})$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$\textbf{w} \gets \textbf{w} + \alpha [G - \widehat{q}(S_{\tau}, A_{\tau}, \textbf{w})] \nabla \widehat{q}(S_{\tau}, A_{\tau}, \textbf{w})$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;until $\tau = T - 1$

The one-step Expected Sarsa is similar to the one-step Sarsa (the same as the pseudocode above for $n = 1$), but updating the value of a state-action using the expected value, which is the sum of the action-values of all actions weighed by the probability of choosing the action according to the policy, or to the $\epsilon$-greedy policy when estimating $q_*$.

The pseudocode would be exactly the same as the pseudocode above (n-step Sarsa), but defining $n = 1$ and changing the line:

>If $\tau + n < T$, then $G \gets G + \gamma^n \widehat{q}(S_{\tau + n}, A_{\tau + n}, \textbf{w})$

to:

>If $\tau + n < T$, then $G \gets G + \gamma^n \sum_{a \in \mathcal{A}(S_{\tau + n})} \pi(a | S_{\tau + n}) \widehat{q}(S_{\tau + n}, a, \textbf{w})$

where $\pi$ is the policy being estimated, or, if the objective is estimating $q_*$, the $\epsilon$-greedy policy over $\widehat{q}$.

It's important to note that the pseudocode can be generalized for any value of $n$.

For $n = 1$, specifically, the algorithm can be simplified to:

**Episodic semi-gradient one-step Expected Sarsa for estimating $\widehat{q} \approx q_*$ or $q_{\pi}$**

> Input: a differentiable action-value function parameterization $\widehat{q}: \mathcal{S} \times \mathcal{A} \times \mathbb{R}^d \to \mathbb{R}$<br/>
> Input: a policy $\pi$ (if estimating $q_{\pi}$)<br/>
> Algorithm parameters: step size $\alpha > 0$, small $\epsilon > 0$<br/>
> Initialize value-function weights $\textbf{w} \in \mathbb{R}^d$ arbitrarily (e.g., $\textbf{w} = \textbf{0}$)<br/>
>
> Loop for each episode:<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;Initialize $S \gets S_0 \neq terminal$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;Select an action $A \gets A_0 \sim \pi(\cdot | S_0)$ or $\epsilon$-greedy wrt $\widehat{q}(S_0, \cdot, \textbf{w})$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;Loop until $S$ is terminal:<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Take action $A$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Observe the next reward as $R$ and the next state as $NS$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$G \gets R$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;If $NS$ is not terminal, then $G \gets G + \gamma \sum_{a \in \mathcal{A}(NS)} [\pi(a | NS)$ or $\epsilon$-greedy wrt $\widehat{q}(NS, a, \textbf{w})] \widehat{q}(NS, a, \textbf{w})$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$\textbf{w} \gets \textbf{w} + \alpha [G - \widehat{q}(S, A, \textbf{w})] \nabla \widehat{q}(S, A, \textbf{w})$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$S \gets NS$<br/>
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;If $S$ is not terminal, then select an action $A \sim \pi(\cdot | S)$ or $\epsilon$-greedy wrt $\widehat{q}(S, \cdot, \textbf{w})$<br/>

### Exercise 10.3

**Q**

Why do the results shown in Figure 10.4 have higher standard errors at large n than at small n?

**A**

The tendence is that higher values of n will give smaller errors with time, especially with n higher than the number of steps in the episodes, as the error due to the use of semi-gradients reduce with higher n (with no semi-gradients for n higher than T for an episode, using the actual returns in this case). On the other hand, the convergence of the state values, and consequently the reduction of the error, for higher values of n should be require a huge number of episodes, normally with high variance at the beginning, so the number of 50 episodes used during the runs was not enough, making their results worse than using smaller values of n.