# Chapter 13 - Exercises

### Exercise 13.1

**Q**

Use your knowledge of the gridworld and its dynamics to determine an exact symbolic expression for the optimal probability of selecting the right action in Example 13.1.

**A**

Considering the values of each state from left to right being $V_1$ $(V_{S_0}), V_2, V_3$ and $V_G$ $(V_{terminal} = 0)$, and defining them as a function of the other state-values and the probability of choosing the right action, $p = \pi(right)$, we have:

$$
\begin{align*}
V_1(p) &= -1 + p V_2 + (1 - p) V_1(p)
\\
V_2(p) &= -1 + p V_1 + (1 - p) V_3(p)
\\
V_3(p) &= -1 + p V_G + (1 - p) V_2(p) = -1 + (1 - p) V_2(p)
\end{align*}
$$

The equations above are derived from the general definition of state-values:

$$
\begin{align*}
V(s) &\doteq \sum_a \pi(a | s) \sum_{r, s'} p(r, s' | s, a) [r + V(s')] 
\\
&= \sum_a \pi(a | s) \sum_{s'} p(s' | s, a) \cdot [-1 + V(s')] 
\\
&= -1 + \sum_a \pi(a | s) \sum_{s'} p(s' | s, a) V(s')
\\
&= -1 + [\pi(right) \cdot V(s_{right}')] + [\pi(left) \cdot V(s_{left}')]
\\
&= -1 + p V(s_{right}') + (1 - p) V(s_{left}')
\end{align*}
$$

(in the given environment, the reward is always -1 $(r(s, a) = -1)$, and the transitions are deterministic, that is, given a state $s$ and an action $a$, there's exactly 1 possible next state $s'$, so $p(s' | s, a) = 1$)

The probability of choosing an action is given by $\pi(right)$ instead of $\pi(right | s)$ because, due to the use of function approximation, all the states are seen by the agent as a single state, so the policy will only define probabilities for choosing an action, independently of the actual state, that maximizes the final return (this is the real challenge, otherwise, if the actual states were known by the agent, it could define the best action for the state using a deterministic policy: $\pi(right | s_{left}) = 1$, $\pi(left | s_{middle}) = 1$ and $\pi(right | s_{right}) = 1$).

Then, the best probability of choosing the right action is given by:

$$
\pi(right) = \operatorname*{argmax}_p \begin{cases}
  -1 + p V_2(p) + (1 - p) V_1(p) \\
  -1 + p V_1(p) + (1 - p) V_3(p) \\
  -1 + (1 - p) V_2(p)
\end{cases}
$$

The episodes start at the leftmost state $(S_0)$, so the objective is actually to maximize $V_1 = V_{S_0}$:

$$
\pi(right) = \operatorname*{argmax}_p V_1(p)
$$

(the value of $V_1$ depends on $V_2$, that depends on $V_3$, so $p$ must be chosen in such a way that the start state is the best possible, which depends on the changes in the other states due to $p$).

The above definition is enough to answer this exercise, but to verify the actual probability we can simplify the equations, defining $V_1(p)$ as a function that depends only on $p$, without other state-values, and then find the value of $p$ that maximizes $V_1(p)$.

$$
\begin{align*}
V_3 &= -1 + (1 - p) V_2
\\
V_3 &= -1 + (1 - p)  [-1 + p V_1 + (1 - p) V_3]
\\
V_3 &= -1 + p - 1 + p(1 - p)V_1 + (1 - p)^2 V_3
\\
V_3 &= -1 + p - 1 + p(1 - p)V_1 + (1 - 2p + p^2) V_3
\\
V_3 &= p - 2 + p(1 - p)V_1 + [V_3 - 2pV_3 + p^2 V_3]
\\
2pV_3 - p^2 V_3 &= p(1 - p)V_1 + p - 2
\\
V_3 &= \frac{p(1 - p)V_1 + p - 2}{2p - p^2}
\end{align*}
$$

Then:

$$
\begin{align*}
V_1 &= -1 + p V_2 + (1 - p) V_1
\\
V_1 &= p V_2 - 1 + V_1 - p V_1
\\
p V_1 &= p V_2 - 1
\\
V_1 &= V_2 - \frac{1}{p}
\\
V_1 &= -1 + p V_1 + (1 - p) V_3 - \frac{1}{p}
\\
(1 - p)V_1 &= (1 - p) \frac{p(1 - p)V_1 + p - 2}{2p - p^2} - \frac{1}{p} - 1
\\
V_1 &= \frac{p(1 - p)V_1 + p - 2}{2p - p^2} - \frac{1}{p(1 - p)} - \frac{1}{1 - p}
\\
V_1 &= \frac{p(1 - p)V_1 + p - 2}{p(2 - p)} - \frac{1}{p(1 - p)} - \frac{1}{1 - p}
\\
V_1 &= \frac{(1 - p)[p(1 - p)V_1 + p - 2] - [2 - p] - p[2 - p]}{p(1 - p)(2 - p)}
\\
p(1 - p)(2 - p) V_1 &= p(1 - p)^2V_1 + p(1 - p) - 2(1 - p) - (2 - p) - (2p - p^2)
\\
p(1 - p)[(2 - p) - (1 - p)] V_1 &= p(1 - p) - 2(1 - p) - (2 - p) - (2p - p^2)
\\
p(1 - p)[2 - p - 1 + p] V_1 &= p - p^2 + 2p - 2 + p - 2 + p^2 - 2p
\\
p(1 - p) V_1 &= 2p - 4
\\
V_1 &= \frac{2p - 4}{p(1 - p)}
\end{align*}
$$

Let's consider the function:

$$
f(p) = \frac{2p - 4}{p(1 - p)}
$$

To find the critical points, we need to take the derivative of $f(p)$ with respect to $p$, then set it equal to zero.

**Differentiate the numerator and the denominator using the quotient rule:**

The quotient rule states that for a function $f(p) = \frac{g(p)}{h(p)}$, the derivative is:

$$
f'(p) = \frac{g'(p)h(p) - g(p)h'(p)}{[h(p)]^2}
$$

Here, we have:

- $g(p) = 2p - 4$
- $h(p) = p(1 - p) = p - p^2$

Now, let's compute the derivatives:

- $g'(p) = 2$
- $h'(p) = 1 - 2p$

Now, applying the quotient rule:

$$
\begin{align*}
f'(p) &= \frac{2(p - p^2) - (2p - 4)(1 - 2p)}{(p - p^2)^2}
\\
&= \frac{(2p - 2p^2) - (2p - 4p^2 - 4 + 8p)}{(p - p^2)^2}
\\
&= \frac{2p - 2p^2 - 2p + 4p^2 + 4 - 8p}{(p - p^2)^2}
\\
&= \frac{2p^2 - 8p + 4}{(p - p^2)^2}
\end{align*}
$$

**Set $f'(p) = 0$**

To find the critical points, we set the numerator equal to zero:

$$
\begin{align*}
2p^2 - 8p + 4 &= 0
\\
p^2 - 4p + 2 &= 0
\end{align*}
$$

Now solve this quadratic equation using the quadratic formula:

$$
p = \frac{-(-4) \pm \sqrt{(-4)^2 - 4(1)(2)}}{2(1)}
$$

$$
p = \frac{4 \pm \sqrt{16 - 8}}{2}
$$

$$
p = \frac{4 \pm \sqrt{8}}{2}
$$

$$
p = \frac{4 \pm 2\sqrt{2}}{2}
$$

$$
p = 2 \pm \sqrt{2}
$$

Thus, the two possible values of $p$ are:

$$
p = 2 + \sqrt{2} \quad \text{or} \quad p = 2 - \sqrt{2}
$$

**Analyze the critical points**

Since $p$ must be between 0 and 1 (as $p$ is a probability), the only valid solution is:

$$
p = 2 - \sqrt{2}
$$

Thus, the value of $p$ that maximizes $f(p)$ is:

$$
p = 2 - \sqrt{2} \approx 0.5858
$$

So, the optimal probability of selecting the right action is:

$$
\pi(right) = 2 - \sqrt{2} \approx 0.5858
$$


### Exercise 13.2

**Q**

Generalize the box on page 199, the policy gradient theorem (13.5), the proof of the policy gradient theorem (page 325), and the steps leading to the REINFORCE update equation (13.8), so that (13.8) ends up with a factor of $\gamma^t$ and thus aligns with the general algorithm given in the pseudocode.

**A**

The policy gradient theorem (13.5) is:

$$
\nabla J(\theta) ? \sum_s \mu(s) \sum_a q_{\pi}(s, a) \nabla \pi(a | s, \theta) \tag{13.5}
$$

The REINFORCE update equation (13.8) is:

$$
\theta_{t+1} \doteq \theta_t + \alpha G_t \frac{\nabla \pi(A_t | S_t, \theta_t)}{\pi(A_t | S_t, \theta_t)} \tag{13.8}
$$

In the box on page 199, we have:

$$
\eta(s) = h(s) + \sum_{\overline{s}} \eta(\overline{s}) \sum_a \pi(a | \overline{s}) p(s | \overline{s}, a), \quad \text{for all } s \in \mathcal{S} \tag{9.2}
$$

with:

$$
\mu(s) = \frac{\eta(s)}{\sum_{s'} \eta(s')}, \quad \text{for all } s \in \mathcal{S} \tag{9.3}
$$