# Chapter 13 - Exercises

### Exercise 13.1

**Q**

Use your knowledge of the gridworld and its dynamics to determine an exact symbolic expression for the optimal probability of selecting the right action in Example 13.1.

**A**

Considering the values of each state from left to right being $V_1$ $(V_{S_0}), V_2, V_3$ and $V_G$ $(V_{terminal} = 0)$, and defining them as a function of the other state-values and the probability of choosing the right action, $p = \pi(right)$, we have:

$$
\begin{align*}
V_1(p) &= -1 + p V_2 + (1 - p) V_1(p)
\\
V_2(p) &= -1 + p V_1 + (1 - p) V_3(p)
\\
V_3(p) &= -1 + p V_G + (1 - p) V_2(p) = -1 + (1 - p) V_2(p)
\end{align*}
$$

The equations above are derived from the general definition of state-values:

$$
\begin{align*}
V(s) &\doteq \sum_a \pi(a | s) \sum_{r, s'} p(r, s' | s, a) [r + V(s')] 
\\
&= \sum_a \pi(a | s) \sum_{s'} p(s' | s, a) \cdot [-1 + V(s')] 
\\
&= -1 + \sum_a \pi(a | s) \sum_{s'} p(s' | s, a) V(s')
\\
&= -1 + [\pi(right) \cdot V(s_{right}')] + [\pi(left) \cdot V(s_{left}')]
\\
&= -1 + p V(s_{right}') + (1 - p) V(s_{left}')
\end{align*}
$$

(in the given environment, the reward is always -1 $(r(s, a) = -1)$, and the transitions are deterministic, that is, given a state $s$ and an action $a$, there's exactly 1 possible next state $s'$, so $p(s' | s, a) = 1$)

The probability of choosing an action is given by $\pi(right)$ instead of $\pi(right | s)$ because, due to the use of function approximation, all the states are seen by the agent as a single state, so the policy will only define probabilities for choosing an action, independently of the actual state, that maximizes the final return (this is the real challenge, otherwise, if the actual states were known by the agent, it could define the best action for the state using a deterministic policy: $\pi(right | s_{left}) = 1$, $\pi(left | s_{middle}) = 1$ and $\pi(right | s_{right}) = 1$).

Then, the best probability of choosing the right action is given by:

$$
\pi(right) = \operatorname*{argmax}_p \begin{cases}
  -1 + p V_2(p) + (1 - p) V_1(p) \\
  -1 + p V_1(p) + (1 - p) V_3(p) \\
  -1 + (1 - p) V_2(p)
\end{cases}
$$

The episodes start at the leftmost state $(S_0)$, so the objective is actually to maximize $V_1 = V_{S_0}$:

$$
\pi(right) = \operatorname*{argmax}_p V_1(p)
$$

(the value of $V_1$ depends on $V_2$, that depends on $V_3$, so $p$ must be chosen in such a way that the start state is the best possible, which depends on the changes in the other states due to $p$).

The above definition is enough to answer this exercise, but to verify the actual probability we can simplify the equations, defining $V_1(p)$ as a function that depends only on $p$, without other state-values, and then find the value of $p$ that maximizes $V_1(p)$.

$$
\begin{align*}
V_3 &= -1 + (1 - p) V_2
\\
V_3 &= -1 + (1 - p)  [-1 + p V_1 + (1 - p) V_3]
\\
V_3 &= -1 + p - 1 + p(1 - p)V_1 + (1 - p)^2 V_3
\\
V_3 &= -1 + p - 1 + p(1 - p)V_1 + (1 - 2p + p^2) V_3
\\
V_3 &= p - 2 + p(1 - p)V_1 + [V_3 - 2pV_3 + p^2 V_3]
\\
2pV_3 - p^2 V_3 &= p(1 - p)V_1 + p - 2
\\
V_3 &= \frac{p(1 - p)V_1 + p - 2}{2p - p^2}
\end{align*}
$$

Then:

$$
\begin{align*}
V_1 &= -1 + p V_2 + (1 - p) V_1
\\
V_1 &= p V_2 - 1 + V_1 - p V_1
\\
p V_1 &= p V_2 - 1
\\
V_1 &= V_2 - \frac{1}{p}
\\
V_1 &= -1 + p V_1 + (1 - p) V_3 - \frac{1}{p}
\\
(1 - p)V_1 &= (1 - p) \frac{p(1 - p)V_1 + p - 2}{2p - p^2} - \frac{1}{p} - 1
\\
V_1 &= \frac{p(1 - p)V_1 + p - 2}{2p - p^2} - \frac{1}{p(1 - p)} - \frac{1}{1 - p}
\\
V_1 &= \frac{p(1 - p)V_1 + p - 2}{p(2 - p)} - \frac{1}{p(1 - p)} - \frac{1}{1 - p}
\\
V_1 &= \frac{(1 - p)[p(1 - p)V_1 + p - 2] - [2 - p] - p[2 - p]}{p(1 - p)(2 - p)}
\\
p(1 - p)(2 - p) V_1 &= p(1 - p)^2V_1 + p(1 - p) - 2(1 - p) - (2 - p) - (2p - p^2)
\\
p(1 - p)[(2 - p) - (1 - p)] V_1 &= p(1 - p) - 2(1 - p) - (2 - p) - (2p - p^2)
\\
p(1 - p)[2 - p - 1 + p] V_1 &= p - p^2 + 2p - 2 + p - 2 + p^2 - 2p
\\
p(1 - p) V_1 &= 2p - 4
\\
V_1 &= \frac{2p - 4}{p(1 - p)}
\end{align*}
$$

Let's consider the function:

$$
f(p) = \frac{2p - 4}{p(1 - p)}
$$

To find the critical points, we need to take the derivative of $f(p)$ with respect to $p$, then set it equal to zero.

**Differentiate the numerator and the denominator using the quotient rule:**

The quotient rule states that for a function $f(p) = \frac{g(p)}{h(p)}$, the derivative is:

$$
f'(p) = \frac{g'(p)h(p) - g(p)h'(p)}{[h(p)]^2}
$$

Here, we have:

- $g(p) = 2p - 4$
- $h(p) = p(1 - p) = p - p^2$

Now, let's compute the derivatives:

- $g'(p) = 2$
- $h'(p) = 1 - 2p$

Now, applying the quotient rule:

$$
\begin{align*}
f'(p) &= \frac{2(p - p^2) - (2p - 4)(1 - 2p)}{(p - p^2)^2}
\\
&= \frac{(2p - 2p^2) - (2p - 4p^2 - 4 + 8p)}{(p - p^2)^2}
\\
&= \frac{2p - 2p^2 - 2p + 4p^2 + 4 - 8p}{(p - p^2)^2}
\\
&= \frac{2p^2 - 8p + 4}{(p - p^2)^2}
\end{align*}
$$

**Set $f'(p) = 0$**

To find the critical points, we set the numerator equal to zero:

$$
\begin{align*}
2p^2 - 8p + 4 &= 0
\\
p^2 - 4p + 2 &= 0
\end{align*}
$$

Now solve this quadratic equation using the quadratic formula:

$$
p = \frac{-(-4) \pm \sqrt{(-4)^2 - 4(1)(2)}}{2(1)}
$$

$$
p = \frac{4 \pm \sqrt{16 - 8}}{2}
$$

$$
p = \frac{4 \pm \sqrt{8}}{2}
$$

$$
p = \frac{4 \pm 2\sqrt{2}}{2}
$$

$$
p = 2 \pm \sqrt{2}
$$

Thus, the two possible values of $p$ are:

$$
p = 2 + \sqrt{2} \quad \text{or} \quad p = 2 - \sqrt{2}
$$

**Analyze the critical points**

Since $p$ must be between 0 and 1 (as $p$ is a probability), the only valid solution is:

$$
p = 2 - \sqrt{2}
$$

Thus, the value of $p$ that maximizes $f(p)$ is:

$$
p = 2 - \sqrt{2} \approx 0.5858
$$

So, the optimal probability of selecting the right action is:

$$
\pi(right) = 2 - \sqrt{2} \approx 0.5858
$$


### Exercise 13.2

**Q**

Generalize the box on page 199, the policy gradient theorem (13.5), the proof of the policy gradient theorem (page 325), and the steps leading to the REINFORCE update equation (13.8), so that (13.8) ends up with a factor of $\gamma^t$ and thus aligns with the general algorithm given in the pseudocode.

**A**

The policy gradient theorem (13.5) is:

$$
\nabla J(\theta) \propto \sum_s \mu(s) \sum_a q_{\pi}(s, a) \nabla \pi(a | s, \theta) \tag{13.5}
$$

The REINFORCE update equation (13.8) is:

$$
\theta_{t+1} \doteq \theta_t + \alpha G_t \frac{\nabla \pi(A_t | S_t, \theta_t)}{\pi(A_t | S_t, \theta_t)} \tag{13.8}
$$

In the box on page 199, we have:

$$
\eta(s) = h(s) + \sum_{\overline{s}} \eta(\overline{s}) \sum_a \pi(a | \overline{s}) p(s | \overline{s}, a), \quad \text{for all } s \in \mathcal{S} \tag{9.2}
$$

with:

$$
\mu(s) = \frac{\eta(s)}{\sum_{s'} \eta(s')}, \quad \text{for all } s \in \mathcal{S} \tag{9.3}
$$

The proof of the policy gradient theorem (page 325) is:

$$
\begin{align*}
\nabla v_{\pi} (s) &= \nabla \left[ \sum_a \pi(a | s) q_{\pi} (s, a) \right] , \quad \text{for all } s \in \mathcal{S} \tag{Exercise 3.18}
\\
&= \sum_a \left[\nabla \pi(a|s) q_{\pi}(s, a) + \pi(a|s) \nabla q_{\pi}(s, a) \right] \tag{product rule of calculus}
\\
&= \sum_a \left[\nabla \pi(a|s) q_{\pi}(s, a) + \pi(a|s) \nabla \sum_{s', r} p(s', r | s, a)(r + v_{\pi}(s')) \right] \tag{Exercise 3.19 and Equation 3.2}
\\
&= \sum_a \left[\nabla \pi(a|s) q_{\pi}(s, a) + \pi(a|s) \sum_{s'} p(s' | s, a) \nabla v_{\pi}(s') \right] \tag{Eq. 3.4}
\\
&= \sum_a \bigg[\nabla \pi(a|s) q_{\pi}(s, a) + \pi(a|s) \sum_{s'} p(s' | s, a) \tag{unrolling}
\\
&\quad \quad \quad \sum_{a'} \bigg[\nabla \pi(a'|s') q_{\pi}(s', a') + \pi(a'|s') \sum_{s''} p(s'' | s', a') \nabla v_{\pi}(s'') \bigg] \bigg]
\\
&= \sum_{x \in \mathcal{S}} \sum_{k = 0}^{\infty} Pr(s \to x, k, \pi) \sum_a \nabla \pi(a|x) q_{\pi}(x, a)
\end{align*}
$$

after repeated unrolling, where $Pr(s \to x, k, \pi)$ is the probability of transitioning from state $s$ to state $x$ in $k$ steps under policy $\pi$. It is then immediate that

$$
\begin{align*}
\nabla J(\theta) &= \nabla v_{\pi}(s_0)
\\
&= \sum_s \left( \sum_{k=0}^{\infty} Pr(s_0 \to s, k, \pi) \right) \sum_a \nabla \pi(a | s) q_{\pi}(s, a)
\\
&= \sum_s \eta (s) \sum_a \nabla \pi(a|s) q_{\pi} (s, a) \tag{box page 199}
\\
&= \sum_{s'} \eta (s') \sum_s \frac{\eta (s)}{\sum_{s'} \eta (s')} \sum_a \nabla \pi(a|s) q_{\pi} (s, a)
\\
&= \sum_{s'} \eta (s') \sum_s \mu(s) \sum_a \nabla \pi(a|s) q_{\pi} (s, a) \tag{Eq. 9.3}
\\
&\propto \sum_s \mu(s) \sum_a \nabla \pi(a|s) q_{\pi} (s, a) \tag{Q.E.D.}
\end{align*}
$$

To include the discount factor $\gamma$, we change the policy gradient theorem to define $q_{\pi}(s, a) = \sum_{s', r} p(s', r | s, a)(r + \gamma v_{\pi}(s'))$ (instead of $q_{\pi}(s, a) = \sum_{s', r} p(s', r | s, a)(r + v_{\pi}(s'))$):

$$
\begin{align*}
\nabla v_{\pi} (s) &= \nabla \left[ \sum_a \pi(a | s) q_{\pi} (s, a) \right] , \quad \text{for all } s \in \mathcal{S} \tag{Exercise 3.18}
\\
&= \sum_a \left[\nabla \pi(a|s) q_{\pi}(s, a) + \pi(a|s) \nabla q_{\pi}(s, a) \right] \tag{product rule of calculus}
\\
&= \sum_a \left[\nabla \pi(a|s) q_{\pi}(s, a) + \pi(a|s) \nabla \sum_{s', r} p(s', r | s, a)(r + \gamma v_{\pi}(s')) \right] \tag{Exercise 3.19 and Equation 3.2}
\\
&= \sum_a \left[\nabla \pi(a|s) q_{\pi}(s, a) + \pi(a|s) \sum_{s'} p(s' | s, a) \gamma \nabla v_{\pi}(s') \right] \tag{Eq. 3.4, $\gamma$ is constant}
\\
&= \sum_a \bigg[\nabla \pi(a|s) q_{\pi}(s, a) + \pi(a|s) \sum_{s'} p(s' | s, a) \gamma \tag{unrolling}
\\
&\quad \quad \quad \sum_{a'} \bigg[\nabla \pi(a'|s') q_{\pi}(s', a') + \pi(a'|s') \sum_{s''} p(s'' | s', a') \gamma \nabla v_{\pi}(s'') \bigg] \bigg]
\\
&= \sum_{x \in \mathcal{S}} \sum_{k = 0}^{\infty} \left[ Pr(s \to x, k, \pi) \gamma^k \right] \sum_a \nabla \pi(a|x) q_{\pi}(x, a)
\end{align*}
$$

And then:

$$
\begin{align*}
\nabla J(\theta) &= \nabla v_{\pi}(s_0)
\\
&= \sum_s \left( \sum_{k=0}^{\infty} \left[ Pr(s_0 \to s, k, \pi) \gamma^k \right] \right) \sum_a \nabla \pi(a | s, \theta) q_{\pi}(s, a)
\\
&= \mathbb{E}_{\pi} \left[ \sum_s \eta (s) \gamma^t \sum_a \nabla \pi(a|s, \theta) q_{\pi} (s, a) \right] \tag{replacing $\gamma^k$ by the sample $\gamma^t$ at the time step $t$ in the expectation*}
\\
&= \mathbb{E}_{\pi} \left[ \gamma^t \sum_{s'} \eta (s') \sum_s \frac{\eta (s)}{\sum_{s'} \eta (s')} \sum_a \nabla \pi(a|s, \theta) q_{\pi} (s, a) \right]
\\
&= \left[\sum_{s'} \eta (s') \right] \mathbb{E}_{\pi} \left[ \gamma^t \sum_s \mu(s) \sum_a \nabla \pi(a|s, \theta) q_{\pi} (s, a) \right] \tag{Eq. 9.3}
\\
&\propto \mathbb{E}_{\pi} \left[ \gamma^t \sum_s \mu(s) \sum_a q_{\pi} (s, a) \nabla \pi(a|s, \theta) \right]
\\
&= \mathbb{E}_{\pi} \left[ \gamma^t \sum_a \pi(a|S_t, \theta) q_{\pi} (S_t, a) \frac{\nabla \pi(a|S_t, \theta)}{\pi(a|S_t, \theta)} \right] \tag{replacing $s$ by the sample $S_t \sim \pi$}
\\
&= \mathbb{E}_{\pi} \left[ \gamma^t q_{\pi} (S_t, A_t) \frac{\nabla \pi(A_t|S_t, \theta)}{\pi(A_t|S_t, \theta)} \right] \tag{replacing $a$ by the sample $A_t \sim \pi$}
\\
&= \mathbb{E}_{\pi} \left[ \gamma^t G_t \frac{\nabla \pi(A_t|S_t, \theta)}{\pi(A_t|S_t, \theta)} \right]
\end{align*}
$$

So:

$$
\theta_{t+1} \doteq \theta_t + \alpha \gamma^t G_t \frac{\nabla \pi(A_t|S_t, \theta_t)}{\pi(A_t|S_t, \theta_t)} = \theta_t + \alpha \gamma^t G_t \nabla \ln \pi(A_t|S_t, \theta_t)
$$

as defined in the pseudocode.

*\*The substitution of $\sum_{k=0}^{\infty} Pr(s_0 \to s, k, \pi) = \eta (s)$ in the original is mathematically obscure (considering only the given information), even tough it's understandable intuitively. The discount factor $\gamma$  to the power of $k$ is intuitively equivalent to $\gamma^t$ at the time-step $t$, considering that $k$ corresponds to the time-steps (in the iteration) in which the state happened, although it's not completely clear mathematically that $\sum_{k=0}^{\infty} \left[ Pr(s_0 \to s, k, \pi) \gamma^k \right] = \eta (s) \gamma^t$.*

### Exercise 13.3

**Q**

In Section 13.1 we considered policy parameterizations using the soft-max in action preferences (13.2) with linear action preferences (13.3). For this parameterization, prove that the eligibility vector is

$$
\nabla \ln \pi(a|s, \theta) = \textbf{x}(s, a) - \sum_b \pi(b|s, \theta) \textbf{x}(s, b)
$$

using the definitions and elementary calculus.

**A**

The soft-max in action preferences (13.2) is defined below:

$$
\pi(a|s, \theta) \doteq \frac{e^{h(s, a, \theta)}}{\sum_b e^{h(s, b, \theta)}} \tag{13.2}
$$

When the action preferences are linear , we have:

$$
h(s, a, \theta) = \theta^T \textbf{x}(s, a) \tag{13.3}
$$

The derivative of $\ln x$ is $\frac{1}{x}$ as demonstrated below:

$$
\begin{align*}
\frac{d}{dx} (\ln x) &= \operatorname*{lim}_{\Delta x \to 0} \frac{\ln{(x + \Delta x)} - \ln{(x)}}{\Delta x}
\\
&= \operatorname*{lim}_{\Delta x \to 0} \frac{\ln{(\frac{x + \Delta x}{x})}}{\Delta x} \tag{using the logaritmic property}
\\
&= \operatorname*{lim}_{\Delta x \to 0} \frac{\ln{(1 + \frac{\Delta x}{x})}}{\Delta x}
\\
&\approx \operatorname*{lim}_{\Delta x \to 0} \frac{\frac{\Delta x}{x}}{\Delta x} \tag{when $\frac{\Delta x}{x}$ is small, which is the case when $\Delta x \to 0$}
\\
&= \operatorname*{lim}_{\Delta x \to 0} \frac{1}{x}
\\
&= \frac{1}{x}
\end{align*}
$$

Alternatively, making $y = \ln x$, which means that $e^y = x$, and also having in mind that $\frac{d e^x}{dx} = e^x$:

$$
\begin{align*}
\frac{d e^y}{dx} &= \frac{dx}{dx}
\\
\frac{d e^y}{dy} \frac{dy}{dx} &= 1
\\
e^y \frac{d \ln x}{dx} &= 1
\\
x \frac{d \ln x}{dx} &= 1
\\
\frac{d \ln x}{dx} &= \frac{1}{x}
\end{align*}
$$

Also, $\frac{d}{dx} \left( \frac{f(x)}{g(x)} \right) = \frac{g(x) f'(x) - f(x) g'(x)}{[g(x)]^2}$ as demonstrated below, considering $h(x) \doteq \frac{f(x)}{g(x)}$:

$$
\begin{align*}
h'(x) &\doteq \operatorname*{lim}_{\Delta x \to 0} \frac{h(x + \Delta x) - h(x)}{\Delta x}
\\
\frac{d}{dx} \left( \frac{f(x)}{g(x)} \right) &= \operatorname*{lim}_{\Delta x \to 0} \frac{\frac{f(x + \Delta x)}{g(x + \Delta x)} - \frac{f(x)}{g(x)}}{\Delta x}
\\
\frac{d}{dx} \left( \frac{f(x)}{g(x)} \right) &= \operatorname*{lim}_{\Delta x \to 0} \frac{\frac{f(x + \Delta x)g(x) - f(x)g(x + \Delta x)}{g(x) g(x + \Delta x)}}{\Delta x}
\\
\frac{d}{dx} \left( \frac{f(x)}{g(x)} \right) &= \operatorname*{lim}_{\Delta x \to 0} \frac{f(x + \Delta x)g(x) - f(x)g(x + \Delta x)}{g(x) g(x + \Delta x) \Delta x}
\\
\frac{d}{dx} \left( \frac{f(x)}{g(x)} \right) &= \operatorname*{lim}_{\Delta x \to 0} \frac{\frac{f(x + \Delta x)}{\Delta x}g(x) - f(x)\frac{g(x + \Delta x)}{\Delta x}}{g(x)g(x + \Delta x)}
\\
\frac{d}{dx} \left( \frac{f(x)}{g(x)} \right) &= \operatorname*{lim}_{\Delta x \to 0} \frac{\frac{f(x + \Delta x)}{\Delta x}g(x) - \frac{f(x)g(x)}{\Delta x} + \frac{f(x)g(x)}{\Delta x} - f(x)\frac{g(x + \Delta x)}{\Delta x}}{g(x)g(x + \Delta x)}
\\
\frac{d}{dx} \left( \frac{f(x)}{g(x)} \right) &= \operatorname*{lim}_{\Delta x \to 0} \frac{\left[ \frac{f(x + \Delta x)}{\Delta x}g(x) - \frac{f(x)}{\Delta x} g(x) \right] - \left[ f(x)\frac{g(x + \Delta x)}{\Delta x} - f(x) \frac{g(x)}{\Delta x} \right]}{g(x)g(x + \Delta x)}
\\
\frac{d}{dx} \left( \frac{f(x)}{g(x)} \right) &= \operatorname*{lim}_{\Delta x \to 0} \frac{\frac{f(x + \Delta x) - f(x)}{\Delta x}g(x) - f(x)\frac{g(x + \Delta x) - g(x)}{\Delta x}}{g(x)g(x + \Delta x)}
\\
\frac{d}{dx} \left( \frac{f(x)}{g(x)} \right) &= \operatorname*{lim}_{\Delta x \to 0} \frac{f'(x)g(x) - f(x)g'(x)}{g(x)g(x + \Delta x)}
\\
\frac{d}{dx} \left( \frac{f(x)}{g(x)} \right) &= \frac{f'(x)g(x) - f(x)g'(x)}{[g(x)]^2}
\end{align*}
$$

The extension to gradients is applied naturally.

Making $f(\theta) \doteq e^{h(s, a, \theta)}$ ($s$ and $a$ are constant w.r.t. $\theta$) and $g(\theta) \doteq \sum_b e^{h(s, b, \theta)}$, with:

$$
q(\theta) \doteq \frac{f(\theta)}{g(\theta)} = \frac{e^{h(s, a, \theta)}}{\sum_b e^{h(s, b, \theta)}} = \pi(a|s, \theta)
$$

we have:

$$
\begin{align*}
\nabla_{\theta} f(\theta) &= \nabla_{\theta} e^{h(s, a, \theta)}
\\
&= \nabla_h e^{h(s, a, \theta)} \nabla_{\theta} h(s, a, \theta)
\\
&= e^{h(s, a, \theta)} \nabla_{\theta} [\theta^T \textbf{x}(s, a)]
\\
&= e^{h(s, a, \theta)} \textbf{x}(s, a)
\end{align*}
$$

and:

$$
\begin{align*}
\nabla_{\theta} g(\theta) &= \nabla_{\theta} \sum_b e^{h(s, b, \theta)}
\\
&= \sum_b \nabla_{\theta} e^{h(s, b, \theta)}
\\
&= \sum_b \nabla_h e^{h(s, b, \theta)} \nabla_{\theta} h(s, b, \theta)
\\
&= \sum_b e^{h(s, b, \theta)} \nabla_{\theta} [\theta^T \textbf{x}(s, b)]
\\
&= \sum_b e^{h(s, b, \theta)} \textbf{x}(s, b)
\end{align*}
$$

So:

$$
\begin{align*}
\nabla_{\theta} \ln \pi(a|s, \theta) &= \nabla_{\theta} \ln q(\theta)
\\
&= [\nabla_q \ln q(\theta)] [\nabla_{\theta} q(\theta)]
\\
&= \left[ \frac{1}{q(\theta)} \right] \left[ \nabla_{\theta} \frac{f(\theta)}{g(\theta)} \right]
\\
&= \left[ \frac{g(\theta)}{f(\theta)} \right] \left[ \frac{[\nabla_{\theta} f(\theta)] g(\theta) - f(\theta) [\nabla_{\theta} g(\theta)]}{[g(\theta)]^2} \right]
\\
&= \frac{[\nabla_{\theta} f(\theta)] g(\theta) - f(\theta) [\nabla_{\theta} g(\theta)]}{f(\theta) g(\theta)}
\\
&= \frac{\nabla_{\theta} f(\theta)}{f(\theta)} - \frac{\nabla_{\theta} g(\theta)}{g(\theta)}
\\
&= \frac{e^{h(s, a, \theta)} \textbf{x}(s, a)}{e^{h(s, a, \theta)}} - \frac{\sum_b e^{h(s, b, \theta)} \textbf{x}(s, b)}{\sum_c e^{h(s, c, \theta)}}
\\
&= \textbf{x}(s, a) - \sum_b \frac{e^{h(s, b, \theta)}}{\sum_{c} e^{h(s, c, \theta)}} \textbf{x}(s, b)
\\
&= \textbf{x}(s, a) - \sum_b \pi(b|s, \theta) \textbf{x}(s, b)
\end{align*}
$$

### Exercise 13.4

**Q**

Show that for the gaussian policy parameterization (13.19) the eligibility vector has the following two parts:

$$
\begin{align*}
\nabla \ln \pi(a | s, \theta) &= \frac{\nabla \pi(a | s, \theta)}{\pi(a | s, \theta)} = \frac{1}{\sigma(s, \theta_{\sigma})^2} (a - \mu(s, \theta_{\mu})) \textbf{x}_{\mu}(s)\text{, and}
\\
\nabla \ln \pi(a | s, \theta_{\sigma}) &= \frac{\nabla \pi(a | s, \theta_{\sigma})}{\pi(a | s, \theta)} = \left( \frac{(a - \mu(s, \theta_{\mu}))^2}{\sigma(s, \theta_{\sigma})^2} - 1 \right) \textbf{x}_{\sigma}(s)
\end{align*}
$$

**A**

The gaussian policy parameterization (13.19) is:

$$
\pi(a | s, \theta) \doteq \frac{1}{\sigma(s, \theta_{\sigma}) \sqrt{2 \pi}} \exp{\left(- \frac{(a - \mu(s, \theta_{\mu}))^2}{2 \sigma(s, \theta_{\sigma})^2} \right)} \tag{3.19}
$$

where $\mu : \mathcal{S} \times \mathbb{R}^{d'} \to \mathbb{R}$ and $\sigma : \mathcal{S} \times \mathbb{R}^{d'} \to \mathbb{R}^+$ are two parameterized function approximators.

Also:

$$
\mu(s, \theta_{\mu}) \doteq \theta_{\mu}^T \textbf{x}_{\mu}(s) \quad \text{and} \quad \sigma(s, \theta_{\sigma}) \doteq \exp{(\theta_{\sigma}^T \textbf{x}_{\sigma}(s))} \tag{3.20}
$$

and:

$$
\begin{align*}
\frac{d}{dx} \ln{f(x)} &= \left[ \frac{d}{d f(x)} \ln{f(x)} \right] \left[ \frac{d}{dx} f(x) \right] \tag{chain rule of calculus}
\\
&= \frac{1}{f(x)} \frac{d}{dx} f(x) \tag{because $\frac{d \ln x}{dx} = \frac{1}{x}$ as demonstrated in Exercise 13.3}
\\
&= \frac{\frac{d}{dx} f(x)}{f(x)} \tag{a1}
\end{align*}
$$

The previous demonstration can be extended to gradients and prove the first equality of each of the 2 parts of the question of this exercise.

Defining $f(\theta)$ as:

$$
f(\theta) \doteq - \frac{(a - \mu(s, \theta_{\mu}))^2}{2 \sigma(s, \theta_{\sigma})^2} \tag{a2}
$$

we have:

$$
\begin{align*}
\nabla_{\theta_{\mu}} \ln \pi(a | s, \theta) &= \frac{\nabla_{\theta_{\mu}} \pi(a | s, \theta)}{\pi(a | s, \theta)} \tag{from a1}
\\
&= \frac{\nabla_{\theta_{\mu}} \frac{1}{\sigma(s, \theta_{\sigma}) \sqrt{2 \pi}} \exp{\left(- \frac{(a - \mu(s, \theta_{\mu}))^2}{2 \sigma(s, \theta_{\sigma})^2} \right)}}{\frac{1}{\sigma(s, \theta_{\sigma}) \sqrt{2 \pi}} \exp{\left(- \frac{(a - \mu(s, \theta_{\mu}))^2}{2 \sigma(s, \theta_{\sigma})^2} \right)}}
\\
&= \frac{1}{\sigma(s, \theta_{\sigma}) \sqrt{2 \pi}} \frac{\nabla_{\theta_{\mu}} \exp{\left(- \frac{(a - \mu(s, \theta_{\mu}))^2}{2 \sigma(s, \theta_{\sigma})^2} \right)}}{\frac{1}{\sigma(s, \theta_{\sigma}) \sqrt{2 \pi}} \exp{\left(- \frac{(a - \mu(s, \theta_{\mu}))^2}{2 \sigma(s, \theta_{\sigma})^2} \right)}}
\\
&= \frac{\nabla_{\theta_{\mu}} \exp{\left(- \frac{(a - \mu(s, \theta_{\mu}))^2}{2 \sigma(s, \theta_{\sigma})^2} \right)}}{\exp{\left(- \frac{(a - \mu(s, \theta_{\mu}))^2}{2 \sigma(s, \theta_{\sigma})^2} \right)}}
\\
&= \frac{\nabla_{\theta_{\mu}} \exp{f(\theta)}}{\exp{f(\theta)}} \tag{from a2}
\\
&= \frac{\left[ \nabla_f \exp{f(\theta)} \right] \left[ \nabla_{\theta_{\mu}} f(\theta) \right]}{\exp{f(\theta)}}
\\
&= \frac{\exp{f(\theta)} \left[ \nabla_{\theta_{\mu}} f(\theta) \right]}{\exp{f(\theta)}}
\\
&= \nabla_{\theta_{\mu}} f(\theta)
\\
&= \nabla_{\theta_{\mu}} \left[ - \frac{(a - \mu(s, \theta_{\mu}))^2}{2 \sigma(s, \theta_{\sigma})^2} \right]
\\
&= - \frac{1}{2 \sigma(s, \theta_{\sigma})^2} \nabla_{\theta_{\mu}} (a - \mu(s, \theta_{\mu}))^2
\\
&= - \frac{1}{2 \sigma(s, \theta_{\sigma})^2} \cdot 2 (a - \mu(s, \theta_{\mu})) \nabla_{\theta_{\mu}} (a - \mu(s, \theta_{\mu})) \tag{chain rule of gradients}
\\
&= - \frac{a - \mu(s, \theta_{\mu})}{\sigma(s, \theta_{\sigma})^2} \nabla_{\theta_{\mu}} (a - \mu(s, \theta_{\mu}))
\\
&= - \frac{a - \mu(s, \theta_{\mu})}{\sigma(s, \theta_{\sigma})^2} \cdot (-1) \cdot \nabla_{\theta_{\mu}} \mu(s, \theta_{\mu}) \tag{chain rule of gradients}
\\
&= \frac{a - \mu(s, \theta_{\mu})}{\sigma(s, \theta_{\sigma})^2} \nabla_{\theta_{\mu}} [\theta_{\mu}^T \textbf{x}_{\mu}(s)]
\\
&= \frac{a - \mu(s, \theta_{\mu})}{\sigma(s, \theta_{\sigma})^2} \textbf{x}_{\mu}(s) \tag{first part proved}
\end{align*}
$$

For the second part we have:

$$
\begin{align*}
\nabla_{\theta_{\sigma}} \ln \pi(a | s, \theta) &= \frac{\nabla_{\theta_{\sigma}} \pi(a | s, \theta)}{\pi(a | s, \theta)} \tag{from a1}
\\
&= \frac{\nabla_{\theta_{\sigma}} \frac{1}{\sigma(s, \theta_{\sigma}) \sqrt{2 \pi}} \exp{f(\theta)}}{\frac{1}{\sigma(s, \theta_{\sigma}) \sqrt{2 \pi}} \exp{f(\theta)}}
\\
&= \frac{1}{\sqrt{2 \pi}} \frac{\nabla_{\theta_{\sigma}} \frac{1}{\sigma(s, \theta_{\sigma})} \exp{f(\theta)}}{\frac{1}{\sigma(s, \theta_{\sigma}) \sqrt{2 \pi}} \exp{f(\theta)}}
\\
&= \frac{\nabla_{\theta_{\sigma}} \frac{1}{\sigma(s, \theta_{\sigma})} \exp{f(\theta)}}{\frac{1}{\sigma(s, \theta_{\sigma})} \exp{f(\theta)}}
\\
&= \frac{\left[ \nabla_{\theta_{\sigma}} \frac{1}{\sigma(s, \theta_{\sigma})} \right] \exp{f(\theta)} + \frac{1}{\sigma(s, \theta_{\sigma})} \left[ \nabla_{\theta_{\sigma}} \exp{f(\theta)} \right]}{\frac{1}{\sigma(s, \theta_{\sigma})} \exp{f(\theta)}} \tag{product rule of calculus}
\\
&= \frac{\nabla_{\theta_{\sigma}} \frac{1}{\sigma(s, \theta_{\sigma})}}{\frac{1}{\sigma(s, \theta_{\sigma})}} + \frac{\nabla_{\theta_{\sigma}} \exp{f(\theta)}}{\exp{f(\theta)}}
\\
&= \frac{\left[ \nabla_{\sigma} \frac{1}{\sigma(s, \theta_{\sigma})} \right] \left[ \nabla_{\theta_{\sigma}} \sigma(s, \theta_{\sigma}) \right]}{\frac{1}{\sigma(s, \theta_{\sigma})}} + \frac{[\nabla_f \exp{f(\theta)}] [\nabla_{\theta_{\sigma}} f(\theta)]}{\exp{f(\theta)}} \tag{chain rule of gradients}
\\
&= \frac{\left[ - \frac{1}{\sigma(s, \theta_{\sigma})^2} \right] \left[ \nabla_{\theta_{\sigma}} \sigma(s, \theta_{\sigma}) \right]}{\frac{1}{\sigma(s, \theta_{\sigma})}} + \frac{\exp{f(\theta)} [\nabla_{\theta_{\sigma}} f(\theta)]}{\exp{f(\theta)}}
\\
&= -\frac{\nabla_{\theta_{\sigma}} \sigma(s, \theta_{\sigma})}{\sigma(s, \theta_{\sigma})} + \nabla_{\theta_{\sigma}} f(\theta)
\\
&= -\frac{\nabla_{\theta_{\sigma}} \exp{(\theta_{\sigma}^T \textbf{x}_{\sigma}(s))}}{\sigma(s, \theta_{\sigma})} + \nabla_{\theta_{\sigma}} \left[ - \frac{(a - \mu(s, \theta_{\mu}))^2}{2 \sigma(s, \theta_{\sigma})^2} \right]
\\
&= -\frac{\exp{(\theta_{\sigma}^T \textbf{x}_{\sigma}(s))} \nabla_{\theta_{\sigma}} [\theta_{\sigma}^T \textbf{x}_{\sigma}(s)]}{\sigma(s, \theta_{\sigma})} - (a - \mu(s, \theta_{\mu}))^2 \nabla_{\theta_{\sigma}} \frac{1}{2 \sigma(s, \theta_{\sigma})^2} \tag{chain rule of gradients in the left term}
\\
&= -\frac{\sigma(s, \theta_{\sigma}) \nabla_{\theta_{\sigma}} [\theta_{\sigma}^T \textbf{x}_{\sigma}(s)]}{\sigma(s, \theta_{\sigma})} - (a - \mu(s, \theta_{\mu}))^2 \frac{1}{2} \frac{-2}{\sigma(s, \theta_{\sigma})^3} \nabla_{\theta_{\sigma}} \sigma(s, \theta_{\sigma}) \tag{chain rule of gradients in the right term}
\\
&= -\textbf{x}_{\sigma}(s) + (a - \mu(s, \theta_{\mu}))^2 \frac{1}{\sigma(s, \theta_{\sigma})^3} \nabla_{\theta_{\sigma}} \exp{(\theta_{\sigma}^T \textbf{x}_{\sigma}(s))}
\\
&= -\textbf{x}_{\sigma}(s) + (a - \mu(s, \theta_{\mu}))^2 \frac{1}{\sigma(s, \theta_{\sigma})^3} \exp{(\theta_{\sigma}^T \textbf{x}_{\sigma}(s))} \nabla_{\theta_{\sigma}} [\theta_{\sigma}^T \textbf{x}_{\sigma}(s)]
\\
&= -\textbf{x}_{\sigma}(s) + (a - \mu(s, \theta_{\mu}))^2 \frac{1}{\sigma(s, \theta_{\sigma})^3} \sigma(s, \theta_{\sigma}) \nabla_{\theta_{\sigma}} [\theta_{\sigma}^T \textbf{x}_{\sigma}(s)]
\\
&= -\textbf{x}_{\sigma}(s) + (a - \mu(s, \theta_{\mu}))^2 \frac{1}{\sigma(s, \theta_{\sigma})^2} \nabla_{\theta_{\sigma}} [\theta_{\sigma}^T \textbf{x}_{\sigma}(s)]
\\
&= -\textbf{x}_{\sigma}(s) + \frac{(a - \mu(s, \theta_{\mu}))^2}{\sigma(s, \theta_{\sigma})^2} \textbf{x}_{\sigma}(s)
\\
&= \frac{(a - \mu(s, \theta_{\mu}))^2}{\sigma(s, \theta_{\sigma})^2} \textbf{x}_{\sigma}(s) - \textbf{x}_{\sigma}(s)
\\
&= \left[ \frac{(a - \mu(s, \theta_{\mu}))^2}{\sigma(s, \theta_{\sigma})^2} - 1 \right] \textbf{x}_{\sigma}(s) \tag{second part proved}
\end{align*}
$$

### Exercise 13.5

**Q**

A *Bernoulli-logistic unit* is a stochastic neuron-like unit used in some ANNs (Section 9.6). Its input at time $t$ is a feature vector $\textbf{x}(S_t)$; its output, $A_t$, is a random variable having two values, 0 and 1, with $Pr\{A_t = 1\} = P_t$ and $Pr\{A_t = 0\} = 1 - P_t$ (the Bernoulli distribution). Let $h(s, 0, \theta)$ and $h(s, 1, \theta)$ be the preferences in state $s$ for the unit’s two actions given policy parameter $\theta$. Assume that the difference between the action preferences is given by a weighted sum of the unit’s input vector, that is, assume that $h(s, 1, \theta) - h(s, 0, \theta) = \theta^T \textbf{x}(s)$, where $\theta$ is the unit’s weight vector.

(a) Show that if the exponential soft-max distribution (13.2) is used to convert action preferences to policies, then $P_t = \pi(1|S_t, \theta_t) = 1/(1 + \exp(-\theta_t^T \textbf{x}(S_t)))$ (the logistic function).

(b) What is the Monte-Carlo REINFORCE update of $\theta_t$ to $\theta_{t+1}$ upon receipt of return $G_t$?

(c) Express the eligibility $\nabla \ln \pi(a | s, \theta)$ for a Bernoulli-logistic unit, in terms of $a$, $\textbf{x}(s)$, and $\pi(a | s, \theta)$ by calculating the gradient.

Hint for part (c): Define $P = \pi(1 | s, \theta)$ and compute the derivative of the logarithm, for each action, using the chain rule on $P$. Combine the two results into one expression that depends on $a$ and $P$, and then use the chain rule again, this time on $\theta^T \textbf{x}(s)$, noting that the derivative of the logistic function $f(x) = 1/(1 + e^{-x})$ is $f(x)(1 - f(x))$.

**A**

(a)

The soft-max in action preferences (13.2) is defined below:

$$
\pi(a|s, \theta) \doteq \frac{e^{h(s, a, \theta)}}{\sum_b e^{h(s, b, \theta)}} \tag{13.2}
$$

Also, $h(s, 1, \theta) - h(s, 0, \theta) = \theta^T \textbf{x}(s)$ means that:

$$
h(s, 0, \theta) = h(s, 1, \theta) - \theta^T \textbf{x}(s)
$$

We have:

$$
\begin{align*}
P_t &= \pi(1|S_t, \theta_t) 
\\
&= \frac{e^{h(S_t, 1, \theta_t)}}{\sum_b e^{h(S_t, b, \theta_t)}}
\\
&= \frac{e^{h(S_t, 1, \theta_t)}}{e^{h(S_t, 1, \theta_t)} + e^{h(S_t, 0, \theta_t)}}
\\
&= \frac{e^{h(S_t, 1, \theta_t)}}{e^{h(S_t, 1, \theta_t)} + e^{h(S_t, 1, \theta_t) - \theta_t^T \textbf{x}(S_t)}}
\\
&= \frac{e^{h(S_t, 1, \theta_t)}}{e^{h(S_t, 1, \theta_t)} + \frac{e^{h(S_t, 1, \theta_t)}}{e^{\theta_t^T \textbf{x}(S_t)}}}
\\
&= \frac{e^{h(S_t, 1, \theta_t)}}{e^{h(S_t, 1, \theta_t)} \left(1 + \frac{1}{e^{\theta_t^T \textbf{x}(S_t)}} \right)}
\\
&= \frac{1}{1 + \frac{1}{e^{\theta_t^T \textbf{x}(S_t)}}}
\\
&= \frac{1}{1 + e^{- \theta_t^T \textbf{x}(S_t)}}
\\
&= \frac{1}{1 + \exp{(- \theta_t^T \textbf{x}(S_t))}}
\end{align*}
$$

(b)

The Monte-Carlo REINFORCE update for the general case is:

$$
\theta_{t+1} \doteq \theta_t + \alpha G_t \frac{\nabla \pi(A_t | S_t, \theta_t)}{\pi(A_t | S_t, \theta_t)} \tag{13.8}
$$

We also have, taking into account that $\pi(0 | S_t, \theta_t) = 1 - \pi(1 | S_t, \theta_t)$ and the result of (a):

$$
\pi(0|S_t, \theta_t) \doteq 1 - \pi(1|S_t, \theta_t) = 1 - \frac{1}{1 + \exp{(- \theta_t^T \textbf{x}(S_t))}} = \frac{1 + \exp{(- \theta_t^T \textbf{x}(S_t))} - 1}{1 + \exp{(- \theta_t^T \textbf{x}(S_t))}} = \frac{\exp{(- \theta_t^T \textbf{x}(S_t))}}{1 + \exp{(- \theta_t^T \textbf{x}(S_t))}}
$$

If $A_t = 1$, then:

$$
\begin{align*}
\theta_{t+1} &\doteq \theta_t + \alpha G_t \frac{\nabla \pi(1 | S_t, \theta_t)}{\pi(1 | S_t, \theta_t)}
\\
&= \theta_t + \alpha G_t \frac{\nabla_{\theta} \frac{1}{1 + \exp{(- \theta_t^T \textbf{x}(S_t))}}}{\frac{1}{1 + \exp{(- \theta_t^T \textbf{x}(S_t))}}} \tag{according to (a)}
\\
&= \theta_t + \alpha G_t [1 + \exp{(- \theta_t^T \textbf{x}(S_t))}] \nabla_{\theta} \frac{1}{1 + \exp{(- \theta_t^T \textbf{x}(S_t))}}
\\
&= \theta_t - \alpha G_t [1 + \exp{(- \theta_t^T \textbf{x}(S_t))}] \frac{1}{[1 + \exp{(- \theta_t^T \textbf{x}(S_t))}]^2} \nabla_{\theta} [1 + \exp{(- \theta_t^T \textbf{x}(S_t))}]
\\
&= \theta_t - \alpha G_t \frac{1}{1 + \exp{(- \theta_t^T \textbf{x}(S_t))}} \nabla_{\theta} \exp{(- \theta_t^T \textbf{x}(S_t))}
\\
&= \theta_t - \alpha G_t \frac{1}{1 + \exp{(- \theta_t^T \textbf{x}(S_t))}} \exp{(- \theta_t^T \textbf{x}(S_t))} \nabla_{\theta} (- \theta_t^T \textbf{x}(S_t))
\\
&= \theta_t + \alpha G_t \frac{\exp{(- \theta_t^T \textbf{x}(S_t))}}{1 + \exp{(- \theta_t^T \textbf{x}(S_t))}} \nabla_{\theta} \theta_t^T \textbf{x}(S_t)
\\
&= \theta_t + \alpha G_t \frac{\exp{(- \theta_t^T \textbf{x}(S_t))}}{1 + \exp{(- \theta_t^T \textbf{x}(S_t))}} \textbf{x}(S_t)
\\
&= \theta_t + \alpha G_t \pi(0 | S_t, \theta_t) \textbf{x}(S_t)
\\
&= \theta_t + \alpha G_t [1 - \pi(1 | S_t, \theta_t)] \textbf{x}(S_t)
\end{align*}
$$

For $A_t = 0$, we have:

$$
\begin{align*}
\theta_{t+1} &\doteq \theta_t + \alpha G_t \frac{\nabla \pi(0 | S_t, \theta_t)}{\pi(0 | S_t, \theta_t)}
\\
&= \theta_t + \alpha G_t \frac{\nabla (1 - \pi(1 | S_t, \theta_t))}{1 - \pi(1 | S_t, \theta_t)} \tag{$\pi(0 | S_t, \theta_t) = 1 - \pi(1 | S_t, \theta_t)$}
\\
&= \theta_t - \alpha G_t \frac{\nabla \pi(1 | S_t, \theta_t)}{1 - \pi(1 | S_t, \theta_t)}
\\
&= \theta_t - \alpha G_t \frac{\nabla_{\theta} \frac{1}{1 + \exp{(- \theta_t^T \textbf{x}(S_t))}}}{1 - \frac{1}{1 + \exp{(- \theta_t^T \textbf{x}(S_t))}}} \tag{according to (a)}
\\
&= \theta_t - \alpha G_t \frac{\nabla_{\theta} \frac{1}{1 + \exp{(- \theta_t^T \textbf{x}(S_t))}}}{\frac{1 + \exp{(- \theta_t^T \textbf{x}(S_t))} - 1}{1 + \exp{(- \theta_t^T \textbf{x}(S_t))}}}
\\
&= \theta_t - \alpha G_t \frac{\nabla_{\theta} \frac{1}{1 + \exp{(- \theta_t^T \textbf{x}(S_t))}}}{\frac{\exp{(- \theta_t^T \textbf{x}(S_t))}}{1 + \exp{(- \theta_t^T \textbf{x}(S_t))}}}
\\
&= \theta_t - \alpha G_t \frac{1 + \exp{(- \theta_t^T \textbf{x}(S_t))}}{\exp{(- \theta_t^T \textbf{x}(S_t))}} \nabla_{\theta} \frac{1}{1 + \exp{(- \theta_t^T \textbf{x}(S_t))}}
\\
&= \theta_t + \alpha G_t \frac{1 + \exp{(- \theta_t^T \textbf{x}(S_t))}}{\exp{(- \theta_t^T \textbf{x}(S_t))}} \frac{1}{[1 + \exp{(- \theta_t^T \textbf{x}(S_t))}]^2} \nabla_{\theta} [1 + \exp{(- \theta_t^T \textbf{x}(S_t))}]
\\
&= \theta_t + \alpha G_t \frac{1}{\exp{(- \theta_t^T \textbf{x}(S_t))} [1 + \exp{(- \theta_t^T \textbf{x}(S_t))}]} \nabla_{\theta} \exp{(- \theta_t^T \textbf{x}(S_t))}
\\
&= \theta_t + \alpha G_t \frac{1}{\exp{(- \theta_t^T \textbf{x}(S_t))} [1 + \exp{(- \theta_t^T \textbf{x}(S_t))}]} \exp{(- \theta_t^T \textbf{x}(S_t))} \nabla_{\theta} (- \theta_t^T \textbf{x}(S_t))
\\
&= \theta_t - \alpha G_t \frac{1}{1 + \exp{(- \theta_t^T \textbf{x}(S_t))}} \nabla_{\theta} \theta_t^T \textbf{x}(S_t)
\\
&= \theta_t - \alpha G_t \frac{1}{1 + \exp{(- \theta_t^T \textbf{x}(S_t))}} \textbf{x}(S_t)
\\
&= \theta_t - \alpha G_t \pi(1 | S_t, \theta_t) \textbf{x}(S_t)
\\
&= \theta_t - \alpha G_t [1 - \pi(0 | S_t, \theta_t)] \textbf{x}(S_t)
\end{align*}
$$

More generally, we can represent the update as:

$$
\theta_{t+1} = \theta_t + \alpha G_t [2 A_t - 1] [1 - \pi(A_t | S_t, \theta_t)] \textbf{x}(S_t)
$$

where $2 A_t - 1$ is a term added to have the value $1$ when $A_t = 1$ and $-1$ when $A_t = 0$.

(c)

Let's define:

$$
P \doteq \pi(1 | s, \theta)
$$

By consequence:

$$
\pi(0 | s, \theta) \doteq 1 - P
$$

Also, for $f(x) = \frac{1}{1 + e^{-x}}$:

$$
\begin{align*}
f'(x) &= - \frac{1}{[1 + e^{-x}]^2} \frac{d}{dx} [1 + e^{-x}] 
\\
&= - \frac{1}{[1 + e^{-x}]^2} \frac{d}{dx} e^{-x} 
\\
&= - \frac{1}{[1 + e^{-x}]^2} e^{-x} \frac{d}{dx} (-x) 
\\
&= \frac{1}{[1 + e^{-x}]^2} e^{-x} \frac{dx}{dx} 
\\
&= \frac{e^{-x}}{[1 + e^{-x}]^2}
\\
&= \frac{1}{1 + e^{-x}} \frac{e^{-x}}{1 + e^{-x}}
\\
&= f(x) \frac{1 + e^{-x} - 1}{1 + e^{-x}}
\\
&= f(x) \left( \frac{1 + e^{-x}}{1 + e^{-x}} - \frac{1}{1 + e^{-x}} \right)
\\
&= f(x) \left( 1 - f(x) \right) \tag{a1}
\end{align*}
$$

Fo $a = 1$, we have:

$$
\begin{align*}
\nabla_{\theta} \ln \pi(a | s, \theta) &= \nabla_{\theta} \ln \pi(1 | s, \theta)
\\
&= \nabla_{\theta} \ln P
\\
&= \frac{1}{P} \nabla_{\theta} P \tag{because $\frac{d \ln x}{dx} = \frac{1}{x}$ as demonstrated in Exercise 13.3}
\\
&= \frac{1}{\pi(1 | s, \theta)} \nabla_{\theta} P
\end{align*}
$$

Fo $a = 0$, we have:

$$
\begin{align*}
\nabla_{\theta} \ln \pi(a | s, \theta) &= \nabla_{\theta} \ln \pi(0 | s, \theta)
\\
&= \nabla_{\theta} \ln [1 - P]
\\
&= \frac{1}{1 - P} \nabla_{\theta} [1 - P]
\\
&= - \frac{1}{\pi(0 | s, \theta)} \nabla_{\theta} P
\end{align*}
$$

So, for any $a$ (either 0 or 1):

$$
\nabla_{\theta} \ln \pi(a | s, \theta) = [2a - 1] \frac{1}{\pi(a | s, \theta)} \nabla_{\theta} P
$$

where $2a - 1$ is a term added to have the value $1$ when $a = 1$ and $-1$ when $a = 0$.

Solving the expression, to remove the gradient:

$$
\begin{align*}
\nabla_{\theta} \ln \pi(a | s, \theta) &= [2a - 1] \frac{1}{\pi(a | s, \theta)} \nabla_{\theta} P
\\
&= [2a - 1] \frac{1}{\pi(a | s, \theta)} \nabla_{\theta} \pi(1 | s, \theta)
\\
&= [2a - 1] \frac{1}{\pi(a | s, \theta)} \nabla_{\theta} \frac{1}{1 + \exp{(- \theta_t^T \textbf{x}(S_t))}} \tag{according to (a)}
\\
&= [2a - 1] \frac{1}{\pi(a | s, \theta)} \left[ \frac{1}{1 + \exp{(- \theta_t^T \textbf{x}(S_t))}} \right] \left[1 - \frac{1}{1 + \exp{(- \theta_t^T \textbf{x}(S_t))}} \right] \nabla_{\theta} [\theta_t^T \textbf{x}(S_t)] 
\\
\tag{according to a1 and the chain rule}
\\
&= [2a - 1] \frac{1}{\pi(a | s, \theta)} [\pi(1 | s, \theta)] [1 - \pi(1 | s, \theta)] \textbf{x}(S_t)
\\
&= [2a - 1] \frac{\pi(1 | s, \theta) \pi(0 | s, \theta)}{\pi(a | s, \theta)} \textbf{x}(S_t)
\\
&= [2a - 1] \frac{\pi(a | s, \theta) [1 - \pi(a | s, \theta)]}{\pi(a | s, \theta)} \textbf{x}(S_t) 
\\
\tag{$\pi(a | s, \theta)$ is either $\pi(1 | s, \theta)$ or $\pi(0 | s, \theta)$; $1 - \pi(a | s, \theta)$ is the other}
\\
&= [2a - 1] [1 - \pi(a | s, \theta)] \textbf{x}(S_t)
\end{align*}
$$

Note that this is the eligibility used in the update in (b) is the same:

$$
\theta_{t+1} = \theta_t + \alpha G_t [2 A_t - 1] [1 - \pi(A_t | S_t, \theta_t)] \textbf{x}(S_t) = \theta_t + \alpha G_t \nabla_{\theta} \ln \pi(A_t | S_t, \theta_t)
$$