### Policy Gradient Theorem Derivation
The goal is: 
$$
J(\theta) = \mathbb{E}_{\tau \sim p_\theta}\left[\sum_{t=0}^{\infty} \gamma^t\,r(s_t, a_t)\right]
$$

We can write:
$$
J(\theta) = \int p_\theta(\tau)\left[\sum_{t=0}^{\infty} \gamma^t\,r(s_t,a_t)\right]\,d\tau.
$$

Taking the gradient with respect to $\theta$:
$$
\nabla_\theta J(\theta) = \nabla_\theta \int p_\theta(\tau)\left[\sum_{t=0}^{\infty}\gamma^t\,r(s_t,a_t)\right]d\tau.
$$
Assuming we can interchange the gradient and the integral:
$$
\nabla_\theta J(\theta) = \int \nabla_\theta p_\theta(\tau)\left[\sum_{t=0}^{\infty}\gamma^t\,r(s_t,a_t)\right]d\tau.
$$

Applying the Log-Likelihood trick gives:
$$
\nabla_\theta J(\theta) = \int p_\theta(\tau)\,\nabla_\theta \log p_\theta(\tau) \left[\sum_{t=0}^{\infty}\gamma^t\,r(s_t,a_t)\right]d\tau.
$$
In expectation notation:
$$
\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim p_\theta}\left[\nabla_\theta \log p_\theta(\tau) \,\sum_{t=0}^{\infty}\gamma^t\,r(s_t,a_t)\right].
$$


Since the trajectory probability factorizes as
$$
p_\theta(\tau) = p(s_0)\prod_{t=0}^{\infty} \left[\pi_\theta(a_t \mid s_t) \,P(s_{t+1} \mid s_t,a_t)\right],
$$
its logarithm is
$$
\log p_\theta(\tau) = \log p(s_0) + \sum_{t=0}^{\infty}\Bigl[\log \pi_\theta(a_t \mid s_t) + \log P(s_{t+1} \mid s_t,a_t)\Bigr].
$$
Only the terms $\log \pi_\theta(a_t \mid s_t)$ depend on $\theta$. Therefore,
$$
\nabla_\theta \log p_\theta(\tau) = \sum_{t=0}^{\infty}\nabla_\theta \log \pi_\theta(a_t \mid s_t).
$$
Substitute back into our gradient expression:
$$
\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim p_\theta}\left[\left(\sum_{t=0}^{\infty}\nabla_\theta \log \pi_\theta(a_t \mid s_t)\right)\left(\sum_{t=0}^{\infty}\gamma^t\,r(s_t,a_t)\right)\right].
$$

For each time step t, decompose the total return as:
$$
\sum_{m=0}^{\infty}\gamma^m\,r(s_m,a_m)
=\underbrace{\sum_{m=0}^{t-1}\gamma^m\,r(s_m,a_m)}_{\text{past (independent of \(a_t\))}}
+\underbrace{\sum_{m=t}^{\infty}\gamma^m\,r(s_m,a_m)}_{\text{future (dependent on \(a_t\))}}.
$$
Since the past portion is independent of $(a_t)$, its contribution to $(\nabla_\theta \log \pi_\theta(a_t\mid s_t))$ vanishes in expectation. Hence, we have:
$$
\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim p_\theta}\left[\sum_{t=0}^{\infty}\nabla_\theta \log \pi_\theta(a_t \mid s_t)\left(\sum_{m=t}^{\infty}\gamma^m\,r(s_m,a_m)\right)\right].
$$

Since the Q function is:
$$
Q^\pi(s_t,a_t) = \mathbb{E}\left[\sum_{k=0}^{\infty}\gamma^k\,r(s_{t+k},a_{t+k}) \,\Bigm|\, s_t,a_t\right].
$$
Thus, the partial sum $(\sum_{m=t}^{\infty}\gamma^m\,r(s_m,a_m)$ is an unbiased sample of $(Q^\pi(s_t,a_t))$. Therefore,
$$
\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim p_\theta}\left[\sum_{t=0}^{\infty}\nabla_\theta \log \pi_\theta(a_t \mid s_t)\,Q^\pi(s_t,a_t)\right].
$$


Each trajectory $(\tau)$ consists of a sequence of state–action pairs $((s_0,a_0), (s_1,a_1), \ldots)$. The sum over time steps is equivalent to taking an expectation with respect to the **discounted occupancy measure** $(\mu_\pi(s,a))$, which represents the (normalized) distribution of state–action pairs visited by $(\pi_\theta)$. Thus, we can write:
$$
\nabla_\theta J(\theta) = \mathbb{E}_{(s,a) \sim \mu_\pi}\left[\nabla_\theta \log \pi_\theta(a \mid s)\,Q^\pi(s,a)\right].
$$

---

**Final Policy Gradient Theorem**

$$
\boxed{
\nabla_\theta J(\theta) = \mathbb{E}_{(s,a) \sim \mu_\pi}\left[\nabla_\theta \log \pi_\theta(a \mid s)\,Q^\pi(s,a)\right].
}
$$
A common variance-reduced version uses the advantage function $A^\pi(s,a)=Q^\pi(s,a)-V^\pi(s)$:
$$
\boxed{
\nabla_\theta J(\theta) = \mathbb{E}_{(s,a) \sim \mu_\pi}\left[\nabla_\theta \log \pi_\theta(a \mid s)\,A^\pi(s,a)\right].
}
$$


---

### REINFORCE
**Objective:**

The expected return is given by

$$
J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T-1}\gamma^t\,r(s_t,a_t)\right],
$$

where $ \tau = (s_0,a_0,\dots,s_{T-1},a_{T-1}) $ is a complete episode.

**Policy Gradient Theorem (using Monte Carlo return):**

The gradient is

$$
\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t\mid s_t) \, G_t\right],
$$

with the return

$$
G_t = \sum_{k=t}^{T-1}\gamma^{k-t}\,r(s_k,a_k).
$$

**Update Rule:**

For each episode, update

$$
\theta \leftarrow \theta + \alpha \sum_{t=0}^{T-1} G_t \, \nabla_\theta \log \pi_\theta(a_t\mid s_t).
$$

### Actor Critic (A2C)
**Architecture:**

We have an actor (policy) and a critic (value function). The critic estimates

$$
V^\pi(s) \approx \mathbb{E}\left[G_t \mid s_t = s\right].
$$

**Advantage Estimate:**

A common choice is the one-step temporal difference (TD) error:

$$
A_t = r(s_t,a_t) + \gamma V^\pi(s_{t+1}) - V^\pi(s_t).
$$

**Loss Functions:**

- **Actor Loss:**

$$
L_{\text{actor}}(\theta) = -\mathbb{E}\left[\log \pi_\theta(a_t\mid s_t)\,A_t\right].
$$

- **Critic Loss:**

$$
L_{\text{critic}}(\phi) = \frac{1}{2}\mathbb{E}\left[\left(r(s_t,a_t) + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)\right)^2\right].
$$

**Update Rules:**

- **Actor Update:**

$$
\theta \leftarrow \theta + \alpha\,\nabla_\theta \log \pi_\theta(a_t\mid s_t) \, A_t.
$$

- **Critic Update:**

$$
\phi \leftarrow \phi - \beta\,\nabla_\phi L_{\text{critic}}(\phi).
$$


### Natural Policy Gradient (NPG)
**Key Idea:**

The standard gradient is preconditioned by the inverse Fisher information matrix $F(\theta) $ to obtain the natural gradient:

$$
\Delta \theta = F(\theta)^{-1} \, \nabla_\theta J(\theta),
$$

where

$$
F(\theta) = \mathbb{E}_{s \sim d^\pi,\, a \sim \pi_\theta}\left[\nabla_\theta \log \pi_\theta(a\mid s) \, \nabla_\theta \log \pi_\theta(a\mid s)^\top\right].
$$

**Update Rule:**

$$
\theta \leftarrow \theta + \alpha \, \Delta \theta = \theta + \alpha \, F(\theta)^{-1} \, \nabla_\theta J(\theta).
$$

### Trust Region Policy Optimization (TRPO)
The goal of reinforcement learning is to maximize the expected return:

$$
J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]=\eta(\pi_{\theta})
$$
Consider the expected return under the new policy $\pi_{\theta'}$:

$$
J(\theta') = \mathbb{E}_{\tau \sim \pi_{\theta'}}[R(\tau)]
$$

Using importance sampling with respect to the old policy $\pi_{\theta_{\text{old}}}$, we rewrite:

$$
J(\theta') = \mathbb{E}_{\tau \sim \pi_{\theta_{\text{old}}}}\left[\frac{\pi_{\theta'}(\tau)}{\pi_{\theta_{\text{old}}}(\tau)}R(\tau)\right]
$$

However, directly computing $\frac{\pi_{\theta'}(\tau)}{\pi_{\theta_{\text{old}}}(\tau)}$ over entire trajectories is challenging. Therefore, TRPO approximates this by using single-step importance sampling and the advantage function $A^{\pi_{\theta_{\text{old}}}}(s,a)$:

$$
J(\theta') \approx \eta(\pi_{\theta_{\text{old}}}) + \sum_s d^{\pi_{\theta_{\text{old}}}}(s)\sum_a \pi_{\theta'}(a|s)A^{\pi_{\theta_{\text{old}}}}(s,a)
$$

Here, we define the surrogate objective clearly as:

$$
L_{\pi_{\theta_{\text{old}}}}(\pi_{\theta'}) = \eta(\pi_{\theta_{\text{old}}}) + \sum_s d^{\pi_{\theta_{\text{old}}}}(s)\sum_a \pi_{\theta'}(a|s)A^{\pi_{\theta_{\text{old}}}}(s,a)
$$

Recall the **Conservative Policy Iteration (CPI)** lower bound:
$$
L_{\pi_{\theta_{\text{old}}}}(\pi_{\theta'}) \geq \eta(\pi_{\theta_{\text{old}}}) + \sum_{s} d^{\pi_{\theta_{\text{old}}}}(s)\sum_{a}\pi_{\theta'}(a|s)A^{\pi_{\theta_{\text{old}}}}(s,a) - \frac{2\gamma\epsilon}{(1-\gamma)^2} D_{\text{KL}}^{\text{max}}(\pi_{\theta_{\text{old}}}\|\pi_{\theta'})
$$
The CPI bound above is theoretically insightful but practically restrictive due to the term $D_{\text{KL}}^{\text{max}}(\pi||\pi')$, which requires bounding the divergence at **every state**. 

In practice, we replace the "max KL divergence" with an **average KL divergence** to simplify computation:

$$
D_{\text{KL}}^{\text{max}}(\pi_{\theta_{\text{old}}}\|\pi_{\theta'}) \approx \mathbb{E}_{s \sim d^{\pi_{\theta_{\text{old}}}}}[D_{\text{KL}}(\pi_{\theta_{\text{old}}}(\cdot|s)\|\pi_{\theta'}(\cdot|s))]
$$

This leads to a more manageable lower bound:

$$
L_{\pi_{\theta_{\text{old}}}}(\pi_{\theta'}) \geq \eta(\pi_{\theta_{\text{old}}}) + \sum_{s} d^{\pi_{\theta_{\text{old}}}}(s)\sum_{a}\pi_{\theta'}(a|s)A^{\pi_{\theta_{\text{old}}}}(s,a) - \frac{2\gamma\epsilon}{(1-\gamma)^2}\cdot\mathbb{E}_{s \sim d^{\pi_{\theta_{\text{old}}}}}[D_{\text{KL}}(\pi_{\theta_{\text{old}}}(\cdot|s)\|\pi_{\theta'}(\cdot|s))]
$$
To ensure policy improvement, we aim to maximize the lower bound above. Equivalently, we can pose this as a constrained optimization problem. Specifically, we seek a new policy $\pi'$ that maximizes (since $\eta(\pi_{\theta_{\text{old}}})$ is constant):

$$
L_{\pi_{\theta_{\text{old}}}}(\pi_{\theta'}) = \sum_{s} d^{\pi_{\theta_{\text{old}}}}(s)\sum_{a}\pi_{\theta'}(a|s)A^{\pi_{\theta_{\text{old}}}}(s,a)=\mathbb{E}_{s,a \sim \pi_{\theta_{\text{old}}}} \left[ \frac{\pi_{\theta'}(a|s)}{\pi_{\theta_{old}}(a|s)}A^{\pi_{\theta_{\text{old}}}}(s,a)\right]
$$

subject to a constraint on the KL divergence between the old policy and the new policy:

$$
\mathbb{E}_{s \sim d^{\pi_{\theta_{\text{old}}}}}[D_{\text{KL}}(\pi_{\theta_{\text{old}}}(\cdot|s)\|\pi_{\theta'}(\cdot|s))] \leq \delta
$$

Here, $\delta$ is a hyperparameter chosen to control how aggressively the policy can change in each update.

Thus, the optimization becomes explicitly:

$$
\max_{\pi_{\theta'}} L_{\pi_{\theta_{\text{old}}}}(\pi_{\theta'}) \quad\text{s.t.}\quad \mathbb{E}_{s \sim d^{\pi_{\theta_{\text{old}}}}}[D_{\text{KL}}(\pi_{\theta_{\text{old}}}(\cdot|s)\|\pi_{\theta'}(\cdot|s))] \leq \delta
$$

We approximate the surrogate objective around $\theta_{\text{old}}$ with a first-order (linear) Taylor expansion:

$$
L_{\pi_{\theta_{\text{old}}}}(\pi_{\theta'}) \approx L_{\pi_{\theta_{\text{old}}}}(\pi_{\theta_{\text{old}}}) + \nabla_{\theta} L_{\pi_{\theta_{\text{old}}}}(\pi_{\theta})_{\theta =\theta_{old}}^\top (\theta' - \theta_{\text{old}})
$$

For the KL divergence constraint, we apply a second-order (quadratic) Taylor expansion around $\theta_{\text{old}}$:

$$
\mathbb{E}_{s \sim d^{\pi_{\theta_{\text{old}}}}}\left[D_{\text{KL}}(\pi_{\theta_{\text{old}}}(\cdot|s)||\pi_{\theta'}(\cdot|s))\right] \approx \frac{1}{2}(\theta' - \theta_{\text{old}})^\top F(\theta_{\text{old}})(\theta' - \theta_{\text{old}})
$$

where the Fisher Information Matrix (FIM), $F(\theta_{\text{old}})$, is defined as:

$$
F(\theta_{\text{old}}) = \mathbb{E}_{s,a \sim \pi_{\theta_{\text{old}}}}\left[\nabla_{\theta}\log\pi_{\theta}(a|s)\nabla_{\theta}\log\pi_{\theta}(a|s)^\top\right]\Big|_{\theta=\theta_{\text{old}}}
$$

Combining the above approximations yields a simpler constrained optimization problem:

$$
\max_{\theta'} \nabla_{\theta} L_{\pi_{\theta_{\text{old}}}}(\pi_{\theta})_{\theta =\theta_{old}}^\top(\theta' - \theta_{\text{old}}) \quad \text{subject to} \quad \frac{1}{2}(\theta' - \theta_{\text{old}})^\top F(\theta_{\text{old}})(\theta' - \theta_{\text{old}}) \leq \delta
$$

Using the method of Lagrange multipliers, we derive the optimal policy update step explicitly as:

$$
\theta_{\text{new}} = \theta_{\text{old}} + \sqrt{\frac{2\delta}{g^\top F^{-1} g}} F^{-1} g
$$

where:

- $g = \nabla_{\theta} L_{\pi_{\theta_{\text{old}}}}(\pi_{\theta})_{\theta =\theta_{old}}$ is the policy gradient evaluated at $\theta_{old}$ by using the logrithm trick.
- $F$ is the Fisher Information Matrix evaluated at $\theta_{\text{old}}$.

In practice, directly computing the inverse $F^{-1}$ is computationally expensive. Thus, TRPO uses the conjugate gradient method to approximate the product $F^{-1} g$ efficiently without explicitly computing the inverse.

---
**TRPO Algorithm**

**Given:** initial policy parameters $\theta_0$, KL-divergence constraint parameter $\delta$

**for** iteration $k=0,1,2,\dots$ **do**:

1. **Collect trajectories** by executing policy $\pi_{\theta_k}(a|s)$.

2. **Compute advantages** $A^{\pi_{\theta_k}}(s,a)$ using collected data.

3. **Compute policy gradient:**
$$
g = \mathbb{E}_{s,a \sim \pi_{\theta_k}}\left[\nabla_{\theta'}\log\pi_{\theta'}(a|s)\big|_{\theta'=\theta_k} A^{\pi_{\theta_k}}(s,a)\right]
$$

4. **Estimate Fisher Information Matrix (FIM)**:
$$
F(\theta_k) = \mathbb{E}_{s,a \sim \pi_{\theta_k}}\left[\nabla_{\theta_k}\log\pi_{\theta_k}(a|s)\nabla_{\theta_k}\log\pi_{\theta_k}(a|s)^\top\right]
$$

5. **Compute policy update direction** by approximately solving:
$$
F(\theta_k)x = g
$$
using the **conjugate gradient method**.

6. **Update policy parameters**:
$$
\theta_{k+1} = \theta_k + \sqrt{\frac{2\delta}{g^\top x}}\,x
$$

**end for**  


### Proximal Policy Optimization (PPO)