# Actor Critic Methods

## The simplest actor-critic methods (QAC)

Recall that the idea of the policy gradient method is to search for an optimal policy by maximizing a scalar matrix $J(\theta)$, gradient ascent for maximizing $J(\theta)$:

$$
\begin{aligned}
\theta_{t+1} &= \theta_{t} + \alpha\nabla_{\theta}J(\theta)\\
&= \theta_{t} + \alpha\mathbb{E}_{S\sim\eta,A\sim\pi}\left[ \nabla_{\theta}\ln\pi(A|S,\theta_{t})q_{\pi}(S,A) \right]
\end{aligned}
$$

where $\eta$ is a distribution of the states. Stochastic gradient to approximate it:

$$\theta_{t+1} = \theta_{t} + \alpha\nabla_{\theta}\ln\pi(a_{t}|s_{t},\theta_{t})q_{t}(s_{t},a_{t})$$

* If $q_{t}(s_{t}, a_{t})$ is estimated by Monte Carlo learning, the corresponding algorithm is called REINFORCE.

* If $q_{t}(s_{t}, a_{t})$ is estimated by TD learning, the corresponding algorithms are called actor-critic.

(QAC) At time step $t$, generate $(s_{t},a_{t},r_{t+1},s_{t+1},a_{t+1})$, then:

* Actor: $\theta_{t+1} = \theta_{t} + \alpha_{\theta}\nabla_{\theta}\ln\pi(a_{t}|s_{t},\theta_{t})q_{t}(s_{t},a_{t})$

* Critic: $w_{t+1} = w_{t} + \alpha_{w}\left[r_{t+1} + \gamma q(s_{t+1}, a_{t+1}, w_{t}) - q(s_{t}, a_{t}, w_{t})\right]\nabla_{w}q(s_{t}, a_{t}, w_{t})$

## Advantage actor-critic (A2C)

One interesting property of the policy gradient is that it is invariant to an additional baseline. That is

$$ \mathbb{E}_{S\sim\eta,A\sim\pi}\left[ \nabla_{\theta}\ln\pi(A|S,\theta_{t})q_{\pi}(S,A) \right] =  \mathbb{E}_{S\sim\eta,A\sim\pi}\left[ \nabla_{\theta}\ln\pi(A|S,\theta_{t})(q_{\pi}(S,A) - b(S)) \right]$$

The baseline is useful because it can reduce the approximation variance when we use samples to approximate the true gradient. In fact, the optimal baseline that minimizes $\text{var}(X)$ is

$$b^{\ast}(s) = \frac{\mathbb{E}_{A\sim\pi}\left[\left \| \nabla_{\theta}\ln\pi(A|s,\theta_{t}) \right \|^{2}q_{\pi}(s, A)\right]}{\mathbb{E}_{A\sim\pi}\left[\left \| \nabla_{\theta}\ln\pi(A|s,\theta_{t}) \right \|^{2}\right]}$$

it is too complex to use in practice, we can obtain state value as suboptimal baseline:

$$b(s) = \mathbb{E}_{A\sim\pi}[q_{\pi}(s, A)] = v_{\pi}(s)$$

Now the gradient-ascent algorithm becomes:

$$\theta_{t+1} = \theta_{t} + \alpha\mathbb{E}_{S\sim\eta,A\sim\pi}\left[ \nabla_{\theta}\ln\pi(A|S,\theta_{t})(q_{\pi}(S,A) -  v_{\pi}(S)) \right]$$

(A2C) At time step $t$, generate $(s_{t},a_{t},r_{t+1},s_{t+1},a_{t+1})$, then:

* Advantage (TD error): $\delta_{t} = r_{t+1} + \gamma v(s_{t+1}, w_{t}) - v(s_{t},w_{t})$, here we use $r_{t+1} + \gamma v_{t}(s_{t+1})$ to approximate $q_{t}(s_{t},a_{t})$

* Actor (policy update): $\theta_{t+1} = \theta_{t} + \alpha_{\theta}\delta_{t}\nabla_{\theta}\ln\pi(a_{t}|s_{t},\theta_{t})$

* Critic (value update): $w_{t+1} = w_{t} + \alpha_{w}\delta_{t}\nabla_{w}v(s_{t},w_{t})$

## Off-policy actor-critic

(Importance sampling) Estimate $E_{X\sim p_{0}}[X]$ with i.i.d samples $\{x_{i}\}_{i=1}^{n}$ from $p_{1}$:

$$E_{X\sim p_{0}}[X] = \sum_{x\in\mathcal{X}}p_{0}(x)x = \sum_{x\in\mathcal{X}}p_{1}(x)\frac{p_{0}(x)}{p_{1}(x)}x = E_{X\sim p_{1}}[f(X)]$$

where $\frac{p_{0}(x)}{p_{1}(x)}$ is importance weight.

Suppose that $\beta$ is a behavior policy, our goal is to use the samples generated by $\beta$ to learn a target policy $\pi$ that maximize the following:

$$J(\theta) = \sum_{s\in\mathcal{S}}d_{\beta}(s)v_{\pi}(s) = \mathbb{E}_{S\sim d_{\beta}}[v_{\pi}(S)]$$

the gradient of $J(\theta)$ is:

$$\nabla_{\theta}J(\theta) = \mathbb{E}_{S\sim\rho,A\sim\beta}\left[ \frac{\pi(A|S,\theta)}{\beta(A|S)}\nabla_{\theta}\ln\pi(A|S,\theta)q_{\pi}(S,A) \right]$$

(Off-policy actor-critic based on importance sampling) At time step $t$ generate $a_{t}$ from $\beta(s_{t})$ and then observe $r_{t+1},s_{t+1}$, then:

* Advantage: $\delta_{t} = r_{t+1} + \gamma v(s_{t+1}, w_{t}) - v(s_{t},w_{t})$

* Actor: $\theta_{t+1} = \theta_{t} + \alpha_{\theta}\frac{\pi(a_{t}|s_{t},\theta_{t})}{\beta(a_{t}|s_{t})}\delta_{t}\nabla_{\theta}\ln\pi(a_{t}|s_{t},\theta_{t})$

* Critic (value update): $w_{t+1} = w_{t} + \alpha_{w}\frac{\pi(a_{t}|s_{t},\theta_{t})}{\beta(a_{t}|s_{t})}\delta_{t}\nabla_{w}v(s_{t},w_{t})$