# Policy Approximation

We can do *__control using action-value functions__* and a policy that selects actions based on estimated action values. 

We can also *__directly learn a parameterized policy__* that selects actions without consulting an action value function.

$$ \pi(a|s, \theta) \geq 0\ \ \forall a \in A, s \in S $$

$$ \sum_{a\in A} \pi(a|s,\theta) = 1\ \ \forall s \in S $$

If the action space is discrete and not too large we can parameterize the policy using parameterized numerical preferences $h(s,a,\theta)$ for each state-action pair, and select actions using a softmax policy.

$$\pi(a|s,\theta) \doteq \frac{e^{h(s,a,\theta)}}{\sum_{b \in A}e^{h(s,b,\theta)}} $$

#### Why prefer Policy Approximation?
* The policy can be more exploratory in the beginning and the agent can learn to be greedy over time.
* Stochastic policies are possible.
* Can be simpler than computing action-values.
* Stronger convergence guarantees are available for Policy gradient methods than action-value methods.

## Policy Gradient Theorem

Let us define the performance measure $ J(\theta) $ in an episodic task as the value of the start state of the episode.

$$ J(\theta) \doteq v_{\pi_{\theta}}(s_0) $$

The policy gradient theorem gives us the gradient of the performance measure, which we can then use to perform gradient ascent to maximize the performance.

$$ \bigtriangledown J(\theta) \propto \sum_s \mu(s) \sum_a \bigtriangledown \pi(a|s,\theta).q_\pi(s,a) $$

For episodic tasks, the proportionality constant is the average length of an episode.
\
For continuing tasks, the proportionality constant is 1.

## REINFORCE: Monte Carlo Policy Gradient

The policy parameter update equation for gradient ascent is:

$$ \theta_{t+1} = \theta_{t} + \alpha \widehat{\bigtriangledown J(\theta_t)} $$

where $ \widehat{\bigtriangledown J(\theta_t)} $ is a stochastic estimate whose expectation approximates $\bigtriangledown J(\theta_t)$ $-$ the gradient of $J(\theta_t)$ with respect to $ \theta_t $.

The policy gradient theorem gives us an expression proportional to the gradient $\bigtriangledown J(\theta_t)$. 
\
All we need is a way of sampling whose expectation equals or approximates $\bigtriangledown J(\theta_t)$.

$$ \bigtriangledown J(\theta) \propto \sum_s \mu(s) \sum_a \bigtriangledown \pi(a|s,\theta).q_\pi(s,a) $$

We know that $\mu(s)$ represents how often the states occur.

$$ \bigtriangledown J(\theta) =  \mathop{{}\mathbb{E}_\pi} \left [ \sum_a \bigtriangledown \pi(a|S_t,\theta).q_\pi(S_t,a) \right ] $$

Using this, we can write the update equation as:

$$ \theta_{t+1} = \theta_{t} + \alpha \sum_a \bigtriangledown \pi(a|S_t,\theta).\hat{q}(S_t,a, \textbf{w}) $$

where $\hat{q}(S_t,a, \textbf{w})$ is a learned approximation to $q_\pi$.

Now, we have to sample actions the same way we did for states (get $A_t$ in there in place of a, and remove $\sum_a$)
\
We need each $q$ term weighted by $\pi(a|S_t,\theta)$. Let's multiply and divide

$$ \bigtriangledown J(\theta) =  \mathop{{}\mathbb{E}_\pi} \left [ \sum_a \bigtriangledown \pi(a|S_t,\theta).q_\pi(S_t,a) \frac{\pi(a|S_t,\theta)}{\pi(a|S_t,\theta)} \right ] $$

$$ \bigtriangledown J(\theta) =  \mathop{{}\mathbb{E}_\pi} \left [ \sum_a q_\pi(S_t,a).\pi(a|S_t,\theta). \frac{\bigtriangledown \pi(a|S_t,\theta)}{\pi(a|S_t,\theta)} \right ] $$

$$ \bigtriangledown J(\theta) =  \mathop{{}\mathbb{E}_\pi} \left [ q_\pi(S_t, A_t) \frac{\bigtriangledown \pi(A_t|S_t,\theta)}{\pi(A_t|S_t,\theta)} \right ] $$

Since $q_\pi(S_t, A_t) = \mathop{{}\mathbb{E}_\pi} [G_t | S_t, A_t]$,
and $\bigtriangledown ln\ f(x) = \frac{\bigtriangledown f(x)}{f(x)}$,

$$ \bigtriangledown J(\theta) =  \mathop{{}\mathbb{E}_\pi} \left [ G_t \bigtriangledown ln\ \pi(A_t|S_t,\theta) \right ] $$

So, the update equation becomes:

$$ \theta_{t+1} = \theta_{t} + \alpha\ G_t \bigtriangledown ln\ \pi(A_t|S_t,\theta) $$

## REINFORCE with Baseline

We can include a baseline $b(s)$ in the policy gradient theorem as follows:

$$ \bigtriangledown J(\theta) \propto \sum_s \mu(s) \sum_a \bigtriangledown \pi(a|s,\theta) \bigl(q_\pi(s,a) - b(s) \bigr ) $$

The baseline can be any function __as long as it does not depend on $a$__.

Consider $\sum_a b(s) \bigtriangledown \pi(a|s,\theta)$

$$ \sum_a b(s) \bigtriangledown \pi(a|s,\theta) = b(s) \sum_a \bigtriangledown \pi(a|s,\theta) = b(s) \bigtriangledown 1 = 0 $$

So, b(s) __doesn't change the expectation__ as long as it does not depend on $a$. However, it has a __large effect on its variance__.

The update equation for the generalization of REINFORCE is:

$$ \theta_{t+1} = \theta_{t} + \alpha\ \bigl(G_t - b(s) \bigr) \bigtriangledown ln\ \pi(A_t|S_t,\theta) $$

One common choice for the baseline is the estimation of the state-value function $\hat{v}(S_t, \textbf{w})$.

## Actor-Critic Methods

Actor-Critic methods have two components:
* __Actor__: takes an action based on the policy
* __Critic__: provides feedback to the actor by estimating the value of the states

The value fucntion in REINFORCE is a baseline not a critic. We don't use it for bootstrapping. So, REINFORCE with Baseline is not an actor-critic method.

REINFORCE with Baseline will converge aymptotically to a local minimum, but like all Monte Carlo methods it takes too long. It cannot be used to solve online or continuing problems.

For faster learning, like in TD, we have to introduce bootstrapping. The critic part of Actor Critic is where we will do this.

From the revious section, we have for $b(s) = \hat{v}(S_t,\textbf{w})$,

$$ \theta_{t+1} = \theta_{t} + \alpha\ \bigl(G_{t} - \hat{v}(S_t,\textbf{w}) \bigr) \bigtriangledown ln\ \pi(A_t|S_t,\theta) $$

Let's introduce bootstrapping $ \bigl(G_{t} = R_{t+1} + \gamma \hat{v}(S_{t+1}, \textbf{w}) \bigr ) $ :

$$ \theta_{t+1} = \theta_{t} + \alpha\ \bigl(R_{t+1} + \gamma \hat{v}(S_{t+1}, \textbf{w}) - \hat{v}(S_t,\textbf{w}) \bigr) \bigtriangledown ln\ \pi(A_t|S_t,\theta) $$

$$ \theta_{t+1} = \theta_{t} + \alpha\ \delta_t \bigtriangledown ln\ \pi(A_t|S_t,\theta) $$

Through bootstrapping, we introduce bias and an asymptotic dependence on the quality of the function approximation. However, the bias introduced through bootstrapping and reliance on state representations is beneficial because it reduces variance and accelerates learning.

#### The update equations for Actor-Critic methods

$$ \textbf{w} \leftarrow \textbf{w} + \alpha^{\text{w}} . \delta . \bigtriangledown \hat{v}(S, \textbf{w}) $$

$$ \theta \leftarrow \theta + \alpha^\theta . \delta . \bigtriangledown ln\ \pi(A_t|S_t,\theta) $$

Where usually $ \alpha^{\theta} < \alpha^{\text{w}} $.

And $ \delta = R_{t+1} + \gamma \hat{v}(S_{t+1}, \textbf{w}) - \hat{v}(S_t,\textbf{w}) $.


## Policy Parameterization for Contnuous Actions

Policy based methods offer practical ways of dealing with large or even continuous action spaces.

Instead of computing learned probabilities for each action, we learn statistics of the probability distribution.

### Gaussian Policies

The probability density function for the normal distribution with mean $\mu$ and standard deviation $\sigma$ is given as:

$$ p(x) \doteq \frac{1}{\sigma \sqrt{2\pi}}exp \bigl( - \frac{(x-\mu)^2}{2 \sigma^2} \bigr) $$

To produce a policy parameterization, the policy can be defined as the normal probability density over a real-valued scalar action, with mean and standard deviation given by parametric function approximators that depend on the state.

$$ p(x) \doteq \frac{1}{\sigma \sqrt{2\pi}}exp \bigl( - \frac{(x-\mu)^2}{2 \sigma^2} \bigr) $$
