# Chapter 13: Policy Gradient Methods

## Intro:

So far, all methods have been action-value methods. Now, we consider methods that learn a *parameterized policy* that can choose actions without needing to consult a value function. A value function might still be needed to learn the policy parameter, but it isn't required for picking actions. The notation $\theta \in \mathbb{R}^{d'}$ is used for the policy's parameter vector. So we write $\pi(a | s, \theta) = $Pr{$A_t = a | S_t = s, \theta_t = \theta$} to be the probability that action $a$ is taken at time $t$ given that the environment is in state $s$ at time $t$ with parameter $\theta$. If a method also uses a value function, the weight vector of the value function uses the same notation as before. 

The methods considered here learn the policy parameter based on the gradient of some performance measure $J(\theta)$ with respect to the policy parameter. The methods aim to maximize performance, so the updates approximate gradient ascent in J:

$\theta_{t+1} = \theta_t + \alpha \widehat{\nabla J(\theta_t)}$

$\widehat{\nabla J(\theta_t)} \in \mathbb{R}^d$ is a stochastic estimate, its expectation approximates gradient of the performance measure with respect to (w.r.t) argument $\theta_t$. All methods following this general outline are called *policy-gradient methods*, whether or not they also learn an approximate value function. Methods learning both approximations to policy and value functions are called *actor-critic methods*, where *actor* refers to the learned policy, and *critic* is the learned value function. 

## 13.1: Policy Approximation and its Advantages

We can parameterize the policy in any way, as long as $\pi(a | s, \theta)$ is differentiable w.r.t its parameters. That is, as long as $\nabla \pi(a | s, \theta)$ (column vector of partial derivatives of the policy $\pi(a|s, \theta)$ w.r.t the components of $\theta$) exists and is finite for all $s \in S, a \in A(s), \theta \in \mathbb{R}^{d'}$. To deal with exploration vs exploitation, we require that the policy is always stochastic. 

If we've got a discrete action space and a not too large state-space, then a natural way to parameterize is to form parameterized numerical preferences $h(s, a, \theta) \in \mathbb{R}$ for each state-action pair. We want to give the actions with the highest preference in each state the highest probability of being chosen, i.e. according to the exponential softmax distribution:

$\pi(a|s, \theta) \stackrel{.}{=} \frac{e^{h(s, a, \theta)}}{\sum_b e^{h(s, b, \theta)}}$

e is the base of the natural log. The denominator here is exactly what's required to make sure that the probabilities in each state sum up to one. Policy parameterization done in this way is called *softmax in action preferences*. 

The action preferences can be parameterized arbitrarily. A neural network can be used, in this case, $\theta$ might be the vector of all the connection weights in the network, or the preferences could be linear in features:

$h(s, a, \theta) = \theta^T \textbf{x}(s, a)$

using feature vectors $\textbf{x}(s, a) \in \mathbb{R}^{d'}$. An advantage to parameterizing policies with the softmax is that the resulting approximate policy can approach a deterministic policy, compared to with $\varepsilon$-greedy policies when there is always a probability $\epsilon$ of selecting a random action. You could pick actions according to a softmax distribution based on action values, but this wouldn't allow approaching a deterministic policy. The action value estimates would instead converge to their corresponding true values, differing by a finite amount, and so would be real probabilities instead of 0 and 1. 

If softmax distribution is given a temperature parameter, then this parameter can be reduced over time to reach deterministic policy. However, it would be difficult to choose the reduction schedule or the initial temperature value. 