# Chapter 13: Policy Gradient Methods

## Intro:

So far, all methods have been action-value methods. Now, we consider methods that learn a *parameterized policy* that can choose actions without needing to consult a value function. A value function might still be needed to learn the policy parameter, but it isn't required for picking actions. The notation $\theta \in \mathbb{R}^{d'}$ is used for the policy's parameter vector. So we write $\pi(a | s, \theta) = $Pr{$A_t = a | S_t = s, \theta_t = \theta$} to be the probability that action $a$ is taken at time $t$ given that the environment is in state $s$ at time $t$ with parameter $\theta$. If a method also uses a value function, the weight vector of the value function uses the same notation as before. 

The methods considered here learn the policy parameter based on the gradient of some performance measure $J(\theta)$ with respect to the policy parameter. The methods aim to maximize performance, so the updates approximate gradient ascent in J:

$\theta_{t+1} = \theta_t + \alpha \widehat{\nabla J(\theta_t)}$

$\widehat{\nabla J(\theta_t)} \in \mathbb{R}^d$ is a stochastic estimate, its expectation approximates gradient of the performance measure with respect to (w.r.t) argument $\theta_t$. All methods following this general outline are called *policy-gradient methods*, whether or not they also learn an approximate value function. Methods learning both approximations to policy and value functions are called *actor-critic methods*, where *actor* refers to the learned policy, and *critic* is the learned value function. 

## 13.1: Policy Approximation and its Advantages

