# **Policy Gradient**

Policy gradients in reinforcement learning (RL) are a class of algorithms that directly optimize the agent’s policy by estimating the gradient of the expected reward with respect to the policy parameters. Based on policy gradient, here comes PPO, A2C and some policy-based RL algorithm.

Action space in RL can be discrete or continuous; however, value-based methods, like Q-learning, SARSA and DQN, are unable to deal with continuous RL problem with continuous action space, since curse of dimensionality will be a big issue hewn we build the Q-table or DNN model.

Policy gradient parameterize the policy and estimate the gradient of cumulative rewards with respect to he policy parameters. Gradient is used directly to ptimize the policy by updating the parameters. So by policy gradient, even we don't know the number available actions or action space is continuous, we still can modelize RL problem and overcome the curse of dimensionality, which enables better efficiency and more flexible exploration. Policy gradient is also a basis of RLHF (Reinforcement Learning using Human Feedback) method, which contributes a lot to AI chatbot.

### **Deterministic / Stochastic policy**

Given a observation:

* **Deterministic policy** will bring a specific action to take.

* **Stochastic policy** will give a probabiliy distribution that tell you the probability to that each action.

When using stochastic policy, the same observation can lead to take different action in different iterations, which improves the ability to explore the action space and avoid trapping in local optma.

Policy gradient tends to use stochastic policy, let agent to choose an action by sampling from action probility distribution, which conbine exploration and exploitation.

### **Terminology of policy gradient**

* $\pi_{\theta}$ is a policy with parameter set $\theta$, usualy $\theta$ represent the weights of neural network that model the policy.

* $\tau$ represents a trajectory which refers to a state set from initial state to terminal state.

* $J(\theta)$ represents the expected return when following the policy $\pi_{\theta}$. This is also an obective function for policy gradien ascent.

* $R(\tau)$ refers to the total reward over trajectory $\tau$.

## **Policy update**

As mentioned in above part, we can get a function like,     

$$J(\theta)=E_{\tau\sim\pi_{\theta}}[R(\tau)]=E_{\tau\sim\pi_{\theta}}[\sum_{t=1}^{T}r(s_t,a_t)]$$

Assumed that there are $N$ sampled trajectory, the objective function can be expressed as,  

$$J(\theta)=E_{\tau\sim\pi_{\theta}}[\sum_{t=1}^{T}r(s_t,a_t)]\approx\frac{1}{N}\sum_{i=1}^{N}\sum_{t=1}^{T}r(s_t,a_t)$$

The goal of polcy update is to finc the opimal parameters $\theta$,   

$$\theta^{*}=\text{arg}\max_{\theta}E_{\tau\sim\pi_{\theta}}[\sum_{t=1}^{T}r(s_t,a_t)]$$

To optimize the objective function, we need get the differential of the objective function with respect to $\theta$   

$$\nabla_{\theta}J(\theta)=\nabla_{\theta}E_{\tau\sim\pi_{\theta}}[\sum_{t=1}^{T}r(s_t,a_t)]$$

And the expected return can be expressed as $\sum_{\tau}P(\tau|\theta)R(\tau)$ for discrete case, and $\int_{\tau}P(\tau|\theta)R(\tau)$ for continuous case, so the differential of the objective function can be rewritten as,   

$$\nabla_{\theta}J(\theta)=\nabla_{\theta}\sum_{\tau}P(\tau|\theta)R(\tau)$$  
or  
$$\nabla_{\theta}J(\theta)=\nabla_{\theta}\int_{\tau}P(\tau|\theta)R(\tau)$$

If breaking down the probability of a trajectory into the transition from inital state to terminal state, we can get the following equation,  

$$P(\tau|\theta)=\rho_0(s_0)\prod_{t=1}^{T}P(s_{t+1}|s_t,a_t)\pi_{\theta}(a_t|s_t)$$

For simplier calculation, we take the logarithm on both sides of the above equation,    

$$\log(P(\tau|\theta))=\log{\rho_0(s_0)}+\sum_{t=1}^{T}[\log(P(s_{t+1}|s_t,a_t))+\log(\pi_{\theta}(a_t|s_t))]$$

$$\nabla_{\theta}\log(P(\tau|\theta))=\nabla_{\theta}\log{\rho_0(s_0)}+\nabla_{\theta}\sum_{t=1}^{T}[\log(P(s_{t+1}|s_t,a_t))+\log(\pi_{\theta}(a_t|s_t))]$$

$$\nabla_{\theta}\log(P(\tau|\theta))=\nabla_{\theta}\sum_{t=1}^{T}[\log(\pi_{\theta}(a_t|s_t))]$$

So the differential of objective function can be rewritten again as,    

$$\nabla_{\theta}J(\theta)=E_{\tau\sim\pi_{\theta}}[(\sum_{t=1}^{T}\nabla_{\theta}log(\pi_{\theta}(a_t|s_t)))(\sum_{t=1}^{T}r(s_t,a_t))]\approx\frac{1}{N}\sum_{i=1}^{N}[\sum_{t=1}^{T}\nabla_{\theta}log(\pi_{\theta}(a_t|s_t))]{\sum_{t=1}^{T}r(s_t,a_t)}$$

With the differntial of objective funtion, we can update policy parameters $\theta$ by, 

$$\theta_{t+1} \leftarrow \theta_{t}+\alpha\nabla_{\theta}J(\pi_{\theta})|_{\theta_{t}}$$


where $t$ is iteration count, $\alpha$ is learning rate.