# Proximal Policy Optimization

paper: [Proximal Policy Optimization Algorithms](https://arxiv.org/pdf/1707.06347.pdf)

avoid updating policy network too much

## Loss

### 1. Trust Region Objective

### 2. Clipped Surrogate Objective (our code)

$$r_t(\theta) = \frac{\pi_{\theta} (a_t|s_t) }{  \pi_{\theta_{old}} (a_t|s_t) }$$

$$L^{CLIP}(\theta) = \hat{\mathbb{E}}_t[   \text{min}(r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_t )   ]$$

Discourage policy change if it is outside our comfort zone.

Note that $min$ is element-wise and the target is $ratio * A$. 


## Generalized Advantage Estimation (GAE)
Blend Monte Carlo and TD together, see details [here](https://medium.com/@jonathan_hui/rl-policy-gradients-explained-advanced-topic-20c2b81a9a8b)

The final advantage function for GAE is
$$\hat{A}_t^{GAE(\gamma, \lambda)} := (1 - \lambda) ( \hat{A_t}^{(1)} + \lambda \hat{A_t}^{(2)} + \lambda^2 \hat{A_t}^{(3)} + ...  )$$

Both of n-step return and $\lambda$-return trade off between bias and variance of the estimator.


When $\lambda$ is 1, it is Monte Carlo. When $\lambda$ is 0, it is TD with one step look ahead.

$$\text{GAE}(\gamma, 0): \hat{A}_t := \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$$

$$\text{GAE}(\gamma, 1): \hat{A}_t := \sum_{l=0}^{\infty}{\gamma^l \delta_{t+l}} = \sum_{l=0}^{\infty}{\gamma^l r_{t+1}} - V(s_t)$$



### 1. n-step return

See supplementary material section on page 143:15 [here](https://arxiv.org/pdf/1804.02717.pdf)

N-step return is referred to $R_t^{n} = \sum_{l=0}^{n-1} { \gamma^l r_{t+l}  + \gamma ^n V(s_{t+n}) }$. N-step return provides lower variance at the cost of introducing some bias. The n-step return can be computed  by truncating the sum of returns after n steps.

+ 1-step return is $R_t^{1} =  r_{t}  + \gamma V(s_{t+1}) $ is commonly used in Q-Learning. 1-step return provides a **biased** but **lower variance** estimator.

+ Monte Carlo return is referred to $R_t^{\infty} = \sum_{l=0}^{T-t} { \gamma ^l r_{t+l} }$. Monte Carlo return provides an **unbiased** sample of the expected return at the given step, but results in a **high variance**.

+ n acts as a trade-off between bias and variance for the value estimator


### 2. $\lambda$-return

$$R_t{(\lambda)} = (1-\lambda) \sum_{n=1}^{\infty} {\lambda^{n-1} R_t^n} = (1-\lambda)(R_t + \lambda R_t^2 + \lambda^2 R_t^3 + ...) $$




### GAE Codes
```
def GAE(advantages, gamma, lmbda):
    gae_advantages = torch.zeros_like(advantages)
    gae = 0

    for ri in reversed(range(len(advantages))):
        gae = gae * gamma * lmbda + advantages[ri]
        gae_advantages[ri] = gae
    return gae_advantages
```

The codes lead to the formula shown below:

$$gae_t = \delta_{t} + \gamma \lambda \delta_{t+1} + \gamma^2 \lambda^2 \delta_{t+2} + ... = \delta_t + \gamma \lambda gae_{t+1} $$


## Points
+ Based on Advantage Actor-Critic, which means PPO trains on partial sample, etc
+ The $\text{log}_{\pi_{\theta}}(a_t|s_t)$ in the loss function is replaced by ratio, denoted as $r_t(\theta)$
+ One partial sample is trained several times and a new policy is obtained for each training. Then the new policy is compared with the old policy

## Questions
1. a/b == exp(log(a)-log(b)), why do we bother to use log and exp?

2. Why should we select the minimum instead of simply clamping the result?

## 1. Import packages

In [0]:
import gym
import numpy as np
import torch
import torch.nn as nn
from torch.distributions import Categorical
import torch.nn.functional as F

## 2. Define constants

In [0]:
gamma = 0.98
lmbda = 0.95
num_epochs = 3000
num_rollouts = 20
reward_div = 100
k_epoch = 3
eps = 0.1

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

## 3. Prepare data

In [0]:
env = gym.make("CartPole-v0")

def get_sample(env, policy):
    done = False
    s = env.reset() # (state_size, )
    while not done:
        ss, aa, rr, s_primes, done_masks = list(), list(), list(), list(), list()
        probs = list()
        for t in range(num_rollouts):
            a = policy.sample_action(torch.Tensor(s).to(device))
            s_prime, r, done, _ = env.step(a) # a is 0 or 1
            ss.append(s)
            aa.append(a)
            rr.append(r)
            s_primes.append(s_prime)
            done_mask = 0.0 if done else 1.0
            done_masks.append(done_mask)
            probs.append(policy.policy(torch.Tensor(s).to(device))[a])
            s = s_prime
            if done:
                break
                
        sample = (torch.Tensor(ss).to(device), torch.LongTensor(aa).to(device), torch.Tensor(rr).to(device), torch.Tensor(s_primes).to(device), torch.Tensor(done_masks).to(device), torch.Tensor(probs).to(device))
        yield sample

## 4. Build model

In [0]:
def GAE(advantages, gamma, lmbda):
    gae_advantages = torch.zeros_like(advantages)
    gae = 0

    for ri in reversed(range(len(advantages))):
        gae = gae * gamma * lmbda + advantages[ri]
        gae_advantages[ri] = gae
    return gae_advantages


class PPO(nn.Module):
    def __init__(self):
        super(PPO, self).__init__()
        
        self.fc1 = nn.Linear(4, 256)
        self.fc_pi = nn.Linear(256, 2)
        self.fc_v = nn.Linear(256, 1)
        
        self.optimizer = torch.optim.Adam(self.parameters(), lr=0.0005, betas=(0.9, 0.99))

    def policy(self, state, softmax_dim=0):
        net = F.relu(self.fc1(state)) # (B, 256) # !!! Do not forget ReLU
        net = self.fc_pi(net) # (B, 2)
        probs = F.softmax(net, dim=softmax_dim)
        return probs
        
    def sample_action(self, state, softmax_dim=0): # state: (4,) => indicates that the fully-connected layer in PyTorch can receive inputs without batch_size
        probs = self.policy(state)
        m = Categorical(probs) # !!! The cpu or gpu version will influence the seed. In other words, even if we set the seed to be 2, different versions of `probs` might produce different results
        a_pred = m.sample().item()
        return a_pred # (predicted action: 0 or 1, log of probability of current action)

    def value(self, state):
        net = F.relu(self.fc1(state)) # !!! Do not forget ReLU
        return self.fc_v(net)
      
    def fit(self, sample): # samples: [(s1, a1, r1), (s2, a2, r2), ...]
        (s, a, r, ns, done_mask, old_probs) = sample
        rewards = r / reward_div # (B, num_rollouts)
        
        for i in range(k_epoch):
            td_target = (rewards + gamma * self.value(ns).squeeze() * done_mask).unsqueeze(1) # (num_rollouts, 1)
            vs = self.value(s) # (num_rollouts, 1)
            advantages = td_target - vs # (num_rollouts, 1)

            advantages = GAE(advantages, gamma, lmbda).detach() # !!! detach the advantages
            
            
            probs = self.policy(s, softmax_dim=1) # (num_rollouts, action_size=2)
            probs = probs.gather(1, a.unsqueeze(1)) # (num_rollouts, 1)
            
            ratio = torch.exp(torch.log(probs) - torch.log(old_probs.unsqueeze(1))) # (num_rollouts, 1) !!! tensor with size of (20 ,1) minus that of (20,) will produce (20, 20) tensor
            surr1 = ratio * advantages
            surr2 = torch.clamp(ratio, 1 - eps, 1 + eps) * advantages
            
            loss = torch.mean(-torch.min(surr1, surr2) +  F.smooth_l1_loss(vs, td_target.detach()))

            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()
        
ppo = PPO().to(device)

## 5. Train

In [5]:
score = 0.0

for epoch in range(num_epochs):
    sample_iter = get_sample(env, ppo)
    for sample in sample_iter:
        ppo.fit(sample)
        rewards = sample[2]
        score += sum(rewards)
        
    if epoch % 100 == 0:
        print('Epoch %d || Average Score: %.6f'%(epoch, score / (epoch + 1)))

Epoch 0 || Average Score: 21.000000
Epoch 100 || Average Score: 66.871284
Epoch 200 || Average Score: 118.323380
Epoch 300 || Average Score: 131.312286
Epoch 400 || Average Score: 141.441406
Epoch 500 || Average Score: 151.720566
Epoch 600 || Average Score: 159.149750
Epoch 700 || Average Score: 164.716125
Epoch 800 || Average Score: 168.885147
Epoch 900 || Average Score: 170.413986
Epoch 1000 || Average Score: 171.368622
Epoch 1100 || Average Score: 173.969131
Epoch 1200 || Average Score: 175.488754
Epoch 1300 || Average Score: 176.553421
Epoch 1400 || Average Score: 177.571732
Epoch 1500 || Average Score: 178.611603
Epoch 1600 || Average Score: 176.448471
Epoch 1700 || Average Score: 175.454437
Epoch 1800 || Average Score: 176.618546
Epoch 1900 || Average Score: 176.685425
Epoch 2000 || Average Score: 176.619202
Epoch 2100 || Average Score: 177.288437
Epoch 2200 || Average Score: 176.102219
Epoch 2300 || Average Score: 176.391129
Epoch 2400 || Average Score: 177.010406
Epoch 2500 || 