<img src="images/ppo_cover.jpg" width=25% align="right"/>
# Proximal Policy Optimization Algorithms
Author: Jin Yeom (jinyeom@utexas.edu)  
Original authors: John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov

## Contents
- [Configuration](#Configuration)
- [Environment](#Environment)
- [Policy](#Policy)
- [PPO](#PPO)
- [Training](#Training)
- [References](#References)

**[Proximal Policy Optimization (PPO)](https://arxiv.org/abs/1707.06347)** algorithms are a set of policy gradient algorithms with a novel loss function,

$$
L^{CLIP}(\theta) = E[min(r_t(\theta)A_t, clip(r_t(\theta)A_t, 1 - \epsilon, 1 + \epsilon)A_t)] 
$$

which extends [TRPO algorithm](https://arxiv.org/abs/1502.05477), but is simpler to implement while showing SOTA performance.

In [2]:
import numpy as np
import gym
import torch
from torch import nn
from torch import optim
from torch.nn import functional as F
from torch.distributions import Categorical
from torchsummary import summary
from matplotlib import pyplot as plt
from tqdm import tnrange
from tqdm import tqdm_notebook as tqdm
from IPython import display

## Configuration

In [27]:
# random seed
SEED = 42
# discount rate
GAMMA = 0.99
# learning rate
ALPHA = 3e-3
# number of episodes of training
N_EPISODES = 2000
# number of iterations per episode
N_ITERS = 20

In [4]:
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("deivce =", DEVICE)

deivce = cpu


## Environment

In [5]:
env = gym.make("CartPole-v0")
env.seed(SEED)
print(env.observation_space)
print(env.action_space)

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Box(4,)
Discrete(2)


In [37]:
class Episode(object):
    def __init__(self):
        self.observations = [] # agent observation (same as state in MDP)
        self.actions = [] # selected action (index)
        self.act_probs = [] # probability of each action
        self.log_probs = [] # log probability of each action
        self.rewards = [] # extrinsic reward signals
        
    def returns(self, gamma, normalize=True):
        returns = [0.0]
        for i, r in enumerate(reversed(self.rewards)):
            returns.append(r + gamma * returns[i])
        returns = torch.tensor(list(reversed(returns[1:])))
        if normalize:
            returns = (returns - returns.mean()) / (returns.std() + 1e-8)
        return returns
            
    def append(self, observation, action, reward):
        self.observations.append(observation)
        self.actions.append(action)
        self.rewards.append(reward)

## Model

In [7]:
class ActorCritic(nn.Module):
    def __init__(self, obs_dim, act_dim):
        super(ActorCritic, self).__init__()
        self.fc1 = nn.Linear(obs_dim, 128)
        self.fc_actor = nn.Linear(128, act_dim)
        self.fc_critic = nn.Linear(128, 1)
        
    def actor(self, x):
        x = F.relu(self.fc1(x))
        return F.softmax(self.fc_actor(x), dim=-1)
    
    def critic(self, x):
        x = F.relu(self.fc1(x))
        return self.fc_critic(x)
        
    def forward(self, x):
        x = F.relu(self.fc1(x))
        act_probs = F.softmax(self.fc_actor(x), dim=-1)
        value = self.fc_critic(x)
        return act_probs, value

In [8]:
def sel_action(act_probs):
    dist = Categorical(act_probs)
    action = dist.sample()
    return action.item(), dist.log_prob(action)

In [9]:
obs_dim = env.observation_space.shape[0]
act_dim = env.action_space.n
model = ActorCritic(obs_dim, act_dim).to(DEVICE)
summary(model, (obs_dim,))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Linear-1                  [-1, 128]             640
            Linear-2                    [-1, 2]             258
            Linear-3                    [-1, 1]             129
Total params: 1,027
Trainable params: 1,027
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.00
Estimated Total Size (MB): 0.00
----------------------------------------------------------------


## Training

In [10]:
optimizer = optim.Adam(model.parameters(), lr=ALPHA)

In [47]:
def update(model, epsiode):
    pass

In [48]:
def train(env, model, update, n_ep, n_iter):
    model.train()
    for ep in tnrange(n_ep, desc="episode"):
        episode = Episode()
        obs = env.reset()
        for i in range(n_iter):
            obs = torch.tensor(obs).float()
            act_probs, value = model(obs)
            action, act_log_prob = sel_action(act_probs)
            next_obs, reward, done, info = env.step(action)
            episode.append(obs, action, reward)
            obs = next_obs
            if done:
                break
                
        # update the model
        update(model, episode)

In [49]:
train(env, model, update, 1, 50)

HBox(children=(IntProgress(value=0, description='episode', max=1), HTML(value='')))




## References

- https://arxiv.org/abs/1707.06347 (Proximal Policy Optimization Algorithms)
- https://arxiv.org/abs/1502.05477 (Trust Region Policy Optimization)