<img src="images/ppo_cover.jpg" width=25% align="right"/>
# Proximal Policy Optimization Algorithms
Author: Jin Yeom (jinyeom@utexas.edu)  
Original authors: John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov

## Contents
- [Implementation](#Implementation)
    - [Environment](#Environment)
    - [Policy](#Policy)
    - [PPO](#PPO)
- [Training](#Training)
- [References](#References)

**[Proximal Policy Optimization (PPO)](https://arxiv.org/abs/1707.06347)** algorithms are a set of policy gradient algorithms with a novel loss function,

$$
L^{CLIP}(\theta) = E[min(r_t(\theta)A_t, clip(r_t(\theta)A_t, 1 - \epsilon, 1 + \epsilon)A_t)] 
$$

which extends [TRPO algorithm](https://arxiv.org/abs/1502.05477), but is simpler to implement while showing SOTA performance.

In [19]:
import torch
from torch import nn
from torch.nn import functional as F
from torch.autograd import Variable
import gym
import numpy as np

## Implementation

This implementation refers to [OpenAI's TensorFlow implementation](https://github.com/openai/baselines). In this notebook, however, we're going to be using PyTorch.

### Policy

In [23]:
class NatureDQN(nn.Module):
    def __init__(self, in_channels=4, act_dim=18):
        super(NatureDQN, self).__init__()
        self.conv1 = nn.Conv2d(in_channels, 32, kernel_size=8, stride=4)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=4, stride=2)
        self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=1)
        self.fc4 = nn.Linear(7 * 7 * 64, 512)
        self.fc5 = nn.Linear(512, act_dim)
        
    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))
        x = x.view(x.size(0), -1)
        x = F.relu(self.fc4(x))
        return self.fc5(x)

In [28]:
model = NatureDQN()
n_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Number of trainable parameters: {n_params}")

Number of trainable parameters: 1693362


### Environment

In [29]:
!pip install 'gym[all]'

Looking in indexes: http://pypi.hudltools.com, https://pypi.python.org
Collecting mujoco-py>=1.50; extra == "all" (from gym[all])
[?25l  Downloading https://files.pythonhosted.org/packages/53/bd/c12bad1630389104ece20a793e5ab70040a7688daadd371f8b46ed583ed0/mujoco-py-1.50.1.56.tar.gz (5.8MB)
[K    100% |████████████████████████████████| 5.8MB 5.7MB/s eta 0:00:01
Collecting atari-py>=0.1.1; extra == "all" (from gym[all])
[?25l  Downloading https://files.pythonhosted.org/packages/8b/38/3c6716ac9031a686cc3228f3855e48c08a40e4d7c490dd4c21c65b465205/atari-py-0.1.1.tar.gz (760kB)
[K    100% |████████████████████████████████| 768kB 14.6MB/s ta 0:00:01
[?25hCollecting PyOpenGL; extra == "all" (from gym[all])
[?25l  Downloading https://files.pythonhosted.org/packages/9c/1d/4544708aaa89f26c97cc09450bb333a23724a320923e74d73e028b3560f9/PyOpenGL-3.1.0.tar.gz (1.2MB)
[K    100% |████████████████████████████████| 1.2MB 11.4MB/s ta 0:00:01
[?25hCollecting Box2D-kengz; extra == "all" (from gym[all

### PPO

In [None]:
env = gym.make

## Training

## References

- https://arxiv.org/abs/1707.06347 (Proximal Policy Optimization Algorithms)
- https://arxiv.org/abs/1502.05477 (Trust Region Policy Optimization)