<img src="images/ppo_cover.jpg" width=25% align="right"/>
# Proximal Policy Optimization Algorithms
Author: Jin Yeom (jinyeom@utexas.edu)  
Original authors: John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov

## Contents
- [Configuration](#Configuration)
- [Environment](#Environment)
- [Policy](#Policy)
- [PPO](#PPO)
- [Training](#Training)
- [References](#References)

**[Proximal Policy Optimization (PPO)](https://arxiv.org/abs/1707.06347)** algorithms are a set of policy gradient algorithms with a novel loss function,

$$
L^{CLIP}(\theta) = E[min(r_t(\theta)A_t, clip(r_t(\theta)A_t, 1 - \epsilon, 1 + \epsilon)A_t)] 
$$

which extends [TRPO algorithm](https://arxiv.org/abs/1502.05477), but is simpler to implement while showing SOTA performance.

## Configuration

In [16]:
import numpy as np
import gym
import torch
from torch import nn
from torch.nn import functional as F
from torch.distributions import Categorical
from torchsummary import summary
from matplotlib import pyplot as plt
from IPython import display

In [8]:
SEED = 42

In [7]:
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("deivce =", DEVICE)

deivce = cuda


## Environment

In [17]:
env = gym.make("CartPole-v0")
env.seed(SEED)
print(env.observation_space)
print(env.action_space)

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Box(4,)
Discrete(2)


## Model

In [18]:
class ActorCritic(nn.Module):
    def __init__(self, obs_dim, act_dim):
        super(ActorCritic, self).__init__()
        self.fc1 = nn.Linear(obs_dim, 128)
        self.fc_actor = nn.Linear(128, act_dim)
        self.fc_critic = nn.Linear(128, 1)
        
    def forward(self, x):
        x = F.relu(self.fc1(x))
        act_probs = F.softmax(self.fc_actor(x))
        value = self.fc_critic(x)
        return act_probs, value

In [19]:
def sel_action(act_probs):
    r"""Given a vector of action probabilities, return a selected action
    and its log-probability.
    """
    dist = Categorical(act_probs)
    action = dist.sample()
    return action.item(), dist.log_prob(action)

In [15]:
model = ActorCritic(4, 2).to(DEVICE)
summary(model, (4,))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Linear-1                  [-1, 128]             640
            Linear-2                    [-1, 2]             258
            Linear-3                    [-1, 1]             129
Total params: 1,027
Trainable params: 1,027
Non-trainable params: 0
----------------------------------------------------------------


  # Remove the CWD from sys.path while we load stuff.


## PPO

## Coffee break

## References

- https://arxiv.org/abs/1707.06347 (Proximal Policy Optimization Algorithms)
- https://arxiv.org/abs/1502.05477 (Trust Region Policy Optimization)