# PPO Implementation

Try to create a basic policy to get the agent to try to kick the ball to the target. The paper for this algorithm can be found [here](https://arxiv.org/pdf/1707.06347.pdf).

## Setup
Hyperparameters and other preliminaries.

### Imports

In [3]:
from dm_control import suite
from dm_control import viewer
import numpy as np
import random

import torch
import torch.nn as nn
import torch.nn.functional as F

### Constants

Get the training device and dynamically set it to the GPU if needed.

In [4]:
_DEVICE = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

Constants of the MuJoCo environment. `_c` denotes the *cardinality* or the *count* of the value.

In [5]:
_walls_c = 3
_num_walls = 4
_ball_state_c = 9
_egocentric_state_c = 44

Network Hyperparameters:

In [6]:
_INPUT_DIM = _walls_c * _num_walls + _ball_state_c + _egocentric_state_c
_GAMMA = 0.99  # Discount factor
_MINIBATCH_SIZE = 32
_LEARNING_RATE = 0.0015
_ITERATIONS = 1000000
_EPOCHS = 10
_MEMORY_SIZE = 10000

_HIDDEN_LAYER_1 = 64
_HIDDEN_LAYER_2 = 32

_SEED = 2019
_EPSILON = 0.2  # Probability clip
_DROPOUT_PROB = 0.5

### Set seeds

In [7]:
torch.manual_seed(_SEED)
np.random.seed(_SEED)
random.seed(_SEED)

## Define the environment

### Define observation and agent inputs

Here, an agent observation is converted into the input for TRPO. The observed features that are used are: 
* Wall vectors for the left, right, top, and back walls of the goal
* The ball x,y,z positions and velocicties relative to the agent
* The state of the agent itself (joints, etc)

The features are converted to be 1-dimensional and then concatenated as follows:
$$\left[ \matrix{ left \cr
                  right \cr
                  top \cr
                  back \cr
                  ball-state \cr
                  egocentric-state} \right]$$

In [8]:
def to_input(obs):
  left, right, top, back = obs['goal_walls_positions']
  ball_state = obs['ball_state']
  egocentric_state = obs['egocentric_state']
  
  return np.concatenate((
    left.ravel(),
    right.ravel(),
    top.ravel(),
    back.ravel(),
    ball_state.ravel(),
    egocentric_state.ravel()
  ))

### Define reward function

In [9]:
def reward(physics):
  return 0

### Create the environment

In [10]:
task_kwargs = {
  'reward_func': reward
}

env = suite.load(domain_name="quadruped", 
                 task_name="soccer", 
                 visualize_reward=True, 
                 task_kwargs=task_kwargs)

Get the dynamic output required for TRPO

In [11]:
_OUTPUT_DIM = env.action_spec().shape[0]

## Model Creation

The model is a simple feed foward network with 2 hidden layers. Note that in this Actor-Critic model, the actor tries to fit to the policy and the critic tries to fit to the value function. Additionally, in this case both the actor and the critic share the same subnet to *(hopefully)* converge faster.

In [12]:
class PPO(nn.Module):
  def __init__(self):
    super(PPO, self).__init__()
    
    self.network_base = nn.Sequential(
      nn.Linear(_INPUT_DIM, _HIDDEN_LAYER_1), nn.Dropout(_DROPOUT_PROB), nn.Tanh(),
      nn.Linear(_HIDDEN_LAYER_1, _HIDDEN_LAYER_2), nn.Dropout(_DROPOUT_PROB), nn.Tanh(),
    )
    
    self.policy_mu = nn.Linear(_HIDDEN_LAYER_2, _OUTPUT_DIM)
    self.policy_log_std = nn.Parameter(torch.zeros(1, _OUTPUT_DIM))
    self.value = nn.Linear(_HIDDEN_LAYER_2, 1)
    
  def forward(self, x):
    latent_state = self.network_base(x)
    
    mus = self.policy_mu(latent_state)
    sigma_sq = torch.exp(self.policy_log_std)
    value_s = self.value(latent_state)
    
    return mus, sigma_sq, value_s

Create the network and verify the layers are good as-is.

In [13]:
PPO()

PPO(
  (network_base): Sequential(
    (0): Linear(in_features=65, out_features=64, bias=True)
    (1): Dropout(p=0.5, inplace=False)
    (2): Tanh()
    (3): Linear(in_features=64, out_features=32, bias=True)
    (4): Dropout(p=0.5, inplace=False)
    (5): Tanh()
  )
  (actor): Linear(in_features=32, out_features=12, bias=True)
  (critic): Linear(in_features=32, out_features=1, bias=True)
)

### Replay Memory

Recall that PPO works based off of trajectories. Create a standarded memory structure so that batches can be sampled from it in the future.

In [14]:
Transitions = collections.namedtuple('Transition',
                                     ['state', 'action', 'reward', 'next_state', 'terminating'])

## Training

Create target and stable nets for training

In [15]:
policy = PPO().float().to(_DEVICE)
policy_old = PPO().float().to(_DEVICE)
policy_old.load_state_dict(policy.state_dict())

<All keys matched successfully>

In [31]:
for i in range(_ITERATIONS):
  timestep = env.reset()
  memory = []
  
  while not timestep.last():
    input_ = to_input(timestep.observation)
    state = torch.from_numpy(input_).float().to(_DEVICE)
    
    with torch.no_grad():
      action, value = policy_old(state)

tensor([-0.4105, -0.0643,  0.0494, -0.2165,  0.2117,  0.0917, -0.3746, -0.1664,
         0.0418, -0.2445, -0.5868, -0.3083])


In [30]:
input_ = torch.from_numpy(np.random.random(_INPUT_DIM)).float()
input_

tensor([0.9035, 0.3931, 0.6240, 0.6379, 0.8805, 0.2992, 0.7022, 0.9032, 0.8814,
        0.4057, 0.4524, 0.2671, 0.1629, 0.8892, 0.1485, 0.9847, 0.0324, 0.5154,
        0.2011, 0.8860, 0.5136, 0.5783, 0.2993, 0.8372, 0.5266, 0.1048, 0.2781,
        0.0466, 0.5091, 0.4724, 0.9045, 0.9435, 0.7034, 0.8463, 0.9280, 0.8194,
        0.8452, 0.7915, 0.1710, 0.2900, 0.3045, 0.1477, 0.5738, 0.8636, 0.3233,
        0.2756, 0.6822, 0.1914, 0.5810, 0.8626, 0.2345, 0.2899, 0.3829, 0.3496,
        0.3288, 0.9402, 0.0380, 0.7768, 0.3846, 0.7167, 0.4525, 0.6990, 0.3536,
        0.7942, 0.2971])

In [None]:
ppo(input_)

In [None]:
timestep = env.reset()
timestep

In [None]:
to_input(timestep.observation)

In [None]:
env.action_spec().shape