# PPO and ProcGen notebbok

In this notebook a runnable code (with some exceptions) is accessible, first the import of neccessary packages and code snippest from the repository are imported. Setting up the parameters for the training loop and running the training loop with different setups is performed. The information gained from training (updated policy) are then used to play on unseen data and the results plotted.

The cell below installs `procgen` and downloads a small `utils.py` script that contains some utility functions. Python scripts including functions for data augmentations are imported for later use.

In [13]:
!pip install procgen
!wget https://raw.githubusercontent.com/nicklashansen/ppo-procgen-utils/main/utils.py
!wget https://raw.githubusercontent.com/hlynurarni/Deep_learning_02456_FProject/master/HPC%20Scripts/data_aug.py
!wget https://raw.githubusercontent.com/hlynurarni/Deep_learning_02456_FProject/master/HPC%20Scripts/TransformLayer.py

--2020-12-27 18:08:21--  https://raw.githubusercontent.com/nicklashansen/ppo-procgen-utils/main/utils.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14807 (14K) [text/plain]
Saving to: ‘utils.py.1’


2020-12-27 18:08:22 (53.2 MB/s) - ‘utils.py.1’ saved [14807/14807]

--2020-12-27 18:08:22--  https://raw.githubusercontent.com/hlynurarni/Deep_learning_02456_FProject/master/HPC%20Scripts/data_aug.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4697 (4.6K) [text/plain]
Saving to: ‘data_aug.py.1’


2020-12-27 18:08:22 (64.0 MB/s) - ‘data_

Hyperparameters, the value can be changed. For ease of read and to showcase a working notebook 20.000 steps are made for few different combinations. Initial values, which are a good starting point can be seen in comments for each parameter.

In [3]:
# Hyperparameters
total_steps = 1e2 # 8e6
num_envs = 32 # 32
num_levels = 100 # 10
num_steps = 256 # 256
num_epochs = 3 # 3
batch_size = 512 # 512
eps = .2 # .2
grad_eps = .5 # .5
value_coef = .5 # .5
entropy_coef = .01 # .01

To store the data the gdrive is mounted, this can be altered and changed based on the user.

In [11]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


The policy network using popular `NatureDQN` encoder architecture (see below), while policy and value functions are linear projections from the encodings. Here the neccessary libraries are imported, along with the data agumentations functions from the imported python files.

In [15]:
# Add the necessary libraries
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
from utils import make_env, Storage, orthogonal_init
from data_aug import RandGray,random_color_jitter,random_cutout
from time import time

Implementation of the `Impala` encoder from [this paper](https://arxiv.org/pdf/1802.01561.pdf) minus the LSTM.

In [6]:
# Impala encoder
#  Added instead of the given Encoder 
def xavier_uniform_init(module, gain=1.0):
    if isinstance(module, nn.Linear) or isinstance(module, nn.Conv2d):
        nn.init.xavier_uniform_(module.weight.data, gain)
        nn.init.constant_(module.bias.data, 0)
    return module

class ResidualBlock(nn.Module):
    def __init__(self,in_channels):
        super(ResidualBlock, self).__init__()
        self.conv1 = nn.Conv2d(in_channels=in_channels, out_channels=in_channels, kernel_size=3, stride=1, padding=1)
        self.conv2 = nn.Conv2d(in_channels=in_channels, out_channels=in_channels, kernel_size=3, stride=1, padding=1)

    def forward(self, x):
        out = nn.ReLU()(x)
        out = self.conv1(out)
        out = nn.ReLU()(out)
        out = self.conv2(out)
        return out + x

class ImpalaBlock(nn.Module):
    def __init__(self, in_channels, out_channels):
        super(ImpalaBlock, self).__init__()
        self.conv = nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=3, stride=1, padding=1)
        self.res1 = ResidualBlock(out_channels)
        self.res2 = ResidualBlock(out_channels)

    def forward(self, x):
        x = self.conv(x)
        x = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)(x)
        x = self.res1(x)
        x = self.res2(x)
        return x

class Encoder(nn.Module):
    def __init__(self,in_channels,out_features,**kwargs):
        super().__init__()
        self.block1 = ImpalaBlock(in_channels=in_channels, out_channels=16)
        self.block2 = ImpalaBlock(in_channels=16, out_channels=32)
        self.block3 = ImpalaBlock(in_channels=32, out_channels=32)
        self.fc = nn.Linear(in_features=32 * 8 * 8, out_features=out_features)

        self.output_dim = feature_dim
        self.apply(xavier_uniform_init)

    def forward(self, x):
        x = self.block1(x)
        x = self.block2(x)
        x = self.block3(x)
        x = nn.ReLU()(x)
        x = Flatten()(x)
        x = self.fc(x)
        x = nn.ReLU()(x)
        return x


In [7]:
def imshow(img):
    """ show an image """
    plt.figure(figsize=(10,8))
    plt.imshow(np.transpose(img, (1, 2, 0)))

In [8]:
class Flatten(nn.Module):
    def forward(self, x):
        return x.view(x.size(0), -1)

class Encoder2(nn.Module):
    def __init__(self, in_channels, feature_dim):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Conv2d(in_channels=in_channels, out_channels=32, kernel_size=8, stride=4), nn.ReLU(),
            nn.Conv2d(in_channels=32, out_channels=64, kernel_size=4, stride=2), nn.ReLU(),
            nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, stride=1), nn.ReLU(),
            Flatten(),
            nn.Linear(in_features=1024, out_features=feature_dim), nn.ReLU()
        )
        self.apply(orthogonal_init)

    def forward(self, x):
        return self.layers(x)

class Policy(nn.Module):
    def __init__(self, encoder, feature_dim, num_actions):
    super().__init__()
    self.encoder = encoder
    self.policy = orthogonal_init(nn.Linear(feature_dim, num_actions), gain=.01)
    self.value = orthogonal_init(nn.Linear(feature_dim, 1), gain=1.)

    def act(self, x):
        with torch.no_grad():
            x = x.cuda().contiguous()
            dist, value = self.forward(x)
            action = dist.sample()
            log_prob = dist.log_prob(action)

        return action.cpu(), log_prob.cpu(), value.cpu()

    def forward(self, x):
        x = self.encoder(x)
        logits = self.policy(x)
        value = self.value(x).squeeze(1)
        dist = torch.distributions.Categorical(logits=logits)

        return dist, value

In [9]:
#  Example runs
setup_run = {
    'Run1': {
      'Encoder':'regular',
      'Data_aug': 'regular',
      'Mixreg': False,
      'Policy': None
    }
}

setup_run['Run2'] = {
    'Encoder':'impala',
    'Data_aug': 'regular',
    'Mixreg': False,
    'Policy': None
}

setup_run['Run3'] = {
    'Encoder':'impala',
    'Data_aug': 'grayscale',
    'Mixreg': False,
    'Policy': None
}

setup_run['Run4'] = {
    'Encoder':'impala',
    'Data_aug': 'color_jitter',
    'Mixreg': False,
    'Policy': None
}

setup_run['Run5'] = {
    'Encoder':'impala',
    'Data_aug': 'cut_out',
    'Mixreg': False,
    'Policy': None
}

In [16]:
#  ============================       RUNNING LOOP      ==============================

run_setups =['Run1','Run2','Run3','Run4','Run5'] # Specify a list of setups to run

for run in run_setups: # Run the training for all our setups
    # Define environment
    # check the utils.py file for info on arguments
    env_name = 'starpilot'
    env = make_env(num_envs, num_levels=num_levels, env_name='starpilot')
    print('Observation space:', env.observation_space)
    print('Action space:', env.action_space.n)
    print(f'Run setup for {run}: playing {env_name}')
    for setup_key in setup_run[run]:
        print(f'{setup_key}: {setup_run[run][setup_key]}', end = '\t')
    print('')

    # Read in the setup
    do_mixreg = setup_run[run]['Mixreg']
    data_aug = setup_run[run]['Data_aug']
    encoder_use = setup_run[run]['Encoder']

    # Define network
    feature_dim = 512
    lambda_mix = 0.95
    num_actions = env.action_space.n
    in_channels = env.observation_space.shape[0]

    # Define the encoder
    if encoder_use == 'impala':
        print('Using Impala') 
        encoder = Encoder(in_channels, feature_dim) # added
    else:
        encoder = Encoder2(in_channels, feature_dim) # added

    # Initialize the policy
    policy = Policy(encoder, feature_dim, num_actions) # added
    policy.cuda()

    # Define optimizer
    # these are reasonable values but probably not optimal
    optimizer = torch.optim.Adam(policy.parameters(), lr=5e-4, eps=1e-5)

    # Define temporary storage
    # we use this to collect transitions during each iteration
    storage = Storage(
          env.observation_space.shape,
          num_steps,
          num_envs
      )

    # Run training
    obs = env.reset()
    nenv = env.num_envs
    device = torch.device('cpu')

    # Change the first observations to desired augmentation
    if data_aug == 'grayscale':
        obs = np.zeros((nenv,) + env.observation_space.shape, dtype=env.observation_space.dtype.name)
        obs[:] = env.reset()

        # Do the grayscale and transfer to tensor
        augs_funcs = RandGray(batch_size=num_envs, p_rand=1) # added
        obs = augs_funcs.do_augmentation(obs)
        obs = torch.from_numpy(obs)
    elif data_aug == 'random_cutout':
        # Initialize as a numpy array then convert to tensor
        obs = np.zeros((nenv,) + env.observation_space.shape, dtype=env.observation_space.dtype.name)
        obs[:] = env.reset()

        # Do the cutout and transfer to tensor
        obs = random_cutout(obs,12,24)
        obs = torch.from_numpy(obs)
    elif data_aug == 'color_jitter':
        obs = random_color_jitter(obs,p=0.5)

    step = 0
    # Initilize mean_reward for each setup that we store in the end
    mean_rewards = []
    mean_rewards_done = []
    first_loop = True
    start = time() # Lets measure how long each training task takes
    while step < total_steps:
        # Use policy to collect data for num_steps steps
        policy.eval()
        for _ in range(num_steps):
            # Use policy
            action, log_prob, value = policy.act(obs)

            # Take step in environment
            next_obs = np.zeros((nenv,) + env.observation_space.shape, dtype=env.observation_space.dtype.name)

            # numpy obs
            next_obs[:], reward, done, info = env.step(action)

            # Store data
            storage.store(obs, action, reward, done, info, log_prob, value)

            # Make augmented transformation, probably possible to do this another way, like in a class to avoid the if statements
            if data_aug == 'grayscale':
                obs = augs_funcs.do_augmentation(next_obs)
                obs = torch.from_numpy(obs)
            elif data_aug == 'random_cutout':
                obs = random_cutout(next_obs,12,24)
                obs = torch.from_numpy(obs)
            elif data_aug == 'color_jitter':
                obs = torch.from_numpy(next_obs)
                obs = random_color_jitter(obs,p=0.5)
            else:
                obs = torch.from_numpy(next_obs)


        # Add the last observation to collected data
        _, _, value = policy.act(obs)
        storage.store_last(obs, value)

        # Compute return and advantage
        storage.compute_return_advantage()

        # Optimize policy
        policy.train()
        for epoch in range(num_epochs):

            # Iterate over batches of transitions
            generator = storage.get_generator(batch_size)
            for batch in generator:
                b_obs, b_action, b_log_prob, b_value, b_returns, b_advantage = batch

                if do_mixreg: 
                    index_ij = torch.randint(0, batch_size-1, (batch_size,2))
                    b_obs = lambda_mix*b_obs[index_ij[:,0]] + (1-lambda_mix)*b_obs[index_ij[:,1]]
                    b_log_prob = lambda_mix*b_log_prob[index_ij[:,0]] + (1-lambda_mix)*b_log_prob[index_ij[:,1]]
                    b_value = lambda_mix*b_value[index_ij[:,0]] + (1-lambda_mix)*b_value[index_ij[:,1]]
                    b_returns = lambda_mix*b_returns[index_ij[:,0]] + (1-lambda_mix)*b_returns[index_ij[:,1]]
                    b_advantage = lambda_mix*b_advantage[index_ij[:,0]] + (1-lambda_mix)*b_advantage[index_ij[:,1]]
                    if (lambda_mix >= 0.5):
                        b_action = b_action[index_ij[:,0]]
                    else:
                        b_action = b_action[index_ij[:,1]]

            # Get current policy outputs
            new_dist, new_value = policy(b_obs)
            new_log_prob = new_dist.log_prob(b_action)

            # Clipped policy objective
            ratio = torch.exp(new_log_prob - b_log_prob) # added
            # ratio = b_log_prob/new_log_prob # added
            clipped_ratio = ratio.clamp(min=1.0 - eps,max=1.0 + eps) # added
            # pi_loss = torch.min(rt_theta*b_advantage,) # added
            pi_loss = -torch.min(ratio * b_advantage,clipped_ratio * b_advantage).mean() # added

            # Clipped value function objective
            clipped_value = b_value + (new_value - b_value).clamp(min=-eps, max=eps) # added
            # value_loss = (new_value - b_value)**2 # added
            value_loss = 0.5 * torch.max((b_value - b_returns) ** 2, (clipped_value - b_returns) **2).mean() # added

            # Entropy loss
            entropy_loss = -new_dist.entropy().mean() # added

            # Backpropagate losses
            loss = pi_loss + value_coef*value_loss + entropy_coef*entropy_loss # added
            loss.backward()

            # Clip gradients
            torch.nn.utils.clip_grad_norm_(policy.parameters(), grad_eps) # added

            # Update policy
            optimizer.step()
            optimizer.zero_grad()

        # Update stats
        mean_rewards.append(storage.get_reward(normalized_reward=False))
        done_reward = sum((sum(storage.reward)/(sum(storage.done)+1)))/num_envs

        # TODO: If you never die implement an if statement that doesn't include the plus 1
        mean_rewards_done.append(done_reward)
        step += num_envs * num_steps
        print(f'Step: {step}\tMean reward: {storage.get_reward(normalized_reward=False)}, \tMean reward done: {done_reward}')
        if first_loop:
            end = time()
            time_total = end-start
            estimated_time = (8e6/(8192/time_total))/3600
            print(f'Estimated time of completion: {estimated_time} hours')
        first_loop = False
        
    # While loop ended, save results
    end = time()
    time_total = end-start
    
    # Save the newest version after every epoch
    torch.save({
              'Setup': setup_run[run], # Have 
              'policy_state_dict': policy.state_dict(), # This is the policy
              'encoder_state_dict': encoder.state_dict(),
              'optimizer_state_dict': optimizer.state_dict(), # The optimizer used
              'Mean Reward': mean_rewards,
              'Mean Reward Done': mean_rewards_done,
              'Training time': time_total,
              }, f'/content/gdrive/MyDrive/Deep Learning Project 2020/data/{run}.pt')
    print(f'Completed training of {run}!')
    torch.save(policy.state_dict(), f'{run}_policy.pt')

print('Completed all runs!')

Observation space: Box(0.0, 1.0, (3, 64, 64), float32)
Action space: 15
Run setup for Run1: playing starpilot
Encoder: regular	Data_aug: regular	Mixreg: False	Policy: None	
Step: 8192	Mean reward: 8.401390075683594, 	Mean reward done: 2.391955614089966
Estimated time of completion: 3.1214913017013006 hours
Completed training of Run1!
Observation space: Box(0.0, 1.0, (3, 64, 64), float32)
Action space: 15
Run setup for Run2: playing starpilot
Encoder: impala	Data_aug: regular	Mixreg: False	Policy: None	
Using Impala
Step: 8192	Mean reward: 8.384974479675293, 	Mean reward done: 2.3830788135528564
Estimated time of completion: 4.3119494027147685 hours
Completed training of Run2!
Observation space: Box(0.0, 1.0, (3, 64, 64), float32)
Action space: 15
Run setup for Run3: playing starpilot
Encoder: impala	Data_aug: grayscale	Mixreg: False	Policy: None	
Using Impala
Step: 8192	Mean reward: 8.401407241821289, 	Mean reward done: 2.3919591903686523
Estimated time of completion: 4.396456578332517