<a href="https://colab.research.google.com/github/hegde95/RLColab/blob/master/PPO.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
This Colab notebook trains a PPO agent on the chosen basic gym environment. This notebook is a part of the following github project:
[https://github.com/hegde95/RLColab/](https://github.com/hegde95/RLColab/)

The main aim of this notebook is to help RL enthusiasts (such as my self), not be limited by hardware.

![alt text](https://media.giphy.com/media/5Zesu5VPNGJlm/giphy.gif)

This notebook just trains an agent (without rendering anything), and saves the agent in the provided drive folder. To test the agent, run the play.py python script from the github project.

## Steps for this project:

1.   Clone this repo:
>```
$ git clone https://github.com/hegde95/RLColab.git
```    

2.   Add the lib, runs and checkpoints folders from the repo to your Google Drive. The checkpoints folder is the place where the models will be saved.

3.   Configure Section 3 as needed and run this CoLab Notebook for as long as needed. Follow the instructions below to run this notebook. Since google can terminate your session at any time, the models are saved to the checkpoints folder whenever the test rewards improve. Therefore you can choose to load a previously trained model in Section 3. 


4.   Download the needed model (.dat file) from the checkpoints folder on drive to the checkpoints folder on your local repo.

5.   Test the model by running play.py
>Esure you have the following dependencies installed (gym, torch) on your local machine.
```
pip install gym
pip install torch
```
run play.py
```
python play.py
```
Note that this only tests the last model that was downloaded to the local checkpoints folder.


####So Lets Start
---


## Takeoff!!! 


---


Before running make sure that GPU is enabled:
Edit -> Notebook settings -> Hardware accelerator dropdown choose GPU

To run the notebook:
Runtime -> Run all


---


Add this code in inspector to keep colab from restarting.

Ctrl+ Shift + i to open inspector view . Then goto console and run the following


```
function ClickConnect(){
console.log("Working"); 
document.querySelector("colab-toolbar-button#connect").click() 
}
setInterval(ClickConnect,60000)
```


---



# Section 1: Get Current GPU data

In [0]:
#@title
# memory footprint support libraries/code
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
!pip install psutil
!pip install humanize
!nvidia-smi
import psutil
import humanize
import os
import GPUtil as GPU
GPUs = GPU.getGPUs()
# XXX: only one GPU on Colab and isnâ€™t guaranteed
gpu = GPUs[0]
def printm():
 process = psutil.Process(os.getpid())
 print("Gen RAM Free: " + humanize.naturalsize( psutil.virtual_memory().available ), " | Proc size: " + humanize.naturalsize( process.memory_info().rss))
 print("GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total {3:.0f}MB".format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))
printm()



---


# Section 2: Set up and imports
Will get stuck here everytime the session restarts. Expand this section and authenticate google to access your drive

In [0]:
#@title # Add Google Drive path here{ run: "auto" }
import sys
#@markdown Enter the path like so: /content/drive/My Drive/RL/PPO/
path_to_project = '/content/drive/My Drive/RL/PPO1/' #@param {type:"string"}
sys.path.append(path_to_project)
#@markdown This Folder should be set up in your Google drive and must contain the lib, checkpoints and runs folders

In [0]:
#@title ##Connect to drive
#@markdown ###please authenticate if needed
from google.colab import drive
drive.mount('/content/drive')

In [0]:
#@title Install tensorboardX and box2d-py
!pip install tensorboardX
!pip install box2d-py

In [0]:
#@title Imports
import torch
import torch.optim as optim
import os
import gym
import numpy as np
from datetime import datetime

import os
from tensorboardX import SummaryWriter
import warnings


from lib.common import mkdir
from lib.Model import ActorCritic
from lib.multiprocessing_env import SubprocVecEnv



---


# Section 3: Variables

In [0]:
#@title # Configure Hyper-parameters here{ run: "auto" }

NUM_ENVS = 8 #@param {type:"integer"}
ENV_ID = "BipedalWalker-v3" #@param {type:"string"}
HIDDEN_SIZE = 256 #@param {type:"integer"}
LEARNING_RATE = 1e-4 #@param {type:"number"}
GAMMA = 0.99 #@param {type:"number"}
GAE_LAMBDA = 0.95 #@param {type:"number"}
PPO_EPSILON = 0.2 #@param {type:"number"}
CRITIC_DISCOUNT = 0.5 #@param {type:"number"}
ENTROPY_BETA = 0.001 #@param {type:"number"}
PPO_STEPS = 1024 #@param {type:"integer"}
MINI_BATCH_SIZE = 64 #@param {type:"integer"}
PPO_EPOCHS = 10 #@param {type:"integer"}
TEST_EPOCHS = 10 #@param {type:"integer"}
NUM_TESTS = 5 #@param {type:"integer"}
TARGET_REWARD = 2500 #@param {type:"integer"}

#@markdown ------

#@markdown If you have a previously trained model for the same ENV_ID and HIDDEN_SIZE, you may choose Latest, Best or Custom

#@markdown Select New, If you want to create a new model

#@markdown Select Latest, If you want to choose the latest model

#@markdown Select Best, If you want to choose the best model

#@markdown Select Custom, If you want to load a custom model

#@markdown If running for the first time, choose New

LOAD_MODEL = "New"  #@param ["New", "Custom", "Latest", "Best"]
#@markdown NOTE: If you selected an option other than New, MAKE SURE THE MODEL ARCHITECTURE IS NOT DIFFERENT

#@markdown If you selected Custom for LOAD_MODEL, enter the name of the model to be restored.

import glob
import sys

def getScore(s):
    c= re.findall(r"[-+]?\d*\.\d+|\d+", s)
    return float(c[-2])

if LOAD_MODEL == "Latest":
    list_of_files = glob.glob(path_to_project+'checkpoints/*')
    model_name = max(list_of_files, key=os.path.getctime)
    print('Loading the following file:')
    print(model_name)
elif LOAD_MODEL == "Best":
    list_of_files = glob.glob(path_to_project+'checkpoints/*')
    bs = -99999
    model_name = ""
    for model in list_of_files:
        if getScore(model)>bs:
            model_name = model
            bs = getScore(model)
    print('Loading the following file:')
    print(model_name)
elif LOAD_MODEL == "Custom":
    CUSTOM_MODEL_NAME = "BipedalWalker-v3_best_+12.372_20480.dat" #@param {type:"string"}  
    model_name = CUSTOM_MODEL_NAME
    list_of_files = glob.glob(path_to_project+'checkpoints/*')
    if path_to_project+'checkpoints/'+model_name in list_of_files:
        print('Loading the following file:')
        print(model_name)
    else:
        print('Model not found')
        sys.exit("Model Not Found")
else:
    print('Creating new model')



---


# Section 4: PPO Methods

In [0]:
# Based on https://github.com/higgsfield/RL-Adventure-2/blob/master/3.ppo.ipynb
# Based on https://github.com/colinskow/move37/blob/master/ppo/ppo_train.py

def make_env():
    # returns a function which creates a single environment
    def _thunk():
        env = gym.make(ENV_ID)
        return env
    return _thunk


def test_env(env, model, device, deterministic=True):
    state = env.reset()
    done = False
    total_reward = 0
    i = 0
    while (not done) and (i<1024):
        state = torch.FloatTensor(state).unsqueeze(0).to(device)
        dist, _ = model(state)
        action = dist.mean.detach().cpu().numpy()[0] if deterministic \
            else dist.sample().cpu().numpy()[0]
        next_state, reward, done, _ = env.step(action)
        state = next_state
        total_reward += reward
        i +=1
    return total_reward


def normalize(x):
    x -= x.mean()
    x /= (x.std() + 1e-8)
    return x


def compute_gae(next_value, rewards, masks, values, gamma=GAMMA, lam=GAE_LAMBDA):
    values = values + [next_value]
    gae = 0
    returns = []
    for step in reversed(range(len(rewards))):
        delta = rewards[step] + gamma * \
            values[step + 1] * masks[step] - values[step]
        gae = delta + gamma * lam * masks[step] * gae
        # prepend to get correct order back
        returns.insert(0, gae + values[step])
    return returns


def ppo_iter(states, actions, log_probs, returns, advantage):
    batch_size = states.size(0)
    # generates random mini-batches until we have covered the full batch
    for _ in range(batch_size // MINI_BATCH_SIZE):
        rand_ids = np.random.randint(0, batch_size, MINI_BATCH_SIZE)
        yield states[rand_ids, :], actions[rand_ids, :], log_probs[rand_ids, :], returns[rand_ids, :], advantage[rand_ids, :]


def ppo_update(frame_idx, states, actions, log_probs, returns, advantages, clip_param=PPO_EPSILON):
    count_steps = 0
    sum_returns = 0.0
    sum_advantage = 0.0
    sum_loss_actor = 0.0
    sum_loss_critic = 0.0
    sum_entropy = 0.0
    sum_loss_total = 0.0

    # PPO EPOCHS is the number of times we will go through ALL the training data to make updates
    for _ in range(PPO_EPOCHS):
        # grabs random mini-batches several times until we have covered all data
        for state, action, old_log_probs, return_, advantage in ppo_iter(states, actions, log_probs, returns, advantages):
            dist, value = model(state)
            entropy = dist.entropy().mean()
            new_log_probs = dist.log_prob(action)

            ratio = (new_log_probs - old_log_probs).exp()
            surr1 = ratio * advantage
            surr2 = torch.clamp(ratio, 1.0 - clip_param,
                                1.0 + clip_param) * advantage

            actor_loss = - torch.min(surr1, surr2).mean()
            critic_loss = (return_ - value).pow(2).mean()
            loss = CRITIC_DISCOUNT * critic_loss + actor_loss - ENTROPY_BETA * entropy

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()


            # track statistics
            sum_returns += return_.mean()
            sum_advantage += advantage.mean()
            sum_loss_actor += actor_loss
            sum_loss_critic += critic_loss
            sum_loss_total += loss
            sum_entropy += entropy

            count_steps += 1

    writer.add_scalar("returns", sum_returns / count_steps, frame_idx)
    writer.add_scalar("advantage", sum_advantage / count_steps, frame_idx)
    writer.add_scalar("loss_actor", sum_loss_actor / count_steps, frame_idx)
    writer.add_scalar("loss_critic", sum_loss_critic / count_steps, frame_idx)
    writer.add_scalar("entropy", sum_entropy / count_steps, frame_idx)
    writer.add_scalar("loss_total", sum_loss_total / count_steps, frame_idx)



---


# Section 5: Main Method



In [0]:
#@title
runs = str(path_to_project) + "runs/"
%load_ext tensorboard
%tensorboard --logdir "$runs"

warnings.filterwarnings('ignore')

writer = SummaryWriter(path_to_project+'runs/'+str(datetime.now())+'/',comment="ppo_" + "AlienGo")
device = torch.device("cuda")
print('Device:', device)
# Prepare environments
envs = [make_env() for i in range(NUM_ENVS)]
envs = SubprocVecEnv(envs)
env = gym.make(ENV_ID)


num_inputs = envs.observation_space.shape[0]
num_outputs = envs.action_space.shape[0]

model = ActorCritic(num_inputs, num_outputs, HIDDEN_SIZE).to(device)
if LOAD_MODEL != "New":
    model.load_state_dict(torch.load(model_name))
    print("Loaded the file:"+model_name)

print(model)
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

frame_idx = 0
train_epoch = 0
if LOAD_MODEL != "New":
    best_reward = getScore(model_name)
else:
    best_reward = -9999

state = envs.reset()
early_stop = False
while not early_stop:

    log_probs = []
    values = []
    states = []
    actions = []
    rewards = []
    masks = []

    for _ in range(PPO_STEPS):
        state = torch.FloatTensor(state).to(device)
        dist, value = model(state)

        action = dist.sample()
        # each state, reward, done is a list of results from each parallel environment
        next_state, reward, done, _ = envs.step(action.cpu().numpy())
        log_prob = dist.log_prob(action)

        log_probs.append(log_prob)
        values.append(value)
        rewards.append(torch.FloatTensor(reward).unsqueeze(1).to(device))
        masks.append(torch.FloatTensor(1 - done).unsqueeze(1).to(device))

        states.append(state)
        actions.append(action)

        state = next_state
        frame_idx += 1

    next_state = torch.FloatTensor(next_state).to(device)
    _, next_value = model(next_state)
    returns = compute_gae(next_value, rewards, masks, values)

    returns = torch.cat(returns).detach()
    log_probs = torch.cat(log_probs).detach()
    values = torch.cat(values).detach()
    states = torch.cat(states)
    actions = torch.cat(actions)
    advantage = returns - values
    advantage = normalize(advantage)

    ppo_update(frame_idx, states, actions, log_probs, returns, advantage)
    train_epoch += 1

    if train_epoch % TEST_EPOCHS == 0:
        test_reward = np.mean([test_env(env, model, device)
                                for _ in range(NUM_TESTS)])
        writer.add_scalar("test_rewards", test_reward, frame_idx)
        print('Frame %s. reward: %s' % (frame_idx, test_reward))
        # Save a checkpoint every time we achieve a best reward
        if best_reward is None or best_reward < test_reward:
            if best_reward is not None:
                print("Best reward updated: %.3f -> %.3f" %
                      (best_reward, test_reward))
                name = "%s_best_%+.3f_%d.dat" % (ENV_ID,
                                                  test_reward, frame_idx)
                fname = os.path.join('.', path_to_project+'checkpoints', name)
                torch.save(model.state_dict(), fname)
            best_reward = test_reward
        if test_reward > TARGET_REWARD:
            early_stop = True

---
# Landing:
This Notebook was just something I tried out of curiosity and is nowhere close to where cutting edge RL research is. But it can be extended to fit the use case.

Future work:


*   Implement Custom Environments. These gym wrapper classes can be placed in the same drive folder and can be imported from this notebook
*   Try using TPU's for RL


If you like this work give me a shoutout :)

![alt text](https://media.giphy.com/media/ui1hpJSyBDWlG/giphy.gif)

website: https://hegde95.github.io/

LinkedIn: https://www.linkedin.com/in/karkala-shashank-hegde/


