<a href="https://colab.research.google.com/github/lauraminicucci/ReinforcementLearning/blob/master/LunarLanderContinuous_V2_PPO2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Deep Learning Labs S01 E05: BiPedalwalker-V2

This colab will allow you to train, evaluate and visulize your results. As Google colab don't support env.render() we will use a work around where we "fake" a display, record a video and then display it. 

Notebook will run with classic **CPU** enviorment as well as **GPU** & **TPU**

> To run it all select `runtime` in menu and choose `run all`

![Nextgrid Deep learning labs](https://nextgrid.ai/wp-content/uploads/2019/12/Deck-wallpaper-logo-scaled.jpg)

### Stable Baselines OpenAI Gym BiPedalwalker-V2

Notebook by [nextgrid.ai](https://nextgrid.ai) for [Deep learning labs](https://nextgrid.ai/deep-learning-labs/) #5.


Documentation for stabile-baselines available at: [https://stable-baselines.readthedocs.io/](https://stable-baselines.readthedocs.io/)


notebook authored by M.   
[linkedin](https://www.linkedin.com/in/imathias) / [twitter](https://twitter.com/mathiiias123)   



## Install system wide packages
Install linux server packages using `apt-get` and Python packages using `pip`

In [3]:
!apt-get install swig cmake libopenmpi-dev zlib1g-dev xvfb x11-utils ffmpeg -qq #remove -qq for full output
!pip install stable-baselines[mpi] box2d box2d-kengz pyvirtualdisplay pyglet==1.3.1 --quiet #remove --quiet for full output 
# Stable Baselines only supports tensorflow 1.x for now
%tensorflow_version 1.x

Selecting previously unselected package libxxf86dga1:amd64.
(Reading database ... (Reading database ... 5%(Reading database ... 10%(Reading database ... 15%(Reading database ... 20%(Reading database ... 25%(Reading database ... 30%(Reading database ... 35%(Reading database ... 40%(Reading database ... 45%(Reading database ... 50%(Reading database ... 55%(Reading database ... 60%(Reading database ... 65%(Reading database ... 70%(Reading database ... 75%(Reading database ... 80%(Reading database ... 85%(Reading database ... 90%(Reading database ... 95%(Reading database ... 100%(Reading database ... 145674 files and directories currently installed.)
Preparing to unpack .../libxxf86dga1_2%3a1.1.4-1_amd64.deb ...
Unpacking libxxf86dga1:amd64 (2:1.1.4-1) ...
Selecting previously unselected package swig3.0.
Preparing to unpack .../swig3.0_3.0.12-1_amd64.deb ...
Unpacking swig3.0 (3.0.12-1) ...
Selecting previously unselected package swig.
Preparing to unpack .../swig_3.0.

## Dependencis
import dependencis required to run & train our model + record a video

In [4]:
import gym
import imageio
import numpy as np
import base64
import IPython
import PIL.Image
import pyvirtualdisplay

# Video stuff 
from pathlib import Path
from IPython import display as ipythondisplay

from stable_baselines.common.policies import MlpPolicy
from stable_baselines.common.vec_env import VecVideoRecorder, SubprocVecEnv, DummyVecEnv
from stable_baselines import PPO2

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



## Define variables & functions
Here we define our variables and also create a couple of functions 

In [0]:
# set enviorment variables that we will use in our code
env_id = 'BipedalWalker-v2'
video_folder = '/videos'
video_length = 100

# set our inital enviorment
env = DummyVecEnv([lambda: gym.make(env_id)])
obs = env.reset()

In [0]:
# Evaluation Function
def evaluate(model, num_steps=1000):
  """
  Evaluate a RL agent
  :param model: (BaseRLModel object) the RL Agent
  :param num_steps: (int) number of timesteps to evaluate it
  :return: (float) Mean reward for the last 100 episodes
  """
  episode_rewards = [0.0]
  obs = env.reset()
  for i in range(num_steps):
      # _states are only useful when using LSTM policies
      action, _states = model.predict(obs)

      obs, reward, done, info = env.step(action)
      
      # Stats
      episode_rewards[-1] += reward
      if done:
          obs = env.reset()
          episode_rewards.append(0.0)
  # Compute mean reward for the last 100 episodes
  mean_100ep_reward = round(np.mean(episode_rewards[-100:]), 1)
  print("Mean reward:", mean_100ep_reward, "Num episodes:", len(episode_rewards))
  
  return mean_100ep_reward

In [0]:
# Make video
# Set up fake display; otherwise rendering will fail
import os
os.system("Xvfb :1 -screen 0 1024x768x24 &")
os.environ['DISPLAY'] = ':1'

In [0]:
# Record video
def record_video(env_id, model, video_length=500, prefix='', video_folder='videos/'):
  """
  :param env_id: (str)
  :param model: (RL model)
  :param video_length: (int)
  :param prefix: (str)
  :param video_folder: (str)
  """
  eval_env = DummyVecEnv([lambda: gym.make('BipedalWalker-v2')])
  # Start the video at step=0 and record 500 steps
  eval_env = VecVideoRecorder(env, video_folder=video_folder,
                              record_video_trigger=lambda step: step == 0, video_length=video_length,
                              name_prefix=prefix)

  obs = eval_env.reset()
  for _ in range(video_length):
    action, _ = model.predict(obs)
    obs, _, _, _ = eval_env.step(action)

  # Close the video recorder
  eval_env.close()

In [0]:
# Display video
def show_videos(video_path='', prefix=''):
  html = []
  for mp4 in Path(video_path).glob("{}*.mp4".format(prefix)):
      video_b64 = base64.b64encode(mp4.read_bytes())
      html.append('''<video alt="{}" autoplay 
                    loop controls style="height: 400px;">
                    <source src="data:video/mp4;base64,{}" type="video/mp4" />
                </video>'''.format(mp4, video_b64.decode('ascii')))
  ipythondisplay.display(ipythondisplay.HTML(data="<br>".join(html)))

# Define & Configure out Reinforcment learning algoritm 
In this example we are using default PPO2 / Proximal Policy Optimization. Read more about how you define your PPO2 [parameters](https://stable-baselines.readthedocs.io/en/master/modules/ppo2.html#parameters) 

In [10]:
# Define the model
model = PPO2(MlpPolicy, env, verbose=1) # add & tweak default parameters, messure your output & improve link to parameters above (it will however work with default)





Instructions for updating:
Use `tf.cast` instead.
Instructions for updating:
Use keras.layers.flatten instead.
Instructions for updating:
Please use `layer.__call__` method instead.




Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where






## Train model 50k steps & evaluate results
Here we train, evaluate, save, record & display video

In [11]:

# Random Agent, before training
mean_reward_before_train = evaluate(model, num_steps=10000)

# Train model
model.learn(total_timesteps=50000)

# Save model
model.save("ppo2-walker-50000")

# Random Agent, after training
mean_reward_after_train = evaluate(model, num_steps=1000)

Mean reward: -107.1 Num episodes: 22
--------------------------------------
| approxkl           | 0.00077803154 |
| clipfrac           | 0.0           |
| ep_rewmean         | nan           |
| eplenmean          | nan           |
| explained_variance | -0.0746       |
| fps                | 238           |
| nupdates           | 2             |
| policy_entropy     | 5.6758256     |
| policy_loss        | -0.009089731  |
| serial_timesteps   | 256           |
| time_elapsed       | 3.58e-06      |
| total_timesteps    | 256           |
| value_loss         | 0.35657626    |
--------------------------------------
--------------------------------------
| approxkl           | 0.00040489502 |
| clipfrac           | 0.0           |
| ep_rewmean         | nan           |
| eplenmean          | nan           |
| explained_variance | -0.0489       |
| fps                | 432           |
| nupdates           | 3             |
| policy_entropy     | 5.6759953     |
| policy_loss        | -0.0

In [12]:
# Record & show video
record_video('BipedalWalker-v2', model, video_length=1500, prefix='ppo2-walker-50000')
show_videos('videos', prefix='ppo2-walker-50000')

Saving video to  /content/videos/ppo2-walker-50000-step-0-to-step-1500.mp4


## Train model another 500k steps & evaluate results


In [13]:
# Random Agent, before training
mean_reward_before_train = evaluate(model, num_steps=10000)

# Train model
model.learn(total_timesteps=500000)

# Save model
model.save("ppo2-walker-500000")

# Random Agent, after training
mean_reward_after_train = evaluate(model, num_steps=10000)

Mean reward: -109.8 Num episodes: 61
-------------------------------------
| approxkl           | 0.0010612659 |
| clipfrac           | 0.0078125    |
| ep_rewmean         | nan          |
| eplenmean          | nan          |
| explained_variance | 0.167        |
| fps                | 376          |
| nupdates           | 2            |
| policy_entropy     | 5.432322     |
| policy_loss        | -0.001787466 |
| serial_timesteps   | 256          |
| time_elapsed       | 2.15e-06     |
| total_timesteps    | 256          |
| value_loss         | 181.98203    |
-------------------------------------
--------------------------------------
| approxkl           | 0.0005322192  |
| clipfrac           | 0.0           |
| ep_rewmean         | nan           |
| eplenmean          | nan           |
| explained_variance | 0.203         |
| fps                | 396           |
| nupdates           | 3             |
| policy_entropy     | 5.4321585     |
| policy_loss        | -0.0017671142 |
| s

In [14]:
# Record & show video
record_video('BipedalWalker-v2', model, video_length=1500, prefix='ppo2-walker-500000')
show_videos('videos', prefix='ppo2-walker-500000')

Saving video to  /content/videos/ppo2-walker-500000-step-0-to-step-1500.mp4
