<a href="https://colab.research.google.com/github/john-vastola/RL-HMS-group/blob/main/basics/RL_group_gym_and_baselines_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#<center>Intro to OpenAI Gym and Stable Baselines | RL discussion group</center>
## <center>A tutorial notebook by John Vastola</center>

<center><img src="https://github.com/john-vastola/RL-HMS-group/blob/main/basics/images/walker_pic.png?raw=true" alt="drawing" width="400"/></center>

The goal of this notebook is to briefly introduce two useful resources for developing reinforcement learning (RL) models in practice: 

(i) [OpenAI Gym](gymlibrary.dev), which provides environments and a useful API for both working with them and constructing your own; and 

(ii) [Stable Baselines](stable-baselines.readthedocs.io/en/master/), which provides reference implementations of standard RL algorithms.


The structure of the notebook is as follows:

1.   Getting started with OpenAI Gym environments
2.   Using RL algorithms provided by Stable Baselines
3.   Importing pre-trained RL agents


**Previous tutorials**: [RL basics](https://github.com/john-vastola/RL-HMS-group/blob/main/basics/RL_group_Basics_notebook.ipynb) (TD-learning, Q-learning, actor-critic, etc.)

**Credit attribution note**: The code here is based on a [notebook](https://colab.research.google.com/github/NeuromatchAcademy/course-content-dl/blob/main/projects/ReinforcementLearning/lunar_lander.ipynb) made for [Neuromatch Academy](https://academy.neuromatch.io/), as well as a past [ML from scratch](https://github.com/DrugowitschLab/ML-from-scratch-seminar) session on [actor-critic RL models](https://github.com/DrugowitschLab/ML-from-scratch-seminar/tree/master/AdvancedRL) by [Zach Cohen](https://twitter.com/zachacohen?lang=en) and myself. Assume good code is theirs and errors are mine.

-----------

## 1. Getting started with OpenAI Gym environments

[OpenAI Gym](https://www.gymlibrary.dev/) provides environments, and an API for working with environments. Environments define the tasks that RL agents have to solve, including what actions they are allowed to take and how they get rewarded. 

Famous RL environments include [Atari games](https://www.gymlibrary.dev/environments/atari/), [MuJoCo physics simulators](https://www.gymlibrary.dev/environments/mujoco/), and classical control tasks like [Cart Pole](https://www.gymlibrary.dev/environments/classic_control/cart_pole/). All of these are available through Gym.

First, we need to install and import everything. This can be annoying on Colab, but let's try; see if you can run the cells below.

In [21]:
# @title Install dependencies


!pip install rarfile --quiet
!pip install stable-baselines3 > /dev/null

!pip install huggingface_hub
!pip install huggingface_sb3

!pip install box2d-py > /dev/null
!pip install gym[all]
!sudo apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1
!pip install pyglet==1.5.11

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting lz4>=3.1.0
  Using cached lz4-4.0.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
Collecting ale-py~=0.7.1
  Using cached ale_py-0.7.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB)
Collecting box2d-py==2.3.5
  Using cached box2d_py-2.3.5-cp37-cp37m-manylinux1_x86_64.whl (2.1 MB)
Collecting mujoco-py<2.0,>=1.50
  Using cached mujoco-py-1.50.1.68.tar.gz (120 kB)
Building wheels for collected packages: mujoco-py
  Building wheel for mujoco-py (setup.py) ... [?25lerror
[31m  ERROR: Failed building wheel for mujoco-py[0m
[?25h  Running setup.py clean for mujoco-py
Failed to build mujoco-py
Installing collected packages: mujoco-py, lz4, box2d-p

In [22]:
import io
import os
import glob
import torch
import base64


import numpy as np
import matplotlib.pyplot as plt

import gym
from gym import spaces
from gym.wrappers import Monitor

import stable_baselines3
from stable_baselines3 import DQN
from stable_baselines3.common.results_plotter import ts2xy, load_results
from stable_baselines3.common.callbacks import EvalCallback
from stable_baselines3.common.env_util import make_atari_env




print("Gym version:", gym.version.VERSION)

Gym version: 0.21.0


In [4]:
# @title Video-playing code
from IPython.display import HTML
from pyvirtualdisplay import Display
from IPython import display as ipythondisplay

display = Display(visible=0, size=(1400, 900))
display.start()

"""
Utility functions to enable video recording of gym environment
and displaying it.
To enable video, just do "env = wrap_env(env)""
"""

def show_video():
  mp4list = glob.glob('video/*.mp4')
  if len(mp4list) > 0:
    mp4 = mp4list[0]
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
  else:
    print("Could not find video")


def wrap_env(env):
  env = Monitor(env, './video', force=True)
  return env

If you could successfully run the above cells, run the below cells and see if you can simulate and visualize an RL agent with a random policy on one of the available OpenAI Gym environments. For example, try simulating an agent interacting with the [Lunar Lander environment](https://www.gymlibrary.dev/environments/box2d/lunar_lander/). 

Alternatively, look through the [Gym documentation](https://www.gymlibrary.dev/) for another environment to try. Two alternatives are the [Bipedal Walker environment](https://www.gymlibrary.dev/environments/box2d/bipedal_walker/) and the [Car Racing environment](https://www.gymlibrary.dev/environments/box2d/car_racing/).

In [5]:
# Make environment
env = wrap_env(gym.make("LunarLanderContinuous-v2"))     # Make Lunar Lander environment
#env = wrap_env(gym.make("BipedalWalker-v3"))              # Make Bipedal Walker environment
#env = wrap_env(gym.make("CarRacing-v0"))                 # Make Car Racing environment

In [6]:
state = env.reset()                                      # Reset the environment


# Let agent with a uniform random policy play the game for one episode
total_reward = 0   


num_frames_to_try = 300    # make this number smaller to make this cell run faster
for i in range(num_frames_to_try):
  env.render()                                           # used to render video frames
  
  action = env.action_space.sample()                     # choose a random action
  state, reward, done, info = env.step(action)           # perform your chosen action

  total_reward += reward                                 # increment total reward

  if done:                                               # if episode ended (e.g. if lander landed or crashed), end session
    break;


# Print total reward accumulated, and video of episode
print("Total reward:", total_reward, "(max = 200)")
env.close()
show_video()

Total reward: -300.04206197544784 (max = 200)


The video-playing code makes the above cell kind of slow. If you don't visualize the entire episode, things run faster, as you can verify by running the below cell.

In [26]:
state = env.reset()                                      # Reset the environment


# Let agent with a uniform random policy play the game for one episode
total_reward = 0   


num_frames_to_try = 300    # make this number smaller to make this cell run faster
for i in range(num_frames_to_try):
  action = env.action_space.sample()                     # choose a random action
  state, reward, done, info = env.step(action)           # perform your chosen action

  total_reward += reward                                 # increment total reward

  if done:                                               # if episode ended (e.g. if lander landed or crashed), end session
    break;


# Print total reward accumulated, and video of episode
print("Total reward:", total_reward, "(max = 200)")
env.close()

Total reward: -223.63032426370643 (max = 200)


Above, I chose to simulate agent-environment interactions for a fixed number of frames (since we are just trying to see if things work, and may prefer that the code runs faster), but you can change the `for i in range(num_frames_to_try):` statement to `while True:` if you want to simulate the agent-environment interaction until the episode is over. 

When the episode ends depends on the environment. In the Lunar Lander task, the episode ends when your lander either lands successfully or crashes. In the Bipedal Walker task, the episode ends if the walker falls over, or successfully moves a certain distance to the right.

----------------

## 2. Using RL algorithms provided by Stable Baselines

If we want to simulate agents that do something more complicated than taking random actions, we need to endow agents with some kind of RL (control) algorithm for learning from trial and error. Two common kinds of approaches are Q-Learning and actor-critic architectures, both of which were introduced in the previous [RL basics tutorial notebook](https://github.com/john-vastola/RL-HMS-group/blob/main/basics/RL_group_Basics_notebook.ipynb).

But the form of these algorithms is essentially task-agnostic, so it would be a waste of time if we had to reimplement them whenever we wanted to train a new RL agent. This is where [Stable Baselines](https://stable-baselines.readthedocs.io/en/master/) comes in: it provides reference implementations of many popular RL algorithms, so that all you need to do is choose hyperparameters. 

For example, if you want to use a [Deep Q network](https://stable-baselines.readthedocs.io/en/master/mod), Stable Baselines allows you to pick the number of neural network layers, learning rate, and so on, without actually having to design a neural network from scratch (although you can still do this if you want). 

Let's try applying a Deep Q network to the Lunar Lander environment.

In [7]:
nn_layers = [64,64] #This is the configuration of your neural network. Currently, we have two layers, each consisting of 64 neurons.
                    #If you want three layers with 64 neurons each, set the value to [64,64,64] and so on.

learning_rate = 0.001 #This is the step-size with which the gradient descent is carried out.
                      #Tip: Use smaller step-sizes for larger networks.

In [8]:
log_dir = "/tmp/gym/"
os.makedirs(log_dir, exist_ok=True)

# Create environment
env = gym.make('LunarLander-v2')
#You can also load other environments like cartpole, MountainCar, Acrobot. Refer to https://gym.openai.com/docs/ for descriptions.
#For example, if you would like to load Cartpole, just replace the above statement with "env = gym.make('CartPole-v1')".

env = stable_baselines3.common.monitor.Monitor(env, log_dir )

callback = EvalCallback(env,log_path = log_dir, deterministic=True) #For evaluating the performance of the agent periodically and logging the results.
policy_kwargs = dict(activation_fn=torch.nn.ReLU,
                     net_arch=nn_layers)
model = DQN("MlpPolicy", env,policy_kwargs = policy_kwargs,
            learning_rate=learning_rate,
            batch_size=1,  #for simplicity, we are not doing batch update.
            buffer_size=1, #size of experience of replay buffer. Set to 1 as batch update is not done
            learning_starts=1, #learning starts immediately!
            gamma=0.99, #discount facto. range is between 0 and 1.
            tau = 1,  #the soft update coefficient for updating the target network
            target_update_interval=1, #update the target network immediately.
            train_freq=(1,"step"), #train the network at every step.
            max_grad_norm = 10, #the maximum value for the gradient clipping
            exploration_initial_eps = 1, #initial value of random action probability
            exploration_fraction = 0.5, #fraction of entire training period over which the exploration rate is reduced
            gradient_steps = 1, #number of gradient steps
            seed = 1, #seed for the pseudo random generators
            verbose=0) #Set verbose to 1 to observe training logs. We encourage you to set the verbose to 1.

# You can also experiment with other RL algorithms like A2C, PPO, DDPG etc. Refer to  https://stable-baselines3.readthedocs.io/en/master/guide/examples.html
#for documentation. For example, if you would like to run DDPG, just replace "DQN" above with "DDPG".

Here's a video of how it does before training:

In [9]:
test_env = wrap_env(gym.make("LunarLander-v2"))
observation = test_env.reset()
total_reward = 0
while True:
  test_env.render()
  action, states = model.predict(observation, deterministic=True)
  observation, reward, done, info = test_env.step(action)
  total_reward += reward
  if done:
    break;

# print(total_reward)
test_env.close()
show_video()

Train it...

In [16]:
model.learn(total_timesteps=20000, log_interval=10, callback=callback)   # 10,000 time steps takes around a minute
# The performance of the training will be printed every 10 episodes. Change it to 1, if you wish to
# view the performance at every training episode.

Eval num_timesteps=9000, episode_reward=-64.39 +/- 78.77
Episode length: 257.80 +/- 35.32
New best mean reward!
Eval num_timesteps=19000, episode_reward=-98.36 +/- 59.23
Episode length: 919.00 +/- 162.00


<stable_baselines3.dqn.dqn.DQN at 0x7fa3608b1a90>

Now here's its performance after training:

In [17]:
test_env = wrap_env(gym.make("LunarLander-v2"))
observation = test_env.reset()
total_reward = 0
while True:
  test_env.render()
  action, states = model.predict(observation, deterministic=True)
  observation, reward, done, info = test_env.step(action)
  total_reward += reward
  if done:
    break;

# print(total_reward)
test_env.close()
show_video()

Try playing around with the algorithm / hyperparameters / training time / environment choice to see how the final agent's performance varies!

For other simple examples from the Stable Baselines documentation, see [here](https://stable-baselines.readthedocs.io/en/master/guide/quickstart.html) and [here](https://stable-baselines.readthedocs.io/en/master/guide/examples.html).

-------------------

## 3. Importing pre-trained RL agents

Training your own RL agents can be hard and time-consuming. Is there an easy way to import pre-trained agents?

Fortunately, there are a number of ways to do this. One way is to import models others have uploaded to [Hugging Face](https://huggingface.co/), which among other things is a platform for sharing machine learning models.

[Here](https://huggingface.co/blog/sb3) is a Hugging Face blog post explaining how to import a pre-trained RL agent that uses one of the algorithms included in Stable Baselines. 

Run the code below, which I essentially took from that blog post, to import an RL agent that was trained on the Bipedal Walker task.

In [26]:
from huggingface_sb3 import load_from_hub
from stable_baselines3 import PPO, DDPG

Download the model.

In [42]:
# Retrieve the model from the hub
## repo_id = id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name})
## filename = name of the model zip file from the repository including the extension .zip
checkpoint = load_from_hub(
    repo_id="sb3/ddpg-BipedalWalker-v3",
    filename="ddpg-BipedalWalker-v3.zip",
)
model = DDPG.load(checkpoint)


# This step is optional. Assigning an environment to the model
# (which by default doesn't have one) lets us continue training it if we want.
env = gym.make("BipedalWalker-v3")
model.set_env(env)
print(model.get_env())

Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
<stable_baselines3.common.vec_env.dummy_vec_env.DummyVecEnv object at 0x7fa35d1acb90>


Evaluate and visualize the model's performance.

In [47]:
test_env = wrap_env(gym.make("BipedalWalker-v3"))
observation = test_env.reset()
total_reward = 0
while True:
  test_env.render()
  action, states = model.predict(observation, deterministic=True)
  observation, reward, done, info = test_env.step(action)
  total_reward += reward
  if done:
    break;

# print(total_reward)
test_env.close()
show_video()

You can find other RL agents that have been trained on various Gym environments at [this link](https://huggingface.co/sb3). See if you can import a different agent, or an agent trained on a different environment.

After we've imported a model, we can also do other things with it. For example, we can keep training it:

In [45]:
model.learn(100)   # 10,000 time steps takes around a minute

<stable_baselines3.ddpg.ddpg.DDPG at 0x7fa3613de490>

But be careful. It's quite possible you make the performance worse by doing this!