<a href="https://colab.research.google.com/github/araffin/rl-tutorial-jnrr19/blob/sb3/2_gym_wrappers_saving_loading.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stable Baselines3 Tutorial - Gym wrappers, saving and loading models

Github repo: https://github.com/araffin/rl-tutorial-jnrr19/tree/sb3/

Stable-Baselines3: https://github.com/DLR-RM/stable-baselines3

Documentation: https://stable-baselines3.readthedocs.io/en/master/

RL Baselines3 zoo: https://github.com/DLR-RM/rl-baselines3-zoo


## Introduction

In this notebook, you will learn how to use *Gym Wrappers* which allow to do monitoring, normalization, limit the number of steps, feature augmentation, ...


You will also see the *loading* and *saving* functions, and how to read the outputed files for possible exporting.

## Install Dependencies and Stable Baselines3 Using Pip

In [0]:
# !apt install swig
# !pip install stable-baselines3[extra]

In [1]:
import os
import sys
path_to_curr_file=os.path.realpath(os.getcwd())
proj_root=os.path.dirname(os.path.dirname(path_to_curr_file))
if proj_root not in sys.path:
    sys.path.insert(0,proj_root)
import stable_baselines3
stable_baselines3.__version__

'0.11.0a0'

In [2]:
import gym
from stable_baselines3 import A2C, SAC, PPO, TD3

# Saving and loading

Saving and loading stable-baselines models is straightforward: you can directly call `.save()` and `.load()` on the models.

In [11]:
import os
output_dir = os.path.join(os.path.expanduser('~'),'share','Data','study','sbl3')
# Create save dir
save_dir = os.path.join(output_dir,"tmp/gym/")
os.makedirs(save_dir, exist_ok=True)

model = PPO('MlpPolicy', 'Pendulum-v0', verbose=0).learn(8000)
# The model will be saved under PPO_tutorial.zip
model.save(save_dir + "/PPO_tutorial")

# sample an observation from the environment
obs = model.env.observation_space.sample()

# Check prediction before saving
print("pre saved", model.predict(obs, deterministic=True))

del model # delete trained model to demonstrate loading

loaded_model = PPO.load(save_dir + "/PPO_tutorial")
# Check that the prediction is the same after loading (for the same observation)
print("loaded", loaded_model.predict(obs, deterministic=True))

pre saved (array([0.06651051], dtype=float32), None)
loaded (array([0.06651051], dtype=float32), None)


Saving in stable-baselines is quite powerful, as you save the training hyperparameters, with the current weights. This means in practice, you can simply load a custom model, without redefining the parameters, and continue learning.

The loading function can also update the model's class variables when loading.

In [12]:
import os
from stable_baselines3.common.vec_env import DummyVecEnv

# Create save dir
save_dir = os.path.join(output_dir,"tmp/gym/")
os.makedirs(save_dir, exist_ok=True)

model = A2C('MlpPolicy', 'Pendulum-v0', verbose=0, gamma=0.9, n_steps=20).learn(8000)
# The model will be saved under A2C_tutorial.zip
model.save(save_dir + "/A2C_tutorial")

del model # delete trained model to demonstrate loading

# load the model, and when loading set verbose to 1
loaded_model = A2C.load(save_dir + "/A2C_tutorial", verbose=1)

# show the save hyperparameters
print("loaded:", "gamma =", loaded_model.gamma, "n_steps =", loaded_model.n_steps)

# as the environment is not serializable, we need to set a new instance of the environment
loaded_model.set_env(DummyVecEnv([lambda: gym.make('Pendulum-v0')]))
# and continue training
loaded_model.learn(8000)

loaded: gamma = 0.9 n_steps = 20


<stable_baselines3.a2c.a2c.A2C at 0x7f782284ec90>

# Gym and VecEnv wrappers

## Anatomy of a gym wrapper

A gym wrapper follows the [gym](https://stable-baselines.readthedocs.io/en/master/guide/custom_env.html) interface: it has a `reset()` and `step()` method.

Because a wrapper is *around* an environment, we can access it with `self.env`, this allow to easily interact with it without modifying the original env.
There are many wrappers that have been predefined, for a complete list refer to [gym documentation](https://github.com/openai/gym/tree/master/gym/wrappers)

In [13]:
class CustomWrapper(gym.Wrapper):
    """
    :param env: (gym.Env) Gym environment that will be wrapped
    """
    def __init__(self, env):
        # Call the parent constructor, so we can access self.env later
        super(CustomWrapper, self).__init__(env)
  
    def reset(self):
        """
        Reset the environment 
        """
        obs = self.env.reset()
        return obs

    def step(self, action):
        """
        :param action: ([float] or int) Action taken by the agent
        :return: (np.ndarray, float, bool, dict) observation, reward, is the episode over?, additional informations
        """
        obs, reward, done, info = self.env.step(action)
        return obs, reward, done, info


## First example: limit the episode length

One practical use case of a wrapper is when you want to limit the number of steps by episode, for that you will need to overwrite the `done` signal when the limit is reached. It is also a good practice to pass that information in the `info` dictionnary.

In [14]:
class TimeLimitWrapper(gym.Wrapper):
    """
    :param env: (gym.Env) Gym environment that will be wrapped
    :param max_steps: (int) Max number of steps per episode
    """
    def __init__(self, env, max_steps=100):
        # Call the parent constructor, so we can access self.env later
        super(TimeLimitWrapper, self).__init__(env)
        self.max_steps = max_steps
        # Counter of steps per episode
        self.current_step = 0
  
    def reset(self):
        """
        Reset the environment 
        """
        # Reset the counter
        self.current_step = 0
        return self.env.reset()

    def step(self, action):
        """
        :param action: ([float] or int) Action taken by the agent
        :return: (np.ndarray, float, bool, dict) observation, reward, is the episode over?, additional informations
        """
        self.current_step += 1
        obs, reward, done, info = self.env.step(action)
        # Overwrite the done signal when 
        if self.current_step >= self.max_steps:
            done = True
            # Update the info dict to signal that the limit was exceeded
            info['time_limit_reached'] = True
        return obs, reward, done, info


#### Test the wrapper

In [15]:
from gym.envs.classic_control.pendulum import PendulumEnv

# Here we create the environment directly because gym.make() already wrap the environement in a TimeLimit wrapper otherwise
env = PendulumEnv()
# Wrap the environment
env = TimeLimitWrapper(env, max_steps=100)

In [16]:
obs = env.reset()
done = False
n_steps = 0
while not done:
    # Take random actions
    random_action = env.action_space.sample()
    obs, reward, done, info = env.step(random_action)
    n_steps += 1

print(n_steps, info)

100 {'time_limit_reached': True}


In practice, `gym` already have a wrapper for that named `TimeLimit` (`gym.wrappers.TimeLimit`) that is used by most environments.

## Second example: normalize actions

It is usually a good idea to normalize observations and actions before giving it to the agent, this prevent [hard to debug issue](https://github.com/hill-a/stable-baselines/issues/473).

In this example, we are going to normalize the action space of *Pendulum-v0* so it lies in [-1, 1] instead of [-2, 2].

Note: here we are dealing with continuous actions, hence the `gym.Box` space

In [17]:
import numpy as np

class NormalizeActionWrapper(gym.Wrapper):
    """
    :param env: (gym.Env) Gym environment that will be wrapped
    """
    def __init__(self, env):
        # Retrieve the action space
        action_space = env.action_space
        assert isinstance(action_space, gym.spaces.Box), "This wrapper only works with continuous action space (spaces.Box)"
        # Retrieve the max/min values
        self.low, self.high = action_space.low, action_space.high

        # We modify the action space, so all actions will lie in [-1, 1]
        env.action_space = gym.spaces.Box(low=-1, high=1, shape=action_space.shape, dtype=np.float32)

        # Call the parent constructor, so we can access self.env later
        super(NormalizeActionWrapper, self).__init__(env)
  
    def rescale_action(self, scaled_action):
        """
        Rescale the action from [-1, 1] to [low, high]
        (no need for symmetric action space)
        :param scaled_action: (np.ndarray)
        :return: (np.ndarray)
        """
        return self.low + (0.5 * (scaled_action + 1.0) * (self.high -  self.low))

    def reset(self):
        """
        Reset the environment 
        """
        # Reset the counter
        return self.env.reset()

    def step(self, action):
        """
        :param action: ([float] or int) Action taken by the agent
        :return: (np.ndarray, float, bool, dict) observation, reward, is the episode over?, additional informations
        """
        # Rescale action from [-1, 1] to original [low, high] interval
        rescaled_action = self.rescale_action(action)
        obs, reward, done, info = self.env.step(rescaled_action)
        return obs, reward, done, info


#### Test before rescaling actions

In [18]:
original_env = gym.make("Pendulum-v0")

print(original_env.action_space.low)
for _ in range(10):
    print(original_env.action_space.sample())

[-2.]
[1.9651746]
[-1.0854412]
[-1.1056249]
[1.3483766]
[1.8395236]
[0.31834376]
[-1.8840069]
[-0.1397737]
[1.8065475]
[1.9078236]


#### Test the NormalizeAction wrapper

In [19]:
env = NormalizeActionWrapper(gym.make("Pendulum-v0"))

print(env.action_space.low)

for _ in range(10):
    print(env.action_space.sample())

[-1.]
[-0.9651875]
[-0.07952247]
[-0.6028893]
[0.7646301]
[0.65237325]
[0.4377256]
[-0.0738201]
[0.23016843]
[0.10372037]
[0.6582585]


#### Test with a RL algorithm

We are going to use the Monitor wrapper of stable baselines, wich allow to monitor training stats (mean episode reward, mean episode length)

In [20]:
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.vec_env import DummyVecEnv

In [21]:
env = Monitor(gym.make('Pendulum-v0'))
env = DummyVecEnv([lambda: env])

In [22]:
model = A2C("MlpPolicy", env, verbose=1).learn(int(1000))

Using cuda device


With the action wrapper

In [23]:
normalized_env = Monitor(gym.make('Pendulum-v0'))
# Note that we can use multiple wrappers
normalized_env = NormalizeActionWrapper(normalized_env)
normalized_env = DummyVecEnv([lambda: normalized_env])

In [24]:
model_2 = A2C("MlpPolicy", normalized_env, verbose=1).learn(int(1000))

Using cuda device


## Additional wrappers: VecEnvWrappers

In the same vein as gym wrappers, stable baselines provide wrappers for `VecEnv`. Among the different that exist (and you can create your own), you should know: 

- VecNormalize: it computes a running mean and standard deviation to normalize observation and returns
- VecFrameStack: it stacks several consecutive observations (useful to integrate time in the observation, e.g. sucessive frame of an atari game)

More info in the [documentation](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#wrappers)

Note: when using `VecNormalize` wrapper, you must save the running mean and std along with the model, otherwise you will not get proper results when loading the agent again. If you use the [rl zoo](https://github.com/DLR-RM/rl-baselines3-zoo), this is done automatically

In [25]:
from stable_baselines3.common.vec_env import VecNormalize, VecFrameStack

env = DummyVecEnv([lambda: gym.make("Pendulum-v0")])
normalized_vec_env = VecNormalize(env)

In [26]:
obs = normalized_vec_env.reset()
for _ in range(10):
    action = [normalized_vec_env.action_space.sample()]
    obs, reward, _, _ = normalized_vec_env.step(action)
    print(obs, reward)

[[-0.00527694  0.00616744  0.00057067]] [-1.999992]
[[-0.65529925 -0.5405193   0.99902487]] [-1.2315251]
[[-1.2173524 -1.1682727  1.2077568]] [-0.9157799]
[[-1.5202762 -1.5309641  1.4869303]] [-0.74434817]
[[-1.6348554 -1.6861827  1.4533983]] [-0.65050536]
[[-1.7181369 -1.8215219  1.5987649]] [-0.58862704]
[[-1.7070214 -1.8726633  1.4195371]] [-0.558044]
[[-1.6502717 -1.8903327  1.2630758]] [-0.5252843]
[[-1.5862261 -1.9403621  1.4296852]] [-0.4952729]
[[-1.4697304 -1.9853543  1.5101444]] [-0.48356533]


## Exercise: code you own monitor wrapper

Now that you know how does a wrapper work and what you can do with it, it's time to experiment.

The goal here is to create a wrapper that will monitor the training progress, storing both the episode reward (sum of reward for one episode) and episode length (number of steps in for the last episode).

You will return those values using the `info` dict after each end of episode.

In [32]:
class MyMonitorWrapper(gym.Wrapper):
    """
    :param env: (gym.Env) Gym environment that will be wrapped
    """
    def __init__(self, env):
        # Call the parent constructor, so we can access self.env later
        super(MyMonitorWrapper, self).__init__(env)
        # === YOUR CODE HERE ===#
        # Initialize the variables that will be used
        # to store the episode length and episode reward
        self.episode_length = 0
        self.episode_reward = 0
        # ====================== #
  
    def reset(self):
        """
        Reset the environment 
        """
        obs = self.env.reset()
        # === YOUR CODE HERE ===#
        # Reset the variables
        self.episode_length = 0
        self.episode_reward = 0
        # ====================== #
        return obs

    def step(self, action):
        """
        :param action: ([float] or int) Action taken by the agent
        :return: (np.ndarray, float, bool, dict) observation, reward, is the episode over?, additional informations
        """
        obs, reward, done, info = self.env.step(action)
        # === YOUR CODE HERE ===#
        # Update the current episode reward and episode length
        self.episode_length += 1
        self.episode_reward += reward
        # ====================== #

        if done:
            # === YOUR CODE HERE ===#
            # Store the episode length and episode reward in the info dict
            info.update({'episode_length':self.episode_length,'episode_reward':self.episode_reward})
            # ====================== #
        return obs, reward, done, info

#### Test your wrapper

In [0]:
# To use LunarLander, you need to install box2d box2d-kengz (pip) and swig (apt-get)
# !pip install box2d box2d-kengz

In [33]:
env = gym.make("LunarLander-v2")
# === YOUR CODE HERE ===#
# Wrap the environment
m_env = MyMonitorWrapper(env)
# Reset the environment
m_env.reset()
# Take random actions in the enviromnent and check
# that it returns the correct values after the end of each episode
done = False
while not done:
    act=m_env.action_space.sample()
    obs, reward, done, info = m_env.step(act)
    print(f"reward:{reward}, info: {info}")
# ====================== #

reward:1.4683554209927945, info: {}
reward:-0.27325179125492016, info: {}
reward:-2.085364526437206, info: {}
reward:1.5233190924687403, info: {}
reward:1.4249219925166312, info: {}
reward:1.537682636871723, info: {}
reward:0.14881778498312428, info: {}
reward:0.7509820120052666, info: {}
reward:-0.7565978232524515, info: {}
reward:-1.3459700215774604, info: {}
reward:0.6791455570620997, info: {}
reward:1.5111778460104983, info: {}
reward:-0.4291856617637382, info: {}
reward:-2.7437364727416194, info: {}
reward:-2.771581491492088, info: {}
reward:-2.8573061101489143, info: {}
reward:-3.1496797179432745, info: {}
reward:0.6139052256416164, info: {}
reward:-3.0232326499184863, info: {}
reward:-0.3846756158179676, info: {}
reward:-4.302866561995432, info: {}
reward:1.581184446899216, info: {}
reward:0.6199706571430283, info: {}
reward:-0.3220596598479222, info: {}
reward:-2.6510550827109, info: {}
reward:1.453962956933226, info: {}
reward:0.5403078956290983, info: {}
reward:-1.64023521736

 # Conclusion
 
 In this notebook, we have seen:
 - how to easily save and load a model
 - what is wrapper and what we can do with it
 - how to create your own wrapper

## Wrapper Bonus: changing the observation space: a wrapper for episode of fixed length

In [0]:
from gym.wrappers import TimeLimit

class TimeFeatureWrapper(gym.Wrapper):
    """
    Add remaining time to observation space for fixed length episodes.
    See https://arxiv.org/abs/1712.00378 and https://github.com/aravindr93/mjrl/issues/13.

    :param env: (gym.Env)
    :param max_steps: (int) Max number of steps of an episode
        if it is not wrapped in a TimeLimit object.
    :param test_mode: (bool) In test mode, the time feature is constant,
        equal to zero. This allow to check that the agent did not overfit this feature,
        learning a deterministic pre-defined sequence of actions.
    """
    def __init__(self, env, max_steps=1000, test_mode=False):
        assert isinstance(env.observation_space, gym.spaces.Box)
        # Add a time feature to the observation
        low, high = env.observation_space.low, env.observation_space.high
        low, high= np.concatenate((low, [0])), np.concatenate((high, [1.]))
        env.observation_space = gym.spaces.Box(low=low, high=high, dtype=np.float32)

        super(TimeFeatureWrapper, self).__init__(env)

        if isinstance(env, TimeLimit):
            self._max_steps = env._max_episode_steps
        else:
            self._max_steps = max_steps
        self._current_step = 0
        self._test_mode = test_mode

    def reset(self):
        self._current_step = 0
        return self._get_obs(self.env.reset())

    def step(self, action):
        self._current_step += 1
        obs, reward, done, info = self.env.step(action)
        return self._get_obs(obs), reward, done, info

    def _get_obs(self, obs):
        """
        Concatenate the time feature to the current observation.

        :param obs: (np.ndarray)
        :return: (np.ndarray)
        """
        # Remaining time is more general
        time_feature = 1 - (self._current_step / self._max_steps)
        if self._test_mode:
            time_feature = 1.0
        # Optionnaly: concatenate [time_feature, time_feature ** 2]
        return np.concatenate((obs, [time_feature]))

## Going further - Saving format 

The format for saving and loading models is a zip-archived JSON dump and NumPy zip archive of the arrays:
```
saved_model.zip/
├── data              JSON file of class-parameters (dictionary)
├── parameter_list    JSON file of model parameters and their ordering (list)
├── parameters        Bytes from numpy.savez (a zip file of the numpy arrays). ...
    ├── ...           Being a zip-archive itself, this object can also be opened ...
        ├── ...       as a zip-archive and browsed.
```

## Save and find 

In [34]:
# Create save dir
save_dir = os.path.join(output_dir,"tmp/gym/")
os.makedirs(save_dir, exist_ok=True)

model = PPO('MlpPolicy', 'Pendulum-v0', verbose=0).learn(8000)
model.save(save_dir + "/PPO_tutorial")

In [36]:
!ls {save_dir+"PPO_tutorial*"}

/home/gkoren2/share/Data/study/sbl3/tmp/gym/PPO_tutorial.zip


In [37]:
import zipfile

archive = zipfile.ZipFile(save_dir+"PPO_tutorial.zip", 'r')
for f in archive.filelist:
    print(f.filename)

data
pytorch_variables.pth
policy.pth
policy.optimizer.pth
_stable_baselines3_version


## Exporting saved models

And finally some futher reading for those who want to export to tensorflowJS or Java.

https://stable-baselines.readthedocs.io/en/master/guide/export.html