<a href="https://colab.research.google.com/github/iskra3138/stable-baselines/blob/main/stable_baselines_2_gym_wrappers_saving_loading.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stable Baselines Tutorial - Gym wrappers, saving and loading models

Github repo: https://github.com/araffin/rl-tutorial-jnrr19

Stable-Baselines: https://github.com/hill-a/stable-baselines

Documentation: https://stable-baselines.readthedocs.io/en/master/

RL Baselines zoo: https://github.com/araffin/rl-baselines-zoo


## Introduction

In this notebook, you will learn how to use *Gym Wrappers* which allow to do monitoring, normalization, limit the number of steps, feature augmentation, ...
- 이번 노트북을 토해 *GymWrappers*를 사용하는 방법을 배움
  - 이를 통해 모니터링, normalization, 스텝수 제한, Feature augmentation등을 할 수 있음


You will also see the *loading* and *saving* functions, and how to read the outputed files for possible exporting.
- 모델을 저장하고 읽는 것도 해볼 것임

## Install Dependencies and Stable Baselines Using Pip

In [None]:
# Stable Baselines only supports tensorflow 1.x for now
%tensorflow_version 1.x
!apt install swig
!pip install stable-baselines[mpi]==2.10.0

Reading package lists... Done
Building dependency tree       
Reading state information... Done
swig is already the newest version (3.0.12-1).
0 upgraded, 0 newly installed, 0 to remove and 6 not upgraded.


In [None]:
import gym
from stable_baselines import A2C, SAC, PPO2, TD3

# Saving and loading

Saving and loading stable-baselines models is straightforward: you can directly call `.save()` and `.load()` on the models.
- stable-baselines 모델을 저장하고 불러오는 것은 매우 직관적임
  - .save(), .load()를 바로 호출함

In [None]:
import os

# Create save dir
save_dir = "/tmp/gym/"
os.makedirs(save_dir, exist_ok=True)

model = PPO2('MlpPolicy', 'Pendulum-v0', verbose=0).learn(8000)
# The model will be saved under PPO2_tutorial.zip
model.save(save_dir + "/PPO2_tutorial")

# sample an observation from the environment
obs = model.env.observation_space.sample()

# Check prediction before saving
print("pre saved", model.predict(obs, deterministic=True))

del model # delete trained model to demonstrate loading

loaded_model = PPO2.load(save_dir + "/PPO2_tutorial")
# Check that the prediction is the same after loading (for the same observation)
print("loaded", loaded_model.predict(obs, deterministic=True))

pre saved (array([-0.19883569], dtype=float32), None)
loaded (array([-0.19883569], dtype=float32), None)


Saving in stable-baselines is quite powerful, as you save the training hyperparameters, with the current weights. This means in practice, you can simply load a custom model, without redefining the parameters, and continue learning.
- stable-baselines에서 저장은 학습 HP, Weight를 모두 함께 저장하므로 굉장히 powerful.
- 모델을 불러올 때, 파라미터를 다시 재정의할 필요없이 계속해서 재학습이 가능해짐


The loading function can also update the model's class variables when loading.
- loading function은 로딩 시 모델 클래스의 변수들을 업데이트할 수도 있음

In [None]:
import os
from stable_baselines.common.vec_env import DummyVecEnv

# Create save dir
save_dir = "/tmp/gym/"
os.makedirs(save_dir, exist_ok=True)

model = A2C('MlpPolicy', 'Pendulum-v0', verbose=0, gamma=0.9, n_steps=20).learn(8000)
# The model will be saved under A2C_tutorial.zip
model.save(save_dir + "/A2C_tutorial")

del model # delete trained model to demonstrate loading

# load the model, and when loading set verbose to 1
loaded_model = A2C.load(save_dir + "/A2C_tutorial", verbose=1)

# show the save hyperparameters
print("loaded:", "gamma =", loaded_model.gamma, "n_steps =", loaded_model.n_steps)

# as the environment is not serializable, we need to set a new instance of the environment
# environment는 저장되지 않기 때문에, 새로운 environment instance를 설정해줘야 함
loaded_model.set_env(DummyVecEnv([lambda: gym.make('Pendulum-v0')]))
# and continue training
loaded_model.learn(8000)

Loading a model without an environment, this model cannot be trained until it has a valid environment.
loaded: gamma = 0.9 n_steps = 20
---------------------------------
| explained_variance | 0.00487  |
| fps                | 90       |
| nupdates           | 1        |
| policy_entropy     | 1.42     |
| total_timesteps    | 20       |
| value_loss         | 2.56e+03 |
---------------------------------
---------------------------------
| explained_variance | 0.003    |
| fps                | 893      |
| nupdates           | 100      |
| policy_entropy     | 1.42     |
| total_timesteps    | 2000     |
| value_loss         | 2.46e+03 |
---------------------------------
---------------------------------
| explained_variance | 0.0093   |
| fps                | 948      |
| nupdates           | 200      |
| policy_entropy     | 1.42     |
| total_timesteps    | 4000     |
| value_loss         | 2.19e+03 |
---------------------------------
---------------------------------
| explained_va

<stable_baselines.a2c.a2c.A2C at 0x7f2f317d3240>

# Gym and VecEnv wrappers

## Anatomy of a gym wrapper

A gym wrapper follows the [gym](https://stable-baselines.readthedocs.io/en/master/guide/custom_env.html) interface: it has a `reset()` and `step()` method.

Because a wrapper is *around* an environment, we can access it with `self.env`, this allow to easily interact with it without modifying the original env.
There are many wrappers that have been predefined, for a complete list refer to [gym documentation](https://github.com/openai/gym/tree/master/gym/wrappers)

- gym wrapper는 gym 인터페이스를 따름
  - reset (), step () 메서드를 가짐.

- wrapper는 environement *주변에* 있기 때문에, self.env로 액세스 할 수 있음
  - 원래 environment를 수정하지 않고도 쉽게 상호 작용할 수 있음
- 사전 정의 된 많은 wrapper들이 있음. 전체 목록은 gym 문서를 참조.

In [None]:
class CustomWrapper(gym.Wrapper):
  """
  :param env: (gym.Env) Gym environment that will be wrapped
  """
  def __init__(self, env):
    # Call the parent constructor, so we can access self.env later
    super(CustomWrapper, self).__init__(env)
  
  def reset(self):
    """
    Reset the environment 
    """
    obs = self.env.reset()
    return obs

  def step(self, action):
    """
    :param action: ([float] or int) Action taken by the agent
    :return: (np.ndarray, float, bool, dict) observation, reward, is the episode over?, additional informations
    """
    obs, reward, done, info = self.env.step(action)
    return obs, reward, done, info


## First example: limit the episode length
첫번째 예제: 에피소드 길이 제한

One practical use case of a wrapper is when you want to limit the number of steps by episode, for that you will need to overwrite the `done` signal when the limit is reached. It is also a good practice to pass that information in the `info` dictionnary.
- wrapper의 실제 use case 중 하나는 에피소드별로 step 수를 제한하려는 경우
  - 제한에 도달하면 'done'신호를 덮어 써야할 필요가 있음
  - 'info' dictionary에 해당 정보를 전달하는 것도 좋은 방법


In [None]:
class TimeLimitWrapper(gym.Wrapper):
  """
  :param env: (gym.Env) Gym environment that will be wrapped
  :param max_steps: (int) Max number of steps per episode
  """
  def __init__(self, env, max_steps=100):
    # Call the parent constructor, so we can access self.env later
    super(TimeLimitWrapper, self).__init__(env)
    self.max_steps = max_steps
    # Counter of steps per episode
    self.current_step = 0
  
  def reset(self):
    """
    Reset the environment 
    """
    # Reset the counter
    self.current_step = 0
    return self.env.reset()

  def step(self, action):
    """
    :param action: ([float] or int) Action taken by the agent
    :return: (np.ndarray, float, bool, dict) observation, reward, is the episode over?, additional informations
    """
    self.current_step += 1
    obs, reward, done, info = self.env.step(action)
    # Overwrite the done signal when 
    if self.current_step >= self.max_steps:
      done = True
      # Update the info dict to signal that the limit was exceeded
      info['time_limit_reached'] = True
    return obs, reward, done, info


#### Test the wrapper

In [None]:
from gym.envs.classic_control.pendulum import PendulumEnv

# Here we create the environment directly because gym.make() already wrap the environement in a TimeLimit wrapper otherwise
env = PendulumEnv()
# Wrap the environment
env = TimeLimitWrapper(env, max_steps=100)

In [None]:
obs = env.reset()
done = False
n_steps = 0
while not done:
  # Take random actions
  random_action = env.action_space.sample()
  obs, reward, done, info = env.step(random_action)
  n_steps += 1

print(n_steps, info)

100 {'time_limit_reached': True}


In practice, `gym` already have a wrapper for that named `TimeLimit` (`gym.wrappers.TimeLimit`) that is used by most environments.

## Second example: normalize actions
두번째 예제 : action 노멀라이즈

It is usually a good idea to normalize observations and actions before giving it to the agent, this prevent [hard to debug issue](https://github.com/hill-a/stable-baselines/issues/473).
-일반적으로 관찰 및 조치를 에이전트에 제공하기 전에 정규화하는 것이 좋음
  - 이를 통해 디버그하기 어려운 문제를 방지 할 수 있음

In this example, we are going to normalize the action space of *Pendulum-v0* so it lies in [-1, 1] instead of [-2, 2].
- 이 예에서는 Pendulum-v0의 작업 공간을 정규화하여 [-2, 2] 대신 [-1, 1]이 되게 함

Note: here we are dealing with continuous actions, hence the `gym.Box` space
- 참고 : 여기서 우리는 연속적인 동작을 다루고 있음

In [None]:
import numpy as np

class NormalizeActionWrapper(gym.Wrapper):
  """
  :param env: (gym.Env) Gym environment that will be wrapped
  """
  def __init__(self, env):
    # Retrieve the action space
    action_space = env.action_space
    assert isinstance(action_space, gym.spaces.Box), "This wrapper only works with continuous action space (spaces.Box)"
    # Retrieve the max/min values
    self.low, self.high = action_space.low, action_space.high

    # We modify the action space, so all actions will lie in [-1, 1]
    env.action_space = gym.spaces.Box(low=-1, high=1, shape=action_space.shape, dtype=np.float32)

    # Call the parent constructor, so we can access self.env later
    super(NormalizeActionWrapper, self).__init__(env)
  
  def rescale_action(self, scaled_action):
      """
      Rescale the action from [-1, 1] to [low, high]
      (no need for symmetric action space)
      :param scaled_action: (np.ndarray)
      :return: (np.ndarray)
      """
      return self.low + (0.5 * (scaled_action + 1.0) * (self.high -  self.low))

  def reset(self):
    """
    Reset the environment 
    """
    # Reset the counter
    return self.env.reset()

  def step(self, action):
    """
    :param action: ([float] or int) Action taken by the agent
    :return: (np.ndarray, float, bool, dict) observation, reward, is the episode over?, additional informations
    """
    # Rescale action from [-1, 1] to original [low, high] interval
    rescaled_action = self.rescale_action(action)
    obs, reward, done, info = self.env.step(rescaled_action)
    return obs, reward, done, info


#### Test before rescaling actions

In [None]:
original_env = gym.make("Pendulum-v0")

print(original_env.action_space.low)
for _ in range(10):
  print(original_env.action_space.sample())

[-2.]
[1.1236833]
[0.8284715]
[1.1950923]
[0.38227656]
[0.91574895]
[-0.42847648]
[1.580651]
[1.3893884]
[-1.5455852]
[1.3891894]


#### Test the NormalizeAction wrapper

In [None]:
env = NormalizeActionWrapper(gym.make("Pendulum-v0"))

print(env.action_space.low)

for _ in range(10):
  print(env.action_space.sample())

[-1.]
[0.31806344]
[-0.9056247]
[-0.7164699]
[0.43610492]
[0.31373236]
[0.5897452]
[0.8344767]
[-0.5719794]
[-0.44559914]
[-0.84648055]


#### Test with a RL algorithm

We are going to use the Monitor wrapper of stable baselines, wich allow to monitor training stats (mean episode reward, mean episode length)

In [None]:
from stable_baselines.bench import Monitor
from stable_baselines.common.vec_env import DummyVecEnv

In [None]:
env = Monitor(gym.make('Pendulum-v0'), filename=None, allow_early_resets=True)
env = DummyVecEnv([lambda: env])

In [None]:
model = A2C("MlpPolicy", env, verbose=1).learn(int(1000))

---------------------------------
| explained_variance | 0.00532  |
| fps                | 24       |
| nupdates           | 1        |
| policy_entropy     | 1.42     |
| total_timesteps    | 5        |
| value_loss         | 501      |
---------------------------------
----------------------------------
| ep_len_mean        | 200       |
| ep_reward_mean     | -1.32e+03 |
| explained_variance | -0.0102   |
| fps                | 586       |
| nupdates           | 100       |
| policy_entropy     | 1.42      |
| total_timesteps    | 500       |
| value_loss         | 759       |
----------------------------------
----------------------------------
| ep_len_mean        | 200       |
| ep_reward_mean     | -1.08e+03 |
| explained_variance | 0.0809    |
| fps                | 680       |
| nupdates           | 200       |
| policy_entropy     | 1.42      |
| total_timesteps    | 1000      |
| value_loss         | 35.9      |
----------------------------------


With the action wrapper

In [None]:
normalized_env = Monitor(gym.make('Pendulum-v0'), filename=None, allow_early_resets=True)
# Note that we can use multiple wrappers
normalized_env = NormalizeActionWrapper(normalized_env)
normalized_env = DummyVecEnv([lambda: normalized_env])

In [None]:
model_2 = A2C("MlpPolicy", normalized_env, verbose=1).learn(int(1000))

---------------------------------
| explained_variance | -0.0237  |
| fps                | 26       |
| nupdates           | 1        |
| policy_entropy     | 1.42     |
| total_timesteps    | 5        |
| value_loss         | 166      |
---------------------------------
----------------------------------
| ep_len_mean        | 200       |
| ep_reward_mean     | -1.04e+03 |
| explained_variance | -0.00917  |
| fps                | 594       |
| nupdates           | 100       |
| policy_entropy     | 1.42      |
| total_timesteps    | 500       |
| value_loss         | 470       |
----------------------------------
----------------------------------
| ep_len_mean        | 200       |
| ep_reward_mean     | -1.06e+03 |
| explained_variance | 0.00124   |
| fps                | 653       |
| nupdates           | 200       |
| policy_entropy     | 1.42      |
| total_timesteps    | 1000      |
| value_loss         | 1.18e+03  |
----------------------------------


## Additional wrappers: VecEnvWrappers

In the same vein as gym wrappers, stable baselines provide wrappers for `VecEnv`. Among the different that exist (and you can create your own), you should know: 

- VecNormalize: it computes a running mean and standard deviation to normalize observation and returns
- VecFrameStack: it stacks several consecutive observations (useful to integrate time in the observation, e.g. sucessive frame of an atari game)

More info in the [documentation](https://stable-baselines.readthedocs.io/en/master/guide/vec_envs.html#wrappers)

Note: when using `VecNormalize` wrapper, you must save the running mean and std along with the model, otherwise you will not get proper results when loading the agent again. If you use the [rl zoo](https://github.com/araffin/rl-baselines-zoo), this is done automatically

- gym wrappers와 같은 맥락에서 stable baselines은 'VecEnv'에 대한 wrapper를 제공
존재하는 다른 것 (그리고 직접 만들 수 있음) 중에서 다음을 알아야 함
  - VecNormalize : 관측을 정규화하기 위해 실행 평균과 표준 편차를 계산하고 반환
  - VecFrameStack : 여러 개의 연속된 observation을 쌓음 
    - observation에 시간을 통합하는 데 유용함 (예 : 아타리 게임의 연속 프레임)


참고 : 'VecNormalize' wrapper를 사용하는 경우, 실행 중인 평균 및 표준편차를 모델과 함께 저장해야 함. 그렇지 않으면 에이전트를 다시 로드 할 때 적절한 결과를 얻지 못함.
 - rl zoo를 사용하면 자동으로 수행됨

In [None]:
from stable_baselines.common.vec_env import VecNormalize, VecFrameStack

env = DummyVecEnv([lambda: gym.make("Pendulum-v0")])
normalized_vec_env = VecNormalize(env)

In [None]:
obs = normalized_vec_env.reset()
for _ in range(10):
  action = [normalized_vec_env.action_space.sample()]
  obs, reward, _, _ = normalized_vec_env.step(action)
  print(obs, reward)

[[ 0.0069937  -0.00201678 -0.00252103]] [-1.8756421]
[[-0.30897885 -0.89460939 -0.99908061]] [-1.3867719]
[[-0.85137728 -1.25493211 -1.1367014 ]] [-1.3548794]
[[-1.38664769 -1.48970432 -1.46118034]] [-1.2629999]
[[-1.6032098  -1.55958178 -1.20179075]] [-1.3572179]
[[-1.79896311 -1.66288467 -1.52662146]] [-1.0916774]
[[-1.93717621 -1.73059521 -1.5940493 ]] [-1.1513095]
[[-2.06685232 -1.78746985 -1.74711613]] [-1.121458]
[[-2.17348952 -1.8130202  -1.81983126]] [-1.1469169]
[[-2.25909019 -1.79650298 -1.87612573]] [-1.1441946]


## Exercise: code you own monitor wrapper
실습: 너의 monitor wrapper를 만들어라

Now that you know how does a wrapper work and what you can do with it, it's time to experiment.
- wrapper가 어떻게 작동하는 지와 그것으로 무엇을 할 수 있는 지를 알게 되었으니, 실습해볼 것

The goal here is to create a wrapper that will monitor the training progress, storing both the episode reward (sum of reward for one episode) and episode length (number of steps in for the last episode).
- training progress를 모니터링하고, episode reward와 length를 저장하는 wrapper 정의

You will return those values using the `info` dict after each end of episode.
- episode의 매 마지막에 'info' dict를 사용하여 해당 값들을 반환

In [None]:
class MyMonitorWrapper(gym.Wrapper):
  """
  :param env: (gym.Env) Gym environment that will be wrapped
  """
  def __init__(self, env):
    # Call the parent constructor, so we can access self.env later
    super(MyMonitorWrapper, self).__init__(env)
    # === YOUR CODE HERE ===#
    # Initialize the variables that will be used
    # to store the episode length and episode reward

    # ====================== #
  
  def reset(self):
    """
    Reset the environment 
    """
    obs = self.env.reset()
    # === YOUR CODE HERE ===#
    # Reset the variables

    # ====================== #
    return obs

  def step(self, action):
    """
    :param action: ([float] or int) Action taken by the agent
    :return: (np.ndarray, float, bool, dict) observation, reward, is the episode over?, additional informations
    """
    obs, reward, done, info = self.env.step(action)
    # === YOUR CODE HERE ===#
    # Update the current episode reward and episode length

    # ====================== #

    if done:
      # === YOUR CODE HERE ===#
      # Store the episode length and episode reward in the info dict

      # ====================== #
      return obs, reward, done, info

#### Test your wrapper

In [None]:
# To use LunarLander, you need to install box2d box2d-kengz (pip) and swig (apt-get)
!pip install box2d box2d-kengz

Collecting box2d
[?25l  Downloading https://files.pythonhosted.org/packages/a9/0b/d48d42dd9e19ce83a3fb4eee074e785b6c6ea612a2244dc2ef69427d338b/Box2D-2.3.10-cp36-cp36m-manylinux1_x86_64.whl (1.3MB)
[K     |████████████████████████████████| 1.3MB 5.9MB/s 
[?25hCollecting box2d-kengz
[?25l  Downloading https://files.pythonhosted.org/packages/81/20/51d6c0c87f7642efb709c518fb0ca8e5eab068259588552c41da5926ae27/Box2D-kengz-2.3.3.tar.gz (425kB)
[K     |████████████████████████████████| 430kB 44.4MB/s 
[?25hBuilding wheels for collected packages: box2d-kengz
  Building wheel for box2d-kengz (setup.py) ... [?25l[?25hdone
  Created wheel for box2d-kengz: filename=Box2D_kengz-2.3.3-cp36-cp36m-linux_x86_64.whl size=2014815 sha256=72e1ac9548cf46cf3219e48e8e282a3905034c4570fe48e60606e483c88b3d22
  Stored in directory: /root/.cache/pip/wheels/75/ae/e5/8bc678d262caad94659c199c540550e59d03dd3bd3684d4f1a
Successfully built box2d-kengz
Installing collected packages: box2d, box2d-kengz
Successfully

In [None]:
env = gym.make("LunarLander-v2")
# === YOUR CODE HERE ===#
# Wrap the environment

# Reset the environment

# Take random actions in the enviromnent and check
# that it returns the correct values after the end of each episode

# ====================== #

 # Conclusion
 
 In this notebook, we have seen:
 - how to easily save and load a model
 - what is wrapper and what we can do with it
 - how to create your own wrapper

## Wrapper Bonus: changing the observation space: a wrapper for episode of fixed length

In [None]:
from gym.wrappers import TimeLimit

class TimeFeatureWrapper(gym.Wrapper):
    """
    Add remaining time to observation space for fixed length episodes.
    See https://arxiv.org/abs/1712.00378 and https://github.com/aravindr93/mjrl/issues/13.

    :param env: (gym.Env)
    :param max_steps: (int) Max number of steps of an episode
        if it is not wrapped in a TimeLimit object.
    :param test_mode: (bool) In test mode, the time feature is constant,
        equal to zero. This allow to check that the agent did not overfit this feature,
        learning a deterministic pre-defined sequence of actions.
    """
    def __init__(self, env, max_steps=1000, test_mode=False):
        assert isinstance(env.observation_space, gym.spaces.Box)
        # Add a time feature to the observation
        low, high = env.observation_space.low, env.observation_space.high
        low, high= np.concatenate((low, [0])), np.concatenate((high, [1.]))
        env.observation_space = gym.spaces.Box(low=low, high=high, dtype=np.float32)

        super(TimeFeatureWrapper, self).__init__(env)

        if isinstance(env, TimeLimit):
            self._max_steps = env._max_episode_steps
        else:
            self._max_steps = max_steps
        self._current_step = 0
        self._test_mode = test_mode

    def reset(self):
        self._current_step = 0
        return self._get_obs(self.env.reset())

    def step(self, action):
        self._current_step += 1
        obs, reward, done, info = self.env.step(action)
        return self._get_obs(obs), reward, done, info

    def _get_obs(self, obs):
        """
        Concatenate the time feature to the current observation.

        :param obs: (np.ndarray)
        :return: (np.ndarray)
        """
        # Remaining time is more general
        time_feature = 1 - (self._current_step / self._max_steps)
        if self._test_mode:
            time_feature = 1.0
        # Optionnaly: concatenate [time_feature, time_feature ** 2]
        return np.concatenate((obs, [time_feature]))

## Going further - Saving format 

The format for saving and loading models has been recently revamped as of Stable-Baselines (>2.7.0).

It is a zip-archived JSON dump and NumPy zip archive of the arrays:
```
saved_model.zip/
├── data              JSON file of class-parameters (dictionary)
├── parameter_list    JSON file of model parameters and their ordering (list)
├── parameters        Bytes from numpy.savez (a zip file of the numpy arrays). ...
    ├── ...           Being a zip-archive itself, this object can also be opened ...
        ├── ...       as a zip-archive and browsed.
```

## Save and find 

In [None]:
# Create save dir
save_dir = "/tmp/gym/"
os.makedirs(save_dir, exist_ok=True)

model = PPO2('MlpPolicy', 'Pendulum-v0', verbose=0).learn(8000)
model.save(save_dir + "/PPO2_tutorial")

In [None]:
!ls /tmp/gym/PPO2_tutorial*

In [None]:
import zipfile

archive = zipfile.ZipFile("/tmp/gym/PPO2_tutorial.zip", 'r')
for f in archive.filelist:
  print(f.filename)

## Exporting saved models

And finally some futher reading for those who want to export to tensorflowJS or Java.

https://stable-baselines.readthedocs.io/en/master/guide/export.html