<a href="https://colab.research.google.com/github/mariaelisagmt/CartPole/blob/main/A2C.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



## Install Dependencies and Stable Baselines Using Pip

List of full dependencies can be found in the [README](https://github.com/hill-a/stable-baselines).

```
sudo apt-get update && sudo apt-get install cmake libopenmpi-dev zlib1g-dev
```


```
pip install stable-baselines[mpi]
```

In [4]:
# Stable Baselines only supports tensorflow 1.x for now
%tensorflow_version 1.x
!apt-get install ffmpeg freeglut3-dev xvfb  # For visualization
!pip install stable-baselines[mpi]==2.10.0

Reading package lists... Done
Building dependency tree       
Reading state information... Done
freeglut3-dev is already the newest version (2.8.1-3).
ffmpeg is already the newest version (7:3.4.8-0ubuntu0.2).
xvfb is already the newest version (2:1.19.6-1ubuntu4.8).
0 upgraded, 0 newly installed, 0 to remove and 10 not upgraded.


## Imports

Stable-Baselines works on environments that follow the [gym interface](https://stable-baselines.readthedocs.io/en/master/guide/custom_env.html).
You can find a list of available environment [here](https://gym.openai.com/envs/#classic_control).

It is also recommended to check the [source code](https://github.com/openai/gym) to learn more about the observation and action space of each env, as gym does not have a proper documentation.
Not all algorithms can work with all action spaces, you can find more in this [recap table](https://stable-baselines.readthedocs.io/en/master/guide/algos.html)

In [5]:
import gym
import numpy as np

The first thing you need to import is the RL model, check the documentation to know what you can use on which problem

In [6]:
from stable_baselines import PPO2

The next thing you need to import is the policy class that will be used to create the networks (for the policy/value functions).
This step is optional as you can directly use strings in the constructor: 

```PPO2('MlpPolicy', env)``` instead of ```PPO2(MlpPolicy, env)```

Note that some algorithms like `SAC` have their own `MlpPolicy` (different from `stable_baselines.common.policies.MlpPolicy`), that's why using string for the policy is the recommened option.

In [7]:
from stable_baselines.common.policies import MlpPolicy

## Create the Gym env and instantiate the agent

For this example, we will use CartPole environment, a classic control problem.

"A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. "

Cartpole environment: [https://gym.openai.com/envs/CartPole-v1/](https://gym.openai.com/envs/CartPole-v1/)

![Cartpole](https://cdn-images-1.medium.com/max/1143/1*h4WTQNVIsvMXJTCpXm_TAw.gif)


We chose the MlpPolicy because the observation of the CartPole task is a feature vector, not images.

The type of action to use (discrete/continuous) will be automatically deduced from the environment action space

Here we are using the [Proximal Policy Optimization](https://stable-baselines.readthedocs.io/en/master/modules/ppo2.html) algorithm (PPO2 is the version optimized for GPU), which is an Actor-Critic method: it uses a value function to improve the policy gradient descent (by reducing the variance).

It combines ideas from [A2C](https://stable-baselines.readthedocs.io/en/master/modules/a2c.html) (having multiple workers and using an entropy bonus for exploration) and [TRPO](https://stable-baselines.readthedocs.io/en/master/modules/trpo.html) (it uses a trust region to improve stability and avoid catastrophic drops in performance).

PPO is an on-policy algorithm, which means that the trajectories used to update the networks must be collected using the latest policy.
It is usually less sample efficient than off-policy alorithms like [DQN](https://stable-baselines.readthedocs.io/en/master/modules/dqn.html), [SAC](https://stable-baselines.readthedocs.io/en/master/modules/sac.html) or [TD3](https://stable-baselines.readthedocs.io/en/master/modules/td3.html), but is much faster regarding wall-clock time.


In [25]:
ambiente = gym.make('CartPole-v1')

model = PPO2(MlpPolicy, ambiente, verbose=0)

In [28]:
modelo = PPO2(MlpPolicy, ambiente, verbose = 1)
modelo.learn(total_timesteps = 10000)

obs = ambiente.reset()

for i in range(1000):
  action, estados = modelo.predict(obs, deterministic = True)
  obs, reward, done, info = ambiente.step(acao)
  ambiente.render()
  
  if done:
    obs = ambiente.reset()

ambiente.close()

Wrapping the env in a DummyVecEnv.
--------------------------------------
| approxkl           | 0.00010015384 |
| clipfrac           | 0.0           |
| explained_variance | -0.0216       |
| fps                | 307           |
| n_updates          | 1             |
| policy_entropy     | 0.69305027    |
| policy_loss        | -0.0017282362 |
| serial_timesteps   | 128           |
| time_elapsed       | 1.76e-05      |
| total_timesteps    | 128           |
| value_loss         | 46.98421      |
--------------------------------------
---------------------------------------
| approxkl           | 3.9955044e-05  |
| clipfrac           | 0.0            |
| explained_variance | -0.0451        |
| fps                | 519            |
| n_updates          | 2              |
| policy_entropy     | 0.69257385     |
| policy_loss        | -0.00069640053 |
| serial_timesteps   | 256            |
| time_elapsed       | 0.417          |
| total_timesteps    | 256            |
| value_loss      

NameError: ignored

We create a helper function to evaluate the agent:

In [10]:
def evaluate(model, num_episodes=100):
    """
    Evaluate a RL agent
    :param model: (BaseRLModel object) the RL Agent
    :param num_episodes: (int) number of episodes to evaluate it
    :return: (float) Mean reward for the last num_episodes
    """
    # This function will only work for a single Environment
    env = model.get_env()
    all_episode_rewards = []
    for i in range(num_episodes):
        episode_rewards = []
        done = False
        obs = env.reset()
        while not done:
            # _states are only useful when using LSTM policies
            action, _states = model.predict(obs)
            # here, action, rewards and dones are arrays
            # because we are using vectorized env
            obs, reward, done, info = env.step(action)
            episode_rewards.append(reward)

        all_episode_rewards.append(sum(episode_rewards))

    mean_episode_reward = np.mean(all_episode_rewards)
    print("Mean reward:", mean_episode_reward, "Num episodes:", num_episodes)

    return mean_episode_reward

Let's evaluate the un-trained agent, this should be a random agent.

In [11]:
# Random Agent, before training
mean_reward_before_train = evaluate(model, num_episodes=100)

Mean reward: 24.22 Num episodes: 100


Stable-Baselines already provides you with that helper:

In [13]:
from stable_baselines.common.evaluation import evaluate_policy

In [14]:
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=100)

print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")

mean_reward:9.65 +/- 0.67


## Train the agent and evaluate it

In [15]:
# Train the agent for 10000 steps
model.learn(total_timesteps=10000)

<stable_baselines.ppo2.ppo2.PPO2 at 0x7f22159eba58>

In [16]:
# Evaluate the trained agent
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=100)

print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")

mean_reward:317.84 +/- 123.42


Apparently the training went well, the mean reward increased a lot ! 

### Prepare video recording

In [17]:
# Set up fake display; otherwise rendering will fail
import os
os.system("Xvfb :1 -screen 0 1024x768x24 &")
os.environ['DISPLAY'] = ':1'

In [18]:
import base64
from pathlib import Path

from IPython import display as ipythondisplay

def show_videos(video_path='', prefix=''):
  """
  Taken from https://github.com/eleurent/highway-env

  :param video_path: (str) Path to the folder containing videos
  :param prefix: (str) Filter the video, showing only the only starting with this prefix
  """
  html = []
  for mp4 in Path(video_path).glob("{}*.mp4".format(prefix)):
      video_b64 = base64.b64encode(mp4.read_bytes())
      html.append('''<video alt="{}" autoplay 
                    loop controls style="height: 400px;">
                    <source src="data:video/mp4;base64,{}" type="video/mp4" />
                </video>'''.format(mp4, video_b64.decode('ascii')))
  ipythondisplay.display(ipythondisplay.HTML(data="<br>".join(html)))

We will record a video using the [VecVideoRecorder](https://stable-baselines.readthedocs.io/en/master/guide/vec_envs.html#vecvideorecorder) wrapper, you will learn about those wrapper in the next notebook.

In [19]:
from stable_baselines.common.vec_env import VecVideoRecorder, DummyVecEnv

def record_video(env_id, model, video_length=500, prefix='', video_folder='videos/'):
  """
  :param env_id: (str)
  :param model: (RL model)
  :param video_length: (int)
  :param prefix: (str)
  :param video_folder: (str)
  """
  eval_env = DummyVecEnv([lambda: gym.make(env_id)])
  # Start the video at step=0 and record 500 steps
  eval_env = VecVideoRecorder(eval_env, video_folder=video_folder,
                              record_video_trigger=lambda step: step == 0, video_length=video_length,
                              name_prefix=prefix)

  obs = eval_env.reset()
  for _ in range(video_length):
    action, _ = model.predict(obs)
    obs, _, _, _ = eval_env.step(action)

  # Close the video recorder
  eval_env.close()

### Visualize trained agent



In [20]:
record_video('CartPole-v1', model, video_length=500, prefix='ppo2-cartpole')

Saving video to  /content/videos/ppo2-cartpole-step-0-to-step-500.mp4


In [21]:
show_videos('videos', prefix='ppo2')

## Bonus: Train a RL Model in One Line

The policy class to use will be inferred and the environment will be automatically created. This works because both are [registered](https://stable-baselines.readthedocs.io/en/master/guide/quickstart.html).

In [22]:
model = PPO2('MlpPolicy', "CartPole-v1", verbose=1).learn(1000)

Creating environment from the given name, wrapped in a DummyVecEnv.
---------------------------------------
| approxkl           | 6.8005334e-06  |
| clipfrac           | 0.0            |
| explained_variance | -0.00471       |
| fps                | 299            |
| n_updates          | 1              |
| policy_entropy     | 0.6931432      |
| policy_loss        | -4.1479943e-06 |
| serial_timesteps   | 128            |
| time_elapsed       | 2.43e-05       |
| total_timesteps    | 128            |
| value_loss         | 70.77068       |
---------------------------------------
--------------------------------------
| approxkl           | 7.150585e-05  |
| clipfrac           | 0.0           |
| explained_variance | -0.000758     |
| fps                | 496           |
| n_updates          | 2             |
| policy_entropy     | 0.6929811     |
| policy_loss        | -0.0014476473 |
| serial_timesteps   | 256           |
| time_elapsed       | 0.428         |
| total_timesteps    |

## Train a DQN agent

In the previous example, we have used PPO, which one of the many algorithms provided by stable-baselines.

In the next example, we are going train a [Deep Q-Network agent (DQN)](https://stable-baselines.readthedocs.io/en/master/modules/dqn.html), and try to see possible improvements provided by its extensions (Double-DQN, Dueling-DQN, Prioritized Experience Replay).

The essential point of this section is to show you how simple it is to tweak hyperparameters.

The main advantage of stable-baselines is that it provides a common interface to use the algorithms, so the code will be quite similar.


DQN paper: https://arxiv.org/abs/1312.5602

Dueling DQN: https://arxiv.org/abs/1511.06581

Double-Q Learning: https://arxiv.org/abs/1509.06461

Prioritized Experience Replay: https://arxiv.org/abs/1511.05952

### Vanilla DQN: DQN without extensions

In [None]:
# Same as before we instantiate the agent along with the environment
from stable_baselines import DQN

# Deactivate all the DQN extensions to have the original version
# In practice, it is recommend to have them activated
kwargs = {'double_q': False, 'prioritized_replay': False, 'policy_kwargs': dict(dueling=False)}

# Note that the MlpPolicy of DQN is different from the one of PPO
# but stable-baselines handles that automatically if you pass a string
dqn_model = DQN('MlpPolicy', 'CartPole-v1', verbose=1, **kwargs)

In [None]:
# Random Agent, before training
mean_reward_before_train = evaluate(dqn_model, num_episodes=100)

Mean reward: 9.29 Num episodes: 100


In [None]:
# Train the agent for 10000 steps
dqn_model.learn(total_timesteps=10000, log_interval=10)

In [None]:
# Evaluate the trained agent
mean_reward = evaluate(dqn_model, num_episodes=100)

Mean reward: 130.02 Num episodes: 100


### DQN + Prioritized Replay

In [None]:
# Activate only the prioritized replay
kwargs = {'double_q': False, 'prioritized_replay': True, 'policy_kwargs': dict(dueling=False)}

dqn_per_model = DQN('MlpPolicy', 'CartPole-v1', verbose=1, **kwargs)

In [None]:
dqn_per_model.learn(total_timesteps=10000, log_interval=10)

In [None]:
# Evaluate the trained agent
mean_reward = evaluate(dqn_per_model, num_episodes=100)

Mean reward: 110.18 Num episodes: 100


### DQN + Prioritized Experience Replay + Double Q-Learning + Dueling

In [None]:
# Activate all extensions
kwargs = {'double_q': True, 'prioritized_replay': True, 'policy_kwargs': dict(dueling=True)}

dqn_full_model = DQN('MlpPolicy', 'CartPole-v1', verbose=1, **kwargs)

Creating environment from the given name, wrapped in a DummyVecEnv.


In [None]:
dqn_full_model.learn(total_timesteps=10000, log_interval=10)

In [None]:
mean_reward = evaluate(dqn_per_model, num_episodes=100)

Mean reward: 110.02 Num episodes: 100


In this particular example, the extensions does not seem to give any improvement compared to the simple DQN version.
They are several reasons for that:

1. `CartPole-v1` is a pretty simple environment
2. We trained DQN for very few timesteps, not enough to see any difference
3. The default hyperparameters for DQN are tuned for atari games, where the number of training timesteps is much larger (10^6) and input observations are images
4. We have only compared one random seed per experiment

## Conclusion

In this notebook we have seen:
- how to define and train a RL model using stable baselines, it takes only one line of code ;)
- how to use different RL algorithms and change some hyperparameters