# Preparation part

Import basic dependencies needed

In [1]:
# Create virtual display to render on remote machine
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1, 1))
display.start()

import matplotlib.pyplot as plt
%matplotlib inline
from IPython import display
import gym
import numpy as np

Function for model evaluation

In [2]:
def evaluate_model(model, env, num_episodes=100):
    """
    Evaluate a RL agent
    :param model: (BaseRLModel object) the RL Agent
    :param num_episodes: (int) number of episodes
    :return: (list) List of rewards for episodes
    """
    episode_rewards = []
    obs = env.reset()
    for i in range(num_episodes):
        episode_rewards.append(0.0)
        done = False
        while not done:
            # _states are only useful when using LSTM policies
            action, _states = model.predict(obs)
            obs, reward, done, info = env.step(action)
            # Stats
            episode_rewards[-1] += reward
            if done:
                obs = env.reset()
    # Compute mean reward for the last 100 episodes
    mean_100ep_reward = round(np.mean(episode_rewards), 1)
    print("Mean reward:", mean_100ep_reward, "Num episodes:", len(episode_rewards))
    return episode_rewards

# Training part

## Basic cartpole solution
In this example we will use PPO2 (be aware of different hyperparameters when changing the algorithm):  
**Full list** of algos available with stable-baselines: https://stable-baselines.readthedocs.io/en/master/guide/algos.html  
**PPO2 documentation:** https://stable-baselines.readthedocs.io/en/master/modules/ppo2.html

In [3]:
from stable_baselines.common.policies import MlpPolicy # our policy (neural network that will perform actions)
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import PPO2 # import model

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



### Tensorboard
We also can add tensorboard when we train stable-baselines, then we will be able to explore more details about out training process.
To use tensorboard you need:
1. Pass tensorboard_log argument to you model which will indicate path where you want to have logs written;
2. Run tensorboard separately (preferred) or inside the notebook to explore the logs. To run new tensorboard instance you need to execute in terminal:   
    - `tensorboard --logdir <path to logs>"` then we will see port of the tensorboard and can open it via browser just replacing the port of jupyter (8888) with the port of the tensorboard (default is 6006);

Here we set tensorboard logs directory:

In [4]:
tensorboard_dir = "/root/tensorboard"

Create an instance with CartPole-v0 environment for our model. 

In [5]:
env = gym.make('CartPole-v0')
env = DummyVecEnv([lambda: env]) # Should be used in 1 core case



Train our model. Note that we use **magic command ```%%time```** to track time for this cell.

In [6]:
%%time
n_steps = 128
model = PPO2(
    MlpPolicy, 
    env, 
    verbose=1, 
    n_steps=n_steps, 
    learning_rate=0.003, 
    ent_coef=0.0,
    tensorboard_log=tensorboard_dir
)
model.learn(total_timesteps=25000, log_interval=100)





Instructions for updating:
Use keras.layers.flatten instead.
Instructions for updating:
Please use `layer.__call__` method instead.





Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where





-------------------------------------
| approxkl           | 0.007997252  |
| clipfrac           | 0.119140625  |
| explained_variance | 0.00387      |
| fps                | 269          |
| n_updates          | 1            |
| policy_entropy     | 0.6859102    |
| policy_loss        | -0.017660078 |
| serial_timesteps   | 128          |
| time_elapsed       | 3.58e-06     |
| total_timesteps    | 128          |
| value_loss         | 56.71531     |
-------------------------------------
--------------------------------------
| approxkl           | 0.0013344522  |
| clipfrac           | 0.00390625    |
| explained_variance | 0.11          |
| fps                | 774           |
| n_updates          | 100           |
| policy_entropy     | 0.5843034

<stable_baselines.ppo2.ppo2.PPO2 at 0x7fafa92afbe0>

And **evaluation** of trained model

In [7]:
rewards = evaluate_model(model, env, num_episodes=100)
np.std(rewards)

Mean reward: 200.0 Num episodes: 100


0.0

This task considired to be solved if the model achieved reward higher than 190 and maximum reward is 200.  
You should see that it took us around 30 seconds to train PPO2 on cartpole in colab.  
However, we can make it faster if we will parallelize our environment. 


## Parallelized environment on cartpole example
Here we can simply use baselines documentation as a cookbook and run our traing in parallel.  
https://stable-baselines.readthedocs.io/en/master/guide/examples.html#multiprocessing-unleashing-the-power-of-vectorized-environments

In [8]:
from stable_baselines.common.policies import MlpPolicy
from stable_baselines.common.vec_env import SubprocVecEnv
from stable_baselines.common import set_global_seeds, make_vec_env
from stable_baselines import PPO2

Here we use make_vec_env that will create copies of environment for each CPU.

In [11]:
env_name = "CartPole-v0"
num_cpu = 8
vec_env = make_vec_env(env_name, n_envs=num_cpu, seed=0)



You need to scale you number of steps (```n_steps```) to the CPU number to have the same results, because parameter defines number of steps per one core, which are later aggregated.

In [12]:
%%time
model = PPO2(
    MlpPolicy, 
    vec_env, 
    verbose=1, 
    n_steps=n_steps // num_cpu, # Scaling number of steps
    learning_rate=0.003, 
    ent_coef=0.0,
    tensorboard_log=tensorboard_dir
)
model.learn(total_timesteps=25000, log_interval=100)

--------------------------------------
| approxkl           | 0.0002593151  |
| clipfrac           | 0.0           |
| ep_len_mean        | 13.5          |
| ep_reward_mean     | 13.5          |
| explained_variance | -0.00439      |
| fps                | 347           |
| n_updates          | 1             |
| policy_entropy     | 0.6928569     |
| policy_loss        | 0.00090540387 |
| serial_timesteps   | 16            |
| time_elapsed       | 4.05e-06      |
| total_timesteps    | 128           |
| value_loss         | 22.222218     |
--------------------------------------
--------------------------------------
| approxkl           | 0.009314638   |
| clipfrac           | 0.12890625    |
| ep_len_mean        | 109           |
| ep_reward_mean     | 109           |
| explained_variance | 0.194         |
| fps                | 2052          |
| n_updates          | 100           |
| policy_entropy     | 0.38300434    |
| policy_loss        | -0.0047525167 |
| serial_timesteps   | 16

<stable_baselines.ppo2.ppo2.PPO2 at 0x7fb027583668>

### Saving the model
To evaluate our model on the same function we need to have it one CPU environment. However, stable-baselines doesn't provide easy way to transfer our model, so we will simply **save our model** in the storage and then load it with needed environment.

In [11]:
model.save("ppo2_parallel_test")

Loading the model with new environment (for on CPU)

In [12]:
env = gym.make(env_name)
env = DummyVecEnv([lambda: env])
model = PPO2.load("ppo2_parallel_test", env=env)



And model **evaluation** again.

In [13]:
rewards = evaluate_model(model, env, num_episodes=100)
np.std(rewards)

Mean reward: 200.0 Num episodes: 100


0.0

As you can see using 4 CPUs we didn't get 4 times speed-up, but it still significantly faster.

# Bonus: Changing architecture of our policy network
Policy is neural network that performs actions and it's possible to configure it architecture with stable-baselines tools.  
https://stable-baselines.readthedocs.io/en/master/guide/custom_policy.html

Here we will use baselines FeedForwardPolicy network to construct our custom feed forward net.  
We define `CustomPolicy` that will consist of two networks:
- Policy with 2 layers of 64 neurons each;
- Value function with 3 layers: 1 and 2 have 64 neurons and the last has 32;

In [14]:
from stable_baselines.common.policies import FeedForwardPolicy, register_policy

# Custom MLP policy of three layers of size 128 each
class CustomPolicy(FeedForwardPolicy):
    def __init__(self, *args, **kwargs):
        super(CustomPolicy, self).__init__(*args, **kwargs,
                                           net_arch=[dict(pi=[64, 64],          # Policy network layers size
                                                          vf=[64, 64, 32])],    # Value function layers size
                                           feature_extraction="mlp") # The feature extraction type ("cnn" or "mlp") (could be also custom network)

# Register the policy, it will check that the name is not already taken
register_policy('CustomPolicy', CustomPolicy)

# Because the policy is now registered, you can pass
# a string to the agent constructor instead of passing a class
# model = A2C(policy='CustomPolicy', env='LunarLander-v2', verbose=1).learn(total_timesteps=100000)


Creating our vectorized (multi-CPU) environment

In [15]:
env_name = "CartPole-v0"
num_cpu = 4
vec_env = make_vec_env(env_name, n_envs=num_cpu, seed=0)



The same steps as before

In [16]:
%%time
model = PPO2(
    CustomPolicy, # Instead of MlpPolicy now we have our custom policy
    vec_env, 
    verbose=1, 
    n_steps=n_steps // num_cpu, # Scaling number of steps
    learning_rate=0.003, 
    ent_coef=0.0,
    tensorboard_log=tensorboard_dir
)
model.learn(total_timesteps=25000, log_interval=100)

-------------------------------------
| approxkl           | 0.0070039397 |
| clipfrac           | 0.107421875  |
| ep_len_mean        | 19.4         |
| ep_reward_mean     | 19.4         |
| explained_variance | 0.0272       |
| fps                | 300          |
| n_updates          | 1            |
| policy_entropy     | 0.68663955   |
| policy_loss        | -0.011830209 |
| serial_timesteps   | 32           |
| time_elapsed       | 3.1e-06      |
| total_timesteps    | 128          |
| value_loss         | 27.326387    |
-------------------------------------
-------------------------------------
| approxkl           | 0.0051283794 |
| clipfrac           | 0.07421875   |
| ep_len_mean        | 123          |
| ep_reward_mean     | 123          |
| explained_variance | 0.877        |
| fps                | 1435         |
| n_updates          | 100          |
| policy_entropy     | 0.4683259    |
| policy_loss        | 0.0047310414 |
| serial_timesteps   | 3200         |
| time_elaps

<stable_baselines.ppo2.ppo2.PPO2 at 0x7f7d8e74d978>

Save and load the model

In [17]:
model.save("custom_ppo2_parallel")

In [18]:
env = gym.make(env_name)
env = DummyVecEnv([lambda: env])
model = PPO2.load("custom_ppo2_parallel", env=env)



Evaluation

In [19]:
rewards = evaluate_model(model, env, num_episodes=100)
np.std(rewards)

Mean reward: 200.0 Num episodes: 100


0.0