# Useful tips and tricks

In this notebook you have a few useful tricks, that can impact your performance

In [1]:
# Create virtual display to render on remote machine
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1, 1))
display.start()

import matplotlib.pyplot as plt
%matplotlib inline
from IPython import display
import gym
import numpy as np
from pathlib import Path

from stable_baselines.common.vec_env import DummyVecEnv, SubprocVecEnv
from stable_baselines.common.evaluation import evaluate_policy
from stable_baselines.common import set_global_seeds, make_vec_env

from utils import evaluate_model_vec

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



In [2]:
from stable_baselines.common.policies import MlpPolicy # our policy (neural network that will perform actions)
from stable_baselines import PPO2

# Tensorboard
We also can use tensorboard when we train stable-baselines, then we will be able to explore more details about out training process. \
To use tensorboard you need:
1. Pass tensorboard_log argument to you model which will indicate path where you want to have logs written;
2. Run tensorboard: 
    - at your VM **you have active tensorboard** at port 6006 that is looking for logs in two folder `your_work` and `tensorboard`. So you just need to open the same ip adress with 6006 port to get to the tensorboard.

Here we set tensorboard logs directory:

In [3]:
tensorboard_dir = "/root/tensorboard"

In [4]:
env = gym.make('CartPole-v0')
env = DummyVecEnv([lambda: env]) # Should be used in 1 core case



Train our model. Note that we use **magic command ```%%time```** to track time for this cell. \
Directory to the tensorboard is just passed as a `tensorboard_dir` argument to our model.  \

In [5]:
%%time
n_steps = 128
model = PPO2(
    MlpPolicy, 
    env, 
    verbose=1, 
    n_steps=n_steps, 
    learning_rate=0.003, 
    ent_coef=0.0,
    tensorboard_log=tensorboard_dir # path to store tensorboard logs
)
model.learn(total_timesteps=25000, log_interval=100)





Instructions for updating:
Use keras.layers.flatten instead.
Instructions for updating:
Please use `layer.__call__` method instead.





Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where





-------------------------------------
| approxkl           | 0.0068953326 |
| clipfrac           | 0.109375     |
| explained_variance | -0.0591      |
| fps                | 264          |
| n_updates          | 1            |
| policy_entropy     | 0.6864212    |
| policy_loss        | -0.018464664 |
| serial_timesteps   | 128          |
| time_elapsed       | 4.05e-06     |
| total_timesteps    | 128          |
| value_loss         | 34.298294    |
-------------------------------------
--------------------------------------
| approxkl           | 0.0032370728  |
| clipfrac           | 0.041015625   |
| explained_variance | 0.0894        |
| fps                | 882           |
| n_updates          | 100           |
| policy_entropy     | 0.5306942

<stable_baselines.ppo2.ppo2.PPO2 at 0x7f4b7b6984a8>

# Parallelized environment

We can use baselines documentation as a cookbook and run our traing in parallel.  
https://stable-baselines.readthedocs.io/en/master/guide/examples.html#multiprocessing-unleashing-the-power-of-vectorized-environments

Let's do quick performance comparision. \
First, we will run single-core training

In [6]:
env = gym.make('CartPole-v0')
env = DummyVecEnv([lambda: env])



In [7]:
%%time
n_steps = 128
model = PPO2(
    MlpPolicy, 
    env, 
    verbose=1, 
    n_steps=n_steps, 
    learning_rate=0.003, 
    ent_coef=0.0,
)
model.learn(total_timesteps=10000, log_interval=100)

---------------------------------------
| approxkl           | 0.0012090533   |
| clipfrac           | 0.001953125    |
| explained_variance | -0.00806       |
| fps                | 310            |
| n_updates          | 1              |
| policy_entropy     | 0.6919473      |
| policy_loss        | -0.00010843319 |
| serial_timesteps   | 128            |
| time_elapsed       | 3.34e-06       |
| total_timesteps    | 128            |
| value_loss         | 38.902607      |
---------------------------------------
CPU times: user 18.8 s, sys: 3.77 s, total: 22.6 s
Wall time: 11.7 s


<stable_baselines.ppo2.ppo2.PPO2 at 0x7f4bfb84e2e8>

Next, let's run parallelized training. \
The only thing we need is to create vectorized environment, each of them on separate threads.

In [8]:
env_name = "CartPole-v0"
num_cpu = 4
vec_env = make_vec_env(env_name, n_envs=num_cpu, seed=0)



In [9]:
%%time
model = PPO2(
    MlpPolicy, 
    vec_env, 
    verbose=1, 
    n_steps=n_steps // num_cpu, # Scaling number of steps
    learning_rate=0.003, 
    ent_coef=0.0,
)
model.learn(total_timesteps=10000, log_interval=100)

--------------------------------------
| approxkl           | 0.0006608284  |
| clipfrac           | 0.0           |
| ep_len_mean        | 16.5          |
| ep_reward_mean     | 16.5          |
| explained_variance | -0.00216      |
| fps                | 381           |
| n_updates          | 1             |
| policy_entropy     | 0.6925019     |
| policy_loss        | -0.0010020696 |
| serial_timesteps   | 32            |
| time_elapsed       | 3.1e-06       |
| total_timesteps    | 128           |
| value_loss         | 24.10811      |
--------------------------------------
CPU times: user 11.6 s, sys: 2.81 s, total: 14.4 s
Wall time: 6.28 s


<stable_baselines.ppo2.ppo2.PPO2 at 0x7f4b7b748d68>

As you can see, parallel execution of 4 environments speeds up training by 2 times.

# Faster evaluation for parallelized environments

In `utils.py` in this folder (which you can also find on our [github](https://github.com/nextgrid/notebooks)) you can find `evaluate_model_vec` function.  \
This function could be very usefull if you use parallelized environemnts because it gives you parallelized evaluation.

Let's do simple comparision - run both evaluation functions on the same model:

In [10]:
env_name = "LunarLander-v2"
num_cpu = 4
train_env = make_vec_env(env_name, n_envs=num_cpu, seed=0)

model = PPO2(
    MlpPolicy, 
    train_env, 
    verbose=1, 
    n_steps=n_steps // num_cpu, # Scaling number of steps
    learning_rate=0.003, 
    ent_coef=0.0,
)



Let's run our function:

In [11]:
%%time
vec_env = make_vec_env(env_name, n_envs=num_cpu, seed=0)
rewards = evaluate_model_vec(model, vec_env, num_episodes=100)



Mean reward: -482.1 Num episodes: 100
CPU times: user 8.09 s, sys: 328 ms, total: 8.42 s
Wall time: 7 s


And evaluation from stable-baselines:

In [12]:
%%time
eval_env = DummyVecEnv([lambda: gym.make(env_name)])
new_evaluation = evaluate_policy(model, eval_env, n_eval_episodes=100, deterministic=True)
print("Mean reward is", new_evaluation[0])

Mean reward is -505.01254
CPU times: user 18.6 s, sys: 1.7 s, total: 20.3 s
Wall time: 14.8 s


As you can see we have x2 speed-up due to parallelization of our evaluation.

# Custom policy networks

Policy is neural network that performs actions and it's possible to configure it architecture with stable-baselines tools:  
https://stable-baselines.readthedocs.io/en/master/guide/custom_policy.html

Here we will use baselines FeedForwardPolicy network to construct our custom feed forward net.  
We define `CustomPolicy` that will consist of two networks:
- Policy with 2 layers of 64 neurons each;
- Value function with 3 layers: 1 and 2 have 64 neurons and the last has 32;

In [13]:
from stable_baselines.common.policies import FeedForwardPolicy, register_policy

# Custom MLP policy of three layers of size 128 each
class CustomPolicy(FeedForwardPolicy):
    def __init__(self, *args, **kwargs):
        super(CustomPolicy, self).__init__(*args, **kwargs,
                                           net_arch=[dict(pi=[64, 64],          # Policy network layers size
                                                          vf=[64, 64, 32])],    # Value function layers size
                                           feature_extraction="mlp") # The feature extraction type ("cnn" or "mlp") (could be also custom network)

# Register the policy, it will check that the name is not already taken
register_policy('CustomPolicy', CustomPolicy)

In [14]:
env_name = "CartPole-v0"
num_cpu = 4
vec_env = make_vec_env(env_name, n_envs=num_cpu, seed=0)

In [15]:
%%time
model = PPO2(
    CustomPolicy, # Instead of MlpPolicy now we have our custom policy
    vec_env, 
    verbose=1, 
    n_steps=n_steps // num_cpu, # Scaling number of steps
    learning_rate=0.003, 
    ent_coef=0.0,
)
model.learn(total_timesteps=25000, log_interval=100)

-------------------------------------
| approxkl           | 0.004477359  |
| clipfrac           | 0.048828125  |
| ep_len_mean        | 22.2         |
| ep_reward_mean     | 22.2         |
| explained_variance | -0.014       |
| fps                | 342          |
| n_updates          | 1            |
| policy_entropy     | 0.6891707    |
| policy_loss        | -0.004464022 |
| serial_timesteps   | 32           |
| time_elapsed       | 3.58e-06     |
| total_timesteps    | 128          |
| value_loss         | 33.687138    |
-------------------------------------
--------------------------------------
| approxkl           | 0.002528545   |
| clipfrac           | 0.009765625   |
| ep_len_mean        | 120           |
| ep_reward_mean     | 120           |
| explained_variance | 0.000405      |
| fps                | 1901          |
| n_updates          | 100           |
| policy_entropy     | 0.6042605     |
| policy_loss        | -0.0017906395 |
| serial_timesteps   | 3200          |
|

<stable_baselines.ppo2.ppo2.PPO2 at 0x7f4b70d267b8>

In [16]:
%%time
vec_env = make_vec_env(env_name, n_envs=num_cpu, seed=0)
rewards = evaluate_model_vec(model, vec_env, num_episodes=100)



Mean reward: 200.0 Num episodes: 100
CPU times: user 5.69 s, sys: 692 ms, total: 6.38 s
Wall time: 4.4 s
