# Training half-cheetah

This notebook will show you how starting from this:

<video controls height=300 src="videos/cheetah_random.mp4" />

Train an agent that can do like this (or even better):

<video controls height=300 src="videos/cheetah_trained.mp4" />

## Preparation part

Import all stuff that will be needed

In [1]:
# Create virtual display to render on remote machine
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1, 1))
display.start()

import matplotlib.pyplot as plt
%matplotlib inline
from IPython import display
import gym
import numpy as np
from pathlib import Path

from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines.common.vec_env import VecVideoRecorder
from stable_baselines.common.evaluation import evaluate_policy

from utils import record_and_show, evaluate_model_vec

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



Function for video recording and showing

# Training part

Here we set the path for tensorboard logs.  
[Tensorboard](https://www.tensorflow.org/tensorboard) is a tool for visualisation of machine learning experiments

In [7]:
tensorboard_dir = "/root/tensorboard"

In this example we will use SAC:  
**Full list** of algos available with stable-baselines: https://stable-baselines.readthedocs.io/en/master/guide/algos.html  
**SAC documentation:** https://stable-baselines.readthedocs.io/en/master/modules/sac.html

In [4]:
from stable_baselines.sac.policies import MlpPolicy
from stable_baselines.common.vec_env import SubprocVecEnv
from stable_baselines.common import set_global_seeds, make_vec_env
from stable_baselines import SAC

Before we start training we need to create an instance of environment for our model. 

In [9]:
env_name = "HalfCheetah-v2"
num_cpu = 8
environment = DummyVecEnv([lambda: gym.make(env_name)])

Now we are creating instance of the model.  
Note that this is a step where you can later change parameters of the algorithm.

In [10]:
model = SAC(
    MlpPolicy, 
    environment, 
    verbose=1,
    tensorboard_log=tensorboard_dir
)





Instructions for updating:
Use keras.layers.flatten instead.
Instructions for updating:
Please use `layer.__call__` method instead.
Instructions for updating:
Use keras.layers.Dense instead.






Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where






This function we use to record behaviour of our model in the environment.

In [11]:
record_and_show(env_name, model, name="half_cheetah_random")

Saving video to  /root/notebooks/videos/half_cheetah_random-step-0-to-step-500.mp4


The function below is a standard function for **model evaluation**. \
The most important argument that you should track is `n_eval_episodes`, \
that defines for how many episodes you want to evaluate your model.

In [12]:
%%time
new_evaluation = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True, 
    render=False, callback=None, reward_threshold=None, return_episode_rewards=False
)
print("Mean reward is", new_evaluation[0])

Mean reward is -402.14676
CPU times: user 7.59 s, sys: 460 ms, total: 8.05 s
Wall time: 6.96 s


Train our model. Note that we use **magic command ```%%time```** to track time for this cell.

In [12]:
%%time
model.learn(total_timesteps=20000, log_interval=10)



----------------------------------------
| current_lr              | 0.0003     |
| ent_coef                | 0.07141871 |
| ent_coef_loss           | -19.234632 |
| entropy                 | 7.1757326  |
| episodes                | 10         |
| fps                     | 193        |
| mean 100 episode reward | -253       |
| n_updates               | 8900       |
| policy_loss             | -19.36945  |
| qf1_loss                | 0.6281452  |
| qf2_loss                | 0.50562    |
| time_elapsed            | 46         |
| total timesteps         | 9000       |
| value_loss              | 0.23295787 |
----------------------------------------
------------------------------------------
| current_lr              | 0.0003       |
| ent_coef                | 0.0075087165 |
| ent_coef_loss           | -0.84746796  |
| entropy                 | 4.8486214    |
| episodes                | 20           |
| fps                     | 195          |
| mean 100 episode reward | -255         

<stable_baselines.sac.sac.SAC at 0x7f3e0b1f9c88>

After the run we can record video and see how our agent behaves in the simulation:

In [13]:
record_and_show(env_name, model, name="half_sac_20k")



Saving video to  /root/notebooks/videos/half_sac_20k-step-0-to-step-500.mp4


At this moment you should get half-cheetah that will behave in an unpredictably strange way :)

In [14]:
%%time
new_evaluation = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True, 
    render=False, callback=None, reward_threshold=None, return_episode_rewards=False
)
print("Mean reward is", new_evaluation[0])

Mean reward is -428.0861
CPU times: user 8.42 s, sys: 0 ns, total: 8.42 s
Wall time: 6.94 s


So, let's train it a little longer (execution of this cell should take around **6 minutes**):

In [15]:
%%time
model.learn(total_timesteps=80000, log_interval=50)

-----------------------------------------
| current_lr              | 0.0003      |
| ent_coef                | 0.030020183 |
| ent_coef_loss           | -1.6741768  |
| entropy                 | 4.8553376   |
| episodes                | 50          |
| fps                     | 194         |
| mean 100 episode reward | 300         |
| n_updates               | 48900       |
| policy_loss             | -16.513357  |
| qf1_loss                | 0.73105353  |
| qf2_loss                | 0.781108    |
| time_elapsed            | 252         |
| total timesteps         | 49000       |
| value_loss              | 0.19938016  |
-----------------------------------------
CPU times: user 13min 44s, sys: 2min 29s, total: 16min 14s
Wall time: 6min 52s


<stable_baselines.sac.sac.SAC at 0x7f3e0b1f9c88>

Let's take a look what we got.

In [16]:
record_and_show(env_name, model, name="hcheetah_sac_100k")



Saving video to  /root/notebooks/videos/hcheetah_sac_100k-step-0-to-step-500.mp4


In [17]:
%%time
eval_env = DummyVecEnv([lambda: gym.make(env_name)])
new_evaluation = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True, 
    render=False, callback=None, reward_threshold=None, return_episode_rewards=False
)
print("Mean reward is", new_evaluation[0])



Mean reward is 2434.8992
CPU times: user 8.5 s, sys: 0 ns, total: 8.5 s
Wall time: 6.95 s


You should see much cheetah that is already learned some way to run, so let's save this results.

In [18]:
model.save("hcheetah_sac_100k")

Now, let's run it for a 400k steps and see how well our cheetah could run. \
(**Warning**: this cell will take around 30 minutes to run, so open new notebook and play there)

In [19]:
%%time
model.learn(total_timesteps=400000, log_interval=100)

-----------------------------------------
| current_lr              | 0.0003      |
| ent_coef                | 0.13606983  |
| ent_coef_loss           | -0.18048215 |
| entropy                 | 4.1546736   |
| episodes                | 100         |
| fps                     | 195         |
| mean 100 episode reward | 3.04e+03    |
| n_updates               | 98900       |
| policy_loss             | -210.80276  |
| qf1_loss                | 2.8582304   |
| qf2_loss                | 2.440816    |
| time_elapsed            | 505         |
| total timesteps         | 99000       |
| value_loss              | 2.5658822   |
-----------------------------------------
----------------------------------------
| current_lr              | 0.0003     |
| ent_coef                | 0.18854176 |
| ent_coef_loss           | 1.5034069  |
| entropy                 | 4.331627   |
| episodes                | 200        |
| fps                     | 196        |
| mean 100 episode reward | 4.32e+03   |


<stable_baselines.sac.sac.SAC at 0x7f3e0b1f9c88>

Visualization:

In [20]:
record_and_show(env_name, model, name="hcheetah_sac_500k")

Saving video to  /root/notebooks/videos/hcheetah_sac_500k-step-0-to-step-500.mp4


Evaluation:

In [21]:
%%time
new_evaluation = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True, 
    render=False, callback=None, reward_threshold=None, return_episode_rewards=False
)
print("Mean reward is", new_evaluation[0])

Mean reward is 6552.7397
CPU times: user 8.53 s, sys: 0 ns, total: 8.53 s
Wall time: 7.04 s


Saving the model

In [22]:
model.save("hcheetah_sac_500k")

So, at this moment you should have successfully trained policy that can steer cheetah to run.  \
Here are some advices what to do next:
1. Train this algorithm in another environment (start with simple ones like `LunarLander`)
2. Train agent with another algorithm (different algorithms perform best on different problems)