# Tutorial 6 - Stable Baselines3 Demo

Few things about Stable Baselines3:
* `stable-baselines3` https://github.com/DLR-RM/stable-baselines3 contains the core algorithms.
* `rl-baselines3-zoo` https://github.com/DLR-RM/rl-baselines3-zoo contains additional scripts for training, evaluation, tuning and recording.
* Documentation -> https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html.

In [29]:
# Install dependencies
# pip install stable-baselines3[extra]
#!apt-get update && apt-get install swig cmake
#!pip install box2d-py
#!pip install "stable-baselines3[extra]>=2.0.0a4"

### Demo - Playing Donkey Kong using DQN

Suppose that we have to design a RL agent to play the atari variant of [Donkey Kong](https://gymnasium.farama.org/environments/atari/donkey_kong/). The following information is given:

| Information        | Value                       |
|--------------------|-----------------------------|
| Action Space       | Discrete(18)                |
| Observation Space  | Box(0, 255, (210, 160, 3), uint8) |
| Import             | `gymnasium.make("ALE/DonkeyKong-v5")` |

[Deep Q-learning](https://en.wikipedia.org/wiki/Q-learning) was originally demonstrated as working directly with frames from atari games (the state space are images from the game). This is made possible because the action value function $q(s,a, \mathbf{w})$ is approximated using a convolutional neural network (CNN), which naturally handles image data. Of course, DQN can also be used with other function approximators. But when using a Multilayer Perceptron (MLP) for example, feature vectors have to be used.

Deep Q-learning can be seen as an extension of Q-learning. As in Q-learning $\epsilon$-greedy action selection is used for exploration and the target is a deterministic greedy policy:
$$
    \pi_*(s) = \underset{a \in \mathcal{A}}{\text{argmax}} \; q_*(s,a).
$$
Just like Q-learning, DQN is not compatible with continuous action spaces. As can be seen from the stable-baselines3 documentation on which gymnasium space is supported:

| Space            | Action     | Observation |
|------------------|------------|-------------|
| Discrete         | ✔️          | ✔️          |
| Box              | ❌          | ✔️          |
| MultiDiscrete    | ❌          | ✔️          |
| MultiBinary      | ❌          | ✔️          |
| Dict             | ❌          | ✔️          |




In [30]:
import gymnasium as gym
import os
from stable_baselines3 import DQN
from stable_baselines3.common.vec_env import VecFrameStack, VecVideoRecorder, DummyVecEnv
from stable_baselines3.common.env_util import make_atari_env

First, we want to do some common pre-processing that is done on atari environments. This is done with the `make_atari_env` function + the `VecFrameStack` wrapper. The pre-processing steps are:
* Noop reset: obtain initial state by taking random number of no-ops on reset.
* Frame skipping: 4 by default.
* Max-pooling: most recent two observations.
* Termination signal when a life is lost.
* Resize to a square image: 84x84 by default.
* Grayscale observation.
* Clip reward to $\{-1, 0, 1\}$.
* Stack multiple frames together to rpovide the agent with temporal information.

So the input to our CNN which approximates the action-value function are 4 pre-processed $84 \times 84$ frames from the game.

In [31]:
vec_env = make_atari_env("ALE/DonkeyKong-v5", n_envs=4, seed=0)     # n_envs=4 train on 4 instances of the environment in parallel.
vec_env = VecFrameStack(vec_env, n_stack=4)

First, we instantiate a DQN object. The first argument of the constructor specifies what model to use as function approximator.

In [32]:
if not os.path.exists("./dqn_dk.zip"):
    model = DQN(
        "CnnPolicy",               # What model to use to approximate Q-function.
        vec_env,
        verbose=1,
        train_freq=4,
        gradient_steps=1,
        exploration_fraction=0.1,
        exploration_final_eps=0.1,    # epsilon-greedy schedule
        learning_rate=1e-4,
        batch_size=32,
        learning_starts=100000,
        target_update_interval=1000,
        buffer_size=100000,             # Replay buffer size
        optimize_memory_usage=False
)

Using cuda device
Wrapping the env in a VecTransposeImage.


Note that the choice of hyperparameters is important. In general Deep Reinforcement Learning algorithms are sensitive to the choise of hyperparameters such as the learning rate. Instead of tuning the hyperpameters ourselves we simply take them from the list of pre-tuned hyperparameters in rl-baselines3-zoo: https://github.com/DLR-RM/rl-baselines3-zoo/tree/master/hyperparams.

In Stable-baselines3 the environment interaction is encapsulated inside the algorithm class. To start the algorithm on the environment for $n$ timesteps we simply run its `learn` member, similar to something like `model.fit` in keras.

In [33]:
if not os.path.exists("./dqn_dk.zip"):
    model.learn(total_timesteps=1e7)
    model.save("dqn_dk")
else:
    model = DQN.load("./dqn_dk.zip")

----------------------------------
| rollout/            |          |
|    exploration_rate | 1        |
| time/               |          |
|    episodes         | 4        |
|    fps              | 439      |
|    time_elapsed     | 0        |
|    total_timesteps  | 312      |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 680      |
|    ep_rew_mean      | 125      |
|    exploration_rate | 0.999    |
| time/               |          |
|    episodes         | 8        |
|    fps              | 577      |
|    time_elapsed     | 1        |
|    total_timesteps  | 700      |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 680      |
|    ep_rew_mean      | 125      |
|    exploration_rate | 0.999    |
| time/               |          |
|    episodes         | 12       |
|    fps              | 619      |
|    time_elapsed   

: 

### Evaluating the DQN agent

### Game Recording

Stable-baselines3 also has functionality in place for recording your RL agent. This is done through the `VecVideoRecorder` class. It is also worth noting that this feature can be used also with your own custom RL agent.

In [None]:
# Set up fake display; otherwise rendering will fail
import os
os.system("Xvfb :1 -screen 0 1024x768x24 &")
os.environ['DISPLAY'] = ':1'
# !apt-get install ffmpeg freeglut3-dev xvfb  -y # For visualization

def record_video_atari(eval_env, model, video_length=500, prefix="", video_folder="videos/"):
    """
    :param eval_env: environment.
    :param model: (RL model)
    :param video_length: (int)
    :param prefix: (str)
    :param video_folder: (str)
    """
    eval_env.metadata['render_fps'] = 60
    vec_env_record = VecVideoRecorder(eval_env,
                                      video_folder,
                                      record_video_trigger=lambda x: x == 0,    #  Function that defines when to start recording.
                                      video_length=video_length,
                                      name_prefix=prefix)

    obs = vec_env_record.reset()
    for _ in range(video_length + 1):
        action, _ = model.predict(obs)
        obs, _, _, _ = vec_env_record.step(action)

    # Close the video recorder
    vec_env_record.close()

In [None]:
import base64
from pathlib import Path

from IPython import display as ipythondisplay


def show_videos(video_path="", prefix=""):
    """
    Taken from https://github.com/eleurent/highway-env

    :param video_path: (str) Path to the folder containing videos
    :param prefix: (str) Filter the video, showing only the only starting with this prefix
    """
    html = []
    for mp4 in Path(video_path).glob("{}*.mp4".format(prefix)):
        video_b64 = base64.b64encode(mp4.read_bytes())
        html.append(
            """<video alt="{}" autoplay 
                    loop controls style="height: 400px;">
                    <source src="data:video/mp4;base64,{}" type="video/mp4" />
                </video>""".format(
                mp4, video_b64.decode("ascii")
            )
        )
    ipythondisplay.display(ipythondisplay.HTML(data="<br>".join(html)))

(EE) 
Fatal server error:
(EE) Server is already active for display 1
	If this server is no longer running, remove /tmp/.X1-lock
	and start again.
(EE) 


In [None]:
test_env = make_atari_env("ALE/DonkeyKong-v5", n_envs=1, seed=0)
# Frame-stacking with 4 frames
test_env = VecFrameStack(test_env, n_stack=4)

record_video_atari(test_env, model, video_length=5000, prefix="dqn-dk")
show_videos("videos", prefix="dqn-dk")

  logger.warn(


Saving video to /home/matthijs/bsc/BachelorProject/notebooks/videos/dqn-dk-step-0-to-step-5000.mp4
Moviepy - Building video /home/matthijs/bsc/BachelorProject/notebooks/videos/dqn-dk-step-0-to-step-5000.mp4.
Moviepy - Writing video /home/matthijs/bsc/BachelorProject/notebooks/videos/dqn-dk-step-0-to-step-5000.mp4



                                                                  

Moviepy - Done !
Moviepy - video ready /home/matthijs/bsc/BachelorProject/notebooks/videos/dqn-dk-step-0-to-step-5000.mp4


