# Ações Contínuas com Stable Baselines3

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pablo-sampaio/rl_facil/blob/main/cap09/cap09-3-DDPG-stablebaselines.ipynb)

Vamos usar os algoritmo **DDPG**, **TD3** e **SAC** neste Google Colab.

Referências:
- Stable-Baselines3: https://github.com/DLR-RM/stable-baselines3
- Documentação: https://stable-baselines3.readthedocs.io/en/master/guide/rl.html


## 1 - Configurações necessárias

### 1.1 Instalação de pacotes

In [None]:
import sys
from IPython.display import clear_output

IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    !apt-get install ffmpeg freeglut3-dev xvfb  # For visualization
    !pip install "stable-baselines3[extra]==2.0.0"
    #clear_output()

In [None]:
!mkdir log_dir

### 1.2 Para salvar vídeo

In [None]:
if IN_COLAB:
    # Set up fake display; otherwise rendering will fail
    import os
    os.system("Xvfb :1 -screen 0 1024x768x24 &")
    os.environ['DISPLAY'] = ':1'

A gravação é feita com o wrapper [VecVideoRecorder](https://stable-baselines.readthedocs.io/en/master/guide/vec_envs.html#vecvideorecorder).

In [None]:
from stable_baselines3.common.vec_env import VecVideoRecorder, DummyVecEnv

def record_video(env_id, model, video_length=500, prefix='', video_folder='videos/'):
  """
  :param env_id: (str)
  :param model: (RL model)
  :param video_length: (int)
  :param prefix: (str)
  :param video_folder: (str)
  """
  eval_env = DummyVecEnv([lambda: gym.make(env_id, render_mode="rgb_array")])
  # Start the video at step=0 and record the given number of steps
  eval_env = VecVideoRecorder(eval_env, video_folder=video_folder,
                              record_video_trigger=lambda step: step == 0, video_length=video_length,
                              name_prefix=prefix)

  obs = eval_env.reset()
  for _ in range(video_length):
    action, _ = model.predict(obs)
    obs, _, _, _ = eval_env.step(action)

  # Close the video recorder
  eval_env.close()

In [None]:
import base64
from pathlib import Path

from IPython import display as ipythondisplay

def show_videos(video_path='', prefix=''):
  """
  Taken from https://github.com/eleurent/highway-env

  :param video_path: (str) Path to the folder containing videos
  :param prefix: (str) Filter the video, showing only the only starting with this prefix
  """
  html = []
  for mp4 in Path(video_path).glob("{}*.mp4".format(prefix)):
      video_b64 = base64.b64encode(mp4.read_bytes())
      html.append('''<video alt="{}" autoplay
                    loop controls style="height: 400px;">
                    <source src="data:video/mp4;base64,{}" type="video/mp4" />
                </video>'''.format(mp4, video_b64.decode('ascii')))
  ipythondisplay.display(ipythondisplay.HTML(data="<br>".join(html)))

### 1.3 Imports

In [None]:
import gymnasium as gym
import numpy as np

import tensorboard
%load_ext tensorboard

import stable_baselines3
stable_baselines3.__version__

In [None]:
from stable_baselines3 import DDPG, TD3, SAC
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.noise import NormalActionNoise, OrnsteinUhlenbeckActionNoise


## 2 - Ativa Tensorboard

Acompanhe, principalmente, o indicador `ep_rew_mean`, que é a **recompensa média por episódio** (= **retorno médio**).

Também vale a pena comparar diferentes algoritmos quanto ao "tempo de relógio": escolha `RELATIVE` para o eixo horizontal.

In [None]:
%tensorboard --logdir log_dir

## 3 - Cria e Treina um Agente

In [None]:
# Algumas opções de ambientes com ações contínuas:
# 'Pendulum-v1', 'LunarLanderContinuous-v2', 'MountainCarContinuous-v0', 'BipedalWalker-v3'
ENVIRONMENT_ID = "Pendulum-v1"

env = gym.make(ENVIRONMENT_ID)

# The noise objects for DDPG and TD3
n_actions = env.action_space.shape[-1]
action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions))
policy_kwargs = None

Descomente o código abaixo, se quiser definir o números de nós por camada intermediária das redes que representam política **pi** (chave `pi` do dicionário) e do crítico **Q** (chave `qf`).

Mais informações: https://stable-baselines.readthedocs.io/en/master/guide/custom_policy.html.

In [None]:
# Ator com duas camadas: de 128 e 256 unidades / Crítico com duas camadas de 256 unidades cada
#policy_kwargs = dict( net_arch=dict(pi=[128, 256], qf=[256, 256]) )

Escolha um dos algoritmos abaixo descomentando a linha correspondente. Mais informações:
- **DDPG**: https://stable-baselines3.readthedocs.io/en/master/modules/ddpg.html
- **TD3**: https://stable-baselines3.readthedocs.io/en/master/modules/td3.html
- **SAC**: https://stable-baselines3.readthedocs.io/en/master/modules/sac.html

In [None]:
# Cria o agente
model = DDPG("MlpPolicy", env, policy_kwargs=policy_kwargs, action_noise=action_noise, tensorboard_log="log_dir", verbose=1)
#model = TD3("MlpPolicy", env, policy_kwargs=policy_kwargs, action_noise=action_noise, tensorboard_log="log_dir", verbose=1)
#model = SAC("MlpPolicy", env, policy_kwargs=policy_kwargs, tensorboard_log="log_dir", verbose=1)

In [None]:
# Avalia o agente antes de treinado
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=100)
print(f"Retorno médio: {mean_reward:.2f} +/- {std_reward:.2f}")

In [None]:
# Aplica o treinamento
model.learn(total_timesteps=20_000, log_interval=10)

## 4 - Exibe e Avalia o agente

In [None]:
record_video(ENVIRONMENT_ID, model, video_length=1000, prefix='alg-treinado')
show_videos('videos', prefix='alg-treinado')

In [None]:
# Avalia o agente treinado
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=100)
print(f"Retorno médio: {mean_reward:.2f} +/- {std_reward:.2f}")