<a href="https://colab.research.google.com/github/pablo-sampaio/rl_facil/blob/main/cap09/cap09-continuous-stablebaselines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ações Contínuas com Stable Baselines3

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pablo-sampaio/rl_facil/blob/main/cap09/cap09-PPO-stablebaselines.ipynb) 

Vamos usar os algoritmo **DDPG**, **TD3** e **SAC** neste Google Colab.

Referências:
- Stable-Baselines3: https://github.com/DLR-RM/stable-baselines3
- Documentação: https://stable-baselines.readthedocs.io/en/master/


## 1 - Configurações necessárias

### 1.1 Instalação de pacotes

In [1]:
from IPython.display import clear_output
!apt-get install ffmpeg freeglut3-dev xvfb  # For visualization
!pip install gym[all]==00.25.1
!pip install gym[atari,accept-rom-license]==00.25.1
!pip install pyglet
!pip install stable-baselines3[extra]

clear_output()

In [None]:
!mkdir log_dir

### 1.2 Para salvar vídeo

In [2]:
# Set up fake display; otherwise rendering will fail
import os
os.system("Xvfb :1 -screen 0 1024x768x24 &")
os.environ['DISPLAY'] = ':1'

A gravação é feita com o wrapper [VecVideoRecorder](https://stable-baselines.readthedocs.io/en/master/guide/vec_envs.html#vecvideorecorder).

In [3]:
from stable_baselines3.common.vec_env import VecVideoRecorder, DummyVecEnv

def record_video(env_id, model, video_length=500, prefix='', video_folder='videos/'):
  """
  :param env_id: (str)
  :param model: (RL model)
  :param video_length: (int)
  :param prefix: (str)
  :param video_folder: (str)
  """
  eval_env = DummyVecEnv([lambda: gym.make(env_id)])
  # Start the video at step=0 and record the given number of steps
  eval_env = VecVideoRecorder(eval_env, video_folder=video_folder,
                              record_video_trigger=lambda step: step == 0, video_length=video_length,
                              name_prefix=prefix)

  obs = eval_env.reset()
  for _ in range(video_length):
    action, _ = model.predict(obs)
    obs, _, _, _ = eval_env.step(action)

  # Close the video recorder
  eval_env.close()

In [4]:
import base64
from pathlib import Path

from IPython import display as ipythondisplay

def show_videos(video_path='', prefix=''):
  """
  Taken from https://github.com/eleurent/highway-env

  :param video_path: (str) Path to the folder containing videos
  :param prefix: (str) Filter the video, showing only the only starting with this prefix
  """
  html = []
  for mp4 in Path(video_path).glob("{}*.mp4".format(prefix)):
      video_b64 = base64.b64encode(mp4.read_bytes())
      html.append('''<video alt="{}" autoplay 
                    loop controls style="height: 400px;">
                    <source src="data:video/mp4;base64,{}" type="video/mp4" />
                </video>'''.format(mp4, video_b64.decode('ascii')))
  ipythondisplay.display(ipythondisplay.HTML(data="<br>".join(html)))

### 1.3 Imports

In [5]:
import gym
import numpy as np
import tensorboard

%load_ext tensorboard

import stable_baselines3
stable_baselines3.__version__

'1.6.0'

In [8]:
import gym
import numpy as np

from stable_baselines3 import DDPG, TD3, SAC
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.noise import NormalActionNoise, OrnsteinUhlenbeckActionNoise


## 2 - Ativa Tensorboard

In [None]:
%tensorboard --logdir log_dir

## 3 - Cria e Treina um Agente

In [13]:
# Algumas opções contínuas:
# 'Pendulum-v1', 'HalfCheetahBulletEnv-v0', 'LunarLanderContinuous-v2', 'MountainCarContinuous-v0', 'BipedalWalker-v2', 'CarRacing-v0'
ENVIRONMENT_ID = "Pendulum-v1"  

In [None]:
env = gym.make(ENVIRONMENT_ID)

# The noise objects for DDPG and TD3
n_actions = env.action_space.shape[-1]
action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions))

model = DDPG("MlpPolicy", env, action_noise=action_noise, tensorboard_log="log_dir", verbose=1)
#model = TD3("MlpPolicy", env, action_noise=action_noise, tensorboard_log="log_dir", verbose=1)
#model = SAC("MlpPolicy", env, tensorboard_log="log_dir", verbose=1)

In [None]:
model.learn(total_timesteps=10000, log_interval=10)

## 4 - Exibe e avalia o agente

In [22]:
record_video(ENVIRONMENT_ID, model, video_length=1000, prefix='alg-sem-treino')
show_videos('videos', prefix='alg-sem-treino')

Saving video to /content/videos/alg-sem-treino-step-0-to-step-1000.mp4


In [23]:
# Evaluate the trained agent
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=30)

print(f"Retorno médio: {mean_reward:.2f} +/- {std_reward:.2f}")



Retorno médio: -173.20 +/- 108.69
