# Atividade Prática II - Treinamento e Validação de Modelos de RL

**Aluno:** Marco Antonio Batista

**Disciplina:** Reinforcement Learning - Turma II

**Data:** 21/08/2021



Neste trabalho vamos aplicar `Gym`, `Stable-Baselines3` e `RL Baselines Zoo` para lidar com o treinamento e validação de problemas de aprendizado por reforço. Sua tarefa é:

1. Selecionar um cenário da biblioteca `Gym` de sua preferência, desde que este cenário também seja contemplado pelos modelos disponibilizados na `rl baselines zoo`;
2. Selecionar três algoritmos das biblioteca `Stable-baselines3` para resolver esse problema. Pesquise na documentação da biblioteca quais são os algoritmos mais adequados para o ambiente escolhido e justifique a sua escolha. 
3. Realize o treinamento de cada um dos três modelos ---você pode ajustar os parâmetros do modelos, se achar necessário--- e salve os modelos em disco.
4. De posse dos modelos treinados e salvos, carregue-os e avalie-os por 10 episódios. Apresente os resultados médios e gere a curva de recompensa acumulada disponibilizada pelo `TensorBoard`.
5. Compare os resultados dos modelos treinados com os resultados obtidos por modelo(s) existentes no `RL Baselines Zoo` para o cenário escolhido.
6. Gere um vídeo do melhor modelo que você treinou e do modelo escolhido na `RL Baselines Zoo`. Verifique a documentação de cada biblioteca sobre a criação do vídeo e visualização em Notebooks.



* **Data de entrega:** 04/09/2021
* **Local de envio:** AVA.
* **Tipo de documento:** Notebook (`.ipynb`).



In [1]:
import gym

from stable_baselines3 import DQN
from stable_baselines3 import PPO
from stable_baselines3 import A2C
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.callbacks import CheckpointCallback, EvalCallback

import time
import imageio
import numpy as np

# Create environment
env = gym.make('LunarLander-v2')

In [3]:
# Instantiate the agent
model = DQN('MlpPolicy', env, verbose=0, tensorboard_log="./lunar_tensorboard/train_dqn/")
# Train the agent
model.learn(total_timesteps=100000)
# Save the agent
model.save("lunar_dqn")
del model

2021-09-01 19:36:25.842035: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-09-01 19:36:25.842090: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [4]:
# Instantiate the agent
model = PPO('MlpPolicy', env, verbose=0, tensorboard_log="./lunar_tensorboard/train_ppo/")
# Train the agent
model.learn(total_timesteps=100000)
# Save the agent
model.save("lunar_ppo")
del model

In [5]:
# Instantiate the agent
model = A2C('MlpPolicy', env, verbose=0, tensorboard_log="./lunar_tensorboard/train_a2c/")
# Train the agent
model.learn(total_timesteps=100000)
# Save the agent
model.save("lunar_a2c")
del model

In [6]:
# analisar os dados de treinamento no TensorBoard
#!tensorboard --logdir ./lunar_tensorboard/

In [7]:
# Load the trained agent
model = DQN.load("lunar_dqn", env=env)

# Evaluate the agent
mean_reward, std_reward = evaluate_policy(model, model.get_env(), n_eval_episodes=10)
print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")

# Enjoy trained agent
obs = env.reset()
for i in range(3000):
    action, _state = model.predict(obs, deterministic=True)
    obs, reward, done, info = env.step(action)
    env.render()
    time.sleep(0.0003)
    if done:
      obs = env.reset()
 
env.close()   

mean_reward:-96.67 +/- 46.50


In [8]:
# Load the trained agent
model = PPO.load("lunar_ppo", env=env)

# Evaluate the agent
mean_reward, std_reward = evaluate_policy(model, model.get_env(), n_eval_episodes=10)

# Enjoy trained agent
obs = env.reset()
for i in range(3000):
    action, _state = model.predict(obs, deterministic=True)
    obs, reward, done, info = env.step(action)
    env.render()
    time.sleep(0.0003)
    if done:
      obs = env.reset()
    
env.close()   

In [9]:
# Load the trained agent
model = A2C.load("lunar_a2c", env=env)

# Evaluate the agent
mean_reward, std_reward = evaluate_policy(model, model.get_env(), n_eval_episodes=10)

# Enjoy trained agent
obs = env.reset()
for i in range(3000):
    action, _state = model.predict(obs, deterministic=True)
    obs, reward, done, info = env.step(action)
    env.render()
    time.sleep(0.0003)
    if done:
      obs = env.reset()
    
env.close()   

In [10]:
#!git clone --recursive https://github.com/DLR-RM/rl-baselines3-zoo
%cd rl-baselines3-zoo/
#!pip install -r requirements.txt

/home/marco/rf/rl-baselines3-zoo


In [11]:
#RL Baselines3 Zoo - avaliação via TensorBoard DQN
!python train.py --algo dqn --env LunarLander-v2 --verbose 0 --tensorboard-log ../lunar_tensorboard/

Seed: 287680758
Log path: logs/dqn/LunarLander-v2_4
2021-09-01 19:46:57.891537: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-09-01 19:46:57.891575: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Eval num_timesteps=10000, episode_reward=-94.04 +/- 40.17
Episode length: 677.60 +/- 396.07
New best mean reward!
Eval num_timesteps=20000, episode_reward=42.89 +/- 149.97
Episode length: 278.00 +/- 182.54
New best mean reward!
Eval num_timesteps=30000, episode_reward=165.46 +/- 99.77
Episode length: 597.80 +/- 247.05
New best mean reward!
Eval num_timesteps=40000, episode_reward=152.68 +/- 94.75
Episode length: 586.20 +/- 348.68
Eval num_timesteps=50000, episode_reward=260.36 +/- 12.92
Episode length: 326.80 +/- 51.59
New best mean reward!
Eval num_time

In [12]:
#RL Baselines3 Zoo - avaliação via TensorBoard PPO
!python train.py --algo ppo --env LunarLander-v2 --verbose 0 --tensorboard-log ../lunar_tensorboard/

Seed: 3698498684
Log path: logs/ppo/LunarLander-v2_4
2021-09-01 19:54:36.165229: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-09-01 19:54:36.165265: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Eval num_timesteps=10000, episode_reward=-148.90 +/- 61.77
Episode length: 71.20 +/- 12.06
New best mean reward!
Eval num_timesteps=20000, episode_reward=-161.41 +/- 61.85
Episode length: 98.20 +/- 39.97
Eval num_timesteps=30000, episode_reward=-193.31 +/- 132.50
Episode length: 110.20 +/- 42.34
Eval num_timesteps=40000, episode_reward=-778.93 +/- 473.02
Episode length: 221.40 +/- 109.70
Eval num_timesteps=50000, episode_reward=-155.99 +/- 227.62
Episode length: 414.80 +/- 294.00
Eval num_timesteps=60000, episode_reward=-215.00 +/- 158.19
Episode length

Eval num_timesteps=810000, episode_reward=248.62 +/- 17.09
Episode length: 355.60 +/- 23.10
Eval num_timesteps=820000, episode_reward=239.38 +/- 17.61
Episode length: 356.20 +/- 16.71
Eval num_timesteps=830000, episode_reward=251.25 +/- 22.85
Episode length: 352.80 +/- 10.13
Eval num_timesteps=840000, episode_reward=254.02 +/- 22.25
Episode length: 373.40 +/- 15.45
Eval num_timesteps=850000, episode_reward=231.16 +/- 18.78
Episode length: 358.20 +/- 12.92
Eval num_timesteps=860000, episode_reward=248.52 +/- 12.74
Episode length: 393.00 +/- 16.25
Eval num_timesteps=870000, episode_reward=240.40 +/- 6.17
Episode length: 411.60 +/- 18.38
Eval num_timesteps=880000, episode_reward=237.59 +/- 21.37
Episode length: 396.20 +/- 15.03
Eval num_timesteps=890000, episode_reward=230.89 +/- 14.87
Episode length: 405.80 +/- 18.13
Eval num_timesteps=900000, episode_reward=237.77 +/- 29.19
Episode length: 421.60 +/- 13.47
Eval num_timesteps=910000, episode_reward=248.48 +/- 22.40
Episode length: 399.40

In [14]:
#RL Baselines3 Zoo - avaliação via TensorBoard A2C
!python train.py --algo a2c --env LunarLander-v2 --verbose 0 --tensorboard-log ../lunar_tensorboard/

Seed: 599132
Log path: logs/a2c/LunarLander-v2_5
2021-09-01 20:20:03.431910: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-09-01 20:20:03.432005: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Eval num_timesteps=10000, episode_reward=-4159.30 +/- 1010.45
Episode length: 844.40 +/- 82.42
New best mean reward!
Eval num_timesteps=20000, episode_reward=-1588.27 +/- 85.35
Episode length: 1000.00 +/- 0.00
New best mean reward!
Eval num_timesteps=30000, episode_reward=-1255.24 +/- 112.63
Episode length: 1000.00 +/- 0.00
New best mean reward!
Eval num_timesteps=40000, episode_reward=-148.62 +/- 247.28
Episode length: 917.40 +/- 125.53
New best mean reward!
Eval num_timesteps=50000, episode_reward=-64.51 +/- 173.73
Episode length: 634.00 +/- 196.41
New be

In [18]:
%cd ../

/home/marco/rf


In [19]:
# analisar os dados de treinamento e do RL Baselines3 Zoo no TensorBoard
#!tensorboard --logdir ./lunar_tensorboard/

In [2]:
model = PPO.load("lunar_ppo", env=env)

images = []
obs = env.reset()
img = env.render(mode='rgb_array')
for i in range(1300):
    images.append(img)
    action, _state = model.predict(obs, deterministic=True)
    obs, reward, done, info = env.step(action)
    img = env.render(mode='rgb_array')
    if done:
      obs = env.reset()

imageio.mimsave('lander_ppo.gif', [np.array(img) for i, img in enumerate(images) if i%2 == 0], fps=29)

env.close()

In [3]:
model = PPO.load("lunar_a2c", env=env)

images = []
obs = env.reset()
img = env.render(mode='rgb_array')
for i in range(1300):
    images.append(img)
    action, _state = model.predict(obs, deterministic=True)
    obs, reward, done, info = env.step(action)
    img = env.render(mode='rgb_array')
    if done:
      obs = env.reset()

imageio.mimsave('lander_a2c.gif', [np.array(img) for i, img in enumerate(images) if i%2 == 0], fps=29)

env.close()

In [4]:
model = DQN.load("lunar_dqn", env=env)

images = []
obs = env.reset()
img = env.render(mode='rgb_array')
for i in range(1300):
    images.append(img)
    action, _state = model.predict(obs, deterministic=True)
    obs, reward, done, info = env.step(action)
    img = env.render(mode='rgb_array')
    if done:
      obs = env.reset()

imageio.mimsave('lander_dqn.gif', [np.array(img) for i, img in enumerate(images) if i%2 == 0], fps=29)

env.close()

In [6]:
%cd rl-baselines3-zoo/
#!python -m utils.record_video --algo ppo --env LunarLander-v2 -n 1000
!python -m utils.record_training --algo ppo --env LunarLander-v2 -n 1000 -f logs --deterministic --gif

[Errno 2] No such file or directory: 'rl-baselines3-zoo/'
/home/marco/rf/rl-baselines3-zoo
Loading latest experiment, id=4
Saving video to /home/marco/rf/rl-baselines3-zoo/logs/ppo/LunarLander-v2_4/videos/final-model-ppo-LunarLander-v2-step-0-to-step-1000.mp4
Exception ignored in: <function VecVideoRecorder.__del__ at 0x7f520d5e11f0>
Traceback (most recent call last):
  File "/home/marco/rf/rf/lib/python3.8/site-packages/stable_baselines3/common/vec_env/vec_video_recorder.py", line 113, in __del__
  File "/home/marco/rf/rf/lib/python3.8/site-packages/stable_baselines3/common/vec_env/vec_video_recorder.py", line 109, in close
AttributeError: 'NoneType' object has no attribute 'close'
Saving video to /home/marco/rf/rl-baselines3-zoo/logs/ppo/LunarLander-v2_4/videos/best-model-ppo-LunarLander-v2-step-0-to-step-1000.mp4
Exception ignored in: <function VecVideoRecorder.__del__ at 0x7fb9ed19b1f0>
Traceback (most recent call last):
  File "/home/marco/rf/rf/lib/python3.8/site-packages/stable_