<a href="https://colab.research.google.com/github/marcosfmmota/RL-mo436/blob/main/notebooks/project2_mo436.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Projeto 2 - Reinforcement Learning - MO436


## Problema

Nesse projeto iremos analisar o comportamento de algoritmos de Deep Reinforcement Learning para jogos de Atari. O enfoque vai ser na versão de Atari do jogo Pacman. 

A utilização de algoritmos de reinforcement learning em jogos de Atari foi o primeiro breakthrough da utilização de métodos de deep learning com RL, resultado no estudo da área de Deep Reinforcement Learning [1]. No entanto, tais implementações, em geral, utilizam treinamento por várias dias em computadores com altas recursos computacionais para atingir o estado da arte.

Nesse projeto iremos fazer a análise dos algoritmos DQN (Deep Q-Networks) [1] e PPO (Proximal Policy Optimization) [2] na versão de Atari do Pacman.

In [2]:
!pip install stable-baselines3[extra]



### Funções Adicionais

In [1]:
# Video Recorder
# Set up fake display; otherwise rendering will fail
import os
os.system("Xvfb :1 -screen 0 1024x768x24 &")
os.environ['DISPLAY'] = ':1'

import base64
from pathlib import Path

from IPython import display as ipythondisplay

def show_videos(video_path='', prefix=''):
  """
  Taken from https://github.com/eleurent/highway-env

  :param video_path: (str) Path to the folder containing videos
  :param prefix: (str) Filter the video, showing only the only starting with this prefix
  """
  html = []
  for mp4 in Path(video_path).glob("{}*.mp4".format(prefix)):
      video_b64 = base64.b64encode(mp4.read_bytes())
      html.append('''<video alt="{}" autoplay 
                    loop controls style="height: 400px;">
                    <source src="data:video/mp4;base64,{}" type="video/mp4" />
                </video>'''.format(mp4, video_b64.decode('ascii')))
  ipythondisplay.display(ipythondisplay.HTML(data="<br>".join(html)))

In [2]:
from stable_baselines3.common.vec_env import VecVideoRecorder, DummyVecEnv
from stable_baselines3.common.env_util import make_atari_env

def record_video(env_id, model, video_length=500, prefix='', video_folder='videos/'):
  """
  :param env_id: (str)
  :param model: (RL model)
  :param video_length: (int)
  :param prefix: (str)
  :param video_folder: (str)
  """
  eval_env = make_atari_env(env_id)
  # Start the video at step=0 and record 500 steps
  eval_env = VecVideoRecorder(eval_env, video_folder=video_folder,
                              record_video_trigger=lambda step: step == 0, video_length=video_length,
                              name_prefix=prefix)

  obs = eval_env.reset()
  for _ in range(video_length):
    action, _ = model.predict(obs)
    obs, _, _, _ = eval_env.step(action)

  # Close the video recorder
  eval_env.close()

## Método off-policy (DQN)

In [1]:
import gym
from stable_baselines3 import DQN
from stable_baselines3.dqn import MlpPolicy, CnnPolicy
from stable_baselines3.common.env_util import make_atari_env
from stable_baselines3.common.vec_env import VecFrameStack
from stable_baselines3.common.atari_wrappers import WarpFrame, ClipRewardEnv, AtariWrapper

BUFFER_SIZE = 500000
TOTAL_TIMESTEPS = 100000
env = make_atari_env('MsPacman-v0')

In [None]:
model_dqn = DQN(CnnPolicy, env, buffer_size=BUFFER_SIZE)

In [None]:
model_dqn.learn(total_timesteps=timesteps)
model_dqn.save(f"dqn_mspacman-{buffer_size}-{timesteps}")

In [2]:
loaded_dqn_model = DQN.load(f"dqn_mspacman-{BUFFER_SIZE}-{TOTAL_TIMESTEPS}")

In [3]:
from stable_baselines3.common.evaluation import evaluate_policy
env_eval = make_atari_env('MsPacman-v0')
mean_reward, std_reward = evaluate_policy(loaded_dqn_model, env_eval, n_eval_episodes=100)

print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")

mean_reward:23.13 +/- 15.92


In [8]:
record_video('MsPacman-v0', loaded_dqn_model, video_length=500, prefix='dqn-mspacman')

Saving video to  /content/videos/dqn-mspacman-step-0-to-step-500.mp4


In [9]:
show_videos('videos', 'dqn-mspacman')

## Método on-policy

In [4]:
import gym

from stable_baselines3 import PPO
from stable_baselines3.ppo import MlpPolicy
from stable_baselines3.common.env_util import make_atari_env
from stable_baselines3.common.vec_env import VecFrameStack


# Parallel environments
env = make_atari_env('MsPacman-v0', n_envs=4)
TOTAL_TIMESTEPS = 500000
model_ppo = PPO(MlpPolicy, env)

In [5]:
model_ppo.learn(total_timesteps=TOTAL_TIMESTEPS)
model_ppo.save(f"ppo_mspacman-t-{TOTAL_TIMESTEPS}")

In [10]:
from stable_baselines3.common.evaluation import evaluate_policy
env_eval = make_atari_env('MsPacman-v0')
mean_reward, std_reward = evaluate_policy(model_ppo, env_eval, n_eval_episodes=100)

print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")

mean_reward:6.91 +/- 9.64


In [6]:
record_video('MsPacman-v0', model_ppo, video_length=500, prefix='ppo-mspacman-t-500000')

In [7]:
show_videos('videos', 'ppo-mspacman-t')

## Referências

[1] V. Mnih et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015, doi: 10.1038/nature14236.

[2] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv, pp. 1–12, 2017.