## LAB 5 - TASK 4 submission. ML 2024-25.
**Deep Reinforcement Learning to play ATARI games**


FILL UP THIS BOX WITH YOUR DETAILS

**NAME AND NIP**: 

- Ignacio Pastore Benaim, 920576
- David Padilla Orenga, 946874

## Deep RL example with ATARI
This colab is a tutorial built following materials published by [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx) at [class t81-558](https://sites.wustl.edu/jeffheaton/t81-558/). **Atari Games with Stable Baselines Neural Networks** [[Notebook]](t81_558_class_12_4_atari.ipynb)

## Colab setup (run only first time)

Install necessary libraries. You may need to re-start the colab environment after all the installation steps have finished

In [None]:
try:
    from google.colab import drive
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

if COLAB:
  !pip install stable-baselines3[extra] gymnasium
  !pip install gymnasium[atari]
  !pip install pyvirtualdisplay
  !sudo apt-get install -y python-opengl ffmpeg
  !sudo apt-get install -y xvfb

## Gymnasium Atari Breakout

In the context of artificial intelligence research and particularly within reinforcement learning, Atari Breakout has been adapted as an environment within **OpenAI's Gym toolkit**, a collection of environments that provide a standardized interface for algorithm development and benchmarking.

**Stable Baselines** is a set of implementations of reinforcement learning algorithms to train and evaluate agents on various tasks, including playing Atari games like Breakout.

The adaptation of **Breakout game** to the Gym environment, often referred to as 'Breakout-v0' or 'BreakoutDeterministic-v4' in the Gym library, abstracts the game's mechanics into observations, actions, and rewards, which an AI agent can interact with.
In this setup, the agent observes the game state (pixel data from the screen), selects actions (moving the paddle left or right), and receives rewards (the score for breaking bricks).

## Training the Agent

The following code configures and runs the training of the DQN. (This process can take many hours if you increase a lot the number of environment steps to run during training - TIMESTEPS). This code updates the loss and average return. The losses reported reflect the average loss for individual training batches.

In [None]:
# VERSION 1 of the DQN
import ale_py  # Import this before making any Atari environments
import gymnasium as gym
from stable_baselines3 import PPO, DQN
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.env_util import make_atari_env
from stable_baselines3.common.vec_env import VecFrameStack

# Choose the game name
GAME_NAME = 'Breakout'  # Or 'Atlantis'

# Create the game environment, note that we wrap it with VecFrameStack for preprocessing
env_id = f"{GAME_NAME}NoFrameskip-v4" # pick desired game version
env = make_atari_env(env_id, n_envs=4, seed=0)
env = VecFrameStack(env, n_stack=4)

# Initialize the DQN to use a CNN model as feature extractor
model = DQN('CnnPolicy', env, verbose=1, tensorboard_log="./atari_dqn_tensorboard/")

# Load your previously trained model to continue training where you stopped
# set to TRUE and make sure the model_path is correct to load a previously trained model
pretrained = False
if pretrained:
    model_path = f"{GAME_NAME}_dqn_model.zip"
    model = DQN.load(model_path)
    model.set_env(env)

# Train the agent
TIMESTEPS = 1e5
model.learn(total_timesteps=TIMESTEPS, progress_bar=True)

# Save the model
model.save(f"{GAME_NAME}_dqn_model")

# Evaluate the trained agent
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10)

print(f"Mean reward: {mean_reward} +/- {std_reward}")

# Don't forget to close the environment when you are done
env.close()


In [None]:
# VERSION 2 of the DQN (including your variations)
# - remember to SAVE THE SAMPLE VIDEO for VERSION 1 with the code below before running your new version of the DQN!!

## Videos

The following functions allow us to watch the agent play the game in the notebook.

In [None]:
from stable_baselines3.common.vec_env import VecFrameStack
from stable_baselines3.common.env_util import make_atari_env
from stable_baselines3.common.vec_env import VecVideoRecorder
from stable_baselines3 import PPO, DQN
import os

# Set the game name here
GAME_NAME = 'Breakout'  # If you prefer, you can try the 'Atlantis' game as well

# Load your previously trained model
model_path = f"{GAME_NAME}_dqn_model.zip"
#model = PPO.load(model_path)
model = DQN.load(model_path)

# Create the Atari environment and apply the correct wrappers
env_id = f"{GAME_NAME}NoFrameskip-v4"
env = make_atari_env(env_id, n_envs=1, seed=0)
env = VecFrameStack(env, n_stack=4)

# Record the environment
video_folder = '/content/videos'
if not os.path.exists(video_folder):
    os.makedirs(video_folder)

env = VecVideoRecorder(env, video_folder,
                       record_video_trigger=lambda step: step == 0,
                       video_length=500,
                       name_prefix=f"{GAME_NAME}-agent")

# Reset the environment and observe the initial observation shape
obs = env.reset()
print("Initial observation shape:", obs.shape)  # Should be (1, 84, 84, 4)

# Run one episode
done = False
while not done:
    action, _states = model.predict(obs, deterministic=True)
    obs, rewards, done, info = env.step(action)
    env.render()

# Close the environment which should also save the video
env.close()


In [None]:
from IPython.display import HTML
from base64 import b64encode

# Load the video and encode it
video_path = '/content/videos/'  # Make sure this matches the path where the videos are saved
video_files = [f for f in os.listdir(video_path) if f.endswith('.mp4')]

if video_files:
    video_filename = video_files[-1]  # if you expect multiple videos, modify this to select the correct one
    full_video_filename = f"{video_path}/{video_filename}"
    mp4 = open(full_video_filename, 'rb').read()
    encoded = b64encode(mp4).decode('ascii')
    html = HTML(data=f'<video width="640" height="480" controls><source src="data:video/mp4;base64,{encoded}" type="video/mp4"></video>')
else:
    html = HTML(data="Error: No video found")

html


In [None]:
# RUN this command to visualize tensorboard logs stored
# More info on what has been logged here: https://stable-baselines3.readthedocs.io/en/master/guide/tensorboard.html
%load_ext tensorboard
%tensorboard --logdir /content/atari_dqn_tensorboard/

## **Question 1**:
Train a different model than the provided one (if possible, a similar number of iterations).

For this new version, you need to **change at least two elements** related to any of the following: the CNN used, the policy, other agent configuration parameters or model compilation options (The number of training iterations does not count ;-).
Check the documentation of the implementation of DQN used here:
[Stable_Baselines3 DQN documentation](https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html#module-stable_baselines3.dqn)

**Briefly explain your variation and if you have an intuition of how they could affect the results**

In [None]:
# if you don´t know where to start, you can think about modifying these parameters when defining the DQN model
#exploration_fraction (float) – fraction of entire training period over which the exploration rate is reduced
#exploration_initial_eps (float) – initial value of random action probability
#exploration_final_eps (float) – final value of random action probability
#print(model.exploration_final_eps, model.exploration_initial_eps, model.exploration_fraction)

ANSWER: [YOUR-ANSWER-HERE] (max 5 lines)

## **Question 2**:
Save and visualize the video for the two trained models and log/save the main statistics you think are useful (you most likely already have everything you need in the tensorboard folder logs and in the printed information in the cell output during training)

**Which model do you think is better? Which metric/s or information are you using to decide this?** *Even if your models are not good, neither one of them, think about which information you would need to compare two solutions for this problem*
(More ideas about questions you can try to answer in the discussion: Do you think the videos are enough to evaluate the method? What parameters or statistics you may find useful from the logged information?)

**Do you think it is better/necessary to run the model with a different configuration during final evaluation than during training? Can you investigate which one?** You can write additional code if you need it for your evaluation tests.

ANSWER: [YOUR-ANSWER-HERE] (max 10 lines)