# Atari Breakout: Deep Q-Learning

Breakout je klasická hra ze starých počítačů a konzolí. Pamatujete si ji z úvodu?

Tento příklad je výpočetně velice náročný (agent vidí pouze vizuální data), potřebuje tedy výkonnou grafickou kartu a dlouhý čas na trénink.

## Training

In [None]:
!pip install stable-baselines3[extra,atari]
!apt-get install -y xvfb python3-opengl
!pip install gymnasium PyOpenGL pyvirtualdisplay
!pip install imageio

In [None]:
import os
import gymnasium as gym
import numpy as np
from datetime import datetime
import io
import imageio.v3 as iio
import matplotlib.pyplot as plt
import torch
from stable_baselines3 import PPO, DQN
# from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.env_util import make_atari_env
from stable_baselines3.common.vec_env import VecFrameStack
from stable_baselines3.common.evaluation import evaluate_policy
from IPython import display as ipythondisplay
from pyvirtualdisplay import Display
from pathlib import Path
from base64 import b64encode
from IPython.display import HTML

display = Display(visible=0, size=(400, 300))
display.start()

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
# Ignore IMAGEIO FFMPEG_WRITER WARNING
warnings.filterwarnings("ignore", category=UserWarning, module="imageio_ffmpeg")

from google.colab import drive
drive.mount('/content/gdrive')

model_root = '/content/gdrive/My Drive/Colab Notebooks/Discover - Machine Learning/Models'

In [None]:
env = make_atari_env('BreakoutNoFrameskip-v4', n_envs=16)
env = VecFrameStack(env, n_stack=4)
env.metadata['render_fps'] = 30

In [None]:
if not torch.cuda.is_available:
  # CPU - TEST
  LEARNING_STARTS = 1_000
  BUFFER_SIZE = 5_000
  TOTAL_TIMESTEPS = 10_000
  REPEAT_TRAINING = 4
else:
  # CUDA
  LEARNING_STARTS = 20_000
  BUFFER_SIZE = 200_000
  TOTAL_TIMESTEPS = 500_000
  REPEAT_TRAINING = 20

BUFFER_SIZE

In [None]:
# FUNCTION FOR RENDERING
PLOT_SINGLE_ENV = True


def get_frames(env, model=None, max_steps=500):
    # either render random behavior or a policy given by the model, if provided
    if model is None:
      get_action = lambda obs: [
          env.action_space.sample() for _ in range(len(obs))]
    else:
      get_action = lambda obs: model.predict(obs, deterministic=True)[0]
    crop_img = lambda img: img[:210, :160].copy() if PLOT_SINGLE_ENV else img

    images = []
    obs = env.reset()
    img = env.render('rgb_array')
    images.append(crop_img(img))

    for i in range(max_steps):
        action = get_action(obs)
        obs, *_ = env.step(action)
        img = env.render('rgb_array')
        images.append(crop_img(img))

    return images


def show_frames(frames, width=None):
  mp4_image = iio.imwrite("<bytes>", frames, extension=".mp4")
  data_url = "data:video/mp4;base64," + b64encode(mp4_image).decode()

  return HTML(f"""
  <video {"" if width is None else "width=" + str(width) + " "}controls>
        <source src="%s" type="video/mp4">
  </video>
  """ % data_url)


def render_env(env, model=None, max_steps=500, width=None):
  frames = get_frames(env, model=model, max_steps=max_steps)
  # fix the dimensions for conversion to video
  h, w = frames[0].shape[:2]
  h = (h // 16) * 16
  w = (w // 16) * 16
  frames = [f[:h, :w, :] for f in frames]
  return show_frames(frames, width=width)

In [None]:
render_env(env)

In [None]:
model = DQN('CnnPolicy', env, verbose=0, learning_starts=LEARNING_STARTS,
            buffer_size=BUFFER_SIZE)

In [None]:
models = []

t0 = datetime.now()
print(f'{t0} -- Starting training.')
timesteps = 0

for i in range(REPEAT_TRAINING):
  model.learn(total_timesteps=TOTAL_TIMESTEPS)
  tt = datetime.now()
  timesteps += model.num_timesteps
  print(f'{tt} -- Finished iteration {i + 1} / {REPEAT_TRAINING}, elapsed {tt - t0},'
        f' total timesteps {timesteps:,}')
  # store it under a new file name, so I do not overwrite the existing ones
  model.save(f'{model_root}/new_breakout_model_{i}')

## What has the agent learnt?

In [None]:
env = make_atari_env('BreakoutNoFrameskip-v4', n_envs=16)
env = VecFrameStack(env, n_stack=4)
env.metadata['render_fps'] = 30

### Untrained agent (random)

In [None]:
render_env(env, width=400)

### Trained models

In [None]:
os.listdir(model_root)

In [None]:
model_name = 'breakout_model'

In [None]:
i = 0  # After 500k steps, about 15 minutes of training
model = DQN.load(f'{model_root}/{model_name}_{i}')

In [None]:
model

In [None]:
render_env(env, model=model, width=400)

In [None]:
i = 3  # After 2M steps, about 1 hour of training
model = DQN.load(f'{model_root}/{model_name}_{i}')
render_env(env, model=model, width=400)

In [None]:
i = 19  # After 10M steps, almost 5 hours of training
model = DQN.load(f'{model_root}/{model_name}_{i}')
render_env(env, model=model, width=400)

### Not as good as Deepmind's model!
- **Playing Atari with Deep Reinforcement Learning**
- https://arxiv.org/abs/1312.5602


At the same time, we have very basic setup, 10 years ago this was quite a milestone to achieve and it was necessary to be Deepmind / Google to train such models. Basically everybody can do it now and we even't haven't pulled out all the tricks.

It is possible to use different learning algorithms, to tweak hyperparameters, to have longer training, to use multiple GPUs (distributed training is not simple, but everybody uses it nowadays).

At the same time, it costs money: using larger machine with good GPU on Colab costs about $1 per hour.

Estimations for the compute costs of training large language models are about 1M dollars per 1B of parameters. Basic ChatGPT has 175B parameters and GPT-4 is estimated to have about 10x as much. That's really expensive and only the biggest players can do it. Nvidia is hot.