# **INF8953DE Reinforcement Learning (Fall 2021)**

## **Final Project – Analysis of Deep Q-Network for Playing Atari**

### **Authors:** Resende Silva, Luiz$^{1}$ & Talotsing, Gaëlle P. M.$^{2}$

\
$1$ - luiz.resende-silva@polymtl.ca
$2$ - gaelle-patricia-megouo.talotsing@polymtl.ca

## **Python modules, classes and functions**

### _MUST RUN_

The cells below install the necessary Python modules and download and import environment ROMs and experiment scripts.

### *Installing Python modules*

In [None]:
!python -m pip install --upgrade pip
!python -m pip install torch==1.10.0+cu111 torchaudio==0.10.0+cu111 torchvision==0.11.1+cu111
!python -m pip install albumentations==1.1.0 ale-py==0.7.3 cmake==3.12.0 datascience==0.17.0 folium==0.12.1.post1 gym==0.21.0 imageio-ffmpeg==0.4.5 matplotlib==3.2.2 numpy==1.19.5 pandas==1.1.5 pyglet==1.5.21 pyvirtualdisplay==2.2 opencv-python==4.5.4.60  pygame==2.1.0 wandb==0.12.7 yellowbrick==1.3.post1
!sudo apt-get install xvfb  # COMMENT-OUT IF RUNNING ON WINDOWS


Collecting pip
  Downloading pip-21.3.1-py3-none-any.whl (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 8.0 MB/s 
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 21.1.3
    Uninstalling pip-21.1.3:
      Successfully uninstalled pip-21.1.3
Successfully installed pip-21.3.1
Collecting albumentations==1.1.0
  Downloading albumentations-1.1.0-py3-none-any.whl (102 kB)
     |████████████████████████████████| 102 kB 10.1 MB/s            
[?25hCollecting ale-py==0.7.3
  Downloading ale_py-0.7.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB)
     |████████████████████████████████| 1.6 MB 32.5 MB/s            
Collecting datascience==0.17.0
  Downloading datascience-0.17.0.tar.gz (721 kB)
     |████████████████████████████████| 721 kB 73.4 MB/s            
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting folium==0.12.1.post1
  Downloading folium-0.12.1.post1-py2.py3-none-any.whl (95 

### *Weights & Biases*

The experiments set up uses [Weights & Biases](https://wandb.ai/site) to log training information. The ``wandb`` module allows for code automation, real-time visualization of the training process and general code cleanness.

To use Weights & Biases, the user must create an account, go to the Weights & Biases account settings, retrieve the personal API key and replace ``<key>`` with the personal API key number (40-character combination). However, if the user does not want to use the ``wandb`` module, skip this cell and ensure that ``'use_wandb_logging': True`` inside the dictionary ``config_training``. The object of class ``AgentDQN`` will still have the arguments ``episodes_scores`` and ``episodes_losses`` that can be used as training quality metrics.

**NOTE FOR WINDOWS**: remove the "!" if running on Windows.

In [None]:
# LOGIN TO WEIGHTS & BIASES LOGGING MODULE
!wandb login de27dda9fe6edcfd2dd6e97800a893ece3e1958e

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


### *Atari Environment ROMS*

Downloading and installing the Atari ROMS for the [Arcade Learning Environment (ALE)](https://github.com/mgbellemare/Arcade-Learning-Environment), as well as the [MinAtar](https://github.com/kenjyoung/MinAtar) testbed and the modified ``env/registration.py`` script for [OpenAI Gym](https://gym.openai.com/docs/) (allows creating MinAtar-Gym environment).

**NOTE FOR WINDOWS**: running this Python notebook on Windows, the commands ``!wget`` and ``!unzip`` will not be supported. The user should download and extract files in the below links and change the directory paths to import the Atari ROMS.

In [None]:
# RETRIEVING ATARI ROMS, MINATAR MODULE AND GYM ENVIRONMENT REGISTRATION ALLOWING MINATAR
!wget https://github.com/luiz-resende/ReinforcementLearning/raw/main/A03_DQN/ROMS.zip
!wget https://github.com/luiz-resende/ReinforcementLearning/raw/main/A03_DQN/MinAtar_gym.zip
!wget https://github.com/luiz-resende/ReinforcementLearning/raw/main/A03_DQN/MinAtarModule.zip
!unzip './ROMS.zip'
!unzip './MinAtarModule.zip'
!ale-import-roms './ROMS'

!python -m pip install ./MinAtar

--2021-12-14 04:04:18--  https://github.com/luiz-resende/ReinforcementLearning/raw/main/A03_DQN/ROMS.zip
Resolving github.com (github.com)... 192.30.255.112
Connecting to github.com (github.com)|192.30.255.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/luiz-resende/ReinforcementLearning/main/A03_DQN/ROMS.zip [following]
--2021-12-14 04:04:18--  https://raw.githubusercontent.com/luiz-resende/ReinforcementLearning/main/A03_DQN/ROMS.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 421859 (412K) [application/zip]
Saving to: ‘ROMS.zip’


2021-12-14 04:04:18 (14.7 MB/s) - ‘ROMS.zip’ saved [421859/421859]

--2021-12-14 04:04:18--  https://github.com/luiz-resende/ReinforcementLearning/raw/main/

In [None]:
import os
import numpy as np
import urllib.request
import zipfile
import warnings


warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.filterwarnings('ignore', category=UserWarning)


zipfile.ZipFile('MinAtar_gym.zip', 'r').extractall((os.path.abspath(np.__file__)[:-17] + 'gym/envs/'))

### *Agent, Model, Memory Buffer and Environment Wrapper scripts*

[Luiz Resende Silva](mailto:luiz.resende-silva@polymtl.ca) implemented the scripts for the different classes used in the experiment, keeping them in his [GitHub page](https://github.com/luiz-resende/ReinforcementLearning/tree/main/A03_DQN) to maintain a clean and small-sized ``.ipynb`` file and also make centralized editions to the code. The scripts are downloaded and imported into this python notebook using the code in the cells below.

In [None]:
warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.filterwarnings('ignore', category=UserWarning)

def import_github_script(URL: str, file_name: str) -> None:
    r"""
    Function retrieves Python scripts from github page and creates an importable file with extension ``.py``.

    This function aims to allow the code modifications to be made at Github, maintaining cleaness of coding
    and allowing modifications to be readily available in any copy of the notebook.

    Parameters
    ----------
    URL : ``str``
        URL address containing the file.
    file_name : ``str``
        Name to be given to the downloaded file.

    Returns
    -------
    ``None``
    """
    path = file_name + '.py'
    if (not os.path.isfile(path)):
        raw_bytes = urllib.request.urlopen(URL).read()
        raw_str = raw_bytes.decode("utf-8")
        mode = 'x'
        text_file = open(path, mode)
        text_file.write(raw_str)
        text_file.close()


In [None]:
# DOWNLOADING SCRIPTS FROM GITHUB
import_github_script(r'https://raw.githubusercontent.com/luiz-resende/ReinforcementLearning/main/A03_DQN/utils/dqn_memory_buffer.py',
                     r'dqn_memory_buffer')
import_github_script(r'https://raw.githubusercontent.com/luiz-resende/ReinforcementLearning/main/A03_DQN/models/dqn_models_torch.py',
                     r'dqn_models_torch')
import_github_script(r'https://raw.githubusercontent.com/luiz-resende/ReinforcementLearning/main/A03_DQN/utils/dqn_wrappers_env.py',
                     r'dqn_wrappers_env')
import_github_script(r'https://raw.githubusercontent.com/luiz-resende/ReinforcementLearning/main/A03_DQN/agent/dqn_agent_atari.py',
                     r'dqn_agent_atari')


## **Training Pipeline**

The different experiments performed are shown and commented below.

In [None]:
import time
from dqn_agent_atari import AgentDQN  # IMPORTING EXPERIMENT AGENT SCRIPT


### 1 - Testing size of experience replay memory

All experiments in this section used the hyper-parameters shown in the dictionaries below.

In [None]:
config_env = {'game_id': 'Pong-v4',
              'is_minatar': False,
              'render_mode': 'rgb_array',
              'max_episode_steps': None,
              'no_op_reset_env': True,
              'no_op_max_env': 30,
              'skip_frames_env': True,
              'skip_frames_env_n': 4,
              'wrap_env': True,
              'clip_rewards': True,
              'episodic_life': True,
              'scale_frame': False,
              'stack_frames': True,
              'warp_frames': True,
              'warp_frames_greyscale': True,
              }

config_model = {'in_channels': 4,
                'out_channel': 32,
                'shape_input': (84, 84),
                'kernel': (8, 8),
                'stride': (4, 4),
                'padding': (0, 0),
                'out_features_linear': 512,
                'agent_architecture': 2,
                'use_batch_norm': False,
                'scale_batch_input': 255.0,
                'device': 'gpu',
                }

config_optim = {'optimizer': 'RMSprop',
                'loss_criterion': 'smooth_l1_loss',  # 'huber_loss'
                'gamma_disc': 0.99,
                'learn_rate': 0.00025,
                'grad_momentum': 0.95,
                'grad_momentum_square': 0.95,
                'min_sqr_grad': 0.01,
                'epsilon_max': 1.00,
                'epsilon_min': 0.10,
                'eps_decay_interval': 1000000,
                'exponential_decay': False,
                'target_network_update': 10000,
                }

config_training = {'seed': 895359,
                   'use_wandb_logging': True,
                   'experiment_project_name': 'RL_ProjectTest_-_DQN_Atari_Pong-v4'
                   'train_in_episodes': False,
                   'max_number_training_frames': 5000000,
                   }


#### 1.1 - Experience replay memory capacity equal to $100$k samples.

In [None]:
config_memory = {'memory_capacity': 100000,
                 'sample_batch_size': 32,
                 'initial_memory': 50000,
                  }

agent = AgentDQN(configuration_environment=config_env,
                  configuration_dqn_models=config_model,
                  configuration_optimization=config_optim,
                  configuration_memory_replay=config_memory,
                  seed=config_training['seed'],
                  use_wandb_logging=config_training['use_wandb_logging'],
                  experiment_project_name=config_training['experiment_project_name'],
                  experiment_run_name='Test_ReplayMemory_size=100k',
                  experiment_run_notes=('Training run with a maximum of 5M frames and architecture 2.' +
                                        ' Using model update frequency of every 4 frame and 100k memory capacity.')
                  )


In [None]:
scores = agent.train_agent(train_in_episodes=config_training['train_in_episodes'],
                           max_number_training_frames=config_training['max_number_training_frames'],
                           update_frequency_model=4,
                           render=False,
                           render_mode='rgb_array',
                           save_tensors_in_memory_buffer=True,
                           load_agent_state=False,
                           load_agent_info=None,
                           save_iterruption=True
                           )


[Training AgentDQN] : 100%|██████████| 5000000/5000000 [7:23:53<00:00, 187.73 frames/s, AvgRewardEps=16.7, AvgSteps=1.98e+3, RewardMax=21]


In [None]:
agent.evaluate_agent()


[Final average score] : 20.18 +/- 4.048


#### 1.2 - Experience replay memory capacity equal to $250$k samples.

In [None]:
config_memory = {'memory_capacity': 250000,
                 'sample_batch_size': 32,
                 'initial_memory': 50000,
                  }

agent = AgentDQN(configuration_environment=config_env,
                  configuration_dqn_models=config_model,
                  configuration_optimization=config_optim,
                  configuration_memory_replay=config_memory,
                  seed=config_training['seed'],
                  use_wandb_logging=config_training['use_wandb_logging'],
                  experiment_project_name=config_training['experiment_project_name'],
                  experiment_run_name='Test_ReplayMemory_size=250k',
                  experiment_run_notes=('Training run with a maximum of 5M frames and architecture 2.' +
                                        ' Using model update frequency of every 4 frame and 250k memory capacity.')
                  )


In [None]:
scores = agent.train_agent(train_in_episodes=config_training['train_in_episodes'],
                           max_number_training_frames=config_training['max_number_training_frames'],
                           update_frequency_model=4,
                           render=False,
                           render_mode='rgb_array',
                           save_tensors_in_memory_buffer=True,
                           load_agent_state=False,
                           load_agent_info=None,
                           save_iterruption=True
                           )


[Training AgentDQN] : 100%|██████████| 5000000/5000000 [7:23:28<00:00, 187.91 frames/s, AvgRewardEps=16, AvgSteps=1.98e+3, RewardMax=21]


In [None]:
agent.evaluate_agent()


[Final average score] : 20.74 +/- 1.82


#### 1.3 - Experience replay memory capacity equal to $500$k samples.

In [None]:
config_memory = {'memory_capacity': 500000,
                 'sample_batch_size': 32,
                 'initial_memory': 50000,
                  }

agent = AgentDQN(configuration_environment=config_env,
                  configuration_dqn_models=config_model,
                  configuration_optimization=config_optim,
                  configuration_memory_replay=config_memory,
                  seed=config_training['seed'],
                  use_wandb_logging=config_training['use_wandb_logging'],
                  experiment_project_name=config_training['experiment_project_name'],
                  experiment_run_name='Test_ReplayMemory_size=500k',
                  experiment_run_notes=('Training run with a maximum of 5M frames and architecture 2.' +
                                        ' Using model update frequency of every 4 frame and 500k memory capacity.')
                  )


In [None]:
scores = agent.train_agent(train_in_episodes=config_training['train_in_episodes'],
                           max_number_training_frames=config_training['max_number_training_frames'],
                           update_frequency_model=4,
                           render=False,
                           render_mode='rgb_array',
                           save_tensors_in_memory_buffer=True,
                           load_agent_state=False,
                           load_agent_info=None,
                           save_iterruption=True
                           )


[Training AgentDQN] : 100%|██████████| 5000000/5000000 [9:33:06<00:00, 145.40 frames/s, AvgRewardEps=16.2, AvgSteps=1983.33, RewardMax=21]


In [None]:
agent.evaluate_agent()


[Final average score] : 21.0 +/- 0.0


#### 1.4 - Experience replay memory capacity equal to $1$M samples.

In [None]:
config_memory = {'memory_capacity': 1000000,
                 'sample_batch_size': 32,
                 'initial_memory': 50000,
                  }

agent = AgentDQN(configuration_environment=config_env,
                  configuration_dqn_models=config_model,
                  configuration_optimization=config_optim,
                  configuration_memory_replay=config_memory,
                  seed=config_training['seed'],
                  use_wandb_logging=config_training['use_wandb_logging'],
                  experiment_project_name=config_training['experiment_project_name'],
                  experiment_run_name='Test_ReplayMemory_size=1M',
                  experiment_run_notes=('Training run with a maximum of 5M frames and architecture 2.' +
                                        ' Using model update frequency of every 4 frame and 1M memory capacity.')
                  )


In [None]:
scores = agent.train_agent(train_in_episodes=config_training['train_in_episodes'],
                           max_number_training_frames=config_training['max_number_training_frames'],
                           update_frequency_model=4,
                           render=False,
                           render_mode='rgb_array',
                           save_tensors_in_memory_buffer=True,
                           load_agent_state=False,
                           load_agent_info=None,
                           save_iterruption=True
                           )


[Training AgentDQN] : 100%|██████████| 5000000/5000000 [41:42:50<00:00, 33.30 frames/s, AvgRewardEps=15.3, AvgSteps=2.02e+3, RewardMax=21]


In [None]:
agent.evaluate_agent()


[Final average score] : 20.52 +/- 0.608


### 2 - Testing policy model update frequency

All experiments in this section used the hyper-parameters shown in the dictionaries below.

In [None]:
config_env = {'game_id': 'Pong-v4',
              'is_minatar': False,
              'render_mode': 'rgb_array',
              'max_episode_steps': None,
              'no_op_reset_env': True,
              'no_op_max_env': 30,
              'skip_frames_env': True,
              'skip_frames_env_n': 4,
              'wrap_env': True,
              'clip_rewards': True,
              'episodic_life': True,
              'scale_frame': False,
              'stack_frames': True,
              'warp_frames': True,
              'warp_frames_greyscale': True,
              }

config_model = {'in_channels': 4,
                'out_channel': 32,
                'shape_input': (84, 84),
                'kernel': (8, 8),
                'stride': (4, 4),
                'padding': (0, 0),
                'out_features_linear': 512,
                'agent_architecture': 2,
                'use_batch_norm': False,
                'scale_batch_input': 255.0,
                'device': 'gpu',
                }

config_optim = {'optimizer': 'RMSprop',
                'loss_criterion': 'smooth_l1_loss',  # 'huber_loss'
                'gamma_disc': 0.99,
                'learn_rate': 0.00025,
                'grad_momentum': 0.95,
                'grad_momentum_square': 0.95,
                'min_sqr_grad': 0.01,
                'epsilon_max': 1.00,
                'epsilon_min': 0.10,
                'eps_decay_interval': 1000000,
                'exponential_decay': False,
                'target_network_update': 10000,
                }

config_memory = {'memory_capacity': 500000,
                 'sample_batch_size': 32,
                 'initial_memory': 50000,
                  }

config_training = {'seed': 895359,
                   'use_wandb_logging': True,
                   'experiment_project_name': 'RL_ProjectTest_-_DQN_Atari_Pong-v4',
                   'train_in_episodes': False,
                   'max_number_training_frames': 5000000,
                   }


#### 2.1 - Update policy model after every $k=1$ steps.

In [None]:
agent = AgentDQN(configuration_environment=config_env,
                  configuration_dqn_models=config_model,
                  configuration_optimization=config_optim,
                  configuration_memory_replay=config_memory,
                  seed=config_training['seed'],
                  use_wandb_logging=config_training['use_wandb_logging'],
                  experiment_project_name=config_training['experiment_project_name'],
                  experiment_run_name='Test_UpdateFrequency_k=1',
                  experiment_run_notes=('Training run with a maximum of 5M frames and architecture 2.' +
                                        ' Using model update frequency of every 1 frame and 500k memory capacity.')
                  )


In [None]:
scores = agent.train_agent(train_in_episodes=config_training['train_in_episodes'],
                           max_number_training_frames=config_training['max_number_training_frames'],
                           update_frequency_model=1,
                           render=False,
                           render_mode='rgb_array',
                           save_tensors_in_memory_buffer=True,
                           load_agent_state=False,
                           load_agent_info=None,
                           save_iterruption=True
                           )


[Training AgentDQN] : 100%|██████████| 5000000/5000000 [32:12:39<00:00, 43.12 frames/s, AvgRewardEps=14.7, AvgSteps=1.97e+3, RewardMax=21]


In [None]:
agent.evaluate_agent()


[Final average score] : 20.68 +/- 0.466


#### 2.2 - Update policy model after every $k=4$ steps (base-line).

In [None]:
agent = AgentDQN(configuration_environment=config_env,
                  configuration_dqn_models=config_model,
                  configuration_optimization=config_optim,
                  configuration_memory_replay=config_memory,
                  seed=config_training['seed'],
                  use_wandb_logging=config_training['use_wandb_logging'],
                  experiment_project_name=config_training['experiment_project_name'],
                  experiment_run_name='Test_UpdateFrequency_k=4_(baseline)',
                  experiment_run_notes=('Training run with a maximum of 5M frames and architecture 2.' +
                                        ' Using model update frequency of every 4 frames and 500k memory capacity.')
                  )


In [None]:
scores = agent.train_agent(train_in_episodes=config_training['train_in_episodes'],
                           max_number_training_frames=config_training['max_number_training_frames'],
                           update_frequency_model=4,
                           render=False,
                           render_mode='rgb_array',
                           save_tensors_in_memory_buffer=True,
                           load_agent_state=False,
                           load_agent_info=None,
                           save_iterruption=True
                           )


[Training AgentDQN] : 100%|██████████| 5000000/5000000 [9:33:06<00:00, 145.40 frames/s, AvgRewardEps=16.2, AvgSteps=1983.33, RewardMax=21]


In [None]:
agent.evaluate_agent()


[Final average score] : 21.0 +/- 0.0


#### 2.3 - Update policy model after every $k=5$ steps.

In [None]:
agent = AgentDQN(configuration_environment=config_env,
                  configuration_dqn_models=config_model,
                  configuration_optimization=config_optim,
                  configuration_memory_replay=config_memory,
                  seed=config_training['seed'],
                  use_wandb_logging=config_training['use_wandb_logging'],
                  experiment_project_name=config_training['experiment_project_name'],
                  experiment_run_name='Test_UpdateFrequency_k=5',
                  experiment_run_notes=('Training run with a maximum of 5M frames and architecture 2.' +
                                        ' Using model update frequency of every 5 frames and 500k memory capacity.')
                  )


In [None]:
scores = agent.train_agent(train_in_episodes=config_training['train_in_episodes'],
                           max_number_training_frames=config_training['max_number_training_frames'],
                           update_frequency_model=5,
                           render=False,
                           render_mode='rgb_array',
                           save_tensors_in_memory_buffer=True,
                           load_agent_state=False,
                           load_agent_info=None,
                           save_iterruption=True
                           )


[Training AgentDQN] : 100%|██████████| 5000000/5000000 [6:53:07<00:00, 201.71 frames/s, AvgRewardEps=16.1, AvgSteps=1.98e+3, RewardMax=21]


In [None]:
agent.evaluate_agent()  # Evaluation time 3.9 min


[Final average score] : 20.96 +/- 0.28


#### 2.4 - Update policy model after every $k=10$ steps.

In [None]:
agent = AgentDQN(configuration_environment=config_env,
                  configuration_dqn_models=config_model,
                  configuration_optimization=config_optim,
                  configuration_memory_replay=config_memory,
                  seed=config_training['seed'],
                  use_wandb_logging=config_training['use_wandb_logging'],
                  experiment_project_name=config_training['experiment_project_name'],
                  experiment_run_name='Test_UpdateFrequency_k=10',
                  experiment_run_notes=('Training run with a maximum of 5M frames and architecture 2.' +
                                        ' Using model update frequency of every 10 frames and 500k memory capacity.')
                  )


In [None]:
scores = agent.train_agent(train_in_episodes=config_training['train_in_episodes'],
                           max_number_training_frames=config_training['max_number_training_frames'],
                           update_frequency_model=10,
                           render=False,
                           render_mode='rgb_array',
                           save_tensors_in_memory_buffer=True,
                           load_agent_state=False,
                           load_agent_info=None,
                           save_iterruption=True
                           )


[Training AgentDQN] : 100%|██████████| 5000000/5000000 [5:37:36<00:00, 246.83 frames/s, AvgRewardEps=16.3, AvgSteps=2.04e+3, RewardMax=21]


In [None]:
agent.evaluate_agent()


[Final average score] : 20.82 +/- 0.384


### 3 - Testing different network architectures

In [None]:
config_env = {'game_id': 'Pong-v4',
              'is_minatar': False,
              'render_mode': 'rgb_array',
              'max_episode_steps': None,
              'no_op_reset_env': True,
              'no_op_max_env': 30,
              'skip_frames_env': True,
              'skip_frames_env_n': 4,
              'wrap_env': True,
              'clip_rewards': True,
              'episodic_life': True,
              'scale_frame': False,
              'stack_frames': True,
              'warp_frames': True,
              'warp_frames_greyscale': True,
              }

config_optim = {'optimizer': 'RMSprop',
                'loss_criterion': 'smooth_l1_loss',  # 'huber_loss'
                'gamma_disc': 0.99,
                'learn_rate': 0.00025,
                'grad_momentum': 0.95,
                'grad_momentum_square': 0.95,
                'min_sqr_grad': 0.01,
                'epsilon_max': 1.00,
                'epsilon_min': 0.10,
                'eps_decay_interval': 1000000,
                'exponential_decay': False,
                'target_network_update': 10000,
                }

config_memory = {'memory_capacity': 500000,
                 'sample_batch_size': 32,
                 'initial_memory': 50000,
                 }

config_training = {'seed': 895359,
                   'use_wandb_logging': True,
                   'experiment_project_name': 'RL_ProjectTest_-_DQN_Atari_Pong-v4',
                   'train_in_episodes': False,
                   'max_number_training_frames': 5000000,
                   }


#### 3.1 Using network architecture from Mnih et al. (2015) with only 1 convolutional layer.

In [None]:
config_model = {'in_channels': 4,
                'out_channel': 32,
                'shape_input': (84, 84),
                'kernel': (8, 8),
                'stride': (4, 4),
                'padding': (0, 0),
                'out_features_linear': 512,
                'agent_architecture': 3,
                'use_batch_norm': False,
                'scale_batch_input': 255.0,
                'device': 'gpu',
                }

agent = AgentDQN(configuration_environment=config_env,
                  configuration_dqn_models=config_model,
                  configuration_optimization=config_optim,
                  configuration_memory_replay=config_memory,
                  seed=config_training['seed'],
                  use_wandb_logging=config_training['use_wandb_logging'],
                  experiment_project_name=config_training['experiment_project_name'],
                  experiment_run_name='Test_Architecture_3_minimal',
                  experiment_run_notes=('Training run with a maximum of 5M frames and architecture 3.' +
                                        ' Using model update frequency of every 4 frames, 500k memory capacity' +
                                        ' and a network architecture with a single conv. layer.')
                  )


In [None]:
scores = agent.train_agent(train_in_episodes=config_training['train_in_episodes'],
                           max_number_training_frames=config_training['max_number_training_frames'],
                           update_frequency_model=4,
                           render=False,
                           render_mode='rgb_array',
                           save_tensors_in_memory_buffer=True,
                           load_agent_state=False,
                           load_agent_info=None,
                           save_iterruption=True
                           )


[Training AgentDQN] : 100%|██████████| 5000000/5000000 [8:37:57<00:00, 160.89 frames/s, AvgRewardEps=-13, AvgSteps=1.43e+3, RewardMax=-2]


In [None]:
agent.evaluate_agent()


[Final average score] : -3.86 +/- 3.633


#### 3.2 Using network architecture from Mnih et al. (2013) with parameters from  Mnih et al. (2015)

In [None]:
config_model = {'in_channels': 4,
                'out_channel': 32,
                'shape_input': (84, 84),
                'kernel': (8, 8),
                'stride': (4, 4),
                'padding': (0, 0),
                'out_features_linear': 512,
                'agent_architecture': 1,
                'use_batch_norm': False,
                'scale_batch_input': 255.0,
                'device': 'gpu',
                }

agent = AgentDQN(configuration_environment=config_env,
                  configuration_dqn_models=config_model,
                  configuration_optimization=config_optim,
                  configuration_memory_replay=config_memory,
                  seed=config_training['seed'],
                  use_wandb_logging=config_training['use_wandb_logging'],
                  experiment_project_name=config_training['experiment_project_name'],
                  experiment_run_name='Test_Architecture_1_2013',
                  experiment_run_notes=('Training run with a maximum of 5M frames and architecture 1.' +
                                        ' Using model update frequency of every 4 frames and 500k memory capacity' +
                                        ' and a network architecture with two conv. layers (Mnih et al., 2013).')
                  )


In [None]:
scores = agent.train_agent(train_in_episodes=config_training['train_in_episodes'],
                           max_number_training_frames=config_training['max_number_training_frames'],
                           update_frequency_model=4,
                           render=False,
                           render_mode='rgb_array',
                           save_tensors_in_memory_buffer=True,
                           load_agent_state=False,
                           load_agent_info=None,
                           save_iterruption=True
                           )


[Training AgentDQN] : 100%|██████████| 5000000/5000000 [7:16:08<00:00, 191.07 frames/s, AvgRewardEps=16.8, AvgSteps=2.08e+3, RewardMax=21]


In [None]:
agent.evaluate_agent()


[Final average score] : 20.44 +/- 0.92


#### 3.3 Using network architecture from Mnih et al. (2015) with an additional convolutional layer.

In [None]:
config_model = {'in_channels': 4,
                'out_channel': 32,
                'shape_input': (84, 84),
                'kernel': (8, 8),
                'stride': (4, 4),
                'padding': (0, 0),
                'out_features_linear': 512,
                'agent_architecture': 4,
                'use_batch_norm': False,
                'scale_batch_input': 255.0,
                'device': 'gpu',
                }

agent = AgentDQN(configuration_environment=config_env,
                  configuration_dqn_models=config_model,
                  configuration_optimization=config_optim,
                  configuration_memory_replay=config_memory,
                  seed=config_training['seed'],
                  use_wandb_logging=config_training['use_wandb_logging'],
                  experiment_project_name=config_training['experiment_project_name'],
                  experiment_run_name='Test_Architecture_4_larger',
                  experiment_run_notes=('Training run with a maximum of 5M frames and architecture 4.' +
                                        ' Using model update frequency of every 4 frames and 500k memory capacity' +
                                        ' and a network architecture with four conv. layers.')
                  )


In [None]:
scores = agent.train_agent(train_in_episodes=config_training['train_in_episodes'],
                           max_number_training_frames=config_training['max_number_training_frames'],
                           update_frequency_model=4,
                           render=False,
                           render_mode='rgb_array',
                           save_tensors_in_memory_buffer=True,
                           load_agent_state=False,
                           load_agent_info=None,
                           save_iterruption=True
                           )


[Training AgentDQN] : 100%|██████████| 5000000/5000000 [8:14:02<00:00, 168.68 frames/s, AvgRewardEps=15.3, AvgSteps=1.95e+3, RewardMax=21]


In [None]:
agent.evaluate_agent()


[Final average score] : 20.3 +/- 2.934


#### 3.4 Using network architecture from Mnih et al. (2013) with parameters from  Mnih et al. (2015) and applying batch normalization.

In [None]:
config_model = {'in_channels': 4,
                'out_channel': 32,
                'shape_input': (84, 84),
                'kernel': (8, 8),
                'stride': (4, 4),
                'padding': (0, 0),
                'out_features_linear': 512,
                'agent_architecture': 1,
                'use_batch_norm': True,
                'scale_batch_input': 255.0,
                'device': 'gpu',
                }

agent = AgentDQN(configuration_environment=config_env,
                  configuration_dqn_models=config_model,
                  configuration_optimization=config_optim,
                  configuration_memory_replay=config_memory,
                  seed=config_training['seed'],
                  use_wandb_logging=config_training['use_wandb_logging'],
                  experiment_project_name=config_training['experiment_project_name'],
                  experiment_run_name='Test_Architecture_1_2013_batch_norm',
                  experiment_run_notes=('Training run with a maximum of 5M frames and architecture 1.' +
                                        ' Using model update frequency of every 4 frames and 500k memory capacity' +
                                        ' and a network architecture with two conv. layers (Mnih et al., 2013) and batch norm.')
                  )


In [None]:
scores = agent.train_agent(train_in_episodes=config_training['train_in_episodes'],
                           max_number_training_frames=config_training['max_number_training_frames'],
                           update_frequency_model=4,
                           render=False,
                           render_mode='rgb_array',
                           save_tensors_in_memory_buffer=True,
                           load_agent_state=False,
                           load_agent_info=None,
                           save_iterruption=True
                           )


[Training AgentDQN] : 100%|██████████| 5000000/5000000 [7:50:12<00:00, 177.23 frames/s, AvgRewardEps=17.7, AvgSteps=1.96e+3, RewardMax=21]


In [None]:
agent.evaluate_agent()


[Final average score] : 18.6 +/- 3.194


#### 3.5 Using network architecture from Mnih et al. (2015) and applying batch normalization.

In [None]:
config_model = {'in_channels': 4,
                'out_channel': 32,
                'shape_input': (84, 84),
                'kernel': (8, 8),
                'stride': (4, 4),
                'padding': (0, 0),
                'out_features_linear': 512,
                'agent_architecture': 2,
                'use_batch_norm': True,
                'scale_batch_input': 255.0,
                'device': 'gpu',
                }

agent = AgentDQN(configuration_environment=config_env,
                  configuration_dqn_models=config_model,
                  configuration_optimization=config_optim,
                  configuration_memory_replay=config_memory,
                  seed=config_training['seed'],
                  use_wandb_logging=config_training['use_wandb_logging'],
                  experiment_project_name=config_training['experiment_project_name'],
                  experiment_run_name='Test_Architecture_2_2015_batch_norm_2',
                  experiment_run_notes=('Training run with a maximum of 5M frames and architecture 2.' +
                                        ' Using model update frequency of every 4 frames and 500k memory capacity' +
                                        ' and a network architecture with three conv. layers (Mnih et al., 2015) and batch norm.')
                  )


In [None]:
scores = agent.train_agent(train_in_episodes=config_training['train_in_episodes'],
                           max_number_training_frames=config_training['max_number_training_frames'],
                           update_frequency_model=4,
                           render=False,
                           render_mode='rgb_array',
                           save_tensors_in_memory_buffer=True,
                           load_agent_state=False,
                           load_agent_info=None,
                           save_iterruption=True
                           )


[Training AgentDQN] : 100%|██████████| 5000000/5000000 [10:54:36<00:00, 127.30 frames/s, AvgRewardEps=17.1, AvgSteps=1.89e+3, RewardMax=21]


In [None]:
agent.evaluate_agent()  # Evaluation time is approximately 5.7 min


[Final average score] : 20.8 +/- 1.4


#### 3.6 Using network architecture from Mnih et al. (2015) with an additional convolutional layer and applying batch normalization.

In [None]:
config_model = {'in_channels': 4,
                'out_channel': 32,
                'shape_input': (84, 84),
                'kernel': (8, 8),
                'stride': (4, 4),
                'padding': (0, 0),
                'out_features_linear': 512,
                'agent_architecture': 4,
                'use_batch_norm': True,
                'scale_batch_input': 255.0,
                'device': 'gpu',
                }

agent = AgentDQN(configuration_environment=config_env,
                  configuration_dqn_models=config_model,
                  configuration_optimization=config_optim,
                  configuration_memory_replay=config_memory,
                  seed=config_training['seed'],
                  use_wandb_logging=config_training['use_wandb_logging'],
                  experiment_project_name=config_training['experiment_project_name'],
                  experiment_run_name='Test_Architecture_4_larger_batch_norm',
                  experiment_run_notes=('Training run with a maximum of 5M frames and architecture 4.' +
                                        ' Using model update frequency of every 4 frames and 500k memory capacity' +
                                        ' and a network architecture with four conv. layers and batch norm.')
                  )


In [None]:
 scores = agent.train_agent(train_in_episodes=config_training['train_in_episodes'],
                           max_number_training_frames=config_training['max_number_training_frames'],
                           update_frequency_model=4,
                           render=False,
                           render_mode='rgb_array',
                           save_tensors_in_memory_buffer=True,
                           load_agent_state=False,
                           load_agent_info=None,
                           save_iterruption=True
                           )


[Training AgentDQN] :  82%|████████▏ | 4078304/5000000 [10:56:29<3:23:00, 75.67 frames/s, AvgRewardEps=16.6, AvgSteps=1.93e+3, RewardMax=21] 

In [None]:
agent.evaluate_agent()


#### 3.7 Using network architecture from Mnih et al. (2015) with only 1 convolutional layer and batch normalization.

In [None]:
config_model = {'in_channels': 4,
                'out_channel': 32,
                'shape_input': (84, 84),
                'kernel': (8, 8),
                'stride': (4, 4),
                'padding': (0, 0),
                'out_features_linear': 512,
                'agent_architecture': 3,
                'use_batch_norm': True,
                'scale_batch_input': 255.0,
                'device': 'gpu',
                }

agent = AgentDQN(configuration_environment=config_env,
                  configuration_dqn_models=config_model,
                  configuration_optimization=config_optim,
                  configuration_memory_replay=config_memory,
                  seed=config_training['seed'],
                  use_wandb_logging=config_training['use_wandb_logging'],
                  experiment_project_name=config_training['experiment_project_name'],
                  experiment_run_name='Test_Architecture_3_minimal_batch_norm',
                  experiment_run_notes=('Training run with a maximum of 5M frames and architecture 3.' +
                                        ' Using model update frequency of every 4 frames, 500k memory capacity' +
                                        ' and a network architecture with a single conv. layer and batch norm.')
                  )


In [None]:
scores = agent.train_agent(train_in_episodes=config_training['train_in_episodes'],
                           max_number_training_frames=config_training['max_number_training_frames'],
                           update_frequency_model=4,
                           render=False,
                           render_mode='rgb_array',
                           save_tensors_in_memory_buffer=True,
                           load_agent_state=False,
                           load_agent_info=None,
                           save_iterruption=True
                           )


[Training AgentDQN] : 100%|██████████| 5000000/5000000 [7:30:23<00:00, 185.02 frames/s, AvgRewardEps=17.3, AvgSteps=1.93e+3, RewardMax=21]


In [None]:
agent.evaluate_agent()  # 4 min to evaluate agent


[Final average score] : 20.9 +/- 0.7


### 4 - Testing different mini-batch sizes

In [None]:
config_env = {'game_id': 'Pong-v4',
              'is_minatar': False,
              'render_mode': 'rgb_array',
              'max_episode_steps': None,
              'no_op_reset_env': True,
              'no_op_max_env': 30,
              'skip_frames_env': True,
              'skip_frames_env_n': 4,
              'wrap_env': True,
              'clip_rewards': True,
              'episodic_life': True,
              'scale_frame': False,
              'stack_frames': True,
              'warp_frames': True,
              'warp_frames_greyscale': True,
              }

config_optim = {'optimizer': 'RMSprop',
                'loss_criterion': 'smooth_l1_loss',  # 'huber_loss'
                'gamma_disc': 0.99,
                'learn_rate': 0.00025,
                'grad_momentum': 0.95,
                'grad_momentum_square': 0.95,
                'min_sqr_grad': 0.01,
                'epsilon_max': 1.00,
                'epsilon_min': 0.10,
                'eps_decay_interval': 1000000,
                'exponential_decay': False,
                'target_network_update': 10000,
                }

config_model = {'in_channels': 4,
                'out_channel': 32,
                'shape_input': (84, 84),
                'kernel': (8, 8),
                'stride': (4, 4),
                'padding': (0, 0),
                'out_features_linear': 512,
                'agent_architecture': 2,
                'use_batch_norm': False,
                'scale_batch_input': 255.0,
                'device': 'gpu',
                }

config_training = {'seed': 895359,
                   'use_wandb_logging': True,
                   'experiment_project_name': 'RL_ProjectTest_-_DQN_Atari_Pong-v4',
                   'train_in_episodes': False,
                   'max_number_training_frames': 5000000,
                   }


#### 4.1 Using mini-batches of 16 samples.

In [None]:
config_memory = {'memory_capacity': 500000,
                 'sample_batch_size': 16,
                 'initial_memory': 50000,
                 }

agent = AgentDQN(configuration_environment=config_env,
                  configuration_dqn_models=config_model,
                  configuration_optimization=config_optim,
                  configuration_memory_replay=config_memory,
                  seed=config_training['seed'],
                  use_wandb_logging=config_training['use_wandb_logging'],
                  experiment_project_name=config_training['experiment_project_name'],
                  experiment_run_name='Test_MiniBatchSize_b=16',
                  experiment_run_notes=('Training run with a maximum of 5M frames and architecture 2.' +
                                        ' Using model update frequency of every 4 frames, 500k memory capacity' +
                                        ' and a mini-batch of 16 samples.')
                  )


In [None]:
scores = agent.train_agent(train_in_episodes=config_training['train_in_episodes'],
                           max_number_training_frames=config_training['max_number_training_frames'],
                           update_frequency_model=4,
                           render=False,
                           render_mode='rgb_array',
                           save_tensors_in_memory_buffer=True,
                           load_agent_state=False,
                           load_agent_info=None,
                           save_iterruption=True
                           )


[Training AgentDQN] : 100%|██████████| 5000000/5000000 [6:36:45<00:00, 210.04 frames/s, AvgRewardEps=10.8, AvgSteps=2.11e+3, RewardMax=20]


In [None]:
agent.evaluate_agent()


[Final average score] : 17.64 +/- 3.497


#### 4.2 Using mini-batches of 64 samples.

In [None]:
config_memory = {'memory_capacity': 500000,
                 'sample_batch_size': 64,
                 'initial_memory': 50000,
                 }

agent = AgentDQN(configuration_environment=config_env,
                  configuration_dqn_models=config_model,
                  configuration_optimization=config_optim,
                  configuration_memory_replay=config_memory,
                  seed=config_training['seed'],
                  use_wandb_logging=config_training['use_wandb_logging'],
                  experiment_project_name=config_training['experiment_project_name'],
                  experiment_run_name='Test_MiniBatchSize_b=64',
                  experiment_run_notes=('Training run with a maximum of 5M frames and architecture 2.' +
                                        ' Using model update frequency of every 4 frames, 500k memory capacity' +
                                        ' and a mini-batch of 64 samples.')
                  )


In [None]:
scores = agent.train_agent(train_in_episodes=config_training['train_in_episodes'],
                           max_number_training_frames=config_training['max_number_training_frames'],
                           update_frequency_model=4,
                           render=False,
                           render_mode='rgb_array',
                           save_tensors_in_memory_buffer=True,
                           load_agent_state=False,
                           load_agent_info=None,
                           save_iterruption=True
                           )


[Training AgentDQN] : 100%|██████████| 5000000/5000000 [9:19:07<00:00, 149.04 frames/s, AvgRewardEps=17.6, AvgSteps=1.98e+3, RewardMax=21]


In [None]:
agent.evaluate_agent()


[Final average score] : 20.28 +/- 0.801


#### 4.3 Using mini-batches of 128 samples.

In [None]:
config_memory = {'memory_capacity': 500000,
                 'sample_batch_size': 128,
                 'initial_memory': 50000,
                 }

agent = AgentDQN(configuration_environment=config_env,
                  configuration_dqn_models=config_model,
                  configuration_optimization=config_optim,
                  configuration_memory_replay=config_memory,
                  seed=config_training['seed'],
                  use_wandb_logging=config_training['use_wandb_logging'],
                  experiment_project_name=config_training['experiment_project_name'],
                  experiment_run_name='Test_MiniBatchSize_b=128',
                  experiment_run_notes=('Training run with a maximum of 5M frames and architecture 2.' +
                                        ' Using model update frequency of every 4 frames, 500k memory capacity' +
                                        ' and a mini-batch of 128 samples.')
                  )


In [None]:
scores = agent.train_agent(train_in_episodes=config_training['train_in_episodes'],
                           max_number_training_frames=config_training['max_number_training_frames'],
                           update_frequency_model=4,
                           render=False,
                           render_mode='rgb_array',
                           save_tensors_in_memory_buffer=True,
                           load_agent_state=False,
                           load_agent_info=None,
                           save_iterruption=True
                           )


[Training AgentDQN] : 100%|██████████| 5000000/5000000 [23:42:32<00:00, 58.58 frames/s, AvgRewardEps=16.8, AvgSteps=1.99e+3, RewardMax=21]


In [None]:
agent.evaluate_agent()


[Final average score] : 20.68 +/- 0.786


#### 4.4 Using resizeable mini-batches.

In [None]:
config_memory = {'memory_capacity': 500000,
                 'sample_batch_size': [16, 32, 64, 128],
                 'initial_memory': 50000,
                 }

agent = AgentDQN(configuration_environment=config_env,
                  configuration_dqn_models=config_model,
                  configuration_optimization=config_optim,
                  configuration_memory_replay=config_memory,
                  seed=config_training['seed'],
                  use_wandb_logging=config_training['use_wandb_logging'],
                  experiment_project_name=config_training['experiment_project_name'],
                  experiment_run_name='Test_MiniBatchSize_b=[16, 32, 64, 128]',
                  experiment_run_notes=('Training run with a maximum of 5M frames and architecture 2.' +
                                        ' Using model update frequency of every 4 frames, 500k memory capacity' +
                                        ' and a resizable mini-batch in [16, 32, 64, 128].')
                  )


In [None]:
scores = agent.train_agent(train_in_episodes=config_training['train_in_episodes'],
                           max_number_training_frames=config_training['max_number_training_frames'],
                           update_frequency_model=4,
                           render=False,
                           render_mode='rgb_array',
                           save_tensors_in_memory_buffer=True,
                           load_agent_state=False,
                           load_agent_info=None,
                           save_iterruption=True
                           )


[Training AgentDQN] : 100%|██████████| 5000000/5000000 [12:35:03<00:00, 110.37 frames/s, AvgRewardEps=17.2, AvgSteps=2.03e+3, RewardMax=21]


In [None]:
agent.evaluate_agent()


[Final average score] : 20.64 +/- 2.52


In [None]:
agent.save_agent_state(base_file_name='', postfix='batch_size_resizeable')
