# DQN on Images

This notebook implements **DQN on Images** for the [Breakout](https://gym.openai.com/envs/Breakout-v0/) game. The Keras-RL2 abstraction library is used.

For general theory and intuition, see`../ReinforcementLearning_Guide.md`.

Overview:
1. Imports

## 1. Imports

In [2]:
# Image processing
from PIL import Image
import numpy as np
import gym

# CNN
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Flatten, Convolution2D, Permute
from tensorflow.keras.optimizers import Adam

# Keras-RL2
from rl.agents.dqn import DQNAgent
from rl.policy import LinearAnnealedPolicy, EpsGreedyQPolicy
from rl.memory import SequentialMemory
from rl.core import Processor
from rl.callbacks import FileLogger, ModelIntervalCheckpoint # for tracking results

## 2. Environment Setup

We have 2 versions for the Atari games: the one with RAM contains relevant information on some object positions, etc.; the other has only images and a CNN should be applied to understand the object poses.

Note that the Atari environments were removed from the Open AI gym github repo; they are now being maintained under the  Arcade-Learning-Environment repo: [Issue 2407](https://github.com/openai/gym/issues/2407). We can still install them with `pip install "gym[atari,accept-rom-license]"`, but the links to the source code at the Open AI environments website are broken.

Further important links:

- Atari environements were removed from OpenAI Gym and moved to the Arcade-Learning-Environment repo: [https://github.com/openai/gym/issues/2407](https://github.com/openai/gym/issues/2407)

- Difference between v0, v4 & Deterministic: 
[https://github.com/openai/gym/issues/1280](https://github.com/openai/gym/issues/1280)

- Atari enviornment: [https://github.com/mgbellemare/Arcade-Learning-Environment/blob/master/src/gym/envs/atari/environment.py](https://github.com/mgbellemare/Arcade-Learning-Environment/blob/master/src/gym/envs/atari/environment.py)

We choose the environment `BreakoutDeterministic-v4` following the comments from the link above: [Issue 1280](https://github.com/openai/gym/issues/1280)

In [28]:
env_name = "BreakoutDeterministic-v4"
env = gym.make(env_name)
nb_actions = env.action_space.n

In [5]:
nb_actions

4

## 3. Image Processing

In [6]:
IMG_SHAPE = (84, 84)
WINDOW_LENGTH = 4

In [8]:
class ImageProcessor(Processor):
    def process_observation(self, observation):
        # numpy -> PIL
        img = Image.fromarray(observation)
        # scale / resize
        img = img.resize(IMG_SHAPE)
        # grayscale (luminiscence)
        img = img.convert("L")
        # PIL -> numpy
        img = np.array(img)
        # save storage (optional)
        return img.astype('uint8')
    def process_state_batch(self, batch):
        # scale grayvalues to [0,1]
        processed_batch = batch.astype('float32') / 255.
        return processed_batch
    def process_reward(self, reward):
        # clip reward to [-1,1]
        return np.clip(reward, -1.0, 1.0)

## 4. Network Model

In [12]:
# We pass images in sequences of 4 frames!
input_shape = (WINDOW_LENGTH, IMG_SHAPE[0], IMG_SHAPE[1])
input_shape

(4, 84, 84)

Take into account that we have `(4,84,84)` arrays, but if we look at the documentation of `Convolution2D`, our convolutional network expects `(BatchSize, 84, 84, 4)`. Thus, we need to account for this: we use the `Permute` layer for that.

- [Keras Permute](https://keras.io/api/layers/reshaping_layers/permute/)
- [Keras Convolution2d](https://keras.io/api/layers/convolution_layers/convolution2d/)

In [15]:
model = Sequential()
# Change dimension places: (4,84,84) --> (84,84,4)
model.add(Permute((2, 3, 1), input_shape=input_shape))
# 32 filters, 8x8 kernel size
# Default kernel initialization is Glorot, but some publications report better results with He
model.add(Convolution2D(32, (8, 8), strides=(4, 4),kernel_initializer='he_normal'))
model.add(Activation('relu'))
model.add(Convolution2D(64, (4, 4), strides=(2, 2), kernel_initializer='he_normal'))
model.add(Activation('relu'))
model.add(Convolution2D(64, (3, 3), strides=(1, 1), kernel_initializer='he_normal'))
model.add(Activation('relu'))
model.add(Flatten())
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dense(nb_actions))
model.add(Activation('linear'))
print(model.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
permute_1 (Permute)          (None, 84, 84, 4)         0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 20, 20, 32)        8224      
_________________________________________________________________
activation_5 (Activation)    (None, 20, 20, 32)        0         
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 9, 9, 64)          32832     
_________________________________________________________________
activation_6 (Activation)    (None, 9, 9, 64)          0         
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 7, 7, 64)          36928     
_________________________________________________________________
activation_7 (Activation)    (None, 7, 7, 64)         

## 5. Agent

In [16]:
# Replay buffer
memory = SequentialMemory(limit=1000000, window_length=WINDOW_LENGTH)

In [17]:
# Image processor
processor = ImageProcessor()

In [24]:
# Policy: Linear decay
# Random (exploration) or best (exploitation) action chosen
# depending on epsilon in [value_min, value_max], decreased by steps.
# value_test: evaluation can be performed at a fixed epsilon (should be small: exploitation)
# nb_steps: we match our sequential memory size
policy = LinearAnnealedPolicy(EpsGreedyQPolicy(),
                              attr='eps',
                              value_max=1.0,
                              value_min=0.1,
                              value_test=.05,
                              nb_steps=1000000)

In [25]:
# DQN Agent
# We now pass all elements we have to the agent;
# nb_steps_warmup: our burn_in = how many steps before epsilon starts decreasing
# target_model_update: every how many epochs do we update the weights of the frozen model
# Optional: batch_size, gamma
dqn = DQNAgent(model=model,
               nb_actions=nb_actions,
               policy=policy,
               memory=memory,
               processor=processor,
               nb_steps_warmup=50000,
               gamma=.99,
               target_model_update=10000,
               train_interval=4,
               delta_clip=1)

In [26]:
# We need to pass the optimizer for the model and the metric(s)
# 'mae': Mean Absolute Error
dqn.compile(Adam(learning_rate=.00025), metrics=['mae'])

## 6. Training & Storing 

The DQN agent needs to play and train for a long period of time alone. Since we are going to leave it train for a long time, it makes sense to store some model checkpoints; that is achieved with a callback.

In [47]:
# Store it as HDF5: 2 files are stored (h5f.data* and h5f.index), but we refer to the .h5f ending only
weights_filename = 'dqn_breakout_weights.h5f'
checkpoint_weights_filename = 'dqn_' + env_name + '_weights_{step}.h5f'
# Every interval steps, model weights saved
checkpoint_callback = ModelIntervalCheckpoint(checkpoint_weights_filename, interval=100000)

We can also load the weights of a pre-trained network and keep training with it; however, in that case the `epsilon` value needs to be adjusted!

In [48]:
weights_path = "C:/Users/Mikel/Dropbox/Learning/PythonLab/udemy_rl_ai/notebooks/08-Deep-Q-Learning-On-Images/weights/dqn_BreakoutDeterministic-v4_weights_900000.h5f"

In [49]:
# Example: load pre-trained model from course at step 900,000; epsilon: 0.3 -> 0.1
model.load_weights(weights_path)
# Update policy with new epsilon
policy = LinearAnnealedPolicy(EpsGreedyQPolicy(),
                              attr='eps',
                              value_max=0.3,
                              value_min=.1,
                              value_test=.05,
                              nb_steps=100000)
# DQN Agent
dqn = DQNAgent(model=model,
               nb_actions=nb_actions,
               policy=policy,
               memory=memory,
               processor=processor,
               nb_steps_warmup=50000,
               gamma=.99,
               target_model_update=10000)
# Compile
dqn.compile(Adam(learning_rate=.00025), metrics=['mae'])

In [None]:
# Train
# log_interval: checkpoint_callback called every log_interval steps
dqn.fit(env, nb_steps=500000, callbacks=[checkpoint_callback], log_interval=10000, visualize=False)

Training for 500000 steps ...
Interval 1 (0 steps performed)
    1/10000 [..............................] - ETA: 28:56 - reward: 0.0000e+00



23 episodes - episode_reward: 7.130 [2.000, 14.000] - lives: 2.695

Interval 2 (10000 steps performed)
24 episodes - episode_reward: 7.042 [2.000, 13.000] - lives: 2.725

Interval 3 (20000 steps performed)
21 episodes - episode_reward: 8.476 [2.000, 14.000] - lives: 2.779

Interval 4 (30000 steps performed)
21 episodes - episode_reward: 9.333 [3.000, 21.000] - lives: 2.828

Interval 5 (40000 steps performed)
21 episodes - episode_reward: 8.143 [1.000, 16.000] - lives: 2.741

Interval 6 (50000 steps performed)
 1291/10000 [==>...........................] - ETA: 13:55 - reward: 0.0163