# DQN on Images with Keras-RL2 - Pong

This notebook implements **DQN on Images** for the [Pong](https://gym.openai.com/envs/Pong-v0/) game. The Keras-RL2 abstraction library is used. This notebook builds up on the previous one without adding special novelties.

For general theory and intuition, see`../ReinforcementLearning_Guide.md`.

Overview:
1. Imports
2. Environment Setup
3. Image Processing
4. Network Model
5. Agent
6. Training & Storing
7. Test & Use

## 1. Imports

In [3]:
# Image processing
from PIL import Image
import numpy as np
import gym
import random
#from gym.utils import play

# CNN
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Flatten, Convolution2D, Permute
from tensorflow.keras.optimizers import Adam

# Keras-RL2
from rl.agents.dqn import DQNAgent
from rl.policy import LinearAnnealedPolicy, EpsGreedyQPolicy
from rl.memory import SequentialMemory
from rl.core import Processor
from rl.callbacks import FileLogger, ModelIntervalCheckpoint # for tracking results

  for external in metadata.entry_points().get(self.group, []):


## 2. Environment Setup

In [4]:
env_name = "Pong-v0"
env = gym.make(env_name)
nb_actions = env.action_space.n

In [5]:
nb_actions

6

In [6]:
env.unwrapped.get_action_meanings()

['NOOP', 'FIRE', 'RIGHT', 'LEFT', 'RIGHTFIRE', 'LEFTFIRE']

In [7]:
#play.play(env)

## 3. Image Processing

In [8]:
IMG_SHAPE = (84, 84)
WINDOW_LENGTH = 4

In [9]:
class ImageProcessor(Processor):
    def process_observation(self, observation):
        img = Image.fromarray(observation)
        img = img.resize(IMG_SHAPE)
        img = img.convert("L")
        img = np.array(img)
        return img.astype('uint8')
    def process_state_batch(self, batch):
        processed_batch = batch.astype('float32') / 255.
        return processed_batch

## 4. Network Model

In [10]:
input_shape = (WINDOW_LENGTH, IMG_SHAPE[0], IMG_SHAPE[1])
# Model definition
model = Sequential()
model.add(Permute((2, 3, 1), input_shape=input_shape))
model.add(Convolution2D(32, (8, 8), strides=(4, 4),kernel_initializer='he_normal'))
model.add(Activation('relu'))
model.add(Convolution2D(64, (4, 4), strides=(2, 2), kernel_initializer='he_normal'))
model.add(Activation('relu'))
model.add(Convolution2D(64, (3, 3), strides=(1, 1), kernel_initializer='he_normal'))
model.add(Activation('relu'))
model.add(Flatten())
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dense(nb_actions))
model.add(Activation('linear'))
# Print model
print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
permute (Permute)            (None, 84, 84, 4)         0         
_________________________________________________________________
conv2d (Conv2D)              (None, 20, 20, 32)        8224      
_________________________________________________________________
activation (Activation)      (None, 20, 20, 32)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 9, 9, 64)          32832     
_________________________________________________________________
activation_1 (Activation)    (None, 9, 9, 64)          0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 7, 7, 64)          36928     
_________________________________________________________________
activation_2 (Activation)    (None, 7, 7, 64)          0

## 5. Agent

In [11]:
# Replay buffer
memory = SequentialMemory(limit=100000, window_length=WINDOW_LENGTH) # It should be x10

In [12]:
processor = ImageProcessor()

In [13]:
policy = LinearAnnealedPolicy(EpsGreedyQPolicy(),
                              attr='eps',
                              value_max=1.,
                              value_min=.1,
                              value_test=.05,
                              nb_steps=100000) # It should be x10

In [14]:
dqn = DQNAgent(model=model,
               nb_actions=nb_actions,
               policy=policy,
               memory=memory,
               processor=processor,
               nb_steps_warmup=50000,
               gamma=0.99,
               target_model_update=10000,
               train_interval=4,
               delta_clip=1)

In [15]:
dqn.compile(Adam(learning_rate=.00025), metrics=['mae'])

## 6. Training & Storing

In [16]:
# Store it as HDF5: 2 files are stored (h5f.data* and h5f.index), but we refer to the .h5f ending only
weights_filename = 'dqn_' + env_name + '_weights.h5f'
checkpoint_weights_filename = 'dqn_' + env_name + '_weights_{step}.h5f'
# Every interval steps, model weights saved
checkpoint_callback = ModelIntervalCheckpoint(checkpoint_weights_filename, interval=10000) # Should bd x10

In [17]:
# Train
# log_interval: output frequency
# nb_steps: steps to train; watch out: if it is pretrained, then we need less
# but we definitely need more than 10000 steps...
dqn.fit(env, nb_steps=10000, callbacks=[checkpoint_callback], log_interval=10000, visualize=False)

Training for 10000 steps ...
Interval 1 (0 steps performed)
   30/10000 [..............................] - ETA: 35s - reward: 0.0000e+00



done, took 33.094 seconds


<keras.callbacks.History at 0x244dca4fe88>

In [18]:
# Save
dqn.save_weights(weights_filename, overwrite=True)

## 7. Test & Use

In [19]:
# It is a very bad agent, because did not train long enough
dqn.test(env, nb_episodes=5, visualize=True)

Testing for 5 episodes ...


  "We strongly suggest supplying `render_mode` when "


Episode 1: reward: -21.000, steps: 1016
Episode 2: reward: -21.000, steps: 1019
Episode 3: reward: -21.000, steps: 1026
Episode 4: reward: -21.000, steps: 1014
Episode 5: reward: -21.000, steps: 1010


<keras.callbacks.History at 0x244d65e36c8>

### Model Provided in the Course

In [21]:
weights_path = "C:/Users/Mikel/Dropbox/Learning/PythonLab/udemy_rl_ai/notebooks/08-Deep-Q-Learning-On-Images/weights_exercise/dqn_PONG_weights_1500000.h5f"
model.load_weights(weights_path)
# Redefinition of memory & policy
memory = SequentialMemory(limit=1000000,
                          window_length=WINDOW_LENGTH)
policy = LinearAnnealedPolicy(EpsGreedyQPolicy(),
                              attr='eps',
                              value_max=1,
                              value_min=.1,
                              value_test=.05,
                              nb_steps=100000)
processor = ImageProcessor()
dqn = DQNAgent(model=model,
               nb_actions=nb_actions,
               policy=policy,
               memory=memory,
               processor=processor,
               nb_steps_warmup=50000,
               gamma=0.99,
               target_model_update=10000)
dqn.compile(Adam(lr=.00025), metrics=['mae'])

In [22]:
# Now, this is much better
dqn.test(env, nb_episodes=5, visualize=True)

Testing for 5 episodes ...
Episode 1: reward: -13.000, steps: 4615
Episode 2: reward: -7.000, steps: 4858
Episode 3: reward: -11.000, steps: 4344
Episode 4: reward: -17.000, steps: 4053
Episode 5: reward: -13.000, steps: 4694


<keras.callbacks.History at 0x244ded54888>