# DQN on Custom Environments - Snake Game: Training AI Agent

This notebook trains a DQN agent using Keras-RL2 to play our custom game Snake. Follow the guide (Section 7) `../ReinforcementLearning_Guide.md` for more information.

Prior to executing this notebook, the previous two and the game folder structure `snake/` must have been correctly created.

This nobetook is basically a modification of the notebook `04_2_DQN_Images_Keras_RL2_Breakout.ipynb`; the main modification is that we create an instance of our game environment Snake, instead of Breakout.

Overview:
1. Imports
2. Environment Setup
3. Image Processing
4. Network Model
5. Agent
6. Training & Storing
7. Test & Use

## 1. Imports

In [18]:
from PIL import Image
import numpy as np
import gym

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Flatten, Convolution2D, Permute
from tensorflow.keras.optimizers import Adam

# Keras-RL2
from rl.agents.dqn import DQNAgent
from rl.policy import LinearAnnealedPolicy, EpsGreedyQPolicy
from rl.memory import SequentialMemory
from rl.core import Processor
from rl.callbacks import FileLogger, ModelIntervalCheckpoint

## 2. Environment Setup

In [19]:
# Our custom environment
env = gym.make("snake:snake-v0")
nb_actions = env.action_space.n

## 3. Image Processing

In [20]:
# Even though the game is 200x200,
# we need to downsample it to train in feasible times
IMG_SHAPE = (84, 84)
WINDOW_LENGTH = 4

In [21]:
class ImageProcessor(Processor):
    def process_observation(self, observation):
        # numpy -> PIL
        img = Image.fromarray(observation)
        img = img.resize(IMG_SHAPE)
        # RGB -> grayscale
        img = img.convert("L")
        # PIL -> numpy
        img = np.array(img)
        return img.astype('uint8') # compress
    
    def process_state_batch(self, batch):
        # [0,255] -> [0, 1], for training the nerual network
        processed_batch = batch.astype('float32') / 255.
        return processed_batch

    def process_reward(self, reward):
        return np.clip(reward, -1.0, 1.0)

## 4. Network Model

The model greatly varies on the type of custom game/environment. Read papers of similar environments to figure out which architectures to define.

In [22]:
input_shape = (WINDOW_LENGTH, IMG_SHAPE[0], IMG_SHAPE[1])
input_shape

(4, 84, 84)

In [23]:
model = Sequential()
model.add(Permute((2, 3, 1), input_shape=input_shape))

model.add(Convolution2D(32, (8, 8), strides=(4, 4),kernel_initializer='he_normal'))
model.add(Activation('relu'))
model.add(Convolution2D(64, (4, 4), strides=(2, 2), kernel_initializer='he_normal'))
model.add(Activation('relu'))
model.add(Convolution2D(64, (3, 3), strides=(1, 1), kernel_initializer='he_normal'))
model.add(Activation('relu'))
model.add(Flatten())
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dense(nb_actions))
model.add(Activation('linear'))
print(model.summary())


Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
permute_1 (Permute)          (None, 84, 84, 4)         0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 20, 20, 32)        8224      
_________________________________________________________________
activation_5 (Activation)    (None, 20, 20, 32)        0         
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 9, 9, 64)          32832     
_________________________________________________________________
activation_6 (Activation)    (None, 9, 9, 64)          0         
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 7, 7, 64)          36928     
_________________________________________________________________
activation_7 (Activation)    (None, 7, 7, 64)         

## 5. Agent

In [24]:
# Replay buffer
memory = SequentialMemory(limit=1000000, window_length=WINDOW_LENGTH)

In [25]:
# Image processor
processor = ImageProcessor()

In [26]:
policy = LinearAnnealedPolicy(EpsGreedyQPolicy(),
                              attr='eps',
                              value_max=1.0,
                              value_min=0.1,
                              value_test=0.05,
                              nb_steps=1000000)

In [27]:
dqn = DQNAgent(model=model,
               nb_actions=nb_actions,
               policy=policy,
               memory=memory,
               processor=processor,
               nb_steps_warmup=50000,
               gamma=.99,
               target_model_update=10000,
               train_interval=4,
               delta_clip=1)

In [28]:
dqn.compile(Adam(learning_rate=.00025), metrics=['mae'])

In [29]:
# File for trained weigts + checkpoints
weights_filename = 'test_dqn_snake_weights.h5f'
checkpoint_weights_filename = 'test_dqn_' + "snake" + '_weights_{step}.h5f'
checkpoint_callback = ModelIntervalCheckpoint(checkpoint_weights_filename, interval=100000)

## 6. Training & Storing

In [13]:
# Train and save checkpoint weights
# Note that 1.5M steps takes many hours to train!
# We can stop it and load the weights provided in teh course
dqn.fit(env,
        nb_steps=1500000,
        callbacks=[checkpoint_callback],
        log_interval=100000,
        visualize=False)

# Save final weights
dqn.save_weights(weights_filename, overwrite=True)

Training for 1500000 steps ...
Interval 1 (0 steps performed)
    14/100000 [..............................] - ETA: 6:26 - reward: 0.0000e+00   



2296 episodes - episode_reward: -0.895 [-1.000, 2.000] - loss: 0.003 - mae: 0.108 - mean_q: 0.138 - mean_eps: 0.932 - score: 0.081

Interval 2 (100000 steps performed)
 21001/100000 [=====>........................] - ETA: 36:46 - reward: -0.0197done, took 2130.279 seconds


## 7. Test & Use

In [30]:
# Load the weights
model.load_weights("./snake_weights/dqn_snake_weights_1200000.h5f")
# Replay buffer
memory = SequentialMemory(limit=1000000,
                          window_length=WINDOW_LENGTH)
# Updated Policy
policy = LinearAnnealedPolicy(EpsGreedyQPolicy(),
                              attr='eps',
                              value_max=1,
                              value_min=.1,
                              value_test=.05,
                              nb_steps=100000)
# Image Processor
processor = ImageProcessor()
# Initialize the DQNAgent with the new model and updated policy and compile it
dqn = DQNAgent(model=model,
               nb_actions=nb_actions,
               policy=policy,
               memory=memory,
               processor=processor,
               nb_steps_warmup=50000,
               gamma=.99,
               target_model_update=10000)
dqn.compile(Adam(learning_rate=.00025), metrics=['mae'])

In [36]:
# Set sleep, otherwise the snake is going to move too fast!
env.sleep = 0.2

In [39]:
# This loop does not seem to work on my computer...
# ... but it seems to be a Windows issue...??
# However, the reward seems not to be that bad??
dqn.test(env, nb_episodes=5, visualize=True)

Testing for 5 episodes ...
Episode 1: reward: 6.000, steps: 86
Episode 2: reward: 7.000, steps: 107
Episode 3: reward: 7.000, steps: 106
Episode 4: reward: 6.000, steps: 132
Episode 5: reward: 5.000, steps: 139


<keras.callbacks.History at 0x194e8e5fc08>