# Deep Q-learning (DQN) for Doom

![](vizdoom.png)

## Doom game rules, the BASIC scenario

* The map is a rectangle with walls, ceiling and floor
* A monster is spawned randomly somewhere along the opposite wall
* The player can only go left/right or shoot
* One hit is enough to kill the monster
* Episode finishes when monster is killed or on timeout (300 tics).

Rewards:
* +100 for killing the monster
* -1 for every time tick (every time tick there's an action left/right/shoot)
* -5 missed shot


## Installation

No specific reason for these specific versions, just that I used them and they worked. To prevent problems, you could you use these versions as well, although newer versions are available.

* python 3.7
* tensorflow 1.15
* skimage latest version (conda install scikit-image)
* vizdoom 1.1.7

Installing vizdoom is easiest without using conda or pip:
1. download version 1.1.7 from https://github.com/mwydmuch/ViZDoom/releases
2. unpack the zip in `...\Anaconda3\envs\<your-conda-env>\Lib\site-packages`
3. you should now have a folder named `vizdoom` in the folder `site-packages`
4. you will find a scenario folder in the vizdoom folder, copy basic.cfg and basic.wad which are inside the scenario folder and put them in the same folder as your own code. This step assumes you are trying to play the basic mission.
5. if your python version doesn't match change version in file `...\Anaconda3\envs\<your-conda-env>\Lib\site-packages\vizdoom\__init__.py`
6. in basic.cfg (the version in the folder with your own code), change to "screen_format = GRAY8"
7. activate the conda environment, open the python prompt and try:

```
    import vizdoom
    game = vizdoom.DoomGame()
    game.init()
```

If a small graphical window opens, vizdoom installation is fine.

Just a remark about the basic.cfg file. If you make a typo in a statement, there's no error message displayed. The statement is simply ignored by the parser of the file, leading to unexpected behavior. For example, in step 6, I first tried `screen_format = GRAY8  # used to be: CRCGCB` instead of `screen_format = GRAY8`. The parser can't handle comments behind a statement. The image was not single-channel grayscale, but remained 3-channel color, leading to the error `ValueError: ('Cannot warp empty image with dimensions', (0, 180, 320))`.

More info on Vizdoom:
* [vizdoom tutorial](http://vizdoom.cs.put.edu.pl/tutorial)
* [vizdoom source code](https://github.com/mwydmuch/ViZDoom)

## Training

Training with 500 episodes takes about half an hour, however very much depending on type of computer and whether you use the GPU. The first couple of episodes take a lot of time, but as the agent improves, later episodes take less time to complete. Because the learned model is saved every 5 episodes, you can stop training before the 500 episodes have been done, without losing the training effort.

Start tensorboard to see the loss decreasing:
* tensorboard --logdir tensorboard_logs
* http://localhost:6006/

## Running

After training the agent plays 100 episodes. Quite impressive to see the result in real-time!

In [4]:
import tensorflow as tf 
import numpy as np 
from vizdoom import *  # Doom environment
import random 
import time 
from skimage import transform
from collections import deque
import matplotlib.pyplot as plt

def create_environment():
    game = DoomGame()
    game.load_config("basic.cfg")
    game.set_doom_scenario_path("basic.wad") 
    game.init()
    
    left = [1, 0, 0]
    right = [0, 1, 0]
    shoot = [0, 0, 1]
    actions = [left, right, shoot]
    
    return game, actions
       
def test_environment():
    game, actions = create_environment()

    episodes = 3
    for i in range(episodes):
        game.new_episode()
        while not game.is_episode_finished():
            state = game.get_state()
            img = state.screen_buffer
            misc = state.game_variables
            action = random.choice(actions)
            reward = game.make_action(action)
            print("action: {}, reward: {}".format(action, reward))
            time.sleep(0.1)
        print ("Result:", game.get_total_reward())
        time.sleep(2)
    game.close()

# just to playtest if the vizdoom environment works
test_environment()


action: [1, 0, 0], reward: -1.0
action: [1, 0, 0], reward: -1.0
action: [0, 0, 1], reward: -1.0
action: [0, 1, 0], reward: -1.0
action: [0, 0, 1], reward: -1.0
action: [0, 1, 0], reward: -1.0
action: [1, 0, 0], reward: 100.0
action: [0, 1, 0], reward: -1.0
Result: 93.0
action: [0, 1, 0], reward: -1.0
action: [0, 1, 0], reward: -1.0
action: [0, 1, 0], reward: -1.0
action: [1, 0, 0], reward: -1.0
action: [0, 1, 0], reward: -1.0
action: [0, 1, 0], reward: -1.0
action: [0, 1, 0], reward: -1.0
action: [0, 0, 1], reward: -1.0
action: [1, 0, 0], reward: -1.0
action: [0, 0, 1], reward: -1.0
action: [0, 1, 0], reward: -1.0
action: [1, 0, 0], reward: -6.0
action: [1, 0, 0], reward: -1.0
action: [0, 0, 1], reward: -1.0
action: [0, 1, 0], reward: -1.0
action: [0, 0, 1], reward: -1.0
action: [0, 1, 0], reward: -1.0
action: [1, 0, 0], reward: -1.0
action: [1, 0, 0], reward: -1.0
action: [1, 0, 0], reward: -1.0
action: [0, 1, 0], reward: -1.0
action: [0, 1, 0], reward: -1.0
action: [0, 0, 1], reward:

In [8]:
# normalize and resize the frame
def preprocess_frame(frame):
    # greyscale frame already done in basic.cfg: "screen_format = GRAY8"    
    cropped_frame = frame[30:-10,30:-30]  # crop the screen (remove the roof because it contains no information)
    normalized_frame = cropped_frame/255.0  # normalize pixel values
    preprocessed_frame = transform.resize(normalized_frame, [84,84])  # resize the frame to shape (84, 84)
    
    return preprocessed_frame


# initialize deque with zero images; one array for each image
stack_size = 4 # We stack 4 frames
stacked_frames = deque([np.zeros((84,84), dtype=np.int) for i in range(stack_size)], maxlen=4) 

def stack_frames(stacked_frames, state, is_new_episode):
    frame = preprocess_frame(state)
    
    if is_new_episode:
        # clear stacked_frames
        stacked_frames = deque([np.zeros((84,84), dtype=np.int) for i in range(stack_size)], maxlen=4)
        
        # because we're in a new episode, copy the same frame 4x
        stacked_frames.append(frame)
        stacked_frames.append(frame)
        stacked_frames.append(frame)
        stacked_frames.append(frame)
        
        # stack the frames; result is a tensor
        stacked_state = np.stack(stacked_frames, axis=2)  # resulting shape (84, 84, 4)
        
    else:
        # append frame to deque, automatically removes the oldest frame
        stacked_frames.append(frame)

        # build the stacked state
        stacked_state = np.stack(stacked_frames, axis=2) 
    
    return stacked_state, stacked_frames


game, possible_actions = create_environment()
state_size = [84,84,4]  # input is a stack of 4 frames hence 84x84x4 (width, height, stackinglayers) 
action_size = game.get_available_buttons_size()  # 3 possible actions: left, right, shoot

# learning parameters
learning_rate =  0.0002 
n_episodes = 500   
max_steps = 100  # max possible steps in an episode
batch_size = 64             
gamma = 0.95  # discounting rate

# exploration parameters for epsilon greedy strategy
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# memory hyperparameters
pretrain_length = batch_size   # number of experiences stored in the memory when initialized for the first time
memory_size = 1000000          # number of experiences the memory can keep

training = True


class DQNetwork:
    def __init__(self, state_size, action_size, learning_rate, name='DQNetwork'):
        self.state_size = state_size
        self.action_size = action_size
        self.learning_rate = learning_rate
        
        with tf.variable_scope(name):
            # create the placeholders
            # *state_size means take each element of state_size in a tuple; like if we wrote [None, 84, 84, 4]
            self.inputs_ = tf.placeholder(tf.float32, [None, *state_size], name="inputs")
            self.actions_ = tf.placeholder(tf.float32, [None, 3], name="actions_")
            
            # remember that target_Q is the R(s,a) + y * max Q_hat(s', a')  (Q_hat is the estimated Q)
            self.target_Q = tf.placeholder(tf.float32, [None], name="target")
            
            """ First convnet: CNN, BatchNormalization, ELU """
            # input is 84x84x4
            self.conv1 = tf.layers.conv2d(inputs = self.inputs_,
                                         filters = 32,
                                         kernel_size = [8,8],
                                         strides = [4,4],
                                         padding = "VALID",
                                          kernel_initializer=tf.contrib.layers.xavier_initializer_conv2d(),
                                         name = "conv1")
            
            self.conv1_batchnorm = tf.layers.batch_normalization(self.conv1,
                                                   training = True,
                                                   epsilon = 1e-5,
                                                     name = 'batch_norm1')
            
            self.conv1_out = tf.nn.elu(self.conv1_batchnorm, name="conv1_out")
            ## --> [20, 20, 32]
            
            
            """ Second convnet: CNN, BatchNormalization, ELU """
            self.conv2 = tf.layers.conv2d(inputs = self.conv1_out,
                                 filters = 64,
                                 kernel_size = [4,4],
                                 strides = [2,2],
                                 padding = "VALID",
                                kernel_initializer=tf.contrib.layers.xavier_initializer_conv2d(),
                                 name = "conv2")
        
            self.conv2_batchnorm = tf.layers.batch_normalization(self.conv2,
                                                   training = True,
                                                   epsilon = 1e-5,
                                                     name = 'batch_norm2')

            self.conv2_out = tf.nn.elu(self.conv2_batchnorm, name="conv2_out")
            ## --> [9, 9, 64]
            
            
            """ Third convnet: CNN, BatchNormalization, ELU """
            self.conv3 = tf.layers.conv2d(inputs = self.conv2_out,
                                 filters = 128,
                                 kernel_size = [4,4],
                                 strides = [2,2],
                                 padding = "VALID",
                                kernel_initializer=tf.contrib.layers.xavier_initializer_conv2d(),
                                 name = "conv3")
        
            self.conv3_batchnorm = tf.layers.batch_normalization(self.conv3,
                                                   training = True,
                                                   epsilon = 1e-5,
                                                     name = 'batch_norm3')

            self.conv3_out = tf.nn.elu(self.conv3_batchnorm, name="conv3_out")
            ## --> [3, 3, 128]
            
            
            self.flatten = tf.layers.flatten(self.conv3_out)
            ## --> [1152]
            
            
            self.fc = tf.layers.dense(inputs = self.flatten,
                                  units = 512,
                                  activation = tf.nn.elu,
                                       kernel_initializer=tf.contrib.layers.xavier_initializer(),
                                name="fc1")
            
            
            self.output = tf.layers.dense(inputs = self.fc, 
                                           kernel_initializer=tf.contrib.layers.xavier_initializer(),
                                          units = 3, 
                                        activation=None)

  
            # Q is the predicted Q-value.
            self.Q = tf.reduce_sum(tf.multiply(self.output, self.actions_), axis=1)
            
            
            # the loss is the MSE of predicted Q_values and the Q_target
            self.loss = tf.reduce_mean(tf.square(self.target_Q - self.Q))
            self.optimizer = tf.train.RMSPropOptimizer(self.learning_rate).minimize(self.loss)

            
# instantiate the DQNetwork
tf.reset_default_graph()
DQNetwork = DQNetwork(state_size, action_size, learning_rate)

class Memory():
    def __init__(self, max_size):
        self.buffer = deque(maxlen = max_size)
    
    def add(self, experience):
        self.buffer.append(experience)
    
    def sample(self, batch_size):
        buffer_size = len(self.buffer)
        index = np.random.choice(np.arange(buffer_size),
                                size = batch_size,
                                replace = False)
        
        return [self.buffer[i] for i in index]
    
# instantiate memory
memory = Memory(max_size = memory_size)

# render the environment
game.new_episode()

for i in range(pretrain_length):
    
    if i == 0:  # the first step
        state = game.get_state().screen_buffer  # First we need a state
        state, stacked_frames = stack_frames(stacked_frames, state, True)
    
    action = random.choice(possible_actions)  # random action
    reward = game.make_action(action)
    done = game.is_episode_finished()  # look if the episode is finished
    
    if done:  # we're dead
        next_state = np.zeros(state.shape)  # we finished the episode
        memory.add((state, action, reward, next_state, done))  # add experience to memory
        
        game.new_episode()  # start a new episode
        state = game.get_state().screen_buffer  # first we need a state
        state, stacked_frames = stack_frames(stacked_frames, state, True)  # Stack the frames
        
    else:
        next_state = game.get_state().screen_buffer  # get the next state
        next_state, stacked_frames = stack_frames(stacked_frames, next_state, False)
        
        
        memory.add((state, action, reward, next_state, done))  # add experience to memory
        state = next_state
        

writer = tf.summary.FileWriter("./tensorboard_logs/dqn")  # setup TensorBoard Writer
tf.summary.scalar("Loss", DQNetwork.loss)
write_op = tf.summary.merge_all()


# the Q-learning part
# choose action a from state s using epsilon greedy policy
def predict_action(explore_start, explore_stop, decay_rate, decay_step, state, possible_actions):
    exp_exp_tradeoff = np.random.rand()
    explore_probability = explore_stop + (explore_start - explore_stop) * np.exp(-decay_rate * decay_step)
    
    if(explore_probability > exp_exp_tradeoff):
        action = random.choice(possible_actions)  # random action (exploration)
        
    else:
        # get action from Q-network (exploitation)
        # estimate the Qs values state
        Qs = sess.run(DQNetwork.output, feed_dict = {DQNetwork.inputs_: state.reshape((1, *state.shape))})
        choice = np.argmax(Qs)  # Take the biggest Q value (= the best action)
        action = possible_actions[int(choice)]
                
    return action, explore_probability


saver = tf.train.Saver()  # will help to save our model

if training:
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        decay_step = 0
        game.init()

        for episode in range(n_episodes):
            step = 0            
            episode_rewards = []
            
            game.new_episode()
            state = game.get_state().screen_buffer  # observe the first state
            
            state, stacked_frames = stack_frames(stacked_frames, state, True)  # stack_frames() also calls preprocess()

            while step < max_steps:
                step += 1
                
                decay_step +=1
                
                action, explore_probability = predict_action(explore_start, explore_stop, decay_rate, decay_step, state, possible_actions)
                reward = game.make_action(action)
                done = game.is_episode_finished()                
                episode_rewards.append(reward)

                if done:
                    # the episode ends so no next state
                    next_state = np.zeros((84,84), dtype=np.int)
                    next_state, stacked_frames = stack_frames(stacked_frames, next_state, False)

                    step = max_steps  # set step = max_steps to end the episode

                    total_reward = np.sum(episode_rewards)  # get the total reward of the episode

                    print('Episode: {}'.format(episode),
                              'Total reward: {}'.format(total_reward),
                              'Training loss: {:.4f}'.format(loss),
                              'Epsilon: {:.4f}'.format(explore_probability))

                    memory.add((state, action, reward, next_state, done))

                else:
                    next_state = game.get_state().screen_buffer  # get the next state
                    next_state, stacked_frames = stack_frames(stacked_frames, next_state, False)  # stack the frame of the next_state
                    memory.add((state, action, reward, next_state, done))  # add experience to memory
                    state = next_state  # new state becomes current state


                # learning part            
                # obtain random mini-batch from memory
                batch = memory.sample(batch_size)
                states_mb = np.array([each[0] for each in batch], ndmin=3)
                actions_mb = np.array([each[1] for each in batch])
                rewards_mb = np.array([each[2] for each in batch]) 
                next_states_mb = np.array([each[3] for each in batch], ndmin=3)
                dones_mb = np.array([each[4] for each in batch])

                target_Qs_batch = []

                # get Q values for next_state 
                Qs_next_state = sess.run(DQNetwork.output, feed_dict = {DQNetwork.inputs_: next_states_mb})
                
                # set Q_target = r if the episode ends at s+1, otherwise set Q_target = r + gamma*maxQ(s', a')
                for i in range(0, len(batch)):
                    terminal = dones_mb[i]

                    # if we are in a terminal state, only equals reward
                    if terminal:
                        target_Qs_batch.append(rewards_mb[i])
                        
                    else:
                        target = rewards_mb[i] + gamma * np.max(Qs_next_state[i])
                        target_Qs_batch.append(target)
                        

                targets_mb = np.array([each for each in target_Qs_batch])

                loss, _ = sess.run([DQNetwork.loss, DQNetwork.optimizer],
                                    feed_dict={DQNetwork.inputs_: states_mb,
                                               DQNetwork.target_Q: targets_mb,
                                               DQNetwork.actions_: actions_mb})

                # write TF summaries
                summary = sess.run(write_op, feed_dict={DQNetwork.inputs_: states_mb,
                                                   DQNetwork.target_Q: targets_mb,
                                                   DQNetwork.actions_: actions_mb})
                writer.add_summary(summary, episode)
                writer.flush()

            # save model every 5 episodes
            if episode % 5 == 0:
                save_path = saver.save(sess, "./models/model.ckpt")
                print("Model Saved")

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  del sys.path[0]
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations


Model Saved


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations


Episode: 2 Total reward: 95.0 Training loss: 0.6641 Epsilon: 0.9798
Episode: 3 Total reward: 93.0 Training loss: 164.5786 Epsilon: 0.9790
Episode: 4 Total reward: 95.0 Training loss: 38.6239 Epsilon: 0.9785
Model Saved
Episode: 7 Total reward: 95.0 Training loss: 5.9613 Epsilon: 0.9587
Episode: 8 Total reward: 94.0 Training loss: 19.2348 Epsilon: 0.9580
Model Saved
Episode: 11 Total reward: 87.0 Training loss: 0.4275 Epsilon: 0.9380
Episode: 12 Total reward: 92.0 Training loss: 1.3743 Epsilon: 0.9371
Model Saved
Episode: 16 Total reward: 93.0 Training loss: 3.3141 Epsilon: 0.9090
Episode: 18 Total reward: 92.0 Training loss: 3.3379 Epsilon: 0.8993
Episode: 20 Total reward: 95.0 Training loss: 3.1877 Epsilon: 0.8899
Model Saved
Episode: 22 Total reward: 95.0 Training loss: 0.2893 Epsilon: 0.8806
Episode: 23 Total reward: 92.0 Training loss: 3.4163 Epsilon: 0.8798
Episode: 24 Total reward: 92.0 Training loss: 9.9500 Epsilon: 0.8791
Model Saved
Episode: 28 Total reward: 93.0 Training loss

Finished training the agent! Let the agent play!

In [3]:
with tf.Session() as sess:
    
    game, possible_actions = create_environment()
    n_episodes = 100
    totalScore = 0
    saver.restore(sess, "./models/model.ckpt")  # load the model
    game.init()
    
    for i in range(n_episodes):
        
        done = False
        game.new_episode()
        
        state = game.get_state().screen_buffer
        state, stacked_frames = stack_frames(stacked_frames, state, True)
            
        while not game.is_episode_finished():
            Qs = sess.run(DQNetwork.output, feed_dict = {DQNetwork.inputs_: state.reshape((1, *state.shape))})
            choice = np.argmax(Qs)  # greedy policy
            action = possible_actions[int(choice)]
            
            game.make_action(action)
            done = game.is_episode_finished()
            score = game.get_total_reward()
            
            if done:
                break                  
            else:
                next_state = game.get_state().screen_buffer
                next_state, stacked_frames = stack_frames(stacked_frames, next_state, False)
                state = next_state
                
        score = game.get_total_reward()
        print("Score: ", score)
    game.close()

NameError: name 'tf' is not defined