# Deep Q learning with Doom 🕹️
In this notebook we'll implement an agent <b>that plays Doom by using a Deep Q learning architecture.</b> <br>
Our agent playing Doom:

<img src="assets/doom.gif" style="max-width: 600px;" alt="Deep Q learning with Doom"/>


## This notebook is part of the Free Deep Reinforcement Course 📝
<img src="https://simoninithomas.github.io/Deep_reinforcement_learning_Course/assets/img/preview.jpg" alt="Deep Reinforcement Course" style="width: 500px;"/>

<p> Deep Reinforcement Learning Course is a free series of blog posts about Deep Reinforcement Learning, where we'll learn the main algorithms, <b>and how to implement them in Tensorflow.</b></p>

<p>The goal of these articles is to <b>explain step by step from the big picture</b> and the mathematical details behind it, to the implementation with Tensorflow </p>


<a href="https://simoninithomas.github.io/Deep_reinforcement_learning_Course/">Syllabus</a><br>
<a href="https://medium.freecodecamp.org/an-introduction-to-reinforcement-learning-4339519de419">Part 0: Introduction to Reinforcement Learning </a><br>
<a href=""> Part 1: Q-learning with FrozenLake</a><br>
<a href=""> Part 2: Deep Q-learning with Doom</a><br>
<a href=""> Part 3: Policy Gradients with Doom </a><br>


## Any questions 👨‍💻
<p> If you have any questions, feel free to ask me: </p>
<p> 📧: <a href="mailto:hello@simoninithomas.com">hello@simoninithomas.com</a>  </p>
<p> Github: https://github.com/simoninithomas/Deep_reinforcement_learning_Course </p>
<p> 🌐 : https://simoninithomas.github.io/Deep_reinforcement_learning_Course/ </p>
<p> Twitter: <a href="https://twitter.com/ThomasSimonini">@ThomasSimonini</a> </p>
<p> Don't forget to <b> follow me on <a href="https://twitter.com/ThomasSimonini">twitter</a>, <a href="https://github.com/simoninithomas/Deep_reinforcement_learning_Course">github</a> and <a href="https://medium.com/@thomassimonini">Medium</a> to be alerted of the new articles that I publish </b></p>
    
## Important note 🤔
<b> You can run it on your computer but it's better to run it on GPU based services </b> (except if your computer have GPUs or you're able to wait 10 years 😅), personally I use Microsoft Azure and their Deep Learning Virtual Machine (they offer 170$)
https://azuremarketplace.microsoft.com/en-us/marketplace/apps/microsoft-ads.dsvm-deep-learning
<br>
⚠️ I don't have any business relations with them. I just loved their excellent customer service.

If you have some troubles to use Microsoft Azure follow the explainations of this excellent article here (without last the part fast.ai): https://medium.com/@manikantayadunanda/setting-up-deeplearning-machine-and-fast-ai-on-azure-a22eb6bd6429

## Step 1: Importing the libraries 📚

In [1]:
import tensorflow as tf      # Deep Learning library
import numpy as np           # Handle matrices
from vizdoom import *        # Doom Environment
import random                # Handling random number generation
import time                  # Handling time calculation
from skimage import transform# Help us to preprocess the frames

from collections import deque# Ordered collection with ends
import matplotlib.pyplot as plt # Display graphs

## Step 2: Create our environment 🎮
- Now that we imported the libraries/dependencies, we will create our environment.
- Doom environment takes:
    - A configuration files that handle all the options (size of the frame, possible actions...)
    - A scenario files: that generates the correct scenario (in our case basic but you're invited to try other scenarios).
- Note: We transform actions to an identity array but instead of using 0 and 1 we use a bool identity. This will help us to do one_hot encoding.
 

In [2]:
import itertools as it
"""
Here we create our environment
"""
def create_environment():
    game = DoomGame()
    
    # Load the correct configuration
    game.load_config("basic.cfg")
    
    # Load the correct scenario (in our case basic scenario)
    game.set_doom_scenario_path("basic.wad")
    
    game.init()
    
    # Returns bool identity array [[T, F, F], [F, T, F], [F, F, T]]
    possible_actions = np.identity(game.get_available_buttons_size(),dtype=bool).tolist()
    
    return game, possible_actions
       
"""
Here we performing random action to test the environment
"""
def test_environment():
    game = DoomGame()
    game.load_config("basic.cfg")
    game.set_doom_scenario_path("basic.wad")
    game.init()
    shoot = [0, 0, 1]
    left = [1, 0, 0]
    right = [0, 1, 0]
    actions = [shoot, left, right]

    episodes = 10
    for i in range(episodes):
        game.new_episode()
        while not game.is_episode_finished():
            state = game.get_state()
            img = state.screen_buffer
            misc = state.game_variables
            action = random.choice(actions)
            print(action)
            reward = game.make_action(action)
            #print ("\treward:", reward)
            time.sleep(0.02)
        print ("Result:", game.get_total_reward())
        time.sleep(2)

In [3]:
game, possible_actions = create_environment()

In [4]:
# Training hyperparameters
total_episodes = 100
max_steps = 100

# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob


# Q-learning settings
learning_rate = 0.00025
discount_factor = 0.99

epochs = 20

learning_steps_per_epoch = 2000

replay_memory_size = 10000

batch_size = 1

frame_repeat = 4

pretrain_length = 1
memory_size = 100

state_size = [84,84,4]
action_size = 3

## Step 3 : Define the preprocessing functions ⚙️
### preprocess_frame
Preprocessing is an important step, <b>because we want to reduce the complexity of our states to reduce the computation time needed for training.</b>
<br><br>
Our steps:
- Grayscale each of our frames (because <b> color does not add important information </b>). But this is already done by the config file.
- Crop the screen (in our case we remove the roof because it contains no information)
- We normalize pixel values
- Finally we resize the preprocessed frame

In [5]:
"""
    preprocess_frame:
    Take a frame.
    Resize it.
        __________________
        |                 |
        |                 |
        |                 |
        |                 |
        |_________________|
        
        to
        _____________
        |            |
        |            |
        |            |
        |____________|
    Normalize it.
    
    return preprocessed_frame
    
    """
def preprocess_frame(frame):
    # Greyscale frame already done in our vizdoom config
    # x = np.mean(frame,-1)
    
    # Crop the screen (remove the roof because it contains no information)
    cropped_frame = frame[30:-10,30:-30]
    
    # Normalize Pixel Values
    normalized_frame = cropped_frame/255.0
    
    # Resize
    preprocessed_frame = transform.resize(normalized_frame, [84,84])
    
    return preprocessed_frame

### stack_frames
👏 This part was made possible thanks to help of <a href="https://github.com/Miffyli">Anssi</a><br>
Stacking frames is really important because it helps us to give have a sense of motion to our Neural Network.
- First we preprocess frame
- Then we append the frame to the deque that automatically removes the oldest frame
- Finally we build the stacked state

This is how work stack:
- For the first frame, we feed the other 3 with blank frames
- At each timestep, we add the new frame to deque and then we stack them to form a new stacked frame
- And so on
<img src="assets\stack.png" alt="stack">

In [6]:
stack_size = 4
# Initialize deque with zero-images one array for each image
stacked_frames  =  deque([np.zeros((84,84), dtype=np.int) for i in range(stack_size)], maxlen=4) 

def stack_frames(stacked_frames, state):
    # Preprocess frame
    frame = preprocess_frame(state)
        
    # Append frame to deque, automatically removes the oldest frame
    stacked_frames.append(frame)
       
    # Build the stacked state (first dimension specifies different frames)
    stacked_state = np.stack(stacked_frames, axis=2)
    
    return stacked_state

## Step 3: Create our Deep Q learning Neural Network model 🧠

In [7]:
class DQNetwork:
    def __init__(self, state_size, action_size, learning_rate, name='DQNetwork'):
        self.state_size = state_size
        self.action_size = action_size
        self.learning_rate = learning_rate
        
        with tf.variable_scope(name):
            # We create the placeholders
            self.inputs_ = tf.placeholder(tf.float32, [None, 84, 84, 4], name="inputs")
            self.actions_ = tf.placeholder(tf.int32, [None, 3], name="actions_")
            self.actions_one_hot = tf.one_hot(self.actions_, action_size)
    
            self.target_Q = tf.placeholder(tf.float32, [None], name="target")
            
            """
            First convnet:
            CNN
            BatchNormalization
            ELU
            """
            # Input is 84x84x4
            self.conv1 = tf.layers.conv2d(inputs = self.inputs_,
                                         filters = 16,
                                         kernel_size = [8,8],
                                         strides = [4,4],
                                         padding = "VALID",
                                         name = "conv1")
             
            self.conv1_batchnorm = tf.layers.batch_normalization(self.conv1,
                                                   training = True,
                                                   epsilon = 1e-5,
                                                     name = 'batch_norm1')
            
            self.conv1_out = tf.nn.elu(self.conv1_batchnorm, name="conv1_out")
            
            """
            Second convnet:
            CNN
            BatchNormalization
            ELU
            """
            self.conv2 = tf.layers.conv2d(inputs = self.conv1_out,
                                 filters = 32,
                                 kernel_size = [4,4],
                                 strides = [2,2],
                                 padding = "VALID",
                                 name = "conv2")
        
            self.conv2_batchnorm = tf.layers.batch_normalization(self.conv2,
                                                   training = True,
                                                   epsilon = 1e-5,
                                                     name = 'batch_norm2')

            self.conv2_out = tf.nn.elu(self.conv2_batchnorm, name="conv2_out")
            
            """
            Third convnet:
            CNN
            BatchNormalization
            ELU
            """
            self.conv3 = tf.layers.conv2d(inputs = self.conv2_out,
                                 filters = 64,
                                 kernel_size = [4,4],
                                 strides = [2,2],
                                 padding = "VALID",
                                 name = "conv3")
        
            self.conv3_batchnorm = tf.layers.batch_normalization(self.conv3,
                                                   training = True,
                                                   epsilon = 1e-5,
                                                     name = 'batch_norm3')

            self.conv3_out = tf.nn.elu(self.conv3_batchnorm, name="conv3_out")

            self.flatten = tf.layers.flatten(self.conv3_out)
            
            self.fc = tf.layers.dense(inputs = self.flatten,
                                  units = 512,
                                  activation = tf.nn.elu,
                                name="fc1")
            
            self.output = tf.layers.dense(inputs = self.fc, 
                                          units = 3, 
                                        activation=None)


  
            ### Train with loss (targetQ - Q)^2
            # output has length 2, for two actions. This next line chooses
            # one value from output (per row) according to the one-hot encoded actions.
            self.Q = tf.reduce_sum(tf.multiply(self.output, self.actions_one_hot), axis=1)
            
            
            # (Qtarget - Qhat)
            self.loss = tf.reduce_mean(tf.square(self.target_Q - self.Q))
            
            self.optimizer = tf.train.AdamOptimizer(self.learning_rate).minimize(self.loss)

In [8]:
tf.reset_default_graph()
DQNetwork = DQNetwork(state_size, action_size, learning_rate)

## Step: Experience Replay
This part was made possible by the excellent notebook: _ by udacity

In [9]:
class Memory():
    def __init__(self, max_size):
        self.buffer = deque(maxlen = max_size)
    
    def add(self, experience):
        self.buffer.append(experience)
    
    def sample(self, batch_size):
        buffer_size = len(self.buffer)
        index = np.random.choice(np.arange(buffer_size),
                                size = batch_size,
                                replace = False)
        
        return [self.buffer[i] for i in index]

<p> Here we'll deal with the empty memory problem: we pre-populate our memory by taking random actions and storing the experience (state, action, reward, new_state).</p>

In [10]:
memory_size = 10           # memory capacity
batch_size = 10 
pretrain_length = batch_size

In [11]:
# Render the environment
game.new_episode()

# Instantiate memory
memory = Memory(max_size = memory_size)

for i in range(pretrain_length):
    
    if i == 0:
        # First we need a state
        state = game.get_state().screen_buffer
        state = stack_frames(stacked_frames, state)
    
    # Random action
    action = random.choice(possible_actions)
    
    # Get the rewards
    reward = game.make_action(action)
    
    # Look if the episode is finished
    done = game.is_episode_finished()
    
    
    if done:
        print(done)
        # We finished the episode
        next_state = np.zeros(state.shape)
        
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        # Start a new episode
        game.new_episode()
    else:
        # Get the next state
        next_state = game.get_state().screen_buffer
        next_state = stack_frames(stacked_frames, next_state)
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        state = next_state

  warn("The default mode, 'constant', will be changed to 'reflect' in "


## Step 4: Train our Model 🏃‍♂️

* Initialize the memory $D$
* Initialize the action-value network $Q$ with random weights
* **For** episode $\leftarrow 1$ **to** $M$ **do**
  * Observe $s_0$
  * **For** $t \leftarrow 0$ **to** $T-1$ **do**
     * With probability $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s_t,a)$
     * Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
     * Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
     * Sample random mini-batch from $D$: $<s_j, a_j, r_j, s'_j>$
     * Set $\hat{Q}_j = r_j$ if the episode ends at $j+1$, otherwise set $\hat{Q}_j = r_j + \gamma \max_{a'}{Q(s'_j, a')}$
     * Make a gradient descent step with loss $(\hat{Q}_j - Q(s_j, a_j))^2$
  * **endfor**
* **endfor**

In [12]:
rewards_list = []

with tf.Session() as sess:
    # Initialize the variables
    sess.run(tf.global_variables_initializer())
    
    for episode in range(total_episodes):
        
        game.new_episode()
        
        total_reward = 0
        step = 0
        
        # Observe the first state
        frame = game.get_state().screen_buffer
        state = stack_frames(stacked_frames, frame)
        
        while step < max_steps:
            
            ## EPSILON GREEDY STRATEGY
            # Choose action a from state s using epsilon greedy.
            ## First we randomize a number
            exp_exp_tradeoff = np.random.rand()
            
            # Here we'll use an improved version of our epsilon greedy strategy used in Q-learning notebook
            explore_probability = explore_stop + (explore_start - explore_stop) * np.exp(-decay_rate * step)
            
            
            if (explore_probability > exp_exp_tradeoff):
                # Make a random action
                action = random.choice(possible_actions) 
            
            else:
                # Get action from Q-network
                # Estimate the Qs values state
                Qs = sess.run(DQNetwork.output, feed_dict = {DQNetwork.inputs_: state})
                # Take the biggest Q value (= the best action)
                action = np.argmax(Qs)
                
            
            # Do the action
            reward = game.make_action(action)
            
            # Get the next state
            next_state = game.get_state().screen_buffer
            next_state = stack_frames(stacked_frames, next_state)
            
            # Look if the episode is finished
            done = game.is_episode_finished()
            
            total_reward += reward
            
            # If the game is finished
            if done:
                # the episode ends so no next state
                next_state = np.zeros((84,84), dtype=np.int)
                next_state = stack_frames(stacked_frames, next_state)
                                      
                print('Episode: {}'.format(episode),
                      'Total reward: {}'.format(total_reward),
                      'Training loss: {:.4f}'.format(loss),
                      'Explore P: {:.4f}'.format(explore_probability))
                rewards_list.append((episode, total_reward))
                break                  
                            
            else:
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                state = next_state
                step += 1
            
            ### LEARNING PART            
            # Obtain random mini-batch from memory
            batch = memory.sample(batch_size)
            states = np.array([each[0] for each in batch], ndmin=3)
            actions = np.array([each[1] for each in batch])
           
            
            rewards = np.array([each[2] for each in batch])
            next_states = np.array([each[3] for each in batch])
            
            gamma = 0.99   
            
            # Get Q values for next_state 
            target_Qs = sess.run(DQNetwork.output, feed_dict = {DQNetwork.inputs_: next_states})
            
 
            # Set target Q_target = R(s,a) + ymax Qhat(s',a')
            #                      rewards + gamma * np.max(next_state)
            targets = rewards + gamma * np.max(target_Qs, axis=1)
            
            
            # TODO Set target_Qs to 0 for states where episode ends
            
            loss, _ = sess.run([DQNetwork.loss, DQNetwork.optimizer],
                                feed_dict={DQNetwork.inputs_: states,
                                           DQNetwork.target_Q: targets,
                                           DQNetwork.actions_: actions})

            


  warn("The default mode, 'constant', will be changed to 'reflect' in "


InvalidArgumentError: Incompatible shapes: [10,3] vs. [10,3,3]
	 [[Node: DQNetwork/Mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](DQNetwork/dense/BiasAdd, DQNetwork/one_hot)]]

Caused by op 'DQNetwork/Mul', defined at:
  File "C:\Users\simon\Anaconda3\envs\gameplai\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Users\simon\Anaconda3\envs\gameplai\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\simon\Anaconda3\envs\gameplai\lib\site-packages\ipykernel\__main__.py", line 3, in <module>
    app.launch_new_instance()
  File "C:\Users\simon\Anaconda3\envs\gameplai\lib\site-packages\traitlets\config\application.py", line 658, in launch_instance
    app.start()
  File "C:\Users\simon\Anaconda3\envs\gameplai\lib\site-packages\ipykernel\kernelapp.py", line 478, in start
    self.io_loop.start()
  File "C:\Users\simon\Anaconda3\envs\gameplai\lib\site-packages\zmq\eventloop\ioloop.py", line 177, in start
    super(ZMQIOLoop, self).start()
  File "C:\Users\simon\Anaconda3\envs\gameplai\lib\site-packages\tornado\ioloop.py", line 888, in start
    handler_func(fd_obj, events)
  File "C:\Users\simon\Anaconda3\envs\gameplai\lib\site-packages\tornado\stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "C:\Users\simon\Anaconda3\envs\gameplai\lib\site-packages\zmq\eventloop\zmqstream.py", line 440, in _handle_events
    self._handle_recv()
  File "C:\Users\simon\Anaconda3\envs\gameplai\lib\site-packages\zmq\eventloop\zmqstream.py", line 472, in _handle_recv
    self._run_callback(callback, msg)
  File "C:\Users\simon\Anaconda3\envs\gameplai\lib\site-packages\zmq\eventloop\zmqstream.py", line 414, in _run_callback
    callback(*args, **kwargs)
  File "C:\Users\simon\Anaconda3\envs\gameplai\lib\site-packages\tornado\stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "C:\Users\simon\Anaconda3\envs\gameplai\lib\site-packages\ipykernel\kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "C:\Users\simon\Anaconda3\envs\gameplai\lib\site-packages\ipykernel\kernelbase.py", line 233, in dispatch_shell
    handler(stream, idents, msg)
  File "C:\Users\simon\Anaconda3\envs\gameplai\lib\site-packages\ipykernel\kernelbase.py", line 399, in execute_request
    user_expressions, allow_stdin)
  File "C:\Users\simon\Anaconda3\envs\gameplai\lib\site-packages\ipykernel\ipkernel.py", line 208, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "C:\Users\simon\Anaconda3\envs\gameplai\lib\site-packages\ipykernel\zmqshell.py", line 537, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "C:\Users\simon\Anaconda3\envs\gameplai\lib\site-packages\IPython\core\interactiveshell.py", line 2728, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "C:\Users\simon\Anaconda3\envs\gameplai\lib\site-packages\IPython\core\interactiveshell.py", line 2850, in run_ast_nodes
    if self.run_code(code, result):
  File "C:\Users\simon\Anaconda3\envs\gameplai\lib\site-packages\IPython\core\interactiveshell.py", line 2910, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-8-5f5bd85100a3>", line 2, in <module>
    DQNetwork = DQNetwork(state_size, action_size, learning_rate)
  File "<ipython-input-7-c64893ef8bfc>", line 92, in __init__
    self.Q = tf.reduce_sum(tf.multiply(self.output, self.actions_one_hot), axis=1)
  File "C:\Users\simon\Anaconda3\envs\gameplai\lib\site-packages\tensorflow\python\ops\math_ops.py", line 321, in multiply
    return gen_math_ops._mul(x, y, name)
  File "C:\Users\simon\Anaconda3\envs\gameplai\lib\site-packages\tensorflow\python\ops\gen_math_ops.py", line 3100, in _mul
    "Mul", x=x, y=y, name=name)
  File "C:\Users\simon\Anaconda3\envs\gameplai\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "C:\Users\simon\Anaconda3\envs\gameplai\lib\site-packages\tensorflow\python\framework\ops.py", line 3160, in create_op
    op_def=op_def)
  File "C:\Users\simon\Anaconda3\envs\gameplai\lib\site-packages\tensorflow\python\framework\ops.py", line 1625, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Incompatible shapes: [10,3] vs. [10,3,3]
	 [[Node: DQNetwork/Mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](DQNetwork/dense/BiasAdd, DQNetwork/one_hot)]]


## Step 5: Watch our Agent play 👀