# Reinforcement Learning: Deep Q-Learning

## Table of Contents
- [1 - Fundamentals](#1)
    - [1.1 - Characteristics](#1.1)
    - [1.2 - Limitations of Q-Learning with Q-Tables](#1.1)
    - [1.3 - Deep-Q-Learning](#1.2)
    - [1.4 - Deep-Q-Network](#1.3)
    - [1.5 - Experience Replay](#1.4)
    - [1.6 - Target Network](#1.5)
    - [1.7 - When to use Deep Q-Learning](#1.6)

<a name='1'></a>
# 1 - Fundamentals

<a name='1.1'></a>
## 1.1 - Characteristics

Deep Q-Learning is a model-free, value-based, off-policy deep reinforcement learning algorithm which calculates updates according to the temporal difference method.

<a name='1.2'></a>
## 1.2 - Limitations of Q-Learning with Q-Tables

The Q-learning algorithm do a pretty decent job in relatively small state spaces, but it's performance will drop-off considerably when we work in more complex and sophisticated environments. 

Think about a video game where a player has a large environment to roam around in. Each state in the environment would be represented by a set of pixels, and the agent may be able to take serveral actions from each state. The iterative process of computing and updating Q-values for each state-action pair in a large state space becomes computationally inefficient and perhaps infeasible due to the computational resources and time this may take.

So what can we do when we want to manage more sophisticated environments with large state spaces? Well, rather than using value iteration to directly compute Q-values and find the optimal Q-function, we instead use a function approximation to estimate the optimal Q-function.

<a name='1.3'></a>
## 1.3 - Deep-Q-Learning

We'll make use of a deep neural network to estimate the Q-values for each state-action pair in a given environment, and in turn, the network will approximate the optimal Q-function. The act of combining Q-learning with a deep neural network is called *Deep-Q-Learning*, and a deep neural network that approximates a Q-function is called *Deep-Q-Network* or *DQN*.

<a name='1.4'></a>
## 1.4 - Deep-Q-Network

Suppose we have some arbitrary deep neural network that accepts states from a given environment as input. For each given state input, the network outputs estimated Q-values for each action that can be taken from that state. The objective of this network is to approximate the optimal Q-function, and remember that the optimal Q-function will satisfy the Bellman equation.

<img src="images/deep_q_network.png" style="width:400;height:400px;">
<caption><center><font ><b>Figure 1</b>: Deep-Q-Network </center></caption>

Which this in mind, the loss from the network ic calculated by comparing the outputted Q-values to the target Q-values from the right hand side of the Bellman equation, and as with any network, the objective here is to minmize this loss.
    
After the loss is calculated, the weights within the network are updated via SGD and backpropagation, again, just like with any other typical network. This process is done over and over again for each state in the environment until we sufficiently minimize the loss and get an approximate optimal Q-function.
    
**The Input**
    
The network accept states from the environment as input. In more complex environments, like a video games, images can be used as input. Usually there will be some preprocessing on these types of inputs. 

Sometimes a single frame is not enough to represent a single input state, so we have to stack a few consecutive frames to represent a single input. 
    
**The Layers**

The layers in a *Deep-Q-Network* are not different than layers in other known networks. Many *Deep-Q-Networks* are purely just some convolutional layers, followed by some non-linear activation function, and a couple of fully connected layers at the end. 
    
**The Output**
    
The output layer is a fully connected layer and it produces the Q-value for each action that can be taken from the given state that was passed as input. There is no activation function after the output layer since we want the raw, non-transformed Q-values from the network. 

<a name='1.5'></a>
## 1.5 - Experience Replay

With deep-q-networks, a technique called experience replay is often used during training. With this technique the agent's experience is stored at each time step in a data set called the *replay memory*. 

At time *t*, the agent's experience $e_{t}$ is defined as this tuple:

$$e_{t}=(s_{t}, a_{t}, r_{t+1}, s_{t+1})$$

All of the agent's experience at each time step over all episodes played by the agent are stored in the *replay memory*. In practice usually a finite size limit is set and only the last *N* experiences are stored. 

Why is the network trained by random samples from replay memory, rather than just providing the network with the sequential experiences as they occur in the environment? If the network learned only from consecutive samples of experience as they occured sequentially in the environment, the samples would be highly correlated and would therefore lead to inefficient and unstable learning. Taking random samples from replay memory breaks this correlation. 

**Training a Deep-Q-Network with Replay Memory**

After storing an experiences in replay memory, a random batch of experiences is sampled from replay memory. The state is then passed to the network as input. The input state data then forward propagates through the network, using the same forward propagation technique like other general neural networks. The model then outputs an estimated Q-value for each possible action from the given input state. 

The loss is then calculated. This is done by comparing the Q-value output from the network for the action in the experience tuple and the corresponding optimal Q-value, or *target Q-value*, for the same action. Remember, the target Q-value is calculated using the expression from the right rand side of the Bellman equation. So, the loss is calculated by subtracting the Q-value for a given state-action pair from the optimal Q-value from the same state-action pair. 

To compute the optimal Q-value for any given state-action pair, the state *s'* is passed to the policy network, which will output the Q-values for each state-action pair using *s'* as the state and each of the possible next actions as *a'*. 

<a name='1.6'></a>
## 1.6 - Target Network

The target network is a second network that is used to calculate the target Q-values. Rather than calculate them from the policy network, they are obtained by a completely separate network, appropriately called the *target network*. 

The target network is a clone of the policy network. Its weights are frozen with the original policy network's weights, and are updated every certain amount of time steps. This certain amount of time steps can be looked at as yet another hyperparameter. As it turns out, the use of a target network removes much of the instability introduced by using only one network to calculate both the Q-values, as well as the target Q-values. 

<a name='1.7'></a>
## 1.7 - When to use Deep Q-Learning

Deep Q-Learning should be used in single processes with discrete action spaces. To increase performance extensions (DDQN, Dueling DQN, DRQN, Prioritized experience replay) should be used.