# Q-Values from Colors

Typically, I don't do much RL from pixels or images on this blog because, 1) my research doesn't deal with image processing, and more importantly 2) it takes a lot of compute! The extra compute is carried by the **Convolutional Neural Network** (CNN) that the image must be passed through first before it is fed to the fully-connected layer. Think about the task we're setting before our neural network for a moment. It not only has to learn how to play a game with no prior models to assist it, but it also has to learn relationships between all of those pixels and how they change through time, again, with no prior model. No wonder it takes so many computational resources to train!

Starting by learning to [play CartPole with a DQN](https://www.datahubbs.com/deep-q-learning-101/) is a good way to test your algorithm for bugs, here we'll push it to do more by following the [DeepMind paper](https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf) to play Atari from the pixels on the screen. Just by "watching" the screen and making movements, these algorithms were able to acheive the impressive accomplishment of surpassing human performance for many games.

## TL;DR

We build a DQN with a convolutional neural network (CNN) in order to learn to play from the pixels on the screen. This is the way a lot of games are played with deep reinforcement learning and makes these techniques applicable to images.

<img src="https://i0.wp.com/www.datahubbs.com/wp-content/uploads/2019/02/atari_games.png?w=790&ssl=1">



## Basics of CNN's

To go into depth on CNN's would take us far afield here (see this post [here](https://skymind.ai/wiki/convolutional-network) and video [here](https://www.youtube.com/watch?v=FmpDIaiMIeA) for more), but we can cover the essential basics that you need to grasp the workings of the DQN algorithm. 

Convolutional networks are used in image recognition tasks and typically take three-dimensional tensors as inputs. Images have a height and width (two of the dimensions) while the third is devoted to the **RGB color**. This means for each color we see, we can represent it digitally as a combination of three colors: red, green, and blue. For CNN's, these input dimensions are commonly referred to as **channels**.

<img src="https://www.datahubbs.com/wp-content/uploads/2019/04/3dmatrix.png">

The image is transformed via a **kernel** (also called a **filter**). If you have an RGB image, then you have one filter being applied for each color channel of your image. 

In the picture below (borrowed from [cs231n notes](http://cs231n.github.io/convolutional-networks/) which has a handy animation), the kernel moves across the channel and calculates the dot product of the corresponding values. This is done for all the pixels where we move a given number of pixels at a time, called the **stride**. 

<img src="https://www.datahubbs.com/wp-content/uploads/2019/04/karpathy-convnet-labels.png">

It is challenging to clearly explain what is going on in the image above, so stick with me and we'll walk through a computational example for the highlighted portion. The blue and gray matrices on the left are our inputs (`x`) with each channel (0, 1, 2) corresponding to some RGB value. The gray boxes represent the **padding** that's applied to each image, which is simply 0 values or whitespace around the images we're processing. We apply our filters/kernels (red matrices) to each of those in turn, then sum those values and add the bias to get our output (green matrices). We can calculate the example above with `numpy` as follows:

In [89]:
import numpy as np

x0 = np.array([[2, 1, 0], [1, 2, 0], [2, 1, 0]])
f0 = np.array([[1, 1, 1], [1, 1, 1], [1, 1, 1]])
x1 = np.array([[2, 0, 0], [2, 2, 0], [1, 0, 0]])
f1 = np.array([[0, -1, 0], [0, 1, 1], [-1, 1, -1]]) 
x2 = np.array([[0, 2, 0], [2, 1, 0], [1, 0, 0]])
f2 = np.array([[1, 0, 1], [-1, 1, 0], [-1, 1, 1]])
b = 1
print("x0:\n{}".format(x0))
print("f0:\n{}".format(f0))
print("x1:\n{}".format(x1))
print("f1:\n{}".format(f1))
print("x2:\n{}".format(x2))
print("f2:\n{}".format(f2))

dot0 = np.dot(x0.flatten(), f0.flatten())
dot1 = np.dot(x1.flatten(), f1.flatten())
dot2 = np.dot(x2.flatten(), f2.flatten())
out = dot0 + dot1 + dot2 + b
print("Output value: {}".format(out))

x0:
[[2 1 0]
 [1 2 0]
 [2 1 0]]
f0:
[[1 1 1]
 [1 1 1]
 [1 1 1]]
x1:
[[2 0 0]
 [2 2 0]
 [1 0 0]]
f1:
[[ 0 -1  0]
 [ 0  1  1]
 [-1  1 -1]]
x2:
[[0 2 0]
 [2 1 0]
 [1 0 0]]
f2:
[[ 1  0  1]
 [-1  1  0]
 [-1  1  1]]
Output value: 9


The local and kernel values are stretched into columns (hence the `flatten()` method) before the dot product is calculated. By calculating each this way, our convolution is equivalent to a single, large matrix multiplication for the dot product between every kernel and pixel location. These values are given as our output matrix. 

We can see how these inputs get transformed and operated on through the convolutional layer, but what do their output dimensions look like? Thankfully, we have some handy formulas for this.

If we have a 2D image (plus a color channel) we have a $W_{in} \times H_{in} \times D_{in}$ input. We convolve our images with the kernel which has its size $K$ ($K=3$ in our example above), number of kernels, $F$, stride $S$, and padding $P$. The output dimensions are then given as $W_{out} \times H_{out} \times D_{out}$ and are calculated as:

$$W_{out} = \frac{W_{in} - K + 2P}{S}+1$$

$$H_{out} = \frac{H_{in} - K + 2P}{S}+1$$

$$D_{out} = F$$

## Pooling Layer

After convolving the input, there typically follows a **max pooling** layer. This operator is relatively simple compared to what we just examined. It has some kernel size and stride like we saw before, but rather than taking the dot product of the kernel, it returns the maximum value in the kernel window. Thus it creates a new matrix of maximum values of the various subsets it scans as shown in the color-coded image below.

<img src="https://www.datahubbs.com/wp-content/uploads/2019/04/MaxpoolSample2.png">

Although this function wasn't applied in the DeepMind paper we'll be modeling our network off of, it is very common when building CNN's. 

The important takeaway is to understand a bit of what is going on in the convolutional portion of the neural network, particularly because the dimensions are so important to get right when constructing these things. Take it from my experience, it can be very frustrating to build a functioning CNN without a good understanding of why it fits together the way it does.

To get the hang of putting some of these layers together in PyTorch, let's just build a simple example CNN.

## Building a CNN

We can play with any of the [Atari environments](https://gym.openai.com/envs/#atari) in OpenAI Gym to get a feel for how this works. 

If you need to install the Atari environments, it's just:

`pip install gym[atari]`

Last I checked, the OpenAI Gym Atari environments won't run on a Windows machine. It appears there are [workarounds](https://stackoverflow.com/questions/42605769/openai-gym-atari-on-windows), but I haven't tried them myself. If you want to go ahead with this on Windows, then give that a shot or, finally do what the cool people do and grab a [Linux distro](https://www.techradar.com/news/best-linux-distro).

In [91]:
import torch
from torch import nn
import matplotlib.pyplot as plt
import gym

%matplotlib inline

In [None]:
env = gym.make('Breakout-v0')
state = env.reset()
plt.figure(figsize=(12,8))
plt.imshow(state)
plt.axis('off')
plt.show()

<img src="https://www.datahubbs.com/wp-content/uploads/2019/04/breakout_s_0.png">

I imported *Breakout* and showed a starting state with `matplotlib`. From here, we can inspect the `state`, which is just a 3D numpy array.

In [98]:
state.shape

(210, 160, 3)

Let's put together our single-layer convolutional network.

In [115]:
CNN = nn.Sequential(
    nn.Conv2d(in_channels=3, out_channels=32, kernel_size=8, stride=2, padding=1))

We should be able to predict our output dimension from our above settings:

$$H_{out} = \frac{210 - 8 + 2\times1}{2}+1 = 103$$

$$W_{out} = \frac{160 - 8 + 2\times1}{2}+1 = 78$$

PyTorch expects your color channel to be first for your input, so we need to re-organize a bit.

In [107]:
_state = np.moveaxis(state, 2, 0)
_state.shape

(3, 210, 160)

`np.moveaxis` does what the name implies, allows us to swap axes around to our heart's content. 

The next adjustment we need to do is add a fourth dimension to the front of our array. This doesn't affect the image, but simply allows us to pass multiple images to the network at once in a single **batch**. Typically we'll do this with something like `np.stack([])` to combine states into batches, but in this case we'll just use `np.newaxis` because we've only got one observation to work with.

In [110]:
_state = _state[np.newaxis, :]

Now, pass this to our `CNN` object and we can check the results.

In [116]:
CNN(torch.FloatTensor(_state)).shape

torch.Size([1, 32, 103, 78])

The dimensions match up according to our prediction and we got no errors, so we're good! 

Before we get to building our DQN, we need to understand a bit about the steps DeepMind took to enable their DQN algorithm to learn from the Atari inputs.

## Preparing the Pixels

Learning to play from pixels isn't as straightforward as plugging the raw values into a network and letting it run. You can try that, but it's going to be costly in terms of compute (and by extension, in terms of your [AWS bill](https://amzn.to/2GevJdg)). We're going to help our algorithm along by applying a bit of pre-processing like found in the [DeepMind paper](https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf).

Let's refer back to our starting state image.

<img src="https://www.datahubbs.com/wp-content/uploads/2019/04/breakout_s_0.png">

Here, we've got lots of colors and a lot of extra space and resolution. We can reduce our computational load by reducing the resolution and adjust the colors. DeepMind took the maximum pixel value over subsequent frames to reduce flickering caused by the limitations of the Atari platform and then scale it from its current $210 \times 160 \times 3$ resolution to $84 \times 84$. We'll do something similar, except convert the colors to grayscale rather than adjust based on the maximimum pixel value.

To convert this, we will take the luminance channel (denoted as $Y$) from the image, which is the our RGB channel, and apply linear weights to the channel to transform it according to the [relative luminance](https://en.wikipedia.org/wiki/Grayscale#Luma_coding_in_video_systems). 
 
$$Y = 0.299R + 0.587G + 0.114B$$

We can do this easily with a small function.

In [None]:
def scale_lumininance(img):
    return np.dot(img[...,:3], [0.299, 0.587, 0.114])

s_g = scale_lumininance(s)
plt.figure(figsize=(12,8))
plt.imshow(s_g, cmap=plt.get_cmap('gray'))
plt.axis('off')
plt.show()

<img src="https://www.datahubbs.com/wp-content/uploads/2019/04/breakout_s_0_bw.png">

Now we've got our grayscale images and we've reduced our dimensionality by one.

In [15]:
print("Original image dimensions {}".format(s.shape))
print("Gray image dimensions {}".format(s_g.shape))

Original image dimensions (210, 160, 3)
Gray image dimensions (210, 160)


Next we need to go from $210 \times 160$ down to $84 \times 84$. I've found the easiest way to do this is with [`scikit-image`](http://scikit-image.org/docs/stable/auto_examples/transform/plot_rescale.html). Import this, and use the `transform.resize()` function and you'll quickly downsample your image to whatever value you specify.

In [120]:
from skimage import transform

s_g84 = transform.resize(s_g, (84, 84))

print(s_g84.shape)

plt.figure(figsize=(12,8))
plt.imshow(s_g84, cmap=plt.get_cmap('gray'))
plt.axis('off')
plt.show()

(84, 84)


<img src="https://www.datahubbs.com/wp-content/uploads/2019/04/breakout_s_0_resize.png">

Let's go ahead and tie these pre-processing steps together for our convenience.

In [133]:
def preprocess_observations(obs):
    obs_gray = scale_lumininance(obs)
    obs_trans = transform.resize(obs_gray, (84, 84))
    return np.moveaxis(obs_trans, 1, 0)

Finally, before we can move on, there's one more element to discuss.

Take a look at the photo below:

<img src="https://www.datahubbs.com/wp-content/uploads/2019/03/ball-catch.jpg">

I can't tell if the ball is moving up or down; whether the girl just tossed the ball or is getting ready to catch it. But if you were to give me at least two sequential images, then I would be able to tell which direction the ball is moving. The same principle applies for a game like *Breakout*. Without this sequence of information, the agent won't be able to tell which direction the ball is moving.[<sup>1</sup>](#fn1)

To help out, I'm going to introduce a parameter `tau` ($\tau$) which is set to 4 by default. This means that each state we pass to the agent will actually have 4 frames stacked together so that it can pick up on the direction of motion. 

We'll link these frames together using `deque` from the `collections` package and set up two, one for the current state (`state_buffer`) and one for the next state (`next_state_buffer`). 

In [134]:
from collections import deque

tau = 4
state_buffer = deque(maxlen=tau)
next_state_buffer = deque(maxlen=tau)

## Building the DQN

With our pre-processing steps out of the way, we can turn to constructing our Deep Q-Network. We're going to skip ahead to the network and the implementation ([see this post if you want to see the algorithm](https://www.datahubbs.com/deep-q-learning-101/)). 

DeepMind placed their model architecture in the *Methods* section at the end of the paper along with most of the implementation details. They used a three-layer CNN followed by a fully-connected network consisting of one hidden layer and one output layer. 

|Layer|Kernel Size|Stride|Hidden Nodes|Activation Function|
|-----|-----------|------|------------|-------------------|
|CNN1|8|4|NA|ReLU|
|CNN2|4|2|NA|ReLU|
|CNN3|3|1|NA|ReLU|
|FC1 |NA|NA|512|ReLU|

To implement this in PyTorch, we stack our layers with the activation function in between as we would for any other network architecture.

In [135]:
# CNN layers
cnn = nn.Sequential(
    nn.Conv2d(tau, 32, kernel_size=8, stride=4),
    nn.ReLU(),
    nn.Conv2d(32, 64, kernel_size=4, stride=2),
    nn.ReLU(),
    nn.Conv2d(64, 64, kernel_size=3, stride=1),
    nn.ReLU()
    )

Our first input is given by our $\tau$ value to ensure we have the proper number of input channels. 

Unfortunately, we can't just slap our fully connected layer at the end because the dimensions won't match. Because the output of our CNN is a multi-dimensional tensor and a fully connected linear layer can only handle a 2D input, we need to transform it before passing it to the final layers. We have two options: use the CNN dimensionality equations we discussed above to calculate the dimensions, or let PyTorch do the work for us. Personally, I prefer the latter. 

We can get the answer from our network by passing in a sample state and simply looking at the output dimensions. In our case, we're resizing everything to an $84\times84$ input, so we'll put $\tau$ arrays together in our `state_buffer`, stack it, and send it through the CNN.

In [152]:
input_dim = (84, 84)
[state_buffer.append(np.zeros(input_dim)) for i in range(tau)]
state_t = torch.FloatTensor(np.stack([state_buffer]))
output = cnn(state_t)
print(output.shape)

torch.Size([1, 64, 7, 7])


When we flatten our output tensor, we get our answer.

In [155]:
fc_input_dim = output.flatten().shape[0]
print(fc_input_dim)

3136


All we need to make this work is a function that will take our CNN output, transform it according to the proper dimensions, and then pass that to our fully connected layers and we've got our network. 

In [157]:
fully_connected = nn.Sequential(
    nn.Linear(fc_input_dim, 512, bias=True),
    nn.ReLU(),
    nn.Linear(512, env.action_space.n))

def get_qvals(state):
    state_t = torch.FloatTensor(state)
    cnn_out = cnn(state_t).reshape(-1, fc_input_dim)
    return fully_connected(cnn_out)

get_qvals(np.stack([state_buffer]))

tensor([[ 0.0086,  0.0376, -0.0117,  0.0186]], grad_fn=<AddmmBackward>)

Putting this together into a `QNetwork` class is rather straightforward.

In [161]:
class QNetwork(nn.Module):
    
    def __init__(self, env, learning_rate=1e-3, 
        tau=4, device='cpu', input_dim=(84,84), *args, **kwargs):
        super(QNetwork, self).__init__()
        self.device = device
        self.actions = np.arange(env.action_space.n)
        self.tau = tau
        self.n_outputs = env.action_space.n

        # CNN modeled off of Mnih et al.
        self.cnn = nn.Sequential(
            nn.Conv2d(tau, 32, kernel_size=8, stride=4),
            nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=4, stride=2),
            nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, stride=1),
            nn.ReLU()
        )
        
        self.fc_layer_inputs = self.cnn_out_dim(input_dim)
        
        self.fully_connected = nn.Sequential(
            nn.Linear(self.fc_layer_inputs, 512, bias=True),
            nn.ReLU(),
            nn.Linear(512, self.n_outputs))
        
        # Set device for GPU's
        if self.device == 'cuda':
            self.cnn.cuda()
            self.fully_connected.cuda()
        
        self.optimizer = torch.optim.Adam(self.parameters(),
                                          lr=learning_rate)
                
    def get_action(self, state, epsilon=0.05):
        if np.random.random() < epsilon:
            action = np.random.choice(self.actions)
        else:
            action = self.greedy_action(state)
        return action
    
    def greedy_action(self, state):
        qvals = self.get_qvals(state)
        return torch.max(qvals, dim=-1)[1].item()
    
    def get_qvals(self, state):
        state_t = torch.FloatTensor(state).to(device=self.device)
        cnn_out = self.cnn(state_t).reshape(-1, self.fc_layer_inputs)
        return self.fully_connected(cnn_out)

    def cnn_out_dim(self, input_dim):
        return self.cnn(torch.zeros(1, self.tau, *input_dim)
            ).flatten().shape[0]

And just to show you it works:

In [162]:
dqn = QNetwork(env)
dqn.get_qvals(np.stack([state_buffer]))

tensor([[-0.0019,  0.0051,  0.0113, -0.0224]], grad_fn=<AddmmBackward>)

With that, we can plug this new network into the [previous DQN algorithm](https://www.datahubbs.com/deep-q-learning-101/) with only one small alteration, passing our states through the `preprocess_observation` function before loading them in the `state_buffer`. Rather than re-create the full algorithm here, you can build it yourself or just pull the CNN version from [GitHub](https://github.com/hubbs5/rl_blog/blob/master/q_learning/deep/dqn_cnn.py). If you want to run the algorithm with the same hyperparameters as DeepMind did, I've inserted the table from their paper below. 

Be warned, because this can take some time to train! The code works, but it could take 2-3 days to reach a reasonable level of performance for each game depending on your hardware configuration. GPU's definitely help speed the training, but neither the hyperparameters nor the code is optimized for fast performance, instead, I wrote it for clarity. Keep an eye out for a future post where we'll dive into parallelization techniques, profiling, and hyperparameter optimization to get the most out of your hardware.

Between this post and the last (and lots of time and computational resources), you should be able to build a DQN to reproduce DeepMind's groundbreaking results in *Nature*. Good luck!

<img src="https://www.datahubbs.com/wp-content/uploads/2019/04/hyperparameters.png">

<span id="fn1">1. [Mnih et al.](https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf) refer to this as the *agent history length*. It's also worth noting that using this type of concatenation strategy to capture motion is critical for learning from pixels such as is done in the original paper. There are also RAM environments that we can train on for most Atari games. If we were to use these, then it may not be necessary because to stack these frames together because it is likely that there is a value in the state that defines the direction of travel for the ball. Setting $\tau=1$ will allow you to test that.</span>