Convolution Neural Networks

Introduction

Convolutional Neural Networks (CNN) are now a standard way of image classification – there are publicly accessible deep learning frameworks, trained models and services. It’s more time consuming to install stuff like caffe¹ than to perform state-of-the-art object classification or detection. We also have many methods of getting knowledge -there is a large number of deep learning courses² /MOOCs³, free e-books⁴ or even direct ways of accessing to the strongest Deep/Machine Learning minds such as Yoshua Bengio⁵, Andrew NG⁶ or Yann Lecun⁷ by Quora, Facebook or G+.

Nevertheless, when I wanted to get deeper insight in CNN, I could not find a “CNN backpropagation for dummies”. Notoriously I met with statements like: “If you understand backpropagation in standard neural networks, there should not be a problem with understanding it in CNN” or “All things are nearly the same, except matrix multiplications are replaced by convolutions”. And of course I saw tons of ready equations.

It was a little consoling, when I found out that I am not alone, for example: Hello, when computing the gradients CNN, the weights need to be rotated, Why ?⁸

δ_j^l = f′(u_j^l) ⊙ conv2(δ_j^l + 1, rot180(k_j^l + 1), ′full′)

The answer on above question, that concerns the need of rotation on weights in gradient computing, will be a result of this long post.

Back Propagation

We start from multilayer perceptron and counting delta errors on fingers:

We see on above picture that δ₁¹ is proportional to deltas from next layer that are scaled by weights.

But how do we connect concept of MLP with Convolutional Neural Network? Let’s play with MLP:

Transforming Multilayer Perceptron to Convolutional Neural Network.

If you are not sure that after connections cutting and weights sharing we get one layer Convolutional Neural Network, I hope that below picture will convince you:

Feedforward in CNN is identical with convolution operation.

The idea behind this figure is to show, that such neural network configuration is identical with a 2D convolution operation and weights are just filters (also called kernels, convolution matrices, or masks).

Now we can come back to gradient computing by counting on fingers, but from now we will be only focused on CNN. Let’s begin:

Backpropagation also results with convolution.

No magic here, we have just summed in “blue layer” scaled by weights gradients from “orange” layer. Same process as in MLP’s backpropagation. However, in the standard approach we talk about dot products and here we have … yup, again convolution:

Yeah, it is a bit different convolution than in previous (forward) case. There we did so called valid convolution, while here we do a full convolution (more about nomenclature here⁹ ). What is more, we rotate our kernel by 180 degrees. But still, we are talking about convolution!

Now, I have some good news and some bad news:

you see (BTW, sorry for pictures aesthetics), that matrix dot products are replaced by convolution operations both in feed forward and backpropagation.
you know that seeing something and understanding something … yup, we are going now to get our hands dirty and prove above statement before getting next, I recommend to read, mentioned already in the disclaimer, chapter 2¹⁰ of M. Nielsen book. I tried to make all quantities to be consistent with work of Michael.

In the standard MLP, we can define an error of neuron j as:

$$\delta_j^l = \frac{\partial C}{\partial z_j^l}$$

where z_j^l is just:

z_j^l = ∑_kw_jk^la_k^l − 1 + b_j^l

and for clarity, a_j^l = σ(z_j^l), where σ is an activation function such as sigmoid, hyperbolic tangent or relu¹¹.

But here, we do not have MLP but CNN and matrix multiplications are replaced by convolutions as we discussed before. So instead of z_j we do have a z_x, y:

z_x, y^l + 1 = w^l + 1 * σ(z_x, y^l) + b_x, y^l + 1 = ∑_a∑_bw_a, b^l + 1σ(z_{x − a, y − b}^l) + b_x, y^l + 1

Above equation is just a convolution operation during feedforward phase illustrated in the above picture titled ‘Feedforward in CNN is identical with convolution operation’¹²

Now we can get to the point and answer the question Hello, when computing the gradients CNN, the weights need to be rotated, Why ?¹³

We start from statement:

$$\delta_{x,y}^l = \frac{\partial C}{\partial z_{x,y}^l} = \sum_{x'} \sum_{y'}\frac{\partial C}{\partial z_{x',y'}^{l+1}} \frac{\partial z_{x',y'}^{l+1}}{\partial z_{x,y}^l}$$

We know that z_x, y^l is in relation to z_x′, y′^l + 1 which is indirectly showed in the above picture titled ‘Backpropagation also results with convolution’. So sums are the result of chain rule. Let’s move on:

$$\begin{aligned} \frac{\partial C}{\partial z_{x,y}^l} & = \sum_{x'} \sum_{y'}\frac{\partial C}{\partial z_{x',y'}^{l+1}} \frac{\partial z_{x',y'}^{l+1}}{\partial z_{x,y}^l} \\\ & = \sum_{x'} \sum_{y'} \delta_{x',y'}^{l+1} \frac{\partial(\sum_{a}\sum_{b}w_{a,b}^{l+1} \sigma(z_{x'-a, y'-b}^l) + b_{x',y'}^{l+1})}{\partial z_{x,y}^l} \end{aligned}$$

First term is replaced by definition of error, while second has become large because we put it here expression on z_x′, y′^l + 1. However, we do not have to fear of this big monster – all components of sums equal 0, except these ones that are indexed: x = x′ − a and y = y′ − b. So:

$$\sum_{x'} \sum_{y'} \delta_{x',y'}^{l+1} \frac{\partial(\sum_{a}\sum_{b}w_{a,b}^{l+1} \sigma(z_{x'-a, y'-b}^l) + b_{x',y'}^{l+1})} {\partial z_{x,y}^l} = \sum_{x'} \sum_{y'} \delta_{x',y'}^{l+1} w_{a,b}^{l+1} \sigma'(z_{x,y}^l)$$

If "math;`x=x'-a` and y = y′ − b then it is obvious that a = x′ − x and b = y′ − y so we can reformulate above equation to:

∑_x′∑_y′δ_x′, y′^l + 1w_a, b^l + 1σ′(z_x, y^l) = ∑_x′∑_y′δ_x′, y′^l + 1w_{x′ − x, y′ − y}^l + 1σ′(z_x, y^l)

OK, our last equation is just …

∑_x′∑_y′δ_x′, y′^l + 1w_{x′ − x, y′ − y}^l + 1σ′(z_x, y^l) = δ^l + 1 * w_{− x, − y}^l + 1σ′(z_x, y^l)

Where is the rotation of weights? Actually ROT180(w_x, y^l + 1) = w_{− x, − y}^l + 1.

So the answer on question Hello, when computing the gradients CNN, the weights need to be rotated, Why ?¹⁴ is simple: the rotation of the weights just results from derivation of delta error in Convolution Neural Network.

OK, we are really close to the end. One more ingredient of backpropagation algorithm is update of weights $\frac{\partial C}{\partial w_{a,b}^l}$:

$$\begin{aligned} \frac{\partial C}{\partial w_{a,b}^l} & = \sum_{x} \sum_{y} \frac{\partial C}{\partial z_{x,y}^l}\frac{\partial z_{x,y}^l}{\partial w_{a,b}^l} \\\ & = \sum_{x}\sum_{y}\delta_{x,y}^l \frac{\partial(\sum_{a'}\sum_{b'}w_{a',b'}^l\sigma(z_{x-a', y-b'}^l) + b_{x,y}^l)}{\partial w_{a,b}^l} \\\ & =\sum_{x}\sum_{y} \delta_{x,y}^l \sigma(z_{x-a,y-b}^{l-1}) \\\ & = \delta_{a,b}^l * \sigma(z_{-a,-b}^{l-1}) \\\ & =\delta_{a,b}^l * \sigma(ROT180(z_{a,b}^{l-1})) \end{aligned}$$

So paraphrasing the backpropagation algorithm¹⁵ for CNN:

Input x: set the corresponding activation a¹ for the input layer.
Feedforward: for each l = 2, 3, ⋯, L, compute z_x, y^l = w^l * σ(z_x, y^l − 1) + b_x, y^l and a_x, y^l = σ(z_x, y^l)
Output error δ^L: Compute the vector δ^L = ∇_aC ⊙ σ′(z^L)
Backpropagate the error: For each l = L − 1, L − 2, ⋯, 2, compute δ_x, y^l = δ^l + 1 * ROT180(w_x, y^l + 1)σ′(z_x, y^l)
Output: The gradient of the cost function is given by $\frac{\partial C}{\partial w_{a,b}^l} =\delta_{a,b}^l * \sigma(ROT180(z_{a,b}^{l-1}))$

Visualizing Features

It's been shown many times that convolutional neural nets are very good at recognizing patterns in order to classify images. But what patterns are they actually looking for?

I attempted to recreate the techniques described in¹⁶ to project features in the convnet back to pixel space.

In order to do this, we first need to define and train a convolutional network. Due to lack of training power, I couldn't train on ImageNet and had to use CIFAR-10, a dataset of 32x32 images in 10 classes. The network structure was pretty standard: two convolutional layers, each with 2x2 max pooling and a reLu gate, followed by a fully-connected layer and a softmax classifier.

We're only trying to visualize the features in the convolutional layers, so we can effectively ignore the fully-connected and softmax layers.

Features in a convolutional network are simply numbers that represent how present a certain pattern is. The intuition behind displaying these features is pretty simple: we input one image, and retrieve the matrix of features. We set every feature to 0 except one, and pass it backwards through the network until reaching the pixel layer. The challenge here lies in how to effectively pass data backwards through a convolutional network.

We can approach this problem step-by-step. There are three main portions to a convolutional layer. The actual convolution, some max-pooling, and a nonlinearity (in our case, a rectified linear unit). If we can figure out how to calculate the inputs to these units given their outputs, we can pass any feature back to the pixel input.

image from¹⁷.

Here, the paper introduces a structure called a deconvolutional layer. However, in practice, this is simply a regular convolutional layer with its filters transposed. By applying these transposed filters to the output of a convolutional layer, the input can be retrieved.

A max-pool gate cannot be reversed on its own, as data about the non-maximum features is lost. The paper describes a method in which the positions of each maximum is recorded and saved during forward propagation, and when features are passed backwards, they are placed where the maximums had originated from. In my recreation, I took an even simpler route and just set the whole 2x2 square equal to the maximum activation.

Finally, the rectified linear unit. It's the easiest one to reverse, we just need to pass the data through a reLu again when moving backwards.

To test these techniques out, I trained a standard convolutional network on CIFAR-10. First, I passed one of the training images, a dog, through the network and recorded the various features.

our favorite dog.

first convolutional layer (features 1-32).

As you can see, there are quite a variety of patterns the network is looking for. You can see evidence of the original dog picture in these feature activations, most prominently the arms.

Now, let's see how these features change when different images are passed through.

first convolutional layer (feature #7).

This image shows all the different pixel representations of the activations of feature #7, when a variety of images are used. It's clear that this feature activates when green is present. You can really see the original picture in this feature, since it probably just captures the overall color green rather than some specific pattern.

Finally, to gain some intuition of how images activated each feature, I passed in a whole batch of images and saved the maximum activations.

maximum activations for 32 features.

Which features were activated by which images? There's some interesting stuff going on here. Some of the features are activated simply by the presence of a certain color. The green frog and red car probably contained the most of their respective colors in the batch of images.

two activations from the above image.

However, here are two features which are activated the most by a red frog image. The feature activations show an outline, but one is in red and the other is in blue. Most likely, this feature isn't getting activated by the frog itself, but by the black background. Visualizing the features of a convolutional network allows us to see such details.

So, what happens if we go farther, and look at the second convolutional layer?

second convolutional layer (64 features).

I took the feature activations for the dog again, this time on the second convolutional layer. Already some differences can be spotted. The presence of the original image here is much harder to see.

It's a good sign that all the features are activated in different places. Ideally, we want features to have minimal correlation with one another.

Finally, let's examine how a second layer feature activates when various images are passed in.

second convolutional layer (feature #9).

For the majority of these images, feature #9 activated at dark locations of the original image. However, there are still outliers to this, so there is probably more to this feature than that.

For most features, it's a lot harder to tell what part of the image activated it, since second layer features are made of any linear combination of first layer features. I'm sure that if the network was trained on a higher resolution image set, these features would become more apparent.

Code

cnn_visualization.py

http://caffe.berkeleyvision.org/↩
http://cs224d.stanford.edu/↩
https://www.udacity.com/course/deep-learning--ud730 ↩
http://www.deeplearningbook.org/↩
https://plus.google.com/+YoshuaBengio/posts ↩
https://www.quora.com/session/Andrew-Ng/1 ↩
https://www.facebook.com/yann.lecun?fref=ts ↩
https://plus.google.com/111541909707081118542/posts/P8bZBNpg84Z ↩
http://www.johnloomis.org/ece563/notes/filter/conv/convolution.html ↩
http://neuralnetworksanddeeplearning.com/chap2.html ↩
https://en.wikipedia.org/wiki/Rectifier_(neural_networks)↩
https://grzegorzgwardys.wordpress.com/2016/04/22/8/#unique-identifier ↩
https://plus.google.com/111541909707081118542/posts/P8bZBNpg84Z ↩
https://plus.google.com/111541909707081118542/posts/P8bZBNpg84Z ↩
http://neuralnetworksanddeeplearning.com/chap2.html#the_backpropagation_algorithm ↩
Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding convolutional networks." European conference on computer vision. Springer International Publishing, 2014.↩
Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding convolutional networks." European conference on computer vision. Springer International Publishing, 2014.↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CNN.rst

CNN.rst

Convolution Neural Networks

Introduction

Back Propagation

Visualizing Features

Code

Files

CNN.rst

Latest commit

History

CNN.rst

File metadata and controls

Convolution Neural Networks

Introduction

Back Propagation

Visualizing Features

Code