# Convolutional Neural Networks

The main part of the chapter is an introduction to one of the most widely used types of deep network: deep convolutional networks. We'll work through a detailed example - code and all - of using convolutional nets to solve the problem of classifying handwritten digits from the MNIST data set:

<table style="width:100%">
  <tr>
    <th><img src="photos/digits.png" alt="Drawing" style="width:400px;"/></th>
  </tr>
</table>

We'll start our account of convolutional networks with the shallow networks used to attack this problem earlier in the book. Through many iterations we'll build up more and more powerful networks. As we go we'll explore many powerful techniques: convolutions, pooling, the use of GPUs to do far more training than we did with our shallow networks, the algorithmic expansion of our training data (to reduce overfitting), the use of the dropout technique (also to reduce overfitting), the use of ensembles of networks, and others. The result will be a system that offers near-human performance. Of the 10,000 MNIST test images - images not seen during training! - our system will classify 9,967 correctly. Here's a peek at the 33 images which are misclassified. Note that the correct classification is in the top right; our program's classification is in the bottom right:

<table style="width:100%">
  <tr>
    <th><img src="photos/ensemble_errors.png" alt="Drawing" style="width:400px;"/></th>
  </tr>
</table>

Many of these are tough even for a human to classify. Consider, for example, the third image in the top row. To me it looks more like a "9" than an "8", which is the official classification. Our network also thinks it's a "9". This kind of "error" is at the very least understandable, and perhaps even commendable. We conclude our discussion of image recognition with a survey of some of the spectacular recent progress using networks (particularly convolutional nets) to do image recognition.

### Introducing convolutional networks

 In earlier chapters, we taught our neural networks to do a pretty good job recognizing images of handwritten digits:
 
 We did this using networks in which adjacent network layers are fully connected to one another. That is, every neuron in the network is connected to every neuron in adjacent layers:

<table style="width:100%">
  <tr>
    <th><img src="photos/tikz41.png" alt="Drawing" style="width:500px;"/></th>
  </tr>
</table>

In particular, for each pixel in the input image, we encoded the pixel's intensity as the value for a corresponding neuron in the input layer. For the 28×28 pixel images we've been using, this means our network has 784 (=28×28) input neurons. We then trained the network's weights and biases so that the network's output would - we hope! - correctly identify the input image: '0', '1', '2', ..., '8', or '9'.

Convolutional neural networks use three basic ideas: local receptive fields, shared weights, and pooling. Let's look at each of these ideas in turn.

**Local receptive fields:** In the fully-connected layers shown earlier, the inputs were depicted as a vertical line of neurons. In a convolutional net, it'll help to think instead of the inputs as a 28×28 square of neurons, whose values correspond to the 28×28 pixel intensities we're using as inputs:

<table style="width:100%">
  <tr>
    <th><img src="photos/tikz42.png" alt="Drawing" style="width:300px;"/></th>
  </tr>
</table>

As per usual, we'll connect the input pixels to a layer of hidden neurons. But we won't connect every input pixel to every hidden neuron. Instead, we only make connections in small, localized regions of the input image.

To be more precise, each neuron in the first hidden layer will be connected to a small region of the input neurons, say, for example, a 5×5
region, corresponding to 25 input pixels. So, for a particular hidden neuron, we might have connections that look like this: 

<table style="width:100%">
  <tr>
    <th><img src="photos/tikz43.png" alt="Drawing" style="width:300px;"/></th>
  </tr>
</table>

That region in the input image is called the local receptive field for the hidden neuron. It's a little window on the input pixels. Each connection learns a weight. And the hidden neuron learns an overall bias as well. You can think of that particular hidden neuron as learning to analyze its particular local receptive field.

We then slide the local receptive field across the entire input image. For each local receptive field, there is a different hidden neuron in the first hidden layer. To illustrate this concretely, let's start with a local receptive field in the top-left corner: 

<table style="width:100%">
  <tr>
    <th><img src="photos/tikz44.png" alt="Drawing" style="width:300px;"/></th>
  </tr>
</table>

Then we slide the local receptive field over by one pixel to the right (i.e., by one neuron), to connect to a second hidden neuron:

<table style="width:100%">
  <tr>
    <th><img src="photos/tikz45.png" alt="Drawing" style="width:300px;"/></th>
  </tr>
</table>

And so on, building up the first hidden layer. Note that if we have a 28×28 input image, and 5×5 local receptive fields, then there will be 24×24 neurons in the hidden layer. This is because we can only move the local receptive field 23 neurons across (or 23 neurons down), before colliding with the right-hand side (or bottom) of the input image.

I've shown the local receptive field being moved by one pixel at a time. In fact, sometimes a different stride length is used. For instance, we might move the local receptive field 2
pixels to the right (or down), in which case we'd say a stride length of 2 is used. In this chapter we'll mostly stick with stride length 1, but it's worth knowing that people sometimes experiment with different stride lengths* 

Shared weights and biases: I've said that each hidden neuron has a bias and 5×5 weights connected to its local receptive field. What I did not yet mention is that we're going to use the same weights and bias for each of the 24×24 hidden neurons. In other words, for the j,kth hidden neuron, the output is: 

\begin{eqnarray} 
  \sigma\left(b + \sum_{l=0}^4 \sum_{m=0}^4  w_{l,m} a_{j+l, k+m} \right).
\tag{125}\end{eqnarray}

Here, σ is the neural activation function - perhaps the sigmoid function we used in earlier chapters. b is the shared value for the bias. wl,m is a 5×5 array of shared weights. And, finally, we use ax,y to denote the input activation at position x,y.

This means that all the neurons in the first hidden layer detect exactly the same feature* *I haven't precisely defined the notion of a feature. Informally, think of the feature detected by a hidden neuron as the kind of input pattern that will cause the neuron to activate: it might be an edge in the image, for instance, or maybe some other type of shape. , just at different locations in the input image. To see why this makes sense, suppose the weights and bias are such that the hidden neuron can pick out, say, a vertical edge in a particular local receptive field. That ability is also likely to be useful at other places in the image. And so it is useful to apply the same feature detector everywhere in the image. To put it in slightly more abstract terms, convolutional networks are well adapted to the translation invariance of images: move a picture of a cat (say) a little ways, and it's still an image of a cat*

For this reason, we sometimes call the map from the input layer to the hidden layer a **feature map**. We call the weights defining the feature map the shared weights. And we call the bias defining the feature map in this way the shared bias. The shared weights and bias are often said to define a kernel or filter. In the literature, people sometimes use these terms in slightly different ways, and for that reason I'm not going to be more precise; rather, in a moment, we'll look at some concrete examples.

The network structure I've described so far can detect just a single kind of localized feature. To do image recognition we'll need more than one feature map. And so a complete convolutional layer consists of several different feature maps:

<table style="width:100%">
  <tr>
    <th><img src="photos/tikz46.png" alt="Drawing" style="width:300px;"/></th>
  </tr>
</table>

 In the example shown, there are 3 feature maps. Each feature map is defined by a set of 5×5 shared weights, and a single shared bias. The result is that the network can detect 3 different kinds of features, with each feature being detectable across the entire image.

I've shown just 3
feature maps, to keep the diagram above simple. However, in practice convolutional networks may use more (and perhaps many more) feature maps. One of the early convolutional networks, LeNet-5, used 6 feature maps, each associated to a 5×5 local receptive field, to recognize MNIST digits. So the example illustrated above is actually pretty close to LeNet-5. In the examples we develop later in the chapter we'll use convolutional layers with 20 and 40 feature maps. Let's take a quick peek at some of the features which are learned.




<table style="width:100%">
  <tr>
    <th><img src="photos/net_full_layer_0.png" alt="Drawing" style="width:300px;"/></th>
  </tr>
</table>

The 20 images correspond to 20 different feature maps (or filters, or kernels). Each map is represented as a 5×5 block image, corresponding to the 5×5 weights in the local receptive field. Whiter blocks mean a smaller (typically, more negative) weight, so the feature map responds less to corresponding input pixels. Darker blocks mean a larger weight, so the feature map responds more to the corresponding input pixels. Very roughly speaking, the images above show the type of features the convolutional layer responds to.

A big advantage of sharing weights and biases is that it greatly reduces the number of parameters involved in a convolutional network. For each feature map we need 25=5×5 shared weights, plus a single shared bias. So each feature map requires 26 parameters. If we have 20 feature maps that's a total of 20×26=520 parameters defining the convolutional layer. By comparison, suppose we had a fully connected first layer, with 784=28×28 input neurons, and a relatively modest 30 hidden neurons, as we used in many of the examples earlier in the book. That's a total of 784×30 weights, plus an extra 30 biases, for a total of 23,550 parameters. In other words, the fully-connected layer would have more than 40 times as many parameters as the convolutional layer.

Of course, we can't really do a direct comparison between the number of parameters, since the two models are different in essential ways. But, intuitively, it seems likely that the use of translation invariance by the convolutional layer will reduce the number of parameters it needs to get the same performance as the fully-connected model. That, in turn, will result in faster training for the convolutional model, and, ultimately, will help us build deep networks using convolutional layers.

Incidentally, the name convolutional comes from the fact that the operation in Equation (125) is sometimes known as a convolution. A little more precisely, people sometimes write that equation as $a^1 = \sigma(b + w * a^0)$
, where a1 denotes the set of output activations from one feature map, a0 is the set of input activations, and ∗ is called a convolution operation. We're not going to make any deep use of the mathematics of convolutions, so you don't need to worry too much about this connection. But it's worth at least knowing where the name comes from.

**Pooling layers:** In addition to the convolutional layers just described, convolutional neural networks also contain pooling layers. Pooling layers are usually used immediately after convolutional layers. What the pooling layers do is simplify the information in the output from the convolutional layer.

In detail, a pooling layer takes each feature map output from the convolutional layer and prepares a condensed feature map. For instance, each unit in the pooling layer may summarize a region of (say) 2×2 neurons in the previous layer. As a concrete example, one common procedure for pooling is known as max-pooling. In max-pooling, a pooling unit simply outputs the maximum activation in the 2×2 input region, as illustrated in the following diagram:

<table style="width:100%">
  <tr>
    <th><img src="photos/tikz47.png" alt="Drawing" style="width:300px;"/></th>
  </tr>
</table>

Note that since we have 24×24 neurons output from the convolutional layer, after pooling we have 12×12 neurons.

As mentioned above, the convolutional layer usually involves more than a single feature map. We apply max-pooling to each feature map separately. So if there were three feature maps, the combined convolutional and max-pooling layers would look like:

<table style="width:100%">
  <tr>
    <th><img src="photos/tikz48.png" alt="Drawing" style="width:500px;"/></th>
  </tr>
</table>

We can think of max-pooling as a way for the network to ask whether a given feature is found anywhere in a region of the image. It then throws away the exact positional information. The intuition is that once a feature has been found, its exact location isn't as important as its rough location relative to other features. A big benefit is that there are many fewer pooled features, and so this helps reduce the number of parameters needed in later layers.

Max-pooling isn't the only technique used for pooling. Another common approach is known as **L2 pooling**. Here, instead of taking the maximum activation of a 2×2 region of neurons, we take the square root of the sum of the squares of the activations in the 2×2 region. While the details are different, the intuition is similar to max-pooling: L2 pooling is a way of condensing information from the convolutional layer. In practice, both techniques have been widely used. And sometimes people use other types of pooling operation. If you're really trying to optimize performance, you may use validation data to compare several different approaches to pooling, and choose the approach which works best. But we're not going to worry about that kind of detailed optimization.

Putting it all together: We can now put all these ideas together to form a complete convolutional neural network. It's similar to the architecture we were just looking at, but has the addition of a layer of 10 output neurons, corresponding to the 10 possible values for MNIST digits ('0', '1', '2', etc):

<table style="width:100%">
  <tr>
    <th><img src="photos/tikz49.png" alt="Drawing" style="width:500px;"/></th>
  </tr>
</table>

The network begins with 28×28 input neurons, which are used to encode the pixel intensities for the MNIST image. This is then followed by a convolutional layer using a 5×5 local receptive field and 3 feature maps. The result is a layer of 3×24×24 hidden feature neurons. The next step is a max-pooling layer, applied to 2×2 regions, across each of the 3 feature maps. The result is a layer of 3×12×12 hidden feature neurons.

The final layer of connections in the network is a fully-connected layer. That is, this layer connects every neuron from the max-pooled layer to every one of the 10 output neurons. This fully-connected architecture is the same as we used in earlier chapters. Note, however, that in the diagram above, I've used a single arrow, for simplicity, rather than showing all the connections. Of course, you can easily imagine the connections.

This convolutional architecture is quite different to the architectures used in earlier chapters. But the overall picture is similar: a network made of many simple units, whose behaviors are determined by their weights and biases. And the overall goal is still the same: to use training data to train the network's weights and biases so that the network does a good job classifying input digits.

## Why are deep neural networks hard to train?

Imagine you're an engineer who has been asked to design a computer from scratch. One day you're working away in your office, designing logical circuits, setting out AND gates, OR gates, and so on, when your boss walks in with bad news. The customer has just added a surprising design requirement: the circuit for the entire computer must be just two layers deep:



<table style="width:100%">
  <tr>
    <th><img src="photos/dl.png" alt="Drawing" style="width:500px;"/></th>
  </tr>
</table>

You're dumbfounded, and tell your boss: "The customer is crazy!"

Your boss replies: "I think they're crazy, too. But what the customer wants, they get."

In fact, there's a limited sense in which the customer isn't crazy. Suppose you're allowed to use a special logical gate which lets you AND together as many inputs as you want. And you're also allowed a many-input NAND gate, that is, a gate which can AND multiple inputs and then negate the output. With these special gates it turns out to be possible to compute any function at all using a circuit that's just two layers deep.

But just because something is possible doesn't make it a good idea. In practice, when solving circuit design problems (or most any kind of algorithmic problem), we usually start by figuring out how to solve sub-problems, and then gradually integrate the solutions. In other words, we build up to a solution through multiple layers of abstraction.

For instance, suppose we're designing a logical circuit to multiply two numbers. Chances are we want to build it up out of sub-circuits doing operations like adding two numbers. The sub-circuits for adding two numbers will, in turn, be built up out of sub-sub-circuits for adding two bits. Very roughly speaking our circuit will look like:

<table style="width:100%">
  <tr>
    <th><img src="photos/dl2.png" alt="Drawing" style="width:500px;"/></th>
  </tr>
</table>

That is, our final circuit contains at least three layers of circuit elements. In fact, it'll probably contain more than three layers, as we break the sub-tasks down into smaller units than I've described. But you get the general idea.

So deep circuits make the process of design easier. But they're not just helpful for design. There are, in fact, mathematical proofs showing that for some functions very shallow circuits require exponentially more circuit elements to compute than do deep circuits. For instance, a famous series of papers in the early 1980s showed that computing the parity of a set of bits requires exponentially many gates, if done with a shallow circuit. On the other hand, if you use deeper circuits it's easy to compute the parity using a small circuit: you just compute the parity of pairs of bits, then use those results to compute the parity of pairs of pairs of bits, and so on, building up quickly to the overall parity. Deep circuits thus can be intrinsically much more powerful than shallow circuits.

Up to now, this book has approached neural networks like the crazy customer. Almost all the networks we've worked with have just a single hidden layer of neurons (plus the input and output layers):



<table style="width:100%">
  <tr>
    <th><img src="photos/dl3.png" alt="Drawing" style="width:500px;"/></th>
  </tr>
</table>

These simple networks have been remarkably useful: in earlier chapters we used networks like this to classify handwritten digits with better than 98 percent accuracy! Nonetheless, intuitively we'd expect networks with many more hidden layers to be more powerful:



<table style="width:100%">
  <tr>
    <th><img src="photos/dl4.png" alt="Drawing" style="width:500px;"/></th>
  </tr>
</table>

Such networks could use the intermediate layers to build up multiple layers of abstraction, just as we do in Boolean circuits. For instance, if we're doing visual pattern recognition, then the neurons in the first layer might learn to recognize edges, the neurons in the second layer could learn to recognize more complex shapes, say triangle or rectangles, built up from edges. The third layer would then recognize still more complex shapes. And so on. These multiple layers of abstraction seem likely to give deep networks a compelling advantage in learning to solve complex pattern recognition problems. Moreover, just as in the case of circuits, there are theoretical results suggesting that deep networks are intrinsically more powerful than shallow networks* 

**For certain problems and network architectures this is proved in On the number of response regions of deep feed forward networks with piece-wise linear activations, by Razvan Pascanu, Guido Montúfar, and Yoshua Bengio (2014). See also the more informal discussion in section 2 of Learning deep architectures for AI, by Yoshua Bengio (2009)..**

How can we train such deep networks? In this chapter, we'll try training deep networks using our workhorse learning algorithm - stochastic gradient descent by backpropagation. But we'll run into trouble, with our deep networks not performing much (if at all) better than shallow networks.

That failure seems surprising in the light of the discussion above. Rather than give up on deep networks, we'll dig down and try to understand what's making our deep networks hard to train. When we look closely, we'll discover that the different layers in our deep network are learning at vastly different speeds. In particular, when later layers in the network are learning well, early layers often get stuck during training, learning almost nothing at all. This stuckness isn't simply due to bad luck. Rather, we'll discover there are fundamental reasons the learning slowdown occurs, connected to our use of gradient-based learning techniques.

As we delve into the problem more deeply, we'll learn that the opposite phenomenon can also occur: the early layers may be learning well, but later layers can become stuck. In fact, we'll find that there's an intrinsic instability associated to learning by gradient descent in deep, many-layer neural networks. This instability tends to result in either the early or the later layers getting stuck during training.

This all sounds like bad news. But by delving into these difficulties, we can begin to gain insight into what's required to train deep networks effectively. And so these investigations are good preparation for the next chapter, where we'll use deep learning to attack image recognition problems.

## The vanishing gradient problem

So, what goes wrong when we try to train a deep network?

To answer that question, let's first revisit the case of a network with just a single hidden layer. As per usual, we'll use the MNIST digit classification problem as our playground for learning and experimentation.

Then, from a Python shell we load the MNIST data:

    import mnist_loader
    training_data, validation_data, test_data = \ mnist_loader.load_data_wrapper()

This network has 784 neurons in the input layer, corresponding to the 28×28=784 pixels in the input image. We use 30 hidden neurons, as well as 10 output neurons, corresponding to the 10 possible classifications for the MNIST digits ('0', '1', '2', … , '9').

Let's try training our network for 30 complete epochs, using mini-batches of 10 training examples at a time, a learning rate η=0.1 , and regularization parameter λ=5.0 . As we train we'll monitor the classification accuracy on the validation_data:

    net.SGD(training_data, 30, 10, 0.1, lmbda=5.0, 
    evaluation_data=validation_data, monitor_evaluation_accuracy=True)

We get a classification accuracy of 96.48 percent (or thereabouts - it'll vary a bit from run to run), comparable to our earlier results with a similar configuration.

Now, let's add another hidden layer, also with 30 neurons in it, and try training with the same hyper-parameters:

    net = network2.Network([784, 30, 30, 10])
    net.SGD(training_data, 30, 10, 0.1, lmbda=5.0, 
    evaluation_data=validation_data, monitor_evaluation_accuracy=True)

This gives an improved classification accuracy, 96.90 percent. That's encouraging: a little more depth is helping. Let's add another 30-neuron hidden layer:

    net = network2.Network([784, 30, 30, 30, 10])
    net.SGD(training_data, 30, 10, 0.1, lmbda=5.0, 
    evaluation_data=validation_data, monitor_evaluation_accuracy=True
    
That doesn't help at all. In fact, the result drops back down to 96.57 percent, close to our original shallow network. And suppose we insert one further hidden layer:

    net = network2.Network([784, 30, 30, 30, 30, 10])
    net.SGD(training_data, 30, 10, 0.1, lmbda=5.0, 
    evaluation_data=validation_data, monitor_evaluation_accuracy=True)

The classification accuracy drops again, to 96.53 percent. That's probably not a statistically significant drop, but it's not encouraging, either.

This behaviour seems strange. Intuitively, extra hidden layers ought to make the network able to learn more complex classification functions, and thus do a better job classifying. Certainly, things shouldn't get worse, since the extra layers can, in the worst case, simply do nothing

So what is going on? Let's assume that the extra hidden layers really could help in principle, and the problem is that our learning algorithm isn't finding the right weights and biases. We'd like to figure out what's going wrong in our learning algorithm, and how to do better.

To get some insight into what's going wrong, let's visualize how the network learns. Below, I've plotted part of a [784,30,30,10] network, i.e., a network with two hidden layers, each containing 30 hidden neurons. Each neuron in the diagram has a little bar on it, representing how quickly that neuron is changing as the network learns. A big bar means the neuron's weights and bias are changing rapidly, while a small bar means the weights and bias are changing slowly. More precisely, the bars denote the gradient ∂C/∂b for each neuron, i.e., the rate of change of the cost with respect to the neuron's bias. Back in Chapter 2 we saw that this gradient quantity controlled not just how rapidly the bias changes during learning, but also how rapidly the weights input to the neuron change, too. Don't worry if you don't recall the details: the thing to keep in mind is simply that these bars show how quickly each neuron's weights and bias are changing as the network learns.

To keep the diagram simple, I've shown just the top six neurons in the two hidden layers. I've omitted the input neurons, since they've got no weights or biases to learn. I've also omitted the output neurons, since we're doing layer-wise comparisons, and it makes most sense to compare layers with the same number of neurons. The results are plotted at the very beginning of training, i.e., immediately after the network is initialized. Here they are

<table style="width:100%">
  <tr>
    <th><img src="photos/dl5.png" alt="Drawing" style="width:500px;"/></th>
  </tr>
</table>

The network was initialized randomly, and so it's not surprising that there's a lot of variation in how rapidly the neurons learn. Still, one thing that jumps out is that the bars in the second hidden layer are mostly much larger than the bars in the first hidden layer. As a result, the neurons in the second hidden layer will learn quite a bit faster than the neurons in the first hidden layer. Is this merely a coincidence, or are the neurons in the second hidden layer likely to learn faster than neurons in the first hidden layer in general?

To determine whether this is the case, it helps to have a global way of comparing the speed of learning in the first and second hidden layers. To do this, let's denote the gradient as $δ^l_j=∂C/∂b^l_j$, i.e., the gradient for the j th neuron in the l th layer. We can think of the gradient $δ^1$ as a vector whose entries determine how quickly the first hidden layer learns, and $δ^2$ as a vector whose entries determine how quickly the second hidden layer learns. We'll then use the lengths of these vectors as (rough!) global measures of the speed at which the layers are learning. So, for instance, the length $‖δ^1‖$ measures the speed at which the first hidden layer is learning, while the length $‖δ^2‖$ measures the speed at which the second hidden layer is learning.

With these definitions, and in the same configuration as was plotted above, we find $‖δ^1‖=0.07…$ and $‖δ^2‖=0.31…$ . So this confirms our earlier suspicion: the neurons in the second hidden layer really are learning much faster than the neurons in the first hidden layer.

What happens if we add more hidden layers? If we have three hidden layers, in a [784,30,30,30,10] network, then the respective speeds of learning turn out to be 0.012, 0.060, and 0.283. Again, earlier hidden layers are learning much slower than later hidden layers. Suppose we add yet another layer with 30 hidden neurons. In that case, the respective speeds of learning are 0.003, 0.017, 0.070, and 0.285. The pattern holds: early layers learn slower than later layers.

We've been looking at the speed of learning at the start of training, that is, just after the networks are initialized. How does the speed of learning change as we train our networks? Let's return to look at the network with just two hidden layers. The speed of learning changes as follows:



<table style="width:100%">
  <tr>
    <th><img src="photos/dl6.png" alt="Drawing" style="width:500px;"/></th>
  </tr>
</table>

To generate these results, I used batch gradient descent with just 1,000 training images, trained over 500 epochs. This is a bit different than the way we usually train - I've used no mini-batches, and just 1,000 training images, rather than the full 50,000 image training set. I'm not trying to do anything sneaky, or pull the wool over your eyes, but it turns out that using mini-batch stochastic gradient descent gives much noisier (albeit very similar, when you average away the noise) results. Using the parameters I've chosen is an easy way of smoothing the results out, so we can see what's going on.

In any case, as you can see the two layers start out learning at very different speeds (as we already know). The speed in both layers then drops very quickly, before rebounding. But through it all, the first hidden layer learns much more slowly than the second hidden layer.

What about more complex networks? Here's the results of a similar experiment, but this time with three hidden layers (a [784,30,30,30,10] network):



<table style="width:100%">
  <tr>
    <th><img src="photos/dl7.png" alt="Drawing" style="width:500px;"/></th>
  </tr>
</table>

Again, early hidden layers learn much more slowly than later hidden layers. Finally, let's add a fourth hidden layer (a [784,30,30,30,30,10] network), and see what happens when we train:

<table style="width:100%">
  <tr>
    <th><img src="photos/dl8.png" alt="Drawing" style="width:500px;"/></th>
  </tr>
</table>

Again, early hidden layers learn much more slowly than later hidden layers. In this case, the first hidden layer is learning roughly 100 times slower than the final hidden layer. No wonder we were having trouble training these networks earlier!

We have here an important observation: in at least some deep neural networks, the gradient tends to get smaller as we move backward through the hidden layers. This means that neurons in the earlier layers learn much more slowly than neurons in later layers. And while we've seen this in just a single network, there are fundamental reasons why this happens in many neural networks. The phenomenon is known as the **vanishing gradient problem**.

Why does the vanishing gradient problem occur? Are there ways we can avoid it? And how should we deal with it in training deep neural networks? In fact, we'll learn shortly that it's not inevitable, although the alternative is not very attractive, either: sometimes the gradient gets much larger in earlier layers! This is the exploding gradient problem, and it's not much better news than the vanishing gradient problem. More generally, it turns out that the gradient in deep neural networks is unstable, tending to either explode or vanish in earlier layers. This instability is a fundamental problem for gradient-based learning in deep neural networks. It's something we need to understand, and, if possible, take steps to address.

One response to vanishing (or unstable) gradients is to wonder if they're really such a problem. Momentarily stepping away from neural nets, imagine we were trying to numerically minimize a function f(x) of a single variable. Wouldn't it be good news if the derivative f′(x) was small? Wouldn't that mean we were already near an extremum? In a similar way, might the small gradient in early layers of a deep network mean that we don't need to do much adjustment of the weights and biases?

Of course, this isn't the case. Recall that we randomly initialized the weight and biases in the network. It is extremely unlikely our initial weights and biases will do a good job at whatever it is we want our network to do. To be concrete, consider the first layer of weights in a [784,30,30,30,10] network for the MNIST problem. The random initialization means the first layer throws away most information about the input image. Even if later layers have been extensively trained, they will still find it extremely difficult to identify the input image, simply because they don't have enough information. And so it can't possibly be the case that not much learning needs to be done in the first layer. If we're going to train deep networks, we need to figure out how to address the vanishing gradient problem.

## What's causing the vanishing gradient problem? Unstable gradients in deep neural nets

To get insight into why the vanishing gradient problem occurs, let's consider the simplest deep neural network: one with just a single neuron in each layer. Here's a network with three hidden layers:

<table style="width:100%">
  <tr>
    <th><img src="photos/dl9.png" alt="Drawing" style="width:500px;"/></th>
  </tr>
</table>

Here, $w_1,w_2,…$ are the weights, $b_1,b_2,…$ are the biases, and C is some cost function. Just to remind you how this works, the output $a_j$ from the jth neuron is $σ(z_j)$, where σ is the usual sigmoid activation function, and $z_j=w_ja_j−1+b_j$ is the weighted input to the neuron. I've drawn the cost C at the end to emphasize that the cost is a function of the network's output, $a_4$: if the actual output from the network is close to the desired output, then the cost will be low, while if it's far away, the cost will be high.

We're going to study the gradient $∂C/∂b_1$ associated to the first hidden neuron. We'll figure out an expression for $∂C/∂b_1$, and by studying that expression we'll understand why the vanishing gradient problem occurs.

I'll start by simply showing you the expression for $∂C/∂b_1$. It looks forbidding, but it's actually got a simple structure, which I'll describe in a moment. Here's the expression (ignore the network, for now, and note that σ′ is just the derivative of the σ function):



<table style="width:100%">
  <tr>
    <th><img src="photos/dl10.png" alt="Drawing" style="width:500px;"/></th>
  </tr>
</table>

The structure in the expression is as follows: there is a $σ′(z_j)$ term in the product for each neuron in the network; a weight $w_j$ term for each weight in the network; and a final $∂C/∂a_4$ term, corresponding to the cost function at the end. Notice that I've placed each term in the expression above the corresponding part of the network. So the network itself is a mnemonic for the expression.

You're welcome to take this expression for granted, and skip to the discussion of how it relates to the vanishing gradient problem. There's no harm in doing this, since the expression is a special case of our earlier discussion of backpropagation. But there's also a simple explanation of why the expression is true, and so it's fun (and perhaps enlightening) to take a look at that explanation.

Imagine we make a small change $Δb_1$ in the bias $b_1$. That will set off a cascading series of changes in the rest of the network. First, it causes a change $Δa_1$ in the output from the first hidden neuron. That, in turn, will cause a change $Δz_2$ in the weighted input to the second hidden neuron. Then a change $Δa_2$ in the output from the second hidden neuron. And so on, all the way through to a change ΔC in the cost at the output. We have


$$\frac{\partial C}{\partial b_1}≈\frac{\Delta C}{\Delta b_1}.$$

This suggests that we can figure out an expression for the gradient $∂C/∂b_1$ by carefully tracking the effect of each step in this cascade.

To do this, let's think about how $Δb_1$ causes the output $a_1$ from the first hidden neuron to change. We have $a_1=σ(z_1)=σ(w_1a_0+b_1)$, so

$$\Delta a_1≈\frac{∂σ(w_1a_0+b_1)}{∂b_1}\Delta b_1$$
$$=σ′(z_1)\Delta b_1.$$

That $σ′(z_1)$ term should look familiar: it's the first term in our claimed expression for the gradient $∂C/∂b_1$. Intuitively, this term converts a change $Δb_1$ in the bias into a change $Δa_1$ in the output activation. That change $Δa_1$ in turn causes a change in the weighted input $z_2=w_2a_1+b_2$ to the second hidden neuron:

$$\Delta z_2≈\frac{∂z_2}{∂a_1}Δa_1$$
$$w_2Δa_1.$$

Combining our expressions for $Δz_2$ and $Δa_1$Δ, we see how the change in the bias $b_1$ propagates along the network to affect $z_2$:

$$Δz_2≈σ′(z_1)w_2Δb_1.$$

Again, that should look familiar: we've now got the first two terms in our claimed expression for the gradient $∂C/∂b_1$. We can keep going in this fashion, tracking the way changes propagate through the rest of the network. At each neuron we pick up a $σ′(z_j)$ term, and through each weight we pick up a $w_j$ term. The end result is an expression relating the final change ΔC in cost to the initial change $Δb_1$ in the bias:

$$ ΔC≈σ′(z_1)w_2σ′(z_2)…σ′(z_4)\frac{∂C}{∂a_4}Δb_1.$$

Dividing by $Δb_1$ we do indeed get the desired expression for the gradient:

$$∂C∂b_1=σ′(z_1)w_2σ′(z_2)…σ′(z_4)\frac{∂C}{∂a_4}.$$

### Why the vanishing gradient problem occurs: 
To understand why the vanishing gradient problem occurs, let's explicitly write out the entire expression for the gradient:

$$∂C∂b_1=σ′(z_1)w_2σ′(z_2)w_3σ′(z_3)w_4σ′(z_4)\frac{∂C}{∂a_4}.$$

Excepting the very last term, this expression is a product of terms of the form $w_jσ′(z_j)$. To understand how each of those terms behave, let's look at a plot of the function σ′:







<table style="width:100%">
  <tr>
    <th><img src="photos/dl11.png" alt="Drawing" style="width:500px;"/></th>
  </tr>
</table>

The derivative reaches a maximum at σ′(0)=1/4. Now, if we use our standard approach to initializing the weights in the network, then we'll choose the weights using a Gaussian with mean 0 and standard deviation 1. So the weights will usually satisfy $|w_j|<1$. Putting these observations together, we see that the terms $w_jσ′(z_j)$ will usually satisfy $|w_jσ′(z_j)|<1/4$. And when we take a product of many such terms, the product will tend to exponentially decrease: the more terms, the smaller the product will be. This is starting to smell like a possible explanation for the vanishing gradient problem.

To make this all a bit more explicit, let's compare the expression for $∂C/∂b_1$ to an expression for the gradient with respect to a later bias, say $∂C/∂b_3$. Of course, we haven't explicitly worked out an expression for $∂C/∂b_3$, but it follows the same pattern described above for $∂C/∂b_1$. Here's the comparison of the two expressions:



<table style="width:100%">
  <tr>
    <th><img src="photos/dl12.png" alt="Drawing" style="width:500px;"/></th>
  </tr>
</table>

The two expressions share many terms. But the gradient $∂C/∂b_1$ includes two extra terms each of the form $w_jσ′(z_j)$. As we've seen, such terms are typically less than 1/4 in magnitude. And so the gradient $∂C/∂b_1$ will usually be a factor of 16 (or more) smaller than $∂C/∂b_3$. This is the essential origin of the vanishing gradient problem.

Of course, this is an informal argument, not a rigorous proof that the vanishing gradient problem will occur. There are several possible escape clauses. In particular, we might wonder whether the weights $w_j$ could grow during training. If they do, it's possible the terms $w_jσ′(z_j)$ in the product will no longer satisfy $|w_jσ′(z_j)|<1/4$. Indeed, if the terms get large enough - greater than 1 - then we will no longer have a vanishing gradient problem. Instead, the gradient will actually grow exponentially as we move backward through the layers. Instead of a vanishing gradient problem, we'll have an exploding gradient problem.

### The exploding gradient problem: 

Let's look at an explicit example where exploding gradients occur. The example is somewhat contrived: I'm going to fix parameters in the network in just the right way to ensure we get an exploding gradient. But even though the example is contrived, it has the virtue of firmly establishing that exploding gradients aren't merely a hypothetical possibility, they really can happen.

There are two steps to getting an exploding gradient. First, we choose all the weights in the network to be large, say $w_1=w_2=w_3=w_4=100$. Second, we'll choose the biases so that the $σ′(z_j)$ terms are not too small. That's actually pretty easy to do: all we need do is choose the biases to ensure that the weighted input to each neuron is $z_j=0$
 (and so $σ′(z_j)=1/4$. So, for instance, we want $z_1=w_1a_0+b_1=0$. We can achieve this by setting $b_1=−100∗a_0$. We can use the same idea to select the other biases. When we do this, we see that all the terms $w_jσ′(z_j)$ are equal to 100∗14=25. With these choices we get an exploding gradient.

### The unstable gradient problem: 
The fundamental problem here isn't so much the vanishing gradient problem or the exploding gradient problem. It's that the gradient in early layers is the product of terms from all the later layers. When there are many layers, that's an intrinsically unstable situation. The only way all layers can learn at close to the same speed is if all those products of terms come close to balancing out. Without some mechanism or underlying reason for that balancing to occur, it's highly unlikely to happen simply by chance. In short, the real problem here is that neural networks suffer from an unstable gradient problem. As a result, if we use standard gradient-based learning techniques, different layers in the network will tend to learn at wildly different speeds.

### The prevalence of the vanishing gradient problem: 

We've seen that the gradient can either vanish or explode in the early layers of a deep network. In fact, when using sigmoid neurons the gradient will usually vanish. To see why, consider again the expression $|wσ′(z)|$. To avoid the vanishing gradient problem we need $|wσ′(z)|≥1$. You might think this could happen easily if w is very large. However, it's more difficult than it looks. The reason is that the σ′(z) term also depends on $ w: σ′(z)=σ′(wa+b)$, where a is the input activation. So when we make w large, we need to be careful that we're not simultaneously making σ′(wa+b) small. That turns out to be a considerable constraint. The reason is that when we make w large we tend to make wa+b very large. Looking at the graph of σ′ you can see that this puts us off in the "wings" of the σ′function, where it takes very small values. The only way to avoid this is if the input activation falls within a fairly narrow range of values (this qualitative explanation is made quantitative in the first problem below). Sometimes that will chance to happen. More often, though, it does not happen. And so in the generic case we have vanishing gradients.

## Unstable gradients in more complex networks

We've been studying toy networks, with just one neuron in each hidden layer. What about more complex deep networks, with many neurons in each hidden layer?

<table style="width:100%">
  <tr>
    <th><img src="photos/dl13.png" alt="Drawing" style="width:500px;"/></th>
  </tr>
</table>

In fact, much the same behaviour occurs in such networks. In the earlier chapter on backpropagation we saw that the gradient in the l th layer of an L layer network is given by:

$$δ^l=Σ′(z'l)(w^{l+1})^TΣ′(z^{l+1)}(w^{l+2})^T…Σ′(z^L)∇_aC$$ 

Here, $Σ′(z^l)$ is a diagonal matrix whose entries are the σ′(z) values for the weighted inputs to the l th layer. The $w^l$ are the weight matrices for the different layers. And $∇_aC$ is the vector of partial derivatives of C with respect to the output activations.

This is a much more complicated expression than in the single-neuron case. Still, if you look closely, the essential form is very similar, with lots of pairs of the form ${(w^j)}^TΣ′(z^j)$. What's more, the matrices $Σ′(z^j)$ have small entries on the diagonal, none larger than 14. Provided the weight matrices $w^j$  aren't too large, each additional term ${(w^j)}^TΣ′(z^l)$ tends to make the gradient vector smaller, leading to a vanishing gradient. More generally, the large number of terms in the product tends to lead to an unstable gradient, just as in our earlier example. In practice, empirically it is typically found in sigmoid networks that gradients vanish exponentially quickly in earlier layers. As a result, learning slows down in those layers. This slowdown isn't merely an accident or an inconvenience: it's a fundamental consequence of the approach we're taking to learning.

## Other obstacles to deep learning

In this chapter we've focused on vanishing gradients - and, more generally, unstable gradients - as an obstacle to deep learning. In fact, unstable gradients are just one obstacle to deep learning, albeit an important fundamental obstacle. Much ongoing research aims to better understand the challenges that can occur when training deep networks. I won't comprehensively summarize that work here, but just want to briefly mention a couple of papers, to give you the flavor of some of the questions people are asking.

As a first example, in 2010 Glorot and Bengio early in training, substantially slowing down learning. They suggested some alternative activation functions, which appear not to suffer as much from this saturation problem.

As a second example, in 2013 Sutskever, Martens, Dahl and Hinton studied the impact on deep learning of both the random weight initialization and the momentum schedule in momentum-based stochastic gradient descent. In both cases, making good choices made a substantial difference in the ability to train deep networks.

These examples suggest that "What makes deep networks hard to train?" is a complex question. In this chapter, we've focused on the instabilities associated to gradient-based learning in deep networks. The results in the last two paragraphs suggest that there is also a role played by the choice of activation function, the way weights are initialized, and even details of how learning by gradient descent is implemented. And, of course, choice of network architecture and other hyper-parameters is also important. Thus, many factors can play a role in making deep networks hard to train, and understanding all those factors is still a subject of ongoing research. This all seems rather downbeat and pessimism-inducing. But the good news is that in the next chapter we'll turn that around, and develop several approaches to deep learning that to some extent manage to overcome or route around all these challenges.

