diff --git a/about.html b/about.html index 0d6a831..51251b8 100644 --- a/about.html +++ b/about.html @@ -153,6 +153,8 @@
@@ -162,9 +164,10 @@ Michael Nielsen's project announcement mailing list
-Deep Learning, draft book -in preparation, by Yoshua Bengio, Ian Goodfellow, and Aaron -Courville
+Deep Learning, book by Ian +Goodfellow, Yoshua Bengio, and Aaron Courville
+ +@@ -162,9 +164,10 @@ Michael Nielsen's project announcement mailing list
-Deep Learning, draft book -in preparation, by Yoshua Bengio, Ian Goodfellow, and Aaron -Courville
+Deep Learning, book by Ian +Goodfellow, Yoshua Bengio, and Aaron Courville
+ +@@ -162,9 +164,10 @@ Michael Nielsen's project announcement mailing list
-Deep Learning, draft book -in preparation, by Yoshua Bengio, Ian Goodfellow, and Aaron -Courville
+Deep Learning, book by Ian +Goodfellow, Yoshua Bengio, and Aaron Courville
+ +@@ -164,9 +166,10 @@
Deep Learning, draft book -in preparation, by Yoshua Bengio, Ian Goodfellow, and Aaron -Courville
+Deep Learning, book by Ian +Goodfellow, Yoshua Bengio, and Aaron Courville
+ +The human visual system is one of the wonders of the world. Consider the following sequence of handwritten digits:
Most people effortlessly recognize those digits as 504192. That ease is deceptive. In each hemisphere of our brain, humans have a primary visual cortex, also known as V1, containing 140 million neurons, with tens of billions of connections between them. And yet human vision involves not just V1, but an entire series of visual cortices - V2, V3, V4, and V5 - doing progressively more complex image processing. We carry in our heads a supercomputer, tuned by evolution over hundreds of millions of years, and superbly adapted to understand the visual world. Recognizing handwritten digits isn't easy. Rather, we humans are stupendously, astoundingly good at making sense of what our eyes show us. But nearly all that work is done unconsciously. And so we don't usually appreciate how tough a problem our visual systems solve.
The difficulty of visual pattern recognition becomes apparent if you attempt to write a computer program to recognize digits like those above. What seems easy when we do it ourselves suddenly becomes extremely difficult. Simple intuitions about how we recognize shapes - "a 9 has a loop at the top, and a vertical stroke in the bottom right" - turn out to be not so simple to express algorithmically. When you try to make such rules precise, you quickly get lost in a morass of exceptions and caveats and special cases. It seems hopeless.
Neural networks approach the problem in a different way. The idea is to take a large number of handwritten digits, known as training examples,
and then develop a system which can learn from those training examples. In other words, the neural network uses the examples to automatically infer rules for recognizing handwritten digits. Furthermore, by increasing the number of training examples, the network can learn more about handwriting, and so improve its accuracy. So while I've shown just 100 training digits above, perhaps we could build a better handwriting recognizer by using thousands or even millions or billions of training examples.
In this chapter we'll write a computer program implementing a neural network that learns to recognize handwritten digits. The program is just 74 lines long, and uses no special neural network libraries. But this short program can recognize digits with an accuracy over 96 percent, without human intervention. Furthermore, in later chapters we'll develop ideas which can improve accuracy to over 99 percent. In fact, the best commercial neural networks are now so good that they are used by banks to process cheques, and by post offices to recognize addresses.
We're focusing on handwriting recognition because it's an excellent prototype problem for learning about neural networks in general. As a prototype it hits a sweet spot: it's challenging - it's no small feat to recognize handwritten digits - but it's not so difficult as to require an extremely complicated solution, or tremendous computational power. Furthermore, it's a great way to develop more advanced techniques, such as deep learning. And so throughout the book we'll return repeatedly to the problem of handwriting recognition. Later in the book, we'll discuss how these ideas may be applied to other problems in computer vision, and also in speech, natural language processing, and other domains.
Of course, if the point of the chapter was only to write a computer program to recognize handwritten digits, then the chapter would be much shorter! But along the way we'll develop many key ideas about neural networks, including two important types of artificial neuron (the perceptron and the sigmoid neuron), and the standard learning algorithm for neural networks, known as stochastic gradient descent. Throughout, I focus on explaining why things are done the way they are, and on building your neural networks intuition. That requires a lengthier discussion than if I just presented the basic mechanics of what's going on, but it's worth it for the deeper understanding you'll attain. Amongst the payoffs, by the end of the chapter we'll be in position to understand what deep learning is, and why it matters.
What is a neural network? To get started, I'll explain a type of artificial neuron called a perceptron. Perceptrons were developed in the 1950s and 1960s by the scientist Frank Rosenblatt, inspired by earlier work by Warren McCulloch and Walter Pitts. Today, it's more common to use other models of artificial neurons - in this book, and in much modern work on neural networks, the main neuron model used is one called the sigmoid neuron. We'll get to sigmoid neurons shortly. But to understand why sigmoid neurons are defined the way they are, it's worth taking the time to first understand perceptrons.
So how do perceptrons work? A perceptron takes several binary inputs, $x_1, x_2, \ldots$, and produces a single binary output:
That's the basic mathematical model. A way you can think about the perceptron is that it's a device that makes decisions by weighing up evidence. Let me give an example. It's not a very realistic example, but it's easy to understand, and we'll soon get to more realistic examples. Suppose the weekend is coming up, and you've heard that there's going to be a cheese festival in your city. You like cheese, and are trying to decide whether or not to go to the festival. You might make your decision by weighing up three factors:
Now, suppose you absolutely adore cheese, so much so that you're happy to go to the festival even if your boyfriend or girlfriend is uninterested and the festival is hard to get to. But perhaps you really loathe bad weather, and there's no way you'd go to the festival if the weather is bad. You can use perceptrons to model this kind of decision-making. One way to do this is to choose a weight $w_1 = 6$ for the weather, and $w_2 = 2$ and $w_3 = 2$ for the other conditions. The larger value of $w_1$ indicates that the weather matters a lot to you, much more than whether your boyfriend or girlfriend joins you, or the nearness of public transit. Finally, suppose you choose a threshold of $5$ for the perceptron. With these choices, the perceptron implements the desired decision-making model, outputting $1$ whenever the weather is good, and $0$ whenever the weather is bad. It makes no difference to the output whether your boyfriend or girlfriend wants to go, or whether public transit is nearby.
By varying the weights and the threshold, we can get different models of decision-making. For example, suppose we instead chose a threshold of $3$. Then the perceptron would decide that you should go to the festival whenever the weather was good or when both the festival was near public transit and your boyfriend or girlfriend was willing to join you. In other words, it'd be a different model of decision-making. Dropping the threshold means you're more willing to go to the festival.
Obviously, the perceptron isn't a complete model of human decision-making! But what the example illustrates is how a perceptron can weigh up different kinds of evidence in order to make decisions. And it should seem plausible that a complex network of perceptrons could make quite subtle decisions:
Incidentally, when I defined perceptrons I said that a perceptron has just a single output. In the network above the perceptrons look like they have multiple outputs. In fact, they're still single output. The multiple output arrows are merely a useful way of indicating that the output from a perceptron is being used as the input to several other perceptrons. It's less unwieldy than drawing a single output line which then splits.
Let's simplify the way we describe perceptrons. The condition $\sum_j w_j x_j > \mbox{threshold}$ is cumbersome, and we can make two notational changes to simplify it. The first change is to write $\sum_j w_j x_j$ as a dot product, $w \cdot x \equiv \sum_j w_j x_j$, where $w$ and $x$ are vectors whose components are the weights and inputs, respectively. The second change is to move the threshold to the other side of the inequality, and to replace it by what's known as the perceptron's bias, $b \equiv -\mbox{threshold}$. Using the bias instead of the threshold, the perceptron rule can be rewritten: \begin{eqnarray} \mbox{output} = \left\{ \begin{array}{ll} 0 & \mbox{if } w\cdot x + b \leq 0 \\ 1 & \mbox{if } w\cdot x + b > 0 \end{array} \right. \tag{2}\end{eqnarray} You can think of the bias as a measure of how easy it is to get the perceptron to output a $1$. Or to put it in more biological terms, the bias is a measure of how easy it is to get the perceptron to fire. For a perceptron with a really big bias, it's extremely easy for the perceptron to output a $1$. But if the bias is very negative, then it's difficult for the perceptron to output a $1$. Obviously, introducing the bias is only a small change in how we describe perceptrons, but we'll see later that it leads to further notational simplifications. Because of this, in the remainder of the book we won't use the threshold, we'll always use the bias.
I've described perceptrons as a method for weighing evidence to make
decisions. Another way perceptrons can be used is to compute the
elementary logical functions we usually think of as underlying
computation, functions such as AND
, OR
, and
NAND
. For example, suppose we have a perceptron with two
inputs, each with weight $-2$, and an overall bias of $3$. Here's our
perceptron:
NAND
gate!The NAND
example shows that we can use perceptrons to compute
simple logical functions.
In fact, we can use
networks of perceptrons to compute any logical function at all.
The reason is that the NAND
gate is universal for
computation, that is, we can build any computation up out of
NAND
gates. For example, we can use NAND
gates to
build a circuit which adds two bits, $x_1$ and $x_2$. This requires
computing the bitwise sum, $x_1 \oplus x_2$, as well as a carry bit
which is set to $1$ when both $x_1$ and $x_2$ are $1$, i.e., the carry
bit is just the bitwise product $x_1 x_2$:
NAND
gates by perceptrons with two inputs, each with weight
$-2$, and an overall bias of $3$. Here's the resulting network. Note
that I've moved the perceptron corresponding to the bottom right
NAND
gate a little, just to make it easier to draw the arrows
on the diagram:
The adder example demonstrates how a network of perceptrons can be
used to simulate a circuit containing many NAND
gates. And
because NAND
gates are universal for computation, it follows
that perceptrons are also universal for computation.
The computational universality of perceptrons is simultaneously
reassuring and disappointing. It's reassuring because it tells us
that networks of perceptrons can be as powerful as any other computing
device. But it's also disappointing, because it makes it seem as
though perceptrons are merely a new type of NAND
gate.
That's hardly big news!
However, the situation is better than this view suggests. It turns
out that we can devise learning
algorithms which can
automatically tune the weights and biases of a network of artificial
neurons. This tuning happens in response to external stimuli, without
direct intervention by a programmer. These learning algorithms enable
us to use artificial neurons in a way which is radically different to
conventional logic gates. Instead of explicitly laying out a circuit
of NAND
and other gates, our neural networks can simply learn
to solve problems, sometimes problems where it would be extremely
difficult to directly design a conventional circuit.
Learning algorithms sound terrific. But how can we devise such algorithms for a neural network? Suppose we have a network of perceptrons that we'd like to use to learn to solve some problem. For example, the inputs to the network might be the raw pixel data from a scanned, handwritten image of a digit. And we'd like the network to learn weights and biases so that the output from the network correctly classifies the digit. To see how learning might work, suppose we make a small change in some weight (or bias) in the network. What we'd like is for this small change in weight to cause only a small corresponding change in the output from the network. As we'll see in a moment, this property will make learning possible. Schematically, here's what we want (obviously this network is too simple to do handwriting recognition!):
If it were true that a small change in a weight (or bias) causes only a small change in output, then we could use this fact to modify the weights and biases to get our network to behave more in the manner we want. For example, suppose the network was mistakenly classifying an image as an "8" when it should be a "9". We could figure out how to make a small change in the weights and biases so the network gets a little closer to classifying the image as a "9". And then we'd repeat this, changing the weights and biases over and over to produce better and better output. The network would be learning.
The problem is that this isn't what happens when our network contains perceptrons. In fact, a small change in the weights or bias of any single perceptron in the network can sometimes cause the output of that perceptron to completely flip, say from $0$ to $1$. That flip may then cause the behaviour of the rest of the network to completely change in some very complicated way. So while your "9" might now be classified correctly, the behaviour of the network on all the other images is likely to have completely changed in some hard-to-control way. That makes it difficult to see how to gradually modify the weights and biases so that the network gets closer to the desired behaviour. Perhaps there's some clever way of getting around this problem. But it's not immediately obvious how we can get a network of perceptrons to learn.
We can overcome this problem by introducing a new type of artificial neuron called a sigmoid neuron. Sigmoid neurons are similar to perceptrons, but modified so that small changes in their weights and bias cause only a small change in their output. That's the crucial fact which will allow a network of sigmoid neurons to learn.
Okay, let me describe the sigmoid neuron. We'll depict sigmoid neurons in the same way we depicted perceptrons:
At first sight, sigmoid neurons appear very different to perceptrons. The algebraic form of the sigmoid function may seem opaque and forbidding if you're not already familiar with it. In fact, there are many similarities between perceptrons and sigmoid neurons, and the algebraic form of the sigmoid function turns out to be more of a technical detail than a true barrier to understanding.
To understand the similarity to the perceptron model, suppose $z \equiv w \cdot x + b$ is a large positive number. Then $e^{-z} \approx 0$ and so $\sigma(z) \approx 1$. In other words, when $z = w \cdot x+b$ is large and positive, the output from the sigmoid neuron is approximately $1$, just as it would have been for a perceptron. Suppose on the other hand that $z = w \cdot x+b$ is very negative. Then $e^{-z} \rightarrow \infty$, and $\sigma(z) \approx 0$. So when $z = w \cdot x +b$ is very negative, the behaviour of a sigmoid neuron also closely approximates a perceptron. It's only when $w \cdot x+b$ is of modest size that there's much deviation from the perceptron model.
What about the algebraic form of $\sigma$? How can we understand that? In fact, the exact form of $\sigma$ isn't so important - what really matters is the shape of the function when plotted. Here's the shape:
This shape is a smoothed out version of a step function:
If $\sigma$ had in fact been a step function, then the sigmoid neuron would be a perceptron, since the output would be $1$ or $0$ depending on whether $w\cdot x+b$ was positive or negative* *Actually, when $w \cdot x +b = 0$ the perceptron outputs $0$, while the step function outputs $1$. So, strictly speaking, we'd need to modify the step function at that one point. But you get the idea.. By using the actual $\sigma$ function we get, as already implied above, a smoothed out perceptron. Indeed, it's the smoothness of the $\sigma$ function that is the crucial fact, not its detailed form. The smoothness of $\sigma$ means that small changes $\Delta w_j$ in the weights and $\Delta b$ in the bias will produce a small change $\Delta \mbox{output}$ in the output from the neuron. In fact, calculus tells us that $\Delta \mbox{output}$ is well approximated by \begin{eqnarray} \Delta \mbox{output} \approx \sum_j \frac{\partial \, \mbox{output}}{\partial w_j} \Delta w_j + \frac{\partial \, \mbox{output}}{\partial b} \Delta b, \tag{5}\end{eqnarray} where the sum is over all the weights, $w_j$, and $\partial \, \mbox{output} / \partial w_j$ and $\partial \, \mbox{output} /\partial b$ denote partial derivatives of the $\mbox{output}$ with respect to $w_j$ and $b$, respectively. Don't panic if you're not comfortable with partial derivatives! While the expression above looks complicated, with all the partial derivatives, it's actually saying something very simple (and which is very good news): $\Delta \mbox{output}$ is a linear function of the changes $\Delta w_j$ and $\Delta b$ in the weights and bias. This linearity makes it easy to choose small changes in the weights and biases to achieve any desired small change in the output. So while sigmoid neurons have much of the same qualitative behaviour as perceptrons, they make it much easier to figure out how changing the weights and biases will change the output.
If it's the shape of $\sigma$ which really matters, and not its exact form, then why use the particular form used for $\sigma$ in Equation (3)\begin{eqnarray} \sigma(z) \equiv \frac{1}{1+e^{-z}} \nonumber\end{eqnarray}? In fact, later in the book we will occasionally consider neurons where the output is $f(w \cdot x + b)$ for some other activation function $f(\cdot)$. The main thing that changes when we use a different activation function is that the particular values for the partial derivatives in Equation (5)\begin{eqnarray} \Delta \mbox{output} \approx \sum_j \frac{\partial \, \mbox{output}}{\partial w_j} \Delta w_j + \frac{\partial \, \mbox{output}}{\partial b} \Delta b \nonumber\end{eqnarray} change. It turns out that when we compute those partial derivatives later, using $\sigma$ will simplify the algebra, simply because exponentials have lovely properties when differentiated. In any case, $\sigma$ is commonly-used in work on neural nets, and is the activation function we'll use most often in this book.
How should we interpret the output from a sigmoid neuron? Obviously, one big difference between perceptrons and sigmoid neurons is that sigmoid neurons don't just output $0$ or $1$. They can have as output any real number between $0$ and $1$, so values such as $0.173\ldots$ and $0.689\ldots$ are legitimate outputs. This can be useful, for example, if we want to use the output value to represent the average intensity of the pixels in an image input to a neural network. But sometimes it can be a nuisance. Suppose we want the output from the network to indicate either "the input image is a 9" or "the input image is not a 9". Obviously, it'd be easiest to do this if the output was a $0$ or a $1$, as in a perceptron. But in practice we can set up a convention to deal with this, for example, by deciding to interpret any output of at least $0.5$ as indicating a "9", and any output less than $0.5$ as indicating "not a 9". I'll always explicitly state when we're using such a convention, so it shouldn't cause any confusion.
In the next section I'll introduce a neural network that can do a pretty good job classifying handwritten digits. In preparation for that, it helps to explain some terminology that lets us name different parts of a network. Suppose we have the network:
The design of the input and output layers in a network is often straightforward. For example, suppose we're trying to determine whether a handwritten image depicts a "9" or not. A natural way to design the network is to encode the intensities of the image pixels into the input neurons. If the image is a $64$ by $64$ greyscale image, then we'd have $4,096 = 64 \times 64$ input neurons, with the intensities scaled appropriately between $0$ and $1$. The output layer will contain just a single neuron, with output values of less than $0.5$ indicating "input image is not a 9", and values greater than $0.5$ indicating "input image is a 9 ".
While the design of the input and output layers of a neural network is often straightforward, there can be quite an art to the design of the hidden layers. In particular, it's not possible to sum up the design process for the hidden layers with a few simple rules of thumb. Instead, neural networks researchers have developed many design heuristics for the hidden layers, which help people get the behaviour they want out of their nets. For example, such heuristics can be used to help determine how to trade off the number of hidden layers against the time required to train the network. We'll meet several such design heuristics later in this book.
Up to now, we've been discussing neural networks where the output from one layer is used as input to the next layer. Such networks are called feedforward neural networks. This means there are no loops in the network - information is always fed forward, never fed back. If we did have loops, we'd end up with situations where the input to the $\sigma$ function depended on the output. That'd be hard to make sense of, and so we don't allow such loops.
However, there are other models of artificial neural networks in which feedback loops are possible. These models are called recurrent neural networks. The idea in these models is to have neurons which fire for some limited duration of time, before becoming quiescent. That firing can stimulate other neurons, which may fire a little while later, also for a limited duration. That causes still more neurons to fire, and so over time we get a cascade of neurons firing. Loops don't cause problems in such a model, since a neuron's output only affects its input at some later time, not instantaneously.
Recurrent neural nets have been less influential than feedforward networks, in part because the learning algorithms for recurrent nets are (at least to date) less powerful. But recurrent networks are still extremely interesting. They're much closer in spirit to how our brains work than feedforward networks. And it's possible that recurrent networks can solve important problems which can only be solved with great difficulty by feedforward networks. However, to limit our scope, in this book we're going to concentrate on the more widely-used feedforward networks.
Having defined neural networks, let's return to handwriting recognition. We can split the problem of recognizing handwritten digits into two sub-problems. First, we'd like a way of breaking an image containing many digits into a sequence of separate images, each containing a single digit. For example, we'd like to break the image
into six separate images,
We humans solve this segmentation problem with ease, but it's challenging for a computer program to correctly break up the image. Once the image has been segmented, the program then needs to classify each individual digit. So, for instance, we'd like our program to recognize that the first digit above,
is a 5.
We'll focus on writing a program to solve the second problem, that is, classifying individual digits. We do this because it turns out that the segmentation problem is not so difficult to solve, once you have a good way of classifying individual digits. There are many approaches to solving the segmentation problem. One approach is to trial many different ways of segmenting the image, using the individual digit classifier to score each trial segmentation. A trial segmentation gets a high score if the individual digit classifier is confident of its classification in all segments, and a low score if the classifier is having a lot of trouble in one or more segments. The idea is that if the classifier is having trouble somewhere, then it's probably having trouble because the segmentation has been chosen incorrectly. This idea and other variations can be used to solve the segmentation problem quite well. So instead of worrying about segmentation we'll concentrate on developing a neural network which can solve the more interesting and difficult problem, namely, recognizing individual handwritten digits.
To recognize individual digits we will use a three-layer neural network:
The input layer of the network contains neurons encoding the values of the input pixels. As discussed in the next section, our training data for the network will consist of many $28$ by $28$ pixel images of scanned handwritten digits, and so the input layer contains $784 = 28 \times 28$ neurons. For simplicity I've omitted most of the $784$ input neurons in the diagram above. The input pixels are greyscale, with a value of $0.0$ representing white, a value of $1.0$ representing black, and in between values representing gradually darkening shades of grey.
The second layer of the network is a hidden layer. We denote the number of neurons in this hidden layer by $n$, and we'll experiment with different values for $n$. The example shown illustrates a small hidden layer, containing just $n = 15$ neurons.
The output layer of the network contains 10 neurons. If the first neuron fires, i.e., has an output $\approx 1$, then that will indicate that the network thinks the digit is a $0$. If the second neuron fires then that will indicate that the network thinks the digit is a $1$. And so on. A little more precisely, we number the output neurons from $0$ through $9$, and figure out which neuron has the highest activation value. If that neuron is, say, neuron number $6$, then our network will guess that the input digit was a $6$. And so on for the other output neurons.
You might wonder why we use $10$ output neurons. After all, the goal of the network is to tell us which digit ($0, 1, 2, \ldots, 9$) corresponds to the input image. A seemingly natural way of doing that is to use just $4$ output neurons, treating each neuron as taking on a binary value, depending on whether the neuron's output is closer to $0$ or to $1$. Four neurons are enough to encode the answer, since $2^4 = 16$ is more than the 10 possible values for the input digit. Why should our network use $10$ neurons instead? Isn't that inefficient? The ultimate justification is empirical: we can try out both network designs, and it turns out that, for this particular problem, the network with $10$ output neurons learns to recognize digits better than the network with $4$ output neurons. But that leaves us wondering why using $10$ output neurons works better. Is there some heuristic that would tell us in advance that we should use the $10$-output encoding instead of the $4$-output encoding?
To understand why we do this, it helps to think about what the neural network is doing from first principles. Consider first the case where we use $10$ output neurons. Let's concentrate on the first output neuron, the one that's trying to decide whether or not the digit is a $0$. It does this by weighing up evidence from the hidden layer of neurons. What are those hidden neurons doing? Well, just suppose for the sake of argument that the first neuron in the hidden layer detects whether or not an image like the following is present:
It can do this by heavily weighting input pixels which overlap with the image, and only lightly weighting the other inputs. In a similar way, let's suppose for the sake of argument that the second, third, and fourth neurons in the hidden layer detect whether or not the following images are present:
As you may have guessed, these four images together make up the $0$ image that we saw in the line of digits shown earlier:
So if all four of these hidden neurons are firing then we can conclude that the digit is a $0$. Of course, that's not the only sort of evidence we can use to conclude that the image was a $0$ - we could legitimately get a $0$ in many other ways (say, through translations of the above images, or slight distortions). But it seems safe to say that at least in this case we'd conclude that the input was a $0$.
Supposing the neural network functions in this way, we can give a plausible explanation for why it's better to have $10$ outputs from the network, rather than $4$. If we had $4$ outputs, then the first output neuron would be trying to decide what the most significant bit of the digit was. And there's no easy way to relate that most significant bit to simple shapes like those shown above. It's hard to imagine that there's any good historical reason the component shapes of the digit will be closely related to (say) the most significant bit in the output.
Now, with all that said, this is all just a heuristic. Nothing says that the three-layer neural network has to operate in the way I described, with the hidden neurons detecting simple component shapes. Maybe a clever learning algorithm will find some assignment of weights that lets us use only $4$ output neurons. But as a heuristic the way of thinking I've described works pretty well, and can save you a lot of time in designing good neural network architectures.
Now that we have a design for our neural network, how can it learn to recognize digits? The first thing we'll need is a data set to learn from - a so-called training data set. We'll use the MNIST data set, which contains tens of thousands of scanned images of handwritten digits, together with their correct classifications. MNIST's name comes from the fact that it is a modified subset of two data sets collected by NIST, the United States' National Institute of Standards and Technology. Here's a few images from MNIST:
As you can see, these digits are, in fact, the same as those shown at the beginning of this chapter as a challenge to recognize. Of course, when testing our network we'll ask it to recognize images which aren't in the training set!
The MNIST data comes in two parts. The first part contains 60,000 images to be used as training data. These images are scanned handwriting samples from 250 people, half of whom were US Census Bureau employees, and half of whom were high school students. The images are greyscale and 28 by 28 pixels in size. The second part of the MNIST data set is 10,000 images to be used as test data. Again, these are 28 by 28 greyscale images. We'll use the test data to evaluate how well our neural network has learned to recognize digits. To make this a good test of performance, the test data was taken from a different set of 250 people than the original training data (albeit still a group split between Census Bureau employees and high school students). This helps give us confidence that our system can recognize digits from people whose writing it didn't see during training.
We'll use the notation $x$ to denote a training input. It'll be convenient to regard each training input $x$ as a $28 \times 28 = 784$-dimensional vector. Each entry in the vector represents the grey value for a single pixel in the image. We'll denote the corresponding desired output by $y = y(x)$, where $y$ is a $10$-dimensional vector. For example, if a particular training image, $x$, depicts a $6$, then $y(x) = (0, 0, 0, 0, 0, 0, 1, 0, 0, 0)^T$ is the desired output from the network. Note that $T$ here is the transpose operation, turning a row vector into an ordinary (column) vector.
What we'd like is an algorithm which lets us find weights and biases so that the output from the network approximates $y(x)$ for all training inputs $x$. To quantify how well we're achieving this goal we define a cost function* *Sometimes referred to as a loss or objective function. We use the term cost function throughout this book, but you should note the other terminology, since it's often used in research papers and other discussions of neural networks. : \begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \| y(x) - a\|^2. \tag{6}\end{eqnarray} Here, $w$ denotes the collection of all weights in the network, $b$ all the biases, $n$ is the total number of training inputs, $a$ is the vector of outputs from the network when $x$ is input, and the sum is over all training inputs, $x$. Of course, the output $a$ depends on $x$, $w$ and $b$, but to keep the notation simple I haven't explicitly indicated this dependence. The notation $\| v \|$ just denotes the usual length function for a vector $v$. We'll call $C$ the quadratic cost function; it's also sometimes known as the mean squared error or just MSE. Inspecting the form of the quadratic cost function, we see that $C(w,b)$ is non-negative, since every term in the sum is non-negative. Furthermore, the cost $C(w,b)$ becomes small, i.e., $C(w,b) \approx 0$, precisely when $y(x)$ is approximately equal to the output, $a$, for all training inputs, $x$. So our training algorithm has done a good job if it can find weights and biases so that $C(w,b) \approx 0$. By contrast, it's not doing so well when $C(w,b)$ is large - that would mean that $y(x)$ is not close to the output $a$ for a large number of inputs. So the aim of our training algorithm will be to minimize the cost $C(w,b)$ as a function of the weights and biases. In other words, we want to find a set of weights and biases which make the cost as small as possible. We'll do that using an algorithm known as gradient descent.
Why introduce the quadratic cost? After all, aren't we primarily interested in the number of images correctly classified by the network? Why not try to maximize that number directly, rather than minimizing a proxy measure like the quadratic cost? The problem with that is that the number of images correctly classified is not a smooth function of the weights and biases in the network. For the most part, making small changes to the weights and biases won't cause any change at all in the number of training images classified correctly. That makes it difficult to figure out how to change the weights and biases to get improved performance. If we instead use a smooth cost function like the quadratic cost it turns out to be easy to figure out how to make small changes in the weights and biases so as to get an improvement in the cost. That's why we focus first on minimizing the quadratic cost, and only after that will we examine the classification accuracy.
Even given that we want to use a smooth cost function, you may still wonder why we choose the quadratic function used in Equation (6)\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}. Isn't this a rather ad hoc choice? Perhaps if we chose a different cost function we'd get a totally different set of minimizing weights and biases? This is a valid concern, and later we'll revisit the cost function, and make some modifications. However, the quadratic cost function of Equation (6)\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray} works perfectly well for understanding the basics of learning in neural networks, so we'll stick with it for now.
Recapping, our goal in training a neural network is to find weights and biases which minimize the quadratic cost function $C(w, b)$. This is a well-posed problem, but it's got a lot of distracting structure as currently posed - the interpretation of $w$ and $b$ as weights and biases, the $\sigma$ function lurking in the background, the choice of network architecture, MNIST, and so on. It turns out that we can understand a tremendous amount by ignoring most of that structure, and just concentrating on the minimization aspect. So for now we're going to forget all about the specific form of the cost function, the connection to neural networks, and so on. Instead, we're going to imagine that we've simply been given a function of many variables and we want to minimize that function. We're going to develop a technique called gradient descent which can be used to solve such minimization problems. Then we'll come back to the specific function we want to minimize for neural networks.
Okay, let's suppose we're trying to minimize some function, $C(v)$. This could be any real-valued function of many variables, $v = v_1, v_2, \ldots$. Note that I've replaced the $w$ and $b$ notation by $v$ to emphasize that this could be any function - we're not specifically thinking in the neural networks context any more. To minimize $C(v)$ it helps to imagine $C$ as a function of just two variables, which we'll call $v_1$ and $v_2$:
What we'd like is to find where $C$ achieves its global minimum. Now, of course, for the function plotted above, we can eyeball the graph and find the minimum. In that sense, I've perhaps shown slightly too simple a function! A general function, $C$, may be a complicated function of many variables, and it won't usually be possible to just eyeball the graph to find the minimum.
One way of attacking the problem is to use calculus to try to find the minimum analytically. We could compute derivatives and then try using them to find places where $C$ is an extremum. With some luck that might work when $C$ is a function of just one or a few variables. But it'll turn into a nightmare when we have many more variables. And for neural networks we'll often want far more variables - the biggest neural networks have cost functions which depend on billions of weights and biases in an extremely complicated way. Using calculus to minimize that just won't work!
(After asserting that we'll gain insight by imagining $C$ as a function of just two variables, I've turned around twice in two paragraphs and said, "hey, but what if it's a function of many more than two variables?" Sorry about that. Please believe me when I say that it really does help to imagine $C$ as a function of two variables. It just happens that sometimes that picture breaks down, and the last two paragraphs were dealing with such breakdowns. Good thinking about mathematics often involves juggling multiple intuitive pictures, learning when it's appropriate to use each picture, and when it's not.)
Okay, so calculus doesn't work. Fortunately, there is a beautiful analogy which suggests an algorithm which works pretty well. We start by thinking of our function as a kind of a valley. If you squint just a little at the plot above, that shouldn't be too hard. And we imagine a ball rolling down the slope of the valley. Our everyday experience tells us that the ball will eventually roll to the bottom of the valley. Perhaps we can use this idea as a way to find a minimum for the function? We'd randomly choose a starting point for an (imaginary) ball, and then simulate the motion of the ball as it rolled down to the bottom of the valley. We could do this simulation simply by computing derivatives (and perhaps some second derivatives) of $C$ - those derivatives would tell us everything we need to know about the local "shape" of the valley, and therefore how our ball should roll.
Based on what I've just written, you might suppose that we'll be trying to write down Newton's equations of motion for the ball, considering the effects of friction and gravity, and so on. Actually, we're not going to take the ball-rolling analogy quite that seriously - we're devising an algorithm to minimize $C$, not developing an accurate simulation of the laws of physics! The ball's-eye view is meant to stimulate our imagination, not constrain our thinking. So rather than get into all the messy details of physics, let's simply ask ourselves: if we were declared God for a day, and could make up our own laws of physics, dictating to the ball how it should roll, what law or laws of motion could we pick that would make it so the ball always rolled to the bottom of the valley?
To make this question more precise, let's think about what happens when we move the ball a small amount $\Delta v_1$ in the $v_1$ direction, and a small amount $\Delta v_2$ in the $v_2$ direction. Calculus tells us that $C$ changes as follows: \begin{eqnarray} \Delta C \approx \frac{\partial C}{\partial v_1} \Delta v_1 + \frac{\partial C}{\partial v_2} \Delta v_2. \tag{7}\end{eqnarray} We're going to find a way of choosing $\Delta v_1$ and $\Delta v_2$ so as to make $\Delta C$ negative; i.e., we'll choose them so the ball is rolling down into the valley. To figure out how to make such a choice it helps to define $\Delta v$ to be the vector of changes in $v$, $\Delta v \equiv (\Delta v_1, \Delta v_2)^T$, where $T$ is again the transpose operation, turning row vectors into column vectors. We'll also define the gradient of $C$ to be the vector of partial derivatives, $\left(\frac{\partial C}{\partial v_1}, \frac{\partial C}{\partial v_2}\right)^T$. We denote the gradient vector by $\nabla C$, i.e.: \begin{eqnarray} \nabla C \equiv \left( \frac{\partial C}{\partial v_1}, \frac{\partial C}{\partial v_2} \right)^T. \tag{8}\end{eqnarray} In a moment we'll rewrite the change $\Delta C$ in terms of $\Delta v$ and the gradient, $\nabla C$. Before getting to that, though, I want to clarify something that sometimes gets people hung up on the gradient. When meeting the $\nabla C$ notation for the first time, people sometimes wonder how they should think about the $\nabla$ symbol. What, exactly, does $\nabla$ mean? In fact, it's perfectly fine to think of $\nabla C$ as a single mathematical object - the vector defined above - which happens to be written using two symbols. In this point of view, $\nabla$ is just a piece of notational flag-waving, telling you "hey, $\nabla C$ is a gradient vector". There are more advanced points of view where $\nabla$ can be viewed as an independent mathematical entity in its own right (for example, as a differential operator), but we won't need such points of view.
With these definitions, the expression (7)\begin{eqnarray} \Delta C \approx \frac{\partial C}{\partial v_1} \Delta v_1 + \frac{\partial C}{\partial v_2} \Delta v_2 \nonumber\end{eqnarray} for $\Delta C$ can be rewritten as \begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v. \tag{9}\end{eqnarray} This equation helps explain why $\nabla C$ is called the gradient vector: $\nabla C$ relates changes in $v$ to changes in $C$, just as we'd expect something called a gradient to do. But what's really exciting about the equation is that it lets us see how to choose $\Delta v$ so as to make $\Delta C$ negative. In particular, suppose we choose \begin{eqnarray} \Delta v = -\eta \nabla C, \tag{10}\end{eqnarray} where $\eta$ is a small, positive parameter (known as the learning rate). Then Equation (9)\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray} tells us that $\Delta C \approx -\eta \nabla C \cdot \nabla C = -\eta \|\nabla C\|^2$. Because $\| \nabla C \|^2 \geq 0$, this guarantees that $\Delta C \leq 0$, i.e., $C$ will always decrease, never increase, if we change $v$ according to the prescription in (10)\begin{eqnarray} \Delta v = -\eta \nabla C \nonumber\end{eqnarray}. (Within, of course, the limits of the approximation in Equation (9)\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}). This is exactly the property we wanted! And so we'll take Equation (10)\begin{eqnarray} \Delta v = -\eta \nabla C \nonumber\end{eqnarray} to define the "law of motion" for the ball in our gradient descent algorithm. That is, we'll use Equation (10)\begin{eqnarray} \Delta v = -\eta \nabla C \nonumber\end{eqnarray} to compute a value for $\Delta v$, then move the ball's position $v$ by that amount: \begin{eqnarray} v \rightarrow v' = v -\eta \nabla C. \tag{11}\end{eqnarray} Then we'll use this update rule again, to make another move. If we keep doing this, over and over, we'll keep decreasing $C$ until - we hope - we reach a global minimum.
Summing up, the way the gradient descent algorithm works is to repeatedly compute the gradient $\nabla C$, and then to move in the opposite direction, "falling down" the slope of the valley. We can visualize it like this:
Notice that with this rule gradient descent doesn't reproduce real physical motion. In real life a ball has momentum, and that momentum may allow it to roll across the slope, or even (momentarily) roll uphill. It's only after the effects of friction set in that the ball is guaranteed to roll down into the valley. By contrast, our rule for choosing $\Delta v$ just says "go down, right now". That's still a pretty good rule for finding the minimum!
To make gradient descent work correctly, we need to choose the learning rate $\eta$ to be small enough that Equation (9)\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray} is a good approximation. If we don't, we might end up with $\Delta C > 0$, which obviously would not be good! At the same time, we don't want $\eta$ to be too small, since that will make the changes $\Delta v$ tiny, and thus the gradient descent algorithm will work very slowly. In practical implementations, $\eta$ is often varied so that Equation (9)\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray} remains a good approximation, but the algorithm isn't too slow. We'll see later how this works.
I've explained gradient descent when $C$ is a function of just two variables. But, in fact, everything works just as well even when $C$ is a function of many more variables. Suppose in particular that $C$ is a function of $m$ variables, $v_1,\ldots,v_m$. Then the change $\Delta C$ in $C$ produced by a small change $\Delta v = (\Delta v_1, \ldots, \Delta v_m)^T$ is \begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v, \tag{12}\end{eqnarray} where the gradient $\nabla C$ is the vector \begin{eqnarray} \nabla C \equiv \left(\frac{\partial C}{\partial v_1}, \ldots, \frac{\partial C}{\partial v_m}\right)^T. \tag{13}\end{eqnarray} Just as for the two variable case, we can choose \begin{eqnarray} \Delta v = -\eta \nabla C, \tag{14}\end{eqnarray} and we're guaranteed that our (approximate) expression (12)\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray} for $\Delta C$ will be negative. This gives us a way of following the gradient to a minimum, even when $C$ is a function of many variables, by repeatedly applying the update rule \begin{eqnarray} v \rightarrow v' = v-\eta \nabla C. \tag{15}\end{eqnarray} You can think of this update rule as defining the gradient descent algorithm. It gives us a way of repeatedly changing the position $v$ in order to find a minimum of the function $C$. The rule doesn't always work - several things can go wrong and prevent gradient descent from finding the global minimum of $C$, a point we'll return to explore in later chapters. But, in practice gradient descent often works extremely well, and in neural networks we'll find that it's a powerful way of minimizing the cost function, and so helping the net learn.
Indeed, there's even a sense in which gradient descent is the optimal strategy for searching for a minimum. Let's suppose that we're trying to make a move $\Delta v$ in position so as to decrease $C$ as much as possible. This is equivalent to minimizing $\Delta C \approx \nabla C \cdot \Delta v$. We'll constrain the size of the move so that $\| \Delta v \| = \epsilon$ for some small fixed $\epsilon > 0$. In other words, we want a move that is a small step of a fixed size, and we're trying to find the movement direction which decreases $C$ as much as possible. It can be proved that the choice of $\Delta v$ which minimizes $\nabla C \cdot \Delta v$ is $\Delta v = - \eta \nabla C$, where $\eta = \epsilon / \|\nabla C\|$ is determined by the size constraint $\|\Delta v\| = \epsilon$. So gradient descent can be viewed as a way of taking small steps in the direction which does the most to immediately decrease $C$.
People have investigated many variations of gradient descent, including variations that more closely mimic a real physical ball. These ball-mimicking variations have some advantages, but also have a major disadvantage: it turns out to be necessary to compute second partial derivatives of $C$, and this can be quite costly. To see why it's costly, suppose we want to compute all the second partial derivatives $\partial^2 C/ \partial v_j \partial v_k$. If there are a million such $v_j$ variables then we'd need to compute something like a trillion (i.e., a million squared) second partial derivatives* *Actually, more like half a trillion, since $\partial^2 C/ \partial v_j \partial v_k = \partial^2 C/ \partial v_k \partial v_j$. Still, you get the point.! That's going to be computationally costly. With that said, there are tricks for avoiding this kind of problem, and finding alternatives to gradient descent is an active area of investigation. But in this book we'll use gradient descent (and variations) as our main approach to learning in neural networks.
How can we apply gradient descent to learn in a neural network? The idea is to use gradient descent to find the weights $w_k$ and biases $b_l$ which minimize the cost in Equation (6)\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}. To see how this works, let's restate the gradient descent update rule, with the weights and biases replacing the variables $v_j$. In other words, our "position" now has components $w_k$ and $b_l$, and the gradient vector $\nabla C$ has corresponding components $\partial C / \partial w_k$ and $\partial C / \partial b_l$. Writing out the gradient descent update rule in terms of components, we have \begin{eqnarray} w_k & \rightarrow & w_k' = w_k-\eta \frac{\partial C}{\partial w_k} \tag{16}\\ b_l & \rightarrow & b_l' = b_l-\eta \frac{\partial C}{\partial b_l}. \tag{17}\end{eqnarray} By repeatedly applying this update rule we can "roll down the hill", and hopefully find a minimum of the cost function. In other words, this is a rule which can be used to learn in a neural network.
There are a number of challenges in applying the gradient descent rule. We'll look into those in depth in later chapters. But for now I just want to mention one problem. To understand what the problem is, let's look back at the quadratic cost in Equation (6)\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}. Notice that this cost function has the form $C = \frac{1}{n} \sum_x C_x$, that is, it's an average over costs $C_x \equiv \frac{\|y(x)-a\|^2}{2}$ for individual training examples. In practice, to compute the gradient $\nabla C$ we need to compute the gradients $\nabla C_x$ separately for each training input, $x$, and then average them, $\nabla C = \frac{1}{n} \sum_x \nabla C_x$. Unfortunately, when the number of training inputs is very large this can take a long time, and learning thus occurs slowly.
An idea called stochastic gradient descent can be used to speed up learning. The idea is to estimate the gradient $\nabla C$ by computing $\nabla C_x$ for a small sample of randomly chosen training inputs. By averaging over this small sample it turns out that we can quickly get a good estimate of the true gradient $\nabla C$, and this helps speed up gradient descent, and thus learning.
To make these ideas more precise, stochastic gradient descent works by randomly picking out a small number $m$ of randomly chosen training inputs. We'll label those random training inputs $X_1, X_2, \ldots, X_m$, and refer to them as a mini-batch. Provided the sample size $m$ is large enough we expect that the average value of the $\nabla C_{X_j}$ will be roughly equal to the average over all $\nabla C_x$, that is, \begin{eqnarray} \frac{\sum_{j=1}^m \nabla C_{X_{j}}}{m} \approx \frac{\sum_x \nabla C_x}{n} = \nabla C, \tag{18}\end{eqnarray} where the second sum is over the entire set of training data. Swapping sides we get \begin{eqnarray} \nabla C \approx \frac{1}{m} \sum_{j=1}^m \nabla C_{X_{j}}, \tag{19}\end{eqnarray} confirming that we can estimate the overall gradient by computing gradients just for the randomly chosen mini-batch.
To connect this explicitly to learning in neural networks, suppose $w_k$ and $b_l$ denote the weights and biases in our neural network. Then stochastic gradient descent works by picking out a randomly chosen mini-batch of training inputs, and training with those, \begin{eqnarray} w_k & \rightarrow & w_k' = w_k-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial w_k} \tag{20}\\ b_l & \rightarrow & b_l' = b_l-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial b_l}, \tag{21}\end{eqnarray} where the sums are over all the training examples $X_j$ in the current mini-batch. Then we pick out another randomly chosen mini-batch and train with those. And so on, until we've exhausted the training inputs, which is said to complete an epoch of training. At that point we start over with a new training epoch.
Incidentally, it's worth noting that conventions vary about scaling of the cost function and of mini-batch updates to the weights and biases. In Equation (6)\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray} we scaled the overall cost function by a factor $\frac{1}{n}$. People sometimes omit the $\frac{1}{n}$, summing over the costs of individual training examples instead of averaging. This is particularly useful when the total number of training examples isn't known in advance. This can occur if more training data is being generated in real time, for instance. And, in a similar way, the mini-batch update rules (20)\begin{eqnarray} w_k & \rightarrow & w_k' = w_k-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial w_k} \nonumber\end{eqnarray} and (21)\begin{eqnarray} b_l & \rightarrow & b_l' = b_l-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial b_l} \nonumber\end{eqnarray} sometimes omit the $\frac{1}{m}$ term out the front of the sums. Conceptually this makes little difference, since it's equivalent to rescaling the learning rate $\eta$. But when doing detailed comparisons of different work it's worth watching out for.
We can think of stochastic gradient descent as being like political polling: it's much easier to sample a small mini-batch than it is to apply gradient descent to the full batch, just as carrying out a poll is easier than running a full election. For example, if we have a training set of size $n = 60,000$, as in MNIST, and choose a mini-batch size of (say) $m = 10$, this means we'll get a factor of $6,000$ speedup in estimating the gradient! Of course, the estimate won't be perfect - there will be statistical fluctuations - but it doesn't need to be perfect: all we really care about is moving in a general direction that will help decrease $C$, and that means we don't need an exact computation of the gradient. In practice, stochastic gradient descent is a commonly used and powerful technique for learning in neural networks, and it's the basis for most of the learning techniques we'll develop in this book.
Let me conclude this section by discussing a point that sometimes bugs people new to gradient descent. In neural networks the cost $C$ is, of course, a function of many variables - all the weights and biases - and so in some sense defines a surface in a very high-dimensional space. Some people get hung up thinking: "Hey, I have to be able to visualize all these extra dimensions". And they may start to worry: "I can't think in four dimensions, let alone five (or five million)". Is there some special ability they're missing, some ability that "real" supermathematicians have? Of course, the answer is no. Even most professional mathematicians can't visualize four dimensions especially well, if at all. The trick they use, instead, is to develop other ways of representing what's going on. That's exactly what we did above: we used an algebraic (rather than visual) representation of $\Delta C$ to figure out how to move so as to decrease $C$. People who are good at thinking in high dimensions have a mental library containing many different techniques along these lines; our algebraic trick is just one example. Those techniques may not have the simplicity we're accustomed to when visualizing three dimensions, but once you build up a library of such techniques, you can get pretty good at thinking in high dimensions. I won't go into more detail here, but if you're interested then you may enjoy reading this discussion of some of the techniques professional mathematicians use to think in high dimensions. While some of the techniques discussed are quite complex, much of the best content is intuitive and accessible, and could be mastered by anyone.
Alright, let's write a program that learns how to recognize handwritten digits, using stochastic gradient descent and the MNIST training data. We'll do this with a short Python (2.7) program, just 74 lines of code! The first thing we need is to get the MNIST data. If you're a git user then you can obtain the data by cloning the code repository for this book,
git clone https://github.com/mnielsen/neural-networks-and-deep-learning.git
+The human visual system is one of the wonders of the world. Consider
the following sequence of handwritten digits:
Most people effortlessly recognize those digits as 504192. That ease
is deceptive. In each hemisphere of our brain, humans have a primary
visual cortex, also known as V1, containing 140 million neurons, with
tens of billions of connections between them. And yet human vision
involves not just V1, but an entire series of visual cortices - V2,
V3, V4, and V5 - doing progressively more complex image processing.
We carry in our heads a supercomputer, tuned by evolution over
hundreds of millions of years, and superbly adapted to understand the
visual world. Recognizing handwritten digits isn't easy. Rather, we
humans are stupendously, astoundingly good at making sense of what our
eyes show us. But nearly all that work is done unconsciously. And so
we don't usually appreciate how tough a problem our visual systems
solve.
The difficulty of visual pattern recognition becomes apparent if you
attempt to write a computer program to recognize digits like those
above. What seems easy when we do it ourselves suddenly becomes
extremely difficult. Simple intuitions about how we recognize shapes
- "a 9 has a loop at the top, and a vertical stroke in the bottom
right" - turn out to be not so simple to express algorithmically.
When you try to make such rules precise, you quickly get lost in a
morass of exceptions and caveats and special cases. It seems
hopeless.
Neural networks approach the problem in a different way. The idea is
to take a large number of handwritten digits, known as training
examples,
and then develop a system which can learn from those training
examples. In other words, the neural network uses the examples to
automatically infer rules for recognizing handwritten digits.
Furthermore, by increasing the number of training examples, the
network can learn more about handwriting, and so improve its accuracy.
So while I've shown just 100 training digits above, perhaps we could
build a better handwriting recognizer by using thousands or even
millions or billions of training examples.
In this chapter we'll write a computer program implementing a neural
network that learns to recognize handwritten digits. The program is
just 74 lines long, and uses no special neural network libraries. But
this short program can recognize digits with an accuracy over 96
percent, without human intervention. Furthermore, in later chapters
we'll develop ideas which can improve accuracy to over 99 percent. In
fact, the best commercial neural networks are now so good that they
are used by banks to process cheques, and by post offices to recognize
addresses.
We're focusing on handwriting recognition because it's an excellent
prototype problem for learning about neural networks in general. As a
prototype it hits a sweet spot: it's challenging - it's no small
feat to recognize handwritten digits - but it's not so difficult as
to require an extremely complicated solution, or tremendous
computational power. Furthermore, it's a great way to develop more
advanced techniques, such as deep learning. And so throughout the
book we'll return repeatedly to the problem of handwriting
recognition. Later in the book, we'll discuss how these ideas may be
applied to other problems in computer vision, and also in speech,
natural language processing, and other domains.
Of course, if the point of the chapter was only to write a computer
program to recognize handwritten digits, then the chapter would be
much shorter! But along the way we'll develop many key ideas about
neural networks, including two important types of artificial neuron
(the perceptron and the sigmoid neuron), and the standard learning
algorithm for neural networks, known as stochastic gradient descent.
Throughout, I focus on explaining why things are done the way
they are, and on building your neural networks intuition. That
requires a lengthier discussion than if I just presented the basic
mechanics of what's going on, but it's worth it for the deeper
understanding you'll attain. Amongst the payoffs, by the end of the
chapter we'll be in position to understand what deep learning is, and
why it matters.
Perceptrons
What is a neural network? To get started, I'll explain a type of
artificial neuron called a perceptron.
Perceptrons were
developed
in the 1950s and 1960s by the scientist
Frank
Rosenblatt, inspired by earlier
work
by Warren
McCulloch and
Walter
Pitts. Today, it's more common to use other
models of artificial neurons - in this book, and in much modern work
on neural networks, the main neuron model used is one called the
sigmoid neuron. We'll get to sigmoid neurons shortly. But to
understand why sigmoid neurons are defined the way they are, it's
worth taking the time to first understand perceptrons.
So how do perceptrons work? A perceptron takes several binary inputs,
$x_1, x_2, \ldots$, and produces a single binary output:
In the example shown the perceptron has three inputs, $x_1, x_2, x_3$.
In general it could have more or fewer inputs. Rosenblatt proposed a
simple rule to compute the output. He introduced
weights, $w_1,w_2,\ldots$, real numbers
expressing the importance of the respective inputs to the output. The
neuron's output, $0$ or $1$, is determined by whether the weighted sum
$\sum_j w_j x_j$ is less than or greater than some threshold
value. Just like the weights, the
threshold is a real number which is a parameter of the neuron. To put
it in more precise algebraic terms:
\begin{eqnarray}
\mbox{output} & = & \left\{ \begin{array}{ll}
0 & \mbox{if } \sum_j w_j x_j \leq \mbox{ threshold} \\
1 & \mbox{if } \sum_j w_j x_j > \mbox{ threshold}
\end{array} \right.
\tag{1}\end{eqnarray}
That's all there is to how a perceptron works!That's the basic mathematical model. A way you can think about the
perceptron is that it's a device that makes decisions by weighing up
evidence. Let me give an example. It's not a very realistic example,
but it's easy to understand, and we'll soon get to more realistic
examples. Suppose the weekend is coming up, and you've heard that
there's going to be a cheese festival in your city. You like cheese,
and are trying to decide whether or not to go to the festival. You
might make your decision by weighing up three factors:
- Is the weather good?
- Does your boyfriend or girlfriend want to accompany you?
- Is the festival near public transit? (You don't own a car).
We can represent these three factors by corresponding binary variables
$x_1, x_2$, and $x_3$. For instance, we'd have $x_1 = 1$ if the
weather is good, and $x_1 = 0$ if the weather is bad. Similarly, $x_2
= 1$ if your boyfriend or girlfriend wants to go, and $x_2 = 0$ if
not. And similarly again for $x_3$ and public transit.Now, suppose you absolutely adore cheese, so much so that you're happy
to go to the festival even if your boyfriend or girlfriend is
uninterested and the festival is hard to get to. But perhaps you
really loathe bad weather, and there's no way you'd go to the festival
if the weather is bad. You can use perceptrons to model this kind of
decision-making. One way to do this is to choose a weight $w_1 = 6$
for the weather, and $w_2 = 2$ and $w_3 = 2$ for the other conditions.
The larger value of $w_1$ indicates that the weather matters a lot to
you, much more than whether your boyfriend or girlfriend joins you, or
the nearness of public transit. Finally, suppose you choose a
threshold of $5$ for the perceptron. With these choices, the
perceptron implements the desired decision-making model, outputting
$1$ whenever the weather is good, and $0$ whenever the weather is bad.
It makes no difference to the output whether your boyfriend or
girlfriend wants to go, or whether public transit is nearby.
By varying the weights and the threshold, we can get different models
of decision-making. For example, suppose we instead chose a threshold
of $3$. Then the perceptron would decide that you should go to the
festival whenever the weather was good or when both the
festival was near public transit and your boyfriend or
girlfriend was willing to join you. In other words, it'd be a
different model of decision-making. Dropping the threshold means
you're more willing to go to the festival.
Obviously, the perceptron isn't a complete model of human
decision-making! But what the example illustrates is how a perceptron
can weigh up different kinds of evidence in order to make decisions.
And it should seem plausible that a complex network of perceptrons
could make quite subtle decisions:
In this network, the first column of perceptrons - what we'll call
the first layer of perceptrons - is making three very simple
decisions, by weighing the input evidence. What about the perceptrons
in the second layer? Each of those perceptrons is making a decision
by weighing up the results from the first layer of decision-making.
In this way a perceptron in the second layer can make a decision at a
more complex and more abstract level than perceptrons in the first
layer. And even more complex decisions can be made by the perceptron
in the third layer. In this way, a many-layer network of perceptrons
can engage in sophisticated decision making.Incidentally, when I defined perceptrons I said that a perceptron has
just a single output. In the network above the perceptrons look like
they have multiple outputs. In fact, they're still single output.
The multiple output arrows are merely a useful way of indicating that
the output from a perceptron is being used as the input to several
other perceptrons. It's less unwieldy than drawing a single output
line which then splits.
Let's simplify the way we describe perceptrons. The condition $\sum_j
w_j x_j > \mbox{threshold}$ is cumbersome, and we can make two
notational changes to simplify it.
The first change is to write
$\sum_j w_j x_j$ as a dot product, $w \cdot x \equiv \sum_j w_j x_j$,
where $w$ and $x$ are vectors whose components are the weights and
inputs, respectively. The second change is to move the threshold to
the other side of the inequality, and to replace it by what's known as
the perceptron's bias, $b \equiv
-\mbox{threshold}$. Using the bias instead of the threshold, the
perceptron rule can be
rewritten:
\begin{eqnarray}
\mbox{output} = \left\{
\begin{array}{ll}
0 & \mbox{if } w\cdot x + b \leq 0 \\
1 & \mbox{if } w\cdot x + b > 0
\end{array}
\right.
\tag{2}\end{eqnarray}
You can think of the bias as a measure of how easy it is to get the
perceptron to output a $1$. Or to put it in more biological terms,
the bias is a measure of how easy it is to get the perceptron to
fire. For a perceptron with a really big bias, it's extremely
easy for the perceptron to output a $1$. But if the bias is very
negative, then it's difficult for the perceptron to output a $1$.
Obviously, introducing the bias is only a small change in how we
describe perceptrons, but we'll see later that it leads to further
notational simplifications. Because of this, in the remainder of the
book we won't use the threshold, we'll always use the bias.
I've described perceptrons as a method for weighing evidence to make
decisions. Another way perceptrons can be used is to compute the
elementary logical functions we usually think of as underlying
computation, functions such as AND
, OR
, and
NAND
. For example, suppose we have a perceptron with two
inputs, each with weight $-2$, and an overall bias of $3$. Here's our
perceptron:
Then we see that input $00$ produces output $1$, since
$(-2)*0+(-2)*0+3 = 3$ is positive. Here, I've introduced the $*$
symbol to make the multiplications explicit. Similar calculations
show that the inputs $01$ and $10$ produce output $1$. But the input
$11$ produces output $0$, since $(-2)*1+(-2)*1+3 = -1$ is negative.
And so our perceptron implements a NAND
gate!The NAND
example shows that we can use perceptrons to compute
simple logical functions.
In fact, we can use
networks of perceptrons to compute any logical function at all.
The reason is that the NAND
gate is universal for
computation, that is, we can build any computation up out of
NAND
gates. For example, we can use NAND
gates to
build a circuit which adds two bits, $x_1$ and $x_2$. This requires
computing the bitwise sum, $x_1 \oplus x_2$, as well as a carry bit
which is set to $1$ when both $x_1$ and $x_2$ are $1$, i.e., the carry
bit is just the bitwise product $x_1 x_2$:
To get an equivalent network of perceptrons we replace all the
NAND
gates by perceptrons with two inputs, each with weight
$-2$, and an overall bias of $3$. Here's the resulting network. Note
that I've moved the perceptron corresponding to the bottom right
NAND
gate a little, just to make it easier to draw the arrows
on the diagram:
One notable aspect of this network of perceptrons is that the output
from the leftmost perceptron is used twice as input to the bottommost
perceptron. When I defined the perceptron model I didn't say whether
this kind of double-output-to-the-same-place was allowed. Actually,
it doesn't much matter. If we don't want to allow this kind of thing,
then it's possible to simply merge the two lines, into a single
connection with a weight of -4 instead of two connections with -2
weights. (If you don't find this obvious, you should stop and prove
to yourself that this is equivalent.) With that change, the network
looks as follows, with all unmarked weights equal to -2, all biases
equal to 3, and a single weight of -4, as marked:
Up to now I've been drawing inputs like $x_1$ and $x_2$ as variables
floating to the left of the network of perceptrons. In fact, it's
conventional to draw an extra layer of perceptrons - the input
layer - to encode the inputs:
This notation for input perceptrons, in which we have an output, but
no inputs,
is a shorthand. It doesn't actually mean a perceptron with no inputs.
To see this, suppose we did have a perceptron with no inputs. Then
the weighted sum $\sum_j w_j x_j$ would always be zero, and so the
perceptron would output $1$ if $b > 0$, and $0$ if $b \leq 0$. That
is, the perceptron would simply output a fixed value, not the desired
value ($x_1$, in the example above). It's better to think of the
input perceptrons as not really being perceptrons at all, but rather
special units which are simply defined to output the desired values,
$x_1, x_2,\ldots$.The adder example demonstrates how a network of perceptrons can be
used to simulate a circuit containing many NAND
gates. And
because NAND
gates are universal for computation, it follows
that perceptrons are also universal for computation.
The computational universality of perceptrons is simultaneously
reassuring and disappointing. It's reassuring because it tells us
that networks of perceptrons can be as powerful as any other computing
device. But it's also disappointing, because it makes it seem as
though perceptrons are merely a new type of NAND
gate.
That's hardly big news!
However, the situation is better than this view suggests. It turns
out that we can devise learning
algorithms which can
automatically tune the weights and biases of a network of artificial
neurons. This tuning happens in response to external stimuli, without
direct intervention by a programmer. These learning algorithms enable
us to use artificial neurons in a way which is radically different to
conventional logic gates. Instead of explicitly laying out a circuit
of NAND
and other gates, our neural networks can simply learn
to solve problems, sometimes problems where it would be extremely
difficult to directly design a conventional circuit.
Sigmoid neurons
Learning algorithms sound terrific. But how can we devise such
algorithms for a neural network? Suppose we have a network of
perceptrons that we'd like to use to learn to solve some problem. For
example, the inputs to the network might be the raw pixel data from a
scanned, handwritten image of a digit. And we'd like the network to
learn weights and biases so that the output from the network correctly
classifies the digit. To see how learning might work, suppose we make
a small change in some weight (or bias) in the network. What we'd
like is for this small change in weight to cause only a small
corresponding change in the output from the network. As we'll see in
a moment, this property will make learning possible. Schematically,
here's what we want (obviously this network is too simple to do
handwriting recognition!):
If it were true that a small change in a weight (or bias) causes only
a small change in output, then we could use this fact to modify the
weights and biases to get our network to behave more in the manner we
want. For example, suppose the network was mistakenly classifying an
image as an "8" when it should be a "9". We could figure out how
to make a small change in the weights and biases so the network gets a
little closer to classifying the image as a "9". And then we'd
repeat this, changing the weights and biases over and over to produce
better and better output. The network would be learning.
The problem is that this isn't what happens when our network contains
perceptrons. In fact, a small change in the weights or bias of any
single perceptron in the network can sometimes cause the output of
that perceptron to completely flip, say from $0$ to $1$. That flip
may then cause the behaviour of the rest of the network to completely
change in some very complicated way. So while your "9" might now be
classified correctly, the behaviour of the network on all the other
images is likely to have completely changed in some hard-to-control
way. That makes it difficult to see how to gradually modify the
weights and biases so that the network gets closer to the desired
behaviour. Perhaps there's some clever way of getting around this
problem. But it's not immediately obvious how we can get a network of
perceptrons to learn.
We can overcome this problem by introducing a new type of artificial
neuron called a sigmoid neuron.
Sigmoid neurons are similar to perceptrons, but modified so that small
changes in their weights and bias cause only a small change in their
output. That's the crucial fact which will allow a network of sigmoid
neurons to learn.
Okay, let me describe the sigmoid neuron. We'll depict sigmoid
neurons in the same way we depicted perceptrons:
Just like a perceptron, the sigmoid neuron has inputs, $x_1, x_2,
\ldots$. But instead of being just $0$ or $1$, these inputs can also
take on any values between $0$ and $1$. So, for instance,
$0.638\ldots$ is a valid input for a sigmoid neuron. Also just like a
perceptron, the sigmoid neuron has weights for each input, $w_1, w_2,
\ldots$, and an overall bias, $b$. But the output is not $0$ or $1$.
Instead, it's $\sigma(w \cdot x+b)$, where $\sigma$ is called the
sigmoid function*
*Incidentally, $\sigma$ is sometimes
called the logistic
function, and this
new class of neurons called logistic
neurons. It's useful
to remember this terminology, since these terms are used by many
people working with neural nets. However, we'll stick with the
sigmoid terminology., and is defined
by:
\begin{eqnarray}
\sigma(z) \equiv \frac{1}{1+e^{-z}}.
\tag{3}\end{eqnarray}
To put it all a little more explicitly, the output of a sigmoid neuron
with inputs $x_1,x_2,\ldots$, weights $w_1,w_2,\ldots$, and bias $b$ is
\begin{eqnarray}
\frac{1}{1+\exp(-\sum_j w_j x_j-b)}.
\tag{4}\end{eqnarray}At first sight, sigmoid neurons appear very different to perceptrons.
The algebraic form of the sigmoid function may seem opaque and
forbidding if you're not already familiar with it. In fact, there are
many similarities between perceptrons and sigmoid neurons, and the
algebraic form of the sigmoid function turns out to be more of a
technical detail than a true barrier to understanding.
To understand the similarity to the perceptron model, suppose $z
\equiv w \cdot x + b$ is a large positive number. Then $e^{-z}
\approx 0$ and so $\sigma(z) \approx 1$. In other words, when $z = w
\cdot x+b$ is large and positive, the output from the sigmoid neuron
is approximately $1$, just as it would have been for a perceptron.
Suppose on the other hand that $z = w \cdot x+b$ is very negative.
Then $e^{-z} \rightarrow \infty$, and $\sigma(z) \approx 0$. So when
$z = w \cdot x +b$ is very negative, the behaviour of a sigmoid neuron
also closely approximates a perceptron. It's only when $w \cdot x+b$
is of modest size that there's much deviation from the perceptron
model.
What about the algebraic form of $\sigma$? How can we understand
that? In fact, the exact form of $\sigma$ isn't so important - what
really matters is the shape of the function when plotted. Here's the
shape:
This shape is a smoothed out version of a step function:
If $\sigma$ had in fact been a step function, then the sigmoid neuron
would be a perceptron, since the output would be $1$ or $0$
depending on whether $w\cdot x+b$ was positive or
negative*
*Actually, when $w \cdot x +b = 0$ the perceptron
outputs $0$, while the step function outputs $1$. So, strictly
speaking, we'd need to modify the step function at that one point.
But you get the idea.. By using the actual $\sigma$ function we
get, as already implied above, a smoothed out perceptron. Indeed,
it's the smoothness of the $\sigma$ function that is the crucial fact,
not its detailed form. The smoothness of $\sigma$ means that small
changes $\Delta w_j$ in the weights and $\Delta b$ in the bias will
produce a small change $\Delta \mbox{output}$ in the output from the
neuron. In fact, calculus tells us that $\Delta \mbox{output}$ is
well approximated by
\begin{eqnarray}
\Delta \mbox{output} \approx \sum_j \frac{\partial \, \mbox{output}}{\partial w_j}
\Delta w_j + \frac{\partial \, \mbox{output}}{\partial b} \Delta b,
\tag{5}\end{eqnarray}
where the sum is over all the weights, $w_j$, and $\partial \,
\mbox{output} / \partial w_j$ and $\partial \, \mbox{output} /\partial
b$ denote partial derivatives of the $\mbox{output}$ with respect to
$w_j$ and $b$, respectively. Don't panic if you're not comfortable
with partial derivatives! While the expression above looks
complicated, with all the partial derivatives, it's actually saying
something very simple (and which is very good news): $\Delta
\mbox{output}$ is a linear function of the changes $\Delta w_j$
and $\Delta b$ in the weights and bias. This linearity makes it easy
to choose small changes in the weights and biases to achieve any
desired small change in the output. So while sigmoid neurons have
much of the same qualitative behaviour as perceptrons, they make it
much easier to figure out how changing the weights and biases will
change the output.
If it's the shape of $\sigma$ which really matters, and not its exact
form, then why use the particular form used for $\sigma$ in
Equation (3)\begin{eqnarray}
\sigma(z) \equiv \frac{1}{1+e^{-z}} \nonumber\end{eqnarray}? In fact, later in the book we will
occasionally consider neurons where the output is $f(w \cdot x + b)$
for some other activation function $f(\cdot)$. The main thing
that changes when we use a different activation function is that the
particular values for the partial derivatives in
Equation (5)\begin{eqnarray}
\Delta \mbox{output} \approx \sum_j \frac{\partial \, \mbox{output}}{\partial w_j}
\Delta w_j + \frac{\partial \, \mbox{output}}{\partial b} \Delta b \nonumber\end{eqnarray} change. It turns out that when we
compute those partial derivatives later, using $\sigma$ will simplify
the algebra, simply because exponentials have lovely properties when
differentiated. In any case, $\sigma$ is commonly-used in work on
neural nets, and is the activation function we'll use most often in
this book.
How should we interpret the output from a sigmoid neuron? Obviously,
one big difference between perceptrons and sigmoid neurons is that
sigmoid neurons don't just output $0$ or $1$. They can have as output
any real number between $0$ and $1$, so values such as $0.173\ldots$
and $0.689\ldots$ are legitimate outputs. This can be useful, for
example, if we want to use the output value to represent the average
intensity of the pixels in an image input to a neural network. But
sometimes it can be a nuisance. Suppose we want the output from the
network to indicate either "the input image is a 9" or "the input
image is not a 9". Obviously, it'd be easiest to do this if the
output was a $0$ or a $1$, as in a perceptron. But in practice we can
set up a convention to deal with this, for example, by deciding to
interpret any output of at least $0.5$ as indicating a "9", and any
output less than $0.5$ as indicating "not a 9". I'll always
explicitly state when we're using such a convention, so it shouldn't
cause any confusion.
Exercises
In the next section I'll introduce a neural network that can do a pretty good job classifying handwritten digits. In preparation for that, it helps to explain some terminology that lets us name different parts of a network. Suppose we have the network:
The design of the input and output layers in a network is often straightforward. For example, suppose we're trying to determine whether a handwritten image depicts a "9" or not. A natural way to design the network is to encode the intensities of the image pixels into the input neurons. If the image is a $64$ by $64$ greyscale image, then we'd have $4,096 = 64 \times 64$ input neurons, with the intensities scaled appropriately between $0$ and $1$. The output layer will contain just a single neuron, with output values of less than $0.5$ indicating "input image is not a 9", and values greater than $0.5$ indicating "input image is a 9 ".
While the design of the input and output layers of a neural network is often straightforward, there can be quite an art to the design of the hidden layers. In particular, it's not possible to sum up the design process for the hidden layers with a few simple rules of thumb. Instead, neural networks researchers have developed many design heuristics for the hidden layers, which help people get the behaviour they want out of their nets. For example, such heuristics can be used to help determine how to trade off the number of hidden layers against the time required to train the network. We'll meet several such design heuristics later in this book.
Up to now, we've been discussing neural networks where the output from one layer is used as input to the next layer. Such networks are called feedforward neural networks. This means there are no loops in the network - information is always fed forward, never fed back. If we did have loops, we'd end up with situations where the input to the $\sigma$ function depended on the output. That'd be hard to make sense of, and so we don't allow such loops.
However, there are other models of artificial neural networks in which feedback loops are possible. These models are called recurrent neural networks. The idea in these models is to have neurons which fire for some limited duration of time, before becoming quiescent. That firing can stimulate other neurons, which may fire a little while later, also for a limited duration. That causes still more neurons to fire, and so over time we get a cascade of neurons firing. Loops don't cause problems in such a model, since a neuron's output only affects its input at some later time, not instantaneously.
Recurrent neural nets have been less influential than feedforward networks, in part because the learning algorithms for recurrent nets are (at least to date) less powerful. But recurrent networks are still extremely interesting. They're much closer in spirit to how our brains work than feedforward networks. And it's possible that recurrent networks can solve important problems which can only be solved with great difficulty by feedforward networks. However, to limit our scope, in this book we're going to concentrate on the more widely-used feedforward networks.
Having defined neural networks, let's return to handwriting recognition. We can split the problem of recognizing handwritten digits into two sub-problems. First, we'd like a way of breaking an image containing many digits into a sequence of separate images, each containing a single digit. For example, we'd like to break the image
into six separate images,
We humans solve this segmentation problem with ease, but it's challenging for a computer program to correctly break up the image. Once the image has been segmented, the program then needs to classify each individual digit. So, for instance, we'd like our program to recognize that the first digit above,
is a 5.
We'll focus on writing a program to solve the second problem, that is, classifying individual digits. We do this because it turns out that the segmentation problem is not so difficult to solve, once you have a good way of classifying individual digits. There are many approaches to solving the segmentation problem. One approach is to trial many different ways of segmenting the image, using the individual digit classifier to score each trial segmentation. A trial segmentation gets a high score if the individual digit classifier is confident of its classification in all segments, and a low score if the classifier is having a lot of trouble in one or more segments. The idea is that if the classifier is having trouble somewhere, then it's probably having trouble because the segmentation has been chosen incorrectly. This idea and other variations can be used to solve the segmentation problem quite well. So instead of worrying about segmentation we'll concentrate on developing a neural network which can solve the more interesting and difficult problem, namely, recognizing individual handwritten digits.
To recognize individual digits we will use a three-layer neural network:
The input layer of the network contains neurons encoding the values of the input pixels. As discussed in the next section, our training data for the network will consist of many $28$ by $28$ pixel images of scanned handwritten digits, and so the input layer contains $784 = 28 \times 28$ neurons. For simplicity I've omitted most of the $784$ input neurons in the diagram above. The input pixels are greyscale, with a value of $0.0$ representing white, a value of $1.0$ representing black, and in between values representing gradually darkening shades of grey.
The second layer of the network is a hidden layer. We denote the number of neurons in this hidden layer by $n$, and we'll experiment with different values for $n$. The example shown illustrates a small hidden layer, containing just $n = 15$ neurons.
The output layer of the network contains 10 neurons. If the first neuron fires, i.e., has an output $\approx 1$, then that will indicate that the network thinks the digit is a $0$. If the second neuron fires then that will indicate that the network thinks the digit is a $1$. And so on. A little more precisely, we number the output neurons from $0$ through $9$, and figure out which neuron has the highest activation value. If that neuron is, say, neuron number $6$, then our network will guess that the input digit was a $6$. And so on for the other output neurons.
You might wonder why we use $10$ output neurons. After all, the goal of the network is to tell us which digit ($0, 1, 2, \ldots, 9$) corresponds to the input image. A seemingly natural way of doing that is to use just $4$ output neurons, treating each neuron as taking on a binary value, depending on whether the neuron's output is closer to $0$ or to $1$. Four neurons are enough to encode the answer, since $2^4 = 16$ is more than the 10 possible values for the input digit. Why should our network use $10$ neurons instead? Isn't that inefficient? The ultimate justification is empirical: we can try out both network designs, and it turns out that, for this particular problem, the network with $10$ output neurons learns to recognize digits better than the network with $4$ output neurons. But that leaves us wondering why using $10$ output neurons works better. Is there some heuristic that would tell us in advance that we should use the $10$-output encoding instead of the $4$-output encoding?
To understand why we do this, it helps to think about what the neural network is doing from first principles. Consider first the case where we use $10$ output neurons. Let's concentrate on the first output neuron, the one that's trying to decide whether or not the digit is a $0$. It does this by weighing up evidence from the hidden layer of neurons. What are those hidden neurons doing? Well, just suppose for the sake of argument that the first neuron in the hidden layer detects whether or not an image like the following is present:
It can do this by heavily weighting input pixels which overlap with the image, and only lightly weighting the other inputs. In a similar way, let's suppose for the sake of argument that the second, third, and fourth neurons in the hidden layer detect whether or not the following images are present:
As you may have guessed, these four images together make up the $0$ image that we saw in the line of digits shown earlier:
So if all four of these hidden neurons are firing then we can conclude that the digit is a $0$. Of course, that's not the only sort of evidence we can use to conclude that the image was a $0$ - we could legitimately get a $0$ in many other ways (say, through translations of the above images, or slight distortions). But it seems safe to say that at least in this case we'd conclude that the input was a $0$.
Supposing the neural network functions in this way, we can give a plausible explanation for why it's better to have $10$ outputs from the network, rather than $4$. If we had $4$ outputs, then the first output neuron would be trying to decide what the most significant bit of the digit was. And there's no easy way to relate that most significant bit to simple shapes like those shown above. It's hard to imagine that there's any good historical reason the component shapes of the digit will be closely related to (say) the most significant bit in the output.
Now, with all that said, this is all just a heuristic. Nothing says that the three-layer neural network has to operate in the way I described, with the hidden neurons detecting simple component shapes. Maybe a clever learning algorithm will find some assignment of weights that lets us use only $4$ output neurons. But as a heuristic the way of thinking I've described works pretty well, and can save you a lot of time in designing good neural network architectures.
Now that we have a design for our neural network, how can it learn to recognize digits? The first thing we'll need is a data set to learn from - a so-called training data set. We'll use the MNIST data set, which contains tens of thousands of scanned images of handwritten digits, together with their correct classifications. MNIST's name comes from the fact that it is a modified subset of two data sets collected by NIST, the United States' National Institute of Standards and Technology. Here's a few images from MNIST:
As you can see, these digits are, in fact, the same as those shown at the beginning of this chapter as a challenge to recognize. Of course, when testing our network we'll ask it to recognize images which aren't in the training set!
The MNIST data comes in two parts. The first part contains 60,000 images to be used as training data. These images are scanned handwriting samples from 250 people, half of whom were US Census Bureau employees, and half of whom were high school students. The images are greyscale and 28 by 28 pixels in size. The second part of the MNIST data set is 10,000 images to be used as test data. Again, these are 28 by 28 greyscale images. We'll use the test data to evaluate how well our neural network has learned to recognize digits. To make this a good test of performance, the test data was taken from a different set of 250 people than the original training data (albeit still a group split between Census Bureau employees and high school students). This helps give us confidence that our system can recognize digits from people whose writing it didn't see during training.
We'll use the notation $x$ to denote a training input. It'll be convenient to regard each training input $x$ as a $28 \times 28 = 784$-dimensional vector. Each entry in the vector represents the grey value for a single pixel in the image. We'll denote the corresponding desired output by $y = y(x)$, where $y$ is a $10$-dimensional vector. For example, if a particular training image, $x$, depicts a $6$, then $y(x) = (0, 0, 0, 0, 0, 0, 1, 0, 0, 0)^T$ is the desired output from the network. Note that $T$ here is the transpose operation, turning a row vector into an ordinary (column) vector.
What we'd like is an algorithm which lets us find weights and biases so that the output from the network approximates $y(x)$ for all training inputs $x$. To quantify how well we're achieving this goal we define a cost function* *Sometimes referred to as a loss or objective function. We use the term cost function throughout this book, but you should note the other terminology, since it's often used in research papers and other discussions of neural networks. : \begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \| y(x) - a\|^2. \tag{6}\end{eqnarray} Here, $w$ denotes the collection of all weights in the network, $b$ all the biases, $n$ is the total number of training inputs, $a$ is the vector of outputs from the network when $x$ is input, and the sum is over all training inputs, $x$. Of course, the output $a$ depends on $x$, $w$ and $b$, but to keep the notation simple I haven't explicitly indicated this dependence. The notation $\| v \|$ just denotes the usual length function for a vector $v$. We'll call $C$ the quadratic cost function; it's also sometimes known as the mean squared error or just MSE. Inspecting the form of the quadratic cost function, we see that $C(w,b)$ is non-negative, since every term in the sum is non-negative. Furthermore, the cost $C(w,b)$ becomes small, i.e., $C(w,b) \approx 0$, precisely when $y(x)$ is approximately equal to the output, $a$, for all training inputs, $x$. So our training algorithm has done a good job if it can find weights and biases so that $C(w,b) \approx 0$. By contrast, it's not doing so well when $C(w,b)$ is large - that would mean that $y(x)$ is not close to the output $a$ for a large number of inputs. So the aim of our training algorithm will be to minimize the cost $C(w,b)$ as a function of the weights and biases. In other words, we want to find a set of weights and biases which make the cost as small as possible. We'll do that using an algorithm known as gradient descent.
Why introduce the quadratic cost? After all, aren't we primarily interested in the number of images correctly classified by the network? Why not try to maximize that number directly, rather than minimizing a proxy measure like the quadratic cost? The problem with that is that the number of images correctly classified is not a smooth function of the weights and biases in the network. For the most part, making small changes to the weights and biases won't cause any change at all in the number of training images classified correctly. That makes it difficult to figure out how to change the weights and biases to get improved performance. If we instead use a smooth cost function like the quadratic cost it turns out to be easy to figure out how to make small changes in the weights and biases so as to get an improvement in the cost. That's why we focus first on minimizing the quadratic cost, and only after that will we examine the classification accuracy.
Even given that we want to use a smooth cost function, you may still wonder why we choose the quadratic function used in Equation (6)\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}. Isn't this a rather ad hoc choice? Perhaps if we chose a different cost function we'd get a totally different set of minimizing weights and biases? This is a valid concern, and later we'll revisit the cost function, and make some modifications. However, the quadratic cost function of Equation (6)\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray} works perfectly well for understanding the basics of learning in neural networks, so we'll stick with it for now.
Recapping, our goal in training a neural network is to find weights and biases which minimize the quadratic cost function $C(w, b)$. This is a well-posed problem, but it's got a lot of distracting structure as currently posed - the interpretation of $w$ and $b$ as weights and biases, the $\sigma$ function lurking in the background, the choice of network architecture, MNIST, and so on. It turns out that we can understand a tremendous amount by ignoring most of that structure, and just concentrating on the minimization aspect. So for now we're going to forget all about the specific form of the cost function, the connection to neural networks, and so on. Instead, we're going to imagine that we've simply been given a function of many variables and we want to minimize that function. We're going to develop a technique called gradient descent which can be used to solve such minimization problems. Then we'll come back to the specific function we want to minimize for neural networks.
Okay, let's suppose we're trying to minimize some function, $C(v)$. This could be any real-valued function of many variables, $v = v_1, v_2, \ldots$. Note that I've replaced the $w$ and $b$ notation by $v$ to emphasize that this could be any function - we're not specifically thinking in the neural networks context any more. To minimize $C(v)$ it helps to imagine $C$ as a function of just two variables, which we'll call $v_1$ and $v_2$:
What we'd like is to find where $C$ achieves its global minimum. Now, of course, for the function plotted above, we can eyeball the graph and find the minimum. In that sense, I've perhaps shown slightly too simple a function! A general function, $C$, may be a complicated function of many variables, and it won't usually be possible to just eyeball the graph to find the minimum.
One way of attacking the problem is to use calculus to try to find the minimum analytically. We could compute derivatives and then try using them to find places where $C$ is an extremum. With some luck that might work when $C$ is a function of just one or a few variables. But it'll turn into a nightmare when we have many more variables. And for neural networks we'll often want far more variables - the biggest neural networks have cost functions which depend on billions of weights and biases in an extremely complicated way. Using calculus to minimize that just won't work!
(After asserting that we'll gain insight by imagining $C$ as a function of just two variables, I've turned around twice in two paragraphs and said, "hey, but what if it's a function of many more than two variables?" Sorry about that. Please believe me when I say that it really does help to imagine $C$ as a function of two variables. It just happens that sometimes that picture breaks down, and the last two paragraphs were dealing with such breakdowns. Good thinking about mathematics often involves juggling multiple intuitive pictures, learning when it's appropriate to use each picture, and when it's not.)
Okay, so calculus doesn't work. Fortunately, there is a beautiful analogy which suggests an algorithm which works pretty well. We start by thinking of our function as a kind of a valley. If you squint just a little at the plot above, that shouldn't be too hard. And we imagine a ball rolling down the slope of the valley. Our everyday experience tells us that the ball will eventually roll to the bottom of the valley. Perhaps we can use this idea as a way to find a minimum for the function? We'd randomly choose a starting point for an (imaginary) ball, and then simulate the motion of the ball as it rolled down to the bottom of the valley. We could do this simulation simply by computing derivatives (and perhaps some second derivatives) of $C$ - those derivatives would tell us everything we need to know about the local "shape" of the valley, and therefore how our ball should roll.
Based on what I've just written, you might suppose that we'll be trying to write down Newton's equations of motion for the ball, considering the effects of friction and gravity, and so on. Actually, we're not going to take the ball-rolling analogy quite that seriously - we're devising an algorithm to minimize $C$, not developing an accurate simulation of the laws of physics! The ball's-eye view is meant to stimulate our imagination, not constrain our thinking. So rather than get into all the messy details of physics, let's simply ask ourselves: if we were declared God for a day, and could make up our own laws of physics, dictating to the ball how it should roll, what law or laws of motion could we pick that would make it so the ball always rolled to the bottom of the valley?
To make this question more precise, let's think about what happens when we move the ball a small amount $\Delta v_1$ in the $v_1$ direction, and a small amount $\Delta v_2$ in the $v_2$ direction. Calculus tells us that $C$ changes as follows: \begin{eqnarray} \Delta C \approx \frac{\partial C}{\partial v_1} \Delta v_1 + \frac{\partial C}{\partial v_2} \Delta v_2. \tag{7}\end{eqnarray} We're going to find a way of choosing $\Delta v_1$ and $\Delta v_2$ so as to make $\Delta C$ negative; i.e., we'll choose them so the ball is rolling down into the valley. To figure out how to make such a choice it helps to define $\Delta v$ to be the vector of changes in $v$, $\Delta v \equiv (\Delta v_1, \Delta v_2)^T$, where $T$ is again the transpose operation, turning row vectors into column vectors. We'll also define the gradient of $C$ to be the vector of partial derivatives, $\left(\frac{\partial C}{\partial v_1}, \frac{\partial C}{\partial v_2}\right)^T$. We denote the gradient vector by $\nabla C$, i.e.: \begin{eqnarray} \nabla C \equiv \left( \frac{\partial C}{\partial v_1}, \frac{\partial C}{\partial v_2} \right)^T. \tag{8}\end{eqnarray} In a moment we'll rewrite the change $\Delta C$ in terms of $\Delta v$ and the gradient, $\nabla C$. Before getting to that, though, I want to clarify something that sometimes gets people hung up on the gradient. When meeting the $\nabla C$ notation for the first time, people sometimes wonder how they should think about the $\nabla$ symbol. What, exactly, does $\nabla$ mean? In fact, it's perfectly fine to think of $\nabla C$ as a single mathematical object - the vector defined above - which happens to be written using two symbols. In this point of view, $\nabla$ is just a piece of notational flag-waving, telling you "hey, $\nabla C$ is a gradient vector". There are more advanced points of view where $\nabla$ can be viewed as an independent mathematical entity in its own right (for example, as a differential operator), but we won't need such points of view.
With these definitions, the expression (7)\begin{eqnarray} \Delta C \approx \frac{\partial C}{\partial v_1} \Delta v_1 + \frac{\partial C}{\partial v_2} \Delta v_2 \nonumber\end{eqnarray} for $\Delta C$ can be rewritten as \begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v. \tag{9}\end{eqnarray} This equation helps explain why $\nabla C$ is called the gradient vector: $\nabla C$ relates changes in $v$ to changes in $C$, just as we'd expect something called a gradient to do. But what's really exciting about the equation is that it lets us see how to choose $\Delta v$ so as to make $\Delta C$ negative. In particular, suppose we choose \begin{eqnarray} \Delta v = -\eta \nabla C, \tag{10}\end{eqnarray} where $\eta$ is a small, positive parameter (known as the learning rate). Then Equation (9)\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray} tells us that $\Delta C \approx -\eta \nabla C \cdot \nabla C = -\eta \|\nabla C\|^2$. Because $\| \nabla C \|^2 \geq 0$, this guarantees that $\Delta C \leq 0$, i.e., $C$ will always decrease, never increase, if we change $v$ according to the prescription in (10)\begin{eqnarray} \Delta v = -\eta \nabla C \nonumber\end{eqnarray}. (Within, of course, the limits of the approximation in Equation (9)\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}). This is exactly the property we wanted! And so we'll take Equation (10)\begin{eqnarray} \Delta v = -\eta \nabla C \nonumber\end{eqnarray} to define the "law of motion" for the ball in our gradient descent algorithm. That is, we'll use Equation (10)\begin{eqnarray} \Delta v = -\eta \nabla C \nonumber\end{eqnarray} to compute a value for $\Delta v$, then move the ball's position $v$ by that amount: \begin{eqnarray} v \rightarrow v' = v -\eta \nabla C. \tag{11}\end{eqnarray} Then we'll use this update rule again, to make another move. If we keep doing this, over and over, we'll keep decreasing $C$ until - we hope - we reach a global minimum.
Summing up, the way the gradient descent algorithm works is to repeatedly compute the gradient $\nabla C$, and then to move in the opposite direction, "falling down" the slope of the valley. We can visualize it like this:
Notice that with this rule gradient descent doesn't reproduce real physical motion. In real life a ball has momentum, and that momentum may allow it to roll across the slope, or even (momentarily) roll uphill. It's only after the effects of friction set in that the ball is guaranteed to roll down into the valley. By contrast, our rule for choosing $\Delta v$ just says "go down, right now". That's still a pretty good rule for finding the minimum!
To make gradient descent work correctly, we need to choose the learning rate $\eta$ to be small enough that Equation (9)\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray} is a good approximation. If we don't, we might end up with $\Delta C > 0$, which obviously would not be good! At the same time, we don't want $\eta$ to be too small, since that will make the changes $\Delta v$ tiny, and thus the gradient descent algorithm will work very slowly. In practical implementations, $\eta$ is often varied so that Equation (9)\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray} remains a good approximation, but the algorithm isn't too slow. We'll see later how this works.
I've explained gradient descent when $C$ is a function of just two variables. But, in fact, everything works just as well even when $C$ is a function of many more variables. Suppose in particular that $C$ is a function of $m$ variables, $v_1,\ldots,v_m$. Then the change $\Delta C$ in $C$ produced by a small change $\Delta v = (\Delta v_1, \ldots, \Delta v_m)^T$ is \begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v, \tag{12}\end{eqnarray} where the gradient $\nabla C$ is the vector \begin{eqnarray} \nabla C \equiv \left(\frac{\partial C}{\partial v_1}, \ldots, \frac{\partial C}{\partial v_m}\right)^T. \tag{13}\end{eqnarray} Just as for the two variable case, we can choose \begin{eqnarray} \Delta v = -\eta \nabla C, \tag{14}\end{eqnarray} and we're guaranteed that our (approximate) expression (12)\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray} for $\Delta C$ will be negative. This gives us a way of following the gradient to a minimum, even when $C$ is a function of many variables, by repeatedly applying the update rule \begin{eqnarray} v \rightarrow v' = v-\eta \nabla C. \tag{15}\end{eqnarray} You can think of this update rule as defining the gradient descent algorithm. It gives us a way of repeatedly changing the position $v$ in order to find a minimum of the function $C$. The rule doesn't always work - several things can go wrong and prevent gradient descent from finding the global minimum of $C$, a point we'll return to explore in later chapters. But, in practice gradient descent often works extremely well, and in neural networks we'll find that it's a powerful way of minimizing the cost function, and so helping the net learn.
Indeed, there's even a sense in which gradient descent is the optimal strategy for searching for a minimum. Let's suppose that we're trying to make a move $\Delta v$ in position so as to decrease $C$ as much as possible. This is equivalent to minimizing $\Delta C \approx \nabla C \cdot \Delta v$. We'll constrain the size of the move so that $\| \Delta v \| = \epsilon$ for some small fixed $\epsilon > 0$. In other words, we want a move that is a small step of a fixed size, and we're trying to find the movement direction which decreases $C$ as much as possible. It can be proved that the choice of $\Delta v$ which minimizes $\nabla C \cdot \Delta v$ is $\Delta v = - \eta \nabla C$, where $\eta = \epsilon / \|\nabla C\|$ is determined by the size constraint $\|\Delta v\| = \epsilon$. So gradient descent can be viewed as a way of taking small steps in the direction which does the most to immediately decrease $C$.
People have investigated many variations of gradient descent, including variations that more closely mimic a real physical ball. These ball-mimicking variations have some advantages, but also have a major disadvantage: it turns out to be necessary to compute second partial derivatives of $C$, and this can be quite costly. To see why it's costly, suppose we want to compute all the second partial derivatives $\partial^2 C/ \partial v_j \partial v_k$. If there are a million such $v_j$ variables then we'd need to compute something like a trillion (i.e., a million squared) second partial derivatives* *Actually, more like half a trillion, since $\partial^2 C/ \partial v_j \partial v_k = \partial^2 C/ \partial v_k \partial v_j$. Still, you get the point.! That's going to be computationally costly. With that said, there are tricks for avoiding this kind of problem, and finding alternatives to gradient descent is an active area of investigation. But in this book we'll use gradient descent (and variations) as our main approach to learning in neural networks.
How can we apply gradient descent to learn in a neural network? The idea is to use gradient descent to find the weights $w_k$ and biases $b_l$ which minimize the cost in Equation (6)\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}. To see how this works, let's restate the gradient descent update rule, with the weights and biases replacing the variables $v_j$. In other words, our "position" now has components $w_k$ and $b_l$, and the gradient vector $\nabla C$ has corresponding components $\partial C / \partial w_k$ and $\partial C / \partial b_l$. Writing out the gradient descent update rule in terms of components, we have \begin{eqnarray} w_k & \rightarrow & w_k' = w_k-\eta \frac{\partial C}{\partial w_k} \tag{16}\\ b_l & \rightarrow & b_l' = b_l-\eta \frac{\partial C}{\partial b_l}. \tag{17}\end{eqnarray} By repeatedly applying this update rule we can "roll down the hill", and hopefully find a minimum of the cost function. In other words, this is a rule which can be used to learn in a neural network.
There are a number of challenges in applying the gradient descent rule. We'll look into those in depth in later chapters. But for now I just want to mention one problem. To understand what the problem is, let's look back at the quadratic cost in Equation (6)\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}. Notice that this cost function has the form $C = \frac{1}{n} \sum_x C_x$, that is, it's an average over costs $C_x \equiv \frac{\|y(x)-a\|^2}{2}$ for individual training examples. In practice, to compute the gradient $\nabla C$ we need to compute the gradients $\nabla C_x$ separately for each training input, $x$, and then average them, $\nabla C = \frac{1}{n} \sum_x \nabla C_x$. Unfortunately, when the number of training inputs is very large this can take a long time, and learning thus occurs slowly.
An idea called stochastic gradient descent can be used to speed up learning. The idea is to estimate the gradient $\nabla C$ by computing $\nabla C_x$ for a small sample of randomly chosen training inputs. By averaging over this small sample it turns out that we can quickly get a good estimate of the true gradient $\nabla C$, and this helps speed up gradient descent, and thus learning.
To make these ideas more precise, stochastic gradient descent works by randomly picking out a small number $m$ of randomly chosen training inputs. We'll label those random training inputs $X_1, X_2, \ldots, X_m$, and refer to them as a mini-batch. Provided the sample size $m$ is large enough we expect that the average value of the $\nabla C_{X_j}$ will be roughly equal to the average over all $\nabla C_x$, that is, \begin{eqnarray} \frac{\sum_{j=1}^m \nabla C_{X_{j}}}{m} \approx \frac{\sum_x \nabla C_x}{n} = \nabla C, \tag{18}\end{eqnarray} where the second sum is over the entire set of training data. Swapping sides we get \begin{eqnarray} \nabla C \approx \frac{1}{m} \sum_{j=1}^m \nabla C_{X_{j}}, \tag{19}\end{eqnarray} confirming that we can estimate the overall gradient by computing gradients just for the randomly chosen mini-batch.
To connect this explicitly to learning in neural networks, suppose $w_k$ and $b_l$ denote the weights and biases in our neural network. Then stochastic gradient descent works by picking out a randomly chosen mini-batch of training inputs, and training with those, \begin{eqnarray} w_k & \rightarrow & w_k' = w_k-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial w_k} \tag{20}\\ b_l & \rightarrow & b_l' = b_l-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial b_l}, \tag{21}\end{eqnarray} where the sums are over all the training examples $X_j$ in the current mini-batch. Then we pick out another randomly chosen mini-batch and train with those. And so on, until we've exhausted the training inputs, which is said to complete an epoch of training. At that point we start over with a new training epoch.
Incidentally, it's worth noting that conventions vary about scaling of the cost function and of mini-batch updates to the weights and biases. In Equation (6)\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray} we scaled the overall cost function by a factor $\frac{1}{n}$. People sometimes omit the $\frac{1}{n}$, summing over the costs of individual training examples instead of averaging. This is particularly useful when the total number of training examples isn't known in advance. This can occur if more training data is being generated in real time, for instance. And, in a similar way, the mini-batch update rules (20)\begin{eqnarray} w_k & \rightarrow & w_k' = w_k-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial w_k} \nonumber\end{eqnarray} and (21)\begin{eqnarray} b_l & \rightarrow & b_l' = b_l-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial b_l} \nonumber\end{eqnarray} sometimes omit the $\frac{1}{m}$ term out the front of the sums. Conceptually this makes little difference, since it's equivalent to rescaling the learning rate $\eta$. But when doing detailed comparisons of different work it's worth watching out for.
We can think of stochastic gradient descent as being like political polling: it's much easier to sample a small mini-batch than it is to apply gradient descent to the full batch, just as carrying out a poll is easier than running a full election. For example, if we have a training set of size $n = 60,000$, as in MNIST, and choose a mini-batch size of (say) $m = 10$, this means we'll get a factor of $6,000$ speedup in estimating the gradient! Of course, the estimate won't be perfect - there will be statistical fluctuations - but it doesn't need to be perfect: all we really care about is moving in a general direction that will help decrease $C$, and that means we don't need an exact computation of the gradient. In practice, stochastic gradient descent is a commonly used and powerful technique for learning in neural networks, and it's the basis for most of the learning techniques we'll develop in this book.
Let me conclude this section by discussing a point that sometimes bugs people new to gradient descent. In neural networks the cost $C$ is, of course, a function of many variables - all the weights and biases - and so in some sense defines a surface in a very high-dimensional space. Some people get hung up thinking: "Hey, I have to be able to visualize all these extra dimensions". And they may start to worry: "I can't think in four dimensions, let alone five (or five million)". Is there some special ability they're missing, some ability that "real" supermathematicians have? Of course, the answer is no. Even most professional mathematicians can't visualize four dimensions especially well, if at all. The trick they use, instead, is to develop other ways of representing what's going on. That's exactly what we did above: we used an algebraic (rather than visual) representation of $\Delta C$ to figure out how to move so as to decrease $C$. People who are good at thinking in high dimensions have a mental library containing many different techniques along these lines; our algebraic trick is just one example. Those techniques may not have the simplicity we're accustomed to when visualizing three dimensions, but once you build up a library of such techniques, you can get pretty good at thinking in high dimensions. I won't go into more detail here, but if you're interested then you may enjoy reading this discussion of some of the techniques professional mathematicians use to think in high dimensions. While some of the techniques discussed are quite complex, much of the best content is intuitive and accessible, and could be mastered by anyone.
Alright, let's write a program that learns how to recognize handwritten digits, using stochastic gradient descent and the MNIST training data. We'll do this with a short Python (2.7) program, just 74 lines of code! The first thing we need is to get the MNIST data. If you're a git user then you can obtain the data by cloning the code repository for this book,
git clone https://github.com/mnielsen/neural-networks-and-deep-learning.git
If you don't use git then you can download the data and code here.
Incidentally, when I described the MNIST data earlier, I said it was split into 60,000 training images, and 10,000 test images. That's the official MNIST description. Actually, we're going to split the data a little differently. We'll leave the test images as is, but split the 60,000-image MNIST training set into two parts: a set of 50,000 images, which we'll use to train our neural network, and a separate 10,000 image validation set. We won't use the validation data in this chapter, but later in the book we'll find it useful in figuring out how to set certain hyper-parameters of the neural network - things like the learning rate, and so on, which aren't directly selected by our learning algorithm. Although the validation data isn't part of the original MNIST specification, many people use MNIST in this fashion, and the use of validation data is common in neural networks. When I refer to the "MNIST training data" from now on, I'll be referring to our 50,000 image data set, not the original 60,000 image data set* *As noted earlier, the MNIST data set is based on two data sets collected by NIST, the United States' National Institute of Standards and Technology. To construct MNIST the NIST data sets were stripped down and put into a more convenient format by Yann LeCun, Corinna Cortes, and Christopher J. C. Burges. See this link for more details. The data set in my repository is in a form that makes it easy to load and manipulate the MNIST data in Python. I obtained this particular form of the data from the LISA machine learning laboratory at the University of Montreal (link)..
Apart from the MNIST data we also need a Python library called Numpy, for doing fast linear algebra. If you don't already have Numpy installed, you can get it here.
Let me explain the core features of the neural networks code, before giving a full listing, below. The centerpiece is a Network class, which we use to represent a neural network. Here's the code we use to initialize a Network object:
class Network(object):
@@ -188,10 +191,10 @@ Using neural nets to recognize handwritten
In this code, the list sizes contains the number of neurons in the respective layers. So, for example, if we want to create a Network object with 2 neurons in the first layer, 3 neurons in the second layer, and 1 neuron in the final layer, we'd do this with the code:
net = Network([2, 3, 1])
Note also that the biases and weights are stored as lists of Numpy matrices. So, for example net.weights[1] is a Numpy matrix storing the weights connecting the second and third layers of neurons. (It's not the first and second layers, since Python's list indexing starts at 0.) Since net.weights[1] is rather verbose, let's just denote that matrix $w$. It's a matrix such that $w_{jk}$ is the weight for the connection between the $k^{\rm th}$ neuron in the second layer, and the $j^{\rm th}$ neuron in the third layer. This ordering of the $j$ and $k$ indices may seem strange - surely it'd make more sense to swap the $j$ and $k$ indices around? The big advantage of using this ordering is that it means that the vector of activations of the third layer of neurons is: \begin{eqnarray} a' = \sigma(w a + b). \tag{22}\end{eqnarray} There's quite a bit going on in this equation, so let's unpack it piece by piece. $a$ is the vector of activations of the second layer of neurons. To obtain $a'$ we multiply $a$ by the weight matrix $w$, and add the vector $b$ of biases. We then apply the function $\sigma$ elementwise to every entry in the vector $w a +b$. (This is called vectorizing the function $\sigma$.) It's easy to verify that Equation (22)\begin{eqnarray} a' = \sigma(w a + b) \nonumber\end{eqnarray} gives the same result as our earlier rule, Equation (4)\begin{eqnarray} \frac{1}{1+\exp(-\sum_j w_j x_j-b)} \nonumber\end{eqnarray}, for computing the output of a sigmoid neuron.
With all this in mind, it's easy to write code computing the output from a Network instance. We begin by defining the sigmoid function:
def sigmoid(z):
+
The biases
and weights in the Network object are all initialized randomly,
using the Numpy np.random.randn function to generate Gaussian
distributions with mean $0$ and standard deviation $1$. This random
initialization gives our stochastic gradient descent algorithm a place
to start from. In later chapters we'll find better ways of
initializing the weights and biases, but this will do for now. Note
that the Network initialization code assumes that the first
layer of neurons is an input layer, and omits to set any biases for
those neurons, since biases are only ever used in computing the
outputs from later layers.Note also that the biases and weights are stored as lists of Numpy
matrices. So, for example net.weights[1] is a Numpy matrix
storing the weights connecting the second and third layers of neurons.
(It's not the first and second layers, since Python's list indexing
starts at 0.) Since net.weights[1] is rather verbose,
let's just denote that matrix $w$. It's a matrix such that $w_{jk}$
is the weight for the connection between the $k^{\rm th}$ neuron in the
second layer, and the $j^{\rm th}$ neuron in the third layer. This ordering
of the $j$ and $k$ indices may seem strange - surely it'd make more
sense to swap the $j$ and $k$ indices around? The big advantage of
using this ordering is that it means that the vector of activations of
the third layer of neurons is:
\begin{eqnarray}
a' = \sigma(w a + b).
\tag{22}\end{eqnarray}
There's quite a bit going on in this equation, so let's unpack it
piece by piece. $a$ is the vector of activations of the second layer
of neurons. To obtain $a'$ we multiply $a$ by the weight matrix $w$,
and add the vector $b$ of biases. We then apply the function $\sigma$
elementwise to every entry in the vector $w a +b$. (This is called
vectorizing the function
$\sigma$.) It's easy to verify that
Equation (22)\begin{eqnarray}
a' = \sigma(w a + b) \nonumber\end{eqnarray} gives the same result as our
earlier rule, Equation (4)\begin{eqnarray}
\frac{1}{1+\exp(-\sum_j w_j x_j-b)} \nonumber\end{eqnarray}, for
computing the output of a sigmoid neuron.
Exercise
With all this in mind, it's easy to write code computing the output from a Network instance. We begin by defining the sigmoid function:
def sigmoid(z):
return 1.0/(1.0+np.exp(-z))
We then add a feedforward method to the Network class, which, given an input a for the network, returns the corresponding output* *It is assumed that the input a is an (n, 1) Numpy ndarray, not a (n,) vector. Here, n is the number of inputs to the network. If you try to use an (n,) vector as input you'll get strange results. Although using an (n,) vector appears the more natural choice, using an (n, 1) ndarray makes it particularly easy to modify the code to feedforward multiple inputs at once, and that is sometimes convenient. . All the method does is applies Equation (22)\begin{eqnarray} a' = \sigma(w a + b) \nonumber\end{eqnarray} for each layer:
def feedforward(self, a):
+
Note that when the input z is a vector or Numpy array, Numpy
automatically applies the function sigmoid elementwise, that
is, in vectorized form.We then add a feedforward method to the Network class,
which, given an input a for the network, returns the
corresponding output*
*It is assumed that the input a is
an (n, 1) Numpy ndarray, not a (n,) vector. Here,
n is the number of inputs to the network. If you try to use
an (n,) vector as input you'll get strange results. Although
using an (n,) vector appears the more natural choice, using
an (n, 1) ndarray makes it particularly easy to modify the
code to feedforward multiple inputs at once, and that is sometimes
convenient. . All the method does is applies
Equation (22)\begin{eqnarray}
a' = \sigma(w a + b) \nonumber\end{eqnarray} for each layer:
def feedforward(self, a):
"""Return the output of the network if "a" is input."""
for b, w in zip(self.biases, self.weights):
a = sigmoid(np.dot(w, a)+b)
@@ -527,7 +530,7 @@ Using neural nets to recognize handwritten
href="mailto:mn@michaelnielsen.org">contact me.
-Last update: Sun Jan 1 16:00:21 2017
+Last update: Thu Jan 19 06:09:48 2017
diff --git a/chap2.html b/chap2.html
index 0b38fb6..0ac9948 100644
--- a/chap2.html
+++ b/chap2.html
@@ -155,6 +155,8 @@ How the backpropagation algorithm works
Resources
+
+
@@ -164,9 +166,10 @@
How the backpropagation algorithm works
Michael Nielsen's project announcement mailing list
-
Deep Learning, draft book
-in preparation, by Yoshua Bengio, Ian Goodfellow, and Aaron
-Courville
+ Deep Learning, book by Ian
+Goodfellow, Yoshua Bengio, and Aaron Courville
+
+
@@ -175,7 +178,7 @@ How the backpropagation algorithm works
By Michael Nielsen / Jan 2017
-
In the last chapter we saw how neural networks can
learn their weights and biases using the gradient descent algorithm.
There was, however, a gap in our explanation: we didn't discuss how to
compute the gradient of the cost function. That's quite a gap! In
this chapter I'll explain a fast algorithm for computing such
gradients, an algorithm known as backpropagation.
The backpropagation algorithm was originally introduced in the 1970s,
but its importance wasn't fully appreciated until a
famous
1986 paper by
David
Rumelhart,
Geoffrey
Hinton, and
Ronald
Williams. That paper describes several
neural networks where backpropagation works far faster than earlier
approaches to learning, making it possible to use neural nets to solve
problems which had previously been insoluble. Today, the
backpropagation algorithm is the workhorse of learning in neural
networks.
This chapter is more mathematically involved than the rest of the
book. If you're not crazy about mathematics you may be tempted to
skip the chapter, and to treat backpropagation as a black box whose
details you're willing to ignore. Why take the time to study those
details?
The reason, of course, is understanding. At the heart of
backpropagation is an expression for the partial derivative $\partial
C / \partial w$ of the cost function $C$ with respect to any weight
$w$ (or bias $b$) in the network. The expression tells us how quickly
the cost changes when we change the weights and biases. And while the
expression is somewhat complex, it also has a beauty to it, with each
element having a natural, intuitive interpretation. And so
backpropagation isn't just a fast algorithm for learning. It actually
gives us detailed insights into how changing the weights and biases
changes the overall behaviour of the network. That's well worth
studying in detail.
With that said, if you want to skim the chapter, or jump straight to
the next chapter, that's fine. I've written the rest of the book to
be accessible even if you treat backpropagation as a black box. There
are, of course, points later in the book where I refer back to results
from this chapter. But at those points you should still be able to
understand the main conclusions, even if you don't follow all the
reasoning.
Warm up: a fast matrix-based approach to computing the output
from a neural network
Before discussing backpropagation, let's warm up with a fast
matrix-based algorithm to compute the output from a neural network.
We actually already briefly saw this algorithm
near
the end of the last chapter, but I described it quickly, so it's
worth revisiting in detail. In particular, this is a good way of
getting comfortable with the notation used in backpropagation, in a
familiar context.
Let's begin with a notation which lets us refer to weights in the
network in an unambiguous way. We'll use $w^l_{jk}$ to denote the
weight for the connection from the $k^{\rm th}$ neuron in the
$(l-1)^{\rm th}$ layer to the $j^{\rm th}$ neuron in the $l^{\rm th}$
layer. So, for example, the diagram below shows the weight on a
connection from the fourth neuron in the second layer to the second
neuron in the third layer of a network:
This notation is cumbersome at first, and it does take some work to
master. But with a little effort you'll find the notation becomes
easy and natural. One quirk of the notation is the ordering of the
$j$ and $k$ indices. You might think that it makes more sense to use
$j$ to refer to the input neuron, and $k$ to the output neuron, not
vice versa, as is actually done. I'll explain the reason for this
quirk below.We use a similar notation for the network's biases and activations.
Explicitly, we use $b^l_j$ for the bias of the $j^{\rm th}$ neuron in
the $l^{\rm th}$ layer. And we use $a^l_j$ for the activation of the
$j^{\rm th}$ neuron in the $l^{\rm th}$ layer. The following diagram
shows examples of these notations in use:
With these notations, the activation $a^{l}_j$ of the $j^{\rm th}$
neuron in the $l^{\rm th}$ layer is related to the activations in the
$(l-1)^{\rm th}$ layer by the equation (compare
Equation (4)\begin{eqnarray}
\frac{1}{1+\exp(-\sum_j w_j x_j-b)} \nonumber\end{eqnarray} and surrounding
discussion in the last chapter)
\begin{eqnarray}
a^{l}_j = \sigma\left( \sum_k w^{l}_{jk} a^{l-1}_k + b^l_j \right),
\tag{23}\end{eqnarray}
where the sum is over all neurons $k$ in the $(l-1)^{\rm th}$ layer. To
rewrite this expression in a matrix form we define a weight
matrix $w^l$ for each layer, $l$. The entries of the weight matrix
$w^l$ are just the weights connecting to the $l^{\rm th}$ layer of neurons,
that is, the entry in the $j^{\rm th}$ row and $k^{\rm th}$ column is $w^l_{jk}$.
Similarly, for each layer $l$ we define a bias vector, $b^l$.
You can probably guess how this works - the components of the bias
vector are just the values $b^l_j$, one component for each neuron in
the $l^{\rm th}$ layer. And finally, we define an activation vector $a^l$
whose components are the activations $a^l_j$.The last ingredient we need to rewrite (23)\begin{eqnarray}
a^{l}_j = \sigma\left( \sum_k w^{l}_{jk} a^{l-1}_k + b^l_j \right) \nonumber\end{eqnarray} in a
matrix form is the idea of vectorizing a function such as $\sigma$.
We met vectorization briefly in the last chapter, but to recap, the
idea is that we want to apply a function such as $\sigma$ to every
element in a vector $v$. We use the obvious notation $\sigma(v)$ to
denote this kind of elementwise application of a function. That is,
the components of $\sigma(v)$ are just $\sigma(v)_j = \sigma(v_j)$.
As an example, if we have the function $f(x) = x^2$ then the
vectorized form of $f$ has the effect
\begin{eqnarray}
f\left(\left[ \begin{array}{c} 2 \\ 3 \end{array} \right] \right)
= \left[ \begin{array}{c} f(2) \\ f(3) \end{array} \right]
= \left[ \begin{array}{c} 4 \\ 9 \end{array} \right],
\tag{24}\end{eqnarray}
that is, the vectorized $f$ just squares every element of the vector.
With these notations in mind, Equation (23)\begin{eqnarray}
a^{l}_j = \sigma\left( \sum_k w^{l}_{jk} a^{l-1}_k + b^l_j \right) \nonumber\end{eqnarray} can
be rewritten in the beautiful and compact vectorized form
\begin{eqnarray}
a^{l} = \sigma(w^l a^{l-1}+b^l).
\tag{25}\end{eqnarray}
This expression gives us a much more global way of thinking about how
the activations in one layer relate to activations in the previous
layer: we just apply the weight matrix to the activations, then add
the bias vector, and finally apply the $\sigma$ function*
*By the way, it's this expression that
motivates the quirk in the $w^l_{jk}$ notation mentioned earlier.
If we used $j$ to index the input neuron, and $k$ to index the
output neuron, then we'd need to replace the weight matrix in
Equation (25)\begin{eqnarray}
a^{l} = \sigma(w^l a^{l-1}+b^l) \nonumber\end{eqnarray} by the transpose of the
weight matrix. That's a small change, but annoying, and we'd lose
the easy simplicity of saying (and thinking) "apply the weight
matrix to the activations".. That global view is often easier and
more succinct (and involves fewer indices!) than the neuron-by-neuron
view we've taken to now. Think of it as a way of escaping index hell,
while remaining precise about what's going on. The expression is also
useful in practice, because most matrix libraries provide fast ways of
implementing matrix multiplication, vector addition, and
vectorization. Indeed, the
code
in the last chapter made implicit use of this expression to compute
the behaviour of the network.
When using Equation (25)\begin{eqnarray}
a^{l} = \sigma(w^l a^{l-1}+b^l) \nonumber\end{eqnarray} to compute $a^l$,
we compute the intermediate quantity $z^l \equiv w^l a^{l-1}+b^l$
along the way. This quantity turns out to be useful enough to be
worth naming: we call $z^l$ the weighted input to the neurons
in layer $l$. We'll make considerable use of the weighted input $z^l$
later in the chapter. Equation (25)\begin{eqnarray}
a^{l} = \sigma(w^l a^{l-1}+b^l) \nonumber\end{eqnarray} is
sometimes written in terms of the weighted input, as $a^l =
\sigma(z^l)$. It's also worth noting that $z^l$ has components $z^l_j
= \sum_k w^l_{jk} a^{l-1}_k+b^l_j$, that is, $z^l_j$ is just the
weighted input to the activation function for neuron $j$ in layer $l$.
The two assumptions we need about the cost function
The goal of backpropagation is to compute the partial derivatives
$\partial C / \partial w$ and $\partial C / \partial b$ of the cost
function $C$ with respect to any weight $w$ or bias $b$ in the
network. For backpropagation to work we need to make two main
assumptions about the form of the cost function. Before stating those
assumptions, though, it's useful to have an example cost function in
mind. We'll use the quadratic cost function from last chapter
(c.f. Equation (6)\begin{eqnarray} C(w,b) \equiv
\frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}). In the notation of
the last section, the quadratic cost has the form
\begin{eqnarray}
C = \frac{1}{2n} \sum_x \|y(x)-a^L(x)\|^2,
\tag{26}\end{eqnarray}
where: $n$ is the total number of training examples; the sum is over
individual training examples, $x$; $y = y(x)$ is the corresponding
desired output; $L$ denotes the number of layers in the network; and
$a^L = a^L(x)$ is the vector of activations output from the network
when $x$ is input.
Okay, so what assumptions do we need to make about our cost function,
$C$, in order that backpropagation can be applied? The first
assumption we need is that the cost function can be written as an
average $C = \frac{1}{n} \sum_x C_x$ over cost functions $C_x$ for
individual training examples, $x$. This is the case for the quadratic
cost function, where the cost for a single training example is $C_x =
\frac{1}{2} \|y-a^L \|^2$. This assumption will also hold true for
all the other cost functions we'll meet in this book.
The reason we need this assumption is because what backpropagation
actually lets us do is compute the partial derivatives $\partial C_x
/ \partial w$ and $\partial C_x / \partial b$ for a single training
example. We then recover $\partial C / \partial w$ and $\partial C
/ \partial b$ by averaging over training examples. In fact, with this
assumption in mind, we'll suppose the training example $x$ has been
fixed, and drop the $x$ subscript, writing the cost $C_x$ as $C$.
We'll eventually put the $x$ back in, but for now it's a notational
nuisance that is better left implicit.
The second assumption we make about the cost is that it can be written
as a function of the outputs from the neural network:
For example, the quadratic cost function satisfies this requirement,
since the quadratic cost for a single training example $x$ may be
written as
\begin{eqnarray}
C = \frac{1}{2} \|y-a^L\|^2 = \frac{1}{2} \sum_j (y_j-a^L_j)^2,
\tag{27}\end{eqnarray}
and thus is a function of the output activations. Of course, this
cost function also depends on the desired output $y$, and you may
wonder why we're not regarding the cost also as a function of $y$.
Remember, though, that the input training example $x$ is fixed, and so
the output $y$ is also a fixed parameter. In particular, it's not
something we can modify by changing the weights and biases in any way,
i.e., it's not something which the neural network learns. And so it
makes sense to regard $C$ as a function of the output activations
$a^L$ alone, with $y$ merely a parameter that helps define that
function.The Hadamard product, $s \odot t$
The backpropagation algorithm is based on common linear algebraic
operations - things like vector addition, multiplying a vector by a
matrix, and so on. But one of the operations is a little less
commonly used. In particular, suppose $s$ and $t$ are two vectors of
the same dimension. Then we use $s \odot t$ to denote the
elementwise product of the two vectors. Thus the components of
$s \odot t$ are just $(s \odot t)_j = s_j t_j$. As an example,
\begin{eqnarray}
\left[\begin{array}{c} 1 \\ 2 \end{array}\right]
\odot \left[\begin{array}{c} 3 \\ 4\end{array} \right]
= \left[ \begin{array}{c} 1 * 3 \\ 2 * 4 \end{array} \right]
= \left[ \begin{array}{c} 3 \\ 8 \end{array} \right].
\tag{28}\end{eqnarray}
This kind of elementwise multiplication is sometimes called the
Hadamard product or Schur product. We'll refer to it as
the Hadamard product. Good matrix libraries usually provide fast
implementations of the Hadamard product, and that comes in handy when
implementing backpropagation.
The four fundamental equations behind backpropagation
Backpropagation is about understanding how changing the weights and
biases in a network changes the cost function. Ultimately, this means
computing the partial derivatives $\partial C / \partial w^l_{jk}$ and
$\partial C / \partial b^l_j$. But to compute those, we first
introduce an intermediate quantity, $\delta^l_j$, which we call the
error in the $j^{\rm th}$ neuron in the $l^{\rm th}$ layer.
Backpropagation will give us a procedure to compute the error
$\delta^l_j$, and then will relate $\delta^l_j$ to $\partial C
/ \partial w^l_{jk}$ and $\partial C / \partial b^l_j$.
To understand how the error is defined, imagine there is a demon in
our neural network:
The demon sits at the $j^{\rm th}$ neuron in layer $l$. As the input to the
neuron comes in, the demon messes with the neuron's operation. It
adds a little change $\Delta z^l_j$ to the neuron's weighted input, so
that instead of outputting $\sigma(z^l_j)$, the neuron instead outputs
$\sigma(z^l_j+\Delta z^l_j)$. This change propagates through later
layers in the network, finally causing the overall cost to change by
an amount $\frac{\partial C}{\partial z^l_j} \Delta z^l_j$.Now, this demon is a good demon, and is trying to help you improve the
cost, i.e., they're trying to find a $\Delta z^l_j$ which makes the
cost smaller. Suppose $\frac{\partial C}{\partial z^l_j}$ has a large
value (either positive or negative). Then the demon can lower the
cost quite a bit by choosing $\Delta z^l_j$ to have the opposite sign
to $\frac{\partial C}{\partial z^l_j}$. By contrast, if
$\frac{\partial C}{\partial z^l_j}$ is close to zero, then the demon
can't improve the cost much at all by perturbing the weighted input
$z^l_j$. So far as the demon can tell, the neuron is already pretty
near optimal*
*This is only the case for small changes $\Delta
z^l_j$, of course. We'll assume that the demon is constrained to
make such small changes.. And so there's a heuristic sense in
which $\frac{\partial C}{\partial z^l_j}$ is a measure of the error in
the neuron.
Motivated by this story, we define the error $\delta^l_j$ of neuron
$j$ in layer $l$ by
\begin{eqnarray}
\delta^l_j \equiv \frac{\partial C}{\partial z^l_j}.
\tag{29}\end{eqnarray}
As per our usual conventions, we use $\delta^l$ to denote the vector
of errors associated with layer $l$. Backpropagation will give us a
way of computing $\delta^l$ for every layer, and then relating those
errors to the quantities of real interest, $\partial C / \partial
w^l_{jk}$ and $\partial C / \partial b^l_j$.
You might wonder why the demon is changing the weighted input $z^l_j$.
Surely it'd be more natural to imagine the demon changing the output
activation $a^l_j$, with the result that we'd be using $\frac{\partial
C}{\partial a^l_j}$ as our measure of error. In fact, if you do
this things work out quite similarly to the discussion below. But it
turns out to make the presentation of backpropagation a little more
algebraically complicated. So we'll stick with $\delta^l_j =
\frac{\partial C}{\partial z^l_j}$ as our measure of error*
*In
classification problems like MNIST the term "error" is sometimes
used to mean the classification failure rate. E.g., if the neural
net correctly classifies 96.0 percent of the digits, then the error
is 4.0 percent. Obviously, this has quite a different meaning from
our $\delta$ vectors. In practice, you shouldn't have trouble
telling which meaning is intended in any given usage..
Plan of attack: Backpropagation is based around four
fundamental equations. Together, those equations give us a way of
computing both the error $\delta^l$ and the gradient of the cost
function. I state the four equations below. Be warned, though: you
shouldn't expect to instantaneously assimilate the equations. Such an
expectation will lead to disappointment. In fact, the backpropagation
equations are so rich that understanding them well requires
considerable time and patience as you gradually delve deeper into the
equations. The good news is that such patience is repaid many times
over. And so the discussion in this section is merely a beginning,
helping you on the way to a thorough understanding of the equations.
Here's a preview of the ways we'll delve more deeply into the
equations later in the chapter: I'll
give
a short proof of the equations, which helps explain why they are
true; we'll restate
the equations in algorithmic form as pseudocode, and
see how the
pseudocode can be implemented as real, running Python code; and, in
the final
section of the chapter, we'll develop an intuitive picture of what
the backpropagation equations mean, and how someone might discover
them from scratch. Along the way we'll return repeatedly to the four
fundamental equations, and as you deepen your understanding those
equations will come to seem comfortable and, perhaps, even beautiful
and natural.
An equation for the error in the output layer, $\delta^L$:
The components of $\delta^L$ are given by
\begin{eqnarray}
\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j).
\tag{BP1}\end{eqnarray}
This is a very natural expression. The first term on the right,
$\partial C / \partial a^L_j$, just measures how fast the cost is
changing as a function of the $j^{\rm th}$ output activation. If, for
example, $C$ doesn't depend much on a particular output neuron, $j$,
then $\delta^L_j$ will be small, which is what we'd expect. The
second term on the right, $\sigma'(z^L_j)$, measures how fast the
activation function $\sigma$ is changing at $z^L_j$.
Notice that everything in (BP1)\begin{eqnarray}
\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray} is easily computed. In
particular, we compute $z^L_j$ while computing the behaviour of the
network, and it's only a small additional overhead to compute
$\sigma'(z^L_j)$. The exact form of $\partial C / \partial a^L_j$
will, of course, depend on the form of the cost function. However,
provided the cost function is known there should be little trouble
computing $\partial C / \partial a^L_j$. For example, if we're using
the quadratic cost function then $C = \frac{1}{2} \sum_j
(y_j-a^L_j)^2$, and so $\partial C / \partial a^L_j = (a_j^L-y_j)$,
which obviously is easily computable.
Equation (BP1)\begin{eqnarray}
\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray} is a componentwise expression for $\delta^L$.
It's a perfectly good expression, but not the matrix-based form we
want for backpropagation. However, it's easy to rewrite the equation
in a matrix-based form, as
\begin{eqnarray}
\delta^L = \nabla_a C \odot \sigma'(z^L).
\tag{BP1a}\end{eqnarray}
Here, $\nabla_a C$ is defined to be a vector whose components are the
partial derivatives $\partial C / \partial a^L_j$. You can think of
$\nabla_a C$ as expressing the rate of change of $C$ with respect to
the output activations. It's easy to see that Equations (BP1a)\begin{eqnarray}
\delta^L = \nabla_a C \odot \sigma'(z^L) \nonumber\end{eqnarray}
and (BP1)\begin{eqnarray}
\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray} are equivalent, and for that reason from now on we'll
use (BP1)\begin{eqnarray}
\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray} interchangeably to refer to both equations. As an
example, in the case of the quadratic cost we have $\nabla_a C =
(a^L-y)$, and so the fully matrix-based form of (BP1)\begin{eqnarray}
\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray} becomes
\begin{eqnarray}
\delta^L = (a^L-y) \odot \sigma'(z^L).
\tag{30}\end{eqnarray}
As you can see, everything in this expression has a nice vector form,
and is easily computed using a library such as Numpy.
An equation for the error $\delta^l$ in terms of the error in
the next layer, $\delta^{l+1}$: In particular
\begin{eqnarray}
\delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l),
\tag{BP2}\end{eqnarray}
where $(w^{l+1})^T$ is the transpose of the weight matrix $w^{l+1}$ for
the $(l+1)^{\rm th}$ layer. This equation appears complicated, but
each element has a nice interpretation. Suppose we know the error
$\delta^{l+1}$ at the $l+1^{\rm th}$ layer. When we apply the
transpose weight matrix, $(w^{l+1})^T$, we can think intuitively of
this as moving the error backward through the network, giving
us some sort of measure of the error at the output of the $l^{\rm th}$
layer. We then take the Hadamard product $\odot \sigma'(z^l)$. This
moves the error backward through the activation function in layer $l$,
giving us the error $\delta^l$ in the weighted input to layer $l$.
By combining (BP2)\begin{eqnarray}
\delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray} with (BP1)\begin{eqnarray}
\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray} we can compute the error
$\delta^l$ for any layer in the network. We start by
using (BP1)\begin{eqnarray}
\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray} to compute $\delta^L$, then apply
Equation (BP2)\begin{eqnarray}
\delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray} to compute $\delta^{L-1}$, then
Equation (BP2)\begin{eqnarray}
\delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray} again to compute $\delta^{L-2}$, and so on, all
the way back through the network.
An equation for the rate of change of the cost with respect to
any bias in the network: In particular:
\begin{eqnarray} \frac{\partial C}{\partial b^l_j} =
\delta^l_j.
\tag{BP3}\end{eqnarray}
That is, the error $\delta^l_j$ is exactly equal to the rate of
change $\partial C / \partial b^l_j$. This is great news, since
(BP1)\begin{eqnarray}
\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray} and (BP2)\begin{eqnarray}
\delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray} have already told us how to compute
$\delta^l_j$. We can rewrite (BP3)\begin{eqnarray} \frac{\partial C}{\partial b^l_j} =
\delta^l_j \nonumber\end{eqnarray} in shorthand as
\begin{eqnarray}
\frac{\partial C}{\partial b} = \delta,
\tag{31}\end{eqnarray}
where it is understood that $\delta$ is being evaluated at the same
neuron as the bias $b$.
An equation for the rate of change of the cost with respect to
any weight in the network: In particular:
\begin{eqnarray}
\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j.
\tag{BP4}\end{eqnarray}
This tells us how to compute the partial derivatives $\partial C
/ \partial w^l_{jk}$ in terms of the quantities $\delta^l$ and
$a^{l-1}$, which we already know how to compute. The equation can be
rewritten in a less index-heavy notation as
\begin{eqnarray} \frac{\partial
C}{\partial w} = a_{\rm in} \delta_{\rm out},
\tag{32}\end{eqnarray}
where it's understood that $a_{\rm in}$ is the activation of the
neuron input to the weight $w$, and $\delta_{\rm out}$ is the error of
the neuron output from the weight $w$. Zooming in to look at just the
weight $w$, and the two neurons connected by that weight, we can
depict this as:
A nice consequence of Equation (32)\begin{eqnarray} \frac{\partial
C}{\partial w} = a_{\rm in} \delta_{\rm out} \nonumber\end{eqnarray} is
that when the activation $a_{\rm in}$ is small, $a_{\rm in} \approx
0$, the gradient term $\partial C / \partial w$ will also tend to be
small. In this case, we'll say the weight learns slowly,
meaning that it's not changing much during gradient descent. In other
words, one consequence of (BP4)\begin{eqnarray}
\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray} is that weights output from
low-activation neurons learn slowly.There are other insights along these lines which can be obtained
from (BP1)\begin{eqnarray}
\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}-(BP4)\begin{eqnarray}
\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}. Let's start by looking at the output
layer. Consider the term $\sigma'(z^L_j)$ in (BP1)\begin{eqnarray}
\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}. Recall
from the graph of the sigmoid
function in the last chapter that the $\sigma$ function becomes
very flat when $\sigma(z^L_j)$ is approximately $0$ or $1$. When this
occurs we will have $\sigma'(z^L_j) \approx 0$. And so the lesson is
that a weight in the final layer will learn slowly if the output
neuron is either low activation ($\approx 0$) or high activation
($\approx 1$). In this case it's common to say the output neuron has
saturated and, as a result, the weight has stopped learning (or
is learning slowly). Similar remarks hold also for the biases of
output neuron.
We can obtain similar insights for earlier layers. In particular,
note the $\sigma'(z^l)$ term in (BP2)\begin{eqnarray}
\delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}. This means that
$\delta^l_j$ is likely to get small if the neuron is near saturation.
And this, in turn, means that any weights input to a saturated neuron
will learn slowly*
*This reasoning won't hold if ${w^{l+1}}^T
\delta^{l+1}$ has large enough entries to compensate for the
smallness of $\sigma'(z^l_j)$. But I'm speaking of the general
tendency..
Summing up, we've learnt that a weight will learn slowly if either the
input neuron is low-activation, or if the output neuron has saturated,
i.e., is either high- or low-activation.
None of these observations is too greatly surprising. Still, they
help improve our mental model of what's going on as a neural network
learns. Furthermore, we can turn this type of reasoning around. The
four fundamental equations turn out to hold for any activation
function, not just the standard sigmoid function (that's because, as
we'll see in a moment, the proofs don't use any special properties of
$\sigma$). And so we can use these equations to design
activation functions which have particular desired learning
properties. As an example to give you the idea, suppose we were to
choose a (non-sigmoid) activation function $\sigma$ so that $\sigma'$
is always positive, and never gets close to zero. That would prevent
the slow-down of learning that occurs when ordinary sigmoid neurons
saturate. Later in the book we'll see examples where this kind of
modification is made to the activation function. Keeping the four
equations (BP1)\begin{eqnarray}
\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}-(BP4)\begin{eqnarray}
\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray} in mind can help explain why such
modifications are tried, and what impact they can have.
Problem
We'll now prove the four fundamental equations (BP1)\begin{eqnarray} \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}-(BP4)\begin{eqnarray} \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}. All four are consequences of the chain rule from multivariable calculus. If you're comfortable with the chain rule, then I strongly encourage you to attempt the derivation yourself before reading on.
Let's begin with Equation (BP1)\begin{eqnarray} \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}, which gives an expression for the output error, $\delta^L$. To prove this equation, recall that by definition \begin{eqnarray} \delta^L_j = \frac{\partial C}{\partial z^L_j}. \tag{36}\end{eqnarray} Applying the chain rule, we can re-express the partial derivative above in terms of partial derivatives with respect to the output activations, \begin{eqnarray} \delta^L_j = \sum_k \frac{\partial C}{\partial a^L_k} \frac{\partial a^L_k}{\partial z^L_j}, \tag{37}\end{eqnarray} where the sum is over all neurons $k$ in the output layer. Of course, the output activation $a^L_k$ of the $k^{\rm th}$ neuron depends only on the weighted input $z^L_j$ for the $j^{\rm th}$ neuron when $k = j$. And so $\partial a^L_k / \partial z^L_j$ vanishes when $k \neq j$. As a result we can simplify the previous equation to \begin{eqnarray} \delta^L_j = \frac{\partial C}{\partial a^L_j} \frac{\partial a^L_j}{\partial z^L_j}. \tag{38}\end{eqnarray} Recalling that $a^L_j = \sigma(z^L_j)$ the second term on the right can be written as $\sigma'(z^L_j)$, and the equation becomes \begin{eqnarray} \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j), \tag{39}\end{eqnarray} which is just (BP1)\begin{eqnarray} \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}, in component form.
Next, we'll prove (BP2)\begin{eqnarray} \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}, which gives an equation for the error $\delta^l$ in terms of the error in the next layer, $\delta^{l+1}$. To do this, we want to rewrite $\delta^l_j = \partial C / \partial z^l_j$ in terms of $\delta^{l+1}_k = \partial C / \partial z^{l+1}_k$. We can do this using the chain rule, \begin{eqnarray} \delta^l_j & = & \frac{\partial C}{\partial z^l_j} \tag{40}\\ & = & \sum_k \frac{\partial C}{\partial z^{l+1}_k} \frac{\partial z^{l+1}_k}{\partial z^l_j} \tag{41}\\ & = & \sum_k \frac{\partial z^{l+1}_k}{\partial z^l_j} \delta^{l+1}_k, \tag{42}\end{eqnarray} where in the last line we have interchanged the two terms on the right-hand side, and substituted the definition of $\delta^{l+1}_k$. To evaluate the first term on the last line, note that \begin{eqnarray} z^{l+1}_k = \sum_j w^{l+1}_{kj} a^l_j +b^{l+1}_k = \sum_j w^{l+1}_{kj} \sigma(z^l_j) +b^{l+1}_k. \tag{43}\end{eqnarray} Differentiating, we obtain \begin{eqnarray} \frac{\partial z^{l+1}_k}{\partial z^l_j} = w^{l+1}_{kj} \sigma'(z^l_j). \tag{44}\end{eqnarray} Substituting back into (42)\begin{eqnarray} & = & \sum_k \frac{\partial z^{l+1}_k}{\partial z^l_j} \delta^{l+1}_k \nonumber\end{eqnarray} we obtain \begin{eqnarray} \delta^l_j = \sum_k w^{l+1}_{kj} \delta^{l+1}_k \sigma'(z^l_j). \tag{45}\end{eqnarray} This is just (BP2)\begin{eqnarray} \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray} written in component form.
The final two equations we want to prove are (BP3)\begin{eqnarray} \frac{\partial C}{\partial b^l_j} = \delta^l_j \nonumber\end{eqnarray} and (BP4)\begin{eqnarray} \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}. These also follow from the chain rule, in a manner similar to the proofs of the two equations above. I leave them to you as an exercise.
That completes the proof of the four fundamental equations of backpropagation. The proof may seem complicated. But it's really just the outcome of carefully applying the chain rule. A little less succinctly, we can think of backpropagation as a way of computing the gradient of the cost function by systematically applying the chain rule from multi-variable calculus. That's all there really is to backpropagation - the rest is details.
The backpropagation equations provide us with a way of computing the gradient of the cost function. Let's explicitly write this out in the form of an algorithm:
Examining the algorithm you can see why it's called backpropagation. We compute the error vectors $\delta^l$ backward, starting from the final layer. It may seem peculiar that we're going through the network backward. But if you think about the proof of backpropagation, the backward movement is a consequence of the fact that the cost is a function of outputs from the network. To understand how the cost varies with earlier weights and biases we need to repeatedly apply the chain rule, working backward through the layers to obtain usable expressions.
As I've described it above, the backpropagation algorithm computes the gradient of the cost function for a single training example, $C = C_x$. In practice, it's common to combine backpropagation with a learning algorithm such as stochastic gradient descent, in which we compute the gradient for many training examples. In particular, given a mini-batch of $m$ training examples, the following algorithm applies a gradient descent learning step based on that mini-batch:
Having understood backpropagation in the abstract, we can now understand the code used in the last chapter to implement backpropagation. Recall from that chapter that the code was contained in the update_mini_batch and backprop methods of the Network class. The code for these methods is a direct translation of the algorithm described above. In particular, the update_mini_batch method updates the Network's weights and biases by computing the gradient for the current mini_batch of training examples:
class Network(object):
+
In the last chapter we saw how neural networks can
learn their weights and biases using the gradient descent algorithm.
There was, however, a gap in our explanation: we didn't discuss how to
compute the gradient of the cost function. That's quite a gap! In
this chapter I'll explain a fast algorithm for computing such
gradients, an algorithm known as backpropagation.
The backpropagation algorithm was originally introduced in the 1970s,
but its importance wasn't fully appreciated until a
famous
1986 paper by
David
Rumelhart,
Geoffrey
Hinton, and
Ronald
Williams. That paper describes several
neural networks where backpropagation works far faster than earlier
approaches to learning, making it possible to use neural nets to solve
problems which had previously been insoluble. Today, the
backpropagation algorithm is the workhorse of learning in neural
networks.
This chapter is more mathematically involved than the rest of the
book. If you're not crazy about mathematics you may be tempted to
skip the chapter, and to treat backpropagation as a black box whose
details you're willing to ignore. Why take the time to study those
details?
The reason, of course, is understanding. At the heart of
backpropagation is an expression for the partial derivative $\partial
C / \partial w$ of the cost function $C$ with respect to any weight
$w$ (or bias $b$) in the network. The expression tells us how quickly
the cost changes when we change the weights and biases. And while the
expression is somewhat complex, it also has a beauty to it, with each
element having a natural, intuitive interpretation. And so
backpropagation isn't just a fast algorithm for learning. It actually
gives us detailed insights into how changing the weights and biases
changes the overall behaviour of the network. That's well worth
studying in detail.
With that said, if you want to skim the chapter, or jump straight to
the next chapter, that's fine. I've written the rest of the book to
be accessible even if you treat backpropagation as a black box. There
are, of course, points later in the book where I refer back to results
from this chapter. But at those points you should still be able to
understand the main conclusions, even if you don't follow all the
reasoning.
Warm up: a fast matrix-based approach to computing the output
from a neural network
Before discussing backpropagation, let's warm up with a fast
matrix-based algorithm to compute the output from a neural network.
We actually already briefly saw this algorithm
near
the end of the last chapter, but I described it quickly, so it's
worth revisiting in detail. In particular, this is a good way of
getting comfortable with the notation used in backpropagation, in a
familiar context.
Let's begin with a notation which lets us refer to weights in the
network in an unambiguous way. We'll use $w^l_{jk}$ to denote the
weight for the connection from the $k^{\rm th}$ neuron in the
$(l-1)^{\rm th}$ layer to the $j^{\rm th}$ neuron in the $l^{\rm th}$
layer. So, for example, the diagram below shows the weight on a
connection from the fourth neuron in the second layer to the second
neuron in the third layer of a network:
This notation is cumbersome at first, and it does take some work to
master. But with a little effort you'll find the notation becomes
easy and natural. One quirk of the notation is the ordering of the
$j$ and $k$ indices. You might think that it makes more sense to use
$j$ to refer to the input neuron, and $k$ to the output neuron, not
vice versa, as is actually done. I'll explain the reason for this
quirk below.We use a similar notation for the network's biases and activations.
Explicitly, we use $b^l_j$ for the bias of the $j^{\rm th}$ neuron in
the $l^{\rm th}$ layer. And we use $a^l_j$ for the activation of the
$j^{\rm th}$ neuron in the $l^{\rm th}$ layer. The following diagram
shows examples of these notations in use:
With these notations, the activation $a^{l}_j$ of the $j^{\rm th}$
neuron in the $l^{\rm th}$ layer is related to the activations in the
$(l-1)^{\rm th}$ layer by the equation (compare
Equation (4)\begin{eqnarray}
\frac{1}{1+\exp(-\sum_j w_j x_j-b)} \nonumber\end{eqnarray} and surrounding
discussion in the last chapter)
\begin{eqnarray}
a^{l}_j = \sigma\left( \sum_k w^{l}_{jk} a^{l-1}_k + b^l_j \right),
\tag{23}\end{eqnarray}
where the sum is over all neurons $k$ in the $(l-1)^{\rm th}$ layer. To
rewrite this expression in a matrix form we define a weight
matrix $w^l$ for each layer, $l$. The entries of the weight matrix
$w^l$ are just the weights connecting to the $l^{\rm th}$ layer of neurons,
that is, the entry in the $j^{\rm th}$ row and $k^{\rm th}$ column is $w^l_{jk}$.
Similarly, for each layer $l$ we define a bias vector, $b^l$.
You can probably guess how this works - the components of the bias
vector are just the values $b^l_j$, one component for each neuron in
the $l^{\rm th}$ layer. And finally, we define an activation vector $a^l$
whose components are the activations $a^l_j$.The last ingredient we need to rewrite (23)\begin{eqnarray}
a^{l}_j = \sigma\left( \sum_k w^{l}_{jk} a^{l-1}_k + b^l_j \right) \nonumber\end{eqnarray} in a
matrix form is the idea of vectorizing a function such as $\sigma$.
We met vectorization briefly in the last chapter, but to recap, the
idea is that we want to apply a function such as $\sigma$ to every
element in a vector $v$. We use the obvious notation $\sigma(v)$ to
denote this kind of elementwise application of a function. That is,
the components of $\sigma(v)$ are just $\sigma(v)_j = \sigma(v_j)$.
As an example, if we have the function $f(x) = x^2$ then the
vectorized form of $f$ has the effect
\begin{eqnarray}
f\left(\left[ \begin{array}{c} 2 \\ 3 \end{array} \right] \right)
= \left[ \begin{array}{c} f(2) \\ f(3) \end{array} \right]
= \left[ \begin{array}{c} 4 \\ 9 \end{array} \right],
\tag{24}\end{eqnarray}
that is, the vectorized $f$ just squares every element of the vector.
With these notations in mind, Equation (23)\begin{eqnarray}
a^{l}_j = \sigma\left( \sum_k w^{l}_{jk} a^{l-1}_k + b^l_j \right) \nonumber\end{eqnarray} can
be rewritten in the beautiful and compact vectorized form
\begin{eqnarray}
a^{l} = \sigma(w^l a^{l-1}+b^l).
\tag{25}\end{eqnarray}
This expression gives us a much more global way of thinking about how
the activations in one layer relate to activations in the previous
layer: we just apply the weight matrix to the activations, then add
the bias vector, and finally apply the $\sigma$ function*
*By the way, it's this expression that
motivates the quirk in the $w^l_{jk}$ notation mentioned earlier.
If we used $j$ to index the input neuron, and $k$ to index the
output neuron, then we'd need to replace the weight matrix in
Equation (25)\begin{eqnarray}
a^{l} = \sigma(w^l a^{l-1}+b^l) \nonumber\end{eqnarray} by the transpose of the
weight matrix. That's a small change, but annoying, and we'd lose
the easy simplicity of saying (and thinking) "apply the weight
matrix to the activations".. That global view is often easier and
more succinct (and involves fewer indices!) than the neuron-by-neuron
view we've taken to now. Think of it as a way of escaping index hell,
while remaining precise about what's going on. The expression is also
useful in practice, because most matrix libraries provide fast ways of
implementing matrix multiplication, vector addition, and
vectorization. Indeed, the
code
in the last chapter made implicit use of this expression to compute
the behaviour of the network.
When using Equation (25)\begin{eqnarray}
a^{l} = \sigma(w^l a^{l-1}+b^l) \nonumber\end{eqnarray} to compute $a^l$,
we compute the intermediate quantity $z^l \equiv w^l a^{l-1}+b^l$
along the way. This quantity turns out to be useful enough to be
worth naming: we call $z^l$ the weighted input to the neurons
in layer $l$. We'll make considerable use of the weighted input $z^l$
later in the chapter. Equation (25)\begin{eqnarray}
a^{l} = \sigma(w^l a^{l-1}+b^l) \nonumber\end{eqnarray} is
sometimes written in terms of the weighted input, as $a^l =
\sigma(z^l)$. It's also worth noting that $z^l$ has components $z^l_j
= \sum_k w^l_{jk} a^{l-1}_k+b^l_j$, that is, $z^l_j$ is just the
weighted input to the activation function for neuron $j$ in layer $l$.
The two assumptions we need about the cost function
The goal of backpropagation is to compute the partial derivatives
$\partial C / \partial w$ and $\partial C / \partial b$ of the cost
function $C$ with respect to any weight $w$ or bias $b$ in the
network. For backpropagation to work we need to make two main
assumptions about the form of the cost function. Before stating those
assumptions, though, it's useful to have an example cost function in
mind. We'll use the quadratic cost function from last chapter
(c.f. Equation (6)\begin{eqnarray} C(w,b) \equiv
\frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}). In the notation of
the last section, the quadratic cost has the form
\begin{eqnarray}
C = \frac{1}{2n} \sum_x \|y(x)-a^L(x)\|^2,
\tag{26}\end{eqnarray}
where: $n$ is the total number of training examples; the sum is over
individual training examples, $x$; $y = y(x)$ is the corresponding
desired output; $L$ denotes the number of layers in the network; and
$a^L = a^L(x)$ is the vector of activations output from the network
when $x$ is input.
Okay, so what assumptions do we need to make about our cost function,
$C$, in order that backpropagation can be applied? The first
assumption we need is that the cost function can be written as an
average $C = \frac{1}{n} \sum_x C_x$ over cost functions $C_x$ for
individual training examples, $x$. This is the case for the quadratic
cost function, where the cost for a single training example is $C_x =
\frac{1}{2} \|y-a^L \|^2$. This assumption will also hold true for
all the other cost functions we'll meet in this book.
The reason we need this assumption is because what backpropagation
actually lets us do is compute the partial derivatives $\partial C_x
/ \partial w$ and $\partial C_x / \partial b$ for a single training
example. We then recover $\partial C / \partial w$ and $\partial C
/ \partial b$ by averaging over training examples. In fact, with this
assumption in mind, we'll suppose the training example $x$ has been
fixed, and drop the $x$ subscript, writing the cost $C_x$ as $C$.
We'll eventually put the $x$ back in, but for now it's a notational
nuisance that is better left implicit.
The second assumption we make about the cost is that it can be written
as a function of the outputs from the neural network:
For example, the quadratic cost function satisfies this requirement,
since the quadratic cost for a single training example $x$ may be
written as
\begin{eqnarray}
C = \frac{1}{2} \|y-a^L\|^2 = \frac{1}{2} \sum_j (y_j-a^L_j)^2,
\tag{27}\end{eqnarray}
and thus is a function of the output activations. Of course, this
cost function also depends on the desired output $y$, and you may
wonder why we're not regarding the cost also as a function of $y$.
Remember, though, that the input training example $x$ is fixed, and so
the output $y$ is also a fixed parameter. In particular, it's not
something we can modify by changing the weights and biases in any way,
i.e., it's not something which the neural network learns. And so it
makes sense to regard $C$ as a function of the output activations
$a^L$ alone, with $y$ merely a parameter that helps define that
function.The Hadamard product, $s \odot t$
The backpropagation algorithm is based on common linear algebraic
operations - things like vector addition, multiplying a vector by a
matrix, and so on. But one of the operations is a little less
commonly used. In particular, suppose $s$ and $t$ are two vectors of
the same dimension. Then we use $s \odot t$ to denote the
elementwise product of the two vectors. Thus the components of
$s \odot t$ are just $(s \odot t)_j = s_j t_j$. As an example,
\begin{eqnarray}
\left[\begin{array}{c} 1 \\ 2 \end{array}\right]
\odot \left[\begin{array}{c} 3 \\ 4\end{array} \right]
= \left[ \begin{array}{c} 1 * 3 \\ 2 * 4 \end{array} \right]
= \left[ \begin{array}{c} 3 \\ 8 \end{array} \right].
\tag{28}\end{eqnarray}
This kind of elementwise multiplication is sometimes called the
Hadamard product or Schur product. We'll refer to it as
the Hadamard product. Good matrix libraries usually provide fast
implementations of the Hadamard product, and that comes in handy when
implementing backpropagation.
The four fundamental equations behind backpropagation
Backpropagation is about understanding how changing the weights and
biases in a network changes the cost function. Ultimately, this means
computing the partial derivatives $\partial C / \partial w^l_{jk}$ and
$\partial C / \partial b^l_j$. But to compute those, we first
introduce an intermediate quantity, $\delta^l_j$, which we call the
error in the $j^{\rm th}$ neuron in the $l^{\rm th}$ layer.
Backpropagation will give us a procedure to compute the error
$\delta^l_j$, and then will relate $\delta^l_j$ to $\partial C
/ \partial w^l_{jk}$ and $\partial C / \partial b^l_j$.
To understand how the error is defined, imagine there is a demon in
our neural network:
The demon sits at the $j^{\rm th}$ neuron in layer $l$. As the input to the
neuron comes in, the demon messes with the neuron's operation. It
adds a little change $\Delta z^l_j$ to the neuron's weighted input, so
that instead of outputting $\sigma(z^l_j)$, the neuron instead outputs
$\sigma(z^l_j+\Delta z^l_j)$. This change propagates through later
layers in the network, finally causing the overall cost to change by
an amount $\frac{\partial C}{\partial z^l_j} \Delta z^l_j$.Now, this demon is a good demon, and is trying to help you improve the
cost, i.e., they're trying to find a $\Delta z^l_j$ which makes the
cost smaller. Suppose $\frac{\partial C}{\partial z^l_j}$ has a large
value (either positive or negative). Then the demon can lower the
cost quite a bit by choosing $\Delta z^l_j$ to have the opposite sign
to $\frac{\partial C}{\partial z^l_j}$. By contrast, if
$\frac{\partial C}{\partial z^l_j}$ is close to zero, then the demon
can't improve the cost much at all by perturbing the weighted input
$z^l_j$. So far as the demon can tell, the neuron is already pretty
near optimal*
*This is only the case for small changes $\Delta
z^l_j$, of course. We'll assume that the demon is constrained to
make such small changes.. And so there's a heuristic sense in
which $\frac{\partial C}{\partial z^l_j}$ is a measure of the error in
the neuron.
Motivated by this story, we define the error $\delta^l_j$ of neuron
$j$ in layer $l$ by
\begin{eqnarray}
\delta^l_j \equiv \frac{\partial C}{\partial z^l_j}.
\tag{29}\end{eqnarray}
As per our usual conventions, we use $\delta^l$ to denote the vector
of errors associated with layer $l$. Backpropagation will give us a
way of computing $\delta^l$ for every layer, and then relating those
errors to the quantities of real interest, $\partial C / \partial
w^l_{jk}$ and $\partial C / \partial b^l_j$.
You might wonder why the demon is changing the weighted input $z^l_j$.
Surely it'd be more natural to imagine the demon changing the output
activation $a^l_j$, with the result that we'd be using $\frac{\partial
C}{\partial a^l_j}$ as our measure of error. In fact, if you do
this things work out quite similarly to the discussion below. But it
turns out to make the presentation of backpropagation a little more
algebraically complicated. So we'll stick with $\delta^l_j =
\frac{\partial C}{\partial z^l_j}$ as our measure of error*
*In
classification problems like MNIST the term "error" is sometimes
used to mean the classification failure rate. E.g., if the neural
net correctly classifies 96.0 percent of the digits, then the error
is 4.0 percent. Obviously, this has quite a different meaning from
our $\delta$ vectors. In practice, you shouldn't have trouble
telling which meaning is intended in any given usage..
Plan of attack: Backpropagation is based around four
fundamental equations. Together, those equations give us a way of
computing both the error $\delta^l$ and the gradient of the cost
function. I state the four equations below. Be warned, though: you
shouldn't expect to instantaneously assimilate the equations. Such an
expectation will lead to disappointment. In fact, the backpropagation
equations are so rich that understanding them well requires
considerable time and patience as you gradually delve deeper into the
equations. The good news is that such patience is repaid many times
over. And so the discussion in this section is merely a beginning,
helping you on the way to a thorough understanding of the equations.
Here's a preview of the ways we'll delve more deeply into the
equations later in the chapter: I'll
give
a short proof of the equations, which helps explain why they are
true; we'll restate
the equations in algorithmic form as pseudocode, and
see how the
pseudocode can be implemented as real, running Python code; and, in
the final
section of the chapter, we'll develop an intuitive picture of what
the backpropagation equations mean, and how someone might discover
them from scratch. Along the way we'll return repeatedly to the four
fundamental equations, and as you deepen your understanding those
equations will come to seem comfortable and, perhaps, even beautiful
and natural.
An equation for the error in the output layer, $\delta^L$:
The components of $\delta^L$ are given by
\begin{eqnarray}
\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j).
\tag{BP1}\end{eqnarray}
This is a very natural expression. The first term on the right,
$\partial C / \partial a^L_j$, just measures how fast the cost is
changing as a function of the $j^{\rm th}$ output activation. If, for
example, $C$ doesn't depend much on a particular output neuron, $j$,
then $\delta^L_j$ will be small, which is what we'd expect. The
second term on the right, $\sigma'(z^L_j)$, measures how fast the
activation function $\sigma$ is changing at $z^L_j$.
Notice that everything in (BP1)\begin{eqnarray}
\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray} is easily computed. In
particular, we compute $z^L_j$ while computing the behaviour of the
network, and it's only a small additional overhead to compute
$\sigma'(z^L_j)$. The exact form of $\partial C / \partial a^L_j$
will, of course, depend on the form of the cost function. However,
provided the cost function is known there should be little trouble
computing $\partial C / \partial a^L_j$. For example, if we're using
the quadratic cost function then $C = \frac{1}{2} \sum_j
(y_j-a^L_j)^2$, and so $\partial C / \partial a^L_j = (a_j^L-y_j)$,
which obviously is easily computable.
Equation (BP1)\begin{eqnarray}
\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray} is a componentwise expression for $\delta^L$.
It's a perfectly good expression, but not the matrix-based form we
want for backpropagation. However, it's easy to rewrite the equation
in a matrix-based form, as
\begin{eqnarray}
\delta^L = \nabla_a C \odot \sigma'(z^L).
\tag{BP1a}\end{eqnarray}
Here, $\nabla_a C$ is defined to be a vector whose components are the
partial derivatives $\partial C / \partial a^L_j$. You can think of
$\nabla_a C$ as expressing the rate of change of $C$ with respect to
the output activations. It's easy to see that Equations (BP1a)\begin{eqnarray}
\delta^L = \nabla_a C \odot \sigma'(z^L) \nonumber\end{eqnarray}
and (BP1)\begin{eqnarray}
\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray} are equivalent, and for that reason from now on we'll
use (BP1)\begin{eqnarray}
\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray} interchangeably to refer to both equations. As an
example, in the case of the quadratic cost we have $\nabla_a C =
(a^L-y)$, and so the fully matrix-based form of (BP1)\begin{eqnarray}
\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray} becomes
\begin{eqnarray}
\delta^L = (a^L-y) \odot \sigma'(z^L).
\tag{30}\end{eqnarray}
As you can see, everything in this expression has a nice vector form,
and is easily computed using a library such as Numpy.
An equation for the error $\delta^l$ in terms of the error in
the next layer, $\delta^{l+1}$: In particular
\begin{eqnarray}
\delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l),
\tag{BP2}\end{eqnarray}
where $(w^{l+1})^T$ is the transpose of the weight matrix $w^{l+1}$ for
the $(l+1)^{\rm th}$ layer. This equation appears complicated, but
each element has a nice interpretation. Suppose we know the error
$\delta^{l+1}$ at the $l+1^{\rm th}$ layer. When we apply the
transpose weight matrix, $(w^{l+1})^T$, we can think intuitively of
this as moving the error backward through the network, giving
us some sort of measure of the error at the output of the $l^{\rm th}$
layer. We then take the Hadamard product $\odot \sigma'(z^l)$. This
moves the error backward through the activation function in layer $l$,
giving us the error $\delta^l$ in the weighted input to layer $l$.
By combining (BP2)\begin{eqnarray}
\delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray} with (BP1)\begin{eqnarray}
\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray} we can compute the error
$\delta^l$ for any layer in the network. We start by
using (BP1)\begin{eqnarray}
\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray} to compute $\delta^L$, then apply
Equation (BP2)\begin{eqnarray}
\delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray} to compute $\delta^{L-1}$, then
Equation (BP2)\begin{eqnarray}
\delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray} again to compute $\delta^{L-2}$, and so on, all
the way back through the network.
An equation for the rate of change of the cost with respect to
any bias in the network: In particular:
\begin{eqnarray} \frac{\partial C}{\partial b^l_j} =
\delta^l_j.
\tag{BP3}\end{eqnarray}
That is, the error $\delta^l_j$ is exactly equal to the rate of
change $\partial C / \partial b^l_j$. This is great news, since
(BP1)\begin{eqnarray}
\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray} and (BP2)\begin{eqnarray}
\delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray} have already told us how to compute
$\delta^l_j$. We can rewrite (BP3)\begin{eqnarray} \frac{\partial C}{\partial b^l_j} =
\delta^l_j \nonumber\end{eqnarray} in shorthand as
\begin{eqnarray}
\frac{\partial C}{\partial b} = \delta,
\tag{31}\end{eqnarray}
where it is understood that $\delta$ is being evaluated at the same
neuron as the bias $b$.
An equation for the rate of change of the cost with respect to
any weight in the network: In particular:
\begin{eqnarray}
\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j.
\tag{BP4}\end{eqnarray}
This tells us how to compute the partial derivatives $\partial C
/ \partial w^l_{jk}$ in terms of the quantities $\delta^l$ and
$a^{l-1}$, which we already know how to compute. The equation can be
rewritten in a less index-heavy notation as
\begin{eqnarray} \frac{\partial
C}{\partial w} = a_{\rm in} \delta_{\rm out},
\tag{32}\end{eqnarray}
where it's understood that $a_{\rm in}$ is the activation of the
neuron input to the weight $w$, and $\delta_{\rm out}$ is the error of
the neuron output from the weight $w$. Zooming in to look at just the
weight $w$, and the two neurons connected by that weight, we can
depict this as:
A nice consequence of Equation (32)\begin{eqnarray} \frac{\partial
C}{\partial w} = a_{\rm in} \delta_{\rm out} \nonumber\end{eqnarray} is
that when the activation $a_{\rm in}$ is small, $a_{\rm in} \approx
0$, the gradient term $\partial C / \partial w$ will also tend to be
small. In this case, we'll say the weight learns slowly,
meaning that it's not changing much during gradient descent. In other
words, one consequence of (BP4)\begin{eqnarray}
\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray} is that weights output from
low-activation neurons learn slowly.There are other insights along these lines which can be obtained
from (BP1)\begin{eqnarray}
\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}-(BP4)\begin{eqnarray}
\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}. Let's start by looking at the output
layer. Consider the term $\sigma'(z^L_j)$ in (BP1)\begin{eqnarray}
\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}. Recall
from the graph of the sigmoid
function in the last chapter that the $\sigma$ function becomes
very flat when $\sigma(z^L_j)$ is approximately $0$ or $1$. When this
occurs we will have $\sigma'(z^L_j) \approx 0$. And so the lesson is
that a weight in the final layer will learn slowly if the output
neuron is either low activation ($\approx 0$) or high activation
($\approx 1$). In this case it's common to say the output neuron has
saturated and, as a result, the weight has stopped learning (or
is learning slowly). Similar remarks hold also for the biases of
output neuron.
We can obtain similar insights for earlier layers. In particular,
note the $\sigma'(z^l)$ term in (BP2)\begin{eqnarray}
\delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}. This means that
$\delta^l_j$ is likely to get small if the neuron is near saturation.
And this, in turn, means that any weights input to a saturated neuron
will learn slowly*
*This reasoning won't hold if ${w^{l+1}}^T
\delta^{l+1}$ has large enough entries to compensate for the
smallness of $\sigma'(z^l_j)$. But I'm speaking of the general
tendency..
Summing up, we've learnt that a weight will learn slowly if either the
input neuron is low-activation, or if the output neuron has saturated,
i.e., is either high- or low-activation.
None of these observations is too greatly surprising. Still, they
help improve our mental model of what's going on as a neural network
learns. Furthermore, we can turn this type of reasoning around. The
four fundamental equations turn out to hold for any activation
function, not just the standard sigmoid function (that's because, as
we'll see in a moment, the proofs don't use any special properties of
$\sigma$). And so we can use these equations to design
activation functions which have particular desired learning
properties. As an example to give you the idea, suppose we were to
choose a (non-sigmoid) activation function $\sigma$ so that $\sigma'$
is always positive, and never gets close to zero. That would prevent
the slow-down of learning that occurs when ordinary sigmoid neurons
saturate. Later in the book we'll see examples where this kind of
modification is made to the activation function. Keeping the four
equations (BP1)\begin{eqnarray}
\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}-(BP4)\begin{eqnarray}
\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray} in mind can help explain why such
modifications are tried, and what impact they can have.
Problem
We'll now prove the four fundamental equations (BP1)\begin{eqnarray} \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}-(BP4)\begin{eqnarray} \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}. All four are consequences of the chain rule from multivariable calculus. If you're comfortable with the chain rule, then I strongly encourage you to attempt the derivation yourself before reading on.
Let's begin with Equation (BP1)\begin{eqnarray} \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}, which gives an expression for the output error, $\delta^L$. To prove this equation, recall that by definition \begin{eqnarray} \delta^L_j = \frac{\partial C}{\partial z^L_j}. \tag{36}\end{eqnarray} Applying the chain rule, we can re-express the partial derivative above in terms of partial derivatives with respect to the output activations, \begin{eqnarray} \delta^L_j = \sum_k \frac{\partial C}{\partial a^L_k} \frac{\partial a^L_k}{\partial z^L_j}, \tag{37}\end{eqnarray} where the sum is over all neurons $k$ in the output layer. Of course, the output activation $a^L_k$ of the $k^{\rm th}$ neuron depends only on the weighted input $z^L_j$ for the $j^{\rm th}$ neuron when $k = j$. And so $\partial a^L_k / \partial z^L_j$ vanishes when $k \neq j$. As a result we can simplify the previous equation to \begin{eqnarray} \delta^L_j = \frac{\partial C}{\partial a^L_j} \frac{\partial a^L_j}{\partial z^L_j}. \tag{38}\end{eqnarray} Recalling that $a^L_j = \sigma(z^L_j)$ the second term on the right can be written as $\sigma'(z^L_j)$, and the equation becomes \begin{eqnarray} \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j), \tag{39}\end{eqnarray} which is just (BP1)\begin{eqnarray} \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}, in component form.
Next, we'll prove (BP2)\begin{eqnarray} \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}, which gives an equation for the error $\delta^l$ in terms of the error in the next layer, $\delta^{l+1}$. To do this, we want to rewrite $\delta^l_j = \partial C / \partial z^l_j$ in terms of $\delta^{l+1}_k = \partial C / \partial z^{l+1}_k$. We can do this using the chain rule, \begin{eqnarray} \delta^l_j & = & \frac{\partial C}{\partial z^l_j} \tag{40}\\ & = & \sum_k \frac{\partial C}{\partial z^{l+1}_k} \frac{\partial z^{l+1}_k}{\partial z^l_j} \tag{41}\\ & = & \sum_k \frac{\partial z^{l+1}_k}{\partial z^l_j} \delta^{l+1}_k, \tag{42}\end{eqnarray} where in the last line we have interchanged the two terms on the right-hand side, and substituted the definition of $\delta^{l+1}_k$. To evaluate the first term on the last line, note that \begin{eqnarray} z^{l+1}_k = \sum_j w^{l+1}_{kj} a^l_j +b^{l+1}_k = \sum_j w^{l+1}_{kj} \sigma(z^l_j) +b^{l+1}_k. \tag{43}\end{eqnarray} Differentiating, we obtain \begin{eqnarray} \frac{\partial z^{l+1}_k}{\partial z^l_j} = w^{l+1}_{kj} \sigma'(z^l_j). \tag{44}\end{eqnarray} Substituting back into (42)\begin{eqnarray} & = & \sum_k \frac{\partial z^{l+1}_k}{\partial z^l_j} \delta^{l+1}_k \nonumber\end{eqnarray} we obtain \begin{eqnarray} \delta^l_j = \sum_k w^{l+1}_{kj} \delta^{l+1}_k \sigma'(z^l_j). \tag{45}\end{eqnarray} This is just (BP2)\begin{eqnarray} \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray} written in component form.
The final two equations we want to prove are (BP3)\begin{eqnarray} \frac{\partial C}{\partial b^l_j} = \delta^l_j \nonumber\end{eqnarray} and (BP4)\begin{eqnarray} \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}. These also follow from the chain rule, in a manner similar to the proofs of the two equations above. I leave them to you as an exercise.
That completes the proof of the four fundamental equations of backpropagation. The proof may seem complicated. But it's really just the outcome of carefully applying the chain rule. A little less succinctly, we can think of backpropagation as a way of computing the gradient of the cost function by systematically applying the chain rule from multi-variable calculus. That's all there really is to backpropagation - the rest is details.
The backpropagation equations provide us with a way of computing the gradient of the cost function. Let's explicitly write this out in the form of an algorithm:
Examining the algorithm you can see why it's called backpropagation. We compute the error vectors $\delta^l$ backward, starting from the final layer. It may seem peculiar that we're going through the network backward. But if you think about the proof of backpropagation, the backward movement is a consequence of the fact that the cost is a function of outputs from the network. To understand how the cost varies with earlier weights and biases we need to repeatedly apply the chain rule, working backward through the layers to obtain usable expressions.
As I've described it above, the backpropagation algorithm computes the gradient of the cost function for a single training example, $C = C_x$. In practice, it's common to combine backpropagation with a learning algorithm such as stochastic gradient descent, in which we compute the gradient for many training examples. In particular, given a mini-batch of $m$ training examples, the following algorithm applies a gradient descent learning step based on that mini-batch:
Having understood backpropagation in the abstract, we can now understand the code used in the last chapter to implement backpropagation. Recall from that chapter that the code was contained in the update_mini_batch and backprop methods of the Network class. The code for these methods is a direct translation of the algorithm described above. In particular, the update_mini_batch method updates the Network's weights and biases by computing the gradient for the current mini_batch of training examples:
class Network(object):
...
def update_mini_batch(self, mini_batch, eta):
"""Update the network's weights and biases by applying
@@ -245,7 +248,7 @@ How the backpropagation algorithm works
"""Derivative of the sigmoid function."""
return sigmoid(z)*(1-sigmoid(z))
In what sense is backpropagation a fast algorithm? To answer this question, let's consider another approach to computing the gradient. Imagine it's the early days of neural networks research. Maybe it's the 1950s or 1960s, and you're the first person in the world to think of using gradient descent to learn! But to make the idea work you need a way of computing the gradient of the cost function. You think back to your knowledge of calculus, and decide to see if you can use the chain rule to compute the gradient. But after playing around a bit, the algebra looks complicated, and you get discouraged. So you try to find another approach. You decide to regard the cost as a function of the weights $C = C(w)$ alone (we'll get back to the biases in a moment). You number the weights $w_1, w_2, \ldots$, and want to compute $\partial C / \partial w_j$ for some particular weight $w_j$. An obvious way of doing that is to use the approximation \begin{eqnarray} \frac{\partial C}{\partial w_{j}} \approx \frac{C(w+\epsilon e_j)-C(w)}{\epsilon}, \tag{46}\end{eqnarray} where $\epsilon > 0$ is a small positive number, and $e_j$ is the unit vector in the $j^{\rm th}$ direction. In other words, we can estimate $\partial C / \partial w_j$ by computing the cost $C$ for two slightly different values of $w_j$, and then applying Equation (46)\begin{eqnarray} \frac{\partial C}{\partial w_{j}} \approx \frac{C(w+\epsilon e_j)-C(w)}{\epsilon} \nonumber\end{eqnarray}. The same idea will let us compute the partial derivatives $\partial C / \partial b$ with respect to the biases.
This approach looks very promising. It's simple conceptually, and extremely easy to implement, using just a few lines of code. Certainly, it looks much more promising than the idea of using the chain rule to compute the gradient!
Unfortunately, while this approach appears promising, when you implement the code it turns out to be extremely slow. To understand why, imagine we have a million weights in our network. Then for each distinct weight $w_j$ we need to compute $C(w+\epsilon e_j)$ in order to compute $\partial C / \partial w_j$. That means that to compute the gradient we need to compute the cost function a million different times, requiring a million forward passes through the network (per training example). We need to compute $C(w)$ as well, so that's a total of a million and one passes through the network.
What's clever about backpropagation is that it enables us to simultaneously compute all the partial derivatives $\partial C / \partial w_j$ using just one forward pass through the network, followed by one backward pass through the network. Roughly speaking, the computational cost of the backward pass is about the same as the forward pass* *This should be plausible, but it requires some analysis to make a careful statement. It's plausible because the dominant computational cost in the forward pass is multiplying by the weight matrices, while in the backward pass it's multiplying by the transposes of the weight matrices. These operations obviously have similar computational cost.. And so the total cost of backpropagation is roughly the same as making just two forward passes through the network. Compare that to the million and one forward passes we needed for the approach based on (46)\begin{eqnarray} \frac{\partial C}{\partial w_{j}} \approx \frac{C(w+\epsilon e_j)-C(w)}{\epsilon} \nonumber\end{eqnarray}! And so even though backpropagation appears superficially more complex than the approach based on (46)\begin{eqnarray} \frac{\partial C}{\partial w_{j}} \approx \frac{C(w+\epsilon e_j)-C(w)}{\epsilon} \nonumber\end{eqnarray}, it's actually much, much faster.
This speedup was first fully appreciated in 1986, and it greatly expanded the range of problems that neural networks could solve. That, in turn, caused a rush of people using neural networks. Of course, backpropagation is not a panacea. Even in the late 1980s people ran up against limits, especially when attempting to use backpropagation to train deep neural networks, i.e., networks with many hidden layers. Later in the book we'll see how modern computers and some clever new ideas now make it possible to use backpropagation to train such deep neural networks.
As I've explained it, backpropagation presents two mysteries. First, what's the algorithm really doing? We've developed a picture of the error being backpropagated from the output. But can we go any deeper, and build up more intuition about what is going on when we do all these matrix and vector multiplications? The second mystery is how someone could ever have discovered backpropagation in the first place? It's one thing to follow the steps in an algorithm, or even to follow the proof that the algorithm works. But that doesn't mean you understand the problem so well that you could have discovered the algorithm in the first place. Is there a plausible line of reasoning that could have led you to discover the backpropagation algorithm? In this section I'll address both these mysteries.
To improve our intuition about what the algorithm is doing, let's imagine that we've made a small change $\Delta w^l_{jk}$ to some weight in the network, $w^l_{jk}$:
Let's try to carry this out. The change $\Delta w^l_{jk}$ causes a small change $\Delta a^{l}_j$ in the activation of the $j^{\rm th}$ neuron in the $l^{\rm th}$ layer. This change is given by \begin{eqnarray} \Delta a^l_j \approx \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk}. \tag{48}\end{eqnarray} The change in activation $\Delta a^l_{j}$ will cause changes in all the activations in the next layer, i.e., the $(l+1)^{\rm th}$ layer. We'll concentrate on the way just a single one of those activations is affected, say $a^{l+1}_q$,
What I've been providing up to now is a heuristic argument, a way of thinking about what's going on when you perturb a weight in a network. Let me sketch out a line of thinking you could use to further develop this argument. First, you could derive explicit expressions for all the individual partial derivatives in Equation (53)\begin{eqnarray} \frac{\partial C}{\partial w^l_{jk}} = \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m} \frac{\partial a^L_m}{\partial a^{L-1}_n} \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots \frac{\partial a^{l+1}_q}{\partial a^l_j} \frac{\partial a^l_j}{\partial w^l_{jk}} \nonumber\end{eqnarray}. That's easy to do with a bit of calculus. Having done that, you could then try to figure out how to write all the sums over indices as matrix multiplications. This turns out to be tedious, and requires some persistence, but not extraordinary insight. After doing all this, and then simplifying as much as possible, what you discover is that you end up with exactly the backpropagation algorithm! And so you can think of the backpropagation algorithm as providing a way of computing the sum over the rate factor for all these paths. Or, to put it slightly differently, the backpropagation algorithm is a clever way of keeping track of small perturbations to the weights (and biases) as they propagate through the network, reach the output, and then affect the cost.
Now, I'm not going to work through all this here. It's messy and requires considerable care to work through all the details. If you're up for a challenge, you may enjoy attempting it. And even if not, I hope this line of thinking gives you some insight into what backpropagation is accomplishing.
What about the other mystery - how backpropagation could have been discovered in the first place? In fact, if you follow the approach I just sketched you will discover a proof of backpropagation. Unfortunately, the proof is quite a bit longer and more complicated than the one I described earlier in this chapter. So how was that short (but more mysterious) proof discovered? What you find when you write out all the details of the long proof is that, after the fact, there are several obvious simplifications staring you in the face. You make those simplifications, get a shorter proof, and write that out. And then several more obvious simplifications jump out at you. So you repeat again. The result after a few iterations is the proof we saw earlier* *There is one clever step required. In Equation (53)\begin{eqnarray} \frac{\partial C}{\partial w^l_{jk}} = \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m} \frac{\partial a^L_m}{\partial a^{L-1}_n} \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots \frac{\partial a^{l+1}_q}{\partial a^l_j} \frac{\partial a^l_j}{\partial w^l_{jk}} \nonumber\end{eqnarray} the intermediate variables are activations like $a_q^{l+1}$. The clever idea is to switch to using weighted inputs, like $z^{l+1}_q$, as the intermediate variables. If you don't have this idea, and instead continue using the activations $a^{l+1}_q$, the proof you obtain turns out to be slightly more complex than the proof given earlier in the chapter. - short, but somewhat obscure, because all the signposts to its construction have been removed! I am, of course, asking you to trust me on this, but there really is no great mystery to the origin of the earlier proof. It's just a lot of hard work simplifying the proof I've sketched in this section.
In what sense is backpropagation a fast algorithm? To answer this question, let's consider another approach to computing the gradient. Imagine it's the early days of neural networks research. Maybe it's the 1950s or 1960s, and you're the first person in the world to think of using gradient descent to learn! But to make the idea work you need a way of computing the gradient of the cost function. You think back to your knowledge of calculus, and decide to see if you can use the chain rule to compute the gradient. But after playing around a bit, the algebra looks complicated, and you get discouraged. So you try to find another approach. You decide to regard the cost as a function of the weights $C = C(w)$ alone (we'll get back to the biases in a moment). You number the weights $w_1, w_2, \ldots$, and want to compute $\partial C / \partial w_j$ for some particular weight $w_j$. An obvious way of doing that is to use the approximation \begin{eqnarray} \frac{\partial C}{\partial w_{j}} \approx \frac{C(w+\epsilon e_j)-C(w)}{\epsilon}, \tag{46}\end{eqnarray} where $\epsilon > 0$ is a small positive number, and $e_j$ is the unit vector in the $j^{\rm th}$ direction. In other words, we can estimate $\partial C / \partial w_j$ by computing the cost $C$ for two slightly different values of $w_j$, and then applying Equation (46)\begin{eqnarray} \frac{\partial C}{\partial w_{j}} \approx \frac{C(w+\epsilon e_j)-C(w)}{\epsilon} \nonumber\end{eqnarray}. The same idea will let us compute the partial derivatives $\partial C / \partial b$ with respect to the biases.
This approach looks very promising. It's simple conceptually, and extremely easy to implement, using just a few lines of code. Certainly, it looks much more promising than the idea of using the chain rule to compute the gradient!
Unfortunately, while this approach appears promising, when you implement the code it turns out to be extremely slow. To understand why, imagine we have a million weights in our network. Then for each distinct weight $w_j$ we need to compute $C(w+\epsilon e_j)$ in order to compute $\partial C / \partial w_j$. That means that to compute the gradient we need to compute the cost function a million different times, requiring a million forward passes through the network (per training example). We need to compute $C(w)$ as well, so that's a total of a million and one passes through the network.
What's clever about backpropagation is that it enables us to simultaneously compute all the partial derivatives $\partial C / \partial w_j$ using just one forward pass through the network, followed by one backward pass through the network. Roughly speaking, the computational cost of the backward pass is about the same as the forward pass* *This should be plausible, but it requires some analysis to make a careful statement. It's plausible because the dominant computational cost in the forward pass is multiplying by the weight matrices, while in the backward pass it's multiplying by the transposes of the weight matrices. These operations obviously have similar computational cost.. And so the total cost of backpropagation is roughly the same as making just two forward passes through the network. Compare that to the million and one forward passes we needed for the approach based on (46)\begin{eqnarray} \frac{\partial C}{\partial w_{j}} \approx \frac{C(w+\epsilon e_j)-C(w)}{\epsilon} \nonumber\end{eqnarray}! And so even though backpropagation appears superficially more complex than the approach based on (46)\begin{eqnarray} \frac{\partial C}{\partial w_{j}} \approx \frac{C(w+\epsilon e_j)-C(w)}{\epsilon} \nonumber\end{eqnarray}, it's actually much, much faster.
This speedup was first fully appreciated in 1986, and it greatly expanded the range of problems that neural networks could solve. That, in turn, caused a rush of people using neural networks. Of course, backpropagation is not a panacea. Even in the late 1980s people ran up against limits, especially when attempting to use backpropagation to train deep neural networks, i.e., networks with many hidden layers. Later in the book we'll see how modern computers and some clever new ideas now make it possible to use backpropagation to train such deep neural networks.
As I've explained it, backpropagation presents two mysteries. First, what's the algorithm really doing? We've developed a picture of the error being backpropagated from the output. But can we go any deeper, and build up more intuition about what is going on when we do all these matrix and vector multiplications? The second mystery is how someone could ever have discovered backpropagation in the first place? It's one thing to follow the steps in an algorithm, or even to follow the proof that the algorithm works. But that doesn't mean you understand the problem so well that you could have discovered the algorithm in the first place. Is there a plausible line of reasoning that could have led you to discover the backpropagation algorithm? In this section I'll address both these mysteries.
To improve our intuition about what the algorithm is doing, let's imagine that we've made a small change $\Delta w^l_{jk}$ to some weight in the network, $w^l_{jk}$:
Let's try to carry this out. The change $\Delta w^l_{jk}$ causes a small change $\Delta a^{l}_j$ in the activation of the $j^{\rm th}$ neuron in the $l^{\rm th}$ layer. This change is given by \begin{eqnarray} \Delta a^l_j \approx \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk}. \tag{48}\end{eqnarray} The change in activation $\Delta a^l_{j}$ will cause changes in all the activations in the next layer, i.e., the $(l+1)^{\rm th}$ layer. We'll concentrate on the way just a single one of those activations is affected, say $a^{l+1}_q$,
What I've been providing up to now is a heuristic argument, a way of thinking about what's going on when you perturb a weight in a network. Let me sketch out a line of thinking you could use to further develop this argument. First, you could derive explicit expressions for all the individual partial derivatives in Equation (53)\begin{eqnarray} \frac{\partial C}{\partial w^l_{jk}} = \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m} \frac{\partial a^L_m}{\partial a^{L-1}_n} \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots \frac{\partial a^{l+1}_q}{\partial a^l_j} \frac{\partial a^l_j}{\partial w^l_{jk}} \nonumber\end{eqnarray}. That's easy to do with a bit of calculus. Having done that, you could then try to figure out how to write all the sums over indices as matrix multiplications. This turns out to be tedious, and requires some persistence, but not extraordinary insight. After doing all this, and then simplifying as much as possible, what you discover is that you end up with exactly the backpropagation algorithm! And so you can think of the backpropagation algorithm as providing a way of computing the sum over the rate factor for all these paths. Or, to put it slightly differently, the backpropagation algorithm is a clever way of keeping track of small perturbations to the weights (and biases) as they propagate through the network, reach the output, and then affect the cost.
Now, I'm not going to work through all this here. It's messy and requires considerable care to work through all the details. If you're up for a challenge, you may enjoy attempting it. And even if not, I hope this line of thinking gives you some insight into what backpropagation is accomplishing.
What about the other mystery - how backpropagation could have been discovered in the first place? In fact, if you follow the approach I just sketched you will discover a proof of backpropagation. Unfortunately, the proof is quite a bit longer and more complicated than the one I described earlier in this chapter. So how was that short (but more mysterious) proof discovered? What you find when you write out all the details of the long proof is that, after the fact, there are several obvious simplifications staring you in the face. You make those simplifications, get a shorter proof, and write that out. And then several more obvious simplifications jump out at you. So you repeat again. The result after a few iterations is the proof we saw earlier* *There is one clever step required. In Equation (53)\begin{eqnarray} \frac{\partial C}{\partial w^l_{jk}} = \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m} \frac{\partial a^L_m}{\partial a^{L-1}_n} \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots \frac{\partial a^{l+1}_q}{\partial a^l_j} \frac{\partial a^l_j}{\partial w^l_{jk}} \nonumber\end{eqnarray} the intermediate variables are activations like $a_q^{l+1}$. The clever idea is to switch to using weighted inputs, like $z^{l+1}_q$, as the intermediate variables. If you don't have this idea, and instead continue using the activations $a^{l+1}_q$, the proof you obtain turns out to be slightly more complex than the proof given earlier in the chapter. - short, but somewhat obscure, because all the signposts to its construction have been removed! I am, of course, asking you to trust me on this, but there really is no great mystery to the origin of the earlier proof. It's just a lot of hard work simplifying the proof I've sketched in this section.
@@ -164,9 +166,10 @@
Deep Learning, draft book -in preparation, by Yoshua Bengio, Ian Goodfellow, and Aaron -Courville
+Deep Learning, book by Ian +Goodfellow, Yoshua Bengio, and Aaron Courville
+ +When a golf player is first learning to play golf, they usually spend most of their time developing a basic swing. Only gradually do they develop other shots, learning to chip, draw and fade the ball, building on and modifying their basic swing. In a similar way, up to now we've focused on understanding the backpropagation algorithm. It's our "basic swing", the foundation for learning in most work on neural networks. In this chapter I explain a suite of techniques which can be used to improve on our vanilla implementation of backpropagation, and so improve the way our networks learn.
The techniques we'll develop in this chapter include: a better choice of cost function, known as the cross-entropy cost function; four so-called "regularization" methods (L1 and L2 regularization, dropout, and artificial expansion of the training data), which make our networks better at generalizing beyond the training data; a better method for initializing the weights in the network; and a set of heuristics to help choose good hyper-parameters for the network. I'll also overview several other techniques in less depth. The discussions are largely independent of one another, and so you may jump ahead if you wish. We'll also implement many of the techniques in running code, and use them to improve the results obtained on the handwriting classification problem studied in Chapter 1.
Of course, we're only covering a few of the many, many techniques which have been developed for use in neural nets. The philosophy is that the best entree to the plethora of available techniques is in-depth study of a few of the most important. Mastering those important techniques is not just useful in its own right, but will also deepen your understanding of what problems can arise when you use neural networks. That will leave you well prepared to quickly pick up other techniques, as you need them.
Most of us find it unpleasant to be wrong. Soon after beginning to learn the piano I gave my first performance before an audience. I was nervous, and began playing the piece an octave too low. I got confused, and couldn't continue until someone pointed out my error. I was very embarrassed. Yet while unpleasant, we also learn quickly when we're decisively wrong. You can bet that the next time I played before an audience I played in the correct octave! By contrast, we learn more slowly when our errors are less well-defined.
Ideally, we hope and expect that our neural networks will learn fast from their errors. Is this what happens in practice? To answer this question, let's look at a toy example. The example involves a neuron with just one input:
We'll train this neuron to do something ridiculously easy: take the input $1$ to the output $0$. Of course, this is such a trivial task that we could easily figure out an appropriate weight and bias by hand, without using a learning algorithm. However, it turns out to be illuminating to use gradient descent to attempt to learn a weight and bias. So let's take a look at how the neuron learns.
To make things definite, I'll pick the initial weight to be $0.6$ and the initial bias to be $0.9$. These are generic choices used as a place to begin learning, I wasn't picking them to be special in any way. The initial output from the neuron is $0.82$, so quite a bit of learning will be needed before our neuron gets near the desired output, $0.0$. Click on "Run" in the bottom right corner below to see how the neuron learns an output much closer to $0.0$. Note that this isn't a pre-recorded animation, your browser is actually computing the gradient, then using the gradient to update the weight and bias, and displaying the result. The learning rate is $\eta = 0.15$, which turns out to be slow enough that we can follow what's happening, but fast enough that we can get substantial learning in just a few seconds. The cost is the quadratic cost function, $C$, introduced back in Chapter 1. I'll remind you of the exact form of the cost function shortly, so there's no need to go and dig up the definition. Note that you can run the animation multiple times by clicking on "Run" again.
As you can see, the neuron rapidly learns a weight and bias that drives down the cost, and gives an output from the neuron of about $0.09$. That's not quite the desired output, $0.0$, but it is pretty good. Suppose, however, that we instead choose both the starting weight and the starting bias to be $2.0$. In this case the initial output is $0.98$, which is very badly wrong. Let's look at how the neuron learns to output $0$ in this case. Click on "Run" again:
Although this example uses the same learning rate ($\eta = 0.15$), we can see that learning starts out much more slowly. Indeed, for the first 150 or so learning epochs, the weights and biases don't change much at all. Then the learning kicks in and, much as in our first example, the neuron's output rapidly moves closer to $0.0$.
This behaviour is strange when contrasted to human learning. As I said at the beginning of this section, we often learn fastest when we're badly wrong about something. But we've just seen that our artificial neuron has a lot of difficulty learning when it's badly wrong - far more difficulty than when it's just a little wrong. What's more, it turns out that this behaviour occurs not just in this toy model, but in more general networks. Why is learning so slow? And can we find a way of avoiding this slowdown?
To understand the origin of the problem, consider that our neuron learns by changing the weight and bias at a rate determined by the partial derivatives of the cost function, $\partial C/\partial w$ and $\partial C / \partial b$. So saying "learning is slow" is really the same as saying that those partial derivatives are small. The challenge is to understand why they are small. To understand that, let's compute the partial derivatives. Recall that we're using the quadratic cost function, which, from Equation (6)\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}, is given by \begin{eqnarray} C = \frac{(y-a)^2}{2}, \tag{54}\end{eqnarray} where $a$ is the neuron's output when the training input $x = 1$ is used, and $y = 0$ is the corresponding desired output. To write this more explicitly in terms of the weight and bias, recall that $a = \sigma(z)$, where $z = wx+b$. Using the chain rule to differentiate with respect to the weight and bias we get \begin{eqnarray} \frac{\partial C}{\partial w} & = & (a-y)\sigma'(z) x = a \sigma'(z) \tag{55}\\ \frac{\partial C}{\partial b} & = & (a-y)\sigma'(z) = a \sigma'(z), \tag{56}\end{eqnarray} where I have substituted $x = 1$ and $y = 0$. To understand the behaviour of these expressions, let's look more closely at the $\sigma'(z)$ term on the right-hand side. Recall the shape of the $\sigma$ function:
We can see from this graph that when the neuron's output is close to $1$, the curve gets very flat, and so $\sigma'(z)$ gets very small. Equations (55)\begin{eqnarray} \frac{\partial C}{\partial w} & = & (a-y)\sigma'(z) x = a \sigma'(z) \nonumber\end{eqnarray} and (56)\begin{eqnarray} \frac{\partial C}{\partial b} & = & (a-y)\sigma'(z) = a \sigma'(z) \nonumber\end{eqnarray} then tell us that $\partial C / \partial w$ and $\partial C / \partial b$ get very small. This is the origin of the learning slowdown. What's more, as we shall see a little later, the learning slowdown occurs for essentially the same reason in more general neural networks, not just the toy example we've been playing with.
How can we address the learning slowdown? It turns out that we can solve the problem by replacing the quadratic cost with a different cost function, known as the cross-entropy. To understand the cross-entropy, let's move a little away from our super-simple toy model. We'll suppose instead that we're trying to train a neuron with several input variables, $x_1, x_2, \ldots$, corresponding weights $w_1, w_2, \ldots$, and a bias, $b$:
It's not obvious that the expression (57)\begin{eqnarray} C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray} fixes the learning slowdown problem. In fact, frankly, it's not even obvious that it makes sense to call this a cost function! Before addressing the learning slowdown, let's see in what sense the cross-entropy can be interpreted as a cost function.
Two properties in particular make it reasonable to interpret the cross-entropy as a cost function. First, it's non-negative, that is, $C > 0$. To see this, notice that: (a) all the individual terms in the sum in (57)\begin{eqnarray} C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray} are negative, since both logarithms are of numbers in the range $0$ to $1$; and (b) there is a minus sign out the front of the sum.
Second, if the neuron's actual output is close to the desired output for all training inputs, $x$, then the cross-entropy will be close to zero* *To prove this I will need to assume that the desired outputs $y$ are all either $0$ or $1$. This is usually the case when solving classification problems, for example, or when computing Boolean functions. To understand what happens when we don't make this assumption, see the exercises at the end of this section.. To see this, suppose for example that $y = 0$ and $a \approx 0$ for some input $x$. This is a case when the neuron is doing a good job on that input. We see that the first term in the expression (57)\begin{eqnarray} C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray} for the cost vanishes, since $y = 0$, while the second term is just $-\ln (1-a) \approx 0$. A similar analysis holds when $y = 1$ and $a \approx 1$. And so the contribution to the cost will be low provided the actual output is close to the desired output.
Summing up, the cross-entropy is positive, and tends toward zero as the neuron gets better at computing the desired output, $y$, for all training inputs, $x$. These are both properties we'd intuitively expect for a cost function. Indeed, both properties are also satisfied by the quadratic cost. So that's good news for the cross-entropy. But the cross-entropy cost function has the benefit that, unlike the quadratic cost, it avoids the problem of learning slowing down. To see this, let's compute the partial derivative of the cross-entropy cost with respect to the weights. We substitute $a = \sigma(z)$ into (57)\begin{eqnarray} C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray}, and apply the chain rule twice, obtaining: \begin{eqnarray} \frac{\partial C}{\partial w_j} & = & -\frac{1}{n} \sum_x \left( \frac{y }{\sigma(z)} -\frac{(1-y)}{1-\sigma(z)} \right) \frac{\partial \sigma}{\partial w_j} \tag{58}\\ & = & -\frac{1}{n} \sum_x \left( \frac{y}{\sigma(z)} -\frac{(1-y)}{1-\sigma(z)} \right)\sigma'(z) x_j. \tag{59}\end{eqnarray} Putting everything over a common denominator and simplifying this becomes: \begin{eqnarray} \frac{\partial C}{\partial w_j} & = & \frac{1}{n} \sum_x \frac{\sigma'(z) x_j}{\sigma(z) (1-\sigma(z))} (\sigma(z)-y). \tag{60}\end{eqnarray} Using the definition of the sigmoid function, $\sigma(z) = 1/(1+e^{-z})$, and a little algebra we can show that $\sigma'(z) = \sigma(z)(1-\sigma(z))$. I'll ask you to verify this in an exercise below, but for now let's accept it as given. We see that the $\sigma'(z)$ and $\sigma(z)(1-\sigma(z))$ terms cancel in the equation just above, and it simplifies to become: \begin{eqnarray} \frac{\partial C}{\partial w_j} = \frac{1}{n} \sum_x x_j(\sigma(z)-y). \tag{61}\end{eqnarray} This is a beautiful expression. It tells us that the rate at which the weight learns is controlled by $\sigma(z)-y$, i.e., by the error in the output. The larger the error, the faster the neuron will learn. This is just what we'd intuitively expect. In particular, it avoids the learning slowdown caused by the $\sigma'(z)$ term in the analogous equation for the quadratic cost, Equation (55)\begin{eqnarray} \frac{\partial C}{\partial w} & = & (a-y)\sigma'(z) x = a \sigma'(z) \nonumber\end{eqnarray}. When we use the cross-entropy, the $\sigma'(z)$ term gets canceled out, and we no longer need worry about it being small. This cancellation is the special miracle ensured by the cross-entropy cost function. Actually, it's not really a miracle. As we'll see later, the cross-entropy was specially chosen to have just this property.
In a similar way, we can compute the partial derivative for the bias. I won't go through all the details again, but you can easily verify that \begin{eqnarray} \frac{\partial C}{\partial b} = \frac{1}{n} \sum_x (\sigma(z)-y). \tag{62}\end{eqnarray} Again, this avoids the learning slowdown caused by the $\sigma'(z)$ term in the analogous equation for the quadratic cost, Equation (56)\begin{eqnarray} \frac{\partial C}{\partial b} & = & (a-y)\sigma'(z) = a \sigma'(z) \nonumber\end{eqnarray}.
Let's return to the toy example we played with earlier, and explore what happens when we use the cross-entropy instead of the quadratic cost. To re-orient ourselves, we'll begin with the case where the quadratic cost did just fine, with starting weight $0.6$ and starting bias $0.9$. Press "Run" to see what happens when we replace the quadratic cost by the cross-entropy:
Unsurprisingly, the neuron learns perfectly well in this instance, just as it did earlier. And now let's look at the case where our neuron got stuck before (link, for comparison), with the weight and bias both starting at $2.0$:
Success! This time the neuron learned quickly, just as we hoped. If you observe closely you can see that the slope of the cost curve was much steeper initially than the initial flat region on the corresponding curve for the quadratic cost. It's that steepness which the cross-entropy buys us, preventing us from getting stuck just when we'd expect our neuron to learn fastest, i.e., when the neuron starts out badly wrong.
I didn't say what learning rate was used in the examples just illustrated. Earlier, with the quadratic cost, we used $\eta = 0.15$. Should we have used the same learning rate in the new examples? In fact, with the change in cost function it's not possible to say precisely what it means to use the "same" learning rate; it's an apples and oranges comparison. For both cost functions I simply experimented to find a learning rate that made it possible to see what is going on. If you're still curious, despite my disavowal, here's the lowdown: I used $\eta = 0.005$ in the examples just given.
You might object that the change in learning rate makes the graphs above meaningless. Who cares how fast the neuron learns, when our choice of learning rate was arbitrary to begin with?! That objection misses the point. The point of the graphs isn't about the absolute speed of learning. It's about how the speed of learning changes. In particular, when we use the quadratic cost learning is slower when the neuron is unambiguously wrong than it is later on, as the neuron gets closer to the correct output; while with the cross-entropy learning is faster when the neuron is unambiguously wrong. Those statements don't depend on how the learning rate is set.
We've been studying the cross-entropy for a single neuron. However, it's easy to generalize the cross-entropy to many-neuron multi-layer networks. In particular, suppose $y = y_1, y_2, \ldots$ are the desired values at the output neurons, i.e., the neurons in the final layer, while $a^L_1, a^L_2, \ldots$ are the actual output values. Then we define the cross-entropy by \begin{eqnarray} C = -\frac{1}{n} \sum_x \sum_j \left[y_j \ln a^L_j + (1-y_j) \ln (1-a^L_j) \right]. \tag{63}\end{eqnarray} This is the same as our earlier expression, Equation (57)\begin{eqnarray} C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray}, except now we've got the $\sum_j$ summing over all the output neurons. I won't explicitly work through a derivation, but it should be plausible that using the expression (63)\begin{eqnarray} C = -\frac{1}{n} \sum_x \sum_j \left[y_j \ln a^L_j + (1-y_j) \ln (1-a^L_j) \right] \nonumber\end{eqnarray} avoids a learning slowdown in many-neuron networks. If you're interested, you can work through the derivation in the problem below.
When should we use the cross-entropy instead of the quadratic cost? In fact, the cross-entropy is nearly always the better choice, provided the output neurons are sigmoid neurons. To see why, consider that when we're setting up the network we usually initialize the weights and biases using some sort of randomization. It may happen that those initial choices result in the network being decisively wrong for some training input - that is, an output neuron will have saturated near $1$, when it should be $0$, or vice versa. If we're using the quadratic cost that will slow down learning. It won't stop learning completely, since the weights will continue learning from other training inputs, but it's obviously undesirable.
The cross-entropy is easy to implement as part of a program which
learns using gradient descent and backpropagation. We'll do that
later in the
chapter, developing an improved version of our
earlier
program for classifying the MNIST handwritten digits,
network.py. The new program is called network2.py, and
incorporates not just the cross-entropy, but also several other
techniques developed in this chapter*
*The code is available
on
GitHub.. For now, let's look at how well our new program
classifies MNIST digits. As was the case in Chapter 1, we'll use a
network with $30$ hidden neurons, and we'll use a mini-batch size of
$10$. We set the learning rate to $\eta = 0.5$*
*In Chapter 1
we used the quadratic cost and a learning rate of $\eta = 3.0$. As
discussed above, it's not possible to say precisely what it means to
use the "same" learning rate when the cost function is changed.
For both cost functions I experimented to find a learning rate that
provides near-optimal performance, given the other hyper-parameter
choices.
There is, incidentally, a very rough
general heuristic for relating the learning rate for the
cross-entropy and the quadratic cost. As we saw earlier, the
gradient terms for the quadratic cost have an extra $\sigma' =
\sigma(1-\sigma)$ term in them. Suppose we average this over values
for $\sigma$, $\int_0^1 d\sigma \sigma(1-\sigma) = 1/6$. We see
that (very roughly) the quadratic cost learns an average of $6$
times slower, for the same learning rate. This suggests that a
reasonable starting point is to divide the learning rate for the
quadratic cost by $6$. Of course, this argument is far from
rigorous, and shouldn't be taken too seriously. Still, it can
sometimes be a useful starting point. and we train for $30$ epochs.
The interface to network2.py is slightly different than
network.py, but it should still be clear what is going on. You
can, by the way, get documentation about network2.py's
interface by using commands such as help(network2.Network.SGD)
in a Python shell.
>>> import mnist_loader
+When a golf player is first learning to play golf, they usually spend
most of their time developing a basic swing. Only gradually do they
develop other shots, learning to chip, draw and fade the ball,
building on and modifying their basic swing. In a similar way, up to
now we've focused on understanding the backpropagation algorithm.
It's our "basic swing", the foundation for learning in most work on
neural networks. In this chapter I explain a suite of techniques
which can be used to improve on our vanilla implementation of
backpropagation, and so improve the way our networks learn.
The techniques we'll develop in this chapter include: a better choice
of cost function, known as
the
cross-entropy cost function; four so-called
"regularization"
methods (L1 and L2 regularization, dropout, and artificial
expansion of the training data), which make our networks better at
generalizing beyond the training data; a
better method for
initializing the weights in the network; and a
set
of heuristics to help choose good hyper-parameters for the network.
I'll also overview several other
techniques in less depth. The discussions are largely independent
of one another, and so you may jump ahead if you wish. We'll also
implement
many of the techniques in running code, and use them to improve the
results obtained on the handwriting classification problem studied in
Chapter 1.
Of course, we're only covering a few of the many, many techniques
which have been developed for use in neural nets. The philosophy is
that the best entree to the plethora of available techniques is
in-depth study of a few of the most important. Mastering those
important techniques is not just useful in its own right, but will
also deepen your understanding of what problems can arise when you use
neural networks. That will leave you well prepared to quickly pick up
other techniques, as you need them.
The cross-entropy cost function
Most of us find it unpleasant to be wrong. Soon after beginning to
learn the piano I gave my first performance before an audience. I was
nervous, and began playing the piece an octave too low. I got
confused, and couldn't continue until someone pointed out my error. I
was very embarrassed. Yet while unpleasant, we also learn quickly when
we're decisively wrong. You can bet that the next time I played
before an audience I played in the correct octave! By contrast, we
learn more slowly when our errors are less well-defined.
Ideally, we hope and expect that our neural networks will learn fast
from their errors. Is this what happens in practice? To answer this
question, let's look at a toy example. The example involves a neuron
with just one input:
We'll train this neuron to do something ridiculously easy: take the
input $1$ to the output $0$. Of course, this is such a trivial task
that we could easily figure out an appropriate weight and bias by
hand, without using a learning algorithm. However, it turns out to be
illuminating to use gradient descent to attempt to learn a weight and
bias. So let's take a look at how the neuron learns.
To make things definite, I'll pick the initial weight to be $0.6$ and
the initial bias to be $0.9$. These are generic choices used as a
place to begin learning, I wasn't picking them to be special in any
way. The initial output from the neuron is $0.82$, so quite a bit of
learning will be needed before our neuron gets near the desired
output, $0.0$. Click on "Run" in the bottom right corner below to
see how the neuron learns an output much closer to $0.0$. Note that
this isn't a pre-recorded animation, your browser is actually
computing the gradient, then using the gradient to update the weight
and bias, and displaying the result. The learning rate is $\eta =
0.15$, which turns out to be slow enough that we can follow what's
happening, but fast enough that we can get substantial learning in
just a few seconds. The cost is the quadratic cost function, $C$,
introduced back in Chapter 1. I'll remind you of the exact form of
the cost function shortly, so there's no need to go and dig up the
definition. Note that you can run the animation multiple times by
clicking on "Run" again.
As you can see, the neuron rapidly learns a weight and bias that
drives down the cost, and gives an output from the neuron of about
$0.09$. That's not quite the desired output, $0.0$, but it is pretty
good. Suppose, however, that we instead choose both the starting
weight and the starting bias to be $2.0$. In this case the initial
output is $0.98$, which is very badly wrong. Let's look at how the
neuron learns to output $0$ in this case. Click on "Run" again:
Although this example uses the same learning rate ($\eta = 0.15$), we
can see that learning starts out much more slowly. Indeed, for the
first 150 or so learning epochs, the weights and biases don't change
much at all. Then the learning kicks in and, much as in our first
example, the neuron's output rapidly moves closer to $0.0$.
This behaviour is strange when contrasted to human learning. As I
said at the beginning of this section, we often learn fastest when
we're badly wrong about something. But we've just seen that our
artificial neuron has a lot of difficulty learning when it's badly
wrong - far more difficulty than when it's just a little wrong.
What's more, it turns out that this behaviour occurs not just in this
toy model, but in more general networks. Why is learning so slow?
And can we find a way of avoiding this slowdown?
To understand the origin of the problem, consider that our neuron
learns by changing the weight and bias at a rate determined by the
partial derivatives of the cost function, $\partial C/\partial w$ and
$\partial C / \partial b$. So saying "learning is slow" is really
the same as saying that those partial derivatives are small. The
challenge is to understand why they are small. To understand that,
let's compute the partial derivatives. Recall that we're using the
quadratic cost function, which, from
Equation (6)\begin{eqnarray} C(w,b) \equiv
\frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}, is given by
\begin{eqnarray}
C = \frac{(y-a)^2}{2},
\tag{54}\end{eqnarray}
where $a$ is the neuron's output when the training input $x = 1$ is
used, and $y = 0$ is the corresponding desired output. To write this
more explicitly in terms of the weight and bias, recall that $a =
\sigma(z)$, where $z = wx+b$. Using the chain rule to differentiate
with respect to the weight and bias we get
\begin{eqnarray}
\frac{\partial C}{\partial w} & = & (a-y)\sigma'(z) x = a \sigma'(z) \tag{55}\\
\frac{\partial C}{\partial b} & = & (a-y)\sigma'(z) = a \sigma'(z),
\tag{56}\end{eqnarray}
where I have substituted $x = 1$ and $y = 0$. To understand the
behaviour of these expressions, let's look more closely at the
$\sigma'(z)$ term on the right-hand side. Recall the shape of the
$\sigma$ function:
We can see from this graph that when the neuron's output is close to
$1$, the curve gets very flat, and so $\sigma'(z)$ gets very small.
Equations (55)\begin{eqnarray}
\frac{\partial C}{\partial w} & = & (a-y)\sigma'(z) x = a \sigma'(z) \nonumber\end{eqnarray} and (56)\begin{eqnarray}
\frac{\partial C}{\partial b} & = & (a-y)\sigma'(z) = a \sigma'(z) \nonumber\end{eqnarray} then tell us that
$\partial C / \partial w$ and $\partial C / \partial b$ get very
small. This is the origin of the learning slowdown. What's more, as
we shall see a little later, the learning slowdown occurs for
essentially the same reason in more general neural networks, not just
the toy example we've been playing with.
Introducing the cross-entropy cost function
How can we address the learning slowdown? It turns out that we can
solve the problem by replacing the quadratic cost with a different
cost function, known as the cross-entropy. To understand the
cross-entropy, let's move a little away from our super-simple toy
model. We'll suppose instead that we're trying to train a neuron with
several input variables, $x_1, x_2, \ldots$, corresponding weights
$w_1, w_2, \ldots$, and a bias, $b$:
The output from the neuron is, of course, $a = \sigma(z)$, where $z =
\sum_j w_j x_j+b$ is the weighted sum of the inputs. We define the
cross-entropy cost function for this neuron by
\begin{eqnarray}
C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right],
\tag{57}\end{eqnarray}
where $n$ is the total number of items of training data, the sum is
over all training inputs, $x$, and $y$ is the corresponding desired
output.It's not obvious that the expression (57)\begin{eqnarray}
C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray}
fixes the learning slowdown problem. In fact, frankly, it's not even
obvious that it makes sense to call this a cost function! Before
addressing the learning slowdown, let's see in what sense the
cross-entropy can be interpreted as a cost function.
Two properties in particular make it reasonable to interpret the
cross-entropy as a cost function. First, it's non-negative, that is,
$C > 0$. To see this, notice that: (a) all the individual terms in
the sum in (57)\begin{eqnarray}
C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray} are negative, since both
logarithms are of numbers in the range $0$ to $1$; and (b) there is a
minus sign out the front of the sum.
Second, if the neuron's actual output is close to the desired output
for all training inputs, $x$, then the cross-entropy will be close to
zero*
*To prove this I will need to assume that the desired
outputs $y$ are all either $0$ or $1$. This is usually the case
when solving classification problems, for example, or when computing
Boolean functions. To understand what happens when we don't make
this assumption, see the exercises at the end of this section.. To
see this, suppose for example that $y = 0$ and $a \approx 0$ for some
input $x$. This is a case when the neuron is doing a good job on that
input. We see that the first term in the
expression (57)\begin{eqnarray}
C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray} for the cost vanishes, since
$y = 0$, while the second term is just $-\ln (1-a) \approx 0$. A
similar analysis holds when $y = 1$ and $a \approx 1$. And so the
contribution to the cost will be low provided the actual output is
close to the desired output.
Summing up, the cross-entropy is positive, and tends toward zero as
the neuron gets better at computing the desired output, $y$, for all
training inputs, $x$. These are both properties we'd intuitively
expect for a cost function. Indeed, both properties are also
satisfied by the quadratic cost. So that's good news for the
cross-entropy. But the cross-entropy cost function has the benefit
that, unlike the quadratic cost, it avoids the problem of learning
slowing down. To see this, let's compute the partial derivative of
the cross-entropy cost with respect to the weights. We substitute $a
= \sigma(z)$ into (57)\begin{eqnarray}
C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray}, and apply the chain
rule twice, obtaining:
\begin{eqnarray}
\frac{\partial C}{\partial w_j} & = & -\frac{1}{n} \sum_x \left(
\frac{y }{\sigma(z)} -\frac{(1-y)}{1-\sigma(z)} \right)
\frac{\partial \sigma}{\partial w_j} \tag{58}\\
& = & -\frac{1}{n} \sum_x \left(
\frac{y}{\sigma(z)}
-\frac{(1-y)}{1-\sigma(z)} \right)\sigma'(z) x_j.
\tag{59}\end{eqnarray}
Putting everything over a common denominator and simplifying this
becomes:
\begin{eqnarray}
\frac{\partial C}{\partial w_j} & = & \frac{1}{n}
\sum_x \frac{\sigma'(z) x_j}{\sigma(z) (1-\sigma(z))}
(\sigma(z)-y).
\tag{60}\end{eqnarray}
Using the definition of the sigmoid function, $\sigma(z) =
1/(1+e^{-z})$, and a little algebra we can show that $\sigma'(z) =
\sigma(z)(1-\sigma(z))$. I'll ask you to verify this in an exercise
below, but for now let's accept it as given. We see that the
$\sigma'(z)$ and $\sigma(z)(1-\sigma(z))$ terms cancel in the equation
just above, and it simplifies to become:
\begin{eqnarray}
\frac{\partial C}{\partial w_j} = \frac{1}{n} \sum_x x_j(\sigma(z)-y).
\tag{61}\end{eqnarray}
This is a beautiful expression. It tells us that the rate at which
the weight learns is controlled by $\sigma(z)-y$, i.e., by the error
in the output. The larger the error, the faster the neuron will
learn. This is just what we'd intuitively expect. In particular, it
avoids the learning slowdown caused by the $\sigma'(z)$ term in the
analogous equation for the quadratic cost, Equation (55)\begin{eqnarray}
\frac{\partial C}{\partial w} & = & (a-y)\sigma'(z) x = a \sigma'(z) \nonumber\end{eqnarray}.
When we use the cross-entropy, the $\sigma'(z)$ term gets canceled
out, and we no longer need worry about it being small. This
cancellation is the special miracle ensured by the cross-entropy cost
function. Actually, it's not really a miracle. As we'll see later,
the cross-entropy was specially chosen to have just this property.
In a similar way, we can compute the partial derivative for the bias.
I won't go through all the details again, but you can easily verify
that
\begin{eqnarray}
\frac{\partial C}{\partial b} = \frac{1}{n} \sum_x (\sigma(z)-y).
\tag{62}\end{eqnarray}
Again, this avoids the learning slowdown caused by the $\sigma'(z)$
term in the analogous equation for the quadratic cost,
Equation (56)\begin{eqnarray}
\frac{\partial C}{\partial b} & = & (a-y)\sigma'(z) = a \sigma'(z) \nonumber\end{eqnarray}.
Exercise
Let's return to the toy example we played with earlier, and explore what happens when we use the cross-entropy instead of the quadratic cost. To re-orient ourselves, we'll begin with the case where the quadratic cost did just fine, with starting weight $0.6$ and starting bias $0.9$. Press "Run" to see what happens when we replace the quadratic cost by the cross-entropy:
Unsurprisingly, the neuron learns perfectly well in this instance, just as it did earlier. And now let's look at the case where our neuron got stuck before (link, for comparison), with the weight and bias both starting at $2.0$:
Success! This time the neuron learned quickly, just as we hoped. If you observe closely you can see that the slope of the cost curve was much steeper initially than the initial flat region on the corresponding curve for the quadratic cost. It's that steepness which the cross-entropy buys us, preventing us from getting stuck just when we'd expect our neuron to learn fastest, i.e., when the neuron starts out badly wrong.
I didn't say what learning rate was used in the examples just illustrated. Earlier, with the quadratic cost, we used $\eta = 0.15$. Should we have used the same learning rate in the new examples? In fact, with the change in cost function it's not possible to say precisely what it means to use the "same" learning rate; it's an apples and oranges comparison. For both cost functions I simply experimented to find a learning rate that made it possible to see what is going on. If you're still curious, despite my disavowal, here's the lowdown: I used $\eta = 0.005$ in the examples just given.
You might object that the change in learning rate makes the graphs above meaningless. Who cares how fast the neuron learns, when our choice of learning rate was arbitrary to begin with?! That objection misses the point. The point of the graphs isn't about the absolute speed of learning. It's about how the speed of learning changes. In particular, when we use the quadratic cost learning is slower when the neuron is unambiguously wrong than it is later on, as the neuron gets closer to the correct output; while with the cross-entropy learning is faster when the neuron is unambiguously wrong. Those statements don't depend on how the learning rate is set.
We've been studying the cross-entropy for a single neuron. However, it's easy to generalize the cross-entropy to many-neuron multi-layer networks. In particular, suppose $y = y_1, y_2, \ldots$ are the desired values at the output neurons, i.e., the neurons in the final layer, while $a^L_1, a^L_2, \ldots$ are the actual output values. Then we define the cross-entropy by \begin{eqnarray} C = -\frac{1}{n} \sum_x \sum_j \left[y_j \ln a^L_j + (1-y_j) \ln (1-a^L_j) \right]. \tag{63}\end{eqnarray} This is the same as our earlier expression, Equation (57)\begin{eqnarray} C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray}, except now we've got the $\sum_j$ summing over all the output neurons. I won't explicitly work through a derivation, but it should be plausible that using the expression (63)\begin{eqnarray} C = -\frac{1}{n} \sum_x \sum_j \left[y_j \ln a^L_j + (1-y_j) \ln (1-a^L_j) \right] \nonumber\end{eqnarray} avoids a learning slowdown in many-neuron networks. If you're interested, you can work through the derivation in the problem below.
When should we use the cross-entropy instead of the quadratic cost? In fact, the cross-entropy is nearly always the better choice, provided the output neurons are sigmoid neurons. To see why, consider that when we're setting up the network we usually initialize the weights and biases using some sort of randomization. It may happen that those initial choices result in the network being decisively wrong for some training input - that is, an output neuron will have saturated near $1$, when it should be $0$, or vice versa. If we're using the quadratic cost that will slow down learning. It won't stop learning completely, since the weights will continue learning from other training inputs, but it's obviously undesirable.
The cross-entropy is easy to implement as part of a program which
learns using gradient descent and backpropagation. We'll do that
later in the
chapter, developing an improved version of our
earlier
program for classifying the MNIST handwritten digits,
network.py. The new program is called network2.py, and
incorporates not just the cross-entropy, but also several other
techniques developed in this chapter*
*The code is available
on
GitHub.. For now, let's look at how well our new program
classifies MNIST digits. As was the case in Chapter 1, we'll use a
network with $30$ hidden neurons, and we'll use a mini-batch size of
$10$. We set the learning rate to $\eta = 0.5$*
*In Chapter 1
we used the quadratic cost and a learning rate of $\eta = 3.0$. As
discussed above, it's not possible to say precisely what it means to
use the "same" learning rate when the cost function is changed.
For both cost functions I experimented to find a learning rate that
provides near-optimal performance, given the other hyper-parameter
choices.
There is, incidentally, a very rough
general heuristic for relating the learning rate for the
cross-entropy and the quadratic cost. As we saw earlier, the
gradient terms for the quadratic cost have an extra $\sigma' =
\sigma(1-\sigma)$ term in them. Suppose we average this over values
for $\sigma$, $\int_0^1 d\sigma \sigma(1-\sigma) = 1/6$. We see
that (very roughly) the quadratic cost learns an average of $6$
times slower, for the same learning rate. This suggests that a
reasonable starting point is to divide the learning rate for the
quadratic cost by $6$. Of course, this argument is far from
rigorous, and shouldn't be taken too seriously. Still, it can
sometimes be a useful starting point. and we train for $30$ epochs.
The interface to network2.py is slightly different than
network.py, but it should still be clear what is going on. You
can, by the way, get documentation about network2.py's
interface by using commands such as help(network2.Network.SGD)
in a Python shell.
>>> import mnist_loader
>>> training_data, validation_data, test_data = \
... mnist_loader.load_data_wrapper()
>>> import network2
@@ -184,7 +187,7 @@ Improving the way neural networks learn
>>> net.SGD(training_data, 30, 10, 0.5, evaluation_data=test_data,
... monitor_evaluation_accuracy=True)
Note, by the way, that the net.large_weight_initializer() command is used to initialize the weights and biases in the same way as described in Chapter 1. We need to run this command because later in this chapter we'll change the default weight initialization in our networks. The result from running the above sequence of commands is a network with $95.49$ percent accuracy. This is pretty close to the result we obtained in Chapter 1, $95.42$ percent, using the quadratic cost.
Let's look also at the case where we use $100$ hidden neurons, the cross-entropy, and otherwise keep the parameters the same. In this case we obtain an accuracy of $96.82$ percent. That's a substantial improvement over the results from Chapter 1, where we obtained a classification accuracy of $96.59$ percent, using the quadratic cost. That may look like a small change, but consider that the error rate has dropped from $3.41$ percent to $3.18$ percent. That is, we've eliminated about one in fourteen of the original errors. That's quite a handy improvement.
It's encouraging that the cross-entropy cost gives us similar or better results than the quadratic cost. However, these results don't conclusively prove that the cross-entropy is a better choice. The reason is that I've put only a little effort into choosing hyper-parameters such as learning rate, mini-batch size, and so on. For the improvement to be really convincing we'd need to do a thorough job optimizing such hyper-parameters. Still, the results are encouraging, and reinforce our earlier theoretical argument that the cross-entropy is a better choice than the quadratic cost.
This, by the way, is part of a general pattern that we'll see through this chapter and, indeed, through much of the rest of the book. We'll develop a new technique, we'll try it out, and we'll get "improved" results. It is, of course, nice that we see such improvements. But the interpretation of such improvements is always problematic. They're only truly convincing if we see an improvement after putting tremendous effort into optimizing all the other hyper-parameters. That's a great deal of work, requiring lots of computing power, and we're not usually going to do such an exhaustive investigation. Instead, we'll proceed on the basis of informal tests like those done above. Still, you should keep in mind that such tests fall short of definitive proof, and remain alert to signs that the arguments are breaking down.
By now, we've discussed the cross-entropy at great length. Why go to so much effort when it gives only a small improvement to our MNIST results? Later in the chapter we'll see other techniques - notably, regularization - which give much bigger improvements. So why so much focus on cross-entropy? Part of the reason is that the cross-entropy is a widely-used cost function, and so is worth understanding well. But the more important reason is that neuron saturation is an important problem in neural nets, a problem we'll return to repeatedly throughout the book. And so I've discussed the cross-entropy at length because it's a good laboratory to begin understanding neuron saturation and how it may be addressed.
Our discussion of the cross-entropy has focused on algebraic analysis and practical implementation. That's useful, but it leaves unanswered broader conceptual questions, like: what does the cross-entropy mean? Is there some intuitive way of thinking about the cross-entropy? And how could we have dreamed up the cross-entropy in the first place?
Let's begin with the last of these questions: what could have motivated us to think up the cross-entropy in the first place? Suppose we'd discovered the learning slowdown described earlier, and understood that the origin was the $\sigma'(z)$ terms in Equations (55)\begin{eqnarray} \frac{\partial C}{\partial w} & = & (a-y)\sigma'(z) x = a \sigma'(z) \nonumber\end{eqnarray} and (56)\begin{eqnarray} \frac{\partial C}{\partial b} & = & (a-y)\sigma'(z) = a \sigma'(z) \nonumber\end{eqnarray}. After staring at those equations for a bit, we might wonder if it's possible to choose a cost function so that the $\sigma'(z)$ term disappeared. In that case, the cost $C = C_x$ for a single training example $x$ would satisfy \begin{eqnarray} \frac{\partial C}{\partial w_j} & = & x_j(a-y) \tag{71}\\ \frac{\partial C}{\partial b } & = & (a-y). \tag{72}\end{eqnarray} If we could choose the cost function to make these equations true, then they would capture in a simple way the intuition that the greater the initial error, the faster the neuron learns. They'd also eliminate the problem of a learning slowdown. In fact, starting from these equations we'll now show that it's possible to derive the form of the cross-entropy, simply by following our mathematical noses. To see this, note that from the chain rule we have \begin{eqnarray} \frac{\partial C}{\partial b} = \frac{\partial C}{\partial a} \sigma'(z). \tag{73}\end{eqnarray} Using $\sigma'(z) = \sigma(z)(1-\sigma(z)) = a(1-a)$ the last equation becomes \begin{eqnarray} \frac{\partial C}{\partial b} = \frac{\partial C}{\partial a} a(1-a). \tag{74}\end{eqnarray} Comparing to Equation (72)\begin{eqnarray} \frac{\partial C}{\partial b } & = & (a-y) \nonumber\end{eqnarray} we obtain \begin{eqnarray} \frac{\partial C}{\partial a} = \frac{a-y}{a(1-a)}. \tag{75}\end{eqnarray} Integrating this expression with respect to $a$ gives \begin{eqnarray} C = -[y \ln a + (1-y) \ln (1-a)]+ {\rm constant}, \tag{76}\end{eqnarray} for some constant of integration. This is the contribution to the cost from a single training example, $x$. To get the full cost function we must average over training examples, obtaining \begin{eqnarray} C = -\frac{1}{n} \sum_x [y \ln a +(1-y) \ln(1-a)] + {\rm constant}, \tag{77}\end{eqnarray} where the constant here is the average of the individual constants for each training example. And so we see that Equations (71)\begin{eqnarray} \frac{\partial C}{\partial w_j} & = & x_j(a-y) \nonumber\end{eqnarray} and (72)\begin{eqnarray} \frac{\partial C}{\partial b } & = & (a-y) \nonumber\end{eqnarray} uniquely determine the form of the cross-entropy, up to an overall constant term. The cross-entropy isn't something that was miraculously pulled out of thin air. Rather, it's something that we could have discovered in a simple and natural way.
What about the intuitive meaning of the cross-entropy? How should we think about it? Explaining this in depth would take us further afield than I want to go. However, it is worth mentioning that there is a standard way of interpreting the cross-entropy that comes from the field of information theory. Roughly speaking, the idea is that the cross-entropy is a measure of surprise. In particular, our neuron is trying to compute the function $x \rightarrow y = y(x)$. But instead it computes the function $x \rightarrow a = a(x)$. Suppose we think of $a$ as our neuron's estimated probability that $y$ is $1$, and $1-a$ is the estimated probability that the right value for $y$ is $0$. Then the cross-entropy measures how "surprised" we are, on average, when we learn the true value for $y$. We get low surprise if the output is what we expect, and high surprise if the output is unexpected. Of course, I haven't said exactly what "surprise" means, and so this perhaps seems like empty verbiage. But in fact there is a precise information-theoretic way of saying what is meant by surprise. Unfortunately, I don't know of a good, short, self-contained discussion of this subject that's available online. But if you want to dig deeper, then Wikipedia contains a brief summary that will get you started down the right track. And the details can be filled in by working through the materials about the Kraft inequality in chapter 5 of the book about information theory by Cover and Thomas.
In this chapter we'll mostly use the cross-entropy cost to address the problem of learning slowdown. However, I want to briefly describe another approach to the problem, based on what are called softmax layers of neurons. We're not actually going to use softmax layers in the remainder of the chapter, so if you're in a great hurry, you can skip to the next section. However, softmax is still worth understanding, in part because it's intrinsically interesting, and in part because we'll use softmax layers in Chapter 6, in our discussion of deep neural networks.
The idea of softmax is to define a new type of output layer for our neural networks. It begins in the same way as with a sigmoid layer, by forming the weighted inputs* *In describing the softmax we'll make frequent use of notation introduced in the last chapter. You may wish to revisit that chapter if you need to refresh your memory about the meaning of the notation. $z^L_j = \sum_{k} w^L_{jk} a^{L-1}_k + b^L_j$. However, we don't apply the sigmoid function to get the output. Instead, in a softmax layer we apply the so-called softmax function to the $z^L_j$. According to this function, the activation $a^L_j$ of the $j$th output neuron is \begin{eqnarray} a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}}, \tag{78}\end{eqnarray} where in the denominator we sum over all the output neurons.
If you're not familiar with the softmax function, Equation (78)\begin{eqnarray} a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}} \nonumber\end{eqnarray} may look pretty opaque. It's certainly not obvious why we'd want to use this function. And it's also not obvious that this will help us address the learning slowdown problem. To better understand Equation (78)\begin{eqnarray} a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}} \nonumber\end{eqnarray}, suppose we have a network with four output neurons, and four corresponding weighted inputs, which we'll denote $z^L_1, z^L_2, z^L_3$, and $z^L_4$. Shown below are adjustable sliders showing possible values for the weighted inputs, and a graph of the corresponding output activations. A good place to start exploration is by using the bottom slider to increase $z^L_4$:
$z^L_1 = $ |
$a^L_1 = $
|
$z^L_2$ = |
$a^L_2 = $
|
$z^L_3$ = |
$a^L_3 = $
|
$z^L_4$ = |
$a^L_4 = $
|
As you increase $z^L_4$, you'll see an increase in the corresponding output activation, $a^L_4$, and a decrease in the other output activations. Similarly, if you decrease $z^L_4$ then $a^L_4$ will decrease, and all the other output activations will increase. In fact, if you look closely, you'll see that in both cases the total change in the other activations exactly compensates for the change in $a^L_4$. The reason is that the output activations are guaranteed to always sum up to $1$, as we can prove using Equation (78)\begin{eqnarray} a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}} \nonumber\end{eqnarray} and a little algebra: \begin{eqnarray} \sum_j a^L_j & = & \frac{\sum_j e^{z^L_j}}{\sum_k e^{z^L_k}} = 1. \tag{79}\end{eqnarray} As a result, if $a^L_4$ increases, then the other output activations must decrease by the same total amount, to ensure the sum over all activations remains $1$. And, of course, similar statements hold for all the other activations.
Equation (78)\begin{eqnarray} a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}} \nonumber\end{eqnarray} also implies that the output activations are all positive, since the exponential function is positive. Combining this with the observation in the last paragraph, we see that the output from the softmax layer is a set of positive numbers which sum up to $1$. In other words, the output from the softmax layer can be thought of as a probability distribution.
The fact that a softmax layer outputs a probability distribution is rather pleasing. In many problems it's convenient to be able to interpret the output activation $a^L_j$ as the network's estimate of the probability that the correct output is $j$. So, for instance, in the MNIST classification problem, we can interpret $a^L_j$ as the network's estimated probability that the correct digit classification is $j$.
By contrast, if the output layer was a sigmoid layer, then we certainly couldn't assume that the activations formed a probability distribution. I won't explicitly prove it, but it should be plausible that the activations from a sigmoid layer won't in general form a probability distribution. And so with a sigmoid output layer we don't have such a simple interpretation of the output activations.
We're starting to build up some feel for the softmax function and the way softmax layers behave. Just to review where we're at: the exponentials in Equation (78)\begin{eqnarray} a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}} \nonumber\end{eqnarray} ensure that all the output activations are positive. And the sum in the denominator of Equation (78)\begin{eqnarray} a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}} \nonumber\end{eqnarray} ensures that the softmax outputs sum to $1$. So that particular form no longer appears so mysterious: rather, it is a natural way to ensure that the output activations form a probability distribution. You can think of softmax as a way of rescaling the $z^L_j$, and then squishing them together to form a probability distribution.
The learning slowdown problem: We've now built up considerable familiarity with softmax layers of neurons. But we haven't yet seen how a softmax layer lets us address the learning slowdown problem. To understand that, let's define the log-likelihood cost function. We'll use $x$ to denote a training input to the network, and $y$ to denote the corresponding desired output. Then the log-likelihood cost associated to this training input is \begin{eqnarray} C \equiv -\ln a^L_y. \tag{80}\end{eqnarray} So, for instance, if we're training with MNIST images, and input an image of a $7$, then the log-likelihood cost is $-\ln a^L_7$. To see that this makes intuitive sense, consider the case when the network is doing a good job, that is, it is confident the input is a $7$. In that case it will estimate a value for the corresponding probability $a^L_7$ which is close to $1$, and so the cost $-\ln a^L_7$ will be small. By contrast, when the network isn't doing such a good job, the probability $a^L_7$ will be smaller, and the cost $-\ln a^L_7$ will be larger. So the log-likelihood cost behaves as we'd expect a cost function to behave.
What about the learning slowdown problem? To analyze that, recall that the key to the learning slowdown is the behaviour of the quantities $\partial C / \partial w^L_{jk}$ and $\partial C / \partial b^L_j$. I won't go through the derivation explicitly - I'll ask you to do in the problems, below - but with a little algebra you can show that* *Note that I'm abusing notation here, using $y$ in a slightly different way to last paragraph. In the last paragraph we used $y$ to denote the desired output from the network - e.g., output a "$7$" if an image of a $7$ was input. But in the equations which follow I'm using $y$ to denote the vector of output activations which corresponds to $7$, that is, a vector which is all $0$s, except for a $1$ in the $7$th location. \begin{eqnarray} \frac{\partial C}{\partial b^L_j} & = & a^L_j-y_j \tag{81}\\ \frac{\partial C}{\partial w^L_{jk}} & = & a^{L-1}_k (a^L_j-y_j) \tag{82}\end{eqnarray} These equations are the same as the analogous expressions obtained in our earlier analysis of the cross-entropy. Compare, for example, Equation (82)\begin{eqnarray} \frac{\partial C}{\partial w^L_{jk}} & = & a^{L-1}_k (a^L_j-y_j) \nonumber\end{eqnarray} to Equation (67)\begin{eqnarray} \frac{\partial C}{\partial w^L_{jk}} & = & \frac{1}{n} \sum_x a^{L-1}_k (a^L_j-y_j). \nonumber\end{eqnarray}. It's the same equation, albeit in the latter I've averaged over training instances. And, just as in the earlier analysis, these expressions ensure that we will not encounter a learning slowdown. In fact, it's useful to think of a softmax output layer with log-likelihood cost as being quite similar to a sigmoid output layer with cross-entropy cost.
Given this similarity, should you use a sigmoid output layer and cross-entropy, or a softmax output layer and log-likelihood? In fact, in many situations both approaches work well. Through the remainder of this chapter we'll use a sigmoid output layer, with the cross-entropy cost. Later, in Chapter 6, we'll sometimes use a softmax output layer, with log-likelihood cost. The reason for the switch is to make some of our later networks more similar to networks found in certain influential academic papers. As a more general point of principle, softmax plus log-likelihood is worth using whenever you want to interpret the output activations as probabilities. That's not always a concern, but can be useful with classification problems (like MNIST) involving disjoint classes.
The Nobel prizewinning physicist Enrico Fermi was once asked his opinion of a mathematical model some colleagues had proposed as the solution to an important unsolved physics problem. The model gave excellent agreement with experiment, but Fermi was skeptical. He asked how many free parameters could be set in the model. "Four" was the answer. Fermi replied* *The quote comes from a charming article by Freeman Dyson, who is one of the people who proposed the flawed model. A four-parameter elephant may be found here. : "I remember my friend Johnny von Neumann used to say, with four parameters I can fit an elephant, and with five I can make him wiggle his trunk.".
The point, of course, is that models with a large number of free parameters can describe an amazingly wide range of phenomena. Even if such a model agrees well with the available data, that doesn't make it a good model. It may just mean there's enough freedom in the model that it can describe almost any data set of the given size, without capturing any genuine insights into the underlying phenomenon. When that happens the model will work well for the existing data, but will fail to generalize to new situations. The true test of a model is its ability to make predictions in situations it hasn't been exposed to before.
Fermi and von Neumann were suspicious of models with four parameters. Our 30 hidden neuron network for classifying MNIST digits has nearly 24,000 parameters! That's a lot of parameters. Our 100 hidden neuron network has nearly 80,000 parameters, and state-of-the-art deep neural nets sometimes contain millions or even billions of parameters. Should we trust the results?
Let's sharpen this problem up by constructing a situation where our network does a bad job generalizing to new situations. We'll use our 30 hidden neuron network, with its 23,860 parameters. But we won't train the network using all 50,000 MNIST training images. Instead, we'll use just the first 1,000 training images. Using that restricted set will make the problem with generalization much more evident. We'll train in a similar way to before, using the cross-entropy cost function, with a learning rate of $\eta = 0.5$ and a mini-batch size of $10$. However, we'll train for 400 epochs, a somewhat larger number than before, because we're not using as many training examples. Let's use network2 to look at the way the cost function changes:
+Note, by the way, that the net.large_weight_initializer()
command is used to initialize the weights and biases in the same way
as described in Chapter 1. We need to run this command because later
in this chapter we'll change the default weight initialization in our
networks. The result from running the above sequence of commands is a
network with $95.49$ percent accuracy. This is pretty close to the
result we obtained in Chapter 1, $95.42$ percent, using the quadratic
cost.
Let's look also at the case where we use $100$ hidden neurons, the
cross-entropy, and otherwise keep the parameters the same. In this
case we obtain an accuracy of $96.82$ percent. That's a substantial
improvement over the results from Chapter 1, where we obtained a
classification accuracy of $96.59$ percent, using the quadratic cost.
That may look like a small change, but consider that the error rate
has dropped from $3.41$ percent to $3.18$ percent. That is, we've
eliminated about one in fourteen of the original errors. That's quite
a handy improvement.
It's encouraging that the cross-entropy cost gives us similar or
better results than the quadratic cost. However, these results don't
conclusively prove that the cross-entropy is a better choice. The
reason is that I've put only a little effort into choosing
hyper-parameters such as learning rate, mini-batch size, and so on.
For the improvement to be really convincing we'd need to do a thorough
job optimizing such hyper-parameters. Still, the results are
encouraging, and reinforce our earlier theoretical argument that the
cross-entropy is a better choice than the quadratic cost.
This, by the way, is part of a general pattern that we'll see through
this chapter and, indeed, through much of the rest of the book. We'll
develop a new technique, we'll try it out, and we'll get "improved"
results. It is, of course, nice that we see such improvements. But
the interpretation of such improvements is always problematic.
They're only truly convincing if we see an improvement after putting
tremendous effort into optimizing all the other hyper-parameters.
That's a great deal of work, requiring lots of computing power, and
we're not usually going to do such an exhaustive investigation.
Instead, we'll proceed on the basis of informal tests like those done
above. Still, you should keep in mind that such tests fall short of
definitive proof, and remain alert to signs that the arguments are
breaking down.
By now, we've discussed the cross-entropy at great length. Why go to
so much effort when it gives only a small improvement to our MNIST
results? Later in the chapter we'll see other techniques - notably,
regularization - which
give much bigger improvements. So why so much focus on cross-entropy?
Part of the reason is that the cross-entropy is a widely-used cost
function, and so is worth understanding well. But the more important
reason is that neuron saturation is an important problem in neural
nets, a problem we'll return to repeatedly throughout the book. And
so I've discussed the cross-entropy at length because it's a good
laboratory to begin understanding neuron saturation and how it may be
addressed.
What does the cross-entropy mean? Where does it come from?
Our discussion of the cross-entropy has focused on algebraic analysis
and practical implementation. That's useful, but it leaves unanswered
broader conceptual questions, like: what does the cross-entropy mean?
Is there some intuitive way of thinking about the cross-entropy? And
how could we have dreamed up the cross-entropy in the first place?
Let's begin with the last of these questions: what could have
motivated us to think up the cross-entropy in the first place?
Suppose we'd discovered the learning slowdown described earlier, and
understood that the origin was the $\sigma'(z)$ terms in
Equations (55)\begin{eqnarray}
\frac{\partial C}{\partial w} & = & (a-y)\sigma'(z) x = a \sigma'(z) \nonumber\end{eqnarray} and (56)\begin{eqnarray}
\frac{\partial C}{\partial b} & = & (a-y)\sigma'(z) = a \sigma'(z) \nonumber\end{eqnarray}. After staring at
those equations for a bit, we might wonder if it's possible to choose
a cost function so that the $\sigma'(z)$ term disappeared. In that
case, the cost $C = C_x$ for a single training example $x$ would
satisfy
\begin{eqnarray}
\frac{\partial C}{\partial w_j} & = & x_j(a-y) \tag{71}\\
\frac{\partial C}{\partial b } & = & (a-y).
\tag{72}\end{eqnarray}
If we could choose the cost function to make these equations true,
then they would capture in a simple way the intuition that the greater
the initial error, the faster the neuron learns. They'd also
eliminate the problem of a learning slowdown. In fact, starting from
these equations we'll now show that it's possible to derive the form
of the cross-entropy, simply by following our mathematical noses. To
see this, note that from the chain rule we have
\begin{eqnarray}
\frac{\partial C}{\partial b} = \frac{\partial C}{\partial a}
\sigma'(z).
\tag{73}\end{eqnarray}
Using $\sigma'(z) = \sigma(z)(1-\sigma(z)) = a(1-a)$ the last equation
becomes
\begin{eqnarray}
\frac{\partial C}{\partial b} = \frac{\partial C}{\partial a}
a(1-a).
\tag{74}\end{eqnarray}
Comparing to Equation (72)\begin{eqnarray}
\frac{\partial C}{\partial b } & = & (a-y) \nonumber\end{eqnarray} we obtain
\begin{eqnarray}
\frac{\partial C}{\partial a} = \frac{a-y}{a(1-a)}.
\tag{75}\end{eqnarray}
Integrating this expression with respect to $a$ gives
\begin{eqnarray}
C = -[y \ln a + (1-y) \ln (1-a)]+ {\rm constant},
\tag{76}\end{eqnarray}
for some constant of integration. This is the contribution to the
cost from a single training example, $x$. To get the full cost
function we must average over training examples, obtaining
\begin{eqnarray}
C = -\frac{1}{n} \sum_x [y \ln a +(1-y) \ln(1-a)] + {\rm constant},
\tag{77}\end{eqnarray}
where the constant here is the average of the individual constants for
each training example. And so we see that
Equations (71)\begin{eqnarray}
\frac{\partial C}{\partial w_j} & = & x_j(a-y) \nonumber\end{eqnarray}
and (72)\begin{eqnarray}
\frac{\partial C}{\partial b } & = & (a-y) \nonumber\end{eqnarray} uniquely determine the form
of the cross-entropy, up to an overall constant term. The
cross-entropy isn't something that was miraculously pulled out of thin
air. Rather, it's something that we could have discovered in a simple
and natural way.
What about the intuitive meaning of the cross-entropy? How should we
think about it? Explaining this in depth would take us further afield
than I want to go. However, it is worth mentioning that there is a
standard way of interpreting the cross-entropy that comes from the
field of information theory. Roughly speaking, the idea is that the
cross-entropy is a measure of surprise. In particular, our neuron is
trying to compute the function $x \rightarrow y = y(x)$. But instead
it computes the function $x \rightarrow a = a(x)$. Suppose we think
of $a$ as our neuron's estimated probability that $y$ is $1$, and
$1-a$ is the estimated probability that the right value for $y$ is
$0$. Then the cross-entropy measures how "surprised" we are, on
average, when we learn the true value for $y$. We get low surprise if
the output is what we expect, and high surprise if the output is
unexpected. Of course, I haven't said exactly what "surprise"
means, and so this perhaps seems like empty verbiage. But in fact
there is a precise information-theoretic way of saying what is meant
by surprise. Unfortunately, I don't know of a good, short,
self-contained discussion of this subject that's available online.
But if you want to dig deeper, then Wikipedia contains a
brief
summary that will get you started down the right track. And the
details can be filled in by working through the materials about the
Kraft inequality in chapter 5 of the book about information theory by
Cover and Thomas.
Problem
In this chapter we'll mostly use the cross-entropy cost to address the problem of learning slowdown. However, I want to briefly describe another approach to the problem, based on what are called softmax layers of neurons. We're not actually going to use softmax layers in the remainder of the chapter, so if you're in a great hurry, you can skip to the next section. However, softmax is still worth understanding, in part because it's intrinsically interesting, and in part because we'll use softmax layers in Chapter 6, in our discussion of deep neural networks.
The idea of softmax is to define a new type of output layer for our neural networks. It begins in the same way as with a sigmoid layer, by forming the weighted inputs* *In describing the softmax we'll make frequent use of notation introduced in the last chapter. You may wish to revisit that chapter if you need to refresh your memory about the meaning of the notation. $z^L_j = \sum_{k} w^L_{jk} a^{L-1}_k + b^L_j$. However, we don't apply the sigmoid function to get the output. Instead, in a softmax layer we apply the so-called softmax function to the $z^L_j$. According to this function, the activation $a^L_j$ of the $j$th output neuron is \begin{eqnarray} a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}}, \tag{78}\end{eqnarray} where in the denominator we sum over all the output neurons.
If you're not familiar with the softmax function, Equation (78)\begin{eqnarray} a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}} \nonumber\end{eqnarray} may look pretty opaque. It's certainly not obvious why we'd want to use this function. And it's also not obvious that this will help us address the learning slowdown problem. To better understand Equation (78)\begin{eqnarray} a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}} \nonumber\end{eqnarray}, suppose we have a network with four output neurons, and four corresponding weighted inputs, which we'll denote $z^L_1, z^L_2, z^L_3$, and $z^L_4$. Shown below are adjustable sliders showing possible values for the weighted inputs, and a graph of the corresponding output activations. A good place to start exploration is by using the bottom slider to increase $z^L_4$:
$z^L_1 = $ |
$a^L_1 = $
|
$z^L_2$ = |
$a^L_2 = $
|
$z^L_3$ = |
$a^L_3 = $
|
$z^L_4$ = |
$a^L_4 = $
|
As you increase $z^L_4$, you'll see an increase in the corresponding output activation, $a^L_4$, and a decrease in the other output activations. Similarly, if you decrease $z^L_4$ then $a^L_4$ will decrease, and all the other output activations will increase. In fact, if you look closely, you'll see that in both cases the total change in the other activations exactly compensates for the change in $a^L_4$. The reason is that the output activations are guaranteed to always sum up to $1$, as we can prove using Equation (78)\begin{eqnarray} a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}} \nonumber\end{eqnarray} and a little algebra: \begin{eqnarray} \sum_j a^L_j & = & \frac{\sum_j e^{z^L_j}}{\sum_k e^{z^L_k}} = 1. \tag{79}\end{eqnarray} As a result, if $a^L_4$ increases, then the other output activations must decrease by the same total amount, to ensure the sum over all activations remains $1$. And, of course, similar statements hold for all the other activations.
Equation (78)\begin{eqnarray} a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}} \nonumber\end{eqnarray} also implies that the output activations are all positive, since the exponential function is positive. Combining this with the observation in the last paragraph, we see that the output from the softmax layer is a set of positive numbers which sum up to $1$. In other words, the output from the softmax layer can be thought of as a probability distribution.
The fact that a softmax layer outputs a probability distribution is rather pleasing. In many problems it's convenient to be able to interpret the output activation $a^L_j$ as the network's estimate of the probability that the correct output is $j$. So, for instance, in the MNIST classification problem, we can interpret $a^L_j$ as the network's estimated probability that the correct digit classification is $j$.
By contrast, if the output layer was a sigmoid layer, then we certainly couldn't assume that the activations formed a probability distribution. I won't explicitly prove it, but it should be plausible that the activations from a sigmoid layer won't in general form a probability distribution. And so with a sigmoid output layer we don't have such a simple interpretation of the output activations.
We're starting to build up some feel for the softmax function and the way softmax layers behave. Just to review where we're at: the exponentials in Equation (78)\begin{eqnarray} a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}} \nonumber\end{eqnarray} ensure that all the output activations are positive. And the sum in the denominator of Equation (78)\begin{eqnarray} a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}} \nonumber\end{eqnarray} ensures that the softmax outputs sum to $1$. So that particular form no longer appears so mysterious: rather, it is a natural way to ensure that the output activations form a probability distribution. You can think of softmax as a way of rescaling the $z^L_j$, and then squishing them together to form a probability distribution.
The learning slowdown problem: We've now built up considerable familiarity with softmax layers of neurons. But we haven't yet seen how a softmax layer lets us address the learning slowdown problem. To understand that, let's define the log-likelihood cost function. We'll use $x$ to denote a training input to the network, and $y$ to denote the corresponding desired output. Then the log-likelihood cost associated to this training input is \begin{eqnarray} C \equiv -\ln a^L_y. \tag{80}\end{eqnarray} So, for instance, if we're training with MNIST images, and input an image of a $7$, then the log-likelihood cost is $-\ln a^L_7$. To see that this makes intuitive sense, consider the case when the network is doing a good job, that is, it is confident the input is a $7$. In that case it will estimate a value for the corresponding probability $a^L_7$ which is close to $1$, and so the cost $-\ln a^L_7$ will be small. By contrast, when the network isn't doing such a good job, the probability $a^L_7$ will be smaller, and the cost $-\ln a^L_7$ will be larger. So the log-likelihood cost behaves as we'd expect a cost function to behave.
What about the learning slowdown problem? To analyze that, recall that the key to the learning slowdown is the behaviour of the quantities $\partial C / \partial w^L_{jk}$ and $\partial C / \partial b^L_j$. I won't go through the derivation explicitly - I'll ask you to do in the problems, below - but with a little algebra you can show that* *Note that I'm abusing notation here, using $y$ in a slightly different way to last paragraph. In the last paragraph we used $y$ to denote the desired output from the network - e.g., output a "$7$" if an image of a $7$ was input. But in the equations which follow I'm using $y$ to denote the vector of output activations which corresponds to $7$, that is, a vector which is all $0$s, except for a $1$ in the $7$th location. \begin{eqnarray} \frac{\partial C}{\partial b^L_j} & = & a^L_j-y_j \tag{81}\\ \frac{\partial C}{\partial w^L_{jk}} & = & a^{L-1}_k (a^L_j-y_j) \tag{82}\end{eqnarray} These equations are the same as the analogous expressions obtained in our earlier analysis of the cross-entropy. Compare, for example, Equation (82)\begin{eqnarray} \frac{\partial C}{\partial w^L_{jk}} & = & a^{L-1}_k (a^L_j-y_j) \nonumber\end{eqnarray} to Equation (67)\begin{eqnarray} \frac{\partial C}{\partial w^L_{jk}} & = & \frac{1}{n} \sum_x a^{L-1}_k (a^L_j-y_j). \nonumber\end{eqnarray}. It's the same equation, albeit in the latter I've averaged over training instances. And, just as in the earlier analysis, these expressions ensure that we will not encounter a learning slowdown. In fact, it's useful to think of a softmax output layer with log-likelihood cost as being quite similar to a sigmoid output layer with cross-entropy cost.
Given this similarity, should you use a sigmoid output layer and cross-entropy, or a softmax output layer and log-likelihood? In fact, in many situations both approaches work well. Through the remainder of this chapter we'll use a sigmoid output layer, with the cross-entropy cost. Later, in Chapter 6, we'll sometimes use a softmax output layer, with log-likelihood cost. The reason for the switch is to make some of our later networks more similar to networks found in certain influential academic papers. As a more general point of principle, softmax plus log-likelihood is worth using whenever you want to interpret the output activations as probabilities. That's not always a concern, but can be useful with classification problems (like MNIST) involving disjoint classes.
The Nobel prizewinning physicist Enrico Fermi was once asked his opinion of a mathematical model some colleagues had proposed as the solution to an important unsolved physics problem. The model gave excellent agreement with experiment, but Fermi was skeptical. He asked how many free parameters could be set in the model. "Four" was the answer. Fermi replied* *The quote comes from a charming article by Freeman Dyson, who is one of the people who proposed the flawed model. A four-parameter elephant may be found here. : "I remember my friend Johnny von Neumann used to say, with four parameters I can fit an elephant, and with five I can make him wiggle his trunk.".
The point, of course, is that models with a large number of free parameters can describe an amazingly wide range of phenomena. Even if such a model agrees well with the available data, that doesn't make it a good model. It may just mean there's enough freedom in the model that it can describe almost any data set of the given size, without capturing any genuine insights into the underlying phenomenon. When that happens the model will work well for the existing data, but will fail to generalize to new situations. The true test of a model is its ability to make predictions in situations it hasn't been exposed to before.
Fermi and von Neumann were suspicious of models with four parameters. Our 30 hidden neuron network for classifying MNIST digits has nearly 24,000 parameters! That's a lot of parameters. Our 100 hidden neuron network has nearly 80,000 parameters, and state-of-the-art deep neural nets sometimes contain millions or even billions of parameters. Should we trust the results?
Let's sharpen this problem up by constructing a situation where our network does a bad job generalizing to new situations. We'll use our 30 hidden neuron network, with its 23,860 parameters. But we won't train the network using all 50,000 MNIST training images. Instead, we'll use just the first 1,000 training images. Using that restricted set will make the problem with generalization much more evident. We'll train in a similar way to before, using the cross-entropy cost function, with a learning rate of $\eta = 0.5$ and a mini-batch size of $10$. However, we'll train for 400 epochs, a somewhat larger number than before, because we're not using as many training examples. Let's use network2 to look at the way the cost function changes:
>>> import mnist_loader
>>> training_data, validation_data, test_data = \
... mnist_loader.load_data_wrapper()
@@ -199,7 +202,7 @@ Improving the way neural networks learn
>>> training_data, validation_data, test_data = \
... mnist_loader.load_data_wrapper()
Why use the validation_data to prevent overfitting, rather than the test_data? In fact, this is part of a more general strategy, which is to use the validation_data to evaluate different trial choices of hyper-parameters such as the number of epochs to train for, the learning rate, the best network architecture, and so on. We use such evaluations to find and set good values for the hyper-parameters. Indeed, although I haven't mentioned it until now, that is, in part, how I arrived at the hyper-parameter choices made earlier in this book. (More on this later.)
Of course, that doesn't in any way answer the question of why we're using the validation_data to prevent overfitting, rather than the test_data. Instead, it replaces it with a more general question, which is why we're using the validation_data rather than the test_data to set good hyper-parameters? To understand why, consider that when setting hyper-parameters we're likely to try many different choices for the hyper-parameters. If we set the hyper-parameters based on evaluations of the test_data it's possible we'll end up overfitting our hyper-parameters to the test_data. That is, we may end up finding hyper-parameters which fit particular peculiarities of the test_data, but where the performance of the network won't generalize to other data sets. We guard against that by figuring out the hyper-parameters using the validation_data. Then, once we've got the hyper-parameters we want, we do a final evaluation of accuracy using the test_data. That gives us confidence that our results on the test_data are a true measure of how well our neural network generalizes. To put it another way, you can think of the validation data as a type of training data that helps us learn good hyper-parameters. This approach to finding good hyper-parameters is sometimes known as the hold out method, since the validation_data is kept apart or "held out" from the training_data.
Now, in practice, even after evaluating performance on the test_data we may change our minds and want to try another approach - perhaps a different network architecture - which will involve finding a new set of hyper-parameters. If we do this, isn't there a danger we'll end up overfitting to the test_data as well? Do we need a potentially infinite regress of data sets, so we can be confident our results will generalize? Addressing this concern fully is a deep and difficult problem. But for our practical purposes, we're not going to worry too much about this question. Instead, we'll plunge ahead, using the basic hold out method, based on the training_data, validation_data, and test_data, as described above.
We've been looking so far at overfitting when we're just using 1,000 training images. What happens when we use the full training set of 50,000 images? We'll keep all the other parameters the same (30 hidden neurons, learning rate 0.5, mini-batch size of 10), but train using all 50,000 images for 30 epochs. Here's a graph showing the results for the classification accuracy on both the training data and the test data. Note that I've used the test data here, rather than the validation data, in order to make the results more directly comparable with the earlier graphs.
As you can see, the accuracy on the test and training data remain much closer together than when we were using 1,000 training examples. In particular, the best classification accuracy of $97.86$ percent on the training data is only $2.53$ percent higher than the $95.33$ percent on the test data. That's compared to the $17.73$ percent gap we had earlier! Overfitting is still going on, but it's been greatly reduced. Our network is generalizing much better from the training data to the test data. In general, one of the best ways of reducing overfitting is to increase the size of the training data. With enough training data it is difficult for even a very large network to overfit. Unfortunately, training data can be expensive or difficult to acquire, so this is not always a practical option.
Increasing the amount of training data is one way of reducing overfitting. Are there other ways we can reduce the extent to which overfitting occurs? One possible approach is to reduce the size of our network. However, large networks have the potential to be more powerful than small networks, and so this is an option we'd only adopt reluctantly.
Fortunately, there are other techniques which can reduce overfitting, even when we have a fixed network and fixed training data. These are known as regularization techniques. In this section I describe one of the most commonly used regularization techniques, a technique sometimes known as weight decay or L2 regularization. The idea of L2 regularization is to add an extra term to the cost function, a term called the regularization term. Here's the regularized cross-entropy:
\begin{eqnarray} C = -\frac{1}{n} \sum_{xj} \left[ y_j \ln a^L_j+(1-y_j) \ln (1-a^L_j)\right] + \frac{\lambda}{2n} \sum_w w^2. \tag{85}\end{eqnarray}
The first term is just the usual expression for the cross-entropy. But we've added a second term, namely the sum of the squares of all the weights in the network. This is scaled by a factor $\lambda / 2n$, where $\lambda > 0$ is known as the regularization parameter, and $n$ is, as usual, the size of our training set. I'll discuss later how $\lambda$ is chosen. It's also worth noting that the regularization term doesn't include the biases. I'll also come back to that below.
Of course, it's possible to regularize other cost functions, such as the quadratic cost. This can be done in a similar way:
\begin{eqnarray} C = \frac{1}{2n} \sum_x \|y-a^L\|^2 + \frac{\lambda}{2n} \sum_w w^2. \tag{86}\end{eqnarray}
In both cases we can write the regularized cost function as \begin{eqnarray} C = C_0 + \frac{\lambda}{2n} \sum_w w^2, \tag{87}\end{eqnarray} where $C_0$ is the original, unregularized cost function.
Intuitively, the effect of regularization is to make it so the network prefers to learn small weights, all other things being equal. Large weights will only be allowed if they considerably improve the first part of the cost function. Put another way, regularization can be viewed as a way of compromising between finding small weights and minimizing the original cost function. The relative importance of the two elements of the compromise depends on the value of $\lambda$: when $\lambda$ is small we prefer to minimize the original cost function, but when $\lambda$ is large we prefer small weights.
Now, it's really not at all obvious why making this kind of compromise should help reduce overfitting! But it turns out that it does. We'll address the question of why it helps in the next section. But first, let's work through an example showing that regularization really does reduce overfitting.
To construct such an example, we first need to figure out how to apply our stochastic gradient descent learning algorithm in a regularized neural network. In particular, we need to know how to compute the partial derivatives $\partial C / \partial w$ and $\partial C / \partial b$ for all the weights and biases in the network. Taking the partial derivatives of Equation (87)\begin{eqnarray} C = C_0 + \frac{\lambda}{2n} \sum_w w^2 \nonumber\end{eqnarray} gives
\begin{eqnarray} \frac{\partial C}{\partial w} & = & \frac{\partial C_0}{\partial w} + \frac{\lambda}{n} w \tag{88}\\ \frac{\partial C}{\partial b} & = & \frac{\partial C_0}{\partial b}. \tag{89}\end{eqnarray}
The $\partial C_0 / \partial w$ and $\partial C_0 / \partial b$ terms can be computed using backpropagation, as described in the last chapter. And so we see that it's easy to compute the gradient of the regularized cost function: just use backpropagation, as usual, and then add $\frac{\lambda}{n} w$ to the partial derivative of all the weight terms. The partial derivatives with respect to the biases are unchanged, and so the gradient descent learning rule for the biases doesn't change from the usual rule:
\begin{eqnarray} b & \rightarrow & b -\eta \frac{\partial C_0}{\partial b}. \tag{90}\end{eqnarray}
The learning rule for the weights becomes:
\begin{eqnarray} w & \rightarrow & w-\eta \frac{\partial C_0}{\partial w}-\frac{\eta \lambda}{n} w \tag{91}\\ & = & \left(1-\frac{\eta \lambda}{n}\right) w -\eta \frac{\partial C_0}{\partial w}. \tag{92}\end{eqnarray}
This is exactly the same as the usual gradient descent learning rule, except we first rescale the weight $w$ by a factor $1-\frac{\eta \lambda}{n}$. This rescaling is sometimes referred to as weight decay, since it makes the weights smaller. At first glance it looks as though this means the weights are being driven unstoppably toward zero. But that's not right, since the other term may lead the weights to increase, if so doing causes a decrease in the unregularized cost function.
Okay, that's how gradient descent works. What about stochastic gradient descent? Well, just as in unregularized stochastic gradient descent, we can estimate $\partial C_0 / \partial w$ by averaging over a mini-batch of $m$ training examples. Thus the regularized learning rule for stochastic gradient descent becomes (c.f. Equation (20)\begin{eqnarray} w_k & \rightarrow & w_k' = w_k-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial w_k} \nonumber\end{eqnarray})
\begin{eqnarray} w \rightarrow \left(1-\frac{\eta \lambda}{n}\right) w -\frac{\eta}{m} \sum_x \frac{\partial C_x}{\partial w}, \tag{93}\end{eqnarray}
where the sum is over training examples $x$ in the mini-batch, and $C_x$ is the (unregularized) cost for each training example. This is exactly the same as the usual rule for stochastic gradient descent, except for the $1-\frac{\eta \lambda}{n}$ weight decay factor. Finally, and for completeness, let me state the regularized learning rule for the biases. This is, of course, exactly the same as in the unregularized case (c.f. Equation (21)\begin{eqnarray} b_l & \rightarrow & b_l' = b_l-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial b_l} \nonumber\end{eqnarray}),
\begin{eqnarray} b \rightarrow b - \frac{\eta}{m} \sum_x \frac{\partial C_x}{\partial b}, \tag{94}\end{eqnarray} where the sum is over training examples $x$ in the mini-batch.
Let's see how regularization changes the performance of our neural network. We'll use a network with $30$ hidden neurons, a mini-batch size of $10$, a learning rate of $0.5$, and the cross-entropy cost function. However, this time we'll use a regularization parameter of $\lambda = 0.1$. Note that in the code, we use the variable name lmbda, because lambda is a reserved word in Python, with an unrelated meaning. I've also used the test_data again, not the validation_data. Strictly speaking, we should use the validation_data, for all the reasons we discussed earlier. But I decided to use the test_data because it makes the results more directly comparable with our earlier, unregularized results. You can easily change the code to use the validation_data instead, and you'll find that it gives similar results.
+
Up to now we've been using the training_data and
test_data, and ignoring the validation_data. The
validation_data contains $10,000$ images of digits, images
which are different from the $50,000$ images in the MNIST training
set, and the $10,000$ images in the MNIST test set. Instead of using
the test_data to prevent overfitting, we will use the
validation_data. To do this, we'll use much the same strategy
as was described above for the test_data. That is, we'll
compute the classification accuracy on the validation_data at
the end of each epoch. Once the classification accuracy on the
validation_data has saturated, we stop training. This strategy
is called early stopping. Of course, in practice we won't
immediately know when the accuracy has saturated. Instead, we
continue training until we're confident that the accuracy has
saturated*
*It requires some judgment to determine when to
stop. In my earlier graphs I identified epoch 280 as the place at
which accuracy saturated. It's possible that was too pessimistic.
Neural networks sometimes plateau for a while in training, before
continuing to improve. I wouldn't be surprised if more learning
could have occurred even after epoch 400, although the magnitude of
any further improvement would likely be small. So it's possible to
adopt more or less aggressive strategies for early stopping..Why use the validation_data to prevent overfitting, rather than
the test_data? In fact, this is part of a more general
strategy, which is to use the validation_data to evaluate
different trial choices of hyper-parameters such as the number of
epochs to train for, the learning rate, the best network architecture,
and so on. We use such evaluations to find and set good values for
the hyper-parameters. Indeed, although I haven't mentioned it until
now, that is, in part, how I arrived at the hyper-parameter choices
made earlier in this book. (More on this
later.)
Of course, that doesn't in any way answer the question of why we're
using the validation_data to prevent overfitting, rather than
the test_data. Instead, it replaces it with a more general
question, which is why we're using the validation_data rather
than the test_data to set good hyper-parameters? To understand
why, consider that when setting hyper-parameters we're likely to try
many different choices for the hyper-parameters. If we set the
hyper-parameters based on evaluations of the test_data it's
possible we'll end up overfitting our hyper-parameters to the
test_data. That is, we may end up finding hyper-parameters
which fit particular peculiarities of the test_data, but where
the performance of the network won't generalize to other data sets.
We guard against that by figuring out the hyper-parameters using the
validation_data. Then, once we've got the hyper-parameters we
want, we do a final evaluation of accuracy using the test_data.
That gives us confidence that our results on the test_data are
a true measure of how well our neural network generalizes. To put it
another way, you can think of the validation data as a type of
training data that helps us learn good hyper-parameters. This
approach to finding good hyper-parameters is sometimes known as the
hold out method, since the validation_data is kept apart
or "held out" from the training_data.
Now, in practice, even after evaluating performance on the
test_data we may change our minds and want to try another
approach - perhaps a different network architecture - which will
involve finding a new set of hyper-parameters. If we do this, isn't
there a danger we'll end up overfitting to the test_data as
well? Do we need a potentially infinite regress of data sets, so we
can be confident our results will generalize? Addressing this concern
fully is a deep and difficult problem. But for our practical
purposes, we're not going to worry too much about this question.
Instead, we'll plunge ahead, using the basic hold out method, based on
the training_data, validation_data, and
test_data, as described above.
We've been looking so far at overfitting when we're just using 1,000
training images. What happens when we use the full training set of
50,000 images? We'll keep all the other parameters the same (30
hidden neurons, learning rate 0.5, mini-batch size of 10), but train
using all 50,000 images for 30 epochs. Here's a graph showing the
results for the classification accuracy on both the training data and
the test data. Note that I've used the test data here, rather than
the validation data, in order to make the results more directly
comparable with the earlier graphs.
As you can see, the accuracy on the test and training data remain much
closer together than when we were using 1,000 training examples. In
particular, the best classification accuracy of $97.86$ percent on the
training data is only $2.53$ percent higher than the $95.33$ percent
on the test data. That's compared to the $17.73$ percent gap we had
earlier! Overfitting is still going on, but it's been greatly
reduced. Our network is generalizing much better from the training
data to the test data. In general, one of the best ways of reducing
overfitting is to increase the size of the training data. With enough
training data it is difficult for even a very large network to
overfit. Unfortunately, training data can be expensive or difficult
to acquire, so this is not always a practical option.
Regularization
Increasing the amount of training data is one way of reducing
overfitting. Are there other ways we can reduce the extent to which
overfitting occurs? One possible approach is to reduce the size of
our network. However, large networks have the potential to be more
powerful than small networks, and so this is an option we'd only adopt
reluctantly.
Fortunately, there are other techniques which can reduce overfitting,
even when we have a fixed network and fixed training data. These are
known as regularization techniques. In this section I describe
one of the most commonly used regularization techniques, a technique
sometimes known as weight decay or L2 regularization.
The idea of L2 regularization is to add an extra term to the cost
function, a term called the regularization term. Here's the
regularized cross-entropy:
\begin{eqnarray} C = -\frac{1}{n} \sum_{xj} \left[ y_j \ln a^L_j+(1-y_j) \ln
(1-a^L_j)\right] + \frac{\lambda}{2n} \sum_w w^2.
\tag{85}\end{eqnarray}
The first term is just the usual expression for the cross-entropy.
But we've added a second term, namely the sum of the squares of all
the weights in the network. This is scaled by a factor $\lambda /
2n$, where $\lambda > 0$ is known as the regularization
parameter, and $n$ is, as usual, the size of our training set.
I'll discuss later how $\lambda$ is chosen. It's also worth noting
that the regularization term doesn't include the biases. I'll
also come back to that below.
Of course, it's possible to regularize other cost functions, such as
the quadratic cost. This can be done in a similar way:
\begin{eqnarray} C = \frac{1}{2n} \sum_x \|y-a^L\|^2 +
\frac{\lambda}{2n} \sum_w w^2.
\tag{86}\end{eqnarray}
In both cases we can write the regularized cost function as
\begin{eqnarray} C = C_0 + \frac{\lambda}{2n}
\sum_w w^2,
\tag{87}\end{eqnarray} where $C_0$ is the original, unregularized cost
function.
Intuitively, the effect of regularization is to make it so the network
prefers to learn small weights, all other things being equal. Large
weights will only be allowed if they considerably improve the first
part of the cost function. Put another way, regularization can be
viewed as a way of compromising between finding small weights and
minimizing the original cost function. The relative importance of the
two elements of the compromise depends on the value of $\lambda$: when
$\lambda$ is small we prefer to minimize the original cost function,
but when $\lambda$ is large we prefer small weights.
Now, it's really not at all obvious why making this kind of compromise
should help reduce overfitting! But it turns out that it does. We'll
address the question of why it helps in the next section. But first,
let's work through an example showing that regularization really does
reduce overfitting.
To construct such an example, we first need to figure out how to apply
our stochastic gradient descent learning algorithm in a regularized
neural network. In particular, we need to know how to compute the
partial derivatives $\partial C / \partial w$ and $\partial C
/ \partial b$ for all the weights and biases in the network. Taking
the partial derivatives of Equation (87)\begin{eqnarray} C = C_0 + \frac{\lambda}{2n}
\sum_w w^2 \nonumber\end{eqnarray} gives
\begin{eqnarray}
\frac{\partial C}{\partial w} & = & \frac{\partial C_0}{\partial w} +
\frac{\lambda}{n} w \tag{88}\\
\frac{\partial C}{\partial b} & = & \frac{\partial C_0}{\partial b}.
\tag{89}\end{eqnarray}
The $\partial C_0 / \partial w$ and $\partial C_0 / \partial b$ terms
can be computed using backpropagation, as described in
the last chapter. And so we see that it's easy to
compute the gradient of the regularized cost function: just use
backpropagation, as usual, and then add $\frac{\lambda}{n} w$ to the
partial derivative of all the weight terms. The partial derivatives
with respect to the biases are unchanged, and so the gradient descent
learning rule for the biases doesn't change from the usual rule:
\begin{eqnarray}
b & \rightarrow & b -\eta \frac{\partial C_0}{\partial b}.
\tag{90}\end{eqnarray}
The learning rule for the weights becomes:
\begin{eqnarray}
w & \rightarrow & w-\eta \frac{\partial C_0}{\partial
w}-\frac{\eta \lambda}{n} w \tag{91}\\
& = & \left(1-\frac{\eta \lambda}{n}\right) w -\eta \frac{\partial
C_0}{\partial w}.
\tag{92}\end{eqnarray}
This is exactly the same as the usual gradient descent learning rule,
except we first rescale the weight $w$ by a factor $1-\frac{\eta
\lambda}{n}$. This rescaling is sometimes referred to as
weight decay, since it makes the weights smaller. At first
glance it looks as though this means the weights are being driven
unstoppably toward zero. But that's not right, since the other term
may lead the weights to increase, if so doing causes a decrease in the
unregularized cost function.
Okay, that's how gradient descent works. What about stochastic
gradient descent? Well, just as in unregularized stochastic gradient
descent, we can estimate $\partial C_0 / \partial w$ by averaging over
a mini-batch of $m$ training examples. Thus the regularized learning
rule for stochastic gradient descent becomes
(c.f. Equation (20)\begin{eqnarray}
w_k & \rightarrow & w_k' = w_k-\frac{\eta}{m}
\sum_j \frac{\partial C_{X_j}}{\partial w_k} \nonumber\end{eqnarray})
\begin{eqnarray}
w \rightarrow \left(1-\frac{\eta \lambda}{n}\right) w -\frac{\eta}{m}
\sum_x \frac{\partial C_x}{\partial w},
\tag{93}\end{eqnarray}
where the sum is over training examples $x$ in the mini-batch, and
$C_x$ is the (unregularized) cost for each training example. This is
exactly the same as the usual rule for stochastic gradient descent,
except for the $1-\frac{\eta \lambda}{n}$ weight decay factor.
Finally, and for completeness, let me state the regularized learning
rule for the biases. This is, of course, exactly the same as in the
unregularized case (c.f. Equation (21)\begin{eqnarray}
b_l & \rightarrow & b_l' = b_l-\frac{\eta}{m}
\sum_j \frac{\partial C_{X_j}}{\partial b_l} \nonumber\end{eqnarray}),
\begin{eqnarray}
b \rightarrow b - \frac{\eta}{m} \sum_x \frac{\partial C_x}{\partial b},
\tag{94}\end{eqnarray}
where the sum is over training examples $x$ in the mini-batch.
Let's see how regularization changes the performance of our neural
network. We'll use a network with $30$ hidden neurons, a mini-batch
size of $10$, a learning rate of $0.5$, and the cross-entropy cost
function. However, this time we'll use a regularization parameter of
$\lambda = 0.1$. Note that in the code, we use the variable name
lmbda, because lambda is a reserved word in Python, with
an unrelated meaning. I've also used the test_data again, not
the validation_data. Strictly speaking, we should use the
validation_data, for all the reasons we discussed earlier. But
I decided to use the test_data because it makes the results
more directly comparable with our earlier, unregularized results. You
can easily change the code to use the validation_data instead,
and you'll find that it gives similar results.
>>> import mnist_loader
>>> training_data, validation_data, test_data = \
... mnist_loader.load_data_wrapper()
@@ -224,7 +227,7 @@ Improving the way neural networks learn
... evaluation_data=validation_data,
... monitor_evaluation_accuracy=True)
-The final result is a classification accuracy of $97.92$ percent on
the validation data. That's a big jump from the 30 hidden neuron
case. In fact, tuning just
a little more, to run for 60 epochs at $\eta = 0.1$ and $\lambda =
5.0$ we break the $98$ percent barrier, achieving $98.04$ percent
classification accuracy on the validation data. Not bad for what
turns out to be 152 lines of code!
I've described regularization as a way to reduce overfitting and to
increase classification accuracies. In fact, that's not the only
benefit. Empirically, when doing multiple runs of our MNIST networks,
but with different (random) weight initializations, I've found that
the unregularized runs will occasionally get "stuck", apparently
caught in local minima of the cost function. The result is that
different runs sometimes provide quite different results. By
contrast, the regularized runs have provided much more easily
replicable results.
Why is this going on? Heuristically, if the cost function is
unregularized, then the length of the weight vector is likely to grow,
all other things being equal. Over time this can lead to the weight
vector being very large indeed. This can cause the weight vector to
get stuck pointing in more or less the same direction, since changes
due to gradient descent only make tiny changes to the direction, when
the length is long. I believe this phenomenon is making it hard for
our learning algorithm to properly explore the weight space, and
consequently harder to find good minima of the cost function.
Why does regularization help reduce overfitting?
We've seen empirically that regularization helps reduce overfitting.
That's encouraging but, unfortunately, it's not obvious why
regularization helps! A standard story people tell to explain what's
going on is along the following lines: smaller weights are, in some
sense, lower complexity, and so provide a simpler and more powerful
explanation for the data, and should thus be preferred. That's a
pretty terse story, though, and contains several elements that perhaps
seem dubious or mystifying. Let's unpack the story and examine it
critically. To do that, let's suppose we have a simple data set for
which we wish to build a model:
Implicitly, we're studying some real-world phenomenon here, with $x$
and $y$ representing real-world data. Our goal is to build a model
which lets us predict $y$ as a function of $x$. We could try using
neural networks to build such a model, but I'm going to do something
even simpler: I'll try to model $y$ as a polynomial in $x$. I'm doing
this instead of using neural nets because using polynomials will make
things particularly transparent. Once we've understood the polynomial
case, we'll translate to neural networks. Now, there are ten points
in the graph above, which means we can find a unique $9$th-order
polynomial $y = a_0 x^9 + a_1 x^8 + \ldots + a_9$ which fits the data
exactly. Here's the graph of that polynomial*
*I won't show
the coefficients explicitly, although they are easy to find using a
routine such as Numpy's polyfit. You can view the exact form
of the polynomial in the source code
for the graph if you're curious. It's the function p(x)
defined starting on line 14 of the program which produces the
graph.:
That provides an exact fit. But we can also get a good fit using the
linear model $y = 2x$:
Which of these is the better model? Which is more likely to be true?
And which model is more likely to generalize well to other examples of
the same underlying real-world phenomenon?
These are difficult questions. In fact, we can't determine with
certainty the answer to any of the above questions, without much more
information about the underlying real-world phenomenon. But let's
consider two possibilities: (1) the $9$th order polynomial is, in
fact, the model which truly describes the real-world phenomenon, and
the model will therefore generalize perfectly; (2) the correct model
is $y = 2x$, but there's a little additional noise due to, say,
measurement error, and that's why the model isn't an exact fit.
It's not a priori possible to say which of these two
possibilities is correct. (Or, indeed, if some third possibility
holds). Logically, either could be true. And it's not a trivial
difference. It's true that on the data provided there's only a small
difference between the two models. But suppose we want to predict the
value of $y$ corresponding to some large value of $x$, much larger
than any shown on the graph above. If we try to do that there will be
a dramatic difference between the predictions of the two models, as
the $9$th order polynomial model comes to be dominated by the $x^9$
term, while the linear model remains, well, linear.
One point of view is to say that in science we should go with the
simpler explanation, unless compelled not to. When we find a simple
model that seems to explain many data points we are tempted to shout
"Eureka!" After all, it seems unlikely that a simple explanation
should occur merely by coincidence. Rather, we suspect that the model
must be expressing some underlying truth about the phenomenon. In the
case at hand, the model $y = 2x+{\rm noise}$ seems much simpler than
$y = a_0 x^9 + a_1 x^8 + \ldots$. It would be surprising if that
simplicity had occurred by chance, and so we suspect that $y = 2x+{\rm
noise}$ expresses some underlying truth. In this point of view, the
9th order model is really just learning the effects of local
noise. And so while the 9th order model works perfectly for these
particular data points, the model will fail to generalize to other
data points, and the noisy linear model will have greater predictive
power.
Let's see what this point of view means for neural networks. Suppose
our network mostly has small weights, as will tend to happen in a
regularized network. The smallness of the weights means that the
behaviour of the network won't change too much if we change a few
random inputs here and there. That makes it difficult for a
regularized network to learn the effects of local noise in the data.
Think of it as a way of making it so single pieces of evidence don't
matter too much to the output of the network. Instead, a regularized
network learns to respond to types of evidence which are seen often
across the training set. By contrast, a network with large weights
may change its behaviour quite a bit in response to small changes in
the input. And so an unregularized network can use large weights to
learn a complex model that carries a lot of information about the
noise in the training data. In a nutshell, regularized networks are
constrained to build relatively simple models based on patterns seen
often in the training data, and are resistant to learning
peculiarities of the noise in the training data. The hope is that
this will force our networks to do real learning about the phenomenon
at hand, and to generalize better from what they learn.
With that said, this idea of preferring simpler explanation should
make you nervous. People sometimes refer to this idea as "Occam's
Razor", and will zealously apply it as though it has the status of
some general scientific principle. But, of course, it's not a general
scientific principle. There is no a priori logical reason to
prefer simple explanations over more complex explanations. Indeed,
sometimes the more complex explanation turns out to be correct.
Let me describe two examples where more complex explanations have
turned out to be correct. In the 1940s the physicist Marcel Schein
announced the discovery of a new particle of nature. The company he
worked for, General Electric, was ecstatic, and publicized the
discovery widely. But the physicist Hans Bethe was skeptical. Bethe
visited Schein, and looked at the plates showing the tracks of
Schein's new particle. Schein showed Bethe plate after plate, but on
each plate Bethe identified some problem that suggested the data
should be discarded. Finally, Schein showed Bethe a plate that looked
good. Bethe said it might just be a statistical fluke. Schein:
"Yes, but the chance that this would be statistics, even according to
your own formula, is one in five." Bethe: "But we have already
looked at five plates." Finally, Schein said: "But on my plates,
each one of the good plates, each one of the good pictures, you
explain by a different theory, whereas I have one hypothesis that
explains all the plates, that they are [the new particle]." Bethe
replied: "The sole difference between your and my explanations is
that yours is wrong and all of mine are right. Your single
explanation is wrong, and all of my multiple explanations are right."
Subsequent work confirmed that Nature agreed with Bethe, and Schein's
particle is no more*
*The story is related by the physicist
Richard Feynman in an
interview
with the historian Charles Weiner..
As a second example, in 1859 the astronomer Urbain Le Verrier observed
that the orbit of the planet Mercury doesn't have quite the shape that
Newton's theory of gravitation says it should have. It was a tiny,
tiny deviation from Newton's theory, and several of the explanations
proferred at the time boiled down to saying that Newton's theory was
more or less right, but needed a tiny alteration. In 1916, Einstein
showed that the deviation could be explained very well using his
general theory of relativity, a theory radically different to
Newtonian gravitation, and based on much more complex mathematics.
Despite that additional complexity, today it's accepted that
Einstein's explanation is correct, and Newtonian gravity, even in its
modified forms, is wrong. This is in part because we now know that
Einstein's theory explains many other phenomena which Newton's theory
has difficulty with. Furthermore, and even more impressively,
Einstein's theory accurately predicts several phenomena which aren't
predicted by Newtonian gravity at all. But these impressive qualities
weren't entirely obvious in the early days. If one had judged merely
on the grounds of simplicity, then some modified form of Newton's
theory would arguably have been more attractive.
There are three morals to draw from these stories. First, it can be
quite a subtle business deciding which of two explanations is truly
"simpler". Second, even if we can make such a judgment, simplicity
is a guide that must be used with great caution! Third, the true test
of a model is not simplicity, but rather how well it does in
predicting new phenomena, in new regimes of behaviour.
With that said, and keeping the need for caution in mind, it's an
empirical fact that regularized neural networks usually generalize
better than unregularized networks. And so through the remainder of
the book we will make frequent use of regularization. I've included
the stories above merely to help convey why no-one has yet developed
an entirely convincing theoretical explanation for why regularization
helps networks generalize. Indeed, researchers continue to write
papers where they try different approaches to regularization, compare
them to see which works better, and attempt to understand why different
approaches work better or worse. And so you can view regularization
as something of a kludge. While it often helps, we don't have an
entirely satisfactory systematic understanding of what's going on,
merely incomplete heuristics and rules of thumb.
There's a deeper set of issues here, issues which go to the heart of
science. It's the question of how we generalize. Regularization may
give us a computational magic wand that helps our networks generalize
better, but it doesn't give us a principled understanding of how
generalization works, nor of what the best approach is*
*These
issues go back to the
problem
of induction, famously discussed by the Scottish philosopher
David Hume in "An
Enquiry Concerning Human Understanding" (1748). The problem of
induction has been given a modern machine learning form in the
no-free lunch theorem
(link)
of David Wolpert and William Macready (1997)..
This is particularly galling because in everyday life, we humans
generalize phenomenally well. Shown just a few images of an elephant
a child will quickly learn to recognize other elephants. Of course,
they may occasionally make mistakes, perhaps confusing a rhinoceros
for an elephant, but in general this process works remarkably
accurately. So we have a system - the human brain - with a huge
number of free parameters. And after being shown just one or a few
training images that system learns to generalize to other images. Our
brains are, in some sense, regularizing amazingly well! How do we do
it? At this point we don't know. I expect that in years to come we
will develop more powerful techniques for regularization in artificial
neural networks, techniques that will ultimately enable neural nets to
generalize well even from small data sets.
In fact, our networks already generalize better than one might a
priori expect. A network with 100 hidden neurons has nearly 80,000
parameters. We have only 50,000 images in our training data. It's
like trying to fit an 80,000th degree polynomial to 50,000 data
points. By all rights, our network should overfit terribly. And yet,
as we saw earlier, such a network actually does a pretty good job
generalizing. Why is that the case? It's not well understood. It
has been conjectured*
*In
Gradient-Based
Learning Applied to Document Recognition, by Yann LeCun,
Léon Bottou, Yoshua Bengio, and Patrick Haffner
(1998). that "the dynamics of gradient descent learning in
multilayer nets has a `self-regularization' effect". This is
exceptionally fortunate, but it's also somewhat disquieting that we
don't understand why it's the case. In the meantime, we will adopt
the pragmatic approach and use regularization whenever we can. Our
neural networks will be the better for it.
Let me conclude this section by returning to a detail which I left
unexplained earlier: the fact that L2 regularization doesn't
constrain the biases. Of course, it would be easy to modify the
regularization procedure to regularize the biases. Empirically, doing
this often doesn't change the results very much, so to some extent
it's merely a convention whether to regularize the biases or not.
However, it's worth noting that having a large bias doesn't make a
neuron sensitive to its inputs in the same way as having large
weights. And so we don't need to worry about large biases enabling
our network to learn the noise in our training data. At the same
time, allowing large biases gives our networks more flexibility in
behaviour - in particular, large biases make it easier for neurons
to saturate, which is sometimes desirable. For these reasons we don't
usually include bias terms when regularizing.
Other techniques for regularization
There are many regularization techniques other than L2 regularization.
In fact, so many techniques have been developed that I can't possibly
summarize them all. In this section I briefly describe three other
approaches to reducing overfitting: L1 regularization, dropout, and
artificially increasing the training set size. We won't go into
nearly as much depth studying these techniques as we did earlier.
Instead, the purpose is to get familiar with the main ideas, and to
appreciate something of the diversity of regularization techniques
available.
L1 regularization: In this approach we modify the
unregularized cost function by adding the sum of the absolute values
of the weights:
\begin{eqnarray} C = C_0 + \frac{\lambda}{n} \sum_w |w|.
\tag{95}\end{eqnarray}
Intuitively, this is similar to L2 regularization, penalizing large
weights, and tending to make the network prefer small weights. Of
course, the L1 regularization term isn't the same as the L2
regularization term, and so we shouldn't expect to get exactly the
same behaviour. Let's try to understand how the behaviour of a
network trained using L1 regularization differs from a network trained
using L2 regularization.
To do that, we'll look at the partial derivatives of the cost
function. Differentiating (95)\begin{eqnarray} C = C_0 + \frac{\lambda}{n} \sum_w |w| \nonumber\end{eqnarray} we obtain:
\begin{eqnarray} \frac{\partial C}{\partial
w} = \frac{\partial C_0}{\partial w} + \frac{\lambda}{n} \, {\rm
sgn}(w),
\tag{96}\end{eqnarray}
where ${\rm sgn}(w)$ is the sign of $w$, that is, $+1$ if $w$ is
positive, and $-1$ if $w$ is negative. Using this expression, we can
easily modify backpropagation to do stochastic gradient descent using
L1 regularization. The resulting update rule for an L1 regularized
network is
\begin{eqnarray} w \rightarrow w' =
w-\frac{\eta \lambda}{n} \mbox{sgn}(w) - \eta \frac{\partial
C_0}{\partial w},
\tag{97}\end{eqnarray}
where, as per usual, we can estimate $\partial C_0 / \partial w$ using
a mini-batch average, if we wish. Compare that to the update rule for
L2 regularization (c.f. Equation (93)\begin{eqnarray}
w \rightarrow \left(1-\frac{\eta \lambda}{n}\right) w -\frac{\eta}{m}
\sum_x \frac{\partial C_x}{\partial w}, \nonumber\end{eqnarray}),
\begin{eqnarray}
w \rightarrow w' = w\left(1 - \frac{\eta \lambda}{n} \right)
- \eta \frac{\partial C_0}{\partial w}.
\tag{98}\end{eqnarray}
In both expressions the effect of regularization is to shrink the
weights. This accords with our intuition that both kinds of
regularization penalize large weights. But the way the weights shrink
is different. In L1 regularization, the weights shrink by a constant
amount toward $0$. In L2 regularization, the weights shrink by an
amount which is proportional to $w$. And so when a particular weight
has a large magnitude, $|w|$, L1 regularization shrinks the weight
much less than L2 regularization does. By contrast, when $|w|$ is
small, L1 regularization shrinks the weight much more than L2
regularization. The net result is that L1 regularization tends to
concentrate the weight of the network in a relatively small number of
high-importance connections, while the other weights are driven toward
zero.
I've glossed over an issue in the above discussion, which is that the
partial derivative $\partial C / \partial w$ isn't defined when $w =
0$. The reason is that the function $|w|$ has a sharp "corner" at
$w = 0$, and so isn't differentiable at that point. That's okay,
though. What we'll do is just apply the usual (unregularized) rule
for stochastic gradient descent when $w = 0$. That should be okay -
intuitively, the effect of regularization is to shrink weights, and
obviously it can't shrink a weight which is already $0$. To put it
more precisely, we'll use Equations (96)\begin{eqnarray} \frac{\partial C}{\partial
w} = \frac{\partial C_0}{\partial w} + \frac{\lambda}{n} \, {\rm
sgn}(w) \nonumber\end{eqnarray}
and (97)\begin{eqnarray} w \rightarrow w' =
w-\frac{\eta \lambda}{n} \mbox{sgn}(w) - \eta \frac{\partial
C_0}{\partial w} \nonumber\end{eqnarray} with the convention that $\mbox{sgn}(0) = 0$.
That gives a nice, compact rule for doing stochastic gradient descent
with L1 regularization.
Dropout: Dropout is a radically different technique for
regularization. Unlike L1 and L2 regularization, dropout doesn't rely
on modifying the cost function. Instead, in dropout we modify the
network itself. Let me describe the basic mechanics of how dropout
works, before getting into why it works, and what the results are.
Suppose we're trying to train a network:
In particular, suppose we have a training input $x$ and corresponding
desired output $y$. Ordinarily, we'd train by forward-propagating $x$
through the network, and then backpropagating to determine the
contribution to the gradient. With dropout, this process is modified.
We start by randomly (and temporarily) deleting half the hidden
neurons in the network, while leaving the input and output neurons
untouched. After doing this, we'll end up with a network along the
following lines. Note that the dropout neurons, i.e., the neurons
which have been temporarily deleted, are still ghosted in:
We forward-propagate the input $x$ through the modified network, and
then backpropagate the result, also through the modified network.
After doing this over a mini-batch of examples, we update the
appropriate weights and biases. We then repeat the process, first
restoring the dropout neurons, then choosing a new random subset of
hidden neurons to delete, estimating the gradient for a different
mini-batch, and updating the weights and biases in the network.
By repeating this process over and over, our network will learn a set
of weights and biases. Of course, those weights and biases will have
been learnt under conditions in which half the hidden neurons were
dropped out. When we actually run the full network that means that
twice as many hidden neurons will be active. To compensate for that,
we halve the weights outgoing from the hidden neurons.
This dropout procedure may seem strange and ad hoc. Why would
we expect it to help with regularization? To explain what's going on,
I'd like you to briefly stop thinking about dropout, and instead
imagine training neural networks in the standard way (no dropout). In
particular, imagine we train several different neural networks, all
using the same training data. Of course, the networks may not start
out identical, and as a result after training they may sometimes give
different results. When that happens we could use some kind of
averaging or voting scheme to decide which output to accept. For
instance, if we have trained five networks, and three of them are
classifying a digit as a "3", then it probably really is a "3".
The other two networks are probably just making a mistake. This kind
of averaging scheme is often found to be a powerful (though expensive)
way of reducing overfitting. The reason is that the different
networks may overfit in different ways, and averaging may help
eliminate that kind of overfitting.
What's this got to do with dropout? Heuristically, when we dropout
different sets of neurons, it's rather like we're training different
neural networks. And so the dropout procedure is like averaging the
effects of a very large number of different networks. The different
networks will overfit in different ways, and so, hopefully, the net
effect of dropout will be to reduce overfitting.
A related heuristic explanation for dropout is given in one of the
earliest papers to use the
technique*
*ImageNet
Classification with Deep Convolutional Neural Networks, by Alex
Krizhevsky, Ilya Sutskever, and Geoffrey Hinton (2012).: "This
technique reduces complex co-adaptations of neurons, since a neuron
cannot rely on the presence of particular other neurons. It is,
therefore, forced to learn more robust features that are useful in
conjunction with many different random subsets of the other neurons."
In other words, if we think of our network as a model which is making
predictions, then we can think of dropout as a way of making sure that
the model is robust to the loss of any individual piece of evidence.
In this, it's somewhat similar to L1 and L2 regularization, which tend
to reduce weights, and thus make the network more robust to losing any
individual connection in the network.
Of course, the true measure of dropout is that it has been very
successful in improving the performance of neural networks. The
original
paper*
*Improving
neural networks by preventing co-adaptation of feature detectors
by Geoffrey Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya
Sutskever, and Ruslan Salakhutdinov (2012). Note that the paper
discusses a number of subtleties that I have glossed over in this
brief introduction. introducing the technique applied it to many
different tasks. For us, it's of particular interest that they applied
dropout to MNIST digit classification, using a vanilla feedforward
neural network along lines similar to those we've been considering.
The paper noted that the best result anyone had achieved up to that
point using such an architecture was $98.4$ percent classification
accuracy on the test set. They improved that to $98.7$ percent
accuracy using a combination of dropout and a modified form of L2
regularization. Similarly impressive results have been obtained for
many other tasks, including problems in image and speech recognition,
and natural language processing. Dropout has been especially useful
in training large, deep networks, where the problem of overfitting is
often acute.
Artificially expanding the training data: We saw earlier that
our MNIST classification accuracy dropped down to percentages in the
mid-80s when we used only 1,000 training images. It's not surprising
that this is the case, since less training data means our network will
be exposed to fewer variations in the way human beings write digits.
Let's try training our 30 hidden neuron network with a variety of
different training data set sizes, to see how performance varies. We
train using a mini-batch size of 10, a learning rate $\eta = 0.5$, a
regularization parameter $\lambda = 5.0$, and the cross-entropy cost
function. We will train for 30 epochs when the full training data set
is used, and scale up the number of epochs proportionally when smaller
training sets are used. To ensure the weight decay factor remains the
same across training sets, we will use a regularization parameter of
$\lambda = 5.0$ when the full training data set is used, and scale
down $\lambda$ proportionally when smaller training sets are
used*
*This and the next two graph are produced with the
program
more_data.py..
As you can see, the classification accuracies improve considerably as
we use more training data. Presumably this improvement would continue
still further if more data was available. Of course, looking at the
graph above it does appear that we're getting near saturation.
Suppose, however, that we redo the graph with the training set size
plotted logarithmically:
It seems clear that the graph is still going up toward the end. This
suggests that if we used vastly more training data - say, millions
or even billions of handwriting samples, instead of just 50,000 -
then we'd likely get considerably better performance, even from this
very small network.
Obtaining more training data is a great idea. Unfortunately, it can be
expensive, and so is not always possible in practice. However,
there's another idea which can work nearly as well, and that's to
artificially expand the training data. Suppose, for example, that we
take an MNIST training image of a five,
and rotate it by a small amount, let's say 15 degrees:
It's still recognizably the same digit. And yet at the pixel level
it's quite different to any image currently in the MNIST training
data. It's conceivable that adding this image to the training data
might help our network learn more about how to classify digits.
What's more, obviously we're not limited to adding just this one
image. We can expand our training data by making many small
rotations of all the MNIST training images, and then using the
expanded training data to improve our network's performance.
This idea is very powerful and has been widely used. Let's look at
some of the results from a
paper*
*Best
Practices for Convolutional Neural Networks Applied to Visual
Document Analysis, by Patrice Simard, Dave Steinkraus, and John
Platt (2003). which applied several variations of the idea to
MNIST. One of the neural network architectures they considered was
along similar lines to what we've been using, a feedforward network
with 800 hidden neurons and using the cross-entropy cost function.
Running the network with the standard MNIST training data they
achieved a classification accuracy of 98.4 percent on their test set.
But then they expanded the training data, using not just rotations, as
I described above, but also translating and skewing the images. By
training on the expanded data set they increased their network's
accuracy to 98.9 percent. They also experimented with what they
called "elastic distortions", a special type of image distortion
intended to emulate the random oscillations found in hand muscles. By
using the elastic distortions to expand the data they achieved an even
higher accuracy, 99.3 percent. Effectively, they were broadening the
experience of their network by exposing it to the sort of variations
that are found in real handwriting.
Variations on this idea can be used to improve performance on many
learning tasks, not just handwriting recognition. The general
principle is to expand the training data by applying operations that
reflect real-world variation. It's not difficult to think of ways of
doing this. Suppose, for example, that you're building a neural
network to do speech recognition. We humans can recognize speech even
in the presence of distortions such as background noise. And so you
can expand your data by adding background noise. We can also
recognize speech if it's sped up or slowed down. So that's another way
we can expand the training data. These techniques are not always used
- for instance, instead of expanding the training data by adding
noise, it may well be more efficient to clean up the input to the
network by first applying a noise reduction filter. Still, it's worth
keeping the idea of expanding the training data in mind, and looking
for opportunities to apply it.
Exercise
An aside on big data and what it means to compare classification accuracies: Let's look again at how our neural network's accuracy varies with training set size:
Suppose that instead of using a neural network we use some other machine learning technique to classify digits. For instance, let's try using the support vector machines (SVM) which we met briefly back in Chapter 1. As was the case in Chapter 1, don't worry if you're not familiar with SVMs, we don't need to understand their details. Instead, we'll use the SVM supplied by the scikit-learn library. Here's how SVM performance varies as a function of training set size. I've plotted the neural net results as well, to make comparison easy* *This graph was produced with the program more_data.py (as were the last few graphs).:
Probably the first thing that strikes you about this graph is that our neural network outperforms the SVM for every training set size. That's nice, although you shouldn't read too much into it, since I just used the out-of-the-box settings from scikit-learn's SVM, while we've done a fair bit of work improving our neural network. A more subtle but more interesting fact about the graph is that if we train our SVM using 50,000 images then it actually has better performance (94.48 percent accuracy) than our neural network does when trained using 5,000 images (93.24 percent accuracy). In other words, more training data can sometimes compensate for differences in the machine learning algorithm used.
Something even more interesting can occur. Suppose we're trying to solve a problem using two machine learning algorithms, algorithm A and algorithm B. It sometimes happens that algorithm A will outperform algorithm B with one set of training data, while algorithm B will outperform algorithm A with a different set of training data. We don't see that above - it would require the two graphs to cross - but it does happen* *Striking examples may be found in Scaling to very very large corpora for natural language disambiguation, by Michele Banko and Eric Brill (2001).. The correct response to the question "Is algorithm A better than algorithm B?" is really: "What training data set are you using?"
All this is a caution to keep in mind, both when doing development, and when reading research papers. Many papers focus on finding new tricks to wring out improved performance on standard benchmark data sets. "Our whiz-bang technique gave us an improvement of X percent on standard benchmark Y" is a canonical form of research claim. Such claims are often genuinely interesting, but they must be understood as applying only in the context of the specific training data set used. Imagine an alternate history in which the people who originally created the benchmark data set had a larger research grant. They might have used the extra money to collect more training data. It's entirely possible that the "improvement" due to the whiz-bang technique would disappear on a larger data set. In other words, the purported improvement might be just an accident of history. The message to take away, especially in practical applications, is that what we want is both better algorithms and better training data. It's fine to look for better algorithms, but make sure you're not focusing on better algorithms to the exclusion of easy wins getting more or better training data.
Summing up: We've now completed our dive into overfitting and regularization. Of course, we'll return again to the issue. As I've mentioned several times, overfitting is a major problem in neural networks, especially as computers get more powerful, and we have the ability to train larger networks. As a result there's a pressing need to develop powerful regularization techniques to reduce overfitting, and this is an extremely active area of current work.
When we create our neural networks, we have to make choices for the initial weights and biases. Up to now, we've been choosing them according to a prescription which I discussed only briefly back in Chapter 1. Just to remind you, that prescription was to choose both the weights and biases using independent Gaussian random variables, normalized to have mean $0$ and standard deviation $1$. While this approach has worked well, it was quite ad hoc, and it's worth revisiting to see if we can find a better way of setting our initial weights and biases, and perhaps help our neural networks learn faster.
It turns out that we can do quite a bit better than initializing with normalized Gaussians. To see why, suppose we're working with a network with a large number - say $1,000$ - of input neurons. And let's suppose we've used normalized Gaussians to initialize the weights connecting to the first hidden layer. For now I'm going to concentrate specifically on the weights connecting the input neurons to the first neuron in the hidden layer, and ignore the rest of the network:
We'll suppose for simplicity that we're trying to train using a training input $x$ in which half the input neurons are on, i.e., set to $1$, and half the input neurons are off, i.e., set to $0$. The argument which follows applies more generally, but you'll get the gist from this special case. Let's consider the weighted sum $z = \sum_j w_j x_j+b$ of inputs to our hidden neuron. $500$ terms in this sum vanish, because the corresponding input $x_j$ is zero. And so $z$ is a sum over a total of $501$ normalized Gaussian random variables, accounting for the $500$ weight terms and the $1$ extra bias term. Thus $z$ is itself distributed as a Gaussian with mean zero and standard deviation $\sqrt{501} \approx 22.4$. That is, $z$ has a very broad Gaussian distribution, not sharply peaked at all:
In particular, we can see from this graph that it's quite likely that $|z|$ will be pretty large, i.e., either $z \gg 1$ or $z \ll -1$. If that's the case then the output $\sigma(z)$ from the hidden neuron will be very close to either $1$ or $0$. That means our hidden neuron will have saturated. And when that happens, as we know, making small changes in the weights will make only absolutely miniscule changes in the activation of our hidden neuron. That miniscule change in the activation of the hidden neuron will, in turn, barely affect the rest of the neurons in the network at all, and we'll see a correspondingly miniscule change in the cost function. As a result, those weights will only learn very slowly when we use the gradient descent algorithm* *We discussed this in more detail in Chapter 2, where we used the equations of backpropagation to show that weights input to saturated neurons learned slowly.. It's similar to the problem we discussed earlier in this chapter, in which output neurons which saturated on the wrong value caused learning to slow down. We addressed that earlier problem with a clever choice of cost function. Unfortunately, while that helped with saturated output neurons, it does nothing at all for the problem with saturated hidden neurons.
I've been talking about the weights input to the first hidden layer. Of course, similar arguments apply also to later hidden layers: if the weights in later hidden layers are initialized using normalized Gaussians, then activations will often be very close to $0$ or $1$, and learning will proceed very slowly.
Is there some way we can choose better initializations for the weights and biases, so that we don't get this kind of saturation, and so avoid a learning slowdown? Suppose we have a neuron with $n_{\rm in}$ input weights. Then we shall initialize those weights as Gaussian random variables with mean $0$ and standard deviation $1/\sqrt{n_{\rm in}}$. That is, we'll squash the Gaussians down, making it less likely that our neuron will saturate. We'll continue to choose the bias as a Gaussian with mean $0$ and standard deviation $1$, for reasons I'll return to in a moment. With these choices, the weighted sum $z = \sum_j w_j x_j + b$ will again be a Gaussian random variable with mean $0$, but it'll be much more sharply peaked than it was before. Suppose, as we did earlier, that $500$ of the inputs are zero and $500$ are $1$. Then it's easy to show (see the exercise below) that $z$ has a Gaussian distribution with mean $0$ and standard deviation $\sqrt{3/2} = 1.22\ldots$. This is much more sharply peaked than before, so much so that even the graph below understates the situation, since I've had to rescale the vertical axis, when compared to the earlier graph:
Such a neuron is much less likely to saturate, and correspondingly much less likely to have problems with a learning slowdown.
I stated above that we'll continue to initialize the biases as before, as Gaussian random variables with a mean of $0$ and a standard deviation of $1$. This is okay, because it doesn't make it too much more likely that our neurons will saturate. In fact, it doesn't much matter how we initialize the biases, provided we avoid the problem with saturation. Some people go so far as to initialize all the biases to $0$, and rely on gradient descent to learn appropriate biases. But since it's unlikely to make much difference, we'll continue with the same initialization procedure as before.
Let's compare the results for both our old and new approaches to weight initialization, using the MNIST digit classification task. As before, we'll use $30$ hidden neurons, a mini-batch size of $10$, a regularization parameter $\lambda = 5.0$, and the cross-entropy cost function. We will decrease the learning rate slightly from $\eta = 0.5$ to $0.1$, since that makes the results a little more easily visible in the graphs. We can train using the old method of weight initialization:
>>> import mnist_loader
+The final result is a classification accuracy of $97.92$ percent on
the validation data. That's a big jump from the 30 hidden neuron
case. In fact, tuning just
a little more, to run for 60 epochs at $\eta = 0.1$ and $\lambda =
5.0$ we break the $98$ percent barrier, achieving $98.04$ percent
classification accuracy on the validation data. Not bad for what
turns out to be 152 lines of code!
I've described regularization as a way to reduce overfitting and to
increase classification accuracies. In fact, that's not the only
benefit. Empirically, when doing multiple runs of our MNIST networks,
but with different (random) weight initializations, I've found that
the unregularized runs will occasionally get "stuck", apparently
caught in local minima of the cost function. The result is that
different runs sometimes provide quite different results. By
contrast, the regularized runs have provided much more easily
replicable results.
Why is this going on? Heuristically, if the cost function is
unregularized, then the length of the weight vector is likely to grow,
all other things being equal. Over time this can lead to the weight
vector being very large indeed. This can cause the weight vector to
get stuck pointing in more or less the same direction, since changes
due to gradient descent only make tiny changes to the direction, when
the length is long. I believe this phenomenon is making it hard for
our learning algorithm to properly explore the weight space, and
consequently harder to find good minima of the cost function.
Why does regularization help reduce overfitting?
We've seen empirically that regularization helps reduce overfitting.
That's encouraging but, unfortunately, it's not obvious why
regularization helps! A standard story people tell to explain what's
going on is along the following lines: smaller weights are, in some
sense, lower complexity, and so provide a simpler and more powerful
explanation for the data, and should thus be preferred. That's a
pretty terse story, though, and contains several elements that perhaps
seem dubious or mystifying. Let's unpack the story and examine it
critically. To do that, let's suppose we have a simple data set for
which we wish to build a model:
Implicitly, we're studying some real-world phenomenon here, with $x$
and $y$ representing real-world data. Our goal is to build a model
which lets us predict $y$ as a function of $x$. We could try using
neural networks to build such a model, but I'm going to do something
even simpler: I'll try to model $y$ as a polynomial in $x$. I'm doing
this instead of using neural nets because using polynomials will make
things particularly transparent. Once we've understood the polynomial
case, we'll translate to neural networks. Now, there are ten points
in the graph above, which means we can find a unique $9$th-order
polynomial $y = a_0 x^9 + a_1 x^8 + \ldots + a_9$ which fits the data
exactly. Here's the graph of that polynomial*
*I won't show
the coefficients explicitly, although they are easy to find using a
routine such as Numpy's polyfit. You can view the exact form
of the polynomial in the source code
for the graph if you're curious. It's the function p(x)
defined starting on line 14 of the program which produces the
graph.:
That provides an exact fit. But we can also get a good fit using the
linear model $y = 2x$:
Which of these is the better model? Which is more likely to be true?
And which model is more likely to generalize well to other examples of
the same underlying real-world phenomenon?
These are difficult questions. In fact, we can't determine with
certainty the answer to any of the above questions, without much more
information about the underlying real-world phenomenon. But let's
consider two possibilities: (1) the $9$th order polynomial is, in
fact, the model which truly describes the real-world phenomenon, and
the model will therefore generalize perfectly; (2) the correct model
is $y = 2x$, but there's a little additional noise due to, say,
measurement error, and that's why the model isn't an exact fit.
It's not a priori possible to say which of these two
possibilities is correct. (Or, indeed, if some third possibility
holds). Logically, either could be true. And it's not a trivial
difference. It's true that on the data provided there's only a small
difference between the two models. But suppose we want to predict the
value of $y$ corresponding to some large value of $x$, much larger
than any shown on the graph above. If we try to do that there will be
a dramatic difference between the predictions of the two models, as
the $9$th order polynomial model comes to be dominated by the $x^9$
term, while the linear model remains, well, linear.
One point of view is to say that in science we should go with the
simpler explanation, unless compelled not to. When we find a simple
model that seems to explain many data points we are tempted to shout
"Eureka!" After all, it seems unlikely that a simple explanation
should occur merely by coincidence. Rather, we suspect that the model
must be expressing some underlying truth about the phenomenon. In the
case at hand, the model $y = 2x+{\rm noise}$ seems much simpler than
$y = a_0 x^9 + a_1 x^8 + \ldots$. It would be surprising if that
simplicity had occurred by chance, and so we suspect that $y = 2x+{\rm
noise}$ expresses some underlying truth. In this point of view, the
9th order model is really just learning the effects of local
noise. And so while the 9th order model works perfectly for these
particular data points, the model will fail to generalize to other
data points, and the noisy linear model will have greater predictive
power.
Let's see what this point of view means for neural networks. Suppose
our network mostly has small weights, as will tend to happen in a
regularized network. The smallness of the weights means that the
behaviour of the network won't change too much if we change a few
random inputs here and there. That makes it difficult for a
regularized network to learn the effects of local noise in the data.
Think of it as a way of making it so single pieces of evidence don't
matter too much to the output of the network. Instead, a regularized
network learns to respond to types of evidence which are seen often
across the training set. By contrast, a network with large weights
may change its behaviour quite a bit in response to small changes in
the input. And so an unregularized network can use large weights to
learn a complex model that carries a lot of information about the
noise in the training data. In a nutshell, regularized networks are
constrained to build relatively simple models based on patterns seen
often in the training data, and are resistant to learning
peculiarities of the noise in the training data. The hope is that
this will force our networks to do real learning about the phenomenon
at hand, and to generalize better from what they learn.
With that said, this idea of preferring simpler explanation should
make you nervous. People sometimes refer to this idea as "Occam's
Razor", and will zealously apply it as though it has the status of
some general scientific principle. But, of course, it's not a general
scientific principle. There is no a priori logical reason to
prefer simple explanations over more complex explanations. Indeed,
sometimes the more complex explanation turns out to be correct.
Let me describe two examples where more complex explanations have
turned out to be correct. In the 1940s the physicist Marcel Schein
announced the discovery of a new particle of nature. The company he
worked for, General Electric, was ecstatic, and publicized the
discovery widely. But the physicist Hans Bethe was skeptical. Bethe
visited Schein, and looked at the plates showing the tracks of
Schein's new particle. Schein showed Bethe plate after plate, but on
each plate Bethe identified some problem that suggested the data
should be discarded. Finally, Schein showed Bethe a plate that looked
good. Bethe said it might just be a statistical fluke. Schein:
"Yes, but the chance that this would be statistics, even according to
your own formula, is one in five." Bethe: "But we have already
looked at five plates." Finally, Schein said: "But on my plates,
each one of the good plates, each one of the good pictures, you
explain by a different theory, whereas I have one hypothesis that
explains all the plates, that they are [the new particle]." Bethe
replied: "The sole difference between your and my explanations is
that yours is wrong and all of mine are right. Your single
explanation is wrong, and all of my multiple explanations are right."
Subsequent work confirmed that Nature agreed with Bethe, and Schein's
particle is no more*
*The story is related by the physicist
Richard Feynman in an
interview
with the historian Charles Weiner..
As a second example, in 1859 the astronomer Urbain Le Verrier observed
that the orbit of the planet Mercury doesn't have quite the shape that
Newton's theory of gravitation says it should have. It was a tiny,
tiny deviation from Newton's theory, and several of the explanations
proferred at the time boiled down to saying that Newton's theory was
more or less right, but needed a tiny alteration. In 1916, Einstein
showed that the deviation could be explained very well using his
general theory of relativity, a theory radically different to
Newtonian gravitation, and based on much more complex mathematics.
Despite that additional complexity, today it's accepted that
Einstein's explanation is correct, and Newtonian gravity, even in its
modified forms, is wrong. This is in part because we now know that
Einstein's theory explains many other phenomena which Newton's theory
has difficulty with. Furthermore, and even more impressively,
Einstein's theory accurately predicts several phenomena which aren't
predicted by Newtonian gravity at all. But these impressive qualities
weren't entirely obvious in the early days. If one had judged merely
on the grounds of simplicity, then some modified form of Newton's
theory would arguably have been more attractive.
There are three morals to draw from these stories. First, it can be
quite a subtle business deciding which of two explanations is truly
"simpler". Second, even if we can make such a judgment, simplicity
is a guide that must be used with great caution! Third, the true test
of a model is not simplicity, but rather how well it does in
predicting new phenomena, in new regimes of behaviour.
With that said, and keeping the need for caution in mind, it's an
empirical fact that regularized neural networks usually generalize
better than unregularized networks. And so through the remainder of
the book we will make frequent use of regularization. I've included
the stories above merely to help convey why no-one has yet developed
an entirely convincing theoretical explanation for why regularization
helps networks generalize. Indeed, researchers continue to write
papers where they try different approaches to regularization, compare
them to see which works better, and attempt to understand why different
approaches work better or worse. And so you can view regularization
as something of a kludge. While it often helps, we don't have an
entirely satisfactory systematic understanding of what's going on,
merely incomplete heuristics and rules of thumb.
There's a deeper set of issues here, issues which go to the heart of
science. It's the question of how we generalize. Regularization may
give us a computational magic wand that helps our networks generalize
better, but it doesn't give us a principled understanding of how
generalization works, nor of what the best approach is*
*These
issues go back to the
problem
of induction, famously discussed by the Scottish philosopher
David Hume in "An
Enquiry Concerning Human Understanding" (1748). The problem of
induction has been given a modern machine learning form in the
no-free lunch theorem
(link)
of David Wolpert and William Macready (1997)..
This is particularly galling because in everyday life, we humans
generalize phenomenally well. Shown just a few images of an elephant
a child will quickly learn to recognize other elephants. Of course,
they may occasionally make mistakes, perhaps confusing a rhinoceros
for an elephant, but in general this process works remarkably
accurately. So we have a system - the human brain - with a huge
number of free parameters. And after being shown just one or a few
training images that system learns to generalize to other images. Our
brains are, in some sense, regularizing amazingly well! How do we do
it? At this point we don't know. I expect that in years to come we
will develop more powerful techniques for regularization in artificial
neural networks, techniques that will ultimately enable neural nets to
generalize well even from small data sets.
In fact, our networks already generalize better than one might a
priori expect. A network with 100 hidden neurons has nearly 80,000
parameters. We have only 50,000 images in our training data. It's
like trying to fit an 80,000th degree polynomial to 50,000 data
points. By all rights, our network should overfit terribly. And yet,
as we saw earlier, such a network actually does a pretty good job
generalizing. Why is that the case? It's not well understood. It
has been conjectured*
*In
Gradient-Based
Learning Applied to Document Recognition, by Yann LeCun,
Léon Bottou, Yoshua Bengio, and Patrick Haffner
(1998). that "the dynamics of gradient descent learning in
multilayer nets has a `self-regularization' effect". This is
exceptionally fortunate, but it's also somewhat disquieting that we
don't understand why it's the case. In the meantime, we will adopt
the pragmatic approach and use regularization whenever we can. Our
neural networks will be the better for it.
Let me conclude this section by returning to a detail which I left
unexplained earlier: the fact that L2 regularization doesn't
constrain the biases. Of course, it would be easy to modify the
regularization procedure to regularize the biases. Empirically, doing
this often doesn't change the results very much, so to some extent
it's merely a convention whether to regularize the biases or not.
However, it's worth noting that having a large bias doesn't make a
neuron sensitive to its inputs in the same way as having large
weights. And so we don't need to worry about large biases enabling
our network to learn the noise in our training data. At the same
time, allowing large biases gives our networks more flexibility in
behaviour - in particular, large biases make it easier for neurons
to saturate, which is sometimes desirable. For these reasons we don't
usually include bias terms when regularizing.
Other techniques for regularization
There are many regularization techniques other than L2 regularization.
In fact, so many techniques have been developed that I can't possibly
summarize them all. In this section I briefly describe three other
approaches to reducing overfitting: L1 regularization, dropout, and
artificially increasing the training set size. We won't go into
nearly as much depth studying these techniques as we did earlier.
Instead, the purpose is to get familiar with the main ideas, and to
appreciate something of the diversity of regularization techniques
available.
L1 regularization: In this approach we modify the
unregularized cost function by adding the sum of the absolute values
of the weights:
\begin{eqnarray} C = C_0 + \frac{\lambda}{n} \sum_w |w|.
\tag{95}\end{eqnarray}
Intuitively, this is similar to L2 regularization, penalizing large
weights, and tending to make the network prefer small weights. Of
course, the L1 regularization term isn't the same as the L2
regularization term, and so we shouldn't expect to get exactly the
same behaviour. Let's try to understand how the behaviour of a
network trained using L1 regularization differs from a network trained
using L2 regularization.
To do that, we'll look at the partial derivatives of the cost
function. Differentiating (95)\begin{eqnarray} C = C_0 + \frac{\lambda}{n} \sum_w |w| \nonumber\end{eqnarray} we obtain:
\begin{eqnarray} \frac{\partial C}{\partial
w} = \frac{\partial C_0}{\partial w} + \frac{\lambda}{n} \, {\rm
sgn}(w),
\tag{96}\end{eqnarray}
where ${\rm sgn}(w)$ is the sign of $w$, that is, $+1$ if $w$ is
positive, and $-1$ if $w$ is negative. Using this expression, we can
easily modify backpropagation to do stochastic gradient descent using
L1 regularization. The resulting update rule for an L1 regularized
network is
\begin{eqnarray} w \rightarrow w' =
w-\frac{\eta \lambda}{n} \mbox{sgn}(w) - \eta \frac{\partial
C_0}{\partial w},
\tag{97}\end{eqnarray}
where, as per usual, we can estimate $\partial C_0 / \partial w$ using
a mini-batch average, if we wish. Compare that to the update rule for
L2 regularization (c.f. Equation (93)\begin{eqnarray}
w \rightarrow \left(1-\frac{\eta \lambda}{n}\right) w -\frac{\eta}{m}
\sum_x \frac{\partial C_x}{\partial w}, \nonumber\end{eqnarray}),
\begin{eqnarray}
w \rightarrow w' = w\left(1 - \frac{\eta \lambda}{n} \right)
- \eta \frac{\partial C_0}{\partial w}.
\tag{98}\end{eqnarray}
In both expressions the effect of regularization is to shrink the
weights. This accords with our intuition that both kinds of
regularization penalize large weights. But the way the weights shrink
is different. In L1 regularization, the weights shrink by a constant
amount toward $0$. In L2 regularization, the weights shrink by an
amount which is proportional to $w$. And so when a particular weight
has a large magnitude, $|w|$, L1 regularization shrinks the weight
much less than L2 regularization does. By contrast, when $|w|$ is
small, L1 regularization shrinks the weight much more than L2
regularization. The net result is that L1 regularization tends to
concentrate the weight of the network in a relatively small number of
high-importance connections, while the other weights are driven toward
zero.
I've glossed over an issue in the above discussion, which is that the
partial derivative $\partial C / \partial w$ isn't defined when $w =
0$. The reason is that the function $|w|$ has a sharp "corner" at
$w = 0$, and so isn't differentiable at that point. That's okay,
though. What we'll do is just apply the usual (unregularized) rule
for stochastic gradient descent when $w = 0$. That should be okay -
intuitively, the effect of regularization is to shrink weights, and
obviously it can't shrink a weight which is already $0$. To put it
more precisely, we'll use Equations (96)\begin{eqnarray} \frac{\partial C}{\partial
w} = \frac{\partial C_0}{\partial w} + \frac{\lambda}{n} \, {\rm
sgn}(w) \nonumber\end{eqnarray}
and (97)\begin{eqnarray} w \rightarrow w' =
w-\frac{\eta \lambda}{n} \mbox{sgn}(w) - \eta \frac{\partial
C_0}{\partial w} \nonumber\end{eqnarray} with the convention that $\mbox{sgn}(0) = 0$.
That gives a nice, compact rule for doing stochastic gradient descent
with L1 regularization.
Dropout: Dropout is a radically different technique for
regularization. Unlike L1 and L2 regularization, dropout doesn't rely
on modifying the cost function. Instead, in dropout we modify the
network itself. Let me describe the basic mechanics of how dropout
works, before getting into why it works, and what the results are.
Suppose we're trying to train a network:
In particular, suppose we have a training input $x$ and corresponding
desired output $y$. Ordinarily, we'd train by forward-propagating $x$
through the network, and then backpropagating to determine the
contribution to the gradient. With dropout, this process is modified.
We start by randomly (and temporarily) deleting half the hidden
neurons in the network, while leaving the input and output neurons
untouched. After doing this, we'll end up with a network along the
following lines. Note that the dropout neurons, i.e., the neurons
which have been temporarily deleted, are still ghosted in:
We forward-propagate the input $x$ through the modified network, and
then backpropagate the result, also through the modified network.
After doing this over a mini-batch of examples, we update the
appropriate weights and biases. We then repeat the process, first
restoring the dropout neurons, then choosing a new random subset of
hidden neurons to delete, estimating the gradient for a different
mini-batch, and updating the weights and biases in the network.
By repeating this process over and over, our network will learn a set
of weights and biases. Of course, those weights and biases will have
been learnt under conditions in which half the hidden neurons were
dropped out. When we actually run the full network that means that
twice as many hidden neurons will be active. To compensate for that,
we halve the weights outgoing from the hidden neurons.
This dropout procedure may seem strange and ad hoc. Why would
we expect it to help with regularization? To explain what's going on,
I'd like you to briefly stop thinking about dropout, and instead
imagine training neural networks in the standard way (no dropout). In
particular, imagine we train several different neural networks, all
using the same training data. Of course, the networks may not start
out identical, and as a result after training they may sometimes give
different results. When that happens we could use some kind of
averaging or voting scheme to decide which output to accept. For
instance, if we have trained five networks, and three of them are
classifying a digit as a "3", then it probably really is a "3".
The other two networks are probably just making a mistake. This kind
of averaging scheme is often found to be a powerful (though expensive)
way of reducing overfitting. The reason is that the different
networks may overfit in different ways, and averaging may help
eliminate that kind of overfitting.
What's this got to do with dropout? Heuristically, when we dropout
different sets of neurons, it's rather like we're training different
neural networks. And so the dropout procedure is like averaging the
effects of a very large number of different networks. The different
networks will overfit in different ways, and so, hopefully, the net
effect of dropout will be to reduce overfitting.
A related heuristic explanation for dropout is given in one of the
earliest papers to use the
technique*
*ImageNet
Classification with Deep Convolutional Neural Networks, by Alex
Krizhevsky, Ilya Sutskever, and Geoffrey Hinton (2012).: "This
technique reduces complex co-adaptations of neurons, since a neuron
cannot rely on the presence of particular other neurons. It is,
therefore, forced to learn more robust features that are useful in
conjunction with many different random subsets of the other neurons."
In other words, if we think of our network as a model which is making
predictions, then we can think of dropout as a way of making sure that
the model is robust to the loss of any individual piece of evidence.
In this, it's somewhat similar to L1 and L2 regularization, which tend
to reduce weights, and thus make the network more robust to losing any
individual connection in the network.
Of course, the true measure of dropout is that it has been very
successful in improving the performance of neural networks. The
original
paper*
*Improving
neural networks by preventing co-adaptation of feature detectors
by Geoffrey Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya
Sutskever, and Ruslan Salakhutdinov (2012). Note that the paper
discusses a number of subtleties that I have glossed over in this
brief introduction. introducing the technique applied it to many
different tasks. For us, it's of particular interest that they applied
dropout to MNIST digit classification, using a vanilla feedforward
neural network along lines similar to those we've been considering.
The paper noted that the best result anyone had achieved up to that
point using such an architecture was $98.4$ percent classification
accuracy on the test set. They improved that to $98.7$ percent
accuracy using a combination of dropout and a modified form of L2
regularization. Similarly impressive results have been obtained for
many other tasks, including problems in image and speech recognition,
and natural language processing. Dropout has been especially useful
in training large, deep networks, where the problem of overfitting is
often acute.
Artificially expanding the training data: We saw earlier that
our MNIST classification accuracy dropped down to percentages in the
mid-80s when we used only 1,000 training images. It's not surprising
that this is the case, since less training data means our network will
be exposed to fewer variations in the way human beings write digits.
Let's try training our 30 hidden neuron network with a variety of
different training data set sizes, to see how performance varies. We
train using a mini-batch size of 10, a learning rate $\eta = 0.5$, a
regularization parameter $\lambda = 5.0$, and the cross-entropy cost
function. We will train for 30 epochs when the full training data set
is used, and scale up the number of epochs proportionally when smaller
training sets are used. To ensure the weight decay factor remains the
same across training sets, we will use a regularization parameter of
$\lambda = 5.0$ when the full training data set is used, and scale
down $\lambda$ proportionally when smaller training sets are
used*
*This and the next two graph are produced with the
program
more_data.py..
As you can see, the classification accuracies improve considerably as
we use more training data. Presumably this improvement would continue
still further if more data was available. Of course, looking at the
graph above it does appear that we're getting near saturation.
Suppose, however, that we redo the graph with the training set size
plotted logarithmically:
It seems clear that the graph is still going up toward the end. This
suggests that if we used vastly more training data - say, millions
or even billions of handwriting samples, instead of just 50,000 -
then we'd likely get considerably better performance, even from this
very small network.
Obtaining more training data is a great idea. Unfortunately, it can be
expensive, and so is not always possible in practice. However,
there's another idea which can work nearly as well, and that's to
artificially expand the training data. Suppose, for example, that we
take an MNIST training image of a five,
and rotate it by a small amount, let's say 15 degrees:
It's still recognizably the same digit. And yet at the pixel level
it's quite different to any image currently in the MNIST training
data. It's conceivable that adding this image to the training data
might help our network learn more about how to classify digits.
What's more, obviously we're not limited to adding just this one
image. We can expand our training data by making many small
rotations of all the MNIST training images, and then using the
expanded training data to improve our network's performance.
This idea is very powerful and has been widely used. Let's look at
some of the results from a
paper*
*Best
Practices for Convolutional Neural Networks Applied to Visual
Document Analysis, by Patrice Simard, Dave Steinkraus, and John
Platt (2003). which applied several variations of the idea to
MNIST. One of the neural network architectures they considered was
along similar lines to what we've been using, a feedforward network
with 800 hidden neurons and using the cross-entropy cost function.
Running the network with the standard MNIST training data they
achieved a classification accuracy of 98.4 percent on their test set.
But then they expanded the training data, using not just rotations, as
I described above, but also translating and skewing the images. By
training on the expanded data set they increased their network's
accuracy to 98.9 percent. They also experimented with what they
called "elastic distortions", a special type of image distortion
intended to emulate the random oscillations found in hand muscles. By
using the elastic distortions to expand the data they achieved an even
higher accuracy, 99.3 percent. Effectively, they were broadening the
experience of their network by exposing it to the sort of variations
that are found in real handwriting.
Variations on this idea can be used to improve performance on many
learning tasks, not just handwriting recognition. The general
principle is to expand the training data by applying operations that
reflect real-world variation. It's not difficult to think of ways of
doing this. Suppose, for example, that you're building a neural
network to do speech recognition. We humans can recognize speech even
in the presence of distortions such as background noise. And so you
can expand your data by adding background noise. We can also
recognize speech if it's sped up or slowed down. So that's another way
we can expand the training data. These techniques are not always used
- for instance, instead of expanding the training data by adding
noise, it may well be more efficient to clean up the input to the
network by first applying a noise reduction filter. Still, it's worth
keeping the idea of expanding the training data in mind, and looking
for opportunities to apply it.
Exercise
An aside on big data and what it means to compare classification accuracies: Let's look again at how our neural network's accuracy varies with training set size:
Suppose that instead of using a neural network we use some other machine learning technique to classify digits. For instance, let's try using the support vector machines (SVM) which we met briefly back in Chapter 1. As was the case in Chapter 1, don't worry if you're not familiar with SVMs, we don't need to understand their details. Instead, we'll use the SVM supplied by the scikit-learn library. Here's how SVM performance varies as a function of training set size. I've plotted the neural net results as well, to make comparison easy* *This graph was produced with the program more_data.py (as were the last few graphs).:
Probably the first thing that strikes you about this graph is that our neural network outperforms the SVM for every training set size. That's nice, although you shouldn't read too much into it, since I just used the out-of-the-box settings from scikit-learn's SVM, while we've done a fair bit of work improving our neural network. A more subtle but more interesting fact about the graph is that if we train our SVM using 50,000 images then it actually has better performance (94.48 percent accuracy) than our neural network does when trained using 5,000 images (93.24 percent accuracy). In other words, more training data can sometimes compensate for differences in the machine learning algorithm used.
Something even more interesting can occur. Suppose we're trying to solve a problem using two machine learning algorithms, algorithm A and algorithm B. It sometimes happens that algorithm A will outperform algorithm B with one set of training data, while algorithm B will outperform algorithm A with a different set of training data. We don't see that above - it would require the two graphs to cross - but it does happen* *Striking examples may be found in Scaling to very very large corpora for natural language disambiguation, by Michele Banko and Eric Brill (2001).. The correct response to the question "Is algorithm A better than algorithm B?" is really: "What training data set are you using?"
All this is a caution to keep in mind, both when doing development, and when reading research papers. Many papers focus on finding new tricks to wring out improved performance on standard benchmark data sets. "Our whiz-bang technique gave us an improvement of X percent on standard benchmark Y" is a canonical form of research claim. Such claims are often genuinely interesting, but they must be understood as applying only in the context of the specific training data set used. Imagine an alternate history in which the people who originally created the benchmark data set had a larger research grant. They might have used the extra money to collect more training data. It's entirely possible that the "improvement" due to the whiz-bang technique would disappear on a larger data set. In other words, the purported improvement might be just an accident of history. The message to take away, especially in practical applications, is that what we want is both better algorithms and better training data. It's fine to look for better algorithms, but make sure you're not focusing on better algorithms to the exclusion of easy wins getting more or better training data.
Summing up: We've now completed our dive into overfitting and regularization. Of course, we'll return again to the issue. As I've mentioned several times, overfitting is a major problem in neural networks, especially as computers get more powerful, and we have the ability to train larger networks. As a result there's a pressing need to develop powerful regularization techniques to reduce overfitting, and this is an extremely active area of current work.
When we create our neural networks, we have to make choices for the initial weights and biases. Up to now, we've been choosing them according to a prescription which I discussed only briefly back in Chapter 1. Just to remind you, that prescription was to choose both the weights and biases using independent Gaussian random variables, normalized to have mean $0$ and standard deviation $1$. While this approach has worked well, it was quite ad hoc, and it's worth revisiting to see if we can find a better way of setting our initial weights and biases, and perhaps help our neural networks learn faster.
It turns out that we can do quite a bit better than initializing with normalized Gaussians. To see why, suppose we're working with a network with a large number - say $1,000$ - of input neurons. And let's suppose we've used normalized Gaussians to initialize the weights connecting to the first hidden layer. For now I'm going to concentrate specifically on the weights connecting the input neurons to the first neuron in the hidden layer, and ignore the rest of the network:
We'll suppose for simplicity that we're trying to train using a training input $x$ in which half the input neurons are on, i.e., set to $1$, and half the input neurons are off, i.e., set to $0$. The argument which follows applies more generally, but you'll get the gist from this special case. Let's consider the weighted sum $z = \sum_j w_j x_j+b$ of inputs to our hidden neuron. $500$ terms in this sum vanish, because the corresponding input $x_j$ is zero. And so $z$ is a sum over a total of $501$ normalized Gaussian random variables, accounting for the $500$ weight terms and the $1$ extra bias term. Thus $z$ is itself distributed as a Gaussian with mean zero and standard deviation $\sqrt{501} \approx 22.4$. That is, $z$ has a very broad Gaussian distribution, not sharply peaked at all:
In particular, we can see from this graph that it's quite likely that $|z|$ will be pretty large, i.e., either $z \gg 1$ or $z \ll -1$. If that's the case then the output $\sigma(z)$ from the hidden neuron will be very close to either $1$ or $0$. That means our hidden neuron will have saturated. And when that happens, as we know, making small changes in the weights will make only absolutely miniscule changes in the activation of our hidden neuron. That miniscule change in the activation of the hidden neuron will, in turn, barely affect the rest of the neurons in the network at all, and we'll see a correspondingly miniscule change in the cost function. As a result, those weights will only learn very slowly when we use the gradient descent algorithm* *We discussed this in more detail in Chapter 2, where we used the equations of backpropagation to show that weights input to saturated neurons learned slowly.. It's similar to the problem we discussed earlier in this chapter, in which output neurons which saturated on the wrong value caused learning to slow down. We addressed that earlier problem with a clever choice of cost function. Unfortunately, while that helped with saturated output neurons, it does nothing at all for the problem with saturated hidden neurons.
I've been talking about the weights input to the first hidden layer. Of course, similar arguments apply also to later hidden layers: if the weights in later hidden layers are initialized using normalized Gaussians, then activations will often be very close to $0$ or $1$, and learning will proceed very slowly.
Is there some way we can choose better initializations for the weights and biases, so that we don't get this kind of saturation, and so avoid a learning slowdown? Suppose we have a neuron with $n_{\rm in}$ input weights. Then we shall initialize those weights as Gaussian random variables with mean $0$ and standard deviation $1/\sqrt{n_{\rm in}}$. That is, we'll squash the Gaussians down, making it less likely that our neuron will saturate. We'll continue to choose the bias as a Gaussian with mean $0$ and standard deviation $1$, for reasons I'll return to in a moment. With these choices, the weighted sum $z = \sum_j w_j x_j + b$ will again be a Gaussian random variable with mean $0$, but it'll be much more sharply peaked than it was before. Suppose, as we did earlier, that $500$ of the inputs are zero and $500$ are $1$. Then it's easy to show (see the exercise below) that $z$ has a Gaussian distribution with mean $0$ and standard deviation $\sqrt{3/2} = 1.22\ldots$. This is much more sharply peaked than before, so much so that even the graph below understates the situation, since I've had to rescale the vertical axis, when compared to the earlier graph:
Such a neuron is much less likely to saturate, and correspondingly much less likely to have problems with a learning slowdown.
I stated above that we'll continue to initialize the biases as before, as Gaussian random variables with a mean of $0$ and a standard deviation of $1$. This is okay, because it doesn't make it too much more likely that our neurons will saturate. In fact, it doesn't much matter how we initialize the biases, provided we avoid the problem with saturation. Some people go so far as to initialize all the biases to $0$, and rely on gradient descent to learn appropriate biases. But since it's unlikely to make much difference, we'll continue with the same initialization procedure as before.
Let's compare the results for both our old and new approaches to weight initialization, using the MNIST digit classification task. As before, we'll use $30$ hidden neurons, a mini-batch size of $10$, a regularization parameter $\lambda = 5.0$, and the cross-entropy cost function. We will decrease the learning rate slightly from $\eta = 0.5$ to $0.1$, since that makes the results a little more easily visible in the graphs. We can train using the old method of weight initialization:
>>> import mnist_loader
>>> training_data, validation_data, test_data = \
... mnist_loader.load_data_wrapper()
>>> import network2
@@ -267,7 +270,7 @@ Improving the way neural networks learn
def delta(z, a, y):
return (a-y)
Let's break this down. The first thing to observe is that even though the cross-entropy is, mathematically speaking, a function, we've implemented it as a Python class, not a Python function. Why have I made that choice? The reason is that the cost plays two different roles in our network. The obvious role is that it's a measure of how well an output activation, a, matches the desired output, y. This role is captured by the CrossEntropyCost.fn method. (Note, by the way, that the np.nan_to_num call inside CrossEntropyCost.fn ensures that Numpy deals correctly with the log of numbers very close to zero.) But there's also a second way the cost function enters our network. Recall from Chapter 2 that when running the backpropagation algorithm we need to compute the network's output error, $\delta^L$. The form of the output error depends on the choice of cost function: different cost function, different form for the output error. For the cross-entropy the output error is, as we saw in Equation (66)\begin{eqnarray} \delta^L = a^L-y. \nonumber\end{eqnarray},
\begin{eqnarray} \delta^L = a^L-y. \tag{99}\end{eqnarray} For this reason we define a second method, CrossEntropyCost.delta, whose purpose is to tell our network how to compute the output error. And then we bundle these two methods up into a single class containing everything our networks need to know about the cost function.
In a similar way, network2.py also contains a class to represent the quadratic cost function. This is included for comparison with the results of Chapter 1, since going forward we'll mostly use the cross entropy. The code is just below. The QuadraticCost.fn method is a straightforward computation of the quadratic cost associated to the actual output, a, and the desired output, y. The value returned by QuadraticCost.delta is based on the expression (30)\begin{eqnarray} \delta^L = (a^L-y) \odot \sigma'(z^L) \nonumber\end{eqnarray} for the output error for the quadratic cost, which we derived back in Chapter 2.
class QuadraticCost(object):
+Let's break this down. The first thing to observe is that even though
the cross-entropy is, mathematically speaking, a function, we've
implemented it as a Python class, not a Python function. Why have I
made that choice? The reason is that the cost plays two different
roles in our network. The obvious role is that it's a measure of how
well an output activation, a, matches the desired output,
y. This role is captured by the CrossEntropyCost.fn
method. (Note, by the way, that the np.nan_to_num call inside
CrossEntropyCost.fn ensures that Numpy deals correctly with the
log of numbers very close to zero.) But there's also a second way the
cost function enters our network. Recall from
Chapter
2 that when running the backpropagation algorithm we need to
compute the network's output error, $\delta^L$. The form of the output
error depends on the choice of cost function: different cost function,
different form for the output error. For the cross-entropy the output
error is, as we saw in Equation (66)\begin{eqnarray}
\delta^L = a^L-y.
\nonumber\end{eqnarray},
\begin{eqnarray}
\delta^L = a^L-y.
\tag{99}\end{eqnarray}
For this reason we define a second method,
CrossEntropyCost.delta, whose purpose is to tell our network
how to compute the output error. And then we bundle these two methods
up into a single class containing everything our networks need to know
about the cost function.
In a similar way, network2.py also contains a class to
represent the quadratic cost function. This is included for
comparison with the results of Chapter 1, since going forward we'll
mostly use the cross entropy. The code is just below. The
QuadraticCost.fn method is a straightforward computation of the
quadratic cost associated to the actual output, a, and the
desired output, y. The value returned by
QuadraticCost.delta is based on the
expression (30)\begin{eqnarray}
\delta^L = (a^L-y) \odot \sigma'(z^L) \nonumber\end{eqnarray} for the output error for the
quadratic cost, which we derived back in Chapter 2.
class QuadraticCost(object):
@staticmethod
def fn(a, y):
@@ -726,7 +729,7 @@ Improving the way neural networks learn
...
-That's better! And so we can continue, individually adjusting each
hyper-parameter, gradually improving performance. Once we've explored
to find an improved value for $\eta$, then we move on to find a good
value for $\lambda$. Then experiment with a more complex
architecture, say a network with 10 hidden neurons. Then adjust the
values for $\eta$ and $\lambda$ again. Then increase to 20 hidden
neurons. And then adjust other hyper-parameters some more. And so
on, at each stage evaluating performance using our held-out validation
data, and using those evaluations to find better and better
hyper-parameters. As we do so, it typically takes longer to witness
the impact due to modifications of the hyper-parameters, and so we can
gradually decrease the frequency of monitoring.
This all looks very promising as a broad strategy. However, I want to
return to that initial stage of finding hyper-parameters that enable a
network to learn anything at all. In fact, even the above discussion
conveys too positive an outlook. It can be immensely frustrating to
work with a network that's learning nothing. You can tweak
hyper-parameters for days, and still get no meaningful response. And
so I'd like to re-emphasize that during the early stages you should
make sure you can get quick feedback from experiments. Intuitively,
it may seem as though simplifying the problem and the architecture
will merely slow you down. In fact, it speeds things up, since you
much more quickly find a network with a meaningful signal. Once
you've got such a signal, you can often get rapid improvements by
tweaking the hyper-parameters. As with many things in life, getting
started can be the hardest thing to do.
Okay, that's the broad strategy. Let's now look at some specific
recommendations for setting hyper-parameters. I will focus on the
learning rate, $\eta$, the L2 regularization parameter, $\lambda$, and
the mini-batch size. However, many of the remarks apply also to other
hyper-parameters, including those associated to network architecture,
other forms of regularization, and some hyper-parameters we'll meet
later in the book, such as the momentum co-efficient.
Learning rate: Suppose we run three MNIST networks with three
different learning rates, $\eta = 0.025$, $\eta = 0.25$ and $\eta =
2.5$, respectively. We'll set the other hyper-parameters as for the
experiments in earlier sections, running over 30 epochs, with a
mini-batch size of 10, and with $\lambda = 5.0$. We'll also return to
using the full $50,000$ training images. Here's a graph showing the
behaviour of the training cost as we train*
*The graph was
generated by
multiple_eta.py.:
With $\eta = 0.025$ the cost decreases smoothly until the final epoch.
With $\eta = 0.25$ the cost initially decreases, but after about $20$
epochs it is near saturation, and thereafter most of the changes are
merely small and apparently random oscillations. Finally, with $\eta
= 2.5$ the cost makes large oscillations right from the start. To
understand the reason for the oscillations, recall that stochastic
gradient descent is supposed to step us gradually down into a valley
of the cost function,
However, if $\eta$ is too large then the steps will be so large that
they may actually overshoot the minimum, causing the algorithm to
climb up out of the valley instead. That's likely*
*This
picture is helpful, but it's intended as an intuition-building
illustration of what may go on, not as a complete, exhaustive
explanation. Briefly, a more complete explanation is as follows:
gradient descent uses a first-order approximation to the cost
function as a guide to how to decrease the cost. For large $\eta$,
higher-order terms in the cost function become more important, and
may dominate the behaviour, causing gradient descent to break down.
This is especially likely as we approach minima and quasi-minima of
the cost function, since near such points the gradient becomes
small, making it easier for higher-order terms to dominate
behaviour. what's causing the cost to oscillate when $\eta = 2.5$.
When we choose $\eta = 0.25$ the initial steps do take us toward a
minimum of the cost function, and it's only once we get near that
minimum that we start to suffer from the overshooting problem. And
when we choose $\eta = 0.025$ we don't suffer from this problem at all
during the first $30$ epochs. Of course, choosing $\eta$ so small
creates another problem, namely, that it slows down stochastic
gradient descent. An even better approach would be to start with
$\eta = 0.25$, train for $20$ epochs, and then switch to $\eta =
0.025$. We'll discuss such variable learning rate schedules later.
For now, though, let's stick to figuring out how to find a single good
value for the learning rate, $\eta$.
With this picture in mind, we can set $\eta$ as follows. First, we
estimate the threshold value for $\eta$ at which the cost on the
training data immediately begins decreasing, instead of oscillating or
increasing. This estimate doesn't need to be too accurate. You can
estimate the order of magnitude by starting with $\eta = 0.01$. If
the cost decreases during the first few epochs, then you should
successively try $\eta = 0.1, 1.0, \ldots$ until you find a value for
$\eta$ where the cost oscillates or increases during the first few
epochs. Alternately, if the cost oscillates or increases during the
first few epochs when $\eta = 0.01$, then try $\eta = 0.001, 0.0001,
\ldots$ until you find a value for $\eta$ where the cost decreases
during the first few epochs. Following this procedure will give us an
order of magnitude estimate for the threshold value of $\eta$. You
may optionally refine your estimate, to pick out the largest value of
$\eta$ at which the cost decreases during the first few epochs, say
$\eta = 0.5$ or $\eta = 0.2$ (there's no need for this to be
super-accurate). This gives us an estimate for the threshold value of
$\eta$.
Obviously, the actual value of $\eta$ that you use should be no larger
than the threshold value. In fact, if the value of $\eta$ is to
remain usable over many epochs then you likely want to use a value
for $\eta$ that is smaller, say, a factor of two below the threshold.
Such a choice will typically allow you to train for many epochs,
without causing too much of a slowdown in learning.
In the case of the MNIST data, following this strategy leads to an
estimate of $0.1$ for the order of magnitude of the threshold value of
$\eta$. After some more refinement, we obtain a threshold value $\eta
= 0.5$. Following the prescription above, this suggests using $\eta =
0.25$ as our value for the learning rate. In fact, I found that using
$\eta = 0.5$ worked well enough over $30$ epochs that for the most
part I didn't worry about using a lower value of $\eta$.
This all seems quite straightforward. However, using the training
cost to pick $\eta$ appears to contradict what I said earlier in this
section, namely, that we'd pick hyper-parameters by evaluating
performance using our held-out validation data. In fact, we'll use
validation accuracy to pick the regularization hyper-parameter, the
mini-batch size, and network parameters such as the number of layers
and hidden neurons, and so on. Why do things differently for the
learning rate? Frankly, this choice is my personal aesthetic
preference, and is perhaps somewhat idiosyncratic. The reasoning is
that the other hyper-parameters are intended to improve the final
classification accuracy on the test set, and so it makes sense to
select them on the basis of validation accuracy. However, the
learning rate is only incidentally meant to impact the final
classification accuracy. Its primary purpose is really to control
the step size in gradient descent, and monitoring the training cost is
the best way to detect if the step size is too big. With that said,
this is a personal aesthetic preference. Early on during learning the
training cost usually only decreases if the validation accuracy
improves, and so in practice it's unlikely to make much difference
which criterion you use.
Use early stopping to determine the number of training
epochs: As we discussed earlier in the chapter, early stopping means
that at the end of each epoch we should compute the classification
accuracy on the validation data. When that stops improving,
terminate. This makes setting the number of epochs very simple. In
particular, it means that we don't need to worry about explicitly
figuring out how the number of epochs depends on the other
hyper-parameters. Instead, that's taken care of automatically.
Furthermore, early stopping also automatically prevents us from
overfitting. This is, of course, a good thing, although in the early
stages of experimentation it can be helpful to turn off early
stopping, so you can see any signs of overfitting, and use it to
inform your approach to regularization.
To implement early stopping we need to say more precisely what it
means that the classification accuracy has stopped improving. As
we've seen, the accuracy can jump around quite a bit, even when the
overall trend is to improve. If we stop the first time the accuracy
decreases then we'll almost certainly stop when there are more
improvements to be had. A better rule is to terminate if the best
classification accuracy doesn't improve for quite some time. Suppose,
for example, that we're doing MNIST. Then we might elect to terminate
if the classification accuracy hasn't improved during the last ten
epochs. This ensures that we don't stop too soon, in response to bad
luck in training, but also that we're not waiting around forever for
an improvement that never comes.
This no-improvement-in-ten rule is good for initial exploration of
MNIST. However, networks can sometimes plateau near a particular
classification accuracy for quite some time, only to then begin
improving again. If you're trying to get really good performance, the
no-improvement-in-ten rule may be too aggressive about stopping. In
that case, I suggest using the no-improvement-in-ten rule for initial
experimentation, and gradually adopting more lenient rules, as you
better understand the way your network trains:
no-improvement-in-twenty, no-improvement-in-fifty, and so on. Of
course, this introduces a new hyper-parameter to optimize! In
practice, however, it's usually easy to set this hyper-parameter to
get pretty good results. Similarly, for problems other than MNIST,
the no-improvement-in-ten rule may be much too aggressive or not
nearly aggressive enough, depending on the details of the problem.
However, with a little experimentation it's usually easy to find a
pretty good strategy for early stopping.
We haven't used early stopping in our MNIST experiments to date. The
reason is that we've been doing a lot of comparisons between different
approaches to learning. For such comparisons it's helpful to use the
same number of epochs in each case. However, it's well worth
modifying network2.py to implement early stopping:
Problem
Learning rate schedule: We've been holding the learning rate $\eta$ constant. However, it's often advantageous to vary the learning rate. Early on during the learning process it's likely that the weights are badly wrong. And so it's best to use a large learning rate that causes the weights to change quickly. Later, we can reduce the learning rate as we make more fine-tuned adjustments to our weights.
How should we set our learning rate schedule? Many approaches are possible. One natural approach is to use the same basic idea as early stopping. The idea is to hold the learning rate constant until the validation accuracy starts to get worse. Then decrease the learning rate by some amount, say a factor of two or ten. We repeat this many times, until, say, the learning rate is a factor of 1,024 (or 1,000) times lower than the initial value. Then we terminate.
A variable learning schedule can improve performance, but it also opens up a world of possible choices for the learning schedule. Those choices can be a headache - you can spend forever trying to optimize your learning schedule. For first experiments my suggestion is to use a single, constant value for the learning rate. That'll get you a good first approximation. Later, if you want to obtain the best performance from your network, it's worth experimenting with a learning schedule, along the lines I've described* *A readable recent paper which demonstrates the benefits of variable learning rates in attacking MNIST is Deep, Big, Simple Neural Nets Excel on Handwritten Digit Recognition, by Dan Claudiu Cireșan, Ueli Meier, Luca Maria Gambardella, and Jürgen Schmidhuber (2010)..
The regularization parameter, $\lambda$: I suggest starting initially with no regularization ($\lambda = 0.0$), and determining a value for $\eta$, as above. Using that choice of $\eta$, we can then use the validation data to select a good value for $\lambda$. Start by trialling $\lambda = 1.0$* *I don't have a good principled justification for using this as a starting value. If anyone knows of a good principled discussion of where to start with $\lambda$, I'd appreciate hearing it (mn@michaelnielsen.org)., and then increase or decrease by factors of $10$, as needed to improve performance on the validation data. Once you've found a good order of magnitude, you can fine tune your value of $\lambda$. That done, you should return and re-optimize $\eta$ again.
How I selected hyper-parameters earlier in this book: If you use the recommendations in this section you'll find that you get values for $\eta$ and $\lambda$ which don't always exactly match the values I've used earlier in the book. The reason is that the book has narrative constraints that have sometimes made it impractical to optimize the hyper-parameters. Think of all the comparisons we've made of different approaches to learning, e.g., comparing the quadratic and cross-entropy cost functions, comparing the old and new methods of weight initialization, running with and without regularization, and so on. To make such comparisons meaningful, I've usually tried to keep hyper-parameters constant across the approaches being compared (or to scale them in an appropriate way). Of course, there's no reason for the same hyper-parameters to be optimal for all the different approaches to learning, so the hyper-parameters I've used are something of a compromise.
As an alternative to this compromise, I could have tried to optimize the heck out of the hyper-parameters for every single approach to learning. In principle that'd be a better, fairer approach, since then we'd see the best from every approach to learning. However, we've made dozens of comparisons along these lines, and in practice I found it too computationally expensive. That's why I've adopted the compromise of using pretty good (but not necessarily optimal) choices for the hyper-parameters.
Mini-batch size: How should we set the mini-batch size? To answer this question, let's first suppose that we're doing online learning, i.e., that we're using a mini-batch size of $1$.
The obvious worry about online learning is that using mini-batches which contain just a single training example will cause significant errors in our estimate of the gradient. In fact, though, the errors turn out to not be such a problem. The reason is that the individual gradient estimates don't need to be super-accurate. All we need is an estimate accurate enough that our cost function tends to keep decreasing. It's as though you are trying to get to the North Magnetic Pole, but have a wonky compass that's 10-20 degrees off each time you look at it. Provided you stop to check the compass frequently, and the compass gets the direction right on average, you'll end up at the North Magnetic Pole just fine.
Based on this argument, it sounds as though we should use online learning. In fact, the situation turns out to be more complicated than that. In a problem in the last chapter I pointed out that it's possible to use matrix techniques to compute the gradient update for all examples in a mini-batch simultaneously, rather than looping over them. Depending on the details of your hardware and linear algebra library this can make it quite a bit faster to compute the gradient estimate for a mini-batch of (for example) size $100$, rather than computing the mini-batch gradient estimate by looping over the $100$ training examples separately. It might take (say) only $50$ times as long, rather than $100$ times as long.
Now, at first it seems as though this doesn't help us that much. With our mini-batch of size $100$ the learning rule for the weights looks like: \begin{eqnarray} w \rightarrow w' = w-\eta \frac{1}{100} \sum_x \nabla C_x, \tag{100}\end{eqnarray} where the sum is over training examples in the mini-batch. This is versus \begin{eqnarray} w \rightarrow w' = w-\eta \nabla C_x \tag{101}\end{eqnarray} for online learning. Even if it only takes $50$ times as long to do the mini-batch update, it still seems likely to be better to do online learning, because we'd be updating so much more frequently. Suppose, however, that in the mini-batch case we increase the learning rate by a factor $100$, so the update rule becomes \begin{eqnarray} w \rightarrow w' = w-\eta \sum_x \nabla C_x. \tag{102}\end{eqnarray} That's a lot like doing $100$ separate instances of online learning with a learning rate of $\eta$. But it only takes $50$ times as long as doing a single instance of online learning. Of course, it's not truly the same as $100$ instances of online learning, since in the mini-batch the $\nabla C_x$'s are all evaluated for the same set of weights, as opposed to the cumulative learning that occurs in the online case. Still, it seems distinctly possible that using the larger mini-batch would speed things up.
With these factors in mind, choosing the best mini-batch size is a compromise. Too small, and you don't get to take full advantage of the benefits of good matrix libraries optimized for fast hardware. Too large and you're simply not updating your weights often enough. What you need is to choose a compromise value which maximizes the speed of learning. Fortunately, the choice of mini-batch size at which the speed is maximized is relatively independent of the other hyper-parameters (apart from the overall architecture), so you don't need to have optimized those hyper-parameters in order to find a good mini-batch size. The way to go is therefore to use some acceptable (but not necessarily optimal) values for the other hyper-parameters, and then trial a number of different mini-batch sizes, scaling $\eta$ as above. Plot the validation accuracy versus time (as in, real elapsed time, not epoch!), and choose whichever mini-batch size gives you the most rapid improvement in performance. With the mini-batch size chosen you can then proceed to optimize the other hyper-parameters.
Of course, as you've no doubt realized, I haven't done this optimization in our work. Indeed, our implementation doesn't use the faster approach to mini-batch updates at all. I've simply used a mini-batch size of $10$ without comment or explanation in nearly all examples. Because of this, we could have sped up learning by reducing the mini-batch size. I haven't done this, in part because I wanted to illustrate the use of mini-batches beyond size $1$, and in part because my preliminary experiments suggested the speedup would be rather modest. In practical implementations, however, we would most certainly implement the faster approach to mini-batch updates, and then make an effort to optimize the mini-batch size, in order to maximize our overall speed.
Automated techniques: I've been describing these heuristics as though you're optimizing your hyper-parameters by hand. Hand-optimization is a good way to build up a feel for how neural networks behave. However, and unsurprisingly, a great deal of work has been done on automating the process. A common technique is grid search, which systematically searches through a grid in hyper-parameter space. A review of both the achievements and the limitations of grid search (with suggestions for easily-implemented alternatives) may be found in a 2012 paper* *Random search for hyper-parameter optimization, by James Bergstra and Yoshua Bengio (2012). by James Bergstra and Yoshua Bengio. Many more sophisticated approaches have also been proposed. I won't review all that work here, but do want to mention a particularly promising 2012 paper which used a Bayesian approach to automatically optimize hyper-parameters* *Practical Bayesian optimization of machine learning algorithms, by Jasper Snoek, Hugo Larochelle, and Ryan Adams.. The code from the paper is publicly available, and has been used with some success by other researchers.
Summing up: Following the rules-of-thumb I've described won't give you the absolute best possible results from your neural network. But it will likely give you a good start and a basis for further improvements. In particular, I've discussed the hyper-parameters largely independently. In practice, there are relationships between the hyper-parameters. You may experiment with $\eta$, feel that you've got it just right, then start to optimize for $\lambda$, only to find that it's messing up your optimization for $\eta$. In practice, it helps to bounce backward and forward, gradually closing in good values. Above all, keep in mind that the heuristics I've described are rules of thumb, not rules cast in stone. You should be on the lookout for signs that things aren't working, and be willing to experiment. In particular, this means carefully monitoring your network's behaviour, especially the validation accuracy.
The difficulty of choosing hyper-parameters is exacerbated by the fact that the lore about how to choose hyper-parameters is widely spread, across many research papers and software programs, and often is only available inside the heads of individual practitioners. There are many, many papers setting out (sometimes contradictory) recommendations for how to proceed. However, there are a few particularly useful papers that synthesize and distill out much of this lore. Yoshua Bengio has a 2012 paper* *Practical recommendations for gradient-based training of deep architectures, by Yoshua Bengio (2012). that gives some practical recommendations for using backpropagation and gradient descent to train neural networks, including deep neural nets. Bengio discusses many issues in much more detail than I have, including how to do more systematic hyper-parameter searches. Another good paper is a 1998 paper* *Efficient BackProp, by Yann LeCun, Léon Bottou, Genevieve Orr and Klaus-Robert Müller (1998) by Yann LeCun, Léon Bottou, Genevieve Orr and Klaus-Robert Müller. Both these papers appear in an extremely useful 2012 book that collects many tricks commonly used in neural nets* *Neural Networks: Tricks of the Trade, edited by Grégoire Montavon, Geneviève Orr, and Klaus-Robert Müller.. The book is expensive, but many of the articles have been placed online by their respective authors with, one presumes, the blessing of the publisher, and may be located using a search engine.
One thing that becomes clear as you read these articles and, especially, as you engage in your own experiments, is that hyper-parameter optimization is not a problem that is ever completely solved. There's always another trick you can try to improve performance. There is a saying common among writers that books are never finished, only abandoned. The same is also true of neural network optimization: the space of hyper-parameters is so large that one never really finishes optimizing, one only abandons the network to posterity. So your goal should be to develop a workflow that enables you to quickly do a pretty good job on the optimization, while leaving you the flexibility to try more detailed optimizations, if that's important.
The challenge of setting hyper-parameters has led some people to complain that neural networks require a lot of work when compared with other machine learning techniques. I've heard many variations on the following complaint: "Yes, a well-tuned neural network may get the best performance on the problem. On the other hand, I can try a random forest [or SVM or$\ldots$ insert your own favorite technique] and it just works. I don't have time to figure out just the right neural network." Of course, from a practical point of view it's good to have easy-to-apply techniques. This is particularly true when you're just getting started on a problem, and it may not be obvious whether machine learning can help solve the problem at all. On the other hand, if getting optimal performance is important, then you may need to try approaches that require more specialist knowledge. While it would be nice if machine learning were always easy, there is no a priori reason it should be trivially simple.
Each technique developed in this chapter is valuable to know in its own right, but that's not the only reason I've explained them. The larger point is to familiarize you with some of the problems which can occur in neural networks, and with a style of analysis which can help overcome those problems. In a sense, we've been learning how to think about neural nets. Over the remainder of this chapter I briefly sketch a handful of other techniques. These sketches are less in-depth than the earlier discussions, but should convey some feeling for the diversity of techniques available for use in neural networks.
Stochastic gradient descent by backpropagation has served us well in attacking the MNIST digit classification problem. However, there are many other approaches to optimizing the cost function, and sometimes those other approaches offer performance superior to mini-batch stochastic gradient descent. In this section I sketch two such approaches, the Hessian and momentum techniques.
Hessian technique: To begin our discussion it helps to put neural networks aside for a bit. Instead, we're just going to consider the abstract problem of minimizing a cost function $C$ which is a function of many variables, $w = w_1, w_2, \ldots$, so $C = C(w)$. By Taylor's theorem, the cost function can be approximated near a point $w$ by \begin{eqnarray} C(w+\Delta w) & = & C(w) + \sum_j \frac{\partial C}{\partial w_j} \Delta w_j \nonumber \\ & & + \frac{1}{2} \sum_{jk} \Delta w_j \frac{\partial^2 C}{\partial w_j \partial w_k} \Delta w_k + \ldots \tag{103}\end{eqnarray} We can rewrite this more compactly as \begin{eqnarray} C(w+\Delta w) = C(w) + \nabla C \cdot \Delta w + \frac{1}{2} \Delta w^T H \Delta w + \ldots, \tag{104}\end{eqnarray} where $\nabla C$ is the usual gradient vector, and $H$ is a matrix known as the Hessian matrix, whose $jk$th entry is $\partial^2 C / \partial w_j \partial w_k$. Suppose we approximate $C$ by discarding the higher-order terms represented by $\ldots$ above, \begin{eqnarray} C(w+\Delta w) \approx C(w) + \nabla C \cdot \Delta w + \frac{1}{2} \Delta w^T H \Delta w. \tag{105}\end{eqnarray} Using calculus we can show that the expression on the right-hand side can be minimized* *Strictly speaking, for this to be a minimum, and not merely an extremum, we need to assume that the Hessian matrix is positive definite. Intuitively, this means that the function $C$ looks like a valley locally, not a mountain or a saddle. by choosing \begin{eqnarray} \Delta w = -H^{-1} \nabla C. \tag{106}\end{eqnarray} Provided (105)\begin{eqnarray} C(w+\Delta w) \approx C(w) + \nabla C \cdot \Delta w + \frac{1}{2} \Delta w^T H \Delta w \nonumber\end{eqnarray} is a good approximate expression for the cost function, then we'd expect that moving from the point $w$ to $w+\Delta w = w-H^{-1} \nabla C$ should significantly decrease the cost function. That suggests a possible algorithm for minimizing the cost:
This approach to minimizing a cost function is known as the Hessian technique or Hessian optimization. There are theoretical and empirical results showing that Hessian methods converge on a minimum in fewer steps than standard gradient descent. In particular, by incorporating information about second-order changes in the cost function it's possible for the Hessian approach to avoid many pathologies that can occur in gradient descent. Furthermore, there are versions of the backpropagation algorithm which can be used to compute the Hessian.
If Hessian optimization is so great, why aren't we using it in our neural networks? Unfortunately, while it has many desirable properties, it has one very undesirable property: it's very difficult to apply in practice. Part of the problem is the sheer size of the Hessian matrix. Suppose you have a neural network with $10^7$ weights and biases. Then the corresponding Hessian matrix will contain $10^7 \times 10^7 = 10^{14}$ entries. That's a lot of entries! And that makes computing $H^{-1} \nabla C$ extremely difficult in practice. However, that doesn't mean that it's not useful to understand. In fact, there are many variations on gradient descent which are inspired by Hessian optimization, but which avoid the problem with overly-large matrices. Let's take a look at one such technique, momentum-based gradient descent.
Momentum-based gradient descent: Intuitively, the advantage Hessian optimization has is that it incorporates not just information about the gradient, but also information about how the gradient is changing. Momentum-based gradient descent is based on a similar intuition, but avoids large matrices of second derivatives. To understand the momentum technique, think back to our original picture of gradient descent, in which we considered a ball rolling down into a valley. At the time, we observed that gradient descent is, despite its name, only loosely similar to a ball falling to the bottom of a valley. The momentum technique modifies gradient descent in two ways that make it more similar to the physical picture. First, it introduces a notion of "velocity" for the parameters we're trying to optimize. The gradient acts to change the velocity, not (directly) the "position", in much the same way as physical forces change the velocity, and only indirectly affect position. Second, the momentum method introduces a kind of friction term, which tends to gradually reduce the velocity.
Let's give a more precise mathematical description. We introduce velocity variables $v = v_1, v_2, \ldots$, one for each corresponding $w_j$ variable* *In a neural net the $w_j$ variables would, of course, include all weights and biases.. Then we replace the gradient descent update rule $w \rightarrow w'= w-\eta \nabla C$ by \begin{eqnarray} v & \rightarrow & v' = \mu v - \eta \nabla C \tag{107}\\ w & \rightarrow & w' = w+v'. \tag{108}\end{eqnarray} In these equations, $\mu$ is a hyper-parameter which controls the amount of damping or friction in the system. To understand the meaning of the equations it's helpful to first consider the case where $\mu = 1$, which corresponds to no friction. When that's the case, inspection of the equations shows that the "force" $\nabla C$ is now modifying the velocity, $v$, and the velocity is controlling the rate of change of $w$. Intuitively, we build up the velocity by repeatedly adding gradient terms to it. That means that if the gradient is in (roughly) the same direction through several rounds of learning, we can build up quite a bit of steam moving in that direction. Think, for example, of what happens if we're moving straight down a slope:
With each step the velocity gets larger down the slope, so we move more and more quickly to the bottom of the valley. This can enable the momentum technique to work much faster than standard gradient descent. Of course, a problem is that once we reach the bottom of the valley we will overshoot. Or, if the gradient should change rapidly, then we could find ourselves moving in the wrong direction. That's the reason for the $\mu$ hyper-parameter in (107)\begin{eqnarray} v & \rightarrow & v' = \mu v - \eta \nabla C \nonumber\end{eqnarray}. I said earlier that $\mu$ controls the amount of friction in the system; to be a little more precise, you should think of $1-\mu$ as the amount of friction in the system. When $\mu = 1$, as we've seen, there is no friction, and the velocity is completely driven by the gradient $\nabla C$. By contrast, when $\mu = 0$ there's a lot of friction, the velocity can't build up, and Equations (107)\begin{eqnarray} v & \rightarrow & v' = \mu v - \eta \nabla C \nonumber\end{eqnarray} and (108)\begin{eqnarray} w & \rightarrow & w' = w+v' \nonumber\end{eqnarray} reduce to the usual equation for gradient descent, $w \rightarrow w'=w-\eta \nabla C$. In practice, using a value of $\mu$ intermediate between $0$ and $1$ can give us much of the benefit of being able to build up speed, but without causing overshooting. We can choose such a value for $\mu$ using the held-out validation data, in much the same way as we select $\eta$ and $\lambda$.
I've avoided naming the hyper-parameter $\mu$ up to now. The reason is that the standard name for $\mu$ is badly chosen: it's called the momentum co-efficient. This is potentially confusing, since $\mu$ is not at all the same as the notion of momentum from physics. Rather, it is much more closely related to friction. However, the term momentum co-efficient is widely used, so we will continue to use it.
A nice thing about the momentum technique is that it takes almost no work to modify an implementation of gradient descent to incorporate momentum. We can still use backpropagation to compute the gradients, just as before, and use ideas such as sampling stochastically chosen mini-batches. In this way, we can get some of the advantages of the Hessian technique, using information about how the gradient is changing. But it's done without the disadvantages, and with only minor modifications to our code. In practice, the momentum technique is commonly used, and often speeds up learning.
Other approaches to minimizing the cost function: Many other approaches to minimizing the cost function have been developed, and there isn't universal agreement on which is the best approach. As you go deeper into neural networks it's worth digging into the other techniques, understanding how they work, their strengths and weaknesses, and how to apply them in practice. A paper I mentioned earlier* *Efficient BackProp, by Yann LeCun, Léon Bottou, Genevieve Orr and Klaus-Robert Müller (1998). introduces and compares several of these techniques, including conjugate gradient descent and the BFGS method (see also the closely related limited-memory BFGS method, known as L-BFGS). Another technique which has recently shown promising results* *See, for example, On the importance of initialization and momentum in deep learning, by Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton (2012). is Nesterov's accelerated gradient technique, which improves on the momentum technique. However, for many problems, plain stochastic gradient descent works well, especially if momentum is used, and so we'll stick to stochastic gradient descent through the remainder of this book.
Up to now we've built our neural networks using sigmoid neurons. In principle, a network built from sigmoid neurons can compute any function. In practice, however, networks built using other model neurons sometimes outperform sigmoid networks. Depending on the application, networks based on such alternate models may learn faster, generalize better to test data, or perhaps do both. Let me mention a couple of alternate model neurons, to give you the flavor of some variations in common use.
Perhaps the simplest variation is the tanh (pronounced "tanch") neuron, which replaces the sigmoid function by the hyperbolic tangent function. The output of a tanh neuron with input $x$, weight vector $w$, and bias $b$ is given by \begin{eqnarray} \tanh(w \cdot x+b), \tag{109}\end{eqnarray} where $\tanh$ is, of course, the hyperbolic tangent function. It turns out that this is very closely related to the sigmoid neuron. To see this, recall that the $\tanh$ function is defined by \begin{eqnarray} \tanh(z) \equiv \frac{e^z-e^{-z}}{e^z+e^{-z}}. \tag{110}\end{eqnarray} With a little algebra it can easily be verified that \begin{eqnarray} \sigma(z) = \frac{1+\tanh(z/2)}{2}, \tag{111}\end{eqnarray} that is, $\tanh$ is just a rescaled version of the sigmoid function. We can also see graphically that the $\tanh$ function has the same shape as the sigmoid function,
One difference between tanh neurons and sigmoid neurons is that the output from tanh neurons ranges from -1 to 1, not 0 to 1. This means that if you're going to build a network based on tanh neurons you may need to normalize your outputs (and, depending on the details of the application, possibly your inputs) a little differently than in sigmoid networks.
Similar to sigmoid neurons, a network of tanh neurons can, in principle, compute any function* *There are some technical caveats to this statement for both tanh and sigmoid neurons, as well as for the rectified linear neurons discussed below. However, informally it's usually fine to think of neural networks as being able to approximate any function to arbitrary accuracy. mapping inputs to the range -1 to 1. Furthermore, ideas such as backpropagation and stochastic gradient descent are as easily applied to a network of tanh neurons as to a network of sigmoid neurons.
Which type of neuron should you use in your networks, the tanh or sigmoid? A priori the answer is not obvious, to put it mildly! However, there are theoretical arguments and some empirical evidence to suggest that the tanh sometimes performs better* *See, for example, Efficient BackProp, by Yann LeCun, Léon Bottou, Genevieve Orr and Klaus-Robert Müller (1998), and Understanding the difficulty of training deep feedforward networks, by Xavier Glorot and Yoshua Bengio (2010).. Let me briefly give you the flavor of one of the theoretical arguments for tanh neurons. Suppose we're using sigmoid neurons, so all activations in our network are positive. Let's consider the weights $w^{l+1}_{jk}$ input to the $j$th neuron in the $l+1$th layer. The rules for backpropagation (see here) tell us that the associated gradient will be $a^l_k \delta^{l+1}_j$. Because the activations are positive the sign of this gradient will be the same as the sign of $\delta^{l+1}_j$. What this means is that if $\delta^{l+1}_j$ is positive then all the weights $w^{l+1}_{jk}$ will decrease during gradient descent, while if $\delta^{l+1}_j$ is negative then all the weights $w^{l+1}_{jk}$ will increase during gradient descent. In other words, all weights to the same neuron must either increase together or decrease together. That's a problem, since some of the weights may need to increase while others need to decrease. That can only happen if some of the input activations have different signs. That suggests replacing the sigmoid by an activation function, such as $\tanh$, which allows both positive and negative activations. Indeed, because $\tanh$ is symmetric about zero, $\tanh(-z) = -\tanh(z)$, we might even expect that, roughly speaking, the activations in hidden layers would be equally balanced between positive and negative. That would help ensure that there is no systematic bias for the weight updates to be one way or the other.
How seriously should we take this argument? While the argument is suggestive, it's a heuristic, not a rigorous proof that tanh neurons outperform sigmoid neurons. Perhaps there are other properties of the sigmoid neuron which compensate for this problem? Indeed, for many tasks the tanh is found empirically to provide only a small or no improvement in performance over sigmoid neurons. Unfortunately, we don't yet have hard-and-fast rules to know which neuron types will learn fastest, or give the best generalization performance, for any particular application.
Another variation on the sigmoid neuron is the rectified linear neuron or rectified linear unit. The output of a rectified linear unit with input $x$, weight vector $w$, and bias $b$ is given by \begin{eqnarray} \max(0, w \cdot x+b). \tag{112}\end{eqnarray} Graphically, the rectifying function $\max(0, z)$ looks like this:
Obviously such neurons are quite different from both sigmoid and tanh neurons. However, like the sigmoid and tanh neurons, rectified linear units can be used to compute any function, and they can be trained using ideas such as backpropagation and stochastic gradient descent.
When should you use rectified linear units instead of sigmoid or tanh neurons? Some recent work on image recognition* *See, for example, What is the Best Multi-Stage Architecture for Object Recognition?, by Kevin Jarrett, Koray Kavukcuoglu, Marc'Aurelio Ranzato and Yann LeCun (2009), Deep Sparse Rectifier Neural Networks, by Xavier Glorot, Antoine Bordes, and Yoshua Bengio (2011), and ImageNet Classification with Deep Convolutional Neural Networks, by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton (2012). Note that these papers fill in important details about how to set up the output layer, cost function, and regularization in networks using rectified linear units. I've glossed over all these details in this brief account. The papers also discuss in more detail the benefits and drawbacks of using rectified linear units. Another informative paper is Rectified Linear Units Improve Restricted Boltzmann Machines, by Vinod Nair and Geoffrey Hinton (2010), which demonstrates the benefits of using rectified linear units in a somewhat different approach to neural networks. has found considerable benefit in using rectified linear units through much of the network. However, as with tanh neurons, we do not yet have a really deep understanding of when, exactly, rectified linear units are preferable, nor why. To give you the flavor of some of the issues, recall that sigmoid neurons stop learning when they saturate, i.e., when their output is near either $0$ or $1$. As we've seen repeatedly in this chapter, the problem is that $\sigma'$ terms reduce the gradient, and that slows down learning. Tanh neurons suffer from a similar problem when they saturate. By contrast, increasing the weighted input to a rectified linear unit will never cause it to saturate, and so there is no corresponding learning slowdown. On the other hand, when the weighted input to a rectified linear unit is negative, the gradient vanishes, and so the neuron stops learning entirely. These are just two of the many issues that make it non-trivial to understand when and why rectified linear units perform better than sigmoid or tanh neurons.
I've painted a picture of uncertainty here, stressing that we do not yet have a solid theory of how activation functions should be chosen. Indeed, the problem is harder even than I have described, for there are infinitely many possible activation functions. Which is the best for any given problem? Which will result in a network which learns fastest? Which will give the highest test accuracies? I am surprised how little really deep and systematic investigation has been done of these questions. Ideally, we'd have a theory which tells us, in detail, how to choose (and perhaps modify-on-the-fly) our activation functions. On the other hand, we shouldn't let the lack of a full theory stop us! We have powerful tools already at hand, and can make a lot of progress with those tools. Through the remainder of this book I'll continue to use sigmoid neurons as our go-to neuron, since they're powerful and provide concrete illustrations of the core ideas about neural nets. But keep in the back of your mind that these same ideas can be applied to other types of neuron, and that there are sometimes advantages in doing so.
Question: How do you
approach utilizing and researching machine learning techniques that
are supported almost entirely empirically, as opposed to
mathematically? Also in what situations have you noticed some of
these techniques fail? Answer: You have to realize that our theoretical
tools are very weak. Sometimes, we have good mathematical intuitions
for why a particular technique should work. Sometimes our intuition
ends up being wrong [...] The questions become: how well does my
method work on this particular problem, and how large is the set of
problems on which it works well.
-
Question
and answer with neural networks researcher Yann LeCun
Once, attending a conference on the foundations of quantum mechanics, I noticed what seemed to me a most curious verbal habit: when talks finished, questions from the audience often began with "I'm very sympathetic to your point of view, but [...]". Quantum foundations was not my usual field, and I noticed this style of questioning because at other scientific conferences I'd rarely or never heard a questioner express their sympathy for the point of view of the speaker. At the time, I thought the prevalence of the question suggested that little genuine progress was being made in quantum foundations, and people were merely spinning their wheels. Later, I realized that assessment was too harsh. The speakers were wrestling with some of the hardest problems human minds have ever confronted. Of course progress was slow! But there was still value in hearing updates on how people were thinking, even if they didn't always have unarguable new progress to report.
You may have noticed a verbal tic similar to "I'm very sympathetic [...]" in the current book. To explain what we're seeing I've often fallen back on saying "Heuristically, [...]", or "Roughly speaking, [...]", following up with a story to explain some phenomenon or other. These stories are plausible, but the empirical evidence I've presented has often been pretty thin. If you look through the research literature you'll see that stories in a similar style appear in many research papers on neural nets, often with thin supporting evidence. What should we think about such stories?
In many parts of science - especially those parts that deal with simple phenomena - it's possible to obtain very solid, very reliable evidence for quite general hypotheses. But in neural networks there are large numbers of parameters and hyper-parameters, and extremely complex interactions between them. In such extraordinarily complex systems it's exceedingly difficult to establish reliable general statements. Understanding neural networks in their full generality is a problem that, like quantum foundations, tests the limits of the human mind. Instead, we often make do with evidence for or against a few specific instances of a general statement. As a result those statements sometimes later need to be modified or abandoned, when new evidence comes to light.
One way of viewing this situation is that any heuristic story about neural networks carries with it an implied challenge. For example, consider the statement I quoted earlier, explaining why dropout works* *From ImageNet Classification with Deep Convolutional Neural Networks by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton (2012).: "This technique reduces complex co-adaptations of neurons, since a neuron cannot rely on the presence of particular other neurons. It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons." This is a rich, provocative statement, and one could build a fruitful research program entirely around unpacking the statement, figuring out what in it is true, what is false, what needs variation and refinement. Indeed, there is now a small industry of researchers who are investigating dropout (and many variations), trying to understand how it works, and what its limits are. And so it goes with many of the heuristics we've discussed. Each heuristic is not just a (potential) explanation, it's also a challenge to investigate and understand in more detail.
Of course, there is not time for any single person to investigate all these heuristic explanations in depth. It's going to take decades (or longer) for the community of neural networks researchers to develop a really powerful, evidence-based theory of how neural networks learn. Does this mean you should reject heuristic explanations as unrigorous, and not sufficiently evidence-based? No! In fact, we need such heuristics to inspire and guide our thinking. It's like the great age of exploration: the early explorers sometimes explored (and made new discoveries) on the basis of beliefs which were wrong in important ways. Later, those mistakes were corrected as we filled in our knowledge of geography. When you understand something poorly - as the explorers understood geography, and as we understand neural nets today - it's more important to explore boldly than it is to be rigorously correct in every step of your thinking. And so you should view these stories as a useful guide to how to think about neural nets, while retaining a healthy awareness of the limitations of such stories, and carefully keeping track of just how strong the evidence is for any given line of reasoning. Put another way, we need good stories to help motivate and inspire us, and rigorous in-depth investigation in order to uncover the real facts of the matter.
That's better! And so we can continue, individually adjusting each hyper-parameter, gradually improving performance. Once we've explored to find an improved value for $\eta$, then we move on to find a good value for $\lambda$. Then experiment with a more complex architecture, say a network with 10 hidden neurons. Then adjust the values for $\eta$ and $\lambda$ again. Then increase to 20 hidden neurons. And then adjust other hyper-parameters some more. And so on, at each stage evaluating performance using our held-out validation data, and using those evaluations to find better and better hyper-parameters. As we do so, it typically takes longer to witness the impact due to modifications of the hyper-parameters, and so we can gradually decrease the frequency of monitoring.
This all looks very promising as a broad strategy. However, I want to return to that initial stage of finding hyper-parameters that enable a network to learn anything at all. In fact, even the above discussion conveys too positive an outlook. It can be immensely frustrating to work with a network that's learning nothing. You can tweak hyper-parameters for days, and still get no meaningful response. And so I'd like to re-emphasize that during the early stages you should make sure you can get quick feedback from experiments. Intuitively, it may seem as though simplifying the problem and the architecture will merely slow you down. In fact, it speeds things up, since you much more quickly find a network with a meaningful signal. Once you've got such a signal, you can often get rapid improvements by tweaking the hyper-parameters. As with many things in life, getting started can be the hardest thing to do.
Okay, that's the broad strategy. Let's now look at some specific recommendations for setting hyper-parameters. I will focus on the learning rate, $\eta$, the L2 regularization parameter, $\lambda$, and the mini-batch size. However, many of the remarks apply also to other hyper-parameters, including those associated to network architecture, other forms of regularization, and some hyper-parameters we'll meet later in the book, such as the momentum co-efficient.
Learning rate: Suppose we run three MNIST networks with three different learning rates, $\eta = 0.025$, $\eta = 0.25$ and $\eta = 2.5$, respectively. We'll set the other hyper-parameters as for the experiments in earlier sections, running over 30 epochs, with a mini-batch size of 10, and with $\lambda = 5.0$. We'll also return to using the full $50,000$ training images. Here's a graph showing the behaviour of the training cost as we train* *The graph was generated by multiple_eta.py.:
With $\eta = 0.025$ the cost decreases smoothly until the final epoch. With $\eta = 0.25$ the cost initially decreases, but after about $20$ epochs it is near saturation, and thereafter most of the changes are merely small and apparently random oscillations. Finally, with $\eta = 2.5$ the cost makes large oscillations right from the start. To understand the reason for the oscillations, recall that stochastic gradient descent is supposed to step us gradually down into a valley of the cost function,
However, if $\eta$ is too large then the steps will be so large that they may actually overshoot the minimum, causing the algorithm to climb up out of the valley instead. That's likely* *This picture is helpful, but it's intended as an intuition-building illustration of what may go on, not as a complete, exhaustive explanation. Briefly, a more complete explanation is as follows: gradient descent uses a first-order approximation to the cost function as a guide to how to decrease the cost. For large $\eta$, higher-order terms in the cost function become more important, and may dominate the behaviour, causing gradient descent to break down. This is especially likely as we approach minima and quasi-minima of the cost function, since near such points the gradient becomes small, making it easier for higher-order terms to dominate behaviour. what's causing the cost to oscillate when $\eta = 2.5$. When we choose $\eta = 0.25$ the initial steps do take us toward a minimum of the cost function, and it's only once we get near that minimum that we start to suffer from the overshooting problem. And when we choose $\eta = 0.025$ we don't suffer from this problem at all during the first $30$ epochs. Of course, choosing $\eta$ so small creates another problem, namely, that it slows down stochastic gradient descent. An even better approach would be to start with $\eta = 0.25$, train for $20$ epochs, and then switch to $\eta = 0.025$. We'll discuss such variable learning rate schedules later. For now, though, let's stick to figuring out how to find a single good value for the learning rate, $\eta$.
With this picture in mind, we can set $\eta$ as follows. First, we estimate the threshold value for $\eta$ at which the cost on the training data immediately begins decreasing, instead of oscillating or increasing. This estimate doesn't need to be too accurate. You can estimate the order of magnitude by starting with $\eta = 0.01$. If the cost decreases during the first few epochs, then you should successively try $\eta = 0.1, 1.0, \ldots$ until you find a value for $\eta$ where the cost oscillates or increases during the first few epochs. Alternately, if the cost oscillates or increases during the first few epochs when $\eta = 0.01$, then try $\eta = 0.001, 0.0001, \ldots$ until you find a value for $\eta$ where the cost decreases during the first few epochs. Following this procedure will give us an order of magnitude estimate for the threshold value of $\eta$. You may optionally refine your estimate, to pick out the largest value of $\eta$ at which the cost decreases during the first few epochs, say $\eta = 0.5$ or $\eta = 0.2$ (there's no need for this to be super-accurate). This gives us an estimate for the threshold value of $\eta$.
Obviously, the actual value of $\eta$ that you use should be no larger than the threshold value. In fact, if the value of $\eta$ is to remain usable over many epochs then you likely want to use a value for $\eta$ that is smaller, say, a factor of two below the threshold. Such a choice will typically allow you to train for many epochs, without causing too much of a slowdown in learning.
In the case of the MNIST data, following this strategy leads to an estimate of $0.1$ for the order of magnitude of the threshold value of $\eta$. After some more refinement, we obtain a threshold value $\eta = 0.5$. Following the prescription above, this suggests using $\eta = 0.25$ as our value for the learning rate. In fact, I found that using $\eta = 0.5$ worked well enough over $30$ epochs that for the most part I didn't worry about using a lower value of $\eta$.
This all seems quite straightforward. However, using the training cost to pick $\eta$ appears to contradict what I said earlier in this section, namely, that we'd pick hyper-parameters by evaluating performance using our held-out validation data. In fact, we'll use validation accuracy to pick the regularization hyper-parameter, the mini-batch size, and network parameters such as the number of layers and hidden neurons, and so on. Why do things differently for the learning rate? Frankly, this choice is my personal aesthetic preference, and is perhaps somewhat idiosyncratic. The reasoning is that the other hyper-parameters are intended to improve the final classification accuracy on the test set, and so it makes sense to select them on the basis of validation accuracy. However, the learning rate is only incidentally meant to impact the final classification accuracy. Its primary purpose is really to control the step size in gradient descent, and monitoring the training cost is the best way to detect if the step size is too big. With that said, this is a personal aesthetic preference. Early on during learning the training cost usually only decreases if the validation accuracy improves, and so in practice it's unlikely to make much difference which criterion you use.
Use early stopping to determine the number of training epochs: As we discussed earlier in the chapter, early stopping means that at the end of each epoch we should compute the classification accuracy on the validation data. When that stops improving, terminate. This makes setting the number of epochs very simple. In particular, it means that we don't need to worry about explicitly figuring out how the number of epochs depends on the other hyper-parameters. Instead, that's taken care of automatically. Furthermore, early stopping also automatically prevents us from overfitting. This is, of course, a good thing, although in the early stages of experimentation it can be helpful to turn off early stopping, so you can see any signs of overfitting, and use it to inform your approach to regularization.
To implement early stopping we need to say more precisely what it means that the classification accuracy has stopped improving. As we've seen, the accuracy can jump around quite a bit, even when the overall trend is to improve. If we stop the first time the accuracy decreases then we'll almost certainly stop when there are more improvements to be had. A better rule is to terminate if the best classification accuracy doesn't improve for quite some time. Suppose, for example, that we're doing MNIST. Then we might elect to terminate if the classification accuracy hasn't improved during the last ten epochs. This ensures that we don't stop too soon, in response to bad luck in training, but also that we're not waiting around forever for an improvement that never comes.
This no-improvement-in-ten rule is good for initial exploration of MNIST. However, networks can sometimes plateau near a particular classification accuracy for quite some time, only to then begin improving again. If you're trying to get really good performance, the no-improvement-in-ten rule may be too aggressive about stopping. In that case, I suggest using the no-improvement-in-ten rule for initial experimentation, and gradually adopting more lenient rules, as you better understand the way your network trains: no-improvement-in-twenty, no-improvement-in-fifty, and so on. Of course, this introduces a new hyper-parameter to optimize! In practice, however, it's usually easy to set this hyper-parameter to get pretty good results. Similarly, for problems other than MNIST, the no-improvement-in-ten rule may be much too aggressive or not nearly aggressive enough, depending on the details of the problem. However, with a little experimentation it's usually easy to find a pretty good strategy for early stopping.
We haven't used early stopping in our MNIST experiments to date. The reason is that we've been doing a lot of comparisons between different approaches to learning. For such comparisons it's helpful to use the same number of epochs in each case. However, it's well worth modifying network2.py to implement early stopping:
Learning rate schedule: We've been holding the learning rate $\eta$ constant. However, it's often advantageous to vary the learning rate. Early on during the learning process it's likely that the weights are badly wrong. And so it's best to use a large learning rate that causes the weights to change quickly. Later, we can reduce the learning rate as we make more fine-tuned adjustments to our weights.
How should we set our learning rate schedule? Many approaches are possible. One natural approach is to use the same basic idea as early stopping. The idea is to hold the learning rate constant until the validation accuracy starts to get worse. Then decrease the learning rate by some amount, say a factor of two or ten. We repeat this many times, until, say, the learning rate is a factor of 1,024 (or 1,000) times lower than the initial value. Then we terminate.
A variable learning schedule can improve performance, but it also opens up a world of possible choices for the learning schedule. Those choices can be a headache - you can spend forever trying to optimize your learning schedule. For first experiments my suggestion is to use a single, constant value for the learning rate. That'll get you a good first approximation. Later, if you want to obtain the best performance from your network, it's worth experimenting with a learning schedule, along the lines I've described* *A readable recent paper which demonstrates the benefits of variable learning rates in attacking MNIST is Deep, Big, Simple Neural Nets Excel on Handwritten Digit Recognition, by Dan Claudiu Cireșan, Ueli Meier, Luca Maria Gambardella, and Jürgen Schmidhuber (2010)..
The regularization parameter, $\lambda$: I suggest starting initially with no regularization ($\lambda = 0.0$), and determining a value for $\eta$, as above. Using that choice of $\eta$, we can then use the validation data to select a good value for $\lambda$. Start by trialling $\lambda = 1.0$* *I don't have a good principled justification for using this as a starting value. If anyone knows of a good principled discussion of where to start with $\lambda$, I'd appreciate hearing it (mn@michaelnielsen.org)., and then increase or decrease by factors of $10$, as needed to improve performance on the validation data. Once you've found a good order of magnitude, you can fine tune your value of $\lambda$. That done, you should return and re-optimize $\eta$ again.
How I selected hyper-parameters earlier in this book: If you use the recommendations in this section you'll find that you get values for $\eta$ and $\lambda$ which don't always exactly match the values I've used earlier in the book. The reason is that the book has narrative constraints that have sometimes made it impractical to optimize the hyper-parameters. Think of all the comparisons we've made of different approaches to learning, e.g., comparing the quadratic and cross-entropy cost functions, comparing the old and new methods of weight initialization, running with and without regularization, and so on. To make such comparisons meaningful, I've usually tried to keep hyper-parameters constant across the approaches being compared (or to scale them in an appropriate way). Of course, there's no reason for the same hyper-parameters to be optimal for all the different approaches to learning, so the hyper-parameters I've used are something of a compromise.
As an alternative to this compromise, I could have tried to optimize the heck out of the hyper-parameters for every single approach to learning. In principle that'd be a better, fairer approach, since then we'd see the best from every approach to learning. However, we've made dozens of comparisons along these lines, and in practice I found it too computationally expensive. That's why I've adopted the compromise of using pretty good (but not necessarily optimal) choices for the hyper-parameters.
Mini-batch size: How should we set the mini-batch size? To answer this question, let's first suppose that we're doing online learning, i.e., that we're using a mini-batch size of $1$.
The obvious worry about online learning is that using mini-batches which contain just a single training example will cause significant errors in our estimate of the gradient. In fact, though, the errors turn out to not be such a problem. The reason is that the individual gradient estimates don't need to be super-accurate. All we need is an estimate accurate enough that our cost function tends to keep decreasing. It's as though you are trying to get to the North Magnetic Pole, but have a wonky compass that's 10-20 degrees off each time you look at it. Provided you stop to check the compass frequently, and the compass gets the direction right on average, you'll end up at the North Magnetic Pole just fine.
Based on this argument, it sounds as though we should use online learning. In fact, the situation turns out to be more complicated than that. In a problem in the last chapter I pointed out that it's possible to use matrix techniques to compute the gradient update for all examples in a mini-batch simultaneously, rather than looping over them. Depending on the details of your hardware and linear algebra library this can make it quite a bit faster to compute the gradient estimate for a mini-batch of (for example) size $100$, rather than computing the mini-batch gradient estimate by looping over the $100$ training examples separately. It might take (say) only $50$ times as long, rather than $100$ times as long.
Now, at first it seems as though this doesn't help us that much. With our mini-batch of size $100$ the learning rule for the weights looks like: \begin{eqnarray} w \rightarrow w' = w-\eta \frac{1}{100} \sum_x \nabla C_x, \tag{100}\end{eqnarray} where the sum is over training examples in the mini-batch. This is versus \begin{eqnarray} w \rightarrow w' = w-\eta \nabla C_x \tag{101}\end{eqnarray} for online learning. Even if it only takes $50$ times as long to do the mini-batch update, it still seems likely to be better to do online learning, because we'd be updating so much more frequently. Suppose, however, that in the mini-batch case we increase the learning rate by a factor $100$, so the update rule becomes \begin{eqnarray} w \rightarrow w' = w-\eta \sum_x \nabla C_x. \tag{102}\end{eqnarray} That's a lot like doing $100$ separate instances of online learning with a learning rate of $\eta$. But it only takes $50$ times as long as doing a single instance of online learning. Of course, it's not truly the same as $100$ instances of online learning, since in the mini-batch the $\nabla C_x$'s are all evaluated for the same set of weights, as opposed to the cumulative learning that occurs in the online case. Still, it seems distinctly possible that using the larger mini-batch would speed things up.
With these factors in mind, choosing the best mini-batch size is a compromise. Too small, and you don't get to take full advantage of the benefits of good matrix libraries optimized for fast hardware. Too large and you're simply not updating your weights often enough. What you need is to choose a compromise value which maximizes the speed of learning. Fortunately, the choice of mini-batch size at which the speed is maximized is relatively independent of the other hyper-parameters (apart from the overall architecture), so you don't need to have optimized those hyper-parameters in order to find a good mini-batch size. The way to go is therefore to use some acceptable (but not necessarily optimal) values for the other hyper-parameters, and then trial a number of different mini-batch sizes, scaling $\eta$ as above. Plot the validation accuracy versus time (as in, real elapsed time, not epoch!), and choose whichever mini-batch size gives you the most rapid improvement in performance. With the mini-batch size chosen you can then proceed to optimize the other hyper-parameters.
Of course, as you've no doubt realized, I haven't done this optimization in our work. Indeed, our implementation doesn't use the faster approach to mini-batch updates at all. I've simply used a mini-batch size of $10$ without comment or explanation in nearly all examples. Because of this, we could have sped up learning by reducing the mini-batch size. I haven't done this, in part because I wanted to illustrate the use of mini-batches beyond size $1$, and in part because my preliminary experiments suggested the speedup would be rather modest. In practical implementations, however, we would most certainly implement the faster approach to mini-batch updates, and then make an effort to optimize the mini-batch size, in order to maximize our overall speed.
Automated techniques: I've been describing these heuristics as though you're optimizing your hyper-parameters by hand. Hand-optimization is a good way to build up a feel for how neural networks behave. However, and unsurprisingly, a great deal of work has been done on automating the process. A common technique is grid search, which systematically searches through a grid in hyper-parameter space. A review of both the achievements and the limitations of grid search (with suggestions for easily-implemented alternatives) may be found in a 2012 paper* *Random search for hyper-parameter optimization, by James Bergstra and Yoshua Bengio (2012). by James Bergstra and Yoshua Bengio. Many more sophisticated approaches have also been proposed. I won't review all that work here, but do want to mention a particularly promising 2012 paper which used a Bayesian approach to automatically optimize hyper-parameters* *Practical Bayesian optimization of machine learning algorithms, by Jasper Snoek, Hugo Larochelle, and Ryan Adams.. The code from the paper is publicly available, and has been used with some success by other researchers.
Summing up: Following the rules-of-thumb I've described won't give you the absolute best possible results from your neural network. But it will likely give you a good start and a basis for further improvements. In particular, I've discussed the hyper-parameters largely independently. In practice, there are relationships between the hyper-parameters. You may experiment with $\eta$, feel that you've got it just right, then start to optimize for $\lambda$, only to find that it's messing up your optimization for $\eta$. In practice, it helps to bounce backward and forward, gradually closing in good values. Above all, keep in mind that the heuristics I've described are rules of thumb, not rules cast in stone. You should be on the lookout for signs that things aren't working, and be willing to experiment. In particular, this means carefully monitoring your network's behaviour, especially the validation accuracy.
The difficulty of choosing hyper-parameters is exacerbated by the fact that the lore about how to choose hyper-parameters is widely spread, across many research papers and software programs, and often is only available inside the heads of individual practitioners. There are many, many papers setting out (sometimes contradictory) recommendations for how to proceed. However, there are a few particularly useful papers that synthesize and distill out much of this lore. Yoshua Bengio has a 2012 paper* *Practical recommendations for gradient-based training of deep architectures, by Yoshua Bengio (2012). that gives some practical recommendations for using backpropagation and gradient descent to train neural networks, including deep neural nets. Bengio discusses many issues in much more detail than I have, including how to do more systematic hyper-parameter searches. Another good paper is a 1998 paper* *Efficient BackProp, by Yann LeCun, Léon Bottou, Genevieve Orr and Klaus-Robert Müller (1998) by Yann LeCun, Léon Bottou, Genevieve Orr and Klaus-Robert Müller. Both these papers appear in an extremely useful 2012 book that collects many tricks commonly used in neural nets* *Neural Networks: Tricks of the Trade, edited by Grégoire Montavon, Geneviève Orr, and Klaus-Robert Müller.. The book is expensive, but many of the articles have been placed online by their respective authors with, one presumes, the blessing of the publisher, and may be located using a search engine.
One thing that becomes clear as you read these articles and, especially, as you engage in your own experiments, is that hyper-parameter optimization is not a problem that is ever completely solved. There's always another trick you can try to improve performance. There is a saying common among writers that books are never finished, only abandoned. The same is also true of neural network optimization: the space of hyper-parameters is so large that one never really finishes optimizing, one only abandons the network to posterity. So your goal should be to develop a workflow that enables you to quickly do a pretty good job on the optimization, while leaving you the flexibility to try more detailed optimizations, if that's important.
The challenge of setting hyper-parameters has led some people to complain that neural networks require a lot of work when compared with other machine learning techniques. I've heard many variations on the following complaint: "Yes, a well-tuned neural network may get the best performance on the problem. On the other hand, I can try a random forest [or SVM or$\ldots$ insert your own favorite technique] and it just works. I don't have time to figure out just the right neural network." Of course, from a practical point of view it's good to have easy-to-apply techniques. This is particularly true when you're just getting started on a problem, and it may not be obvious whether machine learning can help solve the problem at all. On the other hand, if getting optimal performance is important, then you may need to try approaches that require more specialist knowledge. While it would be nice if machine learning were always easy, there is no a priori reason it should be trivially simple.
Each technique developed in this chapter is valuable to know in its own right, but that's not the only reason I've explained them. The larger point is to familiarize you with some of the problems which can occur in neural networks, and with a style of analysis which can help overcome those problems. In a sense, we've been learning how to think about neural nets. Over the remainder of this chapter I briefly sketch a handful of other techniques. These sketches are less in-depth than the earlier discussions, but should convey some feeling for the diversity of techniques available for use in neural networks.
Stochastic gradient descent by backpropagation has served us well in attacking the MNIST digit classification problem. However, there are many other approaches to optimizing the cost function, and sometimes those other approaches offer performance superior to mini-batch stochastic gradient descent. In this section I sketch two such approaches, the Hessian and momentum techniques.
Hessian technique: To begin our discussion it helps to put neural networks aside for a bit. Instead, we're just going to consider the abstract problem of minimizing a cost function $C$ which is a function of many variables, $w = w_1, w_2, \ldots$, so $C = C(w)$. By Taylor's theorem, the cost function can be approximated near a point $w$ by \begin{eqnarray} C(w+\Delta w) & = & C(w) + \sum_j \frac{\partial C}{\partial w_j} \Delta w_j \nonumber \\ & & + \frac{1}{2} \sum_{jk} \Delta w_j \frac{\partial^2 C}{\partial w_j \partial w_k} \Delta w_k + \ldots \tag{103}\end{eqnarray} We can rewrite this more compactly as \begin{eqnarray} C(w+\Delta w) = C(w) + \nabla C \cdot \Delta w + \frac{1}{2} \Delta w^T H \Delta w + \ldots, \tag{104}\end{eqnarray} where $\nabla C$ is the usual gradient vector, and $H$ is a matrix known as the Hessian matrix, whose $jk$th entry is $\partial^2 C / \partial w_j \partial w_k$. Suppose we approximate $C$ by discarding the higher-order terms represented by $\ldots$ above, \begin{eqnarray} C(w+\Delta w) \approx C(w) + \nabla C \cdot \Delta w + \frac{1}{2} \Delta w^T H \Delta w. \tag{105}\end{eqnarray} Using calculus we can show that the expression on the right-hand side can be minimized* *Strictly speaking, for this to be a minimum, and not merely an extremum, we need to assume that the Hessian matrix is positive definite. Intuitively, this means that the function $C$ looks like a valley locally, not a mountain or a saddle. by choosing \begin{eqnarray} \Delta w = -H^{-1} \nabla C. \tag{106}\end{eqnarray} Provided (105)\begin{eqnarray} C(w+\Delta w) \approx C(w) + \nabla C \cdot \Delta w + \frac{1}{2} \Delta w^T H \Delta w \nonumber\end{eqnarray} is a good approximate expression for the cost function, then we'd expect that moving from the point $w$ to $w+\Delta w = w-H^{-1} \nabla C$ should significantly decrease the cost function. That suggests a possible algorithm for minimizing the cost:
This approach to minimizing a cost function is known as the Hessian technique or Hessian optimization. There are theoretical and empirical results showing that Hessian methods converge on a minimum in fewer steps than standard gradient descent. In particular, by incorporating information about second-order changes in the cost function it's possible for the Hessian approach to avoid many pathologies that can occur in gradient descent. Furthermore, there are versions of the backpropagation algorithm which can be used to compute the Hessian.
If Hessian optimization is so great, why aren't we using it in our neural networks? Unfortunately, while it has many desirable properties, it has one very undesirable property: it's very difficult to apply in practice. Part of the problem is the sheer size of the Hessian matrix. Suppose you have a neural network with $10^7$ weights and biases. Then the corresponding Hessian matrix will contain $10^7 \times 10^7 = 10^{14}$ entries. That's a lot of entries! And that makes computing $H^{-1} \nabla C$ extremely difficult in practice. However, that doesn't mean that it's not useful to understand. In fact, there are many variations on gradient descent which are inspired by Hessian optimization, but which avoid the problem with overly-large matrices. Let's take a look at one such technique, momentum-based gradient descent.
Momentum-based gradient descent: Intuitively, the advantage Hessian optimization has is that it incorporates not just information about the gradient, but also information about how the gradient is changing. Momentum-based gradient descent is based on a similar intuition, but avoids large matrices of second derivatives. To understand the momentum technique, think back to our original picture of gradient descent, in which we considered a ball rolling down into a valley. At the time, we observed that gradient descent is, despite its name, only loosely similar to a ball falling to the bottom of a valley. The momentum technique modifies gradient descent in two ways that make it more similar to the physical picture. First, it introduces a notion of "velocity" for the parameters we're trying to optimize. The gradient acts to change the velocity, not (directly) the "position", in much the same way as physical forces change the velocity, and only indirectly affect position. Second, the momentum method introduces a kind of friction term, which tends to gradually reduce the velocity.
Let's give a more precise mathematical description. We introduce velocity variables $v = v_1, v_2, \ldots$, one for each corresponding $w_j$ variable* *In a neural net the $w_j$ variables would, of course, include all weights and biases.. Then we replace the gradient descent update rule $w \rightarrow w'= w-\eta \nabla C$ by \begin{eqnarray} v & \rightarrow & v' = \mu v - \eta \nabla C \tag{107}\\ w & \rightarrow & w' = w+v'. \tag{108}\end{eqnarray} In these equations, $\mu$ is a hyper-parameter which controls the amount of damping or friction in the system. To understand the meaning of the equations it's helpful to first consider the case where $\mu = 1$, which corresponds to no friction. When that's the case, inspection of the equations shows that the "force" $\nabla C$ is now modifying the velocity, $v$, and the velocity is controlling the rate of change of $w$. Intuitively, we build up the velocity by repeatedly adding gradient terms to it. That means that if the gradient is in (roughly) the same direction through several rounds of learning, we can build up quite a bit of steam moving in that direction. Think, for example, of what happens if we're moving straight down a slope:
With each step the velocity gets larger down the slope, so we move more and more quickly to the bottom of the valley. This can enable the momentum technique to work much faster than standard gradient descent. Of course, a problem is that once we reach the bottom of the valley we will overshoot. Or, if the gradient should change rapidly, then we could find ourselves moving in the wrong direction. That's the reason for the $\mu$ hyper-parameter in (107)\begin{eqnarray} v & \rightarrow & v' = \mu v - \eta \nabla C \nonumber\end{eqnarray}. I said earlier that $\mu$ controls the amount of friction in the system; to be a little more precise, you should think of $1-\mu$ as the amount of friction in the system. When $\mu = 1$, as we've seen, there is no friction, and the velocity is completely driven by the gradient $\nabla C$. By contrast, when $\mu = 0$ there's a lot of friction, the velocity can't build up, and Equations (107)\begin{eqnarray} v & \rightarrow & v' = \mu v - \eta \nabla C \nonumber\end{eqnarray} and (108)\begin{eqnarray} w & \rightarrow & w' = w+v' \nonumber\end{eqnarray} reduce to the usual equation for gradient descent, $w \rightarrow w'=w-\eta \nabla C$. In practice, using a value of $\mu$ intermediate between $0$ and $1$ can give us much of the benefit of being able to build up speed, but without causing overshooting. We can choose such a value for $\mu$ using the held-out validation data, in much the same way as we select $\eta$ and $\lambda$.
I've avoided naming the hyper-parameter $\mu$ up to now. The reason is that the standard name for $\mu$ is badly chosen: it's called the momentum co-efficient. This is potentially confusing, since $\mu$ is not at all the same as the notion of momentum from physics. Rather, it is much more closely related to friction. However, the term momentum co-efficient is widely used, so we will continue to use it.
A nice thing about the momentum technique is that it takes almost no work to modify an implementation of gradient descent to incorporate momentum. We can still use backpropagation to compute the gradients, just as before, and use ideas such as sampling stochastically chosen mini-batches. In this way, we can get some of the advantages of the Hessian technique, using information about how the gradient is changing. But it's done without the disadvantages, and with only minor modifications to our code. In practice, the momentum technique is commonly used, and often speeds up learning.
Other approaches to minimizing the cost function: Many other approaches to minimizing the cost function have been developed, and there isn't universal agreement on which is the best approach. As you go deeper into neural networks it's worth digging into the other techniques, understanding how they work, their strengths and weaknesses, and how to apply them in practice. A paper I mentioned earlier* *Efficient BackProp, by Yann LeCun, Léon Bottou, Genevieve Orr and Klaus-Robert Müller (1998). introduces and compares several of these techniques, including conjugate gradient descent and the BFGS method (see also the closely related limited-memory BFGS method, known as L-BFGS). Another technique which has recently shown promising results* *See, for example, On the importance of initialization and momentum in deep learning, by Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton (2012). is Nesterov's accelerated gradient technique, which improves on the momentum technique. However, for many problems, plain stochastic gradient descent works well, especially if momentum is used, and so we'll stick to stochastic gradient descent through the remainder of this book.
Up to now we've built our neural networks using sigmoid neurons. In principle, a network built from sigmoid neurons can compute any function. In practice, however, networks built using other model neurons sometimes outperform sigmoid networks. Depending on the application, networks based on such alternate models may learn faster, generalize better to test data, or perhaps do both. Let me mention a couple of alternate model neurons, to give you the flavor of some variations in common use.
Perhaps the simplest variation is the tanh (pronounced "tanch") neuron, which replaces the sigmoid function by the hyperbolic tangent function. The output of a tanh neuron with input $x$, weight vector $w$, and bias $b$ is given by \begin{eqnarray} \tanh(w \cdot x+b), \tag{109}\end{eqnarray} where $\tanh$ is, of course, the hyperbolic tangent function. It turns out that this is very closely related to the sigmoid neuron. To see this, recall that the $\tanh$ function is defined by \begin{eqnarray} \tanh(z) \equiv \frac{e^z-e^{-z}}{e^z+e^{-z}}. \tag{110}\end{eqnarray} With a little algebra it can easily be verified that \begin{eqnarray} \sigma(z) = \frac{1+\tanh(z/2)}{2}, \tag{111}\end{eqnarray} that is, $\tanh$ is just a rescaled version of the sigmoid function. We can also see graphically that the $\tanh$ function has the same shape as the sigmoid function,
One difference between tanh neurons and sigmoid neurons is that the output from tanh neurons ranges from -1 to 1, not 0 to 1. This means that if you're going to build a network based on tanh neurons you may need to normalize your outputs (and, depending on the details of the application, possibly your inputs) a little differently than in sigmoid networks.
Similar to sigmoid neurons, a network of tanh neurons can, in principle, compute any function* *There are some technical caveats to this statement for both tanh and sigmoid neurons, as well as for the rectified linear neurons discussed below. However, informally it's usually fine to think of neural networks as being able to approximate any function to arbitrary accuracy. mapping inputs to the range -1 to 1. Furthermore, ideas such as backpropagation and stochastic gradient descent are as easily applied to a network of tanh neurons as to a network of sigmoid neurons.
Which type of neuron should you use in your networks, the tanh or sigmoid? A priori the answer is not obvious, to put it mildly! However, there are theoretical arguments and some empirical evidence to suggest that the tanh sometimes performs better* *See, for example, Efficient BackProp, by Yann LeCun, Léon Bottou, Genevieve Orr and Klaus-Robert Müller (1998), and Understanding the difficulty of training deep feedforward networks, by Xavier Glorot and Yoshua Bengio (2010).. Let me briefly give you the flavor of one of the theoretical arguments for tanh neurons. Suppose we're using sigmoid neurons, so all activations in our network are positive. Let's consider the weights $w^{l+1}_{jk}$ input to the $j$th neuron in the $l+1$th layer. The rules for backpropagation (see here) tell us that the associated gradient will be $a^l_k \delta^{l+1}_j$. Because the activations are positive the sign of this gradient will be the same as the sign of $\delta^{l+1}_j$. What this means is that if $\delta^{l+1}_j$ is positive then all the weights $w^{l+1}_{jk}$ will decrease during gradient descent, while if $\delta^{l+1}_j$ is negative then all the weights $w^{l+1}_{jk}$ will increase during gradient descent. In other words, all weights to the same neuron must either increase together or decrease together. That's a problem, since some of the weights may need to increase while others need to decrease. That can only happen if some of the input activations have different signs. That suggests replacing the sigmoid by an activation function, such as $\tanh$, which allows both positive and negative activations. Indeed, because $\tanh$ is symmetric about zero, $\tanh(-z) = -\tanh(z)$, we might even expect that, roughly speaking, the activations in hidden layers would be equally balanced between positive and negative. That would help ensure that there is no systematic bias for the weight updates to be one way or the other.
How seriously should we take this argument? While the argument is suggestive, it's a heuristic, not a rigorous proof that tanh neurons outperform sigmoid neurons. Perhaps there are other properties of the sigmoid neuron which compensate for this problem? Indeed, for many tasks the tanh is found empirically to provide only a small or no improvement in performance over sigmoid neurons. Unfortunately, we don't yet have hard-and-fast rules to know which neuron types will learn fastest, or give the best generalization performance, for any particular application.
Another variation on the sigmoid neuron is the rectified linear neuron or rectified linear unit. The output of a rectified linear unit with input $x$, weight vector $w$, and bias $b$ is given by \begin{eqnarray} \max(0, w \cdot x+b). \tag{112}\end{eqnarray} Graphically, the rectifying function $\max(0, z)$ looks like this:
Obviously such neurons are quite different from both sigmoid and tanh neurons. However, like the sigmoid and tanh neurons, rectified linear units can be used to compute any function, and they can be trained using ideas such as backpropagation and stochastic gradient descent.
When should you use rectified linear units instead of sigmoid or tanh neurons? Some recent work on image recognition* *See, for example, What is the Best Multi-Stage Architecture for Object Recognition?, by Kevin Jarrett, Koray Kavukcuoglu, Marc'Aurelio Ranzato and Yann LeCun (2009), Deep Sparse Rectifier Neural Networks, by Xavier Glorot, Antoine Bordes, and Yoshua Bengio (2011), and ImageNet Classification with Deep Convolutional Neural Networks, by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton (2012). Note that these papers fill in important details about how to set up the output layer, cost function, and regularization in networks using rectified linear units. I've glossed over all these details in this brief account. The papers also discuss in more detail the benefits and drawbacks of using rectified linear units. Another informative paper is Rectified Linear Units Improve Restricted Boltzmann Machines, by Vinod Nair and Geoffrey Hinton (2010), which demonstrates the benefits of using rectified linear units in a somewhat different approach to neural networks. has found considerable benefit in using rectified linear units through much of the network. However, as with tanh neurons, we do not yet have a really deep understanding of when, exactly, rectified linear units are preferable, nor why. To give you the flavor of some of the issues, recall that sigmoid neurons stop learning when they saturate, i.e., when their output is near either $0$ or $1$. As we've seen repeatedly in this chapter, the problem is that $\sigma'$ terms reduce the gradient, and that slows down learning. Tanh neurons suffer from a similar problem when they saturate. By contrast, increasing the weighted input to a rectified linear unit will never cause it to saturate, and so there is no corresponding learning slowdown. On the other hand, when the weighted input to a rectified linear unit is negative, the gradient vanishes, and so the neuron stops learning entirely. These are just two of the many issues that make it non-trivial to understand when and why rectified linear units perform better than sigmoid or tanh neurons.
I've painted a picture of uncertainty here, stressing that we do not yet have a solid theory of how activation functions should be chosen. Indeed, the problem is harder even than I have described, for there are infinitely many possible activation functions. Which is the best for any given problem? Which will result in a network which learns fastest? Which will give the highest test accuracies? I am surprised how little really deep and systematic investigation has been done of these questions. Ideally, we'd have a theory which tells us, in detail, how to choose (and perhaps modify-on-the-fly) our activation functions. On the other hand, we shouldn't let the lack of a full theory stop us! We have powerful tools already at hand, and can make a lot of progress with those tools. Through the remainder of this book I'll continue to use sigmoid neurons as our go-to neuron, since they're powerful and provide concrete illustrations of the core ideas about neural nets. But keep in the back of your mind that these same ideas can be applied to other types of neuron, and that there are sometimes advantages in doing so.
Question: How do you approach utilizing and researching machine learning techniques that are supported almost entirely empirically, as opposed to mathematically? Also in what situations have you noticed some of these techniques fail?Answer: You have to realize that our theoretical tools are very weak. Sometimes, we have good mathematical intuitions for why a particular technique should work. Sometimes our intuition ends up being wrong [...] The questions become: how well does my method work on this particular problem, and how large is the set of problems on which it works well.
- Question and answer with neural networks researcher Yann LeCun
Once, attending a conference on the foundations of quantum mechanics, I noticed what seemed to me a most curious verbal habit: when talks finished, questions from the audience often began with "I'm very sympathetic to your point of view, but [...]". Quantum foundations was not my usual field, and I noticed this style of questioning because at other scientific conferences I'd rarely or never heard a questioner express their sympathy for the point of view of the speaker. At the time, I thought the prevalence of the question suggested that little genuine progress was being made in quantum foundations, and people were merely spinning their wheels. Later, I realized that assessment was too harsh. The speakers were wrestling with some of the hardest problems human minds have ever confronted. Of course progress was slow! But there was still value in hearing updates on how people were thinking, even if they didn't always have unarguable new progress to report.
You may have noticed a verbal tic similar to "I'm very sympathetic [...]" in the current book. To explain what we're seeing I've often fallen back on saying "Heuristically, [...]", or "Roughly speaking, [...]", following up with a story to explain some phenomenon or other. These stories are plausible, but the empirical evidence I've presented has often been pretty thin. If you look through the research literature you'll see that stories in a similar style appear in many research papers on neural nets, often with thin supporting evidence. What should we think about such stories?
In many parts of science - especially those parts that deal with simple phenomena - it's possible to obtain very solid, very reliable evidence for quite general hypotheses. But in neural networks there are large numbers of parameters and hyper-parameters, and extremely complex interactions between them. In such extraordinarily complex systems it's exceedingly difficult to establish reliable general statements. Understanding neural networks in their full generality is a problem that, like quantum foundations, tests the limits of the human mind. Instead, we often make do with evidence for or against a few specific instances of a general statement. As a result those statements sometimes later need to be modified or abandoned, when new evidence comes to light.
One way of viewing this situation is that any heuristic story about neural networks carries with it an implied challenge. For example, consider the statement I quoted earlier, explaining why dropout works* *From ImageNet Classification with Deep Convolutional Neural Networks by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton (2012).: "This technique reduces complex co-adaptations of neurons, since a neuron cannot rely on the presence of particular other neurons. It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons." This is a rich, provocative statement, and one could build a fruitful research program entirely around unpacking the statement, figuring out what in it is true, what is false, what needs variation and refinement. Indeed, there is now a small industry of researchers who are investigating dropout (and many variations), trying to understand how it works, and what its limits are. And so it goes with many of the heuristics we've discussed. Each heuristic is not just a (potential) explanation, it's also a challenge to investigate and understand in more detail.
Of course, there is not time for any single person to investigate all these heuristic explanations in depth. It's going to take decades (or longer) for the community of neural networks researchers to develop a really powerful, evidence-based theory of how neural networks learn. Does this mean you should reject heuristic explanations as unrigorous, and not sufficiently evidence-based? No! In fact, we need such heuristics to inspire and guide our thinking. It's like the great age of exploration: the early explorers sometimes explored (and made new discoveries) on the basis of beliefs which were wrong in important ways. Later, those mistakes were corrected as we filled in our knowledge of geography. When you understand something poorly - as the explorers understood geography, and as we understand neural nets today - it's more important to explore boldly than it is to be rigorously correct in every step of your thinking. And so you should view these stories as a useful guide to how to think about neural nets, while retaining a healthy awareness of the limitations of such stories, and carefully keeping track of just how strong the evidence is for any given line of reasoning. Put another way, we need good stories to help motivate and inspire us, and rigorous in-depth investigation in order to uncover the real facts of the matter.
@@ -164,9 +166,10 @@
Deep Learning, draft book -in preparation, by Yoshua Bengio, Ian Goodfellow, and Aaron -Courville
+Deep Learning, book by Ian +Goodfellow, Yoshua Bengio, and Aaron Courville
+ +@@ -164,9 +166,10 @@
Deep Learning, draft book -in preparation, by Yoshua Bengio, Ian Goodfellow, and Aaron -Courville
+Deep Learning, book by Ian +Goodfellow, Yoshua Bengio, and Aaron Courville
+ +Deep Learning, draft book -in preparation, by Yoshua Bengio, Ian Goodfellow, and Aaron -Courville
+Deep Learning, book by Ian +Goodfellow, Yoshua Bengio, and Aaron Courville
+ +In the last chapter we learned that deep neural networks are often much harder to train than shallow neural networks. That's unfortunate, since we have good reason to believe that if we could train deep nets they'd be much more powerful than shallow nets. But while the news from the last chapter is discouraging, we won't let it stop us. In this chapter, we'll develop techniques which can be used to train deep networks, and apply them in practice. We'll also look at the broader picture, briefly reviewing recent progress on using deep nets for image recognition, speech recognition, and other applications. And we'll take a brief, speculative look at what the future may hold for neural nets, and for artificial intelligence.
The chapter is a long one. To help you navigate, let's take a tour. The sections are only loosely coupled, so provided you have some basic familiarity with neural nets, you can jump to whatever most interests you.
The main part of the chapter is an introduction to one of the most widely used types of deep network: deep convolutional networks. We'll work through a detailed example - code and all - of using convolutional nets to solve the problem of classifying handwritten digits from the MNIST data set:
We'll start our account of convolutional networks with the shallow networks used to attack this problem earlier in the book. Through many iterations we'll build up more and more powerful networks. As we go we'll explore many powerful techniques: convolutions, pooling, the use of GPUs to do far more training than we did with our shallow networks, the algorithmic expansion of our training data (to reduce overfitting), the use of the dropout technique (also to reduce overfitting), the use of ensembles of networks, and others. The result will be a system that offers near-human performance. Of the 10,000 MNIST test images - images not seen during training! - our system will classify 9,967 correctly. Here's a peek at the 33 images which are misclassified. Note that the correct classification is in the top right; our program's classification is in the bottom right:
Many of these are tough even for a human to classify. Consider, for example, the third image in the top row. To me it looks more like a "9" than an "8", which is the official classification. Our network also thinks it's a "9". This kind of "error" is at the very least understandable, and perhaps even commendable. We conclude our discussion of image recognition with a survey of some of the spectacular recent progress using networks (particularly convolutional nets) to do image recognition.
The remainder of the chapter discusses deep learning from a broader and less detailed perspective. We'll briefly survey other models of neural networks, such as recurrent neural nets and long short-term memory units, and how such models can be applied to problems in speech recognition, natural language processing, and other areas. And we'll speculate about the future of neural networks and deep learning, ranging from ideas like intention-driven user interfaces, to the role of deep learning in artificial intelligence.
The chapter builds on the earlier chapters in the book, making use of and integrating ideas such as backpropagation, regularization, the softmax function, and so on. However, to read the chapter you don't need to have worked in detail through all the earlier chapters. It will, however, help to have read Chapter 1, on the basics of neural networks. When I use concepts from Chapters 2 to 5, I provide links so you can familiarize yourself, if necessary.
It's worth noting what the chapter is not. It's not a tutorial on the latest and greatest neural networks libraries. Nor are we going to be training deep networks with dozens of layers to solve problems at the very leading edge. Rather, the focus is on understanding some of the core principles behind deep neural networks, and applying them in the simple, easy-to-understand context of the MNIST problem. Put another way: the chapter is not going to bring you right up to the frontier. Rather, the intent of this and earlier chapters is to focus on fundamentals, and so to prepare you to understand a wide range of current work.
In earlier chapters, we taught our neural networks to do a pretty good job recognizing images of handwritten digits:
We did this using networks in which adjacent network layers are fully connected to one another. That is, every neuron in the network is connected to every neuron in adjacent layers:
In particular, for each pixel in the input image, we encoded the pixel's intensity as the value for a corresponding neuron in the input layer. For the $28 \times 28$ pixel images we've been using, this means our network has $784$ ($= 28 \times 28$) input neurons. We then trained the network's weights and biases so that the network's output would - we hope! - correctly identify the input image: '0', '1', '2', ..., '8', or '9'.
Our earlier networks work pretty well: we've obtained a classification accuracy better than 98 percent, using training and test data from the MNIST handwritten digit data set. But upon reflection, it's strange to use networks with fully-connected layers to classify images. The reason is that such a network architecture does not take into account the spatial structure of the images. For instance, it treats input pixels which are far apart and close together on exactly the same footing. Such concepts of spatial structure must instead be inferred from the training data. But what if, instead of starting with a network architecture which is tabula rasa, we used an architecture which tries to take advantage of the spatial structure? In this section I describe convolutional neural networks* *The origins of convolutional neural networks go back to the 1970s. But the seminal paper establishing the modern subject of convolutional networks was a 1998 paper, "Gradient-based learning applied to document recognition", by Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. LeCun has since made an interesting remark on the terminology for convolutional nets: "The [biological] neural inspiration in models like convolutional nets is very tenuous. That's why I call them 'convolutional nets' not 'convolutional neural nets', and why we call the nodes 'units' and not 'neurons' ". Despite this remark, convolutional nets use many of the same ideas as the neural networks we've studied up to now: ideas such as backpropagation, gradient descent, regularization, non-linear activation functions, and so on. And so we will follow common practice, and consider them a type of neural network. I will use the terms "convolutional neural network" and "convolutional net(work)" interchangeably. I will also use the terms "[artificial] neuron" and "unit" interchangeably.. These networks use a special architecture which is particularly well-adapted to classify images. Using this architecture makes convolutional networks fast to train. This, in turn, helps us train deep, many-layer networks, which are very good at classifying images. Today, deep convolutional networks or some close variant are used in most neural networks for image recognition.
Convolutional neural networks use three basic ideas: local receptive fields, shared weights, and pooling. Let's look at each of these ideas in turn.
Local receptive fields: In the fully-connected layers shown earlier, the inputs were depicted as a vertical line of neurons. In a convolutional net, it'll help to think instead of the inputs as a $28 \times 28$ square of neurons, whose values correspond to the $28 \times 28$ pixel intensities we're using as inputs:
As per usual, we'll connect the input pixels to a layer of hidden neurons. But we won't connect every input pixel to every hidden neuron. Instead, we only make connections in small, localized regions of the input image.
To be more precise, each neuron in the first hidden layer will be connected to a small region of the input neurons, say, for example, a $5 \times 5$ region, corresponding to $25$ input pixels. So, for a particular hidden neuron, we might have connections that look like this:
That region in the input image is called the local receptive field for the hidden neuron. It's a little window on the input pixels. Each connection learns a weight. And the hidden neuron learns an overall bias as well. You can think of that particular hidden neuron as learning to analyze its particular local receptive field.
We then slide the local receptive field across the entire input image. For each local receptive field, there is a different hidden neuron in the first hidden layer. To illustrate this concretely, let's start with a local receptive field in the top-left corner:
Then we slide the local receptive field over by one pixel to the right (i.e., by one neuron), to connect to a second hidden neuron:
And so on, building up the first hidden layer. Note that if we have a $28 \times 28$ input image, and $5 \times 5$ local receptive fields, then there will be $24 \times 24$ neurons in the hidden layer. This is because we can only move the local receptive field $23$ neurons across (or $23$ neurons down), before colliding with the right-hand side (or bottom) of the input image.
I've shown the local receptive field being moved by one pixel at a time. In fact, sometimes a different stride length is used. For instance, we might move the local receptive field $2$ pixels to the right (or down), in which case we'd say a stride length of $2$ is used. In this chapter we'll mostly stick with stride length $1$, but it's worth knowing that people sometimes experiment with different stride lengths* *As was done in earlier chapters, if we're interested in trying different stride lengths then we can use validation data to pick out the stride length which gives the best performance. For more details, see the earlier discussion of how to choose hyper-parameters in a neural network. The same approach may also be used to choose the size of the local receptive field - there is, of course, nothing special about using a $5 \times 5$ local receptive field. In general, larger local receptive fields tend to be helpful when the input images are significantly larger than the $28 \times 28$ pixel MNIST images..
Shared weights and biases: I've said that each hidden neuron has a bias and $5 \times 5$ weights connected to its local receptive field. What I did not yet mention is that we're going to use the same weights and bias for each of the $24 \times 24$ hidden neurons. In other words, for the $j, k$th hidden neuron, the output is: \begin{eqnarray} \sigma\left(b + \sum_{l=0}^4 \sum_{m=0}^4 w_{l,m} a_{j+l, k+m} \right). \tag{125}\end{eqnarray} Here, $\sigma$ is the neural activation function - perhaps the sigmoid function we used in earlier chapters. $b$ is the shared value for the bias. $w_{l,m}$ is a $5 \times 5$ array of shared weights. And, finally, we use $a_{x, y}$ to denote the input activation at position $x, y$.
This means that all the neurons in the first hidden layer detect exactly the same feature* *I haven't precisely defined the notion of a feature. Informally, think of the feature detected by a hidden neuron as the kind of input pattern that will cause the neuron to activate: it might be an edge in the image, for instance, or maybe some other type of shape. , just at different locations in the input image. To see why this makes sense, suppose the weights and bias are such that the hidden neuron can pick out, say, a vertical edge in a particular local receptive field. That ability is also likely to be useful at other places in the image. And so it is useful to apply the same feature detector everywhere in the image. To put it in slightly more abstract terms, convolutional networks are well adapted to the translation invariance of images: move a picture of a cat (say) a little ways, and it's still an image of a cat* *In fact, for the MNIST digit classification problem we've been studying, the images are centered and size-normalized. So MNIST has less translation invariance than images found "in the wild", so to speak. Still, features like edges and corners are likely to be useful across much of the input space. .
For this reason, we sometimes call the map from the input layer to the hidden layer a feature map. We call the weights defining the feature map the shared weights. And we call the bias defining the feature map in this way the shared bias. The shared weights and bias are often said to define a kernel or filter. In the literature, people sometimes use these terms in slightly different ways, and for that reason I'm not going to be more precise; rather, in a moment, we'll look at some concrete examples.
The network structure I've described so far can detect just a single kind of localized feature. To do image recognition we'll need more than one feature map. And so a complete convolutional layer consists of several different feature maps:
I've shown just $3$ feature maps, to keep the diagram above simple. However, in practice convolutional networks may use more (and perhaps many more) feature maps. One of the early convolutional networks, LeNet-5, used $6$ feature maps, each associated to a $5 \times 5$ local receptive field, to recognize MNIST digits. So the example illustrated above is actually pretty close to LeNet-5. In the examples we develop later in the chapter we'll use convolutional layers with $20$ and $40$ feature maps. Let's take a quick peek at some of the features which are learned* *The feature maps illustrated come from the final convolutional network we train, see here.:
The $20$ images correspond to $20$ different feature maps (or filters, or kernels). Each map is represented as a $5 \times 5$ block image, corresponding to the $5 \times 5$ weights in the local receptive field. Whiter blocks mean a smaller (typically, more negative) weight, so the feature map responds less to corresponding input pixels. Darker blocks mean a larger weight, so the feature map responds more to the corresponding input pixels. Very roughly speaking, the images above show the type of features the convolutional layer responds to.
So what can we conclude from these feature maps? It's clear there is spatial structure here beyond what we'd expect at random: many of the features have clear sub-regions of light and dark. That shows our network really is learning things related to the spatial structure. However, beyond that, it's difficult to see what these feature detectors are learning. Certainly, we're not learning (say) the Gabor filters which have been used in many traditional approaches to image recognition. In fact, there's now a lot of work on better understanding the features learnt by convolutional networks. If you're interested in following up on that work, I suggest starting with the paper Visualizing and Understanding Convolutional Networks by Matthew Zeiler and Rob Fergus (2013).
A big advantage of sharing weights and biases is that it greatly reduces the number of parameters involved in a convolutional network. For each feature map we need $25 = 5 \times 5$ shared weights, plus a single shared bias. So each feature map requires $26$ parameters. If we have $20$ feature maps that's a total of $20 \times 26 = 520$ parameters defining the convolutional layer. By comparison, suppose we had a fully connected first layer, with $784 = 28 \times 28$ input neurons, and a relatively modest $30$ hidden neurons, as we used in many of the examples earlier in the book. That's a total of $784 \times 30$ weights, plus an extra $30$ biases, for a total of $23,550$ parameters. In other words, the fully-connected layer would have more than $40$ times as many parameters as the convolutional layer.
Of course, we can't really do a direct comparison between the number of parameters, since the two models are different in essential ways. But, intuitively, it seems likely that the use of translation invariance by the convolutional layer will reduce the number of parameters it needs to get the same performance as the fully-connected model. That, in turn, will result in faster training for the convolutional model, and, ultimately, will help us build deep networks using convolutional layers.
Incidentally, the name convolutional comes from the fact that the operation in Equation (125)\begin{eqnarray} \sigma\left(b + \sum_{l=0}^4 \sum_{m=0}^4 w_{l,m} a_{j+l, k+m} \right) \nonumber\end{eqnarray} is sometimes known as a convolution. A little more precisely, people sometimes write that equation as $a^1 = \sigma(b + w * a^0)$, where $a^1$ denotes the set of output activations from one feature map, $a^0$ is the set of input activations, and $*$ is called a convolution operation. We're not going to make any deep use of the mathematics of convolutions, so you don't need to worry too much about this connection. But it's worth at least knowing where the name comes from.
Pooling layers: In addition to the convolutional layers just described, convolutional neural networks also contain pooling layers. Pooling layers are usually used immediately after convolutional layers. What the pooling layers do is simplify the information in the output from the convolutional layer.
In detail, a pooling layer takes each feature map* *The nomenclature is being used loosely here. In particular, I'm using "feature map" to mean not the function computed by the convolutional layer, but rather the activation of the hidden neurons output from the layer. This kind of mild abuse of nomenclature is pretty common in the research literature. output from the convolutional layer and prepares a condensed feature map. For instance, each unit in the pooling layer may summarize a region of (say) $2 \times 2$ neurons in the previous layer. As a concrete example, one common procedure for pooling is known as max-pooling. In max-pooling, a pooling unit simply outputs the maximum activation in the $2 \times 2$ input region, as illustrated in the following diagram:
Note that since we have $24 \times 24$ neurons output from the convolutional layer, after pooling we have $12 \times 12$ neurons.
As mentioned above, the convolutional layer usually involves more than a single feature map. We apply max-pooling to each feature map separately. So if there were three feature maps, the combined convolutional and max-pooling layers would look like:
We can think of max-pooling as a way for the network to ask whether a given feature is found anywhere in a region of the image. It then throws away the exact positional information. The intuition is that once a feature has been found, its exact location isn't as important as its rough location relative to other features. A big benefit is that there are many fewer pooled features, and so this helps reduce the number of parameters needed in later layers.
Max-pooling isn't the only technique used for pooling. Another common approach is known as L2 pooling. Here, instead of taking the maximum activation of a $2 \times 2$ region of neurons, we take the square root of the sum of the squares of the activations in the $2 \times 2$ region. While the details are different, the intuition is similar to max-pooling: L2 pooling is a way of condensing information from the convolutional layer. In practice, both techniques have been widely used. And sometimes people use other types of pooling operation. If you're really trying to optimize performance, you may use validation data to compare several different approaches to pooling, and choose the approach which works best. But we're not going to worry about that kind of detailed optimization.
Putting it all together: We can now put all these ideas together to form a complete convolutional neural network. It's similar to the architecture we were just looking at, but has the addition of a layer of $10$ output neurons, corresponding to the $10$ possible values for MNIST digits ('0', '1', '2', etc):
The network begins with $28 \times 28$ input neurons, which are used to encode the pixel intensities for the MNIST image. This is then followed by a convolutional layer using a $5 \times 5$ local receptive field and $3$ feature maps. The result is a layer of $3 \times 24 \times 24$ hidden feature neurons. The next step is a max-pooling layer, applied to $2 \times 2$ regions, across each of the $3$ feature maps. The result is a layer of $3 \times 12 \times 12$ hidden feature neurons.
The final layer of connections in the network is a fully-connected layer. That is, this layer connects every neuron from the max-pooled layer to every one of the $10$ output neurons. This fully-connected architecture is the same as we used in earlier chapters. Note, however, that in the diagram above, I've used a single arrow, for simplicity, rather than showing all the connections. Of course, you can easily imagine the connections.
This convolutional architecture is quite different to the architectures used in earlier chapters. But the overall picture is similar: a network made of many simple units, whose behaviors are determined by their weights and biases. And the overall goal is still the same: to use training data to train the network's weights and biases so that the network does a good job classifying input digits.
In particular, just as earlier in the book, we will train our network using stochastic gradient descent and backpropagation. This mostly proceeds in exactly the same way as in earlier chapters. However, we do need to make a few modifications to the backpropagation procedure. The reason is that our earlier derivation of backpropagation was for networks with fully-connected layers. Fortunately, it's straightforward to modify the derivation for convolutional and max-pooling layers. If you'd like to understand the details, then I invite you to work through the following problem. Be warned that the problem will take some time to work through, unless you've really internalized the earlier derivation of backpropagation (in which case it's easy).
We've now seen the core ideas behind convolutional neural networks. Let's look at how they work in practice, by implementing some convolutional networks, and applying them to the MNIST digit classification problem. The program we'll use to do this is called network3.py, and it's an improved version of the programs network.py and network2.py developed in earlier chapters* *Note also that network3.py incorporates ideas from the Theano library's documentation on convolutional neural nets (notably the implementation of LeNet-5), from Misha Denil's implementation of dropout, and from Chris Olah.. If you wish to follow along, the code is available on GitHub. Note that we'll work through the code for network3.py itself in the next section. In this section, we'll use network3.py as a library to build convolutional networks.
The programs network.py and network2.py were implemented using Python and the matrix library Numpy. Those programs worked from first principles, and got right down into the details of backpropagation, stochastic gradient descent, and so on. But now that we understand those details, for network3.py we're going to use a machine learning library known as Theano* *See Theano: A CPU and GPU Math Expression Compiler in Python, by James Bergstra, Olivier Breuleux, Frederic Bastien, Pascal Lamblin, Ravzan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio (2010). Theano is also the basis for the popular Pylearn2 and Keras neural networks libraries. Other popular neural nets libraries at the time of this writing include Caffe and Torch. . Using Theano makes it easy to implement backpropagation for convolutional neural networks, since it automatically computes all the mappings involved. Theano is also quite a bit faster than our earlier code (which was written to be easy to understand, not fast), and this makes it practical to train more complex networks. In particular, one great feature of Theano is that it can run code on either a CPU or, if available, a GPU. Running on a GPU provides a substantial speedup and, again, helps make it practical to train more complex networks.
If you wish to follow along, then you'll need to get Theano running on your system. To install Theano, follow the instructions at the project's homepage. The examples which follow were run using Theano 0.6* *As I release this chapter, the current version of Theano has changed to version 0.7. I've actually rerun the examples under Theano 0.7 and get extremely similar results to those reported in the text.. Some were run under Mac OS X Yosemite, with no GPU. Some were run on Ubuntu 14.04, with an NVIDIA GPU. And some of the experiments were run under both. To get network3.py running you'll need to set the GPU flag to either True or False (as appropriate) in the network3.py source. Beyond that, to get Theano up and running on a GPU you may find the instructions here helpful. There are also tutorials on the web, easily found using Google, which can help you get things working. If you don't have a GPU available locally, then you may wish to look into Amazon Web Services EC2 G2 spot instances. Note that even with a GPU the code will take some time to execute. Many of the experiments take from minutes to hours to run. On a CPU it may take days to run the most complex of the experiments. As in earlier chapters, I suggest setting things running, and continuing to read, occasionally coming back to check the output from the code. If you're using a CPU, you may wish to reduce the number of training epochs for the more complex experiments, or perhaps omit them entirely.
To get a baseline, we'll start with a shallow architecture using just
a single hidden layer, containing $100$ hidden neurons. We'll train
for $60$ epochs, using a learning rate of $\eta = 0.1$, a mini-batch
size of $10$, and no regularization. Here we go*
*Code for the
experiments in this section may be found
in
this script. Note that the code in the script simply duplicates
and parallels the discussion in this section.
Note also that
throughout the section I've explicitly specified the number of
training epochs. I've done this for clarity about how we're
training. In practice, it's worth using
early stopping, that is,
tracking accuracy on the validation set, and stopping training when
we are confident the validation accuracy has stopped improving.:
>>> import network3
+In the last chapter we learned that deep neural
networks are often much harder to train than shallow neural networks.
That's unfortunate, since we have good reason to believe that
if we could train deep nets they'd be much more powerful than
shallow nets. But while the news from the last chapter is
discouraging, we won't let it stop us. In this chapter, we'll develop
techniques which can be used to train deep networks, and apply them in
practice. We'll also look at the broader picture, briefly reviewing
recent progress on using deep nets for image recognition, speech
recognition, and other applications. And we'll take a brief,
speculative look at what the future may hold for neural nets, and for
artificial intelligence.
The chapter is a long one. To help you navigate, let's take a tour.
The sections are only loosely coupled, so provided you have some basic
familiarity with neural nets, you can jump to whatever most interests
you.
The main part of the chapter is an
introduction to one of the most widely used types of deep network:
deep convolutional networks. We'll work through a detailed example
- code and all - of using convolutional nets to solve the problem
of classifying handwritten digits from the MNIST data set:
We'll start our account of convolutional networks with the shallow
networks used to attack this problem earlier in the book. Through
many iterations we'll build up more and more powerful networks. As we
go we'll explore many powerful techniques: convolutions, pooling, the
use of GPUs to do far more training than we did with our shallow
networks, the algorithmic expansion of our training data (to reduce
overfitting), the use of the dropout technique (also to reduce
overfitting), the use of ensembles of networks, and others. The
result will be a system that offers near-human performance. Of the
10,000 MNIST test images - images not seen during training! - our
system will classify 9,967 correctly. Here's a peek at the 33 images
which are misclassified. Note that the correct classification is in
the top right; our program's classification is in the bottom right:
Many of these are tough even for a human to classify. Consider, for
example, the third image in the top row. To me it looks more like a
"9" than an "8", which is the official classification. Our
network also thinks it's a "9". This kind of "error" is at the
very least understandable, and perhaps even commendable. We conclude
our discussion of image recognition with a
survey of some of the
spectacular recent progress using networks (particularly
convolutional nets) to do image recognition.
The remainder of the chapter discusses deep learning from a broader
and less detailed perspective. We'll
briefly
survey other models of neural networks, such as recurrent neural
nets and long short-term memory units, and how such models can be
applied to problems in speech recognition, natural language
processing, and other areas. And we'll
speculate about the
future of neural networks and deep learning, ranging from ideas
like intention-driven user interfaces, to the role of deep learning in
artificial intelligence.
The chapter builds on the earlier chapters in the book, making use of
and integrating ideas such as backpropagation, regularization, the
softmax function, and so on. However, to read the chapter you don't
need to have worked in detail through all the earlier chapters. It
will, however, help to have read Chapter 1, on the
basics of neural networks. When I use concepts from Chapters 2 to 5,
I provide links so you can familiarize yourself, if necessary.
It's worth noting what the chapter is not. It's not a tutorial on the
latest and greatest neural networks libraries. Nor are we going to be
training deep networks with dozens of layers to solve problems at the
very leading edge. Rather, the focus is on understanding some of the
core principles behind deep neural networks, and applying them in the
simple, easy-to-understand context of the MNIST problem. Put another
way: the chapter is not going to bring you right up to the frontier.
Rather, the intent of this and earlier chapters is to focus on
fundamentals, and so to prepare you to understand a wide range of
current work.
Introducing convolutional networks
In earlier chapters, we taught our neural networks to do a pretty good
job recognizing images of handwritten digits:
We did this using networks in which adjacent network layers are fully
connected to one another. That is, every neuron in the network is
connected to every neuron in adjacent layers:
In particular, for each pixel in the input image, we encoded the
pixel's intensity as the value for a corresponding neuron in the input
layer. For the $28 \times 28$ pixel images we've been using, this
means our network has $784$ ($= 28 \times 28$) input neurons. We then
trained the network's weights and biases so that the network's output
would - we hope! - correctly identify the input image: '0', '1',
'2', ..., '8', or '9'.
Our earlier networks work pretty well: we've
obtained a classification accuracy better
than 98 percent, using training and test data from the
MNIST handwritten
digit data set. But upon reflection, it's strange to use networks
with fully-connected layers to classify images. The reason is that
such a network architecture does not take into account the spatial
structure of the images. For instance, it treats input pixels which
are far apart and close together on exactly the same footing. Such
concepts of spatial structure must instead be inferred from the
training data. But what if, instead of starting with a network
architecture which is tabula rasa, we used an architecture
which tries to take advantage of the spatial structure? In this
section I describe convolutional neural networks*
*The
origins of convolutional neural networks go back to the 1970s. But
the seminal paper establishing the modern subject of convolutional
networks was a 1998 paper,
"Gradient-based
learning applied to document recognition", by Yann LeCun,
Léon Bottou, Yoshua Bengio, and Patrick Haffner.
LeCun has since made an interesting
remark
on the terminology for convolutional nets: "The [biological] neural
inspiration in models like convolutional nets is very
tenuous. That's why I call them 'convolutional nets' not
'convolutional neural nets', and why we call the nodes 'units' and
not 'neurons' ". Despite this remark, convolutional nets use many
of the same ideas as the neural networks we've studied up to now:
ideas such as backpropagation, gradient descent, regularization,
non-linear activation functions, and so on. And so we will follow
common practice, and consider them a type of neural network. I will
use the terms "convolutional neural network" and "convolutional
net(work)" interchangeably. I will also use the terms
"[artificial] neuron" and "unit" interchangeably.. These
networks use a special architecture which is particularly well-adapted
to classify images. Using this architecture makes convolutional
networks fast to train. This, in turn, helps us train deep,
many-layer networks, which are very good at classifying images.
Today, deep convolutional networks or some close variant are used in
most neural networks for image recognition.
Convolutional neural networks use three basic ideas: local
receptive fields, shared weights, and pooling. Let's
look at each of these ideas in turn.
Local receptive fields: In the fully-connected layers shown
earlier, the inputs were depicted as a vertical line of neurons. In a
convolutional net, it'll help to think instead of the inputs as a $28
\times 28$ square of neurons, whose values correspond to the $28
\times 28$ pixel intensities we're using as inputs:
As per usual, we'll connect the input pixels to a layer of hidden
neurons. But we won't connect every input pixel to every hidden
neuron. Instead, we only make connections in small, localized regions
of the input image.
To be more precise, each neuron in the first hidden layer will be
connected to a small region of the input neurons, say, for example, a
$5 \times 5$ region, corresponding to $25$ input pixels. So, for a
particular hidden neuron, we might have connections that look like
this:
That region in the input image is called the local receptive
field for the hidden neuron. It's a little window on the input
pixels. Each connection learns a weight. And the hidden neuron
learns an overall bias as well. You can think of that particular
hidden neuron as learning to analyze its particular local receptive
field.
We then slide the local receptive field across the entire input image.
For each local receptive field, there is a different hidden neuron in
the first hidden layer. To illustrate this concretely, let's start
with a local receptive field in the top-left corner:
Then we slide the local receptive field over by one pixel to the right
(i.e., by one neuron), to connect to a second hidden neuron:
And so on, building up the first hidden layer. Note that if we have a
$28 \times 28$ input image, and $5 \times 5$ local receptive fields,
then there will be $24 \times 24$ neurons in the hidden layer. This
is because we can only move the local receptive field $23$ neurons
across (or $23$ neurons down), before colliding with the right-hand
side (or bottom) of the input image.
I've shown the local receptive field being moved by one pixel at a
time. In fact, sometimes a different stride length is used.
For instance, we might move the local receptive field $2$ pixels to
the right (or down), in which case we'd say a stride length of $2$ is
used. In this chapter we'll mostly stick with stride length $1$, but
it's worth knowing that people sometimes experiment with different
stride lengths*
*As was done in earlier chapters, if we're
interested in trying different stride lengths then we can use
validation data to pick out the stride length which gives the best
performance. For more details, see the
earlier
discussion of how to choose hyper-parameters in a neural network.
The same approach may also be used to choose the size of the local
receptive field - there is, of course, nothing special about using
a $5 \times 5$ local receptive field. In general, larger local
receptive fields tend to be helpful when the input images are
significantly larger than the $28 \times 28$ pixel MNIST images..
Shared weights and biases: I've said that each hidden neuron
has a bias and $5 \times 5$ weights connected to its local receptive
field. What I did not yet mention is that we're going to use the
same weights and bias for each of the $24 \times 24$ hidden
neurons. In other words, for the $j, k$th hidden neuron, the output
is:
\begin{eqnarray}
\sigma\left(b + \sum_{l=0}^4 \sum_{m=0}^4 w_{l,m} a_{j+l, k+m} \right).
\tag{125}\end{eqnarray}
Here, $\sigma$ is the neural activation function - perhaps the
sigmoid function we used in
earlier chapters. $b$ is the shared value for the bias. $w_{l,m}$ is
a $5 \times 5$ array of shared weights. And, finally, we use $a_{x,
y}$ to denote the input activation at position $x, y$.
This means that all the neurons in the first hidden layer detect
exactly the same feature*
*I haven't precisely defined the
notion of a feature. Informally, think of the feature detected by a
hidden neuron as the kind of input pattern that will cause the
neuron to activate: it might be an edge in the image, for instance,
or maybe some other type of shape. , just at different locations in
the input image. To see why this makes sense, suppose the weights and
bias are such that the hidden neuron can pick out, say, a vertical
edge in a particular local receptive field. That ability is also
likely to be useful at other places in the image. And so it is useful
to apply the same feature detector everywhere in the image. To put it
in slightly more abstract terms, convolutional networks are well
adapted to the translation invariance of images: move a picture of a
cat (say) a little ways, and it's still an image of a cat*
*In
fact, for the MNIST digit classification problem we've been
studying, the images are centered and size-normalized. So MNIST has
less translation invariance than images found "in the wild", so to
speak. Still, features like edges and corners are likely to be
useful across much of the input space. .
For this reason, we sometimes call the map from the input layer to the
hidden layer a feature map. We call the weights defining the
feature map the shared weights. And we call the bias defining
the feature map in this way the shared bias. The shared
weights and bias are often said to define a kernel or
filter. In the literature, people sometimes use these terms in
slightly different ways, and for that reason I'm not going to be more
precise; rather, in a moment, we'll look at some concrete examples.
The network structure I've described so far can detect just a single
kind of localized feature. To do image recognition we'll need more
than one feature map. And so a complete convolutional layer consists
of several different feature maps:
In the example shown, there are $3$ feature maps. Each feature map is
defined by a set of $5 \times 5$ shared weights, and a single shared
bias. The result is that the network can detect $3$ different kinds
of features, with each feature being detectable across the entire
image.I've shown just $3$ feature maps, to keep the diagram above simple.
However, in practice convolutional networks may use more (and perhaps
many more) feature maps. One of the early convolutional networks,
LeNet-5, used $6$ feature maps, each associated to a $5 \times 5$
local receptive field, to recognize MNIST digits. So the example
illustrated above is actually pretty close to LeNet-5. In the
examples we develop later in the chapter we'll use convolutional
layers with $20$ and $40$ feature maps. Let's take a quick peek at
some of the features which are learned*
*The feature maps
illustrated come from the final convolutional network we train, see
here.:
The $20$ images correspond to $20$ different feature maps (or filters,
or kernels). Each map is represented as a $5 \times 5$ block image,
corresponding to the $5 \times 5$ weights in the local receptive
field. Whiter blocks mean a smaller (typically, more negative)
weight, so the feature map responds less to corresponding input
pixels. Darker blocks mean a larger weight, so the feature map
responds more to the corresponding input pixels. Very roughly
speaking, the images above show the type of features the convolutional
layer responds to.
So what can we conclude from these feature maps? It's clear there is
spatial structure here beyond what we'd expect at random: many of the
features have clear sub-regions of light and dark. That shows our
network really is learning things related to the spatial structure.
However, beyond that, it's difficult to see what these feature
detectors are learning. Certainly, we're not learning (say) the
Gabor filters which
have been used in many traditional approaches to image recognition.
In fact, there's now a lot of work on better understanding the
features learnt by convolutional networks. If you're interested in
following up on that work, I suggest starting with the paper
Visualizing and Understanding
Convolutional Networks by Matthew Zeiler and Rob Fergus (2013).
A big advantage of sharing weights and biases is that it greatly
reduces the number of parameters involved in a convolutional network.
For each feature map we need $25 = 5 \times 5$ shared weights, plus a
single shared bias. So each feature map requires $26$ parameters. If
we have $20$ feature maps that's a total of $20 \times 26 = 520$
parameters defining the convolutional layer. By comparison, suppose
we had a fully connected first layer, with $784 = 28 \times 28$ input
neurons, and a relatively modest $30$ hidden neurons, as we used in
many of the examples earlier in the book. That's a total of $784
\times 30$ weights, plus an extra $30$ biases, for a total of $23,550$
parameters. In other words, the fully-connected layer would have more
than $40$ times as many parameters as the convolutional layer.
Of course, we can't really do a direct comparison between the number
of parameters, since the two models are different in essential ways.
But, intuitively, it seems likely that the use of translation
invariance by the convolutional layer will reduce the number of
parameters it needs to get the same performance as the fully-connected
model. That, in turn, will result in faster training for the
convolutional model, and, ultimately, will help us build deep networks
using convolutional layers.
Incidentally, the name convolutional comes from the fact that
the operation in Equation (125)\begin{eqnarray}
\sigma\left(b + \sum_{l=0}^4 \sum_{m=0}^4 w_{l,m} a_{j+l, k+m} \right) \nonumber\end{eqnarray} is sometimes known as a
convolution. A little more precisely, people sometimes write
that equation as $a^1 = \sigma(b + w * a^0)$, where $a^1$ denotes the
set of output activations from one feature map, $a^0$ is the set of
input activations, and $*$ is called a convolution operation. We're
not going to make any deep use of the mathematics of convolutions, so
you don't need to worry too much about this connection. But it's
worth at least knowing where the name comes from.
Pooling layers: In addition to the convolutional layers just
described, convolutional neural networks also contain pooling
layers. Pooling layers are usually used immediately after
convolutional layers. What the pooling layers do is simplify the
information in the output from the convolutional layer.
In detail, a pooling layer takes each feature map*
*The
nomenclature is being used loosely here. In particular, I'm using
"feature map" to mean not the function computed by the
convolutional layer, but rather the activation of the hidden neurons
output from the layer. This kind of mild abuse of nomenclature is
pretty common in the research literature. output from the
convolutional layer and prepares a condensed feature map. For
instance, each unit in the pooling layer may summarize a region of
(say) $2 \times 2$ neurons in the previous layer. As a concrete
example, one common procedure for pooling is known as
max-pooling. In max-pooling, a pooling unit simply outputs the
maximum activation in the $2 \times 2$ input region, as illustrated in
the following diagram:
Note that since we have $24 \times 24$ neurons output from the
convolutional layer, after pooling we have $12 \times 12$ neurons.
As mentioned above, the convolutional layer usually involves more than
a single feature map. We apply max-pooling to each feature map
separately. So if there were three feature maps, the combined
convolutional and max-pooling layers would look like:
We can think of max-pooling as a way for the network to ask whether a
given feature is found anywhere in a region of the image. It then
throws away the exact positional information. The intuition is that
once a feature has been found, its exact location isn't as important
as its rough location relative to other features. A big benefit is
that there are many fewer pooled features, and so this helps reduce
the number of parameters needed in later layers.
Max-pooling isn't the only technique used for pooling. Another common
approach is known as L2 pooling. Here, instead of taking the
maximum activation of a $2 \times 2$ region of neurons, we take the
square root of the sum of the squares of the activations in the $2
\times 2$ region. While the details are different, the intuition is
similar to max-pooling: L2 pooling is a way of condensing information
from the convolutional layer. In practice, both techniques have been
widely used. And sometimes people use other types of pooling
operation. If you're really trying to optimize performance, you may
use validation data to compare several different approaches to
pooling, and choose the approach which works best. But we're not
going to worry about that kind of detailed optimization.
Putting it all together: We can now put all these ideas
together to form a complete convolutional neural network. It's
similar to the architecture we were just looking at, but has the
addition of a layer of $10$ output neurons, corresponding to the $10$
possible values for MNIST digits ('0', '1', '2', etc):
The network begins with $28 \times 28$ input neurons, which are used
to encode the pixel intensities for the MNIST image. This is then
followed by a convolutional layer using a $5 \times 5$ local receptive
field and $3$ feature maps. The result is a layer of $3 \times 24
\times 24$ hidden feature neurons. The next step is a max-pooling
layer, applied to $2 \times 2$ regions, across each of the $3$ feature
maps. The result is a layer of $3 \times 12 \times 12$ hidden feature
neurons.
The final layer of connections in the network is a fully-connected
layer. That is, this layer connects every neuron from the
max-pooled layer to every one of the $10$ output neurons. This
fully-connected architecture is the same as we used in earlier
chapters. Note, however, that in the diagram above, I've used a
single arrow, for simplicity, rather than showing all the connections.
Of course, you can easily imagine the connections.
This convolutional architecture is quite different to the
architectures used in earlier chapters. But the overall picture is
similar: a network made of many simple units, whose behaviors are
determined by their weights and biases. And the overall goal is still
the same: to use training data to train the network's weights and
biases so that the network does a good job classifying input digits.
In particular, just as earlier in the book, we will train our network
using stochastic gradient descent and backpropagation. This mostly
proceeds in exactly the same way as in earlier chapters. However, we
do need to make a few modifications to the backpropagation procedure.
The reason is that our earlier derivation of
backpropagation was for networks with fully-connected layers.
Fortunately, it's straightforward to modify the derivation for
convolutional and max-pooling layers. If you'd like to understand the
details, then I invite you to work through the following problem. Be
warned that the problem will take some time to work through, unless
you've really internalized the earlier derivation of
backpropagation (in which case it's easy).
Problem
We've now seen the core ideas behind convolutional neural networks. Let's look at how they work in practice, by implementing some convolutional networks, and applying them to the MNIST digit classification problem. The program we'll use to do this is called network3.py, and it's an improved version of the programs network.py and network2.py developed in earlier chapters* *Note also that network3.py incorporates ideas from the Theano library's documentation on convolutional neural nets (notably the implementation of LeNet-5), from Misha Denil's implementation of dropout, and from Chris Olah.. If you wish to follow along, the code is available on GitHub. Note that we'll work through the code for network3.py itself in the next section. In this section, we'll use network3.py as a library to build convolutional networks.
The programs network.py and network2.py were implemented using Python and the matrix library Numpy. Those programs worked from first principles, and got right down into the details of backpropagation, stochastic gradient descent, and so on. But now that we understand those details, for network3.py we're going to use a machine learning library known as Theano* *See Theano: A CPU and GPU Math Expression Compiler in Python, by James Bergstra, Olivier Breuleux, Frederic Bastien, Pascal Lamblin, Ravzan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio (2010). Theano is also the basis for the popular Pylearn2 and Keras neural networks libraries. Other popular neural nets libraries at the time of this writing include Caffe and Torch. . Using Theano makes it easy to implement backpropagation for convolutional neural networks, since it automatically computes all the mappings involved. Theano is also quite a bit faster than our earlier code (which was written to be easy to understand, not fast), and this makes it practical to train more complex networks. In particular, one great feature of Theano is that it can run code on either a CPU or, if available, a GPU. Running on a GPU provides a substantial speedup and, again, helps make it practical to train more complex networks.
If you wish to follow along, then you'll need to get Theano running on your system. To install Theano, follow the instructions at the project's homepage. The examples which follow were run using Theano 0.6* *As I release this chapter, the current version of Theano has changed to version 0.7. I've actually rerun the examples under Theano 0.7 and get extremely similar results to those reported in the text.. Some were run under Mac OS X Yosemite, with no GPU. Some were run on Ubuntu 14.04, with an NVIDIA GPU. And some of the experiments were run under both. To get network3.py running you'll need to set the GPU flag to either True or False (as appropriate) in the network3.py source. Beyond that, to get Theano up and running on a GPU you may find the instructions here helpful. There are also tutorials on the web, easily found using Google, which can help you get things working. If you don't have a GPU available locally, then you may wish to look into Amazon Web Services EC2 G2 spot instances. Note that even with a GPU the code will take some time to execute. Many of the experiments take from minutes to hours to run. On a CPU it may take days to run the most complex of the experiments. As in earlier chapters, I suggest setting things running, and continuing to read, occasionally coming back to check the output from the code. If you're using a CPU, you may wish to reduce the number of training epochs for the more complex experiments, or perhaps omit them entirely.
To get a baseline, we'll start with a shallow architecture using just
a single hidden layer, containing $100$ hidden neurons. We'll train
for $60$ epochs, using a learning rate of $\eta = 0.1$, a mini-batch
size of $10$, and no regularization. Here we go*
*Code for the
experiments in this section may be found
in
this script. Note that the code in the script simply duplicates
and parallels the discussion in this section.
Note also that
throughout the section I've explicitly specified the number of
training epochs. I've done this for clarity about how we're
training. In practice, it's worth using
early stopping, that is,
tracking accuracy on the validation set, and stopping training when
we are confident the validation accuracy has stopped improving.:
>>> import network3
>>> from network3 import Network
>>> from network3 import ConvPoolLayer, FullyConnectedLayer, SoftmaxLayer
>>> training_data, validation_data, test_data = network3.load_data_shared()
@@ -751,7 +754,7 @@ Deep learning
@@ -162,9 +164,10 @@ Michael Nielsen's project announcement mailing list
-Deep Learning, draft book -in preparation, by Yoshua Bengio, Ian Goodfellow, and Aaron -Courville
+Deep Learning, book by Ian +Goodfellow, Yoshua Bengio, and Aaron Courville
+ +@@ -162,9 +164,10 @@ Michael Nielsen's project announcement mailing list
-Deep Learning, draft book -in preparation, by Yoshua Bengio, Ian Goodfellow, and Aaron -Courville
+Deep Learning, book by Ian +Goodfellow, Yoshua Bengio, and Aaron Courville
+ +@@ -162,9 +164,10 @@ Michael Nielsen's project announcement mailing list
-Deep Learning, draft book -in preparation, by Yoshua Bengio, Ian Goodfellow, and Aaron -Courville
+Deep Learning, book by Ian +Goodfellow, Yoshua Bengio, and Aaron Courville
+ +@@ -162,9 +164,10 @@ Michael Nielsen's project announcement mailing list
-Deep Learning, draft book -in preparation, by Yoshua Bengio, Ian Goodfellow, and Aaron -Courville
+Deep Learning, book by Ian +Goodfellow, Yoshua Bengio, and Aaron Courville
+ +@@ -162,9 +164,10 @@ Michael Nielsen's project announcement mailing list
-Deep Learning, draft book -in preparation, by Yoshua Bengio, Ian Goodfellow, and Aaron -Courville
+Deep Learning, book by Ian +Goodfellow, Yoshua Bengio, and Aaron Courville
+ +