diff --git a/about.html b/about.html
index 0d6a831..51251b8 100644
--- a/about.html
+++ b/about.html
@@ -153,6 +153,8 @@
 <hr>
 <span class="sidebar_title">Resources</span>
 
+<p class="sidebar"><a href="https://twitter.com/michael_nielsen">Michael Nielsen on Twitter</a></p>
+
 <p class="sidebar"><a href="faq.html">Book FAQ</a></p>
 
 <p class="sidebar">
@@ -162,9 +164,10 @@
 <a href="http://eepurl.com/0Xxjb">Michael Nielsen's project announcement mailing list</a>
 </p>
 
-<p class="sidebar"> <a href="http://www.iro.umontreal.ca/~bengioy/dlbook/">Deep Learning</a>, draft book
-in preparation, by Yoshua Bengio, Ian Goodfellow, and Aaron
-Courville</p>
+<p class="sidebar"> <a href="http://www.deeplearningbook.org/">Deep Learning</a>, book by Ian
+Goodfellow, Yoshua Bengio, and Aaron Courville</p>
+
+<p class="sidebar"><a href="http://cognitivemedium.com">cognitivemedium.com</a></p>
 
 <hr>
 <a href="http://michaelnielsen.org"><img src="assets/Michael_Nielsen_Web_Small.jpg" width="160px" style="border-style: none;"/></a>
@@ -189,7 +192,7 @@
 href="mailto:mn@michaelnielsen.org">contact me</a>.
 </span>
 <span class="right_footer">
-Last update: Sun Jan  1 16:00:21 2017
+Last update: Thu Jan 19 06:09:48 2017
 <br/>
 <br/>
 <br/>
diff --git a/acknowledgements.html b/acknowledgements.html
index 0c6e06f..182e7e7 100644
--- a/acknowledgements.html
+++ b/acknowledgements.html
@@ -153,6 +153,8 @@
 <hr>
 <span class="sidebar_title">Resources</span>
 
+<p class="sidebar"><a href="https://twitter.com/michael_nielsen">Michael Nielsen on Twitter</a></p>
+
 <p class="sidebar"><a href="faq.html">Book FAQ</a></p>
 
 <p class="sidebar">
@@ -162,9 +164,10 @@
 <a href="http://eepurl.com/0Xxjb">Michael Nielsen's project announcement mailing list</a>
 </p>
 
-<p class="sidebar"> <a href="http://www.iro.umontreal.ca/~bengioy/dlbook/">Deep Learning</a>, draft book
-in preparation, by Yoshua Bengio, Ian Goodfellow, and Aaron
-Courville</p>
+<p class="sidebar"> <a href="http://www.deeplearningbook.org/">Deep Learning</a>, book by Ian
+Goodfellow, Yoshua Bengio, and Aaron Courville</p>
+
+<p class="sidebar"><a href="http://cognitivemedium.com">cognitivemedium.com</a></p>
 
 <hr>
 <a href="http://michaelnielsen.org"><img src="assets/Michael_Nielsen_Web_Small.jpg" width="160px" style="border-style: none;"/></a>
@@ -189,7 +192,7 @@
 href="mailto:mn@michaelnielsen.org">contact me</a>.
 </span>
 <span class="right_footer">
-Last update: Sun Jan  1 16:00:21 2017
+Last update: Thu Jan 19 06:09:48 2017
 <br/>
 <br/>
 <br/>
diff --git a/bugfinder.html b/bugfinder.html
index 19a9885..b11f888 100644
--- a/bugfinder.html
+++ b/bugfinder.html
@@ -153,6 +153,8 @@
 <hr>
 <span class="sidebar_title">Resources</span>
 
+<p class="sidebar"><a href="https://twitter.com/michael_nielsen">Michael Nielsen on Twitter</a></p>
+
 <p class="sidebar"><a href="faq.html">Book FAQ</a></p>
 
 <p class="sidebar">
@@ -162,9 +164,10 @@
 <a href="http://eepurl.com/0Xxjb">Michael Nielsen's project announcement mailing list</a>
 </p>
 
-<p class="sidebar"> <a href="http://www.iro.umontreal.ca/~bengioy/dlbook/">Deep Learning</a>, draft book
-in preparation, by Yoshua Bengio, Ian Goodfellow, and Aaron
-Courville</p>
+<p class="sidebar"> <a href="http://www.deeplearningbook.org/">Deep Learning</a>, book by Ian
+Goodfellow, Yoshua Bengio, and Aaron Courville</p>
+
+<p class="sidebar"><a href="http://cognitivemedium.com">cognitivemedium.com</a></p>
 
 <hr>
 <a href="http://michaelnielsen.org"><img src="assets/Michael_Nielsen_Web_Small.jpg" width="160px" style="border-style: none;"/></a>
@@ -211,7 +214,7 @@
 href="mailto:mn@michaelnielsen.org">contact me</a>.
 </span>
 <span class="right_footer">
-Last update: Sun Jan  1 16:00:21 2017
+Last update: Thu Jan 19 06:09:48 2017
 <br/>
 <br/>
 <br/>
diff --git a/chap1.html b/chap1.html
index 20ad0e6..8a37266 100644
--- a/chap1.html
+++ b/chap1.html
@@ -155,6 +155,8 @@ <h1 class="chapter_title"><a href="">Using neural nets to recognize handwritten
 <hr>
 <span class="sidebar_title">Resources</span>
 
+<p class="sidebar"><a href="https://twitter.com/michael_nielsen">Michael Nielsen on Twitter</a></p>
+
 <p class="sidebar"><a href="faq.html">Book FAQ</a></p>
 
 <p class="sidebar">
@@ -164,9 +166,10 @@ <h1 class="chapter_title"><a href="">Using neural nets to recognize handwritten
 <a href="http://eepurl.com/0Xxjb">Michael Nielsen's project announcement mailing list</a>
 </p>
 
-<p class="sidebar"> <a href="http://www.iro.umontreal.ca/~bengioy/dlbook/">Deep Learning</a>, draft book
-in preparation, by Yoshua Bengio, Ian Goodfellow, and Aaron
-Courville</p>
+<p class="sidebar"> <a href="http://www.deeplearningbook.org/">Deep Learning</a>, book by Ian
+Goodfellow, Yoshua Bengio, and Aaron Courville</p>
+
+<p class="sidebar"><a href="http://cognitivemedium.com">cognitivemedium.com</a></p>
 
 <hr>
 <a href="http://michaelnielsen.org"><img src="assets/Michael_Nielsen_Web_Small.jpg" width="160px" style="border-style: none;"/></a>
@@ -175,7 +178,7 @@ <h1 class="chapter_title"><a href="">Using neural nets to recognize handwritten
 By <a href="http://michaelnielsen.org">Michael Nielsen</a> / Jan 2017
 </p>
 </div>
-</p><p>The human visual system is one of the wonders of the world.  Considerthe following sequence of handwritten digits: <a name="complete_zero"></a></p><p><center><img src="images/digits.png" width="160px"></center> </p><p>Most people effortlessly recognize those digits as 504192.  That easeis deceptive.  In each hemisphere of our brain, humans have a primaryvisual cortex, also known as V1, containing 140 million neurons, withtens of billions of connections between them.  And yet human visioninvolves not just V1, but an entire series of visual cortices - V2,V3, V4, and V5 - doing progressively more complex image processing.We carry in our heads a supercomputer, tuned by evolution overhundreds of millions of years, and superbly adapted to understand thevisual world.  Recognizing handwritten digits isn't easy.  Rather, wehumans are stupendously, astoundingly good at making sense of what oureyes show us.  But nearly all that work is done unconsciously.  And sowe don't usually appreciate how tough a problem our visual systemssolve.</p><p>The difficulty of visual pattern recognition becomes apparent if youattempt to write a computer program to recognize digits like thoseabove.  What seems easy when we do it ourselves suddenly becomesextremely difficult.  Simple intuitions about how we recognize shapes- "a 9 has a loop at the top, and a vertical stroke in the bottomright" - turn out to be not so simple to express algorithmically.When you try to make such rules precise, you quickly get lost in amorass of exceptions and caveats and special cases.  It seemshopeless.</p><p></p><p>Neural networks approach the problem in a different way.  The idea isto take a large number of handwritten digits, known as trainingexamples,</p><p><center><img src="images/mnist_100_digits.png" width="440px"></center></p><p>and then develop a system which can learn from those trainingexamples. In other words, the neural network uses the examples toautomatically infer rules for recognizing handwritten digits.Furthermore, by increasing the number of training examples, thenetwork can learn more about handwriting, and so improve its accuracy.So while I've shown just 100 training digits above, perhaps we couldbuild a better handwriting recognizer by using thousands or evenmillions or billions of training examples.</p><p>In this chapter we'll write a computer program implementing a neuralnetwork that learns to recognize handwritten digits.  The program isjust 74 lines long, and uses no special neural network libraries.  Butthis short program can recognize digits with an accuracy over 96percent, without human intervention.  Furthermore, in later chapterswe'll develop ideas which can improve accuracy to over 99 percent.  Infact, the best commercial neural networks are now so good that theyare used by banks to process cheques, and by post offices to recognizeaddresses.</p><p>We're focusing on handwriting recognition because it's an excellentprototype problem for learning about neural networks in general.  As aprototype it hits a sweet spot: it's challenging - it's no smallfeat to recognize handwritten digits - but it's not so difficult asto require an extremely complicated solution, or tremendouscomputational power.  Furthermore, it's a great way to develop moreadvanced techniques, such as deep learning.  And so throughout thebook we'll return repeatedly to the problem of handwritingrecognition.  Later in the book, we'll discuss how these ideas may beapplied to other problems in computer vision, and also in speech,natural language processing, and other domains.</p><p>Of course, if the point of the chapter was only to write a computerprogram to recognize handwritten digits, then the chapter would bemuch shorter!  But along the way we'll develop many key ideas aboutneural networks, including two important types of artificial neuron(the perceptron and the sigmoid neuron), and the standard learningalgorithm for neural networks, known as stochastic gradient descent.Throughout, I focus on explaining <em>why</em> things are done the waythey are, and on building your neural networks intuition.  Thatrequires a lengthier discussion than if I just presented the basicmechanics of what's going on, but it's worth it for the deeperunderstanding you'll attain.  Amongst the payoffs, by the end of thechapter we'll be in position to understand what deep learning is, andwhy it matters.</p><p><h3><a name="perceptrons"></a><a href="#perceptrons">Perceptrons</a></h3></p><p>What is a neural network?  To get started, I'll explain a type ofartificial neuron called a <em>perceptron</em>.Perceptrons were<a href="http://books.google.ca/books/about/Principles_of_neurodynamics.html?id=7FhRAAAAMAAJ">developed</a>in the 1950s and 1960s by the scientist<a href="http://en.wikipedia.org/wiki/Frank_Rosenblatt">Frank  Rosenblatt</a>, inspired by earlier<a href="http://scholar.google.ca/scholar?cluster=4035975255085082870">work</a>by <a href="http://en.wikipedia.org/wiki/Warren_McCulloch">Warren  McCulloch</a> and<a href="http://en.wikipedia.org/wiki/Walter_Pitts">Walter  Pitts</a>.  Today, it's more common to use othermodels of artificial neurons - in this book, and in much modern workon neural networks, the main neuron model used is one called the<em>sigmoid neuron</em>.  We'll get to sigmoid neurons shortly.  But tounderstand why sigmoid neurons are defined the way they are, it'sworth taking the time to first understand perceptrons.</p><p>So how do perceptrons work?  A perceptron takes several binary inputs,$x_1, x_2, \ldots$, and produces a single binary output:<center><img src="images/tikz0.png"/></center>In the example shown the perceptron has three inputs, $x_1, x_2, x_3$.In general it could have more or fewer inputs.  Rosenblatt proposed asimple rule to compute the output.  He introduced<em>weights</em>, $w_1,w_2,\ldots$, real numbersexpressing the importance of the respective inputs to the output.  Theneuron's output, $0$ or $1$, is determined by whether the weighted sum$\sum_j w_j x_j$ is less than or greater than some <em>threshold  value</em>.  Just like the weights, thethreshold is a real number which is a parameter of the neuron.  To putit in more precise algebraic terms:<a class="displaced_anchor" name="eqtn1"></a>\begin{eqnarray}  \mbox{output} & = & \left\{ \begin{array}{ll}      0 & \mbox{if } \sum_j w_j x_j \leq \mbox{ threshold} \\      1 & \mbox{if } \sum_j w_j x_j > \mbox{ threshold}      \end{array} \right.\tag{1}\end{eqnarray}That's all there is to how a perceptron works!</p><p>That's the basic mathematical model.  A way you can think about theperceptron is that it's a device that makes decisions by weighing upevidence.  Let me give an example.  It's not a very realistic example,but it's easy to understand, and we'll soon get to more realisticexamples.  Suppose the weekend is coming up, and you've heard thatthere's going to be a cheese festival in your city.  You like cheese,and are trying to decide whether or not to go to the festival.  Youmight make your decision by weighing up three factors:<ol><li> Is the weather good?<li> Does your boyfriend or girlfriend want to accompany you?<li> Is the festival near public transit? (You don't own a car).</ol>We can represent these three factors by corresponding binary variables$x_1, x_2$, and $x_3$.  For instance, we'd have $x_1 = 1$ if theweather is good, and $x_1 = 0$ if the weather is bad.  Similarly, $x_2= 1$ if your boyfriend or girlfriend wants to go, and $x_2 = 0$ ifnot.  And similarly again for $x_3$ and public transit.</p><p>Now, suppose you absolutely adore cheese, so much so that you're happyto go to the festival even if your boyfriend or girlfriend isuninterested and the festival is hard to get to.  But perhaps youreally loathe bad weather, and there's no way you'd go to the festivalif the weather is bad.  You can use perceptrons to model this kind ofdecision-making.  One way to do this is to choose a weight $w_1 = 6$for the weather, and $w_2 = 2$ and $w_3 = 2$ for the other conditions.The larger value of $w_1$ indicates that the weather matters a lot toyou, much more than whether your boyfriend or girlfriend joins you, orthe nearness of public transit.  Finally, suppose you choose athreshold of $5$ for the perceptron.  With these choices, theperceptron implements the desired decision-making model, outputting$1$ whenever the weather is good, and $0$ whenever the weather is bad.It makes no difference to the output whether your boyfriend orgirlfriend wants to go, or whether public transit is nearby.</p><p>By varying the weights and the threshold, we can get different modelsof decision-making.  For example, suppose we instead chose a thresholdof $3$.  Then the perceptron would decide that you should go to thefestival whenever the weather was good <em>or</em> when both thefestival was near public transit <em>and</em> your boyfriend orgirlfriend was willing to join you.  In other words, it'd be adifferent model of decision-making.  Dropping the threshold meansyou're more willing to go to the festival.</p><p>Obviously, the perceptron isn't a complete model of humandecision-making!  But what the example illustrates is how a perceptroncan weigh up different kinds of evidence in order to make decisions.And it should seem plausible that a complex network of perceptronscould make quite subtle decisions:<center><img src="images/tikz1.png"/></center>In this network, the first column of perceptrons - what we'll callthe first <em>layer</em> of perceptrons - is making three very simpledecisions, by weighing the input evidence.  What about the perceptronsin the second layer?  Each of those perceptrons is making a decisionby weighing up the results from the first layer of decision-making.In this way a perceptron in the second layer can make a decision at amore complex and more abstract level than perceptrons in the firstlayer.  And even more complex decisions can be made by the perceptronin the third layer.  In this way, a many-layer network of perceptronscan engage in sophisticated decision making.</p><p>Incidentally, when I defined perceptrons I said that a perceptron hasjust a single output.  In the network above the perceptrons look likethey have multiple outputs.  In fact, they're still single output.The multiple output arrows are merely a useful way of indicating thatthe output from a perceptron is being used as the input to severalother perceptrons.  It's less unwieldy than drawing a single outputline which then splits.</p><p>Let's simplify the way we describe perceptrons.  The condition $\sum_jw_j x_j > \mbox{threshold}$ is cumbersome, and we can make twonotational changes to simplify it. The first change is to write$\sum_j w_j x_j$ as a dot product, $w \cdot x \equiv \sum_j w_j x_j$,where $w$ and $x$ are vectors whose components are the weights andinputs, respectively.  The second change is to move the threshold tothe other side of the inequality, and to replace it by what's known asthe perceptron's <em>bias</em>, $b \equiv-\mbox{threshold}$.  Using the bias instead of the threshold, theperceptron rule can berewritten:<a class="displaced_anchor" name="eqtn2"></a>\begin{eqnarray}  \mbox{output} = \left\{     \begin{array}{ll}       0 & \mbox{if } w\cdot x + b \leq 0 \\      1 & \mbox{if } w\cdot x + b > 0    \end{array}  \right.\tag{2}\end{eqnarray}You can think of the bias as a measure of how easy it is to get theperceptron to output a $1$.  Or to put it in more biological terms,the bias is a measure of how easy it is to get the perceptron to<em>fire</em>.  For a perceptron with a really big bias, it's extremelyeasy for the perceptron to output a $1$.  But if the bias is verynegative, then it's difficult for the perceptron to output a $1$.Obviously, introducing the bias is only a small change in how wedescribe perceptrons, but we'll see later that it leads to furthernotational simplifications.  Because of this, in the remainder of thebook we won't use the threshold, we'll always use the bias.</p><p>I've described perceptrons as a method for weighing evidence to makedecisions.  Another way perceptrons can be used is to compute theelementary logical functions we usually think of as underlyingcomputation, functions such as <CODE>AND</CODE>, <CODE>OR</CODE>, and<CODE>NAND</CODE>.  For example, suppose we have a perceptron with twoinputs, each with weight $-2$, and an overall bias of $3$.  Here's ourperceptron:<center><img src="images/tikz2.png"/></center>Then we see that input $00$ produces output $1$, since$(-2)*0+(-2)*0+3 = 3$ is positive.  Here, I've introduced the $*$symbol to make the multiplications explicit.  Similar calculationsshow that the inputs $01$ and $10$ produce output $1$.  But the input$11$ produces output $0$, since $(-2)*1+(-2)*1+3 = -1$ is negative.And so our perceptron implements a <CODE>NAND</CODE>gate!</p><p><a name="universality"></a></p><p>The <CODE>NAND</CODE> example shows that we can use perceptrons to computesimple logical functions. In fact, we can usenetworks of perceptrons to compute <em>any</em> logical function at all.The reason is that the <CODE>NAND</CODE> gate is universal forcomputation, that is, we can build any computation up out of<CODE>NAND</CODE> gates.  For example, we can use <CODE>NAND</CODE> gates tobuild a circuit which adds two bits, $x_1$ and $x_2$.  This requirescomputing the bitwise sum, $x_1 \oplus x_2$, as well as a carry bitwhich is set to $1$ when both $x_1$ and $x_2$ are $1$, i.e., the carrybit is just the bitwise product $x_1 x_2$:<center> <img src="images/tikz3.png"/></center>To get an equivalent network of perceptrons we replace all the<CODE>NAND</CODE> gates by perceptrons with two inputs, each with weight$-2$, and an overall bias of $3$.  Here's the resulting network.  Notethat I've moved the perceptron corresponding to the bottom right<CODE>NAND</CODE> gate a little, just to make it easier to draw the arrowson the diagram:<center> <img src="images/tikz4.png"/></center>One notable aspect of this network of perceptrons is that the outputfrom the leftmost perceptron is used twice as input to the bottommostperceptron.  When I defined the perceptron model I didn't say whetherthis kind of double-output-to-the-same-place was allowed.  Actually,it doesn't much matter.  If we don't want to allow this kind of thing,then it's possible to simply merge the two lines, into a singleconnection with a weight of -4 instead of two connections with -2weights.  (If you don't find this obvious, you should stop and proveto yourself that this is equivalent.)  With that change, the networklooks as follows, with all unmarked weights equal to -2, all biasesequal to 3, and a single weight of -4, as marked:<center> <img src="images/tikz5.png"/></center>Up to now I've been drawing inputs like $x_1$ and $x_2$ as variablesfloating to the left of the network of perceptrons.  In fact, it'sconventional to draw an extra layer of perceptrons - the <em>input  layer</em> - to encode the inputs:<center> <img src="images/tikz6.png"/></center>This notation for input perceptrons, in which we have an output, butno inputs,<center><img src="images/tikz7.png"/></center>is a shorthand.  It doesn't actually mean a perceptron with no inputs.To see this, suppose we did have a perceptron with no inputs.  Thenthe weighted sum $\sum_j w_j x_j$ would always be zero, and so theperceptron would output $1$ if $b > 0$, and $0$ if $b \leq 0$.  Thatis, the perceptron would simply output a fixed value, not the desiredvalue ($x_1$, in the example above). It's better to think of theinput perceptrons as not really being perceptrons at all, but ratherspecial units which are simply defined to output the desired values,$x_1, x_2,\ldots$.</p><p>The adder example demonstrates how a network of perceptrons can beused to simulate a circuit containing many <CODE>NAND</CODE> gates.  Andbecause <CODE>NAND</CODE> gates are universal for computation, it followsthat perceptrons are also universal for computation.</p><p>The computational universality of perceptrons is simultaneouslyreassuring and disappointing.  It's reassuring because it tells usthat networks of perceptrons can be as powerful as any other computingdevice.  But it's also disappointing, because it makes it seem asthough perceptrons are merely a new type of <CODE>NAND</CODE> gate.That's hardly big news!</p><p>However, the situation is better than this view suggests.  It turnsout that we can devise <em>learning  algorithms</em> which canautomatically tune the weights and biases of a network of artificialneurons.  This tuning happens in response to external stimuli, withoutdirect intervention by a programmer.  These learning algorithms enableus to use artificial neurons in a way which is radically different toconventional logic gates.  Instead of explicitly laying out a circuitof <CODE>NAND</CODE> and other gates, our neural networks can simply learnto solve problems, sometimes problems where it would be extremelydifficult to directly design a conventional circuit.</p><p><h3><a name="sigmoid_neurons"></a><a href="#sigmoid_neurons">Sigmoid neurons</a></h3></p><p>Learning algorithms sound terrific.  But how can we devise suchalgorithms for a neural network?  Suppose we have a network ofperceptrons that we'd like to use to learn to solve some problem.  Forexample, the inputs to the network might be the raw pixel data from ascanned, handwritten image of a digit.  And we'd like the network tolearn weights and biases so that the output from the network correctlyclassifies the digit.  To see how learning might work, suppose we makea small change in some weight (or bias) in the network.  What we'dlike is for this small change in weight to cause only a smallcorresponding change in the output from the network.  As we'll see ina moment, this property will make learning possible.  Schematically,here's what we want (obviously this network is too simple to dohandwriting recognition!):</p><p><center><img src="images/tikz8.png"/></center></p><p>If it were true that a small change in a weight (or bias) causes onlya small change in output, then we could use this fact to modify theweights and biases to get our network to behave more in the manner wewant.  For example, suppose the network was mistakenly classifying animage as an "8" when it should be a "9".  We could figure out howto make a small change in the weights and biases so the network gets alittle closer to classifying the image as a "9".  And then we'drepeat this, changing the weights and biases over and over to producebetter and better output.  The network would be learning.</p><p>The problem is that this isn't what happens when our network containsperceptrons.  In fact, a small change in the weights or bias of anysingle perceptron in the network can sometimes cause the output ofthat perceptron to completely flip, say from $0$ to $1$.  That flipmay then cause the behaviour of the rest of the network to completelychange in some very complicated way.  So while your "9" might now beclassified correctly, the behaviour of the network on all the otherimages is likely to have completely changed in some hard-to-controlway.  That makes it difficult to see how to gradually modify theweights and biases so that the network gets closer to the desiredbehaviour.  Perhaps there's some clever way of getting around thisproblem.  But it's not immediately obvious how we can get a network ofperceptrons to learn.</p><p>We can overcome this problem by introducing a new type of artificialneuron called a <em>sigmoid</em> neuron.Sigmoid neurons are similar to perceptrons, but modified so that smallchanges in their weights and bias cause only a small change in theiroutput.  That's the crucial fact which will allow a network of sigmoidneurons to learn.</p><p>Okay, let me describe the sigmoid neuron.  We'll depict sigmoidneurons in the same way we depicted perceptrons:<center><img src="images/tikz9.png"/></center>Just like a perceptron, the sigmoid neuron has inputs, $x_1, x_2,\ldots$.  But instead of being just $0$ or $1$, these inputs can alsotake on any values <em>between</em> $0$ and $1$.  So, for instance,$0.638\ldots$ is a valid input for a sigmoid neuron. Also just like aperceptron, the sigmoid neuron has weights for each input, $w_1, w_2,\ldots$, and an overall bias, $b$.  But the output is not $0$ or $1$.Instead, it's $\sigma(w \cdot x+b)$, where $\sigma$ is called the<em>sigmoid function</em>*<span class="marginnote">*Incidentally, $\sigma$ is sometimes  called the <em>logistic    function</em>, and this  new class of neurons called <em>logistic    neurons</em>.  It's useful  to remember this terminology, since these terms are used by many  people working with neural nets.  However, we'll stick with the  sigmoid terminology.</span>, and is definedby:<a class="displaced_anchor" name="eqtn3"></a>\begin{eqnarray}   \sigma(z) \equiv \frac{1}{1+e^{-z}}.\tag{3}\end{eqnarray}To put it all a little more explicitly, the output of a sigmoid neuronwith inputs $x_1,x_2,\ldots$, weights $w_1,w_2,\ldots$, and bias $b$ is<a class="displaced_anchor" name="eqtn4"></a>\begin{eqnarray}   \frac{1}{1+\exp(-\sum_j w_j x_j-b)}.\tag{4}\end{eqnarray}</p><p>At first sight, sigmoid neurons appear very different to perceptrons.The algebraic form of the sigmoid function may seem opaque andforbidding if you're not already familiar with it.  In fact, there aremany similarities between perceptrons and sigmoid neurons, and thealgebraic form of the sigmoid function turns out to be more of atechnical detail than a true barrier to understanding.</p><p>To understand the similarity to the perceptron model, suppose $z\equiv w \cdot x + b$ is a large positive number.  Then $e^{-z}\approx 0$ and so $\sigma(z) \approx 1$.  In other words, when $z = w\cdot x+b$ is large and positive, the output from the sigmoid neuronis approximately $1$, just as it would have been for a perceptron.Suppose on the other hand that $z = w \cdot x+b$ is very negative.Then $e^{-z} \rightarrow \infty$, and $\sigma(z) \approx 0$.  So when$z = w \cdot x +b$ is very negative, the behaviour of a sigmoid neuronalso closely approximates a perceptron.  It's only when $w \cdot x+b$is of modest size that there's much deviation from the perceptronmodel.</p><p>What about the algebraic form of $\sigma$?  How can we understandthat?  In fact, the exact form of $\sigma$ isn't so important - whatreally matters is the shape of the function when plotted.  Here's theshape:</p><p><div id="sigmoid_graph"><a name="sigmoid_graph"></a></div><script src="http://d3js.org/d3.v3.min.js"></script><script>function s(x) {return 1/(1+Math.exp(-x));}var m = [40, 120, 50, 120];var height = 290 - m[0] - m[2];var width = 600 - m[1] - m[3];var xmin = -5;var xmax = 5;var sample = 400;var x1 = d3.scale.linear().domain([0, sample]).range([xmin, xmax]);var data = d3.range(sample).map(function(d){ return {        x: x1(d),         y: s(x1(d))};     });var x = d3.scale.linear().domain([xmin, xmax]).range([0, width]);var y = d3.scale.linear()                .domain([0, 1])                .range([height, 0]);var line = d3.svg.line()    .x(function(d) { return x(d.x); })    .y(function(d) { return y(d.y); })var graph = d3.select("#sigmoid_graph")    .append("svg")    .attr("width", width + m[1] + m[3])    .attr("height", height + m[0] + m[2])    .append("g")    .attr("transform", "translate(" + m[3] + "," + m[0] + ")");var xAxis = d3.svg.axis()                  .scale(x)                  .tickValues(d3.range(-4, 5, 1))                  .orient("bottom")graph.append("g")    .attr("class", "x axis")    .attr("transform", "translate(0, " + height + ")")    .call(xAxis);var yAxis = d3.svg.axis()                  .scale(y)                  .tickValues(d3.range(0, 1.01, 0.2))                  .orient("left")                  .ticks(5)graph.append("g")    .attr("class", "y axis")    .call(yAxis);graph.append("path").attr("d", line(data));graph.append("text")     .attr("class", "x label")     .attr("text-anchor", "end")     .attr("x", width/2)     .attr("y", height+35)     .text("z");graph.append("text")        .attr("x", (width / 2))                     .attr("y", -10)        .attr("text-anchor", "middle")          .style("font-size", "16px")         .text("sigmoid function");</script></p><p>This shape is a smoothed out version of a step function:</p><p><div id="step_graph"></div><script>function s(x) {return x < 0 ? 0 : 1;}var m = [40, 120, 50, 120];var height = 290 - m[0] - m[2];var width = 600 - m[1] - m[3];var xmin = -5;var xmax = 5;var sample = 400;var x1 = d3.scale.linear().domain([0, sample]).range([xmin, xmax]);var data = d3.range(sample).map(function(d){ return {        x: x1(d),         y: s(x1(d))};     });var x = d3.scale.linear().domain([xmin, xmax]).range([0, width]);var y = d3.scale.linear()                .domain([0,1])                .range([height, 0]);var line = d3.svg.line()    .x(function(d) { return x(d.x); })    .y(function(d) { return y(d.y); })var graph = d3.select("#step_graph")    .append("svg")    .attr("width", width + m[1] + m[3])    .attr("height", height + m[0] + m[2])    .append("g")    .attr("transform", "translate(" + m[3] + "," + m[0] + ")");var xAxis = d3.svg.axis()                  .scale(x)                  .tickValues(d3.range(-4, 5, 1))                  .orient("bottom")graph.append("g")    .attr("class", "x axis")    .attr("transform", "translate(0, " + height + ")")    .call(xAxis);var yAxis = d3.svg.axis()                  .scale(y)                  .tickValues(d3.range(0, 1.01, 0.2))                  .orient("left")                  .ticks(5)graph.append("g")    .attr("class", "y axis")    .call(yAxis);graph.append("path").attr("d", line(data));graph.append("text")     .attr("class", "x label")     .attr("text-anchor", "end")     .attr("x", width/2)     .attr("y", height+35)     .text("z");graph.append("text")        .attr("x", (width / 2))                     .attr("y", -10)        .attr("text-anchor", "middle")          .style("font-size", "16px")         .text("step function");</script></p><p>If $\sigma$ had in fact been a step function, then the sigmoid neuronwould <em>be</em> a perceptron, since the output would be $1$ or $0$depending on whether $w\cdot x+b$ was positive ornegative*<span class="marginnote">*Actually, when $w \cdot x +b = 0$ the perceptron  outputs $0$, while the step function outputs $1$.  So, strictly  speaking, we'd need to modify the step function at that one point.  But you get the idea.</span>.  By using the actual $\sigma$ function weget, as already implied above, a smoothed out perceptron.  Indeed,it's the smoothness of the $\sigma$ function that is the crucial fact,not its detailed form.  The smoothness of $\sigma$ means that smallchanges $\Delta w_j$ in the weights and $\Delta b$ in the bias willproduce a small change $\Delta \mbox{output}$ in the output from theneuron.  In fact, calculus tells us that $\Delta \mbox{output}$ iswell approximated by<a class="displaced_anchor" name="eqtn5"></a>\begin{eqnarray}   \Delta \mbox{output} \approx \sum_j \frac{\partial \, \mbox{output}}{\partial w_j}  \Delta w_j + \frac{\partial \, \mbox{output}}{\partial b} \Delta b,\tag{5}\end{eqnarray}where the sum is over all the weights, $w_j$, and $\partial \,\mbox{output} / \partial w_j$ and $\partial \, \mbox{output} /\partialb$ denote partial derivatives of the $\mbox{output}$ with respect to$w_j$ and $b$, respectively.  Don't panic if you're not comfortablewith partial derivatives!  While the expression above lookscomplicated, with all the partial derivatives, it's actually sayingsomething very simple (and which is very good news): $\Delta\mbox{output}$ is a <em>linear function</em> of the changes $\Delta w_j$and $\Delta b$ in the weights and bias.  This linearity makes it easyto choose small changes in the weights and biases to achieve anydesired small change in the output.  So while sigmoid neurons havemuch of the same qualitative behaviour as perceptrons, they make itmuch easier to figure out how changing the weights and biases willchange the output.</p><p>If it's the shape of $\sigma$ which really matters, and not its exactform, then why use the particular form used for $\sigma$ inEquation <span id="margin_747442350899_reveal" class="equation_link">(3)</span><span id="margin_747442350899" class="marginequation" style="display: none;"><a href="chap1.html#eqtn3" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \sigma(z) \equiv \frac{1}{1+e^{-z}} \nonumber\end{eqnarray}</a></span><script>$('#margin_747442350899_reveal').click(function() {$('#margin_747442350899').toggle('slow', function() {});});</script>?  In fact, later in the book we willoccasionally consider neurons where the output is $f(w \cdot x + b)$for some other <em>activation function</em> $f(\cdot)$.  The main thingthat changes when we use a different activation function is that theparticular values for the partial derivatives inEquation <span id="margin_907497550721_reveal" class="equation_link">(5)</span><span id="margin_907497550721" class="marginequation" style="display: none;"><a href="chap1.html#eqtn5" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \Delta \mbox{output} \approx \sum_j \frac{\partial \, \mbox{output}}{\partial w_j}  \Delta w_j + \frac{\partial \, \mbox{output}}{\partial b} \Delta b \nonumber\end{eqnarray}</a></span><script>$('#margin_907497550721_reveal').click(function() {$('#margin_907497550721').toggle('slow', function() {});});</script> change.  It turns out that when wecompute those partial derivatives later, using $\sigma$ will simplifythe algebra, simply because exponentials have lovely properties whendifferentiated.  In any case, $\sigma$ is commonly-used in work onneural nets, and is the activation function we'll use most often inthis book.</p><p>How should we interpret the output from a sigmoid neuron?  Obviously,one big difference between perceptrons and sigmoid neurons is thatsigmoid neurons don't just output $0$ or $1$.  They can have as outputany real number between $0$ and $1$, so values such as $0.173\ldots$and $0.689\ldots$ are legitimate outputs.  This can be useful, forexample, if we want to use the output value to represent the averageintensity of the pixels in an image input to a neural network.  Butsometimes it can be a nuisance.  Suppose we want the output from thenetwork to indicate either "the input image is a 9" or "the inputimage is not a 9".  Obviously, it'd be easiest to do this if theoutput was a $0$ or a $1$, as in a perceptron.  But in practice we canset up a convention to deal with this, for example, by deciding tointerpret any output of at least $0.5$ as indicating a "9", and anyoutput less than $0.5$ as indicating "not a 9".  I'll alwaysexplicitly state when we're using such a convention, so it shouldn'tcause any confusion.</p><p><h4><a name="exercises_191892"></a><a href="#exercises_191892">Exercises</a></h4><ul><li><strong>Sigmoid neurons simulating perceptrons, part I</strong> $\mbox{}$ <br/>  Suppose we take all the weights and biases in a network of  perceptrons, and multiply them by a positive constant, $c > 0$.  Show that the behaviour of the network doesn't change.</p><p><li><strong>Sigmoid neurons simulating perceptrons, part II</strong> $\mbox{}$  <br/> Suppose we have the same setup as the last problem - a  network of perceptrons.  Suppose also that the overall input to the  network of perceptrons has been chosen.  We won't need the actual  input value, we just need the input to have been fixed.  Suppose the  weights and biases are such that $w \cdot x + b \neq 0$ for the  input $x$ to any particular perceptron in the network.  Now replace  all the perceptrons in the network by sigmoid neurons, and multiply  the weights and biases by a positive constant $c > 0$. Show that in  the limit as $c \rightarrow \infty$ the behaviour of this network of  sigmoid neurons is exactly the same as the network of perceptrons.  How can this fail when $w \cdot x + b = 0$ for one of the  perceptrons?</ul></p><p><h3><a name="the_architecture_of_neural_networks"></a><a href="#the_architecture_of_neural_networks">The architecture of neural networks</a></h3></p><p>In the next section I'll introduce a neural network that can do apretty good job classifying handwritten digits.  In preparation forthat, it helps to explain some terminology that lets us name differentparts of a network.  Suppose we have the network:<center><img src="images/tikz10.png"/></center>As mentioned earlier, the leftmost layer in this network is called theinput layer, and the neurons within thelayer are called <em>input neurons</em>.The rightmost or <em>output</em> layercontains the <em>output neurons</em>, or,as in this case, a single output neuron.  The middle layer is called a<em>hidden layer</em>, since the neurons inthis layer are neither inputs nor outputs.  The term "hidden"perhaps sounds a little mysterious - the first time I heard the termI thought it must have some deep philosophical or mathematicalsignificance - but it really means nothing more than "not an inputor an output".  The network above has just a single hidden layer, butsome networks have multiple hidden layers.  For example, the followingfour-layer network has two hidden layers:<center><img src="images/tikz11.png"/></center>Somewhat confusingly, and for historical reasons, such multiple layernetworks are sometimes called <em>multilayer perceptrons</em> or<em>MLPs</em>, despite being made up of sigmoid neurons, notperceptrons.  I'm not going to use the MLP terminology in this book,since I think it's confusing, but wanted to warn you of its existence.</p><p>The design of the input and output layers in a network is oftenstraightforward.  For example, suppose we're trying to determinewhether a handwritten image depicts a "9" or not.  A natural way todesign the network is to encode the intensities of the image pixelsinto the input neurons. If the image is a $64$ by $64$ greyscaleimage, then we'd have $4,096 = 64 \times 64$ input neurons, with theintensities scaled appropriately between $0$ and $1$.  The outputlayer will contain just a single neuron, with output values of lessthan $0.5$ indicating "input image is not a 9", and values greaterthan $0.5$ indicating "input image is a 9 ".</p><p></p><p></p><p>While the design of the input and output layers of a neural network isoften straightforward, there can be quite an art to the design of thehidden layers.  In particular, it's not possible to sum up the designprocess for the hidden layers with a few simple rules of thumb.Instead, neural networks researchers have developed many designheuristics for the hidden layers, which help people get the behaviourthey want out of their nets.  For example, such heuristics can be usedto help determine how to trade off the number of hidden layers againstthe time required to train the network.  We'll meet several suchdesign heuristics later in this book. </p><p>Up to now, we've been discussing neural networks where the output fromone layer is used as input to the next layer.  Such networks arecalled <em>feedforward</em>neural networks.  This means there are no loops in the network -information is always fed forward, never fed back.  If we did haveloops, we'd end up with situations where the input to the $\sigma$function depended on the output.  That'd be hard to make sense of, andso we don't allow such loops.</p><p>However, there are other models of artificial neural networks in whichfeedback loops are possible.  These models are called<a href="http://en.wikipedia.org/wiki/Recurrent_neural_network">recurrent  neural networks</a>. The idea in these models is to have neurons whichfire for some limited duration of time, before becoming quiescent.That firing can stimulate other neurons, which may fire a little whilelater, also for a limited duration.  That causes still more neurons tofire, and so over time we get a cascade of neurons firing.  Loopsdon't cause problems in such a model, since a neuron's output onlyaffects its input at some later time, not instantaneously.</p><p></p><p>Recurrent neural nets have been less influential than feedforwardnetworks, in part because the learning algorithms for recurrent netsare (at least to date) less powerful.  But recurrent networks arestill extremely interesting.  They're much closer in spirit to how ourbrains work than feedforward networks.  And it's possible thatrecurrent networks can solve important problems which can only besolved with great difficulty by feedforward networks.  However, tolimit our scope, in this book we're going to concentrate on the morewidely-used feedforward networks.</p><p><h3><a name="a_simple_network_to_classify_handwritten_digits"></a><a href="#a_simple_network_to_classify_handwritten_digits">A simple network to classify handwritten digits</a></h3></p><p>Having defined neural networks, let's return to handwritingrecognition.  We can split the problem of recognizing handwrittendigits into two sub-problems.  First, we'd like a way of breaking animage containing many digits into a sequence of separate images, eachcontaining a single digit.  For example, we'd like to break the image</p><p><center><img src="images/digits.png" width="300px"></center></p><p>into six separate images,</p><p><center><img src="images/digits_separate.png" width="440px"></center> </p><p>We humans solve this <em>segmentation  problem</em> with ease, but it's challengingfor a computer program to correctly break up the image.  Once theimage has been segmented, the program then needs to classify eachindividual digit.  So, for instance, we'd like our program torecognize that the first digit above,</p><p><center><img src="images/mnist_first_digit.png" width="64px"></center></p><p>is a 5.</p><p>We'll focus on writing a program to solve the second problem, that is,classifying individual digits.  We do this because it turns out thatthe segmentation problem is not so difficult to solve, once you have agood way of classifying individual digits.  There are many approachesto solving the segmentation problem.  One approach is to trial manydifferent ways of segmenting the image, using the individual digitclassifier to score each trial segmentation.  A trial segmentationgets a high score if the individual digit classifier is confident ofits classification in all segments, and a low score if the classifieris having a lot of trouble in one or more segments.  The idea is thatif the classifier is having trouble somewhere, then it's probablyhaving trouble because the segmentation has been chosen incorrectly.This idea and other variations can be used to solve the segmentationproblem quite well.  So instead of worrying about segmentation we'llconcentrate on developing a neural network which can solve the moreinteresting and difficult problem, namely, recognizing individualhandwritten digits.</p><p>To recognize individual digits we will use a three-layer neuralnetwork:</p><p><center><img src="images/tikz12.png"/></center></p><p>The input layer of the network contains neurons encoding the values ofthe input pixels.  As discussed in the next section, our training datafor the network will consist of many $28$ by $28$ pixel images ofscanned handwritten digits, and so the input layer contains $784 = 28\times 28$ neurons.  For simplicity I've omitted most of the $784$input neurons in the diagram above.  The input pixels are greyscale,with a value of $0.0$ representing white, a value of $1.0$representing black, and in between values representing graduallydarkening shades of grey.</p><p>The second layer of the network is a hidden layer.  We denote thenumber of neurons in this hidden layer by $n$, and we'll experimentwith different values for $n$.  The example shown illustrates a smallhidden layer, containing just $n = 15$ neurons.</p><p>The output layer of the network contains 10 neurons.  If the firstneuron fires, i.e., has an output $\approx 1$, then that will indicatethat the network thinks the digit is a $0$.  If the second neuronfires then that will indicate that the network thinks the digit is a$1$.  And so on.  A little more precisely, we number the outputneurons from $0$ through $9$, and figure out which neuron has thehighest activation value.  If that neuron is, say, neuron number $6$,then our network will guess that the input digit was a $6$.  And so onfor the other output neurons.</p><p>You might wonder why we use $10$ output neurons.  After all, the goalof the network is to tell us which digit ($0, 1, 2, \ldots, 9$)corresponds to the input image.  A seemingly natural way of doing thatis to use just $4$ output neurons, treating each neuron as taking on abinary value, depending on whether the neuron's output is closer to$0$ or to $1$.  Four neurons are enough to encode the answer, since$2^4 = 16$ is more than the 10 possible values for the input digit.Why should our network use $10$ neurons instead?  Isn't thatinefficient?  The ultimate justification is empirical: we can try outboth network designs, and it turns out that, for this particularproblem, the network with $10$ output neurons learns to recognizedigits better than the network with $4$ output neurons.  But thatleaves us wondering <em>why</em> using $10$ output neurons works better.Is there some heuristic that would tell us in advance that we shoulduse the $10$-output encoding instead of the $4$-output encoding?</p><p>To understand why we do this, it helps to think about what the neuralnetwork is doing from first principles.  Consider first the case wherewe use $10$ output neurons.  Let's concentrate on the first outputneuron, the one that's trying to decide whether or not the digit is a$0$.  It does this by weighing up evidence from the hidden layer ofneurons.  What are those hidden neurons doing?  Well, just suppose forthe sake of argument that the first neuron in the hidden layer detectswhether or not an image like the following is present:</p><p><center><img src="images/mnist_top_left_feature.png" width="130px"></center></p><p>It can do this by heavily weighting input pixels which overlap withthe image, and only lightly weighting the other inputs.  In a similarway, let's suppose for the sake of argument that the second, third,and fourth neurons in the hidden layer detect whether or not thefollowing images are present:</p><p><center><img src="images/mnist_other_features.png" width="424px"></center></p><p>As you may have guessed, these four images together make up the $0$image that we saw in the line of digits shown <a  href="#complete_zero">earlier</a>:</p><p><center><img src="images/mnist_complete_zero.png" width="130px"></center></p><p>So if all four of these hidden neurons are firing then we can concludethat the digit is a $0$.  Of course, that's not the <em>only</em> sortof evidence we can use to conclude that the image was a $0$ - wecould legitimately get a $0$ in many other ways (say, throughtranslations of the above images, or slight distortions).  But itseems safe to say that at least in this case we'd conclude that theinput was a $0$.</p><p></p><p></p><p></p><p>Supposing the neural network functions in this way, we can give aplausible explanation for why it's better to have $10$ outputs fromthe network, rather than $4$.  If we had $4$ outputs, then the firstoutput neuron would be trying to decide what the most significant bitof the digit was.  And there's no easy way to relate that mostsignificant bit to simple shapes like those shown above.  It's hard toimagine that there's any good historical reason the component shapesof the digit will be closely related to (say) the most significant bitin the output.</p><p>Now, with all that said, this is all just a heuristic.  Nothing saysthat the three-layer neural network has to operate in the way Idescribed, with the hidden neurons detecting simple component shapes.Maybe a clever learning algorithm will find some assignment of weightsthat lets us use only $4$ output neurons.  But as a heuristic the wayof thinking I've described works pretty well, and can save you a lotof time in designing good neural network architectures.</p><p><h4><a name="exercise_513527"></a><a href="#exercise_513527">Exercise</a></h4><ul><li> There is a way of determining the bitwise representation of a  digit by adding an extra layer to the three-layer network above.  The extra layer converts the output from the previous layer into a  binary representation, as illustrated in the figure below.  Find a  set of weights and biases for the new output layer.  Assume that the  first $3$ layers of neurons are such that the correct output in the  third layer (i.e., the old output layer) has activation at least  $0.99$, and incorrect outputs have activation less than $0.01$.</ul></p><p><center><img src="images/tikz13.png"/></center></p><p></p><p></p><p></p><p><h3><a name="learning_with_gradient_descent"></a><a href="#learning_with_gradient_descent">Learning with gradient descent</a></h3></p><p></p><p>Now that we have a design for our neural network, how can it learn torecognize digits?  The first thing we'll need is a data set to learnfrom - a so-called training data set.  We'll use the <a href="http://yann.lecun.com/exdb/mnist/">MNIST  data set</a>, which contains tens of thousands of scanned images ofhandwritten digits, together with their correct classifications.MNIST's name comes from the fact that it is a modified subset of twodata sets collected by<a href="http://en.wikipedia.org/wiki/National_Institute_of_Standards_and_Technology">NIST</a>,the United States' National Institute of Standards andTechnology. Here's a few images from MNIST:</p><p><center><img src="images/digits_separate.png" width="420px"></center> </p><p>As you can see, these digits are, in fact, the same as those shownat the <a  href="#complete_zero">beginning of this chapter</a> as a challengeto recognize.  Of course, when testing our network we'll ask it torecognize images which aren't in the training set!</p><p>The MNIST data comes in two parts.  The first part contains 60,000images to be used as training data.  These images are scannedhandwriting samples from 250 people, half of whom were US CensusBureau employees, and half of whom were high school students.  Theimages are greyscale and 28 by 28 pixels in size.  The second part ofthe MNIST data set is 10,000 images to be used as test data.  Again,these are 28 by 28 greyscale images.  We'll use the test data toevaluate how well our neural network has learned to recognize digits.To make this a good test of performance, the test data was taken froma <em>different</em> set of 250 people than the original training data(albeit still a group split between Census Bureau employees and highschool students).  This helps give us confidence that our system canrecognize digits from people whose writing it didn't see duringtraining.</p><p>We'll use the notation $x$ to denote a training input.  It'll beconvenient to regard each training input $x$ as a $28 \times 28 =784$-dimensional vector.  Each entry in the vector represents the greyvalue for a single pixel in the image.  We'll denote the correspondingdesired output by $y = y(x)$, where $y$ is a $10$-dimensional vector.For example, if a particular training image, $x$, depicts a $6$, then$y(x) = (0, 0, 0, 0, 0, 0, 1, 0, 0, 0)^T$ is the desired output fromthe network.  Note that $T$ here is the transpose operation, turning arow vector into an ordinary (column) vector.</p><p>What we'd like is an algorithm which lets us find weights and biasesso that the output from the network approximates $y(x)$ for alltraining inputs $x$.  To quantify how well we're achieving this goalwe define a <em>cost function</em>*<span class="marginnote">*Sometimes referred to as a  <em>loss</em> or <em>objective</em> function.  We use the term cost  function throughout this book, but you should note the other  terminology, since it's often used in research papers and other  discussions of neural networks. </span>:<a class="displaced_anchor" name="eqtn6"></a>\begin{eqnarray}  C(w,b) \equiv  \frac{1}{2n} \sum_x \| y(x) - a\|^2.\tag{6}\end{eqnarray}Here, $w$ denotes the collection of all weights in the network, $b$all the biases, $n$ is the total number of training inputs, $a$ is thevector of outputs from the network when $x$ is input, and the sum isover all training inputs, $x$.  Of course, the output $a$ depends on$x$, $w$ and $b$, but to keep the notation simple I haven't explicitlyindicated this dependence.  The notation $\| v \|$ just denotes theusual length function for a vector $v$.  We'll call $C$ the<em>quadratic</em> cost function; it's alsosometimes known as the <em>mean squared error</em> or just <em>MSE</em>.Inspecting the form of the quadratic cost function, we see that$C(w,b)$ is non-negative, since every term in the sum is non-negative.Furthermore, the cost $C(w,b)$ becomes small, i.e., $C(w,b) \approx0$, precisely when $y(x)$ is approximately equal to the output, $a$,for all training inputs, $x$.  So our training algorithm has done agood job if it can find weights and biases so that $C(w,b) \approx 0$.By contrast, it's not doing so well when $C(w,b)$ is large - thatwould mean that $y(x)$ is not close to the output $a$ for a largenumber of inputs.  So the aim of our training algorithm will be tominimize the cost $C(w,b)$ as a function of the weights and biases.In other words, we want to find a set of weights and biases which makethe cost as small as possible.  We'll do that using an algorithm knownas <em>gradient descent</em>.</p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p>Why introduce the quadratic cost?  After all, aren't we primarilyinterested in the number of images correctly classified by thenetwork?  Why not try to maximize that number directly, rather thanminimizing a proxy measure like the quadratic cost?  The problem withthat is that the number of images correctly classified is not a smoothfunction of the weights and biases in the network.  For the most part,making small changes to the weights and biases won't cause any changeat all in the number of training images classified correctly.  Thatmakes it difficult to figure out how to change the weights and biasesto get improved performance.  If we instead use a smooth cost functionlike the quadratic cost it turns out to be easy to figure out how tomake small changes in the weights and biases so as to get animprovement in the cost.  That's why we focus first on minimizing thequadratic cost, and only after that will we examine the classificationaccuracy.</p><p></p><p>Even given that we want to use a smooth cost function, you may stillwonder why we choose the quadratic function used inEquation <span id="margin_308975479995_reveal" class="equation_link">(6)</span><span id="margin_308975479995" class="marginequation" style="display: none;"><a href="chap1.html#eqtn6" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  C(w,b) \equiv  \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}</a></span><script>$('#margin_308975479995_reveal').click(function() {$('#margin_308975479995').toggle('slow', function() {});});</script>.  Isn't this a rather <em>ad  hoc</em> choice?  Perhaps if we chose a different cost function we'd geta totally different set of minimizing weights and biases?  This is avalid concern, and later we'll revisit the cost function, and makesome modifications.  However, the quadratic cost function ofEquation <span id="margin_544638503428_reveal" class="equation_link">(6)</span><span id="margin_544638503428" class="marginequation" style="display: none;"><a href="chap1.html#eqtn6" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  C(w,b) \equiv  \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}</a></span><script>$('#margin_544638503428_reveal').click(function() {$('#margin_544638503428').toggle('slow', function() {});});</script> works perfectly well forunderstanding the basics of learning in neural networks, so we'llstick with it for now.</p><p>Recapping, our goal in training a neural network is to find weightsand biases which minimize the quadratic cost function $C(w, b)$.  Thisis a well-posed problem, but it's got a lot of distracting structureas currently posed - the interpretation of $w$ and $b$ as weightsand biases, the $\sigma$ function lurking in the background, thechoice of network architecture, MNIST, and so on.  It turns out thatwe can understand a tremendous amount by ignoring most of thatstructure, and just concentrating on the minimization aspect.  So fornow we're going to forget all about the specific form of the costfunction, the connection to neural networks, and so on.  Instead,we're going to imagine that we've simply been given a function of manyvariables and we want to minimize that function.  We're going todevelop a technique called <em>gradient descent</em> which can be usedto solve such minimization problems.  Then we'll come back to thespecific function we want to minimize for neural networks.</p><p>Okay, let's suppose we're trying to minimize some function, $C(v)$.This could be any real-valued function of many variables, $v = v_1,v_2, \ldots$.  Note that I've replaced the $w$ and $b$ notation by $v$to emphasize that this could be any function - we're notspecifically thinking in the neural networks context any more.  Tominimize $C(v)$ it helps to imagine $C$ as a function of just twovariables, which we'll call $v_1$ and $v_2$:</p><p><center><img src="images/valley.png" width="542px"></center></p><p>What we'd like is to find where $C$ achieves its global minimum.  Now,of course, for the function plotted above, we can eyeball the graphand find the minimum.  In that sense, I've perhaps shown slightly<em>too</em> simple a function! A general function, $C$, may be acomplicated function of many variables, and it won't usually bepossible to just eyeball the graph to find the minimum.</p><p>One way of attacking the problem is to use calculus to try to find theminimum analytically.  We could compute derivatives and then try usingthem to find places where $C$ is an extremum.  With some luck thatmight work when $C$ is a function of just one or a few variables.  Butit'll turn into a nightmare when we have many more variables.  And forneural networks we'll often want <em>far</em> more variables - thebiggest neural networks have cost functions which depend on billionsof weights and biases in an extremely complicated way.  Using calculusto minimize that just won't work!</p><p>(After asserting that we'll gain insight by imagining $C$ as afunction of just two variables, I've turned around twice in twoparagraphs and said, "hey, but what if it's a function of many morethan two variables?"  Sorry about that.  Please believe me when I saythat it really does help to imagine $C$ as a function of twovariables.  It just happens that sometimes that picture breaks down,and the last two paragraphs were dealing with such breakdowns.  Goodthinking about mathematics often involves juggling multiple intuitivepictures, learning when it's appropriate to use each picture, and whenit's not.)</p><p><a name="gradient_descent"></a></p><p>Okay, so calculus doesn't work.  Fortunately, there is a beautifulanalogy which suggests an algorithm which works pretty well.  We startby thinking of our function as a kind of a valley.  If you squint justa little at the plot above, that shouldn't be too hard.  And weimagine a ball rolling down the slope of the valley.  Our everydayexperience tells us that the ball will eventually roll to the bottomof the valley.  Perhaps we can use this idea as a way to find aminimum for the function?  We'd randomly choose a starting point foran (imaginary) ball, and then simulate the motion of the ball as itrolled down to the bottom of the valley.  We could do this simulationsimply by computing derivatives (and perhaps some second derivatives)of $C$ - those derivatives would tell us everything we need to knowabout the local "shape" of the valley, and therefore how our ballshould roll.</p><p>Based on what I've just written, you might suppose that we'll betrying to write down Newton's equations of motion for the ball,considering the effects of friction and gravity, and so on.  Actually,we're not going to take the ball-rolling analogy quite that seriously- we're devising an algorithm to minimize $C$, not developing anaccurate simulation of the laws of physics!  The ball's-eye view ismeant to stimulate our imagination, not constrain our thinking.  Sorather than get into all the messy details of physics, let's simplyask ourselves: if we were declared God for a day, and could make upour own laws of physics, dictating to the ball how it should roll,what law or laws of motion could we pick that would make it so theball always rolled to the bottom of the valley?</p><p>To make this question more precise, let's think about what happenswhen we move the ball a small amount $\Delta v_1$ in the $v_1$direction, and a small amount $\Delta v_2$ in the $v_2$ direction.Calculus tells us that $C$ changes as follows:<a class="displaced_anchor" name="eqtn7"></a>\begin{eqnarray}   \Delta C \approx \frac{\partial C}{\partial v_1} \Delta v_1 +  \frac{\partial C}{\partial v_2} \Delta v_2.\tag{7}\end{eqnarray}We're going to find a way of choosing $\Delta v_1$ and $\Delta v_2$ soas to make $\Delta C$ negative; i.e., we'll choose them so the ball isrolling down into the valley.  To figure out how to make such a choiceit helps to define $\Delta v$ to be the vector of changes in $v$,$\Delta v \equiv (\Delta v_1, \Delta v_2)^T$, where $T$ is again thetranspose operation, turning row vectors into column vectors.  We'llalso define the <em>gradient</em> of $C$to be the vector of partial derivatives, $\left(\frac{\partial    C}{\partial v_1}, \frac{\partial C}{\partial v_2}\right)^T$.  Wedenote the gradient vector by $\nabla C$, i.e.:<a class="displaced_anchor" name="eqtn8"></a>\begin{eqnarray}   \nabla C \equiv \left( \frac{\partial C}{\partial v_1},   \frac{\partial C}{\partial v_2} \right)^T.\tag{8}\end{eqnarray}In a moment we'll rewrite the change $\Delta C$ in terms of $\Delta v$and the gradient, $\nabla C$.  Before getting to that, though, I wantto clarify something that sometimes gets people hung up on thegradient.  When meeting the $\nabla C$ notation for the first time,people sometimes wonder how they should think about the $\nabla$symbol.  What, exactly, does $\nabla$ mean?  In fact, it's perfectlyfine to think of $\nabla C$ as a single mathematical object - thevector defined above - which happens to be written using twosymbols.  In this point of view, $\nabla$ is just a piece ofnotational flag-waving, telling you "hey, $\nabla C$ is a gradientvector".  There are more advanced points of view where $\nabla$ canbe viewed as an independent mathematical entity in its own right (forexample, as a differential operator), but we won't need such points ofview.</p><p>With these definitions, the expression <span id="margin_254868050634_reveal" class="equation_link">(7)</span><span id="margin_254868050634" class="marginequation" style="display: none;"><a href="chap1.html#eqtn7" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \Delta C \approx \frac{\partial C}{\partial v_1} \Delta v_1 +  \frac{\partial C}{\partial v_2} \Delta v_2 \nonumber\end{eqnarray}</a></span><script>$('#margin_254868050634_reveal').click(function() {$('#margin_254868050634').toggle('slow', function() {});});</script> for$\Delta C$ can be rewritten as<a class="displaced_anchor" name="eqtn9"></a>\begin{eqnarray}   \Delta C \approx \nabla C \cdot \Delta v.\tag{9}\end{eqnarray}This equation helps explain why $\nabla C$ is called the gradientvector: $\nabla C$ relates changes in $v$ to changes in $C$, just aswe'd expect something called a gradient to do.  But what's reallyexciting about the equation is that it lets us see how to choose$\Delta v$ so as to make $\Delta C$ negative.  In particular, supposewe choose<a class="displaced_anchor" name="eqtn10"></a>\begin{eqnarray}   \Delta v = -\eta \nabla C,\tag{10}\end{eqnarray}where $\eta$ is a small, positive parameter (known as the<em>learning rate</em>).Then Equation <span id="margin_692131390197_reveal" class="equation_link">(9)</span><span id="margin_692131390197" class="marginequation" style="display: none;"><a href="chap1.html#eqtn9" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}</a></span><script>$('#margin_692131390197_reveal').click(function() {$('#margin_692131390197').toggle('slow', function() {});});</script> tells us that $\Delta C \approx -\eta\nabla C \cdot \nabla C = -\eta \|\nabla C\|^2$.  Because $\| \nabla C\|^2 \geq 0$, this guarantees that $\Delta C \leq 0$, i.e., $C$ willalways decrease, never increase, if we change $v$ according to theprescription in <span id="margin_484838862367_reveal" class="equation_link">(10)</span><span id="margin_484838862367" class="marginequation" style="display: none;"><a href="chap1.html#eqtn10" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \Delta v = -\eta \nabla C \nonumber\end{eqnarray}</a></span><script>$('#margin_484838862367_reveal').click(function() {$('#margin_484838862367').toggle('slow', function() {});});</script>.  (Within, of course, thelimits of the approximation in Equation <span id="margin_255359289423_reveal" class="equation_link">(9)</span><span id="margin_255359289423" class="marginequation" style="display: none;"><a href="chap1.html#eqtn9" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}</a></span><script>$('#margin_255359289423_reveal').click(function() {$('#margin_255359289423').toggle('slow', function() {});});</script>).  This isexactly the property we wanted!  And so we'll takeEquation <span id="margin_546602021034_reveal" class="equation_link">(10)</span><span id="margin_546602021034" class="marginequation" style="display: none;"><a href="chap1.html#eqtn10" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \Delta v = -\eta \nabla C \nonumber\end{eqnarray}</a></span><script>$('#margin_546602021034_reveal').click(function() {$('#margin_546602021034').toggle('slow', function() {});});</script> to define the "law of motion"for the ball in our gradient descent algorithm.  That is, we'll useEquation <span id="margin_281838318616_reveal" class="equation_link">(10)</span><span id="margin_281838318616" class="marginequation" style="display: none;"><a href="chap1.html#eqtn10" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \Delta v = -\eta \nabla C \nonumber\end{eqnarray}</a></span><script>$('#margin_281838318616_reveal').click(function() {$('#margin_281838318616').toggle('slow', function() {});});</script> to compute a value for $\Deltav$, then move the ball's position $v$ by that amount:<a class="displaced_anchor" name="eqtn11"></a>\begin{eqnarray}  v \rightarrow v' = v -\eta \nabla C.\tag{11}\end{eqnarray}Then we'll use this update rule again, to make another move.  If wekeep doing this, over and over, we'll keep decreasing $C$ until - wehope - we reach a global minimum.</p><p>Summing up, the way the gradient descent algorithm works is torepeatedly compute the gradient $\nabla C$, and then to move in the<em>opposite</em> direction, "falling down" the slope of the valley.We can visualize it like this:</p><p><center><img src="images/valley_with_ball.png" width="542px"></center></p><p>Notice that with this rule gradient descent doesn't reproduce realphysical motion.  In real life a ball has momentum, and that momentummay allow it to roll across the slope, or even (momentarily) rolluphill.  It's only after the effects of friction set in that the ballis guaranteed to roll down into the valley.  By contrast, our rule forchoosing $\Delta v$ just says "go down, right now".  That's still apretty good rule for finding the minimum!</p><p>To make gradient descent work correctly, we need to choose thelearning rate $\eta$ to be smallenough that Equation <span id="margin_331321408464_reveal" class="equation_link">(9)</span><span id="margin_331321408464" class="marginequation" style="display: none;"><a href="chap1.html#eqtn9" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}</a></span><script>$('#margin_331321408464_reveal').click(function() {$('#margin_331321408464').toggle('slow', function() {});});</script> is a good approximation.  Ifwe don't, we might end up with $\Delta C > 0$, which obviously wouldnot be good!  At the same time, we don't want $\eta$ to be too small,since that will make the changes $\Delta v$ tiny, and thus thegradient descent algorithm will work very slowly.  In practicalimplementations, $\eta$ is often varied so thatEquation <span id="margin_670423096837_reveal" class="equation_link">(9)</span><span id="margin_670423096837" class="marginequation" style="display: none;"><a href="chap1.html#eqtn9" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}</a></span><script>$('#margin_670423096837_reveal').click(function() {$('#margin_670423096837').toggle('slow', function() {});});</script> remains a good approximation, but thealgorithm isn't too slow.  We'll see later how thisworks. </p><p>I've explained gradient descent when $C$ is a function of just twovariables.  But, in fact, everything works just as well even when $C$is a function of many more variables.  Suppose in particular that $C$is a function of $m$ variables, $v_1,\ldots,v_m$.  Then the change$\Delta C$ in $C$ produced by a small change $\Delta v = (\Delta v_1,\ldots, \Delta v_m)^T$ is<a class="displaced_anchor" name="eqtn12"></a>\begin{eqnarray}   \Delta C \approx \nabla C \cdot \Delta v,\tag{12}\end{eqnarray}where the gradient $\nabla C$ is the vector <a class="displaced_anchor" name="eqtn13"></a>\begin{eqnarray}  \nabla C \equiv \left(\frac{\partial C}{\partial v_1}, \ldots,   \frac{\partial C}{\partial v_m}\right)^T.\tag{13}\end{eqnarray}Just as for the two variable case, we canchoose<a class="displaced_anchor" name="eqtn14"></a>\begin{eqnarray}  \Delta v = -\eta \nabla C,\tag{14}\end{eqnarray}and we're guaranteed that our (approximate)expression <span id="margin_455989356323_reveal" class="equation_link">(12)</span><span id="margin_455989356323" class="marginequation" style="display: none;"><a href="chap1.html#eqtn12" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}</a></span><script>$('#margin_455989356323_reveal').click(function() {$('#margin_455989356323').toggle('slow', function() {});});</script> for $\Delta C$ will be negative.This gives us a way of following the gradient to a minimum, even when$C$ is a function of many variables, by repeatedly applying the updaterule<a class="displaced_anchor" name="eqtn15"></a>\begin{eqnarray}  v \rightarrow v' = v-\eta \nabla C.\tag{15}\end{eqnarray}You can think of this update rule as <em>defining</em> the gradientdescent algorithm.  It gives us a way of repeatedly changing theposition $v$ in order to find a minimum of the function $C$.  The ruledoesn't always work - several things can go wrong and preventgradient descent from finding the global minimum of $C$, a point we'llreturn to explore in later chapters.  But, in practice gradientdescent often works extremely well, and in neural networks we'll findthat it's a powerful way of minimizing the cost function, and sohelping the net learn.</p><p></p><p></p><p>Indeed, there's even a sense in which gradient descent is the optimalstrategy for searching for a minimum.  Let's suppose that we're tryingto make a move $\Delta v$ in position so as to decrease $C$ as much aspossible.  This is equivalent to minimizing $\Delta C \approx \nabla C\cdot \Delta v$.  We'll constrain the size of the move so that $\|\Delta v \| = \epsilon$ for some small fixed $\epsilon > 0$.  In otherwords, we want a move that is a small step of a fixed size, and we'retrying to find the movement direction which decreases $C$ as much aspossible.  It can be proved that the choice of $\Delta v$ whichminimizes $\nabla C \cdot \Delta v$ is $\Delta v = - \eta \nabla C$,where $\eta = \epsilon / \|\nabla C\|$ is determined by the sizeconstraint $\|\Delta v\| = \epsilon$.  So gradient descent can beviewed as a way of taking small steps in the direction which does themost to immediately decrease $C$.</p><p><h4><a name="exercises_647181"></a><a href="#exercises_647181">Exercises</a></h4><ul><li> Prove the assertion of the last paragraph.  <em>Hint:</em> If    you're not already familiar with the    <a href="http://en.wikipedia.org/wiki/Cauchy%E2%80%93Schwarz_inequality">Cauchy-Schwarz      inequality</a>, you may find it helpful to familiarize yourself    with it.</p><p><li> I explained gradient descent when $C$ is a function of two  variables, and when it's a function of more than two variables.  What happens when $C$ is a function of just one variable?  Can you  provide a geometric interpretation of what gradient descent is doing  in the one-dimensional case?</ul></p><p></p><p>People have investigated many variations of gradient descent,including variations that more closely mimic a real physical ball.These ball-mimicking variations have some advantages, but also have amajor disadvantage: it turns out to be necessary to compute secondpartial derivatives of $C$, and this can be quite costly.  To see whyit's costly, suppose we want to compute all the second partialderivatives $\partial^2 C/ \partial v_j \partial v_k$.  If there are amillion such $v_j$ variables then we'd need to compute something likea trillion (i.e., a million squared) second partialderivatives*<span class="marginnote">*Actually, more like half a trillion, since  $\partial^2 C/ \partial v_j \partial v_k = \partial^2 C/ \partial  v_k \partial v_j$.  Still, you get the point.</span>!  That's going to becomputationally costly.  With that said, there are tricks for avoidingthis kind of problem, and finding alternatives to gradient descent isan active area of investigation.  But in this book we'll use gradientdescent (and variations) as our main approach to learning in neuralnetworks.</p><p>How can we apply gradient descent to learn in a neural network?  Theidea is to use gradient descent to find the weights $w_k$ and biases$b_l$ which minimize the cost inEquation <span id="margin_378129769988_reveal" class="equation_link">(6)</span><span id="margin_378129769988" class="marginequation" style="display: none;"><a href="chap1.html#eqtn6" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  C(w,b) \equiv  \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}</a></span><script>$('#margin_378129769988_reveal').click(function() {$('#margin_378129769988').toggle('slow', function() {});});</script>.  To see how this works, let'srestate the gradient descent update rule, with the weights and biasesreplacing the variables $v_j$.  In other words, our "position" nowhas components $w_k$ and $b_l$, and the gradient vector $\nabla C$ hascorresponding components $\partial C / \partial w_k$ and $\partial C/ \partial b_l$.  Writing out the gradient descent update rule interms of components, we have<a class="displaced_anchor" name="eqtn16"></a><a class="displaced_anchor" name="eqtn17"></a>\begin{eqnarray}  w_k & \rightarrow & w_k' = w_k-\eta \frac{\partial C}{\partial w_k} \tag{16}\\  b_l & \rightarrow & b_l' = b_l-\eta \frac{\partial C}{\partial b_l}.\tag{17}\end{eqnarray}By repeatedly applying this update rule we can "roll down the hill",and hopefully find a minimum of the cost function.  In other words,this is a rule which can be used to learn in a neural network.</p><p>There are a number of challenges in applying the gradient descentrule.  We'll look into those in depth in later chapters.  But for nowI just want to mention one problem.  To understand what the problemis, let's look back at the quadratic cost inEquation <span id="margin_137880733942_reveal" class="equation_link">(6)</span><span id="margin_137880733942" class="marginequation" style="display: none;"><a href="chap1.html#eqtn6" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  C(w,b) \equiv  \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}</a></span><script>$('#margin_137880733942_reveal').click(function() {$('#margin_137880733942').toggle('slow', function() {});});</script>.  Notice that this costfunction has the form $C = \frac{1}{n} \sum_x C_x$, that is, it's anaverage over costs $C_x \equiv \frac{\|y(x)-a\|^2}{2}$ for individualtraining examples.  In practice, to compute the gradient $\nabla C$ weneed to compute the gradients $\nabla C_x$ separately for eachtraining input, $x$, and then average them, $\nabla C = \frac{1}{n}\sum_x \nabla C_x$.  Unfortunately, when the number of training inputsis very large this can take a long time, and learning thus occursslowly.</p><p>An idea called <em>stochastic gradient descent</em> can be used to speedup learning.  The idea is to estimate the gradient $\nabla C$ bycomputing $\nabla C_x$ for a small sample of randomly chosen traininginputs.  By averaging over this small sample it turns out that we canquickly get a good estimate of the true gradient $\nabla C$, and thishelps speed up gradient descent, and thus learning.</p><p>To make these ideas more precise, stochastic gradient descent works byrandomly picking out a small number $m$ of randomly chosen traininginputs.  We'll label those random training inputs $X_1, X_2, \ldots,X_m$, and refer to them as a <em>mini-batch</em>.  Provided the samplesize $m$ is large enough we expect that the average value of the$\nabla C_{X_j}$ will be roughly equal to the average over all $\nablaC_x$, that is,<a class="displaced_anchor" name="eqtn18"></a>\begin{eqnarray}  \frac{\sum_{j=1}^m \nabla C_{X_{j}}}{m} \approx \frac{\sum_x \nabla C_x}{n} = \nabla C,\tag{18}\end{eqnarray}where the second sum is over the entire set of training data.Swapping sides we get<a class="displaced_anchor" name="eqtn19"></a>\begin{eqnarray}  \nabla C \approx \frac{1}{m} \sum_{j=1}^m \nabla C_{X_{j}},\tag{19}\end{eqnarray}confirming that we can estimate the overall gradient by computinggradients just for the randomly chosen mini-batch. </p><p>To connect this explicitly to learning in neural networks, suppose$w_k$ and $b_l$ denote the weights and biases in our neural network.Then stochastic gradient descent works by picking out a randomlychosen mini-batch of training inputs, and training with those,<a class="displaced_anchor" name="eqtn20"></a><a class="displaced_anchor" name="eqtn21"></a>\begin{eqnarray}   w_k & \rightarrow & w_k' = w_k-\frac{\eta}{m}  \sum_j \frac{\partial C_{X_j}}{\partial w_k} \tag{20}\\    b_l & \rightarrow & b_l' = b_l-\frac{\eta}{m}  \sum_j \frac{\partial C_{X_j}}{\partial b_l},\tag{21}\end{eqnarray}where the sums are over all the training examples $X_j$ in the currentmini-batch.  Then we pick out another randomly chosen mini-batch andtrain with those.  And so on, until we've exhausted the traininginputs, which is said to complete an<em>epoch</em> of training.  At that pointwe start over with a new training epoch.</p><p>Incidentally, it's worth noting that conventions vary about scaling ofthe cost function and of mini-batch updates to the weights and biases.In Equation <span id="margin_72514577830_reveal" class="equation_link">(6)</span><span id="margin_72514577830" class="marginequation" style="display: none;"><a href="chap1.html#eqtn6" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  C(w,b) \equiv  \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}</a></span><script>$('#margin_72514577830_reveal').click(function() {$('#margin_72514577830').toggle('slow', function() {});});</script> we scaled the overall costfunction by a factor $\frac{1}{n}$.  People sometimes omit the$\frac{1}{n}$, summing over the costs of individual training examplesinstead of averaging.  This is particularly useful when the totalnumber of training examples isn't known in advance.  This can occur ifmore training data is being generated in real time, for instance.And, in a similar way, the mini-batch update rules <span id="margin_824570032652_reveal" class="equation_link">(20)</span><span id="margin_824570032652" class="marginequation" style="display: none;"><a href="chap1.html#eqtn20" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   w_k & \rightarrow & w_k' = w_k-\frac{\eta}{m}  \sum_j \frac{\partial C_{X_j}}{\partial w_k}  \nonumber\end{eqnarray}</a></span><script>$('#margin_824570032652_reveal').click(function() {$('#margin_824570032652').toggle('slow', function() {});});</script>and <span id="margin_764088930463_reveal" class="equation_link">(21)</span><span id="margin_764088930463" class="marginequation" style="display: none;"><a href="chap1.html#eqtn21" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}    b_l & \rightarrow & b_l' = b_l-\frac{\eta}{m}  \sum_j \frac{\partial C_{X_j}}{\partial b_l} \nonumber\end{eqnarray}</a></span><script>$('#margin_764088930463_reveal').click(function() {$('#margin_764088930463').toggle('slow', function() {});});</script> sometimes omit the $\frac{1}{m}$ term out thefront of the sums.  Conceptually this makes little difference, sinceit's equivalent to rescaling the learning rate $\eta$.  But when doingdetailed comparisons of different work it's worth watching out for.</p><p>We can think of stochastic gradient descent as being like politicalpolling: it's much easier to sample a small mini-batch than it is toapply gradient descent to the full batch, just as carrying out a pollis easier than running a full election.  For example, if we have atraining set of size $n = 60,000$, as in MNIST, and choose amini-batch size of (say) $m = 10$, this means we'll get a factor of$6,000$ speedup in estimating the gradient!  Of course, the estimatewon't be perfect - there will be statistical fluctuations - but itdoesn't need to be perfect: all we really care about is moving in ageneral direction that will help decrease $C$, and that means we don'tneed an exact computation of the gradient.  In practice, stochasticgradient descent is a commonly used and powerful technique forlearning in neural networks, and it's the basis for most of thelearning techniques we'll develop in this book.</p><p></p><p></p><p></p><p></p><p></p><p><h4><a name="exercise_263792"></a><a href="#exercise_263792">Exercise</a></h4><ul><li> An extreme version of gradient descent is to use a mini-batch  size of just 1.  That is, given a training input, $x$, we update our  weights and biases according to the rules $w_k \rightarrow w_k' =  w_k - \eta \partial C_x / \partial w_k$ and $b_l \rightarrow b_l' =  b_l - \eta \partial C_x / \partial b_l$.  Then we choose another  training input, and update the weights and biases again.  And so on,  repeatedly.  This procedure is known as <em>online</em>,  <em>on-line</em>, or <em>incremental</em> learning.  In online learning,  a neural network learns from just one training input at a time (just  as human beings do).  Name one advantage and one disadvantage of  online learning, compared to stochastic gradient descent with a  mini-batch size of, say, $20$.</ul></p><p>Let me conclude this section by discussing a point that sometimes bugspeople new to gradient descent.  In neural networks the cost $C$ is,of course, a function of many variables - all the weights and biases- and so in some sense defines a surface in a very high-dimensionalspace.  Some people get hung up thinking: "Hey, I have to be able tovisualize all these extra dimensions".  And they may start to worry:"I can't think in four dimensions, let alone five (or fivemillion)".  Is there some special ability they're missing, someability that "real" supermathematicians have?  Of course, the answeris no.  Even most professional mathematicians can't visualize fourdimensions especially well, if at all.  The trick they use, instead,is to develop other ways of representing what's going on.  That'sexactly what we did above: we used an algebraic (rather than visual)representation of $\Delta C$ to figure out how to move so as todecrease $C$.  People who are good at thinking in high dimensions havea mental library containing many different techniques along theselines; our algebraic trick is just one example.  Those techniques maynot have the simplicity we're accustomed to when visualizing threedimensions, but once you build up a library of such techniques, youcan get pretty good at thinking in high dimensions.  I won't go intomore detail here, but if you're interested then you may enjoy reading<a href="http://mathoverflow.net/questions/25983/intuitive-crutches-for-higher-dimensional-thinking">this  discussion</a> of some of the techniques professional mathematiciansuse to think in high dimensions.  While some of the techniquesdiscussed are quite complex, much of the best content is intuitive andaccessible, and could be mastered by anyone.</p><p></p><p><h3><a name="implementing_our_network_to_classify_digits"></a><a href="#implementing_our_network_to_classify_digits">Implementing our network to classify digits</a></h3></p><p>Alright, let's write a program that learns how to recognizehandwritten digits, using stochastic gradient descent and the MNISTtraining data.  We'll do this with a short Python (2.7) program, just74 lines of code!  The first thing we need is to get the MNIST data.If you're a <tt>git</tt> user then you can obtain the data by cloningthe code repository for this book,</p><p><div class="highlight"><pre><span></span>git clone https://github.com/mnielsen/neural-networks-and-deep-learning.git
+</p><p>The human visual system is one of the wonders of the world.  Considerthe following sequence of handwritten digits: <a name="complete_zero"></a></p><p><center><img src="images/digits.png" width="160px"></center> </p><p>Most people effortlessly recognize those digits as 504192.  That easeis deceptive.  In each hemisphere of our brain, humans have a primaryvisual cortex, also known as V1, containing 140 million neurons, withtens of billions of connections between them.  And yet human visioninvolves not just V1, but an entire series of visual cortices - V2,V3, V4, and V5 - doing progressively more complex image processing.We carry in our heads a supercomputer, tuned by evolution overhundreds of millions of years, and superbly adapted to understand thevisual world.  Recognizing handwritten digits isn't easy.  Rather, wehumans are stupendously, astoundingly good at making sense of what oureyes show us.  But nearly all that work is done unconsciously.  And sowe don't usually appreciate how tough a problem our visual systemssolve.</p><p>The difficulty of visual pattern recognition becomes apparent if youattempt to write a computer program to recognize digits like thoseabove.  What seems easy when we do it ourselves suddenly becomesextremely difficult.  Simple intuitions about how we recognize shapes- "a 9 has a loop at the top, and a vertical stroke in the bottomright" - turn out to be not so simple to express algorithmically.When you try to make such rules precise, you quickly get lost in amorass of exceptions and caveats and special cases.  It seemshopeless.</p><p></p><p>Neural networks approach the problem in a different way.  The idea isto take a large number of handwritten digits, known as trainingexamples,</p><p><center><img src="images/mnist_100_digits.png" width="440px"></center></p><p>and then develop a system which can learn from those trainingexamples. In other words, the neural network uses the examples toautomatically infer rules for recognizing handwritten digits.Furthermore, by increasing the number of training examples, thenetwork can learn more about handwriting, and so improve its accuracy.So while I've shown just 100 training digits above, perhaps we couldbuild a better handwriting recognizer by using thousands or evenmillions or billions of training examples.</p><p>In this chapter we'll write a computer program implementing a neuralnetwork that learns to recognize handwritten digits.  The program isjust 74 lines long, and uses no special neural network libraries.  Butthis short program can recognize digits with an accuracy over 96percent, without human intervention.  Furthermore, in later chapterswe'll develop ideas which can improve accuracy to over 99 percent.  Infact, the best commercial neural networks are now so good that theyare used by banks to process cheques, and by post offices to recognizeaddresses.</p><p>We're focusing on handwriting recognition because it's an excellentprototype problem for learning about neural networks in general.  As aprototype it hits a sweet spot: it's challenging - it's no smallfeat to recognize handwritten digits - but it's not so difficult asto require an extremely complicated solution, or tremendouscomputational power.  Furthermore, it's a great way to develop moreadvanced techniques, such as deep learning.  And so throughout thebook we'll return repeatedly to the problem of handwritingrecognition.  Later in the book, we'll discuss how these ideas may beapplied to other problems in computer vision, and also in speech,natural language processing, and other domains.</p><p>Of course, if the point of the chapter was only to write a computerprogram to recognize handwritten digits, then the chapter would bemuch shorter!  But along the way we'll develop many key ideas aboutneural networks, including two important types of artificial neuron(the perceptron and the sigmoid neuron), and the standard learningalgorithm for neural networks, known as stochastic gradient descent.Throughout, I focus on explaining <em>why</em> things are done the waythey are, and on building your neural networks intuition.  Thatrequires a lengthier discussion than if I just presented the basicmechanics of what's going on, but it's worth it for the deeperunderstanding you'll attain.  Amongst the payoffs, by the end of thechapter we'll be in position to understand what deep learning is, andwhy it matters.</p><p><h3><a name="perceptrons"></a><a href="#perceptrons">Perceptrons</a></h3></p><p>What is a neural network?  To get started, I'll explain a type ofartificial neuron called a <em>perceptron</em>.Perceptrons were<a href="http://books.google.ca/books/about/Principles_of_neurodynamics.html?id=7FhRAAAAMAAJ">developed</a>in the 1950s and 1960s by the scientist<a href="http://en.wikipedia.org/wiki/Frank_Rosenblatt">Frank  Rosenblatt</a>, inspired by earlier<a href="http://scholar.google.ca/scholar?cluster=4035975255085082870">work</a>by <a href="http://en.wikipedia.org/wiki/Warren_McCulloch">Warren  McCulloch</a> and<a href="http://en.wikipedia.org/wiki/Walter_Pitts">Walter  Pitts</a>.  Today, it's more common to use othermodels of artificial neurons - in this book, and in much modern workon neural networks, the main neuron model used is one called the<em>sigmoid neuron</em>.  We'll get to sigmoid neurons shortly.  But tounderstand why sigmoid neurons are defined the way they are, it'sworth taking the time to first understand perceptrons.</p><p>So how do perceptrons work?  A perceptron takes several binary inputs,$x_1, x_2, \ldots$, and produces a single binary output:<center><img src="images/tikz0.png"/></center>In the example shown the perceptron has three inputs, $x_1, x_2, x_3$.In general it could have more or fewer inputs.  Rosenblatt proposed asimple rule to compute the output.  He introduced<em>weights</em>, $w_1,w_2,\ldots$, real numbersexpressing the importance of the respective inputs to the output.  Theneuron's output, $0$ or $1$, is determined by whether the weighted sum$\sum_j w_j x_j$ is less than or greater than some <em>threshold  value</em>.  Just like the weights, thethreshold is a real number which is a parameter of the neuron.  To putit in more precise algebraic terms:<a class="displaced_anchor" name="eqtn1"></a>\begin{eqnarray}  \mbox{output} & = & \left\{ \begin{array}{ll}      0 & \mbox{if } \sum_j w_j x_j \leq \mbox{ threshold} \\      1 & \mbox{if } \sum_j w_j x_j > \mbox{ threshold}      \end{array} \right.\tag{1}\end{eqnarray}That's all there is to how a perceptron works!</p><p>That's the basic mathematical model.  A way you can think about theperceptron is that it's a device that makes decisions by weighing upevidence.  Let me give an example.  It's not a very realistic example,but it's easy to understand, and we'll soon get to more realisticexamples.  Suppose the weekend is coming up, and you've heard thatthere's going to be a cheese festival in your city.  You like cheese,and are trying to decide whether or not to go to the festival.  Youmight make your decision by weighing up three factors:<ol><li> Is the weather good?<li> Does your boyfriend or girlfriend want to accompany you?<li> Is the festival near public transit? (You don't own a car).</ol>We can represent these three factors by corresponding binary variables$x_1, x_2$, and $x_3$.  For instance, we'd have $x_1 = 1$ if theweather is good, and $x_1 = 0$ if the weather is bad.  Similarly, $x_2= 1$ if your boyfriend or girlfriend wants to go, and $x_2 = 0$ ifnot.  And similarly again for $x_3$ and public transit.</p><p>Now, suppose you absolutely adore cheese, so much so that you're happyto go to the festival even if your boyfriend or girlfriend isuninterested and the festival is hard to get to.  But perhaps youreally loathe bad weather, and there's no way you'd go to the festivalif the weather is bad.  You can use perceptrons to model this kind ofdecision-making.  One way to do this is to choose a weight $w_1 = 6$for the weather, and $w_2 = 2$ and $w_3 = 2$ for the other conditions.The larger value of $w_1$ indicates that the weather matters a lot toyou, much more than whether your boyfriend or girlfriend joins you, orthe nearness of public transit.  Finally, suppose you choose athreshold of $5$ for the perceptron.  With these choices, theperceptron implements the desired decision-making model, outputting$1$ whenever the weather is good, and $0$ whenever the weather is bad.It makes no difference to the output whether your boyfriend orgirlfriend wants to go, or whether public transit is nearby.</p><p>By varying the weights and the threshold, we can get different modelsof decision-making.  For example, suppose we instead chose a thresholdof $3$.  Then the perceptron would decide that you should go to thefestival whenever the weather was good <em>or</em> when both thefestival was near public transit <em>and</em> your boyfriend orgirlfriend was willing to join you.  In other words, it'd be adifferent model of decision-making.  Dropping the threshold meansyou're more willing to go to the festival.</p><p>Obviously, the perceptron isn't a complete model of humandecision-making!  But what the example illustrates is how a perceptroncan weigh up different kinds of evidence in order to make decisions.And it should seem plausible that a complex network of perceptronscould make quite subtle decisions:<center><img src="images/tikz1.png"/></center>In this network, the first column of perceptrons - what we'll callthe first <em>layer</em> of perceptrons - is making three very simpledecisions, by weighing the input evidence.  What about the perceptronsin the second layer?  Each of those perceptrons is making a decisionby weighing up the results from the first layer of decision-making.In this way a perceptron in the second layer can make a decision at amore complex and more abstract level than perceptrons in the firstlayer.  And even more complex decisions can be made by the perceptronin the third layer.  In this way, a many-layer network of perceptronscan engage in sophisticated decision making.</p><p>Incidentally, when I defined perceptrons I said that a perceptron hasjust a single output.  In the network above the perceptrons look likethey have multiple outputs.  In fact, they're still single output.The multiple output arrows are merely a useful way of indicating thatthe output from a perceptron is being used as the input to severalother perceptrons.  It's less unwieldy than drawing a single outputline which then splits.</p><p>Let's simplify the way we describe perceptrons.  The condition $\sum_jw_j x_j > \mbox{threshold}$ is cumbersome, and we can make twonotational changes to simplify it. The first change is to write$\sum_j w_j x_j$ as a dot product, $w \cdot x \equiv \sum_j w_j x_j$,where $w$ and $x$ are vectors whose components are the weights andinputs, respectively.  The second change is to move the threshold tothe other side of the inequality, and to replace it by what's known asthe perceptron's <em>bias</em>, $b \equiv-\mbox{threshold}$.  Using the bias instead of the threshold, theperceptron rule can berewritten:<a class="displaced_anchor" name="eqtn2"></a>\begin{eqnarray}  \mbox{output} = \left\{     \begin{array}{ll}       0 & \mbox{if } w\cdot x + b \leq 0 \\      1 & \mbox{if } w\cdot x + b > 0    \end{array}  \right.\tag{2}\end{eqnarray}You can think of the bias as a measure of how easy it is to get theperceptron to output a $1$.  Or to put it in more biological terms,the bias is a measure of how easy it is to get the perceptron to<em>fire</em>.  For a perceptron with a really big bias, it's extremelyeasy for the perceptron to output a $1$.  But if the bias is verynegative, then it's difficult for the perceptron to output a $1$.Obviously, introducing the bias is only a small change in how wedescribe perceptrons, but we'll see later that it leads to furthernotational simplifications.  Because of this, in the remainder of thebook we won't use the threshold, we'll always use the bias.</p><p>I've described perceptrons as a method for weighing evidence to makedecisions.  Another way perceptrons can be used is to compute theelementary logical functions we usually think of as underlyingcomputation, functions such as <CODE>AND</CODE>, <CODE>OR</CODE>, and<CODE>NAND</CODE>.  For example, suppose we have a perceptron with twoinputs, each with weight $-2$, and an overall bias of $3$.  Here's ourperceptron:<center><img src="images/tikz2.png"/></center>Then we see that input $00$ produces output $1$, since$(-2)*0+(-2)*0+3 = 3$ is positive.  Here, I've introduced the $*$symbol to make the multiplications explicit.  Similar calculationsshow that the inputs $01$ and $10$ produce output $1$.  But the input$11$ produces output $0$, since $(-2)*1+(-2)*1+3 = -1$ is negative.And so our perceptron implements a <CODE>NAND</CODE>gate!</p><p><a name="universality"></a></p><p>The <CODE>NAND</CODE> example shows that we can use perceptrons to computesimple logical functions. In fact, we can usenetworks of perceptrons to compute <em>any</em> logical function at all.The reason is that the <CODE>NAND</CODE> gate is universal forcomputation, that is, we can build any computation up out of<CODE>NAND</CODE> gates.  For example, we can use <CODE>NAND</CODE> gates tobuild a circuit which adds two bits, $x_1$ and $x_2$.  This requirescomputing the bitwise sum, $x_1 \oplus x_2$, as well as a carry bitwhich is set to $1$ when both $x_1$ and $x_2$ are $1$, i.e., the carrybit is just the bitwise product $x_1 x_2$:<center> <img src="images/tikz3.png"/></center>To get an equivalent network of perceptrons we replace all the<CODE>NAND</CODE> gates by perceptrons with two inputs, each with weight$-2$, and an overall bias of $3$.  Here's the resulting network.  Notethat I've moved the perceptron corresponding to the bottom right<CODE>NAND</CODE> gate a little, just to make it easier to draw the arrowson the diagram:<center> <img src="images/tikz4.png"/></center>One notable aspect of this network of perceptrons is that the outputfrom the leftmost perceptron is used twice as input to the bottommostperceptron.  When I defined the perceptron model I didn't say whetherthis kind of double-output-to-the-same-place was allowed.  Actually,it doesn't much matter.  If we don't want to allow this kind of thing,then it's possible to simply merge the two lines, into a singleconnection with a weight of -4 instead of two connections with -2weights.  (If you don't find this obvious, you should stop and proveto yourself that this is equivalent.)  With that change, the networklooks as follows, with all unmarked weights equal to -2, all biasesequal to 3, and a single weight of -4, as marked:<center> <img src="images/tikz5.png"/></center>Up to now I've been drawing inputs like $x_1$ and $x_2$ as variablesfloating to the left of the network of perceptrons.  In fact, it'sconventional to draw an extra layer of perceptrons - the <em>input  layer</em> - to encode the inputs:<center> <img src="images/tikz6.png"/></center>This notation for input perceptrons, in which we have an output, butno inputs,<center><img src="images/tikz7.png"/></center>is a shorthand.  It doesn't actually mean a perceptron with no inputs.To see this, suppose we did have a perceptron with no inputs.  Thenthe weighted sum $\sum_j w_j x_j$ would always be zero, and so theperceptron would output $1$ if $b > 0$, and $0$ if $b \leq 0$.  Thatis, the perceptron would simply output a fixed value, not the desiredvalue ($x_1$, in the example above). It's better to think of theinput perceptrons as not really being perceptrons at all, but ratherspecial units which are simply defined to output the desired values,$x_1, x_2,\ldots$.</p><p>The adder example demonstrates how a network of perceptrons can beused to simulate a circuit containing many <CODE>NAND</CODE> gates.  Andbecause <CODE>NAND</CODE> gates are universal for computation, it followsthat perceptrons are also universal for computation.</p><p>The computational universality of perceptrons is simultaneouslyreassuring and disappointing.  It's reassuring because it tells usthat networks of perceptrons can be as powerful as any other computingdevice.  But it's also disappointing, because it makes it seem asthough perceptrons are merely a new type of <CODE>NAND</CODE> gate.That's hardly big news!</p><p>However, the situation is better than this view suggests.  It turnsout that we can devise <em>learning  algorithms</em> which canautomatically tune the weights and biases of a network of artificialneurons.  This tuning happens in response to external stimuli, withoutdirect intervention by a programmer.  These learning algorithms enableus to use artificial neurons in a way which is radically different toconventional logic gates.  Instead of explicitly laying out a circuitof <CODE>NAND</CODE> and other gates, our neural networks can simply learnto solve problems, sometimes problems where it would be extremelydifficult to directly design a conventional circuit.</p><p><h3><a name="sigmoid_neurons"></a><a href="#sigmoid_neurons">Sigmoid neurons</a></h3></p><p>Learning algorithms sound terrific.  But how can we devise suchalgorithms for a neural network?  Suppose we have a network ofperceptrons that we'd like to use to learn to solve some problem.  Forexample, the inputs to the network might be the raw pixel data from ascanned, handwritten image of a digit.  And we'd like the network tolearn weights and biases so that the output from the network correctlyclassifies the digit.  To see how learning might work, suppose we makea small change in some weight (or bias) in the network.  What we'dlike is for this small change in weight to cause only a smallcorresponding change in the output from the network.  As we'll see ina moment, this property will make learning possible.  Schematically,here's what we want (obviously this network is too simple to dohandwriting recognition!):</p><p><center><img src="images/tikz8.png"/></center></p><p>If it were true that a small change in a weight (or bias) causes onlya small change in output, then we could use this fact to modify theweights and biases to get our network to behave more in the manner wewant.  For example, suppose the network was mistakenly classifying animage as an "8" when it should be a "9".  We could figure out howto make a small change in the weights and biases so the network gets alittle closer to classifying the image as a "9".  And then we'drepeat this, changing the weights and biases over and over to producebetter and better output.  The network would be learning.</p><p>The problem is that this isn't what happens when our network containsperceptrons.  In fact, a small change in the weights or bias of anysingle perceptron in the network can sometimes cause the output ofthat perceptron to completely flip, say from $0$ to $1$.  That flipmay then cause the behaviour of the rest of the network to completelychange in some very complicated way.  So while your "9" might now beclassified correctly, the behaviour of the network on all the otherimages is likely to have completely changed in some hard-to-controlway.  That makes it difficult to see how to gradually modify theweights and biases so that the network gets closer to the desiredbehaviour.  Perhaps there's some clever way of getting around thisproblem.  But it's not immediately obvious how we can get a network ofperceptrons to learn.</p><p>We can overcome this problem by introducing a new type of artificialneuron called a <em>sigmoid</em> neuron.Sigmoid neurons are similar to perceptrons, but modified so that smallchanges in their weights and bias cause only a small change in theiroutput.  That's the crucial fact which will allow a network of sigmoidneurons to learn.</p><p>Okay, let me describe the sigmoid neuron.  We'll depict sigmoidneurons in the same way we depicted perceptrons:<center><img src="images/tikz9.png"/></center>Just like a perceptron, the sigmoid neuron has inputs, $x_1, x_2,\ldots$.  But instead of being just $0$ or $1$, these inputs can alsotake on any values <em>between</em> $0$ and $1$.  So, for instance,$0.638\ldots$ is a valid input for a sigmoid neuron. Also just like aperceptron, the sigmoid neuron has weights for each input, $w_1, w_2,\ldots$, and an overall bias, $b$.  But the output is not $0$ or $1$.Instead, it's $\sigma(w \cdot x+b)$, where $\sigma$ is called the<em>sigmoid function</em>*<span class="marginnote">*Incidentally, $\sigma$ is sometimes  called the <em>logistic    function</em>, and this  new class of neurons called <em>logistic    neurons</em>.  It's useful  to remember this terminology, since these terms are used by many  people working with neural nets.  However, we'll stick with the  sigmoid terminology.</span>, and is definedby:<a class="displaced_anchor" name="eqtn3"></a>\begin{eqnarray}   \sigma(z) \equiv \frac{1}{1+e^{-z}}.\tag{3}\end{eqnarray}To put it all a little more explicitly, the output of a sigmoid neuronwith inputs $x_1,x_2,\ldots$, weights $w_1,w_2,\ldots$, and bias $b$ is<a class="displaced_anchor" name="eqtn4"></a>\begin{eqnarray}   \frac{1}{1+\exp(-\sum_j w_j x_j-b)}.\tag{4}\end{eqnarray}</p><p>At first sight, sigmoid neurons appear very different to perceptrons.The algebraic form of the sigmoid function may seem opaque andforbidding if you're not already familiar with it.  In fact, there aremany similarities between perceptrons and sigmoid neurons, and thealgebraic form of the sigmoid function turns out to be more of atechnical detail than a true barrier to understanding.</p><p>To understand the similarity to the perceptron model, suppose $z\equiv w \cdot x + b$ is a large positive number.  Then $e^{-z}\approx 0$ and so $\sigma(z) \approx 1$.  In other words, when $z = w\cdot x+b$ is large and positive, the output from the sigmoid neuronis approximately $1$, just as it would have been for a perceptron.Suppose on the other hand that $z = w \cdot x+b$ is very negative.Then $e^{-z} \rightarrow \infty$, and $\sigma(z) \approx 0$.  So when$z = w \cdot x +b$ is very negative, the behaviour of a sigmoid neuronalso closely approximates a perceptron.  It's only when $w \cdot x+b$is of modest size that there's much deviation from the perceptronmodel.</p><p>What about the algebraic form of $\sigma$?  How can we understandthat?  In fact, the exact form of $\sigma$ isn't so important - whatreally matters is the shape of the function when plotted.  Here's theshape:</p><p><div id="sigmoid_graph"><a name="sigmoid_graph"></a></div><script src="http://d3js.org/d3.v3.min.js"></script><script>function s(x) {return 1/(1+Math.exp(-x));}var m = [40, 120, 50, 120];var height = 290 - m[0] - m[2];var width = 600 - m[1] - m[3];var xmin = -5;var xmax = 5;var sample = 400;var x1 = d3.scale.linear().domain([0, sample]).range([xmin, xmax]);var data = d3.range(sample).map(function(d){ return {        x: x1(d),         y: s(x1(d))};     });var x = d3.scale.linear().domain([xmin, xmax]).range([0, width]);var y = d3.scale.linear()                .domain([0, 1])                .range([height, 0]);var line = d3.svg.line()    .x(function(d) { return x(d.x); })    .y(function(d) { return y(d.y); })var graph = d3.select("#sigmoid_graph")    .append("svg")    .attr("width", width + m[1] + m[3])    .attr("height", height + m[0] + m[2])    .append("g")    .attr("transform", "translate(" + m[3] + "," + m[0] + ")");var xAxis = d3.svg.axis()                  .scale(x)                  .tickValues(d3.range(-4, 5, 1))                  .orient("bottom")graph.append("g")    .attr("class", "x axis")    .attr("transform", "translate(0, " + height + ")")    .call(xAxis);var yAxis = d3.svg.axis()                  .scale(y)                  .tickValues(d3.range(0, 1.01, 0.2))                  .orient("left")                  .ticks(5)graph.append("g")    .attr("class", "y axis")    .call(yAxis);graph.append("path").attr("d", line(data));graph.append("text")     .attr("class", "x label")     .attr("text-anchor", "end")     .attr("x", width/2)     .attr("y", height+35)     .text("z");graph.append("text")        .attr("x", (width / 2))                     .attr("y", -10)        .attr("text-anchor", "middle")          .style("font-size", "16px")         .text("sigmoid function");</script></p><p>This shape is a smoothed out version of a step function:</p><p><div id="step_graph"></div><script>function s(x) {return x < 0 ? 0 : 1;}var m = [40, 120, 50, 120];var height = 290 - m[0] - m[2];var width = 600 - m[1] - m[3];var xmin = -5;var xmax = 5;var sample = 400;var x1 = d3.scale.linear().domain([0, sample]).range([xmin, xmax]);var data = d3.range(sample).map(function(d){ return {        x: x1(d),         y: s(x1(d))};     });var x = d3.scale.linear().domain([xmin, xmax]).range([0, width]);var y = d3.scale.linear()                .domain([0,1])                .range([height, 0]);var line = d3.svg.line()    .x(function(d) { return x(d.x); })    .y(function(d) { return y(d.y); })var graph = d3.select("#step_graph")    .append("svg")    .attr("width", width + m[1] + m[3])    .attr("height", height + m[0] + m[2])    .append("g")    .attr("transform", "translate(" + m[3] + "," + m[0] + ")");var xAxis = d3.svg.axis()                  .scale(x)                  .tickValues(d3.range(-4, 5, 1))                  .orient("bottom")graph.append("g")    .attr("class", "x axis")    .attr("transform", "translate(0, " + height + ")")    .call(xAxis);var yAxis = d3.svg.axis()                  .scale(y)                  .tickValues(d3.range(0, 1.01, 0.2))                  .orient("left")                  .ticks(5)graph.append("g")    .attr("class", "y axis")    .call(yAxis);graph.append("path").attr("d", line(data));graph.append("text")     .attr("class", "x label")     .attr("text-anchor", "end")     .attr("x", width/2)     .attr("y", height+35)     .text("z");graph.append("text")        .attr("x", (width / 2))                     .attr("y", -10)        .attr("text-anchor", "middle")          .style("font-size", "16px")         .text("step function");</script></p><p>If $\sigma$ had in fact been a step function, then the sigmoid neuronwould <em>be</em> a perceptron, since the output would be $1$ or $0$depending on whether $w\cdot x+b$ was positive ornegative*<span class="marginnote">*Actually, when $w \cdot x +b = 0$ the perceptron  outputs $0$, while the step function outputs $1$.  So, strictly  speaking, we'd need to modify the step function at that one point.  But you get the idea.</span>.  By using the actual $\sigma$ function weget, as already implied above, a smoothed out perceptron.  Indeed,it's the smoothness of the $\sigma$ function that is the crucial fact,not its detailed form.  The smoothness of $\sigma$ means that smallchanges $\Delta w_j$ in the weights and $\Delta b$ in the bias willproduce a small change $\Delta \mbox{output}$ in the output from theneuron.  In fact, calculus tells us that $\Delta \mbox{output}$ iswell approximated by<a class="displaced_anchor" name="eqtn5"></a>\begin{eqnarray}   \Delta \mbox{output} \approx \sum_j \frac{\partial \, \mbox{output}}{\partial w_j}  \Delta w_j + \frac{\partial \, \mbox{output}}{\partial b} \Delta b,\tag{5}\end{eqnarray}where the sum is over all the weights, $w_j$, and $\partial \,\mbox{output} / \partial w_j$ and $\partial \, \mbox{output} /\partialb$ denote partial derivatives of the $\mbox{output}$ with respect to$w_j$ and $b$, respectively.  Don't panic if you're not comfortablewith partial derivatives!  While the expression above lookscomplicated, with all the partial derivatives, it's actually sayingsomething very simple (and which is very good news): $\Delta\mbox{output}$ is a <em>linear function</em> of the changes $\Delta w_j$and $\Delta b$ in the weights and bias.  This linearity makes it easyto choose small changes in the weights and biases to achieve anydesired small change in the output.  So while sigmoid neurons havemuch of the same qualitative behaviour as perceptrons, they make itmuch easier to figure out how changing the weights and biases willchange the output.</p><p>If it's the shape of $\sigma$ which really matters, and not its exactform, then why use the particular form used for $\sigma$ inEquation <span id="margin_781139529814_reveal" class="equation_link">(3)</span><span id="margin_781139529814" class="marginequation" style="display: none;"><a href="chap1.html#eqtn3" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \sigma(z) \equiv \frac{1}{1+e^{-z}} \nonumber\end{eqnarray}</a></span><script>$('#margin_781139529814_reveal').click(function() {$('#margin_781139529814').toggle('slow', function() {});});</script>?  In fact, later in the book we willoccasionally consider neurons where the output is $f(w \cdot x + b)$for some other <em>activation function</em> $f(\cdot)$.  The main thingthat changes when we use a different activation function is that theparticular values for the partial derivatives inEquation <span id="margin_76125058644_reveal" class="equation_link">(5)</span><span id="margin_76125058644" class="marginequation" style="display: none;"><a href="chap1.html#eqtn5" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \Delta \mbox{output} \approx \sum_j \frac{\partial \, \mbox{output}}{\partial w_j}  \Delta w_j + \frac{\partial \, \mbox{output}}{\partial b} \Delta b \nonumber\end{eqnarray}</a></span><script>$('#margin_76125058644_reveal').click(function() {$('#margin_76125058644').toggle('slow', function() {});});</script> change.  It turns out that when wecompute those partial derivatives later, using $\sigma$ will simplifythe algebra, simply because exponentials have lovely properties whendifferentiated.  In any case, $\sigma$ is commonly-used in work onneural nets, and is the activation function we'll use most often inthis book.</p><p>How should we interpret the output from a sigmoid neuron?  Obviously,one big difference between perceptrons and sigmoid neurons is thatsigmoid neurons don't just output $0$ or $1$.  They can have as outputany real number between $0$ and $1$, so values such as $0.173\ldots$and $0.689\ldots$ are legitimate outputs.  This can be useful, forexample, if we want to use the output value to represent the averageintensity of the pixels in an image input to a neural network.  Butsometimes it can be a nuisance.  Suppose we want the output from thenetwork to indicate either "the input image is a 9" or "the inputimage is not a 9".  Obviously, it'd be easiest to do this if theoutput was a $0$ or a $1$, as in a perceptron.  But in practice we canset up a convention to deal with this, for example, by deciding tointerpret any output of at least $0.5$ as indicating a "9", and anyoutput less than $0.5$ as indicating "not a 9".  I'll alwaysexplicitly state when we're using such a convention, so it shouldn'tcause any confusion.</p><p><h4><a name="exercises_191892"></a><a href="#exercises_191892">Exercises</a></h4><ul><li><strong>Sigmoid neurons simulating perceptrons, part I</strong> $\mbox{}$ <br/>  Suppose we take all the weights and biases in a network of  perceptrons, and multiply them by a positive constant, $c > 0$.  Show that the behaviour of the network doesn't change.</p><p><li><strong>Sigmoid neurons simulating perceptrons, part II</strong> $\mbox{}$  <br/> Suppose we have the same setup as the last problem - a  network of perceptrons.  Suppose also that the overall input to the  network of perceptrons has been chosen.  We won't need the actual  input value, we just need the input to have been fixed.  Suppose the  weights and biases are such that $w \cdot x + b \neq 0$ for the  input $x$ to any particular perceptron in the network.  Now replace  all the perceptrons in the network by sigmoid neurons, and multiply  the weights and biases by a positive constant $c > 0$. Show that in  the limit as $c \rightarrow \infty$ the behaviour of this network of  sigmoid neurons is exactly the same as the network of perceptrons.  How can this fail when $w \cdot x + b = 0$ for one of the  perceptrons?</ul></p><p><h3><a name="the_architecture_of_neural_networks"></a><a href="#the_architecture_of_neural_networks">The architecture of neural networks</a></h3></p><p>In the next section I'll introduce a neural network that can do apretty good job classifying handwritten digits.  In preparation forthat, it helps to explain some terminology that lets us name differentparts of a network.  Suppose we have the network:<center><img src="images/tikz10.png"/></center>As mentioned earlier, the leftmost layer in this network is called theinput layer, and the neurons within thelayer are called <em>input neurons</em>.The rightmost or <em>output</em> layercontains the <em>output neurons</em>, or,as in this case, a single output neuron.  The middle layer is called a<em>hidden layer</em>, since the neurons inthis layer are neither inputs nor outputs.  The term "hidden"perhaps sounds a little mysterious - the first time I heard the termI thought it must have some deep philosophical or mathematicalsignificance - but it really means nothing more than "not an inputor an output".  The network above has just a single hidden layer, butsome networks have multiple hidden layers.  For example, the followingfour-layer network has two hidden layers:<center><img src="images/tikz11.png"/></center>Somewhat confusingly, and for historical reasons, such multiple layernetworks are sometimes called <em>multilayer perceptrons</em> or<em>MLPs</em>, despite being made up of sigmoid neurons, notperceptrons.  I'm not going to use the MLP terminology in this book,since I think it's confusing, but wanted to warn you of its existence.</p><p>The design of the input and output layers in a network is oftenstraightforward.  For example, suppose we're trying to determinewhether a handwritten image depicts a "9" or not.  A natural way todesign the network is to encode the intensities of the image pixelsinto the input neurons. If the image is a $64$ by $64$ greyscaleimage, then we'd have $4,096 = 64 \times 64$ input neurons, with theintensities scaled appropriately between $0$ and $1$.  The outputlayer will contain just a single neuron, with output values of lessthan $0.5$ indicating "input image is not a 9", and values greaterthan $0.5$ indicating "input image is a 9 ".</p><p></p><p></p><p>While the design of the input and output layers of a neural network isoften straightforward, there can be quite an art to the design of thehidden layers.  In particular, it's not possible to sum up the designprocess for the hidden layers with a few simple rules of thumb.Instead, neural networks researchers have developed many designheuristics for the hidden layers, which help people get the behaviourthey want out of their nets.  For example, such heuristics can be usedto help determine how to trade off the number of hidden layers againstthe time required to train the network.  We'll meet several suchdesign heuristics later in this book. </p><p>Up to now, we've been discussing neural networks where the output fromone layer is used as input to the next layer.  Such networks arecalled <em>feedforward</em>neural networks.  This means there are no loops in the network -information is always fed forward, never fed back.  If we did haveloops, we'd end up with situations where the input to the $\sigma$function depended on the output.  That'd be hard to make sense of, andso we don't allow such loops.</p><p>However, there are other models of artificial neural networks in whichfeedback loops are possible.  These models are called<a href="http://en.wikipedia.org/wiki/Recurrent_neural_network">recurrent  neural networks</a>. The idea in these models is to have neurons whichfire for some limited duration of time, before becoming quiescent.That firing can stimulate other neurons, which may fire a little whilelater, also for a limited duration.  That causes still more neurons tofire, and so over time we get a cascade of neurons firing.  Loopsdon't cause problems in such a model, since a neuron's output onlyaffects its input at some later time, not instantaneously.</p><p></p><p>Recurrent neural nets have been less influential than feedforwardnetworks, in part because the learning algorithms for recurrent netsare (at least to date) less powerful.  But recurrent networks arestill extremely interesting.  They're much closer in spirit to how ourbrains work than feedforward networks.  And it's possible thatrecurrent networks can solve important problems which can only besolved with great difficulty by feedforward networks.  However, tolimit our scope, in this book we're going to concentrate on the morewidely-used feedforward networks.</p><p><h3><a name="a_simple_network_to_classify_handwritten_digits"></a><a href="#a_simple_network_to_classify_handwritten_digits">A simple network to classify handwritten digits</a></h3></p><p>Having defined neural networks, let's return to handwritingrecognition.  We can split the problem of recognizing handwrittendigits into two sub-problems.  First, we'd like a way of breaking animage containing many digits into a sequence of separate images, eachcontaining a single digit.  For example, we'd like to break the image</p><p><center><img src="images/digits.png" width="300px"></center></p><p>into six separate images,</p><p><center><img src="images/digits_separate.png" width="440px"></center> </p><p>We humans solve this <em>segmentation  problem</em> with ease, but it's challengingfor a computer program to correctly break up the image.  Once theimage has been segmented, the program then needs to classify eachindividual digit.  So, for instance, we'd like our program torecognize that the first digit above,</p><p><center><img src="images/mnist_first_digit.png" width="64px"></center></p><p>is a 5.</p><p>We'll focus on writing a program to solve the second problem, that is,classifying individual digits.  We do this because it turns out thatthe segmentation problem is not so difficult to solve, once you have agood way of classifying individual digits.  There are many approachesto solving the segmentation problem.  One approach is to trial manydifferent ways of segmenting the image, using the individual digitclassifier to score each trial segmentation.  A trial segmentationgets a high score if the individual digit classifier is confident ofits classification in all segments, and a low score if the classifieris having a lot of trouble in one or more segments.  The idea is thatif the classifier is having trouble somewhere, then it's probablyhaving trouble because the segmentation has been chosen incorrectly.This idea and other variations can be used to solve the segmentationproblem quite well.  So instead of worrying about segmentation we'llconcentrate on developing a neural network which can solve the moreinteresting and difficult problem, namely, recognizing individualhandwritten digits.</p><p>To recognize individual digits we will use a three-layer neuralnetwork:</p><p><center><img src="images/tikz12.png"/></center></p><p>The input layer of the network contains neurons encoding the values ofthe input pixels.  As discussed in the next section, our training datafor the network will consist of many $28$ by $28$ pixel images ofscanned handwritten digits, and so the input layer contains $784 = 28\times 28$ neurons.  For simplicity I've omitted most of the $784$input neurons in the diagram above.  The input pixels are greyscale,with a value of $0.0$ representing white, a value of $1.0$representing black, and in between values representing graduallydarkening shades of grey.</p><p>The second layer of the network is a hidden layer.  We denote thenumber of neurons in this hidden layer by $n$, and we'll experimentwith different values for $n$.  The example shown illustrates a smallhidden layer, containing just $n = 15$ neurons.</p><p>The output layer of the network contains 10 neurons.  If the firstneuron fires, i.e., has an output $\approx 1$, then that will indicatethat the network thinks the digit is a $0$.  If the second neuronfires then that will indicate that the network thinks the digit is a$1$.  And so on.  A little more precisely, we number the outputneurons from $0$ through $9$, and figure out which neuron has thehighest activation value.  If that neuron is, say, neuron number $6$,then our network will guess that the input digit was a $6$.  And so onfor the other output neurons.</p><p>You might wonder why we use $10$ output neurons.  After all, the goalof the network is to tell us which digit ($0, 1, 2, \ldots, 9$)corresponds to the input image.  A seemingly natural way of doing thatis to use just $4$ output neurons, treating each neuron as taking on abinary value, depending on whether the neuron's output is closer to$0$ or to $1$.  Four neurons are enough to encode the answer, since$2^4 = 16$ is more than the 10 possible values for the input digit.Why should our network use $10$ neurons instead?  Isn't thatinefficient?  The ultimate justification is empirical: we can try outboth network designs, and it turns out that, for this particularproblem, the network with $10$ output neurons learns to recognizedigits better than the network with $4$ output neurons.  But thatleaves us wondering <em>why</em> using $10$ output neurons works better.Is there some heuristic that would tell us in advance that we shoulduse the $10$-output encoding instead of the $4$-output encoding?</p><p>To understand why we do this, it helps to think about what the neuralnetwork is doing from first principles.  Consider first the case wherewe use $10$ output neurons.  Let's concentrate on the first outputneuron, the one that's trying to decide whether or not the digit is a$0$.  It does this by weighing up evidence from the hidden layer ofneurons.  What are those hidden neurons doing?  Well, just suppose forthe sake of argument that the first neuron in the hidden layer detectswhether or not an image like the following is present:</p><p><center><img src="images/mnist_top_left_feature.png" width="130px"></center></p><p>It can do this by heavily weighting input pixels which overlap withthe image, and only lightly weighting the other inputs.  In a similarway, let's suppose for the sake of argument that the second, third,and fourth neurons in the hidden layer detect whether or not thefollowing images are present:</p><p><center><img src="images/mnist_other_features.png" width="424px"></center></p><p>As you may have guessed, these four images together make up the $0$image that we saw in the line of digits shown <a  href="#complete_zero">earlier</a>:</p><p><center><img src="images/mnist_complete_zero.png" width="130px"></center></p><p>So if all four of these hidden neurons are firing then we can concludethat the digit is a $0$.  Of course, that's not the <em>only</em> sortof evidence we can use to conclude that the image was a $0$ - wecould legitimately get a $0$ in many other ways (say, throughtranslations of the above images, or slight distortions).  But itseems safe to say that at least in this case we'd conclude that theinput was a $0$.</p><p></p><p></p><p></p><p>Supposing the neural network functions in this way, we can give aplausible explanation for why it's better to have $10$ outputs fromthe network, rather than $4$.  If we had $4$ outputs, then the firstoutput neuron would be trying to decide what the most significant bitof the digit was.  And there's no easy way to relate that mostsignificant bit to simple shapes like those shown above.  It's hard toimagine that there's any good historical reason the component shapesof the digit will be closely related to (say) the most significant bitin the output.</p><p>Now, with all that said, this is all just a heuristic.  Nothing saysthat the three-layer neural network has to operate in the way Idescribed, with the hidden neurons detecting simple component shapes.Maybe a clever learning algorithm will find some assignment of weightsthat lets us use only $4$ output neurons.  But as a heuristic the wayof thinking I've described works pretty well, and can save you a lotof time in designing good neural network architectures.</p><p><h4><a name="exercise_513527"></a><a href="#exercise_513527">Exercise</a></h4><ul><li> There is a way of determining the bitwise representation of a  digit by adding an extra layer to the three-layer network above.  The extra layer converts the output from the previous layer into a  binary representation, as illustrated in the figure below.  Find a  set of weights and biases for the new output layer.  Assume that the  first $3$ layers of neurons are such that the correct output in the  third layer (i.e., the old output layer) has activation at least  $0.99$, and incorrect outputs have activation less than $0.01$.</ul></p><p><center><img src="images/tikz13.png"/></center></p><p></p><p></p><p></p><p><h3><a name="learning_with_gradient_descent"></a><a href="#learning_with_gradient_descent">Learning with gradient descent</a></h3></p><p></p><p>Now that we have a design for our neural network, how can it learn torecognize digits?  The first thing we'll need is a data set to learnfrom - a so-called training data set.  We'll use the <a href="http://yann.lecun.com/exdb/mnist/">MNIST  data set</a>, which contains tens of thousands of scanned images ofhandwritten digits, together with their correct classifications.MNIST's name comes from the fact that it is a modified subset of twodata sets collected by<a href="http://en.wikipedia.org/wiki/National_Institute_of_Standards_and_Technology">NIST</a>,the United States' National Institute of Standards andTechnology. Here's a few images from MNIST:</p><p><center><img src="images/digits_separate.png" width="420px"></center> </p><p>As you can see, these digits are, in fact, the same as those shownat the <a  href="#complete_zero">beginning of this chapter</a> as a challengeto recognize.  Of course, when testing our network we'll ask it torecognize images which aren't in the training set!</p><p>The MNIST data comes in two parts.  The first part contains 60,000images to be used as training data.  These images are scannedhandwriting samples from 250 people, half of whom were US CensusBureau employees, and half of whom were high school students.  Theimages are greyscale and 28 by 28 pixels in size.  The second part ofthe MNIST data set is 10,000 images to be used as test data.  Again,these are 28 by 28 greyscale images.  We'll use the test data toevaluate how well our neural network has learned to recognize digits.To make this a good test of performance, the test data was taken froma <em>different</em> set of 250 people than the original training data(albeit still a group split between Census Bureau employees and highschool students).  This helps give us confidence that our system canrecognize digits from people whose writing it didn't see duringtraining.</p><p>We'll use the notation $x$ to denote a training input.  It'll beconvenient to regard each training input $x$ as a $28 \times 28 =784$-dimensional vector.  Each entry in the vector represents the greyvalue for a single pixel in the image.  We'll denote the correspondingdesired output by $y = y(x)$, where $y$ is a $10$-dimensional vector.For example, if a particular training image, $x$, depicts a $6$, then$y(x) = (0, 0, 0, 0, 0, 0, 1, 0, 0, 0)^T$ is the desired output fromthe network.  Note that $T$ here is the transpose operation, turning arow vector into an ordinary (column) vector.</p><p>What we'd like is an algorithm which lets us find weights and biasesso that the output from the network approximates $y(x)$ for alltraining inputs $x$.  To quantify how well we're achieving this goalwe define a <em>cost function</em>*<span class="marginnote">*Sometimes referred to as a  <em>loss</em> or <em>objective</em> function.  We use the term cost  function throughout this book, but you should note the other  terminology, since it's often used in research papers and other  discussions of neural networks. </span>:<a class="displaced_anchor" name="eqtn6"></a>\begin{eqnarray}  C(w,b) \equiv  \frac{1}{2n} \sum_x \| y(x) - a\|^2.\tag{6}\end{eqnarray}Here, $w$ denotes the collection of all weights in the network, $b$all the biases, $n$ is the total number of training inputs, $a$ is thevector of outputs from the network when $x$ is input, and the sum isover all training inputs, $x$.  Of course, the output $a$ depends on$x$, $w$ and $b$, but to keep the notation simple I haven't explicitlyindicated this dependence.  The notation $\| v \|$ just denotes theusual length function for a vector $v$.  We'll call $C$ the<em>quadratic</em> cost function; it's alsosometimes known as the <em>mean squared error</em> or just <em>MSE</em>.Inspecting the form of the quadratic cost function, we see that$C(w,b)$ is non-negative, since every term in the sum is non-negative.Furthermore, the cost $C(w,b)$ becomes small, i.e., $C(w,b) \approx0$, precisely when $y(x)$ is approximately equal to the output, $a$,for all training inputs, $x$.  So our training algorithm has done agood job if it can find weights and biases so that $C(w,b) \approx 0$.By contrast, it's not doing so well when $C(w,b)$ is large - thatwould mean that $y(x)$ is not close to the output $a$ for a largenumber of inputs.  So the aim of our training algorithm will be tominimize the cost $C(w,b)$ as a function of the weights and biases.In other words, we want to find a set of weights and biases which makethe cost as small as possible.  We'll do that using an algorithm knownas <em>gradient descent</em>.</p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p>Why introduce the quadratic cost?  After all, aren't we primarilyinterested in the number of images correctly classified by thenetwork?  Why not try to maximize that number directly, rather thanminimizing a proxy measure like the quadratic cost?  The problem withthat is that the number of images correctly classified is not a smoothfunction of the weights and biases in the network.  For the most part,making small changes to the weights and biases won't cause any changeat all in the number of training images classified correctly.  Thatmakes it difficult to figure out how to change the weights and biasesto get improved performance.  If we instead use a smooth cost functionlike the quadratic cost it turns out to be easy to figure out how tomake small changes in the weights and biases so as to get animprovement in the cost.  That's why we focus first on minimizing thequadratic cost, and only after that will we examine the classificationaccuracy.</p><p></p><p>Even given that we want to use a smooth cost function, you may stillwonder why we choose the quadratic function used inEquation <span id="margin_14438919965_reveal" class="equation_link">(6)</span><span id="margin_14438919965" class="marginequation" style="display: none;"><a href="chap1.html#eqtn6" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  C(w,b) \equiv  \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}</a></span><script>$('#margin_14438919965_reveal').click(function() {$('#margin_14438919965').toggle('slow', function() {});});</script>.  Isn't this a rather <em>ad  hoc</em> choice?  Perhaps if we chose a different cost function we'd geta totally different set of minimizing weights and biases?  This is avalid concern, and later we'll revisit the cost function, and makesome modifications.  However, the quadratic cost function ofEquation <span id="margin_526623813325_reveal" class="equation_link">(6)</span><span id="margin_526623813325" class="marginequation" style="display: none;"><a href="chap1.html#eqtn6" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  C(w,b) \equiv  \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}</a></span><script>$('#margin_526623813325_reveal').click(function() {$('#margin_526623813325').toggle('slow', function() {});});</script> works perfectly well forunderstanding the basics of learning in neural networks, so we'llstick with it for now.</p><p>Recapping, our goal in training a neural network is to find weightsand biases which minimize the quadratic cost function $C(w, b)$.  Thisis a well-posed problem, but it's got a lot of distracting structureas currently posed - the interpretation of $w$ and $b$ as weightsand biases, the $\sigma$ function lurking in the background, thechoice of network architecture, MNIST, and so on.  It turns out thatwe can understand a tremendous amount by ignoring most of thatstructure, and just concentrating on the minimization aspect.  So fornow we're going to forget all about the specific form of the costfunction, the connection to neural networks, and so on.  Instead,we're going to imagine that we've simply been given a function of manyvariables and we want to minimize that function.  We're going todevelop a technique called <em>gradient descent</em> which can be usedto solve such minimization problems.  Then we'll come back to thespecific function we want to minimize for neural networks.</p><p>Okay, let's suppose we're trying to minimize some function, $C(v)$.This could be any real-valued function of many variables, $v = v_1,v_2, \ldots$.  Note that I've replaced the $w$ and $b$ notation by $v$to emphasize that this could be any function - we're notspecifically thinking in the neural networks context any more.  Tominimize $C(v)$ it helps to imagine $C$ as a function of just twovariables, which we'll call $v_1$ and $v_2$:</p><p><center><img src="images/valley.png" width="542px"></center></p><p>What we'd like is to find where $C$ achieves its global minimum.  Now,of course, for the function plotted above, we can eyeball the graphand find the minimum.  In that sense, I've perhaps shown slightly<em>too</em> simple a function! A general function, $C$, may be acomplicated function of many variables, and it won't usually bepossible to just eyeball the graph to find the minimum.</p><p>One way of attacking the problem is to use calculus to try to find theminimum analytically.  We could compute derivatives and then try usingthem to find places where $C$ is an extremum.  With some luck thatmight work when $C$ is a function of just one or a few variables.  Butit'll turn into a nightmare when we have many more variables.  And forneural networks we'll often want <em>far</em> more variables - thebiggest neural networks have cost functions which depend on billionsof weights and biases in an extremely complicated way.  Using calculusto minimize that just won't work!</p><p>(After asserting that we'll gain insight by imagining $C$ as afunction of just two variables, I've turned around twice in twoparagraphs and said, "hey, but what if it's a function of many morethan two variables?"  Sorry about that.  Please believe me when I saythat it really does help to imagine $C$ as a function of twovariables.  It just happens that sometimes that picture breaks down,and the last two paragraphs were dealing with such breakdowns.  Goodthinking about mathematics often involves juggling multiple intuitivepictures, learning when it's appropriate to use each picture, and whenit's not.)</p><p><a name="gradient_descent"></a></p><p>Okay, so calculus doesn't work.  Fortunately, there is a beautifulanalogy which suggests an algorithm which works pretty well.  We startby thinking of our function as a kind of a valley.  If you squint justa little at the plot above, that shouldn't be too hard.  And weimagine a ball rolling down the slope of the valley.  Our everydayexperience tells us that the ball will eventually roll to the bottomof the valley.  Perhaps we can use this idea as a way to find aminimum for the function?  We'd randomly choose a starting point foran (imaginary) ball, and then simulate the motion of the ball as itrolled down to the bottom of the valley.  We could do this simulationsimply by computing derivatives (and perhaps some second derivatives)of $C$ - those derivatives would tell us everything we need to knowabout the local "shape" of the valley, and therefore how our ballshould roll.</p><p>Based on what I've just written, you might suppose that we'll betrying to write down Newton's equations of motion for the ball,considering the effects of friction and gravity, and so on.  Actually,we're not going to take the ball-rolling analogy quite that seriously- we're devising an algorithm to minimize $C$, not developing anaccurate simulation of the laws of physics!  The ball's-eye view ismeant to stimulate our imagination, not constrain our thinking.  Sorather than get into all the messy details of physics, let's simplyask ourselves: if we were declared God for a day, and could make upour own laws of physics, dictating to the ball how it should roll,what law or laws of motion could we pick that would make it so theball always rolled to the bottom of the valley?</p><p>To make this question more precise, let's think about what happenswhen we move the ball a small amount $\Delta v_1$ in the $v_1$direction, and a small amount $\Delta v_2$ in the $v_2$ direction.Calculus tells us that $C$ changes as follows:<a class="displaced_anchor" name="eqtn7"></a>\begin{eqnarray}   \Delta C \approx \frac{\partial C}{\partial v_1} \Delta v_1 +  \frac{\partial C}{\partial v_2} \Delta v_2.\tag{7}\end{eqnarray}We're going to find a way of choosing $\Delta v_1$ and $\Delta v_2$ soas to make $\Delta C$ negative; i.e., we'll choose them so the ball isrolling down into the valley.  To figure out how to make such a choiceit helps to define $\Delta v$ to be the vector of changes in $v$,$\Delta v \equiv (\Delta v_1, \Delta v_2)^T$, where $T$ is again thetranspose operation, turning row vectors into column vectors.  We'llalso define the <em>gradient</em> of $C$to be the vector of partial derivatives, $\left(\frac{\partial    C}{\partial v_1}, \frac{\partial C}{\partial v_2}\right)^T$.  Wedenote the gradient vector by $\nabla C$, i.e.:<a class="displaced_anchor" name="eqtn8"></a>\begin{eqnarray}   \nabla C \equiv \left( \frac{\partial C}{\partial v_1},   \frac{\partial C}{\partial v_2} \right)^T.\tag{8}\end{eqnarray}In a moment we'll rewrite the change $\Delta C$ in terms of $\Delta v$and the gradient, $\nabla C$.  Before getting to that, though, I wantto clarify something that sometimes gets people hung up on thegradient.  When meeting the $\nabla C$ notation for the first time,people sometimes wonder how they should think about the $\nabla$symbol.  What, exactly, does $\nabla$ mean?  In fact, it's perfectlyfine to think of $\nabla C$ as a single mathematical object - thevector defined above - which happens to be written using twosymbols.  In this point of view, $\nabla$ is just a piece ofnotational flag-waving, telling you "hey, $\nabla C$ is a gradientvector".  There are more advanced points of view where $\nabla$ canbe viewed as an independent mathematical entity in its own right (forexample, as a differential operator), but we won't need such points ofview.</p><p>With these definitions, the expression <span id="margin_629296432008_reveal" class="equation_link">(7)</span><span id="margin_629296432008" class="marginequation" style="display: none;"><a href="chap1.html#eqtn7" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \Delta C \approx \frac{\partial C}{\partial v_1} \Delta v_1 +  \frac{\partial C}{\partial v_2} \Delta v_2 \nonumber\end{eqnarray}</a></span><script>$('#margin_629296432008_reveal').click(function() {$('#margin_629296432008').toggle('slow', function() {});});</script> for$\Delta C$ can be rewritten as<a class="displaced_anchor" name="eqtn9"></a>\begin{eqnarray}   \Delta C \approx \nabla C \cdot \Delta v.\tag{9}\end{eqnarray}This equation helps explain why $\nabla C$ is called the gradientvector: $\nabla C$ relates changes in $v$ to changes in $C$, just aswe'd expect something called a gradient to do.  But what's reallyexciting about the equation is that it lets us see how to choose$\Delta v$ so as to make $\Delta C$ negative.  In particular, supposewe choose<a class="displaced_anchor" name="eqtn10"></a>\begin{eqnarray}   \Delta v = -\eta \nabla C,\tag{10}\end{eqnarray}where $\eta$ is a small, positive parameter (known as the<em>learning rate</em>).Then Equation <span id="margin_172364974335_reveal" class="equation_link">(9)</span><span id="margin_172364974335" class="marginequation" style="display: none;"><a href="chap1.html#eqtn9" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}</a></span><script>$('#margin_172364974335_reveal').click(function() {$('#margin_172364974335').toggle('slow', function() {});});</script> tells us that $\Delta C \approx -\eta\nabla C \cdot \nabla C = -\eta \|\nabla C\|^2$.  Because $\| \nabla C\|^2 \geq 0$, this guarantees that $\Delta C \leq 0$, i.e., $C$ willalways decrease, never increase, if we change $v$ according to theprescription in <span id="margin_818710006049_reveal" class="equation_link">(10)</span><span id="margin_818710006049" class="marginequation" style="display: none;"><a href="chap1.html#eqtn10" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \Delta v = -\eta \nabla C \nonumber\end{eqnarray}</a></span><script>$('#margin_818710006049_reveal').click(function() {$('#margin_818710006049').toggle('slow', function() {});});</script>.  (Within, of course, thelimits of the approximation in Equation <span id="margin_346046312396_reveal" class="equation_link">(9)</span><span id="margin_346046312396" class="marginequation" style="display: none;"><a href="chap1.html#eqtn9" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}</a></span><script>$('#margin_346046312396_reveal').click(function() {$('#margin_346046312396').toggle('slow', function() {});});</script>).  This isexactly the property we wanted!  And so we'll takeEquation <span id="margin_457047029620_reveal" class="equation_link">(10)</span><span id="margin_457047029620" class="marginequation" style="display: none;"><a href="chap1.html#eqtn10" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \Delta v = -\eta \nabla C \nonumber\end{eqnarray}</a></span><script>$('#margin_457047029620_reveal').click(function() {$('#margin_457047029620').toggle('slow', function() {});});</script> to define the "law of motion"for the ball in our gradient descent algorithm.  That is, we'll useEquation <span id="margin_282751883873_reveal" class="equation_link">(10)</span><span id="margin_282751883873" class="marginequation" style="display: none;"><a href="chap1.html#eqtn10" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \Delta v = -\eta \nabla C \nonumber\end{eqnarray}</a></span><script>$('#margin_282751883873_reveal').click(function() {$('#margin_282751883873').toggle('slow', function() {});});</script> to compute a value for $\Deltav$, then move the ball's position $v$ by that amount:<a class="displaced_anchor" name="eqtn11"></a>\begin{eqnarray}  v \rightarrow v' = v -\eta \nabla C.\tag{11}\end{eqnarray}Then we'll use this update rule again, to make another move.  If wekeep doing this, over and over, we'll keep decreasing $C$ until - wehope - we reach a global minimum.</p><p>Summing up, the way the gradient descent algorithm works is torepeatedly compute the gradient $\nabla C$, and then to move in the<em>opposite</em> direction, "falling down" the slope of the valley.We can visualize it like this:</p><p><center><img src="images/valley_with_ball.png" width="542px"></center></p><p>Notice that with this rule gradient descent doesn't reproduce realphysical motion.  In real life a ball has momentum, and that momentummay allow it to roll across the slope, or even (momentarily) rolluphill.  It's only after the effects of friction set in that the ballis guaranteed to roll down into the valley.  By contrast, our rule forchoosing $\Delta v$ just says "go down, right now".  That's still apretty good rule for finding the minimum!</p><p>To make gradient descent work correctly, we need to choose thelearning rate $\eta$ to be smallenough that Equation <span id="margin_301676028719_reveal" class="equation_link">(9)</span><span id="margin_301676028719" class="marginequation" style="display: none;"><a href="chap1.html#eqtn9" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}</a></span><script>$('#margin_301676028719_reveal').click(function() {$('#margin_301676028719').toggle('slow', function() {});});</script> is a good approximation.  Ifwe don't, we might end up with $\Delta C > 0$, which obviously wouldnot be good!  At the same time, we don't want $\eta$ to be too small,since that will make the changes $\Delta v$ tiny, and thus thegradient descent algorithm will work very slowly.  In practicalimplementations, $\eta$ is often varied so thatEquation <span id="margin_72614288818_reveal" class="equation_link">(9)</span><span id="margin_72614288818" class="marginequation" style="display: none;"><a href="chap1.html#eqtn9" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}</a></span><script>$('#margin_72614288818_reveal').click(function() {$('#margin_72614288818').toggle('slow', function() {});});</script> remains a good approximation, but thealgorithm isn't too slow.  We'll see later how thisworks. </p><p>I've explained gradient descent when $C$ is a function of just twovariables.  But, in fact, everything works just as well even when $C$is a function of many more variables.  Suppose in particular that $C$is a function of $m$ variables, $v_1,\ldots,v_m$.  Then the change$\Delta C$ in $C$ produced by a small change $\Delta v = (\Delta v_1,\ldots, \Delta v_m)^T$ is<a class="displaced_anchor" name="eqtn12"></a>\begin{eqnarray}   \Delta C \approx \nabla C \cdot \Delta v,\tag{12}\end{eqnarray}where the gradient $\nabla C$ is the vector <a class="displaced_anchor" name="eqtn13"></a>\begin{eqnarray}  \nabla C \equiv \left(\frac{\partial C}{\partial v_1}, \ldots,   \frac{\partial C}{\partial v_m}\right)^T.\tag{13}\end{eqnarray}Just as for the two variable case, we canchoose<a class="displaced_anchor" name="eqtn14"></a>\begin{eqnarray}  \Delta v = -\eta \nabla C,\tag{14}\end{eqnarray}and we're guaranteed that our (approximate)expression <span id="margin_325056420417_reveal" class="equation_link">(12)</span><span id="margin_325056420417" class="marginequation" style="display: none;"><a href="chap1.html#eqtn12" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}</a></span><script>$('#margin_325056420417_reveal').click(function() {$('#margin_325056420417').toggle('slow', function() {});});</script> for $\Delta C$ will be negative.This gives us a way of following the gradient to a minimum, even when$C$ is a function of many variables, by repeatedly applying the updaterule<a class="displaced_anchor" name="eqtn15"></a>\begin{eqnarray}  v \rightarrow v' = v-\eta \nabla C.\tag{15}\end{eqnarray}You can think of this update rule as <em>defining</em> the gradientdescent algorithm.  It gives us a way of repeatedly changing theposition $v$ in order to find a minimum of the function $C$.  The ruledoesn't always work - several things can go wrong and preventgradient descent from finding the global minimum of $C$, a point we'llreturn to explore in later chapters.  But, in practice gradientdescent often works extremely well, and in neural networks we'll findthat it's a powerful way of minimizing the cost function, and sohelping the net learn.</p><p></p><p></p><p>Indeed, there's even a sense in which gradient descent is the optimalstrategy for searching for a minimum.  Let's suppose that we're tryingto make a move $\Delta v$ in position so as to decrease $C$ as much aspossible.  This is equivalent to minimizing $\Delta C \approx \nabla C\cdot \Delta v$.  We'll constrain the size of the move so that $\|\Delta v \| = \epsilon$ for some small fixed $\epsilon > 0$.  In otherwords, we want a move that is a small step of a fixed size, and we'retrying to find the movement direction which decreases $C$ as much aspossible.  It can be proved that the choice of $\Delta v$ whichminimizes $\nabla C \cdot \Delta v$ is $\Delta v = - \eta \nabla C$,where $\eta = \epsilon / \|\nabla C\|$ is determined by the sizeconstraint $\|\Delta v\| = \epsilon$.  So gradient descent can beviewed as a way of taking small steps in the direction which does themost to immediately decrease $C$.</p><p><h4><a name="exercises_647181"></a><a href="#exercises_647181">Exercises</a></h4><ul><li> Prove the assertion of the last paragraph.  <em>Hint:</em> If    you're not already familiar with the    <a href="http://en.wikipedia.org/wiki/Cauchy%E2%80%93Schwarz_inequality">Cauchy-Schwarz      inequality</a>, you may find it helpful to familiarize yourself    with it.</p><p><li> I explained gradient descent when $C$ is a function of two  variables, and when it's a function of more than two variables.  What happens when $C$ is a function of just one variable?  Can you  provide a geometric interpretation of what gradient descent is doing  in the one-dimensional case?</ul></p><p></p><p>People have investigated many variations of gradient descent,including variations that more closely mimic a real physical ball.These ball-mimicking variations have some advantages, but also have amajor disadvantage: it turns out to be necessary to compute secondpartial derivatives of $C$, and this can be quite costly.  To see whyit's costly, suppose we want to compute all the second partialderivatives $\partial^2 C/ \partial v_j \partial v_k$.  If there are amillion such $v_j$ variables then we'd need to compute something likea trillion (i.e., a million squared) second partialderivatives*<span class="marginnote">*Actually, more like half a trillion, since  $\partial^2 C/ \partial v_j \partial v_k = \partial^2 C/ \partial  v_k \partial v_j$.  Still, you get the point.</span>!  That's going to becomputationally costly.  With that said, there are tricks for avoidingthis kind of problem, and finding alternatives to gradient descent isan active area of investigation.  But in this book we'll use gradientdescent (and variations) as our main approach to learning in neuralnetworks.</p><p>How can we apply gradient descent to learn in a neural network?  Theidea is to use gradient descent to find the weights $w_k$ and biases$b_l$ which minimize the cost inEquation <span id="margin_386989493945_reveal" class="equation_link">(6)</span><span id="margin_386989493945" class="marginequation" style="display: none;"><a href="chap1.html#eqtn6" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  C(w,b) \equiv  \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}</a></span><script>$('#margin_386989493945_reveal').click(function() {$('#margin_386989493945').toggle('slow', function() {});});</script>.  To see how this works, let'srestate the gradient descent update rule, with the weights and biasesreplacing the variables $v_j$.  In other words, our "position" nowhas components $w_k$ and $b_l$, and the gradient vector $\nabla C$ hascorresponding components $\partial C / \partial w_k$ and $\partial C/ \partial b_l$.  Writing out the gradient descent update rule interms of components, we have<a class="displaced_anchor" name="eqtn16"></a><a class="displaced_anchor" name="eqtn17"></a>\begin{eqnarray}  w_k & \rightarrow & w_k' = w_k-\eta \frac{\partial C}{\partial w_k} \tag{16}\\  b_l & \rightarrow & b_l' = b_l-\eta \frac{\partial C}{\partial b_l}.\tag{17}\end{eqnarray}By repeatedly applying this update rule we can "roll down the hill",and hopefully find a minimum of the cost function.  In other words,this is a rule which can be used to learn in a neural network.</p><p>There are a number of challenges in applying the gradient descentrule.  We'll look into those in depth in later chapters.  But for nowI just want to mention one problem.  To understand what the problemis, let's look back at the quadratic cost inEquation <span id="margin_498380642369_reveal" class="equation_link">(6)</span><span id="margin_498380642369" class="marginequation" style="display: none;"><a href="chap1.html#eqtn6" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  C(w,b) \equiv  \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}</a></span><script>$('#margin_498380642369_reveal').click(function() {$('#margin_498380642369').toggle('slow', function() {});});</script>.  Notice that this costfunction has the form $C = \frac{1}{n} \sum_x C_x$, that is, it's anaverage over costs $C_x \equiv \frac{\|y(x)-a\|^2}{2}$ for individualtraining examples.  In practice, to compute the gradient $\nabla C$ weneed to compute the gradients $\nabla C_x$ separately for eachtraining input, $x$, and then average them, $\nabla C = \frac{1}{n}\sum_x \nabla C_x$.  Unfortunately, when the number of training inputsis very large this can take a long time, and learning thus occursslowly.</p><p>An idea called <em>stochastic gradient descent</em> can be used to speedup learning.  The idea is to estimate the gradient $\nabla C$ bycomputing $\nabla C_x$ for a small sample of randomly chosen traininginputs.  By averaging over this small sample it turns out that we canquickly get a good estimate of the true gradient $\nabla C$, and thishelps speed up gradient descent, and thus learning.</p><p>To make these ideas more precise, stochastic gradient descent works byrandomly picking out a small number $m$ of randomly chosen traininginputs.  We'll label those random training inputs $X_1, X_2, \ldots,X_m$, and refer to them as a <em>mini-batch</em>.  Provided the samplesize $m$ is large enough we expect that the average value of the$\nabla C_{X_j}$ will be roughly equal to the average over all $\nablaC_x$, that is,<a class="displaced_anchor" name="eqtn18"></a>\begin{eqnarray}  \frac{\sum_{j=1}^m \nabla C_{X_{j}}}{m} \approx \frac{\sum_x \nabla C_x}{n} = \nabla C,\tag{18}\end{eqnarray}where the second sum is over the entire set of training data.Swapping sides we get<a class="displaced_anchor" name="eqtn19"></a>\begin{eqnarray}  \nabla C \approx \frac{1}{m} \sum_{j=1}^m \nabla C_{X_{j}},\tag{19}\end{eqnarray}confirming that we can estimate the overall gradient by computinggradients just for the randomly chosen mini-batch. </p><p>To connect this explicitly to learning in neural networks, suppose$w_k$ and $b_l$ denote the weights and biases in our neural network.Then stochastic gradient descent works by picking out a randomlychosen mini-batch of training inputs, and training with those,<a class="displaced_anchor" name="eqtn20"></a><a class="displaced_anchor" name="eqtn21"></a>\begin{eqnarray}   w_k & \rightarrow & w_k' = w_k-\frac{\eta}{m}  \sum_j \frac{\partial C_{X_j}}{\partial w_k} \tag{20}\\    b_l & \rightarrow & b_l' = b_l-\frac{\eta}{m}  \sum_j \frac{\partial C_{X_j}}{\partial b_l},\tag{21}\end{eqnarray}where the sums are over all the training examples $X_j$ in the currentmini-batch.  Then we pick out another randomly chosen mini-batch andtrain with those.  And so on, until we've exhausted the traininginputs, which is said to complete an<em>epoch</em> of training.  At that pointwe start over with a new training epoch.</p><p>Incidentally, it's worth noting that conventions vary about scaling ofthe cost function and of mini-batch updates to the weights and biases.In Equation <span id="margin_882573850756_reveal" class="equation_link">(6)</span><span id="margin_882573850756" class="marginequation" style="display: none;"><a href="chap1.html#eqtn6" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  C(w,b) \equiv  \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}</a></span><script>$('#margin_882573850756_reveal').click(function() {$('#margin_882573850756').toggle('slow', function() {});});</script> we scaled the overall costfunction by a factor $\frac{1}{n}$.  People sometimes omit the$\frac{1}{n}$, summing over the costs of individual training examplesinstead of averaging.  This is particularly useful when the totalnumber of training examples isn't known in advance.  This can occur ifmore training data is being generated in real time, for instance.And, in a similar way, the mini-batch update rules <span id="margin_958311670424_reveal" class="equation_link">(20)</span><span id="margin_958311670424" class="marginequation" style="display: none;"><a href="chap1.html#eqtn20" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   w_k & \rightarrow & w_k' = w_k-\frac{\eta}{m}  \sum_j \frac{\partial C_{X_j}}{\partial w_k}  \nonumber\end{eqnarray}</a></span><script>$('#margin_958311670424_reveal').click(function() {$('#margin_958311670424').toggle('slow', function() {});});</script>and <span id="margin_242796093346_reveal" class="equation_link">(21)</span><span id="margin_242796093346" class="marginequation" style="display: none;"><a href="chap1.html#eqtn21" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}    b_l & \rightarrow & b_l' = b_l-\frac{\eta}{m}  \sum_j \frac{\partial C_{X_j}}{\partial b_l} \nonumber\end{eqnarray}</a></span><script>$('#margin_242796093346_reveal').click(function() {$('#margin_242796093346').toggle('slow', function() {});});</script> sometimes omit the $\frac{1}{m}$ term out thefront of the sums.  Conceptually this makes little difference, sinceit's equivalent to rescaling the learning rate $\eta$.  But when doingdetailed comparisons of different work it's worth watching out for.</p><p>We can think of stochastic gradient descent as being like politicalpolling: it's much easier to sample a small mini-batch than it is toapply gradient descent to the full batch, just as carrying out a pollis easier than running a full election.  For example, if we have atraining set of size $n = 60,000$, as in MNIST, and choose amini-batch size of (say) $m = 10$, this means we'll get a factor of$6,000$ speedup in estimating the gradient!  Of course, the estimatewon't be perfect - there will be statistical fluctuations - but itdoesn't need to be perfect: all we really care about is moving in ageneral direction that will help decrease $C$, and that means we don'tneed an exact computation of the gradient.  In practice, stochasticgradient descent is a commonly used and powerful technique forlearning in neural networks, and it's the basis for most of thelearning techniques we'll develop in this book.</p><p></p><p></p><p></p><p></p><p></p><p><h4><a name="exercise_263792"></a><a href="#exercise_263792">Exercise</a></h4><ul><li> An extreme version of gradient descent is to use a mini-batch  size of just 1.  That is, given a training input, $x$, we update our  weights and biases according to the rules $w_k \rightarrow w_k' =  w_k - \eta \partial C_x / \partial w_k$ and $b_l \rightarrow b_l' =  b_l - \eta \partial C_x / \partial b_l$.  Then we choose another  training input, and update the weights and biases again.  And so on,  repeatedly.  This procedure is known as <em>online</em>,  <em>on-line</em>, or <em>incremental</em> learning.  In online learning,  a neural network learns from just one training input at a time (just  as human beings do).  Name one advantage and one disadvantage of  online learning, compared to stochastic gradient descent with a  mini-batch size of, say, $20$.</ul></p><p>Let me conclude this section by discussing a point that sometimes bugspeople new to gradient descent.  In neural networks the cost $C$ is,of course, a function of many variables - all the weights and biases- and so in some sense defines a surface in a very high-dimensionalspace.  Some people get hung up thinking: "Hey, I have to be able tovisualize all these extra dimensions".  And they may start to worry:"I can't think in four dimensions, let alone five (or fivemillion)".  Is there some special ability they're missing, someability that "real" supermathematicians have?  Of course, the answeris no.  Even most professional mathematicians can't visualize fourdimensions especially well, if at all.  The trick they use, instead,is to develop other ways of representing what's going on.  That'sexactly what we did above: we used an algebraic (rather than visual)representation of $\Delta C$ to figure out how to move so as todecrease $C$.  People who are good at thinking in high dimensions havea mental library containing many different techniques along theselines; our algebraic trick is just one example.  Those techniques maynot have the simplicity we're accustomed to when visualizing threedimensions, but once you build up a library of such techniques, youcan get pretty good at thinking in high dimensions.  I won't go intomore detail here, but if you're interested then you may enjoy reading<a href="http://mathoverflow.net/questions/25983/intuitive-crutches-for-higher-dimensional-thinking">this  discussion</a> of some of the techniques professional mathematiciansuse to think in high dimensions.  While some of the techniquesdiscussed are quite complex, much of the best content is intuitive andaccessible, and could be mastered by anyone.</p><p></p><p><h3><a name="implementing_our_network_to_classify_digits"></a><a href="#implementing_our_network_to_classify_digits">Implementing our network to classify digits</a></h3></p><p>Alright, let's write a program that learns how to recognizehandwritten digits, using stochastic gradient descent and the MNISTtraining data.  We'll do this with a short Python (2.7) program, just74 lines of code!  The first thing we need is to get the MNIST data.If you're a <tt>git</tt> user then you can obtain the data by cloningthe code repository for this book,</p><p><div class="highlight"><pre><span></span>git clone https://github.com/mnielsen/neural-networks-and-deep-learning.git
 </pre></div>
 </p><p>If you don't use <tt>git</tt> then you can download the data and code<a href="https://github.com/mnielsen/neural-networks-and-deep-learning/archive/master.zip">here</a>.</p><p>Incidentally, when I described the MNIST data earlier, I said it wassplit into 60,000 training images, and 10,000 test images.  That's theofficial MNIST description.  Actually, we're going to split the data alittle differently.  We'll leave the test images as is, but split the60,000-image MNIST training set into two parts: a set of 50,000images, which we'll use to train our neural network, and a separate10,000 image <em>validation set</em>.  We won'tuse the validation data in this chapter, but later in the book we'llfind it useful in figuring out how to set certain<em>hyper-parameters</em> of the neural network - things like thelearning rate, and so on, which aren't directly selected by ourlearning algorithm.  Although the validation data isn't part of theoriginal MNIST specification, many people use MNIST in this fashion,and the use of validation data is common in neural networks.  When Irefer to the "MNIST training data" from now on, I'll be referring toour 50,000 image data set, not the original 60,000 image dataset*<span class="marginnote">*As noted earlier, the MNIST data set is based on two data  sets collected by NIST, the United States' National Institute of  Standards and Technology.  To construct MNIST the NIST data sets  were stripped down and put into a more convenient format by Yann  LeCun, Corinna Cortes, and Christopher J. C. Burges.  See  <a href="http://yann.lecun.com/exdb/mnist/">this link</a> for more  details.  The data set in my repository is in a form that makes it  easy to load and manipulate the MNIST data in Python.  I obtained  this particular form of the data from the LISA machine learning  laboratory at the University of Montreal  (<a href="http://www.deeplearning.net/tutorial/gettingstarted.html">link</a>).</span>.</p><p></p><p>Apart from the MNIST data we also need a Python library called<a href="http://numpy.org">Numpy</a>, for doing fast linear algebra.  If youdon't already have Numpy installed, you can get it<a href="http://www.scipy.org/install.html">here</a>.</p><p>Let me explain the core features of the neural networks code, beforegiving a full listing, below.  The centerpiece is a <tt>Network</tt>class, which we use to represent a neural network.  Here's the code weuse to initialize a <tt>Network</tt> object:</p><p><div class="highlight"><pre><span></span><span class="k">class</span> <span class="nc">Network</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
 
@@ -188,10 +191,10 @@ <h1 class="chapter_title"><a href="">Using neural nets to recognize handwritten
 </pre></div>
 </p><p>In this code, the list <tt>sizes</tt> contains the number of neurons inthe respective layers.  So, for example, if we want to create a<tt>Network</tt> object with 2 neurons in the first layer, 3 neurons inthe second layer, and 1 neuron in the final layer, we'd do this withthe code:<div class="highlight"><pre><span></span><span class="n">net</span> <span class="o">=</span> <span class="n">Network</span><span class="p">([</span><span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">1</span><span class="p">])</span>
 </pre></div>
-<a name="weight_initialization"></a> The biasesand weights in the <tt>Network</tt> object are all initialized randomly,using the Numpy <tt>np.random.randn</tt> function to generate Gaussiandistributions with mean $0$ and standard deviation $1$.  This randominitialization gives our stochastic gradient descent algorithm a placeto start from.  In later chapters we'll find better ways ofinitializing the weights and biases, but this will do for now.  Notethat the <tt>Network</tt> initialization code assumes that the firstlayer of neurons is an input layer, and omits to set any biases forthose neurons, since biases are only ever used in computing theoutputs from later layers.</p><p>Note also that the biases and weights are stored as lists of Numpymatrices.  So, for example <tt>net.weights[1]</tt> is a Numpy matrixstoring the weights connecting the second and third layers of neurons.(It's not the first and second layers, since Python's list indexingstarts at <tt>0</tt>.)  Since <tt>net.weights[1]</tt> is rather verbose,let's just denote that matrix $w$.  It's a matrix such that $w_{jk}$is the weight for the connection between the $k^{\rm th}$ neuron in thesecond layer, and the $j^{\rm th}$ neuron in the third layer.  This orderingof the $j$ and $k$ indices may seem strange - surely it'd make moresense to swap the $j$ and $k$ indices around?  The big advantage ofusing this ordering is that it means that the vector of activations ofthe third layer of neurons is:<a class="displaced_anchor" name="eqtn22"></a>\begin{eqnarray}   a' = \sigma(w a + b).\tag{22}\end{eqnarray}There's quite a bit going on in this equation, so let's unpack itpiece by piece.  $a$ is the vector of activations of the second layerof neurons. To obtain $a'$ we multiply $a$ by the weight matrix $w$,and add the vector $b$ of biases.  We then apply the function $\sigma$elementwise to every entry in the vector $w a +b$.  (This is called<em>vectorizing</em> the function$\sigma$.) It's easy to verify thatEquation <span id="margin_38113730525_reveal" class="equation_link">(22)</span><span id="margin_38113730525" class="marginequation" style="display: none;"><a href="chap1.html#eqtn22" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   a' = \sigma(w a + b) \nonumber\end{eqnarray}</a></span><script>$('#margin_38113730525_reveal').click(function() {$('#margin_38113730525').toggle('slow', function() {});});</script> gives the same result as ourearlier rule, Equation <span id="margin_144301273653_reveal" class="equation_link">(4)</span><span id="margin_144301273653" class="marginequation" style="display: none;"><a href="chap1.html#eqtn4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \frac{1}{1+\exp(-\sum_j w_j x_j-b)} \nonumber\end{eqnarray}</a></span><script>$('#margin_144301273653_reveal').click(function() {$('#margin_144301273653').toggle('slow', function() {});});</script>, forcomputing the output of a sigmoid neuron.</p><p><h4><a name="exercise_276574"></a><a href="#exercise_276574">Exercise</a></h4><ul><li> Write out Equation <span id="margin_932768098009_reveal" class="equation_link">(22)</span><span id="margin_932768098009" class="marginequation" style="display: none;"><a href="chap1.html#eqtn22" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   a' = \sigma(w a + b) \nonumber\end{eqnarray}</a></span><script>$('#margin_932768098009_reveal').click(function() {$('#margin_932768098009').toggle('slow', function() {});});</script> in component  form, and verify that it gives the same result as the  rule <span id="margin_828981718981_reveal" class="equation_link">(4)</span><span id="margin_828981718981" class="marginequation" style="display: none;"><a href="chap1.html#eqtn4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \frac{1}{1+\exp(-\sum_j w_j x_j-b)} \nonumber\end{eqnarray}</a></span><script>$('#margin_828981718981_reveal').click(function() {$('#margin_828981718981').toggle('slow', function() {});});</script> for computing the output  of a sigmoid neuron.</ul></p><p>With all this in mind, it's easy to write code computing the outputfrom a <tt>Network</tt> instance.  We begin by defining the sigmoidfunction:<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">sigmoid</span><span class="p">(</span><span class="n">z</span><span class="p">):</span>
+<a name="weight_initialization"></a> The biasesand weights in the <tt>Network</tt> object are all initialized randomly,using the Numpy <tt>np.random.randn</tt> function to generate Gaussiandistributions with mean $0$ and standard deviation $1$.  This randominitialization gives our stochastic gradient descent algorithm a placeto start from.  In later chapters we'll find better ways ofinitializing the weights and biases, but this will do for now.  Notethat the <tt>Network</tt> initialization code assumes that the firstlayer of neurons is an input layer, and omits to set any biases forthose neurons, since biases are only ever used in computing theoutputs from later layers.</p><p>Note also that the biases and weights are stored as lists of Numpymatrices.  So, for example <tt>net.weights[1]</tt> is a Numpy matrixstoring the weights connecting the second and third layers of neurons.(It's not the first and second layers, since Python's list indexingstarts at <tt>0</tt>.)  Since <tt>net.weights[1]</tt> is rather verbose,let's just denote that matrix $w$.  It's a matrix such that $w_{jk}$is the weight for the connection between the $k^{\rm th}$ neuron in thesecond layer, and the $j^{\rm th}$ neuron in the third layer.  This orderingof the $j$ and $k$ indices may seem strange - surely it'd make moresense to swap the $j$ and $k$ indices around?  The big advantage ofusing this ordering is that it means that the vector of activations ofthe third layer of neurons is:<a class="displaced_anchor" name="eqtn22"></a>\begin{eqnarray}   a' = \sigma(w a + b).\tag{22}\end{eqnarray}There's quite a bit going on in this equation, so let's unpack itpiece by piece.  $a$ is the vector of activations of the second layerof neurons. To obtain $a'$ we multiply $a$ by the weight matrix $w$,and add the vector $b$ of biases.  We then apply the function $\sigma$elementwise to every entry in the vector $w a +b$.  (This is called<em>vectorizing</em> the function$\sigma$.) It's easy to verify thatEquation <span id="margin_258238631984_reveal" class="equation_link">(22)</span><span id="margin_258238631984" class="marginequation" style="display: none;"><a href="chap1.html#eqtn22" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   a' = \sigma(w a + b) \nonumber\end{eqnarray}</a></span><script>$('#margin_258238631984_reveal').click(function() {$('#margin_258238631984').toggle('slow', function() {});});</script> gives the same result as ourearlier rule, Equation <span id="margin_440669615035_reveal" class="equation_link">(4)</span><span id="margin_440669615035" class="marginequation" style="display: none;"><a href="chap1.html#eqtn4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \frac{1}{1+\exp(-\sum_j w_j x_j-b)} \nonumber\end{eqnarray}</a></span><script>$('#margin_440669615035_reveal').click(function() {$('#margin_440669615035').toggle('slow', function() {});});</script>, forcomputing the output of a sigmoid neuron.</p><p><h4><a name="exercise_3838"></a><a href="#exercise_3838">Exercise</a></h4><ul><li> Write out Equation <span id="margin_550221132718_reveal" class="equation_link">(22)</span><span id="margin_550221132718" class="marginequation" style="display: none;"><a href="chap1.html#eqtn22" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   a' = \sigma(w a + b) \nonumber\end{eqnarray}</a></span><script>$('#margin_550221132718_reveal').click(function() {$('#margin_550221132718').toggle('slow', function() {});});</script> in component  form, and verify that it gives the same result as the  rule <span id="margin_939629079569_reveal" class="equation_link">(4)</span><span id="margin_939629079569" class="marginequation" style="display: none;"><a href="chap1.html#eqtn4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \frac{1}{1+\exp(-\sum_j w_j x_j-b)} \nonumber\end{eqnarray}</a></span><script>$('#margin_939629079569_reveal').click(function() {$('#margin_939629079569').toggle('slow', function() {});});</script> for computing the output  of a sigmoid neuron.</ul></p><p>With all this in mind, it's easy to write code computing the outputfrom a <tt>Network</tt> instance.  We begin by defining the sigmoidfunction:<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">sigmoid</span><span class="p">(</span><span class="n">z</span><span class="p">):</span>
     <span class="k">return</span> <span class="mf">1.0</span><span class="o">/</span><span class="p">(</span><span class="mf">1.0</span><span class="o">+</span><span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="o">-</span><span class="n">z</span><span class="p">))</span>
 </pre></div>
-Note that when the input <tt>z</tt> is a vector or Numpy array, Numpyautomatically applies the function <tt>sigmoid</tt> elementwise, thatis, in vectorized form.</p><p>We then add a <tt>feedforward</tt> method to the <tt>Network</tt> class,which, given an input <tt>a</tt> for the network, returns thecorresponding output*<span class="marginnote">*It is assumed that the input <tt>a</tt> is  an <tt>(n, 1)</tt> Numpy ndarray, not a <tt>(n,)</tt> vector.  Here,  <tt>n</tt> is the number of inputs to the network.  If you try to use  an <tt>(n,)</tt> vector as input you'll get strange results.  Although  using an <tt>(n,)</tt> vector appears the more natural choice, using  an <tt>(n, 1)</tt> ndarray makes it particularly easy to modify the  code to feedforward multiple inputs at once, and that is sometimes  convenient. </span>.  All the method does is appliesEquation <span id="margin_58735661899_reveal" class="equation_link">(22)</span><span id="margin_58735661899" class="marginequation" style="display: none;"><a href="chap1.html#eqtn22" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   a' = \sigma(w a + b) \nonumber\end{eqnarray}</a></span><script>$('#margin_58735661899_reveal').click(function() {$('#margin_58735661899').toggle('slow', function() {});});</script> for each layer:<div class="highlight"><pre><span></span>    <span class="k">def</span> <span class="nf">feedforward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">a</span><span class="p">):</span>
+Note that when the input <tt>z</tt> is a vector or Numpy array, Numpyautomatically applies the function <tt>sigmoid</tt> elementwise, thatis, in vectorized form.</p><p>We then add a <tt>feedforward</tt> method to the <tt>Network</tt> class,which, given an input <tt>a</tt> for the network, returns thecorresponding output*<span class="marginnote">*It is assumed that the input <tt>a</tt> is  an <tt>(n, 1)</tt> Numpy ndarray, not a <tt>(n,)</tt> vector.  Here,  <tt>n</tt> is the number of inputs to the network.  If you try to use  an <tt>(n,)</tt> vector as input you'll get strange results.  Although  using an <tt>(n,)</tt> vector appears the more natural choice, using  an <tt>(n, 1)</tt> ndarray makes it particularly easy to modify the  code to feedforward multiple inputs at once, and that is sometimes  convenient. </span>.  All the method does is appliesEquation <span id="margin_231983189870_reveal" class="equation_link">(22)</span><span id="margin_231983189870" class="marginequation" style="display: none;"><a href="chap1.html#eqtn22" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   a' = \sigma(w a + b) \nonumber\end{eqnarray}</a></span><script>$('#margin_231983189870_reveal').click(function() {$('#margin_231983189870').toggle('slow', function() {});});</script> for each layer:<div class="highlight"><pre><span></span>    <span class="k">def</span> <span class="nf">feedforward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">a</span><span class="p">):</span>
         <span class="sd">&quot;&quot;&quot;Return the output of the network if &quot;a&quot; is input.&quot;&quot;&quot;</span>
         <span class="k">for</span> <span class="n">b</span><span class="p">,</span> <span class="n">w</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">biases</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">):</span>
             <span class="n">a</span> <span class="o">=</span> <span class="n">sigmoid</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">w</span><span class="p">,</span> <span class="n">a</span><span class="p">)</span><span class="o">+</span><span class="n">b</span><span class="p">)</span>
@@ -527,7 +530,7 @@ <h1 class="chapter_title"><a href="">Using neural nets to recognize handwritten
 href="mailto:mn@michaelnielsen.org">contact me</a>.
 </span>
 <span class="right_footer">
-Last update: Sun Jan  1 16:00:21 2017
+Last update: Thu Jan 19 06:09:48 2017
 <br/>
 <br/>
 <br/>
diff --git a/chap2.html b/chap2.html
index 0b38fb6..0ac9948 100644
--- a/chap2.html
+++ b/chap2.html
@@ -155,6 +155,8 @@ <h1 class="chapter_title"><a href="">How the backpropagation algorithm works</a>
 <hr>
 <span class="sidebar_title">Resources</span>
 
+<p class="sidebar"><a href="https://twitter.com/michael_nielsen">Michael Nielsen on Twitter</a></p>
+
 <p class="sidebar"><a href="faq.html">Book FAQ</a></p>
 
 <p class="sidebar">
@@ -164,9 +166,10 @@ <h1 class="chapter_title"><a href="">How the backpropagation algorithm works</a>
 <a href="http://eepurl.com/0Xxjb">Michael Nielsen's project announcement mailing list</a>
 </p>
 
-<p class="sidebar"> <a href="http://www.iro.umontreal.ca/~bengioy/dlbook/">Deep Learning</a>, draft book
-in preparation, by Yoshua Bengio, Ian Goodfellow, and Aaron
-Courville</p>
+<p class="sidebar"> <a href="http://www.deeplearningbook.org/">Deep Learning</a>, book by Ian
+Goodfellow, Yoshua Bengio, and Aaron Courville</p>
+
+<p class="sidebar"><a href="http://cognitivemedium.com">cognitivemedium.com</a></p>
 
 <hr>
 <a href="http://michaelnielsen.org"><img src="assets/Michael_Nielsen_Web_Small.jpg" width="160px" style="border-style: none;"/></a>
@@ -175,7 +178,7 @@ <h1 class="chapter_title"><a href="">How the backpropagation algorithm works</a>
 By <a href="http://michaelnielsen.org">Michael Nielsen</a> / Jan 2017
 </p>
 </div>
-</p><p>In the <a href="chap1.html">last chapter</a> we saw how neural networks canlearn their weights and biases using the gradient descent algorithm.There was, however, a gap in our explanation: we didn't discuss how tocompute the gradient of the cost function.  That's quite a gap!  Inthis chapter I'll explain a fast algorithm for computing suchgradients, an algorithm known as <em>backpropagation</em>.  </p><p>The backpropagation algorithm was originally introduced in the 1970s,but its importance wasn't fully appreciated until a<a href="http://www.nature.com/nature/journal/v323/n6088/pdf/323533a0.pdf">famous  1986 paper</a> by<a href="http://en.wikipedia.org/wiki/David_Rumelhart">David  Rumelhart</a>,<a href="http://www.cs.toronto.edu/&#126;hinton/">Geoffrey  Hinton</a>, and<a href="http://en.wikipedia.org/wiki/Ronald_J._Williams">Ronald  Williams</a>.  That paper describes severalneural networks where backpropagation works far faster than earlierapproaches to learning, making it possible to use neural nets to solveproblems which had previously been insoluble.  Today, thebackpropagation algorithm is the workhorse of learning in neuralnetworks.</p><p>This chapter is more mathematically involved than the rest of thebook.  If you're not crazy about mathematics you may be tempted toskip the chapter, and to treat backpropagation as a black box whosedetails you're willing to ignore.  Why take the time to study thosedetails?</p><p>The reason, of course, is understanding.  At the heart ofbackpropagation is an expression for the partial derivative $\partialC / \partial w$ of the cost function $C$ with respect to any weight$w$ (or bias $b$) in the network.  The expression tells us how quicklythe cost changes when we change the weights and biases.  And while theexpression is somewhat complex, it also has a beauty to it, with eachelement having a natural, intuitive interpretation.  And sobackpropagation isn't just a fast algorithm for learning.  It actuallygives us detailed insights into how changing the weights and biaseschanges the overall behaviour of the network.  That's well worthstudying in detail.</p><p>With that said, if you want to skim the chapter, or jump straight tothe next chapter, that's fine.  I've written the rest of the book tobe accessible even if you treat backpropagation as a black box.  Thereare, of course, points later in the book where I refer back to resultsfrom this chapter.  But at those points you should still be able tounderstand the main conclusions, even if you don't follow all thereasoning.</p><p><h3><a name="warm_up_a_fast_matrix-based_approach_to_computing_the_output_from_a_neural_network"></a><a href="#warm_up_a_fast_matrix-based_approach_to_computing_the_output_from_a_neural_network">Warm up: a fast matrix-based approach to computing the output  from a neural network</a></h3></p><p>Before discussing backpropagation, let's warm up with a fastmatrix-based algorithm to compute the output from a neural network.We actually already briefly saw this algorithm<a href="chap1.html#implementing_our_network_to_classify_digits">near  the end of the last chapter</a>, but I described it quickly, so it'sworth revisiting in detail.  In particular, this is a good way ofgetting comfortable with the notation used in backpropagation, in afamiliar context.</p><p>Let's begin with a notation which lets us refer to weights in thenetwork in an unambiguous way.  We'll use $w^l_{jk}$ to denote theweight for the connection from the $k^{\rm th}$ neuron in the$(l-1)^{\rm th}$ layer to the $j^{\rm th}$ neuron in the $l^{\rm th}$layer.  So, for example, the diagram below shows the weight on aconnection from the fourth neuron in the second layer to the secondneuron in the third layer of a network:<center><img src="images/tikz16.png"/></center>This notation is cumbersome at first, and it does take some work tomaster.  But with a little effort you'll find the notation becomeseasy and natural.  One quirk of the notation is the ordering of the$j$ and $k$ indices.  You might think that it makes more sense to use$j$ to refer to the input neuron, and $k$ to the output neuron, notvice versa, as is actually done.  I'll explain the reason for thisquirk below.</p><p>We use a similar notation for the network's biases and activations.Explicitly, we use $b^l_j$ for the bias of the $j^{\rm th}$ neuron inthe $l^{\rm th}$ layer.  And we use $a^l_j$ for the activation of the$j^{\rm th}$ neuron in the $l^{\rm th}$ layer.  The following diagramshows examples of these notations in use:<center><img src="images/tikz17.png"/></center>With these notations, the activation $a^{l}_j$ of the $j^{\rm th}$neuron in the $l^{\rm th}$ layer is related to the activations in the$(l-1)^{\rm th}$ layer by the equation (compareEquation <span id="margin_891236829978_reveal" class="equation_link">(4)</span><span id="margin_891236829978" class="marginequation" style="display: none;"><a href="chap1.html#eqtn4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \frac{1}{1+\exp(-\sum_j w_j x_j-b)} \nonumber\end{eqnarray}</a></span><script>$('#margin_891236829978_reveal').click(function() {$('#margin_891236829978').toggle('slow', function() {});});</script> and surroundingdiscussion in the last chapter)<a class="displaced_anchor" name="eqtn23"></a>\begin{eqnarray}   a^{l}_j = \sigma\left( \sum_k w^{l}_{jk} a^{l-1}_k + b^l_j \right),\tag{23}\end{eqnarray}where the sum is over all neurons $k$ in the $(l-1)^{\rm th}$ layer.  Torewrite this expression in a matrix form we define a <em>weight  matrix</em> $w^l$ for each layer, $l$.  The entries of the weight matrix$w^l$ are just the weights connecting to the $l^{\rm th}$ layer of neurons,that is, the entry in the $j^{\rm th}$ row and $k^{\rm th}$ column is $w^l_{jk}$.Similarly, for each layer $l$ we define a <em>bias vector</em>, $b^l$.You can probably guess how this works - the components of the biasvector are just the values $b^l_j$, one component for each neuron inthe $l^{\rm th}$ layer.  And finally, we define an activation vector $a^l$whose components are the activations $a^l_j$.</p><p>The last ingredient we need to rewrite <span id="margin_140241005951_reveal" class="equation_link">(23)</span><span id="margin_140241005951" class="marginequation" style="display: none;"><a href="chap2.html#eqtn23" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   a^{l}_j = \sigma\left( \sum_k w^{l}_{jk} a^{l-1}_k + b^l_j \right) \nonumber\end{eqnarray}</a></span><script>$('#margin_140241005951_reveal').click(function() {$('#margin_140241005951').toggle('slow', function() {});});</script> in amatrix form is the idea of vectorizing a function such as $\sigma$.We met vectorization briefly in the last chapter, but to recap, theidea is that we want to apply a function such as $\sigma$ to everyelement in a vector $v$.  We use the obvious notation $\sigma(v)$ todenote this kind of elementwise application of a function.  That is,the components of $\sigma(v)$ are just $\sigma(v)_j = \sigma(v_j)$.As an example, if we have the function $f(x) = x^2$ then thevectorized form of $f$ has the effect<a class="displaced_anchor" name="eqtn24"></a>\begin{eqnarray}  f\left(\left[ \begin{array}{c} 2 \\ 3 \end{array} \right] \right)  = \left[ \begin{array}{c} f(2) \\ f(3) \end{array} \right]  = \left[ \begin{array}{c} 4 \\ 9 \end{array} \right],\tag{24}\end{eqnarray}that is, the vectorized $f$ just squares every element of the vector.</p><p>With these notations in mind, Equation <span id="margin_124910964095_reveal" class="equation_link">(23)</span><span id="margin_124910964095" class="marginequation" style="display: none;"><a href="chap2.html#eqtn23" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   a^{l}_j = \sigma\left( \sum_k w^{l}_{jk} a^{l-1}_k + b^l_j \right) \nonumber\end{eqnarray}</a></span><script>$('#margin_124910964095_reveal').click(function() {$('#margin_124910964095').toggle('slow', function() {});});</script> canbe rewritten in the beautiful and compact vectorized form<a class="displaced_anchor" name="eqtn25"></a>\begin{eqnarray}   a^{l} = \sigma(w^l a^{l-1}+b^l).\tag{25}\end{eqnarray}This expression gives us a much more global way of thinking about howthe activations in one layer relate to activations in the previouslayer: we just apply the weight matrix to the activations, then addthe bias vector, and finally apply the $\sigma$ function<a  id="quirk"></a>*<span class="marginnote">*By the way, it's this expression that  motivates the quirk in the $w^l_{jk}$ notation mentioned earlier.  If we used $j$ to index the input neuron, and $k$ to index the  output neuron, then we'd need to replace the weight matrix in  Equation <span id="margin_153781012506_reveal" class="equation_link">(25)</span><span id="margin_153781012506" class="marginequation" style="display: none;"><a href="chap2.html#eqtn25" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   a^{l} = \sigma(w^l a^{l-1}+b^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_153781012506_reveal').click(function() {$('#margin_153781012506').toggle('slow', function() {});});</script> by the transpose of the  weight matrix.  That's a small change, but annoying, and we'd lose  the easy simplicity of saying (and thinking) "apply the weight  matrix to the activations".</span>.  That global view is often easier andmore succinct (and involves fewer indices!) than the neuron-by-neuronview we've taken to now.  Think of it as a way of escaping index hell,while remaining precise about what's going on.  The expression is alsouseful in practice, because most matrix libraries provide fast ways ofimplementing matrix multiplication, vector addition, andvectorization.  Indeed, the<a href="chap1.html#implementing_our_network_to_classify_digits">code</a>in the last chapter made implicit use of this expression to computethe behaviour of the network.</p><p>When using Equation <span id="margin_240363861835_reveal" class="equation_link">(25)</span><span id="margin_240363861835" class="marginequation" style="display: none;"><a href="chap2.html#eqtn25" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   a^{l} = \sigma(w^l a^{l-1}+b^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_240363861835_reveal').click(function() {$('#margin_240363861835').toggle('slow', function() {});});</script> to compute $a^l$,we compute the intermediate quantity $z^l \equiv w^l a^{l-1}+b^l$along the way.  This quantity turns out to be useful enough to beworth naming: we call $z^l$ the <em>weighted input</em> to the neuronsin layer $l$.  We'll make considerable use of the weighted input $z^l$later in the chapter.  Equation <span id="margin_660252153198_reveal" class="equation_link">(25)</span><span id="margin_660252153198" class="marginequation" style="display: none;"><a href="chap2.html#eqtn25" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   a^{l} = \sigma(w^l a^{l-1}+b^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_660252153198_reveal').click(function() {$('#margin_660252153198').toggle('slow', function() {});});</script> issometimes written in terms of the weighted input, as $a^l =\sigma(z^l)$.  It's also worth noting that $z^l$ has components $z^l_j= \sum_k w^l_{jk} a^{l-1}_k+b^l_j$, that is, $z^l_j$ is just theweighted input to the activation function for neuron $j$ in layer $l$.</p><p><h3><a name="the_two_assumptions_we_need_about_the_cost_function"></a><a href="#the_two_assumptions_we_need_about_the_cost_function">The two assumptions we need about the cost function</a></h3></p><p>The goal of backpropagation is to compute the partial derivatives$\partial C / \partial w$ and $\partial C / \partial b$ of the costfunction $C$ with respect to any weight $w$ or bias $b$ in thenetwork.  For backpropagation to work we need to make two mainassumptions about the form of the cost function.  Before stating thoseassumptions, though, it's useful to have an example cost function inmind.  We'll use the quadratic cost function from last chapter(c.f. Equation <span id="margin_293466066532_reveal" class="equation_link">(6)</span><span id="margin_293466066532" class="marginequation" style="display: none;"><a href="chap1.html#eqtn6" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  C(w,b) \equiv  \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}</a></span><script>$('#margin_293466066532_reveal').click(function() {$('#margin_293466066532').toggle('slow', function() {});});</script>).  In the notation ofthe last section, the quadratic cost has the form<a class="displaced_anchor" name="eqtn26"></a>\begin{eqnarray}  C = \frac{1}{2n} \sum_x \|y(x)-a^L(x)\|^2,\tag{26}\end{eqnarray}where: $n$ is the total number of training examples; the sum is overindividual training examples, $x$; $y = y(x)$ is the correspondingdesired output; $L$ denotes the number of layers in the network; and$a^L = a^L(x)$ is the vector of activations output from the networkwhen $x$ is input.</p><p>Okay, so what assumptions do we need to make about our cost function,$C$, in order that backpropagation can be applied?  The firstassumption we need is that the cost function can be written as anaverage $C = \frac{1}{n} \sum_x C_x$ over cost functions $C_x$ forindividual training examples, $x$.  This is the case for the quadraticcost function, where the cost for a single training example is $C_x =\frac{1}{2} \|y-a^L \|^2$.  This assumption will also hold true forall the other cost functions we'll meet in this book.</p><p>The reason we need this assumption is because what backpropagationactually lets us do is compute the partial derivatives $\partial C_x/ \partial w$ and $\partial C_x / \partial b$ for a single trainingexample.  We then recover $\partial C / \partial w$ and $\partial C/ \partial b$ by averaging over training examples.  In fact, with thisassumption in mind, we'll suppose the training example $x$ has beenfixed, and drop the $x$ subscript, writing the cost $C_x$ as $C$.We'll eventually put the $x$ back in, but for now it's a notationalnuisance that is better left implicit.</p><p>The second assumption we make about the cost is that it can be writtenas a function of the outputs from the neural network:<center><img src="images/tikz18.png"/></center>For example, the quadratic cost function satisfies this requirement,since the quadratic cost for a single training example $x$ may bewritten as<a class="displaced_anchor" name="eqtn27"></a>\begin{eqnarray}  C = \frac{1}{2} \|y-a^L\|^2 = \frac{1}{2} \sum_j (y_j-a^L_j)^2,\tag{27}\end{eqnarray}and thus is a function of the output activations.  Of course, thiscost function also depends on the desired output $y$, and you maywonder why we're not regarding the cost also as a function of $y$.Remember, though, that the input training example $x$ is fixed, and sothe output $y$ is also a fixed parameter.  In particular, it's notsomething we can modify by changing the weights and biases in any way,i.e., it's not something which the neural network learns.  And so itmakes sense to regard $C$ as a function of the output activations$a^L$ alone, with $y$ merely a parameter that helps define thatfunction.</p><p></p><p></p><p></p><p><h3><a name="the_hadamard_product_$s_\odot_t$"></a><a href="#the_hadamard_product_$s_\odot_t$">The Hadamard product, $s \odot t$</a></h3></p><p>The backpropagation algorithm is based on common linear algebraicoperations - things like vector addition, multiplying a vector by amatrix, and so on.  But one of the operations is a little lesscommonly used.  In particular, suppose $s$ and $t$ are two vectors ofthe same dimension.  Then we use $s \odot t$ to denote the<em>elementwise</em> product of the two vectors.  Thus the components of$s \odot t$ are just $(s \odot t)_j = s_j t_j$.  As an example,<a class="displaced_anchor" name="eqtn28"></a>\begin{eqnarray}\left[\begin{array}{c} 1 \\ 2 \end{array}\right]   \odot \left[\begin{array}{c} 3 \\ 4\end{array} \right]= \left[ \begin{array}{c} 1 * 3 \\ 2 * 4 \end{array} \right]= \left[ \begin{array}{c} 3 \\ 8 \end{array} \right].\tag{28}\end{eqnarray}This kind of elementwise multiplication is sometimes called the<em>Hadamard product</em> or <em>Schur product</em>.  We'll refer to it asthe Hadamard product.  Good matrix libraries usually provide fastimplementations of the Hadamard product, and that comes in handy whenimplementing backpropagation.</p><p><h3><a name="the_four_fundamental_equations_behind_backpropagation"></a><a href="#the_four_fundamental_equations_behind_backpropagation">The four fundamental equations behind backpropagation</a></h3></p><p>Backpropagation is about understanding how changing the weights andbiases in a network changes the cost function.  Ultimately, this meanscomputing the partial derivatives $\partial C / \partial w^l_{jk}$ and$\partial C / \partial b^l_j$.  But to compute those, we firstintroduce an intermediate quantity, $\delta^l_j$, which we call the<em>error</em> in the $j^{\rm th}$ neuron in the $l^{\rm th}$ layer.Backpropagation will give us a procedure to compute the error$\delta^l_j$, and then will relate $\delta^l_j$ to $\partial C/ \partial w^l_{jk}$ and $\partial C / \partial b^l_j$.</p><p>To understand how the error is defined, imagine there is a demon inour neural network:<center><img src="images/tikz19.png"/></center>The demon sits at the $j^{\rm th}$ neuron in layer $l$.  As the input to theneuron comes in, the demon messes with the neuron's operation.  Itadds a little change $\Delta z^l_j$ to the neuron's weighted input, sothat instead of outputting $\sigma(z^l_j)$, the neuron instead outputs$\sigma(z^l_j+\Delta z^l_j)$.  This change propagates through laterlayers in the network, finally causing the overall cost to change byan amount $\frac{\partial C}{\partial z^l_j} \Delta z^l_j$.</p><p>Now, this demon is a good demon, and is trying to help you improve thecost, i.e., they're trying to find a $\Delta z^l_j$ which makes thecost smaller.  Suppose $\frac{\partial C}{\partial z^l_j}$ has a largevalue (either positive or negative).  Then the demon can lower thecost quite a bit by choosing $\Delta z^l_j$ to have the opposite signto $\frac{\partial C}{\partial z^l_j}$.  By contrast, if$\frac{\partial C}{\partial z^l_j}$ is close to zero, then the demoncan't improve the cost much at all by perturbing the weighted input$z^l_j$.  So far as the demon can tell, the neuron is already prettynear optimal*<span class="marginnote">*This is only the case for small changes $\Delta  z^l_j$, of course. We'll assume that the demon is constrained to  make such small changes.</span>.  And so there's a heuristic sense inwhich $\frac{\partial C}{\partial z^l_j}$ is a measure of the error inthe neuron.</p><p>Motivated by this story, we define the error $\delta^l_j$ of neuron$j$ in layer $l$ by<a class="displaced_anchor" name="eqtn29"></a>\begin{eqnarray}   \delta^l_j \equiv \frac{\partial C}{\partial z^l_j}.\tag{29}\end{eqnarray}As per our usual conventions, we use $\delta^l$ to denote the vectorof errors associated with layer $l$.  Backpropagation will give us away of computing $\delta^l$ for every layer, and then relating thoseerrors to the quantities of real interest, $\partial C / \partialw^l_{jk}$ and $\partial C / \partial b^l_j$.</p><p>You might wonder why the demon is changing the weighted input $z^l_j$.Surely it'd be more natural to imagine the demon changing the outputactivation $a^l_j$, with the result that we'd be using $\frac{\partial  C}{\partial a^l_j}$ as our measure of error.  In fact, if you dothis things work out quite similarly to the discussion below.  But itturns out to make the presentation of backpropagation a little morealgebraically complicated.  So we'll stick with $\delta^l_j =\frac{\partial C}{\partial z^l_j}$ as our measure of error*<span class="marginnote">*In  classification problems like MNIST the term "error" is sometimes  used to mean the classification failure rate.  E.g., if the neural  net correctly classifies 96.0 percent of the digits, then the error  is 4.0 percent.  Obviously, this has quite a different meaning from  our $\delta$ vectors.  In practice, you shouldn't have trouble  telling which meaning is intended in any given usage.</span>.</p><p><strong>Plan of attack:</strong> Backpropagation is based around fourfundamental equations.  Together, those equations give us a way ofcomputing both the error $\delta^l$ and the gradient of the costfunction.  I state the four equations below.  Be warned, though: youshouldn't expect to instantaneously assimilate the equations.  Such anexpectation will lead to disappointment.  In fact, the backpropagationequations are so rich that understanding them well requiresconsiderable time and patience as you gradually delve deeper into theequations.  The good news is that such patience is repaid many timesover.  And so the discussion in this section is merely a beginning,helping you on the way to a thorough understanding of the equations.</p><p>Here's a preview of the ways we'll delve more deeply into theequations later in the chapter: I'll<a href="chap2.html#proof_of_the_four_fundamental_equations_(optional)">give  a short proof of the equations</a>, which helps explain why they aretrue; we'll <a href="chap2.html#the_backpropagation_algorithm">restate  the equations</a> in algorithmic form as pseudocode, and<a href="chap2.html#the_code_for_backpropagation">see how</a> thepseudocode can be implemented as real, running Python code; and, in<a href="chap2.html#backpropagation_the_big_picture">the final  section of the chapter</a>, we'll develop an intuitive picture of whatthe backpropagation equations mean, and how someone might discoverthem from scratch.  Along the way we'll return repeatedly to the fourfundamental equations, and as you deepen your understanding thoseequations will come to seem comfortable and, perhaps, even beautifuland natural.</p><p><strong>An equation for the error in the output layer, $\delta^L$:</strong>The components of $\delta^L$ are given by<a class="displaced_anchor" name="eqtnBP1"></a>\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j).\tag{BP1}\end{eqnarray}This is a very natural expression.  The first term on the right,$\partial C / \partial a^L_j$, just measures how fast the cost ischanging as a function of the $j^{\rm th}$ output activation.  If, forexample, $C$ doesn't depend much on a particular output neuron, $j$,then $\delta^L_j$ will be small, which is what we'd expect.  Thesecond term on the right, $\sigma'(z^L_j)$, measures how fast theactivation function $\sigma$ is changing at $z^L_j$.</p><p>Notice that everything in <span id="margin_935569816549_reveal" class="equation_link">(BP1)</span><span id="margin_935569816549" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_935569816549_reveal').click(function() {$('#margin_935569816549').toggle('slow', function() {});});</script> is easily computed.  Inparticular, we compute $z^L_j$ while computing the behaviour of thenetwork, and it's only a small additional overhead to compute$\sigma'(z^L_j)$.  The exact form of $\partial C / \partial a^L_j$will, of course, depend on the form of the cost function.  However,provided the cost function is known there should be little troublecomputing $\partial C / \partial a^L_j$.  For example, if we're usingthe quadratic cost function then $C = \frac{1}{2} \sum_j(y_j-a^L_j)^2$, and so $\partial C / \partial a^L_j = (a_j^L-y_j)$,which obviously is easily computable.</p><p>Equation <span id="margin_107099298430_reveal" class="equation_link">(BP1)</span><span id="margin_107099298430" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_107099298430_reveal').click(function() {$('#margin_107099298430').toggle('slow', function() {});});</script> is a componentwise expression for $\delta^L$.It's a perfectly good expression, but not the matrix-based form wewant for backpropagation. However, it's easy to rewrite the equationin a matrix-based form, as<a class="displaced_anchor" name="eqtnBP1a"></a>\begin{eqnarray}   \delta^L = \nabla_a C \odot \sigma'(z^L).\tag{BP1a}\end{eqnarray}Here, $\nabla_a C$ is defined to be a vector whose components are thepartial derivatives $\partial C / \partial a^L_j$.  You can think of$\nabla_a C$ as expressing the rate of change of $C$ with respect tothe output activations.  It's easy to see that Equations <span id="margin_437037551154_reveal" class="equation_link">(BP1a)</span><span id="margin_437037551154" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1a" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L = \nabla_a C \odot \sigma'(z^L) \nonumber\end{eqnarray}</a></span><script>$('#margin_437037551154_reveal').click(function() {$('#margin_437037551154').toggle('slow', function() {});});</script>and <span id="margin_632035362272_reveal" class="equation_link">(BP1)</span><span id="margin_632035362272" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_632035362272_reveal').click(function() {$('#margin_632035362272').toggle('slow', function() {});});</script> are equivalent, and for that reason from now on we'lluse <span id="margin_257532377907_reveal" class="equation_link">(BP1)</span><span id="margin_257532377907" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_257532377907_reveal').click(function() {$('#margin_257532377907').toggle('slow', function() {});});</script> interchangeably to refer to both equations.  As anexample, in the case of the quadratic cost we have $\nabla_a C =(a^L-y)$, and so the fully matrix-based form of <span id="margin_145664009844_reveal" class="equation_link">(BP1)</span><span id="margin_145664009844" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_145664009844_reveal').click(function() {$('#margin_145664009844').toggle('slow', function() {});});</script> becomes<a class="displaced_anchor" name="eqtn30"></a>\begin{eqnarray}   \delta^L = (a^L-y) \odot \sigma'(z^L).\tag{30}\end{eqnarray}As you can see, everything in this expression has a nice vector form,and is easily computed using a library such as Numpy.</p><p><strong>An equation for the error $\delta^l$ in terms of the error in  the next layer, $\delta^{l+1}$:</strong> In particular<a class="displaced_anchor" name="eqtnBP2"></a>\begin{eqnarray}   \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l),\tag{BP2}\end{eqnarray}where $(w^{l+1})^T$ is the transpose of the weight matrix $w^{l+1}$ forthe $(l+1)^{\rm th}$ layer.  This equation appears complicated, buteach element has a nice interpretation.  Suppose we know the error$\delta^{l+1}$ at the $l+1^{\rm th}$ layer.  When we apply thetranspose weight matrix, $(w^{l+1})^T$, we can think intuitively ofthis as moving the error <em>backward</em> through the network, givingus some sort of measure of the error at the output of the $l^{\rm th}$layer.  We then take the Hadamard product $\odot \sigma'(z^l)$.  Thismoves the error backward through the activation function in layer $l$,giving us the error $\delta^l$ in the weighted input to layer $l$.</p><p>By combining <span id="margin_600641878448_reveal" class="equation_link">(BP2)</span><span id="margin_600641878448" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_600641878448_reveal').click(function() {$('#margin_600641878448').toggle('slow', function() {});});</script> with <span id="margin_418873839872_reveal" class="equation_link">(BP1)</span><span id="margin_418873839872" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_418873839872_reveal').click(function() {$('#margin_418873839872').toggle('slow', function() {});});</script> we can compute the error$\delta^l$ for any layer in the network.  We start byusing <span id="margin_511214064523_reveal" class="equation_link">(BP1)</span><span id="margin_511214064523" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_511214064523_reveal').click(function() {$('#margin_511214064523').toggle('slow', function() {});});</script> to compute $\delta^L$, then applyEquation <span id="margin_757718517679_reveal" class="equation_link">(BP2)</span><span id="margin_757718517679" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_757718517679_reveal').click(function() {$('#margin_757718517679').toggle('slow', function() {});});</script> to compute $\delta^{L-1}$, thenEquation <span id="margin_925253584056_reveal" class="equation_link">(BP2)</span><span id="margin_925253584056" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_925253584056_reveal').click(function() {$('#margin_925253584056').toggle('slow', function() {});});</script> again to compute $\delta^{L-2}$, and so on, allthe way back through the network.</p><p><strong>An equation for the rate of change of the cost with respect to  any bias in the network:</strong> In particular:<a class="displaced_anchor" name="eqtnBP3"></a>\begin{eqnarray}  \frac{\partial C}{\partial b^l_j} =  \delta^l_j.\tag{BP3}\end{eqnarray}That is, the error $\delta^l_j$ is <em>exactly equal</em> to the rate ofchange $\partial C / \partial b^l_j$.  This is great news, since<span id="margin_995510425816_reveal" class="equation_link">(BP1)</span><span id="margin_995510425816" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_995510425816_reveal').click(function() {$('#margin_995510425816').toggle('slow', function() {});});</script> and <span id="margin_816538879098_reveal" class="equation_link">(BP2)</span><span id="margin_816538879098" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_816538879098_reveal').click(function() {$('#margin_816538879098').toggle('slow', function() {});});</script> have already told us how to compute$\delta^l_j$.  We can rewrite <span id="margin_807654478996_reveal" class="equation_link">(BP3)</span><span id="margin_807654478996" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP3" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial C}{\partial b^l_j} =  \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_807654478996_reveal').click(function() {$('#margin_807654478996').toggle('slow', function() {});});</script> in shorthand as<a class="displaced_anchor" name="eqtn31"></a>\begin{eqnarray}  \frac{\partial C}{\partial b} = \delta,\tag{31}\end{eqnarray}where it is understood that $\delta$ is being evaluated at the sameneuron as the bias $b$.</p><p><strong>An equation for the rate of change of the cost with respect to  any weight in the network:</strong>  In particular:<a class="displaced_anchor" name="eqtnBP4"></a>\begin{eqnarray}    \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j.\tag{BP4}\end{eqnarray}This tells us how to compute the partial derivatives $\partial C/ \partial w^l_{jk}$ in terms of the quantities $\delta^l$ and$a^{l-1}$, which we already know how to compute.  The equation can berewritten in a less index-heavy notation as<a class="displaced_anchor" name="eqtn32"></a>\begin{eqnarray}  \frac{\partial    C}{\partial w} = a_{\rm in} \delta_{\rm out},\tag{32}\end{eqnarray}where it's understood that $a_{\rm in}$ is the activation of theneuron input to the weight $w$, and $\delta_{\rm out}$ is the error ofthe neuron output from the weight $w$.  Zooming in to look at just theweight $w$, and the two neurons connected by that weight, we candepict this as:<center><img src="images/tikz20.png"/></center>A nice consequence of Equation <span id="margin_683167189901_reveal" class="equation_link">(32)</span><span id="margin_683167189901" class="marginequation" style="display: none;"><a href="chap2.html#eqtn32" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial    C}{\partial w} = a_{\rm in} \delta_{\rm out} \nonumber\end{eqnarray}</a></span><script>$('#margin_683167189901_reveal').click(function() {$('#margin_683167189901').toggle('slow', function() {});});</script> isthat when the activation $a_{\rm in}$ is small, $a_{\rm in} \approx0$, the gradient term $\partial C / \partial w$ will also tend to besmall.  In this case, we'll say the weight <em>learns slowly</em>,meaning that it's not changing much during gradient descent.  In otherwords, one consequence of <span id="margin_283859246730_reveal" class="equation_link">(BP4)</span><span id="margin_283859246730" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}    \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_283859246730_reveal').click(function() {$('#margin_283859246730').toggle('slow', function() {});});</script> is that weights output fromlow-activation neurons learn slowly.</p><p><a name="saturation"></a></p><p>There are other insights along these lines which can be obtainedfrom <span id="margin_180576525445_reveal" class="equation_link">(BP1)</span><span id="margin_180576525445" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_180576525445_reveal').click(function() {$('#margin_180576525445').toggle('slow', function() {});});</script>-<span id="margin_564164873298_reveal" class="equation_link">(BP4)</span><span id="margin_564164873298" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}    \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_564164873298_reveal').click(function() {$('#margin_564164873298').toggle('slow', function() {});});</script>.  Let's start by looking at the outputlayer.  Consider the term $\sigma'(z^L_j)$ in <span id="margin_647108928268_reveal" class="equation_link">(BP1)</span><span id="margin_647108928268" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_647108928268_reveal').click(function() {$('#margin_647108928268').toggle('slow', function() {});});</script>.  Recallfrom the <a href="chap1.html#sigmoid_graph">graph of the sigmoid  function in the last chapter</a> that the $\sigma$ function becomesvery flat when $\sigma(z^L_j)$ is approximately $0$ or $1$.  When thisoccurs we will have $\sigma'(z^L_j) \approx 0$.  And so the lesson isthat a weight in the final layer will learn slowly if the outputneuron is either low activation ($\approx 0$) or high activation($\approx 1$).  In this case it's common to say the output neuron has<em>saturated</em> and, as a result, the weight has stopped learning (oris learning slowly).  Similar remarks hold also for the biases ofoutput neuron.</p><p>We can obtain similar insights for earlier layers.  In particular,note the $\sigma'(z^l)$ term in <span id="margin_327230374011_reveal" class="equation_link">(BP2)</span><span id="margin_327230374011" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_327230374011_reveal').click(function() {$('#margin_327230374011').toggle('slow', function() {});});</script>.  This means that$\delta^l_j$ is likely to get small if the neuron is near saturation.And this, in turn, means that any weights input to a saturated neuronwill learn slowly*<span class="marginnote">*This reasoning won't hold if ${w^{l+1}}^T  \delta^{l+1}$ has large enough entries to compensate for the  smallness of $\sigma'(z^l_j)$.  But I'm speaking of the general  tendency.</span>.</p><p>Summing up, we've learnt that a weight will learn slowly if either theinput neuron is low-activation, or if the output neuron has saturated,i.e., is either high- or low-activation.  </p><p>None of these observations is too greatly surprising.  Still, theyhelp improve our mental model of what's going on as a neural networklearns.  Furthermore, we can turn this type of reasoning around.  Thefour fundamental equations turn out to hold for any activationfunction, not just the standard sigmoid function (that's because, aswe'll see in a moment, the proofs don't use any special properties of$\sigma$).  And so we can use these equations to <em>design</em>activation functions which have particular desired learningproperties.  As an example to give you the idea, suppose we were tochoose a (non-sigmoid) activation function $\sigma$ so that $\sigma'$is always positive, and never gets close to zero.  That would preventthe slow-down of learning that occurs when ordinary sigmoid neuronssaturate.  Later in the book we'll see examples where this kind ofmodification is made to the activation function.  Keeping the fourequations <span id="margin_30422827866_reveal" class="equation_link">(BP1)</span><span id="margin_30422827866" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_30422827866_reveal').click(function() {$('#margin_30422827866').toggle('slow', function() {});});</script>-<span id="margin_529560008536_reveal" class="equation_link">(BP4)</span><span id="margin_529560008536" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}    \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_529560008536_reveal').click(function() {$('#margin_529560008536').toggle('slow', function() {});});</script> in mind can help explain why suchmodifications are tried, and what impact they can have.</p><p><a name="backpropsummary"></a></p><p><center><img src="images/tikz21.png"/></center></p><p><a id="alternative_backprop"></a></p><p><h4><a name="problem_664557"></a><a href="#problem_664557">Problem</a></h4><ul><li><strong>Alternate presentation of the equations of backpropagation:</strong>  I've stated the equations of backpropagation (notably <span id="margin_854766609649_reveal" class="equation_link">(BP1)</span><span id="margin_854766609649" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_854766609649_reveal').click(function() {$('#margin_854766609649').toggle('slow', function() {});});</script>  and <span id="margin_591043899914_reveal" class="equation_link">(BP2)</span><span id="margin_591043899914" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_591043899914_reveal').click(function() {$('#margin_591043899914').toggle('slow', function() {});});</script>) using the Hadamard product.  This presentation may  be disconcerting if you're unused to the Hadamard product.  There's  an alternative approach, based on conventional matrix  multiplication, which some readers may find enlightening.  (1) Show  that <span id="margin_979575999769_reveal" class="equation_link">(BP1)</span><span id="margin_979575999769" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_979575999769_reveal').click(function() {$('#margin_979575999769').toggle('slow', function() {});});</script> may be rewritten as  <a class="displaced_anchor" name="eqtn33"></a>\begin{eqnarray}    \delta^L = \Sigma'(z^L) \nabla_a C,  \tag{33}\end{eqnarray}  where $\Sigma'(z^L)$ is a square matrix whose diagonal entries are  the values $\sigma'(z^L_j)$, and whose off-diagonal entries are  zero.  Note that this matrix acts on $\nabla_a C$ by conventional  matrix multiplication.  (2) Show that <span id="margin_918347529943_reveal" class="equation_link">(BP2)</span><span id="margin_918347529943" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_918347529943_reveal').click(function() {$('#margin_918347529943').toggle('slow', function() {});});</script> may be rewritten  as  <a class="displaced_anchor" name="eqtn34"></a>\begin{eqnarray}    \delta^l = \Sigma'(z^l) (w^{l+1})^T \delta^{l+1}.  \tag{34}\end{eqnarray}  (3) By combining observations (1) and (2) show that  <a class="displaced_anchor" name="eqtn35"></a>\begin{eqnarray}    \delta^l = \Sigma'(z^l) (w^{l+1})^T \ldots \Sigma'(z^{L-1}) (w^L)^T     \Sigma'(z^L) \nabla_a C  \tag{35}\end{eqnarray}  For readers comfortable with matrix multiplication this equation may  be easier to understand than <span id="margin_945054994108_reveal" class="equation_link">(BP1)</span><span id="margin_945054994108" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_945054994108_reveal').click(function() {$('#margin_945054994108').toggle('slow', function() {});});</script> and <span id="margin_214338788881_reveal" class="equation_link">(BP2)</span><span id="margin_214338788881" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_214338788881_reveal').click(function() {$('#margin_214338788881').toggle('slow', function() {});});</script>.  The  reason I've focused on <span id="margin_208607416981_reveal" class="equation_link">(BP1)</span><span id="margin_208607416981" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_208607416981_reveal').click(function() {$('#margin_208607416981').toggle('slow', function() {});});</script> and <span id="margin_416949150476_reveal" class="equation_link">(BP2)</span><span id="margin_416949150476" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_416949150476_reveal').click(function() {$('#margin_416949150476').toggle('slow', function() {});});</script> is because that  approach turns out to be faster to implement numerically.</ul></p><p><h3><a name="proof_of_the_four_fundamental_equations_(optional)"></a><a href="#proof_of_the_four_fundamental_equations_(optional)">Proof of the four fundamental equations (optional)</a></h3> </p><p>We'll now prove the four fundamentalequations <span id="margin_57516515715_reveal" class="equation_link">(BP1)</span><span id="margin_57516515715" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_57516515715_reveal').click(function() {$('#margin_57516515715').toggle('slow', function() {});});</script>-<span id="margin_821330325050_reveal" class="equation_link">(BP4)</span><span id="margin_821330325050" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}    \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_821330325050_reveal').click(function() {$('#margin_821330325050').toggle('slow', function() {});});</script>.  All four are consequences of thechain rule from multivariable calculus.  If you're comfortable withthe chain rule, then I strongly encourage you to attempt thederivation yourself before reading on.</p><p>Let's begin with Equation <span id="margin_950236768527_reveal" class="equation_link">(BP1)</span><span id="margin_950236768527" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_950236768527_reveal').click(function() {$('#margin_950236768527').toggle('slow', function() {});});</script>, which gives an expression forthe output error, $\delta^L$.  To prove this equation, recall that bydefinition<a class="displaced_anchor" name="eqtn36"></a>\begin{eqnarray}  \delta^L_j = \frac{\partial C}{\partial z^L_j}.\tag{36}\end{eqnarray}Applying the chain rule, we can re-express the partial derivativeabove in terms of partial derivatives with respect to the outputactivations,<a class="displaced_anchor" name="eqtn37"></a>\begin{eqnarray}  \delta^L_j = \sum_k \frac{\partial C}{\partial a^L_k} \frac{\partial a^L_k}{\partial z^L_j},\tag{37}\end{eqnarray}where the sum is over all neurons $k$ in the output layer.  Of course,the output activation $a^L_k$ of the $k^{\rm th}$ neuron depends onlyon the weighted input $z^L_j$ for the $j^{\rm th}$ neuron when $k =j$.  And so $\partial a^L_k / \partial z^L_j$ vanishes when $k \neqj$.  As a result we can simplify the previous equation to<a class="displaced_anchor" name="eqtn38"></a>\begin{eqnarray}  \delta^L_j = \frac{\partial C}{\partial a^L_j} \frac{\partial a^L_j}{\partial z^L_j}.\tag{38}\end{eqnarray}Recalling that $a^L_j = \sigma(z^L_j)$ the second term on the rightcan be written as $\sigma'(z^L_j)$, and the equation becomes<a class="displaced_anchor" name="eqtn39"></a>\begin{eqnarray}  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j),\tag{39}\end{eqnarray}which is just <span id="margin_920959233515_reveal" class="equation_link">(BP1)</span><span id="margin_920959233515" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_920959233515_reveal').click(function() {$('#margin_920959233515').toggle('slow', function() {});});</script>, in component form.</p><p>Next, we'll prove <span id="margin_150997032908_reveal" class="equation_link">(BP2)</span><span id="margin_150997032908" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_150997032908_reveal').click(function() {$('#margin_150997032908').toggle('slow', function() {});});</script>, which gives an equation for the error$\delta^l$ in terms of the error in the next layer, $\delta^{l+1}$.To do this, we want to rewrite $\delta^l_j = \partial C / \partialz^l_j$ in terms of $\delta^{l+1}_k = \partial C / \partial z^{l+1}_k$.We can do this using the chain rule,<a class="displaced_anchor" name="eqtn40"></a><a class="displaced_anchor" name="eqtn41"></a><a class="displaced_anchor" name="eqtn42"></a>\begin{eqnarray}  \delta^l_j & = & \frac{\partial C}{\partial z^l_j} \tag{40}\\  & = & \sum_k \frac{\partial C}{\partial z^{l+1}_k} \frac{\partial z^{l+1}_k}{\partial z^l_j} \tag{41}\\   & = & \sum_k \frac{\partial z^{l+1}_k}{\partial z^l_j} \delta^{l+1}_k,\tag{42}\end{eqnarray}where in the last line we have interchanged the two terms on theright-hand side, and substituted the definition of $\delta^{l+1}_k$.To evaluate the first term on the last line, note that<a class="displaced_anchor" name="eqtn43"></a>\begin{eqnarray}  z^{l+1}_k = \sum_j w^{l+1}_{kj} a^l_j +b^{l+1}_k = \sum_j w^{l+1}_{kj} \sigma(z^l_j) +b^{l+1}_k.\tag{43}\end{eqnarray}Differentiating, we obtain<a class="displaced_anchor" name="eqtn44"></a>\begin{eqnarray}  \frac{\partial z^{l+1}_k}{\partial z^l_j} = w^{l+1}_{kj} \sigma'(z^l_j).\tag{44}\end{eqnarray}Substituting back into <span id="margin_909445417783_reveal" class="equation_link">(42)</span><span id="margin_909445417783" class="marginequation" style="display: none;"><a href="chap2.html#eqtn42" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   & = & \sum_k \frac{\partial z^{l+1}_k}{\partial z^l_j} \delta^{l+1}_k \nonumber\end{eqnarray}</a></span><script>$('#margin_909445417783_reveal').click(function() {$('#margin_909445417783').toggle('slow', function() {});});</script> we obtain<a class="displaced_anchor" name="eqtn45"></a>\begin{eqnarray}  \delta^l_j = \sum_k w^{l+1}_{kj}  \delta^{l+1}_k \sigma'(z^l_j).\tag{45}\end{eqnarray}This is just <span id="margin_37381560554_reveal" class="equation_link">(BP2)</span><span id="margin_37381560554" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_37381560554_reveal').click(function() {$('#margin_37381560554').toggle('slow', function() {});});</script> written in component form.</p><p>The final two equations we want to prove are <span id="margin_414712124519_reveal" class="equation_link">(BP3)</span><span id="margin_414712124519" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP3" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial C}{\partial b^l_j} =  \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_414712124519_reveal').click(function() {$('#margin_414712124519').toggle('slow', function() {});});</script>and <span id="margin_173630652984_reveal" class="equation_link">(BP4)</span><span id="margin_173630652984" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}    \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_173630652984_reveal').click(function() {$('#margin_173630652984').toggle('slow', function() {});});</script>.  These also follow from the chain rule, in a mannersimilar to the proofs of the two equations above.  I leave them to youas an exercise. </p><p><h4><a name="exercise_357925"></a><a href="#exercise_357925">Exercise</a></h4><ul><li> Prove Equations <span id="margin_983716003549_reveal" class="equation_link">(BP3)</span><span id="margin_983716003549" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP3" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial C}{\partial b^l_j} =  \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_983716003549_reveal').click(function() {$('#margin_983716003549').toggle('slow', function() {});});</script> and <span id="margin_58440934047_reveal" class="equation_link">(BP4)</span><span id="margin_58440934047" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}    \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_58440934047_reveal').click(function() {$('#margin_58440934047').toggle('slow', function() {});});</script>.</ul></p><p>That completes the proof of the four fundamental equations ofbackpropagation.  The proof may seem complicated.  But it's reallyjust the outcome of carefully applying the chain rule.  A little lesssuccinctly, we can think of backpropagation as a way of computing thegradient of the cost function by systematically applying the chainrule from multi-variable calculus.  That's all there really is tobackpropagation - the rest is details.</p><p><h3><a name="the_backpropagation_algorithm"></a><a href="#the_backpropagation_algorithm">The backpropagation algorithm</a></h3></p><p>The backpropagation equations provide us with a way of computing thegradient of the cost function.  Let's explicitly write this out in theform of an algorithm:<ol><li> <strong>Input $x$:</strong> Set the corresponding activation $a^{1}$ for  the input layer.  </p><p><li> <strong>Feedforward:</strong> For each $l = 2, 3, \ldots, L$ compute  $z^{l} = w^l a^{l-1}+b^l$ and $a^{l} = \sigma(z^{l})$.</p><p><li> <strong>Output error $\delta^L$:</strong> Compute the vector $\delta^{L}  = \nabla_a C \odot \sigma'(z^L)$.</p><p><li> <strong>Backpropagate the error:</strong> For each $l = L-1, L-2,  \ldots, 2$ compute $\delta^{l} = ((w^{l+1})^T \delta^{l+1}) \odot  \sigma'(z^{l})$.</p><p><li> <strong>Output:</strong> The gradient of the cost function is given by  $\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j$ and  $\frac{\partial C}{\partial b^l_j} = \delta^l_j$.</ol></p><p>Examining the algorithm you can see why it's called<em>back</em>propagation.  We compute the error vectors $\delta^l$backward, starting from the final layer.  It may seem peculiar thatwe're going through the network backward.  But if you think about theproof of backpropagation, the backward movement is a consequence ofthe fact that the cost is a function of outputs from the network.  Tounderstand how the cost varies with earlier weights and biases we needto repeatedly apply the chain rule, working backward through thelayers to obtain usable expressions.</p><p><h4><a name="exercises_675621"></a><a href="#exercises_675621">Exercises</a></h4><ul><li><strong>Backpropagation with a single modified neuron</strong> Suppose we modify  a single neuron in a feedforward network so that the output from the  neuron is given by $f(\sum_j w_j x_j + b)$, where $f$ is some  function other than the sigmoid.  How should we modify the  backpropagation algorithm in this case?</p><p><li><strong>Backpropagation with linear neurons</strong> Suppose we replace the  usual non-linear $\sigma$ function with $\sigma(z) = z$ throughout  the network.  Rewrite the backpropagation algorithm for this case.</ul></p><p>As I've described it above, the backpropagation algorithm computes thegradient of the cost function for a single training example, $C =C_x$.  In practice, it's common to combine backpropagation with alearning algorithm such as stochastic gradient descent, in which wecompute the gradient for many training examples.  In particular, givena mini-batch of $m$ training examples, the following algorithm appliesa gradient descent learning step based on that mini-batch:<ol><li> <strong>Input a set of training examples</strong></p><p><li> <strong>For each training example $x$:</strong> Set the corresponding  input activation $a^{x,1}$, and perform the following steps:</p><p><ul><li> <strong>Feedforward:</strong> For each $l = 2, 3, \ldots, L$ compute  $z^{x,l} = w^l a^{x,l-1}+b^l$ and $a^{x,l} = \sigma(z^{x,l})$.</p><p><li> <strong>Output error $\delta^{x,L}$:</strong> Compute the vector  $\delta^{x,L} = \nabla_a C_x \odot \sigma'(z^{x,L})$.</p><p><li> <strong>Backpropagate the error:</strong> For each $l = L-1, L-2,  \ldots, 2$ compute $\delta^{x,l} = ((w^{l+1})^T \delta^{x,l+1})  \odot \sigma'(z^{x,l})$.</ul></p><p><li> <strong>Gradient descent:</strong> For each $l = L, L-1, \ldots, 2$  update the weights according to the rule $w^l \rightarrow  w^l-\frac{\eta}{m} \sum_x \delta^{x,l} (a^{x,l-1})^T$, and the  biases according to the rule $b^l \rightarrow b^l-\frac{\eta}{m}  \sum_x \delta^{x,l}$.</p><p></ol>Of course, to implement stochastic gradient descent in practice youalso need an outer loop generating mini-batches of training examples,and an outer loop stepping through multiple epochs of training.  I'veomitted those for simplicity.  </p><p></p><p><h3><a name="the_code_for_backpropagation"></a><a href="#the_code_for_backpropagation">The code for backpropagation</a></h3></p><p>Having understood backpropagation in the abstract, we can nowunderstand the code used in the last chapter to implementbackpropagation.  Recall from<a href="chap1.html#implementing_our_network_to_classify_digits">that  chapter</a> that the code was contained in the <tt>update_mini_batch</tt>and <tt>backprop</tt> methods of the <tt>Network</tt> class.  The code forthese methods is a direct translation of the algorithm describedabove.  In particular, the <tt>update_mini_batch</tt> method updates the<tt>Network</tt>'s weights and biases by computing the gradient for thecurrent <tt>mini_batch</tt> of training examples:<div class="highlight"><pre><span></span><span class="k">class</span> <span class="nc">Network</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
+</p><p>In the <a href="chap1.html">last chapter</a> we saw how neural networks canlearn their weights and biases using the gradient descent algorithm.There was, however, a gap in our explanation: we didn't discuss how tocompute the gradient of the cost function.  That's quite a gap!  Inthis chapter I'll explain a fast algorithm for computing suchgradients, an algorithm known as <em>backpropagation</em>.  </p><p>The backpropagation algorithm was originally introduced in the 1970s,but its importance wasn't fully appreciated until a<a href="http://www.nature.com/nature/journal/v323/n6088/pdf/323533a0.pdf">famous  1986 paper</a> by<a href="http://en.wikipedia.org/wiki/David_Rumelhart">David  Rumelhart</a>,<a href="http://www.cs.toronto.edu/&#126;hinton/">Geoffrey  Hinton</a>, and<a href="http://en.wikipedia.org/wiki/Ronald_J._Williams">Ronald  Williams</a>.  That paper describes severalneural networks where backpropagation works far faster than earlierapproaches to learning, making it possible to use neural nets to solveproblems which had previously been insoluble.  Today, thebackpropagation algorithm is the workhorse of learning in neuralnetworks.</p><p>This chapter is more mathematically involved than the rest of thebook.  If you're not crazy about mathematics you may be tempted toskip the chapter, and to treat backpropagation as a black box whosedetails you're willing to ignore.  Why take the time to study thosedetails?</p><p>The reason, of course, is understanding.  At the heart ofbackpropagation is an expression for the partial derivative $\partialC / \partial w$ of the cost function $C$ with respect to any weight$w$ (or bias $b$) in the network.  The expression tells us how quicklythe cost changes when we change the weights and biases.  And while theexpression is somewhat complex, it also has a beauty to it, with eachelement having a natural, intuitive interpretation.  And sobackpropagation isn't just a fast algorithm for learning.  It actuallygives us detailed insights into how changing the weights and biaseschanges the overall behaviour of the network.  That's well worthstudying in detail.</p><p>With that said, if you want to skim the chapter, or jump straight tothe next chapter, that's fine.  I've written the rest of the book tobe accessible even if you treat backpropagation as a black box.  Thereare, of course, points later in the book where I refer back to resultsfrom this chapter.  But at those points you should still be able tounderstand the main conclusions, even if you don't follow all thereasoning.</p><p><h3><a name="warm_up_a_fast_matrix-based_approach_to_computing_the_output_from_a_neural_network"></a><a href="#warm_up_a_fast_matrix-based_approach_to_computing_the_output_from_a_neural_network">Warm up: a fast matrix-based approach to computing the output  from a neural network</a></h3></p><p>Before discussing backpropagation, let's warm up with a fastmatrix-based algorithm to compute the output from a neural network.We actually already briefly saw this algorithm<a href="chap1.html#implementing_our_network_to_classify_digits">near  the end of the last chapter</a>, but I described it quickly, so it'sworth revisiting in detail.  In particular, this is a good way ofgetting comfortable with the notation used in backpropagation, in afamiliar context.</p><p>Let's begin with a notation which lets us refer to weights in thenetwork in an unambiguous way.  We'll use $w^l_{jk}$ to denote theweight for the connection from the $k^{\rm th}$ neuron in the$(l-1)^{\rm th}$ layer to the $j^{\rm th}$ neuron in the $l^{\rm th}$layer.  So, for example, the diagram below shows the weight on aconnection from the fourth neuron in the second layer to the secondneuron in the third layer of a network:<center><img src="images/tikz16.png"/></center>This notation is cumbersome at first, and it does take some work tomaster.  But with a little effort you'll find the notation becomeseasy and natural.  One quirk of the notation is the ordering of the$j$ and $k$ indices.  You might think that it makes more sense to use$j$ to refer to the input neuron, and $k$ to the output neuron, notvice versa, as is actually done.  I'll explain the reason for thisquirk below.</p><p>We use a similar notation for the network's biases and activations.Explicitly, we use $b^l_j$ for the bias of the $j^{\rm th}$ neuron inthe $l^{\rm th}$ layer.  And we use $a^l_j$ for the activation of the$j^{\rm th}$ neuron in the $l^{\rm th}$ layer.  The following diagramshows examples of these notations in use:<center><img src="images/tikz17.png"/></center>With these notations, the activation $a^{l}_j$ of the $j^{\rm th}$neuron in the $l^{\rm th}$ layer is related to the activations in the$(l-1)^{\rm th}$ layer by the equation (compareEquation <span id="margin_788574725314_reveal" class="equation_link">(4)</span><span id="margin_788574725314" class="marginequation" style="display: none;"><a href="chap1.html#eqtn4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \frac{1}{1+\exp(-\sum_j w_j x_j-b)} \nonumber\end{eqnarray}</a></span><script>$('#margin_788574725314_reveal').click(function() {$('#margin_788574725314').toggle('slow', function() {});});</script> and surroundingdiscussion in the last chapter)<a class="displaced_anchor" name="eqtn23"></a>\begin{eqnarray}   a^{l}_j = \sigma\left( \sum_k w^{l}_{jk} a^{l-1}_k + b^l_j \right),\tag{23}\end{eqnarray}where the sum is over all neurons $k$ in the $(l-1)^{\rm th}$ layer.  Torewrite this expression in a matrix form we define a <em>weight  matrix</em> $w^l$ for each layer, $l$.  The entries of the weight matrix$w^l$ are just the weights connecting to the $l^{\rm th}$ layer of neurons,that is, the entry in the $j^{\rm th}$ row and $k^{\rm th}$ column is $w^l_{jk}$.Similarly, for each layer $l$ we define a <em>bias vector</em>, $b^l$.You can probably guess how this works - the components of the biasvector are just the values $b^l_j$, one component for each neuron inthe $l^{\rm th}$ layer.  And finally, we define an activation vector $a^l$whose components are the activations $a^l_j$.</p><p>The last ingredient we need to rewrite <span id="margin_735534017585_reveal" class="equation_link">(23)</span><span id="margin_735534017585" class="marginequation" style="display: none;"><a href="chap2.html#eqtn23" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   a^{l}_j = \sigma\left( \sum_k w^{l}_{jk} a^{l-1}_k + b^l_j \right) \nonumber\end{eqnarray}</a></span><script>$('#margin_735534017585_reveal').click(function() {$('#margin_735534017585').toggle('slow', function() {});});</script> in amatrix form is the idea of vectorizing a function such as $\sigma$.We met vectorization briefly in the last chapter, but to recap, theidea is that we want to apply a function such as $\sigma$ to everyelement in a vector $v$.  We use the obvious notation $\sigma(v)$ todenote this kind of elementwise application of a function.  That is,the components of $\sigma(v)$ are just $\sigma(v)_j = \sigma(v_j)$.As an example, if we have the function $f(x) = x^2$ then thevectorized form of $f$ has the effect<a class="displaced_anchor" name="eqtn24"></a>\begin{eqnarray}  f\left(\left[ \begin{array}{c} 2 \\ 3 \end{array} \right] \right)  = \left[ \begin{array}{c} f(2) \\ f(3) \end{array} \right]  = \left[ \begin{array}{c} 4 \\ 9 \end{array} \right],\tag{24}\end{eqnarray}that is, the vectorized $f$ just squares every element of the vector.</p><p>With these notations in mind, Equation <span id="margin_114817738681_reveal" class="equation_link">(23)</span><span id="margin_114817738681" class="marginequation" style="display: none;"><a href="chap2.html#eqtn23" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   a^{l}_j = \sigma\left( \sum_k w^{l}_{jk} a^{l-1}_k + b^l_j \right) \nonumber\end{eqnarray}</a></span><script>$('#margin_114817738681_reveal').click(function() {$('#margin_114817738681').toggle('slow', function() {});});</script> canbe rewritten in the beautiful and compact vectorized form<a class="displaced_anchor" name="eqtn25"></a>\begin{eqnarray}   a^{l} = \sigma(w^l a^{l-1}+b^l).\tag{25}\end{eqnarray}This expression gives us a much more global way of thinking about howthe activations in one layer relate to activations in the previouslayer: we just apply the weight matrix to the activations, then addthe bias vector, and finally apply the $\sigma$ function<a  id="quirk"></a>*<span class="marginnote">*By the way, it's this expression that  motivates the quirk in the $w^l_{jk}$ notation mentioned earlier.  If we used $j$ to index the input neuron, and $k$ to index the  output neuron, then we'd need to replace the weight matrix in  Equation <span id="margin_984031576396_reveal" class="equation_link">(25)</span><span id="margin_984031576396" class="marginequation" style="display: none;"><a href="chap2.html#eqtn25" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   a^{l} = \sigma(w^l a^{l-1}+b^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_984031576396_reveal').click(function() {$('#margin_984031576396').toggle('slow', function() {});});</script> by the transpose of the  weight matrix.  That's a small change, but annoying, and we'd lose  the easy simplicity of saying (and thinking) "apply the weight  matrix to the activations".</span>.  That global view is often easier andmore succinct (and involves fewer indices!) than the neuron-by-neuronview we've taken to now.  Think of it as a way of escaping index hell,while remaining precise about what's going on.  The expression is alsouseful in practice, because most matrix libraries provide fast ways ofimplementing matrix multiplication, vector addition, andvectorization.  Indeed, the<a href="chap1.html#implementing_our_network_to_classify_digits">code</a>in the last chapter made implicit use of this expression to computethe behaviour of the network.</p><p>When using Equation <span id="margin_709882465331_reveal" class="equation_link">(25)</span><span id="margin_709882465331" class="marginequation" style="display: none;"><a href="chap2.html#eqtn25" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   a^{l} = \sigma(w^l a^{l-1}+b^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_709882465331_reveal').click(function() {$('#margin_709882465331').toggle('slow', function() {});});</script> to compute $a^l$,we compute the intermediate quantity $z^l \equiv w^l a^{l-1}+b^l$along the way.  This quantity turns out to be useful enough to beworth naming: we call $z^l$ the <em>weighted input</em> to the neuronsin layer $l$.  We'll make considerable use of the weighted input $z^l$later in the chapter.  Equation <span id="margin_383003775830_reveal" class="equation_link">(25)</span><span id="margin_383003775830" class="marginequation" style="display: none;"><a href="chap2.html#eqtn25" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   a^{l} = \sigma(w^l a^{l-1}+b^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_383003775830_reveal').click(function() {$('#margin_383003775830').toggle('slow', function() {});});</script> issometimes written in terms of the weighted input, as $a^l =\sigma(z^l)$.  It's also worth noting that $z^l$ has components $z^l_j= \sum_k w^l_{jk} a^{l-1}_k+b^l_j$, that is, $z^l_j$ is just theweighted input to the activation function for neuron $j$ in layer $l$.</p><p><h3><a name="the_two_assumptions_we_need_about_the_cost_function"></a><a href="#the_two_assumptions_we_need_about_the_cost_function">The two assumptions we need about the cost function</a></h3></p><p>The goal of backpropagation is to compute the partial derivatives$\partial C / \partial w$ and $\partial C / \partial b$ of the costfunction $C$ with respect to any weight $w$ or bias $b$ in thenetwork.  For backpropagation to work we need to make two mainassumptions about the form of the cost function.  Before stating thoseassumptions, though, it's useful to have an example cost function inmind.  We'll use the quadratic cost function from last chapter(c.f. Equation <span id="margin_59913254684_reveal" class="equation_link">(6)</span><span id="margin_59913254684" class="marginequation" style="display: none;"><a href="chap1.html#eqtn6" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  C(w,b) \equiv  \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}</a></span><script>$('#margin_59913254684_reveal').click(function() {$('#margin_59913254684').toggle('slow', function() {});});</script>).  In the notation ofthe last section, the quadratic cost has the form<a class="displaced_anchor" name="eqtn26"></a>\begin{eqnarray}  C = \frac{1}{2n} \sum_x \|y(x)-a^L(x)\|^2,\tag{26}\end{eqnarray}where: $n$ is the total number of training examples; the sum is overindividual training examples, $x$; $y = y(x)$ is the correspondingdesired output; $L$ denotes the number of layers in the network; and$a^L = a^L(x)$ is the vector of activations output from the networkwhen $x$ is input.</p><p>Okay, so what assumptions do we need to make about our cost function,$C$, in order that backpropagation can be applied?  The firstassumption we need is that the cost function can be written as anaverage $C = \frac{1}{n} \sum_x C_x$ over cost functions $C_x$ forindividual training examples, $x$.  This is the case for the quadraticcost function, where the cost for a single training example is $C_x =\frac{1}{2} \|y-a^L \|^2$.  This assumption will also hold true forall the other cost functions we'll meet in this book.</p><p>The reason we need this assumption is because what backpropagationactually lets us do is compute the partial derivatives $\partial C_x/ \partial w$ and $\partial C_x / \partial b$ for a single trainingexample.  We then recover $\partial C / \partial w$ and $\partial C/ \partial b$ by averaging over training examples.  In fact, with thisassumption in mind, we'll suppose the training example $x$ has beenfixed, and drop the $x$ subscript, writing the cost $C_x$ as $C$.We'll eventually put the $x$ back in, but for now it's a notationalnuisance that is better left implicit.</p><p>The second assumption we make about the cost is that it can be writtenas a function of the outputs from the neural network:<center><img src="images/tikz18.png"/></center>For example, the quadratic cost function satisfies this requirement,since the quadratic cost for a single training example $x$ may bewritten as<a class="displaced_anchor" name="eqtn27"></a>\begin{eqnarray}  C = \frac{1}{2} \|y-a^L\|^2 = \frac{1}{2} \sum_j (y_j-a^L_j)^2,\tag{27}\end{eqnarray}and thus is a function of the output activations.  Of course, thiscost function also depends on the desired output $y$, and you maywonder why we're not regarding the cost also as a function of $y$.Remember, though, that the input training example $x$ is fixed, and sothe output $y$ is also a fixed parameter.  In particular, it's notsomething we can modify by changing the weights and biases in any way,i.e., it's not something which the neural network learns.  And so itmakes sense to regard $C$ as a function of the output activations$a^L$ alone, with $y$ merely a parameter that helps define thatfunction.</p><p></p><p></p><p></p><p><h3><a name="the_hadamard_product_$s_\odot_t$"></a><a href="#the_hadamard_product_$s_\odot_t$">The Hadamard product, $s \odot t$</a></h3></p><p>The backpropagation algorithm is based on common linear algebraicoperations - things like vector addition, multiplying a vector by amatrix, and so on.  But one of the operations is a little lesscommonly used.  In particular, suppose $s$ and $t$ are two vectors ofthe same dimension.  Then we use $s \odot t$ to denote the<em>elementwise</em> product of the two vectors.  Thus the components of$s \odot t$ are just $(s \odot t)_j = s_j t_j$.  As an example,<a class="displaced_anchor" name="eqtn28"></a>\begin{eqnarray}\left[\begin{array}{c} 1 \\ 2 \end{array}\right]   \odot \left[\begin{array}{c} 3 \\ 4\end{array} \right]= \left[ \begin{array}{c} 1 * 3 \\ 2 * 4 \end{array} \right]= \left[ \begin{array}{c} 3 \\ 8 \end{array} \right].\tag{28}\end{eqnarray}This kind of elementwise multiplication is sometimes called the<em>Hadamard product</em> or <em>Schur product</em>.  We'll refer to it asthe Hadamard product.  Good matrix libraries usually provide fastimplementations of the Hadamard product, and that comes in handy whenimplementing backpropagation.</p><p><h3><a name="the_four_fundamental_equations_behind_backpropagation"></a><a href="#the_four_fundamental_equations_behind_backpropagation">The four fundamental equations behind backpropagation</a></h3></p><p>Backpropagation is about understanding how changing the weights andbiases in a network changes the cost function.  Ultimately, this meanscomputing the partial derivatives $\partial C / \partial w^l_{jk}$ and$\partial C / \partial b^l_j$.  But to compute those, we firstintroduce an intermediate quantity, $\delta^l_j$, which we call the<em>error</em> in the $j^{\rm th}$ neuron in the $l^{\rm th}$ layer.Backpropagation will give us a procedure to compute the error$\delta^l_j$, and then will relate $\delta^l_j$ to $\partial C/ \partial w^l_{jk}$ and $\partial C / \partial b^l_j$.</p><p>To understand how the error is defined, imagine there is a demon inour neural network:<center><img src="images/tikz19.png"/></center>The demon sits at the $j^{\rm th}$ neuron in layer $l$.  As the input to theneuron comes in, the demon messes with the neuron's operation.  Itadds a little change $\Delta z^l_j$ to the neuron's weighted input, sothat instead of outputting $\sigma(z^l_j)$, the neuron instead outputs$\sigma(z^l_j+\Delta z^l_j)$.  This change propagates through laterlayers in the network, finally causing the overall cost to change byan amount $\frac{\partial C}{\partial z^l_j} \Delta z^l_j$.</p><p>Now, this demon is a good demon, and is trying to help you improve thecost, i.e., they're trying to find a $\Delta z^l_j$ which makes thecost smaller.  Suppose $\frac{\partial C}{\partial z^l_j}$ has a largevalue (either positive or negative).  Then the demon can lower thecost quite a bit by choosing $\Delta z^l_j$ to have the opposite signto $\frac{\partial C}{\partial z^l_j}$.  By contrast, if$\frac{\partial C}{\partial z^l_j}$ is close to zero, then the demoncan't improve the cost much at all by perturbing the weighted input$z^l_j$.  So far as the demon can tell, the neuron is already prettynear optimal*<span class="marginnote">*This is only the case for small changes $\Delta  z^l_j$, of course. We'll assume that the demon is constrained to  make such small changes.</span>.  And so there's a heuristic sense inwhich $\frac{\partial C}{\partial z^l_j}$ is a measure of the error inthe neuron.</p><p>Motivated by this story, we define the error $\delta^l_j$ of neuron$j$ in layer $l$ by<a class="displaced_anchor" name="eqtn29"></a>\begin{eqnarray}   \delta^l_j \equiv \frac{\partial C}{\partial z^l_j}.\tag{29}\end{eqnarray}As per our usual conventions, we use $\delta^l$ to denote the vectorof errors associated with layer $l$.  Backpropagation will give us away of computing $\delta^l$ for every layer, and then relating thoseerrors to the quantities of real interest, $\partial C / \partialw^l_{jk}$ and $\partial C / \partial b^l_j$.</p><p>You might wonder why the demon is changing the weighted input $z^l_j$.Surely it'd be more natural to imagine the demon changing the outputactivation $a^l_j$, with the result that we'd be using $\frac{\partial  C}{\partial a^l_j}$ as our measure of error.  In fact, if you dothis things work out quite similarly to the discussion below.  But itturns out to make the presentation of backpropagation a little morealgebraically complicated.  So we'll stick with $\delta^l_j =\frac{\partial C}{\partial z^l_j}$ as our measure of error*<span class="marginnote">*In  classification problems like MNIST the term "error" is sometimes  used to mean the classification failure rate.  E.g., if the neural  net correctly classifies 96.0 percent of the digits, then the error  is 4.0 percent.  Obviously, this has quite a different meaning from  our $\delta$ vectors.  In practice, you shouldn't have trouble  telling which meaning is intended in any given usage.</span>.</p><p><strong>Plan of attack:</strong> Backpropagation is based around fourfundamental equations.  Together, those equations give us a way ofcomputing both the error $\delta^l$ and the gradient of the costfunction.  I state the four equations below.  Be warned, though: youshouldn't expect to instantaneously assimilate the equations.  Such anexpectation will lead to disappointment.  In fact, the backpropagationequations are so rich that understanding them well requiresconsiderable time and patience as you gradually delve deeper into theequations.  The good news is that such patience is repaid many timesover.  And so the discussion in this section is merely a beginning,helping you on the way to a thorough understanding of the equations.</p><p>Here's a preview of the ways we'll delve more deeply into theequations later in the chapter: I'll<a href="chap2.html#proof_of_the_four_fundamental_equations_(optional)">give  a short proof of the equations</a>, which helps explain why they aretrue; we'll <a href="chap2.html#the_backpropagation_algorithm">restate  the equations</a> in algorithmic form as pseudocode, and<a href="chap2.html#the_code_for_backpropagation">see how</a> thepseudocode can be implemented as real, running Python code; and, in<a href="chap2.html#backpropagation_the_big_picture">the final  section of the chapter</a>, we'll develop an intuitive picture of whatthe backpropagation equations mean, and how someone might discoverthem from scratch.  Along the way we'll return repeatedly to the fourfundamental equations, and as you deepen your understanding thoseequations will come to seem comfortable and, perhaps, even beautifuland natural.</p><p><strong>An equation for the error in the output layer, $\delta^L$:</strong>The components of $\delta^L$ are given by<a class="displaced_anchor" name="eqtnBP1"></a>\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j).\tag{BP1}\end{eqnarray}This is a very natural expression.  The first term on the right,$\partial C / \partial a^L_j$, just measures how fast the cost ischanging as a function of the $j^{\rm th}$ output activation.  If, forexample, $C$ doesn't depend much on a particular output neuron, $j$,then $\delta^L_j$ will be small, which is what we'd expect.  Thesecond term on the right, $\sigma'(z^L_j)$, measures how fast theactivation function $\sigma$ is changing at $z^L_j$.</p><p>Notice that everything in <span id="margin_919611304299_reveal" class="equation_link">(BP1)</span><span id="margin_919611304299" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_919611304299_reveal').click(function() {$('#margin_919611304299').toggle('slow', function() {});});</script> is easily computed.  Inparticular, we compute $z^L_j$ while computing the behaviour of thenetwork, and it's only a small additional overhead to compute$\sigma'(z^L_j)$.  The exact form of $\partial C / \partial a^L_j$will, of course, depend on the form of the cost function.  However,provided the cost function is known there should be little troublecomputing $\partial C / \partial a^L_j$.  For example, if we're usingthe quadratic cost function then $C = \frac{1}{2} \sum_j(y_j-a^L_j)^2$, and so $\partial C / \partial a^L_j = (a_j^L-y_j)$,which obviously is easily computable.</p><p>Equation <span id="margin_467685671277_reveal" class="equation_link">(BP1)</span><span id="margin_467685671277" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_467685671277_reveal').click(function() {$('#margin_467685671277').toggle('slow', function() {});});</script> is a componentwise expression for $\delta^L$.It's a perfectly good expression, but not the matrix-based form wewant for backpropagation. However, it's easy to rewrite the equationin a matrix-based form, as<a class="displaced_anchor" name="eqtnBP1a"></a>\begin{eqnarray}   \delta^L = \nabla_a C \odot \sigma'(z^L).\tag{BP1a}\end{eqnarray}Here, $\nabla_a C$ is defined to be a vector whose components are thepartial derivatives $\partial C / \partial a^L_j$.  You can think of$\nabla_a C$ as expressing the rate of change of $C$ with respect tothe output activations.  It's easy to see that Equations <span id="margin_570480201049_reveal" class="equation_link">(BP1a)</span><span id="margin_570480201049" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1a" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L = \nabla_a C \odot \sigma'(z^L) \nonumber\end{eqnarray}</a></span><script>$('#margin_570480201049_reveal').click(function() {$('#margin_570480201049').toggle('slow', function() {});});</script>and <span id="margin_983733590680_reveal" class="equation_link">(BP1)</span><span id="margin_983733590680" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_983733590680_reveal').click(function() {$('#margin_983733590680').toggle('slow', function() {});});</script> are equivalent, and for that reason from now on we'lluse <span id="margin_506574502658_reveal" class="equation_link">(BP1)</span><span id="margin_506574502658" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_506574502658_reveal').click(function() {$('#margin_506574502658').toggle('slow', function() {});});</script> interchangeably to refer to both equations.  As anexample, in the case of the quadratic cost we have $\nabla_a C =(a^L-y)$, and so the fully matrix-based form of <span id="margin_365533411234_reveal" class="equation_link">(BP1)</span><span id="margin_365533411234" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_365533411234_reveal').click(function() {$('#margin_365533411234').toggle('slow', function() {});});</script> becomes<a class="displaced_anchor" name="eqtn30"></a>\begin{eqnarray}   \delta^L = (a^L-y) \odot \sigma'(z^L).\tag{30}\end{eqnarray}As you can see, everything in this expression has a nice vector form,and is easily computed using a library such as Numpy.</p><p><strong>An equation for the error $\delta^l$ in terms of the error in  the next layer, $\delta^{l+1}$:</strong> In particular<a class="displaced_anchor" name="eqtnBP2"></a>\begin{eqnarray}   \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l),\tag{BP2}\end{eqnarray}where $(w^{l+1})^T$ is the transpose of the weight matrix $w^{l+1}$ forthe $(l+1)^{\rm th}$ layer.  This equation appears complicated, buteach element has a nice interpretation.  Suppose we know the error$\delta^{l+1}$ at the $l+1^{\rm th}$ layer.  When we apply thetranspose weight matrix, $(w^{l+1})^T$, we can think intuitively ofthis as moving the error <em>backward</em> through the network, givingus some sort of measure of the error at the output of the $l^{\rm th}$layer.  We then take the Hadamard product $\odot \sigma'(z^l)$.  Thismoves the error backward through the activation function in layer $l$,giving us the error $\delta^l$ in the weighted input to layer $l$.</p><p>By combining <span id="margin_520088358076_reveal" class="equation_link">(BP2)</span><span id="margin_520088358076" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_520088358076_reveal').click(function() {$('#margin_520088358076').toggle('slow', function() {});});</script> with <span id="margin_853208206649_reveal" class="equation_link">(BP1)</span><span id="margin_853208206649" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_853208206649_reveal').click(function() {$('#margin_853208206649').toggle('slow', function() {});});</script> we can compute the error$\delta^l$ for any layer in the network.  We start byusing <span id="margin_569252953000_reveal" class="equation_link">(BP1)</span><span id="margin_569252953000" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_569252953000_reveal').click(function() {$('#margin_569252953000').toggle('slow', function() {});});</script> to compute $\delta^L$, then applyEquation <span id="margin_646471794768_reveal" class="equation_link">(BP2)</span><span id="margin_646471794768" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_646471794768_reveal').click(function() {$('#margin_646471794768').toggle('slow', function() {});});</script> to compute $\delta^{L-1}$, thenEquation <span id="margin_58635688739_reveal" class="equation_link">(BP2)</span><span id="margin_58635688739" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_58635688739_reveal').click(function() {$('#margin_58635688739').toggle('slow', function() {});});</script> again to compute $\delta^{L-2}$, and so on, allthe way back through the network.</p><p><strong>An equation for the rate of change of the cost with respect to  any bias in the network:</strong> In particular:<a class="displaced_anchor" name="eqtnBP3"></a>\begin{eqnarray}  \frac{\partial C}{\partial b^l_j} =  \delta^l_j.\tag{BP3}\end{eqnarray}That is, the error $\delta^l_j$ is <em>exactly equal</em> to the rate ofchange $\partial C / \partial b^l_j$.  This is great news, since<span id="margin_8280399999_reveal" class="equation_link">(BP1)</span><span id="margin_8280399999" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_8280399999_reveal').click(function() {$('#margin_8280399999').toggle('slow', function() {});});</script> and <span id="margin_557075118630_reveal" class="equation_link">(BP2)</span><span id="margin_557075118630" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_557075118630_reveal').click(function() {$('#margin_557075118630').toggle('slow', function() {});});</script> have already told us how to compute$\delta^l_j$.  We can rewrite <span id="margin_205000449729_reveal" class="equation_link">(BP3)</span><span id="margin_205000449729" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP3" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial C}{\partial b^l_j} =  \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_205000449729_reveal').click(function() {$('#margin_205000449729').toggle('slow', function() {});});</script> in shorthand as<a class="displaced_anchor" name="eqtn31"></a>\begin{eqnarray}  \frac{\partial C}{\partial b} = \delta,\tag{31}\end{eqnarray}where it is understood that $\delta$ is being evaluated at the sameneuron as the bias $b$.</p><p><strong>An equation for the rate of change of the cost with respect to  any weight in the network:</strong>  In particular:<a class="displaced_anchor" name="eqtnBP4"></a>\begin{eqnarray}    \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j.\tag{BP4}\end{eqnarray}This tells us how to compute the partial derivatives $\partial C/ \partial w^l_{jk}$ in terms of the quantities $\delta^l$ and$a^{l-1}$, which we already know how to compute.  The equation can berewritten in a less index-heavy notation as<a class="displaced_anchor" name="eqtn32"></a>\begin{eqnarray}  \frac{\partial    C}{\partial w} = a_{\rm in} \delta_{\rm out},\tag{32}\end{eqnarray}where it's understood that $a_{\rm in}$ is the activation of theneuron input to the weight $w$, and $\delta_{\rm out}$ is the error ofthe neuron output from the weight $w$.  Zooming in to look at just theweight $w$, and the two neurons connected by that weight, we candepict this as:<center><img src="images/tikz20.png"/></center>A nice consequence of Equation <span id="margin_988156473660_reveal" class="equation_link">(32)</span><span id="margin_988156473660" class="marginequation" style="display: none;"><a href="chap2.html#eqtn32" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial    C}{\partial w} = a_{\rm in} \delta_{\rm out} \nonumber\end{eqnarray}</a></span><script>$('#margin_988156473660_reveal').click(function() {$('#margin_988156473660').toggle('slow', function() {});});</script> isthat when the activation $a_{\rm in}$ is small, $a_{\rm in} \approx0$, the gradient term $\partial C / \partial w$ will also tend to besmall.  In this case, we'll say the weight <em>learns slowly</em>,meaning that it's not changing much during gradient descent.  In otherwords, one consequence of <span id="margin_30338286679_reveal" class="equation_link">(BP4)</span><span id="margin_30338286679" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}    \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_30338286679_reveal').click(function() {$('#margin_30338286679').toggle('slow', function() {});});</script> is that weights output fromlow-activation neurons learn slowly.</p><p><a name="saturation"></a></p><p>There are other insights along these lines which can be obtainedfrom <span id="margin_250484114753_reveal" class="equation_link">(BP1)</span><span id="margin_250484114753" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_250484114753_reveal').click(function() {$('#margin_250484114753').toggle('slow', function() {});});</script>-<span id="margin_282300343827_reveal" class="equation_link">(BP4)</span><span id="margin_282300343827" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}    \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_282300343827_reveal').click(function() {$('#margin_282300343827').toggle('slow', function() {});});</script>.  Let's start by looking at the outputlayer.  Consider the term $\sigma'(z^L_j)$ in <span id="margin_16544494861_reveal" class="equation_link">(BP1)</span><span id="margin_16544494861" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_16544494861_reveal').click(function() {$('#margin_16544494861').toggle('slow', function() {});});</script>.  Recallfrom the <a href="chap1.html#sigmoid_graph">graph of the sigmoid  function in the last chapter</a> that the $\sigma$ function becomesvery flat when $\sigma(z^L_j)$ is approximately $0$ or $1$.  When thisoccurs we will have $\sigma'(z^L_j) \approx 0$.  And so the lesson isthat a weight in the final layer will learn slowly if the outputneuron is either low activation ($\approx 0$) or high activation($\approx 1$).  In this case it's common to say the output neuron has<em>saturated</em> and, as a result, the weight has stopped learning (oris learning slowly).  Similar remarks hold also for the biases ofoutput neuron.</p><p>We can obtain similar insights for earlier layers.  In particular,note the $\sigma'(z^l)$ term in <span id="margin_998669428161_reveal" class="equation_link">(BP2)</span><span id="margin_998669428161" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_998669428161_reveal').click(function() {$('#margin_998669428161').toggle('slow', function() {});});</script>.  This means that$\delta^l_j$ is likely to get small if the neuron is near saturation.And this, in turn, means that any weights input to a saturated neuronwill learn slowly*<span class="marginnote">*This reasoning won't hold if ${w^{l+1}}^T  \delta^{l+1}$ has large enough entries to compensate for the  smallness of $\sigma'(z^l_j)$.  But I'm speaking of the general  tendency.</span>.</p><p>Summing up, we've learnt that a weight will learn slowly if either theinput neuron is low-activation, or if the output neuron has saturated,i.e., is either high- or low-activation.  </p><p>None of these observations is too greatly surprising.  Still, theyhelp improve our mental model of what's going on as a neural networklearns.  Furthermore, we can turn this type of reasoning around.  Thefour fundamental equations turn out to hold for any activationfunction, not just the standard sigmoid function (that's because, aswe'll see in a moment, the proofs don't use any special properties of$\sigma$).  And so we can use these equations to <em>design</em>activation functions which have particular desired learningproperties.  As an example to give you the idea, suppose we were tochoose a (non-sigmoid) activation function $\sigma$ so that $\sigma'$is always positive, and never gets close to zero.  That would preventthe slow-down of learning that occurs when ordinary sigmoid neuronssaturate.  Later in the book we'll see examples where this kind ofmodification is made to the activation function.  Keeping the fourequations <span id="margin_678414648914_reveal" class="equation_link">(BP1)</span><span id="margin_678414648914" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_678414648914_reveal').click(function() {$('#margin_678414648914').toggle('slow', function() {});});</script>-<span id="margin_936017178156_reveal" class="equation_link">(BP4)</span><span id="margin_936017178156" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}    \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_936017178156_reveal').click(function() {$('#margin_936017178156').toggle('slow', function() {});});</script> in mind can help explain why suchmodifications are tried, and what impact they can have.</p><p><a name="backpropsummary"></a></p><p><center><img src="images/tikz21.png"/></center></p><p><a id="alternative_backprop"></a></p><p><h4><a name="problem_567109"></a><a href="#problem_567109">Problem</a></h4><ul><li><strong>Alternate presentation of the equations of backpropagation:</strong>  I've stated the equations of backpropagation (notably <span id="margin_857011554527_reveal" class="equation_link">(BP1)</span><span id="margin_857011554527" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_857011554527_reveal').click(function() {$('#margin_857011554527').toggle('slow', function() {});});</script>  and <span id="margin_446828221170_reveal" class="equation_link">(BP2)</span><span id="margin_446828221170" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_446828221170_reveal').click(function() {$('#margin_446828221170').toggle('slow', function() {});});</script>) using the Hadamard product.  This presentation may  be disconcerting if you're unused to the Hadamard product.  There's  an alternative approach, based on conventional matrix  multiplication, which some readers may find enlightening.  (1) Show  that <span id="margin_273825325358_reveal" class="equation_link">(BP1)</span><span id="margin_273825325358" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_273825325358_reveal').click(function() {$('#margin_273825325358').toggle('slow', function() {});});</script> may be rewritten as  <a class="displaced_anchor" name="eqtn33"></a>\begin{eqnarray}    \delta^L = \Sigma'(z^L) \nabla_a C,  \tag{33}\end{eqnarray}  where $\Sigma'(z^L)$ is a square matrix whose diagonal entries are  the values $\sigma'(z^L_j)$, and whose off-diagonal entries are  zero.  Note that this matrix acts on $\nabla_a C$ by conventional  matrix multiplication.  (2) Show that <span id="margin_592545063504_reveal" class="equation_link">(BP2)</span><span id="margin_592545063504" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_592545063504_reveal').click(function() {$('#margin_592545063504').toggle('slow', function() {});});</script> may be rewritten  as  <a class="displaced_anchor" name="eqtn34"></a>\begin{eqnarray}    \delta^l = \Sigma'(z^l) (w^{l+1})^T \delta^{l+1}.  \tag{34}\end{eqnarray}  (3) By combining observations (1) and (2) show that  <a class="displaced_anchor" name="eqtn35"></a>\begin{eqnarray}    \delta^l = \Sigma'(z^l) (w^{l+1})^T \ldots \Sigma'(z^{L-1}) (w^L)^T     \Sigma'(z^L) \nabla_a C  \tag{35}\end{eqnarray}  For readers comfortable with matrix multiplication this equation may  be easier to understand than <span id="margin_211286309191_reveal" class="equation_link">(BP1)</span><span id="margin_211286309191" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_211286309191_reveal').click(function() {$('#margin_211286309191').toggle('slow', function() {});});</script> and <span id="margin_33317062381_reveal" class="equation_link">(BP2)</span><span id="margin_33317062381" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_33317062381_reveal').click(function() {$('#margin_33317062381').toggle('slow', function() {});});</script>.  The  reason I've focused on <span id="margin_350024952201_reveal" class="equation_link">(BP1)</span><span id="margin_350024952201" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_350024952201_reveal').click(function() {$('#margin_350024952201').toggle('slow', function() {});});</script> and <span id="margin_968208779254_reveal" class="equation_link">(BP2)</span><span id="margin_968208779254" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_968208779254_reveal').click(function() {$('#margin_968208779254').toggle('slow', function() {});});</script> is because that  approach turns out to be faster to implement numerically.</ul></p><p><h3><a name="proof_of_the_four_fundamental_equations_(optional)"></a><a href="#proof_of_the_four_fundamental_equations_(optional)">Proof of the four fundamental equations (optional)</a></h3> </p><p>We'll now prove the four fundamentalequations <span id="margin_322535054922_reveal" class="equation_link">(BP1)</span><span id="margin_322535054922" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_322535054922_reveal').click(function() {$('#margin_322535054922').toggle('slow', function() {});});</script>-<span id="margin_58831286370_reveal" class="equation_link">(BP4)</span><span id="margin_58831286370" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}    \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_58831286370_reveal').click(function() {$('#margin_58831286370').toggle('slow', function() {});});</script>.  All four are consequences of thechain rule from multivariable calculus.  If you're comfortable withthe chain rule, then I strongly encourage you to attempt thederivation yourself before reading on.</p><p>Let's begin with Equation <span id="margin_669494561666_reveal" class="equation_link">(BP1)</span><span id="margin_669494561666" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_669494561666_reveal').click(function() {$('#margin_669494561666').toggle('slow', function() {});});</script>, which gives an expression forthe output error, $\delta^L$.  To prove this equation, recall that bydefinition<a class="displaced_anchor" name="eqtn36"></a>\begin{eqnarray}  \delta^L_j = \frac{\partial C}{\partial z^L_j}.\tag{36}\end{eqnarray}Applying the chain rule, we can re-express the partial derivativeabove in terms of partial derivatives with respect to the outputactivations,<a class="displaced_anchor" name="eqtn37"></a>\begin{eqnarray}  \delta^L_j = \sum_k \frac{\partial C}{\partial a^L_k} \frac{\partial a^L_k}{\partial z^L_j},\tag{37}\end{eqnarray}where the sum is over all neurons $k$ in the output layer.  Of course,the output activation $a^L_k$ of the $k^{\rm th}$ neuron depends onlyon the weighted input $z^L_j$ for the $j^{\rm th}$ neuron when $k =j$.  And so $\partial a^L_k / \partial z^L_j$ vanishes when $k \neqj$.  As a result we can simplify the previous equation to<a class="displaced_anchor" name="eqtn38"></a>\begin{eqnarray}  \delta^L_j = \frac{\partial C}{\partial a^L_j} \frac{\partial a^L_j}{\partial z^L_j}.\tag{38}\end{eqnarray}Recalling that $a^L_j = \sigma(z^L_j)$ the second term on the rightcan be written as $\sigma'(z^L_j)$, and the equation becomes<a class="displaced_anchor" name="eqtn39"></a>\begin{eqnarray}  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j),\tag{39}\end{eqnarray}which is just <span id="margin_543019119880_reveal" class="equation_link">(BP1)</span><span id="margin_543019119880" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_543019119880_reveal').click(function() {$('#margin_543019119880').toggle('slow', function() {});});</script>, in component form.</p><p>Next, we'll prove <span id="margin_676464380519_reveal" class="equation_link">(BP2)</span><span id="margin_676464380519" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_676464380519_reveal').click(function() {$('#margin_676464380519').toggle('slow', function() {});});</script>, which gives an equation for the error$\delta^l$ in terms of the error in the next layer, $\delta^{l+1}$.To do this, we want to rewrite $\delta^l_j = \partial C / \partialz^l_j$ in terms of $\delta^{l+1}_k = \partial C / \partial z^{l+1}_k$.We can do this using the chain rule,<a class="displaced_anchor" name="eqtn40"></a><a class="displaced_anchor" name="eqtn41"></a><a class="displaced_anchor" name="eqtn42"></a>\begin{eqnarray}  \delta^l_j & = & \frac{\partial C}{\partial z^l_j} \tag{40}\\  & = & \sum_k \frac{\partial C}{\partial z^{l+1}_k} \frac{\partial z^{l+1}_k}{\partial z^l_j} \tag{41}\\   & = & \sum_k \frac{\partial z^{l+1}_k}{\partial z^l_j} \delta^{l+1}_k,\tag{42}\end{eqnarray}where in the last line we have interchanged the two terms on theright-hand side, and substituted the definition of $\delta^{l+1}_k$.To evaluate the first term on the last line, note that<a class="displaced_anchor" name="eqtn43"></a>\begin{eqnarray}  z^{l+1}_k = \sum_j w^{l+1}_{kj} a^l_j +b^{l+1}_k = \sum_j w^{l+1}_{kj} \sigma(z^l_j) +b^{l+1}_k.\tag{43}\end{eqnarray}Differentiating, we obtain<a class="displaced_anchor" name="eqtn44"></a>\begin{eqnarray}  \frac{\partial z^{l+1}_k}{\partial z^l_j} = w^{l+1}_{kj} \sigma'(z^l_j).\tag{44}\end{eqnarray}Substituting back into <span id="margin_863771588529_reveal" class="equation_link">(42)</span><span id="margin_863771588529" class="marginequation" style="display: none;"><a href="chap2.html#eqtn42" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   & = & \sum_k \frac{\partial z^{l+1}_k}{\partial z^l_j} \delta^{l+1}_k \nonumber\end{eqnarray}</a></span><script>$('#margin_863771588529_reveal').click(function() {$('#margin_863771588529').toggle('slow', function() {});});</script> we obtain<a class="displaced_anchor" name="eqtn45"></a>\begin{eqnarray}  \delta^l_j = \sum_k w^{l+1}_{kj}  \delta^{l+1}_k \sigma'(z^l_j).\tag{45}\end{eqnarray}This is just <span id="margin_660951274755_reveal" class="equation_link">(BP2)</span><span id="margin_660951274755" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP2" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}</a></span><script>$('#margin_660951274755_reveal').click(function() {$('#margin_660951274755').toggle('slow', function() {});});</script> written in component form.</p><p>The final two equations we want to prove are <span id="margin_869387373993_reveal" class="equation_link">(BP3)</span><span id="margin_869387373993" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP3" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial C}{\partial b^l_j} =  \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_869387373993_reveal').click(function() {$('#margin_869387373993').toggle('slow', function() {});});</script>and <span id="margin_237094968075_reveal" class="equation_link">(BP4)</span><span id="margin_237094968075" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}    \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_237094968075_reveal').click(function() {$('#margin_237094968075').toggle('slow', function() {});});</script>.  These also follow from the chain rule, in a mannersimilar to the proofs of the two equations above.  I leave them to youas an exercise. </p><p><h4><a name="exercise_835949"></a><a href="#exercise_835949">Exercise</a></h4><ul><li> Prove Equations <span id="margin_501064179739_reveal" class="equation_link">(BP3)</span><span id="margin_501064179739" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP3" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial C}{\partial b^l_j} =  \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_501064179739_reveal').click(function() {$('#margin_501064179739').toggle('slow', function() {});});</script> and <span id="margin_840492722163_reveal" class="equation_link">(BP4)</span><span id="margin_840492722163" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}    \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_840492722163_reveal').click(function() {$('#margin_840492722163').toggle('slow', function() {});});</script>.</ul></p><p>That completes the proof of the four fundamental equations ofbackpropagation.  The proof may seem complicated.  But it's reallyjust the outcome of carefully applying the chain rule.  A little lesssuccinctly, we can think of backpropagation as a way of computing thegradient of the cost function by systematically applying the chainrule from multi-variable calculus.  That's all there really is tobackpropagation - the rest is details.</p><p><h3><a name="the_backpropagation_algorithm"></a><a href="#the_backpropagation_algorithm">The backpropagation algorithm</a></h3></p><p>The backpropagation equations provide us with a way of computing thegradient of the cost function.  Let's explicitly write this out in theform of an algorithm:<ol><li> <strong>Input $x$:</strong> Set the corresponding activation $a^{1}$ for  the input layer.  </p><p><li> <strong>Feedforward:</strong> For each $l = 2, 3, \ldots, L$ compute  $z^{l} = w^l a^{l-1}+b^l$ and $a^{l} = \sigma(z^{l})$.</p><p><li> <strong>Output error $\delta^L$:</strong> Compute the vector $\delta^{L}  = \nabla_a C \odot \sigma'(z^L)$.</p><p><li> <strong>Backpropagate the error:</strong> For each $l = L-1, L-2,  \ldots, 2$ compute $\delta^{l} = ((w^{l+1})^T \delta^{l+1}) \odot  \sigma'(z^{l})$.</p><p><li> <strong>Output:</strong> The gradient of the cost function is given by  $\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j$ and  $\frac{\partial C}{\partial b^l_j} = \delta^l_j$.</ol></p><p>Examining the algorithm you can see why it's called<em>back</em>propagation.  We compute the error vectors $\delta^l$backward, starting from the final layer.  It may seem peculiar thatwe're going through the network backward.  But if you think about theproof of backpropagation, the backward movement is a consequence ofthe fact that the cost is a function of outputs from the network.  Tounderstand how the cost varies with earlier weights and biases we needto repeatedly apply the chain rule, working backward through thelayers to obtain usable expressions.</p><p><h4><a name="exercises_675621"></a><a href="#exercises_675621">Exercises</a></h4><ul><li><strong>Backpropagation with a single modified neuron</strong> Suppose we modify  a single neuron in a feedforward network so that the output from the  neuron is given by $f(\sum_j w_j x_j + b)$, where $f$ is some  function other than the sigmoid.  How should we modify the  backpropagation algorithm in this case?</p><p><li><strong>Backpropagation with linear neurons</strong> Suppose we replace the  usual non-linear $\sigma$ function with $\sigma(z) = z$ throughout  the network.  Rewrite the backpropagation algorithm for this case.</ul></p><p>As I've described it above, the backpropagation algorithm computes thegradient of the cost function for a single training example, $C =C_x$.  In practice, it's common to combine backpropagation with alearning algorithm such as stochastic gradient descent, in which wecompute the gradient for many training examples.  In particular, givena mini-batch of $m$ training examples, the following algorithm appliesa gradient descent learning step based on that mini-batch:<ol><li> <strong>Input a set of training examples</strong></p><p><li> <strong>For each training example $x$:</strong> Set the corresponding  input activation $a^{x,1}$, and perform the following steps:</p><p><ul><li> <strong>Feedforward:</strong> For each $l = 2, 3, \ldots, L$ compute  $z^{x,l} = w^l a^{x,l-1}+b^l$ and $a^{x,l} = \sigma(z^{x,l})$.</p><p><li> <strong>Output error $\delta^{x,L}$:</strong> Compute the vector  $\delta^{x,L} = \nabla_a C_x \odot \sigma'(z^{x,L})$.</p><p><li> <strong>Backpropagate the error:</strong> For each $l = L-1, L-2,  \ldots, 2$ compute $\delta^{x,l} = ((w^{l+1})^T \delta^{x,l+1})  \odot \sigma'(z^{x,l})$.</ul></p><p><li> <strong>Gradient descent:</strong> For each $l = L, L-1, \ldots, 2$  update the weights according to the rule $w^l \rightarrow  w^l-\frac{\eta}{m} \sum_x \delta^{x,l} (a^{x,l-1})^T$, and the  biases according to the rule $b^l \rightarrow b^l-\frac{\eta}{m}  \sum_x \delta^{x,l}$.</p><p></ol>Of course, to implement stochastic gradient descent in practice youalso need an outer loop generating mini-batches of training examples,and an outer loop stepping through multiple epochs of training.  I'veomitted those for simplicity.  </p><p></p><p><h3><a name="the_code_for_backpropagation"></a><a href="#the_code_for_backpropagation">The code for backpropagation</a></h3></p><p>Having understood backpropagation in the abstract, we can nowunderstand the code used in the last chapter to implementbackpropagation.  Recall from<a href="chap1.html#implementing_our_network_to_classify_digits">that  chapter</a> that the code was contained in the <tt>update_mini_batch</tt>and <tt>backprop</tt> methods of the <tt>Network</tt> class.  The code forthese methods is a direct translation of the algorithm describedabove.  In particular, the <tt>update_mini_batch</tt> method updates the<tt>Network</tt>'s weights and biases by computing the gradient for thecurrent <tt>mini_batch</tt> of training examples:<div class="highlight"><pre><span></span><span class="k">class</span> <span class="nc">Network</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
 <span class="o">...</span>
     <span class="k">def</span> <span class="nf">update_mini_batch</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">mini_batch</span><span class="p">,</span> <span class="n">eta</span><span class="p">):</span>
         <span class="sd">&quot;&quot;&quot;Update the network&#39;s weights and biases by applying</span>
@@ -245,7 +248,7 @@ <h1 class="chapter_title"><a href="">How the backpropagation algorithm works</a>
     <span class="sd">&quot;&quot;&quot;Derivative of the sigmoid function.&quot;&quot;&quot;</span>
     <span class="k">return</span> <span class="n">sigmoid</span><span class="p">(</span><span class="n">z</span><span class="p">)</span><span class="o">*</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">sigmoid</span><span class="p">(</span><span class="n">z</span><span class="p">))</span>
 </pre></div>
-</p><p><a id="backprop_over_minibatch"></a><h4><a name="problem_269962"></a><a href="#problem_269962">Problem</a></h4><ul><li><strong>Fully matrix-based approach to backpropagation over a  mini-batch</strong> Our implementation of stochastic gradient descent loops  over training examples in a mini-batch.  It's possible to modify the  backpropagation algorithm so that it computes the gradients for all  training examples in a mini-batch simultaneously.  The idea is that  instead of beginning with a single input vector, $x$, we can begin  with a matrix $X = [x_1 x_2 \ldots x_m]$ whose columns are the  vectors in the mini-batch.  We forward-propagate by multiplying by  the weight matrices, adding a suitable matrix for the bias terms,  and applying the sigmoid function everywhere. We backpropagate along  similar lines.  Explicitly write out pseudocode for this approach to  the backpropagation algorithm.  Modify <tt>network.py</tt> so that it  uses this fully matrix-based approach.  The advantage of this  approach is that it takes full advantage of modern libraries for  linear algebra.  As a result it can be quite a bit faster than  looping over the mini-batch.  (On my laptop, for example, the  speedup is about a factor of two when run on MNIST classification  problems like those we considered in the last chapter.)  In  practice, all serious libraries for backpropagation use this fully  matrix-based approach or some variant.</ul></p><p><h3><a name="in_what_sense_is_backpropagation_a_fast_algorithm"></a><a href="#in_what_sense_is_backpropagation_a_fast_algorithm">In what sense is backpropagation a fast algorithm?</a></h3></p><p>In what sense is backpropagation a fast algorithm?  To answer thisquestion, let's consider another approach to computing the gradient.Imagine it's the early days of neural networks research.  Maybe it'sthe 1950s or 1960s, and you're the first person in the world to thinkof using gradient descent to learn!  But to make the idea work youneed a way of computing the gradient of the cost function.  You thinkback to your knowledge of calculus, and decide to see if you can usethe chain rule to compute the gradient.  But after playing around abit, the algebra looks complicated, and you get discouraged.  So youtry to find another approach.  You decide to regard the cost as afunction of the weights $C = C(w)$ alone (we'll get back to the biasesin a moment).  You number the weights $w_1, w_2, \ldots$, and want tocompute $\partial C / \partial w_j$ for some particular weight $w_j$.An obvious way of doing that is to use the approximation<a class="displaced_anchor" name="eqtn46"></a>\begin{eqnarray}  \frac{\partial    C}{\partial w_{j}} \approx \frac{C(w+\epsilon    e_j)-C(w)}{\epsilon},\tag{46}\end{eqnarray}where $\epsilon > 0$ is a small positive number, and $e_j$ is the unitvector in the $j^{\rm th}$ direction.  In other words, we can estimate$\partial C / \partial w_j$ by computing the cost $C$ for two slightlydifferent values of $w_j$, and then applyingEquation <span id="margin_101857302385_reveal" class="equation_link">(46)</span><span id="margin_101857302385" class="marginequation" style="display: none;"><a href="chap2.html#eqtn46" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial    C}{\partial w_{j}} \approx \frac{C(w+\epsilon    e_j)-C(w)}{\epsilon} \nonumber\end{eqnarray}</a></span><script>$('#margin_101857302385_reveal').click(function() {$('#margin_101857302385').toggle('slow', function() {});});</script>.  The same idea will let uscompute the partial derivatives $\partial C / \partial b$ with respectto the biases.</p><p>This approach looks very promising.  It's simple conceptually, andextremely easy to implement, using just a few lines of code.Certainly, it looks much more promising than the idea of using thechain rule to compute the gradient!</p><p>Unfortunately, while this approach appears promising, when youimplement the code it turns out to be extremely slow.  To understandwhy, imagine we have a million weights in our network.  Then for eachdistinct weight $w_j$ we need to compute $C(w+\epsilon e_j)$ in orderto compute $\partial C / \partial w_j$.  That means that to computethe gradient we need to compute the cost function a million differenttimes, requiring a million forward passes through the network (pertraining example).  We need to compute $C(w)$ as well, so that's atotal of a million and one passes through the network.</p><p>What's clever about backpropagation is that it enables us tosimultaneously compute <em>all</em> the partial derivatives $\partial C/ \partial w_j$ using just one forward pass through the network,followed by one backward pass through the network.  Roughly speaking,the computational cost of the backward pass is about the same as theforward pass*<span class="marginnote">*This should be plausible, but it requires some  analysis to make a careful statement.  It's plausible because the  dominant computational cost in the forward pass is multiplying by  the weight matrices, while in the backward pass it's multiplying by  the transposes of the weight matrices.  These operations obviously  have similar computational cost.</span>.  And so the total cost ofbackpropagation is roughly the same as making just two forward passesthrough the network.  Compare that to the million and one forwardpasses we needed for the approach basedon <span id="margin_900197700312_reveal" class="equation_link">(46)</span><span id="margin_900197700312" class="marginequation" style="display: none;"><a href="chap2.html#eqtn46" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial    C}{\partial w_{j}} \approx \frac{C(w+\epsilon    e_j)-C(w)}{\epsilon} \nonumber\end{eqnarray}</a></span><script>$('#margin_900197700312_reveal').click(function() {$('#margin_900197700312').toggle('slow', function() {});});</script>!  And so even though backpropagationappears superficially more complex than the approach basedon <span id="margin_796834892103_reveal" class="equation_link">(46)</span><span id="margin_796834892103" class="marginequation" style="display: none;"><a href="chap2.html#eqtn46" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial    C}{\partial w_{j}} \approx \frac{C(w+\epsilon    e_j)-C(w)}{\epsilon} \nonumber\end{eqnarray}</a></span><script>$('#margin_796834892103_reveal').click(function() {$('#margin_796834892103').toggle('slow', function() {});});</script>, it's actually much, much faster.</p><p>This speedup was first fully appreciated in 1986, and it greatlyexpanded the range of problems that neural networks could solve.That, in turn, caused a rush of people using neural networks.  Ofcourse, backpropagation is not a panacea.  Even in the late 1980speople ran up against limits, especially when attempting to usebackpropagation to train deep neural networks, i.e., networks withmany hidden layers.  Later in the book we'll see how modern computersand some clever new ideas now make it possible to use backpropagationto train such deep neural networks.</p><p><h3><a name="backpropagation_the_big_picture"></a><a href="#backpropagation_the_big_picture">Backpropagation: the big picture</a></h3></p><p>As I've explained it, backpropagation presents two mysteries.  First,what's the algorithm really doing?  We've developed a picture of theerror being backpropagated from the output.  But can we go any deeper,and build up more intuition about what is going on when we do allthese matrix and vector multiplications?  The second mystery is howsomeone could ever have discovered backpropagation in the first place?It's one thing to follow the steps in an algorithm, or even to followthe proof that the algorithm works.  But that doesn't mean youunderstand the problem so well that you could have discovered thealgorithm in the first place.  Is there a plausible line of reasoningthat could have led you to discover the backpropagation algorithm?  Inthis section I'll address both these mysteries.</p><p>To improve our intuition about what the algorithm is doing, let'simagine that we've made a small change $\Delta w^l_{jk}$ to someweight in the network, $w^l_{jk}$:<center><img src="images/tikz22.png"/></center>That change in weight will cause a change in the output activationfrom the corresponding neuron:<center><img src="images/tikz23.png"/></center>That, in turn, will cause a change in <em>all</em> the activations inthe next layer:<center><img src="images/tikz24.png"/></center>Those changes will in turn cause changes in the next layer, and thenthe next, and so on all the way through to causing a change in thefinal layer, and then in the cost function:<center><img src="images/tikz25.png"/></center>The change $\Delta C$ in the cost is related to the change $\Deltaw^l_{jk}$ in the weight by the equation<a class="displaced_anchor" name="eqtn47"></a>\begin{eqnarray}   \Delta C \approx \frac{\partial C}{\partial w^l_{jk}} \Delta w^l_{jk}.\tag{47}\end{eqnarray}This suggests that a possible approach to computing $\frac{\partial  C}{\partial w^l_{jk}}$ is to carefully track how a small change in$w^l_{jk}$ propagates to cause a small change in $C$.  If we can dothat, being careful to express everything along the way in terms ofeasily computable quantities, then we should be able to compute$\partial C / \partial w^l_{jk}$.</p><p>Let's try to carry this out.  The change $\Delta w^l_{jk}$ causes asmall change $\Delta a^{l}_j$ in the activation of the $j^{\rm th}$ neuron inthe $l^{\rm th}$ layer.  This change is given by<a class="displaced_anchor" name="eqtn48"></a>\begin{eqnarray}   \Delta a^l_j \approx \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk}.\tag{48}\end{eqnarray}The change in activation $\Delta a^l_{j}$ will cause changes in<em>all</em> the activations in the next layer, i.e., the $(l+1)^{\rm  th}$ layer.  We'll concentrate on the way just a single one of thoseactivations is affected, say $a^{l+1}_q$,<center><img src="images/tikz26.png"/></center>In fact, it'll cause the following change:<a class="displaced_anchor" name="eqtn49"></a>\begin{eqnarray}  \Delta a^{l+1}_q \approx \frac{\partial a^{l+1}_q}{\partial a^l_j} \Delta a^l_j.\tag{49}\end{eqnarray}Substituting in the expression from Equation <span id="margin_127983297666_reveal" class="equation_link">(48)</span><span id="margin_127983297666" class="marginequation" style="display: none;"><a href="chap2.html#eqtn48" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \Delta a^l_j \approx \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk} \nonumber\end{eqnarray}</a></span><script>$('#margin_127983297666_reveal').click(function() {$('#margin_127983297666').toggle('slow', function() {});});</script>,we get:<a class="displaced_anchor" name="eqtn50"></a>\begin{eqnarray}  \Delta a^{l+1}_q \approx \frac{\partial a^{l+1}_q}{\partial a^l_j} \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk}.\tag{50}\end{eqnarray}Of course, the change $\Delta a^{l+1}_q$ will, in turn, cause changesin the activations in the next layer.  In fact, we can imagine a pathall the way through the network from $w^l_{jk}$ to $C$, with eachchange in activation causing a change in the next activation, and,finally, a change in the cost at the output.  If the path goes throughactivations $a^l_j, a^{l+1}_q, \ldots, a^{L-1}_n, a^L_m$ then theresulting expression is<a class="displaced_anchor" name="eqtn51"></a>\begin{eqnarray}  \Delta C \approx \frac{\partial C}{\partial a^L_m}   \frac{\partial a^L_m}{\partial a^{L-1}_n}  \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots  \frac{\partial a^{l+1}_q}{\partial a^l_j}  \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk},\tag{51}\end{eqnarray}that is, we've picked up a $\partial a / \partial a$ type term foreach additional neuron we've passed through, as well as the $\partialC/\partial a^L_m$ term at the end.  This represents the change in $C$due to changes in the activations along this particular path throughthe network.  Of course, there's many paths by which a change in$w^l_{jk}$ can propagate to affect the cost, and we've beenconsidering just a single path. To compute the total change in $C$ itis plausible that we should sum over all the possible paths betweenthe weight and the final cost, i.e.,<a class="displaced_anchor" name="eqtn52"></a>\begin{eqnarray}   \Delta C \approx \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m}   \frac{\partial a^L_m}{\partial a^{L-1}_n}  \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots  \frac{\partial a^{l+1}_q}{\partial a^l_j}   \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk},\tag{52}\end{eqnarray}where we've summed over all possible choices for the intermediateneurons along the path.  Comparing with <span id="margin_277820682739_reveal" class="equation_link">(47)</span><span id="margin_277820682739" class="marginequation" style="display: none;"><a href="chap2.html#eqtn47" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \Delta C \approx \frac{\partial C}{\partial w^l_{jk}} \Delta w^l_{jk} \nonumber\end{eqnarray}</a></span><script>$('#margin_277820682739_reveal').click(function() {$('#margin_277820682739').toggle('slow', function() {});});</script> wesee that<a class="displaced_anchor" name="eqtn53"></a>\begin{eqnarray}   \frac{\partial C}{\partial w^l_{jk}} = \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m}   \frac{\partial a^L_m}{\partial a^{L-1}_n}  \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots  \frac{\partial a^{l+1}_q}{\partial a^l_j}   \frac{\partial a^l_j}{\partial w^l_{jk}}.\tag{53}\end{eqnarray}Now, Equation <span id="margin_999604377853_reveal" class="equation_link">(53)</span><span id="margin_999604377853" class="marginequation" style="display: none;"><a href="chap2.html#eqtn53" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \frac{\partial C}{\partial w^l_{jk}} = \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m}   \frac{\partial a^L_m}{\partial a^{L-1}_n}  \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots  \frac{\partial a^{l+1}_q}{\partial a^l_j}   \frac{\partial a^l_j}{\partial w^l_{jk}} \nonumber\end{eqnarray}</a></span><script>$('#margin_999604377853_reveal').click(function() {$('#margin_999604377853').toggle('slow', function() {});});</script> looks complicated.  However,it has a nice intuitive interpretation.  We're computing the rate ofchange of $C$ with respect to a weight in the network.  What theequation tells us is that every edge between two neurons in thenetwork is associated with a rate factor which is just the partialderivative of one neuron's activation with respect to the otherneuron's activation.  The edge from the first weight to the firstneuron has a rate factor $\partial a^{l}_j / \partial w^l_{jk}$.  Therate factor for a path is just the product of the rate factors alongthe path.  And the total rate of change $\partial C / \partialw^l_{jk}$ is just the sum of the rate factors of all paths from theinitial weight to the final cost.  This procedure is illustrated here,for a single path:<center><img src="images/tikz27.png"/></center></p><p>What I've been providing up to now is a heuristic argument, a way ofthinking about what's going on when you perturb a weight in a network.Let me sketch out a line of thinking you could use to further developthis argument.  First, you could derive explicit expressions for allthe individual partial derivatives inEquation <span id="margin_987365018851_reveal" class="equation_link">(53)</span><span id="margin_987365018851" class="marginequation" style="display: none;"><a href="chap2.html#eqtn53" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \frac{\partial C}{\partial w^l_{jk}} = \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m}   \frac{\partial a^L_m}{\partial a^{L-1}_n}  \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots  \frac{\partial a^{l+1}_q}{\partial a^l_j}   \frac{\partial a^l_j}{\partial w^l_{jk}} \nonumber\end{eqnarray}</a></span><script>$('#margin_987365018851_reveal').click(function() {$('#margin_987365018851').toggle('slow', function() {});});</script>.  That's easy to do with a bit ofcalculus.  Having done that, you could then try to figure out how towrite all the sums over indices as matrix multiplications.  This turnsout to be tedious, and requires some persistence, but notextraordinary insight.  After doing all this, and then simplifying asmuch as possible, what you discover is that you end up with exactlythe backpropagation algorithm!  And so you can think of thebackpropagation algorithm as providing a way of computing the sum overthe rate factor for all these paths.  Or, to put it slightlydifferently, the backpropagation algorithm is a clever way of keepingtrack of small perturbations to the weights (and biases) as theypropagate through the network, reach the output, and then affect thecost.</p><p>Now, I'm not going to work through all this here.  It's messy andrequires considerable care to work through all the details.  If you'reup for a challenge, you may enjoy attempting it.  And even if not, Ihope this line of thinking gives you some insight into whatbackpropagation is accomplishing.</p><p>What about the other mystery - how backpropagation could have beendiscovered in the first place?  In fact, if you follow the approach Ijust sketched you will discover a proof of backpropagation.Unfortunately, the proof is quite a bit longer and more complicatedthan the one I described earlier in this chapter.  So how was thatshort (but more mysterious) proof discovered?  What you find when youwrite out all the details of the long proof is that, after the fact,there are several obvious simplifications staring you in the face.You make those simplifications, get a shorter proof, and write thatout.  And then several more obvious simplifications jump out atyou. So you repeat again.  The result after a few iterations is theproof we saw earlier*<span class="marginnote">*There is one clever step required.  In  Equation <span id="margin_851028000358_reveal" class="equation_link">(53)</span><span id="margin_851028000358" class="marginequation" style="display: none;"><a href="chap2.html#eqtn53" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \frac{\partial C}{\partial w^l_{jk}} = \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m}   \frac{\partial a^L_m}{\partial a^{L-1}_n}  \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots  \frac{\partial a^{l+1}_q}{\partial a^l_j}   \frac{\partial a^l_j}{\partial w^l_{jk}} \nonumber\end{eqnarray}</a></span><script>$('#margin_851028000358_reveal').click(function() {$('#margin_851028000358').toggle('slow', function() {});});</script> the intermediate variables are  activations like $a_q^{l+1}$.  The clever idea is to switch to using  weighted inputs, like $z^{l+1}_q$, as the intermediate variables. If  you don't have this idea, and instead continue using the activations  $a^{l+1}_q$, the proof you obtain turns out to be slightly more  complex than the proof given earlier in the chapter.</span> - short, butsomewhat obscure, because all the signposts to its construction havebeen removed!  I am, of course, asking you to trust me on this, butthere really is no great mystery to the origin of the earlier proof.It's just a lot of hard work simplifying the proof I've sketched inthis section.</p><p><br/><br/><br/></p><p></div><div class="footer"> <span class="left_footer"> In academic work,
+</p><p><a id="backprop_over_minibatch"></a><h4><a name="problem_269962"></a><a href="#problem_269962">Problem</a></h4><ul><li><strong>Fully matrix-based approach to backpropagation over a  mini-batch</strong> Our implementation of stochastic gradient descent loops  over training examples in a mini-batch.  It's possible to modify the  backpropagation algorithm so that it computes the gradients for all  training examples in a mini-batch simultaneously.  The idea is that  instead of beginning with a single input vector, $x$, we can begin  with a matrix $X = [x_1 x_2 \ldots x_m]$ whose columns are the  vectors in the mini-batch.  We forward-propagate by multiplying by  the weight matrices, adding a suitable matrix for the bias terms,  and applying the sigmoid function everywhere. We backpropagate along  similar lines.  Explicitly write out pseudocode for this approach to  the backpropagation algorithm.  Modify <tt>network.py</tt> so that it  uses this fully matrix-based approach.  The advantage of this  approach is that it takes full advantage of modern libraries for  linear algebra.  As a result it can be quite a bit faster than  looping over the mini-batch.  (On my laptop, for example, the  speedup is about a factor of two when run on MNIST classification  problems like those we considered in the last chapter.)  In  practice, all serious libraries for backpropagation use this fully  matrix-based approach or some variant.</ul></p><p><h3><a name="in_what_sense_is_backpropagation_a_fast_algorithm"></a><a href="#in_what_sense_is_backpropagation_a_fast_algorithm">In what sense is backpropagation a fast algorithm?</a></h3></p><p>In what sense is backpropagation a fast algorithm?  To answer thisquestion, let's consider another approach to computing the gradient.Imagine it's the early days of neural networks research.  Maybe it'sthe 1950s or 1960s, and you're the first person in the world to thinkof using gradient descent to learn!  But to make the idea work youneed a way of computing the gradient of the cost function.  You thinkback to your knowledge of calculus, and decide to see if you can usethe chain rule to compute the gradient.  But after playing around abit, the algebra looks complicated, and you get discouraged.  So youtry to find another approach.  You decide to regard the cost as afunction of the weights $C = C(w)$ alone (we'll get back to the biasesin a moment).  You number the weights $w_1, w_2, \ldots$, and want tocompute $\partial C / \partial w_j$ for some particular weight $w_j$.An obvious way of doing that is to use the approximation<a class="displaced_anchor" name="eqtn46"></a>\begin{eqnarray}  \frac{\partial    C}{\partial w_{j}} \approx \frac{C(w+\epsilon    e_j)-C(w)}{\epsilon},\tag{46}\end{eqnarray}where $\epsilon > 0$ is a small positive number, and $e_j$ is the unitvector in the $j^{\rm th}$ direction.  In other words, we can estimate$\partial C / \partial w_j$ by computing the cost $C$ for two slightlydifferent values of $w_j$, and then applyingEquation <span id="margin_674486079288_reveal" class="equation_link">(46)</span><span id="margin_674486079288" class="marginequation" style="display: none;"><a href="chap2.html#eqtn46" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial    C}{\partial w_{j}} \approx \frac{C(w+\epsilon    e_j)-C(w)}{\epsilon} \nonumber\end{eqnarray}</a></span><script>$('#margin_674486079288_reveal').click(function() {$('#margin_674486079288').toggle('slow', function() {});});</script>.  The same idea will let uscompute the partial derivatives $\partial C / \partial b$ with respectto the biases.</p><p>This approach looks very promising.  It's simple conceptually, andextremely easy to implement, using just a few lines of code.Certainly, it looks much more promising than the idea of using thechain rule to compute the gradient!</p><p>Unfortunately, while this approach appears promising, when youimplement the code it turns out to be extremely slow.  To understandwhy, imagine we have a million weights in our network.  Then for eachdistinct weight $w_j$ we need to compute $C(w+\epsilon e_j)$ in orderto compute $\partial C / \partial w_j$.  That means that to computethe gradient we need to compute the cost function a million differenttimes, requiring a million forward passes through the network (pertraining example).  We need to compute $C(w)$ as well, so that's atotal of a million and one passes through the network.</p><p>What's clever about backpropagation is that it enables us tosimultaneously compute <em>all</em> the partial derivatives $\partial C/ \partial w_j$ using just one forward pass through the network,followed by one backward pass through the network.  Roughly speaking,the computational cost of the backward pass is about the same as theforward pass*<span class="marginnote">*This should be plausible, but it requires some  analysis to make a careful statement.  It's plausible because the  dominant computational cost in the forward pass is multiplying by  the weight matrices, while in the backward pass it's multiplying by  the transposes of the weight matrices.  These operations obviously  have similar computational cost.</span>.  And so the total cost ofbackpropagation is roughly the same as making just two forward passesthrough the network.  Compare that to the million and one forwardpasses we needed for the approach basedon <span id="margin_884034767764_reveal" class="equation_link">(46)</span><span id="margin_884034767764" class="marginequation" style="display: none;"><a href="chap2.html#eqtn46" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial    C}{\partial w_{j}} \approx \frac{C(w+\epsilon    e_j)-C(w)}{\epsilon} \nonumber\end{eqnarray}</a></span><script>$('#margin_884034767764_reveal').click(function() {$('#margin_884034767764').toggle('slow', function() {});});</script>!  And so even though backpropagationappears superficially more complex than the approach basedon <span id="margin_806797723031_reveal" class="equation_link">(46)</span><span id="margin_806797723031" class="marginequation" style="display: none;"><a href="chap2.html#eqtn46" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial    C}{\partial w_{j}} \approx \frac{C(w+\epsilon    e_j)-C(w)}{\epsilon} \nonumber\end{eqnarray}</a></span><script>$('#margin_806797723031_reveal').click(function() {$('#margin_806797723031').toggle('slow', function() {});});</script>, it's actually much, much faster.</p><p>This speedup was first fully appreciated in 1986, and it greatlyexpanded the range of problems that neural networks could solve.That, in turn, caused a rush of people using neural networks.  Ofcourse, backpropagation is not a panacea.  Even in the late 1980speople ran up against limits, especially when attempting to usebackpropagation to train deep neural networks, i.e., networks withmany hidden layers.  Later in the book we'll see how modern computersand some clever new ideas now make it possible to use backpropagationto train such deep neural networks.</p><p><h3><a name="backpropagation_the_big_picture"></a><a href="#backpropagation_the_big_picture">Backpropagation: the big picture</a></h3></p><p>As I've explained it, backpropagation presents two mysteries.  First,what's the algorithm really doing?  We've developed a picture of theerror being backpropagated from the output.  But can we go any deeper,and build up more intuition about what is going on when we do allthese matrix and vector multiplications?  The second mystery is howsomeone could ever have discovered backpropagation in the first place?It's one thing to follow the steps in an algorithm, or even to followthe proof that the algorithm works.  But that doesn't mean youunderstand the problem so well that you could have discovered thealgorithm in the first place.  Is there a plausible line of reasoningthat could have led you to discover the backpropagation algorithm?  Inthis section I'll address both these mysteries.</p><p>To improve our intuition about what the algorithm is doing, let'simagine that we've made a small change $\Delta w^l_{jk}$ to someweight in the network, $w^l_{jk}$:<center><img src="images/tikz22.png"/></center>That change in weight will cause a change in the output activationfrom the corresponding neuron:<center><img src="images/tikz23.png"/></center>That, in turn, will cause a change in <em>all</em> the activations inthe next layer:<center><img src="images/tikz24.png"/></center>Those changes will in turn cause changes in the next layer, and thenthe next, and so on all the way through to causing a change in thefinal layer, and then in the cost function:<center><img src="images/tikz25.png"/></center>The change $\Delta C$ in the cost is related to the change $\Deltaw^l_{jk}$ in the weight by the equation<a class="displaced_anchor" name="eqtn47"></a>\begin{eqnarray}   \Delta C \approx \frac{\partial C}{\partial w^l_{jk}} \Delta w^l_{jk}.\tag{47}\end{eqnarray}This suggests that a possible approach to computing $\frac{\partial  C}{\partial w^l_{jk}}$ is to carefully track how a small change in$w^l_{jk}$ propagates to cause a small change in $C$.  If we can dothat, being careful to express everything along the way in terms ofeasily computable quantities, then we should be able to compute$\partial C / \partial w^l_{jk}$.</p><p>Let's try to carry this out.  The change $\Delta w^l_{jk}$ causes asmall change $\Delta a^{l}_j$ in the activation of the $j^{\rm th}$ neuron inthe $l^{\rm th}$ layer.  This change is given by<a class="displaced_anchor" name="eqtn48"></a>\begin{eqnarray}   \Delta a^l_j \approx \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk}.\tag{48}\end{eqnarray}The change in activation $\Delta a^l_{j}$ will cause changes in<em>all</em> the activations in the next layer, i.e., the $(l+1)^{\rm  th}$ layer.  We'll concentrate on the way just a single one of thoseactivations is affected, say $a^{l+1}_q$,<center><img src="images/tikz26.png"/></center>In fact, it'll cause the following change:<a class="displaced_anchor" name="eqtn49"></a>\begin{eqnarray}  \Delta a^{l+1}_q \approx \frac{\partial a^{l+1}_q}{\partial a^l_j} \Delta a^l_j.\tag{49}\end{eqnarray}Substituting in the expression from Equation <span id="margin_531795813347_reveal" class="equation_link">(48)</span><span id="margin_531795813347" class="marginequation" style="display: none;"><a href="chap2.html#eqtn48" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \Delta a^l_j \approx \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk} \nonumber\end{eqnarray}</a></span><script>$('#margin_531795813347_reveal').click(function() {$('#margin_531795813347').toggle('slow', function() {});});</script>,we get:<a class="displaced_anchor" name="eqtn50"></a>\begin{eqnarray}  \Delta a^{l+1}_q \approx \frac{\partial a^{l+1}_q}{\partial a^l_j} \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk}.\tag{50}\end{eqnarray}Of course, the change $\Delta a^{l+1}_q$ will, in turn, cause changesin the activations in the next layer.  In fact, we can imagine a pathall the way through the network from $w^l_{jk}$ to $C$, with eachchange in activation causing a change in the next activation, and,finally, a change in the cost at the output.  If the path goes throughactivations $a^l_j, a^{l+1}_q, \ldots, a^{L-1}_n, a^L_m$ then theresulting expression is<a class="displaced_anchor" name="eqtn51"></a>\begin{eqnarray}  \Delta C \approx \frac{\partial C}{\partial a^L_m}   \frac{\partial a^L_m}{\partial a^{L-1}_n}  \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots  \frac{\partial a^{l+1}_q}{\partial a^l_j}  \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk},\tag{51}\end{eqnarray}that is, we've picked up a $\partial a / \partial a$ type term foreach additional neuron we've passed through, as well as the $\partialC/\partial a^L_m$ term at the end.  This represents the change in $C$due to changes in the activations along this particular path throughthe network.  Of course, there's many paths by which a change in$w^l_{jk}$ can propagate to affect the cost, and we've beenconsidering just a single path. To compute the total change in $C$ itis plausible that we should sum over all the possible paths betweenthe weight and the final cost, i.e.,<a class="displaced_anchor" name="eqtn52"></a>\begin{eqnarray}   \Delta C \approx \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m}   \frac{\partial a^L_m}{\partial a^{L-1}_n}  \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots  \frac{\partial a^{l+1}_q}{\partial a^l_j}   \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk},\tag{52}\end{eqnarray}where we've summed over all possible choices for the intermediateneurons along the path.  Comparing with <span id="margin_103206561462_reveal" class="equation_link">(47)</span><span id="margin_103206561462" class="marginequation" style="display: none;"><a href="chap2.html#eqtn47" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \Delta C \approx \frac{\partial C}{\partial w^l_{jk}} \Delta w^l_{jk} \nonumber\end{eqnarray}</a></span><script>$('#margin_103206561462_reveal').click(function() {$('#margin_103206561462').toggle('slow', function() {});});</script> wesee that<a class="displaced_anchor" name="eqtn53"></a>\begin{eqnarray}   \frac{\partial C}{\partial w^l_{jk}} = \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m}   \frac{\partial a^L_m}{\partial a^{L-1}_n}  \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots  \frac{\partial a^{l+1}_q}{\partial a^l_j}   \frac{\partial a^l_j}{\partial w^l_{jk}}.\tag{53}\end{eqnarray}Now, Equation <span id="margin_963684443937_reveal" class="equation_link">(53)</span><span id="margin_963684443937" class="marginequation" style="display: none;"><a href="chap2.html#eqtn53" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \frac{\partial C}{\partial w^l_{jk}} = \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m}   \frac{\partial a^L_m}{\partial a^{L-1}_n}  \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots  \frac{\partial a^{l+1}_q}{\partial a^l_j}   \frac{\partial a^l_j}{\partial w^l_{jk}} \nonumber\end{eqnarray}</a></span><script>$('#margin_963684443937_reveal').click(function() {$('#margin_963684443937').toggle('slow', function() {});});</script> looks complicated.  However,it has a nice intuitive interpretation.  We're computing the rate ofchange of $C$ with respect to a weight in the network.  What theequation tells us is that every edge between two neurons in thenetwork is associated with a rate factor which is just the partialderivative of one neuron's activation with respect to the otherneuron's activation.  The edge from the first weight to the firstneuron has a rate factor $\partial a^{l}_j / \partial w^l_{jk}$.  Therate factor for a path is just the product of the rate factors alongthe path.  And the total rate of change $\partial C / \partialw^l_{jk}$ is just the sum of the rate factors of all paths from theinitial weight to the final cost.  This procedure is illustrated here,for a single path:<center><img src="images/tikz27.png"/></center></p><p>What I've been providing up to now is a heuristic argument, a way ofthinking about what's going on when you perturb a weight in a network.Let me sketch out a line of thinking you could use to further developthis argument.  First, you could derive explicit expressions for allthe individual partial derivatives inEquation <span id="margin_602209892794_reveal" class="equation_link">(53)</span><span id="margin_602209892794" class="marginequation" style="display: none;"><a href="chap2.html#eqtn53" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \frac{\partial C}{\partial w^l_{jk}} = \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m}   \frac{\partial a^L_m}{\partial a^{L-1}_n}  \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots  \frac{\partial a^{l+1}_q}{\partial a^l_j}   \frac{\partial a^l_j}{\partial w^l_{jk}} \nonumber\end{eqnarray}</a></span><script>$('#margin_602209892794_reveal').click(function() {$('#margin_602209892794').toggle('slow', function() {});});</script>.  That's easy to do with a bit ofcalculus.  Having done that, you could then try to figure out how towrite all the sums over indices as matrix multiplications.  This turnsout to be tedious, and requires some persistence, but notextraordinary insight.  After doing all this, and then simplifying asmuch as possible, what you discover is that you end up with exactlythe backpropagation algorithm!  And so you can think of thebackpropagation algorithm as providing a way of computing the sum overthe rate factor for all these paths.  Or, to put it slightlydifferently, the backpropagation algorithm is a clever way of keepingtrack of small perturbations to the weights (and biases) as theypropagate through the network, reach the output, and then affect thecost.</p><p>Now, I'm not going to work through all this here.  It's messy andrequires considerable care to work through all the details.  If you'reup for a challenge, you may enjoy attempting it.  And even if not, Ihope this line of thinking gives you some insight into whatbackpropagation is accomplishing.</p><p>What about the other mystery - how backpropagation could have beendiscovered in the first place?  In fact, if you follow the approach Ijust sketched you will discover a proof of backpropagation.Unfortunately, the proof is quite a bit longer and more complicatedthan the one I described earlier in this chapter.  So how was thatshort (but more mysterious) proof discovered?  What you find when youwrite out all the details of the long proof is that, after the fact,there are several obvious simplifications staring you in the face.You make those simplifications, get a shorter proof, and write thatout.  And then several more obvious simplifications jump out atyou. So you repeat again.  The result after a few iterations is theproof we saw earlier*<span class="marginnote">*There is one clever step required.  In  Equation <span id="margin_817509150234_reveal" class="equation_link">(53)</span><span id="margin_817509150234" class="marginequation" style="display: none;"><a href="chap2.html#eqtn53" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \frac{\partial C}{\partial w^l_{jk}} = \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m}   \frac{\partial a^L_m}{\partial a^{L-1}_n}  \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots  \frac{\partial a^{l+1}_q}{\partial a^l_j}   \frac{\partial a^l_j}{\partial w^l_{jk}} \nonumber\end{eqnarray}</a></span><script>$('#margin_817509150234_reveal').click(function() {$('#margin_817509150234').toggle('slow', function() {});});</script> the intermediate variables are  activations like $a_q^{l+1}$.  The clever idea is to switch to using  weighted inputs, like $z^{l+1}_q$, as the intermediate variables. If  you don't have this idea, and instead continue using the activations  $a^{l+1}_q$, the proof you obtain turns out to be slightly more  complex than the proof given earlier in the chapter.</span> - short, butsomewhat obscure, because all the signposts to its construction havebeen removed!  I am, of course, asking you to trust me on this, butthere really is no great mystery to the origin of the earlier proof.It's just a lot of hard work simplifying the proof I've sketched inthis section.</p><p><br/><br/><br/></p><p></div><div class="footer"> <span class="left_footer"> In academic work,
 please cite this book as: Michael A. Nielsen, "Neural Networks and
 Deep Learning", Determination Press, 2015
 
@@ -261,7 +264,7 @@ <h1 class="chapter_title"><a href="">How the backpropagation algorithm works</a>
 href="mailto:mn@michaelnielsen.org">contact me</a>.
 </span>
 <span class="right_footer">
-Last update: Sun Jan  1 16:00:21 2017
+Last update: Thu Jan 19 06:09:48 2017
 <br/>
 <br/>
 <br/>
diff --git a/chap3.html b/chap3.html
index 2ffe130..09b81b3 100755
--- a/chap3.html
+++ b/chap3.html
@@ -155,6 +155,8 @@ <h1 class="chapter_title"><a href="">Improving the way neural networks learn</a>
 <hr>
 <span class="sidebar_title">Resources</span>
 
+<p class="sidebar"><a href="https://twitter.com/michael_nielsen">Michael Nielsen on Twitter</a></p>
+
 <p class="sidebar"><a href="faq.html">Book FAQ</a></p>
 
 <p class="sidebar">
@@ -164,9 +166,10 @@ <h1 class="chapter_title"><a href="">Improving the way neural networks learn</a>
 <a href="http://eepurl.com/0Xxjb">Michael Nielsen's project announcement mailing list</a>
 </p>
 
-<p class="sidebar"> <a href="http://www.iro.umontreal.ca/~bengioy/dlbook/">Deep Learning</a>, draft book
-in preparation, by Yoshua Bengio, Ian Goodfellow, and Aaron
-Courville</p>
+<p class="sidebar"> <a href="http://www.deeplearningbook.org/">Deep Learning</a>, book by Ian
+Goodfellow, Yoshua Bengio, and Aaron Courville</p>
+
+<p class="sidebar"><a href="http://cognitivemedium.com">cognitivemedium.com</a></p>
 
 <hr>
 <a href="http://michaelnielsen.org"><img src="assets/Michael_Nielsen_Web_Small.jpg" width="160px" style="border-style: none;"/></a>
@@ -175,7 +178,7 @@ <h1 class="chapter_title"><a href="">Improving the way neural networks learn</a>
 By <a href="http://michaelnielsen.org">Michael Nielsen</a> / Jan 2017
 </p>
 </div>
-</p><p>When a golf player is first learning to play golf, they usually spendmost of their time developing a basic swing.  Only gradually do theydevelop other shots, learning to chip, draw and fade the ball,building on and modifying their basic swing.  In a similar way, up tonow we've focused on understanding the backpropagation algorithm.It's our "basic swing", the foundation for learning in most work onneural networks.  In this chapter I explain a suite of techniqueswhich can be used to improve on our vanilla implementation ofbackpropagation, and so improve the way our networks learn.</p><p>The techniques we'll develop in this chapter include: a better choiceof cost function, known as<a href="chap3.html#the_cross-entropy_cost_function">the  cross-entropy</a> cost function; four so-called<a href="chap3.html#overfitting_and_regularization">"regularization"  methods</a> (L1 and L2 regularization, dropout, and artificialexpansion of the training data), which make our networks better atgeneralizing beyond the training data; a<a href="chap3.html#weight_initialization">better method for  initializing the weights</a> in the network; and a<a href="#how_to_choose_a_neural_network's_hyper-parameters">set  of heuristics to help choose good hyper-parameters</a> for the network.I'll also overview <a href="chap3.html#other_techniques">several other  techniques</a> in less depth.  The discussions are largely independentof one another, and so you may jump ahead if you wish.  We'll also<a href="#handwriting_recognition_revisited_the_code">implement</a>many of the techniques in running code, and use them to improve theresults obtained on the handwriting classification problem studied in<a href="chap1.html">Chapter 1</a>.</p><p>Of course, we're only covering a few of the many, many techniqueswhich have been developed for use in neural nets.  The philosophy isthat the best entree to the plethora of available techniques isin-depth study of a few of the most important.  Mastering thoseimportant techniques is not just useful in its own right, but willalso deepen your understanding of what problems can arise when you useneural networks.  That will leave you well prepared to quickly pick upother techniques, as you need them.</p><p></p><p></p><p></p><p><h3><a name="the_cross-entropy_cost_function"></a><a href="#the_cross-entropy_cost_function">The cross-entropy cost function</a></h3></p><p>Most of us find it unpleasant to be wrong.  Soon after beginning tolearn the piano I gave my first performance before an audience.  I wasnervous, and began playing the piece an octave too low.  I gotconfused, and couldn't continue until someone pointed out my error.  Iwas very embarrassed.  Yet while unpleasant, we also learn quickly whenwe're decisively wrong.  You can bet that the next time I playedbefore an audience I played in the correct octave!  By contrast, welearn more slowly when our errors are less well-defined.</p><p>Ideally, we hope and expect that our neural networks will learn fastfrom their errors.  Is this what happens in practice?  To answer thisquestion, let's look at a toy example.  The example involves a neuronwith just one input:</p><p><center><img src="images/tikz28.png"/></center></p><p>We'll train this neuron to do something ridiculously easy: take theinput $1$ to the output $0$.  Of course, this is such a trivial taskthat we could easily figure out an appropriate weight and bias byhand, without using a learning algorithm.  However, it turns out to beilluminating to use gradient descent to attempt to learn a weight andbias.  So let's take a look at how the neuron learns.</p><p>To make things definite, I'll pick the initial weight to be $0.6$ andthe initial bias to be $0.9$.  These are generic choices used as aplace to begin learning, I wasn't picking them to be special in anyway.  The initial output from the neuron is $0.82$, so quite a bit oflearning will be needed before our neuron gets near the desiredoutput, $0.0$.  Click on "Run" in the bottom right corner below tosee how the neuron learns an output much closer to $0.0$.  Note thatthis isn't a pre-recorded animation, your browser is actuallycomputing the gradient, then using the gradient to update the weightand bias, and displaying the result.  The learning rate is $\eta =0.15$, which turns out to be slow enough that we can follow what'shappening, but fast enough that we can get substantial learning injust a few seconds.  The cost is the quadratic cost function, $C$,introduced back in Chapter 1.  I'll remind you of the exact form ofthe cost function shortly, so there's no need to go and dig up thedefinition.  Note that you can run the animation multiple times byclicking on "Run" again.</p><p><script type="text/javascript" src="js/paper.js"></script><script type="text/paperscript" src="js/saturation1.js" canvas="saturation1"></script><center><canvas id="saturation1" width="520" height="300"></canvas></center></p><p>As you can see, the neuron rapidly learns a weight and bias thatdrives down the cost, and gives an output from the neuron of about$0.09$.  That's not quite the desired output, $0.0$, but it is prettygood.  Suppose, however, that we instead choose both the startingweight and the starting bias to be $2.0$.  In this case the initialoutput is $0.98$, which is very badly wrong.  Let's look at how theneuron learns to output $0$ in this case.  Click on "Run" again:</p><p><script type="text/paperscript" src="js/saturation2.js" canvas="saturation2"></script><a id="saturation2_anchor"></a><center><canvas id="saturation2" width="520" height="300"></canvas></center></p><p>Although this example uses the same learning rate ($\eta = 0.15$), wecan see that learning starts out much more slowly.  Indeed, for thefirst 150 or so learning epochs, the weights and biases don't changemuch at all.  Then the learning kicks in and, much as in our firstexample, the neuron's output rapidly moves closer to $0.0$.</p><p>This behaviour is strange when contrasted to human learning.  As Isaid at the beginning of this section, we often learn fastest whenwe're badly wrong about something.  But we've just seen that ourartificial neuron has a lot of difficulty learning when it's badlywrong - far more difficulty than when it's just a little wrong.What's more, it turns out that this behaviour occurs not just in thistoy model, but in more general networks.  Why is learning so slow?And can we find a way of avoiding this slowdown?</p><p>To understand the origin of the problem, consider that our neuronlearns by changing the weight and bias at a rate determined by thepartial derivatives of the cost function, $\partial C/\partial w$ and$\partial C / \partial b$.  So saying "learning is slow" is reallythe same as saying that those partial derivatives are small.  Thechallenge is to understand why they are small.  To understand that,let's compute the partial derivatives.  Recall that we're using thequadratic cost function, which, fromEquation <span id="margin_817023876958_reveal" class="equation_link">(6)</span><span id="margin_817023876958" class="marginequation" style="display: none;"><a href="chap1.html#eqtn6" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  C(w,b) \equiv  \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}</a></span><script>$('#margin_817023876958_reveal').click(function() {$('#margin_817023876958').toggle('slow', function() {});});</script>, is given by<a class="displaced_anchor" name="eqtn54"></a>\begin{eqnarray}  C = \frac{(y-a)^2}{2},\tag{54}\end{eqnarray}where $a$ is the neuron's output when the training input $x = 1$ isused, and $y = 0$ is the corresponding desired output.  To write thismore explicitly in terms of the weight and bias, recall that $a =\sigma(z)$, where $z = wx+b$.  Using the chain rule to differentiatewith respect to the weight and bias we get<a class="displaced_anchor" name="eqtn55"></a><a class="displaced_anchor" name="eqtn56"></a>\begin{eqnarray}   \frac{\partial C}{\partial w} & = & (a-y)\sigma'(z) x = a \sigma'(z) \tag{55}\\    \frac{\partial C}{\partial b} & = & (a-y)\sigma'(z) = a \sigma'(z),\tag{56}\end{eqnarray}where I have substituted $x = 1$ and $y = 0$.  To understand thebehaviour of these expressions, let's look more closely at the$\sigma'(z)$ term on the right-hand side.  Recall the shape of the$\sigma$ function:</p><p><div id="sigmoid_graph"><a name="sigmoid_graph"></a></div><script type="text/javascript" src="http://d3js.org/d3.v3.min.js"></script><script>function s(x) {return 1/(1+Math.exp(-x));}var m = [40, 120, 50, 120];var height = 290 - m[0] - m[2];var width = 600 - m[1] - m[3];var xmin = -5;var xmax = 5;var sample = 400;var x1 = d3.scale.linear().domain([0, sample]).range([xmin, xmax]);var data = d3.range(sample).map(function(d){ return {        x: x1(d),         y: s(x1(d))};     });var x = d3.scale.linear().domain([xmin, xmax]).range([0, width]);var y = d3.scale.linear()                .domain([0, 1])                .range([height, 0]);var line = d3.svg.line()    .x(function(d) { return x(d.x); })    .y(function(d) { return y(d.y); })var graph = d3.select("#sigmoid_graph")    .append("svg")    .attr("width", width + m[1] + m[3])    .attr("height", height + m[0] + m[2])    .append("g")    .attr("transform", "translate(" + m[3] + "," + m[0] + ")");var xAxis = d3.svg.axis()                  .scale(x)                  .tickValues(d3.range(-4, 5, 1))                  .orient("bottom")graph.append("g")    .attr("class", "x axis")    .attr("transform", "translate(0, " + height + ")")    .call(xAxis);var yAxis = d3.svg.axis()                  .scale(y)                  .tickValues(d3.range(0, 1.01, 0.2))                  .orient("left")                  .ticks(5)graph.append("g")    .attr("class", "y axis")    .call(yAxis);graph.append("path").attr("d", line(data));graph.append("text")     .attr("class", "x label")     .attr("text-anchor", "end")     .attr("x", width/2)     .attr("y", height+35)     .text("z");graph.append("text")        .attr("x", (width / 2))                     .attr("y", -10)        .attr("text-anchor", "middle")          .style("font-size", "16px")         .text("sigmoid function");</script></p><p>We can see from this graph that when the neuron's output is close to$1$, the curve gets very flat, and so $\sigma'(z)$ gets very small.Equations <span id="margin_933379123487_reveal" class="equation_link">(55)</span><span id="margin_933379123487" class="marginequation" style="display: none;"><a href="chap3.html#eqtn55" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \frac{\partial C}{\partial w} & = & (a-y)\sigma'(z) x = a \sigma'(z)  \nonumber\end{eqnarray}</a></span><script>$('#margin_933379123487_reveal').click(function() {$('#margin_933379123487').toggle('slow', function() {});});</script> and <span id="margin_525558261430_reveal" class="equation_link">(56)</span><span id="margin_525558261430" class="marginequation" style="display: none;"><a href="chap3.html#eqtn56" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}    \frac{\partial C}{\partial b} & = & (a-y)\sigma'(z) = a \sigma'(z) \nonumber\end{eqnarray}</a></span><script>$('#margin_525558261430_reveal').click(function() {$('#margin_525558261430').toggle('slow', function() {});});</script> then tell us that$\partial C / \partial w$ and $\partial C / \partial b$ get verysmall.  This is the origin of the learning slowdown.  What's more, aswe shall see a little later, the learning slowdown occurs foressentially the same reason in more general neural networks, not justthe toy example we've been playing with.</p><p><h4><a name="introducing_the_cross-entropy_cost_function"></a><a href="#introducing_the_cross-entropy_cost_function">Introducing the cross-entropy cost function</a></h4></p><p>How can we address the learning slowdown?  It turns out that we cansolve the problem by replacing the quadratic cost with a differentcost function, known as the cross-entropy.  To understand thecross-entropy, let's move a little away from our super-simple toymodel.  We'll suppose instead that we're trying to train a neuron withseveral input variables, $x_1, x_2, \ldots$, corresponding weights$w_1, w_2, \ldots$, and a bias, $b$:<center><img src="images/tikz29.png"/></center>The output from the neuron is, of course, $a = \sigma(z)$, where $z =\sum_j w_j x_j+b$ is the weighted sum of the inputs.  We define thecross-entropy cost function for this neuron by<a class="displaced_anchor" name="eqtn57"></a>\begin{eqnarray}   C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right],\tag{57}\end{eqnarray}where $n$ is the total number of items of training data, the sum isover all training inputs, $x$, and $y$ is the corresponding desiredoutput.</p><p>It's not obvious that the expression <span id="margin_255230246449_reveal" class="equation_link">(57)</span><span id="margin_255230246449" class="marginequation" style="display: none;"><a href="chap3.html#eqtn57" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray}</a></span><script>$('#margin_255230246449_reveal').click(function() {$('#margin_255230246449').toggle('slow', function() {});});</script>fixes the learning slowdown problem.  In fact, frankly, it's not evenobvious that it makes sense to call this a cost function!  Beforeaddressing the learning slowdown, let's see in what sense thecross-entropy can be interpreted as a cost function.</p><p>Two properties in particular make it reasonable to interpret thecross-entropy as a cost function.  First, it's non-negative, that is,$C > 0$.  To see this, notice that: (a) all the individual terms inthe sum in <span id="margin_901555981416_reveal" class="equation_link">(57)</span><span id="margin_901555981416" class="marginequation" style="display: none;"><a href="chap3.html#eqtn57" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray}</a></span><script>$('#margin_901555981416_reveal').click(function() {$('#margin_901555981416').toggle('slow', function() {});});</script> are negative, since bothlogarithms are of numbers in the range $0$ to $1$; and (b) there is aminus sign out the front of the sum.</p><p>Second, if the neuron's actual output is close to the desired outputfor all training inputs, $x$, then the cross-entropy will be close tozero*<span class="marginnote">*To prove this I will need to assume that the desired  outputs $y$ are all either $0$ or $1$.  This is usually the case  when solving classification problems, for example, or when computing  Boolean functions.  To understand what happens when we don't make  this assumption, see the exercises at the end of this section.</span>.  Tosee this, suppose for example that $y = 0$ and $a \approx 0$ for someinput $x$.  This is a case when the neuron is doing a good job on thatinput.  We see that the first term in theexpression <span id="margin_600644586123_reveal" class="equation_link">(57)</span><span id="margin_600644586123" class="marginequation" style="display: none;"><a href="chap3.html#eqtn57" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray}</a></span><script>$('#margin_600644586123_reveal').click(function() {$('#margin_600644586123').toggle('slow', function() {});});</script> for the cost vanishes, since$y = 0$, while the second term is just $-\ln (1-a) \approx 0$.  Asimilar analysis holds when $y = 1$ and $a \approx 1$.  And so thecontribution to the cost will be low provided the actual output isclose to the desired output.</p><p>Summing up, the cross-entropy is positive, and tends toward zero asthe neuron gets better at computing the desired output, $y$, for alltraining inputs, $x$.  These are both properties we'd intuitivelyexpect for a cost function.  Indeed, both properties are alsosatisfied by the quadratic cost. So that's good news for thecross-entropy.  But the cross-entropy cost function has the benefitthat, unlike the quadratic cost, it avoids the problem of learningslowing down.  To see this, let's compute the partial derivative ofthe cross-entropy cost with respect to the weights.  We substitute $a= \sigma(z)$ into <span id="margin_564838219103_reveal" class="equation_link">(57)</span><span id="margin_564838219103" class="marginequation" style="display: none;"><a href="chap3.html#eqtn57" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray}</a></span><script>$('#margin_564838219103_reveal').click(function() {$('#margin_564838219103').toggle('slow', function() {});});</script>, and apply the chainrule twice, obtaining:<a class="displaced_anchor" name="eqtn58"></a><a class="displaced_anchor" name="eqtn59"></a>\begin{eqnarray}  \frac{\partial C}{\partial w_j} & = & -\frac{1}{n} \sum_x \left(    \frac{y }{\sigma(z)} -\frac{(1-y)}{1-\sigma(z)} \right)  \frac{\partial \sigma}{\partial w_j} \tag{58}\\ & = & -\frac{1}{n} \sum_x \left(     \frac{y}{\sigma(z)}     -\frac{(1-y)}{1-\sigma(z)} \right)\sigma'(z) x_j.\tag{59}\end{eqnarray}Putting everything over a common denominator and simplifying thisbecomes:<a class="displaced_anchor" name="eqtn60"></a>\begin{eqnarray}  \frac{\partial C}{\partial w_j} & = & \frac{1}{n}  \sum_x \frac{\sigma'(z) x_j}{\sigma(z) (1-\sigma(z))}  (\sigma(z)-y).\tag{60}\end{eqnarray}Using the definition of the sigmoid function, $\sigma(z) =1/(1+e^{-z})$, and a little algebra we can show that $\sigma'(z) =\sigma(z)(1-\sigma(z))$.  I'll ask you to verify this in an exercisebelow, but for now let's accept it as given.  We see that the$\sigma'(z)$ and $\sigma(z)(1-\sigma(z))$ terms cancel in the equationjust above, and it simplifies to become:<a class="displaced_anchor" name="eqtn61"></a>\begin{eqnarray}   \frac{\partial C}{\partial w_j} =  \frac{1}{n} \sum_x x_j(\sigma(z)-y).\tag{61}\end{eqnarray}This is a beautiful expression.  It tells us that the rate at whichthe weight learns is controlled by $\sigma(z)-y$, i.e., by the errorin the output.  The larger the error, the faster the neuron willlearn.  This is just what we'd intuitively expect.  In particular, itavoids the learning slowdown caused by the $\sigma'(z)$ term in theanalogous equation for the quadratic cost, Equation <span id="margin_390668400989_reveal" class="equation_link">(55)</span><span id="margin_390668400989" class="marginequation" style="display: none;"><a href="chap3.html#eqtn55" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \frac{\partial C}{\partial w} & = & (a-y)\sigma'(z) x = a \sigma'(z)  \nonumber\end{eqnarray}</a></span><script>$('#margin_390668400989_reveal').click(function() {$('#margin_390668400989').toggle('slow', function() {});});</script>.When we use the cross-entropy, the $\sigma'(z)$ term gets canceledout, and we no longer need worry about it being small.  Thiscancellation is the special miracle ensured by the cross-entropy costfunction.  Actually, it's not really a miracle.  As we'll see later,the cross-entropy was specially chosen to have just this property.</p><p>In a similar way, we can compute the partial derivative for the bias.I won't go through all the details again, but you can easily verifythat<a class="displaced_anchor" name="eqtn62"></a>\begin{eqnarray}   \frac{\partial C}{\partial b} = \frac{1}{n} \sum_x (\sigma(z)-y).\tag{62}\end{eqnarray}Again, this avoids the learning slowdown caused by the $\sigma'(z)$term in the analogous equation for the quadratic cost,Equation <span id="margin_889616053479_reveal" class="equation_link">(56)</span><span id="margin_889616053479" class="marginequation" style="display: none;"><a href="chap3.html#eqtn56" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}    \frac{\partial C}{\partial b} & = & (a-y)\sigma'(z) = a \sigma'(z) \nonumber\end{eqnarray}</a></span><script>$('#margin_889616053479_reveal').click(function() {$('#margin_889616053479').toggle('slow', function() {});});</script>.</p><p><h4><a name="exercise_35813"></a><a href="#exercise_35813">Exercise</a></h4><ul>  <li> Verify that $\sigma'(z) = \sigma(z)(1-\sigma(z))$.</p><p></ul></p><p>Let's return to the toy example we played with earlier, and explorewhat happens when we use the cross-entropy instead of the quadraticcost.  To re-orient ourselves, we'll begin with the case where thequadratic cost did just fine, with starting weight $0.6$ and startingbias $0.9$.  Press "Run" to see what happens when we replace thequadratic cost by the cross-entropy:</p><p><script type="text/paperscript" src="js/saturation3.js" canvas="saturation3"></script><center><canvas id="saturation3" width="520" height="300"></canvas></center></p><p>Unsurprisingly, the neuron learns perfectly well in this instance,just as it did earlier.  And now let's look at the case where ourneuron got stuck before (<a href="#saturation2_anchor">link</a>, forcomparison), with the weight and bias both starting at $2.0$:</p><p><script type="text/paperscript" src="js/saturation4.js" canvas="saturation4"></script><center><canvas id="saturation4" width="520" height="300"></canvas></center></p><p>Success!  This time the neuron learned quickly, just as we hoped.  Ifyou observe closely you can see that the slope of the cost curve wasmuch steeper initially than the initial flat region on thecorresponding curve for the quadratic cost.  It's that steepness whichthe cross-entropy buys us, preventing us from getting stuck just whenwe'd expect our neuron to learn fastest, i.e., when the neuron startsout badly wrong.</p><p>I didn't say what learning rate was used in the examples justillustrated.  Earlier, with the quadratic cost, we used $\eta = 0.15$.Should we have used the same learning rate in the new examples?  Infact, with the change in cost function it's not possible to sayprecisely what it means to use the "same" learning rate; it's anapples and oranges comparison.  For both cost functions I simplyexperimented to find a learning rate that made it possible to see whatis going on.  If you're still curious, despite my disavowal, here'sthe lowdown: I used $\eta = 0.005$ in the examples just given.</p><p>You might object that the change in learning rate makes the graphsabove meaningless.  Who cares how fast the neuron learns, when ourchoice of learning rate was arbitrary to begin with?!  That objectionmisses the point.  The point of the graphs isn't about the absolutespeed of learning.  It's about how the speed of learning changes.  Inparticular, when we use the quadratic cost learning is <em>slower</em>when the neuron is unambiguously wrong than it is later on, as theneuron gets closer to the correct output; while with the cross-entropylearning is faster when the neuron is unambiguously wrong.  Thosestatements don't depend on how the learning rate is set.  </p><p>We've been studying the cross-entropy for a single neuron.  However,it's easy to generalize the cross-entropy to many-neuron multi-layernetworks.  In particular, suppose $y = y_1, y_2, \ldots$ are thedesired values at the output neurons, i.e., the neurons in the finallayer, while $a^L_1, a^L_2, \ldots$ are the actual output values.Then we define the cross-entropy by<a class="displaced_anchor" name="eqtn63"></a>\begin{eqnarray}  C = -\frac{1}{n} \sum_x  \sum_j \left[y_j \ln a^L_j + (1-y_j) \ln (1-a^L_j) \right].\tag{63}\end{eqnarray}This is the same as our earlier expression,Equation <span id="margin_644800215834_reveal" class="equation_link">(57)</span><span id="margin_644800215834" class="marginequation" style="display: none;"><a href="chap3.html#eqtn57" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray}</a></span><script>$('#margin_644800215834_reveal').click(function() {$('#margin_644800215834').toggle('slow', function() {});});</script>, except now we've got the$\sum_j$ summing over all the output neurons.  I won't explicitly workthrough a derivation, but it should be plausible that using theexpression <span id="margin_562005572164_reveal" class="equation_link">(63)</span><span id="margin_562005572164" class="marginequation" style="display: none;"><a href="chap3.html#eqtn63" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  C = -\frac{1}{n} \sum_x  \sum_j \left[y_j \ln a^L_j + (1-y_j) \ln (1-a^L_j) \right] \nonumber\end{eqnarray}</a></span><script>$('#margin_562005572164_reveal').click(function() {$('#margin_562005572164').toggle('slow', function() {});});</script> avoids a learning slowdown inmany-neuron networks.  If you're interested, you can work through thederivation in the problem below.  </p><p>When should we use the cross-entropy instead of the quadratic cost?In fact, the cross-entropy is nearly always the better choice,provided the output neurons are sigmoid neurons.  To see why, considerthat when we're setting up the network we usually initialize theweights and biases using some sort of randomization.  It may happenthat those initial choices result in the network being decisivelywrong for some training input - that is, an output neuron will havesaturated near $1$, when it should be $0$, or vice versa.  If we'reusing the quadratic cost that will slow down learning.  It won't stoplearning completely, since the weights will continue learning fromother training inputs, but it's obviously undesirable.</p><p><h4><a name="exercises_824189"></a><a href="#exercises_824189">Exercises</a></h4><ul><li> One gotcha with the cross-entropy is that it can be difficult at  first to remember the respective roles of the $y$s and the $a$s.  It's easy to get confused about whether the right form is $-[y \ln a  + (1-y) \ln (1-a)]$ or $-[a \ln y + (1-a) \ln (1-y)]$.  What happens  to the second of these expressions when $y = 0$ or $1$?  Does this  problem afflict the first expression?  Why or why not? </p><p><li> In the single-neuron discussion at the start of this section, I  argued that the cross-entropy is small if $\sigma(z) \approx y$ for  all training inputs.  The argument relied on $y$ being equal to  either $0$ or $1$.  This is usually true in classification problems,  but for other problems (e.g., regression problems) $y$ can sometimes  take values intermediate between $0$ and $1$.  Show that the  cross-entropy is still minimized when $\sigma(z) = y$ for all  training inputs.  When this is the case the cross-entropy has the  value:  <a class="displaced_anchor" name="eqtn64"></a>\begin{eqnarray}    C = -\frac{1}{n} \sum_x [y \ln y+(1-y) \ln(1-y)].  \tag{64}\end{eqnarray}  The quantity $-[y \ln y+(1-y)\ln(1-y)]$ is sometimes known as the  <a href="http://en.wikipedia.org/wiki/Binary_entropy_function">binary    entropy</a>.</p><p></ul></p><p><h4><a name="problems_382219"></a><a href="#problems_382219">Problems</a></h4><ul><li><strong>Many-layer multi-neuron networks</strong> In the notation introduced in  the <a href="chap2.html">last chapter</a>, show that for the quadratic  cost the partial derivative with respect to weights in the output  layer is  <a class="displaced_anchor" name="eqtn65"></a>\begin{eqnarray}      \frac{\partial C}{\partial w^L_{jk}} & = & \frac{1}{n}      \sum_x a^{L-1}_k  (a^L_j-y_j) \sigma'(z^L_j).  \tag{65}\end{eqnarray}  The term $\sigma'(z^L_j)$ causes a learning slowdown whenever an  output neuron saturates on the wrong value.  Show that for the  cross-entropy cost the output error $\delta^L$ for a single training  example $x$ is given by  <a class="displaced_anchor" name="eqtn66"></a>\begin{eqnarray}     \delta^L = a^L-y.  \tag{66}\end{eqnarray}  Use this expression to show that the partial derivative with respect  to the weights in the output layer is given by  <a class="displaced_anchor" name="eqtn67"></a>\begin{eqnarray}       \frac{\partial C}{\partial w^L_{jk}} & = & \frac{1}{n} \sum_x       a^{L-1}_k  (a^L_j-y_j).  \tag{67}\end{eqnarray}  The $\sigma'(z^L_j)$ term has vanished, and so the cross-entropy  avoids the problem of learning slowdown, not just when used with a  single neuron, as we saw earlier, but also in many-layer  multi-neuron networks.  A simple variation on this analysis holds  also for the biases.  If this is not obvious to you, then you should  work through that analysis as well.</p><p><li><strong>Using the quadratic cost when we have linear neurons in the  output layer</strong> Suppose that we have a many-layer multi-neuron  network.  Suppose all the neurons in the final layer are  <em>linear neurons</em>, meaning that the sigmoid activation function  is not applied, and the outputs are simply $a^L_j = z^L_j$.  Show  that if we use the quadratic cost function then the output error  $\delta^L$ for a single training example $x$ is given by  <a class="displaced_anchor" name="eqtn68"></a>\begin{eqnarray}    \delta^L = a^L-y.  \tag{68}\end{eqnarray}  Similarly to the previous problem, use this expression to show that  the partial derivatives with respect to the weights and biases in  the output layer are given by  <a class="displaced_anchor" name="eqtn69"></a><a class="displaced_anchor" name="eqtn70"></a>\begin{eqnarray}      \frac{\partial C}{\partial w^L_{jk}} & = & \frac{1}{n} \sum_x       a^{L-1}_k  (a^L_j-y_j) \tag{69}\\      \frac{\partial C}{\partial b^L_{j}} & = & \frac{1}{n} \sum_x       (a^L_j-y_j).  \tag{70}\end{eqnarray}  This shows that if the output neurons are linear neurons then the  quadratic cost will not give rise to any problems with a learning  slowdown.  In this case the quadratic cost is, in fact, an  appropriate cost function to use.</ul></p><p><h4><a name="using_the_cross-entropy_to_classify_mnist_digits"></a><a href="#using_the_cross-entropy_to_classify_mnist_digits">Using the cross-entropy to classify MNIST digits</a></h4></p><p></p><p>The cross-entropy is easy to implement as part of a program whichlearns using gradient descent and backpropagation.  We'll do that<a href="#handwriting_recognition_revisited_the_code">later in the  chapter</a>, developing an improved version of our<a href="chap1.html#implementing_our_network_to_classify_digits">earlier  program</a> for classifying the MNIST handwritten digits,<tt>network.py</tt>.  The new program is called <tt>network2.py</tt>, andincorporates not just the cross-entropy, but also several othertechniques developed in this chapter*<span class="marginnote">*The code is available  <a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/src/network2.py">on    GitHub</a>.</span>.  For now, let's look at how well our new programclassifies MNIST digits.  As was the case in Chapter 1, we'll use anetwork with $30$ hidden neurons, and we'll use a mini-batch size of$10$.  We set the learning rate to $\eta = 0.5$*<span class="marginnote">*In Chapter 1  we used the quadratic cost and a learning rate of $\eta = 3.0$.  As  discussed above, it's not possible to say precisely what it means to  use the "same" learning rate when the cost function is changed.  For both cost functions I experimented to find a learning rate that  provides near-optimal performance, given the other hyper-parameter  choices. <br/> <br/> There is, incidentally, a very rough  general heuristic for relating the learning rate for the  cross-entropy and the quadratic cost.  As we saw earlier, the  gradient terms for the quadratic cost have an extra $\sigma' =  \sigma(1-\sigma)$ term in them.  Suppose we average this over values  for $\sigma$, $\int_0^1 d\sigma \sigma(1-\sigma) = 1/6$.  We see  that (very roughly) the quadratic cost learns an average of $6$  times slower, for the same learning rate.  This suggests that a  reasonable starting point is to divide the learning rate for the  quadratic cost by $6$.  Of course, this argument is far from  rigorous, and shouldn't be taken too seriously.  Still, it can  sometimes be a useful starting point.</span> and we train for $30$ epochs.The interface to <tt>network2.py</tt> is slightly different than<tt>network.py</tt>, but it should still be clear what is going on.  Youcan, by the way, get documentation about <tt>network2.py</tt>'sinterface by using commands such as <tt>help(network2.Network.SGD)</tt>in a Python shell.</p><p><div class="highlight"><pre><span></span><span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">mnist_loader</span>
+</p><p>When a golf player is first learning to play golf, they usually spendmost of their time developing a basic swing.  Only gradually do theydevelop other shots, learning to chip, draw and fade the ball,building on and modifying their basic swing.  In a similar way, up tonow we've focused on understanding the backpropagation algorithm.It's our "basic swing", the foundation for learning in most work onneural networks.  In this chapter I explain a suite of techniqueswhich can be used to improve on our vanilla implementation ofbackpropagation, and so improve the way our networks learn.</p><p>The techniques we'll develop in this chapter include: a better choiceof cost function, known as<a href="chap3.html#the_cross-entropy_cost_function">the  cross-entropy</a> cost function; four so-called<a href="chap3.html#overfitting_and_regularization">"regularization"  methods</a> (L1 and L2 regularization, dropout, and artificialexpansion of the training data), which make our networks better atgeneralizing beyond the training data; a<a href="chap3.html#weight_initialization">better method for  initializing the weights</a> in the network; and a<a href="#how_to_choose_a_neural_network's_hyper-parameters">set  of heuristics to help choose good hyper-parameters</a> for the network.I'll also overview <a href="chap3.html#other_techniques">several other  techniques</a> in less depth.  The discussions are largely independentof one another, and so you may jump ahead if you wish.  We'll also<a href="#handwriting_recognition_revisited_the_code">implement</a>many of the techniques in running code, and use them to improve theresults obtained on the handwriting classification problem studied in<a href="chap1.html">Chapter 1</a>.</p><p>Of course, we're only covering a few of the many, many techniqueswhich have been developed for use in neural nets.  The philosophy isthat the best entree to the plethora of available techniques isin-depth study of a few of the most important.  Mastering thoseimportant techniques is not just useful in its own right, but willalso deepen your understanding of what problems can arise when you useneural networks.  That will leave you well prepared to quickly pick upother techniques, as you need them.</p><p></p><p></p><p></p><p><h3><a name="the_cross-entropy_cost_function"></a><a href="#the_cross-entropy_cost_function">The cross-entropy cost function</a></h3></p><p>Most of us find it unpleasant to be wrong.  Soon after beginning tolearn the piano I gave my first performance before an audience.  I wasnervous, and began playing the piece an octave too low.  I gotconfused, and couldn't continue until someone pointed out my error.  Iwas very embarrassed.  Yet while unpleasant, we also learn quickly whenwe're decisively wrong.  You can bet that the next time I playedbefore an audience I played in the correct octave!  By contrast, welearn more slowly when our errors are less well-defined.</p><p>Ideally, we hope and expect that our neural networks will learn fastfrom their errors.  Is this what happens in practice?  To answer thisquestion, let's look at a toy example.  The example involves a neuronwith just one input:</p><p><center><img src="images/tikz28.png"/></center></p><p>We'll train this neuron to do something ridiculously easy: take theinput $1$ to the output $0$.  Of course, this is such a trivial taskthat we could easily figure out an appropriate weight and bias byhand, without using a learning algorithm.  However, it turns out to beilluminating to use gradient descent to attempt to learn a weight andbias.  So let's take a look at how the neuron learns.</p><p>To make things definite, I'll pick the initial weight to be $0.6$ andthe initial bias to be $0.9$.  These are generic choices used as aplace to begin learning, I wasn't picking them to be special in anyway.  The initial output from the neuron is $0.82$, so quite a bit oflearning will be needed before our neuron gets near the desiredoutput, $0.0$.  Click on "Run" in the bottom right corner below tosee how the neuron learns an output much closer to $0.0$.  Note thatthis isn't a pre-recorded animation, your browser is actuallycomputing the gradient, then using the gradient to update the weightand bias, and displaying the result.  The learning rate is $\eta =0.15$, which turns out to be slow enough that we can follow what'shappening, but fast enough that we can get substantial learning injust a few seconds.  The cost is the quadratic cost function, $C$,introduced back in Chapter 1.  I'll remind you of the exact form ofthe cost function shortly, so there's no need to go and dig up thedefinition.  Note that you can run the animation multiple times byclicking on "Run" again.</p><p><script type="text/javascript" src="js/paper.js"></script><script type="text/paperscript" src="js/saturation1.js" canvas="saturation1"></script><center><canvas id="saturation1" width="520" height="300"></canvas></center></p><p>As you can see, the neuron rapidly learns a weight and bias thatdrives down the cost, and gives an output from the neuron of about$0.09$.  That's not quite the desired output, $0.0$, but it is prettygood.  Suppose, however, that we instead choose both the startingweight and the starting bias to be $2.0$.  In this case the initialoutput is $0.98$, which is very badly wrong.  Let's look at how theneuron learns to output $0$ in this case.  Click on "Run" again:</p><p><script type="text/paperscript" src="js/saturation2.js" canvas="saturation2"></script><a id="saturation2_anchor"></a><center><canvas id="saturation2" width="520" height="300"></canvas></center></p><p>Although this example uses the same learning rate ($\eta = 0.15$), wecan see that learning starts out much more slowly.  Indeed, for thefirst 150 or so learning epochs, the weights and biases don't changemuch at all.  Then the learning kicks in and, much as in our firstexample, the neuron's output rapidly moves closer to $0.0$.</p><p>This behaviour is strange when contrasted to human learning.  As Isaid at the beginning of this section, we often learn fastest whenwe're badly wrong about something.  But we've just seen that ourartificial neuron has a lot of difficulty learning when it's badlywrong - far more difficulty than when it's just a little wrong.What's more, it turns out that this behaviour occurs not just in thistoy model, but in more general networks.  Why is learning so slow?And can we find a way of avoiding this slowdown?</p><p>To understand the origin of the problem, consider that our neuronlearns by changing the weight and bias at a rate determined by thepartial derivatives of the cost function, $\partial C/\partial w$ and$\partial C / \partial b$.  So saying "learning is slow" is reallythe same as saying that those partial derivatives are small.  Thechallenge is to understand why they are small.  To understand that,let's compute the partial derivatives.  Recall that we're using thequadratic cost function, which, fromEquation <span id="margin_661558065379_reveal" class="equation_link">(6)</span><span id="margin_661558065379" class="marginequation" style="display: none;"><a href="chap1.html#eqtn6" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  C(w,b) \equiv  \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}</a></span><script>$('#margin_661558065379_reveal').click(function() {$('#margin_661558065379').toggle('slow', function() {});});</script>, is given by<a class="displaced_anchor" name="eqtn54"></a>\begin{eqnarray}  C = \frac{(y-a)^2}{2},\tag{54}\end{eqnarray}where $a$ is the neuron's output when the training input $x = 1$ isused, and $y = 0$ is the corresponding desired output.  To write thismore explicitly in terms of the weight and bias, recall that $a =\sigma(z)$, where $z = wx+b$.  Using the chain rule to differentiatewith respect to the weight and bias we get<a class="displaced_anchor" name="eqtn55"></a><a class="displaced_anchor" name="eqtn56"></a>\begin{eqnarray}   \frac{\partial C}{\partial w} & = & (a-y)\sigma'(z) x = a \sigma'(z) \tag{55}\\    \frac{\partial C}{\partial b} & = & (a-y)\sigma'(z) = a \sigma'(z),\tag{56}\end{eqnarray}where I have substituted $x = 1$ and $y = 0$.  To understand thebehaviour of these expressions, let's look more closely at the$\sigma'(z)$ term on the right-hand side.  Recall the shape of the$\sigma$ function:</p><p><div id="sigmoid_graph"><a name="sigmoid_graph"></a></div><script type="text/javascript" src="http://d3js.org/d3.v3.min.js"></script><script>function s(x) {return 1/(1+Math.exp(-x));}var m = [40, 120, 50, 120];var height = 290 - m[0] - m[2];var width = 600 - m[1] - m[3];var xmin = -5;var xmax = 5;var sample = 400;var x1 = d3.scale.linear().domain([0, sample]).range([xmin, xmax]);var data = d3.range(sample).map(function(d){ return {        x: x1(d),         y: s(x1(d))};     });var x = d3.scale.linear().domain([xmin, xmax]).range([0, width]);var y = d3.scale.linear()                .domain([0, 1])                .range([height, 0]);var line = d3.svg.line()    .x(function(d) { return x(d.x); })    .y(function(d) { return y(d.y); })var graph = d3.select("#sigmoid_graph")    .append("svg")    .attr("width", width + m[1] + m[3])    .attr("height", height + m[0] + m[2])    .append("g")    .attr("transform", "translate(" + m[3] + "," + m[0] + ")");var xAxis = d3.svg.axis()                  .scale(x)                  .tickValues(d3.range(-4, 5, 1))                  .orient("bottom")graph.append("g")    .attr("class", "x axis")    .attr("transform", "translate(0, " + height + ")")    .call(xAxis);var yAxis = d3.svg.axis()                  .scale(y)                  .tickValues(d3.range(0, 1.01, 0.2))                  .orient("left")                  .ticks(5)graph.append("g")    .attr("class", "y axis")    .call(yAxis);graph.append("path").attr("d", line(data));graph.append("text")     .attr("class", "x label")     .attr("text-anchor", "end")     .attr("x", width/2)     .attr("y", height+35)     .text("z");graph.append("text")        .attr("x", (width / 2))                     .attr("y", -10)        .attr("text-anchor", "middle")          .style("font-size", "16px")         .text("sigmoid function");</script></p><p>We can see from this graph that when the neuron's output is close to$1$, the curve gets very flat, and so $\sigma'(z)$ gets very small.Equations <span id="margin_725661222068_reveal" class="equation_link">(55)</span><span id="margin_725661222068" class="marginequation" style="display: none;"><a href="chap3.html#eqtn55" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \frac{\partial C}{\partial w} & = & (a-y)\sigma'(z) x = a \sigma'(z)  \nonumber\end{eqnarray}</a></span><script>$('#margin_725661222068_reveal').click(function() {$('#margin_725661222068').toggle('slow', function() {});});</script> and <span id="margin_544739045303_reveal" class="equation_link">(56)</span><span id="margin_544739045303" class="marginequation" style="display: none;"><a href="chap3.html#eqtn56" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}    \frac{\partial C}{\partial b} & = & (a-y)\sigma'(z) = a \sigma'(z) \nonumber\end{eqnarray}</a></span><script>$('#margin_544739045303_reveal').click(function() {$('#margin_544739045303').toggle('slow', function() {});});</script> then tell us that$\partial C / \partial w$ and $\partial C / \partial b$ get verysmall.  This is the origin of the learning slowdown.  What's more, aswe shall see a little later, the learning slowdown occurs foressentially the same reason in more general neural networks, not justthe toy example we've been playing with.</p><p><h4><a name="introducing_the_cross-entropy_cost_function"></a><a href="#introducing_the_cross-entropy_cost_function">Introducing the cross-entropy cost function</a></h4></p><p>How can we address the learning slowdown?  It turns out that we cansolve the problem by replacing the quadratic cost with a differentcost function, known as the cross-entropy.  To understand thecross-entropy, let's move a little away from our super-simple toymodel.  We'll suppose instead that we're trying to train a neuron withseveral input variables, $x_1, x_2, \ldots$, corresponding weights$w_1, w_2, \ldots$, and a bias, $b$:<center><img src="images/tikz29.png"/></center>The output from the neuron is, of course, $a = \sigma(z)$, where $z =\sum_j w_j x_j+b$ is the weighted sum of the inputs.  We define thecross-entropy cost function for this neuron by<a class="displaced_anchor" name="eqtn57"></a>\begin{eqnarray}   C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right],\tag{57}\end{eqnarray}where $n$ is the total number of items of training data, the sum isover all training inputs, $x$, and $y$ is the corresponding desiredoutput.</p><p>It's not obvious that the expression <span id="margin_94747236429_reveal" class="equation_link">(57)</span><span id="margin_94747236429" class="marginequation" style="display: none;"><a href="chap3.html#eqtn57" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray}</a></span><script>$('#margin_94747236429_reveal').click(function() {$('#margin_94747236429').toggle('slow', function() {});});</script>fixes the learning slowdown problem.  In fact, frankly, it's not evenobvious that it makes sense to call this a cost function!  Beforeaddressing the learning slowdown, let's see in what sense thecross-entropy can be interpreted as a cost function.</p><p>Two properties in particular make it reasonable to interpret thecross-entropy as a cost function.  First, it's non-negative, that is,$C > 0$.  To see this, notice that: (a) all the individual terms inthe sum in <span id="margin_459510912475_reveal" class="equation_link">(57)</span><span id="margin_459510912475" class="marginequation" style="display: none;"><a href="chap3.html#eqtn57" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray}</a></span><script>$('#margin_459510912475_reveal').click(function() {$('#margin_459510912475').toggle('slow', function() {});});</script> are negative, since bothlogarithms are of numbers in the range $0$ to $1$; and (b) there is aminus sign out the front of the sum.</p><p>Second, if the neuron's actual output is close to the desired outputfor all training inputs, $x$, then the cross-entropy will be close tozero*<span class="marginnote">*To prove this I will need to assume that the desired  outputs $y$ are all either $0$ or $1$.  This is usually the case  when solving classification problems, for example, or when computing  Boolean functions.  To understand what happens when we don't make  this assumption, see the exercises at the end of this section.</span>.  Tosee this, suppose for example that $y = 0$ and $a \approx 0$ for someinput $x$.  This is a case when the neuron is doing a good job on thatinput.  We see that the first term in theexpression <span id="margin_983582689293_reveal" class="equation_link">(57)</span><span id="margin_983582689293" class="marginequation" style="display: none;"><a href="chap3.html#eqtn57" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray}</a></span><script>$('#margin_983582689293_reveal').click(function() {$('#margin_983582689293').toggle('slow', function() {});});</script> for the cost vanishes, since$y = 0$, while the second term is just $-\ln (1-a) \approx 0$.  Asimilar analysis holds when $y = 1$ and $a \approx 1$.  And so thecontribution to the cost will be low provided the actual output isclose to the desired output.</p><p>Summing up, the cross-entropy is positive, and tends toward zero asthe neuron gets better at computing the desired output, $y$, for alltraining inputs, $x$.  These are both properties we'd intuitivelyexpect for a cost function.  Indeed, both properties are alsosatisfied by the quadratic cost. So that's good news for thecross-entropy.  But the cross-entropy cost function has the benefitthat, unlike the quadratic cost, it avoids the problem of learningslowing down.  To see this, let's compute the partial derivative ofthe cross-entropy cost with respect to the weights.  We substitute $a= \sigma(z)$ into <span id="margin_457283770982_reveal" class="equation_link">(57)</span><span id="margin_457283770982" class="marginequation" style="display: none;"><a href="chap3.html#eqtn57" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray}</a></span><script>$('#margin_457283770982_reveal').click(function() {$('#margin_457283770982').toggle('slow', function() {});});</script>, and apply the chainrule twice, obtaining:<a class="displaced_anchor" name="eqtn58"></a><a class="displaced_anchor" name="eqtn59"></a>\begin{eqnarray}  \frac{\partial C}{\partial w_j} & = & -\frac{1}{n} \sum_x \left(    \frac{y }{\sigma(z)} -\frac{(1-y)}{1-\sigma(z)} \right)  \frac{\partial \sigma}{\partial w_j} \tag{58}\\ & = & -\frac{1}{n} \sum_x \left(     \frac{y}{\sigma(z)}     -\frac{(1-y)}{1-\sigma(z)} \right)\sigma'(z) x_j.\tag{59}\end{eqnarray}Putting everything over a common denominator and simplifying thisbecomes:<a class="displaced_anchor" name="eqtn60"></a>\begin{eqnarray}  \frac{\partial C}{\partial w_j} & = & \frac{1}{n}  \sum_x \frac{\sigma'(z) x_j}{\sigma(z) (1-\sigma(z))}  (\sigma(z)-y).\tag{60}\end{eqnarray}Using the definition of the sigmoid function, $\sigma(z) =1/(1+e^{-z})$, and a little algebra we can show that $\sigma'(z) =\sigma(z)(1-\sigma(z))$.  I'll ask you to verify this in an exercisebelow, but for now let's accept it as given.  We see that the$\sigma'(z)$ and $\sigma(z)(1-\sigma(z))$ terms cancel in the equationjust above, and it simplifies to become:<a class="displaced_anchor" name="eqtn61"></a>\begin{eqnarray}   \frac{\partial C}{\partial w_j} =  \frac{1}{n} \sum_x x_j(\sigma(z)-y).\tag{61}\end{eqnarray}This is a beautiful expression.  It tells us that the rate at whichthe weight learns is controlled by $\sigma(z)-y$, i.e., by the errorin the output.  The larger the error, the faster the neuron willlearn.  This is just what we'd intuitively expect.  In particular, itavoids the learning slowdown caused by the $\sigma'(z)$ term in theanalogous equation for the quadratic cost, Equation <span id="margin_801939868559_reveal" class="equation_link">(55)</span><span id="margin_801939868559" class="marginequation" style="display: none;"><a href="chap3.html#eqtn55" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \frac{\partial C}{\partial w} & = & (a-y)\sigma'(z) x = a \sigma'(z)  \nonumber\end{eqnarray}</a></span><script>$('#margin_801939868559_reveal').click(function() {$('#margin_801939868559').toggle('slow', function() {});});</script>.When we use the cross-entropy, the $\sigma'(z)$ term gets canceledout, and we no longer need worry about it being small.  Thiscancellation is the special miracle ensured by the cross-entropy costfunction.  Actually, it's not really a miracle.  As we'll see later,the cross-entropy was specially chosen to have just this property.</p><p>In a similar way, we can compute the partial derivative for the bias.I won't go through all the details again, but you can easily verifythat<a class="displaced_anchor" name="eqtn62"></a>\begin{eqnarray}   \frac{\partial C}{\partial b} = \frac{1}{n} \sum_x (\sigma(z)-y).\tag{62}\end{eqnarray}Again, this avoids the learning slowdown caused by the $\sigma'(z)$term in the analogous equation for the quadratic cost,Equation <span id="margin_986509902470_reveal" class="equation_link">(56)</span><span id="margin_986509902470" class="marginequation" style="display: none;"><a href="chap3.html#eqtn56" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}    \frac{\partial C}{\partial b} & = & (a-y)\sigma'(z) = a \sigma'(z) \nonumber\end{eqnarray}</a></span><script>$('#margin_986509902470_reveal').click(function() {$('#margin_986509902470').toggle('slow', function() {});});</script>.</p><p><h4><a name="exercise_35813"></a><a href="#exercise_35813">Exercise</a></h4><ul>  <li> Verify that $\sigma'(z) = \sigma(z)(1-\sigma(z))$.</p><p></ul></p><p>Let's return to the toy example we played with earlier, and explorewhat happens when we use the cross-entropy instead of the quadraticcost.  To re-orient ourselves, we'll begin with the case where thequadratic cost did just fine, with starting weight $0.6$ and startingbias $0.9$.  Press "Run" to see what happens when we replace thequadratic cost by the cross-entropy:</p><p><script type="text/paperscript" src="js/saturation3.js" canvas="saturation3"></script><center><canvas id="saturation3" width="520" height="300"></canvas></center></p><p>Unsurprisingly, the neuron learns perfectly well in this instance,just as it did earlier.  And now let's look at the case where ourneuron got stuck before (<a href="#saturation2_anchor">link</a>, forcomparison), with the weight and bias both starting at $2.0$:</p><p><script type="text/paperscript" src="js/saturation4.js" canvas="saturation4"></script><center><canvas id="saturation4" width="520" height="300"></canvas></center></p><p>Success!  This time the neuron learned quickly, just as we hoped.  Ifyou observe closely you can see that the slope of the cost curve wasmuch steeper initially than the initial flat region on thecorresponding curve for the quadratic cost.  It's that steepness whichthe cross-entropy buys us, preventing us from getting stuck just whenwe'd expect our neuron to learn fastest, i.e., when the neuron startsout badly wrong.</p><p>I didn't say what learning rate was used in the examples justillustrated.  Earlier, with the quadratic cost, we used $\eta = 0.15$.Should we have used the same learning rate in the new examples?  Infact, with the change in cost function it's not possible to sayprecisely what it means to use the "same" learning rate; it's anapples and oranges comparison.  For both cost functions I simplyexperimented to find a learning rate that made it possible to see whatis going on.  If you're still curious, despite my disavowal, here'sthe lowdown: I used $\eta = 0.005$ in the examples just given.</p><p>You might object that the change in learning rate makes the graphsabove meaningless.  Who cares how fast the neuron learns, when ourchoice of learning rate was arbitrary to begin with?!  That objectionmisses the point.  The point of the graphs isn't about the absolutespeed of learning.  It's about how the speed of learning changes.  Inparticular, when we use the quadratic cost learning is <em>slower</em>when the neuron is unambiguously wrong than it is later on, as theneuron gets closer to the correct output; while with the cross-entropylearning is faster when the neuron is unambiguously wrong.  Thosestatements don't depend on how the learning rate is set.  </p><p>We've been studying the cross-entropy for a single neuron.  However,it's easy to generalize the cross-entropy to many-neuron multi-layernetworks.  In particular, suppose $y = y_1, y_2, \ldots$ are thedesired values at the output neurons, i.e., the neurons in the finallayer, while $a^L_1, a^L_2, \ldots$ are the actual output values.Then we define the cross-entropy by<a class="displaced_anchor" name="eqtn63"></a>\begin{eqnarray}  C = -\frac{1}{n} \sum_x  \sum_j \left[y_j \ln a^L_j + (1-y_j) \ln (1-a^L_j) \right].\tag{63}\end{eqnarray}This is the same as our earlier expression,Equation <span id="margin_233878627463_reveal" class="equation_link">(57)</span><span id="margin_233878627463" class="marginequation" style="display: none;"><a href="chap3.html#eqtn57" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray}</a></span><script>$('#margin_233878627463_reveal').click(function() {$('#margin_233878627463').toggle('slow', function() {});});</script>, except now we've got the$\sum_j$ summing over all the output neurons.  I won't explicitly workthrough a derivation, but it should be plausible that using theexpression <span id="margin_787051846171_reveal" class="equation_link">(63)</span><span id="margin_787051846171" class="marginequation" style="display: none;"><a href="chap3.html#eqtn63" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  C = -\frac{1}{n} \sum_x  \sum_j \left[y_j \ln a^L_j + (1-y_j) \ln (1-a^L_j) \right] \nonumber\end{eqnarray}</a></span><script>$('#margin_787051846171_reveal').click(function() {$('#margin_787051846171').toggle('slow', function() {});});</script> avoids a learning slowdown inmany-neuron networks.  If you're interested, you can work through thederivation in the problem below.  </p><p>When should we use the cross-entropy instead of the quadratic cost?In fact, the cross-entropy is nearly always the better choice,provided the output neurons are sigmoid neurons.  To see why, considerthat when we're setting up the network we usually initialize theweights and biases using some sort of randomization.  It may happenthat those initial choices result in the network being decisivelywrong for some training input - that is, an output neuron will havesaturated near $1$, when it should be $0$, or vice versa.  If we'reusing the quadratic cost that will slow down learning.  It won't stoplearning completely, since the weights will continue learning fromother training inputs, but it's obviously undesirable.</p><p><h4><a name="exercises_824189"></a><a href="#exercises_824189">Exercises</a></h4><ul><li> One gotcha with the cross-entropy is that it can be difficult at  first to remember the respective roles of the $y$s and the $a$s.  It's easy to get confused about whether the right form is $-[y \ln a  + (1-y) \ln (1-a)]$ or $-[a \ln y + (1-a) \ln (1-y)]$.  What happens  to the second of these expressions when $y = 0$ or $1$?  Does this  problem afflict the first expression?  Why or why not? </p><p><li> In the single-neuron discussion at the start of this section, I  argued that the cross-entropy is small if $\sigma(z) \approx y$ for  all training inputs.  The argument relied on $y$ being equal to  either $0$ or $1$.  This is usually true in classification problems,  but for other problems (e.g., regression problems) $y$ can sometimes  take values intermediate between $0$ and $1$.  Show that the  cross-entropy is still minimized when $\sigma(z) = y$ for all  training inputs.  When this is the case the cross-entropy has the  value:  <a class="displaced_anchor" name="eqtn64"></a>\begin{eqnarray}    C = -\frac{1}{n} \sum_x [y \ln y+(1-y) \ln(1-y)].  \tag{64}\end{eqnarray}  The quantity $-[y \ln y+(1-y)\ln(1-y)]$ is sometimes known as the  <a href="http://en.wikipedia.org/wiki/Binary_entropy_function">binary    entropy</a>.</p><p></ul></p><p><h4><a name="problems_382219"></a><a href="#problems_382219">Problems</a></h4><ul><li><strong>Many-layer multi-neuron networks</strong> In the notation introduced in  the <a href="chap2.html">last chapter</a>, show that for the quadratic  cost the partial derivative with respect to weights in the output  layer is  <a class="displaced_anchor" name="eqtn65"></a>\begin{eqnarray}      \frac{\partial C}{\partial w^L_{jk}} & = & \frac{1}{n}      \sum_x a^{L-1}_k  (a^L_j-y_j) \sigma'(z^L_j).  \tag{65}\end{eqnarray}  The term $\sigma'(z^L_j)$ causes a learning slowdown whenever an  output neuron saturates on the wrong value.  Show that for the  cross-entropy cost the output error $\delta^L$ for a single training  example $x$ is given by  <a class="displaced_anchor" name="eqtn66"></a>\begin{eqnarray}     \delta^L = a^L-y.  \tag{66}\end{eqnarray}  Use this expression to show that the partial derivative with respect  to the weights in the output layer is given by  <a class="displaced_anchor" name="eqtn67"></a>\begin{eqnarray}       \frac{\partial C}{\partial w^L_{jk}} & = & \frac{1}{n} \sum_x       a^{L-1}_k  (a^L_j-y_j).  \tag{67}\end{eqnarray}  The $\sigma'(z^L_j)$ term has vanished, and so the cross-entropy  avoids the problem of learning slowdown, not just when used with a  single neuron, as we saw earlier, but also in many-layer  multi-neuron networks.  A simple variation on this analysis holds  also for the biases.  If this is not obvious to you, then you should  work through that analysis as well.</p><p><li><strong>Using the quadratic cost when we have linear neurons in the  output layer</strong> Suppose that we have a many-layer multi-neuron  network.  Suppose all the neurons in the final layer are  <em>linear neurons</em>, meaning that the sigmoid activation function  is not applied, and the outputs are simply $a^L_j = z^L_j$.  Show  that if we use the quadratic cost function then the output error  $\delta^L$ for a single training example $x$ is given by  <a class="displaced_anchor" name="eqtn68"></a>\begin{eqnarray}    \delta^L = a^L-y.  \tag{68}\end{eqnarray}  Similarly to the previous problem, use this expression to show that  the partial derivatives with respect to the weights and biases in  the output layer are given by  <a class="displaced_anchor" name="eqtn69"></a><a class="displaced_anchor" name="eqtn70"></a>\begin{eqnarray}      \frac{\partial C}{\partial w^L_{jk}} & = & \frac{1}{n} \sum_x       a^{L-1}_k  (a^L_j-y_j) \tag{69}\\      \frac{\partial C}{\partial b^L_{j}} & = & \frac{1}{n} \sum_x       (a^L_j-y_j).  \tag{70}\end{eqnarray}  This shows that if the output neurons are linear neurons then the  quadratic cost will not give rise to any problems with a learning  slowdown.  In this case the quadratic cost is, in fact, an  appropriate cost function to use.</ul></p><p><h4><a name="using_the_cross-entropy_to_classify_mnist_digits"></a><a href="#using_the_cross-entropy_to_classify_mnist_digits">Using the cross-entropy to classify MNIST digits</a></h4></p><p></p><p>The cross-entropy is easy to implement as part of a program whichlearns using gradient descent and backpropagation.  We'll do that<a href="#handwriting_recognition_revisited_the_code">later in the  chapter</a>, developing an improved version of our<a href="chap1.html#implementing_our_network_to_classify_digits">earlier  program</a> for classifying the MNIST handwritten digits,<tt>network.py</tt>.  The new program is called <tt>network2.py</tt>, andincorporates not just the cross-entropy, but also several othertechniques developed in this chapter*<span class="marginnote">*The code is available  <a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/src/network2.py">on    GitHub</a>.</span>.  For now, let's look at how well our new programclassifies MNIST digits.  As was the case in Chapter 1, we'll use anetwork with $30$ hidden neurons, and we'll use a mini-batch size of$10$.  We set the learning rate to $\eta = 0.5$*<span class="marginnote">*In Chapter 1  we used the quadratic cost and a learning rate of $\eta = 3.0$.  As  discussed above, it's not possible to say precisely what it means to  use the "same" learning rate when the cost function is changed.  For both cost functions I experimented to find a learning rate that  provides near-optimal performance, given the other hyper-parameter  choices. <br/> <br/> There is, incidentally, a very rough  general heuristic for relating the learning rate for the  cross-entropy and the quadratic cost.  As we saw earlier, the  gradient terms for the quadratic cost have an extra $\sigma' =  \sigma(1-\sigma)$ term in them.  Suppose we average this over values  for $\sigma$, $\int_0^1 d\sigma \sigma(1-\sigma) = 1/6$.  We see  that (very roughly) the quadratic cost learns an average of $6$  times slower, for the same learning rate.  This suggests that a  reasonable starting point is to divide the learning rate for the  quadratic cost by $6$.  Of course, this argument is far from  rigorous, and shouldn't be taken too seriously.  Still, it can  sometimes be a useful starting point.</span> and we train for $30$ epochs.The interface to <tt>network2.py</tt> is slightly different than<tt>network.py</tt>, but it should still be clear what is going on.  Youcan, by the way, get documentation about <tt>network2.py</tt>'sinterface by using commands such as <tt>help(network2.Network.SGD)</tt>in a Python shell.</p><p><div class="highlight"><pre><span></span><span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">mnist_loader</span>
 <span class="o">&gt;&gt;&gt;</span> <span class="n">training_data</span><span class="p">,</span> <span class="n">validation_data</span><span class="p">,</span> <span class="n">test_data</span> <span class="o">=</span> \
 <span class="o">...</span> <span class="n">mnist_loader</span><span class="o">.</span><span class="n">load_data_wrapper</span><span class="p">()</span>
 <span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">network2</span>
@@ -184,7 +187,7 @@ <h1 class="chapter_title"><a href="">Improving the way neural networks learn</a>
 <span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">,</span> <span class="n">evaluation_data</span><span class="o">=</span><span class="n">test_data</span><span class="p">,</span>
 <span class="o">...</span> <span class="n">monitor_evaluation_accuracy</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
 </pre></div>
-</p><p>Note, by the way, that the <tt>net.large_weight_initializer()</tt>command is used to initialize the weights and biases in the same wayas described in Chapter 1.  We need to run this command because laterin this chapter we'll change the default weight initialization in ournetworks.  The result from running the above sequence of commands is anetwork with $95.49$ percent accuracy.  This is pretty close to theresult we obtained in Chapter 1, $95.42$ percent, using the quadraticcost.</p><p>Let's look also at the case where we use $100$ hidden neurons, thecross-entropy, and otherwise keep the parameters the same.  In thiscase we obtain an accuracy of $96.82$ percent.  That's a substantialimprovement over the results from Chapter 1, where we obtained aclassification accuracy of $96.59$ percent, using the quadratic cost.That may look like a small change, but consider that the error ratehas dropped from $3.41$ percent to $3.18$ percent.  That is, we'veeliminated about one in fourteen of the original errors.  That's quitea handy improvement.</p><p>It's encouraging that the cross-entropy cost gives us similar orbetter results than the quadratic cost.  However, these results don'tconclusively prove that the cross-entropy is a better choice.  Thereason is that I've put only a little effort into choosinghyper-parameters such as learning rate, mini-batch size, and so on.For the improvement to be really convincing we'd need to do a thoroughjob optimizing such hyper-parameters.  Still, the results areencouraging, and reinforce our earlier theoretical argument that thecross-entropy is a better choice than the quadratic cost.</p><p>This, by the way, is part of a general pattern that we'll see throughthis chapter and, indeed, through much of the rest of the book.  We'lldevelop a new technique, we'll try it out, and we'll get "improved"results.  It is, of course, nice that we see such improvements.  Butthe interpretation of such improvements is always problematic.They're only truly convincing if we see an improvement after puttingtremendous effort into optimizing all the other hyper-parameters.That's a great deal of work, requiring lots of computing power, andwe're not usually going to do such an exhaustive investigation.Instead, we'll proceed on the basis of informal tests like those doneabove.  Still, you should keep in mind that such tests fall short ofdefinitive proof, and remain alert to signs that the arguments arebreaking down.</p><p>By now, we've discussed the cross-entropy at great length.  Why go toso much effort when it gives only a small improvement to our MNISTresults?  Later in the chapter we'll see other techniques - notably,<a href="#overfitting_and_regularization">regularization</a> - whichgive much bigger improvements.  So why so much focus on cross-entropy?Part of the reason is that the cross-entropy is a widely-used costfunction, and so is worth understanding well.  But the more importantreason is that neuron saturation is an important problem in neuralnets, a problem we'll return to repeatedly throughout the book.  Andso I've discussed the cross-entropy at length because it's a goodlaboratory to begin understanding neuron saturation and how it may beaddressed.</p><p><h4><a name="what_does_the_cross-entropy_mean_where_does_it_come_from"></a><a href="#what_does_the_cross-entropy_mean_where_does_it_come_from">What does the cross-entropy mean?  Where does it come from?</a></h4></p><p>Our discussion of the cross-entropy has focused on algebraic analysisand practical implementation.  That's useful, but it leaves unansweredbroader conceptual questions, like: what does the cross-entropy mean?Is there some intuitive way of thinking about the cross-entropy?  Andhow could we have dreamed up the cross-entropy in the first place?</p><p>Let's begin with the last of these questions: what could havemotivated us to think up the cross-entropy in the first place?Suppose we'd discovered the learning slowdown described earlier, andunderstood that the origin was the $\sigma'(z)$ terms inEquations <span id="margin_893295869256_reveal" class="equation_link">(55)</span><span id="margin_893295869256" class="marginequation" style="display: none;"><a href="chap3.html#eqtn55" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \frac{\partial C}{\partial w} & = & (a-y)\sigma'(z) x = a \sigma'(z)  \nonumber\end{eqnarray}</a></span><script>$('#margin_893295869256_reveal').click(function() {$('#margin_893295869256').toggle('slow', function() {});});</script> and <span id="margin_59406062794_reveal" class="equation_link">(56)</span><span id="margin_59406062794" class="marginequation" style="display: none;"><a href="chap3.html#eqtn56" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}    \frac{\partial C}{\partial b} & = & (a-y)\sigma'(z) = a \sigma'(z) \nonumber\end{eqnarray}</a></span><script>$('#margin_59406062794_reveal').click(function() {$('#margin_59406062794').toggle('slow', function() {});});</script>.  After staring atthose equations for a bit, we might wonder if it's possible to choosea cost function so that the $\sigma'(z)$ term disappeared.  In thatcase, the cost $C = C_x$ for a single training example $x$ wouldsatisfy<a class="displaced_anchor" name="eqtn71"></a><a class="displaced_anchor" name="eqtn72"></a>\begin{eqnarray}   \frac{\partial C}{\partial w_j} & = & x_j(a-y) \tag{71}\\    \frac{\partial C}{\partial b } & = & (a-y).\tag{72}\end{eqnarray}If we could choose the cost function to make these equations true,then they would capture in a simple way the intuition that the greaterthe initial error, the faster the neuron learns.  They'd alsoeliminate the problem of a learning slowdown.  In fact, starting fromthese equations we'll now show that it's possible to derive the formof the cross-entropy, simply by following our mathematical noses.  Tosee this, note that from the chain rule we have<a class="displaced_anchor" name="eqtn73"></a>\begin{eqnarray}  \frac{\partial C}{\partial b} = \frac{\partial C}{\partial a}   \sigma'(z).\tag{73}\end{eqnarray}Using $\sigma'(z) = \sigma(z)(1-\sigma(z)) = a(1-a)$ the last equationbecomes<a class="displaced_anchor" name="eqtn74"></a>\begin{eqnarray}  \frac{\partial C}{\partial b} = \frac{\partial C}{\partial a}   a(1-a).\tag{74}\end{eqnarray}Comparing to Equation <span id="margin_340063624478_reveal" class="equation_link">(72)</span><span id="margin_340063624478" class="marginequation" style="display: none;"><a href="chap3.html#eqtn72" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}    \frac{\partial C}{\partial b } & = & (a-y) \nonumber\end{eqnarray}</a></span><script>$('#margin_340063624478_reveal').click(function() {$('#margin_340063624478').toggle('slow', function() {});});</script> we obtain<a class="displaced_anchor" name="eqtn75"></a>\begin{eqnarray}  \frac{\partial C}{\partial a} = \frac{a-y}{a(1-a)}.\tag{75}\end{eqnarray}Integrating this expression with respect to $a$ gives<a class="displaced_anchor" name="eqtn76"></a>\begin{eqnarray}  C = -[y \ln a + (1-y) \ln (1-a)]+ {\rm constant},\tag{76}\end{eqnarray}for some constant of integration.  This is the contribution to thecost from a single training example, $x$.  To get the full costfunction we must average over training examples, obtaining<a class="displaced_anchor" name="eqtn77"></a>\begin{eqnarray}  C = -\frac{1}{n} \sum_x [y \ln a +(1-y) \ln(1-a)] + {\rm constant},\tag{77}\end{eqnarray}where the constant here is the average of the individual constants foreach training example.  And so we see thatEquations <span id="margin_94291345850_reveal" class="equation_link">(71)</span><span id="margin_94291345850" class="marginequation" style="display: none;"><a href="chap3.html#eqtn71" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \frac{\partial C}{\partial w_j} & = & x_j(a-y)  \nonumber\end{eqnarray}</a></span><script>$('#margin_94291345850_reveal').click(function() {$('#margin_94291345850').toggle('slow', function() {});});</script>and <span id="margin_386651273702_reveal" class="equation_link">(72)</span><span id="margin_386651273702" class="marginequation" style="display: none;"><a href="chap3.html#eqtn72" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}    \frac{\partial C}{\partial b } & = & (a-y) \nonumber\end{eqnarray}</a></span><script>$('#margin_386651273702_reveal').click(function() {$('#margin_386651273702').toggle('slow', function() {});});</script> uniquely determine the formof the cross-entropy, up to an overall constant term.  Thecross-entropy isn't something that was miraculously pulled out of thinair.  Rather, it's something that we could have discovered in a simpleand natural way.</p><p>What about the intuitive meaning of the cross-entropy?  How should wethink about it?  Explaining this in depth would take us further afieldthan I want to go.  However, it is worth mentioning that there is astandard way of interpreting the cross-entropy that comes from thefield of information theory.  Roughly speaking, the idea is that thecross-entropy is a measure of surprise.  In particular, our neuron istrying to compute the function $x \rightarrow y = y(x)$.  But insteadit computes the function $x \rightarrow a = a(x)$.  Suppose we thinkof $a$ as our neuron's estimated probability that $y$ is $1$, and$1-a$ is the estimated probability that the right value for $y$ is$0$.  Then the cross-entropy measures how "surprised" we are, onaverage, when we learn the true value for $y$.  We get low surprise ifthe output is what we expect, and high surprise if the output isunexpected.  Of course, I haven't said exactly what "surprise"means, and so this perhaps seems like empty verbiage.  But in factthere is a precise information-theoretic way of saying what is meantby surprise.  Unfortunately, I don't know of a good, short,self-contained discussion of this subject that's available online.But if you want to dig deeper, then Wikipedia contains a<a href="http://en.wikipedia.org/wiki/Cross_entropy#Motivation">brief  summary</a> that will get you started down the right track.  And thedetails can be filled in by working through the materials about theKraft inequality in chapter 5 of the book about information theory by<a href="http://books.google.ca/books?id=VWq5GG6ycxMC">Cover and Thomas</a>.</p><p><h4><a name="problem_247099"></a><a href="#problem_247099">Problem</a></h4><ul><li> We've discussed at length the learning slowdown that can occur  when output neurons saturate, in networks using the quadratic cost  to train.  Another factor that may inhibit learning is the presence  of the $x_j$ term in Equation <span id="margin_107749510043_reveal" class="equation_link">(61)</span><span id="margin_107749510043" class="marginequation" style="display: none;"><a href="chap3.html#eqtn61" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \frac{\partial C}{\partial w_j} =  \frac{1}{n} \sum_x x_j(\sigma(z)-y) \nonumber\end{eqnarray}</a></span><script>$('#margin_107749510043_reveal').click(function() {$('#margin_107749510043').toggle('slow', function() {});});</script>.  Because of  this term, when an input $x_j$ is near to zero, the corresponding  weight $w_j$ will learn slowly.  Explain why it is not possible to  eliminate the $x_j$ term through a clever choice of cost function.</ul></p><p><h4><a name="softmax"></a><a href="#softmax">Softmax</a></h4></p><p>In this chapter we'll mostly use the cross-entropy cost to address theproblem of learning slowdown.  However, I want to briefly describeanother approach to the problem, based on what are called<em>softmax</em> layers of neurons. We're not actually going to usesoftmax layers in the remainder of the chapter, so if you're in agreat hurry, you can skip to the next section.  However, softmax isstill worth understanding, in part because it's intrinsicallyinteresting, and in part because we'll use softmax layers in<a href="chap6.html">Chapter 6</a>, in our discussion of deep neuralnetworks.</p><p>The idea of softmax is to define a new type of output layer for ourneural networks.  It begins in the same way as with a sigmoid layer,by forming the weighted inputs*<span class="marginnote">*In describing the softmax  we'll make frequent use of notation introduced in the  <a href="chap2.html">last chapter</a>.  You may wish to revisit that  chapter if you need to refresh your memory about the meaning of the  notation.</span> $z^L_j = \sum_{k} w^L_{jk} a^{L-1}_k + b^L_j$.  However,we don't apply the sigmoid function to get the output.  Instead, in asoftmax layer we apply the so-called <em>softmax function</em> to the$z^L_j$.  According to this function, the activation $a^L_j$ of the$j$th output neuron is<a class="displaced_anchor" name="eqtn78"></a>\begin{eqnarray}   a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}},\tag{78}\end{eqnarray}where in the denominator we sum over all the output neurons.</p><p>If you're not familiar with the softmax function,Equation <span id="margin_11605954175_reveal" class="equation_link">(78)</span><span id="margin_11605954175" class="marginequation" style="display: none;"><a href="chap3.html#eqtn78" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}} \nonumber\end{eqnarray}</a></span><script>$('#margin_11605954175_reveal').click(function() {$('#margin_11605954175').toggle('slow', function() {});});</script> may look pretty opaque.  It's certainlynot obvious why we'd want to use this function.  And it's also notobvious that this will help us address the learning slowdown problem.To better understand Equation <span id="margin_280064877555_reveal" class="equation_link">(78)</span><span id="margin_280064877555" class="marginequation" style="display: none;"><a href="chap3.html#eqtn78" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}} \nonumber\end{eqnarray}</a></span><script>$('#margin_280064877555_reveal').click(function() {$('#margin_280064877555').toggle('slow', function() {});});</script>, suppose we have anetwork with four output neurons, and four corresponding weightedinputs, which we'll denote $z^L_1, z^L_2, z^L_3$, and $z^L_4$.  Shownbelow are adjustable sliders showing possible values for the weightedinputs, and a graph of the corresponding output activations.  A goodplace to start exploration is by using the bottom slider to increase$z^L_4$:</p><p>    <script type="text/javascript" src="js/jquery-ui.min.js"></script>    <link rel="stylesheet" href="js/jquery-ui.css">    <script src="js/softmax.js"></script>    <script src="js/canvas.js"></script><style>.softmaxTable {  height: 64px;  width: 260px;}.canvasSoftmax {  position: absolute;   margin-top: -22px;   padding-bottom: 5px;}.softmaxInput {  border: 0;  font: 18px Arial;}.softmaxA {  padding-top: 7px;  padding-left: 5px;}</style><p><table><tr><td class="softmaxTable"><div id="slider1" style="width: 200px;"></div>$z^L_1 = $ <input type="text" id="amount1" readonly class="softmaxInput"></td><td class="softmaxTable"><canvas id="smG1" width="300" height="40" class="canvasSoftmax"></canvas><div class="softmaxA">$a^L_1 = $ <input type="text" id="activation1" readonly class="softmaxInput"></div></td></tr><tr><td class="softmaxTable"><div id="slider2" style="width: 200px;"></div>$z^L_2$ = <input type="text" id="amount2" readonly class="softmaxInput"></td><td class="softmaxTable"><canvas id="smG2" width="300" height="40" class="canvasSoftmax"></canvas><div class="softmaxA">$a^L_2 = $ <input type="text" id="activation2" readonly class="softmaxInput"></div></td></tr><tr><td class="softmaxTable"><div id="slider3" style="width: 200px;"></div>$z^L_3$ = <input type="text" id="amount3" readonly class="softmaxInput"></td><td class="softmaxTable"><canvas id="smG3" width="300" height="40" class="canvasSoftmax"></canvas><div class="softmaxA">$a^L_3 = $ <input type="text" id="activation3" readonly class="softmaxInput"></div></td></tr><tr><td class="softmaxTable"><div id="slider4" style="width: 200px;"></div>$z^L_4$ = <input type="text" id="amount4" readonly class="softmaxInput"></td><td class="softmaxTable"><canvas id="smG4" width="300" height="40" class="canvasSoftmax"></canvas><div class="softmaxA">$a^L_4 = $ <input type="text" id="activation4" readonly class="softmaxInput"></div></td></tr></table></p><p>As you increase $z^L_4$, you'll see an increase in the correspondingoutput activation, $a^L_4$, and a decrease in the other outputactivations.  Similarly, if you decrease $z^L_4$ then $a^L_4$ willdecrease, and all the other output activations will increase.  Infact, if you look closely, you'll see that in both cases the totalchange in the other activations exactly compensates for the change in$a^L_4$.  The reason is that the output activations are guaranteed toalways sum up to $1$, as we can prove usingEquation <span id="margin_428907560600_reveal" class="equation_link">(78)</span><span id="margin_428907560600" class="marginequation" style="display: none;"><a href="chap3.html#eqtn78" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}} \nonumber\end{eqnarray}</a></span><script>$('#margin_428907560600_reveal').click(function() {$('#margin_428907560600').toggle('slow', function() {});});</script> and a little algebra:<a class="displaced_anchor" name="eqtn79"></a>\begin{eqnarray}  \sum_j a^L_j & = & \frac{\sum_j e^{z^L_j}}{\sum_k e^{z^L_k}} = 1.\tag{79}\end{eqnarray}As a result, if $a^L_4$ increases, then the other output activationsmust decrease by the same total amount, to ensure the sum over allactivations remains $1$.  And, of course, similar statements hold forall the other activations.</p><p>Equation <span id="margin_192433918554_reveal" class="equation_link">(78)</span><span id="margin_192433918554" class="marginequation" style="display: none;"><a href="chap3.html#eqtn78" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}} \nonumber\end{eqnarray}</a></span><script>$('#margin_192433918554_reveal').click(function() {$('#margin_192433918554').toggle('slow', function() {});});</script> also implies that the output activationsare all positive, since the exponential function is positive.Combining this with the observation in the last paragraph, we see thatthe output from the softmax layer is a set of positive numbers whichsum up to $1$.  In other words, the output from the softmax layer canbe thought of as a probability distribution.</p><p>The fact that a softmax layer outputs a probability distribution israther pleasing.  In many problems it's convenient to be able tointerpret the output activation $a^L_j$ as the network's estimate ofthe probability that the correct output is $j$.  So, for instance, inthe MNIST classification problem, we can interpret $a^L_j$ as thenetwork's estimated probability that the correct digit classificationis $j$.</p><p>By contrast, if the output layer was a sigmoid layer, then wecertainly couldn't assume that the activations formed a probabilitydistribution.  I won't explicitly prove it, but it should be plausiblethat the activations from a sigmoid layer won't in general form aprobability distribution.  And so with a sigmoid output layer we don'thave such a simple interpretation of the output activations.</p><p><h4><a name="exercise_332838"></a><a href="#exercise_332838">Exercise</a></h4><ul><li> Construct an example showing explicitly that in a network with a  sigmoid output layer, the output activations $a^L_j$ won't always  sum to $1$.</ul></p><p>We're starting to build up some feel for the softmax function and theway softmax layers behave.  Just to review where we're at: theexponentials in Equation <span id="margin_573558614830_reveal" class="equation_link">(78)</span><span id="margin_573558614830" class="marginequation" style="display: none;"><a href="chap3.html#eqtn78" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}} \nonumber\end{eqnarray}</a></span><script>$('#margin_573558614830_reveal').click(function() {$('#margin_573558614830').toggle('slow', function() {});});</script> ensure that all the outputactivations are positive.  And the sum in the denominator ofEquation <span id="margin_281979870049_reveal" class="equation_link">(78)</span><span id="margin_281979870049" class="marginequation" style="display: none;"><a href="chap3.html#eqtn78" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}} \nonumber\end{eqnarray}</a></span><script>$('#margin_281979870049_reveal').click(function() {$('#margin_281979870049').toggle('slow', function() {});});</script> ensures that the softmax outputs sum to$1$.  So that particular form no longer appears so mysterious: rather,it is a natural way to ensure that the output activations form aprobability distribution.  You can think of softmax as a way ofrescaling the $z^L_j$, and then squishing them together to form aprobability distribution.</p><p><h4><a name="exercises_193619"></a><a href="#exercises_193619">Exercises</a></h4><ul><li><strong>Monotonicity of softmax</strong> Show that $\partial a^L_j / \partial  z^L_k$ is positive if $j = k$ and negative if $j \neq k$.  As a  consequence, increasing $z^L_j$ is guaranteed to increase the  corresponding output activation, $a^L_j$, and will decrease all the  other output activations.  We already saw this empirically with the  sliders, but this is a rigorous proof.</p><p><li><strong>Non-locality of softmax</strong> A nice thing about sigmoid layers is  that the output $a^L_j$ is a function of the corresponding weighted  input, $a^L_j = \sigma(z^L_j)$.  Explain why this is not the case  for a softmax layer: any particular output activation $a^L_j$  depends on <em>all</em> the weighted inputs.</ul></p><p><h4><a name="problem_905066"></a><a href="#problem_905066">Problem</a></h4><ul><li><strong>Inverting the softmax layer</strong> Suppose we have a neural network  with a softmax output layer, and the activations $a^L_j$ are known.  Show that the corresponding weighted inputs have the form $z^L_j =  \ln a^L_j + C$, for some constant $C$ that is independent of $j$.</ul></p><p><strong>The learning slowdown problem:</strong> We've now built upconsiderable familiarity with softmax layers of neurons.  But wehaven't yet seen how a softmax layer lets us address the learningslowdown problem.  To understand that, let's define the<em>log-likelihood</em> cost function.  We'll use $x$ to denote atraining input to the network, and $y$ to denote the correspondingdesired output.  Then the log-likelihood cost associated to thistraining input is<a class="displaced_anchor" name="eqtn80"></a>\begin{eqnarray}  C \equiv -\ln a^L_y.\tag{80}\end{eqnarray}So, for instance, if we're training with MNIST images, and input animage of a $7$, then the log-likelihood cost is $-\ln a^L_7$.  To seethat this makes intuitive sense, consider the case when the network isdoing a good job, that is, it is confident the input is a $7$.  Inthat case it will estimate a value for the corresponding probability$a^L_7$ which is close to $1$, and so the cost $-\ln a^L_7$ will besmall.  By contrast, when the network isn't doing such a good job, theprobability $a^L_7$ will be smaller, and the cost $-\ln a^L_7$ will belarger.  So the log-likelihood cost behaves as we'd expect a costfunction to behave.</p><p>What about the learning slowdown problem?  To analyze that, recallthat the key to the learning slowdown is the behaviour of thequantities $\partial C / \partial w^L_{jk}$ and $\partial C / \partialb^L_j$.  I won't go through the derivation explicitly - I'll ask youto do in the problems, below - but with a little algebra you canshow that*<span class="marginnote">*Note that I'm abusing notation here, using $y$ in a  slightly different way to last paragraph.  In the last paragraph we  used $y$ to denote the desired output from the network - e.g.,  output a "$7$" if an image of a $7$ was input.  But in the  equations which follow I'm using $y$ to denote the vector of output  activations which corresponds to $7$, that is, a vector which is all  $0$s, except for a $1$ in the $7$th location.</span><a class="displaced_anchor" name="eqtn81"></a><a class="displaced_anchor" name="eqtn82"></a>\begin{eqnarray}  \frac{\partial C}{\partial b^L_j} & = & a^L_j-y_j  \tag{81}\\  \frac{\partial C}{\partial w^L_{jk}} & = & a^{L-1}_k (a^L_j-y_j) \tag{82}\end{eqnarray}These equations are the same as the analogous expressions obtained inour earlier analysis of the cross-entropy.  Compare, for example,Equation <span id="margin_955873275986_reveal" class="equation_link">(82)</span><span id="margin_955873275986" class="marginequation" style="display: none;"><a href="chap3.html#eqtn82" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial C}{\partial w^L_{jk}} & = & a^{L-1}_k (a^L_j-y_j)  \nonumber\end{eqnarray}</a></span><script>$('#margin_955873275986_reveal').click(function() {$('#margin_955873275986').toggle('slow', function() {});});</script> to Equation <span id="margin_850198657444_reveal" class="equation_link">(67)</span><span id="margin_850198657444" class="marginequation" style="display: none;"><a href="chap3.html#eqtn67" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}       \frac{\partial C}{\partial w^L_{jk}} & = & \frac{1}{n} \sum_x       a^{L-1}_k  (a^L_j-y_j).   \nonumber\end{eqnarray}</a></span><script>$('#margin_850198657444_reveal').click(function() {$('#margin_850198657444').toggle('slow', function() {});});</script>.  It's thesame equation, albeit in the latter I've averaged over traininginstances.  And, just as in the earlier analysis, these expressionsensure that we will not encounter a learning slowdown.  In fact, it'suseful to think of a softmax output layer with log-likelihood cost asbeing quite similar to a sigmoid output layer with cross-entropy cost.</p><p></p><p></p><p></p><p>Given this similarity, should you use a sigmoid output layer andcross-entropy, or a softmax output layer and log-likelihood?  In fact,in many situations both approaches work well.  Through the remainderof this chapter we'll use a sigmoid output layer, with thecross-entropy cost.  Later, in <a href="chap6.html">Chapter 6</a>, we'llsometimes use a softmax output layer, with log-likelihood cost.  Thereason for the switch is to make some of our later networks moresimilar to networks found in certain influential academic papers.  Asa more general point of principle, softmax plus log-likelihood isworth using whenever you want to interpret the output activations asprobabilities.  That's not always a concern, but can be useful withclassification problems (like MNIST) involving disjoint classes.</p><p><h4><a name="problems_668621"></a><a href="#problems_668621">Problems</a></h4><ul><li> Derive Equations <span id="margin_796445875970_reveal" class="equation_link">(81)</span><span id="margin_796445875970" class="marginequation" style="display: none;"><a href="chap3.html#eqtn81" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial C}{\partial b^L_j} & = & a^L_j-y_j   \nonumber\end{eqnarray}</a></span><script>$('#margin_796445875970_reveal').click(function() {$('#margin_796445875970').toggle('slow', function() {});});</script> and <span id="margin_384764686189_reveal" class="equation_link">(82)</span><span id="margin_384764686189" class="marginequation" style="display: none;"><a href="chap3.html#eqtn82" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial C}{\partial w^L_{jk}} & = & a^{L-1}_k (a^L_j-y_j)  \nonumber\end{eqnarray}</a></span><script>$('#margin_384764686189_reveal').click(function() {$('#margin_384764686189').toggle('slow', function() {});});</script>.</p><p><li><strong>Where does the "softmax" name come from?</strong>  Suppose we change  the softmax function so the output activations are given by<a class="displaced_anchor" name="eqtn83"></a>\begin{eqnarray}  a^L_j = \frac{e^{c z^L_j}}{\sum_k e^{c z^L_k}},\tag{83}\end{eqnarray}where $c$ is a positive constant.  Note that $c = 1$ corresponds tothe standard softmax function.  But if we use a different value of $c$we get a different function, which is nonetheless qualitatively rathersimilar to the softmax.  In particular, show that the outputactivations form a probability distribution, just as for the usualsoftmax.  Suppose we allow $c$ to become large, i.e., $c \rightarrow\infty$.  What is the limiting value for the output activations$a^L_j$?  After solving this problem it should be clear to you why wethink of the $c = 1$ function as a "softened" version of the maximumfunction.  This is the origin of the term "softmax".</p><p><li><strong>Backpropagation with softmax and the log-likelihood cost</strong> In the  last chapter we derived the backpropagation algorithm for a network  containing sigmoid layers.  To apply the algorithm to a network with  a softmax layer we need to figure out an expression for the error  $\delta^L_j \equiv \partial C / \partial z^L_j$ in the final layer.  Show that a suitable expression is:  <a class="displaced_anchor" name="eqtn84"></a>\begin{eqnarray}    \delta^L_j = a^L_j -y_j.  \tag{84}\end{eqnarray}  Using this expression we can apply the backpropagation algorithm to  a network using a softmax output layer and the log-likelihood cost.</p><p></ul></p><p><h3><a name="overfitting_and_regularization"></a><a href="#overfitting_and_regularization">Overfitting and regularization</a></h3></p><p>The Nobel prizewinning physicist Enrico Fermi was once asked hisopinion of a mathematical model some colleagues had proposed as thesolution to an important unsolved physics problem.  The model gaveexcellent agreement with experiment, but Fermi was skeptical.  Heasked how many free parameters could be set in the model.  "Four"was the answer.  Fermi replied*<span class="marginnote">*The quote comes from a  charming article by  <a href="http://www.nature.com/nature/journal/v427/n6972/full/427297a.html">Freeman    Dyson</a>, who is one of the people who proposed the flawed model. A  four-parameter elephant may be found  <a href="http://www.johndcook.com/blog/2011/06/21/how-to-fit-an-elephant/">here</a>. </span>:"I remember my friend Johnny von Neumann used to say, with fourparameters I can fit an elephant, and with five I can make him wigglehis trunk.".</p><p>The point, of course, is that models with a large number of freeparameters can describe an amazingly wide range of phenomena.  Even ifsuch a model agrees well with the available data, that doesn't make ita good model.  It may just mean there's enough freedom in the modelthat it can describe almost any data set of the given size, withoutcapturing any genuine insights into the underlying phenomenon.  Whenthat happens the model will work well for the existing data, but willfail to generalize to new situations.  The true test of a model is itsability to make predictions in situations it hasn't been exposed tobefore.</p><p>Fermi and von Neumann were suspicious of models with four parameters.Our 30 hidden neuron network for classifying MNIST digits has nearly24,000 parameters!  That's a lot of parameters.  Our 100 hidden neuronnetwork has nearly 80,000 parameters, and state-of-the-art deep neuralnets sometimes contain millions or even billions of parameters.Should we trust the results?</p><p>Let's sharpen this problem up by constructing a situation where ournetwork does a bad job generalizing to new situations.  We'll use our30 hidden neuron network, with its 23,860 parameters.  But we won'ttrain the network using all 50,000 MNIST training images.  Instead,we'll use just the first 1,000 training images.  Using that restrictedset will make the problem with generalization much more evident.We'll train in a similar way to before, using the cross-entropy costfunction, with a learning rate of $\eta = 0.5$ and a mini-batch sizeof $10$.  However, we'll train for 400 epochs, a somewhat largernumber than before, because we're not using as many training examples.Let's use <tt>network2</tt> to look at the way the cost functionchanges:</p><p><div class="highlight"><pre><span></span> 
+</p><p>Note, by the way, that the <tt>net.large_weight_initializer()</tt>command is used to initialize the weights and biases in the same wayas described in Chapter 1.  We need to run this command because laterin this chapter we'll change the default weight initialization in ournetworks.  The result from running the above sequence of commands is anetwork with $95.49$ percent accuracy.  This is pretty close to theresult we obtained in Chapter 1, $95.42$ percent, using the quadraticcost.</p><p>Let's look also at the case where we use $100$ hidden neurons, thecross-entropy, and otherwise keep the parameters the same.  In thiscase we obtain an accuracy of $96.82$ percent.  That's a substantialimprovement over the results from Chapter 1, where we obtained aclassification accuracy of $96.59$ percent, using the quadratic cost.That may look like a small change, but consider that the error ratehas dropped from $3.41$ percent to $3.18$ percent.  That is, we'veeliminated about one in fourteen of the original errors.  That's quitea handy improvement.</p><p>It's encouraging that the cross-entropy cost gives us similar orbetter results than the quadratic cost.  However, these results don'tconclusively prove that the cross-entropy is a better choice.  Thereason is that I've put only a little effort into choosinghyper-parameters such as learning rate, mini-batch size, and so on.For the improvement to be really convincing we'd need to do a thoroughjob optimizing such hyper-parameters.  Still, the results areencouraging, and reinforce our earlier theoretical argument that thecross-entropy is a better choice than the quadratic cost.</p><p>This, by the way, is part of a general pattern that we'll see throughthis chapter and, indeed, through much of the rest of the book.  We'lldevelop a new technique, we'll try it out, and we'll get "improved"results.  It is, of course, nice that we see such improvements.  Butthe interpretation of such improvements is always problematic.They're only truly convincing if we see an improvement after puttingtremendous effort into optimizing all the other hyper-parameters.That's a great deal of work, requiring lots of computing power, andwe're not usually going to do such an exhaustive investigation.Instead, we'll proceed on the basis of informal tests like those doneabove.  Still, you should keep in mind that such tests fall short ofdefinitive proof, and remain alert to signs that the arguments arebreaking down.</p><p>By now, we've discussed the cross-entropy at great length.  Why go toso much effort when it gives only a small improvement to our MNISTresults?  Later in the chapter we'll see other techniques - notably,<a href="#overfitting_and_regularization">regularization</a> - whichgive much bigger improvements.  So why so much focus on cross-entropy?Part of the reason is that the cross-entropy is a widely-used costfunction, and so is worth understanding well.  But the more importantreason is that neuron saturation is an important problem in neuralnets, a problem we'll return to repeatedly throughout the book.  Andso I've discussed the cross-entropy at length because it's a goodlaboratory to begin understanding neuron saturation and how it may beaddressed.</p><p><h4><a name="what_does_the_cross-entropy_mean_where_does_it_come_from"></a><a href="#what_does_the_cross-entropy_mean_where_does_it_come_from">What does the cross-entropy mean?  Where does it come from?</a></h4></p><p>Our discussion of the cross-entropy has focused on algebraic analysisand practical implementation.  That's useful, but it leaves unansweredbroader conceptual questions, like: what does the cross-entropy mean?Is there some intuitive way of thinking about the cross-entropy?  Andhow could we have dreamed up the cross-entropy in the first place?</p><p>Let's begin with the last of these questions: what could havemotivated us to think up the cross-entropy in the first place?Suppose we'd discovered the learning slowdown described earlier, andunderstood that the origin was the $\sigma'(z)$ terms inEquations <span id="margin_560366758901_reveal" class="equation_link">(55)</span><span id="margin_560366758901" class="marginequation" style="display: none;"><a href="chap3.html#eqtn55" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \frac{\partial C}{\partial w} & = & (a-y)\sigma'(z) x = a \sigma'(z)  \nonumber\end{eqnarray}</a></span><script>$('#margin_560366758901_reveal').click(function() {$('#margin_560366758901').toggle('slow', function() {});});</script> and <span id="margin_583079464766_reveal" class="equation_link">(56)</span><span id="margin_583079464766" class="marginequation" style="display: none;"><a href="chap3.html#eqtn56" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}    \frac{\partial C}{\partial b} & = & (a-y)\sigma'(z) = a \sigma'(z) \nonumber\end{eqnarray}</a></span><script>$('#margin_583079464766_reveal').click(function() {$('#margin_583079464766').toggle('slow', function() {});});</script>.  After staring atthose equations for a bit, we might wonder if it's possible to choosea cost function so that the $\sigma'(z)$ term disappeared.  In thatcase, the cost $C = C_x$ for a single training example $x$ wouldsatisfy<a class="displaced_anchor" name="eqtn71"></a><a class="displaced_anchor" name="eqtn72"></a>\begin{eqnarray}   \frac{\partial C}{\partial w_j} & = & x_j(a-y) \tag{71}\\    \frac{\partial C}{\partial b } & = & (a-y).\tag{72}\end{eqnarray}If we could choose the cost function to make these equations true,then they would capture in a simple way the intuition that the greaterthe initial error, the faster the neuron learns.  They'd alsoeliminate the problem of a learning slowdown.  In fact, starting fromthese equations we'll now show that it's possible to derive the formof the cross-entropy, simply by following our mathematical noses.  Tosee this, note that from the chain rule we have<a class="displaced_anchor" name="eqtn73"></a>\begin{eqnarray}  \frac{\partial C}{\partial b} = \frac{\partial C}{\partial a}   \sigma'(z).\tag{73}\end{eqnarray}Using $\sigma'(z) = \sigma(z)(1-\sigma(z)) = a(1-a)$ the last equationbecomes<a class="displaced_anchor" name="eqtn74"></a>\begin{eqnarray}  \frac{\partial C}{\partial b} = \frac{\partial C}{\partial a}   a(1-a).\tag{74}\end{eqnarray}Comparing to Equation <span id="margin_639536640662_reveal" class="equation_link">(72)</span><span id="margin_639536640662" class="marginequation" style="display: none;"><a href="chap3.html#eqtn72" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}    \frac{\partial C}{\partial b } & = & (a-y) \nonumber\end{eqnarray}</a></span><script>$('#margin_639536640662_reveal').click(function() {$('#margin_639536640662').toggle('slow', function() {});});</script> we obtain<a class="displaced_anchor" name="eqtn75"></a>\begin{eqnarray}  \frac{\partial C}{\partial a} = \frac{a-y}{a(1-a)}.\tag{75}\end{eqnarray}Integrating this expression with respect to $a$ gives<a class="displaced_anchor" name="eqtn76"></a>\begin{eqnarray}  C = -[y \ln a + (1-y) \ln (1-a)]+ {\rm constant},\tag{76}\end{eqnarray}for some constant of integration.  This is the contribution to thecost from a single training example, $x$.  To get the full costfunction we must average over training examples, obtaining<a class="displaced_anchor" name="eqtn77"></a>\begin{eqnarray}  C = -\frac{1}{n} \sum_x [y \ln a +(1-y) \ln(1-a)] + {\rm constant},\tag{77}\end{eqnarray}where the constant here is the average of the individual constants foreach training example.  And so we see thatEquations <span id="margin_119016391726_reveal" class="equation_link">(71)</span><span id="margin_119016391726" class="marginequation" style="display: none;"><a href="chap3.html#eqtn71" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \frac{\partial C}{\partial w_j} & = & x_j(a-y)  \nonumber\end{eqnarray}</a></span><script>$('#margin_119016391726_reveal').click(function() {$('#margin_119016391726').toggle('slow', function() {});});</script>and <span id="margin_792073230645_reveal" class="equation_link">(72)</span><span id="margin_792073230645" class="marginequation" style="display: none;"><a href="chap3.html#eqtn72" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}    \frac{\partial C}{\partial b } & = & (a-y) \nonumber\end{eqnarray}</a></span><script>$('#margin_792073230645_reveal').click(function() {$('#margin_792073230645').toggle('slow', function() {});});</script> uniquely determine the formof the cross-entropy, up to an overall constant term.  Thecross-entropy isn't something that was miraculously pulled out of thinair.  Rather, it's something that we could have discovered in a simpleand natural way.</p><p>What about the intuitive meaning of the cross-entropy?  How should wethink about it?  Explaining this in depth would take us further afieldthan I want to go.  However, it is worth mentioning that there is astandard way of interpreting the cross-entropy that comes from thefield of information theory.  Roughly speaking, the idea is that thecross-entropy is a measure of surprise.  In particular, our neuron istrying to compute the function $x \rightarrow y = y(x)$.  But insteadit computes the function $x \rightarrow a = a(x)$.  Suppose we thinkof $a$ as our neuron's estimated probability that $y$ is $1$, and$1-a$ is the estimated probability that the right value for $y$ is$0$.  Then the cross-entropy measures how "surprised" we are, onaverage, when we learn the true value for $y$.  We get low surprise ifthe output is what we expect, and high surprise if the output isunexpected.  Of course, I haven't said exactly what "surprise"means, and so this perhaps seems like empty verbiage.  But in factthere is a precise information-theoretic way of saying what is meantby surprise.  Unfortunately, I don't know of a good, short,self-contained discussion of this subject that's available online.But if you want to dig deeper, then Wikipedia contains a<a href="http://en.wikipedia.org/wiki/Cross_entropy#Motivation">brief  summary</a> that will get you started down the right track.  And thedetails can be filled in by working through the materials about theKraft inequality in chapter 5 of the book about information theory by<a href="http://books.google.ca/books?id=VWq5GG6ycxMC">Cover and Thomas</a>.</p><p><h4><a name="problem_337461"></a><a href="#problem_337461">Problem</a></h4><ul><li> We've discussed at length the learning slowdown that can occur  when output neurons saturate, in networks using the quadratic cost  to train.  Another factor that may inhibit learning is the presence  of the $x_j$ term in Equation <span id="margin_802738986988_reveal" class="equation_link">(61)</span><span id="margin_802738986988" class="marginequation" style="display: none;"><a href="chap3.html#eqtn61" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \frac{\partial C}{\partial w_j} =  \frac{1}{n} \sum_x x_j(\sigma(z)-y) \nonumber\end{eqnarray}</a></span><script>$('#margin_802738986988_reveal').click(function() {$('#margin_802738986988').toggle('slow', function() {});});</script>.  Because of  this term, when an input $x_j$ is near to zero, the corresponding  weight $w_j$ will learn slowly.  Explain why it is not possible to  eliminate the $x_j$ term through a clever choice of cost function.</ul></p><p><h4><a name="softmax"></a><a href="#softmax">Softmax</a></h4></p><p>In this chapter we'll mostly use the cross-entropy cost to address theproblem of learning slowdown.  However, I want to briefly describeanother approach to the problem, based on what are called<em>softmax</em> layers of neurons. We're not actually going to usesoftmax layers in the remainder of the chapter, so if you're in agreat hurry, you can skip to the next section.  However, softmax isstill worth understanding, in part because it's intrinsicallyinteresting, and in part because we'll use softmax layers in<a href="chap6.html">Chapter 6</a>, in our discussion of deep neuralnetworks.</p><p>The idea of softmax is to define a new type of output layer for ourneural networks.  It begins in the same way as with a sigmoid layer,by forming the weighted inputs*<span class="marginnote">*In describing the softmax  we'll make frequent use of notation introduced in the  <a href="chap2.html">last chapter</a>.  You may wish to revisit that  chapter if you need to refresh your memory about the meaning of the  notation.</span> $z^L_j = \sum_{k} w^L_{jk} a^{L-1}_k + b^L_j$.  However,we don't apply the sigmoid function to get the output.  Instead, in asoftmax layer we apply the so-called <em>softmax function</em> to the$z^L_j$.  According to this function, the activation $a^L_j$ of the$j$th output neuron is<a class="displaced_anchor" name="eqtn78"></a>\begin{eqnarray}   a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}},\tag{78}\end{eqnarray}where in the denominator we sum over all the output neurons.</p><p>If you're not familiar with the softmax function,Equation <span id="margin_843513390435_reveal" class="equation_link">(78)</span><span id="margin_843513390435" class="marginequation" style="display: none;"><a href="chap3.html#eqtn78" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}} \nonumber\end{eqnarray}</a></span><script>$('#margin_843513390435_reveal').click(function() {$('#margin_843513390435').toggle('slow', function() {});});</script> may look pretty opaque.  It's certainlynot obvious why we'd want to use this function.  And it's also notobvious that this will help us address the learning slowdown problem.To better understand Equation <span id="margin_798644145159_reveal" class="equation_link">(78)</span><span id="margin_798644145159" class="marginequation" style="display: none;"><a href="chap3.html#eqtn78" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}} \nonumber\end{eqnarray}</a></span><script>$('#margin_798644145159_reveal').click(function() {$('#margin_798644145159').toggle('slow', function() {});});</script>, suppose we have anetwork with four output neurons, and four corresponding weightedinputs, which we'll denote $z^L_1, z^L_2, z^L_3$, and $z^L_4$.  Shownbelow are adjustable sliders showing possible values for the weightedinputs, and a graph of the corresponding output activations.  A goodplace to start exploration is by using the bottom slider to increase$z^L_4$:</p><p>    <script type="text/javascript" src="js/jquery-ui.min.js"></script>    <link rel="stylesheet" href="js/jquery-ui.css">    <script src="js/softmax.js"></script>    <script src="js/canvas.js"></script><style>.softmaxTable {  height: 64px;  width: 260px;}.canvasSoftmax {  position: absolute;   margin-top: -22px;   padding-bottom: 5px;}.softmaxInput {  border: 0;  font: 18px Arial;}.softmaxA {  padding-top: 7px;  padding-left: 5px;}</style><p><table><tr><td class="softmaxTable"><div id="slider1" style="width: 200px;"></div>$z^L_1 = $ <input type="text" id="amount1" readonly class="softmaxInput"></td><td class="softmaxTable"><canvas id="smG1" width="300" height="40" class="canvasSoftmax"></canvas><div class="softmaxA">$a^L_1 = $ <input type="text" id="activation1" readonly class="softmaxInput"></div></td></tr><tr><td class="softmaxTable"><div id="slider2" style="width: 200px;"></div>$z^L_2$ = <input type="text" id="amount2" readonly class="softmaxInput"></td><td class="softmaxTable"><canvas id="smG2" width="300" height="40" class="canvasSoftmax"></canvas><div class="softmaxA">$a^L_2 = $ <input type="text" id="activation2" readonly class="softmaxInput"></div></td></tr><tr><td class="softmaxTable"><div id="slider3" style="width: 200px;"></div>$z^L_3$ = <input type="text" id="amount3" readonly class="softmaxInput"></td><td class="softmaxTable"><canvas id="smG3" width="300" height="40" class="canvasSoftmax"></canvas><div class="softmaxA">$a^L_3 = $ <input type="text" id="activation3" readonly class="softmaxInput"></div></td></tr><tr><td class="softmaxTable"><div id="slider4" style="width: 200px;"></div>$z^L_4$ = <input type="text" id="amount4" readonly class="softmaxInput"></td><td class="softmaxTable"><canvas id="smG4" width="300" height="40" class="canvasSoftmax"></canvas><div class="softmaxA">$a^L_4 = $ <input type="text" id="activation4" readonly class="softmaxInput"></div></td></tr></table></p><p>As you increase $z^L_4$, you'll see an increase in the correspondingoutput activation, $a^L_4$, and a decrease in the other outputactivations.  Similarly, if you decrease $z^L_4$ then $a^L_4$ willdecrease, and all the other output activations will increase.  Infact, if you look closely, you'll see that in both cases the totalchange in the other activations exactly compensates for the change in$a^L_4$.  The reason is that the output activations are guaranteed toalways sum up to $1$, as we can prove usingEquation <span id="margin_798491963651_reveal" class="equation_link">(78)</span><span id="margin_798491963651" class="marginequation" style="display: none;"><a href="chap3.html#eqtn78" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}} \nonumber\end{eqnarray}</a></span><script>$('#margin_798491963651_reveal').click(function() {$('#margin_798491963651').toggle('slow', function() {});});</script> and a little algebra:<a class="displaced_anchor" name="eqtn79"></a>\begin{eqnarray}  \sum_j a^L_j & = & \frac{\sum_j e^{z^L_j}}{\sum_k e^{z^L_k}} = 1.\tag{79}\end{eqnarray}As a result, if $a^L_4$ increases, then the other output activationsmust decrease by the same total amount, to ensure the sum over allactivations remains $1$.  And, of course, similar statements hold forall the other activations.</p><p>Equation <span id="margin_908544036249_reveal" class="equation_link">(78)</span><span id="margin_908544036249" class="marginequation" style="display: none;"><a href="chap3.html#eqtn78" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}} \nonumber\end{eqnarray}</a></span><script>$('#margin_908544036249_reveal').click(function() {$('#margin_908544036249').toggle('slow', function() {});});</script> also implies that the output activationsare all positive, since the exponential function is positive.Combining this with the observation in the last paragraph, we see thatthe output from the softmax layer is a set of positive numbers whichsum up to $1$.  In other words, the output from the softmax layer canbe thought of as a probability distribution.</p><p>The fact that a softmax layer outputs a probability distribution israther pleasing.  In many problems it's convenient to be able tointerpret the output activation $a^L_j$ as the network's estimate ofthe probability that the correct output is $j$.  So, for instance, inthe MNIST classification problem, we can interpret $a^L_j$ as thenetwork's estimated probability that the correct digit classificationis $j$.</p><p>By contrast, if the output layer was a sigmoid layer, then wecertainly couldn't assume that the activations formed a probabilitydistribution.  I won't explicitly prove it, but it should be plausiblethat the activations from a sigmoid layer won't in general form aprobability distribution.  And so with a sigmoid output layer we don'thave such a simple interpretation of the output activations.</p><p><h4><a name="exercise_332838"></a><a href="#exercise_332838">Exercise</a></h4><ul><li> Construct an example showing explicitly that in a network with a  sigmoid output layer, the output activations $a^L_j$ won't always  sum to $1$.</ul></p><p>We're starting to build up some feel for the softmax function and theway softmax layers behave.  Just to review where we're at: theexponentials in Equation <span id="margin_189739971255_reveal" class="equation_link">(78)</span><span id="margin_189739971255" class="marginequation" style="display: none;"><a href="chap3.html#eqtn78" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}} \nonumber\end{eqnarray}</a></span><script>$('#margin_189739971255_reveal').click(function() {$('#margin_189739971255').toggle('slow', function() {});});</script> ensure that all the outputactivations are positive.  And the sum in the denominator ofEquation <span id="margin_996666328163_reveal" class="equation_link">(78)</span><span id="margin_996666328163" class="marginequation" style="display: none;"><a href="chap3.html#eqtn78" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}} \nonumber\end{eqnarray}</a></span><script>$('#margin_996666328163_reveal').click(function() {$('#margin_996666328163').toggle('slow', function() {});});</script> ensures that the softmax outputs sum to$1$.  So that particular form no longer appears so mysterious: rather,it is a natural way to ensure that the output activations form aprobability distribution.  You can think of softmax as a way ofrescaling the $z^L_j$, and then squishing them together to form aprobability distribution.</p><p><h4><a name="exercises_193619"></a><a href="#exercises_193619">Exercises</a></h4><ul><li><strong>Monotonicity of softmax</strong> Show that $\partial a^L_j / \partial  z^L_k$ is positive if $j = k$ and negative if $j \neq k$.  As a  consequence, increasing $z^L_j$ is guaranteed to increase the  corresponding output activation, $a^L_j$, and will decrease all the  other output activations.  We already saw this empirically with the  sliders, but this is a rigorous proof.</p><p><li><strong>Non-locality of softmax</strong> A nice thing about sigmoid layers is  that the output $a^L_j$ is a function of the corresponding weighted  input, $a^L_j = \sigma(z^L_j)$.  Explain why this is not the case  for a softmax layer: any particular output activation $a^L_j$  depends on <em>all</em> the weighted inputs.</ul></p><p><h4><a name="problem_905066"></a><a href="#problem_905066">Problem</a></h4><ul><li><strong>Inverting the softmax layer</strong> Suppose we have a neural network  with a softmax output layer, and the activations $a^L_j$ are known.  Show that the corresponding weighted inputs have the form $z^L_j =  \ln a^L_j + C$, for some constant $C$ that is independent of $j$.</ul></p><p><strong>The learning slowdown problem:</strong> We've now built upconsiderable familiarity with softmax layers of neurons.  But wehaven't yet seen how a softmax layer lets us address the learningslowdown problem.  To understand that, let's define the<em>log-likelihood</em> cost function.  We'll use $x$ to denote atraining input to the network, and $y$ to denote the correspondingdesired output.  Then the log-likelihood cost associated to thistraining input is<a class="displaced_anchor" name="eqtn80"></a>\begin{eqnarray}  C \equiv -\ln a^L_y.\tag{80}\end{eqnarray}So, for instance, if we're training with MNIST images, and input animage of a $7$, then the log-likelihood cost is $-\ln a^L_7$.  To seethat this makes intuitive sense, consider the case when the network isdoing a good job, that is, it is confident the input is a $7$.  Inthat case it will estimate a value for the corresponding probability$a^L_7$ which is close to $1$, and so the cost $-\ln a^L_7$ will besmall.  By contrast, when the network isn't doing such a good job, theprobability $a^L_7$ will be smaller, and the cost $-\ln a^L_7$ will belarger.  So the log-likelihood cost behaves as we'd expect a costfunction to behave.</p><p>What about the learning slowdown problem?  To analyze that, recallthat the key to the learning slowdown is the behaviour of thequantities $\partial C / \partial w^L_{jk}$ and $\partial C / \partialb^L_j$.  I won't go through the derivation explicitly - I'll ask youto do in the problems, below - but with a little algebra you canshow that*<span class="marginnote">*Note that I'm abusing notation here, using $y$ in a  slightly different way to last paragraph.  In the last paragraph we  used $y$ to denote the desired output from the network - e.g.,  output a "$7$" if an image of a $7$ was input.  But in the  equations which follow I'm using $y$ to denote the vector of output  activations which corresponds to $7$, that is, a vector which is all  $0$s, except for a $1$ in the $7$th location.</span><a class="displaced_anchor" name="eqtn81"></a><a class="displaced_anchor" name="eqtn82"></a>\begin{eqnarray}  \frac{\partial C}{\partial b^L_j} & = & a^L_j-y_j  \tag{81}\\  \frac{\partial C}{\partial w^L_{jk}} & = & a^{L-1}_k (a^L_j-y_j) \tag{82}\end{eqnarray}These equations are the same as the analogous expressions obtained inour earlier analysis of the cross-entropy.  Compare, for example,Equation <span id="margin_954684416067_reveal" class="equation_link">(82)</span><span id="margin_954684416067" class="marginequation" style="display: none;"><a href="chap3.html#eqtn82" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial C}{\partial w^L_{jk}} & = & a^{L-1}_k (a^L_j-y_j)  \nonumber\end{eqnarray}</a></span><script>$('#margin_954684416067_reveal').click(function() {$('#margin_954684416067').toggle('slow', function() {});});</script> to Equation <span id="margin_975454356091_reveal" class="equation_link">(67)</span><span id="margin_975454356091" class="marginequation" style="display: none;"><a href="chap3.html#eqtn67" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}       \frac{\partial C}{\partial w^L_{jk}} & = & \frac{1}{n} \sum_x       a^{L-1}_k  (a^L_j-y_j).   \nonumber\end{eqnarray}</a></span><script>$('#margin_975454356091_reveal').click(function() {$('#margin_975454356091').toggle('slow', function() {});});</script>.  It's thesame equation, albeit in the latter I've averaged over traininginstances.  And, just as in the earlier analysis, these expressionsensure that we will not encounter a learning slowdown.  In fact, it'suseful to think of a softmax output layer with log-likelihood cost asbeing quite similar to a sigmoid output layer with cross-entropy cost.</p><p></p><p></p><p></p><p>Given this similarity, should you use a sigmoid output layer andcross-entropy, or a softmax output layer and log-likelihood?  In fact,in many situations both approaches work well.  Through the remainderof this chapter we'll use a sigmoid output layer, with thecross-entropy cost.  Later, in <a href="chap6.html">Chapter 6</a>, we'llsometimes use a softmax output layer, with log-likelihood cost.  Thereason for the switch is to make some of our later networks moresimilar to networks found in certain influential academic papers.  Asa more general point of principle, softmax plus log-likelihood isworth using whenever you want to interpret the output activations asprobabilities.  That's not always a concern, but can be useful withclassification problems (like MNIST) involving disjoint classes.</p><p><h4><a name="problems_704457"></a><a href="#problems_704457">Problems</a></h4><ul><li> Derive Equations <span id="margin_703903870998_reveal" class="equation_link">(81)</span><span id="margin_703903870998" class="marginequation" style="display: none;"><a href="chap3.html#eqtn81" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial C}{\partial b^L_j} & = & a^L_j-y_j   \nonumber\end{eqnarray}</a></span><script>$('#margin_703903870998_reveal').click(function() {$('#margin_703903870998').toggle('slow', function() {});});</script> and <span id="margin_131859929054_reveal" class="equation_link">(82)</span><span id="margin_131859929054" class="marginequation" style="display: none;"><a href="chap3.html#eqtn82" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial C}{\partial w^L_{jk}} & = & a^{L-1}_k (a^L_j-y_j)  \nonumber\end{eqnarray}</a></span><script>$('#margin_131859929054_reveal').click(function() {$('#margin_131859929054').toggle('slow', function() {});});</script>.</p><p><li><strong>Where does the "softmax" name come from?</strong>  Suppose we change  the softmax function so the output activations are given by<a class="displaced_anchor" name="eqtn83"></a>\begin{eqnarray}  a^L_j = \frac{e^{c z^L_j}}{\sum_k e^{c z^L_k}},\tag{83}\end{eqnarray}where $c$ is a positive constant.  Note that $c = 1$ corresponds tothe standard softmax function.  But if we use a different value of $c$we get a different function, which is nonetheless qualitatively rathersimilar to the softmax.  In particular, show that the outputactivations form a probability distribution, just as for the usualsoftmax.  Suppose we allow $c$ to become large, i.e., $c \rightarrow\infty$.  What is the limiting value for the output activations$a^L_j$?  After solving this problem it should be clear to you why wethink of the $c = 1$ function as a "softened" version of the maximumfunction.  This is the origin of the term "softmax".</p><p><li><strong>Backpropagation with softmax and the log-likelihood cost</strong> In the  last chapter we derived the backpropagation algorithm for a network  containing sigmoid layers.  To apply the algorithm to a network with  a softmax layer we need to figure out an expression for the error  $\delta^L_j \equiv \partial C / \partial z^L_j$ in the final layer.  Show that a suitable expression is:  <a class="displaced_anchor" name="eqtn84"></a>\begin{eqnarray}    \delta^L_j = a^L_j -y_j.  \tag{84}\end{eqnarray}  Using this expression we can apply the backpropagation algorithm to  a network using a softmax output layer and the log-likelihood cost.</p><p></ul></p><p><h3><a name="overfitting_and_regularization"></a><a href="#overfitting_and_regularization">Overfitting and regularization</a></h3></p><p>The Nobel prizewinning physicist Enrico Fermi was once asked hisopinion of a mathematical model some colleagues had proposed as thesolution to an important unsolved physics problem.  The model gaveexcellent agreement with experiment, but Fermi was skeptical.  Heasked how many free parameters could be set in the model.  "Four"was the answer.  Fermi replied*<span class="marginnote">*The quote comes from a  charming article by  <a href="http://www.nature.com/nature/journal/v427/n6972/full/427297a.html">Freeman    Dyson</a>, who is one of the people who proposed the flawed model. A  four-parameter elephant may be found  <a href="http://www.johndcook.com/blog/2011/06/21/how-to-fit-an-elephant/">here</a>. </span>:"I remember my friend Johnny von Neumann used to say, with fourparameters I can fit an elephant, and with five I can make him wigglehis trunk.".</p><p>The point, of course, is that models with a large number of freeparameters can describe an amazingly wide range of phenomena.  Even ifsuch a model agrees well with the available data, that doesn't make ita good model.  It may just mean there's enough freedom in the modelthat it can describe almost any data set of the given size, withoutcapturing any genuine insights into the underlying phenomenon.  Whenthat happens the model will work well for the existing data, but willfail to generalize to new situations.  The true test of a model is itsability to make predictions in situations it hasn't been exposed tobefore.</p><p>Fermi and von Neumann were suspicious of models with four parameters.Our 30 hidden neuron network for classifying MNIST digits has nearly24,000 parameters!  That's a lot of parameters.  Our 100 hidden neuronnetwork has nearly 80,000 parameters, and state-of-the-art deep neuralnets sometimes contain millions or even billions of parameters.Should we trust the results?</p><p>Let's sharpen this problem up by constructing a situation where ournetwork does a bad job generalizing to new situations.  We'll use our30 hidden neuron network, with its 23,860 parameters.  But we won'ttrain the network using all 50,000 MNIST training images.  Instead,we'll use just the first 1,000 training images.  Using that restrictedset will make the problem with generalization much more evident.We'll train in a similar way to before, using the cross-entropy costfunction, with a learning rate of $\eta = 0.5$ and a mini-batch sizeof $10$.  However, we'll train for 400 epochs, a somewhat largernumber than before, because we're not using as many training examples.Let's use <tt>network2</tt> to look at the way the cost functionchanges:</p><p><div class="highlight"><pre><span></span> 
 <span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">mnist_loader</span> 
 <span class="o">&gt;&gt;&gt;</span> <span class="n">training_data</span><span class="p">,</span> <span class="n">validation_data</span><span class="p">,</span> <span class="n">test_data</span> <span class="o">=</span> \
 <span class="o">...</span> <span class="n">mnist_loader</span><span class="o">.</span><span class="n">load_data_wrapper</span><span class="p">()</span>
@@ -199,7 +202,7 @@ <h1 class="chapter_title"><a href="">Improving the way neural networks learn</a>
 <span class="o">&gt;&gt;&gt;</span> <span class="n">training_data</span><span class="p">,</span> <span class="n">validation_data</span><span class="p">,</span> <span class="n">test_data</span> <span class="o">=</span> \
 <span class="o">...</span> <span class="n">mnist_loader</span><span class="o">.</span><span class="n">load_data_wrapper</span><span class="p">()</span>
 </pre></div>
- Up to now we've been using the <tt>training_data</tt> and<tt>test_data</tt>, and ignoring the <tt>validation_data</tt>.  The<tt>validation_data</tt> contains $10,000$ images of digits, imageswhich are different from the $50,000$ images in the MNIST trainingset, and the $10,000$ images in the MNIST test set.  Instead of usingthe <tt>test_data</tt> to prevent overfitting, we will use the<tt>validation_data</tt>.  To do this, we'll use much the same strategyas was described above for the <tt>test_data</tt>.  That is, we'llcompute the classification accuracy on the <tt>validation_data</tt> atthe end of each epoch.  Once the classification accuracy on the<tt>validation_data</tt> has saturated, we stop training.  This strategyis called <em>early stopping</em>.  Of course, in practice we won'timmediately know when the accuracy has saturated.  Instead, wecontinue training until we're confident that the accuracy hassaturated*<span class="marginnote">*It requires some judgment to determine when to  stop.  In my earlier graphs I identified epoch 280 as the place at  which accuracy saturated.  It's possible that was too pessimistic.  Neural networks sometimes plateau for a while in training, before  continuing to improve.  I wouldn't be surprised if more learning  could have occurred even after epoch 400, although the magnitude of  any further improvement would likely be small.  So it's possible to  adopt more or less aggressive strategies for early stopping.</span>.</p><p><a name="validation_explanation"></a></p><p>Why use the <tt>validation_data</tt> to prevent overfitting, rather thanthe <tt>test_data</tt>?  In fact, this is part of a more generalstrategy, which is to use the <tt>validation_data</tt> to evaluatedifferent trial choices of hyper-parameters such as the number ofepochs to train for, the learning rate, the best network architecture,and so on.  We use such evaluations to find and set good values forthe hyper-parameters.  Indeed, although I haven't mentioned it untilnow, that is, in part, how I arrived at the hyper-parameter choicesmade earlier in this book. (More on this<a href="chap3.html#how_to_choose_a_neural_network's_hyper-parameters">later</a>.)</p><p>Of course, that doesn't in any way answer the question of why we'reusing the <tt>validation_data</tt> to prevent overfitting, rather thanthe <tt>test_data</tt>.  Instead, it replaces it with a more generalquestion, which is why we're using the <tt>validation_data</tt> ratherthan the <tt>test_data</tt> to set good hyper-parameters?  To understandwhy, consider that when setting hyper-parameters we're likely to trymany different choices for the hyper-parameters.  If we set thehyper-parameters based on evaluations of the <tt>test_data</tt> it'spossible we'll end up overfitting our hyper-parameters to the<tt>test_data</tt>.  That is, we may end up finding hyper-parameterswhich fit particular peculiarities of the <tt>test_data</tt>, but wherethe performance of the network won't generalize to other data sets.We guard against that by figuring out the hyper-parameters using the<tt>validation_data</tt>.  Then, once we've got the hyper-parameters wewant, we do a final evaluation of accuracy using the <tt>test_data</tt>.That gives us confidence that our results on the <tt>test_data</tt> area true measure of how well our neural network generalizes.  To put itanother way, you can think of the validation data as a type oftraining data that helps us learn good hyper-parameters.  Thisapproach to finding good hyper-parameters is sometimes known as the<em>hold out</em> method, since the <tt>validation_data</tt> is kept apartor "held out" from the <tt>training_data</tt>.</p><p>Now, in practice, even after evaluating performance on the<tt>test_data</tt> we may change our minds and want to try anotherapproach - perhaps a different network architecture - which willinvolve finding a new set of hyper-parameters.  If we do this, isn'tthere a danger we'll end up overfitting to the <tt>test_data</tt> aswell?  Do we need a potentially infinite regress of data sets, so wecan be confident our results will generalize?  Addressing this concernfully is a deep and difficult problem.  But for our practicalpurposes, we're not going to worry too much about this question.Instead, we'll plunge ahead, using the basic hold out method, based onthe <tt>training_data</tt>, <tt>validation_data</tt>, and<tt>test_data</tt>, as described above.</p><p>We've been looking so far at overfitting when we're just using 1,000training images.  What happens when we use the full training set of50,000 images?  We'll keep all the other parameters the same (30hidden neurons, learning rate 0.5, mini-batch size of 10), but trainusing all 50,000 images for 30 epochs.  Here's a graph showing theresults for the classification accuracy on both the training data andthe test data.  Note that I've used the test data here, rather thanthe validation data, in order to make the results more directlycomparable with the earlier graphs.</p><p><center><img src="images/overfitting_full.png" width="520px"></center></p><p>As you can see, the accuracy on the test and training data remain muchcloser together than when we were using 1,000 training examples.  Inparticular, the best classification accuracy of $97.86$ percent on thetraining data is only $2.53$ percent higher than the $95.33$ percenton the test data.  That's compared to the $17.73$ percent gap we hadearlier!  Overfitting is still going on, but it's been greatlyreduced.  Our network is generalizing much better from the trainingdata to the test data.  In general, one of the best ways of reducingoverfitting is to increase the size of the training data.  With enoughtraining data it is difficult for even a very large network tooverfit.  Unfortunately, training data can be expensive or difficultto acquire, so this is not always a practical option.</p><p><h4><a name="regularization"></a><a href="#regularization">Regularization</a></h4></p><p>Increasing the amount of training data is one way of reducingoverfitting.  Are there other ways we can reduce the extent to whichoverfitting occurs?  One possible approach is to reduce the size ofour network. However, large networks have the potential to be morepowerful than small networks, and so this is an option we'd only adoptreluctantly.</p><p>Fortunately, there are other techniques which can reduce overfitting,even when we have a fixed network and fixed training data.  These areknown as <em>regularization</em> techniques.  In this section I describeone of the most commonly used regularization techniques, a techniquesometimes known as <em>weight decay</em> or <em>L2 regularization</em>.The idea of L2 regularization is to add an extra term to the costfunction, a term called the <em>regularization term</em>.  Here's theregularized cross-entropy:</p><p><a class="displaced_anchor" name="eqtn85"></a>\begin{eqnarray} C = -\frac{1}{n} \sum_{xj} \left[ y_j \ln a^L_j+(1-y_j) \ln(1-a^L_j)\right] + \frac{\lambda}{2n} \sum_w w^2.\tag{85}\end{eqnarray}</p><p>The first term is just the usual expression for the cross-entropy.But we've added a second term, namely the sum of the squares of allthe weights in the network.  This is scaled by a factor $\lambda /2n$, where $\lambda > 0$ is known as the <em>regularization  parameter</em>, and $n$ is, as usual, the size of our training set.I'll discuss later how $\lambda$ is chosen.  It's also worth notingthat the regularization term <em>doesn't</em> include the biases.  I'llalso come back to that below.</p><p>Of course, it's possible to regularize other cost functions, such asthe quadratic cost.  This can be done in a similar way:</p><p><a class="displaced_anchor" name="eqtn86"></a>\begin{eqnarray} C = \frac{1}{2n} \sum_x \|y-a^L\|^2 +  \frac{\lambda}{2n} \sum_w w^2.\tag{86}\end{eqnarray}</p><p>In both cases we can write the regularized cost function as<a class="displaced_anchor" name="eqtn87"></a>\begin{eqnarray}  C = C_0 + \frac{\lambda}{2n}\sum_w w^2,\tag{87}\end{eqnarray} where $C_0$ is the original, unregularized costfunction.</p><p>Intuitively, the effect of regularization is to make it so the networkprefers to learn small weights, all other things being equal.  Largeweights will only be allowed if they considerably improve the firstpart of the cost function.  Put another way, regularization can beviewed as a way of compromising between finding small weights andminimizing the original cost function.  The relative importance of thetwo elements of the compromise depends on the value of $\lambda$: when$\lambda$ is small we prefer to minimize the original cost function,but when $\lambda$ is large we prefer small weights.</p><p>Now, it's really not at all obvious why making this kind of compromiseshould help reduce overfitting!  But it turns out that it does. We'lladdress the question of why it helps in the next section.  But first,let's work through an example showing that regularization really doesreduce overfitting.</p><p>To construct such an example, we first need to figure out how to applyour stochastic gradient descent learning algorithm in a regularizedneural network.  In particular, we need to know how to compute thepartial derivatives $\partial C / \partial w$ and $\partial C/ \partial b$ for all the weights and biases in the network.  Takingthe partial derivatives of Equation <span id="margin_117151609908_reveal" class="equation_link">(87)</span><span id="margin_117151609908" class="marginequation" style="display: none;"><a href="chap3.html#eqtn87" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  C = C_0 + \frac{\lambda}{2n}\sum_w w^2 \nonumber\end{eqnarray}</a></span><script>$('#margin_117151609908_reveal').click(function() {$('#margin_117151609908').toggle('slow', function() {});});</script> gives</p><p><a class="displaced_anchor" name="eqtn88"></a><a class="displaced_anchor" name="eqtn89"></a>\begin{eqnarray}   \frac{\partial C}{\partial w} & = & \frac{\partial C_0}{\partial w} +   \frac{\lambda}{n} w \tag{88}\\   \frac{\partial C}{\partial b} & = & \frac{\partial C_0}{\partial b}.\tag{89}\end{eqnarray} </p><p>The $\partial C_0 / \partial w$ and $\partial C_0 / \partial b$ termscan be computed using backpropagation, as described in<a href="chap2.html">the last chapter</a>.  And so we see that it's easy tocompute the gradient of the regularized cost function: just usebackpropagation, as usual, and then add $\frac{\lambda}{n} w$ to thepartial derivative of all the weight terms.  The partial derivativeswith respect to the biases are unchanged, and so the gradient descentlearning rule for the biases doesn't change from the usual rule:</p><p><a class="displaced_anchor" name="eqtn90"></a>\begin{eqnarray}b & \rightarrow & b -\eta \frac{\partial C_0}{\partial b}.\tag{90}\end{eqnarray}</p><p>The learning rule for the weights becomes:</p><p><a class="displaced_anchor" name="eqtn91"></a><a class="displaced_anchor" name="eqtn92"></a>\begin{eqnarray}   w & \rightarrow & w-\eta \frac{\partial C_0}{\partial    w}-\frac{\eta \lambda}{n} w \tag{91}\\   & = & \left(1-\frac{\eta \lambda}{n}\right) w -\eta \frac{\partial    C_0}{\partial w}. \tag{92}\end{eqnarray}</p><p>This is exactly the same as the usual gradient descent learning rule,except we first rescale the weight $w$ by a factor $1-\frac{\eta  \lambda}{n}$.  This rescaling is sometimes referred to as<em>weight decay</em>, since it makes the weights smaller.  At firstglance it looks as though this means the weights are being drivenunstoppably toward zero.  But that's not right, since the other termmay lead the weights to increase, if so doing causes a decrease in theunregularized cost function.</p><p>Okay, that's how gradient descent works.  What about stochasticgradient descent?  Well, just as in unregularized stochastic gradientdescent, we can estimate $\partial C_0 / \partial w$ by averaging overa mini-batch of $m$ training examples.  Thus the regularized learningrule for stochastic gradient descent becomes(c.f. Equation <span id="margin_376769250275_reveal" class="equation_link">(20)</span><span id="margin_376769250275" class="marginequation" style="display: none;"><a href="chap1.html#eqtn20" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   w_k & \rightarrow & w_k' = w_k-\frac{\eta}{m}  \sum_j \frac{\partial C_{X_j}}{\partial w_k}  \nonumber\end{eqnarray}</a></span><script>$('#margin_376769250275_reveal').click(function() {$('#margin_376769250275').toggle('slow', function() {});});</script>)</p><p><a class="displaced_anchor" name="eqtn93"></a>\begin{eqnarray}   w \rightarrow \left(1-\frac{\eta \lambda}{n}\right) w -\frac{\eta}{m}  \sum_x \frac{\partial C_x}{\partial w}, \tag{93}\end{eqnarray}</p><p>where the sum is over training examples $x$ in the mini-batch, and$C_x$ is the (unregularized) cost for each training example.  This isexactly the same as the usual rule for stochastic gradient descent,except for the $1-\frac{\eta \lambda}{n}$ weight decay factor.Finally, and for completeness, let me state the regularized learningrule for the biases.  This is, of course, exactly the same as in theunregularized case (c.f. Equation <span id="margin_644699699576_reveal" class="equation_link">(21)</span><span id="margin_644699699576" class="marginequation" style="display: none;"><a href="chap1.html#eqtn21" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}    b_l & \rightarrow & b_l' = b_l-\frac{\eta}{m}  \sum_j \frac{\partial C_{X_j}}{\partial b_l} \nonumber\end{eqnarray}</a></span><script>$('#margin_644699699576_reveal').click(function() {$('#margin_644699699576').toggle('slow', function() {});});</script>),</p><p><a class="displaced_anchor" name="eqtn94"></a>\begin{eqnarray}  b \rightarrow b - \frac{\eta}{m} \sum_x \frac{\partial C_x}{\partial b},\tag{94}\end{eqnarray}where the sum is over training examples $x$ in the mini-batch.</p><p>Let's see how regularization changes the performance of our neuralnetwork. We'll use a network with $30$ hidden neurons, a mini-batchsize of $10$, a learning rate of $0.5$, and the cross-entropy costfunction.  However, this time we'll use a regularization parameter of$\lambda = 0.1$.  Note that in the code, we use the variable name<tt>lmbda</tt>, because <tt>lambda</tt> is a reserved word in Python, withan unrelated meaning.  I've also used the <tt>test_data</tt> again, notthe <tt>validation_data</tt>.  Strictly speaking, we should use the<tt>validation_data</tt>, for all the reasons we discussed earlier.  ButI decided to use the <tt>test_data</tt> because it makes the resultsmore directly comparable with our earlier, unregularized results.  Youcan easily change the code to use the <tt>validation_data</tt> instead,and you'll find that it gives similar results.<div class="highlight"><pre><span></span> 
+ Up to now we've been using the <tt>training_data</tt> and<tt>test_data</tt>, and ignoring the <tt>validation_data</tt>.  The<tt>validation_data</tt> contains $10,000$ images of digits, imageswhich are different from the $50,000$ images in the MNIST trainingset, and the $10,000$ images in the MNIST test set.  Instead of usingthe <tt>test_data</tt> to prevent overfitting, we will use the<tt>validation_data</tt>.  To do this, we'll use much the same strategyas was described above for the <tt>test_data</tt>.  That is, we'llcompute the classification accuracy on the <tt>validation_data</tt> atthe end of each epoch.  Once the classification accuracy on the<tt>validation_data</tt> has saturated, we stop training.  This strategyis called <em>early stopping</em>.  Of course, in practice we won'timmediately know when the accuracy has saturated.  Instead, wecontinue training until we're confident that the accuracy hassaturated*<span class="marginnote">*It requires some judgment to determine when to  stop.  In my earlier graphs I identified epoch 280 as the place at  which accuracy saturated.  It's possible that was too pessimistic.  Neural networks sometimes plateau for a while in training, before  continuing to improve.  I wouldn't be surprised if more learning  could have occurred even after epoch 400, although the magnitude of  any further improvement would likely be small.  So it's possible to  adopt more or less aggressive strategies for early stopping.</span>.</p><p><a name="validation_explanation"></a></p><p>Why use the <tt>validation_data</tt> to prevent overfitting, rather thanthe <tt>test_data</tt>?  In fact, this is part of a more generalstrategy, which is to use the <tt>validation_data</tt> to evaluatedifferent trial choices of hyper-parameters such as the number ofepochs to train for, the learning rate, the best network architecture,and so on.  We use such evaluations to find and set good values forthe hyper-parameters.  Indeed, although I haven't mentioned it untilnow, that is, in part, how I arrived at the hyper-parameter choicesmade earlier in this book. (More on this<a href="chap3.html#how_to_choose_a_neural_network's_hyper-parameters">later</a>.)</p><p>Of course, that doesn't in any way answer the question of why we'reusing the <tt>validation_data</tt> to prevent overfitting, rather thanthe <tt>test_data</tt>.  Instead, it replaces it with a more generalquestion, which is why we're using the <tt>validation_data</tt> ratherthan the <tt>test_data</tt> to set good hyper-parameters?  To understandwhy, consider that when setting hyper-parameters we're likely to trymany different choices for the hyper-parameters.  If we set thehyper-parameters based on evaluations of the <tt>test_data</tt> it'spossible we'll end up overfitting our hyper-parameters to the<tt>test_data</tt>.  That is, we may end up finding hyper-parameterswhich fit particular peculiarities of the <tt>test_data</tt>, but wherethe performance of the network won't generalize to other data sets.We guard against that by figuring out the hyper-parameters using the<tt>validation_data</tt>.  Then, once we've got the hyper-parameters wewant, we do a final evaluation of accuracy using the <tt>test_data</tt>.That gives us confidence that our results on the <tt>test_data</tt> area true measure of how well our neural network generalizes.  To put itanother way, you can think of the validation data as a type oftraining data that helps us learn good hyper-parameters.  Thisapproach to finding good hyper-parameters is sometimes known as the<em>hold out</em> method, since the <tt>validation_data</tt> is kept apartor "held out" from the <tt>training_data</tt>.</p><p>Now, in practice, even after evaluating performance on the<tt>test_data</tt> we may change our minds and want to try anotherapproach - perhaps a different network architecture - which willinvolve finding a new set of hyper-parameters.  If we do this, isn'tthere a danger we'll end up overfitting to the <tt>test_data</tt> aswell?  Do we need a potentially infinite regress of data sets, so wecan be confident our results will generalize?  Addressing this concernfully is a deep and difficult problem.  But for our practicalpurposes, we're not going to worry too much about this question.Instead, we'll plunge ahead, using the basic hold out method, based onthe <tt>training_data</tt>, <tt>validation_data</tt>, and<tt>test_data</tt>, as described above.</p><p>We've been looking so far at overfitting when we're just using 1,000training images.  What happens when we use the full training set of50,000 images?  We'll keep all the other parameters the same (30hidden neurons, learning rate 0.5, mini-batch size of 10), but trainusing all 50,000 images for 30 epochs.  Here's a graph showing theresults for the classification accuracy on both the training data andthe test data.  Note that I've used the test data here, rather thanthe validation data, in order to make the results more directlycomparable with the earlier graphs.</p><p><center><img src="images/overfitting_full.png" width="520px"></center></p><p>As you can see, the accuracy on the test and training data remain muchcloser together than when we were using 1,000 training examples.  Inparticular, the best classification accuracy of $97.86$ percent on thetraining data is only $2.53$ percent higher than the $95.33$ percenton the test data.  That's compared to the $17.73$ percent gap we hadearlier!  Overfitting is still going on, but it's been greatlyreduced.  Our network is generalizing much better from the trainingdata to the test data.  In general, one of the best ways of reducingoverfitting is to increase the size of the training data.  With enoughtraining data it is difficult for even a very large network tooverfit.  Unfortunately, training data can be expensive or difficultto acquire, so this is not always a practical option.</p><p><h4><a name="regularization"></a><a href="#regularization">Regularization</a></h4></p><p>Increasing the amount of training data is one way of reducingoverfitting.  Are there other ways we can reduce the extent to whichoverfitting occurs?  One possible approach is to reduce the size ofour network. However, large networks have the potential to be morepowerful than small networks, and so this is an option we'd only adoptreluctantly.</p><p>Fortunately, there are other techniques which can reduce overfitting,even when we have a fixed network and fixed training data.  These areknown as <em>regularization</em> techniques.  In this section I describeone of the most commonly used regularization techniques, a techniquesometimes known as <em>weight decay</em> or <em>L2 regularization</em>.The idea of L2 regularization is to add an extra term to the costfunction, a term called the <em>regularization term</em>.  Here's theregularized cross-entropy:</p><p><a class="displaced_anchor" name="eqtn85"></a>\begin{eqnarray} C = -\frac{1}{n} \sum_{xj} \left[ y_j \ln a^L_j+(1-y_j) \ln(1-a^L_j)\right] + \frac{\lambda}{2n} \sum_w w^2.\tag{85}\end{eqnarray}</p><p>The first term is just the usual expression for the cross-entropy.But we've added a second term, namely the sum of the squares of allthe weights in the network.  This is scaled by a factor $\lambda /2n$, where $\lambda > 0$ is known as the <em>regularization  parameter</em>, and $n$ is, as usual, the size of our training set.I'll discuss later how $\lambda$ is chosen.  It's also worth notingthat the regularization term <em>doesn't</em> include the biases.  I'llalso come back to that below.</p><p>Of course, it's possible to regularize other cost functions, such asthe quadratic cost.  This can be done in a similar way:</p><p><a class="displaced_anchor" name="eqtn86"></a>\begin{eqnarray} C = \frac{1}{2n} \sum_x \|y-a^L\|^2 +  \frac{\lambda}{2n} \sum_w w^2.\tag{86}\end{eqnarray}</p><p>In both cases we can write the regularized cost function as<a class="displaced_anchor" name="eqtn87"></a>\begin{eqnarray}  C = C_0 + \frac{\lambda}{2n}\sum_w w^2,\tag{87}\end{eqnarray} where $C_0$ is the original, unregularized costfunction.</p><p>Intuitively, the effect of regularization is to make it so the networkprefers to learn small weights, all other things being equal.  Largeweights will only be allowed if they considerably improve the firstpart of the cost function.  Put another way, regularization can beviewed as a way of compromising between finding small weights andminimizing the original cost function.  The relative importance of thetwo elements of the compromise depends on the value of $\lambda$: when$\lambda$ is small we prefer to minimize the original cost function,but when $\lambda$ is large we prefer small weights.</p><p>Now, it's really not at all obvious why making this kind of compromiseshould help reduce overfitting!  But it turns out that it does. We'lladdress the question of why it helps in the next section.  But first,let's work through an example showing that regularization really doesreduce overfitting.</p><p>To construct such an example, we first need to figure out how to applyour stochastic gradient descent learning algorithm in a regularizedneural network.  In particular, we need to know how to compute thepartial derivatives $\partial C / \partial w$ and $\partial C/ \partial b$ for all the weights and biases in the network.  Takingthe partial derivatives of Equation <span id="margin_379196048815_reveal" class="equation_link">(87)</span><span id="margin_379196048815" class="marginequation" style="display: none;"><a href="chap3.html#eqtn87" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  C = C_0 + \frac{\lambda}{2n}\sum_w w^2 \nonumber\end{eqnarray}</a></span><script>$('#margin_379196048815_reveal').click(function() {$('#margin_379196048815').toggle('slow', function() {});});</script> gives</p><p><a class="displaced_anchor" name="eqtn88"></a><a class="displaced_anchor" name="eqtn89"></a>\begin{eqnarray}   \frac{\partial C}{\partial w} & = & \frac{\partial C_0}{\partial w} +   \frac{\lambda}{n} w \tag{88}\\   \frac{\partial C}{\partial b} & = & \frac{\partial C_0}{\partial b}.\tag{89}\end{eqnarray} </p><p>The $\partial C_0 / \partial w$ and $\partial C_0 / \partial b$ termscan be computed using backpropagation, as described in<a href="chap2.html">the last chapter</a>.  And so we see that it's easy tocompute the gradient of the regularized cost function: just usebackpropagation, as usual, and then add $\frac{\lambda}{n} w$ to thepartial derivative of all the weight terms.  The partial derivativeswith respect to the biases are unchanged, and so the gradient descentlearning rule for the biases doesn't change from the usual rule:</p><p><a class="displaced_anchor" name="eqtn90"></a>\begin{eqnarray}b & \rightarrow & b -\eta \frac{\partial C_0}{\partial b}.\tag{90}\end{eqnarray}</p><p>The learning rule for the weights becomes:</p><p><a class="displaced_anchor" name="eqtn91"></a><a class="displaced_anchor" name="eqtn92"></a>\begin{eqnarray}   w & \rightarrow & w-\eta \frac{\partial C_0}{\partial    w}-\frac{\eta \lambda}{n} w \tag{91}\\   & = & \left(1-\frac{\eta \lambda}{n}\right) w -\eta \frac{\partial    C_0}{\partial w}. \tag{92}\end{eqnarray}</p><p>This is exactly the same as the usual gradient descent learning rule,except we first rescale the weight $w$ by a factor $1-\frac{\eta  \lambda}{n}$.  This rescaling is sometimes referred to as<em>weight decay</em>, since it makes the weights smaller.  At firstglance it looks as though this means the weights are being drivenunstoppably toward zero.  But that's not right, since the other termmay lead the weights to increase, if so doing causes a decrease in theunregularized cost function.</p><p>Okay, that's how gradient descent works.  What about stochasticgradient descent?  Well, just as in unregularized stochastic gradientdescent, we can estimate $\partial C_0 / \partial w$ by averaging overa mini-batch of $m$ training examples.  Thus the regularized learningrule for stochastic gradient descent becomes(c.f. Equation <span id="margin_394586757774_reveal" class="equation_link">(20)</span><span id="margin_394586757774" class="marginequation" style="display: none;"><a href="chap1.html#eqtn20" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   w_k & \rightarrow & w_k' = w_k-\frac{\eta}{m}  \sum_j \frac{\partial C_{X_j}}{\partial w_k}  \nonumber\end{eqnarray}</a></span><script>$('#margin_394586757774_reveal').click(function() {$('#margin_394586757774').toggle('slow', function() {});});</script>)</p><p><a class="displaced_anchor" name="eqtn93"></a>\begin{eqnarray}   w \rightarrow \left(1-\frac{\eta \lambda}{n}\right) w -\frac{\eta}{m}  \sum_x \frac{\partial C_x}{\partial w}, \tag{93}\end{eqnarray}</p><p>where the sum is over training examples $x$ in the mini-batch, and$C_x$ is the (unregularized) cost for each training example.  This isexactly the same as the usual rule for stochastic gradient descent,except for the $1-\frac{\eta \lambda}{n}$ weight decay factor.Finally, and for completeness, let me state the regularized learningrule for the biases.  This is, of course, exactly the same as in theunregularized case (c.f. Equation <span id="margin_875231608742_reveal" class="equation_link">(21)</span><span id="margin_875231608742" class="marginequation" style="display: none;"><a href="chap1.html#eqtn21" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}    b_l & \rightarrow & b_l' = b_l-\frac{\eta}{m}  \sum_j \frac{\partial C_{X_j}}{\partial b_l} \nonumber\end{eqnarray}</a></span><script>$('#margin_875231608742_reveal').click(function() {$('#margin_875231608742').toggle('slow', function() {});});</script>),</p><p><a class="displaced_anchor" name="eqtn94"></a>\begin{eqnarray}  b \rightarrow b - \frac{\eta}{m} \sum_x \frac{\partial C_x}{\partial b},\tag{94}\end{eqnarray}where the sum is over training examples $x$ in the mini-batch.</p><p>Let's see how regularization changes the performance of our neuralnetwork. We'll use a network with $30$ hidden neurons, a mini-batchsize of $10$, a learning rate of $0.5$, and the cross-entropy costfunction.  However, this time we'll use a regularization parameter of$\lambda = 0.1$.  Note that in the code, we use the variable name<tt>lmbda</tt>, because <tt>lambda</tt> is a reserved word in Python, withan unrelated meaning.  I've also used the <tt>test_data</tt> again, notthe <tt>validation_data</tt>.  Strictly speaking, we should use the<tt>validation_data</tt>, for all the reasons we discussed earlier.  ButI decided to use the <tt>test_data</tt> because it makes the resultsmore directly comparable with our earlier, unregularized results.  Youcan easily change the code to use the <tt>validation_data</tt> instead,and you'll find that it gives similar results.<div class="highlight"><pre><span></span> 
 <span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">mnist_loader</span> 
 <span class="o">&gt;&gt;&gt;</span> <span class="n">training_data</span><span class="p">,</span> <span class="n">validation_data</span><span class="p">,</span> <span class="n">test_data</span> <span class="o">=</span> \
 <span class="o">...</span> <span class="n">mnist_loader</span><span class="o">.</span><span class="n">load_data_wrapper</span><span class="p">()</span> 
@@ -224,7 +227,7 @@ <h1 class="chapter_title"><a href="">Improving the way neural networks learn</a>
 <span class="o">...</span> <span class="n">evaluation_data</span><span class="o">=</span><span class="n">validation_data</span><span class="p">,</span>
 <span class="o">...</span> <span class="n">monitor_evaluation_accuracy</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
 </pre></div>
-</p><p>The final result is a classification accuracy of $97.92$ percent onthe validation data.  That's a big jump from the 30 hidden neuroncase.  <a name="chap3_98_04_percent"></a> In fact, tuning justa little more, to run for 60 epochs at $\eta = 0.1$ and $\lambda =5.0$ we break the $98$ percent barrier, achieving $98.04$ percentclassification accuracy on the validation data.  Not bad for whatturns out to be 152 lines of code!</p><p>I've described regularization as a way to reduce overfitting and toincrease classification accuracies.  In fact, that's not the onlybenefit.  Empirically, when doing multiple runs of our MNIST networks,but with different (random) weight initializations, I've found thatthe unregularized runs will occasionally get "stuck", apparentlycaught in local minima of the cost function.  The result is thatdifferent runs sometimes provide quite different results.  Bycontrast, the regularized runs have provided much more easilyreplicable results.</p><p>Why is this going on?  Heuristically, if the cost function isunregularized, then the length of the weight vector is likely to grow,all other things being equal.  Over time this can lead to the weightvector being very large indeed.  This can cause the weight vector toget stuck pointing in more or less the same direction, since changesdue to gradient descent only make tiny changes to the direction, whenthe length is long.  I believe this phenomenon is making it hard forour learning algorithm to properly explore the weight space, andconsequently harder to find good minima of the cost function.</p><p><h4><a name="why_does_regularization_help_reduce_overfitting"></a><a href="#why_does_regularization_help_reduce_overfitting">Why does regularization help reduce overfitting?</a></h4></p><p>We've seen empirically that regularization helps reduce overfitting.That's encouraging but, unfortunately, it's not obvious whyregularization helps!  A standard story people tell to explain what'sgoing on is along the following lines: smaller weights are, in somesense, lower complexity, and so provide a simpler and more powerfulexplanation for the data, and should thus be preferred.  That's apretty terse story, though, and contains several elements that perhapsseem dubious or mystifying.  Let's unpack the story and examine itcritically.  To do that, let's suppose we have a simple data set forwhich we wish to build a model:</p><p><divid="simple_model"></div> <script type="text/javascript" src="js/simple_data.js">  </script> </p><p>Implicitly, we're studying some real-world phenomenon here, with $x$and $y$ representing real-world data.  Our goal is to build a modelwhich lets us predict $y$ as a function of $x$.  We could try usingneural networks to build such a model, but I'm going to do somethingeven simpler: I'll try to model $y$ as a polynomial in $x$.  I'm doingthis instead of using neural nets because using polynomials will makethings particularly transparent.  Once we've understood the polynomialcase, we'll translate to neural networks.  Now, there are ten pointsin the graph above, which means we can find a unique $9$th-orderpolynomial $y = a_0 x^9 + a_1 x^8 + \ldots + a_9$ which fits the dataexactly.  Here's the graph of that polynomial*<span class="marginnote">*I won't show  the coefficients explicitly, although they are easy to find using a  routine such as Numpy's <tt>polyfit</tt>.  You can view the exact form  of the polynomial in the <a href="js/polynomial_model.js">source code    for the graph</a> if you're curious.  It's the function <tt>p(x)</tt>  defined starting on line 14 of the program which produces the  graph.</span>:</p><p><div id="polynomial_fit"></div> <scripttype="text/javascript" src="js/polynomial_model.js"></script></p><p>That provides an exact fit.  But we can also get a good fit using thelinear model $y = 2x$:</p><p><div id="linear_fit"></div> <scripttype="text/javascript" src="js/linear_model.js"></script> </p><p>Which of these is the better model?  Which is more likely to be true?And which model is more likely to generalize well to other examples ofthe same underlying real-world phenomenon?</p><p>These are difficult questions.  In fact, we can't determine withcertainty the answer to any of the above questions, without much moreinformation about the underlying real-world phenomenon.  But let'sconsider two possibilities: (1) the $9$th order polynomial is, infact, the model which truly describes the real-world phenomenon, andthe model will therefore generalize perfectly; (2) the correct modelis $y = 2x$, but there's a little additional noise due to, say,measurement error, and that's why the model isn't an exact fit.</p><p>It's not <em>a priori</em> possible to say which of these twopossibilities is correct.  (Or, indeed, if some third possibilityholds).  Logically, either could be true.  And it's not a trivialdifference.  It's true that on the data provided there's only a smalldifference between the two models.  But suppose we want to predict thevalue of $y$ corresponding to some large value of $x$, much largerthan any shown on the graph above.  If we try to do that there will bea dramatic difference between the predictions of the two models, asthe $9$th order polynomial model comes to be dominated by the $x^9$term, while the linear model remains, well, linear.</p><p>One point of view is to say that in science we should go with thesimpler explanation, unless compelled not to.  When we find a simplemodel that seems to explain many data points we are tempted to shout"Eureka!"  After all, it seems unlikely that a simple explanationshould occur merely by coincidence.  Rather, we suspect that the modelmust be expressing some underlying truth about the phenomenon.  In thecase at hand, the model $y = 2x+{\rm noise}$ seems much simpler than$y = a_0 x^9 + a_1 x^8 + \ldots$.  It would be surprising if thatsimplicity had occurred by chance, and so we suspect that $y = 2x+{\rm  noise}$ expresses some underlying truth.  In this point of view, the9th order model is really just learning the effects of localnoise. And so while the 9th order model works perfectly for theseparticular data points, the model will fail to generalize to otherdata points, and the noisy linear model will have greater predictivepower.</p><p>Let's see what this point of view means for neural networks.  Supposeour network mostly has small weights, as will tend to happen in aregularized network.  The smallness of the weights means that thebehaviour of the network won't change too much if we change a fewrandom inputs here and there.  That makes it difficult for aregularized network to learn the effects of local noise in the data.Think of it as a way of making it so single pieces of evidence don'tmatter too much to the output of the network.  Instead, a regularizednetwork learns to respond to types of evidence which are seen oftenacross the training set.  By contrast, a network with large weightsmay change its behaviour quite a bit in response to small changes inthe input.  And so an unregularized network can use large weights tolearn a complex model that carries a lot of information about thenoise in the training data.  In a nutshell, regularized networks areconstrained to build relatively simple models based on patterns seenoften in the training data, and are resistant to learningpeculiarities of the noise in the training data.  The hope is thatthis will force our networks to do real learning about the phenomenonat hand, and to generalize better from what they learn.</p><p>With that said, this idea of preferring simpler explanation shouldmake you nervous.  People sometimes refer to this idea as "Occam'sRazor", and will zealously apply it as though it has the status ofsome general scientific principle.  But, of course, it's not a generalscientific principle.  There is no <em>a priori</em> logical reason toprefer simple explanations over more complex explanations.  Indeed,sometimes the more complex explanation turns out to be correct.</p><p>Let me describe two examples where more complex explanations haveturned out to be correct.  In the 1940s the physicist Marcel Scheinannounced the discovery of a new particle of nature.  The company heworked for, General Electric, was ecstatic, and publicized thediscovery widely.  But the physicist Hans Bethe was skeptical.  Bethevisited Schein, and looked at the plates showing the tracks ofSchein's new particle.  Schein showed Bethe plate after plate, but oneach plate Bethe identified some problem that suggested the datashould be discarded.  Finally, Schein showed Bethe a plate that lookedgood.  Bethe said it might just be a statistical fluke.  Schein:"Yes, but the chance that this would be statistics, even according toyour own formula, is one in five."  Bethe: "But we have alreadylooked at five plates."  Finally, Schein said: "But on my plates,each one of the good plates, each one of the good pictures, youexplain by a different theory, whereas I have one hypothesis thatexplains all the plates, that they are [the new particle]."  Bethereplied: "The sole difference between your and my explanations isthat yours is wrong and all of mine are right.  Your singleexplanation is wrong, and all of my multiple explanations are right."Subsequent work confirmed that Nature agreed with Bethe, and Schein'sparticle is no more*<span class="marginnote">*The story is related by the physicist  Richard Feynman in an  <a href="https://www.aip.org/history-programs/niels-bohr-library/oral-histories/5020-4">interview</a>  with the historian Charles Weiner.</span>.</p><p>As a second example, in 1859 the astronomer Urbain Le Verrier observedthat the orbit of the planet Mercury doesn't have quite the shape thatNewton's theory of gravitation says it should have.  It was a tiny,tiny deviation from Newton's theory, and several of the explanationsproferred at the time boiled down to saying that Newton's theory wasmore or less right, but needed a tiny alteration.  In 1916, Einsteinshowed that the deviation could be explained very well using hisgeneral theory of relativity, a theory radically different toNewtonian gravitation, and based on much more complex mathematics.Despite that additional complexity, today it's accepted thatEinstein's explanation is correct, and Newtonian gravity, even in itsmodified forms, is wrong.  This is in part because we now know thatEinstein's theory explains many other phenomena which Newton's theoryhas difficulty with.  Furthermore, and even more impressively,Einstein's theory accurately predicts several phenomena which aren'tpredicted by Newtonian gravity at all. But these impressive qualitiesweren't entirely obvious in the early days.  If one had judged merelyon the grounds of simplicity, then some modified form of Newton'stheory would arguably have been more attractive.</p><p>There are three morals to draw from these stories.  First, it can bequite a subtle business deciding which of two explanations is truly"simpler".  Second, even if we can make such a judgment, simplicityis a guide that must be used with great caution!  Third, the true testof a model is not simplicity, but rather how well it does inpredicting new phenomena, in new regimes of behaviour.</p><p>With that said, and keeping the need for caution in mind, it's anempirical fact that regularized neural networks usually generalizebetter than unregularized networks.  And so through the remainder ofthe book we will make frequent use of regularization.  I've includedthe stories above merely to help convey why no-one has yet developedan entirely convincing theoretical explanation for why regularizationhelps networks generalize.  Indeed, researchers continue to writepapers where they try different approaches to regularization, comparethem to see which works better, and attempt to understand why differentapproaches work better or worse.  And so you can view regularizationas something of a kludge.  While it often helps, we don't have anentirely satisfactory systematic understanding of what's going on,merely incomplete heuristics and rules of thumb.</p><p>There's a deeper set of issues here, issues which go to the heart ofscience.  It's the question of how we generalize.  Regularization maygive us a computational magic wand that helps our networks generalizebetter, but it doesn't give us a principled understanding of howgeneralization works, nor of what the best approach is*<span class="marginnote">*These  issues go back to the  <a href="http://en.wikipedia.org/wiki/Problem_of_induction">problem    of induction</a>, famously discussed by the Scottish philosopher  David Hume in <a href="http://www.gutenberg.org/ebooks/9662">"An    Enquiry Concerning Human Understanding"</a> (1748).  The problem of  induction has been given a modern machine learning form in the  no-free lunch theorem  (<a href="http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=585893">link</a>)  of David Wolpert and William Macready (1997).</span>.</p><p>This is particularly galling because in everyday life, we humansgeneralize phenomenally well.  Shown just a few images of an elephanta child will quickly learn to recognize other elephants.  Of course,they may occasionally make mistakes, perhaps confusing a rhinocerosfor an elephant, but in general this process works remarkablyaccurately.  So we have a system - the human brain - with a hugenumber of free parameters.  And after being shown just one or a fewtraining images that system learns to generalize to other images.  Ourbrains are, in some sense, regularizing amazingly well!  How do we doit?  At this point we don't know.  I expect that in years to come wewill develop more powerful techniques for regularization in artificialneural networks, techniques that will ultimately enable neural nets togeneralize well even from small data sets.</p><p>In fact, our networks already generalize better than one might <em>a  priori</em> expect.  A network with 100 hidden neurons has nearly 80,000parameters.  We have only 50,000 images in our training data.  It'slike trying to fit an 80,000th degree polynomial to 50,000 datapoints.  By all rights, our network should overfit terribly.  And yet,as we saw earlier, such a network actually does a pretty good jobgeneralizing.  Why is that the case?  It's not well understood.  Ithas been conjectured*<span class="marginnote">*In  <a href="http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf">Gradient-Based    Learning Applied to Document Recognition</a>, by Yann LeCun,  Léon Bottou, Yoshua Bengio, and Patrick Haffner  (1998).</span>  that "the dynamics of gradient descent learning inmultilayer nets has a `self-regularization' effect".  This isexceptionally fortunate, but it's also somewhat disquieting that wedon't understand why it's the case.  In the meantime, we will adoptthe pragmatic approach and use regularization whenever we can.  Ourneural networks will be the better for it.</p><p>Let me conclude this section by returning to a detail which I leftunexplained earlier: the fact that L2 regularization <em>doesn't</em>constrain the biases.  Of course, it would be easy to modify theregularization procedure to regularize the biases.  Empirically, doingthis often doesn't change the results very much, so to some extentit's merely a convention whether to regularize the biases or not.However, it's worth noting that having a large bias doesn't make aneuron sensitive to its inputs in the same way as having largeweights.  And so we don't need to worry about large biases enablingour network to learn the noise in our training data.  At the sametime, allowing large biases gives our networks more flexibility inbehaviour - in particular, large biases make it easier for neuronsto saturate, which is sometimes desirable.  For these reasons we don'tusually include bias terms when regularizing.</p><p><h4><a name="other_techniques_for_regularization"></a><a href="#other_techniques_for_regularization">Other techniques for regularization</a></h4></p><p>There are many regularization techniques other than L2 regularization.In fact, so many techniques have been developed that I can't possiblysummarize them all.  In this section I briefly describe three otherapproaches to reducing overfitting: L1 regularization, dropout, andartificially increasing the training set size.  We won't go intonearly as much depth studying these techniques as we did earlier.Instead, the purpose is to get familiar with the main ideas, and toappreciate something of the diversity of regularization techniquesavailable.</p><p><strong>L1 regularization:</strong> In this approach we modify theunregularized cost function by adding the sum of the absolute valuesof the weights:</p><p><a class="displaced_anchor" name="eqtn95"></a>\begin{eqnarray}  C = C_0 + \frac{\lambda}{n} \sum_w |w|.\tag{95}\end{eqnarray} </p><p>Intuitively, this is similar to L2 regularization, penalizing largeweights, and tending to make the network prefer small weights.  Ofcourse, the L1 regularization term isn't the same as the L2regularization term, and so we shouldn't expect to get exactly thesame behaviour.  Let's try to understand how the behaviour of anetwork trained using L1 regularization differs from a network trainedusing L2 regularization.</p><p>To do that, we'll look at the partial derivatives of the costfunction.  Differentiating <span id="margin_310783353329_reveal" class="equation_link">(95)</span><span id="margin_310783353329" class="marginequation" style="display: none;"><a href="chap3.html#eqtn95" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  C = C_0 + \frac{\lambda}{n} \sum_w |w| \nonumber\end{eqnarray}</a></span><script>$('#margin_310783353329_reveal').click(function() {$('#margin_310783353329').toggle('slow', function() {});});</script> we obtain:<a class="displaced_anchor" name="eqtn96"></a>\begin{eqnarray}  \frac{\partial C}{\partial    w} = \frac{\partial C_0}{\partial w} + \frac{\lambda}{n} \, {\rm    sgn}(w),\tag{96}\end{eqnarray}</p><p>where ${\rm sgn}(w)$ is the sign of $w$, that is, $+1$ if $w$ ispositive, and $-1$ if $w$ is negative.  Using this expression, we caneasily modify backpropagation to do stochastic gradient descent usingL1 regularization.  The resulting update rule for an L1 regularizednetwork is <a class="displaced_anchor" name="eqtn97"></a>\begin{eqnarray}  w \rightarrow w' =  w-\frac{\eta \lambda}{n} \mbox{sgn}(w) - \eta \frac{\partial    C_0}{\partial w},\tag{97}\end{eqnarray} </p><p>where, as per usual, we can estimate $\partial C_0 / \partial w$ usinga mini-batch average, if we wish.  Compare that to the update rule forL2 regularization (c.f. Equation <span id="margin_489674115221_reveal" class="equation_link">(93)</span><span id="margin_489674115221" class="marginequation" style="display: none;"><a href="chap3.html#eqtn93" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   w \rightarrow \left(1-\frac{\eta \lambda}{n}\right) w -\frac{\eta}{m}  \sum_x \frac{\partial C_x}{\partial w},  \nonumber\end{eqnarray}</a></span><script>$('#margin_489674115221_reveal').click(function() {$('#margin_489674115221').toggle('slow', function() {});});</script>),<a class="displaced_anchor" name="eqtn98"></a>\begin{eqnarray}  w \rightarrow w' = w\left(1 - \frac{\eta \lambda}{n} \right)  - \eta \frac{\partial C_0}{\partial w}.\tag{98}\end{eqnarray}In both expressions the effect of regularization is to shrink theweights.  This accords with our intuition that both kinds ofregularization penalize large weights.  But the way the weights shrinkis different.  In L1 regularization, the weights shrink by a constantamount toward $0$.  In L2 regularization, the weights shrink by anamount which is proportional to $w$.  And so when a particular weighthas a large magnitude, $|w|$, L1 regularization shrinks the weightmuch less than L2 regularization does.  By contrast, when $|w|$ issmall, L1 regularization shrinks the weight much more than L2regularization.  The net result is that L1 regularization tends toconcentrate the weight of the network in a relatively small number ofhigh-importance connections, while the other weights are driven towardzero.</p><p>I've glossed over an issue in the above discussion, which is that thepartial derivative $\partial C / \partial w$ isn't defined when $w =0$.  The reason is that the function $|w|$ has a sharp "corner" at$w = 0$, and so isn't differentiable at that point.  That's okay,though.  What we'll do is just apply the usual (unregularized) rulefor stochastic gradient descent when $w = 0$.  That should be okay -intuitively, the effect of regularization is to shrink weights, andobviously it can't shrink a weight which is already $0$.  To put itmore precisely, we'll use Equations <span id="margin_38274607594_reveal" class="equation_link">(96)</span><span id="margin_38274607594" class="marginequation" style="display: none;"><a href="chap3.html#eqtn96" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial C}{\partial    w} = \frac{\partial C_0}{\partial w} + \frac{\lambda}{n} \, {\rm    sgn}(w) \nonumber\end{eqnarray}</a></span><script>$('#margin_38274607594_reveal').click(function() {$('#margin_38274607594').toggle('slow', function() {});});</script>and <span id="margin_304560129415_reveal" class="equation_link">(97)</span><span id="margin_304560129415" class="marginequation" style="display: none;"><a href="chap3.html#eqtn97" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  w \rightarrow w' =  w-\frac{\eta \lambda}{n} \mbox{sgn}(w) - \eta \frac{\partial    C_0}{\partial w} \nonumber\end{eqnarray}</a></span><script>$('#margin_304560129415_reveal').click(function() {$('#margin_304560129415').toggle('slow', function() {});});</script> with the convention that $\mbox{sgn}(0) = 0$.That gives a nice, compact rule for doing stochastic gradient descentwith L1 regularization.</p><p><strong>Dropout:</strong> Dropout is a radically different technique forregularization.  Unlike L1 and L2 regularization, dropout doesn't relyon modifying the cost function.  Instead, in dropout we modify thenetwork itself.  Let me describe the basic mechanics of how dropoutworks, before getting into why it works, and what the results are.</p><p>Suppose we're trying to train a network:</p><p><center><img src="images/tikz30.png"/></center></p><p>In particular, suppose we have a training input $x$ and correspondingdesired output $y$.  Ordinarily, we'd train by forward-propagating $x$through the network, and then backpropagating to determine thecontribution to the gradient.  With dropout, this process is modified.We start by randomly (and temporarily) deleting half the hiddenneurons in the network, while leaving the input and output neuronsuntouched.  After doing this, we'll end up with a network along thefollowing lines.  Note that the dropout neurons, i.e., the neuronswhich have been temporarily deleted, are still ghosted in:</p><p><center><img src="images/tikz31.png"/></center></p><p>We forward-propagate the input $x$ through the modified network, andthen backpropagate the result, also through the modified network.After doing this over a mini-batch of examples, we update theappropriate weights and biases.  We then repeat the process, firstrestoring the dropout neurons, then choosing a new random subset ofhidden neurons to delete, estimating the gradient for a differentmini-batch, and updating the weights and biases in the network.</p><p></p><p></p><p></p><p></p><p>By repeating this process over and over, our network will learn a setof weights and biases.  Of course, those weights and biases will havebeen learnt under conditions in which half the hidden neurons weredropped out.  When we actually run the full network that means thattwice as many hidden neurons will be active.  To compensate for that,we halve the weights outgoing from the hidden neurons.</p><p>This dropout procedure may seem strange and <em>ad hoc</em>.  Why wouldwe expect it to help with regularization?  To explain what's going on,I'd like you to briefly stop thinking about dropout, and insteadimagine training neural networks in the standard way (no dropout).  Inparticular, imagine we train several different neural networks, allusing the same training data.  Of course, the networks may not startout identical, and as a result after training they may sometimes givedifferent results.  When that happens we could use some kind ofaveraging or voting scheme to decide which output to accept.  Forinstance, if we have trained five networks, and three of them areclassifying a digit as a "3", then it probably really is a "3".The other two networks are probably just making a mistake.  This kindof averaging scheme is often found to be a powerful (though expensive)way of reducing overfitting.  The reason is that the differentnetworks may overfit in different ways, and averaging may helpeliminate that kind of overfitting.</p><p>What's this got to do with dropout?  Heuristically, when we dropoutdifferent sets of neurons, it's rather like we're training differentneural networks.  And so the dropout procedure is like averaging theeffects of a very large number of different networks.  The differentnetworks will overfit in different ways, and so, hopefully, the neteffect of dropout will be to reduce overfitting.</p><p><a name="dropout_explanation"></a></p><p>A related heuristic explanation for dropout is given in one of theearliest papers to use thetechnique*<span class="marginnote">*<a href="https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf">ImageNet    Classification with Deep Convolutional Neural Networks</a>, by Alex  Krizhevsky, Ilya Sutskever, and Geoffrey Hinton (2012).</span>: "Thistechnique reduces complex co-adaptations of neurons, since a neuroncannot rely on the presence of particular other neurons. It is,therefore, forced to learn more robust features that are useful inconjunction with many different random subsets of the other neurons."In other words, if we think of our network as a model which is makingpredictions, then we can think of dropout as a way of making sure thatthe model is robust to the loss of any individual piece of evidence.In this, it's somewhat similar to L1 and L2 regularization, which tendto reduce weights, and thus make the network more robust to losing anyindividual connection in the network.</p><p>Of course, the true measure of dropout is that it has been verysuccessful in improving the performance of neural networks.  Theoriginalpaper*<span class="marginnote">*<a href="http://arxiv.org/pdf/1207.0580.pdf">Improving    neural networks by preventing co-adaptation of feature detectors</a>  by Geoffrey Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya  Sutskever, and Ruslan Salakhutdinov (2012).  Note that the paper  discusses a number of subtleties that I have glossed over in this  brief introduction.</span> introducing the technique applied it to manydifferent tasks. For us, it's of particular interest that they applieddropout to MNIST digit classification, using a vanilla feedforwardneural network along lines similar to those we've been considering.The paper noted that the best result anyone had achieved up to thatpoint using such an architecture was $98.4$ percent classificationaccuracy on the test set.  They improved that to $98.7$ percentaccuracy using a combination of dropout and a modified form of L2regularization.  Similarly impressive results have been obtained formany other tasks, including problems in image and speech recognition,and natural language processing.  Dropout has been especially usefulin training large, deep networks, where the problem of overfitting isoften acute.</p><p><strong>Artificially expanding the training data:</strong> We saw earlier thatour MNIST classification accuracy dropped down to percentages in themid-80s when we used only 1,000 training images.  It's not surprisingthat this is the case, since less training data means our network willbe exposed to fewer variations in the way human beings write digits.Let's try training our 30 hidden neuron network with a variety ofdifferent training data set sizes, to see how performance varies.  Wetrain using a mini-batch size of 10, a learning rate $\eta = 0.5$, aregularization parameter $\lambda = 5.0$, and the cross-entropy costfunction.  We will train for 30 epochs when the full training data setis used, and scale up the number of epochs proportionally when smallertraining sets are used.  To ensure the weight decay factor remains thesame across training sets, we will use a regularization parameter of$\lambda = 5.0$ when the full training data set is used, and scaledown $\lambda$ proportionally when smaller training sets areused*<span class="marginnote">*This and the next two graph are produced with the  program  <a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/fig/more_data.py">more_data.py</a>.</span>.</p><p><center><img src="images/more_data.png" width="520px"></center></p><p>As you can see, the classification accuracies improve considerably aswe use more training data.  Presumably this improvement would continuestill further if more data was available.  Of course, looking at thegraph above it does appear that we're getting near saturation.Suppose, however, that we redo the graph with the training set sizeplotted logarithmically:</p><p><center><img src="images/more_data_log.png" width="520px"></center></p><p>It seems clear that the graph is still going up toward the end.  Thissuggests that if we used vastly more training data - say, millionsor even billions of handwriting samples, instead of just 50,000 -then we'd likely get considerably better performance, even from thisvery small network.</p><p>Obtaining more training data is a great idea. Unfortunately, it can beexpensive, and so is not always possible in practice.  However,there's another idea which can work nearly as well, and that's toartificially expand the training data.  Suppose, for example, that wetake an MNIST training image of a five,</p><p></p><p><center><img src="images/more_data_5.png" width="120px"></center></p><p>and rotate it by a small amount, let's say 15 degrees:</p><p><center><img src="images/more_data_rotated_5.png" width="120px"></center></p><p>It's still recognizably the same digit.  And yet at the pixel levelit's quite different to any image currently in the MNIST trainingdata.  It's conceivable that adding this image to the training datamight help our network learn more about how to classify digits.What's more, obviously we're not limited to adding just this oneimage.  We can expand our training data by making <em>many</em> smallrotations of <em>all</em> the MNIST training images, and then using theexpanded training data to improve our network's performance.</p><p>This idea is very powerful and has been widely used.  Let's look atsome of the results from apaper*<span class="marginnote">*<a href="http://dx.doi.org/10.1109/ICDAR.2003.1227801">Best    Practices for Convolutional Neural Networks Applied to Visual    Document Analysis</a>, by Patrice Simard, Dave Steinkraus, and John  Platt (2003).</span> which applied several variations of the idea toMNIST.  One of the neural network architectures they considered wasalong similar lines to what we've been using, a feedforward networkwith 800 hidden neurons and using the cross-entropy cost function.Running the network with the standard MNIST training data theyachieved a classification accuracy of 98.4 percent on their test set.But then they expanded the training data, using not just rotations, asI described above, but also translating and skewing the images.  Bytraining on the expanded data set they increased their network'saccuracy to 98.9 percent.  They also experimented with what theycalled "elastic distortions", a special type of image distortionintended to emulate the random oscillations found in hand muscles.  Byusing the elastic distortions to expand the data they achieved an evenhigher accuracy, 99.3 percent.  Effectively, they were broadening theexperience of their network by exposing it to the sort of variationsthat are found in real handwriting.</p><p>Variations on this idea can be used to improve performance on manylearning tasks, not just handwriting recognition.  The generalprinciple is to expand the training data by applying operations thatreflect real-world variation.  It's not difficult to think of ways ofdoing this.  Suppose, for example, that you're building a neuralnetwork to do speech recognition.  We humans can recognize speech evenin the presence of distortions such as background noise.  And so youcan expand your data by adding background noise.  We can alsorecognize speech if it's sped up or slowed down. So that's another waywe can expand the training data.  These techniques are not always used- for instance, instead of expanding the training data by addingnoise, it may well be more efficient to clean up the input to thenetwork by first applying a noise reduction filter.  Still, it's worthkeeping the idea of expanding the training data in mind, and lookingfor opportunities to apply it.</p><p><h4><a name="exercise_195778"></a><a href="#exercise_195778">Exercise</a></h4><ul><li> As discussed above, one way of expanding the MNIST training data  is to use small rotations of training images.  What's a problem that  might occur if we allow arbitrarily large rotations of training  images?</ul></p><p><strong>An aside on big data and what it means to compare  classification accuracies:</strong> Let's look again at how our neuralnetwork's accuracy varies with training set size:</p><p><center><img src="images/more_data_log.png" width="520px"></center></p><p>Suppose that instead of using a neural network we use some othermachine learning technique to classify digits.  For instance, let'stry using the support vector machines (SVM) which we met briefly backin <a href="chap1.html#SVM">Chapter 1</a>.  As was the case in Chapter 1,don't worry if you're not familiar with SVMs, we don't need tounderstand their details.  Instead, we'll use the SVM supplied by the<a href="http://scikit-learn.org/stable/">scikit-learn library</a>.  Here'show SVM performance varies as a function of training set size.  I'veplotted the neural net results as well, to make comparisoneasy*<span class="marginnote">*This graph was produced with the program  <a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/fig/more_data.py">more_data.py</a>  (as were the last few graphs).</span>:</p><p><center><img src="images/more_data_comparison.png" width="520px"></center></p><p>Probably the first thing that strikes you about this graph is that ourneural network outperforms the SVM for every training set size.That's nice, although you shouldn't read too much into it, since Ijust used the out-of-the-box settings from scikit-learn's SVM, whilewe've done a fair bit of work improving our neural network.  A moresubtle but more interesting fact about the graph is that if we trainour SVM using 50,000 images then it actually has better performance(94.48 percent accuracy) than our neural network does when trainedusing 5,000 images (93.24 percent accuracy).  In other words, moretraining data can sometimes compensate for differences in the machinelearning algorithm used.</p><p>Something even more interesting can occur.  Suppose we're trying tosolve a problem using two machine learning algorithms, algorithm A andalgorithm B.  It sometimes happens that algorithm A will outperformalgorithm B with one set of training data, while algorithm B willoutperform algorithm A with a different set of training data.  Wedon't see that above - it would require the two graphs to cross -but it does happen*<span class="marginnote">*Striking examples may be found in  <a href="http://dx.doi.org/10.3115/1073012.1073017">Scaling to very    very large corpora for natural language disambiguation</a>, by  Michele Banko and Eric Brill (2001).</span>.  The correct response to thequestion "Is algorithm A better than algorithm B?" is really: "Whattraining data set are you using?"</p><p>All this is a caution to keep in mind, both when doing development,and when reading research papers.  Many papers focus on finding newtricks to wring out improved performance on standard benchmark datasets.  "Our whiz-bang technique gave us an improvement of X percenton standard benchmark Y" is a canonical form of research claim.  Suchclaims are often genuinely interesting, but they must be understood asapplying only in the context of the specific training data set used.Imagine an alternate history in which the people who originallycreated the benchmark data set had a larger research grant.  Theymight have used the extra money to collect more training data.  It'sentirely possible that the "improvement" due to the whiz-bangtechnique would disappear on a larger data set.  In other words, thepurported improvement might be just an accident of history.  Themessage to take away, especially in practical applications, is thatwhat we want is both better algorithms <em>and</em> better trainingdata.  It's fine to look for better algorithms, but make sure you'renot focusing on better algorithms to the exclusion of easy winsgetting more or better training data.</p><p><h4><a name="problem_455958"></a><a href="#problem_455958">Problem</a></h4><ul><li><strong>(Research problem)</strong> How do our machine learning algorithms  perform in the limit of very large data sets?  For any given  algorithm it's natural to attempt to define a notion of asymptotic  performance in the limit of truly big data. A quick-and-dirty  approach to this problem is to simply try fitting curves to graphs  like those shown above, and then to extrapolate the fitted curves  out to infinity.  An objection to this approach is that different  approaches to curve fitting will give different notions of asymptotic  performance.  Can you find a principled justification for fitting to  some particular class of curves?  If so, compare the asymptotic  performance of several different machine learning algorithms.</ul></p><p><strong>Summing up:</strong> We've now completed our dive into overfitting andregularization.  Of course, we'll return again to the issue.  As I'vementioned several times, overfitting is a major problem in neuralnetworks, especially as computers get more powerful, and we have theability to train larger networks.  As a result there's a pressing needto develop powerful regularization techniques to reduce overfitting,and this is an extremely active area of current work.</p><p><h3><a name="weight_initialization"></a><a href="#weight_initialization">Weight initialization</a></h3></p><p>When we create our neural networks, we have to make choices for theinitial weights and biases.  Up to now, we've been choosing themaccording to a prescription which I discussed only briefly<a href="chap1.html#weight_initialization">back in Chapter 1</a>.  Just toremind you, that prescription was to choose both the weights andbiases using independent Gaussian random variables, normalized to havemean $0$ and standard deviation $1$.  While this approach has workedwell, it was quite <em>ad hoc</em>, and it's worth revisiting to see ifwe can find a better way of setting our initial weights and biases,and perhaps help our neural networks learn faster.</p><p>It turns out that we can do quite a bit better than initializing withnormalized Gaussians.  To see why, suppose we're working with anetwork with a large number - say $1,000$ - of input neurons.  Andlet's suppose we've used normalized Gaussians to initialize theweights connecting to the first hidden layer.  For now I'm going toconcentrate specifically on the weights connecting the input neuronsto the first neuron in the hidden layer, and ignore the rest of thenetwork:</p><p><center><img src="images/tikz32.png"/></center></p><p>We'll suppose for simplicity that we're trying to train using atraining input $x$ in which half the input neurons are on, i.e., setto $1$, and half the input neurons are off, i.e., set to $0$.  Theargument which follows applies more generally, but you'll get the gistfrom this special case.  Let's consider the weighted sum $z = \sum_jw_j x_j+b$ of inputs to our hidden neuron.  $500$ terms in this sumvanish, because the corresponding input $x_j$ is zero.  And so $z$ isa sum over a total of $501$ normalized Gaussian random variables,accounting for the $500$ weight terms and the $1$ extra bias term.Thus $z$ is itself distributed as a Gaussian with mean zero andstandard deviation $\sqrt{501} \approx 22.4$.  That is, $z$ has a verybroad Gaussian distribution, not sharply peaked at all:</p><p><div id="wide_gaussian"></div><script type="text/javascript" src="js/wide_gaussian.js"></script></p><p>In particular, we can see from this graph that it's quite likely that$|z|$ will be pretty large, i.e., either $z \gg 1$ or $z \ll -1$.  Ifthat's the case then the output $\sigma(z)$ from the hidden neuronwill be very close to either $1$ or $0$.  That means our hidden neuronwill have saturated.  And when that happens, as we know, making smallchanges in the weights will make only absolutely miniscule changes inthe activation of our hidden neuron.  That miniscule change in theactivation of the hidden neuron will, in turn, barely affect the restof the neurons in the network at all, and we'll see a correspondinglyminiscule change in the cost function.  As a result, those weightswill only learn very slowly when we use the gradient descentalgorithm*<span class="marginnote">*We discussed this in more detail in Chapter 2,  where we used the  <a href="chap2.html#the_four_fundamental_equations_behind_backpropagation">equations    of backpropagation</a> to show that weights input to saturated  neurons learned slowly.</span>.  It's similar to the problem we discussedearlier in this chapter, in which output neurons which saturated onthe wrong value caused learning to slow down.  We addressed thatearlier problem with a clever choice of cost function.  Unfortunately,while that helped with saturated output neurons, it does nothing atall for the problem with saturated hidden neurons.</p><p>I've been talking about the weights input to the first hidden layer.Of course, similar arguments apply also to later hidden layers: if theweights in later hidden layers are initialized using normalizedGaussians, then activations will often be very close to $0$ or $1$,and learning will proceed very slowly.</p><p>Is there some way we can choose better initializations for the weightsand biases, so that we don't get this kind of saturation, and so avoida learning slowdown?  Suppose we have a neuron with $n_{\rm in}$ inputweights.  Then we shall initialize those weights as Gaussian randomvariables with mean $0$ and standard deviation $1/\sqrt{n_{\rm in}}$.That is, we'll squash the Gaussians down, making it less likely thatour neuron will saturate.  We'll continue to choose the bias as aGaussian with mean $0$ and standard deviation $1$, for reasons I'llreturn to in a moment.  With these choices, the weighted sum $z =\sum_j w_j x_j + b$ will again be a Gaussian random variable with mean$0$, but it'll be much more sharply peaked than it was before.Suppose, as we did earlier, that $500$ of the inputs are zero and$500$ are $1$.  Then it's easy to show (see the exercise below) that$z$ has a Gaussian distribution with mean $0$ and standard deviation$\sqrt{3/2} = 1.22\ldots$.  This is much more sharply peaked thanbefore, so much so that even the graph below understates thesituation, since I've had to rescale the vertical axis, when comparedto the earlier graph:</p><p><div id="narrow_gaussian"></div> <script  type="text/javascript" src="js/narrow_gaussian.js"></script>  </p><p>Such a neuron is much less likely to saturate, and correspondinglymuch less likely to have problems with a learning slowdown.</p><p><h4><a name="exercise_319349"></a><a href="#exercise_319349">Exercise</a></h4><ul><li> Verify that the standard deviation of $z = \sum_j w_j x_j + b$  in the paragraph above is $\sqrt{3/2}$.  It may help to know that:  (a) the variance of a sum of independent random variables is the sum  of the variances of the individual random variables; and (b) the  variance is the square of the standard deviation.</ul></p><p>I stated above that we'll continue to initialize the biases as before,as Gaussian random variables with a mean of $0$ and a standarddeviation of $1$.  This is okay, because it doesn't make it too muchmore likely that our neurons will saturate.  In fact, it doesn't muchmatter how we initialize the biases, provided we avoid the problemwith saturation.  Some people go so far as to initialize all thebiases to $0$, and rely on gradient descent to learn appropriatebiases.  But since it's unlikely to make much difference, we'llcontinue with the same initialization procedure as before.</p><p>Let's compare the results for both our old and new approaches toweight initialization, using the MNIST digit classification task.  Asbefore, we'll use $30$ hidden neurons, a mini-batch size of $10$, aregularization parameter $\lambda = 5.0$, and the cross-entropy costfunction.  We will decrease the learning rate slightly from $\eta =0.5$ to $0.1$, since that makes the results a little more easilyvisible in the graphs.  We can train using the old method of weightinitialization:<div class="highlight"><pre><span></span><span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">mnist_loader</span>
+</p><p>The final result is a classification accuracy of $97.92$ percent onthe validation data.  That's a big jump from the 30 hidden neuroncase.  <a name="chap3_98_04_percent"></a> In fact, tuning justa little more, to run for 60 epochs at $\eta = 0.1$ and $\lambda =5.0$ we break the $98$ percent barrier, achieving $98.04$ percentclassification accuracy on the validation data.  Not bad for whatturns out to be 152 lines of code!</p><p>I've described regularization as a way to reduce overfitting and toincrease classification accuracies.  In fact, that's not the onlybenefit.  Empirically, when doing multiple runs of our MNIST networks,but with different (random) weight initializations, I've found thatthe unregularized runs will occasionally get "stuck", apparentlycaught in local minima of the cost function.  The result is thatdifferent runs sometimes provide quite different results.  Bycontrast, the regularized runs have provided much more easilyreplicable results.</p><p>Why is this going on?  Heuristically, if the cost function isunregularized, then the length of the weight vector is likely to grow,all other things being equal.  Over time this can lead to the weightvector being very large indeed.  This can cause the weight vector toget stuck pointing in more or less the same direction, since changesdue to gradient descent only make tiny changes to the direction, whenthe length is long.  I believe this phenomenon is making it hard forour learning algorithm to properly explore the weight space, andconsequently harder to find good minima of the cost function.</p><p><h4><a name="why_does_regularization_help_reduce_overfitting"></a><a href="#why_does_regularization_help_reduce_overfitting">Why does regularization help reduce overfitting?</a></h4></p><p>We've seen empirically that regularization helps reduce overfitting.That's encouraging but, unfortunately, it's not obvious whyregularization helps!  A standard story people tell to explain what'sgoing on is along the following lines: smaller weights are, in somesense, lower complexity, and so provide a simpler and more powerfulexplanation for the data, and should thus be preferred.  That's apretty terse story, though, and contains several elements that perhapsseem dubious or mystifying.  Let's unpack the story and examine itcritically.  To do that, let's suppose we have a simple data set forwhich we wish to build a model:</p><p><divid="simple_model"></div> <script type="text/javascript" src="js/simple_data.js">  </script> </p><p>Implicitly, we're studying some real-world phenomenon here, with $x$and $y$ representing real-world data.  Our goal is to build a modelwhich lets us predict $y$ as a function of $x$.  We could try usingneural networks to build such a model, but I'm going to do somethingeven simpler: I'll try to model $y$ as a polynomial in $x$.  I'm doingthis instead of using neural nets because using polynomials will makethings particularly transparent.  Once we've understood the polynomialcase, we'll translate to neural networks.  Now, there are ten pointsin the graph above, which means we can find a unique $9$th-orderpolynomial $y = a_0 x^9 + a_1 x^8 + \ldots + a_9$ which fits the dataexactly.  Here's the graph of that polynomial*<span class="marginnote">*I won't show  the coefficients explicitly, although they are easy to find using a  routine such as Numpy's <tt>polyfit</tt>.  You can view the exact form  of the polynomial in the <a href="js/polynomial_model.js">source code    for the graph</a> if you're curious.  It's the function <tt>p(x)</tt>  defined starting on line 14 of the program which produces the  graph.</span>:</p><p><div id="polynomial_fit"></div> <scripttype="text/javascript" src="js/polynomial_model.js"></script></p><p>That provides an exact fit.  But we can also get a good fit using thelinear model $y = 2x$:</p><p><div id="linear_fit"></div> <scripttype="text/javascript" src="js/linear_model.js"></script> </p><p>Which of these is the better model?  Which is more likely to be true?And which model is more likely to generalize well to other examples ofthe same underlying real-world phenomenon?</p><p>These are difficult questions.  In fact, we can't determine withcertainty the answer to any of the above questions, without much moreinformation about the underlying real-world phenomenon.  But let'sconsider two possibilities: (1) the $9$th order polynomial is, infact, the model which truly describes the real-world phenomenon, andthe model will therefore generalize perfectly; (2) the correct modelis $y = 2x$, but there's a little additional noise due to, say,measurement error, and that's why the model isn't an exact fit.</p><p>It's not <em>a priori</em> possible to say which of these twopossibilities is correct.  (Or, indeed, if some third possibilityholds).  Logically, either could be true.  And it's not a trivialdifference.  It's true that on the data provided there's only a smalldifference between the two models.  But suppose we want to predict thevalue of $y$ corresponding to some large value of $x$, much largerthan any shown on the graph above.  If we try to do that there will bea dramatic difference between the predictions of the two models, asthe $9$th order polynomial model comes to be dominated by the $x^9$term, while the linear model remains, well, linear.</p><p>One point of view is to say that in science we should go with thesimpler explanation, unless compelled not to.  When we find a simplemodel that seems to explain many data points we are tempted to shout"Eureka!"  After all, it seems unlikely that a simple explanationshould occur merely by coincidence.  Rather, we suspect that the modelmust be expressing some underlying truth about the phenomenon.  In thecase at hand, the model $y = 2x+{\rm noise}$ seems much simpler than$y = a_0 x^9 + a_1 x^8 + \ldots$.  It would be surprising if thatsimplicity had occurred by chance, and so we suspect that $y = 2x+{\rm  noise}$ expresses some underlying truth.  In this point of view, the9th order model is really just learning the effects of localnoise. And so while the 9th order model works perfectly for theseparticular data points, the model will fail to generalize to otherdata points, and the noisy linear model will have greater predictivepower.</p><p>Let's see what this point of view means for neural networks.  Supposeour network mostly has small weights, as will tend to happen in aregularized network.  The smallness of the weights means that thebehaviour of the network won't change too much if we change a fewrandom inputs here and there.  That makes it difficult for aregularized network to learn the effects of local noise in the data.Think of it as a way of making it so single pieces of evidence don'tmatter too much to the output of the network.  Instead, a regularizednetwork learns to respond to types of evidence which are seen oftenacross the training set.  By contrast, a network with large weightsmay change its behaviour quite a bit in response to small changes inthe input.  And so an unregularized network can use large weights tolearn a complex model that carries a lot of information about thenoise in the training data.  In a nutshell, regularized networks areconstrained to build relatively simple models based on patterns seenoften in the training data, and are resistant to learningpeculiarities of the noise in the training data.  The hope is thatthis will force our networks to do real learning about the phenomenonat hand, and to generalize better from what they learn.</p><p>With that said, this idea of preferring simpler explanation shouldmake you nervous.  People sometimes refer to this idea as "Occam'sRazor", and will zealously apply it as though it has the status ofsome general scientific principle.  But, of course, it's not a generalscientific principle.  There is no <em>a priori</em> logical reason toprefer simple explanations over more complex explanations.  Indeed,sometimes the more complex explanation turns out to be correct.</p><p>Let me describe two examples where more complex explanations haveturned out to be correct.  In the 1940s the physicist Marcel Scheinannounced the discovery of a new particle of nature.  The company heworked for, General Electric, was ecstatic, and publicized thediscovery widely.  But the physicist Hans Bethe was skeptical.  Bethevisited Schein, and looked at the plates showing the tracks ofSchein's new particle.  Schein showed Bethe plate after plate, but oneach plate Bethe identified some problem that suggested the datashould be discarded.  Finally, Schein showed Bethe a plate that lookedgood.  Bethe said it might just be a statistical fluke.  Schein:"Yes, but the chance that this would be statistics, even according toyour own formula, is one in five."  Bethe: "But we have alreadylooked at five plates."  Finally, Schein said: "But on my plates,each one of the good plates, each one of the good pictures, youexplain by a different theory, whereas I have one hypothesis thatexplains all the plates, that they are [the new particle]."  Bethereplied: "The sole difference between your and my explanations isthat yours is wrong and all of mine are right.  Your singleexplanation is wrong, and all of my multiple explanations are right."Subsequent work confirmed that Nature agreed with Bethe, and Schein'sparticle is no more*<span class="marginnote">*The story is related by the physicist  Richard Feynman in an  <a href="https://www.aip.org/history-programs/niels-bohr-library/oral-histories/5020-4">interview</a>  with the historian Charles Weiner.</span>.</p><p>As a second example, in 1859 the astronomer Urbain Le Verrier observedthat the orbit of the planet Mercury doesn't have quite the shape thatNewton's theory of gravitation says it should have.  It was a tiny,tiny deviation from Newton's theory, and several of the explanationsproferred at the time boiled down to saying that Newton's theory wasmore or less right, but needed a tiny alteration.  In 1916, Einsteinshowed that the deviation could be explained very well using hisgeneral theory of relativity, a theory radically different toNewtonian gravitation, and based on much more complex mathematics.Despite that additional complexity, today it's accepted thatEinstein's explanation is correct, and Newtonian gravity, even in itsmodified forms, is wrong.  This is in part because we now know thatEinstein's theory explains many other phenomena which Newton's theoryhas difficulty with.  Furthermore, and even more impressively,Einstein's theory accurately predicts several phenomena which aren'tpredicted by Newtonian gravity at all. But these impressive qualitiesweren't entirely obvious in the early days.  If one had judged merelyon the grounds of simplicity, then some modified form of Newton'stheory would arguably have been more attractive.</p><p>There are three morals to draw from these stories.  First, it can bequite a subtle business deciding which of two explanations is truly"simpler".  Second, even if we can make such a judgment, simplicityis a guide that must be used with great caution!  Third, the true testof a model is not simplicity, but rather how well it does inpredicting new phenomena, in new regimes of behaviour.</p><p>With that said, and keeping the need for caution in mind, it's anempirical fact that regularized neural networks usually generalizebetter than unregularized networks.  And so through the remainder ofthe book we will make frequent use of regularization.  I've includedthe stories above merely to help convey why no-one has yet developedan entirely convincing theoretical explanation for why regularizationhelps networks generalize.  Indeed, researchers continue to writepapers where they try different approaches to regularization, comparethem to see which works better, and attempt to understand why differentapproaches work better or worse.  And so you can view regularizationas something of a kludge.  While it often helps, we don't have anentirely satisfactory systematic understanding of what's going on,merely incomplete heuristics and rules of thumb.</p><p>There's a deeper set of issues here, issues which go to the heart ofscience.  It's the question of how we generalize.  Regularization maygive us a computational magic wand that helps our networks generalizebetter, but it doesn't give us a principled understanding of howgeneralization works, nor of what the best approach is*<span class="marginnote">*These  issues go back to the  <a href="http://en.wikipedia.org/wiki/Problem_of_induction">problem    of induction</a>, famously discussed by the Scottish philosopher  David Hume in <a href="http://www.gutenberg.org/ebooks/9662">"An    Enquiry Concerning Human Understanding"</a> (1748).  The problem of  induction has been given a modern machine learning form in the  no-free lunch theorem  (<a href="http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=585893">link</a>)  of David Wolpert and William Macready (1997).</span>.</p><p>This is particularly galling because in everyday life, we humansgeneralize phenomenally well.  Shown just a few images of an elephanta child will quickly learn to recognize other elephants.  Of course,they may occasionally make mistakes, perhaps confusing a rhinocerosfor an elephant, but in general this process works remarkablyaccurately.  So we have a system - the human brain - with a hugenumber of free parameters.  And after being shown just one or a fewtraining images that system learns to generalize to other images.  Ourbrains are, in some sense, regularizing amazingly well!  How do we doit?  At this point we don't know.  I expect that in years to come wewill develop more powerful techniques for regularization in artificialneural networks, techniques that will ultimately enable neural nets togeneralize well even from small data sets.</p><p>In fact, our networks already generalize better than one might <em>a  priori</em> expect.  A network with 100 hidden neurons has nearly 80,000parameters.  We have only 50,000 images in our training data.  It'slike trying to fit an 80,000th degree polynomial to 50,000 datapoints.  By all rights, our network should overfit terribly.  And yet,as we saw earlier, such a network actually does a pretty good jobgeneralizing.  Why is that the case?  It's not well understood.  Ithas been conjectured*<span class="marginnote">*In  <a href="http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf">Gradient-Based    Learning Applied to Document Recognition</a>, by Yann LeCun,  Léon Bottou, Yoshua Bengio, and Patrick Haffner  (1998).</span>  that "the dynamics of gradient descent learning inmultilayer nets has a `self-regularization' effect".  This isexceptionally fortunate, but it's also somewhat disquieting that wedon't understand why it's the case.  In the meantime, we will adoptthe pragmatic approach and use regularization whenever we can.  Ourneural networks will be the better for it.</p><p>Let me conclude this section by returning to a detail which I leftunexplained earlier: the fact that L2 regularization <em>doesn't</em>constrain the biases.  Of course, it would be easy to modify theregularization procedure to regularize the biases.  Empirically, doingthis often doesn't change the results very much, so to some extentit's merely a convention whether to regularize the biases or not.However, it's worth noting that having a large bias doesn't make aneuron sensitive to its inputs in the same way as having largeweights.  And so we don't need to worry about large biases enablingour network to learn the noise in our training data.  At the sametime, allowing large biases gives our networks more flexibility inbehaviour - in particular, large biases make it easier for neuronsto saturate, which is sometimes desirable.  For these reasons we don'tusually include bias terms when regularizing.</p><p><h4><a name="other_techniques_for_regularization"></a><a href="#other_techniques_for_regularization">Other techniques for regularization</a></h4></p><p>There are many regularization techniques other than L2 regularization.In fact, so many techniques have been developed that I can't possiblysummarize them all.  In this section I briefly describe three otherapproaches to reducing overfitting: L1 regularization, dropout, andartificially increasing the training set size.  We won't go intonearly as much depth studying these techniques as we did earlier.Instead, the purpose is to get familiar with the main ideas, and toappreciate something of the diversity of regularization techniquesavailable.</p><p><strong>L1 regularization:</strong> In this approach we modify theunregularized cost function by adding the sum of the absolute valuesof the weights:</p><p><a class="displaced_anchor" name="eqtn95"></a>\begin{eqnarray}  C = C_0 + \frac{\lambda}{n} \sum_w |w|.\tag{95}\end{eqnarray} </p><p>Intuitively, this is similar to L2 regularization, penalizing largeweights, and tending to make the network prefer small weights.  Ofcourse, the L1 regularization term isn't the same as the L2regularization term, and so we shouldn't expect to get exactly thesame behaviour.  Let's try to understand how the behaviour of anetwork trained using L1 regularization differs from a network trainedusing L2 regularization.</p><p>To do that, we'll look at the partial derivatives of the costfunction.  Differentiating <span id="margin_469811885872_reveal" class="equation_link">(95)</span><span id="margin_469811885872" class="marginequation" style="display: none;"><a href="chap3.html#eqtn95" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  C = C_0 + \frac{\lambda}{n} \sum_w |w| \nonumber\end{eqnarray}</a></span><script>$('#margin_469811885872_reveal').click(function() {$('#margin_469811885872').toggle('slow', function() {});});</script> we obtain:<a class="displaced_anchor" name="eqtn96"></a>\begin{eqnarray}  \frac{\partial C}{\partial    w} = \frac{\partial C_0}{\partial w} + \frac{\lambda}{n} \, {\rm    sgn}(w),\tag{96}\end{eqnarray}</p><p>where ${\rm sgn}(w)$ is the sign of $w$, that is, $+1$ if $w$ ispositive, and $-1$ if $w$ is negative.  Using this expression, we caneasily modify backpropagation to do stochastic gradient descent usingL1 regularization.  The resulting update rule for an L1 regularizednetwork is <a class="displaced_anchor" name="eqtn97"></a>\begin{eqnarray}  w \rightarrow w' =  w-\frac{\eta \lambda}{n} \mbox{sgn}(w) - \eta \frac{\partial    C_0}{\partial w},\tag{97}\end{eqnarray} </p><p>where, as per usual, we can estimate $\partial C_0 / \partial w$ usinga mini-batch average, if we wish.  Compare that to the update rule forL2 regularization (c.f. Equation <span id="margin_634077924707_reveal" class="equation_link">(93)</span><span id="margin_634077924707" class="marginequation" style="display: none;"><a href="chap3.html#eqtn93" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   w \rightarrow \left(1-\frac{\eta \lambda}{n}\right) w -\frac{\eta}{m}  \sum_x \frac{\partial C_x}{\partial w},  \nonumber\end{eqnarray}</a></span><script>$('#margin_634077924707_reveal').click(function() {$('#margin_634077924707').toggle('slow', function() {});});</script>),<a class="displaced_anchor" name="eqtn98"></a>\begin{eqnarray}  w \rightarrow w' = w\left(1 - \frac{\eta \lambda}{n} \right)  - \eta \frac{\partial C_0}{\partial w}.\tag{98}\end{eqnarray}In both expressions the effect of regularization is to shrink theweights.  This accords with our intuition that both kinds ofregularization penalize large weights.  But the way the weights shrinkis different.  In L1 regularization, the weights shrink by a constantamount toward $0$.  In L2 regularization, the weights shrink by anamount which is proportional to $w$.  And so when a particular weighthas a large magnitude, $|w|$, L1 regularization shrinks the weightmuch less than L2 regularization does.  By contrast, when $|w|$ issmall, L1 regularization shrinks the weight much more than L2regularization.  The net result is that L1 regularization tends toconcentrate the weight of the network in a relatively small number ofhigh-importance connections, while the other weights are driven towardzero.</p><p>I've glossed over an issue in the above discussion, which is that thepartial derivative $\partial C / \partial w$ isn't defined when $w =0$.  The reason is that the function $|w|$ has a sharp "corner" at$w = 0$, and so isn't differentiable at that point.  That's okay,though.  What we'll do is just apply the usual (unregularized) rulefor stochastic gradient descent when $w = 0$.  That should be okay -intuitively, the effect of regularization is to shrink weights, andobviously it can't shrink a weight which is already $0$.  To put itmore precisely, we'll use Equations <span id="margin_443845758038_reveal" class="equation_link">(96)</span><span id="margin_443845758038" class="marginequation" style="display: none;"><a href="chap3.html#eqtn96" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial C}{\partial    w} = \frac{\partial C_0}{\partial w} + \frac{\lambda}{n} \, {\rm    sgn}(w) \nonumber\end{eqnarray}</a></span><script>$('#margin_443845758038_reveal').click(function() {$('#margin_443845758038').toggle('slow', function() {});});</script>and <span id="margin_765257392665_reveal" class="equation_link">(97)</span><span id="margin_765257392665" class="marginequation" style="display: none;"><a href="chap3.html#eqtn97" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  w \rightarrow w' =  w-\frac{\eta \lambda}{n} \mbox{sgn}(w) - \eta \frac{\partial    C_0}{\partial w} \nonumber\end{eqnarray}</a></span><script>$('#margin_765257392665_reveal').click(function() {$('#margin_765257392665').toggle('slow', function() {});});</script> with the convention that $\mbox{sgn}(0) = 0$.That gives a nice, compact rule for doing stochastic gradient descentwith L1 regularization.</p><p><strong>Dropout:</strong> Dropout is a radically different technique forregularization.  Unlike L1 and L2 regularization, dropout doesn't relyon modifying the cost function.  Instead, in dropout we modify thenetwork itself.  Let me describe the basic mechanics of how dropoutworks, before getting into why it works, and what the results are.</p><p>Suppose we're trying to train a network:</p><p><center><img src="images/tikz30.png"/></center></p><p>In particular, suppose we have a training input $x$ and correspondingdesired output $y$.  Ordinarily, we'd train by forward-propagating $x$through the network, and then backpropagating to determine thecontribution to the gradient.  With dropout, this process is modified.We start by randomly (and temporarily) deleting half the hiddenneurons in the network, while leaving the input and output neuronsuntouched.  After doing this, we'll end up with a network along thefollowing lines.  Note that the dropout neurons, i.e., the neuronswhich have been temporarily deleted, are still ghosted in:</p><p><center><img src="images/tikz31.png"/></center></p><p>We forward-propagate the input $x$ through the modified network, andthen backpropagate the result, also through the modified network.After doing this over a mini-batch of examples, we update theappropriate weights and biases.  We then repeat the process, firstrestoring the dropout neurons, then choosing a new random subset ofhidden neurons to delete, estimating the gradient for a differentmini-batch, and updating the weights and biases in the network.</p><p></p><p></p><p></p><p></p><p>By repeating this process over and over, our network will learn a setof weights and biases.  Of course, those weights and biases will havebeen learnt under conditions in which half the hidden neurons weredropped out.  When we actually run the full network that means thattwice as many hidden neurons will be active.  To compensate for that,we halve the weights outgoing from the hidden neurons.</p><p>This dropout procedure may seem strange and <em>ad hoc</em>.  Why wouldwe expect it to help with regularization?  To explain what's going on,I'd like you to briefly stop thinking about dropout, and insteadimagine training neural networks in the standard way (no dropout).  Inparticular, imagine we train several different neural networks, allusing the same training data.  Of course, the networks may not startout identical, and as a result after training they may sometimes givedifferent results.  When that happens we could use some kind ofaveraging or voting scheme to decide which output to accept.  Forinstance, if we have trained five networks, and three of them areclassifying a digit as a "3", then it probably really is a "3".The other two networks are probably just making a mistake.  This kindof averaging scheme is often found to be a powerful (though expensive)way of reducing overfitting.  The reason is that the differentnetworks may overfit in different ways, and averaging may helpeliminate that kind of overfitting.</p><p>What's this got to do with dropout?  Heuristically, when we dropoutdifferent sets of neurons, it's rather like we're training differentneural networks.  And so the dropout procedure is like averaging theeffects of a very large number of different networks.  The differentnetworks will overfit in different ways, and so, hopefully, the neteffect of dropout will be to reduce overfitting.</p><p><a name="dropout_explanation"></a></p><p>A related heuristic explanation for dropout is given in one of theearliest papers to use thetechnique*<span class="marginnote">*<a href="https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf">ImageNet    Classification with Deep Convolutional Neural Networks</a>, by Alex  Krizhevsky, Ilya Sutskever, and Geoffrey Hinton (2012).</span>: "Thistechnique reduces complex co-adaptations of neurons, since a neuroncannot rely on the presence of particular other neurons. It is,therefore, forced to learn more robust features that are useful inconjunction with many different random subsets of the other neurons."In other words, if we think of our network as a model which is makingpredictions, then we can think of dropout as a way of making sure thatthe model is robust to the loss of any individual piece of evidence.In this, it's somewhat similar to L1 and L2 regularization, which tendto reduce weights, and thus make the network more robust to losing anyindividual connection in the network.</p><p>Of course, the true measure of dropout is that it has been verysuccessful in improving the performance of neural networks.  Theoriginalpaper*<span class="marginnote">*<a href="http://arxiv.org/pdf/1207.0580.pdf">Improving    neural networks by preventing co-adaptation of feature detectors</a>  by Geoffrey Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya  Sutskever, and Ruslan Salakhutdinov (2012).  Note that the paper  discusses a number of subtleties that I have glossed over in this  brief introduction.</span> introducing the technique applied it to manydifferent tasks. For us, it's of particular interest that they applieddropout to MNIST digit classification, using a vanilla feedforwardneural network along lines similar to those we've been considering.The paper noted that the best result anyone had achieved up to thatpoint using such an architecture was $98.4$ percent classificationaccuracy on the test set.  They improved that to $98.7$ percentaccuracy using a combination of dropout and a modified form of L2regularization.  Similarly impressive results have been obtained formany other tasks, including problems in image and speech recognition,and natural language processing.  Dropout has been especially usefulin training large, deep networks, where the problem of overfitting isoften acute.</p><p><strong>Artificially expanding the training data:</strong> We saw earlier thatour MNIST classification accuracy dropped down to percentages in themid-80s when we used only 1,000 training images.  It's not surprisingthat this is the case, since less training data means our network willbe exposed to fewer variations in the way human beings write digits.Let's try training our 30 hidden neuron network with a variety ofdifferent training data set sizes, to see how performance varies.  Wetrain using a mini-batch size of 10, a learning rate $\eta = 0.5$, aregularization parameter $\lambda = 5.0$, and the cross-entropy costfunction.  We will train for 30 epochs when the full training data setis used, and scale up the number of epochs proportionally when smallertraining sets are used.  To ensure the weight decay factor remains thesame across training sets, we will use a regularization parameter of$\lambda = 5.0$ when the full training data set is used, and scaledown $\lambda$ proportionally when smaller training sets areused*<span class="marginnote">*This and the next two graph are produced with the  program  <a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/fig/more_data.py">more_data.py</a>.</span>.</p><p><center><img src="images/more_data.png" width="520px"></center></p><p>As you can see, the classification accuracies improve considerably aswe use more training data.  Presumably this improvement would continuestill further if more data was available.  Of course, looking at thegraph above it does appear that we're getting near saturation.Suppose, however, that we redo the graph with the training set sizeplotted logarithmically:</p><p><center><img src="images/more_data_log.png" width="520px"></center></p><p>It seems clear that the graph is still going up toward the end.  Thissuggests that if we used vastly more training data - say, millionsor even billions of handwriting samples, instead of just 50,000 -then we'd likely get considerably better performance, even from thisvery small network.</p><p>Obtaining more training data is a great idea. Unfortunately, it can beexpensive, and so is not always possible in practice.  However,there's another idea which can work nearly as well, and that's toartificially expand the training data.  Suppose, for example, that wetake an MNIST training image of a five,</p><p></p><p><center><img src="images/more_data_5.png" width="120px"></center></p><p>and rotate it by a small amount, let's say 15 degrees:</p><p><center><img src="images/more_data_rotated_5.png" width="120px"></center></p><p>It's still recognizably the same digit.  And yet at the pixel levelit's quite different to any image currently in the MNIST trainingdata.  It's conceivable that adding this image to the training datamight help our network learn more about how to classify digits.What's more, obviously we're not limited to adding just this oneimage.  We can expand our training data by making <em>many</em> smallrotations of <em>all</em> the MNIST training images, and then using theexpanded training data to improve our network's performance.</p><p>This idea is very powerful and has been widely used.  Let's look atsome of the results from apaper*<span class="marginnote">*<a href="http://dx.doi.org/10.1109/ICDAR.2003.1227801">Best    Practices for Convolutional Neural Networks Applied to Visual    Document Analysis</a>, by Patrice Simard, Dave Steinkraus, and John  Platt (2003).</span> which applied several variations of the idea toMNIST.  One of the neural network architectures they considered wasalong similar lines to what we've been using, a feedforward networkwith 800 hidden neurons and using the cross-entropy cost function.Running the network with the standard MNIST training data theyachieved a classification accuracy of 98.4 percent on their test set.But then they expanded the training data, using not just rotations, asI described above, but also translating and skewing the images.  Bytraining on the expanded data set they increased their network'saccuracy to 98.9 percent.  They also experimented with what theycalled "elastic distortions", a special type of image distortionintended to emulate the random oscillations found in hand muscles.  Byusing the elastic distortions to expand the data they achieved an evenhigher accuracy, 99.3 percent.  Effectively, they were broadening theexperience of their network by exposing it to the sort of variationsthat are found in real handwriting.</p><p>Variations on this idea can be used to improve performance on manylearning tasks, not just handwriting recognition.  The generalprinciple is to expand the training data by applying operations thatreflect real-world variation.  It's not difficult to think of ways ofdoing this.  Suppose, for example, that you're building a neuralnetwork to do speech recognition.  We humans can recognize speech evenin the presence of distortions such as background noise.  And so youcan expand your data by adding background noise.  We can alsorecognize speech if it's sped up or slowed down. So that's another waywe can expand the training data.  These techniques are not always used- for instance, instead of expanding the training data by addingnoise, it may well be more efficient to clean up the input to thenetwork by first applying a noise reduction filter.  Still, it's worthkeeping the idea of expanding the training data in mind, and lookingfor opportunities to apply it.</p><p><h4><a name="exercise_195778"></a><a href="#exercise_195778">Exercise</a></h4><ul><li> As discussed above, one way of expanding the MNIST training data  is to use small rotations of training images.  What's a problem that  might occur if we allow arbitrarily large rotations of training  images?</ul></p><p><strong>An aside on big data and what it means to compare  classification accuracies:</strong> Let's look again at how our neuralnetwork's accuracy varies with training set size:</p><p><center><img src="images/more_data_log.png" width="520px"></center></p><p>Suppose that instead of using a neural network we use some othermachine learning technique to classify digits.  For instance, let'stry using the support vector machines (SVM) which we met briefly backin <a href="chap1.html#SVM">Chapter 1</a>.  As was the case in Chapter 1,don't worry if you're not familiar with SVMs, we don't need tounderstand their details.  Instead, we'll use the SVM supplied by the<a href="http://scikit-learn.org/stable/">scikit-learn library</a>.  Here'show SVM performance varies as a function of training set size.  I'veplotted the neural net results as well, to make comparisoneasy*<span class="marginnote">*This graph was produced with the program  <a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/fig/more_data.py">more_data.py</a>  (as were the last few graphs).</span>:</p><p><center><img src="images/more_data_comparison.png" width="520px"></center></p><p>Probably the first thing that strikes you about this graph is that ourneural network outperforms the SVM for every training set size.That's nice, although you shouldn't read too much into it, since Ijust used the out-of-the-box settings from scikit-learn's SVM, whilewe've done a fair bit of work improving our neural network.  A moresubtle but more interesting fact about the graph is that if we trainour SVM using 50,000 images then it actually has better performance(94.48 percent accuracy) than our neural network does when trainedusing 5,000 images (93.24 percent accuracy).  In other words, moretraining data can sometimes compensate for differences in the machinelearning algorithm used.</p><p>Something even more interesting can occur.  Suppose we're trying tosolve a problem using two machine learning algorithms, algorithm A andalgorithm B.  It sometimes happens that algorithm A will outperformalgorithm B with one set of training data, while algorithm B willoutperform algorithm A with a different set of training data.  Wedon't see that above - it would require the two graphs to cross -but it does happen*<span class="marginnote">*Striking examples may be found in  <a href="http://dx.doi.org/10.3115/1073012.1073017">Scaling to very    very large corpora for natural language disambiguation</a>, by  Michele Banko and Eric Brill (2001).</span>.  The correct response to thequestion "Is algorithm A better than algorithm B?" is really: "Whattraining data set are you using?"</p><p>All this is a caution to keep in mind, both when doing development,and when reading research papers.  Many papers focus on finding newtricks to wring out improved performance on standard benchmark datasets.  "Our whiz-bang technique gave us an improvement of X percenton standard benchmark Y" is a canonical form of research claim.  Suchclaims are often genuinely interesting, but they must be understood asapplying only in the context of the specific training data set used.Imagine an alternate history in which the people who originallycreated the benchmark data set had a larger research grant.  Theymight have used the extra money to collect more training data.  It'sentirely possible that the "improvement" due to the whiz-bangtechnique would disappear on a larger data set.  In other words, thepurported improvement might be just an accident of history.  Themessage to take away, especially in practical applications, is thatwhat we want is both better algorithms <em>and</em> better trainingdata.  It's fine to look for better algorithms, but make sure you'renot focusing on better algorithms to the exclusion of easy winsgetting more or better training data.</p><p><h4><a name="problem_455958"></a><a href="#problem_455958">Problem</a></h4><ul><li><strong>(Research problem)</strong> How do our machine learning algorithms  perform in the limit of very large data sets?  For any given  algorithm it's natural to attempt to define a notion of asymptotic  performance in the limit of truly big data. A quick-and-dirty  approach to this problem is to simply try fitting curves to graphs  like those shown above, and then to extrapolate the fitted curves  out to infinity.  An objection to this approach is that different  approaches to curve fitting will give different notions of asymptotic  performance.  Can you find a principled justification for fitting to  some particular class of curves?  If so, compare the asymptotic  performance of several different machine learning algorithms.</ul></p><p><strong>Summing up:</strong> We've now completed our dive into overfitting andregularization.  Of course, we'll return again to the issue.  As I'vementioned several times, overfitting is a major problem in neuralnetworks, especially as computers get more powerful, and we have theability to train larger networks.  As a result there's a pressing needto develop powerful regularization techniques to reduce overfitting,and this is an extremely active area of current work.</p><p><h3><a name="weight_initialization"></a><a href="#weight_initialization">Weight initialization</a></h3></p><p>When we create our neural networks, we have to make choices for theinitial weights and biases.  Up to now, we've been choosing themaccording to a prescription which I discussed only briefly<a href="chap1.html#weight_initialization">back in Chapter 1</a>.  Just toremind you, that prescription was to choose both the weights andbiases using independent Gaussian random variables, normalized to havemean $0$ and standard deviation $1$.  While this approach has workedwell, it was quite <em>ad hoc</em>, and it's worth revisiting to see ifwe can find a better way of setting our initial weights and biases,and perhaps help our neural networks learn faster.</p><p>It turns out that we can do quite a bit better than initializing withnormalized Gaussians.  To see why, suppose we're working with anetwork with a large number - say $1,000$ - of input neurons.  Andlet's suppose we've used normalized Gaussians to initialize theweights connecting to the first hidden layer.  For now I'm going toconcentrate specifically on the weights connecting the input neuronsto the first neuron in the hidden layer, and ignore the rest of thenetwork:</p><p><center><img src="images/tikz32.png"/></center></p><p>We'll suppose for simplicity that we're trying to train using atraining input $x$ in which half the input neurons are on, i.e., setto $1$, and half the input neurons are off, i.e., set to $0$.  Theargument which follows applies more generally, but you'll get the gistfrom this special case.  Let's consider the weighted sum $z = \sum_jw_j x_j+b$ of inputs to our hidden neuron.  $500$ terms in this sumvanish, because the corresponding input $x_j$ is zero.  And so $z$ isa sum over a total of $501$ normalized Gaussian random variables,accounting for the $500$ weight terms and the $1$ extra bias term.Thus $z$ is itself distributed as a Gaussian with mean zero andstandard deviation $\sqrt{501} \approx 22.4$.  That is, $z$ has a verybroad Gaussian distribution, not sharply peaked at all:</p><p><div id="wide_gaussian"></div><script type="text/javascript" src="js/wide_gaussian.js"></script></p><p>In particular, we can see from this graph that it's quite likely that$|z|$ will be pretty large, i.e., either $z \gg 1$ or $z \ll -1$.  Ifthat's the case then the output $\sigma(z)$ from the hidden neuronwill be very close to either $1$ or $0$.  That means our hidden neuronwill have saturated.  And when that happens, as we know, making smallchanges in the weights will make only absolutely miniscule changes inthe activation of our hidden neuron.  That miniscule change in theactivation of the hidden neuron will, in turn, barely affect the restof the neurons in the network at all, and we'll see a correspondinglyminiscule change in the cost function.  As a result, those weightswill only learn very slowly when we use the gradient descentalgorithm*<span class="marginnote">*We discussed this in more detail in Chapter 2,  where we used the  <a href="chap2.html#the_four_fundamental_equations_behind_backpropagation">equations    of backpropagation</a> to show that weights input to saturated  neurons learned slowly.</span>.  It's similar to the problem we discussedearlier in this chapter, in which output neurons which saturated onthe wrong value caused learning to slow down.  We addressed thatearlier problem with a clever choice of cost function.  Unfortunately,while that helped with saturated output neurons, it does nothing atall for the problem with saturated hidden neurons.</p><p>I've been talking about the weights input to the first hidden layer.Of course, similar arguments apply also to later hidden layers: if theweights in later hidden layers are initialized using normalizedGaussians, then activations will often be very close to $0$ or $1$,and learning will proceed very slowly.</p><p>Is there some way we can choose better initializations for the weightsand biases, so that we don't get this kind of saturation, and so avoida learning slowdown?  Suppose we have a neuron with $n_{\rm in}$ inputweights.  Then we shall initialize those weights as Gaussian randomvariables with mean $0$ and standard deviation $1/\sqrt{n_{\rm in}}$.That is, we'll squash the Gaussians down, making it less likely thatour neuron will saturate.  We'll continue to choose the bias as aGaussian with mean $0$ and standard deviation $1$, for reasons I'llreturn to in a moment.  With these choices, the weighted sum $z =\sum_j w_j x_j + b$ will again be a Gaussian random variable with mean$0$, but it'll be much more sharply peaked than it was before.Suppose, as we did earlier, that $500$ of the inputs are zero and$500$ are $1$.  Then it's easy to show (see the exercise below) that$z$ has a Gaussian distribution with mean $0$ and standard deviation$\sqrt{3/2} = 1.22\ldots$.  This is much more sharply peaked thanbefore, so much so that even the graph below understates thesituation, since I've had to rescale the vertical axis, when comparedto the earlier graph:</p><p><div id="narrow_gaussian"></div> <script  type="text/javascript" src="js/narrow_gaussian.js"></script>  </p><p>Such a neuron is much less likely to saturate, and correspondinglymuch less likely to have problems with a learning slowdown.</p><p><h4><a name="exercise_319349"></a><a href="#exercise_319349">Exercise</a></h4><ul><li> Verify that the standard deviation of $z = \sum_j w_j x_j + b$  in the paragraph above is $\sqrt{3/2}$.  It may help to know that:  (a) the variance of a sum of independent random variables is the sum  of the variances of the individual random variables; and (b) the  variance is the square of the standard deviation.</ul></p><p>I stated above that we'll continue to initialize the biases as before,as Gaussian random variables with a mean of $0$ and a standarddeviation of $1$.  This is okay, because it doesn't make it too muchmore likely that our neurons will saturate.  In fact, it doesn't muchmatter how we initialize the biases, provided we avoid the problemwith saturation.  Some people go so far as to initialize all thebiases to $0$, and rely on gradient descent to learn appropriatebiases.  But since it's unlikely to make much difference, we'llcontinue with the same initialization procedure as before.</p><p>Let's compare the results for both our old and new approaches toweight initialization, using the MNIST digit classification task.  Asbefore, we'll use $30$ hidden neurons, a mini-batch size of $10$, aregularization parameter $\lambda = 5.0$, and the cross-entropy costfunction.  We will decrease the learning rate slightly from $\eta =0.5$ to $0.1$, since that makes the results a little more easilyvisible in the graphs.  We can train using the old method of weightinitialization:<div class="highlight"><pre><span></span><span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">mnist_loader</span>
 <span class="o">&gt;&gt;&gt;</span> <span class="n">training_data</span><span class="p">,</span> <span class="n">validation_data</span><span class="p">,</span> <span class="n">test_data</span> <span class="o">=</span> \
 <span class="o">...</span> <span class="n">mnist_loader</span><span class="o">.</span><span class="n">load_data_wrapper</span><span class="p">()</span>
 <span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">network2</span>
@@ -267,7 +270,7 @@ <h1 class="chapter_title"><a href="">Improving the way neural networks learn</a>
     <span class="k">def</span> <span class="nf">delta</span><span class="p">(</span><span class="n">z</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
         <span class="k">return</span> <span class="p">(</span><span class="n">a</span><span class="o">-</span><span class="n">y</span><span class="p">)</span>
 </pre></div>
-</p><p>Let's break this down.  The first thing to observe is that even thoughthe cross-entropy is, mathematically speaking, a function, we'veimplemented it as a Python class, not a Python function.  Why have Imade that choice?  The reason is that the cost plays two differentroles in our network.  The obvious role is that it's a measure of howwell an output activation, <tt>a</tt>, matches the desired output,<tt>y</tt>.  This role is captured by the <tt>CrossEntropyCost.fn</tt>method.  (Note, by the way, that the <tt>np.nan_to_num</tt> call inside<tt>CrossEntropyCost.fn</tt> ensures that Numpy deals correctly with thelog of numbers very close to zero.)  But there's also a second way thecost function enters our network.  Recall from<a href="chap2.html#the_four_fundamental_equations_behind_backpropagation">Chapter  2</a> that when running the backpropagation algorithm we need tocompute the network's output error, $\delta^L$. The form of the outputerror depends on the choice of cost function: different cost function,different form for the output error.  For the cross-entropy the outputerror is, as we saw in Equation <span id="margin_184586187972_reveal" class="equation_link">(66)</span><span id="margin_184586187972" class="marginequation" style="display: none;"><a href="chap3.html#eqtn66" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}     \delta^L = a^L-y.   \nonumber\end{eqnarray}</a></span><script>$('#margin_184586187972_reveal').click(function() {$('#margin_184586187972').toggle('slow', function() {});});</script>,</p><p><a class="displaced_anchor" name="eqtn99"></a>\begin{eqnarray}  \delta^L = a^L-y.\tag{99}\end{eqnarray}For this reason we define a second method,<tt>CrossEntropyCost.delta</tt>, whose purpose is to tell our networkhow to compute the output error.  And then we bundle these two methodsup into a single class containing everything our networks need to knowabout the cost function.</p><p>In a similar way, <tt>network2.py</tt> also contains a class torepresent the quadratic cost function.  This is included forcomparison with the results of Chapter 1, since going forward we'llmostly use the cross entropy.  The code is just below.  The<tt>QuadraticCost.fn</tt> method is a straightforward computation of thequadratic cost associated to the actual output, <tt>a</tt>, and thedesired output, <tt>y</tt>.  The value returned by<tt>QuadraticCost.delta</tt> is based on theexpression <span id="margin_394878518608_reveal" class="equation_link">(30)</span><span id="margin_394878518608" class="marginequation" style="display: none;"><a href="chap2.html#eqtn30" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L = (a^L-y) \odot \sigma'(z^L) \nonumber\end{eqnarray}</a></span><script>$('#margin_394878518608_reveal').click(function() {$('#margin_394878518608').toggle('slow', function() {});});</script> for the output error for thequadratic cost, which we derived back in Chapter 2.</p><p><div class="highlight"><pre><span></span><span class="k">class</span> <span class="nc">QuadraticCost</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
+</p><p>Let's break this down.  The first thing to observe is that even thoughthe cross-entropy is, mathematically speaking, a function, we'veimplemented it as a Python class, not a Python function.  Why have Imade that choice?  The reason is that the cost plays two differentroles in our network.  The obvious role is that it's a measure of howwell an output activation, <tt>a</tt>, matches the desired output,<tt>y</tt>.  This role is captured by the <tt>CrossEntropyCost.fn</tt>method.  (Note, by the way, that the <tt>np.nan_to_num</tt> call inside<tt>CrossEntropyCost.fn</tt> ensures that Numpy deals correctly with thelog of numbers very close to zero.)  But there's also a second way thecost function enters our network.  Recall from<a href="chap2.html#the_four_fundamental_equations_behind_backpropagation">Chapter  2</a> that when running the backpropagation algorithm we need tocompute the network's output error, $\delta^L$. The form of the outputerror depends on the choice of cost function: different cost function,different form for the output error.  For the cross-entropy the outputerror is, as we saw in Equation <span id="margin_368733767778_reveal" class="equation_link">(66)</span><span id="margin_368733767778" class="marginequation" style="display: none;"><a href="chap3.html#eqtn66" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}     \delta^L = a^L-y.   \nonumber\end{eqnarray}</a></span><script>$('#margin_368733767778_reveal').click(function() {$('#margin_368733767778').toggle('slow', function() {});});</script>,</p><p><a class="displaced_anchor" name="eqtn99"></a>\begin{eqnarray}  \delta^L = a^L-y.\tag{99}\end{eqnarray}For this reason we define a second method,<tt>CrossEntropyCost.delta</tt>, whose purpose is to tell our networkhow to compute the output error.  And then we bundle these two methodsup into a single class containing everything our networks need to knowabout the cost function.</p><p>In a similar way, <tt>network2.py</tt> also contains a class torepresent the quadratic cost function.  This is included forcomparison with the results of Chapter 1, since going forward we'llmostly use the cross entropy.  The code is just below.  The<tt>QuadraticCost.fn</tt> method is a straightforward computation of thequadratic cost associated to the actual output, <tt>a</tt>, and thedesired output, <tt>y</tt>.  The value returned by<tt>QuadraticCost.delta</tt> is based on theexpression <span id="margin_166061178521_reveal" class="equation_link">(30)</span><span id="margin_166061178521" class="marginequation" style="display: none;"><a href="chap2.html#eqtn30" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L = (a^L-y) \odot \sigma'(z^L) \nonumber\end{eqnarray}</a></span><script>$('#margin_166061178521_reveal').click(function() {$('#margin_166061178521').toggle('slow', function() {});});</script> for the output error for thequadratic cost, which we derived back in Chapter 2.</p><p><div class="highlight"><pre><span></span><span class="k">class</span> <span class="nc">QuadraticCost</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
 
     <span class="nd">@staticmethod</span>
     <span class="k">def</span> <span class="nf">fn</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
@@ -726,7 +729,7 @@ <h1 class="chapter_title"><a href="">Improving the way neural networks learn</a>
 
 <span class="o">...</span>
 </pre></div>
-</p><p>That's better!  And so we can continue, individually adjusting eachhyper-parameter, gradually improving performance.  Once we've exploredto find an improved value for $\eta$, then we move on to find a goodvalue for $\lambda$.  Then experiment with a more complexarchitecture, say a network with 10 hidden neurons.  Then adjust thevalues for $\eta$ and $\lambda$ again.  Then increase to 20 hiddenneurons.  And then adjust other hyper-parameters some more.  And soon, at each stage evaluating performance using our held-out validationdata, and using those evaluations to find better and betterhyper-parameters.  As we do so, it typically takes longer to witnessthe impact due to modifications of the hyper-parameters, and so we cangradually decrease the frequency of monitoring.</p><p>This all looks very promising as a broad strategy.  However, I want toreturn to that initial stage of finding hyper-parameters that enable anetwork to learn anything at all.  In fact, even the above discussionconveys too positive an outlook.  It can be immensely frustrating towork with a network that's learning nothing.  You can tweakhyper-parameters for days, and still get no meaningful response.  Andso I'd like to re-emphasize that during the early stages you shouldmake sure you can get quick feedback from experiments.  Intuitively,it may seem as though simplifying the problem and the architecturewill merely slow you down.  In fact, it speeds things up, since youmuch more quickly find a network with a meaningful signal.  Onceyou've got such a signal, you can often get rapid improvements bytweaking the hyper-parameters.  As with many things in life, gettingstarted can be the hardest thing to do.</p><p>Okay, that's the broad strategy.  Let's now look at some specificrecommendations for setting hyper-parameters.  I will focus on thelearning rate, $\eta$, the L2 regularization parameter, $\lambda$, andthe mini-batch size.  However, many of the remarks apply also to otherhyper-parameters, including those associated to network architecture,other forms of regularization, and some hyper-parameters we'll meetlater in the book, such as the momentum co-efficient.</p><p><strong>Learning rate:</strong> Suppose we run three MNIST networks with threedifferent learning rates, $\eta = 0.025$, $\eta = 0.25$ and $\eta =2.5$, respectively.  We'll set the other hyper-parameters as for theexperiments in earlier sections, running over 30 epochs, with amini-batch size of 10, and with $\lambda = 5.0$.  We'll also return tousing the full $50,000$ training images.  Here's a graph showing thebehaviour of the training cost as we train*<span class="marginnote">*The graph was  generated by  <a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/fig/multiple_eta.py">multiple_eta.py</a>.</span>:</p><p><center><img src="images/multiple_eta.png" width="520px"></center></p><p>With $\eta = 0.025$ the cost decreases smoothly until the final epoch.With $\eta = 0.25$ the cost initially decreases, but after about $20$epochs it is near saturation, and thereafter most of the changes aremerely small and apparently random oscillations.  Finally, with $\eta= 2.5$ the cost makes large oscillations right from the start.  Tounderstand the reason for the oscillations, recall that stochasticgradient descent is supposed to step us gradually down into a valleyof the cost function,</p><p><center><img src="images/tikz33.png"/></center></p><p>However, if $\eta$ is too large then the steps will be so large thatthey may actually overshoot the minimum, causing the algorithm toclimb up out of the valley instead.  That's likely*<span class="marginnote">*This  picture is helpful, but it's intended as an intuition-building  illustration of what may go on, not as a complete, exhaustive  explanation.  Briefly, a more complete explanation is as follows:  gradient descent uses a first-order approximation to the cost  function as a guide to how to decrease the cost.  For large $\eta$,  higher-order terms in the cost function become more important, and  may dominate the behaviour, causing gradient descent to break down.  This is especially likely as we approach minima and quasi-minima of  the cost function, since near such points the gradient becomes  small, making it easier for higher-order terms to dominate  behaviour.</span>  what's causing the cost to oscillate when $\eta = 2.5$.When we choose $\eta = 0.25$ the initial steps do take us toward aminimum of the cost function, and it's only once we get near thatminimum that we start to suffer from the overshooting problem.  Andwhen we choose $\eta = 0.025$ we don't suffer from this problem at allduring the first $30$ epochs.  Of course, choosing $\eta$ so smallcreates another problem, namely, that it slows down stochasticgradient descent.  An even better approach would be to start with$\eta = 0.25$, train for $20$ epochs, and then switch to $\eta =0.025$.  We'll discuss such variable learning rate schedules later.For now, though, let's stick to figuring out how to find a single goodvalue for the learning rate, $\eta$.</p><p></p><p>With this picture in mind, we can set $\eta$ as follows.  First, weestimate the threshold value for $\eta$ at which the cost on thetraining data immediately begins decreasing, instead of oscillating orincreasing.  This estimate doesn't need to be too accurate.  You canestimate the order of magnitude by starting with $\eta = 0.01$.  Ifthe cost decreases during the first few epochs, then you shouldsuccessively try $\eta = 0.1, 1.0, \ldots$ until you find a value for$\eta$ where the cost oscillates or increases during the first fewepochs.  Alternately, if the cost oscillates or increases during thefirst few epochs when $\eta = 0.01$, then try $\eta = 0.001, 0.0001,\ldots$ until you find a value for $\eta$ where the cost decreasesduring the first few epochs.  Following this procedure will give us anorder of magnitude estimate for the threshold value of $\eta$.  Youmay optionally refine your estimate, to pick out the largest value of$\eta$ at which the cost decreases during the first few epochs, say$\eta = 0.5$ or $\eta = 0.2$ (there's no need for this to besuper-accurate).  This gives us an estimate for the threshold value of$\eta$.</p><p></p><p>Obviously, the actual value of $\eta$ that you use should be no largerthan the threshold value.  In fact, if the value of $\eta$ is toremain usable over many epochs then you likely want to use a valuefor $\eta$ that is smaller, say, a factor of two below the threshold.Such a choice will typically allow you to train for many epochs,without causing too much of a slowdown in learning.</p><p></p><p>In the case of the MNIST data, following this strategy leads to anestimate of $0.1$ for the order of magnitude of the threshold value of$\eta$.  After some more refinement, we obtain a threshold value $\eta= 0.5$.  Following the prescription above, this suggests using $\eta =0.25$ as our value for the learning rate.  In fact, I found that using$\eta = 0.5$ worked well enough over $30$ epochs that for the mostpart I didn't worry about using a lower value of $\eta$.</p><p>This all seems quite straightforward.  However, using the trainingcost to pick $\eta$ appears to contradict what I said earlier in thissection, namely, that we'd pick hyper-parameters by evaluatingperformance using our held-out validation data.  In fact, we'll usevalidation accuracy to pick the regularization hyper-parameter, themini-batch size, and network parameters such as the number of layersand hidden neurons, and so on.  Why do things differently for thelearning rate?  Frankly, this choice is my personal aestheticpreference, and is perhaps somewhat idiosyncratic.  The reasoning isthat the other hyper-parameters are intended to improve the finalclassification accuracy on the test set, and so it makes sense toselect them on the basis of validation accuracy.  However, thelearning rate is only incidentally meant to impact the finalclassification accuracy.  Its primary purpose is really to controlthe step size in gradient descent, and monitoring the training cost isthe best way to detect if the step size is too big.  With that said,this is a personal aesthetic preference.  Early on during learning thetraining cost usually only decreases if the validation accuracyimproves, and so in practice it's unlikely to make much differencewhich criterion you use.</p><p><a name="early_stopping"></a></p><p><strong>Use early stopping to determine the number of training  epochs:</strong> As we discussed earlier in the chapter, early stopping meansthat at the end of each epoch we should compute the classificationaccuracy on the validation data.  When that stops improving,terminate.  This makes setting the number of epochs very simple.  Inparticular, it means that we don't need to worry about explicitlyfiguring out how the number of epochs depends on the otherhyper-parameters.  Instead, that's taken care of automatically.Furthermore, early stopping also automatically prevents us fromoverfitting.  This is, of course, a good thing, although in the earlystages of experimentation it can be helpful to turn off earlystopping, so you can see any signs of overfitting, and use it toinform your approach to regularization.</p><p>To implement early stopping we need to say more precisely what itmeans that the classification accuracy has stopped improving.  Aswe've seen, the accuracy can jump around quite a bit, even when theoverall trend is to improve.  If we stop the first time the accuracydecreases then we'll almost certainly stop when there are moreimprovements to be had.  A better rule is to terminate if the bestclassification accuracy doesn't improve for quite some time.  Suppose,for example, that we're doing MNIST.  Then we might elect to terminateif the classification accuracy hasn't improved during the last tenepochs.  This ensures that we don't stop too soon, in response to badluck in training, but also that we're not waiting around forever foran improvement that never comes.</p><p>This no-improvement-in-ten rule is good for initial exploration ofMNIST.  However, networks can sometimes plateau near a particularclassification accuracy for quite some time, only to then beginimproving again.  If you're trying to get really good performance, theno-improvement-in-ten rule may be too aggressive about stopping.  Inthat case, I suggest using the no-improvement-in-ten rule for initialexperimentation, and gradually adopting more lenient rules, as youbetter understand the way your network trains:no-improvement-in-twenty, no-improvement-in-fifty, and so on.  Ofcourse, this introduces a new hyper-parameter to optimize!  Inpractice, however, it's usually easy to set this hyper-parameter toget pretty good results.  Similarly, for problems other than MNIST,the no-improvement-in-ten rule may be much too aggressive or notnearly aggressive enough, depending on the details of the problem.However, with a little experimentation it's usually easy to find apretty good strategy for early stopping.</p><p>We haven't used early stopping in our MNIST experiments to date.  Thereason is that we've been doing a lot of comparisons between differentapproaches to learning.  For such comparisons it's helpful to use thesame number of epochs in each case.  However, it's well worthmodifying <tt>network2.py</tt> to implement early stopping:</p><p><h4><a name="problem_831601"></a><a href="#problem_831601">Problem</a></h4><ul><li> Modify <tt>network2.py</tt> so that it implements early stopping  using a no-improvement-in-$n$ epochs strategy, where $n$ is a  parameter that can be set.</p><p><li> Can you think of a rule for early stopping <em>other</em> than  no-improvement-in-$n$?  Ideally, the rule should compromise between  getting high validation accuracies and not training too long.  Add  your rule to <tt>network2.py</tt>, and run three experiments comparing  the validation accuracies and number of epochs of training to  no-improvement-in-$10$.</p><p></ul></p><p><strong>Learning rate schedule:</strong> We've been holding the learning rate$\eta$ constant.  However, it's often advantageous to vary thelearning rate.  Early on during the learning process it's likely thatthe weights are badly wrong.  And so it's best to use a large learningrate that causes the weights to change quickly.  Later, we can reducethe learning rate as we make more fine-tuned adjustments to ourweights.</p><p>How should we set our learning rate schedule?  Many approaches arepossible.  One natural approach is to use the same basic idea as earlystopping.  The idea is to hold the learning rate constant until thevalidation accuracy starts to get worse.  Then decrease the learningrate by some amount, say a factor of two or ten.  We repeat this manytimes, until, say, the learning rate is a factor of 1,024 (or 1,000)times lower than the initial value.  Then we terminate.</p><p></p><p>A variable learning schedule can improve performance, but it alsoopens up a world of possible choices for the learning schedule.  Thosechoices can be a headache - you can spend forever trying to optimizeyour learning schedule.  For first experiments my suggestion is to usea single, constant value for the learning rate.  That'll get you agood first approximation.  Later, if you want to obtain the bestperformance from your network, it's worth experimenting with alearning schedule, along the lines I've described*<span class="marginnote">*A readable  recent paper which demonstrates the benefits of variable learning  rates in attacking MNIST is  <a href="http://arxiv.org/abs/1003.0358">Deep, Big, Simple Neural Nets    Excel on Handwritten Digit Recognition</a>, by Dan Claudiu  Cireșan, Ueli Meier, Luca Maria Gambardella, and  Jürgen Schmidhuber (2010).</span>.</p><p><h4><a name="exercise_336628"></a><a href="#exercise_336628">Exercise</a></h4><ul><li> Modify <tt>network2.py</tt> so that it implements a learning  schedule that: halves the learning rate each time the validation  accuracy satisfies the no-improvement-in-$10$ rule; and terminates  when the learning rate has dropped to $1/128$ of its original value.</ul></p><p><strong>The regularization parameter, $\lambda$:</strong> I suggest startinginitially with no regularization ($\lambda = 0.0$), and determining avalue for $\eta$, as above.  Using that choice of $\eta$, we can thenuse the validation data to select a good value for $\lambda$.  Startby trialling $\lambda = 1.0$*<span class="marginnote">*I don't have a good principled  justification for using this as a starting value.  If anyone knows  of a good principled discussion of where to start with $\lambda$,  I'd appreciate hearing it (mn@michaelnielsen.org).</span>, and thenincrease or decrease by factors of $10$, as needed to improveperformance on the validation data.  Once you've found a good order ofmagnitude, you can fine tune your value of $\lambda$.  That done, youshould return and re-optimize $\eta$ again.</p><p><h4><a name="exercise_281746"></a><a href="#exercise_281746">Exercise</a></h4><ul><li> It's tempting to use gradient descent to try to learn good  values for hyper-parameters such as $\lambda$ and $\eta$.  Can you  think of an obstacle to using gradient descent to determine  $\lambda$?  Can you think of an obstacle to using gradient descent  to determine $\eta$?</ul></p><p><strong>How I selected hyper-parameters earlier in this book:</strong> If youuse the recommendations in this section you'll find that you getvalues for $\eta$ and $\lambda$ which don't always exactly match thevalues I've used earlier in the book.  The reason is that the book hasnarrative constraints that have sometimes made it impractical tooptimize the hyper-parameters.  Think of all the comparisons we'vemade of different approaches to learning, e.g., comparing thequadratic and cross-entropy cost functions, comparing the old and newmethods of weight initialization, running with and withoutregularization, and so on.  To make such comparisons meaningful, I'veusually tried to keep hyper-parameters constant across the approachesbeing compared (or to scale them in an appropriate way).  Of course,there's no reason for the same hyper-parameters to be optimal for allthe different approaches to learning, so the hyper-parameters I'veused are something of a compromise.</p><p>As an alternative to this compromise, I could have tried to optimizethe heck out of the hyper-parameters for every single approach tolearning.  In principle that'd be a better, fairer approach, sincethen we'd see the best from every approach to learning.  However,we've made dozens of comparisons along these lines, and in practice Ifound it too computationally expensive.  That's why I've adopted thecompromise of using pretty good (but not necessarily optimal) choicesfor the hyper-parameters.</p><p><a name="mini_batch_size"></a></p><p><strong>Mini-batch size:</strong> How should we set the mini-batch size?  Toanswer this question, let's first suppose that we're doing onlinelearning, i.e., that we're using a mini-batch size of $1$.</p><p>The obvious worry about online learning is that using mini-batcheswhich contain just a single training example will cause significanterrors in our estimate of the gradient.  In fact, though, the errorsturn out to not be such a problem.  The reason is that the individualgradient estimates don't need to be super-accurate.  All we need is anestimate accurate enough that our cost function tends to keepdecreasing.  It's as though you are trying to get to the NorthMagnetic Pole, but have a wonky compass that's 10-20 degrees off eachtime you look at it.  Provided you stop to check the compassfrequently, and the compass gets the direction right on average,you'll end up at the North Magnetic Pole just fine.</p><p>Based on this argument, it sounds as though we should use onlinelearning.  In fact, the situation turns out to be more complicatedthan that.  In a <a href="chap2.html#backprop_over_minibatch">problem  in the last chapter</a> I pointed out that it's possible to use matrixtechniques to compute the gradient update for <em>all</em> examples in amini-batch simultaneously, rather than looping over them.  Dependingon the details of your hardware and linear algebra library this canmake it quite a bit faster to compute the gradient estimate for amini-batch of (for example) size $100$, rather than computing themini-batch gradient estimate by looping over the $100$ trainingexamples separately.  It might take (say) only $50$ times as long,rather than $100$ times as long.</p><p>Now, at first it seems as though this doesn't help us that much.  Withour mini-batch of size $100$ the learning rule for the weights lookslike:<a class="displaced_anchor" name="eqtn100"></a>\begin{eqnarray}  w \rightarrow w' = w-\eta \frac{1}{100} \sum_x \nabla C_x,\tag{100}\end{eqnarray}where the sum is over training examples in the mini-batch.  This isversus<a class="displaced_anchor" name="eqtn101"></a>\begin{eqnarray}  w \rightarrow w' = w-\eta \nabla C_x\tag{101}\end{eqnarray}for online learning.  Even if it only takes $50$ times as long to dothe mini-batch update, it still seems likely to be better to do onlinelearning, because we'd be updating so much more frequently.  Suppose,however, that in the mini-batch case we increase the learning rate bya factor $100$, so the update rule becomes<a class="displaced_anchor" name="eqtn102"></a>\begin{eqnarray}  w \rightarrow w' = w-\eta \sum_x \nabla C_x.\tag{102}\end{eqnarray}That's a lot like doing $100$ separate instances of online learningwith a learning rate of $\eta$.  But it only takes $50$ times as longas doing a single instance of online learning.  Of course, it's nottruly the same as $100$ instances of online learning, since in themini-batch the $\nabla C_x$'s are all evaluated for the same set ofweights, as opposed to the cumulative learning that occurs in theonline case.  Still, it seems distinctly possible that using thelarger mini-batch would speed things up.</p><p>With these factors in mind, choosing the best mini-batch size is acompromise.  Too small, and you don't get to take full advantage ofthe benefits of good matrix libraries optimized for fast hardware.Too large and you're simply not updating your weights often enough.What you need is to choose a compromise value which maximizes thespeed of learning.  Fortunately, the choice of mini-batch size atwhich the speed is maximized is relatively independent of the otherhyper-parameters (apart from the overall architecture), so you don'tneed to have optimized those hyper-parameters in order to find a goodmini-batch size.  The way to go is therefore to use some acceptable(but not necessarily optimal) values for the other hyper-parameters,and then trial a number of different mini-batch sizes, scaling $\eta$as above.  Plot the validation accuracy versus <em>time</em> (as in,real elapsed time, not epoch!), and choose whichever mini-batch sizegives you the most rapid improvement in performance.  With themini-batch size chosen you can then proceed to optimize the otherhyper-parameters.</p><p>Of course, as you've no doubt realized, I haven't done thisoptimization in our work.  Indeed, our implementation doesn't use thefaster approach to mini-batch updates at all.  I've simply used amini-batch size of $10$ without comment or explanation in nearly allexamples.  Because of this, we could have sped up learning by reducingthe mini-batch size.  I haven't done this, in part because I wanted toillustrate the use of mini-batches beyond size $1$, and in partbecause my preliminary experiments suggested the speedup would berather modest.  In practical implementations, however, we would mostcertainly implement the faster approach to mini-batch updates, andthen make an effort to optimize the mini-batch size, in order tomaximize our overall speed.</p><p></p><p><strong>Automated techniques:</strong> I've been describing these heuristicsas though you're optimizing your hyper-parameters by hand.Hand-optimization is a good way to build up a feel for how neuralnetworks behave.  However, and unsurprisingly, a great deal of workhas been done on automating the process. A common technique is<em>grid search</em>, which systematically searches through a grid inhyper-parameter space.  A review of both the achievements and thelimitations of grid search (with suggestions for easily-implementedalternatives) may be found in a 2012paper*<span class="marginnote">*<a href="http://dl.acm.org/citation.cfm?id=2188395">Random    search for hyper-parameter optimization</a>, by James Bergstra and  Yoshua Bengio (2012).</span> by James Bergstra and Yoshua Bengio.  Manymore sophisticated approaches have also been proposed.  I won't reviewall that work here, but do want to mention a particularly promising2012 paper which used a Bayesian approach to automatically optimizehyper-parameters*<span class="marginnote">*<a href="http://papers.nips.cc/paper/4522-practical-bayesian-optimization-of-machine-learning-algorithms.pdf">Practical    Bayesian optimization of machine learning algorithms</a>, by Jasper  Snoek, Hugo Larochelle, and Ryan Adams.</span>.  The code from the paperis <a href="https://github.com/jaberg/hyperopt">publicly available</a>, andhas been used with some success by other researchers.</p><p><strong>Summing up:</strong> Following the rules-of-thumb I've described won'tgive you the absolute best possible results from your neural network.But it will likely give you a good start and a basis for furtherimprovements.  In particular, I've discussed the hyper-parameterslargely independently.  In practice, there are relationships betweenthe hyper-parameters.  You may experiment with $\eta$, feel thatyou've got it just right, then start to optimize for $\lambda$, onlyto find that it's messing up your optimization for $\eta$.  Inpractice, it helps to bounce backward and forward, gradually closingin good values.  Above all, keep in mind that the heuristics I'vedescribed are rules of thumb, not rules cast in stone.  You should beon the lookout for signs that things aren't working, and be willing toexperiment.  In particular, this means carefully monitoring yournetwork's behaviour, especially the validation accuracy.</p><p>The difficulty of choosing hyper-parameters is exacerbated by the factthat the lore about how to choose hyper-parameters is widely spread,across many research papers and software programs, and often is onlyavailable inside the heads of individual practitioners.  There aremany, many papers setting out (sometimes contradictory)recommendations for how to proceed.  However, there are a fewparticularly useful papers that synthesize and distill out much ofthis lore.  Yoshua Bengio has a 2012paper*<span class="marginnote">*<a href="http://arxiv.org/abs/1206.5533">Practical    recommendations for gradient-based training of deep    architectures</a>, by Yoshua Bengio (2012).</span> that gives somepractical recommendations for using backpropagation and gradientdescent to train neural networks, including deep neural nets.  Bengiodiscusses many issues in much more detail than I have, including howto do more systematic hyper-parameter searches.  Another good paper isa 1998paper*<span class="marginnote">*<a href="http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf">Efficient    BackProp</a>, by Yann LeCun, Léon Bottou,  Genevieve Orr and Klaus-Robert Müller (1998)</span> byYann LeCun, Léon Bottou, Genevieve Orr andKlaus-Robert Müller.  Both these papers appear inan extremely useful 2012 book that collects many tricks commonly usedin neuralnets*<span class="marginnote">*<a href="http://www.springer.com/computer/theoretical+computer+science/book/978-3-642-35288-1">Neural    Networks: Tricks of the Trade</a>, edited by  Grégoire Montavon, Geneviève Orr, and Klaus-Robert    Müller.</span>.  The book is expensive, but many of the articles havebeen placed online by their respective authors with, one presumes, theblessing of the publisher, and may be located using a search engine.</p><p>One thing that becomes clear as you read these articles and,especially, as you engage in your own experiments, is thathyper-parameter optimization is not a problem that is ever completelysolved.  There's always another trick you can try to improveperformance.  There is a saying common among writers that books arenever finished, only abandoned.  The same is also true of neuralnetwork optimization: the space of hyper-parameters is so large thatone never really finishes optimizing, one only abandons the network toposterity.  So your goal should be to develop a workflow that enablesyou to quickly do a pretty good job on the optimization, while leavingyou the flexibility to try more detailed optimizations, if that'simportant.</p><p>The challenge of setting hyper-parameters has led some people tocomplain that neural networks require a lot of work when compared withother machine learning techniques.  I've heard many variations on thefollowing complaint: "Yes, a well-tuned neural network may get thebest performance on the problem.  On the other hand, I can try arandom forest [or SVM or$\ldots$ insert your own favorite technique]and it just works.  I don't have time to figure out just the rightneural network."  Of course, from a practical point of view it's goodto have easy-to-apply techniques.  This is particularly true whenyou're just getting started on a problem, and it may not be obviouswhether machine learning can help solve the problem at all.  On theother hand, if getting optimal performance is important, then you mayneed to try approaches that require more specialist knowledge.  Whileit would be nice if machine learning were always easy, there is no<em>a priori</em> reason it should be trivially simple.</p><p><h3><a name="other_techniques"></a><a href="#other_techniques">Other techniques</a></h3></p><p>Each technique developed in this chapter is valuable to know in itsown right, but that's not the only reason I've explained them.  Thelarger point is to familiarize you with some of the problems which canoccur in neural networks, and with a style of analysis which can helpovercome those problems.  In a sense, we've been learning how to thinkabout neural nets.  Over the remainder of this chapter I brieflysketch a handful of other techniques.  These sketches are lessin-depth than the earlier discussions, but should convey some feelingfor the diversity of techniques available for use in neural networks.</p><p><h4><a name="variations_on_stochastic_gradient_descent"></a><a href="#variations_on_stochastic_gradient_descent">Variations on stochastic gradient descent</a></h4></p><p>Stochastic gradient descent by backpropagation has served us well inattacking the MNIST digit classification problem.  However, there aremany other approaches to optimizing the cost function, and sometimesthose other approaches offer performance superior to mini-batchstochastic gradient descent.  In this section I sketch two suchapproaches, the Hessian and momentum techniques.</p><p><strong>Hessian technique:</strong> To begin our discussion it helps to putneural networks aside for a bit.  Instead, we're just going toconsider the abstract problem of minimizing a cost function $C$ whichis a function of many variables, $w = w_1, w_2, \ldots$, so $C =C(w)$.  By Taylor's theorem, the cost function can be approximatednear a point $w$ by<a class="displaced_anchor" name="eqtn103"></a>\begin{eqnarray}  C(w+\Delta w) & = & C(w) + \sum_j \frac{\partial C}{\partial w_j} \Delta w_j  \nonumber \\ & & + \frac{1}{2} \sum_{jk} \Delta w_j \frac{\partial^2 C}{\partial w_j    \partial w_k} \Delta w_k + \ldots\tag{103}\end{eqnarray}We can rewrite this more compactly as<a class="displaced_anchor" name="eqtn104"></a>\begin{eqnarray}  C(w+\Delta w) = C(w) + \nabla C \cdot \Delta w +  \frac{1}{2} \Delta w^T H \Delta w + \ldots,\tag{104}\end{eqnarray}where $\nabla C$ is the usual gradient vector, and $H$ is a matrixknown as the <em>Hessian matrix</em>, whose $jk$th entry is $\partial^2C / \partial w_j \partial w_k$.  Suppose we approximate $C$ bydiscarding the higher-order terms represented by $\ldots$ above,<a class="displaced_anchor" name="eqtn105"></a>\begin{eqnarray}   C(w+\Delta w) \approx C(w) + \nabla C \cdot \Delta w +  \frac{1}{2} \Delta w^T H \Delta w.\tag{105}\end{eqnarray}Using calculus we can show that the expression on the right-hand sidecan be minimized*<span class="marginnote">*Strictly speaking, for this to be a minimum,  and not merely an extremum, we need to assume that the Hessian  matrix is positive definite.  Intuitively, this means that the  function $C$ looks like a valley locally, not a mountain or a  saddle.</span> by choosing<a class="displaced_anchor" name="eqtn106"></a>\begin{eqnarray}  \Delta w = -H^{-1} \nabla C.\tag{106}\end{eqnarray}Provided <span id="margin_435491620858_reveal" class="equation_link">(105)</span><span id="margin_435491620858" class="marginequation" style="display: none;"><a href="chap3.html#eqtn105" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   C(w+\Delta w) \approx C(w) + \nabla C \cdot \Delta w +  \frac{1}{2} \Delta w^T H \Delta w \nonumber\end{eqnarray}</a></span><script>$('#margin_435491620858_reveal').click(function() {$('#margin_435491620858').toggle('slow', function() {});});</script> is a good approximate expression for thecost function, then we'd expect that moving from the point $w$ to$w+\Delta w = w-H^{-1} \nabla C$ should significantly decrease thecost function.  That suggests a possible algorithm for minimizing thecost:</p><p><ul><li> Choose a starting point, $w$.</p><p><li> Update $w$ to a new point $w' = w-H^{-1} \nabla C$, where the  Hessian $H$ and $\nabla C$ are computed at $w$.</p><p><li> Update $w'$ to a new point $w{'}{'} = w'-H'^{-1} \nabla' C$,  where the Hessian $H'$ and $\nabla' C$ are computed at $w'$.</p><p><li> $\ldots$</ul>In practice, <span id="margin_457744749939_reveal" class="equation_link">(105)</span><span id="margin_457744749939" class="marginequation" style="display: none;"><a href="chap3.html#eqtn105" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   C(w+\Delta w) \approx C(w) + \nabla C \cdot \Delta w +  \frac{1}{2} \Delta w^T H \Delta w \nonumber\end{eqnarray}</a></span><script>$('#margin_457744749939_reveal').click(function() {$('#margin_457744749939').toggle('slow', function() {});});</script> is only an approximation, and it'sbetter to take smaller steps.  We do this by repeatedly changing $w$by an amount $\Delta w = -\eta H^{-1} \nabla C$, where $\eta$ is knownas the <em>learning rate</em>.</p><p>This approach to minimizing a cost function is known as the<em>Hessian technique</em> or <em>Hessian optimization</em>.  There aretheoretical and empirical results showing that Hessian methodsconverge on a minimum in fewer steps than standard gradient descent.In particular, by incorporating information about second-order changesin the cost function it's possible for the Hessian approach to avoidmany pathologies that can occur in gradient descent.  Furthermore,there are versions of the backpropagation algorithm which can be usedto compute the Hessian.</p><p>If Hessian optimization is so great, why aren't we using it in ourneural networks?  Unfortunately, while it has many desirableproperties, it has one very undesirable property: it's very difficultto apply in practice.  Part of the problem is the sheer size of theHessian matrix.  Suppose you have a neural network with $10^7$ weightsand biases.  Then the corresponding Hessian matrix will contain $10^7\times 10^7 = 10^{14}$ entries.  That's a lot of entries!  And thatmakes computing $H^{-1} \nabla C$ extremely difficult in practice.However, that doesn't mean that it's not useful to understand.  Infact, there are many variations on gradient descent which are inspiredby Hessian optimization, but which avoid the problem with overly-largematrices.  Let's take a look at one such technique, momentum-basedgradient descent.</p><p><a name="momentum"></a></p><p><strong>Momentum-based gradient descent:</strong> Intuitively, the advantageHessian optimization has is that it incorporates not just informationabout the gradient, but also information about how the gradient ischanging.  Momentum-based gradient descent is based on a similarintuition, but avoids large matrices of second derivatives.  Tounderstand the momentum technique, think back to our<a href="chap1.html#gradient_descent">original picture</a> of gradientdescent, in which we considered a ball rolling down into a valley.  Atthe time, we observed that gradient descent is, despite its name, onlyloosely similar to a ball falling to the bottom of a valley.  Themomentum technique modifies gradient descent in two ways that make itmore similar to the physical picture.  First, it introduces a notionof "velocity" for the parameters we're trying to optimize.  Thegradient acts to change the velocity, not (directly) the "position",in much the same way as physical forces change the velocity, and onlyindirectly affect position.  Second, the momentum method introduces akind of friction term, which tends to gradually reduce the velocity.</p><p>Let's give a more precise mathematical description.  We introducevelocity variables $v = v_1, v_2, \ldots$, one for each corresponding$w_j$ variable*<span class="marginnote">*In a neural net the $w_j$ variables would, of  course, include all weights and biases.</span>. Then we replace thegradient descent update rule $w \rightarrow w'= w-\eta \nabla C$ by<a class="displaced_anchor" name="eqtn107"></a><a class="displaced_anchor" name="eqtn108"></a>\begin{eqnarray}   v & \rightarrow  & v' = \mu v - \eta \nabla C \tag{107}\\    w & \rightarrow & w' = w+v'.\tag{108}\end{eqnarray}In these equations, $\mu$ is a hyper-parameter which controls theamount of damping or friction in the system.  To understand themeaning of the equations it's helpful to first consider the case where$\mu = 1$, which corresponds to no friction.  When that's the case,inspection of the equations shows that the "force" $\nabla C$ is nowmodifying the velocity, $v$, and the velocity is controlling the rateof change of $w$.  Intuitively, we build up the velocity by repeatedlyadding gradient terms to it.  That means that if the gradient is in(roughly) the same direction through several rounds of learning, wecan build up quite a bit of steam moving in that direction.  Think,for example, of what happens if we're moving straight down a slope:</p><p><center><img src="images/tikz34.png"/></center></p><p>With each step the velocity gets larger down the slope, so we movemore and more quickly to the bottom of the valley.  This can enablethe momentum technique to work much faster than standard gradientdescent.  Of course, a problem is that once we reach the bottom of thevalley we will overshoot.  Or, if the gradient should change rapidly,then we could find ourselves moving in the wrong direction.  That'sthe reason for the $\mu$ hyper-parameter in <span id="margin_826133810581_reveal" class="equation_link">(107)</span><span id="margin_826133810581" class="marginequation" style="display: none;"><a href="chap3.html#eqtn107" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   v & \rightarrow  & v' = \mu v - \eta \nabla C  \nonumber\end{eqnarray}</a></span><script>$('#margin_826133810581_reveal').click(function() {$('#margin_826133810581').toggle('slow', function() {});});</script>.  Isaid earlier that $\mu$ controls the amount of friction in the system;to be a little more precise, you should think of $1-\mu$ as the amountof friction in the system.  When $\mu = 1$, as we've seen, there is nofriction, and the velocity is completely driven by the gradient$\nabla C$.  By contrast, when $\mu = 0$ there's a lot of friction,the velocity can't build up, and Equations <span id="margin_765821472376_reveal" class="equation_link">(107)</span><span id="margin_765821472376" class="marginequation" style="display: none;"><a href="chap3.html#eqtn107" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   v & \rightarrow  & v' = \mu v - \eta \nabla C  \nonumber\end{eqnarray}</a></span><script>$('#margin_765821472376_reveal').click(function() {$('#margin_765821472376').toggle('slow', function() {});});</script>and <span id="margin_759883684720_reveal" class="equation_link">(108)</span><span id="margin_759883684720" class="marginequation" style="display: none;"><a href="chap3.html#eqtn108" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}    w & \rightarrow & w' = w+v' \nonumber\end{eqnarray}</a></span><script>$('#margin_759883684720_reveal').click(function() {$('#margin_759883684720').toggle('slow', function() {});});</script> reduce to the usual equation for gradientdescent, $w \rightarrow w'=w-\eta \nabla C$.  In practice, using avalue of $\mu$ intermediate between $0$ and $1$ can give us much of thebenefit of being able to build up speed, but without causingovershooting.  We can choose such a value for $\mu$ using the held-outvalidation data, in much the same way as we select $\eta$ and$\lambda$.</p><p></p><p></p><p>I've avoided naming the hyper-parameter $\mu$ up to now.  The reasonis that the standard name for $\mu$ is badly chosen: it's called the<em>momentum co-efficient</em>.  This is potentially confusing, since$\mu$ is not at all the same as the notion of momentum fromphysics. Rather, it is much more closely related to friction.However, the term momentum co-efficient is widely used, so we willcontinue to use it.</p><p>A nice thing about the momentum technique is that it takes almost nowork to modify an implementation of gradient descent to incorporatemomentum.  We can still use backpropagation to compute the gradients,just as before, and use ideas such as sampling stochastically chosenmini-batches.  In this way, we can get some of the advantages of theHessian technique, using information about how the gradient ischanging.  But it's done without the disadvantages, and with onlyminor modifications to our code.  In practice, the momentum techniqueis commonly used, and often speeds up learning.</p><p><h4><a name="exercise_603875"></a><a href="#exercise_603875">Exercise</a></h4><ul><li> What would go wrong if we used $\mu > 1$ in the momentum  technique?</p><p><li> What would go wrong if we used $\mu < 0$ in the momentum  technique?</ul></p><p><h4><a name="problem_713937"></a><a href="#problem_713937">Problem</a></h4><ul><li> Add momentum-based stochastic gradient descent to  <tt>network2.py</tt>.</ul></p><p><strong>Other approaches to minimizing the cost function:</strong> Many otherapproaches to minimizing the cost function have been developed, andthere isn't universal agreement on which is the best approach.  Asyou go deeper into neural networks it's worth digging into the othertechniques, understanding how they work, their strengths andweaknesses, and how to apply them in practice.  A paper I mentionedearlier*<span class="marginnote">*<a href="http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf">Efficient    BackProp</a>, by Yann LeCun, Léon Bottou,  Genevieve Orr and Klaus-Robert Müller (1998).</span>introduces and compares several of these techniques, includingconjugate gradient descent and the BFGS method (see also the closelyrelated limited-memory BFGS method, known as<a href="http://en.wikipedia.org/wiki/Limited-memory_BFGS">L-BFGS</a>).Another technique which has recently shown promisingresults*<span class="marginnote">*See, for example,  <a href="http://www.cs.toronto.edu/&#126;hinton/absps/momentum.pdf">On the    importance of initialization and momentum in deep learning</a>, by  Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton  (2012).</span> is Nesterov's accelerated gradient technique, whichimproves on the momentum technique.  However, for many problems, plainstochastic gradient descent works well, especially if momentum isused, and so we'll stick to stochastic gradient descent through theremainder of this book.</p><p><h4><a name="other_models_of_artificial_neuron"></a><a href="#other_models_of_artificial_neuron">Other models of artificial neuron</a></h4></p><p>Up to now we've built our neural networks using sigmoid neurons.  Inprinciple, a network built from sigmoid neurons can compute anyfunction.  In practice, however, networks built using other modelneurons sometimes outperform sigmoid networks.  Depending on theapplication, networks based on such alternate models may learn faster,generalize better to test data, or perhaps do both.  Let me mention acouple of alternate model neurons, to give you the flavor of somevariations in common use.</p><p>Perhaps the simplest variation is the tanh (pronounced "tanch")neuron, which replaces the sigmoid function by the hyperbolic tangentfunction.  The output of a tanh neuron with input $x$, weight vector$w$, and bias $b$ is given by<a class="displaced_anchor" name="eqtn109"></a>\begin{eqnarray} \tanh(w \cdot x+b), \tag{109}\end{eqnarray}where $\tanh$ is, of course, the hyperbolic tangent function.  Itturns out that this is very closely related to the sigmoid neuron.  Tosee this, recall that the $\tanh$ function is defined by<a class="displaced_anchor" name="eqtn110"></a>\begin{eqnarray}  \tanh(z) \equiv \frac{e^z-e^{-z}}{e^z+e^{-z}}.\tag{110}\end{eqnarray}With a little algebra it can easily be verified that<a class="displaced_anchor" name="eqtn111"></a>\begin{eqnarray}   \sigma(z) = \frac{1+\tanh(z/2)}{2},\tag{111}\end{eqnarray}that is, $\tanh$ is just a rescaled version of the sigmoid function.We can also see graphically that the $\tanh$ function has the sameshape as the sigmoid function,</p><p><div id="tanh"></div><script>function s(x) {return (Math.exp(x)-Math.exp(-x))/(Math.exp(x)+Math.exp(-x));}var m = [40, 120, 50, 120];var height = 290 - m[0] - m[2];var width = 600 - m[1] - m[3];var xmin = -5;var xmax = 5;var sample = 400;var x1 = d3.scale.linear().domain([0, sample]).range([xmin, xmax]);var data = d3.range(sample).map(function(d){ return {        x: x1(d),         y: s(x1(d))};     });var x = d3.scale.linear().domain([xmin, xmax]).range([0, width]);var y = d3.scale.linear()                .domain([-1,1])                .range([height, 0]);var line = d3.svg.line()    .x(function(d) { return x(d.x); })    .y(function(d) { return y(d.y); })var graph = d3.select("#tanh")    .append("svg")    .attr("width", width + m[1] + m[3])    .attr("height", height + m[0] + m[2])    .append("g")    .attr("transform", "translate(" + m[3] + "," + m[0] + ")");var xAxis = d3.svg.axis()                  .scale(x)                  .tickValues(d3.range(-4, 5, 1))                  .orient("bottom")graph.append("g")    .attr("class", "x axis")    .attr("transform", "translate(0, " + height/2 + ")")    .call(xAxis);var yAxis = d3.svg.axis()                  .scale(y)                  .tickValues(d3.range(-1, 1.01, 0.5))                  .orient("left")graph.append("g")    .attr("class", "y axis")    .call(yAxis);graph.append("path").attr("d", line(data));graph.append("text")     .attr("class", "x label")     .attr("text-anchor", "end")     .attr("x", width+20)     .attr("y", height/2+8)     .text("z");graph.append("text")        .attr("x", (width / 2))                     .attr("y", -10)        .attr("text-anchor", "middle")          .style("font-size", "16px")         .text("tanh function");</script></p><p>One difference between tanh neurons and sigmoid neurons is that theoutput from tanh neurons ranges from -1 to 1, not 0 to 1.  This meansthat if you're going to build a network based on tanh neurons you mayneed to normalize your outputs (and, depending on the details of theapplication, possibly your inputs) a little differently than insigmoid networks.</p><p>Similar to sigmoid neurons, a network of tanh neurons can, inprinciple, compute any function*<span class="marginnote">*There are some technical  caveats to this statement for both tanh and sigmoid neurons, as well  as for the rectified linear neurons discussed below.  However,  informally it's usually fine to think of neural networks as being  able to approximate any function to arbitrary accuracy.</span> mappinginputs to the range -1 to 1.  Furthermore, ideas such asbackpropagation and stochastic gradient descent are as easily appliedto a network of tanh neurons as to a network of sigmoid neurons.</p><p><h4><a name="exercise_66677"></a><a href="#exercise_66677">Exercise</a></h4><ul><li> Prove the identity in Equation <span id="margin_376202380998_reveal" class="equation_link">(111)</span><span id="margin_376202380998" class="marginequation" style="display: none;"><a href="chap3.html#eqtn111" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \sigma(z) = \frac{1+\tanh(z/2)}{2} \nonumber\end{eqnarray}</a></span><script>$('#margin_376202380998_reveal').click(function() {$('#margin_376202380998').toggle('slow', function() {});});</script>.</ul></p><p>Which type of neuron should you use in your networks, the tanh orsigmoid?  <em>A priori</em> the answer is not obvious, to put it mildly!However, there are theoretical arguments and some empirical evidenceto suggest that the tanh sometimes performs better*<span class="marginnote">*See, for  example,  <a href="http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf">Efficient    BackProp</a>, by Yann LeCun, Léon Bottou,  Genevieve Orr and Klaus-Robert Müller (1998), and  <a href="http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf">Understanding    the difficulty of training deep feedforward networks</a>, by Xavier  Glorot and Yoshua Bengio (2010).</span>.  Let me briefly give you theflavor of one of the theoretical arguments for tanh neurons.  Supposewe're using sigmoid neurons, so all activations in our network arepositive.  Let's consider the weights $w^{l+1}_{jk}$ input to the$j$th neuron in the $l+1$th layer.  The rules for backpropagation (see<a href="chap2.html#eqtnBP4">here</a>) tell us that the associated gradientwill be $a^l_k \delta^{l+1}_j$.  Because the activations are positivethe sign of this gradient will be the same as the sign of$\delta^{l+1}_j$.  What this means is that if $\delta^{l+1}_j$ ispositive then <em>all</em> the weights $w^{l+1}_{jk}$ will decreaseduring gradient descent, while if $\delta^{l+1}_j$ is negative then<em>all</em> the weights $w^{l+1}_{jk}$ will increase during gradientdescent.  In other words, all weights to the same neuron must eitherincrease together or decrease together.  That's a problem, since someof the weights may need to increase while others need to decrease.That can only happen if some of the input activations have differentsigns.  That suggests replacing the sigmoid by an activation function,such as $\tanh$, which allows both positive and negative activations.Indeed, because $\tanh$ is symmetric about zero, $\tanh(-z) =-\tanh(z)$, we might even expect that, roughly speaking, theactivations in hidden layers would be equally balanced betweenpositive and negative.  That would help ensure that there is nosystematic bias for the weight updates to be one way or the other.</p><p>How seriously should we take this argument?  While the argument issuggestive, it's a heuristic, not a rigorous proof that tanh neuronsoutperform sigmoid neurons.  Perhaps there are other properties of thesigmoid neuron which compensate for this problem?  Indeed, for manytasks the tanh is found empirically to provide only a small or noimprovement in performance over sigmoid neurons.  Unfortunately, wedon't yet have hard-and-fast rules to know which neuron types willlearn fastest, or give the best generalization performance, for anyparticular application.</p><p>Another variation on the sigmoid neuron is the <em>rectified linear  neuron</em> or <em>rectified linear unit</em>.  The output of a rectifiedlinear unit with input $x$, weight vector $w$, and bias $b$ is givenby<a class="displaced_anchor" name="eqtn112"></a>\begin{eqnarray}  \max(0, w \cdot x+b).\tag{112}\end{eqnarray}Graphically, the rectifying function $\max(0, z)$ looks like this:</p><p><div id="relu"></div><script type="text/javascript">function s(x) {return Math.max(0, x);}var m = [40, 120, 50, 120];var height = 340 - m[0] - m[2];var width = 550 - m[1] - m[3];var xmin = -5;var xmax = 5;var sample = 100;var x1 = d3.scale.linear().domain([0, sample]).range([xmin, xmax]);var data = d3.range(sample).map(function(d){ return {        x: x1(d),         y: s(x1(d))};     });var x = d3.scale.linear().domain([xmin, xmax]).range([0, width]);var y = d3.scale.linear()                .domain([-5, 5])                .range([height, 0]);var line = d3.svg.line()    .x(function(d) { return x(d.x); })    .y(function(d) { return y(d.y); })var graph = d3.select("#relu")    .append("svg")    .attr("width", width + m[1] + m[3])    .attr("height", height + m[0] + m[2])    .append("g")    .attr("transform", "translate(" + m[3] + "," + m[0] + ")");var xAxis = d3.svg.axis()                  .scale(x)                  .tickValues(d3.range(-4, 5.01, 1))                  .orient("bottom")graph.append("g")    .attr("class", "x axis")    .attr("transform", "translate(0, " + height/2 + ")")    .call(xAxis);var yAxis = d3.svg.axis()                  .scale(y)                  .tickValues(d3.range(-4, 5.01, 1))                  .orient("left")graph.append("g")    .attr("class", "y axis")    .call(yAxis);graph.append("path").attr("d", line(data));graph.append("text")     .attr("class", "x label")     .attr("text-anchor", "end")     .attr("x", width+20)     .attr("y", height/2+8)     .text("z");graph.append("text")        .attr("x", (width / 2))                     .attr("y", -10)        .attr("text-anchor", "middle")          .style("font-size", "16px")         .text("max(0, z)");</script></p><p>Obviously such neurons are quite different from both sigmoid and tanhneurons.  However, like the sigmoid and tanh neurons, rectified linearunits can be used to compute any function, and they can be trainedusing ideas such as backpropagation and stochastic gradient descent.</p><p>When should you use rectified linear units instead of sigmoid or tanhneurons?  Some recent work on image recognition*<span class="marginnote">*See, for  example,  <a href="http://yann.lecun.com/exdb/publis/pdf/jarrett-iccv-09.pdf">What    is the Best Multi-Stage Architecture for Object Recognition?</a>, by  Kevin Jarrett, Koray Kavukcuoglu, Marc'Aurelio Ranzato and Yann  LeCun (2009),  <a href="http://www.jmlr.org/proceedings/papers/v15/glorot11a.html">Deep    Sparse Rectiﬁer Neural Networks</a>, by Xavier Glorot, Antoine  Bordes, and Yoshua Bengio (2011), and  <a href="https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf">ImageNet    Classification with Deep Convolutional Neural Networks</a>, by Alex  Krizhevsky, Ilya Sutskever, and Geoffrey Hinton (2012).  Note that  these papers fill in important details about how to set up the  output layer, cost function, and regularization in networks using  rectified linear units. I've glossed over all these details in this  brief account. The papers also discuss in more detail the benefits  and drawbacks of using rectified linear units.  Another informative  paper is  <a href="https://www.cs.toronto.edu/&#126;hinton/absps/reluICML.pdf">Rectified    Linear Units Improve Restricted Boltzmann Machines</a>, by Vinod Nair  and Geoffrey Hinton (2010), which demonstrates the benefits of using  rectified linear units in a somewhat different approach to neural  networks.</span>  has found considerable benefit in using rectified linearunits through much of the network.  However, as with tanh neurons, wedo not yet have a really deep understanding of when, exactly,rectified linear units are preferable, nor why.  To give you theflavor of some of the issues, recall that sigmoid neurons stoplearning when they saturate, i.e., when their output is near either$0$ or $1$.  As we've seen repeatedly in this chapter, the problem isthat $\sigma'$ terms reduce the gradient, and that slows downlearning.  Tanh neurons suffer from a similar problem when theysaturate.  By contrast, increasing the weighted input to a rectifiedlinear unit will never cause it to saturate, and so there is nocorresponding learning slowdown.  On the other hand, when the weightedinput to a rectified linear unit is negative, the gradient vanishes,and so the neuron stops learning entirely.  These are just two of themany issues that make it non-trivial to understand when and whyrectified linear units perform better than sigmoid or tanh neurons.</p><p>I've painted a picture of uncertainty here, stressing that we do notyet have a solid theory of how activation functions should be chosen.Indeed, the problem is harder even than I have described, for thereare infinitely many possible activation functions.  Which is the bestfor any given problem?  Which will result in a network which learnsfastest?  Which will give the highest test accuracies? I am surprisedhow little really deep and systematic investigation has been done ofthese questions.  Ideally, we'd have a theory which tells us, indetail, how to choose (and perhaps modify-on-the-fly) our activationfunctions.  On the other hand, we shouldn't let the lack of a fulltheory stop us!  We have powerful tools already at hand, and can makea lot of progress with those tools.  Through the remainder of thisbook I'll continue to use sigmoid neurons as our go-to neuron, sincethey're powerful and provide concrete illustrations of the core ideasabout neural nets.  But keep in the back of your mind that these sameideas can be applied to other types of neuron, and that there aresometimes advantages in doing so.</p><p><h4><a name="on_stories_in_neural_networks"></a><a href="#on_stories_in_neural_networks">On stories in neural networks</a></h4> </p><p><blockquote><em><strong>Question:</strong> How do you  approach utilizing and researching machine learning techniques that  are supported almost entirely empirically, as opposed to  mathematically? Also in what situations have you noticed some of  these techniques fail?</p><p>  <strong>Answer:</strong> You have to realize that our theoretical  tools are very weak. Sometimes, we have good mathematical intuitions  for why a particular technique should work. Sometimes our intuition  ends up being wrong [...] The questions become: how well does my  method work on this particular problem, and how large is the set of  problems on which it works well.</p><p>  -  <a href="http://www.reddit.com/r/MachineLearning/comments/25lnbt/ama_yann_lecun/chivdv7">Question    and answer</a> with neural networks researcher Yann LeCun</em></blockquote></p><p>Once, attending a conference on the foundations of quantum mechanics,I noticed what seemed to me a most curious verbal habit: when talksfinished, questions from the audience often began with "I'm verysympathetic to your point of view, but [...]".  Quantum foundationswas not my usual field, and I noticed this style of questioningbecause at other scientific conferences I'd rarely or never heard aquestioner express their sympathy for the point of view of thespeaker.  At the time, I thought the prevalence of the questionsuggested that little genuine progress was being made in quantumfoundations, and people were merely spinning their wheels.  Later, Irealized that assessment was too harsh.  The speakers were wrestlingwith some of the hardest problems human minds have ever confronted.Of course progress was slow!  But there was still value in hearingupdates on how people were thinking, even if they didn't always haveunarguable new progress to report.</p><p>You may have noticed a verbal tic similar to "I'm very sympathetic[...]" in the current book.  To explain what we're seeing I've oftenfallen back on saying "Heuristically, [...]", or "Roughly speaking,[...]", following up with a story to explain some phenomenon orother.  These stories are plausible, but the empirical evidence I'vepresented has often been pretty thin.  If you look through theresearch literature you'll see that stories in a similar style appearin many research papers on neural nets, often with thin supportingevidence.  What should we think about such stories?</p><p>In many parts of science - especially those parts that deal withsimple phenomena - it's possible to obtain very solid, very reliableevidence for quite general hypotheses.  But in neural networks thereare large numbers of parameters and hyper-parameters, and extremelycomplex interactions between them.  In such extraordinarily complexsystems it's exceedingly difficult to establish reliable generalstatements.  Understanding neural networks in their full generality isa problem that, like quantum foundations, tests the limits of thehuman mind.  Instead, we often make do with evidence for or against afew specific instances of a general statement.  As a result thosestatements sometimes later need to be modified or abandoned, when newevidence comes to light.</p><p>One way of viewing this situation is that any heuristic story aboutneural networks carries with it an implied challenge.  For example,consider the statement I <a href="#dropout_explanation">quoted  earlier</a>, explaining why dropout works*<span class="marginnote">*From  <a href="https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf">ImageNet    Classification with Deep Convolutional Neural Networks</a> by Alex  Krizhevsky, Ilya Sutskever, and Geoffrey Hinton (2012).</span>: "Thistechnique reduces complex co-adaptations of neurons, since a neuroncannot rely on the presence of particular other neurons. It is,therefore, forced to learn more robust features that are useful inconjunction with many different random subsets of the other neurons."This is a rich, provocative statement, and one could build a fruitfulresearch program entirely around unpacking the statement, figuring outwhat in it is true, what is false, what needs variation andrefinement.  Indeed, there is now a small industry of researchers whoare investigating dropout (and many variations), trying to understandhow it works, and what its limits are.  And so it goes with many ofthe heuristics we've discussed.  Each heuristic is not just a(potential) explanation, it's also a challenge to investigate andunderstand in more detail.</p><p>Of course, there is not time for any single person to investigate allthese heuristic explanations in depth.  It's going to take decades (orlonger) for the community of neural networks researchers to develop areally powerful, evidence-based theory of how neural networks learn.Does this mean you should reject heuristic explanations as unrigorous,and not sufficiently evidence-based?  No!  In fact, we need suchheuristics to inspire and guide our thinking.  It's like the great ageof exploration: the early explorers sometimes explored (and made newdiscoveries) on the basis of beliefs which were wrong in importantways.  Later, those mistakes were corrected as we filled in ourknowledge of geography.  When you understand something poorly - asthe explorers understood geography, and as we understand neural netstoday - it's more important to explore boldly than it is to berigorously correct in every step of your thinking.  And so you shouldview these stories as a useful guide to how to think about neuralnets, while retaining a healthy awareness of the limitations of suchstories, and carefully keeping track of just how strong the evidenceis for any given line of reasoning.  Put another way, we need goodstories to help motivate and inspire us, and rigorous in-depthinvestigation in order to uncover the real facts of the matter.</p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></div><div class="footer"> <span class="left_footer"> In academic work,
+</p><p>That's better!  And so we can continue, individually adjusting eachhyper-parameter, gradually improving performance.  Once we've exploredto find an improved value for $\eta$, then we move on to find a goodvalue for $\lambda$.  Then experiment with a more complexarchitecture, say a network with 10 hidden neurons.  Then adjust thevalues for $\eta$ and $\lambda$ again.  Then increase to 20 hiddenneurons.  And then adjust other hyper-parameters some more.  And soon, at each stage evaluating performance using our held-out validationdata, and using those evaluations to find better and betterhyper-parameters.  As we do so, it typically takes longer to witnessthe impact due to modifications of the hyper-parameters, and so we cangradually decrease the frequency of monitoring.</p><p>This all looks very promising as a broad strategy.  However, I want toreturn to that initial stage of finding hyper-parameters that enable anetwork to learn anything at all.  In fact, even the above discussionconveys too positive an outlook.  It can be immensely frustrating towork with a network that's learning nothing.  You can tweakhyper-parameters for days, and still get no meaningful response.  Andso I'd like to re-emphasize that during the early stages you shouldmake sure you can get quick feedback from experiments.  Intuitively,it may seem as though simplifying the problem and the architecturewill merely slow you down.  In fact, it speeds things up, since youmuch more quickly find a network with a meaningful signal.  Onceyou've got such a signal, you can often get rapid improvements bytweaking the hyper-parameters.  As with many things in life, gettingstarted can be the hardest thing to do.</p><p>Okay, that's the broad strategy.  Let's now look at some specificrecommendations for setting hyper-parameters.  I will focus on thelearning rate, $\eta$, the L2 regularization parameter, $\lambda$, andthe mini-batch size.  However, many of the remarks apply also to otherhyper-parameters, including those associated to network architecture,other forms of regularization, and some hyper-parameters we'll meetlater in the book, such as the momentum co-efficient.</p><p><strong>Learning rate:</strong> Suppose we run three MNIST networks with threedifferent learning rates, $\eta = 0.025$, $\eta = 0.25$ and $\eta =2.5$, respectively.  We'll set the other hyper-parameters as for theexperiments in earlier sections, running over 30 epochs, with amini-batch size of 10, and with $\lambda = 5.0$.  We'll also return tousing the full $50,000$ training images.  Here's a graph showing thebehaviour of the training cost as we train*<span class="marginnote">*The graph was  generated by  <a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/fig/multiple_eta.py">multiple_eta.py</a>.</span>:</p><p><center><img src="images/multiple_eta.png" width="520px"></center></p><p>With $\eta = 0.025$ the cost decreases smoothly until the final epoch.With $\eta = 0.25$ the cost initially decreases, but after about $20$epochs it is near saturation, and thereafter most of the changes aremerely small and apparently random oscillations.  Finally, with $\eta= 2.5$ the cost makes large oscillations right from the start.  Tounderstand the reason for the oscillations, recall that stochasticgradient descent is supposed to step us gradually down into a valleyof the cost function,</p><p><center><img src="images/tikz33.png"/></center></p><p>However, if $\eta$ is too large then the steps will be so large thatthey may actually overshoot the minimum, causing the algorithm toclimb up out of the valley instead.  That's likely*<span class="marginnote">*This  picture is helpful, but it's intended as an intuition-building  illustration of what may go on, not as a complete, exhaustive  explanation.  Briefly, a more complete explanation is as follows:  gradient descent uses a first-order approximation to the cost  function as a guide to how to decrease the cost.  For large $\eta$,  higher-order terms in the cost function become more important, and  may dominate the behaviour, causing gradient descent to break down.  This is especially likely as we approach minima and quasi-minima of  the cost function, since near such points the gradient becomes  small, making it easier for higher-order terms to dominate  behaviour.</span>  what's causing the cost to oscillate when $\eta = 2.5$.When we choose $\eta = 0.25$ the initial steps do take us toward aminimum of the cost function, and it's only once we get near thatminimum that we start to suffer from the overshooting problem.  Andwhen we choose $\eta = 0.025$ we don't suffer from this problem at allduring the first $30$ epochs.  Of course, choosing $\eta$ so smallcreates another problem, namely, that it slows down stochasticgradient descent.  An even better approach would be to start with$\eta = 0.25$, train for $20$ epochs, and then switch to $\eta =0.025$.  We'll discuss such variable learning rate schedules later.For now, though, let's stick to figuring out how to find a single goodvalue for the learning rate, $\eta$.</p><p></p><p>With this picture in mind, we can set $\eta$ as follows.  First, weestimate the threshold value for $\eta$ at which the cost on thetraining data immediately begins decreasing, instead of oscillating orincreasing.  This estimate doesn't need to be too accurate.  You canestimate the order of magnitude by starting with $\eta = 0.01$.  Ifthe cost decreases during the first few epochs, then you shouldsuccessively try $\eta = 0.1, 1.0, \ldots$ until you find a value for$\eta$ where the cost oscillates or increases during the first fewepochs.  Alternately, if the cost oscillates or increases during thefirst few epochs when $\eta = 0.01$, then try $\eta = 0.001, 0.0001,\ldots$ until you find a value for $\eta$ where the cost decreasesduring the first few epochs.  Following this procedure will give us anorder of magnitude estimate for the threshold value of $\eta$.  Youmay optionally refine your estimate, to pick out the largest value of$\eta$ at which the cost decreases during the first few epochs, say$\eta = 0.5$ or $\eta = 0.2$ (there's no need for this to besuper-accurate).  This gives us an estimate for the threshold value of$\eta$.</p><p></p><p>Obviously, the actual value of $\eta$ that you use should be no largerthan the threshold value.  In fact, if the value of $\eta$ is toremain usable over many epochs then you likely want to use a valuefor $\eta$ that is smaller, say, a factor of two below the threshold.Such a choice will typically allow you to train for many epochs,without causing too much of a slowdown in learning.</p><p></p><p>In the case of the MNIST data, following this strategy leads to anestimate of $0.1$ for the order of magnitude of the threshold value of$\eta$.  After some more refinement, we obtain a threshold value $\eta= 0.5$.  Following the prescription above, this suggests using $\eta =0.25$ as our value for the learning rate.  In fact, I found that using$\eta = 0.5$ worked well enough over $30$ epochs that for the mostpart I didn't worry about using a lower value of $\eta$.</p><p>This all seems quite straightforward.  However, using the trainingcost to pick $\eta$ appears to contradict what I said earlier in thissection, namely, that we'd pick hyper-parameters by evaluatingperformance using our held-out validation data.  In fact, we'll usevalidation accuracy to pick the regularization hyper-parameter, themini-batch size, and network parameters such as the number of layersand hidden neurons, and so on.  Why do things differently for thelearning rate?  Frankly, this choice is my personal aestheticpreference, and is perhaps somewhat idiosyncratic.  The reasoning isthat the other hyper-parameters are intended to improve the finalclassification accuracy on the test set, and so it makes sense toselect them on the basis of validation accuracy.  However, thelearning rate is only incidentally meant to impact the finalclassification accuracy.  Its primary purpose is really to controlthe step size in gradient descent, and monitoring the training cost isthe best way to detect if the step size is too big.  With that said,this is a personal aesthetic preference.  Early on during learning thetraining cost usually only decreases if the validation accuracyimproves, and so in practice it's unlikely to make much differencewhich criterion you use.</p><p><a name="early_stopping"></a></p><p><strong>Use early stopping to determine the number of training  epochs:</strong> As we discussed earlier in the chapter, early stopping meansthat at the end of each epoch we should compute the classificationaccuracy on the validation data.  When that stops improving,terminate.  This makes setting the number of epochs very simple.  Inparticular, it means that we don't need to worry about explicitlyfiguring out how the number of epochs depends on the otherhyper-parameters.  Instead, that's taken care of automatically.Furthermore, early stopping also automatically prevents us fromoverfitting.  This is, of course, a good thing, although in the earlystages of experimentation it can be helpful to turn off earlystopping, so you can see any signs of overfitting, and use it toinform your approach to regularization.</p><p>To implement early stopping we need to say more precisely what itmeans that the classification accuracy has stopped improving.  Aswe've seen, the accuracy can jump around quite a bit, even when theoverall trend is to improve.  If we stop the first time the accuracydecreases then we'll almost certainly stop when there are moreimprovements to be had.  A better rule is to terminate if the bestclassification accuracy doesn't improve for quite some time.  Suppose,for example, that we're doing MNIST.  Then we might elect to terminateif the classification accuracy hasn't improved during the last tenepochs.  This ensures that we don't stop too soon, in response to badluck in training, but also that we're not waiting around forever foran improvement that never comes.</p><p>This no-improvement-in-ten rule is good for initial exploration ofMNIST.  However, networks can sometimes plateau near a particularclassification accuracy for quite some time, only to then beginimproving again.  If you're trying to get really good performance, theno-improvement-in-ten rule may be too aggressive about stopping.  Inthat case, I suggest using the no-improvement-in-ten rule for initialexperimentation, and gradually adopting more lenient rules, as youbetter understand the way your network trains:no-improvement-in-twenty, no-improvement-in-fifty, and so on.  Ofcourse, this introduces a new hyper-parameter to optimize!  Inpractice, however, it's usually easy to set this hyper-parameter toget pretty good results.  Similarly, for problems other than MNIST,the no-improvement-in-ten rule may be much too aggressive or notnearly aggressive enough, depending on the details of the problem.However, with a little experimentation it's usually easy to find apretty good strategy for early stopping.</p><p>We haven't used early stopping in our MNIST experiments to date.  Thereason is that we've been doing a lot of comparisons between differentapproaches to learning.  For such comparisons it's helpful to use thesame number of epochs in each case.  However, it's well worthmodifying <tt>network2.py</tt> to implement early stopping:</p><p><h4><a name="problem_831601"></a><a href="#problem_831601">Problem</a></h4><ul><li> Modify <tt>network2.py</tt> so that it implements early stopping  using a no-improvement-in-$n$ epochs strategy, where $n$ is a  parameter that can be set.</p><p><li> Can you think of a rule for early stopping <em>other</em> than  no-improvement-in-$n$?  Ideally, the rule should compromise between  getting high validation accuracies and not training too long.  Add  your rule to <tt>network2.py</tt>, and run three experiments comparing  the validation accuracies and number of epochs of training to  no-improvement-in-$10$.</p><p></ul></p><p><strong>Learning rate schedule:</strong> We've been holding the learning rate$\eta$ constant.  However, it's often advantageous to vary thelearning rate.  Early on during the learning process it's likely thatthe weights are badly wrong.  And so it's best to use a large learningrate that causes the weights to change quickly.  Later, we can reducethe learning rate as we make more fine-tuned adjustments to ourweights.</p><p>How should we set our learning rate schedule?  Many approaches arepossible.  One natural approach is to use the same basic idea as earlystopping.  The idea is to hold the learning rate constant until thevalidation accuracy starts to get worse.  Then decrease the learningrate by some amount, say a factor of two or ten.  We repeat this manytimes, until, say, the learning rate is a factor of 1,024 (or 1,000)times lower than the initial value.  Then we terminate.</p><p></p><p>A variable learning schedule can improve performance, but it alsoopens up a world of possible choices for the learning schedule.  Thosechoices can be a headache - you can spend forever trying to optimizeyour learning schedule.  For first experiments my suggestion is to usea single, constant value for the learning rate.  That'll get you agood first approximation.  Later, if you want to obtain the bestperformance from your network, it's worth experimenting with alearning schedule, along the lines I've described*<span class="marginnote">*A readable  recent paper which demonstrates the benefits of variable learning  rates in attacking MNIST is  <a href="http://arxiv.org/abs/1003.0358">Deep, Big, Simple Neural Nets    Excel on Handwritten Digit Recognition</a>, by Dan Claudiu  Cireșan, Ueli Meier, Luca Maria Gambardella, and  Jürgen Schmidhuber (2010).</span>.</p><p><h4><a name="exercise_336628"></a><a href="#exercise_336628">Exercise</a></h4><ul><li> Modify <tt>network2.py</tt> so that it implements a learning  schedule that: halves the learning rate each time the validation  accuracy satisfies the no-improvement-in-$10$ rule; and terminates  when the learning rate has dropped to $1/128$ of its original value.</ul></p><p><strong>The regularization parameter, $\lambda$:</strong> I suggest startinginitially with no regularization ($\lambda = 0.0$), and determining avalue for $\eta$, as above.  Using that choice of $\eta$, we can thenuse the validation data to select a good value for $\lambda$.  Startby trialling $\lambda = 1.0$*<span class="marginnote">*I don't have a good principled  justification for using this as a starting value.  If anyone knows  of a good principled discussion of where to start with $\lambda$,  I'd appreciate hearing it (mn@michaelnielsen.org).</span>, and thenincrease or decrease by factors of $10$, as needed to improveperformance on the validation data.  Once you've found a good order ofmagnitude, you can fine tune your value of $\lambda$.  That done, youshould return and re-optimize $\eta$ again.</p><p><h4><a name="exercise_281746"></a><a href="#exercise_281746">Exercise</a></h4><ul><li> It's tempting to use gradient descent to try to learn good  values for hyper-parameters such as $\lambda$ and $\eta$.  Can you  think of an obstacle to using gradient descent to determine  $\lambda$?  Can you think of an obstacle to using gradient descent  to determine $\eta$?</ul></p><p><strong>How I selected hyper-parameters earlier in this book:</strong> If youuse the recommendations in this section you'll find that you getvalues for $\eta$ and $\lambda$ which don't always exactly match thevalues I've used earlier in the book.  The reason is that the book hasnarrative constraints that have sometimes made it impractical tooptimize the hyper-parameters.  Think of all the comparisons we'vemade of different approaches to learning, e.g., comparing thequadratic and cross-entropy cost functions, comparing the old and newmethods of weight initialization, running with and withoutregularization, and so on.  To make such comparisons meaningful, I'veusually tried to keep hyper-parameters constant across the approachesbeing compared (or to scale them in an appropriate way).  Of course,there's no reason for the same hyper-parameters to be optimal for allthe different approaches to learning, so the hyper-parameters I'veused are something of a compromise.</p><p>As an alternative to this compromise, I could have tried to optimizethe heck out of the hyper-parameters for every single approach tolearning.  In principle that'd be a better, fairer approach, sincethen we'd see the best from every approach to learning.  However,we've made dozens of comparisons along these lines, and in practice Ifound it too computationally expensive.  That's why I've adopted thecompromise of using pretty good (but not necessarily optimal) choicesfor the hyper-parameters.</p><p><a name="mini_batch_size"></a></p><p><strong>Mini-batch size:</strong> How should we set the mini-batch size?  Toanswer this question, let's first suppose that we're doing onlinelearning, i.e., that we're using a mini-batch size of $1$.</p><p>The obvious worry about online learning is that using mini-batcheswhich contain just a single training example will cause significanterrors in our estimate of the gradient.  In fact, though, the errorsturn out to not be such a problem.  The reason is that the individualgradient estimates don't need to be super-accurate.  All we need is anestimate accurate enough that our cost function tends to keepdecreasing.  It's as though you are trying to get to the NorthMagnetic Pole, but have a wonky compass that's 10-20 degrees off eachtime you look at it.  Provided you stop to check the compassfrequently, and the compass gets the direction right on average,you'll end up at the North Magnetic Pole just fine.</p><p>Based on this argument, it sounds as though we should use onlinelearning.  In fact, the situation turns out to be more complicatedthan that.  In a <a href="chap2.html#backprop_over_minibatch">problem  in the last chapter</a> I pointed out that it's possible to use matrixtechniques to compute the gradient update for <em>all</em> examples in amini-batch simultaneously, rather than looping over them.  Dependingon the details of your hardware and linear algebra library this canmake it quite a bit faster to compute the gradient estimate for amini-batch of (for example) size $100$, rather than computing themini-batch gradient estimate by looping over the $100$ trainingexamples separately.  It might take (say) only $50$ times as long,rather than $100$ times as long.</p><p>Now, at first it seems as though this doesn't help us that much.  Withour mini-batch of size $100$ the learning rule for the weights lookslike:<a class="displaced_anchor" name="eqtn100"></a>\begin{eqnarray}  w \rightarrow w' = w-\eta \frac{1}{100} \sum_x \nabla C_x,\tag{100}\end{eqnarray}where the sum is over training examples in the mini-batch.  This isversus<a class="displaced_anchor" name="eqtn101"></a>\begin{eqnarray}  w \rightarrow w' = w-\eta \nabla C_x\tag{101}\end{eqnarray}for online learning.  Even if it only takes $50$ times as long to dothe mini-batch update, it still seems likely to be better to do onlinelearning, because we'd be updating so much more frequently.  Suppose,however, that in the mini-batch case we increase the learning rate bya factor $100$, so the update rule becomes<a class="displaced_anchor" name="eqtn102"></a>\begin{eqnarray}  w \rightarrow w' = w-\eta \sum_x \nabla C_x.\tag{102}\end{eqnarray}That's a lot like doing $100$ separate instances of online learningwith a learning rate of $\eta$.  But it only takes $50$ times as longas doing a single instance of online learning.  Of course, it's nottruly the same as $100$ instances of online learning, since in themini-batch the $\nabla C_x$'s are all evaluated for the same set ofweights, as opposed to the cumulative learning that occurs in theonline case.  Still, it seems distinctly possible that using thelarger mini-batch would speed things up.</p><p>With these factors in mind, choosing the best mini-batch size is acompromise.  Too small, and you don't get to take full advantage ofthe benefits of good matrix libraries optimized for fast hardware.Too large and you're simply not updating your weights often enough.What you need is to choose a compromise value which maximizes thespeed of learning.  Fortunately, the choice of mini-batch size atwhich the speed is maximized is relatively independent of the otherhyper-parameters (apart from the overall architecture), so you don'tneed to have optimized those hyper-parameters in order to find a goodmini-batch size.  The way to go is therefore to use some acceptable(but not necessarily optimal) values for the other hyper-parameters,and then trial a number of different mini-batch sizes, scaling $\eta$as above.  Plot the validation accuracy versus <em>time</em> (as in,real elapsed time, not epoch!), and choose whichever mini-batch sizegives you the most rapid improvement in performance.  With themini-batch size chosen you can then proceed to optimize the otherhyper-parameters.</p><p>Of course, as you've no doubt realized, I haven't done thisoptimization in our work.  Indeed, our implementation doesn't use thefaster approach to mini-batch updates at all.  I've simply used amini-batch size of $10$ without comment or explanation in nearly allexamples.  Because of this, we could have sped up learning by reducingthe mini-batch size.  I haven't done this, in part because I wanted toillustrate the use of mini-batches beyond size $1$, and in partbecause my preliminary experiments suggested the speedup would berather modest.  In practical implementations, however, we would mostcertainly implement the faster approach to mini-batch updates, andthen make an effort to optimize the mini-batch size, in order tomaximize our overall speed.</p><p></p><p><strong>Automated techniques:</strong> I've been describing these heuristicsas though you're optimizing your hyper-parameters by hand.Hand-optimization is a good way to build up a feel for how neuralnetworks behave.  However, and unsurprisingly, a great deal of workhas been done on automating the process. A common technique is<em>grid search</em>, which systematically searches through a grid inhyper-parameter space.  A review of both the achievements and thelimitations of grid search (with suggestions for easily-implementedalternatives) may be found in a 2012paper*<span class="marginnote">*<a href="http://dl.acm.org/citation.cfm?id=2188395">Random    search for hyper-parameter optimization</a>, by James Bergstra and  Yoshua Bengio (2012).</span> by James Bergstra and Yoshua Bengio.  Manymore sophisticated approaches have also been proposed.  I won't reviewall that work here, but do want to mention a particularly promising2012 paper which used a Bayesian approach to automatically optimizehyper-parameters*<span class="marginnote">*<a href="http://papers.nips.cc/paper/4522-practical-bayesian-optimization-of-machine-learning-algorithms.pdf">Practical    Bayesian optimization of machine learning algorithms</a>, by Jasper  Snoek, Hugo Larochelle, and Ryan Adams.</span>.  The code from the paperis <a href="https://github.com/jaberg/hyperopt">publicly available</a>, andhas been used with some success by other researchers.</p><p><strong>Summing up:</strong> Following the rules-of-thumb I've described won'tgive you the absolute best possible results from your neural network.But it will likely give you a good start and a basis for furtherimprovements.  In particular, I've discussed the hyper-parameterslargely independently.  In practice, there are relationships betweenthe hyper-parameters.  You may experiment with $\eta$, feel thatyou've got it just right, then start to optimize for $\lambda$, onlyto find that it's messing up your optimization for $\eta$.  Inpractice, it helps to bounce backward and forward, gradually closingin good values.  Above all, keep in mind that the heuristics I'vedescribed are rules of thumb, not rules cast in stone.  You should beon the lookout for signs that things aren't working, and be willing toexperiment.  In particular, this means carefully monitoring yournetwork's behaviour, especially the validation accuracy.</p><p>The difficulty of choosing hyper-parameters is exacerbated by the factthat the lore about how to choose hyper-parameters is widely spread,across many research papers and software programs, and often is onlyavailable inside the heads of individual practitioners.  There aremany, many papers setting out (sometimes contradictory)recommendations for how to proceed.  However, there are a fewparticularly useful papers that synthesize and distill out much ofthis lore.  Yoshua Bengio has a 2012paper*<span class="marginnote">*<a href="http://arxiv.org/abs/1206.5533">Practical    recommendations for gradient-based training of deep    architectures</a>, by Yoshua Bengio (2012).</span> that gives somepractical recommendations for using backpropagation and gradientdescent to train neural networks, including deep neural nets.  Bengiodiscusses many issues in much more detail than I have, including howto do more systematic hyper-parameter searches.  Another good paper isa 1998paper*<span class="marginnote">*<a href="http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf">Efficient    BackProp</a>, by Yann LeCun, Léon Bottou,  Genevieve Orr and Klaus-Robert Müller (1998)</span> byYann LeCun, Léon Bottou, Genevieve Orr andKlaus-Robert Müller.  Both these papers appear inan extremely useful 2012 book that collects many tricks commonly usedin neuralnets*<span class="marginnote">*<a href="http://www.springer.com/computer/theoretical+computer+science/book/978-3-642-35288-1">Neural    Networks: Tricks of the Trade</a>, edited by  Grégoire Montavon, Geneviève Orr, and Klaus-Robert    Müller.</span>.  The book is expensive, but many of the articles havebeen placed online by their respective authors with, one presumes, theblessing of the publisher, and may be located using a search engine.</p><p>One thing that becomes clear as you read these articles and,especially, as you engage in your own experiments, is thathyper-parameter optimization is not a problem that is ever completelysolved.  There's always another trick you can try to improveperformance.  There is a saying common among writers that books arenever finished, only abandoned.  The same is also true of neuralnetwork optimization: the space of hyper-parameters is so large thatone never really finishes optimizing, one only abandons the network toposterity.  So your goal should be to develop a workflow that enablesyou to quickly do a pretty good job on the optimization, while leavingyou the flexibility to try more detailed optimizations, if that'simportant.</p><p>The challenge of setting hyper-parameters has led some people tocomplain that neural networks require a lot of work when compared withother machine learning techniques.  I've heard many variations on thefollowing complaint: "Yes, a well-tuned neural network may get thebest performance on the problem.  On the other hand, I can try arandom forest [or SVM or$\ldots$ insert your own favorite technique]and it just works.  I don't have time to figure out just the rightneural network."  Of course, from a practical point of view it's goodto have easy-to-apply techniques.  This is particularly true whenyou're just getting started on a problem, and it may not be obviouswhether machine learning can help solve the problem at all.  On theother hand, if getting optimal performance is important, then you mayneed to try approaches that require more specialist knowledge.  Whileit would be nice if machine learning were always easy, there is no<em>a priori</em> reason it should be trivially simple.</p><p><h3><a name="other_techniques"></a><a href="#other_techniques">Other techniques</a></h3></p><p>Each technique developed in this chapter is valuable to know in itsown right, but that's not the only reason I've explained them.  Thelarger point is to familiarize you with some of the problems which canoccur in neural networks, and with a style of analysis which can helpovercome those problems.  In a sense, we've been learning how to thinkabout neural nets.  Over the remainder of this chapter I brieflysketch a handful of other techniques.  These sketches are lessin-depth than the earlier discussions, but should convey some feelingfor the diversity of techniques available for use in neural networks.</p><p><h4><a name="variations_on_stochastic_gradient_descent"></a><a href="#variations_on_stochastic_gradient_descent">Variations on stochastic gradient descent</a></h4></p><p>Stochastic gradient descent by backpropagation has served us well inattacking the MNIST digit classification problem.  However, there aremany other approaches to optimizing the cost function, and sometimesthose other approaches offer performance superior to mini-batchstochastic gradient descent.  In this section I sketch two suchapproaches, the Hessian and momentum techniques.</p><p><strong>Hessian technique:</strong> To begin our discussion it helps to putneural networks aside for a bit.  Instead, we're just going toconsider the abstract problem of minimizing a cost function $C$ whichis a function of many variables, $w = w_1, w_2, \ldots$, so $C =C(w)$.  By Taylor's theorem, the cost function can be approximatednear a point $w$ by<a class="displaced_anchor" name="eqtn103"></a>\begin{eqnarray}  C(w+\Delta w) & = & C(w) + \sum_j \frac{\partial C}{\partial w_j} \Delta w_j  \nonumber \\ & & + \frac{1}{2} \sum_{jk} \Delta w_j \frac{\partial^2 C}{\partial w_j    \partial w_k} \Delta w_k + \ldots\tag{103}\end{eqnarray}We can rewrite this more compactly as<a class="displaced_anchor" name="eqtn104"></a>\begin{eqnarray}  C(w+\Delta w) = C(w) + \nabla C \cdot \Delta w +  \frac{1}{2} \Delta w^T H \Delta w + \ldots,\tag{104}\end{eqnarray}where $\nabla C$ is the usual gradient vector, and $H$ is a matrixknown as the <em>Hessian matrix</em>, whose $jk$th entry is $\partial^2C / \partial w_j \partial w_k$.  Suppose we approximate $C$ bydiscarding the higher-order terms represented by $\ldots$ above,<a class="displaced_anchor" name="eqtn105"></a>\begin{eqnarray}   C(w+\Delta w) \approx C(w) + \nabla C \cdot \Delta w +  \frac{1}{2} \Delta w^T H \Delta w.\tag{105}\end{eqnarray}Using calculus we can show that the expression on the right-hand sidecan be minimized*<span class="marginnote">*Strictly speaking, for this to be a minimum,  and not merely an extremum, we need to assume that the Hessian  matrix is positive definite.  Intuitively, this means that the  function $C$ looks like a valley locally, not a mountain or a  saddle.</span> by choosing<a class="displaced_anchor" name="eqtn106"></a>\begin{eqnarray}  \Delta w = -H^{-1} \nabla C.\tag{106}\end{eqnarray}Provided <span id="margin_945919791936_reveal" class="equation_link">(105)</span><span id="margin_945919791936" class="marginequation" style="display: none;"><a href="chap3.html#eqtn105" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   C(w+\Delta w) \approx C(w) + \nabla C \cdot \Delta w +  \frac{1}{2} \Delta w^T H \Delta w \nonumber\end{eqnarray}</a></span><script>$('#margin_945919791936_reveal').click(function() {$('#margin_945919791936').toggle('slow', function() {});});</script> is a good approximate expression for thecost function, then we'd expect that moving from the point $w$ to$w+\Delta w = w-H^{-1} \nabla C$ should significantly decrease thecost function.  That suggests a possible algorithm for minimizing thecost:</p><p><ul><li> Choose a starting point, $w$.</p><p><li> Update $w$ to a new point $w' = w-H^{-1} \nabla C$, where the  Hessian $H$ and $\nabla C$ are computed at $w$.</p><p><li> Update $w'$ to a new point $w{'}{'} = w'-H'^{-1} \nabla' C$,  where the Hessian $H'$ and $\nabla' C$ are computed at $w'$.</p><p><li> $\ldots$</ul>In practice, <span id="margin_29020479743_reveal" class="equation_link">(105)</span><span id="margin_29020479743" class="marginequation" style="display: none;"><a href="chap3.html#eqtn105" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   C(w+\Delta w) \approx C(w) + \nabla C \cdot \Delta w +  \frac{1}{2} \Delta w^T H \Delta w \nonumber\end{eqnarray}</a></span><script>$('#margin_29020479743_reveal').click(function() {$('#margin_29020479743').toggle('slow', function() {});});</script> is only an approximation, and it'sbetter to take smaller steps.  We do this by repeatedly changing $w$by an amount $\Delta w = -\eta H^{-1} \nabla C$, where $\eta$ is knownas the <em>learning rate</em>.</p><p>This approach to minimizing a cost function is known as the<em>Hessian technique</em> or <em>Hessian optimization</em>.  There aretheoretical and empirical results showing that Hessian methodsconverge on a minimum in fewer steps than standard gradient descent.In particular, by incorporating information about second-order changesin the cost function it's possible for the Hessian approach to avoidmany pathologies that can occur in gradient descent.  Furthermore,there are versions of the backpropagation algorithm which can be usedto compute the Hessian.</p><p>If Hessian optimization is so great, why aren't we using it in ourneural networks?  Unfortunately, while it has many desirableproperties, it has one very undesirable property: it's very difficultto apply in practice.  Part of the problem is the sheer size of theHessian matrix.  Suppose you have a neural network with $10^7$ weightsand biases.  Then the corresponding Hessian matrix will contain $10^7\times 10^7 = 10^{14}$ entries.  That's a lot of entries!  And thatmakes computing $H^{-1} \nabla C$ extremely difficult in practice.However, that doesn't mean that it's not useful to understand.  Infact, there are many variations on gradient descent which are inspiredby Hessian optimization, but which avoid the problem with overly-largematrices.  Let's take a look at one such technique, momentum-basedgradient descent.</p><p><a name="momentum"></a></p><p><strong>Momentum-based gradient descent:</strong> Intuitively, the advantageHessian optimization has is that it incorporates not just informationabout the gradient, but also information about how the gradient ischanging.  Momentum-based gradient descent is based on a similarintuition, but avoids large matrices of second derivatives.  Tounderstand the momentum technique, think back to our<a href="chap1.html#gradient_descent">original picture</a> of gradientdescent, in which we considered a ball rolling down into a valley.  Atthe time, we observed that gradient descent is, despite its name, onlyloosely similar to a ball falling to the bottom of a valley.  Themomentum technique modifies gradient descent in two ways that make itmore similar to the physical picture.  First, it introduces a notionof "velocity" for the parameters we're trying to optimize.  Thegradient acts to change the velocity, not (directly) the "position",in much the same way as physical forces change the velocity, and onlyindirectly affect position.  Second, the momentum method introduces akind of friction term, which tends to gradually reduce the velocity.</p><p>Let's give a more precise mathematical description.  We introducevelocity variables $v = v_1, v_2, \ldots$, one for each corresponding$w_j$ variable*<span class="marginnote">*In a neural net the $w_j$ variables would, of  course, include all weights and biases.</span>. Then we replace thegradient descent update rule $w \rightarrow w'= w-\eta \nabla C$ by<a class="displaced_anchor" name="eqtn107"></a><a class="displaced_anchor" name="eqtn108"></a>\begin{eqnarray}   v & \rightarrow  & v' = \mu v - \eta \nabla C \tag{107}\\    w & \rightarrow & w' = w+v'.\tag{108}\end{eqnarray}In these equations, $\mu$ is a hyper-parameter which controls theamount of damping or friction in the system.  To understand themeaning of the equations it's helpful to first consider the case where$\mu = 1$, which corresponds to no friction.  When that's the case,inspection of the equations shows that the "force" $\nabla C$ is nowmodifying the velocity, $v$, and the velocity is controlling the rateof change of $w$.  Intuitively, we build up the velocity by repeatedlyadding gradient terms to it.  That means that if the gradient is in(roughly) the same direction through several rounds of learning, wecan build up quite a bit of steam moving in that direction.  Think,for example, of what happens if we're moving straight down a slope:</p><p><center><img src="images/tikz34.png"/></center></p><p>With each step the velocity gets larger down the slope, so we movemore and more quickly to the bottom of the valley.  This can enablethe momentum technique to work much faster than standard gradientdescent.  Of course, a problem is that once we reach the bottom of thevalley we will overshoot.  Or, if the gradient should change rapidly,then we could find ourselves moving in the wrong direction.  That'sthe reason for the $\mu$ hyper-parameter in <span id="margin_678678443012_reveal" class="equation_link">(107)</span><span id="margin_678678443012" class="marginequation" style="display: none;"><a href="chap3.html#eqtn107" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   v & \rightarrow  & v' = \mu v - \eta \nabla C  \nonumber\end{eqnarray}</a></span><script>$('#margin_678678443012_reveal').click(function() {$('#margin_678678443012').toggle('slow', function() {});});</script>.  Isaid earlier that $\mu$ controls the amount of friction in the system;to be a little more precise, you should think of $1-\mu$ as the amountof friction in the system.  When $\mu = 1$, as we've seen, there is nofriction, and the velocity is completely driven by the gradient$\nabla C$.  By contrast, when $\mu = 0$ there's a lot of friction,the velocity can't build up, and Equations <span id="margin_731828418398_reveal" class="equation_link">(107)</span><span id="margin_731828418398" class="marginequation" style="display: none;"><a href="chap3.html#eqtn107" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   v & \rightarrow  & v' = \mu v - \eta \nabla C  \nonumber\end{eqnarray}</a></span><script>$('#margin_731828418398_reveal').click(function() {$('#margin_731828418398').toggle('slow', function() {});});</script>and <span id="margin_437000378614_reveal" class="equation_link">(108)</span><span id="margin_437000378614" class="marginequation" style="display: none;"><a href="chap3.html#eqtn108" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}    w & \rightarrow & w' = w+v' \nonumber\end{eqnarray}</a></span><script>$('#margin_437000378614_reveal').click(function() {$('#margin_437000378614').toggle('slow', function() {});});</script> reduce to the usual equation for gradientdescent, $w \rightarrow w'=w-\eta \nabla C$.  In practice, using avalue of $\mu$ intermediate between $0$ and $1$ can give us much of thebenefit of being able to build up speed, but without causingovershooting.  We can choose such a value for $\mu$ using the held-outvalidation data, in much the same way as we select $\eta$ and$\lambda$.</p><p></p><p></p><p>I've avoided naming the hyper-parameter $\mu$ up to now.  The reasonis that the standard name for $\mu$ is badly chosen: it's called the<em>momentum co-efficient</em>.  This is potentially confusing, since$\mu$ is not at all the same as the notion of momentum fromphysics. Rather, it is much more closely related to friction.However, the term momentum co-efficient is widely used, so we willcontinue to use it.</p><p>A nice thing about the momentum technique is that it takes almost nowork to modify an implementation of gradient descent to incorporatemomentum.  We can still use backpropagation to compute the gradients,just as before, and use ideas such as sampling stochastically chosenmini-batches.  In this way, we can get some of the advantages of theHessian technique, using information about how the gradient ischanging.  But it's done without the disadvantages, and with onlyminor modifications to our code.  In practice, the momentum techniqueis commonly used, and often speeds up learning.</p><p><h4><a name="exercise_603875"></a><a href="#exercise_603875">Exercise</a></h4><ul><li> What would go wrong if we used $\mu > 1$ in the momentum  technique?</p><p><li> What would go wrong if we used $\mu < 0$ in the momentum  technique?</ul></p><p><h4><a name="problem_713937"></a><a href="#problem_713937">Problem</a></h4><ul><li> Add momentum-based stochastic gradient descent to  <tt>network2.py</tt>.</ul></p><p><strong>Other approaches to minimizing the cost function:</strong> Many otherapproaches to minimizing the cost function have been developed, andthere isn't universal agreement on which is the best approach.  Asyou go deeper into neural networks it's worth digging into the othertechniques, understanding how they work, their strengths andweaknesses, and how to apply them in practice.  A paper I mentionedearlier*<span class="marginnote">*<a href="http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf">Efficient    BackProp</a>, by Yann LeCun, Léon Bottou,  Genevieve Orr and Klaus-Robert Müller (1998).</span>introduces and compares several of these techniques, includingconjugate gradient descent and the BFGS method (see also the closelyrelated limited-memory BFGS method, known as<a href="http://en.wikipedia.org/wiki/Limited-memory_BFGS">L-BFGS</a>).Another technique which has recently shown promisingresults*<span class="marginnote">*See, for example,  <a href="http://www.cs.toronto.edu/&#126;hinton/absps/momentum.pdf">On the    importance of initialization and momentum in deep learning</a>, by  Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton  (2012).</span> is Nesterov's accelerated gradient technique, whichimproves on the momentum technique.  However, for many problems, plainstochastic gradient descent works well, especially if momentum isused, and so we'll stick to stochastic gradient descent through theremainder of this book.</p><p><h4><a name="other_models_of_artificial_neuron"></a><a href="#other_models_of_artificial_neuron">Other models of artificial neuron</a></h4></p><p>Up to now we've built our neural networks using sigmoid neurons.  Inprinciple, a network built from sigmoid neurons can compute anyfunction.  In practice, however, networks built using other modelneurons sometimes outperform sigmoid networks.  Depending on theapplication, networks based on such alternate models may learn faster,generalize better to test data, or perhaps do both.  Let me mention acouple of alternate model neurons, to give you the flavor of somevariations in common use.</p><p>Perhaps the simplest variation is the tanh (pronounced "tanch")neuron, which replaces the sigmoid function by the hyperbolic tangentfunction.  The output of a tanh neuron with input $x$, weight vector$w$, and bias $b$ is given by<a class="displaced_anchor" name="eqtn109"></a>\begin{eqnarray} \tanh(w \cdot x+b), \tag{109}\end{eqnarray}where $\tanh$ is, of course, the hyperbolic tangent function.  Itturns out that this is very closely related to the sigmoid neuron.  Tosee this, recall that the $\tanh$ function is defined by<a class="displaced_anchor" name="eqtn110"></a>\begin{eqnarray}  \tanh(z) \equiv \frac{e^z-e^{-z}}{e^z+e^{-z}}.\tag{110}\end{eqnarray}With a little algebra it can easily be verified that<a class="displaced_anchor" name="eqtn111"></a>\begin{eqnarray}   \sigma(z) = \frac{1+\tanh(z/2)}{2},\tag{111}\end{eqnarray}that is, $\tanh$ is just a rescaled version of the sigmoid function.We can also see graphically that the $\tanh$ function has the sameshape as the sigmoid function,</p><p><div id="tanh"></div><script>function s(x) {return (Math.exp(x)-Math.exp(-x))/(Math.exp(x)+Math.exp(-x));}var m = [40, 120, 50, 120];var height = 290 - m[0] - m[2];var width = 600 - m[1] - m[3];var xmin = -5;var xmax = 5;var sample = 400;var x1 = d3.scale.linear().domain([0, sample]).range([xmin, xmax]);var data = d3.range(sample).map(function(d){ return {        x: x1(d),         y: s(x1(d))};     });var x = d3.scale.linear().domain([xmin, xmax]).range([0, width]);var y = d3.scale.linear()                .domain([-1,1])                .range([height, 0]);var line = d3.svg.line()    .x(function(d) { return x(d.x); })    .y(function(d) { return y(d.y); })var graph = d3.select("#tanh")    .append("svg")    .attr("width", width + m[1] + m[3])    .attr("height", height + m[0] + m[2])    .append("g")    .attr("transform", "translate(" + m[3] + "," + m[0] + ")");var xAxis = d3.svg.axis()                  .scale(x)                  .tickValues(d3.range(-4, 5, 1))                  .orient("bottom")graph.append("g")    .attr("class", "x axis")    .attr("transform", "translate(0, " + height/2 + ")")    .call(xAxis);var yAxis = d3.svg.axis()                  .scale(y)                  .tickValues(d3.range(-1, 1.01, 0.5))                  .orient("left")graph.append("g")    .attr("class", "y axis")    .call(yAxis);graph.append("path").attr("d", line(data));graph.append("text")     .attr("class", "x label")     .attr("text-anchor", "end")     .attr("x", width+20)     .attr("y", height/2+8)     .text("z");graph.append("text")        .attr("x", (width / 2))                     .attr("y", -10)        .attr("text-anchor", "middle")          .style("font-size", "16px")         .text("tanh function");</script></p><p>One difference between tanh neurons and sigmoid neurons is that theoutput from tanh neurons ranges from -1 to 1, not 0 to 1.  This meansthat if you're going to build a network based on tanh neurons you mayneed to normalize your outputs (and, depending on the details of theapplication, possibly your inputs) a little differently than insigmoid networks.</p><p>Similar to sigmoid neurons, a network of tanh neurons can, inprinciple, compute any function*<span class="marginnote">*There are some technical  caveats to this statement for both tanh and sigmoid neurons, as well  as for the rectified linear neurons discussed below.  However,  informally it's usually fine to think of neural networks as being  able to approximate any function to arbitrary accuracy.</span> mappinginputs to the range -1 to 1.  Furthermore, ideas such asbackpropagation and stochastic gradient descent are as easily appliedto a network of tanh neurons as to a network of sigmoid neurons.</p><p><h4><a name="exercise_408529"></a><a href="#exercise_408529">Exercise</a></h4><ul><li> Prove the identity in Equation <span id="margin_125344768146_reveal" class="equation_link">(111)</span><span id="margin_125344768146" class="marginequation" style="display: none;"><a href="chap3.html#eqtn111" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \sigma(z) = \frac{1+\tanh(z/2)}{2} \nonumber\end{eqnarray}</a></span><script>$('#margin_125344768146_reveal').click(function() {$('#margin_125344768146').toggle('slow', function() {});});</script>.</ul></p><p>Which type of neuron should you use in your networks, the tanh orsigmoid?  <em>A priori</em> the answer is not obvious, to put it mildly!However, there are theoretical arguments and some empirical evidenceto suggest that the tanh sometimes performs better*<span class="marginnote">*See, for  example,  <a href="http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf">Efficient    BackProp</a>, by Yann LeCun, Léon Bottou,  Genevieve Orr and Klaus-Robert Müller (1998), and  <a href="http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf">Understanding    the difficulty of training deep feedforward networks</a>, by Xavier  Glorot and Yoshua Bengio (2010).</span>.  Let me briefly give you theflavor of one of the theoretical arguments for tanh neurons.  Supposewe're using sigmoid neurons, so all activations in our network arepositive.  Let's consider the weights $w^{l+1}_{jk}$ input to the$j$th neuron in the $l+1$th layer.  The rules for backpropagation (see<a href="chap2.html#eqtnBP4">here</a>) tell us that the associated gradientwill be $a^l_k \delta^{l+1}_j$.  Because the activations are positivethe sign of this gradient will be the same as the sign of$\delta^{l+1}_j$.  What this means is that if $\delta^{l+1}_j$ ispositive then <em>all</em> the weights $w^{l+1}_{jk}$ will decreaseduring gradient descent, while if $\delta^{l+1}_j$ is negative then<em>all</em> the weights $w^{l+1}_{jk}$ will increase during gradientdescent.  In other words, all weights to the same neuron must eitherincrease together or decrease together.  That's a problem, since someof the weights may need to increase while others need to decrease.That can only happen if some of the input activations have differentsigns.  That suggests replacing the sigmoid by an activation function,such as $\tanh$, which allows both positive and negative activations.Indeed, because $\tanh$ is symmetric about zero, $\tanh(-z) =-\tanh(z)$, we might even expect that, roughly speaking, theactivations in hidden layers would be equally balanced betweenpositive and negative.  That would help ensure that there is nosystematic bias for the weight updates to be one way or the other.</p><p>How seriously should we take this argument?  While the argument issuggestive, it's a heuristic, not a rigorous proof that tanh neuronsoutperform sigmoid neurons.  Perhaps there are other properties of thesigmoid neuron which compensate for this problem?  Indeed, for manytasks the tanh is found empirically to provide only a small or noimprovement in performance over sigmoid neurons.  Unfortunately, wedon't yet have hard-and-fast rules to know which neuron types willlearn fastest, or give the best generalization performance, for anyparticular application.</p><p>Another variation on the sigmoid neuron is the <em>rectified linear  neuron</em> or <em>rectified linear unit</em>.  The output of a rectifiedlinear unit with input $x$, weight vector $w$, and bias $b$ is givenby<a class="displaced_anchor" name="eqtn112"></a>\begin{eqnarray}  \max(0, w \cdot x+b).\tag{112}\end{eqnarray}Graphically, the rectifying function $\max(0, z)$ looks like this:</p><p><div id="relu"></div><script type="text/javascript">function s(x) {return Math.max(0, x);}var m = [40, 120, 50, 120];var height = 340 - m[0] - m[2];var width = 550 - m[1] - m[3];var xmin = -5;var xmax = 5;var sample = 100;var x1 = d3.scale.linear().domain([0, sample]).range([xmin, xmax]);var data = d3.range(sample).map(function(d){ return {        x: x1(d),         y: s(x1(d))};     });var x = d3.scale.linear().domain([xmin, xmax]).range([0, width]);var y = d3.scale.linear()                .domain([-5, 5])                .range([height, 0]);var line = d3.svg.line()    .x(function(d) { return x(d.x); })    .y(function(d) { return y(d.y); })var graph = d3.select("#relu")    .append("svg")    .attr("width", width + m[1] + m[3])    .attr("height", height + m[0] + m[2])    .append("g")    .attr("transform", "translate(" + m[3] + "," + m[0] + ")");var xAxis = d3.svg.axis()                  .scale(x)                  .tickValues(d3.range(-4, 5.01, 1))                  .orient("bottom")graph.append("g")    .attr("class", "x axis")    .attr("transform", "translate(0, " + height/2 + ")")    .call(xAxis);var yAxis = d3.svg.axis()                  .scale(y)                  .tickValues(d3.range(-4, 5.01, 1))                  .orient("left")graph.append("g")    .attr("class", "y axis")    .call(yAxis);graph.append("path").attr("d", line(data));graph.append("text")     .attr("class", "x label")     .attr("text-anchor", "end")     .attr("x", width+20)     .attr("y", height/2+8)     .text("z");graph.append("text")        .attr("x", (width / 2))                     .attr("y", -10)        .attr("text-anchor", "middle")          .style("font-size", "16px")         .text("max(0, z)");</script></p><p>Obviously such neurons are quite different from both sigmoid and tanhneurons.  However, like the sigmoid and tanh neurons, rectified linearunits can be used to compute any function, and they can be trainedusing ideas such as backpropagation and stochastic gradient descent.</p><p>When should you use rectified linear units instead of sigmoid or tanhneurons?  Some recent work on image recognition*<span class="marginnote">*See, for  example,  <a href="http://yann.lecun.com/exdb/publis/pdf/jarrett-iccv-09.pdf">What    is the Best Multi-Stage Architecture for Object Recognition?</a>, by  Kevin Jarrett, Koray Kavukcuoglu, Marc'Aurelio Ranzato and Yann  LeCun (2009),  <a href="http://www.jmlr.org/proceedings/papers/v15/glorot11a.html">Deep    Sparse Rectiﬁer Neural Networks</a>, by Xavier Glorot, Antoine  Bordes, and Yoshua Bengio (2011), and  <a href="https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf">ImageNet    Classification with Deep Convolutional Neural Networks</a>, by Alex  Krizhevsky, Ilya Sutskever, and Geoffrey Hinton (2012).  Note that  these papers fill in important details about how to set up the  output layer, cost function, and regularization in networks using  rectified linear units. I've glossed over all these details in this  brief account. The papers also discuss in more detail the benefits  and drawbacks of using rectified linear units.  Another informative  paper is  <a href="https://www.cs.toronto.edu/&#126;hinton/absps/reluICML.pdf">Rectified    Linear Units Improve Restricted Boltzmann Machines</a>, by Vinod Nair  and Geoffrey Hinton (2010), which demonstrates the benefits of using  rectified linear units in a somewhat different approach to neural  networks.</span>  has found considerable benefit in using rectified linearunits through much of the network.  However, as with tanh neurons, wedo not yet have a really deep understanding of when, exactly,rectified linear units are preferable, nor why.  To give you theflavor of some of the issues, recall that sigmoid neurons stoplearning when they saturate, i.e., when their output is near either$0$ or $1$.  As we've seen repeatedly in this chapter, the problem isthat $\sigma'$ terms reduce the gradient, and that slows downlearning.  Tanh neurons suffer from a similar problem when theysaturate.  By contrast, increasing the weighted input to a rectifiedlinear unit will never cause it to saturate, and so there is nocorresponding learning slowdown.  On the other hand, when the weightedinput to a rectified linear unit is negative, the gradient vanishes,and so the neuron stops learning entirely.  These are just two of themany issues that make it non-trivial to understand when and whyrectified linear units perform better than sigmoid or tanh neurons.</p><p>I've painted a picture of uncertainty here, stressing that we do notyet have a solid theory of how activation functions should be chosen.Indeed, the problem is harder even than I have described, for thereare infinitely many possible activation functions.  Which is the bestfor any given problem?  Which will result in a network which learnsfastest?  Which will give the highest test accuracies? I am surprisedhow little really deep and systematic investigation has been done ofthese questions.  Ideally, we'd have a theory which tells us, indetail, how to choose (and perhaps modify-on-the-fly) our activationfunctions.  On the other hand, we shouldn't let the lack of a fulltheory stop us!  We have powerful tools already at hand, and can makea lot of progress with those tools.  Through the remainder of thisbook I'll continue to use sigmoid neurons as our go-to neuron, sincethey're powerful and provide concrete illustrations of the core ideasabout neural nets.  But keep in the back of your mind that these sameideas can be applied to other types of neuron, and that there aresometimes advantages in doing so.</p><p><h4><a name="on_stories_in_neural_networks"></a><a href="#on_stories_in_neural_networks">On stories in neural networks</a></h4> </p><p><blockquote><em><strong>Question:</strong> How do you  approach utilizing and researching machine learning techniques that  are supported almost entirely empirically, as opposed to  mathematically? Also in what situations have you noticed some of  these techniques fail?</p><p>  <strong>Answer:</strong> You have to realize that our theoretical  tools are very weak. Sometimes, we have good mathematical intuitions  for why a particular technique should work. Sometimes our intuition  ends up being wrong [...] The questions become: how well does my  method work on this particular problem, and how large is the set of  problems on which it works well.</p><p>  -  <a href="http://www.reddit.com/r/MachineLearning/comments/25lnbt/ama_yann_lecun/chivdv7">Question    and answer</a> with neural networks researcher Yann LeCun</em></blockquote></p><p>Once, attending a conference on the foundations of quantum mechanics,I noticed what seemed to me a most curious verbal habit: when talksfinished, questions from the audience often began with "I'm verysympathetic to your point of view, but [...]".  Quantum foundationswas not my usual field, and I noticed this style of questioningbecause at other scientific conferences I'd rarely or never heard aquestioner express their sympathy for the point of view of thespeaker.  At the time, I thought the prevalence of the questionsuggested that little genuine progress was being made in quantumfoundations, and people were merely spinning their wheels.  Later, Irealized that assessment was too harsh.  The speakers were wrestlingwith some of the hardest problems human minds have ever confronted.Of course progress was slow!  But there was still value in hearingupdates on how people were thinking, even if they didn't always haveunarguable new progress to report.</p><p>You may have noticed a verbal tic similar to "I'm very sympathetic[...]" in the current book.  To explain what we're seeing I've oftenfallen back on saying "Heuristically, [...]", or "Roughly speaking,[...]", following up with a story to explain some phenomenon orother.  These stories are plausible, but the empirical evidence I'vepresented has often been pretty thin.  If you look through theresearch literature you'll see that stories in a similar style appearin many research papers on neural nets, often with thin supportingevidence.  What should we think about such stories?</p><p>In many parts of science - especially those parts that deal withsimple phenomena - it's possible to obtain very solid, very reliableevidence for quite general hypotheses.  But in neural networks thereare large numbers of parameters and hyper-parameters, and extremelycomplex interactions between them.  In such extraordinarily complexsystems it's exceedingly difficult to establish reliable generalstatements.  Understanding neural networks in their full generality isa problem that, like quantum foundations, tests the limits of thehuman mind.  Instead, we often make do with evidence for or against afew specific instances of a general statement.  As a result thosestatements sometimes later need to be modified or abandoned, when newevidence comes to light.</p><p>One way of viewing this situation is that any heuristic story aboutneural networks carries with it an implied challenge.  For example,consider the statement I <a href="#dropout_explanation">quoted  earlier</a>, explaining why dropout works*<span class="marginnote">*From  <a href="https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf">ImageNet    Classification with Deep Convolutional Neural Networks</a> by Alex  Krizhevsky, Ilya Sutskever, and Geoffrey Hinton (2012).</span>: "Thistechnique reduces complex co-adaptations of neurons, since a neuroncannot rely on the presence of particular other neurons. It is,therefore, forced to learn more robust features that are useful inconjunction with many different random subsets of the other neurons."This is a rich, provocative statement, and one could build a fruitfulresearch program entirely around unpacking the statement, figuring outwhat in it is true, what is false, what needs variation andrefinement.  Indeed, there is now a small industry of researchers whoare investigating dropout (and many variations), trying to understandhow it works, and what its limits are.  And so it goes with many ofthe heuristics we've discussed.  Each heuristic is not just a(potential) explanation, it's also a challenge to investigate andunderstand in more detail.</p><p>Of course, there is not time for any single person to investigate allthese heuristic explanations in depth.  It's going to take decades (orlonger) for the community of neural networks researchers to develop areally powerful, evidence-based theory of how neural networks learn.Does this mean you should reject heuristic explanations as unrigorous,and not sufficiently evidence-based?  No!  In fact, we need suchheuristics to inspire and guide our thinking.  It's like the great ageof exploration: the early explorers sometimes explored (and made newdiscoveries) on the basis of beliefs which were wrong in importantways.  Later, those mistakes were corrected as we filled in ourknowledge of geography.  When you understand something poorly - asthe explorers understood geography, and as we understand neural netstoday - it's more important to explore boldly than it is to berigorously correct in every step of your thinking.  And so you shouldview these stories as a useful guide to how to think about neuralnets, while retaining a healthy awareness of the limitations of suchstories, and carefully keeping track of just how strong the evidenceis for any given line of reasoning.  Put another way, we need goodstories to help motivate and inspire us, and rigorous in-depthinvestigation in order to uncover the real facts of the matter.</p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></div><div class="footer"> <span class="left_footer"> In academic work,
 please cite this book as: Michael A. Nielsen, "Neural Networks and
 Deep Learning", Determination Press, 2015
 
@@ -742,7 +745,7 @@ <h1 class="chapter_title"><a href="">Improving the way neural networks learn</a>
 href="mailto:mn@michaelnielsen.org">contact me</a>.
 </span>
 <span class="right_footer">
-Last update: Sun Jan  1 16:00:21 2017
+Last update: Thu Jan 19 06:09:48 2017
 <br/>
 <br/>
 <br/>
diff --git a/chap4.html b/chap4.html
index 22384de..8257c47 100644
--- a/chap4.html
+++ b/chap4.html
@@ -155,6 +155,8 @@ <h1 class="chapter_title"><a href="">A visual proof that neural nets can compute
 <hr>
 <span class="sidebar_title">Resources</span>
 
+<p class="sidebar"><a href="https://twitter.com/michael_nielsen">Michael Nielsen on Twitter</a></p>
+
 <p class="sidebar"><a href="faq.html">Book FAQ</a></p>
 
 <p class="sidebar">
@@ -164,9 +166,10 @@ <h1 class="chapter_title"><a href="">A visual proof that neural nets can compute
 <a href="http://eepurl.com/0Xxjb">Michael Nielsen's project announcement mailing list</a>
 </p>
 
-<p class="sidebar"> <a href="http://www.iro.umontreal.ca/~bengioy/dlbook/">Deep Learning</a>, draft book
-in preparation, by Yoshua Bengio, Ian Goodfellow, and Aaron
-Courville</p>
+<p class="sidebar"> <a href="http://www.deeplearningbook.org/">Deep Learning</a>, book by Ian
+Goodfellow, Yoshua Bengio, and Aaron Courville</p>
+
+<p class="sidebar"><a href="http://cognitivemedium.com">cognitivemedium.com</a></p>
 
 <hr>
 <a href="http://michaelnielsen.org"><img src="assets/Michael_Nielsen_Web_Small.jpg" width="160px" style="border-style: none;"/></a>
@@ -206,7 +209,7 @@ <h1 class="chapter_title"><a href="">A visual proof that neural nets can compute
 href="mailto:mn@michaelnielsen.org">contact me</a>.
 </span>
 <span class="right_footer">
-Last update: Sun Jan  1 16:00:21 2017
+Last update: Thu Jan 19 06:09:48 2017
 <br/>
 <br/>
 <br/>
diff --git a/chap5.html b/chap5.html
index 0022a31..37248e8 100644
--- a/chap5.html
+++ b/chap5.html
@@ -155,6 +155,8 @@ <h1 class="chapter_title"><a href="">Why are deep neural networks hard to train?
 <hr>
 <span class="sidebar_title">Resources</span>
 
+<p class="sidebar"><a href="https://twitter.com/michael_nielsen">Michael Nielsen on Twitter</a></p>
+
 <p class="sidebar"><a href="faq.html">Book FAQ</a></p>
 
 <p class="sidebar">
@@ -164,9 +166,10 @@ <h1 class="chapter_title"><a href="">Why are deep neural networks hard to train?
 <a href="http://eepurl.com/0Xxjb">Michael Nielsen's project announcement mailing list</a>
 </p>
 
-<p class="sidebar"> <a href="http://www.iro.umontreal.ca/~bengioy/dlbook/">Deep Learning</a>, draft book
-in preparation, by Yoshua Bengio, Ian Goodfellow, and Aaron
-Courville</p>
+<p class="sidebar"> <a href="http://www.deeplearningbook.org/">Deep Learning</a>, book by Ian
+Goodfellow, Yoshua Bengio, and Aaron Courville</p>
+
+<p class="sidebar"><a href="http://cognitivemedium.com">cognitivemedium.com</a></p>
 
 <hr>
 <a href="http://michaelnielsen.org"><img src="assets/Michael_Nielsen_Web_Small.jpg" width="160px" style="border-style: none;"/></a>
@@ -215,7 +218,7 @@ <h1 class="chapter_title"><a href="">Why are deep neural networks hard to train?
 href="mailto:mn@michaelnielsen.org">contact me</a>.
 </span>
 <span class="right_footer">
-Last update: Sun Jan  1 16:00:21 2017
+Last update: Thu Jan 19 06:09:48 2017
 <br/>
 <br/>
 <br/>
diff --git a/chap6.html b/chap6.html
index a3b455d..608c5ad 100644
--- a/chap6.html
+++ b/chap6.html
@@ -155,6 +155,8 @@ <h1 class="chapter_title"><a href="">Deep learning</a></h1></div><div class="sec
 <hr>
 <span class="sidebar_title">Resources</span>
 
+<p class="sidebar"><a href="https://twitter.com/michael_nielsen">Michael Nielsen on Twitter</a></p>
+
 <p class="sidebar"><a href="faq.html">Book FAQ</a></p>
 
 <p class="sidebar">
@@ -164,9 +166,10 @@ <h1 class="chapter_title"><a href="">Deep learning</a></h1></div><div class="sec
 <a href="http://eepurl.com/0Xxjb">Michael Nielsen's project announcement mailing list</a>
 </p>
 
-<p class="sidebar"> <a href="http://www.iro.umontreal.ca/~bengioy/dlbook/">Deep Learning</a>, draft book
-in preparation, by Yoshua Bengio, Ian Goodfellow, and Aaron
-Courville</p>
+<p class="sidebar"> <a href="http://www.deeplearningbook.org/">Deep Learning</a>, book by Ian
+Goodfellow, Yoshua Bengio, and Aaron Courville</p>
+
+<p class="sidebar"><a href="http://cognitivemedium.com">cognitivemedium.com</a></p>
 
 <hr>
 <a href="http://michaelnielsen.org"><img src="assets/Michael_Nielsen_Web_Small.jpg" width="160px" style="border-style: none;"/></a>
@@ -175,7 +178,7 @@ <h1 class="chapter_title"><a href="">Deep learning</a></h1></div><div class="sec
 By <a href="http://michaelnielsen.org">Michael Nielsen</a> / Jan 2017
 </p>
 </div>
-</p><p>In the <a href="chap5.html">last chapter</a> we learned that deep neuralnetworks are often much harder to train than shallow neural networks.That's unfortunate, since we have good reason to believe that<em>if</em> we could train deep nets they'd be much more powerful thanshallow nets.  But while the news from the last chapter isdiscouraging, we won't let it stop us.  In this chapter, we'll developtechniques which can be used to train deep networks, and apply them inpractice.  We'll also look at the broader picture, briefly reviewingrecent progress on using deep nets for image recognition, speechrecognition, and other applications.  And we'll take a brief,speculative look at what the future may hold for neural nets, and forartificial intelligence.</p><p>The chapter is a long one.  To help you navigate, let's take a tour.The sections are only loosely coupled, so provided you have some basicfamiliarity with neural nets, you can jump to whatever most interestsyou.</p><p>The <a href="#convolutional_networks">main part of the chapter</a> is anintroduction to one of the most widely used types of deep network:deep convolutional networks.  We'll work through a detailed example- code and all - of using convolutional nets to solve the problemof classifying handwritten digits from the MNIST data set:</p><p><center><img src="images/digits.png" width="160px"></center></p><p>We'll start our account of convolutional networks with the shallownetworks used to attack this problem earlier in the book.  Throughmany iterations we'll build up more and more powerful networks.  As wego we'll explore many powerful techniques: convolutions, pooling, theuse of GPUs to do far more training than we did with our shallownetworks, the algorithmic expansion of our training data (to reduceoverfitting), the use of the dropout technique (also to reduceoverfitting), the use of ensembles of networks, and others.  Theresult will be a system that offers near-human performance.  Of the10,000 MNIST test images - images not seen during training! - oursystem will classify 9,967 correctly.  Here's a peek at the 33 imageswhich are misclassified.  Note that the correct classification is inthe top right; our program's classification is in the bottom right:</p><p><center><img src="images/ensemble_errors.png" width="580px"></center></p><p>Many of these are tough even for a human to classify.  Consider, forexample, the third image in the top row.  To me it looks more like a"9" than an "8", which is the official classification.  Ournetwork also thinks it's a "9".  This kind of "error" is at thevery least understandable, and perhaps even commendable.  We concludeour discussion of image recognition with a<a href="#recent_progress_in_image_recognition">survey of some of the  spectacular recent progress</a> using networks (particularlyconvolutional nets) to do image recognition.</p><p>The remainder of the chapter discusses deep learning from a broaderand less detailed perspective.  We'll<a href="#things_we_didn't_cover_but_which_you'll_eventually_want_to_know">briefly  survey other models of neural networks</a>, such as recurrent neuralnets and long short-term memory units, and how such models can beapplied to problems in speech recognition, natural languageprocessing, and other areas.  And we'll<a href="#on_the_future_of_neural_networks">speculate about the  future of neural networks and deep learning</a>, ranging from ideaslike intention-driven user interfaces, to the role of deep learning inartificial intelligence.</p><p></p><p></p><p></p><p></p><p></p><p></p><p>The chapter builds on the earlier chapters in the book, making use ofand integrating ideas such as backpropagation, regularization, thesoftmax function, and so on.  However, to read the chapter you don'tneed to have worked in detail through all the earlier chapters.  Itwill, however, help to have read <a href="chap1.html">Chapter 1</a>, on thebasics of neural networks.  When I use concepts from Chapters 2 to 5,I provide links so you can familiarize yourself, if necessary.</p><p>It's worth noting what the chapter is not.  It's not a tutorial on thelatest and greatest neural networks libraries.  Nor are we going to betraining deep networks with dozens of layers to solve problems at thevery leading edge.  Rather, the focus is on understanding some of thecore principles behind deep neural networks, and applying them in thesimple, easy-to-understand context of the MNIST problem.  Put anotherway: the chapter is not going to bring you right up to the frontier.Rather, the intent of this and earlier chapters is to focus onfundamentals, and so to prepare you to understand a wide range ofcurrent work.</p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p><h3><a name="introducing_convolutional_networks"></a><a href="#introducing_convolutional_networks">Introducing convolutional networks</a></h3></p><p>In earlier chapters, we taught our neural networks to do a pretty goodjob recognizing images of handwritten digits:</p><p><center><img src="images/digits.png" width="160px"></center></p><p>We did this using networks in which adjacent network layers are fullyconnected to one another.  That is, every neuron in the network isconnected to every neuron in adjacent layers:</p><p><center><img src="images/tikz41.png"/></center></p><p>In particular, for each pixel in the input image, we encoded thepixel's intensity as the value for a corresponding neuron in the inputlayer.  For the $28 \times 28$ pixel images we've been using, thismeans our network has $784$ ($= 28 \times 28$) input neurons.  We thentrained the network's weights and biases so that the network's outputwould - we hope! - correctly identify the input image: '0', '1','2', ..., '8', or '9'.</p><p>Our earlier networks work pretty well: we've<a href="chap3.html#98percent">obtained a classification accuracy better  than 98 percent</a>, using training and test data from the<a href="chap1.html#learning_with_gradient_descent">MNIST handwritten  digit data set</a>.  But upon reflection, it's strange to use networkswith fully-connected layers to classify images.  The reason is thatsuch a network architecture does not take into account the spatialstructure of the images.  For instance, it treats input pixels whichare far apart and close together on exactly the same footing.  Suchconcepts of spatial structure must instead be inferred from thetraining data.  But what if, instead of starting with a networkarchitecture which is <em>tabula rasa</em>, we used an architecturewhich tries to take advantage of the spatial structure?  In thissection I describe <em>convolutional neural networks</em>*<span class="marginnote">*The  origins of convolutional neural networks go back to the 1970s.  But  the seminal paper establishing the modern subject of convolutional  networks was a 1998 paper,  <a href="http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf">"Gradient-based    learning applied to document recognition"</a>, by Yann LeCun,  Léon Bottou, Yoshua Bengio, and Patrick Haffner.  LeCun has since made an interesting  <a href="https://www.facebook.com/yann.lecun/posts/10152348155137143">remark</a>  on the terminology for convolutional nets: "The [biological] neural  inspiration in models like convolutional nets is very  tenuous. That's why I call them 'convolutional nets' not  'convolutional neural nets', and why we call the nodes 'units' and  not 'neurons' ".  Despite this remark, convolutional nets use many  of the same ideas as the neural networks we've studied up to now:  ideas such as backpropagation, gradient descent, regularization,  non-linear activation functions, and so on.  And so we will follow  common practice, and consider them a type of neural network.  I will  use the terms "convolutional neural network" and "convolutional  net(work)" interchangeably.  I will also use the terms  "[artificial] neuron" and "unit" interchangeably.</span>.  Thesenetworks use a special architecture which is particularly well-adaptedto classify images.  Using this architecture makes convolutionalnetworks fast to train.  This, in turn, helps us train deep,many-layer networks, which are very good at classifying images.Today, deep convolutional networks or some close variant are used inmost neural networks for image recognition.</p><p>Convolutional neural networks use three basic ideas: <em>local  receptive fields</em>, <em>shared weights</em>, and <em>pooling</em>.  Let'slook at each of these ideas in turn.</p><p><strong>Local receptive fields:</strong> In the fully-connected layers shownearlier, the inputs were depicted as a vertical line of neurons.  In aconvolutional net, it'll help to think instead of the inputs as a $28\times 28$ square of neurons, whose values correspond to the $28\times 28$ pixel intensities we're using as inputs:</p><p><center><img src="images/tikz42.png"/></center></p><p>As per usual, we'll connect the input pixels to a layer of hiddenneurons.  But we won't connect every input pixel to every hiddenneuron.  Instead, we only make connections in small, localized regionsof the input image.</p><p>To be more precise, each neuron in the first hidden layer will beconnected to a small region of the input neurons, say, for example, a$5 \times 5$ region, corresponding to $25$ input pixels.  So, for aparticular hidden neuron, we might have connections that look likethis:<center><img src="images/tikz43.png"/></center></p><p>That region in the input image is called the <em>local receptive  field</em> for the hidden neuron.  It's a little window on the inputpixels.  Each connection learns a weight.  And the hidden neuronlearns an overall bias as well.  You can think of that particularhidden neuron as learning to analyze its particular local receptivefield.</p><p>We then slide the local receptive field across the entire input image.For each local receptive field, there is a different hidden neuron inthe first hidden layer.  To illustrate this concretely, let's startwith a local receptive field in the top-left corner:<center><img src="images/tikz44.png"/></center></p><p>Then we slide the local receptive field over by one pixel to the right(i.e., by one neuron), to connect to a second hidden neuron:</p><p><center><img src="images/tikz45.png"/></center></p><p>And so on, building up the first hidden layer.  Note that if we have a$28 \times 28$ input image, and $5 \times 5$ local receptive fields,then there will be $24 \times 24$ neurons in the hidden layer.  Thisis because we can only move the local receptive field $23$ neuronsacross (or $23$ neurons down), before colliding with the right-handside (or bottom) of the input image.</p><p>I've shown the local receptive field being moved by one pixel at atime.  In fact, sometimes a different <em>stride length</em> is used.For instance, we might move the local receptive field $2$ pixels tothe right (or down), in which case we'd say a stride length of $2$ isused.  In this chapter we'll mostly stick with stride length $1$, butit's worth knowing that people sometimes experiment with differentstride lengths*<span class="marginnote">*As was done in earlier chapters, if we're  interested in trying different stride lengths then we can use  validation data to pick out the stride length which gives the best  performance.  For more details, see the  <a href="chap3.html#how_to_choose_a_neural_network's_hyper-parameters">earlier    discussion</a> of how to choose hyper-parameters in a neural network.  The same approach may also be used to choose the size of the local  receptive field - there is, of course, nothing special about using  a $5 \times 5$ local receptive field.  In general, larger local  receptive fields tend to be helpful when the input images are  significantly larger than the $28 \times 28$ pixel MNIST images.</span>.</p><p><strong>Shared weights and biases:</strong> I've said that each hidden neuronhas a bias and $5 \times 5$ weights connected to its local receptivefield.  What I did not yet mention is that we're going to use the<em>same</em> weights and bias for each of the $24 \times 24$ hiddenneurons.  In other words, for the $j, k$th hidden neuron, the outputis:<a class="displaced_anchor" name="eqtn125"></a>\begin{eqnarray}   \sigma\left(b + \sum_{l=0}^4 \sum_{m=0}^4  w_{l,m} a_{j+l, k+m} \right).\tag{125}\end{eqnarray}Here, $\sigma$ is the neural activation function - perhaps the<a href="chap1.html#sigmoid_neurons">sigmoid function</a> we used inearlier chapters.  $b$ is the shared value for the bias.  $w_{l,m}$ isa $5 \times 5$ array of shared weights.  And, finally, we use $a_{x,  y}$ to denote the input activation at position $x, y$.</p><p>This means that all the neurons in the first hidden layer detectexactly the same feature*<span class="marginnote">*I haven't precisely defined the  notion of a feature.  Informally, think of the feature detected by a  hidden neuron as the kind of input pattern that will cause the  neuron to activate: it might be an edge in the image, for instance,  or maybe some other type of shape. </span>, just at different locations inthe input image.  To see why this makes sense, suppose the weights andbias are such that the hidden neuron can pick out, say, a verticaledge in a particular local receptive field.  That ability is alsolikely to be useful at other places in the image.  And so it is usefulto apply the same feature detector everywhere in the image.  To put itin slightly more abstract terms, convolutional networks are welladapted to the translation invariance of images: move a picture of acat (say) a little ways, and it's still an image of a cat*<span class="marginnote">*In  fact, for the MNIST digit classification problem we've been  studying, the images are centered and size-normalized.  So MNIST has  less translation invariance than images found "in the wild", so to  speak.  Still, features like edges and corners are likely to be  useful across much of the input space. </span>.</p><p>For this reason, we sometimes call the map from the input layer to thehidden layer a <em>feature map</em>.  We call the weights defining thefeature map the <em>shared weights</em>.  And we call the bias definingthe feature map in this way the <em>shared bias</em>.  The sharedweights and bias are often said to define a <em>kernel</em> or<em>filter</em>.  In the literature, people sometimes use these terms inslightly different ways, and for that reason I'm not going to be moreprecise; rather, in a moment, we'll look at some concrete examples.</p><p></p><p>The network structure I've described so far can detect just a singlekind of localized feature.  To do image recognition we'll need morethan one feature map.  And so a complete convolutional layer consistsof several different feature maps:</p><p><center><img src="images/tikz46.png"/></center>In the example shown, there are $3$ feature maps.  Each feature map isdefined by a set of $5 \times 5$ shared weights, and a single sharedbias.  The result is that the network can detect $3$ different kindsof features, with each feature being detectable across the entireimage.</p><p></p><p>I've shown just $3$ feature maps, to keep the diagram above simple.However, in practice convolutional networks may use more (and perhapsmany more) feature maps.  One of the early convolutional networks,LeNet-5, used $6$ feature maps, each associated to a $5 \times 5$local receptive field, to recognize MNIST digits.  So the exampleillustrated above is actually pretty close to LeNet-5.  In theexamples we develop later in the chapter we'll use convolutionallayers with $20$ and $40$ feature maps.  Let's take a quick peek atsome of the features which are learned*<span class="marginnote">*The feature maps  illustrated come from the final convolutional network we train, see  <a href="#final_conv">here</a>.</span>:</p><p><center><img src="images/net_full_layer_0.png" width="400px"></center></p><p>The $20$ images correspond to $20$ different feature maps (or filters,or kernels).  Each map is represented as a $5 \times 5$ block image,corresponding to the $5 \times 5$ weights in the local receptivefield.  Whiter blocks mean a smaller (typically, more negative)weight, so the feature map responds less to corresponding inputpixels.  Darker blocks mean a larger weight, so the feature mapresponds more to the corresponding input pixels.  Very roughlyspeaking, the images above show the type of features the convolutionallayer responds to.</p><p>So what can we conclude from these feature maps?  It's clear there isspatial structure here beyond what we'd expect at random: many of thefeatures have clear sub-regions of light and dark.  That shows ournetwork really is learning things related to the spatial structure.However, beyond that, it's difficult to see what these featuredetectors are learning.  Certainly, we're not learning (say) the<a href="http://en.wikipedia.org/wiki/Gabor_filter">Gabor filters</a> whichhave been used in many traditional approaches to image recognition.In fact, there's now a lot of work on better understanding thefeatures learnt by convolutional networks.  If you're interested infollowing up on that work, I suggest starting with the paper<a href="http://arxiv.org/abs/1311.2901">Visualizing and Understanding  Convolutional Networks</a> by Matthew Zeiler and Rob Fergus (2013).</p><p></p><p>A big advantage of sharing weights and biases is that it greatlyreduces the number of parameters involved in a convolutional network.For each feature map we need $25 = 5 \times 5$ shared weights, plus asingle shared bias. So each feature map requires $26$ parameters.  Ifwe have $20$ feature maps that's a total of $20 \times 26 = 520$parameters defining the convolutional layer.  By comparison, supposewe had a fully connected first layer, with $784 = 28 \times 28$ inputneurons, and a relatively modest $30$ hidden neurons, as we used inmany of the examples earlier in the book.  That's a total of $784\times 30$ weights, plus an extra $30$ biases, for a total of $23,550$parameters.  In other words, the fully-connected layer would have morethan $40$ times as many parameters as the convolutional layer.</p><p>Of course, we can't really do a direct comparison between the numberof parameters, since the two models are different in essential ways.But, intuitively, it seems likely that the use of translationinvariance by the convolutional layer will reduce the number ofparameters it needs to get the same performance as the fully-connectedmodel.  That, in turn, will result in faster training for theconvolutional model, and, ultimately, will help us build deep networksusing convolutional layers.</p><p></p><p>Incidentally, the name <em>convolutional</em> comes from the fact thatthe operation in Equation <span id="margin_675836467765_reveal" class="equation_link">(125)</span><span id="margin_675836467765" class="marginequation" style="display: none;"><a href="chap6.html#eqtn125" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \sigma\left(b + \sum_{l=0}^4 \sum_{m=0}^4  w_{l,m} a_{j+l, k+m} \right) \nonumber\end{eqnarray}</a></span><script>$('#margin_675836467765_reveal').click(function() {$('#margin_675836467765').toggle('slow', function() {});});</script> is sometimes known as a<em>convolution</em>.  A little more precisely, people sometimes writethat equation as $a^1 = \sigma(b + w * a^0)$, where $a^1$ denotes theset of output activations from one feature map, $a^0$ is the set ofinput activations, and $*$ is called a convolution operation.  We'renot going to make any deep use of the mathematics of convolutions, soyou don't need to worry too much about this connection.  But it'sworth at least knowing where the name comes from.</p><p></p><p></p><p></p><p></p><p><strong>Pooling layers:</strong> In addition to the convolutional layers justdescribed, convolutional neural networks also contain <em>pooling  layers</em>.  Pooling layers are usually used immediately afterconvolutional layers.  What the pooling layers do is simplify theinformation in the output from the convolutional layer.</p><p></p><p>In detail, a pooling layer takes each feature map*<span class="marginnote">*The  nomenclature is being used loosely here.  In particular, I'm using  "feature map" to mean not the function computed by the  convolutional layer, but rather the activation of the hidden neurons  output from the layer.  This kind of mild abuse of nomenclature is  pretty common in the research literature.</span> output from theconvolutional layer and prepares a condensed feature map.  Forinstance, each unit in the pooling layer may summarize a region of(say) $2 \times 2$ neurons in the previous layer.  As a concreteexample, one common procedure for pooling is known as<em>max-pooling</em>.  In max-pooling, a pooling unit simply outputs themaximum activation in the $2 \times 2$ input region, as illustrated inthe following diagram:</p><p><center><img src="images/tikz47.png"/></center></p><p>Note that since we have $24 \times 24$ neurons output from theconvolutional layer, after pooling we have $12 \times 12$ neurons.</p><p>As mentioned above, the convolutional layer usually involves more thana  single feature  map.   We  apply max-pooling  to  each feature  mapseparately.   So  if  there  were three  feature  maps,  the  combinedconvolutional and max-pooling layers would look like:</p><p><center><img src="images/tikz48.png"/></center></p><p>We can think of max-pooling as a way for the network to ask whether agiven feature is found anywhere in a region of the image.  It thenthrows away the exact positional information.  The intuition is thatonce a feature has been found, its exact location isn't as importantas its rough location relative to other features.  A big benefit isthat there are many fewer pooled features, and so this helps reducethe number of parameters needed in later layers.</p><p></p><p>Max-pooling isn't the only technique used for pooling.  Another commonapproach is known as <em>L2 pooling</em>.  Here, instead of taking themaximum activation of a $2 \times 2$ region of neurons, we take thesquare root of the sum of the squares of the activations in the $2\times 2$ region.  While the details are different, the intuition issimilar to max-pooling: L2 pooling is a way of condensing informationfrom the convolutional layer.  In practice, both techniques have beenwidely used.  And sometimes people use other types of poolingoperation.  If you're really trying to optimize performance, you mayuse validation data to compare several different approaches topooling, and choose the approach which works best.  But we're notgoing to worry about that kind of detailed optimization.</p><p></p><p><strong>Putting it all together:</strong> We can now put all these ideastogether to form a complete convolutional neural network.  It'ssimilar to the architecture we were just looking at, but has theaddition of a layer of $10$ output neurons, corresponding to the $10$possible values for MNIST digits ('0', '1', '2', <em>etc</em>):</p><p><center><img src="images/tikz49.png"/></center></p><p>The network begins with $28 \times 28$ input neurons, which are usedto encode the pixel intensities for the MNIST image.  This is thenfollowed by a convolutional layer using a $5 \times 5$ local receptivefield and $3$ feature maps.  The result is a layer of $3 \times 24\times 24$ hidden feature neurons.  The next step is a max-poolinglayer, applied to $2 \times 2$ regions, across each of the $3$ featuremaps.  The result is a layer of $3 \times 12 \times 12$ hidden featureneurons.</p><p>The final layer of connections in the network is a fully-connectedlayer.  That is, this layer connects <em>every</em> neuron from themax-pooled layer to every one of the $10$ output neurons.  Thisfully-connected architecture is the same as we used in earlierchapters.  Note, however, that in the diagram above, I've used asingle arrow, for simplicity, rather than showing all the connections.Of course, you can easily imagine the connections.</p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p>This convolutional architecture is quite different to thearchitectures used in earlier chapters.  But the overall picture issimilar: a network made of many simple units, whose behaviors aredetermined by their weights and biases.  And the overall goal is stillthe same: to use training data to train the network's weights andbiases so that the network does a good job classifying input digits.</p><p>In particular, just as earlier in the book, we will train our networkusing stochastic gradient descent and backpropagation.  This mostlyproceeds in exactly the same way as in earlier chapters.  However, wedo need to make a few modifications to the backpropagation procedure.The reason is that our earlier <a href="chap2.html">derivation of  backpropagation</a> was for networks with fully-connected layers.Fortunately, it's straightforward to modify the derivation forconvolutional and max-pooling layers.  If you'd like to understand thedetails, then I invite you to work through the following problem.  Bewarned that the problem will take some time to work through, unlessyou've really internalized the <a href="chap2.html">earlier derivation of  backpropagation</a> (in which case it's easy).</p><p><h4><a name="problem_221690"></a><a href="#problem_221690">Problem</a></h4><ul><li><strong>Backpropagation in a convolutional network</strong> The core equations  of backpropagation in a network with fully-connected layers  are <span id="margin_59793662954_reveal" class="equation_link">(BP1)</span><span id="margin_59793662954" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_59793662954_reveal').click(function() {$('#margin_59793662954').toggle('slow', function() {});});</script>-<span id="margin_763473530670_reveal" class="equation_link">(BP4)</span><span id="margin_763473530670" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}    \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_763473530670_reveal').click(function() {$('#margin_763473530670').toggle('slow', function() {});});</script>  (<a href="chap2.html#backpropsummary">link</a>).  Suppose we have a  network containing a convolutional layer, a max-pooling layer, and a  fully-connected output layer, as in the network discussed above.  How are the equations of backpropagation modified?</ul></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p><h3><a name="convolutional_neural_networks_in_practice"></a><a href="#convolutional_neural_networks_in_practice">Convolutional neural networks in practice</a></h3></p><p>We've now seen the core ideas behind convolutional neural networks.Let's look at how they work in practice, by implementing someconvolutional networks, and applying them to the MNIST digitclassification problem.  The program we'll use to do this is called<tt>network3.py</tt>, and it's an improved version of the programs<tt>network.py</tt> and <tt>network2.py</tt> developed in earlierchapters*<span class="marginnote">*Note also that <tt>network3.py</tt> incorporates ideas  from the Theano library's documentation on convolutional neural nets  (notably the implementation of  <a href="http://deeplearning.net/tutorial/lenet.html">LeNet-5</a>), from  Misha Denil's  <a href="https://github.com/mdenil/dropout">implementation of dropout</a>,  and from <a href="http://colah.github.io">Chris Olah</a>.</span>.  If you wishto follow along, the code is available<a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/src/network3.py">on  GitHub</a>.  Note that we'll work through the code for<tt>network3.py</tt> itself in the next section.  In this section, we'lluse <tt>network3.py</tt> as a library to build convolutional networks.</p><p></p><p>The programs <tt>network.py</tt> and <tt>network2.py</tt> were implementedusing Python and the matrix library Numpy.  Those programs worked fromfirst principles, and got right down into the details ofbackpropagation, stochastic gradient descent, and so on.  But now thatwe understand those details, for <tt>network3.py</tt> we're going to usea machine learning library known as<a href="http://deeplearning.net/software/theano/">Theano</a>*<span class="marginnote">*See  <a href="http://www.iro.umontreal.ca/&#126;lisa/pointeurs/theano_scipy2010.pdf">Theano:    A CPU and GPU Math Expression Compiler in Python</a>, by James  Bergstra, Olivier Breuleux, Frederic Bastien, Pascal Lamblin, Ravzan  Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley,  and Yoshua Bengio (2010).  Theano is also the basis for the popular  <a href="http://deeplearning.net/software/pylearn2/">Pylearn2</a> and  <a href="http://keras.io/">Keras</a> neural networks libraries. Other  popular neural nets libraries at the time of this writing include  <a href="http://caffe.berkeleyvision.org">Caffe</a> and  <a href="http://torch.ch">Torch</a>. </span>.  Using Theano makes it easy toimplement backpropagation for convolutional neural networks, since itautomatically computes all the mappings involved.  Theano is alsoquite a bit faster than our earlier code (which was written to be easyto understand, not fast), and this makes it practical to train morecomplex networks.  In particular, one great feature of Theano is thatit can run code on either a CPU or, if available, a GPU.  Running on aGPU provides a substantial speedup and, again, helps make it practicalto train more complex networks.</p><p></p><p>If you wish to follow along, then you'll need to get Theano running onyour system.  To install Theano, follow the instructions at theproject's <a href="http://deeplearning.net/software/theano/">homepage</a>.The examples which follow were run using Theano 0.6*<span class="marginnote">*As I  release this chapter, the current version of Theano has changed to  version 0.7.  I've actually rerun the examples under Theano 0.7 and  get extremely similar results to those reported in the text.</span>.  Somewere run under Mac OS X Yosemite, with no GPU.  Some were run onUbuntu 14.04, with an NVIDIA GPU.  And some of the experiments were rununder both.  To get <tt>network3.py</tt> running you'll need to set the<tt>GPU</tt> flag to either <tt>True</tt> or <tt>False</tt> (as appropriate)in the <tt>network3.py</tt> source.  Beyond that, to get Theano up andrunning on a GPU you may find<a href="http://deeplearning.net/software/theano/tutorial/using_gpu.html">the  instructions here</a> helpful.  There are also tutorials on the web,easily found using Google, which can help you get things working.  Ifyou don't have a GPU available locally, then you may wish to look into<a href="http://aws.amazon.com/ec2/instance-types/">Amazon Web Services</a>EC2 G2 spot instances.  Note that even with a GPU the code will takesome time to execute.  Many of the experiments take from minutes tohours to run.  On a CPU it may take days to run the most complex ofthe experiments.  As in earlier chapters, I suggest setting thingsrunning, and continuing to read, occasionally coming back to check theoutput from the code.  If you're using a CPU, you may wish to reducethe number of training epochs for the more complex experiments, orperhaps omit them entirely.</p><p>To get a baseline, we'll start with a shallow architecture using justa single hidden layer, containing $100$ hidden neurons.  We'll trainfor $60$ epochs, using a learning rate of $\eta = 0.1$, a mini-batchsize of $10$, and no regularization.  Here we go*<span class="marginnote">*Code for the  experiments in this section may be found  <a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/src/conv.py">in    this script</a>.  Note that the code in the script simply duplicates  and parallels the discussion in this section.<br><br>Note also that  throughout the section I've explicitly specified the number of  training epochs.  I've done this for clarity about how we're  training.  In practice, it's worth using  <a href="chap3.html#early_stopping">early stopping</a>, that is,  tracking accuracy on the validation set, and stopping training when  we are confident the validation accuracy has stopped improving.</span>:</p><p><div class="highlight"><pre><span></span><span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">network3</span>
+</p><p>In the <a href="chap5.html">last chapter</a> we learned that deep neuralnetworks are often much harder to train than shallow neural networks.That's unfortunate, since we have good reason to believe that<em>if</em> we could train deep nets they'd be much more powerful thanshallow nets.  But while the news from the last chapter isdiscouraging, we won't let it stop us.  In this chapter, we'll developtechniques which can be used to train deep networks, and apply them inpractice.  We'll also look at the broader picture, briefly reviewingrecent progress on using deep nets for image recognition, speechrecognition, and other applications.  And we'll take a brief,speculative look at what the future may hold for neural nets, and forartificial intelligence.</p><p>The chapter is a long one.  To help you navigate, let's take a tour.The sections are only loosely coupled, so provided you have some basicfamiliarity with neural nets, you can jump to whatever most interestsyou.</p><p>The <a href="#convolutional_networks">main part of the chapter</a> is anintroduction to one of the most widely used types of deep network:deep convolutional networks.  We'll work through a detailed example- code and all - of using convolutional nets to solve the problemof classifying handwritten digits from the MNIST data set:</p><p><center><img src="images/digits.png" width="160px"></center></p><p>We'll start our account of convolutional networks with the shallownetworks used to attack this problem earlier in the book.  Throughmany iterations we'll build up more and more powerful networks.  As wego we'll explore many powerful techniques: convolutions, pooling, theuse of GPUs to do far more training than we did with our shallownetworks, the algorithmic expansion of our training data (to reduceoverfitting), the use of the dropout technique (also to reduceoverfitting), the use of ensembles of networks, and others.  Theresult will be a system that offers near-human performance.  Of the10,000 MNIST test images - images not seen during training! - oursystem will classify 9,967 correctly.  Here's a peek at the 33 imageswhich are misclassified.  Note that the correct classification is inthe top right; our program's classification is in the bottom right:</p><p><center><img src="images/ensemble_errors.png" width="580px"></center></p><p>Many of these are tough even for a human to classify.  Consider, forexample, the third image in the top row.  To me it looks more like a"9" than an "8", which is the official classification.  Ournetwork also thinks it's a "9".  This kind of "error" is at thevery least understandable, and perhaps even commendable.  We concludeour discussion of image recognition with a<a href="#recent_progress_in_image_recognition">survey of some of the  spectacular recent progress</a> using networks (particularlyconvolutional nets) to do image recognition.</p><p>The remainder of the chapter discusses deep learning from a broaderand less detailed perspective.  We'll<a href="#things_we_didn't_cover_but_which_you'll_eventually_want_to_know">briefly  survey other models of neural networks</a>, such as recurrent neuralnets and long short-term memory units, and how such models can beapplied to problems in speech recognition, natural languageprocessing, and other areas.  And we'll<a href="#on_the_future_of_neural_networks">speculate about the  future of neural networks and deep learning</a>, ranging from ideaslike intention-driven user interfaces, to the role of deep learning inartificial intelligence.</p><p></p><p></p><p></p><p></p><p></p><p></p><p>The chapter builds on the earlier chapters in the book, making use ofand integrating ideas such as backpropagation, regularization, thesoftmax function, and so on.  However, to read the chapter you don'tneed to have worked in detail through all the earlier chapters.  Itwill, however, help to have read <a href="chap1.html">Chapter 1</a>, on thebasics of neural networks.  When I use concepts from Chapters 2 to 5,I provide links so you can familiarize yourself, if necessary.</p><p>It's worth noting what the chapter is not.  It's not a tutorial on thelatest and greatest neural networks libraries.  Nor are we going to betraining deep networks with dozens of layers to solve problems at thevery leading edge.  Rather, the focus is on understanding some of thecore principles behind deep neural networks, and applying them in thesimple, easy-to-understand context of the MNIST problem.  Put anotherway: the chapter is not going to bring you right up to the frontier.Rather, the intent of this and earlier chapters is to focus onfundamentals, and so to prepare you to understand a wide range ofcurrent work.</p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p><h3><a name="introducing_convolutional_networks"></a><a href="#introducing_convolutional_networks">Introducing convolutional networks</a></h3></p><p>In earlier chapters, we taught our neural networks to do a pretty goodjob recognizing images of handwritten digits:</p><p><center><img src="images/digits.png" width="160px"></center></p><p>We did this using networks in which adjacent network layers are fullyconnected to one another.  That is, every neuron in the network isconnected to every neuron in adjacent layers:</p><p><center><img src="images/tikz41.png"/></center></p><p>In particular, for each pixel in the input image, we encoded thepixel's intensity as the value for a corresponding neuron in the inputlayer.  For the $28 \times 28$ pixel images we've been using, thismeans our network has $784$ ($= 28 \times 28$) input neurons.  We thentrained the network's weights and biases so that the network's outputwould - we hope! - correctly identify the input image: '0', '1','2', ..., '8', or '9'.</p><p>Our earlier networks work pretty well: we've<a href="chap3.html#98percent">obtained a classification accuracy better  than 98 percent</a>, using training and test data from the<a href="chap1.html#learning_with_gradient_descent">MNIST handwritten  digit data set</a>.  But upon reflection, it's strange to use networkswith fully-connected layers to classify images.  The reason is thatsuch a network architecture does not take into account the spatialstructure of the images.  For instance, it treats input pixels whichare far apart and close together on exactly the same footing.  Suchconcepts of spatial structure must instead be inferred from thetraining data.  But what if, instead of starting with a networkarchitecture which is <em>tabula rasa</em>, we used an architecturewhich tries to take advantage of the spatial structure?  In thissection I describe <em>convolutional neural networks</em>*<span class="marginnote">*The  origins of convolutional neural networks go back to the 1970s.  But  the seminal paper establishing the modern subject of convolutional  networks was a 1998 paper,  <a href="http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf">"Gradient-based    learning applied to document recognition"</a>, by Yann LeCun,  Léon Bottou, Yoshua Bengio, and Patrick Haffner.  LeCun has since made an interesting  <a href="https://www.facebook.com/yann.lecun/posts/10152348155137143">remark</a>  on the terminology for convolutional nets: "The [biological] neural  inspiration in models like convolutional nets is very  tenuous. That's why I call them 'convolutional nets' not  'convolutional neural nets', and why we call the nodes 'units' and  not 'neurons' ".  Despite this remark, convolutional nets use many  of the same ideas as the neural networks we've studied up to now:  ideas such as backpropagation, gradient descent, regularization,  non-linear activation functions, and so on.  And so we will follow  common practice, and consider them a type of neural network.  I will  use the terms "convolutional neural network" and "convolutional  net(work)" interchangeably.  I will also use the terms  "[artificial] neuron" and "unit" interchangeably.</span>.  Thesenetworks use a special architecture which is particularly well-adaptedto classify images.  Using this architecture makes convolutionalnetworks fast to train.  This, in turn, helps us train deep,many-layer networks, which are very good at classifying images.Today, deep convolutional networks or some close variant are used inmost neural networks for image recognition.</p><p>Convolutional neural networks use three basic ideas: <em>local  receptive fields</em>, <em>shared weights</em>, and <em>pooling</em>.  Let'slook at each of these ideas in turn.</p><p><strong>Local receptive fields:</strong> In the fully-connected layers shownearlier, the inputs were depicted as a vertical line of neurons.  In aconvolutional net, it'll help to think instead of the inputs as a $28\times 28$ square of neurons, whose values correspond to the $28\times 28$ pixel intensities we're using as inputs:</p><p><center><img src="images/tikz42.png"/></center></p><p>As per usual, we'll connect the input pixels to a layer of hiddenneurons.  But we won't connect every input pixel to every hiddenneuron.  Instead, we only make connections in small, localized regionsof the input image.</p><p>To be more precise, each neuron in the first hidden layer will beconnected to a small region of the input neurons, say, for example, a$5 \times 5$ region, corresponding to $25$ input pixels.  So, for aparticular hidden neuron, we might have connections that look likethis:<center><img src="images/tikz43.png"/></center></p><p>That region in the input image is called the <em>local receptive  field</em> for the hidden neuron.  It's a little window on the inputpixels.  Each connection learns a weight.  And the hidden neuronlearns an overall bias as well.  You can think of that particularhidden neuron as learning to analyze its particular local receptivefield.</p><p>We then slide the local receptive field across the entire input image.For each local receptive field, there is a different hidden neuron inthe first hidden layer.  To illustrate this concretely, let's startwith a local receptive field in the top-left corner:<center><img src="images/tikz44.png"/></center></p><p>Then we slide the local receptive field over by one pixel to the right(i.e., by one neuron), to connect to a second hidden neuron:</p><p><center><img src="images/tikz45.png"/></center></p><p>And so on, building up the first hidden layer.  Note that if we have a$28 \times 28$ input image, and $5 \times 5$ local receptive fields,then there will be $24 \times 24$ neurons in the hidden layer.  Thisis because we can only move the local receptive field $23$ neuronsacross (or $23$ neurons down), before colliding with the right-handside (or bottom) of the input image.</p><p>I've shown the local receptive field being moved by one pixel at atime.  In fact, sometimes a different <em>stride length</em> is used.For instance, we might move the local receptive field $2$ pixels tothe right (or down), in which case we'd say a stride length of $2$ isused.  In this chapter we'll mostly stick with stride length $1$, butit's worth knowing that people sometimes experiment with differentstride lengths*<span class="marginnote">*As was done in earlier chapters, if we're  interested in trying different stride lengths then we can use  validation data to pick out the stride length which gives the best  performance.  For more details, see the  <a href="chap3.html#how_to_choose_a_neural_network's_hyper-parameters">earlier    discussion</a> of how to choose hyper-parameters in a neural network.  The same approach may also be used to choose the size of the local  receptive field - there is, of course, nothing special about using  a $5 \times 5$ local receptive field.  In general, larger local  receptive fields tend to be helpful when the input images are  significantly larger than the $28 \times 28$ pixel MNIST images.</span>.</p><p><strong>Shared weights and biases:</strong> I've said that each hidden neuronhas a bias and $5 \times 5$ weights connected to its local receptivefield.  What I did not yet mention is that we're going to use the<em>same</em> weights and bias for each of the $24 \times 24$ hiddenneurons.  In other words, for the $j, k$th hidden neuron, the outputis:<a class="displaced_anchor" name="eqtn125"></a>\begin{eqnarray}   \sigma\left(b + \sum_{l=0}^4 \sum_{m=0}^4  w_{l,m} a_{j+l, k+m} \right).\tag{125}\end{eqnarray}Here, $\sigma$ is the neural activation function - perhaps the<a href="chap1.html#sigmoid_neurons">sigmoid function</a> we used inearlier chapters.  $b$ is the shared value for the bias.  $w_{l,m}$ isa $5 \times 5$ array of shared weights.  And, finally, we use $a_{x,  y}$ to denote the input activation at position $x, y$.</p><p>This means that all the neurons in the first hidden layer detectexactly the same feature*<span class="marginnote">*I haven't precisely defined the  notion of a feature.  Informally, think of the feature detected by a  hidden neuron as the kind of input pattern that will cause the  neuron to activate: it might be an edge in the image, for instance,  or maybe some other type of shape. </span>, just at different locations inthe input image.  To see why this makes sense, suppose the weights andbias are such that the hidden neuron can pick out, say, a verticaledge in a particular local receptive field.  That ability is alsolikely to be useful at other places in the image.  And so it is usefulto apply the same feature detector everywhere in the image.  To put itin slightly more abstract terms, convolutional networks are welladapted to the translation invariance of images: move a picture of acat (say) a little ways, and it's still an image of a cat*<span class="marginnote">*In  fact, for the MNIST digit classification problem we've been  studying, the images are centered and size-normalized.  So MNIST has  less translation invariance than images found "in the wild", so to  speak.  Still, features like edges and corners are likely to be  useful across much of the input space. </span>.</p><p>For this reason, we sometimes call the map from the input layer to thehidden layer a <em>feature map</em>.  We call the weights defining thefeature map the <em>shared weights</em>.  And we call the bias definingthe feature map in this way the <em>shared bias</em>.  The sharedweights and bias are often said to define a <em>kernel</em> or<em>filter</em>.  In the literature, people sometimes use these terms inslightly different ways, and for that reason I'm not going to be moreprecise; rather, in a moment, we'll look at some concrete examples.</p><p></p><p>The network structure I've described so far can detect just a singlekind of localized feature.  To do image recognition we'll need morethan one feature map.  And so a complete convolutional layer consistsof several different feature maps:</p><p><center><img src="images/tikz46.png"/></center>In the example shown, there are $3$ feature maps.  Each feature map isdefined by a set of $5 \times 5$ shared weights, and a single sharedbias.  The result is that the network can detect $3$ different kindsof features, with each feature being detectable across the entireimage.</p><p></p><p>I've shown just $3$ feature maps, to keep the diagram above simple.However, in practice convolutional networks may use more (and perhapsmany more) feature maps.  One of the early convolutional networks,LeNet-5, used $6$ feature maps, each associated to a $5 \times 5$local receptive field, to recognize MNIST digits.  So the exampleillustrated above is actually pretty close to LeNet-5.  In theexamples we develop later in the chapter we'll use convolutionallayers with $20$ and $40$ feature maps.  Let's take a quick peek atsome of the features which are learned*<span class="marginnote">*The feature maps  illustrated come from the final convolutional network we train, see  <a href="#final_conv">here</a>.</span>:</p><p><center><img src="images/net_full_layer_0.png" width="400px"></center></p><p>The $20$ images correspond to $20$ different feature maps (or filters,or kernels).  Each map is represented as a $5 \times 5$ block image,corresponding to the $5 \times 5$ weights in the local receptivefield.  Whiter blocks mean a smaller (typically, more negative)weight, so the feature map responds less to corresponding inputpixels.  Darker blocks mean a larger weight, so the feature mapresponds more to the corresponding input pixels.  Very roughlyspeaking, the images above show the type of features the convolutionallayer responds to.</p><p>So what can we conclude from these feature maps?  It's clear there isspatial structure here beyond what we'd expect at random: many of thefeatures have clear sub-regions of light and dark.  That shows ournetwork really is learning things related to the spatial structure.However, beyond that, it's difficult to see what these featuredetectors are learning.  Certainly, we're not learning (say) the<a href="http://en.wikipedia.org/wiki/Gabor_filter">Gabor filters</a> whichhave been used in many traditional approaches to image recognition.In fact, there's now a lot of work on better understanding thefeatures learnt by convolutional networks.  If you're interested infollowing up on that work, I suggest starting with the paper<a href="http://arxiv.org/abs/1311.2901">Visualizing and Understanding  Convolutional Networks</a> by Matthew Zeiler and Rob Fergus (2013).</p><p></p><p>A big advantage of sharing weights and biases is that it greatlyreduces the number of parameters involved in a convolutional network.For each feature map we need $25 = 5 \times 5$ shared weights, plus asingle shared bias. So each feature map requires $26$ parameters.  Ifwe have $20$ feature maps that's a total of $20 \times 26 = 520$parameters defining the convolutional layer.  By comparison, supposewe had a fully connected first layer, with $784 = 28 \times 28$ inputneurons, and a relatively modest $30$ hidden neurons, as we used inmany of the examples earlier in the book.  That's a total of $784\times 30$ weights, plus an extra $30$ biases, for a total of $23,550$parameters.  In other words, the fully-connected layer would have morethan $40$ times as many parameters as the convolutional layer.</p><p>Of course, we can't really do a direct comparison between the numberof parameters, since the two models are different in essential ways.But, intuitively, it seems likely that the use of translationinvariance by the convolutional layer will reduce the number ofparameters it needs to get the same performance as the fully-connectedmodel.  That, in turn, will result in faster training for theconvolutional model, and, ultimately, will help us build deep networksusing convolutional layers.</p><p></p><p>Incidentally, the name <em>convolutional</em> comes from the fact thatthe operation in Equation <span id="margin_903716135329_reveal" class="equation_link">(125)</span><span id="margin_903716135329" class="marginequation" style="display: none;"><a href="chap6.html#eqtn125" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \sigma\left(b + \sum_{l=0}^4 \sum_{m=0}^4  w_{l,m} a_{j+l, k+m} \right) \nonumber\end{eqnarray}</a></span><script>$('#margin_903716135329_reveal').click(function() {$('#margin_903716135329').toggle('slow', function() {});});</script> is sometimes known as a<em>convolution</em>.  A little more precisely, people sometimes writethat equation as $a^1 = \sigma(b + w * a^0)$, where $a^1$ denotes theset of output activations from one feature map, $a^0$ is the set ofinput activations, and $*$ is called a convolution operation.  We'renot going to make any deep use of the mathematics of convolutions, soyou don't need to worry too much about this connection.  But it'sworth at least knowing where the name comes from.</p><p></p><p></p><p></p><p></p><p><strong>Pooling layers:</strong> In addition to the convolutional layers justdescribed, convolutional neural networks also contain <em>pooling  layers</em>.  Pooling layers are usually used immediately afterconvolutional layers.  What the pooling layers do is simplify theinformation in the output from the convolutional layer.</p><p></p><p>In detail, a pooling layer takes each feature map*<span class="marginnote">*The  nomenclature is being used loosely here.  In particular, I'm using  "feature map" to mean not the function computed by the  convolutional layer, but rather the activation of the hidden neurons  output from the layer.  This kind of mild abuse of nomenclature is  pretty common in the research literature.</span> output from theconvolutional layer and prepares a condensed feature map.  Forinstance, each unit in the pooling layer may summarize a region of(say) $2 \times 2$ neurons in the previous layer.  As a concreteexample, one common procedure for pooling is known as<em>max-pooling</em>.  In max-pooling, a pooling unit simply outputs themaximum activation in the $2 \times 2$ input region, as illustrated inthe following diagram:</p><p><center><img src="images/tikz47.png"/></center></p><p>Note that since we have $24 \times 24$ neurons output from theconvolutional layer, after pooling we have $12 \times 12$ neurons.</p><p>As mentioned above, the convolutional layer usually involves more thana  single feature  map.   We  apply max-pooling  to  each feature  mapseparately.   So  if  there  were three  feature  maps,  the  combinedconvolutional and max-pooling layers would look like:</p><p><center><img src="images/tikz48.png"/></center></p><p>We can think of max-pooling as a way for the network to ask whether agiven feature is found anywhere in a region of the image.  It thenthrows away the exact positional information.  The intuition is thatonce a feature has been found, its exact location isn't as importantas its rough location relative to other features.  A big benefit isthat there are many fewer pooled features, and so this helps reducethe number of parameters needed in later layers.</p><p></p><p>Max-pooling isn't the only technique used for pooling.  Another commonapproach is known as <em>L2 pooling</em>.  Here, instead of taking themaximum activation of a $2 \times 2$ region of neurons, we take thesquare root of the sum of the squares of the activations in the $2\times 2$ region.  While the details are different, the intuition issimilar to max-pooling: L2 pooling is a way of condensing informationfrom the convolutional layer.  In practice, both techniques have beenwidely used.  And sometimes people use other types of poolingoperation.  If you're really trying to optimize performance, you mayuse validation data to compare several different approaches topooling, and choose the approach which works best.  But we're notgoing to worry about that kind of detailed optimization.</p><p></p><p><strong>Putting it all together:</strong> We can now put all these ideastogether to form a complete convolutional neural network.  It'ssimilar to the architecture we were just looking at, but has theaddition of a layer of $10$ output neurons, corresponding to the $10$possible values for MNIST digits ('0', '1', '2', <em>etc</em>):</p><p><center><img src="images/tikz49.png"/></center></p><p>The network begins with $28 \times 28$ input neurons, which are usedto encode the pixel intensities for the MNIST image.  This is thenfollowed by a convolutional layer using a $5 \times 5$ local receptivefield and $3$ feature maps.  The result is a layer of $3 \times 24\times 24$ hidden feature neurons.  The next step is a max-poolinglayer, applied to $2 \times 2$ regions, across each of the $3$ featuremaps.  The result is a layer of $3 \times 12 \times 12$ hidden featureneurons.</p><p>The final layer of connections in the network is a fully-connectedlayer.  That is, this layer connects <em>every</em> neuron from themax-pooled layer to every one of the $10$ output neurons.  Thisfully-connected architecture is the same as we used in earlierchapters.  Note, however, that in the diagram above, I've used asingle arrow, for simplicity, rather than showing all the connections.Of course, you can easily imagine the connections.</p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p>This convolutional architecture is quite different to thearchitectures used in earlier chapters.  But the overall picture issimilar: a network made of many simple units, whose behaviors aredetermined by their weights and biases.  And the overall goal is stillthe same: to use training data to train the network's weights andbiases so that the network does a good job classifying input digits.</p><p>In particular, just as earlier in the book, we will train our networkusing stochastic gradient descent and backpropagation.  This mostlyproceeds in exactly the same way as in earlier chapters.  However, wedo need to make a few modifications to the backpropagation procedure.The reason is that our earlier <a href="chap2.html">derivation of  backpropagation</a> was for networks with fully-connected layers.Fortunately, it's straightforward to modify the derivation forconvolutional and max-pooling layers.  If you'd like to understand thedetails, then I invite you to work through the following problem.  Bewarned that the problem will take some time to work through, unlessyou've really internalized the <a href="chap2.html">earlier derivation of  backpropagation</a> (in which case it's easy).</p><p><h4><a name="problem_214396"></a><a href="#problem_214396">Problem</a></h4><ul><li><strong>Backpropagation in a convolutional network</strong> The core equations  of backpropagation in a network with fully-connected layers  are <span id="margin_709574921443_reveal" class="equation_link">(BP1)</span><span id="margin_709574921443" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}   \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_709574921443_reveal').click(function() {$('#margin_709574921443').toggle('slow', function() {});});</script>-<span id="margin_220452626963_reveal" class="equation_link">(BP4)</span><span id="margin_220452626963" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}    \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_220452626963_reveal').click(function() {$('#margin_220452626963').toggle('slow', function() {});});</script>  (<a href="chap2.html#backpropsummary">link</a>).  Suppose we have a  network containing a convolutional layer, a max-pooling layer, and a  fully-connected output layer, as in the network discussed above.  How are the equations of backpropagation modified?</ul></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p><h3><a name="convolutional_neural_networks_in_practice"></a><a href="#convolutional_neural_networks_in_practice">Convolutional neural networks in practice</a></h3></p><p>We've now seen the core ideas behind convolutional neural networks.Let's look at how they work in practice, by implementing someconvolutional networks, and applying them to the MNIST digitclassification problem.  The program we'll use to do this is called<tt>network3.py</tt>, and it's an improved version of the programs<tt>network.py</tt> and <tt>network2.py</tt> developed in earlierchapters*<span class="marginnote">*Note also that <tt>network3.py</tt> incorporates ideas  from the Theano library's documentation on convolutional neural nets  (notably the implementation of  <a href="http://deeplearning.net/tutorial/lenet.html">LeNet-5</a>), from  Misha Denil's  <a href="https://github.com/mdenil/dropout">implementation of dropout</a>,  and from <a href="http://colah.github.io">Chris Olah</a>.</span>.  If you wishto follow along, the code is available<a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/src/network3.py">on  GitHub</a>.  Note that we'll work through the code for<tt>network3.py</tt> itself in the next section.  In this section, we'lluse <tt>network3.py</tt> as a library to build convolutional networks.</p><p></p><p>The programs <tt>network.py</tt> and <tt>network2.py</tt> were implementedusing Python and the matrix library Numpy.  Those programs worked fromfirst principles, and got right down into the details ofbackpropagation, stochastic gradient descent, and so on.  But now thatwe understand those details, for <tt>network3.py</tt> we're going to usea machine learning library known as<a href="http://deeplearning.net/software/theano/">Theano</a>*<span class="marginnote">*See  <a href="http://www.iro.umontreal.ca/&#126;lisa/pointeurs/theano_scipy2010.pdf">Theano:    A CPU and GPU Math Expression Compiler in Python</a>, by James  Bergstra, Olivier Breuleux, Frederic Bastien, Pascal Lamblin, Ravzan  Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley,  and Yoshua Bengio (2010).  Theano is also the basis for the popular  <a href="http://deeplearning.net/software/pylearn2/">Pylearn2</a> and  <a href="http://keras.io/">Keras</a> neural networks libraries. Other  popular neural nets libraries at the time of this writing include  <a href="http://caffe.berkeleyvision.org">Caffe</a> and  <a href="http://torch.ch">Torch</a>. </span>.  Using Theano makes it easy toimplement backpropagation for convolutional neural networks, since itautomatically computes all the mappings involved.  Theano is alsoquite a bit faster than our earlier code (which was written to be easyto understand, not fast), and this makes it practical to train morecomplex networks.  In particular, one great feature of Theano is thatit can run code on either a CPU or, if available, a GPU.  Running on aGPU provides a substantial speedup and, again, helps make it practicalto train more complex networks.</p><p></p><p>If you wish to follow along, then you'll need to get Theano running onyour system.  To install Theano, follow the instructions at theproject's <a href="http://deeplearning.net/software/theano/">homepage</a>.The examples which follow were run using Theano 0.6*<span class="marginnote">*As I  release this chapter, the current version of Theano has changed to  version 0.7.  I've actually rerun the examples under Theano 0.7 and  get extremely similar results to those reported in the text.</span>.  Somewere run under Mac OS X Yosemite, with no GPU.  Some were run onUbuntu 14.04, with an NVIDIA GPU.  And some of the experiments were rununder both.  To get <tt>network3.py</tt> running you'll need to set the<tt>GPU</tt> flag to either <tt>True</tt> or <tt>False</tt> (as appropriate)in the <tt>network3.py</tt> source.  Beyond that, to get Theano up andrunning on a GPU you may find<a href="http://deeplearning.net/software/theano/tutorial/using_gpu.html">the  instructions here</a> helpful.  There are also tutorials on the web,easily found using Google, which can help you get things working.  Ifyou don't have a GPU available locally, then you may wish to look into<a href="http://aws.amazon.com/ec2/instance-types/">Amazon Web Services</a>EC2 G2 spot instances.  Note that even with a GPU the code will takesome time to execute.  Many of the experiments take from minutes tohours to run.  On a CPU it may take days to run the most complex ofthe experiments.  As in earlier chapters, I suggest setting thingsrunning, and continuing to read, occasionally coming back to check theoutput from the code.  If you're using a CPU, you may wish to reducethe number of training epochs for the more complex experiments, orperhaps omit them entirely.</p><p>To get a baseline, we'll start with a shallow architecture using justa single hidden layer, containing $100$ hidden neurons.  We'll trainfor $60$ epochs, using a learning rate of $\eta = 0.1$, a mini-batchsize of $10$, and no regularization.  Here we go*<span class="marginnote">*Code for the  experiments in this section may be found  <a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/src/conv.py">in    this script</a>.  Note that the code in the script simply duplicates  and parallels the discussion in this section.<br><br>Note also that  throughout the section I've explicitly specified the number of  training epochs.  I've done this for clarity about how we're  training.  In practice, it's worth using  <a href="chap3.html#early_stopping">early stopping</a>, that is,  tracking accuracy on the validation set, and stopping training when  we are confident the validation accuracy has stopped improving.</span>:</p><p><div class="highlight"><pre><span></span><span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">network3</span>
 <span class="o">&gt;&gt;&gt;</span> <span class="kn">from</span> <span class="nn">network3</span> <span class="kn">import</span> <span class="n">Network</span>
 <span class="o">&gt;&gt;&gt;</span> <span class="kn">from</span> <span class="nn">network3</span> <span class="kn">import</span> <span class="n">ConvPoolLayer</span><span class="p">,</span> <span class="n">FullyConnectedLayer</span><span class="p">,</span> <span class="n">SoftmaxLayer</span>
 <span class="o">&gt;&gt;&gt;</span> <span class="n">training_data</span><span class="p">,</span> <span class="n">validation_data</span><span class="p">,</span> <span class="n">test_data</span> <span class="o">=</span> <span class="n">network3</span><span class="o">.</span><span class="n">load_data_shared</span><span class="p">()</span>
@@ -751,7 +754,7 @@ <h1 class="chapter_title"><a href="">Deep learning</a></h1></div><div class="sec
 href="mailto:mn@michaelnielsen.org">contact me</a>.
 </span>
 <span class="right_footer">
-Last update: Sun Jan  1 16:04:20 2017
+Last update: Thu Jan 19 06:09:48 2017
 <br/>
 <br/>
 <br/>
diff --git a/exercises_and_problems.html b/exercises_and_problems.html
index 2e6e136..94f10ab 100644
--- a/exercises_and_problems.html
+++ b/exercises_and_problems.html
@@ -153,6 +153,8 @@
 <hr>
 <span class="sidebar_title">Resources</span>
 
+<p class="sidebar"><a href="https://twitter.com/michael_nielsen">Michael Nielsen on Twitter</a></p>
+
 <p class="sidebar"><a href="faq.html">Book FAQ</a></p>
 
 <p class="sidebar">
@@ -162,9 +164,10 @@
 <a href="http://eepurl.com/0Xxjb">Michael Nielsen's project announcement mailing list</a>
 </p>
 
-<p class="sidebar"> <a href="http://www.iro.umontreal.ca/~bengioy/dlbook/">Deep Learning</a>, draft book
-in preparation, by Yoshua Bengio, Ian Goodfellow, and Aaron
-Courville</p>
+<p class="sidebar"> <a href="http://www.deeplearningbook.org/">Deep Learning</a>, book by Ian
+Goodfellow, Yoshua Bengio, and Aaron Courville</p>
+
+<p class="sidebar"><a href="http://cognitivemedium.com">cognitivemedium.com</a></p>
 
 <hr>
 <a href="http://michaelnielsen.org"><img src="assets/Michael_Nielsen_Web_Small.jpg" width="160px" style="border-style: none;"/></a>
@@ -189,7 +192,7 @@
 href="mailto:mn@michaelnielsen.org">contact me</a>.
 </span>
 <span class="right_footer">
-Last update: Sun Jan  1 16:00:21 2017
+Last update: Thu Jan 19 06:09:48 2017
 <br/>
 <br/>
 <br/>
diff --git a/faq.html b/faq.html
index a81d214..05bef91 100644
--- a/faq.html
+++ b/faq.html
@@ -153,6 +153,8 @@
 <hr>
 <span class="sidebar_title">Resources</span>
 
+<p class="sidebar"><a href="https://twitter.com/michael_nielsen">Michael Nielsen on Twitter</a></p>
+
 <p class="sidebar"><a href="faq.html">Book FAQ</a></p>
 
 <p class="sidebar">
@@ -162,9 +164,10 @@
 <a href="http://eepurl.com/0Xxjb">Michael Nielsen's project announcement mailing list</a>
 </p>
 
-<p class="sidebar"> <a href="http://www.iro.umontreal.ca/~bengioy/dlbook/">Deep Learning</a>, draft book
-in preparation, by Yoshua Bengio, Ian Goodfellow, and Aaron
-Courville</p>
+<p class="sidebar"> <a href="http://www.deeplearningbook.org/">Deep Learning</a>, book by Ian
+Goodfellow, Yoshua Bengio, and Aaron Courville</p>
+
+<p class="sidebar"><a href="http://cognitivemedium.com">cognitivemedium.com</a></p>
 
 <hr>
 <a href="http://michaelnielsen.org"><img src="assets/Michael_Nielsen_Web_Small.jpg" width="160px" style="border-style: none;"/></a>
@@ -189,7 +192,7 @@
 href="mailto:mn@michaelnielsen.org">contact me</a>.
 </span>
 <span class="right_footer">
-Last update: Sun Jan  1 16:00:21 2017
+Last update: Thu Jan 19 06:09:48 2017
 <br/>
 <br/>
 <br/>
diff --git a/images/tikz0.png b/images/tikz0.png
index 49d952f..73add72 100644
Binary files a/images/tikz0.png and b/images/tikz0.png differ
diff --git a/images/tikz1.png b/images/tikz1.png
index 5683e0f..2ea2d2d 100644
Binary files a/images/tikz1.png and b/images/tikz1.png differ
diff --git a/images/tikz10.png b/images/tikz10.png
index 9f39cf2..035bdb0 100644
Binary files a/images/tikz10.png and b/images/tikz10.png differ
diff --git a/images/tikz11.png b/images/tikz11.png
index d806230..caa673c 100644
Binary files a/images/tikz11.png and b/images/tikz11.png differ
diff --git a/images/tikz12.png b/images/tikz12.png
index 7096176..aecb50d 100644
Binary files a/images/tikz12.png and b/images/tikz12.png differ
diff --git a/images/tikz13.png b/images/tikz13.png
index 18643c2..79579fe 100644
Binary files a/images/tikz13.png and b/images/tikz13.png differ
diff --git a/images/tikz14.png b/images/tikz14.png
index ed84111..5123902 100644
Binary files a/images/tikz14.png and b/images/tikz14.png differ
diff --git a/images/tikz15.png b/images/tikz15.png
index 158604c..87922bb 100644
Binary files a/images/tikz15.png and b/images/tikz15.png differ
diff --git a/images/tikz16.png b/images/tikz16.png
index f5fb778..25a62a3 100644
Binary files a/images/tikz16.png and b/images/tikz16.png differ
diff --git a/images/tikz17.png b/images/tikz17.png
index 05a9722..c737da0 100644
Binary files a/images/tikz17.png and b/images/tikz17.png differ
diff --git a/images/tikz18.png b/images/tikz18.png
index dec52a3..6e3cc27 100644
Binary files a/images/tikz18.png and b/images/tikz18.png differ
diff --git a/images/tikz19.png b/images/tikz19.png
index 16b0213..bee556a 100644
Binary files a/images/tikz19.png and b/images/tikz19.png differ
diff --git a/images/tikz2.png b/images/tikz2.png
index f6887b3..258c88a 100644
Binary files a/images/tikz2.png and b/images/tikz2.png differ
diff --git a/images/tikz20.png b/images/tikz20.png
index 475ce58..3af3183 100644
Binary files a/images/tikz20.png and b/images/tikz20.png differ
diff --git a/images/tikz21.png b/images/tikz21.png
index b9459a3..5dfafbc 100644
Binary files a/images/tikz21.png and b/images/tikz21.png differ
diff --git a/images/tikz22.png b/images/tikz22.png
index 40805ba..f2d6327 100644
Binary files a/images/tikz22.png and b/images/tikz22.png differ
diff --git a/images/tikz23.png b/images/tikz23.png
index f391d37..8b6e876 100644
Binary files a/images/tikz23.png and b/images/tikz23.png differ
diff --git a/images/tikz24.png b/images/tikz24.png
index 1583bdb..6e35b6c 100644
Binary files a/images/tikz24.png and b/images/tikz24.png differ
diff --git a/images/tikz25.png b/images/tikz25.png
index 524df56..8633ca3 100644
Binary files a/images/tikz25.png and b/images/tikz25.png differ
diff --git a/images/tikz26.png b/images/tikz26.png
index 2b12d93..2ec7058 100644
Binary files a/images/tikz26.png and b/images/tikz26.png differ
diff --git a/images/tikz27.png b/images/tikz27.png
index e379917..0920ca8 100644
Binary files a/images/tikz27.png and b/images/tikz27.png differ
diff --git a/images/tikz28.png b/images/tikz28.png
index 979f2f3..89f2d00 100644
Binary files a/images/tikz28.png and b/images/tikz28.png differ
diff --git a/images/tikz29.png b/images/tikz29.png
index 39d7239..c69a47a 100644
Binary files a/images/tikz29.png and b/images/tikz29.png differ
diff --git a/images/tikz3.png b/images/tikz3.png
index c33816d..dced2f7 100644
Binary files a/images/tikz3.png and b/images/tikz3.png differ
diff --git a/images/tikz30.png b/images/tikz30.png
index f060488..9532cdf 100644
Binary files a/images/tikz30.png and b/images/tikz30.png differ
diff --git a/images/tikz31.png b/images/tikz31.png
index f1d3e96..a1f5df9 100644
Binary files a/images/tikz31.png and b/images/tikz31.png differ
diff --git a/images/tikz32.png b/images/tikz32.png
index 5d986d8..262076c 100644
Binary files a/images/tikz32.png and b/images/tikz32.png differ
diff --git a/images/tikz33.png b/images/tikz33.png
index cbb4eb8..aad0df9 100644
Binary files a/images/tikz33.png and b/images/tikz33.png differ
diff --git a/images/tikz34.png b/images/tikz34.png
index abcf9c9..837a005 100644
Binary files a/images/tikz34.png and b/images/tikz34.png differ
diff --git a/images/tikz35.png b/images/tikz35.png
index 866675e..d690804 100644
Binary files a/images/tikz35.png and b/images/tikz35.png differ
diff --git a/images/tikz36.png b/images/tikz36.png
index 4718888..c744c78 100644
Binary files a/images/tikz36.png and b/images/tikz36.png differ
diff --git a/images/tikz37.png b/images/tikz37.png
index 0c4d44f..6ea21a0 100644
Binary files a/images/tikz37.png and b/images/tikz37.png differ
diff --git a/images/tikz38.png b/images/tikz38.png
index 7f4568c..d75093a 100644
Binary files a/images/tikz38.png and b/images/tikz38.png differ
diff --git a/images/tikz39.png b/images/tikz39.png
index f4258b4..9629186 100644
Binary files a/images/tikz39.png and b/images/tikz39.png differ
diff --git a/images/tikz4.png b/images/tikz4.png
index 7879e88..dace5fa 100644
Binary files a/images/tikz4.png and b/images/tikz4.png differ
diff --git a/images/tikz40.png b/images/tikz40.png
index c6d90c2..66f3783 100644
Binary files a/images/tikz40.png and b/images/tikz40.png differ
diff --git a/images/tikz41.png b/images/tikz41.png
index 5d87d5e..6a25a40 100644
Binary files a/images/tikz41.png and b/images/tikz41.png differ
diff --git a/images/tikz42.png b/images/tikz42.png
index 6bbe465..11a0f49 100644
Binary files a/images/tikz42.png and b/images/tikz42.png differ
diff --git a/images/tikz43.png b/images/tikz43.png
index a6cf6a8..267d795 100644
Binary files a/images/tikz43.png and b/images/tikz43.png differ
diff --git a/images/tikz44.png b/images/tikz44.png
index 3ba65f9..de0640d 100644
Binary files a/images/tikz44.png and b/images/tikz44.png differ
diff --git a/images/tikz45.png b/images/tikz45.png
index 2f14d99..4ef08be 100644
Binary files a/images/tikz45.png and b/images/tikz45.png differ
diff --git a/images/tikz46.png b/images/tikz46.png
index 21c39df..2c1924f 100644
Binary files a/images/tikz46.png and b/images/tikz46.png differ
diff --git a/images/tikz47.png b/images/tikz47.png
index 75a2585..ae65e24 100644
Binary files a/images/tikz47.png and b/images/tikz47.png differ
diff --git a/images/tikz48.png b/images/tikz48.png
index 771c67e..80a5da0 100644
Binary files a/images/tikz48.png and b/images/tikz48.png differ
diff --git a/images/tikz49.png b/images/tikz49.png
index c8b7135..356deb0 100644
Binary files a/images/tikz49.png and b/images/tikz49.png differ
diff --git a/images/tikz5.png b/images/tikz5.png
index 59b15a4..7f217d4 100644
Binary files a/images/tikz5.png and b/images/tikz5.png differ
diff --git a/images/tikz50.png b/images/tikz50.png
index 8de36da..23a529d 100644
Binary files a/images/tikz50.png and b/images/tikz50.png differ
diff --git a/images/tikz51.png b/images/tikz51.png
index c7418e2..5d5ff85 100644
Binary files a/images/tikz51.png and b/images/tikz51.png differ
diff --git a/images/tikz52.png b/images/tikz52.png
index b87f0be..985b30b 100644
Binary files a/images/tikz52.png and b/images/tikz52.png differ
diff --git a/images/tikz53.png b/images/tikz53.png
index 1e50f6a..a24a2d2 100644
Binary files a/images/tikz53.png and b/images/tikz53.png differ
diff --git a/images/tikz54.png b/images/tikz54.png
index 7a4113c..5464998 100644
Binary files a/images/tikz54.png and b/images/tikz54.png differ
diff --git a/images/tikz55.png b/images/tikz55.png
index 5cb0e7f..e5d1320 100644
Binary files a/images/tikz55.png and b/images/tikz55.png differ
diff --git a/images/tikz56.png b/images/tikz56.png
index 6ab5230..28bf86a 100644
Binary files a/images/tikz56.png and b/images/tikz56.png differ
diff --git a/images/tikz57.png b/images/tikz57.png
index e598257..21d2b19 100644
Binary files a/images/tikz57.png and b/images/tikz57.png differ
diff --git a/images/tikz58.png b/images/tikz58.png
index 21e1144..a154683 100644
Binary files a/images/tikz58.png and b/images/tikz58.png differ
diff --git a/images/tikz59.png b/images/tikz59.png
index 067ca9c..dd610e4 100644
Binary files a/images/tikz59.png and b/images/tikz59.png differ
diff --git a/images/tikz6.png b/images/tikz6.png
index e446810..bceb0f6 100644
Binary files a/images/tikz6.png and b/images/tikz6.png differ
diff --git a/images/tikz60.png b/images/tikz60.png
index d32a087..400e20a 100644
Binary files a/images/tikz60.png and b/images/tikz60.png differ
diff --git a/images/tikz61.png b/images/tikz61.png
index e0e51cb..d016082 100644
Binary files a/images/tikz61.png and b/images/tikz61.png differ
diff --git a/images/tikz62.png b/images/tikz62.png
index 7064127..12dd083 100644
Binary files a/images/tikz62.png and b/images/tikz62.png differ
diff --git a/images/tikz63.png b/images/tikz63.png
index c1e8a51..c51bacd 100644
Binary files a/images/tikz63.png and b/images/tikz63.png differ
diff --git a/images/tikz7.png b/images/tikz7.png
index 56b7398..cb56199 100644
Binary files a/images/tikz7.png and b/images/tikz7.png differ
diff --git a/images/tikz8.png b/images/tikz8.png
index 7f9afc2..9bb4290 100644
Binary files a/images/tikz8.png and b/images/tikz8.png differ
diff --git a/images/tikz9.png b/images/tikz9.png
index c542166..a26a2d5 100644
Binary files a/images/tikz9.png and b/images/tikz9.png differ
diff --git a/index.html b/index.html
index 1e28f7e..e8d8fb0 100644
--- a/index.html
+++ b/index.html
@@ -153,6 +153,8 @@
 <hr>
 <span class="sidebar_title">Resources</span>
 
+<p class="sidebar"><a href="https://twitter.com/michael_nielsen">Michael Nielsen on Twitter</a></p>
+
 <p class="sidebar"><a href="faq.html">Book FAQ</a></p>
 
 <p class="sidebar">
@@ -162,9 +164,10 @@
 <a href="http://eepurl.com/0Xxjb">Michael Nielsen's project announcement mailing list</a>
 </p>
 
-<p class="sidebar"> <a href="http://www.iro.umontreal.ca/~bengioy/dlbook/">Deep Learning</a>, draft book
-in preparation, by Yoshua Bengio, Ian Goodfellow, and Aaron
-Courville</p>
+<p class="sidebar"> <a href="http://www.deeplearningbook.org/">Deep Learning</a>, book by Ian
+Goodfellow, Yoshua Bengio, and Aaron Courville</p>
+
+<p class="sidebar"><a href="http://cognitivemedium.com">cognitivemedium.com</a></p>
 
 <hr>
 <a href="http://michaelnielsen.org"><img src="assets/Michael_Nielsen_Web_Small.jpg" width="160px" style="border-style: none;"/></a>
@@ -214,7 +217,7 @@
 href="mailto:mn@michaelnielsen.org">contact me</a>.
 </span>
 <span class="right_footer">
-Last update: Sun Jan  1 16:00:21 2017
+Last update: Thu Jan 19 06:09:48 2017
 <br/>
 <br/>
 <br/>
diff --git a/sai.html b/sai.html
index a6fd126..86f3e14 100644
--- a/sai.html
+++ b/sai.html
@@ -153,6 +153,8 @@
 <hr>
 <span class="sidebar_title">Resources</span>
 
+<p class="sidebar"><a href="https://twitter.com/michael_nielsen">Michael Nielsen on Twitter</a></p>
+
 <p class="sidebar"><a href="faq.html">Book FAQ</a></p>
 
 <p class="sidebar">
@@ -162,9 +164,10 @@
 <a href="http://eepurl.com/0Xxjb">Michael Nielsen's project announcement mailing list</a>
 </p>
 
-<p class="sidebar"> <a href="http://www.iro.umontreal.ca/~bengioy/dlbook/">Deep Learning</a>, draft book
-in preparation, by Yoshua Bengio, Ian Goodfellow, and Aaron
-Courville</p>
+<p class="sidebar"> <a href="http://www.deeplearningbook.org/">Deep Learning</a>, book by Ian
+Goodfellow, Yoshua Bengio, and Aaron Courville</p>
+
+<p class="sidebar"><a href="http://cognitivemedium.com">cognitivemedium.com</a></p>
 
 <hr>
 <a href="http://michaelnielsen.org"><img src="assets/Michael_Nielsen_Web_Small.jpg" width="160px" style="border-style: none;"/></a>
@@ -189,7 +192,7 @@
 href="mailto:mn@michaelnielsen.org">contact me</a>.
 </span>
 <span class="right_footer">
-Last update: Sun Jan  1 16:00:21 2017
+Last update: Thu Jan 19 06:09:48 2017
 <br/>
 <br/>
 <br/>
diff --git a/supporters.html b/supporters.html
index be5e728..5c55604 100644
--- a/supporters.html
+++ b/supporters.html
@@ -153,6 +153,8 @@
 <hr>
 <span class="sidebar_title">Resources</span>
 
+<p class="sidebar"><a href="https://twitter.com/michael_nielsen">Michael Nielsen on Twitter</a></p>
+
 <p class="sidebar"><a href="faq.html">Book FAQ</a></p>
 
 <p class="sidebar">
@@ -162,9 +164,10 @@
 <a href="http://eepurl.com/0Xxjb">Michael Nielsen's project announcement mailing list</a>
 </p>
 
-<p class="sidebar"> <a href="http://www.iro.umontreal.ca/~bengioy/dlbook/">Deep Learning</a>, draft book
-in preparation, by Yoshua Bengio, Ian Goodfellow, and Aaron
-Courville</p>
+<p class="sidebar"> <a href="http://www.deeplearningbook.org/">Deep Learning</a>, book by Ian
+Goodfellow, Yoshua Bengio, and Aaron Courville</p>
+
+<p class="sidebar"><a href="http://cognitivemedium.com">cognitivemedium.com</a></p>
 
 <hr>
 <a href="http://michaelnielsen.org"><img src="assets/Michael_Nielsen_Web_Small.jpg" width="160px" style="border-style: none;"/></a>
@@ -561,7 +564,7 @@
 href="mailto:mn@michaelnielsen.org">contact me</a>.
 </span>
 <span class="right_footer">
-Last update: Sun Jan  1 16:00:21 2017
+Last update: Thu Jan 19 06:09:48 2017
 <br/>
 <br/>
 <br/>

$z^L_1 = $	$a^L_1 = $
$z^L_2$ =	$a^L_2 = $
$z^L_3$ =	$a^L_3 = $
$z^L_4$ =	$a^L_4 = $