**Chapter 4. Introduction to neural learning: gradient descent**

> gbtayndb【】sharklasers.com
>
> hakgvsgh【】sharklasers.com

In this chapter
- Do neural networks make accurate predictions?
- Why measure error?
- Hot and cold learning
- Calculating both direction and amount from error
- Gradient descent
- Learning is just reducing error
- Derivatives and how to use them to learn
- Divergence and alpha


> **_“The only relevant test of the validity of a hypothesis is comparison of its predictions with experience.”_**
>
> <div align="right">—— Milton Friedman, Essays in Positive Economics (University of Chicago Press, 1953)</div>



# Predict, compare, and learn
Perhaps this process begged the question, “How do we set weight values so the network predicts accurately?” Answering this question is the main focus of this chapter, as we cover the next two steps of the paradigm: compare and learn.

# Compare

**Comparing gives a measurement of how much a prediction “missed” by**

We’ll consider the analogy of an archer hitting a target: whether the shot is too low by an inch or too high by an inch, the error is still just 1 inch. In the neural network compare step, you need to consider these kinds of properties when measuring error.

As a heads-up, in this chapter we evaluate only one simple way of measuring error: **_mean squared error_** （均方误差）. It’s but one of many ways to evaluate the accuracy of a neural network. 

It won’t tell you why you missed, what direction you missed, or what you should do to fix the error. It more or less says “big miss,” “little miss,” or “perfect prediction.” What to do about the error is captured in the next step, **_learn_** . 



# Learn

**Learning tells each weight how it can change to reduce the error**

In this chapter, we’ll spend many pages looking at the most popular version of the deep learning blame game: **_gradient descent_** （梯度下降）. 

# Compare: Does your network make good predictions?

**Let’s measure the error and find out!**

Execute the following code in your Jupyter notebook. It should print `0.3025`: 

<div align="center">
    <img src="images/4.1.jpg">
</div>             


In [4]:
knob_weight = .5
input = .5
goal_pred = .8

pred = input * knob_weight

error = (pred - goal_pred) ** 2

print(error)

0.30250000000000005


**What is the `goal_pred` variable?**
Much like `input`, `goal_pred` is a number you recorded in the real world somewhere. But it’s usually something hard to observe, like “the percentage of people who did wear sweatsuits,” given the temperature; or “whether the batter did hit a home run,” given his batting average. 

**Why is the error squared?**

Think about an archer hitting a target. When the shot hits 2 inches too high, how much did the archer miss by? When the shot hits 2 inches too low, how much did the archer miss by? Both times, the archer missed by only 2 inches. The primary reason to square “how much you missed” is that it forces the output to be positive. `(pred - goal_pred)` could be negative in some situations, unlike actual error.

**Doesn’t squaring make big errors (>1) bigger and small errors (<1) smaller?**

Yeah ... It’s kind of a weird way of measuring error, but it turns out that amplifying big errors and reducing small errors is OK. Later, you’ll use this error to help the network learn, and you’d rather it pay attention to the big errors and not worry so much about the small ones. Good parents are like this, too: they practically ignore errors if they’re small enough (breaking the lead on your pencil) but may go nuclear for big errors (crashing the car). See why squaring is valuable? 

# Why measure error?
**Measuring error simplifies the problem**

The goal of training a neural network is to make correct predictions. That’s what you want. 

**Different ways of measuring error prioritize error differently.**

 By **squaring** the error, numbers that are less than 1 get **smaller**, whereas numbers that are greater than 1 get **bigger**. 


By measuring error this way, you can prioritize big errors over smaller ones. When you have somewhat large pure errors (say, 10), you’ll tell yourself that you have very large error (10**2 == 100); and in contrast, when you have small pure errors (say, 0.01), you’ll tell yourself that you have very small error (0.01**2 == 0.0001). See what I mean about prioritizing? It’s just modifying what you consider to be error so that you amplify big ones and largely ignore small ones. 

In contrast, if you took the absolute value instead of squaring the error, you wouldn’t have this type of prioritization. The error would just be the positive version of the pure error—which would be fine, but different. More on this later. 

**Why do you want only positive error?**

Eventually, you’ll be working with millions of `input` -> `goal_prediction` pairs, and we’ll still want to make accurate predictions. So, you’ll try to take the average error down to 0. 

This presents a problem if the error can be positive and negative. Imagine if you were trying to get the neural network to correctly predict two datapoints—two `input` -> `goal_prediction` pairs. If the first had an error of 1,000 and the second had an error of –1,000, then the average error would be zero! You’d fool yourself into thinking you predicted perfectly, when you missed by 1,000 each time! That would be really bad. Thus, you want the error of each prediction to always be positive so they don’t accidentally cancel each other out when you average them. 

# What’s the simplest form of neural learning?

**Learning using the hot and cold method**

At the end of the day, learning is really about one thing: adjusting `knob_weight` either up or down so the error is reduced. If you keep doing this and the error goes to 0, you’re done learning! How do you know whether to turn the knob up or down? Well, you try both up and down and see which one reduces the error! Whichever one reduces the error is used to update `knob_weight`. It’s simple but effective. After you do this over and over again, eventually error == 0, which means the neural network is predicting with perfect accuracy. 

<div align="center">
    <img src="images/4.2.jpg">
</div>      


<div align="center">
    <img src="images/4.3.jpg">
</div>      


<div align="center">
    <img src="images/4.4.jpg">
</div>      


<div align="center">
    <img src="images/4.5.jpg">
</div>   


<div align="center">
    <img src="images/4.6.jpg">
</div>      

This reveals what learning in neural networks really is: a search problem. You’re searching for the best possible configuration of weights so the network’s error falls to 0 (and predicts perfectly). As with all other forms of search, you might not find exactly what you’re looking for, and even if you do, it may take some time. Next, we’ll use hot and cold learning for a slightly more difficult prediction so you can see this searching in action! 

# Hot and cold learning

**This is perhaps the simplest form of learning**

Execute the following code in your Jupyter notebook. (New neural network modifications are in bold.) This code attempts to correctly predict 0.8: 

In [18]:
weight = .5
input = .5
goal_prediction = .8

step_amunt = .001

for iteration in range(1101):
    prediction = input * weight
    error = (prediction - goal_prediction) ** 2

    if (error < 0.0001):
        print("Iteration: "+str(iteration)+" Error: "+str(error) + "\t Prediction: " +
              str(prediction) + "\t Weight: " + str(weight))

    up_prediction = input * (weight + step_amunt)
    up_error = (up_prediction - goal_prediction)**2

    down_prediction = input*(weight - step_amunt)
    down_error = (down_prediction - goal_prediction)**2

    if (down_error < up_error):
        weight = weight - step_amunt
    if (down_error > up_error):
        weight = weight+step_amunt

Iteration: 1081 Error: 9.025000000060451e-05	 Prediction: 0.7904999999999682	 Weight: 1.5809999999999365
Iteration: 1082 Error: 8.100000000057368e-05	 Prediction: 0.7909999999999682	 Weight: 1.5819999999999363
Iteration: 1083 Error: 7.225000000054275e-05	 Prediction: 0.7914999999999681	 Weight: 1.5829999999999362
Iteration: 1084 Error: 6.40000000005117e-05	 Prediction: 0.7919999999999681	 Weight: 1.5839999999999361
Iteration: 1085 Error: 5.625000000048055e-05	 Prediction: 0.792499999999968	 Weight: 1.584999999999936
Iteration: 1086 Error: 4.9000000000449285e-05	 Prediction: 0.792999999999968	 Weight: 1.585999999999936
Iteration: 1087 Error: 4.225000000041791e-05	 Prediction: 0.7934999999999679	 Weight: 1.5869999999999358
Iteration: 1088 Error: 3.6000000000386424e-05	 Prediction: 0.7939999999999678	 Weight: 1.5879999999999357
Iteration: 1089 Error: 3.0250000000354826e-05	 Prediction: 0.7944999999999678	 Weight: 1.5889999999999356
Iteration: 1090 Error: 2.500000000032312e-05	 Prediction:

# Characteristics of hot and cold learning

**It’s simple**

Hot and cold learning is simple. After making a prediction, you predict two more times, once with a slightly higher weight and again with a slightly lower weight. You then move weight depending on which direction gave a smaller error. Repeating this enough times eventually reduces error to 0.

**WHY DID I ITERATE EXACTLY 1,101 TIMES?**

The neural network in the example reaches 0.8 after exactly that many iterations. If you go past that, it wiggles back and forth between 0.8 and just above or below 0.8, making for a less pretty error log printed at the bottom of the left page. Feel free to try it.

**Problem 1: It’s inefficient**
You have to predict multiple times to make a single knob_weight update. This seems very inefficient.

**Problem 2: Sometimes it’s impossible to predict the exact goal prediction**



# Calculating both direction and amount from error

**Let’s measure the error and find the direction and amount!**

Execute this code in your Jupyter notebook:

<div align="center">
    <img src="images/4.7.jpg">
</div>      

In [19]:
weight = .5
input = .5
goal_prediction = .8

for iteration in range(20):
    pred = input*weight
    error = (pred-goal_prediction) ** 2
    direction_and_amount = (pred - goal_prediction) * input
    weight = weight - direction_and_amount

    print("Error: " + str(error) + " Prediction: "+str(pred))

Error: 0.30250000000000005 Prediction: 0.25
Error: 0.17015625000000004 Prediction: 0.3875
Error: 0.095712890625 Prediction: 0.49062500000000003
Error: 0.05383850097656251 Prediction: 0.56796875
Error: 0.03028415679931642 Prediction: 0.6259765625
Error: 0.0170348381996155 Prediction: 0.669482421875
Error: 0.00958209648728372 Prediction: 0.70211181640625
Error: 0.005389929274097089 Prediction: 0.7265838623046875
Error: 0.0030318352166796153 Prediction: 0.7449378967285156
Error: 0.0017054073093822882 Prediction: 0.7587034225463867
Error: 0.0009592916115275371 Prediction: 0.76902756690979
Error: 0.0005396015314842384 Prediction: 0.7767706751823426
Error: 0.000303525861459885 Prediction: 0.7825780063867569
Error: 0.00017073329707118678 Prediction: 0.7869335047900676
Error: 9.603747960254256e-05 Prediction: 0.7902001285925507
Error: 5.402108227642978e-05 Prediction: 0.7926500964444131
Error: 3.038685878049206e-05 Prediction: 0.7944875723333098
Error: 1.7092608064027242e-05 Prediction: 0.7958

What you see here is a superior form of learning known as gradient descent. This method allows you to (in a single line of code, shown here in bold) calculate both the direction and the amount you should change weight to reduce error.

**WHAT IS DIRECTION_AND_AMOUNT?**

`direction_and_amount` represents how you want to change `weight`. The first part 1 is what I call pure error, which equals (`pred - goal_pred`). (More about this shortly.) The second part 2 is the multiplication by the `input` that performs scaling, negative reversal, and stopping, modifying the pure error so it’s ready to update `weight`.

**What is the pure error?**

The pure error is (`pred - goal_pred`), which indicates the raw direction and amount you missed. If this is a positive number, you predicted too high, and vice versa. If this is a big number, you missed by a big amount, and so on.

**What are scaling, negative reversal, and stopping?**

These three attributes have the combined effect of translating the pure error into the absolute amount you want to change `weight`. They do so by addressing three major edge cases where the pure error isn’t sufficient to make a good modification to `weight`.

**What is stopping?**

Stopping is the first (and simplest) effect on the pure error caused by multiplying it by input. Imagine plugging a CD player into your stereo. If you turned the volume all the way up but the CD player was off, the volume change wouldn’t matter. Stopping addresses this in a neural network. If input is 0, then it will force direction_and_amount to also be 0. You don’t learn (change the volume) when input is 0, because there’s nothing to learn. Every weight value has the same error, and moving it makes no difference because pred is always 0.

**What is negative reversal?**

This is probably the most difficult and important effect. Normally (when input is positive), moving weight upward makes the prediction move upward. But if input is negative, then all of a sudden weight changes directions! When input is negative, moving weight up makes the prediction go down. It’s reversed! How do you address this? Well, multiplying the pure error by input will reverse the sign of direction_and_amount in the event that input is negative. This is negative reversal, ensuring that weight moves in the correct direction even if input is negative.

**What is scaling?**

Scaling is the third effect on the pure error caused by multiplying it by input. Logically, if input is big, your weight update should also be big. This is more of a side effect, because it often goes out of control. Later, you’ll use alpha to address when that happens.

In this example, you saw gradient descent in action in a bit of an oversimplified environment. Next, you’ll see it in its more native environment. Some terminology will be different, but I’ll code it in a way that makes it more obviously applicable to other kinds of networks (such as those with multiple inputs and outputs).

# One iteration of gradient descent
