In [1]:
import numpy as np

# Introduction to neural learning: gradient descent

So far we have discovered how to make a prediction, but how to make a **good prediction**?

Big question:

* how we define a **good prediction**?

What we will do is summarize in the following sentence:

**predict, compare, learn**

* **predict**: means making a prediction based on a set of weights;
* **compare**: means measuring of much a prediction "missed" by;
* **learn**: means adjusting the weight in order to make a "better" prediction.

So far we have seen how to make a **prediction**, i.e. the predict part, lets now understand how to do the other twos...

### The **error**

We will focus now on a specific type of error, that is the **mean squared error**, defined as follow:

$$
MSE(\hat{y}, y) = \sum_{i=1}^n \frac{(\hat{y_i}-y_i)^2}{n}
$$

In [3]:

# here I will change the shapes of the matrices to maek them less trivial, and to notice the shapes

input = np.random.rand(3, 1) # 3x1

A = np.random.rand(4, 3) # 4x3

B = np.random.rand(5, 4) # 5x4

# output will be 5x1

weights = [A, B]

def neural_network(input, weights):

    # hidden layer
    hid = weights[0].dot(input)

    # final layer, i.e. the prediction
    pred = weights[1].dot(hid)

    return pred

pred = neural_network(input, weights)

goal_pred = np.array([0.3, 0.2, 1.0, 0.5, 0.6])

error = np.mean((pred - goal_pred)**2)

print(f"Error: {error}")

Error: 0.19968081021308165


### One question about MSE and two reasons

**Why MSE?**

Answers:

1. Because we need a positive error
2. Because squared is better

Why?

1. Why think about predicting a value that is 0, and first time you predict 50, second time -50, if you do the average it's 0! Perfect prediction right?

2. Because, like in real life, its better to focus on bigger errors, with the square, small error became even smaller and big error became bigger.

# Hot and cold learning

Learning at the end of the day is just one simple thing:

**adjusting the weights in order to reduce the error**

The simplest (and stupidest) form of learning is simply randomly increase or decrease the weights until you end up with a better (smaller) error.

In [9]:
weight = 0.1

step_amount = 0.001

input = 8.0

goal_pred = 1.0

def neural_network(input, weight):

    return input*weight

pred = neural_network(input, weight)

print(f"First prediction: {pred}")

err = goal_pred-pred

err = err**2

max_iter = 20

count = 0

while err > 0.001 and count < max_iter:
    
    pred = neural_network(input, weight)
    print(f"Current prediction: {pred}")

    if pred > goal_pred:
        weight -= step_amount
    else:
        weight += step_amount

    err = (pred - goal_pred)**2
    print(f"Current weight: {weight}")
    print(f"Current error: {err}")
    print(f"_"*10)

    count += 1

First prediction: 0.8
Current prediction: 0.808
Current weight: 0.101
Current error: 0.03686399999999998
__________
Current prediction: 0.8160000000000001
Current weight: 0.10200000000000001
Current error: 0.033855999999999976
__________
Current prediction: 0.8240000000000001
Current weight: 0.10300000000000001
Current error: 0.030975999999999976
__________
Current prediction: 0.8320000000000001
Current weight: 0.10400000000000001
Current error: 0.028223999999999975
__________
Current prediction: 0.8400000000000001
Current weight: 0.10500000000000001
Current error: 0.025599999999999973
__________
Current prediction: 0.8480000000000001
Current weight: 0.10600000000000001
Current error: 0.023103999999999975
__________
Current prediction: 0.8560000000000001
Current weight: 0.10700000000000001
Current error: 0.020735999999999973
__________
Current prediction: 0.8640000000000001
Current weight: 0.10800000000000001
Current error: 0.01849599999999997
__________
Current prediction: 0.872000000

We understand now, that learning in NN is simply a **search problem**!

Some problems with hot and cold learning:

* unless the perfect weight is exactly $n \cdot step\_amount$, at some point you will end up alternating up and down around the correct value
* the $step\_amount$ does not depend on $error$, i.e. is the same amount wheter error is big or small!
* we are moving the weights based on **prediction** not on **error**, but what we want to minimize is **error**...
* we know the correct direction, but we don't know the correct $step\_amount$


A good idea could be the following:

* calculate the **direction** using the **error**
* calculate also the $step\_amount$ from **errorr**

# Gradient descent

Lets now use for the first time the gradient descent!

In [14]:
weight = 0.5

input = 1.2

goal_pred = 1.0

for i in range(10):

    # compute current prediction
    pred = input*weight

    # compute current error
    error = (pred - goal_pred) ** 2

    # compute both the amount and the direction of the weight's correction
    amount_and_direction = (pred - goal_pred) * input

    # update the weight
    weight = weight - amount_and_direction

    print(f"Prediction: {pred} New weight: {weight}")
    print(f"-"*10)


Prediction: 0.6 New weight: 0.98
----------
Prediction: 1.176 New weight: 0.7688
----------
Prediction: 0.92256 New weight: 0.8617279999999999
----------
Prediction: 1.0340736 New weight: 0.8208396800000001
----------
Prediction: 0.9850076160000001 New weight: 0.8388305408
----------
Prediction: 1.00659664896 New weight: 0.830914562048
----------
Prediction: 0.9970974744575999 New weight: 0.8343975926988801
----------
Prediction: 1.001277111238656 New weight: 0.8328650592124929
----------
Prediction: 0.9994380710549914 New weight: 0.8335393739465032
----------
Prediction: 1.0002472487358038 New weight: 0.8332426754635386
----------


# NB

Now we don't guess the direction, i.e. we don't go up and down and see which one is good, but instead the error inside $(pred - goal\_pred)$ tell us the correct direction. Also we don't guess the $step\_amount$ because now we have at each iteration that $input$ tell us the amount. 

One may ask, "Ok cool, but why??". Seems a little bit magic!

Right now we are here

$$
\hat{y} = f(x) = x \cdot w
$$

$$
MSE(\hat{y}, y) = (\hat{y}-y)^2 = (x\cdot w - y)^2
$$

$$
\nabla MSE = (\hat{y}- y) = (x \cdot w - y)
$$

$$
w_{new} = w_{old} - \nabla MSE \cdot x = w_{old} - (x \cdot w - y) \cdot x
$$


**Why is it that we use $x$ instead of $f(x)$???**

# NB

Notice that in the previous lines:

```python

    # compute current prediction
    pred = input*weight

    # compute current error
    error = (pred - goal_pred) ** 2
```

`error` is a function of `pred`, lets write it in another way:


```python
    # compute current error as a weight function
    error = (input*weight - goal_pred) ** 2
```

Now `error` is an exact function of `weight`, in this way we can calculate how to change `weight` in order to reduce `error`.

# Several steps of learning

In [19]:
weight, goal_pred, input = (0.0, 0.8, 1.1)

for it in range(10):
    pred = weight*input
    delta = pred - goal_pred
    weight_delta = delta*input
    weight = weight - weight_delta
    print("Prediction: ", pred)
    print("Goal prediction: ", goal_pred)
    print("Weight: ", weight)
    print("Weigh delta: ", weight_delta)
    print("-"*10)

Prediction:  0.0
Goal prediction:  0.8
Weight:  0.8800000000000001
Weigh delta:  -0.8800000000000001
----------
Prediction:  0.9680000000000002
Goal prediction:  0.8
Weight:  0.6951999999999999
Weigh delta:  0.1848000000000002
----------
Prediction:  0.76472
Goal prediction:  0.8
Weight:  0.734008
Weigh delta:  -0.0388080000000001
----------
Prediction:  0.8074088
Goal prediction:  0.8
Weight:  0.72585832
Weigh delta:  0.008149679999999992
----------
Prediction:  0.798444152
Goal prediction:  0.8
Weight:  0.7275697528
Weigh delta:  -0.0017114328000000902
----------
Prediction:  0.80032672808
Goal prediction:  0.8
Weight:  0.727210351912
Weigh delta:  0.0003594008880000055
----------
Prediction:  0.7999313871032001
Goal prediction:  0.8
Weight:  0.7272858260984799
Weigh delta:  -7.54741864799291e-05
----------
Prediction:  0.800014408708328
Goal prediction:  0.8
Weight:  0.7272699765193192
Weigh delta:  1.5849579160709395e-05
----------
Prediction:  0.7999969741712513
Goal prediction:  

Let's notice some things:
* weight delta change in amount
* weight delta change in direction

# Fuck calculus!

We will look all the derivatives from a table and fuck calculus!!!

# Let's try to break it

The values below break the gradient: `(0.0, 0.08, 11.1)`

In [35]:
weight, goal_pred, input = (0.0, 0.08, 11.1)

it = 1
max_it = 100
delta = 100

while it < max_it and abs(delta) > 0.0001:
    pred = weight*input
    delta = pred - goal_pred
    weight_delta = delta*input
    weight = weight - weight_delta
    it += 1
    if it % 10 == 0:
        print("Iteration: ", it)
        print("Weight: ", weight)
        print("Prediction: ", pred)
        print("Goal prediction: ", goal_pred)
        print("Delta: ", delta)
        print("-"*10)

print("Iteration: ", it)
print("Weight: ", weight)
print("Prediction: ", pred)
print("Goal prediction: ", goal_pred)
print("Delta: ", delta)

Iteration:  10
Weight:  4.3825583301688264e+16
Prediction:  -3980557848365434.5
Goal prediction:  0.08
Delta:  -3980557848365434.5
----------
Iteration:  20
Weight:  3.2568304217005156e+37
Prediction:  -2.958089982888121e+36
Goal prediction:  0.08
Delta:  -2.958089982888121e+36
----------
Iteration:  30
Weight:  2.4202631423516864e+58
Prediction:  -2.198258806980093e+57
Goal prediction:  0.08
Delta:  -2.198258806980093e+57
----------
Iteration:  40
Weight:  1.7985811110077823e+79
Prediction:  -1.6336020237448968e+78
Goal prediction:  0.08
Delta:  -1.6336020237448968e+78
----------
Iteration:  50
Weight:  1.3365877272876836e+100
Prediction:  -1.2139860709347261e+99
Goal prediction:  0.08
Delta:  -1.2139860709347261e+99
----------
Iteration:  60
Weight:  9.93264491549709e+120
Prediction:  -9.021549673677906e+119
Goal prediction:  0.08
Delta:  -9.021549673677906e+119
----------
Iteration:  70
Weight:  7.381291403711612e+141
Prediction:  -6.704225070059643e+140
Goal prediction:  0.08
Delta

Wow, the prediction exploded. Why?

At each iteration the weight_delta is bigger and keep changing weight, in this way the prediction explode

# Introducing $\alpha$

In [37]:
weight, goal_pred, input = (0.0, 0.08, 11.1)

it = 1
max_it = 100
delta = 100
alpha = 0.01

while it < max_it and abs(delta) > 0.0001:
    pred = weight*input
    delta = pred - goal_pred
    weight_delta = delta*input
    weight = weight - weight_delta*alpha
    it += 1
    if it % 10 == 0:
        print("Iteration: ", it)
        print("Weight: ", weight)
        print("Prediction: ", pred)
        print("Goal prediction: ", goal_pred)
        print("Delta: ", delta)
        print("-"*10)

print("Iteration: ", it)
print("Weight: ", weight)
print("Prediction: ", pred)
print("Goal prediction: ", goal_pred)
print("Delta: ", delta)

Iteration:  7
Weight:  0.007206080482413969
Prediction:  0.08005388472729402
Goal prediction:  0.08
Delta:  5.388472729402072e-05


How to choose $\alpha$? By guessing: $(0.1, 0.01, 0.001, 0.0001, ...)$