### Gradient Descent
For us to be able to correctly evaluate the predictions we make using a neural network, we need the actual value associated with the input data. We can call it the ground truth. 

In [1]:
W = 0.5
X = 0.5
# Ground Truth
GT = 0.8

# Prediction
P = X*W
# Taking the squared error
E = (GT - P)**2
print(E)

0.30250000000000005


Intuitively, what can we do to reach nearer to the ground truth? We need to keep updating the weights to reach nearer. One way to do it is using **Hot and Cold** method. 

1. We add and subtract from the weight a small step amount and see whether the error is reduced in either case. 
2. If yes, then we add or subtract the step amount to the weight and repeat.

In [3]:
LR = 0.01

def neural_network(X, W):
    P = X*W
    return P

def squared_difference(P, GT):
    return (GT-P)**2

# number of rooms
X = 0.5
W = 0.5
# hour price
GT = 0.8

for i in range(1101):
    P = neural_network(X, W)
    E = squared_difference(P, GT)
    print(f"Iteration: {i} Error: {E} Prediction: {P}")

    HP = neural_network(X, W+LR)
    HE = squared_difference(HP, GT)

    LP = neural_network(X, W-LR)
    LE = squared_difference(LP, GT)

    if LE < HE:
        W -= LR
    else:
        W += LR


Iteration: 0 Error: 0.30250000000000005 Prediction: 0.25
Iteration: 1 Error: 0.29702500000000004 Prediction: 0.255
Iteration: 2 Error: 0.2916 Prediction: 0.26
Iteration: 3 Error: 0.286225 Prediction: 0.265
Iteration: 4 Error: 0.28090000000000004 Prediction: 0.27
Iteration: 5 Error: 0.275625 Prediction: 0.275
Iteration: 6 Error: 0.27040000000000003 Prediction: 0.28
Iteration: 7 Error: 0.265225 Prediction: 0.28500000000000003
Iteration: 8 Error: 0.2601 Prediction: 0.29000000000000004
Iteration: 9 Error: 0.255025 Prediction: 0.29500000000000004
Iteration: 10 Error: 0.25 Prediction: 0.30000000000000004
Iteration: 11 Error: 0.245025 Prediction: 0.30500000000000005
Iteration: 12 Error: 0.24009999999999998 Prediction: 0.31000000000000005
Iteration: 13 Error: 0.235225 Prediction: 0.31500000000000006
Iteration: 14 Error: 0.2304 Prediction: 0.32000000000000006
Iteration: 15 Error: 0.225625 Prediction: 0.32500000000000007
Iteration: 16 Error: 0.22089999999999999 Prediction: 0.33000000000000007
It

Here's where we introduce some calculus. But let's just focus on the intuition to start with. Using the hot and cold method, we have identified what we have to optimize about this whole process: the error. Our goal is to find the best weights which lead to the minimal error. In calculus terms: we need to find the global (or local) minima! For that we need to know how much the error changes (and in which direction) wit the change in weight.

That is exactly what differentiation is all about. We will use the chain rule.

What is our cost function?

$$E = (P-GT)^2$$

Taking the derivative of the cost function (w.r.t. our prediction):

$$\frac{dE}{dP} = 2\times(P-GT)$$

What is the prediction?

$$P = X\times W$$

$$\frac{dP}{dW} = X$$

Combining the two, we now want to find the change in the error rate w.r.t. weight:

$$\frac{dE}{dW} = \frac{dE}{dP} \times \frac{dP}{dW} = 2 \times (P-GT) \times X$$

In [12]:
# Calculating the weight derivative
def weight_derivative(P, GT, X):
    dE_dP = (P-GT)
    dP_dW = X
    dE_dW = dE_dP * dP_dW
    return dE_dW

W, GT, X = (0.0, 0.8, 0.5)

def error_optimization(X, W, epochs):
    for i in range(epochs):
        P = neural_network(X, W)
        E = squared_difference(P, GT)
        W -= weight_derivative(P, GT, X)
        print(f"Iteration: {i} Error: {E} Prediction: {P}")

error_optimization(X, W, 50)

Iteration: 0 Error: 0.6400000000000001 Prediction: 0.0
Iteration: 1 Error: 0.3600000000000001 Prediction: 0.2
Iteration: 2 Error: 0.2025 Prediction: 0.35000000000000003
Iteration: 3 Error: 0.11390625000000001 Prediction: 0.4625
Iteration: 4 Error: 0.06407226562500003 Prediction: 0.546875
Iteration: 5 Error: 0.036040649414062535 Prediction: 0.61015625
Iteration: 6 Error: 0.020272865295410177 Prediction: 0.6576171875
Iteration: 7 Error: 0.011403486728668217 Prediction: 0.693212890625
Iteration: 8 Error: 0.006414461284875877 Prediction: 0.71990966796875
Iteration: 9 Error: 0.0036081344727426873 Prediction: 0.7399322509765625
Iteration: 10 Error: 0.0020295756409177616 Prediction: 0.7549491882324219
Iteration: 11 Error: 0.001141636298016239 Prediction: 0.7662118911743164
Iteration: 12 Error: 0.0006421704176341359 Prediction: 0.7746589183807373
Iteration: 13 Error: 0.00036122085991920354 Prediction: 0.7809941887855529
Iteration: 14 Error: 0.000203186733704552 Prediction: 0.7857456415891647
I

What happens if we scale our input a bit. 

In [13]:
X = 2

error_optimization(X, W, 50)

Iteration: 0 Error: 0.6400000000000001 Prediction: 0.0
Iteration: 1 Error: 5.760000000000002 Prediction: 3.2
Iteration: 2 Error: 51.84000000000002 Prediction: -6.400000000000001
Iteration: 3 Error: 466.56000000000006 Prediction: 22.400000000000002
Iteration: 4 Error: 4199.04 Prediction: -64.0
Iteration: 5 Error: 37791.35999999999 Prediction: 195.2
Iteration: 6 Error: 340122.2399999998 Prediction: -582.3999999999999
Iteration: 7 Error: 3061100.1599999983 Prediction: 1750.3999999999994
Iteration: 8 Error: 27549901.439999983 Prediction: -5247.999999999998
Iteration: 9 Error: 247949112.95999986 Prediction: 15747.199999999995
Iteration: 10 Error: 2231542016.639999 Prediction: -47238.39999999999
Iteration: 11 Error: 20083878149.759995 Prediction: 141718.39999999997
Iteration: 12 Error: 180754903347.83994 Prediction: -425151.99999999994
Iteration: 13 Error: 1626794130130.559 Prediction: 1275459.1999999997
Iteration: 14 Error: 14641147171175.031 Prediction: -3826374.399999999
Iteration: 15 Err

Wow. Our prediction has skyrocketed to unfathomable numbers while alternative between negative and positive. What's happening is that with every update to our weight, we are going further away to from our actual minima, and because of that we keep swining back and forth trying to reach our minima which is forever going further. The process is overcorrecting. 

The problem was that our input was quite large, which made our weight update large even when our error is small. Imagine you are a long jumper. But you're a long jumper with a slight disadvantage: with every jump you make you, you over jump the target by 2 times the amount. So if you plan to jump 1m, you end up jumping 1+2m.

You've been asked to jump on this particular spot 1m away and you make your first attempt. You end up jumping 2m ahead of the spot. To go back, you jump back 2m. But because of your quirk, you end up jumping 4m behind it. You again jump towards the spot and this time planning to cover 4m. But again, you've managed to jump 8m ahead of it now. And so on and so forth.

This is the same problem happening in our gradient descent algorithm.

But! Since we know our problem: we over jump, why dont we become aware of it and correct our overjumping problem. How? We simply add a penalty on every jump. If we plan to jump _x_ amount, then lets jump $x \times penalty$ amount!

We call this penalty term the **learning rate**. It makes sure that we update our weights at a stable rate only. If our learning rate is too high, we will overshoot our target. If our learning rate is too low, we will take a really long time to reach our target.

In [14]:
alpha = 0.1

def error_optimization(X, W, epochs, alpha):
    for i in range(epochs):
        P = neural_network(X, W)
        E = squared_difference(P, GT)
        W -= weight_derivative(P, GT, X)*alpha
        print(f"Iteration: {i} Error: {E} Prediction: {P}")

error_optimization(X, W, 50, alpha)

Iteration: 0 Error: 0.6400000000000001 Prediction: 0.0
Iteration: 1 Error: 0.2304 Prediction: 0.32000000000000006
Iteration: 2 Error: 0.08294400000000002 Prediction: 0.512
Iteration: 3 Error: 0.029859840000000023 Prediction: 0.6272
Iteration: 4 Error: 0.0107495424 Prediction: 0.69632
Iteration: 5 Error: 0.0038698352640000053 Prediction: 0.737792
Iteration: 6 Error: 0.0013931406950400036 Prediction: 0.7626752
Iteration: 7 Error: 0.0005015306502144003 Prediction: 0.77760512
Iteration: 8 Error: 0.0001805510340771829 Prediction: 0.7865630720000001
Iteration: 9 Error: 6.499837226778621e-05 Prediction: 0.7919378432
Iteration: 10 Error: 2.3399414016403033e-05 Prediction: 0.79516270592
Iteration: 11 Error: 8.423789045904834e-06 Prediction: 0.7970976235520001
Iteration: 12 Error: 3.0325640565258178e-06 Prediction: 0.7982585741312
Iteration: 13 Error: 1.0917230603492479e-06 Prediction: 0.7989551444787201
Iteration: 14 Error: 3.9302030172572924e-07 Prediction: 0.7993730866872321
Iteration: 15 Err

As you can see here, we kept the same weight which was leading to an explosion, but because of the learning rate $\alpha$, we were able to converge to the minima. 