In [61]:
import numpy as np

To learn the maths we will solve y = ax + b. If you consider a = 5 and b = 10 then for some value of x we will get some y. Lets define our function.

In [62]:
def lin(a, b, x):
    return a * x + b

Lets define some other functions to calculate MSE(Mean squared errors). Here y_hat is for some given input x and output y if we presume a random value for a and b then y_hat is the ourput.

In [63]:
def mse(y_hat, y):
    return ((y - y_hat) ** 2).mean()

In [64]:
def mse_loss(a, b, x, y):
    return mse(lin(a, b, x), y)

As mentioned we need some value of x and y. Lets define a function to get some fake data.

In [65]:
def gen_fake_data(n, a, b):
    x = np.random.uniform(0, 1, n)
    y = lin(a, b, x) + 0.1 * np.random.normal(0, 3, n)
    return x, y

Now we have our function for calculating error and getting data. Lets get some fake data and see the error if we choose some different a and b.

In [66]:
index = 0
x, y = gen_fake_data(bs, 2., 10.)
a_guess, b_guess = 1., 1.
mse_loss(a_guess, b_guess, x, y)

87.74562285385427

Now here comes the interesting part.
As we saw earlier $$ error = (y-y`)^2  $$
                         $$= (y`-(ax+b))^2 \text{  as  } y`=ax+b$$
                         $$= y^2+(ax+b)^2-2y(ax+b)$$
                         $$= y^2+a^2x^2+2axb+b^2-2axy-2by$$

Now if we change the value of a bit what will be its impact on error?
$$ \frac{de}{da}=\frac{d}{da} (y^2+a^2x^2+2axb+b^2-2axy-2by) $$
               $$ =2x^2a+2xb-2xy $$
               $$ =2x(ax+b-y) $$
               $$ =2x(y`-y) $$
And if we change the value of b bit what will be its impact on error?
$$ \frac{de}{db}=\frac{d}{db} (y^2+a^2x^2+2axb+b^2-2axy-2by) $$
               $$ =2ax+2b-2y $$
               $$ =2(ax+b-y) $$
               $$ =2(y`-y) $$

Now we are implementing function to calculate our derivatives

In [67]:
def get_derivatives():
    global a_guess, b_guess
    y_pred = lin(a_guess, b_guess, x[index])
    dedb = 2 * (y_pred - y[index])
    deda = dedb * x[index]
    return deda, dedb

Now all we need to define a learning rate. Then we will able get our new value of a and b from our derivatives and learning rate.

In [68]:
def sgd():
    global a_guess, b_guess
    deda, dedb = get_derivatives()
    a_guess -= (deda * lr)
    b_guess -= (dedb * lr)

SGD is not that efficient when the surface curve is more steep. So we use a hyperameter called momentum. Basically momentum is we should keep going the way we are going and update a bit. We calculate it by the [linear interpolation](https://en.wikipedia.org/wiki/Linear_interpolation) of the derivatives of our last value(a and b) and the direction we were going last time.

In [81]:
beta = .9
alpha = 1 - beta
mga = 1.
mgb = 1.


def momentum():
    global a_guess, b_guess, mga, mgb
    deda, dedb = get_derivatives()
    a_guess -= (mga * lr)
    b_guess -= (mgb * lr)
    mga = (beta * mga) + (alpha * deda)
    mgb = (beta * mgb) + (alpha * dedb)

When we updating out values of a and b certain time we will understand that we are updating our values very slowly. To fix that father of neural network Geoffrey Hinton suggest that we should  divide the learning rate by an exponentially decaying average of squared gradients which is called RMSProp

In [79]:
srda = 1.
srdb = 1.


def rmsprop():
    global a_guess, b_guess, srda, srdb
    deda, dedb = get_derivatives()
    a_guess -= (deda * lr / np.sqrt(srda))
    b_guess -= (dedb * lr / np.sqrt(srdb))
    srda = ((beta * srda) + (alpha * deda)) ** 2
    srdb = ((beta * srdb) + (alpha * dedb)) ** 2

Finally Adam(Adaptive Moment Estimation) is the combination of RMSProp and Momentum. Basically adam prevents the roughness of RMSProp. How? In adam we use 2 momentum. One is the momentum of our gradient and another is the momentum of our squared gradient.

In [71]:
ada1 = 1.
adb1 = 1.
ada2 = 1.
adb2 = 1.
beta1 = .7
alpha1 = 1 - .7

def adam():
    global a_guess, b_guess, ada1, adb1, ada2, adb2
    deda, dedb = get_derivatives()
    a_guess -= (ada1 * lr / np.sqrt(ada2))
    b_guess -= (adb1 * lr / np.sqrt(adb2))
    ada1 = (beta1 * ada1) + (alpha1 * deda)
    adb1 = (beta1 * adb1) + (alpha1 * dedb)
    ada2 = ((beta * ada2) + (alpha * deda)) ** 2
    adb2 = ((beta * adb2) + (alpha * dedb)) ** 2

Lets see how our each optimization technique performs.

In [83]:
index = 0
lr = .01
bs = 10
epoch = 3
a_guess, b_guess = 1., 1.
for i in range(epoch):
    for j in range (bs-1):
        sgd()
        #momentum()
        #rmsprop()
        #adam()
        
print("a: " + str(a_guess) + " b: " + str(b_guess) + " error: " + str(mse_loss(a_guess, b_guess, x, y)))

a: 4.91647177550759 b: 1.0 error: 54.66376869315951
