# Implimentaion of manual backprop in neural nets

## 1. Derivatives

### 1.1  Scalar case

Given a function f : R → R, the derivative of f at a point x ∈ R is defined as: \
![scalar derivativ formula](resources/scalar-derivative-formula.png)

In [1]:
f = lambda x: 3*x
f(4)

12

In [2]:
# derivative of the function 'f' by the given formula
# taking h = 0.001

(f(4+0.001) - f(4))/0.001

3.0000000000001137

### jacobian

In [3]:
import numpy as np

In [4]:
X = np.random.rand(5)
X

array([0.32150969, 0.87640208, 0.15695178, 0.26655405, 0.9600369 ])

In [5]:
f = lambda x: x*x
f(X)

array([0.10336848, 0.76808061, 0.02463386, 0.07105106, 0.92167085])

## Building a sample dataset
Because we are focused on implementing backprop, we are just creating a sample data set instead of a real dataset.

In [7]:
x = np.random.rand(10,5)
y = x @ [[0.8], [0.3], [0.75], [0.5], [0.2]] + 0.43
y.shape

(10, 1)

## Initializing parameters of our model
When initializing model parameters, its important that were doing it randomly

In [6]:
w1 = np.random.rand(5,5)
b1 = np.random.rand(5)

w2 = np.random.rand(5,1)
b2 = np.random.rand(1)
w1,b1, w2,b2

(array([[0.27593355, 0.95201515, 0.72612994, 0.15686858, 0.85350509],
        [0.91957641, 0.51753238, 0.73489154, 0.6862969 , 0.79144822],
        [0.88665725, 0.5395053 , 0.86953145, 0.06929613, 0.61247942],
        [0.17410728, 0.40034224, 0.385399  , 0.61107948, 0.49672354],
        [0.06588156, 0.09678029, 0.88800686, 0.43524464, 0.20779338]]),
 array([0.68163852, 0.5895935 , 0.4460282 , 0.20865587, 0.83078669]),
 array([[0.88479021],
        [0.38111535],
        [0.40518791],
        [0.86754415],
        [0.53291447]]),
 array([0.11526387]))

In [8]:
z1 = x @ w1 + b1
z1.shape

(10, 5)

In [9]:
z2 = z1 @ w2 + b2
z2.shape

(10, 1)

In [10]:
loss = np.sum((y-z2)**2) / (len(z2) *2)

In [11]:
loss

8.749912940692557

In [12]:
dldz2 = (z2 - y)*(2/len(z2)) 
dldz2

array([[0.73549705],
       [0.79404644],
       [0.73413312],
       [0.82781968],
       [0.74989001],
       [0.80534645],
       [1.01739876],
       [0.85827613],
       [1.02154031],
       [0.76028097]])

In [32]:
dldw2 = z1.transpose() @ dldz2
dldw2

array([[16.34830403],
       [15.72584442],
       [20.29328326],
       [10.48304998],
       [19.94432344]])

In [28]:
dldb2 = np.sum(dldz2, keepdims=True)
dldz1 = dldz2 @ w2.transpose()
dldb2

array([[8.30422892]])

In [15]:
dldz1.shape

(10, 5)

In [29]:
dldw1 = x.transpose() @ dldz1
dldw1

array([[3.78051605, 1.62842299, 1.7312798 , 3.7068274 , 2.27702759],
       [4.65059966, 2.00320361, 2.12973285, 4.55995161, 2.80108419],
       [3.78718605, 1.63129603, 1.73433431, 3.71336738, 2.28104496],
       [2.68945525, 1.15845844, 1.23163067, 2.63703321, 1.61987509],
       [4.71357049, 2.03032772, 2.15857021, 4.62169502, 2.83901189]])

In [30]:
dldb1 = np.sum(dldz1, axis=0, keepdims=True)
dldx = dldz1 @ w1.transpose()

So we calculated gradients of Loss with respect all the parameters(w2. b2, w1, b1) in our model.
We can now use these gradients to update the paramertes in our model in a way that the loss of our model minimizes.
We can gradient descent optimizer for that...

In [34]:
lr = 0.1
w2 = w2 - lr*dldw2
b2 = b2 - lr*dldb2
w1 = w1 - lr*dldw1
b2 = b1 - lr*dldb1