# Neural network learning from scratch (Math + Numpy)

This is a major milestone during my self-taught machine learning journey. Understanding the backpropagation and calculus in it is quite an experience. That said, this is really the simplest implementation :p.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Normalization
from tensorflow.keras.activations import linear, relu, sigmoid, softmax
from tensorflow.keras.losses import BinaryCrossentropy
from tensorflow.keras.optimizers import Adam
from keras.datasets import mnist

### Objective: Determine a tumor's malignancy given its texture_mean and radius_mean

#### The dataset

In [2]:
df = pd.read_csv("bdiag.csv")[["diagnosis", "texture_mean", "radius_mean"]]
df["diagnosis"] = df["diagnosis"].replace({"M":1, "B":0})
df

Unnamed: 0,diagnosis,texture_mean,radius_mean
0,1,10.38,17.99
1,1,17.77,20.57
2,1,21.25,19.69
3,1,20.38,11.42
4,1,14.34,20.29
...,...,...,...
564,1,22.39,21.56
565,1,28.25,20.13
566,1,28.08,16.60
567,1,29.33,20.60


### Train-test split

In [3]:
X = df[["texture_mean", "radius_mean"]].values
y = df["diagnosis"].values.reshape(-1, 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [4]:
print('X_train: ' + str(X_train.shape))
print('Y_train: ' + str(y_train.shape))
print('X_test:  '  + str(X_test.shape))
print('Y_test:  '  + str(y_test.shape))

X_train: (455, 2)
Y_train: (455, 1)
X_test:  (114, 2)
Y_test:  (114, 1)


### Feature normalization

In [5]:
norm = Normalization(axis=1)
norm.adapt(X_train)

In [6]:
X_train = np.array(norm(X_train))  # converts back to numpy because tensor is slower
X_test = np.array(norm(X_test))

### The Model overview

![alt text](model_nn_scratch.jpg "test")

## Doing it the "easy" way (using library)

In [7]:
model = Sequential([
    
    Dense(units=2, activation=relu),
    Dense(units=1, activation=linear)
])

model.compile(loss=BinaryCrossentropy(from_logits=True),
              optimizer=Adam(learning_rate=0.001),
              metrics="accuracy"
             )
model.fit(X_train, y_train, epochs=30)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x284a16fdaf0>

### Evaluation

In [8]:
model.evaluate(X_test, y_test)



[0.46033743023872375, 0.6491228342056274]

# Doing it the hard way (scratch)

### The math

2 features --> 1 hidden layer (2 neuron relu) --> 1 output layer (1 neuron sigmoid)

#### Forward propagation:

$$z_{1}^{[1]} = w_{1}^{[1]}x_{1} + w_{2}^{[1]}x_{2} + b_{1}^{[1]} $$ <br>
$$z_{2}^{[1]} = w_{3}^{[1]}x_{1} + w_{4}^{[1]}x_{2} + b_{2}^{[1]} $$ <br>
$$a_{1}^{[1]} = \text{relu}(z_{1}^{[1]})$$ <br>
$$a_{2}^{[1]} = \text{relu}(z_{2}^{[1]})$$ <br>
$$z^{[2]} = w_{1}^{[2]}x_{1} + w_{2}^{[2]}x_{2} + b^{[2]} $$ <br>
$$a^{[2]} = \frac{1}{1 + e^{-z^{[2]}}}$$ <br>

#### cost & loss function: <br>
$$\text{cross entropy cost}(\hat{a}^{[2]}, \hat{y}) = -\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{m}y_{ij}\text{log}(a^{[2]}[ij])$$ <br>
$$\text{binary cross entropy cost}(\hat{a}^{[2]}, \hat{y}) = -\frac{1}{n}\sum_{i=1}^{n}y_{i}\text{log}(a^{[2]}[i]) + (1-y_{i})\text{log}(1-a^{[2]}[i])$$ <br>

#####  Derivative of cost function w.r.t parameters
$$\frac{\partial ( \text{binary cross entropy cost}(\vec{a}^{[2]}, \vec{y}))}{\partial \vec{w}} = \frac{1}{n}(\frac {\partial (\text{binary cross entropy loss}(a^{[2]}, \hat{y}))_{1}}{\partial \vec{w}} + \frac {\partial (\text{binary cross entropy loss}(a^{[2]}, \hat{y}))_{2}}{\partial \vec{w}}+ ...+ \frac {\partial (\text{binary cross entropy loss}(a^{[2]}, \hat{y}))_{n}}{\partial \vec{w}})\space \text{for n=number of examples}$$<br>

$$\text{binary cross entropy loss}(a^{[2]}, \hat{y}) = -(y_{i}\text{log}(a^{[2]}) + (1-y_{i})\text{log}(1-a^{[2]}))$$

- The point of backpropagation is to determine the gradient of cost function wrt to each parameter $w$
- Intuitively speaking, the gradient of cost function wrt to parameter $w$ is the average of all the gradient of loss function wrt to parameter $w_{1}, w_{2},..., w_{n}$ where $n=$number of examples.
- Backpropagation involves the derivation of loss function wrt to each parameters, then averaging them later for gradient descent

#### Backward propagation:

##### output layer
$\frac{\partial \text{cost}(a^{[2]})}{\partial z^{[2]}}=\frac{\partial a^{[2]}}{\partial z^{[2]}}\times\frac{\partial\text{cost}(a^{[2]})}{\partial a^{[2]}} = (1-a^{[2]})\times \frac{a^{[2]}-y}{a^{[2]}(1-a^{[2]})} = a^{[2]} - y$


$$\frac{\partial\text{cost}(a^{[2]})}{\partial w_{1}^{[2]}} = \frac{\partial z^{[2]}}{\partial w_{1}^{[2]}}\times\frac{\partial \text{cost}(a^{[2]})}{\partial z^{[2]}} = a_{1}^{[1]}\times (a^{[2]} - y)$$

$$\frac{\partial\text{cost}(a^{[2]})}{\partial w_{2}^{[2]}} = \frac{\partial z^{[2]}}{\partial w_{2}^{[2]}}\times\frac{\partial \text{cost}(a^{[2]})}{\partial z^{[2]}} = a_{2}^{[1]}\times (a^{[2]} - y)$$

$$\frac{\partial\text{cost}(a^{[2]})}{\partial b^{[2]}} = \frac{\partial z^{[2]}}{\partial b^{[2]}}\times\frac{\partial \text{cost}(a^{[2]})}{\partial z^{[2]}} = a^{[2]} - y$$


##### hidden layer
$\frac{\partial \text{cost}(a^{[2]})}{\partial z_{1}^{[1]}}=\frac{\partial a_{1}^{[1]}}{\partial z_{1}^{[1]}}\times\frac{\partial z^{[2]}}{\partial a_{1}^{[1]}}\times\frac{\partial \text{cost}(a^{[2]})}{\partial z^{[2]}} = \text{relu'}(z_1^{[1]}) \times w_1^{[2]}\times (a^{[2]} - y)$ <br>

$$\frac{\partial\text{cost}(a^{[2]})}{\partial w_{1}^{[1]}} = \frac{\partial z_{1}^{[1]}}{\partial w_{1}^{[1]}}\times \frac{\partial \text{cost}(a^{[2]})}{\partial z_{1}^{[1]}} = x_1\times \text{relu'}(z_1^{[1]}) \times w_1^{[2]}\times (a^{[2]} - y)$$
$$\frac{\partial\text{cost}(a^{[2]})}{\partial w_{2}^{[1]}} = \frac{\partial z_{1}^{[1]}}{\partial w_{2}^{[1]}}\times \frac{\partial \text{cost}(a^{[2]})}{\partial z_{1}^{[1]}} = x_2\times \text{relu'}(z_1^{[1]}) \times w_1^{[2]}\times (a^{[2]} - y)$$ 
$$\frac{\partial\text{cost}(a^{[2]})}{\partial b_{1}^{[1]}} = \frac{\partial z_{1}^{[1]}}{\partial b_{1}^{[1]}}\times \frac{\partial \text{cost}(a^{[2]})}{\partial z_{1}^{[1]}} = \text{relu'}(z_1^{[1]}) \times w_1^{[2]}\times (a^{[2]} - y)$$ 
<br>

$\frac{\partial \text{cost}(a^{[2]})}{\partial z_{2}^{[1]}}=\frac{\partial a_{1}^{[1]}}{\partial z_{2}^{[1]}}\times\frac{\partial z^{[2]}}{\partial a_{1}^{[1]}}\times\frac{\partial \text{cost}(a^{[2]})}{\partial z^{[2]}} = \text{relu'}(z_2^{[1]}) \times w_2^{[2]}\times (a^{[2]} - y)$ <br>

$$\frac{\partial\text{cost}(a^{[2]})}{\partial w_{3}^{[1]}} = \frac{\partial z_{2}^{[1]}}{\partial w_{3}^{[1]}}\times \frac{\partial \text{cost}(a^{[2]})}{\partial z_{1}^{[1]}} = x_1\times \text{relu'}(z_2^{[1]}) \times w_2^{[2]}\times (a^{[2]} - y)$$
$$\frac{\partial\text{cost}(a^{[2]})}{\partial w_{4}^{[1]}} = \frac{\partial z_{1}^{[1]}}{\partial w_{4}^{[1]}}\times \frac{\partial \text{cost}(a^{[2]})}{\partial z_{1}^{[1]}} = x_2\times \text{relu'}(z_1^{[1]}) \times w_1^{[2]}\times (a^{[2]} - y)$$ 
$$\frac{\partial\text{cost}(a^{[2]})}{\partial b_{2}^{[1]}} = \frac{\partial z_{1}^{[1]}}{\partial b_{2}^{[1]}}\times \frac{\partial \text{cost}(a^{[2]})}{\partial z_{1}^{[1]}} = \text{relu'}(z_1^{[1]}) \times w_1^{[2]}\times (a^{[2]} - y)$$ 
<br>

### No vectorization and non-modular implementation

This implementation is tailored for learning, hence not for production usage. It is meant to be explicit and not use any vectorization. It is also non-modular, as it can only be of the specified architecture and parameters.

In [9]:
class NeuralNetwork():
    
    def initialize(self):
        self.w11, self.w12, self.w13, self.w14, self.w21, self.w22, self.b11, self.b12, self.b2 = np.random.randn(9)
    
    def relu(self, x):
        return np.maximum(x, 0)
    
    def sigmoid(self, x):
        return 1/(1+np.exp(-x))
    
    def binary_cross_entropy(self, x, y_true):
        return -(y_true * np.log(x) + (1 - y_true) * np.log(1 - x))
    
    def derivative_relu(self, x):
        return 1 if x>0 else 0
    
    def forward_prop(self, x1, x2):
        forward_dict = {} 
        
        # hidden layer
        forward_dict["z11"] = self.w11 * x1 + self.w12 * x2 + self.b11
        forward_dict["z12"] = self.w13 * x1 + self.w14 * x2 + self.b12
        forward_dict["a11"] = self.relu(forward_dict["z11"])
        forward_dict["a12"] = self.relu(forward_dict["z12"])
        
        # output layer
        forward_dict["z2"] = self.w21 * forward_dict["a11"] + self.w22 * forward_dict["a12"] + self.b2
        forward_dict["a2"] = self.sigmoid(forward_dict["z2"])
        
        return forward_dict

    def back_prop(self, x1, x2, y_true, forward_dict):
        # gradient will be calculated for each training example and averaged later#
        
        deriva_dict = {}
        
        # output layer
        error = forward_dict["a2"] - y_true
        
        deriva_dict["dloss_dw21"] = forward_dict["a11"] * error
        deriva_dict["dloss_dw22"] = forward_dict["a12"] * error
        deriva_dict["dloss_db2"] = error
        
        # hidden layer
        dcost_dz11 = self.derivative_relu(forward_dict["z11"]) * self.w21 * error
        dcost_dz12 = self.derivative_relu(forward_dict["z12"]) * self.w22 * error
        
        deriva_dict["dloss_dw11"] = x1 * dcost_dz11
        deriva_dict["dloss_dw12"] = x2 * dcost_dz11
        deriva_dict["dloss_db11"] = dcost_dz11
        
        deriva_dict["dloss_dw13"] = x1 * dcost_dz12
        deriva_dict["dloss_dw14"] = x2 * dcost_dz12
        deriva_dict["dloss_db12"] = dcost_dz12
        
        return deriva_dict
    
    def get_prob_class(self, x1, x2):
        prob = self.forward_prop(x1, x2)["a2"]
        class_ = self.encode_class(prob)
        return prob, class_
    
    def encode_class(self, x):
        return 1 if x>0.5 else 0
    
    def get_accuracy_cost(self, X, y):
        n = X.shape[0]
        n_correct = 0
        cost = 0
        for i in range(n):
            x1, x2 = X[i]
            prediction = self.get_prob_class(x1, x2)
            if prediction[1] == y[i][0]:
                n_correct += 1
            cost += self.binary_cross_entropy(prediction[0], y[i][0])
        return n_correct/n, cost

    def fit(self, X, y, alpha, epochs):
        n, m = X.shape
        
        self.initialize()
        costs = []

        for i in range(1, epochs+1):
            dcost_dw21 = 0
            dcost_dw22 = 0
            dcost_db2 = 0
            dcost_dw11 = 0
            dcost_dw12 = 0
            dcost_db11 = 0
            dcost_dw13 = 0
            dcost_dw14 = 0
            dcost_db12 = 0
            
            for j in range(n):
                x1 , x2 = X[j]
                y_true = y[j][0]
                forward_prop = self.forward_prop(x1, x2)
                back_prop = self.back_prop(x1, x2, y_true, forward_prop)
                
                dcost_dw21 += back_prop["dloss_dw21"]
                dcost_dw22 += back_prop["dloss_dw22"]
                dcost_db2 += back_prop["dloss_db2"]
                dcost_dw11 += back_prop["dloss_dw11"]
                dcost_dw12 += back_prop["dloss_dw12"]
                dcost_db11 += back_prop["dloss_db11"]
                dcost_dw13 += back_prop["dloss_dw13"]
                dcost_dw14 += back_prop["dloss_dw14"]
                dcost_db12 += back_prop["dloss_db12"]
                    
            self.w21 -= alpha*(1/n)*dcost_dw21 
            self.w22 -= alpha*(1/n)*dcost_dw22 
            self.b2  -= alpha*(1/n)*dcost_db2
            self.w11 -= alpha*(1/n)*dcost_dw11 
            self.w12 -= alpha*(1/n)*dcost_dw12 
            self.b11 -= alpha*(1/n)*dcost_db11
            self.w13 -= alpha*(1/n)*dcost_dw13
            self.w14 -= alpha*(1/n)*dcost_dw14
            self.b12 -= alpha*(1/n)*dcost_db12
            
            accuracy, cost = self.get_accuracy_cost(X, y)
            costs.append(cost)
            
            print(f"epoch: {i}, cost: {cost:.2f}, accuracy: {accuracy*100:.2f}%")

### Training

In [10]:
classifier = NeuralNetwork()
classifier.fit(X_train, y_train, 0.1, 100)

epoch: 1, cost: 1538.06, accuracy: 37.58%
epoch: 2, cost: 1408.38, accuracy: 37.58%
epoch: 3, cost: 1289.91, accuracy: 37.58%
epoch: 4, cost: 1180.47, accuracy: 37.58%
epoch: 5, cost: 1078.23, accuracy: 37.58%
epoch: 6, cost: 981.71, accuracy: 37.58%
epoch: 7, cost: 889.76, accuracy: 37.58%
epoch: 8, cost: 801.73, accuracy: 37.58%
epoch: 9, cost: 717.54, accuracy: 37.58%
epoch: 10, cost: 637.78, accuracy: 37.58%
epoch: 11, cost: 563.64, accuracy: 37.80%
epoch: 12, cost: 496.53, accuracy: 40.44%
epoch: 13, cost: 437.57, accuracy: 43.96%
epoch: 14, cost: 387.07, accuracy: 50.33%
epoch: 15, cost: 344.58, accuracy: 55.38%
epoch: 16, cost: 309.22, accuracy: 60.66%
epoch: 17, cost: 279.88, accuracy: 64.40%
epoch: 18, cost: 255.56, accuracy: 68.79%
epoch: 19, cost: 235.33, accuracy: 73.19%
epoch: 20, cost: 218.39, accuracy: 74.29%
epoch: 21, cost: 204.14, accuracy: 76.70%
epoch: 22, cost: 192.12, accuracy: 79.56%
epoch: 23, cost: 181.93, accuracy: 80.88%
epoch: 24, cost: 173.25, accuracy: 81.

### Evaluation

In [11]:
classifier.get_accuracy_cost(X_test, y_test)

(0.9122807017543859, 35.2690970777859)

#### not bad at all :p