<a href="https://www.kaggle.com/code/mohamedyosef101/mnist-neural-network-with-numpy?scriptVersionId=155164406" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# <span style="font-weight: bold; color: #00773E;">Building a Neural Network with NumPy</span>
We will be using NumPy to build the neural network instead of TensorFlow or PyTorch. This will help you and me understand the fundamentals of neural network architectures and backpropagation from scratch. 

**PS.** 
*As usual, you can find more resources at the end of the notebook.*

<hr style="background-color: #00773E; border: 0px; height: 12px;">

# <span style="color: #00773E;">Step 0.</span><span style="font-weight: bold; color: #BD0000;"> Set it up</span>


In [1]:
# import the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# customize the style
pd.options.display.float_format = '{:.5f}'.format
pd.options.display.max_rows = 12

# load the data
filepath = '/kaggle/input/digit-recognizer/train.csv'
df = pd.read_csv(filepath)

df.head()

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


<hr style="background-color: #00773E; border: 0px; height: 12px;">

# <span style="color: #BD0000;">Step 1.</span>**<span style="color: #00773E"> Move to NumPy & Shuffling.</span>**

### **Now, I <span style="color: #bd0000">do not AGREE</span> with using pandas**
So, we will go one more step further than tradition ML and "convert" my data to a NumPy array to do some fancy linear algebra 😎.

**The PS.** *There is a great linear algebra course in the resources*. Just remember to check it out!

In [2]:
# put the dataset in a numpy array
data = np.array(df)

# computing the shape 
m, n = data.shape
# NOTE: 
# m is amout of rows, n is the amount of columns

print("No. of rows(m) =", m, "\nNo. of columns(n) =", n)

# shuffling the data before splitting
np.random.shuffle(data)

No. of rows(m) = 42000 
No. of columns(n) = 785


**<span style="color: #00773E; font-size: 130%;">Shuffling</span>** is randomly reordering the samples in the data which reduces sampling bias and overfitting so the result will be better and fast models.

#### **<span style="color: #bd0000">Why did I compute the shape before shuffling?</span>**
Knowing m and n allows us to shuffle properly while retaining the ability to undo the transformation by using the original shape. <span style="color: #FFFFFF; background-color: #00373e; border-radius: 4px; padding: 2px 4px;">Capturing data shape makes the shuffling process reversible.</span>

<hr style="background-color: #00773E; border: 0px; height: 12px;">

# <span style="color: #BD0000">Step 2.</span> **<span style="color: #00773e;">Splitting data</span>**
Before splitting, I want you to know why we do it at the first place.

So, the simple answer is that **the validation set acts on behalf of new real-world data which allows rigorous experimentation and tuning without negatively impacting the final test results.**

In [3]:
# splitting data into 75% training and 25% validation

# 25% for validation
val_data = data[0: 10500].T
y_val = val_data[0] # first column is the target
X_val = val_data[1:n] # those are the features

# 75% for training
train_data = data[10500: m].T
y_train = train_data[0]
X_train = train_data[1: n]

# see the new shapes
print(train_data.shape, y_train.shape)

(785, 31500) (31500,)


You can see that there is a **T** at `val_data = data[0: 10500].T` which transpose the data so that each data sample is a row, while each feature/attribute is a column


![transpose](https://allinpython.com/wp-content/uploads/2022/09/A-7-1024x565.png)
*image from [allinpython.com](https://allinpython.com/transpose-of-a-matrix-in-python-with-user-input/)*

<div><br></div>

### **<span style="color: #bd0000">Why did I do it?</span>**
I did it just to make extacting the y and X form the original data easier and to show you the concept of transpose. *But it's okay to do the spliting with your prefered way.*

<hr style="background-color: #00773E; border: 0px; height: 12px;">

# <span style="color: #BD0000">Step 3.</span> **<span style="color: #00773e;">Data Noramlization</span>**
Normalization transforms data to have the same scale and distribution characteristics. This consistency improves model stability, accuracy and speed during training. It allows models to learn robust patterns not driven by extreme variation in inputs.

In [4]:
# Now the min and max values in order to apply the scale
min_val = data.min()
max_val = data.max()

print(min_val, max_val)

0 255


In [5]:
# Now let's apply the scale of 0 and 1
X_val = X_val / 255
X_train = X_train / 255

# check if the normalisation worked
X_val.max()

1.0

I essentially divided the training set into training and validation sets and applied the scaling by dividing the data values by the <code style="background:#00373E; color: #E3EEFC; border-radius: 4px;">max_val</code>, which is <code style="background:#00373E; color: #E3EEFC; border-radius: 4px;">255</code>.

<hr style="background-color: #00773E; border: 0px; height: 12px;">

# <span style="color: #BD0000">Step 4.</span> **<span style="color: #00773e;">Model Building from scratch.</span>**
Building a neural network model from scratch using NumPy is a great way to really understand what's going on under the hood.

### 4.1. **Initializers**
Initializes the weight matrices (W1, W2) and bias vectors (b1, b2) to random values. 


In [6]:
def init_params():
    W1 = np.random.normal(size=(10, 784)) * np.sqrt(1./(784))
    b1 = np.random.normal(size=(10, 1)) * np.sqrt(1./10)
    W2 = np.random.normal(size=(10, 10)) * np.sqrt(1./20)
    b2 = np.random.normal(size=(10, 1)) * np.sqrt(1./(784))
    return W1, b1, W2, b2

### 4.2. **Acitivation functions**
* **ReLU** thresholds inputs at 0 to introduce nonlinearity.
* **Softmax** squashes outputs to probability-like values that sum to 1. *Used for multi-class classification.*

In [7]:
def ReLU(Z):
    return np.maximum(Z, 0)

def softmax(Z):
    A = np.exp(Z) / sum(np.exp(Z))
    return A

### 4.3. **Forward Propagation**
Performs the forward pass through the network. It takes in the parameters, input data X, calculates the linear combinations with weights/biases, applies activations, and returns activations & pre-activation values for backprop.

In [8]:
def forward_prop(W1, b1, W2, b2, X):
    Z1 = W1.dot(X) + b1
    A1 = ReLU(Z1)
    Z2 = W2.dot(A1) + b2
    A2 = softmax(Z2)
    return Z1, A1, Z2, A2

### 4.4. **Backward Propagation**
Performs the backward pass to calculate gradients. Uses prediction error to estimate parameter gradients.

In [9]:
# The derivative of ReLU (to be used in backpropagation).
def ReLU_deriv(Z):
    return Z > 0

# Converts a vector of class index labels Y 
# to a one-hot encoded matrix.
def one_hot(Y):
    one_hot_Y = np.zeros((Y.size, Y.max() + 1))
    one_hot_Y[np.arange(Y.size), Y] = 1
    one_hot_Y = one_hot_Y.T
    return one_hot_Y

# The BACKWARD propagation
def backward_prop(Z1, A1, Z2, A2, W1, W2, X, Y):
    one_hot_Y = one_hot(Y)
    dZ2 = A2 - one_hot_Y
    dW2 = 1 / m * dZ2.dot(A1.T)
    db2 = 1 / m * np.sum(dZ2)
    dZ1 = W2.T.dot(dZ2) * ReLU_deriv(Z1)
    dW1 = 1 / m * dZ1.dot(X.T)
    db1 = 1 / m * np.sum(dZ1)
    return dW1, db1, dW2, db2

### 4.5. **Final touches**
1. **Updating the parameters** by applying the gradients, using a learning rate alpha.

2. **Getting class predictions** by taking the argmax of outputs.

3. **Comparing predictions** to true labels to calculate classification accuracy.

4. Finally, **put it all together** to iteratively train the network with gradient descent. Calculates the gradients, updates parameters, and repeats for num iterations.

In [10]:
# Updating the parameters
def update_params(W1, b1, W2, b2, dW1, db1, dW2, db2, alpha):
    W1 = W1 - alpha * dW1
    b1 = b1 - alpha * db1    
    W2 = W2 - alpha * dW2  
    b2 = b2 - alpha * db2    
    return W1, b1, W2, b2

# Class Predictions
def get_predictions(A2):
    return np.argmax(A2, 0)

# Accuracy for evaluation
def get_accuracy(predictions, Y):
    print(predictions, Y)
    return np.sum(predictions == Y) / Y.size

# Put it all together
def gradient_descent(X, Y, alpha, iterations):
    W1, b1, W2, b2 = init_params()
    for i in range(iterations):
        Z1, A1, Z2, A2 = forward_prop(W1, b1, W2, b2, X)
        dW1, db1, dW2, db2 = backward_prop(Z1, A1, Z2, A2, W1, W2, X, Y)
        W1, b1, W2, b2 = update_params(W1, b1, W2, b2, dW1, db1, dW2, db2, alpha)
        if i % 10 == 0:
            print("Iteration: ", i)
            predictions = get_predictions(A2)
            print(get_accuracy(predictions, Y))
    return W1, b1, W2, b2

<hr style="background-color: #00773E; border: 0px; height: 12px;">

# <span style="color: #BD0000">Step 5.</span> **<span style="color: #00773e;">Starting training and evaluating</span>**

In [11]:
# Put the functions on action
W1, b1, W2, b2 = gradient_descent(X_train, y_train, 0.10, 500)

Iteration:  0
[2 2 2 ... 2 2 2] [4 9 2 ... 8 7 9]
0.1006984126984127
Iteration:  10
[2 2 2 ... 2 2 2] [4 9 2 ... 8 7 9]
0.16984126984126985
Iteration:  20
[2 7 2 ... 2 2 2] [4 9 2 ... 8 7 9]
0.26587301587301587
Iteration:  30
[2 7 2 ... 5 7 2] [4 9 2 ... 8 7 9]
0.3797142857142857
Iteration:  40
[2 7 2 ... 5 7 2] [4 9 2 ... 8 7 9]
0.4914603174603175
Iteration:  50
[2 7 2 ... 3 7 7] [4 9 2 ... 8 7 9]
0.556
Iteration:  60
[2 7 2 ... 3 7 7] [4 9 2 ... 8 7 9]
0.594
Iteration:  70
[2 7 2 ... 3 7 7] [4 9 2 ... 8 7 9]
0.6471111111111111
Iteration:  80
[2 7 2 ... 3 7 9] [4 9 2 ... 8 7 9]
0.697047619047619
Iteration:  90
[4 7 2 ... 3 7 9] [4 9 2 ... 8 7 9]
0.7260634920634921
Iteration:  100
[4 7 2 ... 3 7 9] [4 9 2 ... 8 7 9]
0.7520952380952381
Iteration:  110
[4 7 2 ... 3 7 9] [4 9 2 ... 8 7 9]
0.7702539682539683
Iteration:  120
[4 9 2 ... 3 7 9] [4 9 2 ... 8 7 9]
0.784952380952381
Iteration:  130
[4 9 2 ... 3 7 9] [4 9 2 ... 8 7 9]
0.7964444444444444
Iteration:  140
[4 9 2 ... 3 7 9] [4 9 2 ..

~90% accuracy on training set.

**If you are looking for a higher score checkout my notebook about  [how to build your first neural network with Keras](https://www.kaggle.com/code/mohamedyosef101/build-your-first-neural-network/edit/run/153871112).**

<hr style="background-color: #00773E; border: 0px; height: 12px;">

# <span style="color: #BD0000">Useful</span> **<span style="color: #00773e;">Resources</span>**

#### For using NumPy
* Samson Zhang. 2020. [*Simple MNIST NN from scratch with numpy*](https://www.kaggle.com/code/wwsalmon/simple-mnist-nn-from-scratch-numpy-no-tf-keras/notebook). Kaggle

#### Tutorials on Deep Learning
- Misra Turp. 2023. [*50 Days of Deep Learning*](https://youtube.com/playlist?list=PLM8lYG2MzHmQn55ii0duXdO9QSoDF5myF&si=s1pe9cRtFjKCPqR5). YouTube.
- Grant Sanderson. 2017. [Neural Networks, Deep Learning](https://youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab&si=ieGLDzRU2Ln9L0RO). YouTube.

#### The math behind neural networks
- Grant Sanderson. 2016. [*The Essence of Linear Algebra*](https://youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab&si=ieGLDzRU2Ln9L0RO). 3Blue1Brown. YouTube.
- Kimberly Brehm. 2019. [*Linear Algebra (Entire Course)*](https://youtube.com/playlist?list=PLl-gb0E4MII03hiCrZa7YqxUMEeEPmZqK&si=TT-bemenvZWQIGG2). YouTube.
- Grant Sanderson. 2016. [*Multivariable Calculus*](https://www.khanacademy.org/math/multivariable-calculus). Khan Academy.
- [*Statistics and Probability*](https://www.khanacademy.org/math/statistics-probability). 2008. Khan Academy.

<div><br></div>

---

**Check out more of my work on [Github](https://github.com/mohamedyosef101)**.
<br> 
💬 And, if you have any questions, feel free to send me a message on [**LinkedIn**](https://linkedin.com/in/mohamedyosef101).