<a href="https://www.kaggle.com/code/pony1013/implementation-of-mlp-no-tf-keras-pytorch?scriptVersionId=162038852" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

This is a simple 4 layer MLP of the MNIST dataset classification, where ReLU and sigmoid are used as activation functions of all layers.

All 42000 images belong to 10 classes.

# 1. Preprocessing Stage

- Before we put those images into the NN, we have to do some preprocessings to reduce overfitting and boost the accuracy.

- X_train = X_train/255.0 is used to normalize the train_data




# 2. FeedForward Propagation

Output of the neuron follows :

$\sum = wx + b$

$o = ReLU(\sum)$

Suppose $w_1,w_2,w_3,b_1,b_2,b_3$ are the weights and biases of layer 2-4 respectively, we have the ouput:

$o_1 = ReLU(\sum w_1x + b_1)$

$o_2 = ReLU(\sum w_2o_1 +b_2)$

$o_3 = sigmoid(\sum w_3o_2 +b_3)$

# 3. BackwardPropagation

This algorithm uses gradient descent method as optimization method.

Suppose $\delta_1,\delta_2,\delta_3$ be the **responsibility** of the error in layer 2-4 respectively, and E is the total loss of the model, we have for the output layer:

$\delta_3 = \frac{\partial E}{\partial w_3} = \frac{\partial E}{\partial o_3} \cdot \frac{\partial o_3}{\partial \sum_3} = E\cdot o_3(1-o_3)$

**Thus, the weight correction for weights in layer 3-4 is :**

$\varDelta w_3 = \gamma \cdot o_3 \cdot \delta_3$ where $\gamma$ is the learning rate.

For the layer 2 and 3, Suppose $o_1,o_2$ be the output of layer 2 and 3 respectively:

$\delta_2 = \frac{\partial E}{\partial w_2} = \frac{\partial E}{\partial o_2} \cdot \frac{\partial o_2}{\partial \sum_2} = o_2(1-o_2) \cdot \sum \delta_3 \cdot w_2$ 

The weight correction of layer 2-3 weights are followed by :

$\varDelta w_2 = \gamma \cdot o_2 \cdot \delta_2$

For the weights in layer 1-2, the weights correction follows the same formula as layer 2-3 as they are all inner neurons.

# 4. Activation Functions

1. ReLu
- Relu is the most popular activation function in machine learning, it is defined as :

$ReLu(x) = x$ for x>0

$ReLu(x) = 0$ for x<0

Relu is famous of it's simplifcity of it's derivative function, the derivative of ReLu is :

$ReLU'(x) = 0$ for x<0

$ReLU'(x) = 1$ for x>0

  2. Softmax
    - Softmax is another popular activation function, it can helps transform the output into probability, the softmax function is defined as :

 $softmax(z)_i = \frac{e^{z_i}}{\sum_{j} e^{z_j}}$

  3. Sigmoid

   - Sigmoid is a type of activation function which has a S shape, in this program, the sigmoid function we used in this model is :

$Sigmoid(x) = \frac{1}{1+e^{-x}}$


In [1]:
import numpy as np
import pandas as pd
import scipy as sci
import matplotlib.pyplot as plt
#Importing libaries 


path = "/kaggle/input/digit-recognizer/train.csv"
path1 = "/kaggle/input/digit-recognizer/test.csv"
data = pd.read_csv(path, engine='c')
test_data = pd.read_csv(path1, engine='c')
test_data = test_data/255.0
test_data = test_data.T
label_1 = data['label']
label_1 = label_1.T

data = np.array(data)
m,n = data.shape
train_data = data[0:m].T
X_train = train_data[1:n]
X_train = X_train/255.0
#Preprocessing the data

class Model(object):

    def __init__(self, input_dim=784, output_dim=10):
        self.num_neurons = 64
        
        self.w_1 = np.random.randn(self.num_neurons, input_dim) * np.sqrt(2./input_dim)
        self.w_2 = np.random.randn(self.num_neurons, self.num_neurons) * np.sqrt(2./10)
        self.w_3 = np.random.randn(output_dim, self.num_neurons) * np.sqrt(2./10)
        
        self.b_1 = np.zeros((self.num_neurons, 1))
        self.b_2 = np.zeros((self.num_neurons, 1))
        self.b_3 = np.zeros((output_dim,1))
        
        self.learning_rate = 7.1e-6
        self.epochs = 1000
        
#Kaiming initialization, useful in normalisation of parameters and to prevent model being overfitting.

    def sigmoid(self,x):
        return sci.special.expit(x)
    
    def sigmoid_derivative(self,x):
        return self.sigmoid(x)*(1-self.sigmoid(x))
    
    def ReLU(self, x):
        return np.maximum(0, x)

    def ReLU_derivative(self, x):
        return np.where(x > 0, 1.0, 0.0)

    def ELU(self, x):
        return np.where(x >= 0.0, x, self.alpha * (np.exp(x) - 1))

    def ELU_deriv(self,x):
        return np.where(x >= 0, 1, self.alpha * np.exp(x))

    def softmax(self, z):
        e_z = np.exp(z-np.max(z))
        return sci.special.softmax(e_z)
    
    def softmax_backward(self,z):
        do_dz = self.softmax(1-self.softmax(z))
        
    
#Defining a set of activation functions for the convenience of changing act. functions below


    def Forward(self, X_train):
        self.sum_1 = np.dot(self.w_1, X_train) + self.b_1
        self.output_1 = self.ReLU(self.sum_1)
        #Second layer, using Sigmoid as activation

        self.sum_2 = np.dot(self.w_2, self.output_1) + self.b_2
        self.output_2 = self.ReLU(self.sum_2)
        #Third Layer, using Sigmoid as activation
        
        self.sum_3 = np.dot(self.w_3, self.output_2)+self.b_3
        self.output_3 = self.sigmoid(self.sum_3)
        #Fourth Layer, using Sigmoid as activation
        
        self.predictions = np.argmax(self.output_3, axis=0)
        return self.predictions
    
    #FeedForward Propagation, used to predict the label of the data.
    
    
    def Backward(self, label_1, X_train):
        
        one_hot_labels = np.eye(10)[label_1].T
        self.error = self.output_3 - one_hot_labels
        self.delta_3 = self.error*self.sigmoid_derivative(self.output_3)
        self.d_w_3 = np.dot(self.delta_3, self.output_2.T)  
        self.d_b_3 = np.sum(self.delta_3, axis=1, keepdims=True)

        self.delta_2 = np.dot(self.w_3.T, self.delta_3)*self.ReLU_derivative(self.output_2)
        self.d_w_2 = np.dot(self.delta_2, self.output_1.T)
        self.d_b_2 = np.sum(self.delta_2, axis=1, keepdims=True)

        self.delta_1 = np.dot(self.w_2.T, self.delta_2) * self.ReLU_derivative(self.output_1)
        self.d_w_1 = np.dot(self.delta_1, X_train.T)
        self.d_b_1 = np.sum(self.delta_1, axis=1, keepdims=True)

#Backwardpropagation, this program used gradient descent method to optimize the model
        

    def update_params(self):
        self.w_1 -= self.learning_rate * self.d_w_1
        self.w_2 -= self.learning_rate * self.d_w_2
        self.w_3 -= self.learning_rate * self.d_w_3
        
        self.b_1 -= self.learning_rate * self.d_b_1
        self.b_2 -= self.learning_rate * self.d_b_2
        self.b_3 -= self.learning_rate * self.d_b_3
        
#Updating parameters

    def compute_accuracy(self, label_1):
        correct_predictions = np.sum(self.predictions == label_1)
        total_predictions = self.predictions.shape[0]
        self.accuracy = correct_predictions / total_predictions
        

    def fit(self, X_train, label_1):
        
        for epoch in range(self.epochs):
            list_1 = list()
            self.Forward(X_train)
            self.Backward(label_1, X_train)
            self.update_params()
            self.compute_accuracy(label_1)
            list_1.append(self.accuracy)
            print(f"Epoch {epoch + 1}/{self.epochs} Accuracy: {self.accuracy * 100}%")
            #Create loop for training the mode
        
    def test(self,test_data):
        self.Forward(test_data)
        predictions = self.predictions

# Create a DataFrame directly with 'ImageId' and 'Label'
        submission = pd.DataFrame({
        'ImageId': range(1, len(predictions) + 1),
        'Label': predictions
        })

# Save the DataFrame to a CSV file
        submission.to_csv("/kaggle/working/submission.csv", index=False)
        print(submission)
        

# Create an instance of the Model class
model = Model()

# Train the model
model.fit(X_train, label_1)
model.test(test_data)

Epoch 1/1000 Accuracy: 9.814285714285715%
Epoch 2/1000 Accuracy: 10.504761904761905%
Epoch 3/1000 Accuracy: 9.776190476190475%
Epoch 4/1000 Accuracy: 13.53095238095238%
Epoch 5/1000 Accuracy: 17.926190476190477%
Epoch 6/1000 Accuracy: 15.790476190476191%
Epoch 7/1000 Accuracy: 13.497619047619047%
Epoch 8/1000 Accuracy: 25.1%
Epoch 9/1000 Accuracy: 21.888095238095236%
Epoch 10/1000 Accuracy: 34.21666666666667%
Epoch 11/1000 Accuracy: 34.05952380952381%
Epoch 12/1000 Accuracy: 43.60952380952381%
Epoch 13/1000 Accuracy: 43.114285714285714%
Epoch 14/1000 Accuracy: 50.964285714285715%
Epoch 15/1000 Accuracy: 50.211904761904755%
Epoch 16/1000 Accuracy: 55.22857142857143%
Epoch 17/1000 Accuracy: 55.42857142857143%
Epoch 18/1000 Accuracy: 58.11904761904761%
Epoch 19/1000 Accuracy: 58.04285714285714%
Epoch 20/1000 Accuracy: 60.29047619047619%
Epoch 21/1000 Accuracy: 59.452380952380956%
Epoch 22/1000 Accuracy: 61.59285714285714%
Epoch 23/1000 Accuracy: 60.6452380952381%
Epoch 24/1000 Accuracy: 6