# A neural network for classification of handwritten digits
## Introduction
In this notebook I present implementation of a neural network, written from scratch, to recognize handwritten digits in the MNIST dataset. The dataset is in a *comma-separated values* format and was obtained from [Dariel Dato-on on Kaggle](https://www.kaggle.com/oddrationale/mnist-in-csv).

The neural network is presented as a Python Class. It has one hidden layer and the number of units is tunable. The neural network is trained by backpropagation. The goal here was not performance but to write everything from scratch, as an exercise.

We will use the following modules :
* pandas for data import
* Numpy for all the calculations, which are mostly vectorized
* Matplotlib Pyplot for plotting
* time will be used to measure learning time

Optionally, the function OneHotEncoder from Scikit-learn can be used to recode the output into a matrix of 0 and 1, but a manual implementation is used.

In [1]:
import pandas as pd
import numpy as np
#from sklearn.preprocessing import OneHotEncoder
import matplotlib.pyplot as plt
import time

## Definition of a neural network class
The `NeuralNet` class contains the functions and parameters which will perform the digit classification. Please read the docstrings and comments for explanations.

In [3]:
class NeuralNet(object):
    """This class provides method to perform classification of data using
a neural network.
alpha        : learning rate
n_iter       : number of iteration of forward/backward propagation
lamb         : lambda parameter for regularization
hidden_size  : hidden layer size
n_labels     : number of labels, i.e. number of output classes"""
    def __init__(self, alpha=1, n_iter=400, lamb=1, hidden_size=None, n_labels=None):
        self.alpha = alpha
        self.n_iter = n_iter
        self.lamb = lamb
        self.hidden_size = hidden_size
        self.n_labels = n_labels

    def rand_weights(self, c_in, c_out, epsilon=0.12):
        """Return randomly initialized weights of a layer with c_in incoming 
        connections and c_out and c_out outgoing connection."""
        return np.random.random((c_out, c_in + 1)) * 2 * epsilon - epsilon

    def add_intercept(self, X):
        """Return matrix with an added column of ones."""
        intercept = np.ones((X.shape[0], 1))
        return np.concatenate((intercept, X), axis=1)

    def _s(self, z):
        """Return product of sigmoid function with z as input."""
        return 1 / (1 + np.exp(-z))

    def _sgrad(self, z):
        """Return the gradient of sigmoid function with z as input."""
        return self._s(z) * (1 - self._s(z))

    def forward_prop(self, X, Y, Theta1, Theta2):
        """Return values calculated during forward propagation and cost."""
        z_2 = np.dot(X, Theta1.T)
        a_2 = self._s(z_2)
        a_2 = self.add_intercept(a_2)
        z_3 = np.dot(a_2, Theta2.T)
        h = self._s(z_3)

        #calculate unregularized cost
        m = X.shape[0]
        J = np.sum(np.sum(Y * np.log(h) + (1 - Y) * np.log(1 - h))) / -m
        #calculate regularization parameters
        T1 = Theta1[:, 1:]
        T2 = Theta2[:, 1:]
        A = self.lamb / 2 / m
        B = np.sum(np.sum(T1 * T1, axis=1))
        C = np.sum(np.sum(T2 * T2, axis=1))
        #calculate regularized cost
        J_reg = J + A * (B + C)

        return z_2, a_2, z_3, h, J_reg

    def backward_prop(self, z_2, a_2, z_3, h, X, Y, Theta1, Theta2, lamb, alpha):
        """Return updated Theta1 and Theta2."""
        #perform backward propagation per se
        m = Y.shape[0]
        T2 = Theta2[:, 1:]
        d_3 = h - Y
        d_2 = np.dot(d_3, T2) * self._sgrad(z_2)
        Delta_1 = np.dot(d_2.T, X)
        Delta_2 = np.dot(d_3.T, a_2)

        #calculate unregularized gradient
        Theta1_grad = Delta_1 / m
        Theta2_grad = Delta_2 / m

        #regularization of gradient
        #replace first column with 0 to avoid regularization of bias
        Theta1_copy = np.copy(Theta1)
        Theta2_copy = np.copy(Theta2)
        Theta1_copy[:, 0] = 0
        Theta2_copy[:, 0] = 0
        #scale Theta1/2 and add to gradient
        Theta1_grad_reg = Theta1_grad + np.dot(lamb / m, Theta1_copy)
        Theta2_grad_reg = Theta2_grad + np.dot(lamb / m, Theta2_copy)

        #update Theta1/2
        Theta1_updated = Theta1 - alpha * Theta1_grad_reg
        Theta2_updated = Theta2 - alpha * Theta2_grad_reg

        return Theta1_updated, Theta2_updated

    def predict(self, Theta1, Theta2, X, Y):
        """Return predicted output and accuracy of model."""
        #add intercept to X
        X_1 = self.add_intercept(X)

        #calculate outputs of neural network layers
        h1 = self._s(np.dot(X_1, Theta1.T))
        h1_1 = self.add_intercept(h1)
        h2 = self._s(np.dot(h1_1, Theta2.T))

        #predicted output (remember indexing starts at zero)
        #prediction must have the same shape as Y
        prediction = (np.argmax(h2, axis=1))[:, np.newaxis]

        #accuracy
        accuracy = np.mean(prediction == Y)

        return prediction, accuracy

    def fit(self, X, Y):
        """Train the neural network."""
        #add intercept to X
        X_1 = self.add_intercept(X)

        #get network architecture parameters
        in_size = X.shape[1] #input layer size
        hid_size = self.hidden_size #hidden layer size
        n_lab = self.n_labels #output layer size

        #recode Y into one-hot labels
        #for Octave data set use Y.item(i) - 1
        Y_1 = np.zeros((Y.shape[0], n_lab))
        for i in range(Y.shape[0]):
            Y_1[i, Y.item(i)] = 1
        #could import OneHotEncoder from sklearn.preprocessing
        #then Y_1 = OneHotEncoder(sparse=False).fit_transform(Y)
        
        #initialize Theta matrices randomly
        Theta1 = self.rand_weights(in_size, hid_size, 0.12)
        Theta2 = self.rand_weights(hid_size, n_lab, 0.12)

        #define array to store cost history
        cost_history = np.zeros(self.n_iter)

        #train the neural network by repetition of for and back prop
        for i in range(self.n_iter):
            #for prop
            z_2, a_2, z_3, h, J_reg = self.forward_prop(X_1, Y_1, Theta1, Theta2)
            cost_history[i] = J_reg

            #back prop
            Theta1, Theta2 = self.backward_prop(z_2, a_2, z_3, h, X_1, Y_1, 
                                                Theta1, Theta2, 
                                                self.lamb, self.alpha)

        return Theta1, Theta2, cost_history

    def plot_history(self, history):
        "Plot cost history to check error of parameters is minimized."""
        fig, ax = plt.subplots()
        ax.plot([i for i in range(self.n_iter)], history)
        ax.set_xlabel("Number of iteration")
        ax.set_ylabel("Cost")
        ax.set_title("Cost history of neural network parameters")
        plt.show()
        
    def export_theta(self, filename, Theta1, Theta2):
        """Export parameters as csv file."""
        #flatten matrices
        thetas = np.concatenate((Theta1.flatten(), Theta2.flatten()))
        
        #save flat array as csv file
        np.savetxt(filename, thetas, delimiter=",")
        
    def import_theta(self, filename, X):
        """Import csv file containing parameters and convert to matrices of
        relevant size. Specify X to reshape Theta1 correctly"""
        #import csv file
        data = pd.read_csv(filename, header=None)
        weights = np.array(data).flatten()
        
        #extract and reshape matrices
        Theta1 = weights[:self.hidden_size * (X.shape[1] + 1)]
        Theta1 = Theta1.reshape((self.hidden_size, X.shape[1] + 1))
        Theta2 = weights[self.hidden_size * (X.shape[1] + 1):]
        Theta2 = Theta2.reshape((self.n_labels, self.hidden_size + 1))
        
        return Theta1, Theta2

## Neural network learning and prediction
Let's define a model with 60 units in the hidden layer, a learning rate $\alpha = 0.4$ and a regularization parameter $\lambda = 1$. The neural network will be trained for 100 iterations.

In [4]:
model = NeuralNet(hidden_size=60, n_labels=10, n_iter=100, lamb=1, alpha=0.4)

Now let's import the data. We have a training set containing 60,000 28 x 28 pixels images, and a testing set containing 10,000 of such images.

In [6]:
data_train = np.array(pd.read_csv("mnist_train.csv"))
#the first column of data_train contains output values
X_train = data_train[:, 1:]
Y_train = data_train[:, 0][:, np.newaxis]

data_test = np.array(pd.read_csv("mnist_test.csv"))
#the first column contains output values
X_test = data_test[:, 1:]
Y_test = data_test[:, 0][:, np.newaxis]

We now train the model on the data. It is possible to select a subset of the whole training set to speed things up. We also measure training time, as you might want to do something else while the neural network is learning. On my computer it takes about one minute with the above parameters, so it's a safe lower boundary.

In [7]:
start = time.time()
Theta1, Theta2, cost_history = model.fit(X_train, Y_train)
stop = time.time()
minutes = "{} minutes".format(int((stop - start) // 60))
seconds = "{} seconds".format(int((stop - start) % 60))
print("Time to train the neural network :", minutes, seconds)

Time to train the neural network : 0 minutes 59 seconds


The `NeuralNet` class contains functions to determine the accuracy of the model, i.e. the proportion of accurate predictions. We can compare the accuracy on the training set to the accuracy on the testing set, which contains data the model did not process during learning.

In [8]:
#accuracy on the training data (pred contains the predicted outputs)
pred, accuracy_train = model.predict(Theta1, Theta2, X_train, Y_train)
print("Accuracy on the training data : {:.4f}".format(accuracy_train))
#accuracy on the testing data
pred2, accuracy_test = model.predict(Theta1, Theta2, X_test, Y_test)
print("Accuracy on the testing data : {:.4f}".format(accuracy_test))

Accuracy on the training data : 0.9016
Accuracy on the testing data : 0.8998
