## Two-Layer Artificial Neural Network
### Joseph Melby
### 11/26/18

#### Given training data: MNIST X train.csv (feature values), MNIST y train.csv (labels)

#### Test data: MNIST X test.csv (feature values), MNIST y test.csv (labels) .

#### File House feature MNIST description.csv gives a brief introduction to these data sets.

#### Here I normalize and prepare the data set:

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline 

def read_dataset(feature_file, label_file):
    ''' Read data set in *.csv to data frame in Pandas
    and return numpy arrays for features and labels respectively.
    
    Args:
    feature_file: string, path to the CSV file containing features
    label_file: string, path to the CSV file containing labels
    
    Returns:
    numpy array: features, shape=(num_samples, num_features)
    numpy array: labels, shape=(num_samples, 1)
    '''
    df_X = pd.read_csv(feature_file)
    df_y = pd.read_csv(label_file)
    X = df_X.values # convert values in dataframe to numpy array (features)
    y = df_y.values # convert values in dataframe to numpy array (label)
    return X, y



X_train, y_train = read_dataset('MNIST_X_train.csv', 'MNIST_y_train.csv')
X_test, y_test = read_dataset('MNIST_X_test.csv', 'MNIST_y_test.csv')

print(X_train.shape)
print(X_test.shape)

def plot_digit(feature_vector): 
    ''' Plot a given feature vector as a digit image in grayscale
    
    Args:
    feature_vector: numpy array, shape=(num_features, )
    '''
    plt.gray() 
    plt.matshow(feature_vector.reshape(28,28))
    plt.show() 

def normalize_features(X_train, X_test):
    ''' Normalize features in training and test data using StandardScaler
    
    Args:
    X_train: numpy array, shape=(num_train_samples, num_features)
        Training features
    X_test: numpy array, shape=(num_test_samples, num_features)
        Test features
        
    Returns:
    numpy array: normalized training features, shape=(num_train_samples, num_features)
    numpy array: normalized test features, shape=(num_test_samples, num_features)
    '''
    from sklearn.preprocessing import StandardScaler #import libaray
    scaler = StandardScaler() # call an object function
    scaler.fit(X_train) # calculate mean, std in X_train
    X_train_norm = scaler.transform(X_train) # apply normalization on X_train
    X_test_norm = scaler.transform(X_test) # we use the same normalization on X_test
    return X_train_norm, X_test_norm


X_train_norm, X_test_norm = normalize_features(X_train, X_test)


def one_hot_encoder(y_train, y_test):
    ''' Convert label to a vector under one-hot-code fashion
    
    Args:
    y_train: numpy array, shape=(num_train_samples, 1)
        Training labels
    y_test: numpy array, shape=(num_test_samples, 1)
        Test labels
        
    Returns:
    numpy array: one-hot encoded training labels, shape=(num_train_samples, num_classes)
    numpy array: one-hot encoded test labels, shape=(num_test_samples, num_classes)
    '''
    from sklearn import preprocessing
    lb = preprocessing.LabelBinarizer()
    lb.fit(y_train)
    y_train_ohe = lb.transform(y_train)
    y_test_ohe = lb.transform(y_test)
    return y_train_ohe, y_test_ohe


(2000, 784)
(500, 784)




### Define the two layer ANN class, as well as necessary supporting functions:

#### Note: I was not able to figure out the regularization portion of the assignment. I was planning to use a regularized loss function. However, I have not figured out how to implement this. I believe we are supposed to instead use an L2-norm regularization loss, but I did not have time to implement it.

In [3]:
class twolayer_NN:
    """
    A class for a two-layer neural network.

    Attributes:
    -----------
    X : numpy array
        Input data of shape (number of samples, number of features)
    y : numpy array
        Target data of shape (number of samples, number of output units)
    hidden_layer_nn : int
        Number of neurons in the hidden layer
    lr : float
        Learning rate for gradient descent

    Methods:
    --------
    __init__(self, X, y, hidden_layer_nn=100, lr=0.01):
        Initializes the weights and biases of the network and sets the attributes.

    feed_forward(self):
        Computes the forward pass of the neural network.

    back_propagation(self):
        Computes the gradients of the neural network using backpropagation and updates the weights and biases using gradient descent.

    cross_entropy_loss(self):
        Computes the cross-entropy loss of the neural network.

    predict(self, X_test):
        Predicts the class labels of new data using the trained neural network.

    """

    def __init__(self, X, y, hidden_layer_nn=100, lr=0.01):
        """
        Initializes the weights and biases of the network and sets the attributes.

        Parameters:
        ----------
        X : numpy array
            Input data of shape (number of samples, number of features)
        y : numpy array
            Target data of shape (number of samples, number of output units)
        hidden_layer_nn : int, optional
            Number of neurons in the hidden layer, by default 100
        lr : float, optional
            Learning rate for gradient descent, by default 0.01
        """
        self.X = X
        self.y = y
        self.hidden_layer_nn = hidden_layer_nn
        self.lr = lr
        #Initialize the weights
        self.nn = X.shape[1] #number of neurons in input layer/ number of features / second coordinate 64
        self.W1 = np.random.randn(self.nn, hidden_layer_nn) / np.sqrt(self.nn) #dim(W1) = nn * hidden_layer_nn
        self.b1 = np.zeros((1, hidden_layer_nn))
        
        self.W2 = np.random.randn(hidden_layer_nn, hidden_layer_nn) / np.sqrt(self.nn) #dim(W2) = nn * hidden_layer_nn
        self.b2 = np.zeros((1, hidden_layer_nn))
        
        self.output_layer_nn = y.shape[1]
        self.W3 = np.random.randn(hidden_layer_nn, self.output_layer_nn) / np.sqrt(hidden_layer_nn)
        self.b3 = np.zeros((1, self.output_layer_nn))
    
    def feed_forward(self):
        """
        Performs a forward pass through the neural network to compute the predicted output.
        
        Returns:
        --------
        y_hat : numpy.ndarray
            A 1D array of shape (n_classes,) representing the predicted class probabilities.
        """
        #hidden layer 1
        ## z_1 = xW_1 + b_1
        self.z1 = np.dot(self.X, self.W1) + self.b1
        #self.f1 = np.copy(self.z1)
        self.z1[self.z1 < 0] = 0
        self.f1 = self.z1
        #hidden layer 2
        self.z2 = np.dot(self.f1, self.W2) + self.b2
        #self.f2 = np.copy(self.z2)
        self.z2[self.z2 < 0] = 0
        self.f2 = self.z2
        
        # Output layer
        self.z3 = np.dot(self.f2, self.W3) + self.b3
        self.y_hat = softmax(self.z3)
        
    def back_propagation(self):
        """
        Back propagation step of the neural network. Calculates the gradients of the weights and biases
        using the chain rule of derivatives and updates the weights and biases using the gradient descent algorithm.
        """
        d3 = self.y_hat - self.y
        dW3 = np.dot(self.f2.T, d3)
        db3 = np.sum(d3, axis = 0, keepdims = True)
        # axis = 0 means sum along column/vertical
        #d2 = self.f1*np.dot(d3, vdReLu(self.z2).T)
        self.z2[self.z2 > 0] = 0.001
        d2 = np.dot(d3, self.W3.T)*self.z2
        dW2 = np.dot(self.f1.T, d2)
        db2 = np.sum(d2, axis = 0, keepdims = True)
        
        #d1 = np.dot(d2, vdReLu(self.z1))*self.W2
        self.z1[self.z1 > 0] = 0.0001
        d1 = np.dot(d2,self.W2.T)*self.z1
        dW1 = np.dot(self.X.T, d1)
       
        db1 = np.sum(d1, axis = 0, keepdims = True)
        
        #Update Gradient Descent
        self.W1 = self.W1 - self.lr*dW1
        self.b1 = self.b1 - self.lr*db1
        
        self.W2 = self.W2 - self.lr*dW2
        self.b2 = self.b2 - self.lr*db2
        
        self.W3 = self.W3 - self.lr*dW3
        self.b3 = self.b3 - self.lr*db3

    def cross_entropy_loss(self):
        self.feed_forward() #update self.y_hat
        self.loss = -np.sum(self.y*np.log(self.y_hat + 1e-10))
        
    def predict(self, X_test):
        """
        Predicts the class labels for the given input data X_test.

        Args:
            X_test (numpy.ndarray): A numpy array of shape (num_test_samples, num_features) containing the test input data.

        Returns:
            numpy.ndarray: An array of shape (num_test_samples,) containing the predicted class labels for the input data.
        """
        z1 = np.dot(X_test, self.W1) + self.b1
        z1[z1 < 0] = 0
        f1 = z1
        
        z2 = np.dot(f1, self.W2) + self.b2
        z2[z2 < 0] = 0
        f2 = z2
        
        # Output layer
        z3 = np.dot(f2, self.W3) + self.b3
        y_hat_test = softmax(z3)
        
        labels = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
        num_test_samples = X_test.shape[0]
        # find which index gives us the highest probability
        ypred = np.zeros(num_test_samples, dtype=int) 
        for i in range(num_test_samples):
            ypred[i] = labels[np.argmax(y_hat_test[i,:])]
        return ypred

    
    
# def ReLu(t):
#     if t<0:
#         return 0
#     else:
#         return t
# vReLu = np.vectorize(ReLu)
    
# def dReLu(t):
#     if t<0:
#         return 0
#     else:
#         return 1
# vdReLu = np.vectorize(dReLu)

def accuracy(ypred, yexact):
    """
    Computes the accuracy score given predicted labels and true labels.
    
    Parameters:
        ypred (numpy.ndarray): 1D array of predicted labels.
        yexact (numpy.ndarray): 1D array of true labels.
    
    Returns:
        float: The accuracy score.
    """
    p = np.array(ypred == yexact, dtype=int)
    return np.sum(p) / float(len(yexact))


def softmax(z):
    """
    Computes the softmax function for an input array of logits.
    
    Parameters:
        z (numpy.ndarray): The input array of logits.
    
    Returns:
        numpy.ndarray: The softmax scores for each input logit.
    """
    exp_value = np.exp(z - np.amax(z, axis=1, keepdims=True))
    softmax_scores = exp_value / np.sum(exp_value, axis=1, keepdims=True)
    return softmax_scores


#### 1200 iterations seems to give reasonable accuracy in order to compare learning rates and number of neurons

### Test with a learning rate of 0.001:

In [4]:
for n in [20,50,100,150,200]:
    #initialize a class 
    myNN2 = twolayer_NN(X_train_norm, y_train_ohe, hidden_layer_nn = n, lr = 0.001)
    epoch_num = 1200 #num of iterations
    print('Model for %d neurons with learning rate of 0.001 and %d epochs:' % (n, epoch_num))
    for i in range(epoch_num):
        myNN2.feed_forward()
        myNN2.back_propagation()
        myNN2.cross_entropy_loss()
        current_loss = myNN2.loss
        if ((i+1)%300 == 0):
            print('epoch = %d, current loss = %.5f' % (i+1, myNN2.loss))

    # Validate trained model against test data:

    ypred = myNN2.predict(X_test_norm)
    print('Accuracy of our model ', accuracy(ypred, y_test.ravel()))

Model for 20 neurons with learning rate of 0.001 and 1200 epochs:
epoch = 300, current loss = 3513.90022
epoch = 600, current loss = 2954.01716
epoch = 900, current loss = 2729.03528
epoch = 1200, current loss = 2623.42821
Accuracy of our model  0.546
Model for 50 neurons with learning rate of 0.001 and 1200 epochs:
epoch = 300, current loss = 2257.91672
epoch = 600, current loss = 1723.27663
epoch = 900, current loss = 1502.87426
epoch = 1200, current loss = 1376.44580
Accuracy of our model  0.716
Model for 100 neurons with learning rate of 0.001 and 1200 epochs:
epoch = 300, current loss = 1295.58845
epoch = 600, current loss = 910.65087
epoch = 900, current loss = 720.35496
epoch = 1200, current loss = 595.17689
Accuracy of our model  0.824
Model for 150 neurons with learning rate of 0.001 and 1200 epochs:
epoch = 300, current loss = 951.22536
epoch = 600, current loss = 624.86180
epoch = 900, current loss = 458.22548
epoch = 1200, current loss = 348.33651
Accuracy of our model  0.8

### Test with a learning rate of 0.01:

In [5]:
for n in [20,50,100,150,200]:
    #initialize a class 
    myNN2 = twolayer_NN(X_train_norm, y_train_ohe, hidden_layer_nn = n, lr = 0.01)
    epoch_num = 1200 #num of iterations
    print('Model for %d neurons with learning rate of 0.01 and %d epochs:' % (n, epoch_num))
    for i in range(epoch_num):
        myNN2.feed_forward()
        myNN2.back_propagation()
        myNN2.cross_entropy_loss()
        current_loss = myNN2.loss
        if ((i+1)%100 == 0):
            print('epoch = %d, current loss = %.5f' % (i+1, myNN2.loss))

    # Validate trained model against test data:

    ypred = myNN2.predict(X_test_norm)
    print('Accuracy of our model ', accuracy(ypred, y_test.ravel()))

Model for 20 neurons with learning rate of 0.01 and 1200 epochs:
epoch = 100, current loss = 3268.34392
epoch = 200, current loss = 2868.33519
epoch = 300, current loss = 2639.81562
epoch = 400, current loss = 2453.46863
epoch = 500, current loss = 2384.48748
epoch = 600, current loss = 2238.85384
epoch = 700, current loss = 2207.90627
epoch = 800, current loss = 2124.08280
epoch = 900, current loss = 2084.17572
epoch = 1000, current loss = 2052.33707
epoch = 1100, current loss = 2030.98488
epoch = 1200, current loss = 2010.75838
Accuracy of our model  0.528
Model for 50 neurons with learning rate of 0.01 and 1200 epochs:
epoch = 100, current loss = 1682.46125
epoch = 200, current loss = 1183.09130
epoch = 300, current loss = 911.16276
epoch = 400, current loss = 714.70906
epoch = 500, current loss = 565.70943
epoch = 600, current loss = 452.00085
epoch = 700, current loss = 362.15363
epoch = 800, current loss = 292.65695
epoch = 900, current loss = 237.87700
epoch = 1000, current loss

### Test with a learning rate of 0.1:

In [6]:
for n in [20,50,100,150,200]:
    #initialize a class 
    myNN2 = twolayer_NN(X_train_norm, y_train_ohe, hidden_layer_nn = n, lr = 0.1)
    epoch_num = 1200 #num of iterations
    print('Model for %d neurons with learning rate of 0.001 and %d epochs:' % (n, epoch_num))
    for i in range(epoch_num):
        myNN2.feed_forward()
        myNN2.back_propagation()
        myNN2.cross_entropy_loss()
        current_loss = myNN2.loss
        if ((i+1)%100 == 0):
            print('epoch = %d, current loss = %.5f' % (i+1, myNN2.loss))

    # Validate trained model against test data:

    ypred = myNN2.predict(X_test_norm)
    print('Accuracy of our model ', accuracy(ypred, y_test.ravel()))

Model for 20 neurons with learning rate of 0.001 and 1200 epochs:
epoch = 100, current loss = 38065.08830
epoch = 200, current loss = 37897.81498
epoch = 300, current loss = 37956.62633
epoch = 400, current loss = 37361.22751
epoch = 500, current loss = 41860.23905
epoch = 600, current loss = 41123.61540
epoch = 700, current loss = 40617.59488
epoch = 800, current loss = 38523.54915
epoch = 900, current loss = 38828.04360
epoch = 1000, current loss = 38047.07759
epoch = 1100, current loss = 41560.27138
epoch = 1200, current loss = 37818.42181
Accuracy of our model  0.072
Model for 50 neurons with learning rate of 0.001 and 1200 epochs:
epoch = 100, current loss = 41804.34724
epoch = 200, current loss = 36768.42924
epoch = 300, current loss = 40657.66685
epoch = 400, current loss = 41207.46372
epoch = 500, current loss = 33279.35753
epoch = 600, current loss = 39366.02957
epoch = 700, current loss = 35666.38923
epoch = 800, current loss = 34866.54790
epoch = 900, current loss = 37813.16

### From the above tests, it seems that a learning rate of 0.01, 1200 epochs, and 200 neurons produces a model with the highest accuracy so far. Below is one final test to see if this changes for more neurons:

In [7]:
for n in [250, 300]:
    #initialize a class 
    myNN2 = twolayer_NN(X_train_norm, y_train_ohe, hidden_layer_nn = n, lr = 0.01)
    epoch_num = 1200 #num of iterations
    print('Model for %d neurons with learning rate of 0.01 and %d epochs:' % (n, epoch_num))
    for i in range(epoch_num):
        myNN2.feed_forward()
        myNN2.back_propagation()
        myNN2.cross_entropy_loss()
        current_loss = myNN2.loss
        if ((i+1)%600 == 0):
            print('epoch = %d, current loss = %.5f' % (i+1, myNN2.loss))

    # Validate trained model against test data:

    ypred = myNN2.predict(X_test_norm)
    print('Accuracy of our model ', accuracy(ypred, y_test.ravel()))

Model for 250 neurons with learning rate of 0.01 and 1200 epochs:
epoch = 600, current loss = 13.40052
epoch = 1200, current loss = 5.24166
Accuracy of our model  0.86
Model for 300 neurons with learning rate of 0.01 and 1200 epochs:
epoch = 600, current loss = 11.88643
epoch = 1200, current loss = 4.65041
Accuracy of our model  0.866


#### With 250 neurons, we are losing some accuracy, but the loss is also lower. This indicates overfitting of the data. 300 neurons results in a small increase in accuracy and a decrease in loss. Given this, it seems like this model is ideal for roughly 200 or 300 neurons.