# Assignment #3
## P556: Applied Machine Learning

More often than not, we will use a deep learning library (Tensorflow, Pytorch, or the wrapper known as Keras) to implement our models. However, the abstraction afforded by those libraries can make it hard to troubleshoot issues if we don't understand what is going on under the hood. In this assignment you will implement a fully-connected and a convolutional neural network from scratch. To simplify the implementation, we are asking you to implement static architectures, but you are free to support variable number of layers/neurons/activations/optimizers/etc. We recommend that you make use of private methods so you can easily troubleshoot small parts of your model as you develop them, instead of trying to figure out which parts are not working correctly after implementing everything. Also, keep in mind that there is code from your fully-connected neural network that can be re-used on the CNN. 

Problem #1.1 (40 points): Implement a fully-connected neural network from scratch. The neural network will have the following architecture:

- Input layer
- Dense hidden layer with 512 neurons, using relu as the activation function
- Dropout with a value of 0.2
- Dense hidden layer with 512 neurons, using relu as the activation function
- Dropout with a value of 0.2
- Output layer, using softmax as the activation function

The model will use categorical crossentropy as its loss function. 
We will optimize the gradient descent using RMSProp, with a learning rate of 0.001 and a rho value of 0.9.
We will evaluate the model using accuracy.

Why this architecture? We are trying to reproduce from scratch the following [example from the Keras documentation](https://keras.io/examples/mnist_mlp/). This means that you can compare your results by running the Keras code provided above to see if you are on the right track.

In [0]:
# Reference for Working of Neural Network: Week 4&5 of Andrew Ng's 'Machine Learning' course on Coursera
# https://www.coursera.org/learn/machine-learning/home/welcome

class NeuralNetwork(object):
  def __init__(self,epochs,batch_size,learning_rate):
    self.epochs=epochs
    self.learning_rate=learning_rate
    self.batch_size=batch_size
  
  def fit(self,X_train,Y_train,neurons_each_layer,rho):
    # weights initialization
    # Reference for normalizing weights: https://www.freecodecamp.org/news/building-a-neural-network-from-scratch/
    wi=np.random.randn(neurons_each_layer[1],neurons_each_layer[0])*0.01
    wh1=np.random.randn(neurons_each_layer[2],neurons_each_layer[1])*0.01
    wh2=np.random.randn(neurons_each_layer[3],neurons_each_layer[2])*0.01

    # bias initialization
    bias_i=np.zeros((neurons_each_layer[1], 1))
    bias_h1=np.zeros((neurons_each_layer[2], 1))
    bias_h2=np.zeros((neurons_each_layer[3], 1))

    # RMSProp initialization
    vdwi,vdbi,vdwh1,vdbh1,vdwh2,vdbh2=float(0),float(0),float(0),float(0),float(0),float(0)

    # number_of_batches=int(X_train.shape[0]/self.batch_size)

    # Running Dataset through Epochs
    # Reference: https://machinelearningmastery.com/gentle-introduction-mini-batch-gradient-descent-configure-batch-size/
    print("Running 20 epochs...")
    for j in range(0,self.epochs):
      # print("epoch "+str(j))
      count=0  
      i=0
      # Shuffling samples in dataset
      # Reference: https://adventuresinmachinelearning.com/stochastic-gradient-descent/
      shuffled_indeces = np.random.permutation(X_train.shape[0])
      X_train=X_train[shuffled_indeces]
      Y_train=Y_train[shuffled_indeces] 

      # Iterations through batches begins 
      while(i<X_train.shape[0]):
        X = X_train[i:i+self.batch_size]
        Y = Y_train[i:i+self.batch_size]
        i=i+self.batch_size
        count=count+1

        # Forward Propogation with Dropout
        (A2,D2,A3,D3,Y_pred)=self.forward_prop_with_dropout(X,[wi,bias_i,wh1,bias_h1,wh2,bias_h2])

        # Back Propogation with Dropout
        n=X.shape[0]
        (dwh2,dbias_h2,dwh1,dbias_h1,dwi,dbias_i)=self.back_prop_with_dropout(n,X,Y,[A2,D2,A3,D3,Y_pred],[wi,bias_i,wh1,bias_h1,wh2,bias_h2])
    
        # Gradient Descent with RMSProp
        parameters1=[vdwi,vdbi,vdwh1,vdbh1,vdwh2,vdbh2]
        parameters2=[dwh2,dbias_h2,dwh1,dbias_h1,dwi,dbias_i]
        parameters3=[wi,bias_i,wh1,bias_h1,wh2,bias_h2]
        (vdwi,vdbi,vdwh1,vdbh1,vdwh2,vdbh2,wi,bias_i,wh1,bias_h1,wh2,bias_h2)=self.gradient_descent_with_RMSProp(parameters1,parameters2,parameters3,rho)

    return (wi,bias_i,wh1,bias_h1,wh2,bias_h2)


  # Forward Propogation with Dropout  
  # Reference for forward prop: https://www.freecodecamp.org/news/building-a-neural-network-from-scratch/
  # Reference for dropout: https://www.coursera.org/learn/deep-neural-network/lecture/eM33A/dropout-regularization
  def forward_prop_with_dropout(self,X,parameters):
    [wi,bias_i,wh1,bias_h1,wh2,bias_h2]=parameters
    
    A2=self.relu(np.transpose(X),wi,bias_i)
    D2,A2=self.AwithDropout(A2) 

    A3=self.relu(A2,wh1,bias_h1)
    D3,A3=self.AwithDropout(A3)

    Y_pred=self.softmax(A3,wh2,bias_h2)

    return (A2,D2,A3,D3,Y_pred)

  # Backward Propogation with Dropout 
  # References:
  # Formula for Gradient values: https://www.freecodecamp.org/news/building-a-neural-network-from-scratch/
  # Flow/Chain Rule of Backprop: https://ml-cheatsheet.readthedocs.io/en/latest/backpropagation.html
  def back_prop_with_dropout(self,n,X,Y,parameters1,parameters2):
    [A2,D2,A3,D3,Y_pred]=parameters1
    [wi,bias_i,wh1,bias_h1,wh2,bias_h2]=parameters2

    # dY_pred=Y_pred-np.transpose(Y)
    # del3=np.multiply(dY_pred,self.softmax_backpropogation(Y_pred))

    # Derivative wrt to loss is just Y'-Y
    # Reference: https://www.coursera.org/learn/deep-neural-network/lecture/LCsCH/training-a-softmax-classifier
    del3=Y_pred-np.transpose(Y)
    dwh2=float(1)/float(n)*(np.dot(del3,np.transpose(A3)))
    dbias_h2=float(1)/float(n)*np.sum(del3,axis=1,keepdims=True)
    dA3=np.dot(np.transpose(wh2),del3)
    dA3=(dA3*D3)/0.8

    del2=np.multiply(dA3,self.relu_backprop(A3))
    dwh1=float(1)/float(n)*(np.dot(del2,np.transpose(A2)))
    dbias_h1=float(1)/float(n)*np.sum(del2,axis=1,keepdims=True)
    dA2=np.dot(np.transpose(wh1),del2)
    dA2=(dA2*D2)/0.8
    
    del1=np.multiply(dA2,self.relu_backprop(A2))
    dwi=float(1)/float(n)*(np.dot(del1,X))
    dbias_i=float(1)/float(n)*np.sum(del1,axis=1,keepdims=True)

    return (dwh2,dbias_h2,dwh1,dbias_h1,dwi,dbias_i)

  # Gradient Descent with RMSProp Optimizer
  # References:
  # Implementing RMSProp: https://towardsdatascience.com/a-look-at-gradient-descent-and-rmsprop-optimizers-f77d483ef08b
  # Changing value of learning rate with RMSProp: https://towardsdatascience.com/10-gradient-descent-optimisation-algorithms-86989510b5e9
  def gradient_descent_with_RMSProp(self,parameters1,parameters2,parameters3,rho):
    [vdwi,vdbi,vdwh1,vdbh1,vdwh2,vdbh2]=parameters1
    [dwh2,dbias_h2,dwh1,dbias_h1,dwi,dbias_i]=parameters2
    [wi,bias_i,wh1,bias_h1,wh2,bias_h2]=parameters3

    vdwi=(rho*vdwi) +((1-rho)*(dwi*dwi))
    vdbi=(rho*vdbi)+((1-rho)*(dbias_i*dbias_i))
    new_learning_rate_w=self.learning_rate/(np.sqrt(vdwi+0.00001))
    wi=wi-(new_learning_rate_w*dwi)
    new_learning_rate_b=self.learning_rate/(np.sqrt(vdbi+0.00001))
    bias_i=bias_i-(new_learning_rate_b*dbias_i)

    vdwh1=(rho*vdwh1)+((1-rho)*(dwh1*dwh1))
    vdbh1=(rho*vdbh1)+((1-rho)*(dbias_h1*dbias_h1))
    new_learning_rate_w=self.learning_rate/(np.sqrt(vdwh1+0.00001))
    wh1=wh1-(new_learning_rate_w*dwh1)
    new_learning_rate_b=self.learning_rate/(np.sqrt(vdbh1+0.00001))
    bias_h1=bias_h1-(new_learning_rate_b*dbias_h1)

    vdwh2=(rho*vdwh2)+((1-rho)*(dwh2*dwh2))
    vdbh2=(rho*vdbh2)+((1-rho)*(dbias_h2*dbias_h2))
    new_learning_rate_w=self.learning_rate/(np.sqrt(vdwh2+0.00001))
    wh2=wh2-(new_learning_rate_w*dwh2)
    new_learning_rate_b=self.learning_rate/(np.sqrt(vdbh2+0.00001))
    bias_h2=bias_h2-(new_learning_rate_b*dbias_h2)

    return (vdwi,vdbi,vdwh1,vdbh1,vdwh2,vdbh2,wi,bias_i,wh1,bias_h1,wh2,bias_h2)

  # References for ReLU and derivative of ReLU:
  # https://stats.stackexchange.com/questions/333394/what-is-the-derivative-of-the-relu-activation-function
  # https://www.coursera.org/lecture/neural-networks-deep-learning/derivatives-of-activation-functions-qcG1j
  def relu(self,A,W,BIAS):
    Z=np.dot(W,A)+BIAS
    # print("Z_relu "+str(Z.shape))
    return np.maximum(0,Z)

  def relu_backprop(self,A):
    return float(1)*(A>0)     
  
  # Softmax Calculation
  # Reference: https://stackoverflow.com/questions/54880369/implementation-of-softmax-function-returns-nan-for-high-inputs
  # https://www.coursera.org/learn/deep-neural-network/lecture/LCsCH/training-a-softmax-classifier
  def softmax(self,A,W,BIAS):
    Z=np.dot(W,A)+BIAS
    # print("Z_softmax "+str(Z.shape))
    E=np.exp(Z-np.max(Z))
    return E/np.sum(E, axis=0)

  # Dropout Optimization 
  # Reference for creating a dropout matrix:
  # https://gluon.mxnet.io/chapter03_deep-neural-networks/mlp-dropout-scratch.html
  # https://www.coursera.org/learn/deep-neural-network/lecture/eM33A/dropout-regularization
  def AwithDropout(self,A):
    Dropout_matrix=np.random.rand(A.shape[0], A.shape[1])<0.8
    # print(Dropout_matrix)
    return Dropout_matrix,(A*Dropout_matrix)/0.8

  # This function calls Forward Prop and returns accuracy and loss for the prediction as is returned by the evaluate function in Keras Documentation
  def evaluate(self,X,Y,parameters):
    [wi,bias_i,wh1,bias_h1,wh2,bias_h2]=parameters

    # Running one Forward Prop
    Y_pred=self.forward_prop_with_dropout(X,parameters)[4]

    # Calculating Accuracy for Prediction(Number of correct predictions/Number of samples*100)
    Y_pred=np.transpose(Y_pred)
    count=0
    for i in range(0,len(Y_pred)):
      # np.argmax reference: https://docs.scipy.org/doc/numpy/reference/generated/numpy.argmax.html
      if np.argmax(Y_pred[i])==np.argmax(Y[i]):
        count=count+1
    # print("count "+str(count))   
    accuracy=float(count)/float(Y.shape[0])*100

    # Calculating Loss incurred
    # Reference: https://medium.com/towards-artificial-intelligence/logistic-regression-in-python-from-scratch-954c0196d258
    # https://medium.com/data-science-bootcamp/understand-cross-entropy-loss-in-minutes-9fb263caee9a
    Y_pred=np.transpose(Y_pred)
    Y=np.transpose(Y)
    # After using cross entropy for binary classification, I got the error "divide by zero encountered in log" 
    # Then from the above link, I referred the correct implementation of cross-entropy for multivariate classification
    cost_function=(-float(1)/float(Y.shape[1]))*np.sum(np.multiply(Y,np.log(Y_pred)))
    # accuracy=(cost_function/float(Y_TRAIN.shape[1]))*100
    return accuracy,cost_function

Problem #1.2 (10 points): Train your fully-connected neural network on the Fashion-MNIST dataset using 5-fold cross validation. Report accuracy on the folds, as well as on the test set.

In [4]:
# To simplify the usage of our dataset, we will be importing it from the Keras 
# library. Keras can be installed using pip: python -m pip install keras

# Original source for the dataset:
# https://github.com/zalandoresearch/fashion-mnist

# Reference to the Fashion-MNIST's Keras function: 
# https://keras.io/datasets/#fashion-mnist-database-of-fashion-articles

# Concepts and approaches have been discussed with Deepthi Raghu(draghu)

from keras.datasets import fashion_mnist
import keras.utils
import numpy as np
from sklearn.model_selection import KFold

# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()
# print(x_train.shape)
x_train = x_train.reshape(60000, 784)

x_test = x_test.reshape(10000, 784)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

# convert class vectors to binary class matrices
num_classes = 10
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)


N=NeuralNetwork(20,128,0.001)
# (wi,bias_i,wh1,bias_h1,wh2,bias_h2)=N.fit(x_train,y_train,[784,512,512,10],0.9)
# accuracy=N.evaluate(x_train,y_train,[wi,bias_i,wh1,bias_h1,wh2,bias_h2])
# print(accuracy)

# 5-Fold Cross Validation
# Reference:https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html
KF=KFold(n_splits=5)
count=0
for i,j in KF.split(x_train):
  count=count+1
  X_TRAIN=x_train[i]
  X_VALIDATION=x_train[j]
  Y_TRAIN=y_train[i]
  Y_VALIDATION=y_train[j]
  print('\n')
  print('\n')
  print("Fit for fold "+str(count))
  (wi,bias_i,wh1,bias_h1,wh2,bias_h2)=N.fit(X_TRAIN,Y_TRAIN,[784,512,512,10],0.9)
  accuracy,cost_function=N.evaluate(X_TRAIN,Y_TRAIN,[wi,bias_i,wh1,bias_h1,wh2,bias_h2])
  print('\n')
  print("Loss for Train Set "+str(cost_function))
  print("Accuracy for Train Set "+str(accuracy))
  print('\n')
  accuracy,cost_function=N.evaluate(X_VALIDATION,Y_VALIDATION,[wi,bias_i,wh1,bias_h1,wh2,bias_h2])
  print("Loss for Validation Set "+str(cost_function))
  print("Accuracy for Validation Set "+str(accuracy))
  print('\n')
  accuracy,cost_function=N.evaluate(x_test,y_test,[wi,bias_i,wh1,bias_h1,wh2,bias_h2])
  print("Loss for Test Set "+str(cost_function))
  print("Accuracy for Test Set "+str(accuracy))
  print('\n')
  print('\n')

60000 train samples
10000 test samples




Fit for fold 1
Running 20 epochs...


Loss for Train Set 0.2527936119598102
Accuracy for Train Set 90.46041666666666


Loss for Validation Set 0.33706246278769797
Accuracy for Validation Set 87.73333333333333


Loss for Test Set 0.36287380329087143
Accuracy for Test Set 87.1








Fit for fold 2
Running 20 epochs...


Loss for Train Set 0.22756219574228012
Accuracy for Train Set 91.53333333333333


Loss for Validation Set 0.3244919396184544
Accuracy for Validation Set 88.63333333333333


Loss for Test Set 0.3572972424286358
Accuracy for Test Set 87.74








Fit for fold 3
Running 20 epochs...


Loss for Train Set 0.22200405528878206
Accuracy for Train Set 91.62916666666666


Loss for Validation Set 0.30218567727123735
Accuracy for Validation Set 89.3


Loss for Test Set 0.33708962821009386
Accuracy for Test Set 88.14999999999999








Fit for fold 4
Running 20 epochs...


Loss for Train Set 0.22088261353095642
Accuracy for Train Set 91.7

Problem #2.1 (40 points): Implement a Convolutional Neural Network from scratch. Similarly to problem 1.1, we will be implementing the same architecture as the one shown in [Keras' CNN documentation](https://keras.io/examples/mnist_cnn/). That is:

- Input layer
- Convolutional hidden layer with 32 neurons, a kernel size of (3,3), and relu activation function
- Convolutional hidden layer with 64 neurons, a kernel size of (3,3), and relu activation function
- Maxpooling with a pool size of (2,2)
- Dropout with a value of 0.25
- Flatten layer
- Dense hidden layer, with 128 neurons, and relu activation function
- Dropout with a value of 0.5
- Output layer, using softmax as the activation function

Our loss function is categorical crossentropy and the evaluation will be done using accuracy, as in Problem 1.1. However, we will not be using the gradient optimizer known as Adadelta.

In [0]:
class ConvolutionalNeuralNetwork(object):
  def __init__(epochs, learning_rate):
    pass
  
  def fit(self):
    pass
  
  def evaluate(self):
    pass

Problem #2.2 (10 points): Train your convolutional neural network on the Fashion-MNIST dataset using 5-fold cross validation. Report accuracy on the folds, as well as on the test set.

In [0]:
import keras
from keras.datasets import fashion_mnist

# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()

x_train = x_train.reshape(60000, 784)
x_test = x_test.reshape(10000, 784)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

# convert class vectors to binary class matrices
num_classes = 10
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)