<a href="https://colab.research.google.com/github/nehasupe/AppliedMachinelearning/blob/master/A3_P556_F19.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment #3
## P556: Applied Machine Learning

More often than not, we will use a deep learning library (Tensorflow, Pytorch, or the wrapper known as Keras) to implement our models. However, the abstraction afforded by those libraries can make it hard to troubleshoot issues if we don't understand what is going on under the hood. In this assignment you will implement a fully-connected and a convolutional neural network from scratch. To simplify the implementation, we are asking you to implement static architectures, but you are free to support variable number of layers/neurons/activations/optimizers/etc. We recommend that you make use of private methods so you can easily troubleshoot small parts of your model as you develop them, instead of trying to figure out which parts are not working correctly after implementing everything. Also, keep in mind that there is code from your fully-connected neural network that can be re-used on the CNN. 

Problem #1.1 (40 points): Implement a fully-connected neural network from scratch. The neural network will have the following architecture:

- Input layer
- Dense hidden layer with 512 neurons, using relu as the activation function
- Dropout with a value of 0.2
- Dense hidden layer with 512 neurons, using relu as the activation function
- Dropout with a value of 0.2
- Output layer, using softmax as the activation function

The model will use categorical crossentropy as its loss function. 
We will optimize the gradient descent using RMSProp, with a learning rate of 0.001 and a rho value of 0.9.
We will evaluate the model using accuracy.

Why this architecture? We are trying to reproduce from scratch the following [example from the Keras documentation](https://keras.io/examples/mnist_mlp/). This means that you can compare your results by running the Keras code provided above to see if you are on the right track.

In [0]:
import numpy as np
import math
from sklearn.model_selection import KFold

# For this assignment I have watched the following 2 courses on Coursera,
# https://www.coursera.org/learn/neural-networks-deep-learning/
# https://www.coursera.org/learn/deep-neural-network
# The theory, notations and vectorized implementations used in this assignment are from the course
# Referrence links mainly mentioned in the assignment are links to the videos from these two courses


class NeuralNetwork(object):
  """
  forward propagation:
  input layer is a row in x_train- Randomly initialize weights between -1 and 1
  2 dense hidden layer- with 512 neurons use a relu function and drop value of 0.2
  output layer- softmax, output y_train

  backward propagation:
  loss function- categorical entropy
  gradient descent- optimize using RMSProp, learning value 0.001, rho 0.9

  model evaluation:
  accuracy
  """

  '''
  Hyperparameters:
  learning rate - 0.001
  #iterations- 
  #hidden layers
  #hidden units
  choice of activation function
  dropout - 0.2
  '''

  def __init__(self, epochs, batch_size, drop_out):
    self.epochs = epochs
    self.learning_rate = 0.001
    self.rho = 0.9
    self.batch_size = batch_size
    self.keep_prob = 1 - drop_out
    self.Sdw1, self.Sdb1, self.Sdw2, self.Sdb2, self.Sdw3, self.Sdb3 = 0, 0, 0, 0, 0, 0
    
# Initializes Parameters of the model and returns the initialized parameters
  def initialize_parameters(self, layers):
    # Initializing the weights W1, W2, W3 and biases b1, b2, b3 for the hidden layer 1, hidden layer 2 and the Output layer
    # layer dimensions are 784 (layer 0 or input layer), 512 (hidden layer 1), 512 (hidden layer 2), 10(output layer or number of classes)
    
    # Previously initialized the Weights this way, resulted in overflow in softmax activation
    # so then multiplied it by 0.01 to make the values closer to zero as suggested by Andrew Ng in one of his videos in the above listed courses
    # W1 = np.random.uniform(-1, 1, layers[1] * layers[0]) 
    # W1 = W1.reshape(layers[1], layers[0])
    # W2 = np.random.uniform(-1, 1, layers[2] * layers[1]) 
    # W2 = W2.reshape(layers[2], layers[1])
    # W3 = np.random.uniform(-1, 1, layers[3] * layers[2])
    # W3 = W3.reshape(layers[3], layers[2])

    # Wl is of shape - (layer[l], layer[l-1])
    # W1 - (512, 784), W2 - (512, 512), W3 - (10, 512)
    W1 = np.random.uniform(-1, 1, layers[1] * layers[0]) * 0.01
    W1 = W1.reshape(layers[1], layers[0])
    W2 = np.random.uniform(-1, 1, layers[2] * layers[1]) * 0.01
    W2 = W2.reshape(layers[2], layers[1])
    W3 = np.random.uniform(-1, 1, layers[3] * layers[2]) * 0.01
    W3 = W3.reshape(layers[3], layers[2])
    
    # bl is of shape - (layer[l], 1)
    # b1 - (512, 1), b2 - (512, 1), b3 - (10, 1)
    b1 = np.zeros((layers[1], 1))
    b2 = np.zeros((layers[2], 1))
    b3 = np.zeros((layers[3], 1))

    return W1, W2, W3, b1, b2, b3


# For forward_dropout and backward_dropout
# Referred Coursera video: 'Drop out regularization' in Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization (Week 1)
# https://www.coursera.org/learn/deep-neural-network/lecture/eM33A/dropout-regularization
# Dropout regularization using technique called inverted dropout
  def forward_dropout(self, A):
    D = np.random.rand(A.shape[0], A.shape[1]) < self.keep_prob
    A = np.multiply(A, D)
    A /= self.keep_prob
    return A, D

  def backward_dropout(self, W, dZ, D):
    dA = np.dot(W.T, dZ)
    dA = np.multiply(dA, D)
    dA /= self.keep_prob
    return dA

# For implementation of RMSProp
# Ref. https://www.coursera.org/learn/deep-neural-network/lecture/BhJlm/rmsprop
# Returns updated values of weight and bias
  def RMSProp(self, W, b, dW, db, Sdw, Sdb):
    # With epsilon value 10 ^ -8 as said by Andrew Ng in his video, accuracy was low
    # Try different values of epsilon: 10 ^ -3, 10 ^ -4
    epsilon = math.pow(10, -4)
    Sdw = self.rho * Sdw + (1 - self.rho) * np.multiply(dW, dW)
    Sdb = self.rho * Sdb + (1 - self.rho) * np.multiply(db, db)
    W = W - self.learning_rate * np.divide(dW, np.power(Sdw, 0.5) + epsilon)
    b = b - self.learning_rate * np.divide(db, np.power(Sdb, 0.5) + epsilon)
    return W, b, Sdw, Sdb

# For softmax activation, derivative and cost function
# Ref. https://www.coursera.org/learn/deep-neural-network/lecture/LCsCH/training-a-softmax-classifier
# https://peltarion.com/knowledge-center/documentation/modeling-view/build-an-ai-model/loss-functions/categorical-crossentropy
# Returns the cost using categorical cross entropy function for softmax
  def Cost(self, Y_dash, Y):
      m = Y.shape[1]
      cost = (-1/m) * np.sum(np.multiply(Y, np.log(Y_dash)))
      return cost

# For Forward Propagation function and Back Propagation function 
# Referred the vectorized implementation from the following resources
# https://www.coursera.org/learn/neural-networks-deep-learning/lecture/Wh8NI/gradient-descent-for-neural-networks
# returns predicted Y, weights, biases and activations

  def forward_propagation(self, X, W1, W2, W3, b1, b2, b3):
    ### Hidden Layer 1, 512 neurons, Relu activation function, dropout = 0.2 
    Z1 = np.dot(W1, X) + b1
    A1 = np.maximum(Z1, 0)  #RELU activation
    A1, D1 = self.forward_dropout(A1) #Dropout

    ### Hidden Layer 2, 512 neurons, Relu activation function, dropout = 0.2
    Z2 = np.dot(W2, A1) + b2
    A2 = np.maximum(Z2, 0)  # RELU activation
    A2, D2 = self.forward_dropout(A2) #Dropout

    ### Output layer, 10 classes, Softmax activation function
    Z3 = np.dot(W3, A2) + b3
    # Softmax Acivation
    # A3 = np.exp(Z3) / np.sum(np.exp(Z3), axis=0) gives overflow error
    # Ref. for error: https://stackoverflow.com/questions/54880369/implementation-of-softmax-function-returns-nan-for-high-inputs
    f = np.exp(Z3 - np.max(Z3))    
    A3 = f / np.sum(f, axis=0)

    Y_dash = A3
    activations = (A1, A2, A3)
    weights = (W1, W2, W3)
    biases = (b1, b2, b3)
    D = (D1, D2)

    return Y_dash, activations, weights, biases, D

  def backward_propagation(self, X, Y, activations, weights, biases, D):    
    m = X.shape[1]
    (A1, A2, A3) = activations
    (W1, W2, W3) = weights
    (b1, b2, b3) = biases
    (D1, D2) = D
    
    ### Output layer, softmax derivative
    dZ3 = A3 - Y # Softmax derivative
    dW3 = (1 / m) * np.dot(dZ3, A2.T)
    db3 = (1 / m) * np.sum(dZ3, axis=1, keepdims=True)

    dA2 = self.backward_dropout(W3, dZ3, D2) # Dropout

    ### Hidden layer, Relu derivative
    # RELU derivative g'(z) = 0 if z < 0, g'(z) = 1 if z > 0
    dZ2 = np.multiply(dA2, A2 > 0) # RELU derivative
    dW2 = (1 / m) * np.dot(dZ2, A1.T)
    db2 = (1 / m) * np.sum(dZ2, axis=1, keepdims=True)
    
    dA1 = self.backward_dropout(W2, dZ2, D1) # Dropout

    # Hidden layer, Relu derivative
    dZ1 = np.multiply(dA1, A1 > 0) # RELU derivative
    dW1 = (1 / m) * np.dot(dZ1, X.T)
    db1 = (1 / m) * np.sum(dZ1, axis=1, keepdims=True)

    return dW3, dW2, dW1, db3, db2, db1

# Fit function fits the passed train set, performing mini batch gradient descent
# The Keras implementation makes use of 20 epochs and batches of size 128
# Mini batch gradient descent: https://www.youtube.com/watch?v=4qJaSmvhxi8
# Returns weights and biases of the trained model
  def fit(self, X, Y, layers):
    W1, W2, W3, b1, b2, b3 = self.initialize_parameters(layers)
    for j in range(0, self.epochs):
      print("epoch :", j+1)
      # Creating batches of size 128, 375 batches for the 48000 samples
      for i in range(0, X.shape[0], self.batch_size):
        # forward propagation
        Y_dash, activations, weights, biases, D = self.forward_propagation(X[i:i+self.batch_size].T, W1, W2, W3, b1, b2, b3)
        # Compute Cost
        # cost = self.Cost(Y_dash, Y[i:i+self.batch_size].T)
        # print("cost:",cost)
        # Backward propagation
        dW3, dW2, dW1, db3, db2, db1 = self.backward_propagation(X[i:i+self.batch_size].T, Y[i:i+self.batch_size].T, activations, weights, biases, D)
        # Using RMSProp to optimize gradient descent and update weights and biases
        W3, b3, self.Sdw3, self.Sdb3 = self.RMSProp(W3, b3, dW3, db3, self.Sdw3, self.Sdb3)
        W2, b2, self.Sdw2, self.Sdb2 = self.RMSProp(W2, b2, dW2, db2, self.Sdw2, self.Sdb2)
        W1, b1, self.Sdw1, self.Sdb1 = self.RMSProp(W1, b1, dW1, db1, self.Sdw1, self.Sdb1)
      # print cost after each epoch
      cost = self.Cost(Y_dash, Y[i:i+self.batch_size].T)
      print("cost:",cost)    

    return W1, W2, W3, b1, b2, b3

# Predict function with activations and no dropout, returns the vector of predicted values
  def predict(self, X, W1, W2, W3, b1, b2, b3):
    ### Hidden Layer 1, 512 neurons, Relu activation function
    Z1 = np.dot(W1, X) + b1
    A1 = np.maximum(Z1, 0)  #RELU activation

    ### Hidden Layer 2, 512 neurons, Relu activation function
    Z2 = np.dot(W2, A1) + b2
    A2 = np.maximum(Z2, 0)  # RELU activation

    ### Output layer, 10 classes, Softmax activation function
    Z3 = np.dot(W3, A2) + b3
    # Softmax Acivation
    # A3 = np.exp(Z3) / np.sum(np.exp(Z3), axis=0) gives overflow error
    # Ref. for error: https://stackoverflow.com/questions/54880369/implementation-of-softmax-function-returns-nan-for-high-inputs
    f = np.exp(Z3 - np.max(Z3))    
    A3 = f / np.sum(f, axis=0)

    return A3

  # Evaluation function for the Neural Network, evaluates using the Accuracy metric
  def evaluate(self, X, Y, W1, W2, W3, b1, b2, b3):
    Y_dash = self.predict(X.T, W1, W2, W3, b1, b2, b3)
    Y_dash = np.argmax(Y_dash, axis = 0)
    Y = np.argmax(Y, axis = 1)
    # Accuracy = Number of correct predictions/ Total number of samples 
    # Ref. for getting a count of same elements in 2 numpy arrays: 
    # https://stackoverflow.com/questions/25490641/check-how-many-elements-are-equal-in-two-numpy-arrays-python/25490691
    accuracy = np.sum(Y_dash == Y)/Y.shape[0]
    return accuracy



Problem #1.2 (10 points): Train your fully-connected neural network on the Fashion-MNIST dataset using 5-fold cross validation. Report accuracy on the folds, as well as on the test set.

In [0]:
# To simplify the usage of our dataset, we will be importing it from the Keras 
# library. Keras can be installed using pip: python -m pip install keras

# Original source for the dataset:
# https://github.com/zalandoresearch/fashion-mnist

# Reference to the Fashion-MNIST's Keras function: 
# https://keras.io/datasets/#fashion-mnist-database-of-fashion-articles

from keras.datasets import fashion_mnist
from keras.datasets import mnist
import keras.utils

(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train = x_train.reshape(60000, 784)
x_test = x_test.reshape(10000, 784)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255

print(x_train.shape[0], 'Rows: train samples')
print(x_train.shape[1], 'Columns: train samples')
print(x_test.shape[0], 'Rows: test samples')
print(x_test.shape[1], 'Columns: test samples')

# convert class vectors to binary class matrices
num_classes = 10
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

# Dimensions of layer: Number of inputs and neurons in each layer of neural Network
layers = [ 784, 512, 512, 10]

# K- fold validation from the official documentation
# Referred the example here for working of K-fold- 
# https://scikit-learn.org/stable/auto_examples/exercises/plot_cv_diabetes.html#sphx-glr-auto-examples-exercises-plot-cv-diabetes-py
kf = KFold(n_splits=5)
fold_train_accuracy = []
fold_test_accuracy = []
test_dataset_accuracy = []

# For each fold, fitting the model with train set, evaluating it with the test set and also on the given test dataset
for i , (train_index, test_index) in enumerate(kf.split(x_train, y_train)):
  print("Fold:", i+1)
  nn = NeuralNetwork(epochs = 20, batch_size = 128, drop_out = 0.2)
  W1, W2, W3, b1, b2, b3 = nn.fit(x_train[train_index], y_train[train_index], layers)

  train_accuracy = nn.evaluate(x_train[train_index], y_train[train_index], W1, W2, W3, b1, b2, b3)
  print("For fold:", i+1)
  print("Train accuracy:",train_accuracy)
  fold_train_accuracy.append(train_accuracy)

  test_accuracy = nn.evaluate(x_train[test_index], y_train[test_index], W1, W2, W3, b1, b2, b3)
  print("Test accuracy:",test_accuracy)
  fold_test_accuracy.append(test_accuracy)

  dataset_accuracy = nn.evaluate(x_test, y_test, W1, W2, W3, b1, b2, b3)
  print("Accuracy on the Test set", dataset_accuracy)
  test_dataset_accuracy.append(dataset_accuracy)

print("5-fold cross validation")
print("The overall (average) training accuracy", sum(fold_train_accuracy)/len(fold_train_accuracy))
print("The overall(average) testing accuracy", sum(fold_test_accuracy)/len(fold_test_accuracy))
print("The overall(average) accuracy on the test dataset", sum(test_dataset_accuracy)/len(test_dataset_accuracy))

print("fitting the entire train set to the model and testing on the given test dataset")
nn = NeuralNetwork(epochs = 20, batch_size = 128, drop_out = 0.2)

W1, W2, W3, b1, b2, b3 = nn.fit(x_train, y_train, layers)
accuracy = nn.evaluate(x_train, y_train, W1, W2, W3, b1, b2, b3)
print("Train set accuracy", accuracy)

accuracy = nn.evaluate(x_test, y_test, W1, W2, W3, b1, b2, b3)
print("Test set accuracy", accuracy)
"""
Comparing results with the Keras implementation:
The link included in the question is for keras implementation on the mnist dataset
I executed the code for keras implementation to get the output on mnist dataset:
Test accuracy: 0.9821

Results of this implementation on Mnist dataset:
The overall (average) training accuracy 0.9980958333333334
The overall(average) testing accuracy 0.9784833333333334
The overall(average) accuracy on the test dataset 0.9804999999999999
Train set accuracy 0.9985833333333334
Test set accuracy 0.9845

I executed the Keras implementation on the fashion-mnist dataset and got a test accuracy of 0.8742

Results of this implementation on Fashion mnist dataset:

5-fold cross validation
The overall (average) training accuracy 0.9220083333333333
The overall(average) testing accuracy 0.8908833333333334
The overall(average) accuracy on the test dataset 0.8828400000000001
fitting the entire train set to the model and testing on the given test dataset
Train set accuracy 0.9236
Test set accuracy 0.8859
"""



Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz
60000 Rows: train samples
784 Columns: train samples
10000 Rows: test samples
784 Columns: test samples
Fold: 1
epoch : 1
cost: 0.207351412885461
epoch : 2
cost: 0.19364492279156803
epoch : 3
cost: 0.1877477439888224
epoch : 4
cost: 0.17659390155531832
epoch : 5
cost: 0.12820119806586872
epoch : 6
cost: 0.1578474904465153
epoch : 7
cost: 0.1666165928596442
epoch : 8
cost: 0.1790876128865162
epoch : 9
cost: 0.12590151816571862
epoch : 10
cost: 0.15956686059747738
epoch : 11
cost: 0.14291726518899317
epoch : 12
cost: 0.12079236107607953
epoch : 13
cost: 0.08654107598329869
epoch : 14
cost: 0.08820797107630336
epoch : 15
cost: 0.06289329547388212
epoch : 16
cost: 0.036908168556412116
epoch : 17
cost: 0.1042599603229974
epoch : 18
cost: 0.04801526708002827
epoch : 19
cost: 0.07148436927748822
epoch : 20
cost: 0.09138009489514881
For fold: 1
Train accuracy: 0.9981458333333333
Test accuracy: 0.98025
Accuracy on the Test se

'\nResults of this implementation on Fashion mnist dataset\n\n5-fold cross validation\nThe overall (average) training accuracy 0.9220083333333333\nThe overall(average) testing accuracy 0.8908833333333334\nThe overall(average) accuracy on the test dataset 0.8828400000000001\nfitting the entire train set to the model and testing on the given test dataset\nTrain set accuracy 0.9236\nTest set accuracy 0.8859\n\nResults of this implementation on Mnist dataset\n\n\n'