
# CS6910 - Fundamentals of Deep Learning - Assignment 1


##  Submitted by: NS24Z066 - LIKHITH KUMARA



## Instructions

The goal of this assignment is twofold:

(i) implement and use gradient descent (and its variants) with
backpropagation for a classification task

(ii) get familiar with Wandb which is a cool tool for running and keeping track of a large number of experiments
This is a individual assignment and no groups are allowed.
Collaborations and discussions with other students is strictly prohibited.
You must use Python (NumPy and Pandas) for your implementation.
You cannot use the following packages from Keras, PyTorch, Tensorflow: optimizers, layers
If you are using any packages from Keras, PyTorch, Tensorflow then post on Moodle first to check with the instructor.
You have to generate the report in the same format as shown below using wandb.ai. You can start by cloning this report using the clone option above. Most of the plots that we have asked for below can be (automatically) generated using the APIs provided by wandb.ai. You will upload a link to this report on Gradescope.
You also need to provide a link to your GitHub code as shown below. Follow good software engineering practices and set up a GitHub repo for the project on Day

1. Please do not write all code on your local machine and push everything to GitHub on the last day. The commits in GitHub should reflect how the code has evolved during the course of the assignment.
You have to check Moodle regularly for updates regarding the assignment.

## Problem Statement
In this assignment you need to implement a feedforward neural network and write the backpropagation code for training the network. We strongly recommend using numpy for all matrix/vector operations. You are not allowed to use any automatic differentiation packages. This network will be trained and
tested using the Fashion-MNIST dataset. Specifically, given an input image (28 x 28 = 784
pixels) from the Fashion-MNIST dataset, the network will be trained to classify the image into 1 of 10 classes.


In [6]:
!pip install -qq -U wandb
# Log in to your W&B account
import wandb
from google.colab import userdata
wandb.login(key = userdata.get('WANDB_API_KEY'), verify = True)

run = wandb.init(
      # Set the project where this run will be logged
      project="CS6910-Assignment-1"
      )





VBox(children=(Label(value='0.049 MB of 0.049 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
accuracy,▁▄▅▇▇█
epoch,▁▂▄▅▇█
loss,█▅▄▃▂▁
val_accuracy,▁▄▆▆▇█
val_loss,█▅▃▂▂▁

0,1
accuracy,0.864
epoch,5.0
loss,22015.68
val_accuracy,0.851
val_loss,2631.55




## Question 1 (2 Marks)
Download the fashion-MNIST dataset and plot 1 sample image for each class as shown in the grid below. Use from keras.datasets import fashion_mnist for getting the fashion mnist dataset.


In [7]:
from keras.datasets import fashion_mnist
import numpy as np

(x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()
fmnist_labels = {
0 :	'T-shirt/top',
1 :	'Trouser',
2 :	'Pullover',
3 :	'Dress',
4 :	'Coat',
5 :	'Sandal',
6 :	'Shirt',
7 :	'Sneaker',
8 :	'Bag',
9 :	'Ankle boot'}


# Pre-Processing the data for training
(x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()

# One hot encoding the output parameters, since this is a classification problem, in case of regression this is not needed
one_hot_label = np.zeros([y_train.shape[0], len(np.unique(y_train)), 1], dtype=int)
for index, item in enumerate(y_train):
  one_hot_label[index, item] = [1]

one_hot_label_test = np.zeros([y_test.shape[0], len(np.unique(y_test)), 1], dtype=int)
for index, item in enumerate(y_test):
  one_hot_label_test[index, item] = [1]

# Flattening each image into shape (784, 1) as Neural network accepts only 1d series of data and then normalizing it with max value
x_train_flattened = x_train.reshape(-1, 784, 1)/np.max(x_train)
x_test_flattened = x_test.reshape(-1, 784, 1)/np.max(x_test)

# It is given that 90% of the dataset is to be considered for training, while 10%  for validation
train_records_count = int(len(x_train)*0.9)
train_data={'inputs':x_train_flattened[:train_records_count], 'labels':one_hot_label[:train_records_count]}
val_data={'inputs':x_train_flattened[train_records_count:], 'labels':one_hot_label[train_records_count:]}
test_data={'inputs':x_test_flattened, 'labels': one_hot_label_test}

In [3]:
# Addings 10 random sets of the unique labels from the keras fashion mnist dataset
for _ in range(10):
  rand_int = np.random.randint(6000)
  samples = []
  for i in np.unique(y_train):
    index = np.where(y_train == i)[0][rand_int]
    pixels = x_train[index]
    image = wandb.Image(pixels, caption=f"{fmnist_labels[y_train[index]]}")
    samples.append(image)

  run.log({"Random Samples 10 Labels": samples})


## Question 2 (10 Marks)
Implement a feedforward neural network which takes images from the fashion-mnist data as input and outputs a probability distribution over the 10 classes.
Your code should be flexible such that it is easy to change the number of hidden layers and the number of neurons in each hidden layer.


In [4]:
# Sample of how to Call the NeuralNetwork class, Please check NeuralNetwork.py for complete implementaion
from NeuralNetwork import NeuralNetwork

layers = [{'num_neurons': 128, 'activation': 'tanh'}] * 5
nn = NeuralNetwork(input_dim=x_train_flattened.shape[1], output_dim=one_hot_label.shape[1], nn_archtre=layers,
                   last_layer_activation='softmax', weight_initializer='xavier')

#Trying to predict class after training, the prediction is supposed to be accurate
print('Predicted Class:', np.round(nn.feed_forward(x_train_flattened[1]), 2).tolist(), fmnist_labels[np.argmax(nn.feed_forward(x_train_flattened[1]))])

#Actual Class
print('Actual Class:', one_hot_label[1].tolist(), fmnist_labels[np.argmax(one_hot_label[1])])


Predicted Class: [[0.02], [0.01], [0.01], [0.15], [0.0], [0.0], [0.01], [0.06], [0.02], [0.71]] Ankle boot
Actual Class: [[1], [0], [0], [0], [0], [0], [0], [0], [0], [0]] T-shirt/top



## Question 3 (24 Marks)
Implement the backpropagation algorithm with support for the following optimisation functions
sgd
momentum based gradient descent
nesterov accelerated gradient descent
rmsprop
adam
nadam
(12 marks for the backpropagation framework and 2 marks for each of the optimisation algorithms above)
We will check the code for implementation and ease of use (e.g., how easy it is to add a new optimisation algorithm such as Eve). Note that the code should be flexible enough to work with different batch sizes.



In [17]:
test_records_count = int(len(x_train)*0.1)
train_data={'inputs':x_train_flattened[:-test_records_count], 'labels':one_hot_label[:-test_records_count]}
val_data={'inputs':x_train_flattened[-test_records_count:], 'labels':one_hot_label[-test_records_count:]}

# Training using the parameters that were found to be best by sweeping in wandb
nn.train(train_data=train_data, val_data=val_data, epochs=5, learning_rate=0.001,
                 optimizer='adam', weight_decay=0, batch_size=64, print_every_epoch=1)

#Trying to predict class after training, the prediction is supposed to be accurate
print('Predicted Class:', np.round(nn.feed_forward(x_train_flattened[1]), 2).tolist(), fmnist_labels[np.argmax(nn.feed_forward(x_train_flattened[1]))])

#Actual Class
print('Actual Class:', one_hot_label[1].tolist(), fmnist_labels[np.argmax(one_hot_label[1])])


Using _adam_gradient_descent for traing optimization
Seconds taken 6.04 batch: 1/843 val_loss_acc: (22102.91, 0.102)
Seconds taken 43.19 batch: 423/843 val_loss_acc: (5430.58, 0.686)
Mins taken 2.34 {'loss': 44007.94, 'accuracy': 0.726, 'val_loss': 4859.57, 'val_accuracy': 0.722, 'epoch': 0}
Seconds taken 96.01 batch: 1/843 val_loss_acc: (4827.74, 0.715)
Seconds taken 44.95 batch: 423/843 val_loss_acc: (4663.35, 0.728)
Mins taken 2.33 {'loss': 40005.88, 'accuracy': 0.745, 'val_loss': 4436.92, 'val_accuracy': 0.743, 'epoch': 1}
Seconds taken 95.19 batch: 1/843 val_loss_acc: (4312.39, 0.743)
Seconds taken 43.3 batch: 423/843 val_loss_acc: (4357.08, 0.745)
Mins taken 2.3 {'loss': 37781.39, 'accuracy': 0.759, 'val_loss': 4210.32, 'val_accuracy': 0.755, 'epoch': 2}
Seconds taken 97.01 batch: 1/843 val_loss_acc: (4077.36, 0.758)
Seconds taken 44.25 batch: 423/843 val_loss_acc: (4158.07, 0.757)
Mins taken 2.34 {'loss': 36316.81, 'accuracy': 0.768, 'val_loss': 4065.97, 'val_accuracy': 0.759, '

## Question 4 (10 Marks)
Use the sweep functionality provided by wandb to find the best values for the hyperparameters listed below. Use the standard train/test split of fashion_mnist (use (X_train, y_train), (X_test, y_test) = fashion_mnist.load_data()).  Keep 10% of the training data aside as validation data for this hyperparameter search. Here are some suggestions for different values to try for hyperparameters. As you can quickly see that this leads to an exponential number of combinations. You will have to think about strategies to do this hyperparameter search efficiently. Check out the options provided by wandb.sweep and write down what strategy you chose and why.
number of epochs: 5, 10
number of hidden layers:  3, 4, 5
size of every hidden layer:  32, 64, 128
weight decay (L2 regularisation): 0, 0.0005,  0.5
learning rate: 1e-3, 1 e-4
optimizer:  sgd, momentum, nesterov, rmsprop, adam, nadam
batch size: 16, 32, 64
weight initialisation: random, Xavier
activation functions: sigmoid, tanh, ReLU
wandb will automatically generate the following plots. Paste these plots below using the "Add Panel to Report" feature. Make sure you use meaningful names for each sweep (e.g. hl_3_bs_16_ac_tanh to indicate that there were 3 hidden layers, batch size was 16 and activation function was ReLU) instead of using the default names (whole-sweep, kind-sweep) given by wandb.



In [4]:
sweep_config = {
    'program': 'train.py',
    'name': 'hp-searching-sweep',
    'method': 'bayes',
    'metric': {
          'name': 'accuracy',
          'goal': 'maximize'
    },
    'parameters':{
          'epochs': {
                'values': [5, 10]
              },
          'num_layers': {
                'values': [3, 4, 5]
              },
          'hidden_size': {
                'values': [32, 64, 128]
              },
          'weight_decay': {
                'values': [0, 0.0005, 0.5]
              },
          'learning_rate': {
                'values': [1e-3, 1e-4]
              },
          'optimizer': {
                'values': ['stochastic', 'momentum', 'nag', 'rmsprop', 'adam', 'nadam']
              },
          'batch_size': {
                'values': [16, 32, 64]
              },
          'weight_init': {
                'values': ['random', 'xavier']
              },
          'activation': {
                'values': ['sigmoid', 'tanh', 'relu']
              },
    }
    }
sweep_id = wandb.sweep(sweep_config, project="CS6910-Assignment-1")

Create sweep with ID: 7def9p5u
Sweep URL: https://wandb.ai/ns24z066/CS6910-Assignment-1/sweeps/7def9p5u


In [10]:
def train(config=None):
  # Initialize a new wandb run
    with wandb.init(config=config) as run:

        # If called by wandb.agent, as below,
        # this config will be set by Sweep Controller
        config = wandb.config

        run_name = str(config).replace("': '", ' ').replace("'", '')
        print(run_name)
        run.name = run_name
        layers = [{'num_neurons': config.hidden_size, 'activation': config.activation}] * config.num_layers
        nn = NeuralNetwork(input_dim=x_train_flattened.shape[1], output_dim=one_hot_label.shape[1], nn_archtre=layers,
                   last_layer_activation='softmax', weight_initializer=config.weight_init)



        nn.train(train_data=train_data, val_data=val_data, epochs=config.epochs, learning_rate=config.learning_rate,
                 optimizer=config.optimizer, weight_decay=config.weight_decay, batch_size=config.batch_size, print_every_epoch=1)




wandb.agent(sweep_id, train)

Error in callback <bound method _WandbInit._resume_backend of <wandb.sdk.wandb_init._WandbInit object at 0x7c2dff563ee0>> (for pre_run_cell):


BrokenPipeError: [Errno 32] Broken pipe

[34m[1mwandb[0m: While tearing down the service manager. The following error has occurred: [Errno 32] Broken pipe
[34m[1mwandb[0m: Agent Starting Run: t0h26b96 with config:
[34m[1mwandb[0m: 	activation: relu
[34m[1mwandb[0m: 	batch_size: 64
[34m[1mwandb[0m: 	epochs: 5
[34m[1mwandb[0m: 	hidden_size: 64
[34m[1mwandb[0m: 	learning_rate: 0.0001
[34m[1mwandb[0m: 	num_layers: 5
[34m[1mwandb[0m: 	optimizer: nag
[34m[1mwandb[0m: 	weight_decay: 0.5
[34m[1mwandb[0m: 	weight_init: random


{activation relu, batch_size: 64, epochs: 5, hidden_size: 64, learning_rate: 0.0001, num_layers: 5, optimizer nag, weight_decay: 0.5, weight_init random}
Using _nag_gradient_descent for traing optimization
Seconds taken 11.07 batch: 1/843 val_loss_acc: (2863704.19, 0.105)
loss isnan or not changing for past 3 epochs reinitializing weights
Seconds taken 80.5 batch: 422/843 val_loss_acc: (nan, 0.105)
Seconds taken 74.21 batch: 843/843 val_loss_acc: (84097485692.92, 0.106)
Mins taken 4.5 {'loss': 756877377315.11, 'accuracy': 0.099, 'val_loss': 84097485692.92, 'val_accuracy': 0.106, 'epoch': 0}
Seconds taken 115.75 batch: 1/843 val_loss_acc: (84096541780.11, 0.106)
Seconds taken 72.16 batch: 422/843 val_loss_acc: (83702600062.12, 0.106)
Seconds taken 70.75 batch: 843/843 val_loss_acc: (83317642830.94, 0.106)
Mins taken 4.36 {'loss': 749858791557.3, 'accuracy': 0.099, 'val_loss': 83317642830.94, 'val_accuracy': 0.106, 'epoch': 1}
Seconds taken 116.72 batch: 1/843 val_loss_acc: (83316742113.

VBox(children=(Label(value='0.001 MB of 0.012 MB uploaded\r'), FloatProgress(value=0.08300153139356815, max=1.…

0,1
accuracy,▁▁▁▁▁
epoch,▁▃▅▆█
loss,█▆▄▃▁
val_accuracy,▁▁▁▁▁
val_loss,█▆▄▃▁

0,1
accuracy,0.099
epoch,4.0
loss,730455788031.5
val_accuracy,0.106
val_loss,81161753550.3


[34m[1mwandb[0m: Agent Starting Run: x3zcu7sw with config:
[34m[1mwandb[0m: 	activation: sigmoid
[34m[1mwandb[0m: 	batch_size: 64
[34m[1mwandb[0m: 	epochs: 10
[34m[1mwandb[0m: 	hidden_size: 128
[34m[1mwandb[0m: 	learning_rate: 0.001
[34m[1mwandb[0m: 	num_layers: 3
[34m[1mwandb[0m: 	optimizer: nag
[34m[1mwandb[0m: 	weight_decay: 0.5
[34m[1mwandb[0m: 	weight_init: random


{activation sigmoid, batch_size: 64, epochs: 10, hidden_size: 128, learning_rate: 0.001, num_layers: 3, optimizer nag, weight_decay: 0.5, weight_init random}
Using _nag_gradient_descent for traing optimization
Seconds taken 18.49 batch: 1/843 val_loss_acc: (12662608.19, 0.09)
Seconds taken 122.34 batch: 422/843 val_loss_acc: (4449855.13, 0.469)
Seconds taken 134.71 batch: 843/843 val_loss_acc: (1543232.34, 0.659)
Mins taken 7.33 {'loss': 13890426.81, 'accuracy': 0.653, 'val_loss': 1543232.34, 'val_accuracy': 0.659, 'epoch': 0}
Seconds taken 183.88 batch: 1/843 val_loss_acc: (1539381.55, 0.655)
Seconds taken 120.69 batch: 422/843 val_loss_acc: (544176.56, 0.734)
Seconds taken 124.38 batch: 843/843 val_loss_acc: (204223.75, 0.711)
Mins taken 7.2 {'loss': 1838566.09, 'accuracy': 0.702, 'val_loss': 204223.75, 'val_accuracy': 0.711, 'epoch': 1}
Seconds taken 184.96 batch: 1/843 val_loss_acc: (203770.5, 0.707)
Seconds taken 109.44 batch: 422/843 val_loss_acc: (87611.93, 0.578)
Seconds taken 

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
accuracy,▇█▅▂▁▁▁▁▁▁
epoch,▁▂▃▃▄▅▆▆▇█
loss,█▂▁▁▁▁▁▁▁▁
val_accuracy,▇█▅▂▁▁▁▁▁▁
val_loss,█▂▁▁▁▁▁▁▁▁

0,1
accuracy,0.099
epoch,9.0
loss,124398.21
val_accuracy,0.106
val_loss,13825.15


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: n550ywu9 with config:
[34m[1mwandb[0m: 	activation: tanh
[34m[1mwandb[0m: 	batch_size: 64
[34m[1mwandb[0m: 	epochs: 10
[34m[1mwandb[0m: 	hidden_size: 64
[34m[1mwandb[0m: 	learning_rate: 0.001
[34m[1mwandb[0m: 	num_layers: 4
[34m[1mwandb[0m: 	optimizer: nag
[34m[1mwandb[0m: 	weight_decay: 0.5
[34m[1mwandb[0m: 	weight_init: xavier


{activation tanh, batch_size: 64, epochs: 10, hidden_size: 64, learning_rate: 0.001, num_layers: 4, optimizer nag, weight_decay: 0.5, weight_init xavier}
Using _nag_gradient_descent for traing optimization
Seconds taken 9.91 batch: 1/843 val_loss_acc: (81854.85, 0.084)
Seconds taken 66.53 batch: 422/843 val_loss_acc: (34633.57, 0.7)
Seconds taken 50.61 batch: 843/843 val_loss_acc: (21030.33, 0.732)
Mins taken 3.08 {'loss': 189172.49, 'accuracy': 0.73, 'val_loss': 21030.33, 'val_accuracy': 0.732, 'epoch': 0}
Seconds taken 61.96 batch: 1/843 val_loss_acc: (20992.64, 0.71)
Seconds taken 34.87 batch: 422/843 val_loss_acc: (15982.07, 0.706)
Seconds taken 37.56 batch: 843/843 val_loss_acc: (18123.56, 0.484)
Mins taken 2.25 {'loss': 162697.77, 'accuracy': 0.482, 'val_loss': 18123.56, 'val_accuracy': 0.484, 'epoch': 1}
Seconds taken 65.27 batch: 1/843 val_loss_acc: (18469.37, 0.523)
Seconds taken 42.44 batch: 422/843 val_loss_acc: (18087.32, 0.488)
Seconds taken 39.04 batch: 843/843 val_loss_a

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
accuracy,█▅▅▃▂▁▁▁▁▁
epoch,▁▂▃▃▄▅▆▆▇█
loss,█▅▄▂▂▁▁▁▁▁
val_accuracy,█▅▅▃▂▁▁▁▁▁
val_loss,█▅▄▂▂▁▁▁▁▁

0,1
accuracy,0.1
epoch,9.0
loss,124966.06
val_accuracy,0.103
val_loss,13893.95


[34m[1mwandb[0m: Agent Starting Run: lev2grnw with config:
[34m[1mwandb[0m: 	activation: relu
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epochs: 5
[34m[1mwandb[0m: 	hidden_size: 32
[34m[1mwandb[0m: 	learning_rate: 0.001
[34m[1mwandb[0m: 	num_layers: 3
[34m[1mwandb[0m: 	optimizer: nag
[34m[1mwandb[0m: 	weight_decay: 0.5
[34m[1mwandb[0m: 	weight_init: xavier


{activation relu, batch_size: 32, epochs: 5, hidden_size: 32, learning_rate: 0.001, num_layers: 3, optimizer nag, weight_decay: 0.5, weight_init xavier}
Using _nag_gradient_descent for traing optimization
Seconds taken 3.73 batch: 1/1687 val_loss_acc: (60151.01, 0.187)
Seconds taken 27.34 batch: 844/1687 val_loss_acc: (19384.0, 0.151)
Seconds taken 26.76 batch: 1687/1687 val_loss_acc: (14488.99, 0.1)
Mins taken 1.44 {'loss': 130390.77, 'accuracy': 0.1, 'val_loss': 14488.99, 'val_accuracy': 0.1, 'epoch': 0}
Seconds taken 30.81 batch: 1/1687 val_loss_acc: (14487.27, 0.1)
Seconds taken 30.73 batch: 844/1687 val_loss_acc: (13895.92, 0.098)
Seconds taken 30.33 batch: 1687/1687 val_loss_acc: (13826.51, 0.1)
Mins taken 1.66 {'loss': 124428.77, 'accuracy': 0.1, 'val_loss': 13826.51, 'val_accuracy': 0.1, 'epoch': 1}
Seconds taken 40.61 batch: 1/1687 val_loss_acc: (13826.45, 0.1)
Seconds taken 28.81 batch: 844/1687 val_loss_acc: (13817.31, 0.105)
Seconds taken 27.7 batch: 1687/1687 val_loss_acc:

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
accuracy,▁▁▁▁▁
epoch,▁▃▅▆█
loss,█▁▁▁▁
val_accuracy,▁▁▁▁▁
val_loss,█▁▁▁▁

0,1
accuracy,0.1
epoch,4.0
loss,124343.34
val_accuracy,0.1
val_loss,13817.03


[34m[1mwandb[0m: Agent Starting Run: xcdm75oj with config:
[34m[1mwandb[0m: 	activation: sigmoid
[34m[1mwandb[0m: 	batch_size: 64
[34m[1mwandb[0m: 	epochs: 5
[34m[1mwandb[0m: 	hidden_size: 32
[34m[1mwandb[0m: 	learning_rate: 0.0001
[34m[1mwandb[0m: 	num_layers: 5
[34m[1mwandb[0m: 	optimizer: nadam
[34m[1mwandb[0m: 	weight_decay: 0.5
[34m[1mwandb[0m: 	weight_init: xavier


{activation sigmoid, batch_size: 64, epochs: 5, hidden_size: 32, learning_rate: 0.0001, num_layers: 5, optimizer nadam, weight_decay: 0.5, weight_init xavier}
Using _nadam_gradient_descent for traing optimization
Seconds taken 4.1 batch: 1/843 val_loss_acc: (48813.54, 0.106)
Seconds taken 50.92 batch: 422/843 val_loss_acc: (39321.62, 0.106)
Seconds taken 52.37 batch: 843/843 val_loss_acc: (36446.26, 0.106)
Mins taken 2.74 {'loss': 329416.46, 'accuracy': 0.099, 'val_loss': 36446.26, 'val_accuracy': 0.106, 'epoch': 0}
Seconds taken 62.39 batch: 1/843 val_loss_acc: (36437.92, 0.106)
Seconds taken 49.89 batch: 422/843 val_loss_acc: (34191.93, 0.106)
Seconds taken 46.97 batch: 843/843 val_loss_acc: (32825.72, 0.106)
Mins taken 2.62 {'loss': 296338.52, 'accuracy': 0.099, 'val_loss': 32825.72, 'val_accuracy': 0.106, 'epoch': 1}
Seconds taken 61.58 batch: 1/843 val_loss_acc: (32821.87, 0.106)
Seconds taken 53.07 batch: 422/843 val_loss_acc: (31582.11, 0.106)
Seconds taken 52.61 batch: 843/843 

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
accuracy,▁▁▁▁▁
epoch,▁▃▅▆█
loss,█▅▃▂▁
val_accuracy,▁▁▁▁▁
val_loss,█▅▃▂▁

0,1
accuracy,0.099
epoch,4.0
loss,248239.58
val_accuracy,0.106
val_loss,27550.66


[34m[1mwandb[0m: Agent Starting Run: rp9uox19 with config:
[34m[1mwandb[0m: 	activation: sigmoid
[34m[1mwandb[0m: 	batch_size: 64
[34m[1mwandb[0m: 	epochs: 10
[34m[1mwandb[0m: 	hidden_size: 64
[34m[1mwandb[0m: 	learning_rate: 0.001
[34m[1mwandb[0m: 	num_layers: 5
[34m[1mwandb[0m: 	optimizer: nag
[34m[1mwandb[0m: 	weight_decay: 0
[34m[1mwandb[0m: 	weight_init: random


{activation sigmoid, batch_size: 64, epochs: 10, hidden_size: 64, learning_rate: 0.001, num_layers: 5, optimizer nag, weight_decay: 0, weight_init random}
Using _nag_gradient_descent for traing optimization
Seconds taken 8.16 batch: 1/843 val_loss_acc: (43069.91, 0.07)
Seconds taken 61.05 batch: 422/843 val_loss_acc: (17290.45, 0.413)
Seconds taken 64.25 batch: 843/843 val_loss_acc: (15342.33, 0.508)
Mins taken 3.49 {'loss': 137690.21, 'accuracy': 0.51, 'val_loss': 15342.33, 'val_accuracy': 0.508, 'epoch': 0}
Seconds taken 83.34 batch: 1/843 val_loss_acc: (15350.54, 0.511)
Seconds taken 61.81 batch: 422/843 val_loss_acc: (13076.21, 0.534)
Seconds taken 59.74 batch: 843/843 val_loss_acc: (5807.51, 0.681)
Mins taken 3.41 {'loss': 52192.8, 'accuracy': 0.681, 'val_loss': 5807.51, 'val_accuracy': 0.681, 'epoch': 1}
Seconds taken 81.82 batch: 1/843 val_loss_acc: (5775.74, 0.683)
Seconds taken 63.06 batch: 422/843 val_loss_acc: (5242.63, 0.703)
Seconds taken 63.18 batch: 843/843 val_loss_acc:

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
accuracy,▁▆▇▇▇█████
epoch,▁▂▃▃▄▅▆▆▇█
loss,█▂▂▁▁▁▁▁▁▁
val_accuracy,▁▆▇▇▇█████
val_loss,█▂▂▁▁▁▁▁▁▁

0,1
accuracy,0.773
epoch,9.0
loss,34916.36
val_accuracy,0.766
val_loss,3995.77


[34m[1mwandb[0m: Agent Starting Run: x6dsd64j with config:
[34m[1mwandb[0m: 	activation: sigmoid
[34m[1mwandb[0m: 	batch_size: 64
[34m[1mwandb[0m: 	epochs: 10
[34m[1mwandb[0m: 	hidden_size: 128
[34m[1mwandb[0m: 	learning_rate: 0.0001
[34m[1mwandb[0m: 	num_layers: 3
[34m[1mwandb[0m: 	optimizer: nadam
[34m[1mwandb[0m: 	weight_decay: 0
[34m[1mwandb[0m: 	weight_init: random


{activation sigmoid, batch_size: 64, epochs: 10, hidden_size: 128, learning_rate: 0.0001, num_layers: 3, optimizer nadam, weight_decay: 0, weight_init random}
Using _nadam_gradient_descent for traing optimization
Seconds taken 9.22 batch: 1/843 val_loss_acc: (61286.19, 0.085)


[34m[1mwandb[0m: Ctrl + C detected. Stopping sweep.


Error in callback <bound method _WandbInit._pause_backend of <wandb.sdk.wandb_init._WandbInit object at 0x7c2dff563ee0>> (for post_run_cell):


BrokenPipeError: [Errno 32] Broken pipe



## Question 5 (5 marks)
We would like to see the best accuracy on the validation set across all the models that you train.
wandb automatically generates this plot which summarises the test accuracy of all the models that you tested. Please paste this plot below using the "Add Panel to Report" feature


In [8]:
# The Graphs can be found in the report, I got over 85% Accuracy on validation dataset



## Question 6 (20 Marks)
Based on the different experiments that you have run we want you to make some inferences about which configurations worked and which did not.
Here again, wandb automatically generates a "Parallel co-ordinates plot" and a "correlation summary" as shown below. Learn about a "Parallel co-ordinates plot" and how to read it.
By looking at the plots that you get, write down some interesting observations (simple bullet points but should be insightful). You can also refer to the plot in Question 5 while writing these insights. For example, in the above sample plot there are many configurations which give less than 65% accuracy. I would like to zoom into those and see what is happening.
I would also like to see a recommendation for what configuration to use to get close to 95% accuracy.



In [None]:
# Please refer the report for the graphs

## Question 7 (10 Marks)
For the best model identified above, report the accuracy on the test set of fashion_mnist and plot the confusion matrix as shown below. More marks for creativity (less marks for producing the plot shown below as it is)


In [18]:
network_prediction = np.zeros([y_test.shape[0], len(np.unique(y_test)), 1], dtype=float)
for index, item in enumerate(x_test_flattened):
  network_prediction[index] = nn.feed_forward(item)

cm = wandb.plot.confusion_matrix(
    y_true=one_hot_label_test.reshape(-1, 10).argmax(axis=1), preds=network_prediction.reshape(-1, 10).argmax(axis=1), class_names=list(fmnist_labels.values())
)

wandb.log({"conf_mat": cm})



## Question 8 (5 Marks)
In all the models above you would have used cross entropy loss. Now compare the cross entropy loss with the squared error loss. I would again like to see some automatically generated plots or your own plots to convince me whether one is better than the other.


In [5]:
# Training using the parameters that were found to be best by sweeping in wandb

# Conducting two trainings, first one uses cross_entrophy and second one uses MSE, to check if there is a difference
# nn.train(train_data=train_data, val_data=val_data, epochs=10, learning_rate=0.001,
#                  optimizer='adam', weight_decay=0, batch_size=64, print_every_epoch=1, loss='mse')
nn.train(train_data=train_data, val_data=val_data, epochs=6, learning_rate=0.001,
                 optimizer='adam', weight_decay=0, batch_size=64, print_every_epoch=1)



Using _adam_gradient_descent for traing optimization
Seconds taken 15.04 batch: 1/843 val_loss_acc: (21407.14, 0.103)
Seconds taken 50.7 batch: 423/843 val_loss_acc: (3518.95, 0.802)
Mins taken 2.71 {'loss': 28450.51, 'accuracy': 0.82, 'val_loss': 3198.3, 'val_accuracy': 0.82, 'epoch': 0}
Seconds taken 101.77 batch: 1/843 val_loss_acc: (3194.13, 0.819)
Seconds taken 46.82 batch: 423/843 val_loss_acc: (3010.71, 0.829)
Mins taken 2.49 {'loss': 25772.27, 'accuracy': 0.838, 'val_loss': 2944.78, 'val_accuracy': 0.834, 'epoch': 1}
Seconds taken 102.62 batch: 1/843 val_loss_acc: (2945.88, 0.832)
Seconds taken 47.34 batch: 423/843 val_loss_acc: (2831.87, 0.838)
Mins taken 2.52 {'loss': 24380.92, 'accuracy': 0.848, 'val_loss': 2822.86, 'val_accuracy': 0.84, 'epoch': 2}
Seconds taken 104.03 batch: 1/843 val_loss_acc: (2824.92, 0.841)
Seconds taken 48.02 batch: 423/843 val_loss_acc: (2722.85, 0.844)
Mins taken 2.48 {'loss': 23410.36, 'accuracy': 0.855, 'val_loss': 2743.95, 'val_accuracy': 0.844, 


## Question 9 (10 Marks)
Paste a link to your github code for this assignment
Example: \href{https://github.com/<user-id>/cs6910_assignment1}{https://github.com/<user-id>/cs6910_assignment1};
We will check for coding style, clarity in using functions and a README file with clear instructions on training and evaluating the model (the 10 marks will be based on this)
We will also run a plagiarism check to ensure that the code is not copied (0 marks in the assignment if we find that the code is plagiarized)
We will also check if the training and test data has been split properly and randomly. You will get 0 marks on the assignment if we find any cheating (e.g., adding test data to training data) to get higher accuracy


In [None]:
## https://github.com/i-618/cs6910_assignment1


## Question 10 (10 Marks)
Based on your learnings above, give me 3 recommendations for what would work for the MNIST dataset (not Fashion-MNIST). Just to be clear, I am asking you to take your learnings based on extensive experimentation with one dataset and see if these learnings help on another dataset. If I give you a budget of running only 3 hyperparameter configurations as opposed to the large number of experiments you have run above then which 3 would you use and why. Report the accuracies that you obtain using these 3 configurations.


In [10]:
from keras.datasets import mnist

# Pre-Processing the data for training
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# One hot encoding the output parameters, since this is a classification problem, in case of regression this is not needed
one_hot_label = np.zeros([y_train.shape[0], len(np.unique(y_train)), 1], dtype=int)
for index, item in enumerate(y_train):
  one_hot_label[index, item] = [1]

one_hot_label_test = np.zeros([y_test.shape[0], len(np.unique(y_test)), 1], dtype=int)
for index, item in enumerate(y_test):
  one_hot_label_test[index, item] = [1]

# Flattening each image into shape (784, 1) as Neural network accepts only 1d series of data and then normalizing it with max value
x_train_flattened = x_train.reshape(-1, 784, 1)/np.max(x_train)
x_test_flattened = x_test.reshape(-1, 784, 1)/np.max(x_test)

# It is given that 90% of the dataset is to be considered for training, while 10%  for validation
train_records_count = int(len(x_train)*0.9)
train_data={'inputs':x_train_flattened[:train_records_count], 'labels':one_hot_label[:train_records_count]}
val_data={'inputs':x_train_flattened[train_records_count:], 'labels':one_hot_label[train_records_count:]}
test_data={'inputs':x_test_flattened, 'labels': one_hot_label_test}

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


In [11]:
from NeuralNetwork import NeuralNetwork

layers = [{'num_neurons': 128, 'activation': 'tanh'}] * 5
nn = NeuralNetwork(input_dim=x_train_flattened.shape[1], output_dim=one_hot_label.shape[1], nn_archtre=layers,
                   last_layer_activation='softmax', weight_initializer='xavier')
nn.train(train_data=train_data, val_data=val_data, epochs=5, learning_rate=0.001,
                 optimizer='adam', weight_decay=0, batch_size=64, print_every_epoch=1)


Using _adam_gradient_descent for traing optimization
Seconds taken 5.99 batch: 1/843 val_loss_acc: (19872.81, 0.099)
Seconds taken 51.87 batch: 423/843 val_loss_acc: (2385.56, 0.886)
Mins taken 2.51 {'loss': 20688.79, 'accuracy': 0.89, 'val_loss': 1968.73, 'val_accuracy': 0.905, 'epoch': 0}
Seconds taken 97.78 batch: 1/843 val_loss_acc: (2016.29, 0.902)
Seconds taken 47.79 batch: 423/843 val_loss_acc: (1661.16, 0.921)
Mins taken 2.47 {'loss': 16756.34, 'accuracy': 0.91, 'val_loss': 1621.04, 'val_accuracy': 0.92, 'epoch': 1}
Seconds taken 99.54 batch: 1/843 val_loss_acc: (1648.81, 0.918)
Seconds taken 48.99 batch: 423/843 val_loss_acc: (1444.3, 0.932)
Mins taken 2.49 {'loss': 14571.69, 'accuracy': 0.923, 'val_loss': 1439.74, 'val_accuracy': 0.93, 'epoch': 2}
Seconds taken 100.45 batch: 1/843 val_loss_acc: (1457.5, 0.93)
Seconds taken 48.86 batch: 423/843 val_loss_acc: (1325.58, 0.938)
Mins taken 2.45 {'loss': 13135.3, 'accuracy': 0.932, 'val_loss': 1327.83, 'val_accuracy': 0.935, 'epoch

In [12]:
layers = [{'num_neurons': 128, 'activation': 'tanh'}] * 5
nn = NeuralNetwork(input_dim=x_train_flattened.shape[1], output_dim=one_hot_label.shape[1], nn_archtre=layers,
                   last_layer_activation='softmax', weight_initializer='xavier')
nn.train(train_data=train_data, val_data=val_data, epochs=5, learning_rate=0.001,
                 optimizer='nag', weight_decay=0.0005, batch_size=64, print_every_epoch=1)

Using _nag_gradient_descent for traing optimization
Seconds taken 5.57 batch: 1/843 val_loss_acc: (22963.1, 0.1)
Seconds taken 44.77 batch: 423/843 val_loss_acc: (4325.49, 0.792)
Mins taken 2.41 {'loss': 25015.23, 'accuracy': 0.869, 'val_loss': 2426.93, 'val_accuracy': 0.885, 'epoch': 0}
Seconds taken 98.89 batch: 1/843 val_loss_acc: (2399.57, 0.891)
Seconds taken 43.46 batch: 423/843 val_loss_acc: (2150.31, 0.904)
Mins taken 2.48 {'loss': 20874.25, 'accuracy': 0.894, 'val_loss': 1965.38, 'val_accuracy': 0.912, 'epoch': 1}
Seconds taken 105.99 batch: 1/843 val_loss_acc: (2005.07, 0.91)
Seconds taken 44.27 batch: 423/843 val_loss_acc: (1665.1, 0.924)
Mins taken 2.41 {'loss': 19762.27, 'accuracy': 0.901, 'val_loss': 1879.97, 'val_accuracy': 0.912, 'epoch': 2}
Seconds taken 102.29 batch: 1/843 val_loss_acc: (1821.41, 0.915)
Seconds taken 44.93 batch: 423/843 val_loss_acc: (1568.48, 0.93)
Mins taken 2.4 {'loss': 17545.83, 'accuracy': 0.914, 'val_loss': 1730.74, 'val_accuracy': 0.921, 'epoc

In [13]:
layers = [{'num_neurons': 128, 'activation': 'tanh'}] * 3
nn = NeuralNetwork(input_dim=x_train_flattened.shape[1], output_dim=one_hot_label.shape[1], nn_archtre=layers,
                   last_layer_activation='softmax', weight_initializer='xavier')
nn.train(train_data=train_data, val_data=val_data, epochs=5, learning_rate=0.001,
                 optimizer='relu', weight_decay=0.0005, batch_size=64, print_every_epoch=1)

Using _stochastic_gradient_descent for traing optimization
Seconds taken 5.08 batch: 1/843 val_loss_acc: (19884.98, 0.092)
Seconds taken 29.92 batch: 423/843 val_loss_acc: (4988.64, 0.771)
Mins taken 1.71 {'loss': 32109.48, 'accuracy': 0.837, 'val_loss': 3037.01, 'val_accuracy': 0.865, 'epoch': 0}
Seconds taken 71.79 batch: 1/843 val_loss_acc: (3029.21, 0.867)
Seconds taken 31.03 batch: 423/843 val_loss_acc: (2503.03, 0.894)
Mins taken 1.67 {'loss': 24188.18, 'accuracy': 0.882, 'val_loss': 2248.99, 'val_accuracy': 0.904, 'epoch': 1}
Seconds taken 69.0 batch: 1/843 val_loss_acc: (2248.15, 0.902)
Seconds taken 29.87 batch: 423/843 val_loss_acc: (2071.93, 0.912)
Mins taken 1.71 {'loss': 21028.84, 'accuracy': 0.9, 'val_loss': 1960.24, 'val_accuracy': 0.919, 'epoch': 2}
Seconds taken 73.28 batch: 1/843 val_loss_acc: (1961.2, 0.918)
Seconds taken 30.55 batch: 423/843 val_loss_acc: (1858.63, 0.922)
Mins taken 1.69 {'loss': 19126.63, 'accuracy': 0.911, 'val_loss': 1793.86, 'val_accuracy': 0.92