# Assignment 4: Benchmarking Fashion-MNIST with Deep Neural Nets

### CS 4501 Machine Learning - Department of Computer Science - University of Virginia
"The original MNIST dataset contains a lot of handwritten digits. Members of the AI/ML/Data Science community love this dataset and use it as a benchmark to validate their algorithms. In fact, MNIST is often the first dataset researchers try. "If it doesn't work on MNIST, it won't work at all", they said. "Well, if it does work on MNIST, it may still fail on others." - **Zalando Research, Github Repo.**"

Fashion-MNIST is a dataset from the Zalando's article. Each example is a 28x28 grayscale image, associated with a label from 10 classes. They intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms.

![Here's an example how the data looks (each class takes three-rows):](https://github.com/zalandoresearch/fashion-mnist/raw/master/doc/img/fashion-mnist-sprite.png)

In this assignment, you will attempt to benchmark the Fashion-MNIST using Neural Networks. You must use it to train some neural networks on TensorFlow and predict the final output of 10 classes. For deliverables, you must write code in Python and submit this Jupyter Notebook file (.ipynb) to earn a total of 100 pts. You will gain points depending on how you perform in the following sections.


In [0]:
# You might want to use the following packages
import numpy as np
import os
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR) #reduce annoying warning messages
from functools import partial

# to make this notebook's output stable across runs
def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)


---
## 1. PRE-PROCESSING THE DATA (10 pts)

You can load the Fashion MNIST directly from Tensorflow. **Partition of the dataset** so that you will have 50,000 examples for training, 10,000 examples for validation, and 10,000 examples for testing. Also, make sure that you platten out each of examples so that it contains only a 1-D feature vector.

Write some code to output the dimensionalities of each partition (train, validation, and test sets).



In [2]:
#split into 50000 training, 10000 validation, 10000 testing

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.fashion_mnist.load_data();
x_valid, x_train = x_train[:10000], x_train[10000:]
y_valid, y_train = y_train[:10000], y_train[10000:]

print("Number of training set examples: ", y_train.size)
print("Number of validation set examples: ", y_valid.size)
print("Number of testing set examples: ", y_test.size)
print("\n")
print("Dimensions training (before flattening): ", x_train.shape)
print("Dimensions testing (before flattening): ", x_test.shape)
print("Dimensions validation (before flattening): ", x_valid.shape)
print("\n")

x_train = x_train.astype(np.float32).reshape(-1, 784) / 255.0
x_test = x_test.astype(np.float32).reshape(-1, 784) / 255.0
x_valid = x_valid.astype(np.float32).reshape(-1, 784) / 255.0

y_train = y_train.astype(np.int32)
y_test = y_test.astype(np.int32)
y_valid = y_valid.astype(np.int32)

print("Dimensions training: ", x_train.shape)
print("Dimensions testing: ", x_test.shape)
print("Dimensions validation: ", x_valid.shape)


Number of training set examples:  50000
Number of validation set examples:  10000
Number of testing set examples:  10000


Dimensions training (before flattening):  (50000, 28, 28)
Dimensions testing (before flattening):  (10000, 28, 28)
Dimensions validation (before flattening):  (10000, 28, 28)


Dimensions training:  (50000, 784)
Dimensions testing:  (10000, 784)
Dimensions validation:  (10000, 784)


- - -
## 2. CONSTRUCTION PHASE (30 pts)

In this section, define at least three neural networks with different structures. Make sure that the input layer has the right number of inputs. The best structure often is found through a process of trial and error experimentation:
- You may start with a fully connected network structure with two hidden layers.
- You may try a few settings of the number of nodes in each layer.
- You may try a few activation functions to see if they affect the performance.

**Important Implementation Note:** For the purpose of learning Tensorflow, you must use low-level TensorFlow API to construct the network. Usage of high-level tools (ie. Keras) is not permited. 

In [0]:
# Your code goes here
reset_graph()

# Set some configuration here
n_inputs = 28*28  # Fashion-MNIST
learning_rate = 0.01
n_outputs = 10

# Construct placeholder for the input layer
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")

In [0]:
n_h1 = 300
n_h2 = 100

#implementation of the first net
with tf.name_scope("dnn1"):
  h1 = tf.layers.dense(X, n_h1, name="hidden1", activation=tf.nn.relu)
  h2 = tf.layers.dense(X, n_h2, name="hidden2", activation=tf.nn.relu)
  logits = tf.layers.dense(h2, n_outputs, name = "outputs")

In [0]:
#implementation of the loss function net
with tf.name_scope("loss1"):
  xentropy1 = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
  loss1 = tf.reduce_mean(xentropy1, name="loss1")
  loss_summary1 = tf.summary.scalar('log_loss1', loss1)

In [0]:
learn_rate = .01

#implementation of the training optimizer
with tf.name_scope("train1"):
  optimizer1 = tf.train.GradientDescentOptimizer(learn_rate)
  train_op1 = optimizer1.minimize(loss1)

In [0]:
#implementation of the evaluation procedure
with tf.name_scope("eval1"):
  correct1 = tf.nn.in_top_k(logits,y,1)
  accuracy1 = tf.reduce_mean(tf.cast(correct1, tf.float32))
  accuracy_summary1 = tf.summary.scalar('accuracy1', accuracy1)

In [0]:
n_hid1 = 300
n_hid2 = 200
n_hid3 = 100
n_hid4 = 75
n_hid5 = 50

#implementation of the second net
with tf.name_scope("dnn2"):
  hid1 = tf.layers.dense(X, n_hid1, name="hidden1_2", activation=tf.nn.relu)
  hid2 = tf.layers.dense(X, n_hid2, name="hidden2_2", activation=tf.nn.relu)
  hid3 = tf.layers.dense(X, n_hid3, name="hidden3_2", activation=tf.nn.relu)
  hid4 = tf.layers.dense(X, n_hid4, name="hidden4_2", activation=tf.nn.relu)      
  hid5 = tf.layers.dense(X, n_hid5, name="hidden5_2", activation=tf.nn.relu)
  logits = tf.layers.dense(hid5, n_outputs, name = "outputs_2")


In [0]:
#implementation of the loss function net 
with tf.name_scope("loss2"):
  xentropy2 = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
  loss2 = tf.reduce_mean(xentropy2, name="loss2")
  loss_summary2 = tf.summary.scalar('log_loss2', loss2)

In [0]:
learn_rate = .01

#implementation of the training optimizer
with tf.name_scope("train2"):
  optimizer2 = tf.train.GradientDescentOptimizer(learn_rate)
  train_op2 = optimizer2.minimize(loss2)


In [0]:
#implementation of the evaluation procedure 
with tf.name_scope("eval2"):
  correct2 = tf.nn.in_top_k(logits,y,1)
  accuracy2 = tf.reduce_mean(tf.cast(correct2, tf.float32))
  accuracy_summary2 = tf.summary.scalar('accuracy2', accuracy2)

In [0]:
n_hid1 = 300
n_hid2 = 150
n_hid3 = 100

#implementation of the third net 
with tf.name_scope("dnn3"):
  hi1 = tf.layers.dense(X, n_hid1, name="hidden1_3", activation=tf.nn.tanh)
  hi2 = tf.layers.dense(X, n_hid2, name="hidden2_3", activation=tf.nn.relu)
  hi3 = tf.layers.dense(X, n_hid3, name="hidden3_3", activation=tf.nn.tanh)
  logits = tf.layers.dense(hi3, n_outputs, name = "outputs_3")

In [0]:
#implementation of the loss function net 
with tf.name_scope("loss3"):
  xentropy3 = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
  loss3 = tf.reduce_mean(xentropy3, name="loss3")
  loss_summary3 = tf.summary.scalar('log_loss3', loss3)

In [0]:
learn_rate = .01

#implementation of the training optimizer 
with tf.name_scope("train3"):
  optimizer3 = tf.train.GradientDescentOptimizer(learn_rate)
  train_op3 = optimizer3.minimize(loss3)

In [0]:
#implementation of the evaluation procedure 
with tf.name_scope("eval3"):
  correct3 = tf.nn.in_top_k(logits,y,1)
  accuracy3 = tf.reduce_mean(tf.cast(correct3, tf.float32))
  accuracy_summary3 = tf.summary.scalar('accuracy3', accuracy3)


- - -
## 3. EXECUTION PHASE (30 pts)

After you construct the three models of neural networks, you can compute the performance measure as the class accuracy. You will need to define the number of epochs and size of the training batch. You also might need to reset the graph each time your try a different model. To save time and avoid retraining, you should save the trained model and load it from disk to evaluate a test set. Pick the best model and answer the following:
- Which model yields the best performance measure for your dataset? Provide a reason why it yields the best performance.
- Why did you pick this many hidden layers?
- Provide some justifiable reasons for selecting the number of neurons per hidden layers. 
- Which activation functions did you use?

In the next session you will get a chance to finetune it further .



In [0]:
# Your code goes here
init = tf.global_variables_initializer()
saver = tf.train.Saver()

n_epochs = 20
batch_size = 100

# shuffle_batch() shuffle the examples in a batch before training
def shuffle_batch(X, y, batch_size):
    rnd_idx = np.random.permutation(len(X))
    n_batches = len(X) // batch_size
    for batch_idx in np.array_split(rnd_idx, n_batches):
        X_batch, y_batch = X[batch_idx], y[batch_idx]
        yield X_batch, y_batch


In [17]:
with tf.Session() as sess:
  init.run()
  for epoch in range(n_epochs):
    
    # implementation of the training ops 
    for X_batch, y_batch in shuffle_batch(x_train, y_train, batch_size):
      sess.run(train_op1, feed_dict={X: X_batch, y: y_batch})
      sess.run(train_op2, feed_dict={X: X_batch, y: y_batch})
      sess.run(train_op3, feed_dict={X: X_batch, y: y_batch})
    
    a_batch1 = accuracy1.eval(feed_dict={X: X_batch, y: y_batch})
    a_batch2 = accuracy2.eval(feed_dict={X: X_batch, y: y_batch})
    a_batch3 = accuracy3.eval(feed_dict={X: X_batch, y: y_batch})

    # implementation of the validation accuracy 
    a_valid1 = accuracy1.eval(feed_dict={X: x_valid, y: y_valid})
    a_valid2 = accuracy2.eval(feed_dict={X: x_valid, y: y_valid})
    a_valid3 = accuracy3.eval(feed_dict={X: x_valid, y: y_valid})

    print(epoch, "Batch accuracy (1): ", a_batch1, "\tValidation accuracy (1): ", a_valid1)
    print(epoch, "Batch accuracy (2): ", a_batch2, "\tValidation accuracy (2): ", a_valid2)
    print(epoch, "Batch accuracy (3): ", a_batch3, "\tValidation accuracy (3): ", a_valid3)
    print("\n")
    
  save_path = saver.save(sess, "./my_dnn_model.ckpt")

0 Batch accuracy (1):  0.72 	Validation accuracy (1):  0.7589
0 Batch accuracy (2):  0.71 	Validation accuracy (2):  0.7413
0 Batch accuracy (3):  0.71 	Validation accuracy (3):  0.759


1 Batch accuracy (1):  0.84 	Validation accuracy (1):  0.7922
1 Batch accuracy (2):  0.8 	Validation accuracy (2):  0.7867
1 Batch accuracy (3):  0.75 	Validation accuracy (3):  0.7975


2 Batch accuracy (1):  0.81 	Validation accuracy (1):  0.8124
2 Batch accuracy (2):  0.84 	Validation accuracy (2):  0.8056
2 Batch accuracy (3):  0.81 	Validation accuracy (3):  0.8126


3 Batch accuracy (1):  0.78 	Validation accuracy (1):  0.8192
3 Batch accuracy (2):  0.81 	Validation accuracy (2):  0.8163
3 Batch accuracy (3):  0.78 	Validation accuracy (3):  0.8177


4 Batch accuracy (1):  0.85 	Validation accuracy (1):  0.8255
4 Batch accuracy (2):  0.86 	Validation accuracy (2):  0.8223
4 Batch accuracy (3):  0.87 	Validation accuracy (3):  0.8277


5 Batch accuracy (1):  0.88 	Validation accuracy (1):  0.8309


In [18]:
with tf.Session() as sess:
  saver.restore(sess, save_path)
  accuracy_test1 = accuracy1.eval(feed_dict={X: x_test, y: y_test})
  accuracy_test2 = accuracy2.eval(feed_dict={X: x_test, y: y_test})
  accuracy_test3 = accuracy3.eval(feed_dict={X: x_test, y: y_test})

print('Accuracy of dnn1: ', accuracy_test1)
print('Accuracy of dnn2: ', accuracy_test2)
print('Accuracy of dnn3: ', accuracy_test3)

Accuracy of dnn1:  0.8443
Accuracy of dnn2:  0.8431
Accuracy of dnn3:  0.8444


EXPLANATION:

My first model had the best performance measures. This model had three hidden layers, two of which use an relu activation function, while the middle layer uses a tanh activation function. I tested models with 2, 3, and 5 hidden layers, over several different activation function combination, and this was the most successful model.  I adjusted the number of neurons up and down a bit, to ensure that the final number of neurons per hidden layer were ideal. The testing accuracy of this model was .8444.


- - -
## 4. FINETUNING THE NETWORK (25 pts)

The best performance on the Fashion MNIST of a non-neural-net classifier is the Support Vector Classifier {"C":10,"kernel":"poly"} with 0.897 accuracy. In this section, you will see how close you can get to that accuracy, or (better yet) beat it! You will be able to see the performance of other ML methods below:
http://fashion-mnist.s3-website.eu-central-1.amazonaws.com

Use the best model from the previous section and see if you can improve it further. To improve the performance of your model, You must make some modifications based upon the practical guidelines discuss in class. Here are a few decisions about the recommended network configurations you have to make:
1. Initialization: Use He Initialization for your model
2. Activation: Add ELU as the activation function throughout your hidden layers
3. Normalization: Incorporate the batch normalization at every layer
4. Regularization: Configure the dropout policy at 50% rate
5. Optimization: Change Gradient Descent into Adam Optimization
6. Your choice: make any other changes in 1-5 you deem necessary

Keep in mind that the execution phase is essentially the same, so you can just run it from the above. See how much you gain in classification accuracy. Provide some justifications for the gain in performance. 






In [0]:
reset_graph()

import tensorflow as tf

n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 150
n_hidden3 = 100
n_outputs = 10

#4. Regularization: Configure the dropout policy at 50% rate
dropout_rate = .5

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")

training = tf.placeholder_with_default(False, shape=(), name='training')

# implementation of the new benchmarking DNN
with tf.name_scope("dnnBenchmark"):
  #1. Initialization: Use He Initialization for your model
  he_init = tf.variance_scaling_initializer()
  
  #3. Normalization: Incorporate the batch normalization at every layer
  my_batch_norm_layer = partial(
            tf.layers.batch_normalization,
            training=training,
            momentum=.9)
  
  hidden1BM = tf.layers.dense(X, n_hidden1, name="hidden1BM", activation=tf.nn.relu, kernel_initializer=he_init)
  hidden1_drop = tf.layers.dropout(hidden1BM, dropout_rate, training=training)
  bn1 = tf.nn.elu(my_batch_norm_layer(hidden1_drop))
  
  hidden2BM = tf.layers.dense(X, n_hidden2, name="hidden2BM", activation=tf.nn.relu, kernel_initializer=he_init)
  hidden2_drop = tf.layers.dropout(hidden2BM, dropout_rate, training=training)
  bn2 = tf.nn.elu(my_batch_norm_layer(hidden2_drop))
  
  hidden3BM = tf.layers.dense(X, n_hidden3, name="hidden3BM", activation=tf.nn.relu, kernel_initializer=he_init)
  hidden3_drop = tf.layers.dropout(hidden3BM, dropout_rate, training=training)
  bn3 = tf.nn.elu(my_batch_norm_layer(hidden3_drop))
  
  logits_before_bn = tf.layers.dense(bn3, n_outputs, name = "outputsBM", kernel_initializer=he_init)
  logits = my_batch_norm_layer(logits_before_bn)

In [0]:
#implementation of the loss function net
with tf.name_scope("lossBM"):
  xentropyBM = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
  lossBM = tf.reduce_mean(xentropyBM, name="lossBM")
  loss_summaryBM = tf.summary.scalar('log_lossBM', lossBM)

In [0]:
#5 use Adam Optimization

learn_rate = .01

#implementation of the training optimizer
with tf.name_scope("trainBM"):
  optimizerBM = tf.train.AdamOptimizer(learn_rate)
  train_opBM = optimizerBM.minimize(lossBM)
  

In [0]:
#implementation of the evaluation procedure here
with tf.name_scope("evalBM"):
  correctBM = tf.nn.in_top_k(logits,y,1)
  accuracyBM = tf.reduce_mean(tf.cast(correctBM, tf.float32))
  accuracy_summaryBM = tf.summary.scalar('accuracyBM', accuracyBM)

In [0]:
init = tf.global_variables_initializer()
saver = tf.train.Saver()

n_epochs = 20
batch_size = 50

# shuffle_batch() shuffle the examples in a batch before training
def shuffle_batch(X, y, batch_size):
    rnd_idx = np.random.permutation(len(X))
    n_batches = len(X) // batch_size
    for batch_idx in np.array_split(rnd_idx, n_batches):
        X_batch, y_batch = X[batch_idx], y[batch_idx]
        yield X_batch, y_batch


In [24]:
with tf.Session() as sess:
  init.run()
  for epoch in range(n_epochs):
    for X_batch, y_batch in shuffle_batch(x_train, y_train, batch_size):
      sess.run(train_opBM, feed_dict={X: X_batch, y: y_batch})
    
    a_batchBM = accuracyBM.eval(feed_dict={X: X_batch, y: y_batch})
    a_validBM = accuracyBM.eval(feed_dict={X: x_valid, y: y_valid})

    print(epoch, "Batch accuracy: ", a_batchBM, "\tValidation accuracy: ", a_validBM)
    
  save_path = saver.save(sess, "./my_dnn_model.ckpt")

0 Batch accuracy:  0.9 	Validation accuracy:  0.8525
1 Batch accuracy:  0.82 	Validation accuracy:  0.8485
2 Batch accuracy:  0.9 	Validation accuracy:  0.8333
3 Batch accuracy:  0.82 	Validation accuracy:  0.8734
4 Batch accuracy:  0.86 	Validation accuracy:  0.8758
5 Batch accuracy:  0.96 	Validation accuracy:  0.863
6 Batch accuracy:  0.8 	Validation accuracy:  0.879
7 Batch accuracy:  0.92 	Validation accuracy:  0.8795
8 Batch accuracy:  0.9 	Validation accuracy:  0.8768
9 Batch accuracy:  0.9 	Validation accuracy:  0.8697
10 Batch accuracy:  0.82 	Validation accuracy:  0.875
11 Batch accuracy:  0.86 	Validation accuracy:  0.8818
12 Batch accuracy:  0.8 	Validation accuracy:  0.8702
13 Batch accuracy:  0.9 	Validation accuracy:  0.8814
14 Batch accuracy:  0.98 	Validation accuracy:  0.8705
15 Batch accuracy:  0.92 	Validation accuracy:  0.8782
16 Batch accuracy:  0.86 	Validation accuracy:  0.8858
17 Batch accuracy:  0.86 	Validation accuracy:  0.8816
18 Batch accuracy:  0.92 	Vali

In [25]:
with tf.Session() as sess:
  saver.restore(sess, save_path)
  accuracy_testBM = accuracyBM.eval(feed_dict={X: x_test, y: y_test})

print('Accuracy of dnnbenchmark: ', accuracy_testBM)


Accuracy of dnnbenchmark:  0.8729


For this part, I selected my best performing model, the one with three hidden layers, to fine tune further. By making the above improvements, I was able to successfully increase my testing accuracy. I used He initialization, seleccted ELU as the activation function for all hidden layers, implemented batch normalization for each layer, implemented a 50% dropout rate, and changed the optimization function to Adam Optimization. By making the above changes, I improved my testing accuracy from .844 to .8729. This improvement can be attributed to a number of things, most importantly the implementation of a dropout rate. This dropout rates can prevent overfitting of data, so its inclusion here is essential ot the improvement of results. 


- - -
## 5. OUTLOOK (5 pts)

Plan for the outlook of your system: This may lead to the direction of your future project:
- Did your neural network outperform other "traditional ML technique? Why/why not?
- Does your model work well? If not, which model should be further investigated?
- Do you satisfy with your system? What do you think needed to improve?



I think my model works pretty well. There are improvements that can still be made, such as further tuning of the number of neurons per layer, but overall my model preforms well. Compared to the accuracy of other models found at http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/ my model performs very well, having an .8729 accuracy score. This is significantly better than almost all of the other types of models found in the chart. There are only a few types of model that seems to perform above a .87 accuracy, including a Random Forest Classifier, Gradient Boosting Classifier, and Support Vector Classifier, as mentioned in the introduction to section 4.

Although there is always room for improvement, I am satisfied with my model, and feel a .8729 accuracy is pretty good.



- - - 
### NEED HELP?

In case you get stuck in any step in the process, you may find some useful information from:

 * Consult my lectures and/or the textbook
 * Talk to the TA, they are available and there to help you during OH
 * Come talk to me or email me <nn4pj@virginia.edu> with subject starting "CS4501 Assignment 4:...".
 * More on the Fashion-MNIST to be found here: https://hanxiao.github.io/2018/09/28/Fashion-MNIST-Year-In-Review/

Best of luck and have fun!