# Assignment 4: Benchmarking Fashion-MNIST with Deep Neural Nets

### CS 4501 Machine Learning - Department of Computer Science - University of Virginia
"The original MNIST dataset contains a lot of handwritten digits. Members of the AI/ML/Data Science community love this dataset and use it as a benchmark to validate their algorithms. In fact, MNIST is often the first dataset researchers try. "If it doesn't work on MNIST, it won't work at all", they said. "Well, if it does work on MNIST, it may still fail on others." - **Zalando Research, Github Repo.**"

Fashion-MNIST is a dataset from the Zalando's article. Each example is a 28x28 grayscale image, associated with a label from 10 classes. They intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms.

![Here's an example how the data looks (each class takes three-rows):](https://github.com/zalandoresearch/fashion-mnist/raw/master/doc/img/fashion-mnist-sprite.png)

In this assignment, you will attempt to benchmark the Fashion-MNIST using Neural Networks. You must use it to train some neural networks on TensorFlow and predict the final output of 10 classes. For deliverables, you must write code in Python and submit this Jupyter Notebook file (.ipynb) to earn a total of 100 pts. You will gain points depending on how you perform in the following sections.


In [1]:
# You might want to use the following packages
import numpy as np
import os
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR) #reduce annoying warning messages
from functools import partial
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline


# to make this notebook's output stable across runs
def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)


---
## 1. PRE-PROCESSING THE DATA (10 pts)

You can load the Fashion MNIST directly from Tensorflow. **Partition of the dataset** so that you will have 50,000 examples for training, 10,000 examples for validation, and 10,000 examples for testing. Also, make sure that you platten out each of examples so that it contains only a 1-D feature vector.

Write some code to output the dimensionalities of each partition (train, validation, and test sets).



In [2]:
# Your code goes here for this section
fmnist = tf.keras.datasets.fashion_mnist.load_data();
(x_train, y_train), (x_test, y_test) = fmnist[0], fmnist[1]


# Flattening so that it contains only a 1-D feature vector:
x_train = x_train.flatten().reshape( 60000, 28*28) 
x_test = x_test.flatten().reshape( 10000, 28*28) 


# Splitting x_train into 50,000 examples for training and 10,000 examples for validation: 
x_validation = x_train[0:10000]
y_validation = y_train[0:10000]
x_train = x_train[10000:60000]
y_train = y_train[10000:60000]


# Converting the x's to np.float32 and converting the y's to np.int32 for easier computations:
x_train = x_train.astype( np.float32 )
y_train = y_train.astype( np.int32 )
x_validation = x_validation.astype( np.float32 )
y_validation = y_validation.astype( np.int32 )
x_test = x_test.astype( np.float32 )
y_test = y_test.astype( np.int32 )


# Using Pipeline with Standard Scaler to normalize/scale our data:
pipeline = Pipeline([
        ( 'std_scaler' , StandardScaler() ),
    ])

x_train = pipeline.fit_transform( x_train )
x_validation = pipeline.fit_transform( x_validation )
x_test = pipeline.fit_transform( x_test )


# Printing shapes of x_train, y_train, x_validation, x_test and y_test:
print( "Shape of x_train:", x_train.shape )
print( "Shape of y_train:", y_train.shape )
print( "Shape of x_validation:", x_validation.shape )
print( "Shape of y_validation:", y_validation.shape )
print( "Shape of x_test:", x_test.shape )
print( "Shape of y_test:", y_test.shape )

Shape of x_train: (50000, 784)
Shape of y_train: (50000,)
Shape of x_validation: (10000, 784)
Shape of y_validation: (10000,)
Shape of x_test: (10000, 784)
Shape of y_test: (10000,)


- - -
## 2. CONSTRUCTION PHASE (30 pts)

In this section, define at least three neural networks with different structures. Make sure that the input layer has the right number of inputs. The best structure often is found through a process of trial and error experimentation:
- You may start with a fully connected network structure with two hidden layers.
- You may try a few settings of the number of nodes in each layer.
- You may try a few activation functions to see if they affect the performance.

**Important Implementation Note:** For the purpose of learning Tensorflow, you must use low-level TensorFlow API to construct the network. Usage of high-level tools (ie. Keras) is not permited. 

# **DNN 1:**

In [3]:
# First DNN uses a fully connected network structure with two hidden layers. 
# The first hidden layer has a dimensionality of 300, and the second hidden layer 
# has a dimensionality of 100. I am using the ReLu activation function. 

reset_graph() # need to reset the graph for neural network to work properly every time

n_inputs = 28*28  # Fashion-MNIST
learning_rate = 0.01
n_hidden1 = 300 # dimensionality of first layer
n_hidden2 = 100 # dimensionality of second layer 
n_outputs = 10 # 10 different integer classes (0 - 9)

# Construct placeholder for the input layer
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")

In [4]:
with tf.name_scope("dnn1"):
    # Using 2 hidden layers, and activation function ReLu (tf.nn.relu):
    hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, name="hidden1")
    hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu, name="hidden2")
    logits = tf.layers.dense(hidden2, n_outputs, name="outputs")

In [5]:
with tf.name_scope("loss"):
    # Computes sparse softmax cross entropy between logits and labels:
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    # Computes the mean of elements across dimensions of a tensor.
    loss = tf.reduce_mean(xentropy, name="loss")

In [6]:
with tf.name_scope("train"):
    # Optimizer implements gradient descent algorithm
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    # Minimizing the loss function 
    training_op = optimizer.minimize(loss)

In [7]:
with tf.name_scope("eval"):
    # Says whether the targets are in the top K predictions.
    correct = tf.nn.in_top_k(logits, y, 1)
    # Computes the mean of elements across dimensions of a tensor.
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

# **DNN 2:**

In [8]:
# Second DNN uses a fully connected network structure with 3 hidden layers. 
# The first one with a dimensionality of 300, the second one with a dimensionality 
# of 100 and the third one with a dimensionality of 100. I am using the SELU activation 
# function. Also, I am implementing a dropout method for regularization. Using momentum 
# optimizer instead of regular gradient descent optimizer. 

reset_graph() # need to reset the graph for neural network to work properly every time


n_inputs = 28*28  # Fashion-MNIST
learning_rate = 0.01
n_hidden1 = 300 # dimensionality of first layer 
n_hidden2 = 100 # dimensionality of second layer
n_hidden3 = 100 # dimensionality of third layer

n_outputs = 10 # 10 different integer classes (0 - 9)

# Construct placeholder for the input layer
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")


training = tf.placeholder_with_default(False, shape=(), name='training')

dropout_rate = 0.5  # == 1 - keep_prob
X_drop = tf.layers.dropout(X, dropout_rate, training=training)

with tf.name_scope("dnn2"):
    # Using 3 hidden layers, and activation function SELU (tf.nn.selu) and dropping
    # at a rate of 0.5 

    hidden1 = tf.layers.dense(X_drop, n_hidden1, activation=tf.nn.selu,name="hidden3")
    hidden1_drop = tf.layers.dropout(hidden1, dropout_rate, training=training)
    
    hidden2 = tf.layers.dense(hidden1_drop, n_hidden2, activation=tf.nn.selu, name="hidden4")
    hidden2_drop = tf.layers.dropout(hidden2, dropout_rate, training=training)
    
    hidden3 = tf.layers.dense(hidden2_drop, n_hidden3, activation=tf.nn.selu, name="hidden5")
    hidden3_drop = tf.layers.dropout(hidden3, dropout_rate, training=training)

    logits = tf.layers.dense(hidden3_drop, n_outputs, name="outputs1")


In [9]:
with tf.name_scope("loss"):
    # Computes sparse softmax cross entropy between logits and labels:
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    # Computes the mean of elements across dimensions of a tensor.
    loss = tf.reduce_mean(xentropy, name="loss")

In [10]:
with tf.name_scope("train"):
    # Optimizer implements the momentum algorithm
    optimizer = tf.train.MomentumOptimizer(learning_rate, momentum=0.9)
    # Minimizing the loss function 
    training_op = optimizer.minimize(loss)       
      

In [11]:
with tf.name_scope("eval"):
    # Says whether the targets are in the top K predictions.
    correct = tf.nn.in_top_k(logits, y, 1)
    # Computes the mean of elements across dimensions of a tensor.
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

# **DNN 3:**


In [40]:
# Third DNN uses a fully connected network structure with 5 hidden layers. 
# The first one with a dimensionality of 300, the second one with a dimensionality 
# of 100, the third one with a dimensionality of 50, the fourth one with a 
# dimensionality of 50 and the fifth one with a dimensionality of 50. 
# I am using the SELU activation function. I am also training with Gradient Clipping, 
# because large updates to weights during training can cause a numerical overflow or underflow.


reset_graph() # need to reset the graph for neural network to work properly every time


n_inputs = 28*28  # Fashion-MNIST
learning_rate = 0.01
n_hidden1 = 300 # dimensionality of first layer 
n_hidden2 = 100 # dimensionality of second layer 
n_hidden3 = 50 # dimensionality of third layer 
n_hidden4 = 50 # dimensionality of fourth layer 
n_hidden5 = 50 # dimensionality of fifth layer 
n_outputs = 10

# Construct placeholder for the input layer
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")

with tf.name_scope("dnn3"):
    # Using 5 layers with activation function Elu
    hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.elu, name="hidden6")
    hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.elu, name="hidden7")
    hidden3 = tf.layers.dense(hidden2, n_hidden3, activation=tf.nn.elu, name="hidden8")
    hidden4 = tf.layers.dense(hidden3, n_hidden4, activation=tf.nn.elu, name="hidden9")
    hidden5 = tf.layers.dense(hidden4, n_hidden5, activation=tf.nn.elu, name="hidden10")
    logits = tf.layers.dense(hidden5, n_outputs, name="outputs2")


In [41]:
with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

In [42]:
# Training with Gradient Clipping so that gradients don't "explode". First we clip by 
# using the clip_by_value() function. Then we apply the clip with the threshold. 

threshold = 1.0

optimizer = tf.train.GradientDescentOptimizer(learning_rate)
grads_and_vars = optimizer.compute_gradients(loss)
capped_gvs = [(tf.clip_by_value(grad, -threshold, threshold), var)
              for grad, var in grads_and_vars]
training_op = optimizer.apply_gradients(capped_gvs)  

In [43]:
with tf.name_scope("eval"):
    # Says whether the targets are in the top K predictions.
    correct = tf.nn.in_top_k(logits, y, 1)
    # Computes the mean of elements across dimensions of a tensor.
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name="accuracy")

- - -
## 3. EXECUTION PHASE (30 pts)

After you construct the three models of neural networks, you can compute the performance measure as the class accuracy. You will need to define the number of epochs and size of the training batch. You also might need to reset the graph each time your try a different model. To save time and avoid retraining, you should save the trained model and load it from disk to evaluate a test set. Pick the best model and answer the following:
- Which model yields the best performance measure for your dataset? Provide a reason why it yields the best performance.
- Why did you pick this many hidden layers?
- Provide some justifiable reasons for selecting the number of neurons per hidden layers. 
- Which activation functions did you use?

In the next session you will get a chance to finetune it further .



In [28]:
# Your code goes here
init = tf.global_variables_initializer()
saver = tf.train.Saver()

n_epochs = 20
batch_size = 200

# shuffle_batch() shuffle the examples in a batch before training
def shuffle_batch(X, y, batch_size):
    rnd_idx = np.random.permutation(len(X))
    n_batches = len(X) // batch_size
    for batch_idx in np.array_split(rnd_idx, n_batches):
        X_batch, y_batch = X[batch_idx], y[batch_idx]
        yield X_batch, y_batch


In [29]:
with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        
        for X_batch, y_batch in shuffle_batch(x_train, y_train, batch_size):
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        accuracy_val = accuracy.eval(feed_dict={X: x_validation, y: y_validation})
        print(epoch, "Validation accuracy:", accuracy_val)        
        
    save_path = saver.save(sess, "./my_dnn_model.ckpt")

0 Validation accuracy: 0.8576
1 Validation accuracy: 0.8584
2 Validation accuracy: 0.8669
3 Validation accuracy: 0.875
4 Validation accuracy: 0.8823
5 Validation accuracy: 0.8776
6 Validation accuracy: 0.8843
7 Validation accuracy: 0.8795
8 Validation accuracy: 0.8845
9 Validation accuracy: 0.8854
10 Validation accuracy: 0.8885
11 Validation accuracy: 0.889
12 Validation accuracy: 0.8899
13 Validation accuracy: 0.887
14 Validation accuracy: 0.8816
15 Validation accuracy: 0.8927
16 Validation accuracy: 0.8847
17 Validation accuracy: 0.8882
18 Validation accuracy: 0.8839
19 Validation accuracy: 0.8934


In [30]:
with tf.Session() as sess:
    saver.restore(sess, "./my_dnn_model.ckpt")
    acc_test = accuracy.eval(feed_dict={X: x_test, y: y_test})
    print("Final test accuracy: {:.2f}%".format(acc_test * 100))

Final test accuracy: 88.80%


# **Answers to questions:**
- Which model yields the best performance measure for your dataset? Provide a reason why it yields the best performance. 
    
The model that yielda the best accuracy is DNN # 2, because it has a final test accuracy of 88%, which is only abou t 1% away from the benchmark (89.7%). This was a fairly simple model composed of 3 hidden layers. The first, second and third layers each had 300 neurons, 100 neurons and 100 neurons, respectively. In my opinion, the model yields the best performance because it is not to complex (only using 3 hidden layers), and I implemented the dropout regularization technique in order to avoid overfitting on my DNN. 

    
    
- Why did you pick this many hidden layers?

I chose only 3 hidden layers because I wanted to keep my model simple, easy to implement and computationally convenient. With too many hidden layers, the computational complexity increases drastically and the time that it takes to train the model also increases greatly.

    
    
- Provide some justifiable reasons for selecting the number of neurons per hidden layers. 

For my first, second and third layers, I chose 300 neurons, 100 neurons and 100 neurons respectively. I chose these many neurons per layer because I wanted to reduce the number of neurons after the first layer by 1/3 to simplify the model and make it more computationally feasable to compute. I also did some trial and error, and found that 100 neurons on the last layer resulted in better final test accuracies, so I decided to stick with it. 

    

- Which activation functions did you use?

I used the SElU activation function for every one of my layers, which seemed to work pretty well. After trying ReLu on every layer, I found that the final test accuracies were much better with the SELU activation function. This might be due to SELU's slightly better performance when determining the activation node probabilities, when compared to ReLu.





- - -
## 4. FINETUNING THE NETWORK (25 pts)

The best performance on the Fashion MNIST of a non-neural-net classifier is the Support Vector Classifier {"C":10,"kernel":"poly"} with 0.897 accuracy. In this section, you will see how close you can get to that accuracy, or (better yet) beat it! You will be able to see the performance of other ML methods below:
http://fashion-mnist.s3-website.eu-central-1.amazonaws.com

Use the best model from the previous section and see if you can improve it further. To improve the performance of your model, You must make some modifications based upon the practical guidelines discuss in class. Here are a few decisions about the recommended network configurations you have to make:
1. Initialization: Use He Initialization for your model
2. Activation: Add ELU as the activation function throughout your hidden layers
3. Normalization: Incorporate the batch normalization at every layer
4. Regularization: Configure the dropout policy at 50% rate
5. Optimization: Change Gradient Descent into Adam Optimization
6. Your choice: make any other changes in 1-5 you deem necessary

Keep in mind that the execution phase is essentially the same, so you can just run it from the above. See how much you gain in classification accuracy. Provide some justifications for the gain in performance. 






## **Finetuned Network**

In [31]:
reset_graph() # need to reset the graph for neural network to work properly every time


n_inputs = 28*28  # Fashion-MNIST
learning_rate = 0.01
n_hidden1 = 300 # dimensionality of first layer 
n_hidden2 = 100 # dimensionality of second layer
n_hidden3 = 100 # dimensionality of third layer

n_outputs = 10 # 10 different integer classes (0 - 9)

# Momentum for the batch normalization 
batch_norm_momentum = 0.9

# Construct placeholder for the input layer
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")


training = tf.placeholder_with_default(False, shape=(), name='training')

# Dropping at a rate of 0.5
dropout_rate = 0.5  # == 1 - keep_prob
X_drop = tf.layers.dropout(X, dropout_rate, training=training)


with tf.name_scope("dnnBenchmark"):
    # Using 3 hidden layers, and activation function ELU (tf.nn.elu) and dropping
    # at a rate of 0.5. Also using kernel_initializer=he_init for He initialization. 
    # We are also using Adam Optimizer instead of Gradient Descent. 
    
    # He Initialization
    he_init = tf.variance_scaling_initializer()

    # To avoid repeating the same parameters over and over again, we can use Python's partial() function
    my_batch_norm_layer = partial(
            tf.layers.batch_normalization, # Using batch normalization 
            training=training,
            momentum=batch_norm_momentum)

    my_dense_layer = partial(
            tf.layers.dense,
            kernel_initializer=he_init)

    hidden1 = my_dense_layer(X_drop, n_hidden1, name="hidden11")
    hidden1_drop = tf.layers.dropout(hidden1, dropout_rate, training=training) # Dropping at a rate of 0.5
    bn1 = tf.nn.elu(my_batch_norm_layer(hidden1_drop)) # Using elu as an activation function 
    
    hidden2 = my_dense_layer(bn1, n_hidden2, name="hidden12")
    hidden2_drop = tf.layers.dropout(hidden2, dropout_rate, training=training) # Dropping at a rate of 0.5
    bn2 = tf.nn.elu(my_batch_norm_layer(hidden2_drop)) # Using elu as an activation function 
    
    hidden3 = my_dense_layer(bn2, n_hidden3, name="hidden13")
    hidden3_drop = tf.layers.dropout(hidden3, dropout_rate, training=training) # Dropping at a rate of 0.5
    bn3 = tf.nn.elu(my_batch_norm_layer(hidden3_drop)) # Using elu as an activation function 
    
    
    logits_before_bn = my_dense_layer(bn3, n_outputs, name="outputs3")
    logits = my_batch_norm_layer(logits_before_bn)



In [32]:
with tf.name_scope("loss"):
    # Computes sparse softmax cross entropy between logits and labels:
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    # Computes the mean of elements across dimensions of a tensor.
    loss = tf.reduce_mean(xentropy, name="loss")

In [33]:
with tf.name_scope("train"):
    # Optimizer implements the momentum algorithm
    optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate) # Using Adam Optimizer
    # Minimizing the loss function 
    training_op = optimizer.minimize(loss)       

In [34]:
with tf.name_scope("eval"):
    # Says whether the targets are in the top K predictions.
    correct = tf.nn.in_top_k(logits, y, 1)
    # Computes the mean of elements across dimensions of a tensor.
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

- - -
## 5. OUTLOOK (5 pts)

Plan for the outlook of your system: This may lead to the direction of your future project:
- Did your neural network outperform other "traditional ML technique? Why/why not?

The "traditional ML technique" benchmark had an accuracy of 89.7%. After all of the finetuning, I was able to increase the accuracy of my DNN to 88.8%. This is about 1% away from the benchmark, which in my opinion is not so bad. I think I did not outperform the benchmark because I did not have enough time to try all the different combinations of hyperparameters like drop out rates, momentum rates, etc. I also only used 3 hidden layers, which might have affected the accuracy of the model. My intention was to keep the model simple, but maybe if I tried a more complex model with more layers and different combinations of neurons in each layer, I could have improved the performance of the model.
- Does your model work well? If not, which model should be further investigated?

The finetuned model seems to work quite well. In fact, all of the 4 DNNs that I created had accuracies of over 86%. The worst model, which had an accuracy of 86% was using the ReLu activation function. So I suspect thats why it performed so poorly compared to the others. 

- Do you satisfy with your system? What do you think needed to improve?

I am certainly satisfied with my system but I think that if I had more time, I could perform a trial and error process to find the best possible DNN with the right number of layers and right number of nodes for each layer. I could also do a trial and error process in order to find the best hyper parameters and activation functions. 







- - - 
### NEED HELP?

In case you get stuck in any step in the process, you may find some useful information from:

 * Consult my lectures and/or the textbook
 * Talk to the TA, they are available and there to help you during OH
 * Come talk to me or email me <nn4pj@virginia.edu> with subject starting "CS4501 Assignment 4:...".
 * More on the Fashion-MNIST to be found here: https://hanxiao.github.io/2018/09/28/Fashion-MNIST-Year-In-Review/

Best of luck and have fun!