# Neural Network Hyperparameter Tuning - Exercise

In this exercise we will build a neural network to classify digits from the MNIST dataset, and explore how we may tune one of the model's hyperparameters to achieve better performance.

## Part 1: Loading the dataset

We will first load the MNIST data and prepare it for our model.

**Questions:**
1. Run the code given below to fetch the dataset.
2. Examine the shapes of `X` and `y`. Explain in words what the 784 features in each row of `X` represent. (Hint: the images in MNIST are of size 28 x 28).
3. Normalize the values of elements in `X` to be floats betweek `0.` and `1`, by dividing by a scalar value. Also cast values in `y` to be ints.
4. Using `train_test_split` from sklearn, split the dataset into `X_train, X_test, y_train, y_test`. Use an 80-20 split. How many samples are in the train and test sets?

In [14]:
### CODE FOR QUESTION 1
from sklearn.datasets import fetch_openml
# Optinally, set data_home to where you want to download your data
X, y = fetch_openml('mnist_784', version=1, return_X_y=True)

  warn(


In [15]:
from sklearn.model_selection import train_test_split
import numpy as np

print("Shape of X:", X.shape)
print("Shape of y:", y.shape)

X = X / 255.0
y = y.astype(int)


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the number of samples in train and test sets
print("Number of samples in training set:", len(X_train))
print("Number of samples in test set:", len(X_test))

Shape of X: (70000, 784)
Shape of y: (70000,)
Number of samples in training set: 56000
Number of samples in test set: 14000


Each row in X represents one image from the MNIST dataset.
he dataset contains grayscale images that are 28x28 pixels.
Each row in X has 28*28 = 784 features, where each feature represents the intensity of one pixel in the image.

## Part 2: Building an MLP

We will now build a neural network to classify the digits. We will use a simple MLP ("multilayer perceptron") model; this is also known as a "vanilla" feedforward neural network.

An MLP is a sequential network consisting of multiple feedforward (Dense) layers with nonlinear activation functions.

**Questions:**
5. Using the imports given below, create a Keras sequential model called `model` to classify the digits. Use the following hints:
  * The model should have a single hidden (Dense) layer with hidden dimension 50 and relu activation.
  * The last layer of the model is also a Dense layer. Consider what its size and activation function should be, given that MNIST is a multiclass classification task with 10 classes (recall when we use sigmoid vs. softmax activations).
  * Don't forget to use parameter `input_dim=...` for the first layer, since we are using the Keras Sequential API. Use a value that makes `model.input_shape` match the shapes of `X_train` and `X_test`.
6. Print out `model.input_shape`, `model.output_shape`, and `model.summary()`. Take a look to make sure that what you see makes sense.
7. How many parameters does the model have?

In [16]:
from keras.models import Sequential
from keras.layers import Dense

model = Sequential()
model.add(Dense(50, activation='relu', input_dim=X_train.shape[1]))
model.add(Dense(10, activation='softmax'))

print("Input shape:", model.input_shape)
print("Output shape:", model.output_shape)
model.summary()

total_params = model.count_params()
print("Total parameters:", total_params)


Input shape: (None, 784)
Output shape: (None, 10)
Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_8 (Dense)             (None, 50)                39250     
                                                                 
 dense_9 (Dense)             (None, 10)                510       
                                                                 
Total params: 39760 (155.31 KB)
Trainable params: 39760 (155.31 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Total parameters: 39760


## Part 3: Training the model

We will now train our model so that it learns to classify MNIST digits. We'll see that it performs much better on this task than the linear models we have seen before.

**Questions:**
8. Compile the model with `sparse_categorical_crossentropy` loss and `adam` optimizer. Also use parameter `metrics='accuracy'` so we can visualize the accuracy as the model trains.
9. Train the model. In `model.fit(...)`, use parameters `validation_split=0.2` and `batch_size=16`. How can we tell how many epochs we should train the model for? (include your explanation of how your chose the number of epochs in your solution)
10. What is the best validation loss and accuracy that your model achieved?

**Note:** If you change something and want to train your model from scratch, make sure to re-run the code that created the model (`model = Sequential(...)`) to re-initialize its weights. Otherwise, `model.fit(...)` will continue from where you left off.

In [17]:
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

history = model.fit(X_train, y_train, epochs=10, batch_size=16, validation_split=0.2)


best_val_loss = min(history.history['val_loss'])
best_val_acc = max(history.history['val_accuracy'])

print("Best validation loss:", best_val_loss)
print("Best validation accuracy:", best_val_acc)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Best validation loss: 0.10578353703022003
Best validation accuracy: 0.9708035588264465


I noticed that the validation accuracy improved consistently during the initial 10 epochs, but afterwards, it began to fluctuate. From this pattern, I concluded that training for 10 epochs was optimal.

## Part 4: Hyperparameter Tuning

In choosing our model we set a few hyperparameters, including the hidden layer dimension 50. It might have seemed like a "magic number". In fact, the best way to set hyperparameters like this is to perform a search using the validation set.

For simplicity we will just try a few values for this single hyperparameter and see what gives the best model.

**Questions:**
11. Create new models `model20` and `model100`  with hidden layer dimensions of 20 and 100 respectively. Compile and train each model using the same procedure we used in part 3.
12. Out of `20, 50, 100` which hidden layer dimension is best? Explain your answer, and store the best model in a new variable `best_model`.
13. Using `best_model.evaluate(...)`, report the test set loss and accuracy of the model you chose in the previous question.

In [18]:
model20 = Sequential()
model20.add(Dense(20, activation='relu', input_dim=X_train.shape[1]))
model20.add(Dense(10, activation='softmax'))

model20.compile(optimizer='adam',
                loss='sparse_categorical_crossentropy',
                metrics=['accuracy'])

history20 = model20.fit(X_train, y_train, epochs=10, batch_size=16, validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [19]:
best_val_loss = min(history20.history['val_loss'])
best_val_acc = max(history20.history['val_accuracy'])

print("Best validation loss:", best_val_loss)
print("Best validation accuracy:", best_val_acc)

Best validation loss: 0.15803465247154236
Best validation accuracy: 0.9544642567634583


In [20]:
model100 = Sequential()
model100.add(Dense(100, activation='relu', input_dim=X_train.shape[1]))
model100.add(Dense(10, activation='softmax'))

model100.compile(optimizer='adam',
                 loss='sparse_categorical_crossentropy',
                 metrics=['accuracy'])

history100 = model100.fit(X_train, y_train, epochs=10, batch_size=16, validation_split=0.2)

best_val_loss = min(history100.history['val_loss'])
best_val_acc = max(history100.history['val_accuracy'])

print("Best validation loss:", best_val_loss)
print("Best validation accuracy:", best_val_acc)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Best validation loss: 0.0857173502445221
Best validation accuracy: 0.9756249785423279


I chose to train each model for 10 epochs, we can  see that the best model is the model with 100 hidden layers, we get to the best validation accuracy and loss for this one(although it is not much different than 50 layers in term of performances).

In [21]:
best_model = model100

test_loss, test_accuracy = best_model.evaluate(X_test, y_test)
print("Test set loss:", test_loss)
print("Test set accuracy:", test_accuracy)

Test set loss: 0.10400651395320892
Test set accuracy: 0.9738571643829346
