# Exercise 6 - Hyperparameter Optimization

In this exercise, we want to see, how to find the *optimal hyperparameters* for a simple CNN models to classify the MNIST dataset.

## Data Preparation

Load tensorflow 2:

In [None]:
#Check if colab is running
import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
  %tensorflow_version 2.x

#import TF
import tensorflow as tf
from platform import python_version
print("Tensorflow version", tf.__version__)
print("Python version =",python_version())

Tensorflow version 2.5.0
Python version = 3.7.10


In [None]:
# Most of the libraries we will use in this exercise

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.model_selection import train_test_split, StratifiedKFold
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten, Conv2D, MaxPooling2D, InputLayer

Let's first load and preprocess the data as we did in exercise 5:

In [None]:
np.random.seed(1338)  # for reproducibilty!!

# input image dimensions
img_rows, img_cols = 28, 28
# number of classes
nb_classes = 10

#Load the data
(X_train, Y_train), (X_test, Y_test) = mnist.load_data()

In [None]:
num_train_samples = len(X_train)
print("Number of training samples:", num_train_samples)
num_test_samples = len(X_test)
print("Number of test samples:", num_test_samples)

Number of training samples: 60000
Number of test samples: 10000


In [None]:
#Take just 20k example for training for speed reasons
X_train_reduced = X_train[:20000]
Y_train_reduced = Y_train[:20000]

As the Keras scalers are not very nice in the case of images and we know (well, you might know at least), that pictures have color values between 0 and 255, we can rescale by dividing by 255. The target vector is converted to categorical.

In [None]:
print(X_train.min(), X_train.max())

# scale with global maximum
X_train_scaled = X_train_reduced/255
X_test_scaled = X_test/255

# add "filter" dimension (loaded images are [x,y] only, not [x,y,channels])
X_train_ready = X_train_scaled[..., np.newaxis]
X_test_ready = X_test_scaled[..., np.newaxis]

#convert target vector
Y_train_ready = to_categorical(Y_train_reduced, nb_classes)
Y_test_ready = to_categorical(Y_test, nb_classes)

0 255


## Task 1 (recap): Model preparation
Define a convolutional network as start point for our optimization. Design it following these points:
- Use two convolutional layers followed by a pooling layer.
- Use one hidden dense layer.
- Use dropout before and after the hidden dense layer.
- How many output nodes do you need? Add a dense output layer accordingly.
- Which activation function should you use in the output layer?
- Choose a loss function that is suitable for our problem at hand.
- Compile the model and look at the summary. How many parameters do you have?

In [None]:
# Start here
model = Sequential([
                    # add layers here
])


### A first estimate: how accurate can the model predict the digits?
Before we start to look around for better ways to build the network, we just run it as is for now.

In [None]:
x_train, x_val, y_train, y_val = train_test_split(X_train_ready, Y_train_ready, train_size=.7, test_size=.3)

In [None]:
fit_results = model.fit(
    x=x_train,
    y=y_train,
    batch_size=128,
    epochs=15,
    validation_data=(x_val, y_val)
)

In [None]:
loss, accuracy = model.evaluate(x_train, y_train, verbose=0)
print('Training Loss:', loss)
print('Training Accuracy:', accuracy)

val_loss, val_accuracy = model.evaluate(x_val, y_val, verbose=0)
print('Validation Loss:', val_loss)
print('Validation Accuracy:', val_accuracy)

plt.plot(fit_results.history['loss'], label='Training')
plt.plot(fit_results.history['val_loss'], label='Validation')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

plt.plot(fit_results.history['accuracy'], label='Training')
plt.plot(fit_results.history['val_accuracy'], label='Validation')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

### Quick recap question:
The accuracies returned by the model in the history object and obtained via evaluate() are very different for the training data but not for the validation data. Why is this?

In [None]:
print("Training accuracy from history:", fit_results.history['accuracy'][-1])
print("Training accuracy with evaluate():", accuracy)

print("Validation accuracy from history:", fit_results.history['val_accuracy'][-1])
print("Validation accuracy with evaluate():", val_accuracy)

Training accuracy from history: 0.9687857031822205
Training accuracy with evaluate(): 0.9957143068313599
Validation accuracy from history: 0.9800000190734863
Validation accuracy with evaluate(): 0.9800000190734863


For now, this will not be so much of a problem in this exercise as we will be mostly looking at final/ best validation accuracies, not so much at loss curves (for now). If you do not remember a possible approach to solve this discrepancy, look back in exercise 4.

A callback we will definetely need though is the model checkpoint callback. With the model checkpoint callback we can store the best network during training.

In [None]:
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.models import load_model

# Hyperparameter optimizations: Gridsearch
Updating parameters manually and re-running the training takes a long time and keeping track of the progress is difficult. If you want to search for an optimal hyperparameter-configuration, a grid search over a parameter space is more recommendable. The simplest form of grid search is the iteration over a parameter space.

## Loop over a parameter - filter sizes

As a first test, we will search for different configurations of filter sizes for our two convolutional layers. We will choose the following configuration: We set the number of filters for the first convolutional layer and set the filters of the second convolutional layer to be twice as many.

In [None]:

search_results = []

filters_candidates = [
  # very limited test, you can explore more!
  16, 32, 64
]

for nb_filters in filters_candidates:
  print("Start training for nb_filters=", nb_filters)

  ########################################
  # Use your own model here!!
  ########################################
  model = Sequential([
    Conv2D(
        nb_filters, kernel_size=2, padding='valid',
        activation='relu', input_shape=x_train[0].shape
    ),
    Conv2D(
        int(nb_filters*2), kernel_size=2,
        padding='valid', activation='relu'
    ),
    MaxPooling2D(pool_size=2),
    Dropout(0.5),
    Flatten(),
    Dense(64, activation='relu'),
    Dropout(0.2),
    Dense(10, activation='softmax')
  ])
  model.compile(
      loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']
  )
  ########################################

  # we choose our best model as the one having the highest validation accuracy
  filepath = f"cnn_paramsearch_filters_{nb_filters}.hdf5"
  checkpoint = ModelCheckpoint(
      filepath, monitor='val_accuracy', verbose=0, save_best_only=True, mode='max'
  )

  fit_results = model.fit(
    x=x_train,
    y=y_train,
    batch_size=128,
    # reduce number of epochs for speed reasons --> should be higher!
    epochs=10,
    validation_data=(x_val, y_val),
    callbacks=[checkpoint],
    verbose=0
  )

  # extract the best validation scores
  best_val_epoch    = np.argmax(fit_results.history['val_accuracy'])
  best_val_acc      = np.max(fit_results.history['val_accuracy'])
  best_val_acc_loss = fit_results.history['val_loss'][best_val_epoch]

  # get correct training accuracy
  best_model = load_model(filepath)
  best_val_acc_train_loss, best_val_acc_train_acc = best_model.evaluate(x_train, y_train)

  # store results
  search_results.append({
      'nb_filters': nb_filters,
      'best_val_acc_train_acc': best_val_acc_train_acc,
      'best_val_acc': best_val_acc,
      'best_val_acc_train_loss': best_val_acc_train_loss,
      'best_val_acc_loss': best_val_acc_loss,
      'best_val_epoch': best_val_epoch
  })

We will inspect the results using pandas DataFrames:

In [None]:
resultsDF = pd.DataFrame(search_results)

# sort values
resultsDF.sort_values('best_val_acc', ascending=False)

## Grid search over a more complex parameter space
The simplest way to extend the example above to a grid containing more parameters is the following one:

In [None]:

search_results = []

filters_candidates = [
  16, 32, 64
]
dense_candidates = [
  32, 64, 128
]
dropout_candidates = [
  .3, .4, .5
]

for nb_filters in filters_candidates:
  for nb_dense in dense_candidates:
    for dropout in dropout_candidates:

      print(f"Start training for (filters={nb_filters} - dense={nb_dense} - dropout={dropout})")

      ########################################
      # Use your own model here!!
      ########################################
      model = Sequential([
        Conv2D(
            nb_filters, kernel_size=2, padding='valid',
            activation='relu', input_shape=x_train[0].shape
        ),
        Conv2D(
            int(nb_filters*2), kernel_size=2,
            padding='valid', activation='relu'
        ),
        MaxPooling2D(pool_size=2),
        Dropout(dropout),
        Flatten(),
        Dense(nb_dense, activation='relu'),
        Dropout(dropout),
        Dense(10, activation='softmax')
      ])
      model.compile(
          loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']
      )
      ########################################

      # we choose our best model as the one having the highest validation accuracy
      filepath = f"cnn_paramsearch_filters_f={nb_filters}_dn={nb_dense}_do={dropout}.hdf5"
      checkpoint = ModelCheckpoint(
          filepath, monitor='val_accuracy', verbose=0, save_best_only=True, mode='max'
      )

      fit_results = model.fit(
        x=x_train,
        y=y_train,
        batch_size=128,
        # reduce number of epochs for speed reasons --> should be higher!
        epochs=10,
        validation_data=(x_val, y_val),
        callbacks=[checkpoint],
        verbose=0
      )

      # extract the best validation scores
      best_val_epoch    = np.argmax(fit_results.history['val_accuracy'])
      best_val_acc      = np.max(fit_results.history['val_accuracy'])
      best_val_acc_loss = fit_results.history['val_loss'][best_val_epoch]

      # get correct training accuracy
      best_model = load_model(filepath)
      best_val_acc_train_loss, best_val_acc_train_acc = best_model.evaluate(x_train, y_train)

      # store results
      search_results.append({
          'nb_filters': nb_filters,
          'nb_dense': nb_dense,
          'dropout': dropout,
          'best_val_acc_train_acc': best_val_acc_train_acc,
          'best_val_acc': best_val_acc,
          'best_val_acc_train_loss': best_val_acc_train_loss,
          'best_val_acc_loss': best_val_acc_loss,
          'best_val_epoch': best_val_epoch
      })

In [None]:
resultsDF = pd.DataFrame(search_results)

# sort values
resultsDF.sort_values('best_val_acc', ascending=False)

Let's include another column that denotes the difference between train and val score:

In [None]:
resultsDF['delta_acc'] = (resultsDF['best_val_acc_train_acc']-resultsDF['best_val_acc'])/resultsDF['best_val_acc']

A seaborn pairplot can be useful to inspect dependencies of numerical parameters on the validation loss:

In [None]:
import seaborn as sns
sns.pairplot(resultsDF, x_vars=['nb_filters', 'nb_dense', 'dropout', ], y_vars=['best_val_acc', 'best_val_acc_train_acc', 'delta_acc'], kind='reg',  height=2)

### Task 2: Gridsearch interpretation
- Which parameter combination shows the best validation loss?
- How do validation and train loss compare to each other?
- What might be a problem choosing the hyperparameters based on the validation loss?
- How might you further improve the robustness/ generalization performance?



Space for answers

Testing different parameters in the way you just did is rather exhausting. We want to adjust this by using dictionaries to pass a parameter space in a more convenient way. Using `itertools.product`, we can produce all combinations from a dictionary containing lists of candidates:

In [None]:
# an even more reduced parameter space, but this time with structural changes
param_space = {
    'num_conv_layers': [1,2,3],
    'max_filters': [32, 64],
    'dropout': [ .4, .5],
    'dense_nodes': [32, 64]
}

# this handy tools generates all combinations of values from the grid above
import itertools
value_combis = itertools.product(*[v for v in param_space.values()])

param_combis = []
for combi in value_combis:
  param_combis.append({key: value for key, value in zip(param_space.keys(), combi)})

# List comprehension short form:
# param_combis = [{key:value for key, value in zip(param_space.keys(), combi)} for combi in value_combis]

print(f"We have a total of {len(param_combis)} combinations:")
print(param_combis)

We have a total of 24 combinations:
[{'num_conv_layers': 1, 'max_filters': 32, 'dropout': 0.4, 'dense_nodes': 32}, {'num_conv_layers': 1, 'max_filters': 32, 'dropout': 0.4, 'dense_nodes': 64}, {'num_conv_layers': 1, 'max_filters': 32, 'dropout': 0.5, 'dense_nodes': 32}, {'num_conv_layers': 1, 'max_filters': 32, 'dropout': 0.5, 'dense_nodes': 64}, {'num_conv_layers': 1, 'max_filters': 64, 'dropout': 0.4, 'dense_nodes': 32}, {'num_conv_layers': 1, 'max_filters': 64, 'dropout': 0.4, 'dense_nodes': 64}, {'num_conv_layers': 1, 'max_filters': 64, 'dropout': 0.5, 'dense_nodes': 32}, {'num_conv_layers': 1, 'max_filters': 64, 'dropout': 0.5, 'dense_nodes': 64}, {'num_conv_layers': 2, 'max_filters': 32, 'dropout': 0.4, 'dense_nodes': 32}, {'num_conv_layers': 2, 'max_filters': 32, 'dropout': 0.4, 'dense_nodes': 64}, {'num_conv_layers': 2, 'max_filters': 32, 'dropout': 0.5, 'dense_nodes': 32}, {'num_conv_layers': 2, 'max_filters': 32, 'dropout': 0.5, 'dense_nodes': 64}, {'num_conv_layers': 2, 'max

With this, we can write the search a bit more condensed:

In [None]:
search_results = []

for idx, params in enumerate(param_combis):

    print(f"Start run {idx+1}/{len(param_combis)}: Parameters: {params}")


    model = Sequential()
    model.add(InputLayer(x_train[0].shape))

    filters = params['max_filters']
    for layer in range(params['num_conv_layers']):
      model.add(Conv2D(
          filters, kernel_size=2, padding='valid',
          activation='relu'
      ))
      filters /= 2

    for layer in [
      MaxPooling2D(pool_size=2),
      Dropout(params['dropout']),
      Flatten(),
      Dense(params['dense_nodes'], activation='relu'),
      Dropout(params['dropout']),
      Dense(10, activation='softmax')
    ]:
      model.add(layer)

    model.compile(
        loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']
    )
    ########################################

    # we choose our best model as the one having the highest validation accuracy
    string_config = ""
    for key, value in params.items():
      string_config += key + "=" + str(value)
    filepath = f"cnn_paramsearch_filters_{string_config}.hdf5"
    checkpoint = ModelCheckpoint(
        filepath, monitor='val_accuracy', verbose=0, save_best_only=True, mode='max'
    )

    fit_results = model.fit(
      x=x_train,
      y=y_train,
      batch_size=128,
      # reduce number of epochs for speed reasons --> should be higher!
      epochs=5,
      validation_data=(x_val, y_val),
      callbacks=[checkpoint],
      verbose=0
    )

    # extract the best validation scores
    best_val_epoch    = np.argmax(fit_results.history['val_accuracy'])
    best_val_acc      = np.max(fit_results.history['val_accuracy'])
    best_val_acc_loss = fit_results.history['val_loss'][best_val_epoch]

    # get correct training accuracy
    best_model = load_model(filepath)
    best_val_acc_train_loss, best_val_acc_train_acc = best_model.evaluate(x_train, y_train)

    # store results
    search_results.append({
        **params,
        'best_val_acc_train_acc': best_val_acc_train_acc,
        'best_val_acc': best_val_acc,
        'best_val_acc_train_loss': best_val_acc_train_loss,
        'best_val_acc_loss': best_val_acc_loss,
        'best_val_epoch': best_val_epoch
    })

In [None]:
resultsDF = pd.DataFrame(search_results)

# sort values
resultsDF.sort_values('best_val_acc', ascending=False)

Unnamed: 0,num_conv_layers,max_filters,dropout,dense_nodes,best_val_acc_train_acc,best_val_acc,best_val_acc_train_loss,best_val_acc_loss,best_val_epoch
13,2,64,0.4,64,0.983286,0.975,0.056509,0.086908,4
21,3,64,0.4,64,0.981143,0.9715,0.063531,0.091963,4
15,2,64,0.5,64,0.975214,0.967,0.082619,0.109593,4
23,3,64,0.5,64,0.971214,0.966833,0.096322,0.118294,4
17,3,32,0.4,64,0.971286,0.966167,0.094781,0.110285,4
20,3,64,0.4,32,0.973286,0.965667,0.088493,0.117555,4
12,2,64,0.4,32,0.976714,0.965167,0.085527,0.116529,4
11,2,32,0.5,64,0.970786,0.963167,0.104016,0.127279,4
5,1,64,0.4,64,0.974714,0.960667,0.090076,0.132021,4
14,2,64,0.5,32,0.968857,0.959167,0.114533,0.138049,4


## Task 3: Include loss curves
You can also return the loss curves from your search to inspect them and search for overtraining. Include it in the search above and compare the loss curves for training and validation for the best three combinations.

Bonus: We use dropout, therefore it is more correct if we use a callback function to evaluate the training loss and accuracy after the end of every epoch.


In [None]:
# Space for solutions

# Cross-validation - the grid searches best companion

The grid search we have now up and running is a really nice tool to test different hyperparameter combinations and inspect how the different parameters affect the training outcome.

Using the validation scores for evaluating the generalization performance of our model we have improved compared to inspecting the training score. But especially in the case of rather small data sets, this can introduce another form of bias! If we choose a model based on the validation score, this might mean that is simply works very well for this particular validation data set out of coincidence and no real generalization was achieved. This is quite a difficult one to catch. To investigate this, one would have to test the performance on a variety of validation data sets and rather use their average score than just one result from one test. And this is, what cross-validation is about.

### basics of cross validation
For a cross validation, the *training data* is not only divided into one training and one validation set, but different splits are performed. For a *k-fold cross validation*, the data is split into *k* parts, then the model with a certain hyperparameter configuration is trained on *k-1* of the parts and then evaluated on the remaining part. The overall score can then be computed e. g. by using the average validation score.

<img src="https://cdn-images-1.medium.com/max/800/1*HITaUenDsDuzjAWO0zZyxg.png" height=300>

(image: https://alexforrest.github.io/you-might-be-leaking-data-even-if-you-cross-validate.html)

By this, the resulting score is more likely to reflect the performance we would see on a truely unknown data set like the test data set. As the title of the source of the image suggests, there are more sophisticated ways of improving the generalization performance evaluation for our model, but cross validation certainly is a good start!

`scikit-learn` provides a simple api for performing k-fold data splits similar to the `train-test-split` method.

In [None]:
# yes, we did already import this earlier
from sklearn.model_selection import StratifiedKFold

The *StratifiedKFold* preserves the relative occurences of members of different classes within the different splits. For this classification problem, we will therefore rather use this than the normal KFold.

We will use cross-validation with k=3 (speed) and compute an `average precision` as a score value to find the best parameter set. The best metric for optimization depends on your problem and finding a good one for your application is a central trask of each machine learning investigation. We will modify the test function so that we will not use the TensorBoard for now:

In [None]:
# an even more reduced parameter space, but this time with structural changes
param_space = {
    'num_conv_layers': [2,3,4],
    'max_filters': [32, 64],
    'dropout': [.5],
    'dense_nodes': [64]
}

value_combis = itertools.product(*[v for v in param_space.values()])

# List comprehension short form:
param_combis = [{key:value for key, value in zip(param_space.keys(), combi)} for combi in value_combis]

print(f"We have a total of {len(param_combis)} combinations:")
print(param_combis)

We have a total of 6 combinations:
[{'num_conv_layers': 2, 'max_filters': 32, 'dropout': 0.5, 'dense_nodes': 64}, {'num_conv_layers': 2, 'max_filters': 64, 'dropout': 0.5, 'dense_nodes': 64}, {'num_conv_layers': 3, 'max_filters': 32, 'dropout': 0.5, 'dense_nodes': 64}, {'num_conv_layers': 3, 'max_filters': 64, 'dropout': 0.5, 'dense_nodes': 64}, {'num_conv_layers': 4, 'max_filters': 32, 'dropout': 0.5, 'dense_nodes': 64}, {'num_conv_layers': 4, 'max_filters': 64, 'dropout': 0.5, 'dense_nodes': 64}]


In [None]:
def build_model_from_params(params):
  model = Sequential()
  model.add(InputLayer(x_train[0].shape))

  filters = params['max_filters']
  for layer in range(params['num_conv_layers']):
    model.add(Conv2D(
        filters, kernel_size=2, padding='valid',
        activation='relu'
    ))
    filters /= 2

  for layer in [
    MaxPooling2D(pool_size=2),
    Dropout(params['dropout']),
    Flatten(),
    Dense(params['dense_nodes'], activation='relu'),
    Dropout(params['dropout']),
    Dense(10, activation='softmax')
  ]:
    model.add(layer)

  model.compile(
      loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']
  )
  return model

search_results = []

k_folds = 3

for idx, params in enumerate(param_combis):

    print(f"Start run {idx+1}/{len(param_combis)}: Parameters: {params}")

    k = 3
    kf = StratifiedKFold(n_splits=k)

    # we create some lists to append to during the validation folds
    best_val_accs             = []
    best_val_acc_losses       = []
    best_val_acc_train_accs   = []
    best_val_acc_train_losses = []

    y_labels = np.argmax(Y_train_ready, axis=1)
    for k_index, (train_idx, val_idx) in enumerate(kf.split(X_train_ready, y_labels)):
      x_cv_train, x_cv_val = X_train_ready[train_idx], X_train_ready[val_idx]
      y_cv_train, y_cv_val = Y_train_ready[train_idx], Y_train_ready[val_idx]

      filepath = f"cnn_paramsearch_filters_fold={k_index}_"
      for key, value in params.items():
        filepath += key + "=" + str(value) + "_"

      filepath += '.hdf5'
      checkpoint = ModelCheckpoint(
          filepath, monitor='val_accuracy', verbose=0, save_best_only=True, mode='max'
      )

      this_model = build_model_from_params(params)

      fit_results = this_model.fit(
        x=x_cv_train,
        y=y_cv_train,
        batch_size=128,
        # reduce number of epochs for speed reasons --> should be higher!
        epochs=5,
        validation_data=(x_cv_val, y_cv_val),
        callbacks=[checkpoint],
        verbose=0
      )

      # extract the best validation scores
      best_val_epoch = np.argmax(fit_results.history['val_accuracy'])
      best_val_accs.append(np.max(fit_results.history['val_accuracy']))
      best_val_acc_losses.append(fit_results.history['val_loss'][best_val_epoch])

      # get correct training accuracy
      best_model = load_model(filepath)
      best_val_acc_train_loss, best_val_acc_train_acc = best_model.evaluate(x_train, y_train)
      best_val_acc_train_losses.append(best_val_acc_train_loss)
      best_val_acc_train_accs.append(best_val_acc_train_acc)

    # store results
    search_results.append({
        **params,
        'best_val_acc': np.mean(best_val_accs),
        'best_val_acc_sem': np.std(best_val_accs)/np.sqrt(k),
        'best_val_acc_train_acc': np.mean(best_val_acc_train_accs),
        'best_val_acc_train_acc_sem': np.std(best_val_acc_train_accs)/np.sqrt(k),
        'best_val_acc_loss': np.mean(best_val_acc_losses),
        'best_val_acc_loss_sem': np.std(best_val_acc_losses)/np.sqrt(k),
        'best_val_acc_train_loss': np.mean(best_val_acc_train_losses),
        'best_val_acc_train_loss_sem': np.std(best_val_acc_train_losses)/np.sqrt(k),
    })

In [None]:
resultsDF = pd.DataFrame(search_results)

# sort values
resultsDF.sort_values('best_val_acc', ascending=False)

You already know one way to inspect a 2D search:

In [None]:
offset = 0

# you will very likely have to tune this!!
ylims     = (0.92, 1)
fill_lims = (0, 0.987)
text_vertical_loc = 0.98

for num_layers, sub_df in resultsDF.groupby('num_conv_layers'):
  sub_df_entries = len(sub_df)

  x = np.arange(sub_df_entries) + offset

  errb1 = plt.errorbar(x, sub_df['best_val_acc_train_acc'], yerr=sub_df['best_val_acc_train_acc_sem'], fmt='o', color='C0', label='Training')
  errb2 = plt.errorbar(x, sub_df['best_val_acc'], yerr=sub_df['best_val_acc_sem'], fmt='o', color='C1', label='Validation')

  plt.fill_betweenx(fill_lims, x[0]-0.25, x[-1]+0.25, alpha=.3)
  plt.text((x[0]+x[-1])/2,text_vertical_loc , "$n_\\mathrm{conv}$="+str(num_layers), horizontalalignment='center')
  offset += sub_df_entries

plt.xlabel('Max. filters')

# max_filters is a list, a list times a number (number of conv layers) leads to the list being repeated!
xlabels = param_space['max_filters']*len(param_space['num_conv_layers'])
plt.xticks(ticks=np.arange(len(xlabels)), labels=xlabels)

plt.ylabel('Accuracy')
plt.ylim(ylims)
plt.legend([errb1, errb2], ['Training', 'Validation'], loc='best')

## Task 4: Perform a sequential grid search to optimze the following hyperparameter. Save the best model for each of the sequentiell steps into a hdf5 file.

You have heard a lot about grid search and cross validation by now. Time for some final hands on!

You can choose whether you want to use a cross validation or a normal grid search. What are point in favor and against?

* Find the best convolution + dense structure: Change the width and the depth and try at least 4 different structures (start with dropout rate 0)
* Find the best activation function for the dense network and the convolutional layers `'relu', 'tanh', 'sigmoid', 'hard_sigmoid', 'linear'`
* Find the best learning rate within [1e-4, 1e-2] (watch out, log scale!) and batch size within the range `[32, 2048]`
* Find the best dropout rate: Change dropout between 0 and 0.5


In [None]:
# space for ideas

## Tips for Hyperparameter Optimization
This section lists some handy tips to consider when tuning hyperparameters of your neural network.

* **k-fold Cross Validation.** You can see that the results from the examples in this post show some variance. For speed reasons, we used a cross-validation of 2, but perhaps k=5 or k=10 would be more stable. Carefully choose your cross validation configuration to ensure your results are stable.
* **Review the whole grid.** Do not just focus on the best result, review the whole grid of results and look for trends to support configuration decisions.
* **Parallelize.** Use all your cores if you can, neural networks are slow to train and we often want to try a lot of different parameters. Consider using cluster instances if available.
* **Use a subsample of your dataset.** Because networks are slow to train, try training them on a smaller sample of your training dataset, just to get an idea of general directions of parameters rather than optimal configurations.
* **Start with coarse grids.** Start with coarse-grained grids and zoom into finer grained grids once you can narrow the scope.
* **Do not transfer results.** Results are generally problem specific. Try to avoid favorite configurations on each new problem that you see. It is unlikely that optimal results you discover on one problem will transfer to your next project. Instead look for broader trends like number of layers or relationships between parameters.
* **Reproducibility is a problem.** Although we set the seed for the random number generator in `NumPy`, the results are not 100% reproducible.

## Task 5: Load the best model and evaluate it using the function below
You can load the model using<br>
`from tensorflow.keras.models import load_model`<br>
`model= load_model('filename')`

In [None]:
###################################################################################################

import matplotlib.pyplot as plt
%matplotlib inline

def plot_history(network_history):
    plt.figure()
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.plot(network_history.history['loss'])
    plt.plot(network_history.history['val_loss'])
    plt.legend(['Training', 'Validation'])

    plt.figure()
    plt.xlabel('Epochs')
    plt.ylabel('Accuracy')
    plt.plot(network_history.history['acc'])
    plt.plot(network_history.history['val_acc'])
    plt.legend(['Training', 'Validation'], loc='lower right')
    plt.show()

###################################################################################################

import itertools
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

###################################################################################################
import matplotlib.cm as cm
def display_errors(errors_index,img_errors,pred_errors, obs_errors):
    """ This function shows 6 images with their predicted and real labels"""
    n = 0
    nrows = 2
    ncols = 3
    fig, ax = plt.subplots(nrows,ncols,sharex=True,sharey=True)
    for row in range(nrows):
        for col in range(ncols):
            error = errors_index[n]
            ax[row,col].imshow((img_errors[error]).reshape((28,28)), cmap=cm.Greys, interpolation='nearest')
            ax[row,col].set_title("Predicted label :{}\nTrue label :{}".format(pred_errors[error],obs_errors[error]))
            n += 1

###################################################################################################
from sklearn.metrics import confusion_matrix,classification_report

def evaluate(X_train, Y_train, X_test, Y_test, model):

    ##Evaluate loss and metrics
    loss, accuracy = model.evaluate(X_test, Y_test, verbose=0)
    print('Test Loss:', loss)
    print('Test Accuracy:', accuracy)
    # Predict the values from the test dataset
    Y_pred = model.predict(X_test)
    # Convert predictions classes to one hot vectors
    Y_cls = np.argmax(Y_pred, axis = 1)
    # Convert validation observations to one hot vectors
    Y_true = np.argmax(Y_test, axis = 1)
    print('Classification Report:\n', classification_report(Y_true,Y_cls))

    ## Plot 0 probability including overtraining test
    plt.figure(figsize=(8,8))

    label=0
    #Test prediction
    Y_pred_prob = Y_pred[:,label]
    plt.hist(Y_pred_prob[Y_true == label], alpha=0.5, color='red', range=[0, 1], bins=10, log = True)
    plt.hist(Y_pred_prob[Y_true != label], alpha=0.5, color='blue', range=[0, 1], bins=10, log = True)
    #Train prediction
    Y_train_pred = model.predict(X_train)
    Y_train_pred_prob = Y_train_pred[:,label]
    Y_train_true = np.argmax(Y_train, axis = 1)
    plt.hist(Y_train_pred_prob[Y_train_true == label], alpha=0.5, color='red', range=[0, 1], bins=10, log = True, histtype='step', linewidth=2)
    plt.hist(Y_train_pred_prob[Y_train_true != label], alpha=0.5, color='blue', range=[0, 1], bins=10, log = True, histtype='step', linewidth=2)

    plt.legend(['train == 0', 'train != 0', 'test == 0', 'test != 0'], loc='upper right')
    plt.xlabel('Probability of being 0')
    plt.ylabel('Number of entries')
    plt.show()

    # compute the confusion matrix
    confusion_mtx = confusion_matrix(Y_true, Y_cls)
    # plot the confusion matrix
    plt.figure(figsize=(8,8))
    plot_confusion_matrix(confusion_mtx, classes = range(10))

    #Plot largest errors
    errors = (Y_cls - Y_true != 0)
    Y_cls_errors = Y_cls[errors]
    Y_pred_errors = Y_pred[errors]
    Y_true_errors = Y_true[errors]
    X_test_errors = X_test[errors]
    # Probabilities of the wrong predicted numbers
    Y_pred_errors_prob = np.max(Y_pred_errors,axis = 1)
    # Predicted probabilities of the true values in the error set
    true_prob_errors = np.diagonal(np.take(Y_pred_errors, Y_true_errors, axis=1))
    # Difference between the probability of the predicted label and the true label
    delta_pred_true_errors = Y_pred_errors_prob - true_prob_errors
    # Sorted list of the delta prob errors
    sorted_dela_errors = np.argsort(delta_pred_true_errors)
    # Top 6 errors
    most_important_errors = sorted_dela_errors[-6:]
    # Show the top 6 errors
    display_errors(most_important_errors, X_test_errors, Y_cls_errors, Y_true_errors)

    ##Plot predictions
    slice = 15
    predicted = model.predict(X_test[:slice]).argmax(-1)
    plt.figure(figsize=(16,8))
    for i in range(slice):
        plt.subplot(1, slice, i+1)
        plt.imshow(X_test[i].reshape(28,28), interpolation='nearest')
        plt.text(0, 0, predicted[i], color='black',
                 bbox=dict(facecolor='white', alpha=1))
        plt.axis('off')

In [None]:
# space for solutions


---

# There's more:

We have had a first deep dive into cross validations and it has been quite a ride. We hope you enjoyed it though and are looking forward for more.

Searching for the best hyperparameter combination is pretty much the holy grail of modern machine learning, especially deep learning. Therefore, a lot of programs and packages have been developed for smart ways to search in very high dimensional and even conditional (search this bit if this parameter is chosen, else search another bit).

Please take the time to look through this nice [blog entry](https://neptune.ai/blog/hyperparameter-tuning-in-python-complete-guide#scikit-learn) where 10 of the currently most popular Python packages for hyperparameter optimization are briefly introduced.
