<img style="float: left; margin-bottom: 1em" src="images/PRACE.png" width="200">
<img style="float: right; margin-bottom: 1em" src="images/surfsara.png" width="150">
<hr style="clear: both"/>

# Hyperparameter tuning
In this notebook you will learn how the different hyperparameters affect the performance of your network. You will start with a small network and extend and tweak it gradually to improve its performance. 

We have prepared a more complicated problem for you and turned the original two-class classification problem (healthy versus diseased leafs) into a ten-class classification problem. All pictures are taken from tomato plants. Let's load the data set, and display the class names.

In [None]:
import lib

X, Y, labels = lib.dataset_plant_village_tomato()

for index, name in enumerate(labels):
    print('{}: {}'.format(str(index).rjust(2), name))

As you can see, we have one healthy class, and nine disease classes.

Let's plot the first 32 examples in the data set:

In [None]:
%config InlineBackend.figure_format = 'retina'

lib.plot_examples(X[:32], Y[:32], labels);

Let's inspect the size of the inputs `X` and `Y` using numpy's [`shape`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.shape.html) method:

In [None]:
print('Shape of input images X: {}'.format(X.shape))
print('Shape of input labels Y: {}'.format(Y.shape))

## Exercise 1
What is the size (width and height) of the examples, in pixels? How many examples do we have?

In [None]:
# FILL IN YOUR ANSWER HERE

## Exercise 2
Implement a model with a single dense hidden layer with 16 units and ReLU activation by filling out the skeleton below.

**Hints**:
* Go back to the slides to find out the activation functions you will need in the hidden and output layers.
* It may help to keep open the first notebook for reference purposes.

In [None]:
from keras.models import Sequential
from keras.layers import Flatten, Dense

model = Sequential()
model.add(Flatten(input_shape=X.shape[1:]))
# <FILL IN>
model.add(Dense(10, activation='softmax'))
model.summary()

## Exercise 3
Initialise the Adam optimizer with a learning rate of 0.1 and train the network for 10 epochs by filling out the skeleton below.

Does the network train at all? Why or why not, do you think? Fill in your answer in the cell below.

**Hint**: what would the accuracy be like when we always predict the same class in a ten-class classification problem?

In [None]:
from keras.optimizers import Adam

model.compile(Adam(lr=0.1), loss="categorical_crossentropy", metrics=["accuracy"])
history = model.fit(X, Y, epochs=10)

In [None]:
# FILL IN YOUR ANSWER HERE

## Exercise 4
Experiment with the learning rate by changing it in orders of magnitudes (e.g. 0.1, 0.01, 0.001) in the cell below. We have filled out the network definition and number of epochs for you.

What seems to be a good learning rate for this problem? What happens if you make the learning rate too small? Why? Fill in your answer in the cell below.

In [None]:
from keras.models import Sequential
from keras.layers import Flatten, Dense

model = Sequential()
model.add(Flatten(input_shape=X.shape[1:]))
model.add(Dense(16, activation='relu'))
model.add(Dense(10, activation='softmax'))
model.summary()

model.compile(Adam(lr=<FILL IN>), loss="categorical_crossentropy", metrics=["accuracy"])
history = model.fit(X, Y, epochs=25)

In [None]:
# FILL IN YOUR ANSWER HERE

## Exercise 5
Our model improves, but slowly. That is because it has low capacity. There is only one hidden layer with a small number of units.

Increase the number of units in the hidden layer and retrain the model by filling out the skeleton below. Try to get the accuracy at epoch 25 greater than or equal to 0.7. How many units do you need? Motivate your answer in the cell below.

**Hint**:
* Use the learning rate that worked for you in exercise 4. Adapt it if you think the model does not converge well enough.
* Increase the number of hidden units by factors of 2 or 4.

In [None]:
from keras.models import Sequential
from keras.layers import Flatten, Dense

model = Sequential()
model.add(Flatten(input_shape=X.shape[1:]))
model.add(Dense(<FILL IN>, activation='relu'))
model.add(Dense(10, activation='softmax'))
model.summary()

model.compile(Adam(lr=<FILL IN>), loss="categorical_crossentropy", metrics=["accuracy"])
history = model.fit(X, Y, epochs=25)
lib.plot_history(history);

In [None]:
# FILL IN YOUR ANSWER HERE

## Validation set
We will now introduce a **validation set** into our problem to verify our network and optimizer configuration. This is as simple as providing the Keras [`fit`](https://keras.io/models/sequential/#fit) function with an argument for its `validation_split` parameter. This parameter governs what proportion of the original data set is set aside for validation during training. After each epoch, Keras will run the validation set through the network to obtain the validation loss and accuracy for that epoch.

## Exercise 6
In the following cell we have set this parameter to 0.2, so that 20% of our original data is used for validation. Fill out the network by adding two hidden layers: the first with 512 units, the second with 256. Both should have ReLU activation.

Train the network for 25 epochs with learning rate 0.00001 and batch size 256 (the `batch_size` parameter in Keras's `fit` function). Inspect the resulting accuracy and loss curves. How do you explain the divergence between the training and validation loss and accuracy?

In [None]:
from keras.models import Sequential
from keras.layers import Flatten, Dense

model = Sequential()
model.add(Flatten(input_shape=X.shape[1:]))
<FILL IN>
model.add(Dense(10, activation='softmax'))
model.summary()

model.compile(Adam(lr=<FILL IN>), loss="categorical_crossentropy", metrics=["accuracy"])
history = model.fit(X, Y, epochs=<FILL IN>, batch_size=<FILL IN>, validation_split=0.2)
lib.plot_history(history);

In [None]:
# FILL IN YOUR ANSWER HERE

## Bonus exercise (hard)
How many parameters do we have in the first dense layer of exercise 2, and why? Write down your answer in the cell below.

In [None]:
# FILL IN YOUR ANSWER HERE

## Bonus exercise (hard)
Play around with the `batch_size` parameter of the network in the previous cell. What happens with the training speed and network convergence compared with the previous example? Why? Fill in your answer in the cell below.

**Hint**: try to increase or decrease the batch size by factors of 4 or 8. What happens for very low batch sizes (single digits), and what happens with very high batch sizes (triple digits)?

In [None]:
from keras.models import Sequential
from keras.layers import Flatten, Dense

model = Sequential()
model.add(Flatten(input_shape=X.shape[1:]))
model.add(Dense(512, activation='relu'))
model.add(Dense(256, activation='relu'))
model.add(Dense(10, activation='softmax'))
model.summary()

model.compile(Adam(lr=0.0001), loss="categorical_crossentropy", metrics=["accuracy"])
history = model.fit(X, Y, epochs=25, batch_size=512, validation_split=0.2)
lib.plot_history(history);

In [None]:
# FILL IN YOUR ANSWER HERE