This is a Jupyter notebook.  The most important keyboard shortcuts (cf. the "Help" menu) are
* **cursor keys** to select cells
* **Enter** to go from command mode to edit mode (for changing cell contents)
  * (**Esc** would go back to command mode.)
* **Shift+Enter** to *execute and advance* a cell
  * While experimenting with different values in the same cell, **Ctrl+Enter** is also handy, which executes but does not advance the cursor.
* There is an edit mode with a green bar to the left, and a execution/command mode with a blue bar.
* In command mode, some keys have a function:
    * `l`: toggle line numbers
    * `a`: new cell above 
    * `b`: new cell below
    * `h`: help / see more keyboard shortcuts

# SC1235 Introduction to Medical Image Analysis Using Convolutional Neural Networks

<div id="toc"></div>

## 1. First Experiments with Random Data
* Start by importing some required python modules that implement the layers we will use to build the network. 
* We also need a "container" to connect the layers: the "Model"

In [None]:
from keras.layers import InputLayer, Conv2D, MaxPool2D, Flatten, Dense, UpSampling2D, LocallyConnected2D, Dropout
from keras.models import Model, Sequential
from keras import optimizers

import numpy as np

### Create random data

In these examples, we'll use artificial data first, and then switch to real data.

Run the code in the following box, which will create a pair of input data `x_train` and corresponding output data `y_train` for training.

In [None]:
NUM_INST = 100
x_train = np.random.random((NUM_INST, 1000)) # NUM_INST instances with 1000 random features
y_train = np.zeros((NUM_INST,)) # Label vector (initialized with 0s)
y_train[:int(NUM_INST/2)] = 1 # set first half of vector to 1

### Define the model

In [None]:
model = Sequential()
model.add(InputLayer(input_shape = (1000,)))
model.add(Dense(units=256, name="Hidden")) # Play with the number of units (==neurons)
# Optionally increase the number of layers.
#model.add(Dense(units=128))
#model.add(Dense(units=64))
model.add(Dense(units=1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adadelta')
model.summary()
# If only interested in the number of parameters, use this:
# print("Model parameters: {0:,}".format(model.count_params()))

### Training the network

In [None]:
history = model.fit(x_train, y_train, batch_size=10, epochs=100)
# Clicking left to the output once will change the display mode from a scrollable field to a full display and back. Double-clicking it collapses it, so it is not so dominant.

### A remark on optimization
* Optimizers like SGD, ADAM, ADAGrad ADADelta etc. are variants of Stochastic Gradient Descent (SGD).
* SGD estimates the gradient for parameters based on a batch of examples.
    * The larger the batch, the better the estimated gradiend approximates the gradient for the whole dataset.
* It takes about 300 epochs to converge when creating 1000 instances.

### Quiz: Interpreting the result
* What can you observe regarding the loss?
* Why is that possible?
* Change the number of training instances to 1000. Assure that the classes are equally frequent again. What can you observe?
* Be reminded that you have to re-create the model to reset the weights. To do this, execute the cell with the model definition (important is the `model.compile()` call)

### Investigate the "history" object you created
* Try out the following commands and inspect the variables.
* Make use of tab completion, e.g. by typing `hidden_layer.<tab>`

In [None]:
loss_history = history.history['loss']
weights = history.model.get_weights()
hidden_layer = history.model.get_layer("Hidden")
for w in weights:
    print(w.shape)

## 2. Image Classification: _MNIST handwritten digits_

### Read the data

* We want to work on images: MNIST, which we import and load next.
* You can import them from Keras with one line, because it is one of the standard datasets used for machine learning.

In [None]:
# If you execute this cell, you will overwrite the data above.
from keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

# reduce data by factor 10 / 20 for fast execution during course
x_train = x_train[::10]
y_train = y_train[::10]
x_test = x_test[::20]
y_test = y_test[::20]

# verify resulting array shapes
x_train.shape, y_train.shape, x_test.shape, y_test.shape

### Inspecting the data

Look at the shape of the `x_train` variable to understand how the data is organised.
* You can see that the data has 60.000 examples, each of shape 28x28.
* These are images... of size 28x28 pixels.
* The corresponding output is just a long vector of corresponding labels in the range [0...9].

In [None]:
# Inspect the shape of x_train
print(x_train.shape, y_train.shape, y_train.min(), y_train.max())

As we are dealing with *images* now, we want to display them.
* `matplotlib` is a python package well suited plotting data and displaying images. Let's import it.
* Then, load and display one of the images "inline" in this notebook.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Look at an image
plt.imshow(x_train[600], cmap='gray')

Apart from displaying images, `matplotlib` also helps you visualise logs. Below, we display the learning success as measured by the loss. Remember that the loss is in the `history` object created while fitting?

In [None]:
fig,ax = plt.subplots(figsize=(18, 4), dpi= 80, facecolor='w', edgecolor='k')
ax.plot(history.history['loss'])
ax.set_xlabel('Epoch')
ax.set_ylabel('Loss')
ax.grid()
plt.show()

Also, we could be interested in the distribution of labels in our data:

In [None]:
plt.hist(y_train)
plt.show()

### Preparing the labels for a classification network
We want to convert the numeric labels to so-called *"one-hot vectors"*.
* One-hot means that the network does not directly output a number between 0 and 9 representing the digit.
* Rather, we want a vector with 10 dimensions, in which only one entry is 1, all others 0, e.g. `[0, 0, 1, 0, 0, 0, 0, 0, 0, 0]` to label a "2".
* *Rationale:* The digits represent different categorical classes, and we want to penalize all confused digits the same; it is not "better" or "closer" if the network outputs 4.2 given an image depicting a "6" than if the output is 1.
* In general, the one-hot encoding helps with classification problems and allows to let the neuron with maximal activation "win".

In [None]:
num_labels = 10
# Code to convert labels
y_train_one_hot = (np.arange(num_labels) == y_train[:,np.newaxis]).astype(np.float32)

In [None]:
# Keras offers a convenience function to achieve the same:
from keras.utils.np_utils import to_categorical
y_train_one_hot = to_categorical(y_train, num_classes=num_labels)
# Same for the testing data
y_test_one_hot  = to_categorical(y_test, num_classes=num_labels)

### Image classification with a simple neural network
We now want to train the above network on this data. We have to adapt it to use inputs of a different shape, and to produce vector outputs. We have prepared this below:
* Modified the parameter `input_shape=(...)` to adapt to the new data
* Modified the number of dense units in the output layer to reflect the number of classes; 10 in the digits example
* Modified the loss function to deal with multiple classes

In [None]:
model = Sequential()
model.add(InputLayer(input_shape=(28,28)))
model.add(Flatten()) # Layer reshaping the 28x28 arrays into vectors of length 28*28=784
model.add(Dense(units=256)) # Try higher or lower numbers of hidden units!
# Try adding more layers!
model.add(Dense(units=128))
model.add(Dropout(0.5))
model.add(Dense(units=10, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adadelta')
model.summary()

In [None]:
# This experiments takes about 1 sec per epoch on an older MacBook Pro.
history = model.fit(x_train, y_train_one_hot, batch_size=500, epochs=100) # In this example, you'll no longer want batches of size 10...

### Evaluate the model on independent test data
* The following cell executes the model on the test data
* The result is a list of 10-vectors (recall the on-hot encoding), only this time there are also values between 0 and 1.
* How can we compare these with the true labels in `y_test_one_hot`? There are many possible ways to evaluate classifiers; in general, you want to define some kind of error, usually based on differences.

In [None]:
pred = model.predict(x_test)
print(x_test.shape, pred.shape)
print(pred[0])

The `argmax()` function may come in handy, which converts from the one-hot representation back to integer indices of the maximally activated classes:

In [None]:
pred.argmax(axis = -1)

## 3. Image classification with a simple convolutional neural network (CNN)
For an introduction into convolutional layers, see course notes.

### Weight sharing
Exploring convolutions with and without weight sharing.
* First, your input data now needs to have a "channel" dimension, as the convolutional filter result will be a multi-channel image.
* Next, you will need to remove the 2D nature again to feed into dense layers. 
  * `Flatten()` does this for you.
  * Train the network as before.
  * What do you observe?

Later, we'll explore how a convolutional layer without weight sharing affects the network.

#### With weight sharing

In [None]:
convnet = Sequential()
convnet.add(InputLayer(input_shape=(28,28,1)))
convnet.add(Conv2D(32, kernel_size=(3,3), padding='same'))
convnet.add(Conv2D(32, kernel_size=(3,3), padding='same'))
convnet.add(MaxPool2D())
convnet.add(Conv2D(32, kernel_size=(3,3), padding='same'))
convnet.add(Conv2D(32, kernel_size=(3,3), padding='same'))
convnet.add(MaxPool2D())
convnet.add(Flatten())
convnet.add(Dense(units=128))
convnet.add(Dropout(0.5))
convnet.add(Dense(units=10, activation='softmax'))
convnet.compile(loss='categorical_crossentropy', optimizer='adadelta')
print("convnet parameters: {0:,}".format(convnet.count_params()))
convnet.summary()

In [None]:
convnet_history = convnet.fit(x_train[...,np.newaxis], y_train_one_hot, batch_size=500, epochs=60)

Exercise: Plot the loss (history object, see above):

In [None]:
#plt.plot(...)

If you have scikit-learn installed (try `conda install scikit-learn`), that offers utility methods for computing evaluation metrics such as a [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix):

In [None]:
import sklearn.metrics
pred = convnet.predict(x_test[...,np.newaxis])
cm = sklearn.metrics.confusion_matrix(pred.argmax(axis = -1), y_test)
cm

It's more intuitive to look at it as a heat map.

In [None]:
plt.matshow(cm)
plt.show()

Side note: Numpy and Matplotlib are two important, central libraries for numeric computing with Python. In addition, there are also more advanced libraries such as Seaborn, which build upon the things introduced above and offer dedicated functions for complex graphics, such as a combined version of the above matrix + heatmap.

In [None]:
import seaborn as sns
ax = sns.heatmap(cm, annot=True)

#### Without weight sharing

Now, let's try convolution without weight sharing.
* Use `LocallyConnected2D` for this. 
* What do you observe? Try training the network.
* Regard the number of parameters. Change the network, if necessary.

In [None]:
without_ws = Sequential()
without_ws.add(InputLayer(input_shape=(28,28,1)))
without_ws.add(LocallyConnected2D(32, kernel_size=(3,3)))
#without_ws.add(LocallyConnected2D(32, kernel_size=(3,3)))
without_ws.add(MaxPool2D())
without_ws.add(LocallyConnected2D(32, kernel_size=(3,3)))
#without_ws.add(LocallyConnected2D(32, kernel_size=(3,3)))
without_ws.add(MaxPool2D())
without_ws.add(Flatten())
without_ws.add(Dense(units=128))
without_ws.add(Dropout(0.5))
without_ws.add(Dense(units=10, activation='softmax'))
without_ws.compile(loss='categorical_crossentropy', optimizer='adadelta')
print("without_ws parameters: {0:,}".format(without_ws.count_params()))
without_ws.summary()

In [None]:
wws_history = without_ws.fit(np.reshape(x_train, x_train.shape+(1,)), y_train_one_hot, batch_size=10, epochs=10)

In [None]:
plt.plot(wws_history.history['loss'])
plt.show()
import sklearn.metrics
pred = without_ws.predict(x_test[...,np.newaxis])
cm = sklearn.metrics.confusion_matrix(pred.argmax(axis = -1), y_test)
cm

### From CNN to FCN
* In the following, explore how a network with dense layers is equivalent to a properly configured fully-convolutional network
* Flattening is replaced by a convolutional layer whose kernel spans the full size of the previous output.
    * Replace the `Flatten` layer and subsequent `Dense` layers by `Conv2D` layers.
    * Note, that a `Flatten` layer is still required to convert into the output vector representation.
* Convince yourself that the number of trainable parameters is indeed unchanged.

In [None]:
fcn = Sequential()
fcn.add(InputLayer(input_shape=(None,None,1)))
fcn.add(Conv2D(32, kernel_size=(3,3), padding='same'))
fcn.add(Conv2D(32, kernel_size=(3,3), padding='same'))
fcn.add(MaxPool2D())
fcn.add(Conv2D(32, kernel_size=(3,3), padding='same'))
fcn.add(Conv2D(32, kernel_size=(3,3), padding='same'))
fcn.add(MaxPool2D())
fcn.add(Conv2D(128, kernel_size=(7,7), padding='valid'))
fcn.add(Dropout(0.5))
fcn.add(Conv2D(10, kernel_size=(1,1), activation='softmax'))
fcn.compile(loss='categorical_crossentropy', optimizer='adadelta')
print("fcn parameters: {0:,}".format(fcn.count_params()))

In [None]:
fcn_history = fcn.fit(x_train[...,np.newaxis],
                      y_train_one_hot[:,np.newaxis,np.newaxis,:],
                      batch_size=500, epochs=100)

In [None]:
plt.plot(fcn_history.history['loss'])
plt.show()
pred = fcn.predict(x_test[...,np.newaxis])[:,0,0]
cm = sklearn.metrics.confusion_matrix(pred.argmax(axis = -1), y_test)
cm

Exercise: Visualize this confusion matrix, compare 

In [None]:
%%javascript
// This code generates the table of contents at the top of the notebook
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')