# Deep Learning 1

In this practical session, we will take a look at the [Keras](https://keras.io/) deep learning framework for Python. From the documentation:

>Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research.

Designing and testing neural networks can be a tedious process. Keras provides an intuitive API for training, testing and deploying neural networks without having to worry too much about technical details. To begin, we import it:

In [1]:
import keras

Keras can be used with different *backends*. The backend is another framework which takes care of the low-level details of the implementation, such as GPU optimization and distributed computing. At the time of this writing, Keras supports [TensorFlow](https://www.tensorflow.org/), [Theano](http://deeplearning.net/software/theano/) and [CNTK](https://www.microsoft.com/en-us/cognitive-toolkit/) backends. Upon importing the module, Keras will report which backend (if any) it is using.

## Deep Neural Network basics

A Deep Neural Network (DNN) is built up of several layers. Formally, such a network is given by

$$
    f(x) = g_L(W_Lg_{L-1}(\dots g_1(W_1x + b_1) \dots) + b_L)
$$

Here, $W_1, \dots, W_L$ are matrices (called the *weights* of the network), $b_1, \dots, b_L$ are vectors (called the *biases*) and $g_1, \dots, g_L$ are the *activation functions*. We can picture this as follows:

![A neural network](mlp.png)

Each node in the network computes an inner product plus a bias term between its weight vector, which is a row of the weight matrix of that layer, and its input, which is the output of the previous layer. This inner product is then sent through a non-linear activation function. Examples of typical activation functions are

1. the logistic sigmoid:
$$
    \mathrm{sigmoid}(z) = \frac{1}{1 + \exp(-z)},
$$

2. the rectified linear unit (RELU):
$$
    \mathrm{relu}(z) = \max(0,z),
$$

3. the scaled exponential linear unit (SELU):
$$
    \mathrm{selu}(z) = \lambda\left\{\begin{matrix}
        z, & \mbox{if $z > 0$}\\
        \alpha(\exp(z)-1), & \mbox{if $z \leq 0$}
    \end{matrix}\right..
$$
Here, $\lambda$ and $\alpha$ are parameters fixed before the network is trained.

4. the softmax:
$$
    \mathrm{softmax}(z) = \frac{\exp(z)}{\sum_i\exp(z_i)}.
$$
The softmax function is only used at the very end of a neural network classifier. It has the properties that for any $z \in \mathbb{R}^q$,
$$\begin{aligned}
    \mathrm{softmax}(z) &\in [0,1]^q, & \sum_{i=1}^q\mathrm{softmax}(z)_i &= 1.
\end{aligned}$$
In other words, the output of $\mathrm{softmax}(z)$ can be interpreted as a vector of probabilities over $q$ possible outputs.

We can construct a simple neural network in Keras as follows:

In [12]:
from keras.models import Sequential
from keras.layers import Dense

model = Sequential([
    Dense(units=64, activation='relu', input_dim=30),
    Dense(units=10, activation='relu'),
    Dense(units=2, activation='softmax')
])

This model consists of three *dense* or *fully-connected* layers. These are the "classical" neural network layers which perform a linear transformation and then apply a specified activation function. In our case, the input layer has 30 dimensions, the hidden layer 64 nodes, the output layer 2.  This gives a total of 

$$
    64 \times 30 + 10 \times 64 + 2 \times 10 + 64 + 10 + 2 = 2656
$$

parameters. Of course, the weight matrices and bias vectors have not been set to any meaningful values yet. Finding values for these parameters such that the network maximizes a certain performance measure is the goal of a *learning algorithm*.

## Supervised learning

In the supervised learning setting, we are given a data set of observations, used to train the network.  Training the network aims to minimize the so called  *loss function* which measures how much the output $f(x)$ deviates from the desired output $y$. In a classification problem, usually the 0/1 loss is used:

$$
    \ell(f(x),y) = \left\{\begin{matrix}
        1, & \mbox{if $f(x) \neq y$}\\
        0, & \mbox{if $f(x) = y$}
    \end{matrix}\right..
$$

However, in some cases (e.g. regression problems) it makes sense to use other functions such as the squared error

$$
    \ell(f(x),y) = \|f(x)-y\|_2^2.
$$

We now need to solve this optimization problem.  
In practice, this means adjusting our weight matrices and bias vectors until we don't get anymore reductions in loss. There exist a variety of optimization algorithms to solve this problem, but we won't get into their details here. What is important to know is that all of these optimization algorithms work iteratively on *mini-batches* of the training data set, never on the entire training set at once unless it is really small (which it often isn't). The number of mini-batches is determined by the *batch size*, which is the size of a mini-batch. So the optimization algorithms proceed as follows:

1. Split the training set into mini-batches with a number of samples equal to the batch size.
2. For each mini-batch, compute updates to the weights and biases based solely on the given mini-batch data.
3. Repeat step 2 until convergence.

Each run of the loop in step 2 is called an *epoch*. You will always need to specify the number of epochs as well as the batch size whenever you train a neural network. Make sure that the batch size is always a divisor of the training set size, otherwise some algorithms may refuse to run. Typical batch sizes are small powers of two such as 64 or 128. In case the training set size happens to be a prime number (which can occur e.g. if you augment the training set with artificial samples), you may need to subsample.

To train our model in Keras, we compile it with a loss function and an optimizer which will attempt to minimize the loss:

In [13]:
model.compile(loss='categorical_crossentropy',
              optimizer='sgd',
              metrics=['accuracy'])

Note that the `metrics` takes a list as an argument, because multiple metrics can be computed for the model. See [the documentation](https://keras.io/metrics/) for an exhaustive list of supported metrics. The model can now be fit to a data set. We will try the Wisconsin breast cancer data set:

In [14]:
import sklearn
import numpy as np
from sklearn.datasets import load_breast_cancer

wisconsin = load_breast_cancer()
x_data = wisconsin['data']
y_data = wisconsin['target']

The network expects our target labels to be 2-dimensional vectors of class probabilities, so we perform one-hot encoding:

In [15]:
y_data = np.array([[1., 0.] if y == 0 else [0., 1.] for y in y_data])

We now shuffle and split the data set into training and test sets:

In [16]:
from sklearn.utils import shuffle

x_data, y_data = shuffle(x_data, y_data)

p = .8
idx = int(x_data.shape[0] * p)
x_train, y_train = x_data[:idx], y_data[:idx]
x_test, y_test = x_data[idx:], y_data[idx:]

x_mean, x_std = x_train.mean(), x_train.std()
x_train -= x_mean
x_train /= x_std

x_test -= x_mean
x_test /= x_std

Note that we have z-normalized the samples to zero mean and unit variance. Now it's up to Keras to fit the model to this data set:

In [17]:
model.fit(x_train, y_train, epochs=100, batch_size=65)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<tensorflow.python.keras.callbacks.History at 0x7f98f0f32e48>

Due to randomness in the initialization of the weights and biases as well as the optimization algorithm itself, repeated runs of the `fit` method will yield different results. You should be able to get at least 98% test accuracy after a few runs.

Having trained the model, we can ask it for predictions:

In [18]:
classes = np.argmax(model.predict(x_test, batch_size=65), axis=1)

Using this array, we can compute the accuracy ourselves:

In [19]:
accuracy = np.mean(np.equal(classes, np.argmax(y_test, axis=1)))
print('Accuracy: {}'.format(accuracy))

Accuracy: 0.956140350877193


Or we could let Keras compute it for us:

In [20]:
evals = model.evaluate(x_test, y_test, batch_size=65)
print('Loss: {}'.format(evals[0]))
print('Accuracy: {}'.format(evals[1]))

Loss: 0.14216303825378418
Accuracy: 0.9561403393745422


The classes of the Wisconsin data set are not balanced, so we may want to check out the balanced accuracy as well:

In [21]:
labels = np.argmax(y_test, axis=1)
idx0 = (labels == 0)
idx1 = (labels == 1)

acc0 = np.mean(np.equal(classes[idx0], labels[idx0]))
acc1 = np.mean(np.equal(classes[idx1], labels[idx1]))
bal_acc = (acc0 + acc1) / 2
print('Balanced accuracy: {}'.format(bal_acc))
print('\tClass 0: {}'.format(acc0))
print('\tClass 1: {}'.format(acc1))

Balanced accuracy: 0.9489864864864865
	Class 0: 0.925
	Class 1: 0.972972972972973


## Exercise 1

Before we trained the neural network, we preprocessed the data set by z-normalizing it. What happens if you remove this normalization step? Can you explain the observed behavior? Why do we go through the trouble of separately normalizing the training and test data?

## Exercise 2

Keras supports a number of different loss functions (see [the documentation](https://keras.io/losses/)). Try modifying the loss function we used in the example to something else and retrain the network. Why would you choose a certain loss over another? In particular, why is the 0/1 loss not even on this list?

## Exercise 3

Keras has a number of activation functions built in besides the RELU (see [the documentation](https://keras.io/activations/)). Change the activation functions from RELU to something else and retrain the network. Can you explain the effects of this choice? **Hint:** try visualizing the activation functions with a plot.

## Exercise 4

Compare the performance of the neural network to more "classical" machine learning algorithms such as random forests or support vector machines. Can you get the network to outperform them all?

## Exercise 5

Keras provides a [number of data sets](https://keras.io/datasets/) for you to experiment with. Try training a neural network on the Boston housing price regression data set. Note that this is a *regression problem*, not a *classification problem* as we have considered up until this point. You'll have to change your approach slightly to tackle this problem.