## Work in progress, use at your own risk

**NB: This worksheet assumes that you've watched the first Deep Learning lecture. If you do the worksheet before that, you may have to google some of the phrases.**

For this worksheet, we need to install Keras. Execute the cell below, or run the command without the exclamation point in a terminal, command prompt or anaconda prompt.

In [None]:
!pip install keras

# Worksheet 4: Deep Learning

## Keras

Keras is a frontend for other deep learning libraries (Tensorflow, Theano, CNTK), which implements most basic nerual network architectures in a simple framework. The default backend is Tensorflow, which should be installed automatically with the pip command above (if not, run ```pip install tensorflow```).

Deep learning models are trained by gradient descent. These models are often so complex, that we don't want to have to work out the gradient ourselves. Keras and tensorflow allow the gradient to be computed automatically: we write down our model as a _computation graph_, compile it, and the system works out the gradients for us. This also makes it easier to run things on a GPU (if you have one available, this worksheet doesn't require a GPU).

Before qwe start building models, let's familiarize ourselves with this idea. 

In [1]:
# The first time you run this, tensorflow is started up, which may take a while.

from keras import backend as K

Using TensorFlow backend.


We'll make two 2x2 matrices and sum them.

In [2]:

x = K.placeholder(shape=(2,2))
y = K.placeholder(shape=(2,2))

z = x + y
print(z)

Tensor("add:0", shape=(2, 2), dtype=float32)


Now, you may be wondering what's going on here. x and y have a shape, but we haven't defined what values are in the matrix. How can we sum them before we have put values in them?

The trick here is that x and y are'nt matrices like in numpy, they're placeholders for matrices. ```z``` is actually an _object_ that stores references to objects ```x``` and ```y```. In other words, we have a _computation graph_ with nodes x and y. 

Here's what z looks like under water:

In [3]:
z.graph.as_graph_def()

node {
  name: "Placeholder"
  op: "Placeholder"
  attr {
    key: "dtype"
    value {
      type: DT_FLOAT
    }
  }
  attr {
    key: "shape"
    value {
      shape {
        dim {
          size: 2
        }
        dim {
          size: 2
        }
      }
    }
  }
}
node {
  name: "Placeholder_1"
  op: "Placeholder"
  attr {
    key: "dtype"
    value {
      type: DT_FLOAT
    }
  }
  attr {
    key: "shape"
    value {
      shape {
        dim {
          size: 2
        }
        dim {
          size: 2
        }
      }
    }
  }
}
node {
  name: "add"
  op: "Add"
  input: "Placeholder"
  input: "Placeholder_1"
  attr {
    key: "T"
    value {
      type: DT_FLOAT
    }
  }
}
versions {
  producer: 21
}

You don't need to understand what that means, just that it says that the graph contains three nodes (x, y and z) and that z if derived from x and y by summing.

We can now compile this graph, put some values at nodes x and y, compute it and retrieve the output z. Since Keras isn't meant to be used at this level we'll show you briefly how it works in tensorflow (usually we'll let Keras handle all this):

In [13]:
import tensorflow as tf
import numpy as np

with tf.Session() as session:
    x_value = np.random.rand(2,2)
    y_value = np.random.rand(2,2)
    
    result = session.run(z, feed_dict={x: x_value, y: y_value})
    
result

array([[ 1.77617455,  1.01602781],
       [ 1.50804496,  1.79550886]], dtype=float32)

This code takes the computation graph we've just made and tells us that we want to use if as a function with inputs ```x``` and ```y``` and we want to compute the value of ```z```. 

If you want, you can think of x and y as the "input nodes" of a neural network, and z as the "output node".

We can also make a _Variable_. A variable is also a node in the computation graph, but one that actually contains a value that is retained even if the inputs are changed.

In [14]:
a = K.placeholder(shape=(2,2))
b = K.variable(value=np.random.rand(2,2)) # not that we have to provide an actual matrix

c = K.dot(a, b) # matrix multiplication
print(c)

Tensor("MatMul_1:0", shape=(2, 2), dtype=float32)


c is still a node in the computation graph, and again, we can compile and run the graph for some input.

In [18]:
import tensorflow as tf
import numpy as np


with tf.Session() as session:
    session.run(tf.global_variables_initializer()) # low level stuff, normally Keras handles this

    a_value = np.random.rand(2,2)
    
    result = session.run(c, feed_dict={a: a_value})
    
result

array([[ 1.47241163,  0.18284869],
       [ 1.19554937,  0.13533755]], dtype=float32)

Using these principles, we can build neural networks. We use placeholders for the input, outputs and hidden layers, and variables for the weights. The variables persist between inputs and get changed during training.

## Classification with a simple neural network model

Now, let's build a simple neural network. We'll start by loading the MNIST data that we saw in the first lecture.

In [56]:
from keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

print(x_train.shape) 
print(x_test.shape)

(60000, 28, 28)
(10000, 28, 28)


The training set is one 3-tensor of 60000 images of 28x28 pixels. The test set contains 10000 additional images.

Note that the data is already split in a canonical train and test set.

For this network, we don't care about the structure of the image, so we'll flatten everything into a 784-dimensional feature vector.

In [57]:
x_train = x_train.reshape(60000, -1)
x_test = x_test.reshape(10000, -1)

print(x_train.shape) 
print(x_test.shape)

(60000, 784)
(10000, 784)


The model we'll use is a simple fully connected feedforward network. This is called a Dense layer in Keras. Since fully connected layers are a bit heavy on image data (and you're running this on your laptops), we'll reduce the dimensionality of the data by PCA (see the Methodology 2 lecture).

In [58]:
from sklearn.decomposition import PCA

pca = PCA(n_components=60) # reduce to 60 dimensions
pca.fit(x_train)

x_train = pca.transform(x_train)
x_test = pca.transform(x_test)

print(x_train.shape) 
print(x_test.shape)

(60000, 60)
(10000, 60)


The training labels are encoded as integers. We need these as one-hot vectors instead, so we can match them to the ten outputs of the neural network.

In [59]:
from keras.utils import to_categorical

print(y_train.shape, y_test.shape)

y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

print(y_train.shape, y_test.shape)

(60000,) (10000,)
(60000, 10) (10000, 10)


The labels are now encoded as vectors of length 10 with zeros everywhere, except at the true class.

In [60]:
print(y_train[0, :])

[ 0.  0.  0.  0.  0.  1.  0.  0.  0.  0.]


We are now ready to create a model. Keras has two APIs for this: the Sequential API and the Model API. The sequential API (the simplest) assumes that your model is a simple sequence of operations, usually neural network layers. The inputs is passed through the first layer, the result of that is passed through the second and so on. 

This is useful for simple NN models where you are only interested in the input and output. If your model hgets more complex, you may want to uyse the Model API (we'll discuss that below).

In [61]:
from keras.models import Sequential
from keras.layers import Dense, Activation

model = Sequential()
model.add(Dense(128, input_shape=(60,))) # first dense layer, 32 hidden units
model.add(Activation('relu'))           # activation layer
model.add(Dense(10))                    # second dense layer
model.add(Activation('softmax'))        # output class probabilities

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_7 (Dense)              (None, 128)               7808      
_________________________________________________________________
activation_7 (Activation)    (None, 128)               0         
_________________________________________________________________
dense_8 (Dense)              (None, 10)                1290      
_________________________________________________________________
activation_8 (Activation)    (None, 10)                0         
Total params: 9,098
Trainable params: 9,098
Non-trainable params: 0
_________________________________________________________________


There are a few things to note:

* For the first layer we need to provide the input shape. For the second, this is not necessary, because Keras can _infer_ the input shape from the layers before it.
* The Dense layers are just linear operations (multiply by a weight matrix, add a bias vector). The activation functions are added as separate layers (activation_1, and activation_2). You can also pass the activation as an argument to the Dense layer.
* Keras picks a sensible default weight initialization for us, and applies it (this model already has initial weights).
* The last layer has 10 nodes (one for each class) with a softmax activation. This means we can interpret the output as class probabilities.
* Even though we specified a one-dimensional input, the model summary shows two-dimensional shapes with the first dimension always ```None```. This is the _batch dimension_. Neural networks are almost never trained/run one input at a time; we usually feed them several inputs together (a mini-batch). Kears assumes this is the case. So, if we choose a batch size of ten, our input dimension would become (10, 60 ) and our first hidden layer would have dimensions (10, 32). Keras can be flexible about the batch size so we don't have to specify it now.



To get a complete computation graph, we need more than just a model: we need a loss function as well. We also need to specify which optimizer we're going to use. 

Let's use categorical cross-entropy as loss, together with the _Adam_ optimizer (all these optimizers are fancy variations on gradient descent).

With this information, we can compile the model.

In [62]:
from keras.optimizers import SGD, Adam

optimizer = Adam()
model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])

We've also told the compiler that we'd lie it to compute accuracy for us during training (since categorical cross-entropy is a bit hard to interpret).

We're now ready to start training:

In [63]:
# Train the model, iterating on the data in batches of 32 samples
model.fit(x_train, y_train, epochs=5, batch_size=32);

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


On my laptop (no GPU), this takes about 5 seconds per epoch (an epoch is a run of the whole data).

Note that these losses/accuracies are on the training data. Of course, **we don't want to use the test data at this stage to see how well we're doing**. We can tell Keras to withhold some validation data, so we can get an indication of the accuracy.

In [64]:
# Train the model, iterating on the data in batches of 32 samples
model.fit(x_train, y_train, epochs=5, batch_size=32, validation_split=1/6)

Train on 50000 samples, validate on 10000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1408ecf60>

Note that the model remembers its weights so we've now trained it for 10 epochs, not 5.