# Convolutional Neural Networks (CNN) - Workshop

## CPU Version: recognizing small grayscale images of handwritten digits - MNIST
### Ideally suited to be run on a local laptop or PC


### Load Modules and Dataset
Conveniently, Keras already contains the dataset as this one is often used to test machine learning algorithms, due to its small size.

In [None]:
from keras.datasets import mnist
from keras.preprocessing.image import array_to_img
import matplotlib.pyplot as plt
import numpy as np
import random

# Load MNIST data
(x_train, y_train), (x_test, y_test) = mnist.load_data()
# add 3rd dimension (1 color channel)
x_train = np.expand_dims(x_train, axis=3)
x_test = np.expand_dims(x_test, axis=3)

### Take a look at your images

In [None]:
# Randomly select images
random.seed(23)
random_ids = random.sample(population=range(0, x_train.shape[0]),k=5)
random_imgs = x_train[random_ids, :]
random_labels = y_train[random_ids]

# display random images and their label as provided by the data set
for i in range(0, len(random_ids)):
    print("Label: %s" % random_labels[i])
    plt.imshow(array_to_img(random_imgs[i,:]).convert('L'), cmap='gray')
    plt.show()

### Data Generators

Keras ImageDataGenerator implements transformations, pre-processing and serving of image data during training. For example, streaming image batches from a directory while training a model. The data pre-processing & streaming is done on CPU, while the model training is done on GPU to increase efficiency (GPU version).

Data augmentation is a process where images are artificially altered to create 'new'data that is similar to the original data. This helps to increase your data set size & to avoid overfitting. Data augmentation operations are: flipping, cropping & zooming.

Data pre-processing operations like standardizations of pixel values have been shown to increase the efficiency of model training. Operations include: featurewise_center & featurewise_std_normalization.

https://keras.io/preprocessing/image/

In [None]:
# pre-processing of images
from keras.preprocessing.image import ImageDataGenerator

# data generator for training process
datagen_train = ImageDataGenerator(
    featurewise_center=True,
    featurewise_std_normalization=True,
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    horizontal_flip=True)

# data generator for testing
# we don't need data augmentation here, but we need to pre-process it in the same way as the training data
datagen_test = ImageDataGenerator(
    featurewise_center=True,
    featurewise_std_normalization=True)

# compute quantities required for featurewise normalization
# (std, mean, and principal components if ZCA whitening is applied)
# this has to be done on the training data & is applied on both data generators
datagen_train.fit(x_train)
datagen_test.fit(x_train)

# initialize flow from random data to show what pre-processing of images does
random_datagen = datagen_train.flow(random_imgs, random_labels)

# initialize flow from data
test_datagen = datagen_test.flow(x_test, y_test)
train_datagen = datagen_train.flow(x_train, y_train)


In [None]:
# Now we show what the pre-processing does
data_batch = random_datagen.next()
for i in range(0, len(data_batch[1])):
    print("Label: %s" % data_batch[1][i])
    plt.imshow(array_to_img(data_batch[0][i,:]).convert('L'), cmap='gray')
    plt.show()


If you are not happy with the output, e.g. too crazy data augmentation, feel free to change the 'ImageDataGenerator' and run the cell again.

### Model Architecture
Now we define a model architecture, which consists of sequential layers of operations. The 2 main blocks of a CNN are the convolutional part and the fully connected part.

Convolutional layers: These layers extract features from the input (e.g. edges, corners, patterns). This block starts directly at the beginning (e.g. functions Conv2D()) and typically consists of convolution layers (Conv2d()), followed by an activation (nowadays ReLu activation), and a pooling layer (e.g. MaxPooling2D()). Multiple such constructs can be stacked.
* https://keras.io/layers/convolutional/

The fully connected part represents a classical neural network consisting of an input layer (Flatten()) which takes the features from the convolutional part, some hidden layer(s) (Dense()), and finally an output layer (Dense(), with an activation function that results in probabilities for classification tasks, like sigmoid or softmax).
* https://keras.io/layers/core/

In [None]:
# import keras model and layers
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D

# sequential model (layer based)
model = Sequential
# Convolutional layer over 2 dimensions
# 32 filters, each with a size of 3x3 pixels
# activation function is ReLU
model.add(Conv2D(32, kernel_size=(3, 3),activation='relu',input_shape=(28,28,1), name="conv_1"))
# Aggregate data using max pooling (reduce size of feature maps)
model.add(MaxPooling2D(pool_size=(2, 2), name="pool_1"))
# randomly set 50% of all outputs to zero to prevent overfitting
model.add(Dropout(0.5, name="drop_1"))
# this converts our 3D feature maps to a 1D feature vector
model.add(Flatten(name="flatten"))
# fully connected layer with 128 output values
model.add(Dense(128, activation='relu',name="dense_1"))
# randomly set 50% of all outputs to zero to prevent overfitting
model.add(Dropout(0.5,name="drop_2"))
# softmax transformation (logistic regression) to obtain class probabilities
model.add(Dense(10, activation='softmax', name="output"))

# to take a look at the model we can invoke this command
model.summary()

Feel free to change the architecture. 
* You could for example get rid of all convolutional layers and only use a 'classical' neural network and see what happens.
* Another option is to add more convolutional layers or increase/decrease the number of filters.
* You can also play around with the Dropout() layers to see whether overfitting really is a problem here.

### Model compilation & training
After the model architecture has been defined, we have to compile the model. Important parameters include the optimizer and the loss function. The loss function for binary classifications is 'binary_crossentropy', which is what the model is trying to minimize during the training process. The optimizer defines how the gradients and their updates are calculated. Changing the optimizer can have a great effect on model convergence, if for example one changes to the stochastic gradient descent optimizer and chooses a high learning rate it may be that the model never converges.

* https://keras.io/optimizers/

#### Epochs
Number of full passes over the training data (increase this number to get a better model performance & incrased training time).
#### Batch Size
Number of images simultenously used to calculated one update of the gradients. 1 image is stochastic gradient descent (SGD), N> and >1 images is mini-batch gradient descent (usually between 32 and 256 images), using all N images is gradient descent.
#### Steps_per_epoch
The number of batches as generated by the data generator to process per epoch. We divide the number of samples by the batch size to make one full pass over the training data per epoch.

In [None]:
model.compile(loss='sparse_categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

# number of simulatenously processed images
batch_size = 64

# train the model
model.fit_generator(
    train_datagen,
    steps_per_epoch=train_datagen.n // batch_size,
    epochs=2,
    workers=2,
    validation_data=test_datagen,
    validation_steps=test_datagen.n // batch_size)

Feel free to choose a different optimizer or to increase the number of epochs.

### Evaluating the model
The model quickly learns how to distinct between different digits, you can see that by observing the accuracy during training time. So after only few epochs we can take a look at a few predictions of our model.

In [None]:
# predict on test sample

# get a test batch
test_batch_data = test_datagen.next()

# calculate predictions
p_test = model.predict_on_batch(test_batch_data[0])

# show some images and their prediction
for i in range(0, len(test_batch_data[1])):
    id_max = np.argmax(p_test[i])
    max_val = np.max(p_test[i])
    print("Predicted %s percent of being a %s, in reality is a %s" %
          (round(float(max_val * 100), 2), id_max, test_batch_data[1][i]))
    plt.imshow(array_to_img(test_batch_data[0][i, :]))
    plt.show()