# Deep learning

Deep learning refers to highly multi-layer neural networks that have lots of parameters. Training them can be slow, so be prepared to wait if you have a low-end PC. 

Let's have a look at some popular frameworks for deep learning. The most popular is [tensorflow](https://www.tensorflow.org/), that allows one to create complex computing graphs in Python, while the actual heavy lifting is done by underlying C++ implementations. While tensorflow itself is really powerful, we'll be having a look at [keras](https://keras.io/), an abstraction on top of tensorflow that lets you define neural network in an easy manner.

If you're new to Jupyter notebooks, you can run a cell by clicking on it and pressing `ctrl` + `enter`. The variables, functions etc in previous cells will remain in memory, so you can refer to them later in other cells.

This exercise has some ready-made code and you should implement the parts where is says #TODO. Feel free to create your own functions or classes as well.

**First things first, download the [HASYv2](https://zenodo.org/record/259444#.WcZjIZ8xDCI) dataset into the same folder as this notebook, and extract it.**

In [1]:
import sys
sys.path.append("/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages")

# Python2 compatibility
from __future__ import print_function

import numpy as np
import pandas as pd


Implement a function that reads the HASYv2 dataset from the given folder. It should return a tuple containing two numpy arrays: the data as a `(68233, 32, 32)` tensor, and a size `168233` vector containing the labels ("symbol_id") as integers.
Hint. Use scipy's imread to read the `.png` images as black-and-white. Each image should be a 32x32 matrix.

In [2]:
from scipy.misc import imread
from HASYDataLoader import HASYDataLoader

loader = HASYDataLoader(path = "data/HASYv2/", symbol_range = None, flatten_images = False)
X, y = (loader.images, loader.symbol_ids)

print(X.shape, y.shape) # Should be (168233, 32, 32) (168233,)

(168233, 32, 32) (168233,)


Overfitting is when we fit the model to work well on our training data, but . Fitting a model to predict the training set perfectly is in most cases trivial, and is a simple exercise in [optimization](https://en.wikipedia.org/wiki/Mathematical_optimization). In machine learning, however, we are interested in _generalization error_, or in other words, how well our model works on previously unseen data. Therefore, we want to evaluate the performance of our model on data that it has not yet seen: this is a way to approximate how well it generalizes. 

As a reminder from last week, tuning the hyperparameters of your model or choosing your model based on the test data **overfits to the test data**. We don't want this, so we use a so-called _validation set_ to tune our model.

To do this, **split the data into training, validation and test sets. Be sure to shuffle it first, otherwise your validation and test sets will not contain most labels**. Your function should return a tuple containing the training, validation and test data, i.e `(X_train, y_train, X_val, y_val, X_test, y_test)`. You can use e.g the proportions `0.8, 0.1, 0.1`.

In [3]:
X_train, X_test, X_val, y_train, y_test, y_val = loader.get_training_validation_and_test_data(training_ratio = 0.8, validation_ratio = 0.1)

print(X_train.shape, y_train.shape) # Should yield approx (134586, 32, 32) (134586,)

Validation range [0, 16823]
(134586, 32, 32) (134586,)


Since there's 369 different classes with overall over 150000 images, let's reduce the complexity of our task by taking only the 100 first classes. Also note that the `symbol_id` field does not start from 0, but instead has arbitrary numbers. 

**Transform the labels so that the numbering for the class goes from 0 to 99, and discard the rest of the classes and corresponding images.**

In [4]:
# Give new ids to classes such that the first unique symbol_id gets the number 0,
# the next unique symbol_id gets the number 1, and so forth
def transform_labels(y):
    index = 0
    while len(y[y > index]) > 0:
        next_class_to_transform = np.min(y[y > index])
        y[y == next_class_to_transform] = index
        index += 1
    
    return y

y_train, y_val, y_test = map(transform_labels, [y_train, y_val, y_test])

print(y_train.shape, y_val.shape, y_test.shape) # Should be approx (134586,) (16823,) (16824,)

# Should return the elements in arr for which their corresponding label in y_arr is in between [0, 100]
def filter_out(arr, y_arr):
    return (arr[y_arr < 100], y_arr[y_arr < 100])

X_train, y_train = filter_out(X_train, y_train)
X_val, y_val = filter_out(X_val, y_val)
X_test, y_test = filter_out(X_test, y_test)

print(y_train.shape, X_train.shape) # Should be approx (34062,) (34062, 32, 32)

(134586,) (16823,) (30283,)
(34009,) (34009, 32, 32)


Currently our labels are single integers that represent the class. However, for neural networks it's common to switch them to a so-called "one-hot" encoding, where each label is represented by a vector of length number of classes that has a one at the position  zeros everywhere else. 

For example, if there were 7 classes, we could represent that class 5 as `0000100`. In the case of the HASYv2 dataset, there's 369 unique classes, but we only picked the first 100, so each label should be a length 100 vector with a single one.

_Hint_ `keras` has a function for this

In [5]:
# Convert labels to one-hot encoding here
from keras.utils import to_categorical

y_train = to_categorical(y_train)
print(y_train.shape) # Should be approx (34062, 100)

Using TensorFlow backend.


(34009, 100)


Next let's create a simple linear model using Keras to get ourselves familiar with it. Your model should consist of a `Flatten` layer that takes as input `(32, 32)` matrices and outputs a `1024` vector. Then, you should have a second `Dense` layer that has as many outputs as there are classes in the data (i.e 100), and has a "softmax" activation. Check out the [30 seconds to Keras](https://keras.io) intro for an introduction to the framework. If you're stuck, check the next blocks for some of the same ingredients.

In [6]:
from keras.models import Sequential
from keras.layers import Dense, Flatten

# This function should return a keras Sequential model that has the appropriate layers
# TODO
def create_linear_model():
    model = Sequential()
    model.add(Flatten(input_shape = (32, 32)))
    model.add(Dense(units = 100))
    
    return model

model = create_linear_model()
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_1 (Flatten)          (None, 1024)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 100)               102500    
Total params: 102,500
Trainable params: 102,500
Non-trainable params: 0
_________________________________________________________________


Here we "compile" the model, specifying the loss function to use as well as the method for minimizing the loss function. In addition, we request that Keras provide us with constant accuracy measures. We also feed in the validation data, which means that Keras will evaluate the validation loss for us after each epoch (epoch generally means once it has seen each training sample once).  

In [7]:
# Feel free to try out other optimizers. Categorical crossentropy loss means 
# we are predicting the probability of each class separately.
model.compile(loss="categorical_crossentropy", optimizer="rmsprop", metrics=["accuracy"])

model.fit(X_train, y_train, epochs=3, batch_size=64)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x10b707f98>

The simple linear model probably didn't do too well. Let's create a CNN (Convolutional Neural Network) next. We've provided the code for the network, so just run these cells for now. Try to experiment here, adding and removing layers and tuning the hyperparameters to get better results on the validation data.

In [8]:
from keras.models import Sequential
from keras.layers import Dense, Flatten, Conv2D, MaxPooling2D, BatchNormalization, Dropout
from keras.backend import clear_session

def create_convolutional_model():
    model = Sequential()
    
    model.add(Conv2D(128, (3, 3), input_shape=(32, 32, 1))) # A convolutional layer
    model.add(MaxPooling2D((4,4))) # Max pooling reduces the complexity of the model
    model.add(Dropout(0.4)) # Randomly dropping connections within the network helps against overfitting
    model.add(Conv2D(128, (2, 2), activation="relu")) 
    model.add(BatchNormalization()) # Numbers within the network might get really big, so normalize them 
    model.add(Flatten())
    model.add(Dense(256, activation="relu"))
    model.add(Dense(128, activation="relu"))
    model.add(BatchNormalization())
    model.add(Dense(y_train.shape[1], activation="softmax"))
    
    return model

clear_session()

model = create_convolutional_model()
model.summary() # Get a summary of all the layers

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 30, 30, 128)       1280      
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 7, 7, 128)         0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 7, 7, 128)         0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 6, 6, 128)         65664     
_________________________________________________________________
batch_normalization_1 (Batch (None, 6, 6, 128)         512       
_________________________________________________________________
flatten_1 (Flatten)          (None, 4608)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 256)               1179904   
__________

As you can see, our model has a lot of parameters. Optimizing this might take a while, depending on your PC.

In [None]:
y_val.shape

(4283,)

In [None]:
# Feel free to try out other optimizers. Categorical crossentropy loss means 
# we are predicting the probability of each class separately.
model.compile(loss="categorical_crossentropy", optimizer="rmsprop", metrics=["accuracy"])

# Extra axis for "gray" channel
model.fit(X_train[:, :, :, np.newaxis], y_train, epochs=5, batch_size=64, validation_data=(X_val[:, :, :, np.newaxis], to_categorical(y_val)))

Train on 34009 samples, validate on 4283 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5

Finally, let's see how well our model did on the held-out test data. This is basically what matters, after all. The second number should be test accuracy - you should be able to get approx 80% with our setup. Try to improve this, but be careful not to overfit on the test data. Always use the validation set to tune your model.

In [None]:
model.evaluate(X_test[:, :, :, np.newaxis], to_categorical(y_test))