## Data Exploartion
Before classification an understanding of the data is appropriate.  The data comes from the game Quick Draw where a user has 20 seconds to draw a doodle of an object. An algorithm than tries to guess what they drew. The data provided comes in two versions. The raw data which contains the x and y pixel positions of the strokes and the time in milliseconds from the first point in the stroke. The simplified data contains a normalised an pixel position between 0 and 255. The time information is removed. The dataset contains of 340 different classes. Here is an an example of the simplified data can be seen below. Code is taken from https://www.kaggle.com/jpmiller/image-based-cnn . The values in the 'drawing' column are a pair of two lists. One for the x values in a stroke and one for the corresponding y values for the stroke. The length of an entry in 'drawing' corresponds to the number of strokes used in that drawing. Other than that there is also a contry code, a timestamp of when the image was created and the word that the user was trying to draw.  



In [None]:
import os
import re
from glob import glob
from tqdm import tqdm
import numpy as np
import pandas as pd
import ast
import gc
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
import matplotlib.cbook
warnings.filterwarnings("ignore",category=matplotlib.cbook.mplDeprecation)

In [None]:
fnames = glob('../input/train_simplified/*.csv')
cnames = ['countrycode', 'drawing', 'key_id', 'recognized', 'timestamp', 'word']
drawlist = []
for f in fnames[0:8]:
    first = pd.read_csv(f, nrows=40) # make sure we get a recognized drawing
    first = first[first.recognized==True].head(20)
    drawlist.append(first)
draw_df = pd.DataFrame(np.concatenate(drawlist), columns=cnames)
draw_df.head()

To see how images can look we will plot images from samples classes from 8 classes. 

In [None]:

labels = unique_classes = draw_df['word'].unique()
for label in labels:
    plt.figure(figsize=(10,10))
    
    
    for ii in range(1,5):
        examples = [ast.literal_eval(pts) for pts in draw_df[draw_df['word']==label].drawing.values]
        #import pdb; pdb.set_trace()
        for x,y in examples[ii]:
            plt.subplot(2,2,ii)
            plt.plot(x,y,marker='.')
            plt.axis('off')
            plt.title(label)
plt.show()

draw_df = None
del draw_df

Here we load the 50 first classes to at the distribution of the classes and how large each class is..

In [None]:
fnames = glob('../input/train_simplified/*.csv')
cnames = ['countrycode', 'drawing', 'key_id', 'recognized', 'timestamp', 'word']
drawlist = []
for f in fnames[0:50]:
    df = pd.read_csv(f) # make sure we get a recognized drawing
    df = df[df.recognized==True]
    drawlist.append(len(df))
plt.hist(drawlist,bins=15)
plt.title('Histogram of number of images for first 50 classes.')
plt.ylabel('Number of classes')
plt.xlabel('Number of images')

df = None
drawlist = None
fnames = None
cnames = None
del df
del drawlist
del fnames
del cnames

gc.collect()

When looking at the distribution of images with a quick glance two things become quite obvious. First of all just by judging the first 50 classes it looks like there are between 100000-200000 images in most of the classes. A dataset of that size will very quickly use up all the available memory. The second thing is that it is not an equal distribution of images per class. Some classes have 3 times the amount of images as others. This can pose a problem because a network can be overfitted to a few class since choosing the most represented classes will automaticly increase accuracy during training because there are more instances of them. Both problems can be solved simultaneously by limiting the number of images per class to be the same for all classes.  

# Choosing a network

A simple CNN will be used as a baseline. This model is chosen because of it's simplicity and it's robust performance on the MNIST database which is kind of similar to the approach in this kernal. The model is from https://machinelearningmastery.com/handwritten-digit-recognition-using-convolutional-neural-networks-python-keras/ and is made up of:
    1. Convolutional layer with 32 feature maps of size 5×5.
    2. Pooling layer taking the max over 2*2 patches.
    3. Dropout layer with a probability of 20%.
    4. Flatten layer.
    5. Fully connected layer with 128 neurons and rectifier activation.
    6.. Output layer.
    
The second model that will be chosen is a small version based on a VGG-net structure. The original model can be found here https://www.pyimagesearch.com/2018/04/16/keras-and-convolutional-neural-networks-cnns/. The charactaristic thing about the VGG-net is having more stacked 3x3 filters as we move further down the network. By increasing the number of filters further it should learn more and more abstract features. The end of the VGG net is usually a flattened layer followed by two fully connected layers. 

# Initial testing
To get a rough idea of what structure to use the following approach is used. The small VGG-net is split into three models with growing complexity.  The three models are tested together with the baseline using 5-fold cross validation. Since it takes quite some time to train on all the data a small subset of 20 classes with 500 images each are used to get a rough idea about the networks.

The following code to load the data is from https://www.kaggle.com/jpmiller/image-based-cnn which loads N images from N class and converts the stroke information into an actual image of size 32x32.

In [None]:
#%% import
import os
from glob import glob
import re
import ast
import numpy as np 
import pandas as pd
from PIL import Image, ImageDraw 
from tqdm import tqdm
from dask import bag
import matplotlib.pyplot as plt
import keras
import gc
%matplotlib inline
## Import up sound alert dependencies
from IPython.display import Audio, display

def allDone():
  display(Audio(url='https://sound.peal.io/ps/audios/000/000/537/original/woo_vu_luvub_dub_dub.wav', autoplay=True))
## Insert whatever audio file you want above


In [None]:
#%% set label dictionary and params
classfiles = os.listdir('../input/train_simplified/')
numstonames = {i: v[:-4].replace(" ", "_") for i, v in enumerate(classfiles)} #adds underscores

num_classes = 20    #340 max 
imheight, imwidth = 32, 32  
ims_per_class = 500  # 1500 starts to crash the kernel often


In [None]:
# faster conversion function
def draw_it(strokes):
    image = Image.new("P", (256,256), color=255)
    image_draw = ImageDraw.Draw(image)
    for stroke in ast.literal_eval(strokes):
        for i in range(len(stroke[0])-1):
            image_draw.line([stroke[0][i], 
                             stroke[1][i],
                             stroke[0][i+1], 
                             stroke[1][i+1]],
                            fill=0, width=5)
    image = image.resize((imheight, imwidth))
    return np.array(image)/255.

#%% get train arrays
train_grand = []
class_paths = glob('../input/train_simplified/*.csv')
word_label = []
for i,c in enumerate(tqdm(class_paths[0: num_classes])):
    train = pd.read_csv(c, usecols=['drawing', 'recognized','word'], nrows=ims_per_class*5//4)
    train = train[train.recognized == True].head(ims_per_class)
    word_label_current = train['word'].replace(' ', '_', regex=True)
    imagebag = bag.from_sequence(train.drawing.values).map(draw_it) 
    trainarray = np.array(imagebag.compute())  # PARALLELIZE
    trainarray = np.reshape(trainarray, (ims_per_class, -1))    
    labelarray = np.full((train.shape[0], 1), i)
    trainarray = np.concatenate((labelarray, trainarray), axis=1)
    train_grand.append(trainarray)
    word_label.append(word_label_current.values)
    
word_label = np.ravel(word_label) # flatten the labels. It's a janky workaround
train_grand = np.array([train_grand.pop() for i in np.arange(num_classes)]) #less memory than np.concatenate
train_grand = train_grand.reshape((-1, (imheight*imwidth+1)))
train_grand= np.append(train_grand, word_label[:,None],axis=1)

trainarray = None
train = None
del trainarray
del train
gc.collect()
allDone()

In [None]:
# memory-friendly alternative to train_test_split?
valfrac = 0.1
cutpt = int(valfrac * train_grand.shape[0])

np.random.shuffle(train_grand)
word_array = train_grand[:,-1]
train_grand = np.delete(train_grand,-1,axis=1)
train_grand = train_grand.astype(np.float64)
y_train, X_train = train_grand[cutpt: , 0], train_grand[cutpt: , 1:]
y_val, X_val = train_grand[0:cutpt, 0], train_grand[0:cutpt, 1:] #validation set is recognized==True
word_train, word_val = word_array[cutpt:], word_array[0:cutpt,]

del train_grand
y_train_full_label = y_train
y_train = keras.utils.to_categorical(y_train, num_classes)
X_train = X_train.reshape(X_train.shape[0], imheight, imwidth, 1)
y_val_full_label = y_val
y_val = keras.utils.to_categorical(y_val, num_classes)
X_val = X_val.reshape(X_val.shape[0], imheight, imwidth, 1)

print(y_train.shape, "\n",
      X_train.shape, "\n",
      y_val.shape, "\n",
      X_val.shape)

plt.imshow(X_train[0,:,:,0])
plt.title('Example from class number: {}'.format(numstonames.get([np.where(y_train[0]==1)][0][0][0])))
plt.axis('off')
plt.show()
allDone()

In [None]:
# Load NN libraries
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import Flatten
from keras.layers.core import Activation
from keras.layers.convolutional import Conv2D
from keras.layers.convolutional import MaxPooling2D
from keras.layers.normalization import BatchNormalization
from tensorflow.keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, EarlyStopping
from sklearn.model_selection import GridSearchCV


In [None]:
def baseline_model(height = 32, width = 3, num_classes = 320):
    # create model
    model = Sequential()
    model.add(Conv2D(32, (5, 5), input_shape=(height, width, 1), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Dropout(0.2))
    model.add(Flatten())
    model.add(Dense(128, activation='relu'))
    model.add(Dense(num_classes, activation='softmax'))
    # Compile model
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [None]:
def smallVGGnet1(height = 32, width = 3, num_classes = 320):
	# initialize the model along with the input shape to be
	# "channels last" and the channels dimension itself
	model = Sequential()
	depth = 1
	inputShape = (height, width, depth)
	chanDim = -1

	model.add(Conv2D(32, (3, 3), padding="same",
		input_shape=inputShape))
	model.add(Activation("relu"))
	model.add(BatchNormalization(axis=chanDim))
	model.add(MaxPooling2D(pool_size=(3, 3)))
	model.add(Dropout(0.25))

	# first (and only) set of FC => RELU layers
	model.add(Flatten())
	model.add(Dense(1024))
	model.add(Activation("relu"))
	model.add(BatchNormalization())
	model.add(Dropout(0.5))

	# softmax classifier
	model.add(Dense(num_classes))
	model.add(Activation("softmax"))
	model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
	# return the constructed network architecture
	return model		

def smallVGGnet2(height = 32, width = 3, num_classes = 320, optimizer = 'adam',dropout_rate = 0.5):
	# initialize the model along with the input shape to be
	# "channels last" and the channels dimension itself
	model = Sequential()
	depth = 1
	inputShape = (height, width, depth)
	chanDim = -1

	model.add(Conv2D(32, (3, 3), padding="same",
		input_shape=inputShape))
	model.add(Activation("relu"))
	model.add(BatchNormalization(axis=chanDim))
	model.add(MaxPooling2D(pool_size=(3, 3)))
	model.add(Dropout(0.25))
	# (CONV => RELU) * 2 => POOL
	model.add(Conv2D(64, (3, 3), padding="same"))
	model.add(Activation("relu"))
	model.add(BatchNormalization(axis=chanDim))
	model.add(Conv2D(64, (3, 3), padding="same"))
	model.add(Activation("relu"))
	model.add(BatchNormalization(axis=chanDim))
	model.add(MaxPooling2D(pool_size=(2, 2)))
	model.add(Dropout(0.25))
    
	# first (and only) set of FC => RELU layers
	model.add(Flatten())
	model.add(Dense(1024))
	model.add(Activation("relu"))
	model.add(BatchNormalization())
	model.add(Dropout(dropout_rate))

	# softmax classifier
	model.add(Dense(num_classes))
	model.add(Activation("softmax"))
	model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
	# return the constructed network architecture
	return model		

def smallVGGnet3(height = 32, width = 3, num_classes = 320):
	# initialize the model along with the input shape to be
	# "channels last" and the channels dimension itself
	model = Sequential()
	depth = 1
	inputShape = (height, width, depth)
	chanDim = -1

	model.add(Conv2D(32, (3, 3), padding="same",
		input_shape=inputShape))
	model.add(Activation("relu"))
	model.add(BatchNormalization(axis=chanDim))
	model.add(MaxPooling2D(pool_size=(3, 3)))
	model.add(Dropout(0.25))
	# (CONV => RELU) * 2 => POOL
	model.add(Conv2D(64, (3, 3), padding="same"))
	model.add(Activation("relu"))
	model.add(BatchNormalization(axis=chanDim))
	model.add(Conv2D(64, (3, 3), padding="same"))
	model.add(Activation("relu"))
	model.add(BatchNormalization(axis=chanDim))
	model.add(MaxPooling2D(pool_size=(2, 2)))
	model.add(Dropout(0.25))
	# (CONV => RELU) * 2 => POOL
	model.add(Conv2D(128, (3, 3), padding="same"))
	model.add(Activation("relu"))
	model.add(BatchNormalization(axis=chanDim))
	model.add(Conv2D(128, (3, 3), padding="same"))
	model.add(Activation("relu"))
	model.add(BatchNormalization(axis=chanDim))
	model.add(MaxPooling2D(pool_size=(2, 2)))
	model.add(Dropout(0.25))
	# first (and only) set of FC => RELU layers
	model.add(Flatten())
	model.add(Dense(1024))
	model.add(Activation("relu"))
	model.add(BatchNormalization())
	model.add(Dropout(0.5))
    
	# softmax classifier
	model.add(Dense(num_classes))
	model.add(Activation("softmax"))
	model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
	# return the constructed network architecture
	return model		

In [None]:
# Functions for plotting loss and accuracy
def plotAccuracy(History):
    '''
    Plots the accuracy of the training and validation of the model
    
    input:
        History = the history of the model.
    
    output:
        plt = the plt handle of the figure. Used for saving the plot
    '''
    fig1, ax_acc = plt.subplots()
    plt.plot(History.history['acc'])
    plt.plot(History.history['val_acc'])
    plt.xlabel('Epoch')
    plt.ylabel('Accuracy')
    plt.title('Model - Accuracy')
    plt.legend(['Training', 'Validation'], loc='lower right')
    return plt

def plotLoss(History):
    '''
    Plots the loss of the training and validation of the model
    
    input:
        History = the history of the model.
    
    output:
        plt = the plt handle of the figure. Used for saving the plot  
    '''
    fig2, ax_loss = plt.subplots()
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.title('Model- Loss')
    plt.legend(['Training', 'Validation'], loc='upper right')
    plt.plot(History.history['loss'])
    plt.plot(History.history['val_loss'])
    plt.legend(['Training', 'Validation'], loc='lower right')
    return plt

def plotHistory(History):
    '''
    Plots the accuracy and validation of the training and validation of the model
    
    input:
        History = the history of the model.
    
    output:
        plt_acc = the plt handle of the accuracy figure. Used for saving the plot
        plt_val = the plt handle of the validation figure. Used for saving the plot
    '''    
    plt_acc = plotAccuracy(History)
    plt_val = plotLoss(History)
    return plt_acc,plt_val

In [None]:
from sklearn.cross_validation import cross_val_score
from keras.wrappers.scikit_learn import KerasClassifier

#model = baseline_model(imheight,imwidth,num_classes)
_CV_EPOCH = 30
_CV_BATCH_SIZE = 100
'''
model = KerasClassifier(build_fn=baseline_model,height = imheight, width = imwidth, num_classes = num_classes, nb_epoch=_CV_EPOCH, batch_size=_CV_BATCH_SIZE, verbose=0)
scores = cross_val_score(model, X_train, y_train, cv=5)
print('Mean of baseline CNN %.2f with +- %.2f std' % (np.mean(scores),np.std(scores)))

model = KerasClassifier(build_fn=smallVGGnet1,height = imheight, width = imwidth, num_classes = num_classes, nb_epoch=_CV_EPOCH, batch_size=_CV_BATCH_SIZE, verbose=0)
scores = cross_val_score(model, X_train, y_train, cv=5)
print('Mean of 1 block small VGG %.2f with +- %.2f std' % (np.mean(scores),np.std(scores)))

model = KerasClassifier(build_fn=smallVGGnet2,height = imheight, width = imwidth, num_classes = num_classes, nb_epoch=_CV_EPOCH, batch_size=_CV_BATCH_SIZE, verbose=0)
scores = cross_val_score(model, X_train, y_train, cv=5)
print('Mean of 2 block small VGG %.2f with +- %.2f std' % (np.mean(scores),np.std(scores)))

model = KerasClassifier(build_fn=smallVGGnet3,height = imheight, width = imwidth, num_classes = num_classes, nb_epoch=_CV_EPOCH, batch_size=_CV_BATCH_SIZE, verbose=0)
scores = cross_val_score(model, X_train, y_train, cv=5)
print('Mean of all 3 block small VGG %.2f with +- %.2f std' % (np.mean(scores),np.std(scores)))

model = None
scores = None
del model
del scores
gc.collect()

allDone()
'''
#Mean of baseline CNN 0.39 with +- 0.02 std
#Mean of 1 block small VGG 0.50 with +- 0.06 std
#Mean of 2 block small VGG 0.57 with +- 0.03 std
#Mean of all 3 block small VGG 0.45 with +- 0.05 std

In [None]:
from sklearn.model_selection import GridSearchCV
'''
model = KerasClassifier(build_fn=smallVGGnet2,height = imheight, width = imwidth, num_classes = num_classes, nb_epoch=_CV_EPOCH, batch_size=_CV_BATCH_SIZE, verbose=0)

# define the grid search parameters
batch_size = [10, 20, 50, 100, 200, 400]
param_grid = dict(batch_size=batch_size)
grid = GridSearchCV(estimator=model, param_grid=param_grid)
grid_result = grid.fit(X_train, y_train)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

model = None
grid = None
grid_result = None
del model
del grid
del grid_result
gc.collect()

allDone()
'''
#Best: 0.576333 using {'batch_size': 10}
#0.576333 (0.031041) with: {'batch_size': 10}
#0.564222 (0.021351) with: {'batch_size': 20}
#0.567556 (0.015478) with: {'batch_size': 50}
#0.565556 (0.015110) with: {'batch_size': 100}
#0.551333 (0.010965) with: {'batch_size': 200}
#0.524556 (0.005724) with: {'batch_size': 400}


In [None]:
from sklearn.model_selection import GridSearchCV
'''
model = KerasClassifier(build_fn=smallVGGnet2,height = imheight, width = imwidth, num_classes = num_classes, nb_epoch=_CV_EPOCH, batch_size=100, verbose=0)

# define the grid search parameters
optimizer = ['SGD', 'RMSprop', 'Adagrad', 'Adadelta', 'Adam', 'Adamax', 'Nadam']
param_grid = dict(optimizer=optimizer)
grid = GridSearchCV(estimator=model, param_grid=param_grid)
grid_result = grid.fit(X_train, y_train)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

del model
del grid
del grid_result
allDone()
'''
#Best: 0.579667 using {'optimizer': 'Adam'}
#0.354444 (0.020531) with: {'optimizer': 'SGD'}
#0.511444 (0.057758) with: {'optimizer': 'RMSprop'}
#0.553778 (0.023892) with: {'optimizer': 'Adagrad'}
#0.546222 (0.005814) with: {'optimizer': 'Adadelta'}
#0.579667 (0.019530) with: {'optimizer': 'Adam'}
#0.558111 (0.002183) with: {'optimizer': 'Adamax'}
#0.545444 (0.021730) with: {'optimizer': 'Nadam'}

In [None]:
from keras.preprocessing.image import ImageDataGenerator
'''
_GEN_EPOCH = 20
_GEN_BATCH_SIZE = 100
model_with_gen1 = smallVGGnet2(imheight,imwidth,num_classes)
model_with_gen2 = smallVGGnet2(imheight,imwidth,num_classes)

datagen = ImageDataGenerator(
    horizontal_flip=True)

# This was the other generator that was tried. It reduced the accuracy of about 5%.
datagen2 = ImageDataGenerator(
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    horizontal_flip=True) 

history_gen1 = model_with_gen1.fit_generator(datagen.flow(X_train, y_train, batch_size=_GEN_BATCH_SIZE),steps_per_epoch=len(X_train) / _GEN_BATCH_SIZE, epochs=_GEN_EPOCH,verbose = 0)
scores_gen1 = model_with_gen1.evaluate(X_val, y_val, verbose=0)

history_gen2 = model_with_gen2.fit_generator(datagen2.flow(X_train, y_train, batch_size=_GEN_BATCH_SIZE),steps_per_epoch=len(X_train) / _GEN_BATCH_SIZE, epochs=_GEN_EPOCH,verbose = 0)
scores_gen2 = model_with_gen2.evaluate(X_val, y_val, verbose=0)

model = smallVGGnet2(imheight,imwidth,num_classes)
history = model.fit(X_train, y_train, epochs=_GEN_EPOCH, batch_size=_GEN_BATCH_SIZE, verbose=0)
scores = model.evaluate(X_val, y_val, verbose=0)


print("VGG with flip only Accuracy: %.2f%%" % (scores_gen1[1]*100))
print("VGG with rotation, shift and flip Accuracy: %.2f%%" % (scores_gen2[1]*100))
print("VGG without augmentation Accuracy: %.2f%%" % (scores[1]*100))
allDone()
del model_with_gen
del history_gen
del scores_gen
del model
del history
'''
#VGG with flip only Accuracy: 82.60% 
#VGG with rotation, shift and flip Accuracy: 23.00% 
#VGG without augmentation Accuracy: 82.00%

In [None]:
# Updating the input to include variable dropout rates
def smallVGGnet2(height = 32, width = 3, num_classes = 320, optimizer = 'adam',dropout_rate_last = 0.5, dropout_rate_middle=0.25,init_mode = 'he_normal'):
	# initialize the model along with the input shape to be
	# "channels last" and the channels dimension itself
	model = Sequential()
	depth = 1
	inputShape = (height, width, depth)
	chanDim = -1

	model.add(Conv2D(32, (3, 3), padding="same",
		input_shape=inputShape,kernel_initializer=init_mode))
	model.add(Activation("relu"))
	model.add(BatchNormalization(axis=chanDim))
	model.add(MaxPooling2D(pool_size=(3, 3)))
	model.add(Dropout(dropout_rate_middle))
	# (CONV => RELU) * 2 => POOL
	model.add(Conv2D(64, (3, 3), padding="same",kernel_initializer=init_mode))
	model.add(Activation("relu"))
	model.add(BatchNormalization(axis=chanDim))
	model.add(Conv2D(64, (3, 3), padding="same"))
	model.add(Activation("relu"))
	model.add(BatchNormalization(axis=chanDim))
	model.add(MaxPooling2D(pool_size=(2, 2)))
	model.add(Dropout(dropout_rate_middle))
    
	# first (and only) set of FC => RELU layers
	model.add(Flatten())
	model.add(Dense(1024,kernel_initializer=init_mode))
	model.add(Activation("relu"))
	model.add(BatchNormalization())
	model.add(Dropout(dropout_rate_last))

	# softmax classifier
	model.add(Dense(num_classes,kernel_initializer=init_mode))
	model.add(Activation("softmax"))
	model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
	# return the constructed network architecture
	return model		

In [None]:
# Looking at the effect of dropout of the last layer
'''
_GRID_EPOCH = 20
_GRID_BATCH_SIZE = 100
model = KerasClassifier(build_fn=smallVGGnet2,height = imheight, width = imwidth,
                        num_classes = num_classes, nb_epoch=_GRID_EPOCH, batch_size=_GRID_BATCH_SIZE, verbose=0)
dropout_rate_last = [0.1, 0.3, 0.5, 0.7, 0.9]
param_grid = dict(dropout_rate_last=dropout_rate_last)
grid = GridSearchCV(estimator=model, param_grid=param_grid)
grid_result = grid.fit(X_train, y_train)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))
del model
del grid
del grid_result
allDone()
'''
#Best: 0.601778 using {'dropout_rate_last': 0.3}
#0.589444 (0.014838) with: {'dropout_rate_last': 0.1}
#0.601778 (0.009445) with: {'dropout_rate_last': 0.3}
#0.580000 (0.009361) with: {'dropout_rate_last': 0.5}
#0.535000 (0.012223) with: {'dropout_rate_last': 0.7}
#0.428111 (0.026609) with: {'dropout_rate_last': 0.9}

In [None]:
# Looking at the effect of dropout of the layers after convolution
'''
_GRID_EPOCH = 20
_GRID_BATCH_SIZE = 100
model = KerasClassifier(build_fn=smallVGGnet2,height = imheight, width = imwidth,
                        num_classes = num_classes, nb_epoch=_GRID_EPOCH, batch_size=_GRID_BATCH_SIZE, verbose=0)
dropout_rate_middle = [0.1, 0.3, 0.5, 0.7, 0.9]
param_grid = dict(dropout_rate_middle=dropout_rate_middle)
grid = GridSearchCV(estimator=model, param_grid=param_grid)
grid_result = grid.fit(X_train, y_train)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

del model
del grid
del grid_result
allDone()
'''
#Best: 0.597444 using {'dropout_rate_middle': 0.1}
#0.597444 (0.005977) with: {'dropout_rate_middle': 0.1}
#0.552000 (0.009193) with: {'dropout_rate_middle': 0.3}
#0.428667 (0.027475) with: {'dropout_rate_middle': 0.5}
#0.199111 (0.021056) with: {'dropout_rate_middle': 0.7}
#0.049667 (0.006771) with: {'dropout_rate_middle': 0.9}


In [None]:
# Looking at the weight initialization
'''
_GRID_EPOCH = 20
_GRID_BATCH_SIZE = 100
model = KerasClassifier(build_fn=smallVGGnet2,height = imheight, width = imwidth, num_classes = num_classes, nb_epoch=_GRID_EPOCH, batch_size=_GRID_BATCH_SIZE, verbose=0)
init_mode = ['uniform', 'lecun_uniform', 'normal', 'zero', 'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform']
param_grid = dict(init_mode=init_mode)
grid = GridSearchCV(estimator=model, param_grid=param_grid)
grid_result = grid.fit(X_train, y_train)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))
allDone()
'''
#Best: 0.608667 using {'init_mode': 'lecun_uniform'}
#0.382444 (0.015945) with: {'init_mode': 'uniform'}
#0.608667 (0.018409) with: {'init_mode': 'lecun_uniform'}
#0.518778 (0.031582) with: {'init_mode': 'normal'}
#0.044333 (0.001785) with: {'init_mode': 'zero'}
#0.560000 (0.018799) with: {'init_mode': 'glorot_normal'}
#0.558444 (0.024736) with: {'init_mode': 'glorot_uniform'}
#0.578889 (0.014718) with: {'init_mode': 'he_normal'}
#0.569556 (0.009979) with: {'init_mode': 'he_uniform'}


In [None]:
# Fine search the dropout rates
'''
_GRID_EPOCH = 20
_GRID_BATCH_SIZE = 100

model = KerasClassifier(build_fn=smallVGGnet2,height = imheight, width = imwidth, num_classes = num_classes,
                        nb_epoch=_GRID_EPOCH, batch_size=_GRID_BATCH_SIZE, verbose=0)
dropout_rate_last = [0.3, 0.35, 0.40, 0.45, 0.5]
param_grid = dict(dropout_rate_last=dropout_rate_last)
grid = GridSearchCV(estimator=model, param_grid=param_grid)
grid_result = grid.fit(X_train, y_train)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))



model = KerasClassifier(build_fn=smallVGGnet2,height = imheight, width = imwidth, num_classes = num_classes,
                        dropout_rate_last=0.45, nb_epoch=_GRID_EPOCH, batch_size=_GRID_BATCH_SIZE, verbose=0)
dropout_rate_middle = [0.1, 0.15, 0.20, 0.25, 0.3]
param_grid = dict(dropout_rate_middle=dropout_rate_middle)
grid = GridSearchCV(estimator=model, param_grid=param_grid)
grid_result = grid.fit(X_train, y_train)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

allDone()
del X_train
del y_train
del X_val
del y_val
'''
#Best: 0.585667 using {'dropout_rate_last': 0.45}
#0.556222 (0.021832) with: {'dropout_rate_last': 0.3}
#0.551667 (0.022283) with: {'dropout_rate_last': 0.35}
#0.561111 (0.016153) with: {'dropout_rate_last': 0.4}
#0.585667 (0.005361) with: {'dropout_rate_last': 0.45}
#0.572667 (0.025692) with: {'dropout_rate_last': 0.5}
#Best: 0.588778 using {'dropout_rate_middle': 0.1}
#0.588778 (0.021666) with: {'dropout_rate_middle': 0.1}
#0.586111 (0.011704) with: {'dropout_rate_middle': 0.15}
#0.586556 (0.005370) with: {'dropout_rate_middle': 0.2}
#0.570889 (0.016511) with: {'dropout_rate_middle': 0.25}
#0.556222 (0.020004) with: {'dropout_rate_middle': 0.3}

Cross validation takes up too much memory to run all at once, together with the actual network. Therefore the code has been commented out but the results are left at the end of the the code.  Comment in to run it yourself.  The data used to perform cross validation and grid search was a small subset containing 20 classes and 500 images.

Out of the four models that were tested VGG net with only 2 blocks of CNN performs the best at an accuracy of around 57% , which is a bit surprising. The deepest VGG net with 3 block achieved only 45% accuracy. And they all outperformed the baseline.  We will continue optimizing for the small VGGnet that contains 2 blocks. We perform a grid search for different parameters to get an idea of what to tweak on the network. We will test the effect of batch size, percentage of dropout layer, initial weights and optimizer.

It looks like a batch size between around between 10 to 200 seems to work the best. Since a too small or too large batch size can lead to over/under fitting it will probably be a good idea to go with a 100 as it is somewhere in the middle.

There doesn't seem to be that much difference between the optimizers, except SGD which performed about 10% worse than the rest.  Adam seems to perform the best so we will go further with this.

We also tried to see what result data augmentation has on the accuracy. Adding sheer, flip and rotation dropped the accuracy down to 23%. Horizontal flip increased it by 0.6% which is not a lot. But it could be argued that by adding horizontal flip we're adding more robustness to the network as people don't always draw the images from the same side so flipping would naturally occur. 

We checked if weight initialization players a significant role in this network. There seems to be a big difference between the different weight initialization with zero and uniform being the bottom at 4.3% and 38.2% accuracy respectivly. The rest seem the all very close to each other as they range from 56% to 61.9%. lecun_uniform will be used as weight initialization since it scored the highest.

In regards to dropout rate we searched through the performence of the layers after convolution (middle layers) and the dense layers (the last layer). It looked like the best performence is found in the 0.3-0.5 dropout rate for the last dense layer and between 0.1 and 0.3 in the dropout after convolutions. A closer search of the values between those showed that there wasn't much difference between them. Therefore choosing the 0.5 for last layer and 0.3 for the last layer is the more conservative choice and should translate better when using all the classes.

The following hyperparameters that performed the best were the following:
* Optimizer = Adam
* Initial weights = lecun_uniform
* Batch size = 100
* Dropout_middle = 0.3
* Dropout_last = 0.5



# Final network
To see how it all comes together the network will train on the full data base. A small subset will be split into validation data to see how well it perform. The only time the validation data has been used was during optimizing the network was when seeing how data augmentation affected the accuracy. 

In [None]:
#%% import
import os
from glob import glob
import re
import ast
import numpy as np 
import pandas as pd
from PIL import Image, ImageDraw 
from tqdm import tqdm
from dask import bag
import matplotlib.pyplot as plt
import keras
import gc

%matplotlib inline
## Import up sound alert dependencies
from IPython.display import Audio, display

def allDone():
  display(Audio(url='https://sound.peal.io/ps/audios/000/000/537/original/woo_vu_luvub_dub_dub.wav', autoplay=True))
## Insert whatever audio file you want above

# Load NN libraries
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import Flatten
from keras.layers.core import Activation
from keras.layers.convolutional import Conv2D
from keras.layers.convolutional import MaxPooling2D
from keras.layers.normalization import BatchNormalization
from tensorflow.keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, EarlyStopping
from tensorflow.keras.metrics import top_k_categorical_accuracy
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix, classification_report
from keras.preprocessing.image import ImageDataGenerator
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)


In [None]:
#%% set label dictionary and params
classfiles = os.listdir('../input/train_simplified/')
numstonames = {i: v[:-4].replace(" ", "_") for i, v in enumerate(classfiles)} #adds underscores

num_classes = 340    #340 max 
imheight, imwidth = 32, 32  
ims_per_class = 500  # 1000 kills the kernel. 500 images are able to fit in the memory together with the rest of the code

# faster conversion function
def draw_it(strokes):
    image = Image.new("P", (256,256), color=255)
    image_draw = ImageDraw.Draw(image)
    for stroke in ast.literal_eval(strokes):
        for i in range(len(stroke[0])-1):
            image_draw.line([stroke[0][i], 
                             stroke[1][i],
                             stroke[0][i+1], 
                             stroke[1][i+1]],
                            fill=0, width=5)
    image = image.resize((imheight, imwidth))
    return np.array(image)/255.

# Workaround to load test data without crashing kernal
test_df = pd.read_csv('../input/test_simplified.csv', index_col=['key_id'])
image_series=test_df['drawing']
test_array = []
for row in range(0,len(image_series)):
    #import pdb; pdb.set_trace()
    image = image_series.iloc[row]
    image = np.array(draw_it(image))
    image = np.reshape(image, image.shape + (1,))
    test_array.append(image)
test_df['image'] = test_array

del image_series, image
gc.collect()

#%% get train arrays
train_grand = []
class_paths = glob('../input/train_simplified/*.csv')
word_label = []
for i,c in enumerate(tqdm(class_paths[0: num_classes])):
    train = pd.read_csv(c, usecols=['drawing', 'recognized','word'], nrows=ims_per_class*5//4)
    train = train[train.recognized == True].head(ims_per_class)
    word_label_current = train['word'].replace(' ', '_', regex=True)
    imagebag = bag.from_sequence(train.drawing.values).map(draw_it) 
    trainarray = np.array(imagebag.compute())  # PARALLELIZE
    trainarray = np.reshape(trainarray, (ims_per_class, -1))    
    labelarray = np.full((train.shape[0], 1), i)
    trainarray = np.concatenate((labelarray, trainarray), axis=1)
    train_grand.append(trainarray)
    word_label.append(word_label_current.values)
    
word_label = np.ravel(word_label) # flatten the labels. It's a janky workaround
word_encoder = LabelEncoder()
word_encoder.fit(word_label)
train_grand = np.array([train_grand.pop() for i in np.arange(num_classes)]) #less memory than np.concatenate
train_grand = train_grand.reshape((-1, (imheight*imwidth+1)))
train_grand= np.append(train_grand, word_label[:,None],axis=1)

del trainarray
del train

# memory-friendly alternative to train_test_split?
valfrac = 0.1
cutpt = int(valfrac * train_grand.shape[0])

np.random.shuffle(train_grand)
word_array = train_grand[:,-1]
train_grand = np.delete(train_grand,-1,axis=1)
train_grand = train_grand.astype(np.float64)
y_train, X_train = train_grand[cutpt: , 0], train_grand[cutpt: , 1:]
y_val, X_val = train_grand[0:cutpt, 0], train_grand[0:cutpt, 1:] #validation set is recognized==True
word_train, word_val = word_array[cutpt:], word_array[0:cutpt,]

del train_grand
y_train_full_label = y_train
y_train = keras.utils.to_categorical(y_train, num_classes)
X_train = X_train.reshape(X_train.shape[0], imheight, imwidth, 1)
y_val_full_label = y_val
y_val = keras.utils.to_categorical(y_val, num_classes)
X_val = X_val.reshape(X_val.shape[0], imheight, imwidth, 1)

print(y_train.shape, "\n",
      X_train.shape, "\n",
      y_val.shape, "\n",
      X_val.shape)

plt.imshow(X_train[0,:,:,0])
plt.title('Example from class number: {}'.format(numstonames.get(np.where(y_train[0]==1)[0][0])))
plt.axis('off')

plt.show()
allDone()

In [None]:
def top_3_accuracy(y_true, y_pred):
    return top_k_categorical_accuracy(y_true, y_pred, k=3)

def smallVGGnet2(height = 32, width = 3, num_classes = 320, optimizer = 'adam',dropout_rate_last = 0.35, dropout_rate_middle=0.25, init_mode = 'lecun_uniform'):
	# initialize the model along with the input shape to be
	# "channels last" and the channels dimension itself
	model = Sequential()
	depth = 1
	inputShape = (height, width, depth)
	chanDim = -1

	model.add(Conv2D(32, (3, 3), padding="same",
		input_shape=inputShape,kernel_initializer=init_mode))
	model.add(Activation("relu"))
	model.add(BatchNormalization(axis=chanDim))
	model.add(MaxPooling2D(pool_size=(3, 3)))
	model.add(Dropout(dropout_rate_middle))
	# (CONV => RELU) * 2 => POOL
	model.add(Conv2D(64, (3, 3), padding="same",kernel_initializer=init_mode))
	model.add(Activation("relu"))
	model.add(BatchNormalization(axis=chanDim))
	model.add(Conv2D(64, (3, 3), padding="same"))
	model.add(Activation("relu"))
	model.add(BatchNormalization(axis=chanDim))
	model.add(MaxPooling2D(pool_size=(2, 2)))
	model.add(Dropout(dropout_rate_middle))
	# (CONV => RELU) * 2 => POOL
	model.add(Conv2D(128, (3, 3), padding="same"))
	model.add(Activation("relu"))
	model.add(BatchNormalization(axis=chanDim))
	model.add(Conv2D(128, (3, 3), padding="same"))
	model.add(Activation("relu"))
	model.add(BatchNormalization(axis=chanDim))
	model.add(MaxPooling2D(pool_size=(2, 2)))
	model.add(Dropout(dropout_rate_middle))    
	# first (and only) set of FC => RELU layers
	model.add(Flatten())
	model.add(Dense(1024,kernel_initializer=init_mode))
	model.add(Activation("relu"))
	model.add(BatchNormalization())
	model.add(Dropout(dropout_rate_last))

	# softmax classifier
	model.add(Dense(num_classes,kernel_initializer=init_mode))
	model.add(Activation("softmax"))
	model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy', top_3_accuracy])
	# return the constructed network architecture
	return model		

# Functions for plotting loss and accuracy
def plotAccuracy(History):
    '''
    Plots the accuracy of the training and validation of the model
    
    input:
        History = the history of the model.
    
    output:
        plt = the plt handle of the figure. Used for saving the plot
    '''
    fig1, ax_acc = plt.subplots()
    plt.plot(History.history['acc'])
    plt.plot(History.history['val_acc'])
    plt.xlabel('Epoch')
    plt.ylabel('Accuracy')
    plt.title('Model - Accuracy')
    plt.legend(['Training', 'Validation'], loc='lower right')
    return plt

def plotLoss(History):
    '''
    Plots the loss of the training and validation of the model
    
    input:
        History = the history of the model.
    
    output:
        plt = the plt handle of the figure. Used for saving the plot  
    '''
    fig2, ax_loss = plt.subplots()
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.title('Model- Loss')
    plt.legend(['Training', 'Validation'], loc='upper right')
    plt.plot(History.history['loss'])
    plt.plot(History.history['val_loss'])
    plt.legend(['Training', 'Validation'], loc='lower right')
    return plt

def plotHistory(History):
    '''
    Plots the accuracy and validation of the training and validation of the model
    
    input:
        History = the history of the model.
    
    output:
        plt_acc = the plt handle of the accuracy figure. Used for saving the plot
        plt_val = the plt handle of the validation figure. Used for saving the plot
    '''    
    plt_acc = plotAccuracy(History)
    plt_val = plotLoss(History)
    return plt_acc,plt_val



In [None]:
_EPOCH = 50
_BATCH_SIZE = 100
_WEIGHT_INI = 'lecun_uniform' 
_DROPOUT_RATE_LAST = 0.5
_DROPOUT_RATE_MIDDLE = 0.3

model = smallVGGnet2(height = imheight, width = imwidth, num_classes = num_classes,
                     dropout_rate_last=_DROPOUT_RATE_LAST,dropout_rate_middle=_DROPOUT_RATE_MIDDLE,
                    init_mode = _WEIGHT_INI)
datagen = ImageDataGenerator(horizontal_flip=True)
#datagen = ImageDataGenerator()

earlystop = EarlyStopping(monitor='val_top_3_accuracy', mode='auto', patience=5) 
callbacks = [earlystop]

history = model.fit_generator(datagen.flow(X_train, y_train, batch_size=_BATCH_SIZE),
                              validation_data=(X_val,y_val),
                              callbacks=callbacks,steps_per_epoch=len(X_train) / _BATCH_SIZE, epochs=_EPOCH,verbose = 0)
#history = model.fit(X_train, y_train, batch_size=_BATCH_SIZE,
#                   validation_data=(X_val,y_val),
#                   callbacks=callbacks,
#                   epochs=_EPOCH,verbose = 0)
scores = model.evaluate(X_val, y_val, verbose=0)

print("Optimized VGG Accuracy on validation set: %.2f%%, Top 3 Accuracy: %.2f%%" % (scores[1]*100,scores[2]*100))

plotHistory(history)
allDone()

In [None]:
from io import StringIO
import re
def report_to_df(report):
    report = re.sub(r" +", " ", report).replace("avg / total", "avg/total").replace("\n ", "\n")
    report_df = pd.read_csv(StringIO("Classes" + report), sep=' ', index_col=0)        
    return(report_df)


test_cat = np.argmax(y_val, 1)
pred_y = model.predict(X_val)
pred_cat = np.argmax(pred_y, 1)
#plt.matshow(confusion_matrix(test_cat, pred_cat))
temp = classification_report(test_cat, pred_cat, 
                           target_names = [x for x in numstonames.values()])
classification_df = report_to_df(temp)

n_best_classes = classification_df.nlargest(10,'f1-score').reset_index()
n_worst_classes = classification_df.nsmallest(10,'f1-score').reset_index()
print(n_best_classes)
print(n_worst_classes)
worst_classes_labels = []
for class_number, class_word in numstonames.items():    
    if class_word in n_worst_classes['Classes'].values:
        worst_classes_labels.append(class_number)

plt.figure(figsize=(10,10))
for x,_class in enumerate(worst_classes_labels[0:5]):
    plt.subplot(3,2,x+1)
    worst_class_array_loc = np.where(test_cat==_class)
    wrong_prediction = np.where(test_cat[worst_class_array_loc]!=pred_cat[worst_class_array_loc])
    wrong_img_loc = wrong_prediction[0][0] # first wrong prediction
    plt.imshow(X_val[worst_class_array_loc][wrong_img_loc,:,:,0])
    plt.axis('off')
    plt.title('Real class: {}\n Predicted class: {}'.format(numstonames.get(test_cat[worst_class_array_loc][wrong_img_loc]),
                                                            numstonames.get(pred_cat[worst_class_array_loc][wrong_img_loc])))
plt.show()

best_classes_labels = []
for class_number, class_word in numstonames.items():   
    if class_word in n_best_classes['Classes'].values:
        best_classes_labels.append(class_number)
plt.figure(figsize=(10,10))
for x,_class in enumerate(best_classes_labels[0:5]):
    plt.subplot(3,2,x+1)
    best_class_array_loc = np.where(test_cat==_class)
    right_prediction = np.where(test_cat[best_class_array_loc]==pred_cat[best_class_array_loc])
    right_img_loc = right_prediction[0][0] # first wrong prediction
    #print(right_img_loc)
    plt.imshow(X_val[best_class_array_loc][right_img_loc,:,:,0])
    plt.axis('off')
    plt.title('Real class: {}\n Predicted class: {}'.format(numstonames.get(test_cat[best_class_array_loc][right_img_loc]),
                                                            numstonames.get(pred_cat[best_class_array_loc][right_img_loc])))    
    
plt.show()


In [None]:

X_train = None
y_train = None
X_val = None
y_val = None
gc.collect()


In [None]:
ttvlist = []
test_array = test_df['image']
for test_image in test_array:
    test_image = np.expand_dims(test_image, axis=0)
    testpreds = model.predict(test_image, verbose=0)
    ttvs = np.argsort(-testpreds)[:, 0:3]  # top 3
    ttvlist.append(ttvs)
    
ttvarray = np.concatenate(ttvlist)
allDone()

In [None]:
preds_df = pd.DataFrame({'first': ttvarray[:,0], 'second': ttvarray[:,1], 'third': ttvarray[:,2]})
preds_df = preds_df.replace(numstonames)
preds_df['words'] = preds_df['first'] + " " + preds_df['second'] + " " + preds_df['third']

sub = pd.read_csv('../input/sample_submission.csv', index_col=['key_id'])
sub['word'] = preds_df.words.values
sub.to_csv('small_VGG.csv')
sub.head
allDone()

The optimized small VGG with 2 CNN blocks network has an accuracy of 57.69% when using all 340 classes with  500 images per class. 10% were used for validation. The top 3 accuracy is 77.98% and that is what this competition uses as the submission score. We can use F1-score to get a measurement of how well it classifies a certain class as F1 contains both precision and recall. When looking at the top 10 missclassified classes and examples of where the network fails to classify correctly we can see that the mistakes it makes are reasonable. Like identifying a bear as a mouse or a marker as a candle. Those can be hard to classify even for a human. The top 10 best classes on the other hand seem rather distinctable and the examples of correct classification look easy to classify with the human eye. 

To see how well the network performs on the test set provided by kaggle we submit the top 3 prediction of the test set. We get a top 3 accuracy of around 59% which is substantially lower than the validation set.  There can be multiple reasons for this. The network is overfitting to the training data which can be seen on the accuracy and loss plots. There seems to be around 10% difference in train and validation accuracy. But this doesn't explain why the top 3 accuracy of the test set is around 18% less that the top 3 validation set. Since we don't know the distribution of classes in the test set one explanation could be that there are more images from classes that the network struggles to classify. This probably unlikely but cannot really be investigated as long as the test set distribution is unknown.  Out of curiosity the VGG with 3 CNN blocks was also run on the test data. The hyper parameters were the same as previous. This actually performed better and had an accuracy of 62% on the test data. It looks like even though it performed badly with only 20 classes, when extending to 340 classes the deeper network deeper network learned better features. 

# Experiences with this Kaggle Competition
It took some time to figure out how a kaggle kernel actually works because a lot of the things are implied or assumed that the user already knows. But once your kernel is up it is easy to prototype in it. Some of the pros of kaggle are that there are a lot of kernels available so you can copy some of the trivial stuff like loading the data or submitting the data. The biggest pro is without a doubt that you have very easy access to a good GPU. The biggest con of kaggle is that for me the kernel died quite often making me lose my work and having to start over. I tried different ways to clear the RAM (deleting variables, setting them to None and using garbage collector) but I could not clear it. That is why I had to comment out my gridsearches to be able to commit my kernel. Jupyter also lacks a variable explorer, debug tool and other small things that an IDE like spyder has which makes the workflow much easier. Over all I found the Kaggle platform and competition as an interesting experience and could see my self doing the competitions on a sideproject level after my thesis.
