> **Problem overview**

"Quick, Draw!" was released as an experimental game to educate the public in a playful way about how AI works. The game prompts users to draw an image depicting a certain category, such as ”banana,” “table,” etc. The game generated more than 1B drawings, of which a subset was publicly released as the basis for this competition’s training set. That subset contains 50M drawings encompassing 340 label categories.

Sounds fun, right? Here's the challenge: since the training data comes from the game itself, drawings can be incomplete or may not match the label. You’ll need to build a recognizer that can effectively learn from this noisy data and perform well on a manually-labeled test set from a different distribution.

Your task is to build a better classifier for the existing Quick, Draw! dataset. By advancing models on this dataset, Kagglers can improve pattern recognition solutions more broadly. This will have an immediate impact on handwriting recognition and its robust applications in areas including OCR (Optical Character Recognition), ASR (Automatic Speech Recognition) & NLP (Natural Language Processing).

In [None]:
# import python standard library
import ast, os

# import data manipulation library
import numpy as np
import pandas as pd

# import data visualization library
import matplotlib.pyplot as plt
from tqdm import tqdm

# import image processing library
import cv2

# import tensorflow model class
from tensorflow import keras
from tensorflow.keras.layers import Conv2D, Dense, Dropout, Flatten, MaxPooling2D
from tensorflow.keras.models import load_model, Sequential

# import sklearn model selection
from sklearn.model_selection import train_test_split

# import tensorflow model evaluation classification metrics
from tensorflow.keras.metrics import top_k_categorical_accuracy

> **Acquiring training and testing data**

We start by acquiring the training and testing datasets into Pandas DataFrames.

In [None]:
# acquiring training and testing data
df_train = pd.concat([pd.read_csv('../input/train_simplified/' + file, nrows=4) for file in os.listdir('../input/train_simplified')], ignore_index=True)
df_test = pd.read_csv('../input/test_simplified.csv', nrows=2)

In [None]:
# visualize head of the training data
df_train.head(n=5)

In [None]:
# visualize tail of the testing data
df_test.tail(n=5)

In [None]:
# dataframe columns name
names = ['countrycode', 'drawing', 'key_id', 'recognized', 'timestamp', 'word']

# class files and dictionary
files = sorted([file.lower() for file in os.listdir('../input/train_simplified/')], reverse=False)
class_dict = {file[:-4].replace(" ", "_"): i for i, file in enumerate(files)}
classreverse_dict = {v: k for k, v in class_dict.items()}

# combine training and testing dataframe
df_train['datatype'], df_test['datatype'] = 'training', 'testing'
df_train = df_train[['key_id', 'countrycode', 'drawing', 'datatype', 'word', 'recognized']]
df_test['word'], df_test['recognized'] = '', True
df_data = pd.concat([df_train, df_test], ignore_index=True)

In [None]:
# data dimensions
img_size = 32
num_channels = 1
num_classes = 340

# flat dimensions
img_size_flat = img_size * img_size * num_channels

> **Feature exploration, engineering and cleansing**

Here we generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution together with exploring some data.

In [None]:
# drawplot function plot
def drawplot(drawing = None, label = None, ncols = 5, nrows = 3):
    fig, axes = plt.subplots(figsize=(4*ncols , 3*nrows), ncols=ncols, nrows=nrows)
    axes = axes.flatten()
    for i in label.index:
        for j in range(len(drawing[i])): _ = axes[i - label.index[0]].plot(drawing[i][j][0], drawing[i][j][1])
        axes[i - label.index[0]].invert_yaxis()
        axes[i - label.index[0]].set_title(label[i])

In [None]:
# imageplot function plot
def imageplot(pixel = None, label = None, ncols = 5, nrows = 3):
    fig, axes = plt.subplots(figsize=(4*ncols , 3*nrows), ncols=ncols, nrows=nrows)
    axes = axes.flatten()
    for i in range(len(label)): axes[i].imshow(pixel[i].reshape(img_size, img_size), interpolation='spline16')

In [None]:
# feature exploration, engineering and cleansing
def feature_extraction(df_data):
    # feature extraction: drawing
    df_data['drawing'] = df_data['drawing'].apply(lambda x: ast.literal_eval(x))
    
    # feature extraction: word
    df_data['word'] = df_data['word'].apply(lambda x: x.lower().replace(' ', '_'))
    
    # feature extraction: drawing to pixel
    def drawing2pixel(drawing = None):
        image = np.zeros((256, 256))
        for stroke in drawing:
            for i in range(len(stroke[0])-1):
                _ = cv2.line(image, (stroke[0][i], stroke[1][i]), (stroke[0][i + 1], stroke[1][i + 1]), color=1, thickness=5)
        return cv2.resize(image, (img_size, img_size))
    df_data['pixel'] = df_data['drawing'].apply(lambda x: drawing2pixel(x))
    
    return df_data

In [None]:
# feature extraction: step 1
df_data = feature_extraction(df_data)

In [None]:
# feature exploration: image
drawplot(df_data.loc[:19, 'drawing'], df_data.loc[:19, 'word'], nrows=4)

In [None]:
# feature exploration: image
imageplot(df_data.loc[:19, 'pixel'], df_data.loc[:19, 'word'], nrows=4)

After extracting all features, it is required to convert category features to numerics features, a format suitable to feed into our Machine Learning models.

In [None]:
# feature exploration, engineering and cleansing
def feature_extraction2(df_data):
    # feature extraction: remove countrycode and drawing
    df_data = df_data.drop(['countrycode', 'drawing'], axis=1)
    
    # convert category codes for data dataframe
    df_data = pd.get_dummies(df_data, columns=['datatype', 'word'], drop_first=False)
    
    return df_data

In [None]:
# feature extraction: step 2
df_data = feature_extraction2(df_data)

In [None]:
# describe data dataframe
df_data.describe(include='all')

In [None]:
# verify dtypes object
df_data.info()

In [None]:
# memory clean-up
del df_data, df_train, df_test

In [None]:
# acquiring training data
list_of_data = []
for row in tqdm(range(1, 512, 64)):
    # acquiring training data
    df_data = pd.concat([pd.read_csv('../input/train_simplified/' + file, names=names, nrows=64, skiprows=row) for file in os.listdir('../input/train_simplified')], ignore_index=True)
    
    # combine training dataframe
    df_data['datatype'] = 'training'
    df_data = df_data[['key_id', 'countrycode', 'drawing', 'datatype', 'word', 'recognized']]
    
    # feature extraction: step 1
    df_data = feature_extraction(df_data)
    
    # feature extraction: step 2
    df_data = feature_extraction2(df_data)
    
    # feature extraction: append data dataframe
    list_of_data.append(df_data)
df_data = pd.concat(list_of_data, ignore_index=True)

> **Model, predict and solve the problem**

Now, it is time to feed the features to Machine Learning models.

In [None]:
# select all features
x = np.zeros((df_data.shape[0], img_size, img_size, 1))
for i, df_row in df_data.iterrows(): x[i, :, :, 0] = df_row['pixel']
y = df_data[[col for col in df_data if col.startswith('word')]]

In [None]:
# perform train-test (validate) split
x_train, x_validate, y_train, y_validate = train_test_split(x, y, random_state=58, test_size=0.10)

In [None]:
# memory clean-up
del df_data, x, y

A TensorFlow graph consists of the following parts which will be detailed below:

* Placeholder variables used for inputting data to the graph.
* Variables that are going to be optimized so as to make the convolutional network perform better.
* The mathematical formulas for the convolutional network.
* A loss measure that can be used to guide the optimization of the variables.
* An optimization method which updates the variables.

In [None]:
# top_3_categorical_accuracy function
def top_3_categorical_accuracy(y_true, y_pred):
    return top_k_categorical_accuracy(y_true, y_pred, k=3)

In [None]:
# keras model setup
model_keras = Sequential()
model_keras.add(Conv2D(filters=32, kernel_size=(3, 3), strides=(1, 1), padding='same', activation='relu', input_shape=(img_size, img_size, 1)))
model_keras.add(MaxPooling2D(pool_size=(2, 2), padding='valid'))
model_keras.add(Conv2D(filters=64, kernel_size=(3, 3), strides=(1, 1), padding='same', activation='relu'))
model_keras.add(MaxPooling2D(pool_size=(2, 2), padding='valid'))
model_keras.add(Dropout(rate=0.2, seed=58))
model_keras.add(Flatten())
model_keras.add(Dense(680, activation='relu'))
model_keras.add(Dropout(rate=0.5, seed=58))
model_keras.add(Dense(num_classes, activation='softmax'))
model_keras.summary()

In [None]:
# keras model setup
model_keras.compile(optimizer='adam', loss='categorical_crossentropy', metrics=[top_3_categorical_accuracy])

# keras model fit
model_keras.fit(x_train, y_train, batch_size=32, epochs=32, verbose=2, validation_data=(x_validate, y_validate))

# keras model metrics
model_keras_score = model_keras.evaluate(x_validate, y_validate, verbose=1)
print('keras\n  top 3 categorical accuracy score: %0.4f' %model_keras_score[1])

In [None]:
# keras model save
model_keras.save('model_keras.h5')

In [None]:
# memory clean-up
del x_train, x_validate, y_train, y_validate

> **Supply or submit the results**

Our submission to the competition site Kaggle is ready. Any suggestions to improve our score are welcome.

In [None]:
# acquiring testing data
df_test = pd.read_csv('../input/test_simplified.csv')

# combine testing dataframe
df_test['datatype'] = 'testing'
df_test['word'], df_test['recognized'] = '', True

# feature extraction: step 1
df_test = feature_extraction(df_test)

# feature extraction: step 2
df_test = feature_extraction2(df_test)

In [None]:
# prepare testing data and compute the observed value
x_test = np.zeros((df_test.shape[0], img_size, img_size, 1))
for i, df_row in df_test.iterrows(): x_test[i, :, :, 0] = df_row['pixel']
y_test = np.argsort(-model_keras.predict(x_test, verbose=1))[:, 0:3]
df_word = pd.DataFrame({'top 1': y_test[:, 0], 'top 2': y_test[:, 1], 'top 3': y_test[:, 2]})
df_word = df_word.replace(classreverse_dict)
df_word['submission'] = df_word['top 1'] + ' ' + df_word['top 2'] + ' ' + df_word['top 3']

In [None]:
# summit the results
out = pd.DataFrame({'key_id': df_test['key_id'], 'word': df_word['submission']})
out.to_csv('submission.csv', index=False)