# Lab: Kaggle Competition - Digit Recognizer

## Using Convolutional Neural Network (CNN) 

** Competition Description**

MNIST ("Modified National Institute of Standards and Technology") is the de facto “hello world” dataset of computer vision. Since its release in 1999, this classic dataset of handwritten images has served as the basis for benchmarking classification algorithms. As new machine learning techniques emerge, MNIST remains a reliable resource for researchers and learners alike.

In this competition, your goal is to correctly identify digits from a dataset of tens of thousands of handwritten images. We’ve curated a set of tutorial-style kernels which cover everything from regression to neural networks. We encourage you to experiment with different algorithms to learn first-hand what works well and how techniques compare.

**Skills**

- Computer vision fundamentals including simple neural networks

In [14]:
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense, Input, Dropout, Conv2D, MaxPool2D, Flatten
from keras.utils import to_categorical
from sklearn.model_selection import train_test_split

### Load training data

Load `train.csv` from Kaggle into a pandas DataFrame.

In [19]:
training = pd.read_csv('train.csv')

In [20]:
training.head()

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
training.shape
# 42000 rows (images), 785 columns (representing each pixel)

(42000, 785)

### Set up X and y

NOTE: Keras requires a `numpy` matrix, it doesn't work with `pandas`.

In [22]:
y = training['label'].values   # label column classifying the numbers
X = training[training.columns[1:]].values  # all the remaining columns after that, containing pixel data

### Preprocessing

1. When dealing with image data, you need to normalize your `X` by dividing each value by the max number of pixels (255).
2. Since this is a multiclass classification problem, keras needs `y` to be a one-hot encoded matrix

In [23]:
# Normalize X
X = X / 255.

# Reshape image data to include greyscale parameter 
X = X.reshape(X.shape[0], 28, 28, 1)

# One-hot encoding on y converting it to categorical
y = np_utils.to_categorical(y)

### Train/Test Split

We want to create a validation set that the model will never see to approximate how it's going to do with Kaggle's `test.csv`. Use `sklearn`'s `train_test_split` to do this.

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

In [25]:
X_train.shape

(28140, 28, 28, 1)

### Create your neural network

Create a neural network using the `Dense` and `Dropout` layers from `keras`. Your activation function for the final output layer needs to be `softmax` to accomodate the ten different classes.

In [38]:
# # The neural network includes...
# model = Sequential()
# model.add(Conv2D(15, kernel_size=(5,5), input_shape=(28, 28, 1), activation='relu'))  # input layer (features)
# model.add(MaxPool2D((2,2)))                                                           # pooling to reduce spatial size
# model.add(Conv2D(30, kernel_size=(4,4), activation='relu'))
# model.add(MaxPool2D((2,2)))                                                           # pooling layer
# model.add(Conv2D(45, kernel_size=(3,3), activation='relu'))
# model.add(MaxPool2D((2,2)))                                                           # pooling layer
# model.add(Flatten())
# model.add(Dense(50, activation='relu'))
# model.add(Dense(10, activation='softmax'))                                           # output layer (predictions)
# # loss: 0.0458 - acc: 0.9854 - val_loss: 0.0680 - val_acc: 0.9800

model = Sequential()
model.add(Conv2D(20, kernel_size=(5,5), input_shape=(28, 28, 1), activation='relu'))  # input layer (features)
model.add(MaxPool2D((2,2)))                                                           # pooling to reduce spatial size
model.add(Conv2D(40, kernel_size=(4,4), activation='relu'))
model.add(MaxPool2D((2,2)))                                                           # pooling layer
model.add(Conv2D(80, kernel_size=(3,3), activation='relu'))
model.add(MaxPool2D((2,2)))                                                           # pooling layer
model.add(Flatten())
model.add(Dense(50, activation='relu'))
model.add(Dense(10, activation='softmax'))                                           # output layer (predictions)
# loss: 0.0355 - acc: 0.9886 - val_loss: 0.0592 - val_acc: 0.9825

# model = Sequential()
# model.add(Conv2D(20, kernel_size=(5,5), input_shape=(28, 28, 1), activation='relu'))  # input layer (features)
# model.add(MaxPool2D((2,2)))                                                           # pooling to reduce spatial size
# model.add(Conv2D(40, kernel_size=(4,4), activation='relu'))
# model.add(MaxPool2D((2,2)))                                                           # pooling layer
# model.add(Conv2D(80, kernel_size=(3,3), activation='relu'))
# model.add(MaxPool2D((2,2)))                                                           # pooling layer
# model.add(Flatten())
# model.add(Dense(50, activation='relu'))
# model.add(Dense(25, activation='relu'))
# model.add(Dense(10, activation='softmax'))                                           # output layer (predictions)
# # loss: 0.0390 - acc: 0.9877 - val_loss: 0.0573 - val_acc: 0.9826

### Compile your model

Since this is a multiclass classification problem, your loss function is `categorical_crossentropy`.

In [39]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

### Fit the model

Use your X_test, y_test from the `train_test_split` step for the `validation_data` parameter.

In [40]:
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10)

#loss: 0.0348 - acc: 0.9887 - val_loss: 0.0524 - val_acc: 0.9835
#loss: 0.0186 - acc: 0.9935 - val_loss: 0.0536 - val_acc: 0.9856

Train on 28140 samples, validate on 13860 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x119c45860>

### Load in Kaggle's `test.csv`

Be sure to do the **same** preprocessing you did for your training `X`.

In [46]:
# Load testing dataframe
testing = pd.read_csv('test.csv')

In [47]:
# Normalize columns in testing dataframe (similar to X columns in training dataframe)
testing = testing / 255.

# Reshape image data to include greyscale parameter 
testing = testing.values.reshape(testing.shape[0], 28, 28, 1)

### Create your predictions

Use `predict_classes` to get the actual numerical values (0-9).

In [49]:
predictions = model.predict_classes(testing)



### Prepare your submission

1. Add your predictions to a column called `Label`
2. You'll need to manually create the `ImageId` column, which is just a list of 1..[NUMBER OF TEST SAMPLES]

In [55]:
submission = pd.DataFrame()
submission['ImageId'] = range(1, testing.shape[0] + 1)
submission['Label'] = predictions

### Create your submission csv

Remember to set `index=False`!

In [56]:
submission.to_csv('submission.csv', index=False)

model.summary

<bound method Container.summary of <keras.models.Sequential object at 0x1c25cf3940>>