# Lab: Kaggle Competition - Digit Recognizer

## Using Keras

** Competition Description**

MNIST ("Modified National Institute of Standards and Technology") is the de facto “hello world” dataset of computer vision. Since its release in 1999, this classic dataset of handwritten images has served as the basis for benchmarking classification algorithms. As new machine learning techniques emerge, MNIST remains a reliable resource for researchers and learners alike.

In this competition, your goal is to correctly identify digits from a dataset of tens of thousands of handwritten images. We’ve curated a set of tutorial-style kernels which cover everything from regression to neural networks. We encourage you to experiment with different algorithms to learn first-hand what works well and how techniques compare.

**Skills**

- Computer vision fundamentals including simple neural networks

In [1]:
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense, Input, Dropout
from keras.utils import np_utils
from sklearn.model_selection import train_test_split

Using Theano backend.


### Load training data

Load `train.csv` from Kaggle into a pandas DataFrame.

In [2]:
training = pd.read_csv('train.csv')

In [3]:
training.head()

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
training.shape
# 42000 rows (images), 785 columns (representing each pixel)

(42000, 785)

### Set up X and y

NOTE: Keras requires a `numpy` matrix, it doesn't work with `pandas`.

In [5]:
y = training['label']   # label column classifying the numbers
X = training[training.columns[1:]].values  # all the remaining columns after that, containing pixel data

### Preprocessing

1. When dealing with image data, you need to normalize your `X` by dividing each value by the max number of pixels (255).
2. Since this is a multiclass classification problem, keras needs `y` to be a one-hot encoded matrix

In [6]:
# Normalize X
X = X / 255.

# One-hot encoding on y
y = np_utils.to_categorical(y)

### Train/Test Split

We want to create a validation set that the model will never see to approximate how it's going to do with Kaggle's `test.csv`. Use `sklearn`'s `train_test_split` to do this.

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

In [8]:
X_train.shape

(28140, 784)

### Create your neural network

Create a neural network using the `Dense` and `Dropout` layers from `keras`. Your activation function for the final output layer needs to be `softmax` to accomodate the ten different classes.

In [11]:
# Model includes a dropout layer at 50% and 1 hidden layer with 50 nodes
model = Sequential()
model.add(Dense(X_train.shape[1], input_dim=X_train.shape[1], activation='relu'))  # input layer (features)
model.add(Dropout(.5))                                                             # dropout layer
model.add(Dense(50, activation='relu'))                                            # hidden layer, 50 nodes
model.add(Dense(y_train.shape[1], activation='softmax'))                           # output layer (predictions)

### Compile your model

Since this is a multiclass classification problem, your loss function is `categorical_crossentropy`.

In [12]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

### Fit the model

Use your X_test, y_test from the `train_test_split` step for the `validation_data` parameter.

In [13]:
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=5, batch_size=100)

Train on 28140 samples, validate on 13860 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1c141ca6d8>

### Load in Kaggle's `test.csv`

Be sure to do the **same** preprocessing you did for your training `X`.

In [16]:
# Load testing dataframe
testing = pd.read_csv('test.csv')

In [17]:
# Normalize columns in testing dataframe (similar to X columns in training dataframe)
testing = testing / 255.

### Create your predictions

Use `predict_classes` to get the actual numerical values (0-9).

In [18]:
predictions = model.predict_classes(testing.values)



### Prepare your submission

1. Add your predictions to a column called `Label`
2. You'll need to manually create the `ImageId` column, which is just a list of 1..[NUMBER OF TEST SAMPLES]

In [19]:
testing['Label'] = predictions
testing['ImageId'] = range(1, testing.shape[0] + 1)

### Create your submission csv

Remember to set `index=False`!

In [20]:
testing[['ImageId', 'Label']].to_csv('submission.csv', index=False)