2015-12-17

Using this notebook to get up and running. Start with a dummy classifier, and maybe some utilities for creating submissions.

In [None]:
from datetime import datetime
import numpy as np
import pandas as pd

# nb: changed matplotlib backend to 'Agg' in matplotlibrc 
#  (issues w/ MacOSX backend in virtualenvs)
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sb


from sklearn.cross_validation import train_test_split, cross_val_predict, cross_val_score
from sklearn.dummy import DummyClassifier
from sklearn.externals import joblib
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.preprocessing import StandardScaler

In [None]:
X_train_full = np.load('data/train-images.npy')
y_train_full = np.load('data/train-labels.npy')

------

## inspection

In [None]:
labels = pd.Series(y_train_full)

In [None]:
labels.hist()

plt.title("all training labels")
plt.xlabel("class")
plt.ylabel("counts")

Since the distribution of classes is pretty uniform, we can worry about cross-validation without stratification in later analysis steps.

----

## models

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_train_full, y_train_full)

print("train size: {}".format(len(X_train)))
print("test size: {}".format(len(X_test)))

Start with a fake classifier...

In [None]:
clf = DummyClassifier()
clf_label = clf.__class__.__name__
clf.fit(X_train, y_train)

In [None]:
y_pred = clf.predict(X_test)

In [None]:
accuracy_score(y_test, y_pred)

-----

## diagnostics

In [None]:
# c_v_p returns the array of predictions for each 
#  measurement when it was in the test fold
cv_predict = cross_val_predict(clf, X_train, y_train, cv=5)
cv_predict

In [None]:
confusion_matrix(y_train, cv_predict)

In [None]:
# should be approximately random
sb.heatmap( confusion_matrix(y_train, cv_predict))

plt.title(clf_label)
plt.xlabel("True")
plt.ylabel("Pred")

----

## scaling

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_train_full, y_train_full)

print("train size: {}".format(len(X_train)))
print("test size: {}".format(len(X_test)))

In [None]:
scaler = StandardScaler().fit(X_train)

X_scaled = scaler.transform(X_train)

In [None]:
clf = SGDClassifier(n_jobs=-1)

# k-fold CV (cf ~85% without scaling)
scores = cross_val_score(clf, X_scaled, y_train, n_jobs=-1, cv=3)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

In [None]:
cv_predict = cross_val_predict(clf, X_scaled, y_train, cv=3)
sb.heatmap( confusion_matrix(y_train, cv_predict))

plt.title(clf.__class__.__name__)
plt.xlabel("True")
plt.ylabel("Pred")
plt.show()

----

## image display

----

## utilities

saving predictions

In [None]:
def create_submission(predictions, sub_name, comment=None, team='DrJ'):
    """Include the given array of image predictions in a properly-formatted 
    submission file.
    """
    now = datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%S')
    submission_name = '-'.join(sub_name.split())
    with open('submissions/{}_{}.submission'.format(now, submission_name), 'w') as f:
        f.write('#'*20 + ' Generated submission file\n')
        if comment is not None:
            f.write('# ' + comment + '\n')
        f.write('{}\n'.format(team))
        f.write('{}\n'.format(now))
        f.write('{}\n'.format(sub_name))
        for p in predictions:
            f.write('{}\n'.format(p))
    return True

----------

# Model 1: Class Test Dummies

Use a ``DummyClassifier`` with "stratified" choices (ie maintain the class distribution of the training set).

In [None]:
X_test = np.load('data/test-images.npy')

print("test size: {}".format(len(X_test)))

In [None]:
clf = DummyClassifier()
clf.fit(X_train, y_train)

In [None]:
predictions = clf.predict(X_test)

print("target size: {}".format(len(predictions)))

In [None]:
predictions

# Model 2: Heard It Through The Grapevine

> *Hey, I heard somewhere once that SVMs work well on the MNIST dataset.*
>
> \- the back of my brain

Use a vanilla SVM classifier, because I literally just remember hearing that is was a good and efficient model for 

In [None]:
X_train = np.load('data/train-images.npy')
y_train = np.load('data/train-labels.npy')

X_test = np.load('data/test-images.npy')

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train)

print("train size: {}".format(len(X_train)))
print("test size: {}".format(len(X_test)))

In [None]:
clf = SGDClassifier(n_jobs=-1)

In [None]:
# single prediction
#clf.fit(X_train, y_train)
#accuracy_score(y_test, clf.predict(X_test))

# k-fold CV
scores = cross_val_score(clf, X_train, y_train, n_jobs=-1, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

In [None]:
cv_predict = cross_val_predict(clf, X_train, y_train, cv=5)

In [None]:
sb.heatmap( confusion_matrix(y_train, cv_predict))

plt.title(clf.__class__.__name__)
plt.xlabel("True")
plt.ylabel("Pred")

-----

Reset the model and fit on the entire training set.

In [None]:
clf = SGDClassifier(n_jobs=-1)
predictions = clf.fit(X_train, y_train).predict(X_test)