In [5]:
from __future__ import division
import numpy as np
import scipy as sp
import cPickle
from sklearn.cross_validation import train_test_split

import matplotlib.pyplot as plt
%matplotlib inline

<h3>Image Classification</h3>
Below we load the CIFAR-10 Dataset into python.  Availible at http://www.cs.toronto.edu/~kriz/cifar.html.  This is just some of the basics from http://cs231n.github.io/classification/

In [2]:
files = ['cifar-10-batches-py/data_batch_' + str(i) for i in range(1, 6)]
def unpickle(file):
    import cPickle
    fo = open(file, 'rb')
    dict = cPickle.load(fo)
    fo.close()
    return dict

batches = map(unpickle, files)
X = np.concatenate([b['data'] for b in batches])
y = np.concatenate([b['labels'] for b in batches])

X_test = map(unpickle, ['cifar-10-batches-py/test_batch'])[0]['data']
y_test = map(unpickle, ['cifar-10-batches-py/test_batch'])[0]['labels']

In [3]:
X_test

array([[158, 159, 165, ..., 124, 129, 110],
       [235, 231, 232, ..., 178, 191, 199],
       [158, 158, 139, ...,   8,   3,   7],
       ..., 
       [ 20,  19,  15, ...,  50,  53,  47],
       [ 25,  15,  23, ...,  80,  81,  80],
       [ 73,  98,  99, ...,  94,  58,  26]], dtype=uint8)

These 32x32 images are encoded as flattened numpy arrays;  The first 1024 digits encode the red values, then blue and green, respectively

<h3>Nearest Neighbors Classifier</h3>
The simpliest classified for data is to simply predict the class value $y^*$ for a novel input $x^*$ as the label $y$ of the closest input $x$.  How we defined "closeness" is important; two common choices for this distance metric are the Euclidian Distance (L2) and the Manhatten Distance (L1).


A simple Algorithm to classify a new image would work as follows:

1.  Recieve input $x^*$
2.  For each training data $(x, y)$
    2a. Calculate distance $d(x, x^*)$, storing target class $y$ of closest training input
3. Predict class $y$


Below I take advantage of numpy arrays to perform a slightly faster computation - Though you'll note it is still very slow.  Because our model is the data itself, it scales poorly with data size


In [11]:
def l2(a, b):
    """Calculates L2 distance between two vectors"""
    return np.sum((a - b)**2, axis=1)
def l1(a, b):
    """Calculates L1 distance between two vectors"""
    return np.sum(np.abs(a - b), axis=1)

# Create numpy array of zeroes to store predictions
y_pred = np.zeros_like(y_test)

for i, test_image in enumerate(X_test[:200]):
    y_pred[i] = y[np.argmin(l2(test_image, X))]

In [12]:
errors = 0
for a, b in zip(y_pred[:200], y_test[:200]):
    if (a != b):
        errors +=1
print "Test Accuracy: " + str(errors / 200)

Test Accuracy: 0.695


<h3>K-Nearest Neighbors Classifier</h3>
To improve the performance of the above algorithm, we can take the average (regression) or consesus (classification) of the $k$ nearest datapoints

In [15]:
k = 20
for i, test_image in enumerate(X_test[:200]):
    y_pred[i] = np.argmax(np.bincount(y[np.argsort(l2(test_image, X))][:k]))
    
errors = 0
for a, b in zip(y_pred[:200], y_test[:200]):
    if (a != b):
        errors +=1
print "Test Accuracy: " + str(errors / 200)

Test Accuracy: 0.75


<h3>Improving Performance with K-Means</h3>
Ultimatly we need to improve the performance of the above algorithms if we want to use them at any larger scale.  One method to do so is by preprocessing the data with the K-means algorithm - then running the k-nearest neighbors on the dataset of means

In [None]:
from sklearn.cluster import KMeans
KM = KMeans(n_clusters=100)
X_mod = KM.fit_predict(X)

k = 10
for i, test_image in enumerate(X_test):
    y_pred[i] = np.argmax(np.bincount(y[np.argsort(l2(test_image, X_mod))][:k]))
    
errors = 0
for a, b in zip(y_pred[:200], y_test[:200]):
    if (a != b):
        errors +=1
print "Test Accuracy: " + str(errors / 200)