In [24]:
import numpy as np 
import pandas as pd 

## Purpose:
Identify numbers (MNIST data) using SVCs and kernels. 

## Data Set-up 

Import data and use train_test_split to set up training and testing datasets. 


In [35]:

#import train_test_split 
from sklearn.model_selection import train_test_split

mnist = fetch_mldata('MNIST original')

X = mnist.data.astype('float64')
y = mnist.target.astype('int64')

# print shapes of X and y
print(X.shape)
print(y.shape)

# set test size to 25%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

(70000, 784)
(70000,)


## Use PCA from sklearn
We will use Principal Component Analysis (PCA) to manipulate the data to make it more usable for SVC. The main idea of principal component analysis (PCA) is to reduce the dimensionality of a data set consisting of many variables correlated with each other, either heavily or lightly, while retaining the variation present in the dataset, up to the maximum extent. 

In [40]:
from sklearn.decomposition import PCA

# Keep 80% of feature variation. Get rid of the rest.
pca = PCA(n_components=0.8, whiten=True)

# Use pca to get new X_train, X_test
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

#print the shape of X_train_pca 
print(X_train_pca.shape)

(52500, 43)


What change do you notice between our old training data and our new one?

Answer: We reduced the number of features or input data but still retained 80% of the information.

## SVC and Kernels

Now we will experiment with support vector classifiers and kernels. We will need LinearSVC, SVC, and accuracy_score.

SVMs are really interesting because they have something called the dual formulation, in which the computation is expressed as training point inner products. This means that data can be lifted into higher dimensions easily with this "kernel trick". Data that is not linearly separable in a lower dimension can be linearly separable in a higher dimension - which is why we conduct the transform. Let us experiment.
A transformation that lifts the data into a higher-dimensional space is called a kernel. A poly- nomial kernel expands the feature space by computing all the polynomial cross terms to a specific degree.

In [43]:
from sklearn.svm import LinearSVC, SVC
from sklearn.metrics import accuracy_score

# fit the LinearSVC on X_train_pca and y_train and then print train accuracy and test accuracy

lsvc = LinearSVC(dual=False, tol=0.01) 
lsvc.fit(X_train_pca, y_train)
print('train acc: ', accuracy_score(lsvc.predict(X_train_pca), y_train))
print('test acc: ', accuracy_score(lsvc.predict(X_test_pca), y_test))
        
# use SVC with a polynomial kernel. Fit this model on X_train_pca and y_train and print accuracy metrics as before

psvc = SVC(kernel='poly', degree=2, tol=0.01, cache_size=4000) 
psvc.fit(X_train_pca, y_train, sample_weight = None)
print('train acc: ', accuracy_score(psvc.predict(X_train_pca), y_train)) 
print('test acc: ', accuracy_score(psvc.predict(X_test_pca), y_test))

# play around with the degree of the polynomial to see if you can improve your accuracy 




train acc:  0.892476190476
test acc:  0.894057142857
train acc:  0.9896
test acc:  0.980914285714


The RBF kernel uses the gaussian function to create an infinite dimensional space - a gaussian peak at each datapoint. Use this http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html to figure out what gamma and C paramaters are the most optimal. 

In [None]:
# use SVC with rbf kernel. Fit this model on X_train_pca and y_train and print accuracy metrics.

rsvc = SVC(kernel='rbf', tol=0.01, cache_size=400, C = 200 , gamma = 10)
rsvc.fit(X_train_pca, y_train)
print('train acc: ', accuracy_score(rsvc.predict(X_train_pca), y_train))
print('test acc: ', accuracy_score(rsvc.predict(X_test_pca), y_test))

## Kernel Ridge Regression 

Let's see what happens when we do ridge regression with the kernel trick. We will use KernelRidge from sklearn.

In [None]:
from sklearn.kernel_ridge import KernelRidge

# use KernelRidge to see if accuracy scores are improved. 
krr = KernelRidge(alpha=1.0)
krr.fit(X_train_pca, y_train) 
print('train acc: ', accuracy_score(krr.predict(X_train_pca), y_train))
print('test acc: ', accuracy_score(krr.predict(X_test_pca), y_test))

## Conclusions

1) What is a kernel and why is it important?

2) Can we kernelize all types of data? Why or why not?

3) What are some pros/cons of kernels? (look into runtime)
