In this mini-project, we’ll tackle the exact same email author ID problem as the Naive Bayes mini-project, but now with an SVM. What we find will help clarify some of the practical differences between the two algorithms. This project also gives us a chance to play around with parameters a lot more than Naive Bayes did, so we will do that too.

#!/usr/bin/python

""" 
    This is the code to accompany the Lesson 2 (SVM) mini-project.

    Use a SVM to identify emails from the Enron corpus by their authors:    
    Sara has label 0
    Chris has label 1
"""

In [1]:
import sys
from time import time
sys.path.append("../tools/")
from email_preprocess import preprocess

In [None]:
### features_train and features_test are the features for the training
### and testing datasets, respectively
### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()

When creating the classifier, use a linear kernel (if you forget this step, you will be unpleasantly surprised by how long the classifier takes to train). What is the accuracy of the classifier?
Place timing code around the fit and predict functions, like you did in the Naive Bayes mini-project. How do the training and prediction times compare to Naive Bayes?

In [None]:
from sklearn.svm import SVC

In [None]:
clf = SVC(kernel="linear")
t0 = time()
clf.fit(features_train, labels_train)
print "training time:", round(time()-t0, 3), "s"

In [None]:
from sklearn.metrics import accuracy_score
t0 = time()
pred = clf.predict(features_test)
print "predicting time:", round(time()-t0, 3), "s"
print 'accuracy = {}'.format(accuracy_score(labels_test, pred))
#print 'accuracy = {}'.format(clf.score(features_test, labels_test))

One way to speed up an algorithm is to train it on a smaller training dataset. The tradeoff is that the accuracy almost always goes down when you do this. Let’s explore this more concretely: add in the following two lines immediately before training your classifier. 

features_train = features_train[:len(features_train)/100] 
labels_train = labels_train[:len(labels_train)/100] 

These lines effectively slice the training dataset down to 1% of its original size, tossing out 99% of the training data. You can leave all other code unchanged. What’s the accuracy now?

In [None]:
features_train_1 = features_train[:len(features_train)/100]
labels_train_1 = labels_train[:len(labels_train)/100]

In [None]:
clf.fit(features_train_1, labels_train_1)
print 'accuracy = {}'.format(clf.score(features_test, labels_test))

Keep the training set slice code from the last quiz, so that you are still training on only 1% of the full training set. Change the kernel of your SVM to “rbf”. What’s the accuracy now, with this more complex kernel?

In [None]:
clf = SVC(kernel="rbf")
clf.fit(features_train_1, labels_train_1)
print 'accuracy = {}'.format(clf.score(features_test, labels_test))

Keep the training set size and rbf kernel from the last quiz, but try several values of C (say, 10.0, 100., 1000., and 10000.). Which one gives the best accuracy?
Once you've optimized the C value for your RBF kernel, what accuracy does it give? Does this C value correspond to a simpler or more complex decision boundary?

In [None]:
import numpy as np
C_range = np.logspace(0, 4, 5)
param_grid = dict(C=C_range)
rbf_svm = SVC(kernel="rbf")
from sklearn.grid_search import ParameterGrid
for g in ParameterGrid(param_grid):
    rbf_svm.set_params(**g)
    rbf_svm.fit(features_train_1, labels_train_1)
    print "%s -> accuracy = %s" % (g, rbf_svm.score(features_test, labels_test))

Now that you’ve optimized C for the RBF kernel, go back to using the full training set. In general, having a larger training set will improve the performance of your algorithm, so (by tuning C and training on a large dataset) we should get a fairly optimized result. What is the accuracy of the optimized SVM?

In [None]:
clf_full_rbf_10000 = SVC(kernel="rbf", C=10000.0)
clf_full_rbf_10000.fit(features_train, labels_train)
print 'accuracy = {}'.format(clf_full_rbf_10000.score(features_test, labels_test))

What class does your SVM (0 or 1, corresponding to Sara and Chris respectively) predict for element 10 of the test set? The 26th? The 50th? (Use the RBF kernel, C=10000, and 1% of the training set. Normally you'd get the best results using the full training set, but we found that using 1% sped up the computation considerably and did not change our results--so feel free to use that shortcut here.)
And just to be clear, the data point numbers that we give here (10, 26, 50) assume a zero-indexed list. So the correct answer for element #100 would be found using something like answer=predictions[100]

In [None]:
clf_1_rbf_10000 = SVC(kernel="rbf", C=10000.0)
clf_1_rbf_10000.fit(features_train_1, labels_train_1)
['Sara ({})'.format(p) if p==0 else 'Chris ({})'.format(p) for p in clf_1_rbf_10000.predict(features_test[[10,26,50]])]

There are over 1700 test events--how many are predicted to be in the “Chris” (1) class? (Use the RBF kernel, C=10000., and the full training set.)

In [None]:
from sklearn.metrics import confusion_matrix
conf_mat = confusion_matrix(labels_test, clf_full_rbf_10000.predict(features_test))
print 'Chris (1) predictions in test set = %s' % conf_mat[:,1].sum()