# SVM Mini-Project

In this mini-project, we’ll tackle the exact same email author ID problem as the Naive Bayes mini-project, but now with an SVM. What we find will help clarify some of the practical differences between the two algorithms. This project also gives us a chance to play around with parameters a lot more than Naive Bayes did, so we will do that too.

In [1]:
#!/usr/bin/python3

"""
    This is the code to accompany the Lesson 2 (SVM) mini-project.

    Use a SVM to identify emails from the Enron corpus by their authors:
    Sara has label 0
    Chris has label 1
"""

import sys
from time import time

sys.path.append("../tools/")
from tools.email_preprocess import preprocess

### features_train and features_test are the features for the training
### and testing datasets, respectively
### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()

No. of Chris training emails :  7936
No. of Sara training emails :  7884


# SVM Author ID Accuracy

Go to the svm directory to find the starter code (svm/svm_author_id.py).

Import, create, train and make predictions with the sklearn SVC classifier. When creating the classifier, use a linear kernel (if you forget this step, you will be unpleasantly surprised by how long the classifier takes to train). What is the accuracy of the classifier?

In [2]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

clf = SVC(kernel="linear")
clf.fit(features_train, labels_train)
predictions = clf.predict(features_test)
accuracy = accuracy_score(labels_test, predictions)
print(accuracy)


0.9840728100113766


# SVM Author ID Timing

Place timing code around the fit and predict functions, like you did in the Naive Bayes mini-project. How do the training and prediction times compare to Naive Bayes?

In [3]:
clf = SVC(kernel="linear")
t0 = time()
clf.fit(features_train, labels_train)
print("training time", round(time() - t0, 3), "s")

t1 = time()
predictions = clf.predict(features_test)
print("prediction time", round(time() - t1, 3), "s")

training time 74.53 s
prediction time 7.649 s


# A Smaller Training Set

One way to speed up an algorithm is to train it on a smaller training dataset. The tradeoff is that the accuracy almost always goes down when you do this. Let’s explore this more concretely: add in the following two lines immediately before training your classifier.

features_train = features_train[:len(features_train)/100]
labels_train = labels_train[:len(labels_train)/100]

These lines effectively slice the training dataset down to 1% of its original size, tossing out 99% of the training data. You can leave all other code unchanged. What’s the accuracy now?

In [4]:
features_train = features_train[:int(len(features_train) / 100)]
labels_train = labels_train[:int(len(labels_train) / 100)]
clf = SVC(kernel="linear")
clf.fit(features_train, labels_train)
predictions = clf.predict(features_test)
accuracy = accuracy_score(labels_test, predictions)
print(accuracy)

0.8845278725824801


# Deploy an RBF Kernel

Keep the training set slice code from the last quiz, so that you are still training on only 1% of the full training set. Change the kernel of your SVM to “rbf”. What’s the accuracy now, with this more complex kernel?

In [5]:
clf = SVC(kernel="rbf")
clf.fit(features_train, labels_train)
predictions = clf.predict(features_test)
accuracy = accuracy_score(labels_test, predictions)
print(accuracy)

0.8953356086461889


# Optimize C Parameter

Keep the training set size and rbf kernel from the last quiz, but try several values of C (say, 10.0, 100., 1000., and 10000.). Which one gives the best accuracy?

In [6]:
clf = SVC(kernel="rbf", C=10)
clf.fit(features_train, labels_train)
predictions = clf.predict(features_test)
accuracy = accuracy_score(labels_test, predictions)
print(accuracy)

0.8998862343572241


In [7]:
clf = SVC(kernel="rbf", C=100)
clf.fit(features_train, labels_train)
predictions = clf.predict(features_test)
accuracy = accuracy_score(labels_test, predictions)
print(accuracy)

0.8998862343572241


In [8]:
clf = SVC(kernel="rbf", C=1000)
clf.fit(features_train, labels_train)
predictions = clf.predict(features_test)
accuracy = accuracy_score(labels_test, predictions)
print(accuracy)

0.8998862343572241


In [9]:
clf = SVC(kernel="rbf", C=10000)
clf.fit(features_train, labels_train)
predictions = clf.predict(features_test)
accuracy = accuracy_score(labels_test, predictions)
print(accuracy)

0.8998862343572241


# Optimized RBF vs. Linear SVM: Accuracy

Now that you’ve optimized C for the RBF kernel, go back to using the full training set. In general, having a larger training set will improve the performance of your algorithm, so (by tuning C and training on a large dataset) we should get a fairly optimized result. What is the accuracy of the optimized SVM?

In [10]:
features_train, features_test, labels_train, labels_test = preprocess()
clf = SVC(kernel="rbf", C=10)
clf.fit(features_train, labels_train)
predictions = clf.predict(features_test)
accuracy = accuracy_score(labels_test, predictions)
print(accuracy)

No. of Chris training emails :  7936
No. of Sara training emails :  7884
0.9948805460750854


# Extracting Predictions from an SVM

What class does your SVM (0 or 1, corresponding to Sara and Chris respectively) predict for element 10 of the test set? The 26th? The 50th? (Use the RBF kernel, C=10000, and 1% of the training set. Normally you'd get the best results using the full training set, but we found that using 1% sped up the computation considerably and did not change our results--so feel free to use that shortcut here.)

And just to be clear, the data point numbers that we give here (10, 26, 50) assume a zero-indexed list. So the correct answer for element #100 would be found using something like answer=predictions[100]

In [11]:
features_train = features_train[:int(len(features_train) / 100)]
labels_train = labels_train[:int(len(labels_train) / 100)]
element10 = predictions[10]
element26 = predictions[26]
element50 = predictions[50]

print("element 10:{},element 26:{},element 50:{}".format(element10, element26, element50))

element 10:1,element 26:0,element 50:1


# How Many Chris Emails Predicted?

There are over 1700 test events--how many are predicted to be in the “Chris” (1) class? (Use the RBF kernel, C=10000., and the full training set.)

In [13]:
features_train, features_test, labels_train, labels_test = preprocess()
clf = SVC(kernel="rbf", C=1000)
clf.fit(features_train, labels_train)
predictions = clf.predict(features_test)
print("Chris emails", sum(predictions)==1)

No. of Chris training emails :  7936
No. of Sara training emails :  7884
Chris emails False
