## Lesson 3 SVM Mini Project

This is the code to accompany the Lesson 3 (SVM) mini-project.

In this mini-project, we’ll tackle the exact same email author ID problem as the Naive Bayes mini-project, but now with an SVM. What we find will help clarify some of the practical differences between the two algorithms. This project also gives us a chance to play around with parameters a lot more than Naive Bayes did, so we will do that too.

### Imports and dataset process

In [2]:
import sys
from time import time
sys.path.append("../tools/")
from email_preprocess import preprocess

from sklearn import svm
from sklearn.metrics import accuracy_score


### features_train and features_test are the features for the training
### and testing datasets, respectively
### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()



no. of Chris training emails: 7936
no. of Sara training emails: 7884


### L3Q28 / L3Q29 - Using a linear kernel, what is the accuracy of the classifier ? Is it faster or slower than Naive Bayes ?

In [4]:
clf_linear = svm.SVC(kernel='linear')

t0 = time()
clf_linear.fit(features_train, labels_train)
pred = clf_linear.predict(features_test)
accuracy = accuracy_score(pred, labels_test)
print 'Linear kernel - accuracy ', accuracy
print "Fit and prediction time:", round(time()-t0, 3), "s"

Linear kernel - accuracy  0.984072810011
Fit and prediction time: 162.183 s


### L3Q30  - Re-do with just 1% of the data. What the accuracy now ?

In [5]:
features_train_reduced = features_train[:len(features_train)/100] 
labels_train_reduced = labels_train[:len(labels_train)/100]

t0 = time()
clf_linear.fit(features_train_reduced, labels_train_reduced)
pred = clf_linear.predict(features_test)
accuracy = accuracy_score(pred, labels_test)
print 'Linear kernel with 1% of the data - accuracy ', accuracy
print "Fit and prediction time:", round(time()-t0, 3), "s"

Linear kernel with 1% of the data - accuracy  0.884527872582
Fit and prediction time: 0.975 s


### L3Q32 - Re-do with 1% of the data and rbf kernel

In [6]:
clf_rbf = svm.SVC(kernel='rbf')

t0 = time()
clf_rbf.fit(features_train_reduced, labels_train_reduced)
pred = clf_rbf.predict(features_test)
accuracy = accuracy_score(pred, labels_test)
print 'Rbf kernel with 1% of the data - accuracy ', accuracy
print "Fit and prediction time:", round(time()-t0, 3), "s"

Rbf kernel with 1% of the data - accuracy  0.616040955631
Fit and prediction time: 1.134 s


### L3Q33 - Try differente values for C

In [7]:
c_values = [1.0, 10.0, 100.0, 1000.0, 10000.0]

for c in c_values:
    clf_rbf = svm.SVC(kernel='rbf', C=c)

    t0 = time()
    clf_rbf.fit(features_train_reduced, labels_train_reduced)
    pred = clf_rbf.predict(features_test)
    accuracy = accuracy_score(pred, labels_test)
    print 'Rbf kernel with 1% of the data and C = ', c, ' - accuracy ', accuracy
    print "Fit and prediction time:", round(time()-t0, 3), "s"

Rbf kernel with 1% of the data and C =  1.0  - accuracy  0.616040955631
Fit and prediction time: 1.16 s
Rbf kernel with 1% of the data and C =  10.0  - accuracy  0.616040955631
Fit and prediction time: 1.134 s
Rbf kernel with 1% of the data and C =  100.0  - accuracy  0.616040955631
Fit and prediction time: 1.134 s
Rbf kernel with 1% of the data and C =  1000.0  - accuracy  0.821387940842
Fit and prediction time: 1.15 s
Rbf kernel with 1% of the data and C =  10000.0  - accuracy  0.892491467577
Fit and prediction time: 0.96 s


### L3Q35 - Using C=10000 try the full database

In [8]:
clf_rbf = svm.SVC(kernel='rbf', C=10000.0)

t0 = time()
clf_rbf.fit(features_train, labels_train)
pred = clf_rbf.predict(features_test)
accuracy = accuracy_score(pred, labels_test)
print 'Rbf kernel and C=10000 - accuracy ', accuracy
print "Fit and prediction time:", round(time()-t0, 3), "s" 

Rbf kernel and C=10000 - accuracy  0.990898748578
Fit and prediction time: 109.918 s


### L3Q36 - Class values for 10, 26 and 50

In [9]:
clf_rbf = svm.SVC(kernel='rbf', C=10000.0)

t0 = time()
clf_rbf.fit(features_train_reduced, labels_train_reduced)
pred_reduced = clf_rbf.predict(features_test)
print 'Class values for 10, 26 and 50'
print pred_reduced[10], pred_reduced[26], pred_reduced[50]

Class values for 10, 26 and 50
1 0 1


### L3Q37 - There are over 1700 test events--how many are predicted to be in the “Chris” (1) class? (Use the RBF kernel, C=10000., and the full training set.)

In [10]:
print 'Prediction length: ', len(pred)
print 'Predictions for Chris(1)', len([a for a in pred if a == 1])

Prediction length:  1758
Predictions for Chris(1) 877
