In [1]:
# Supervised learning of Support Vector Machine classifer
'''
1. Higher acuuracy than Naive bayes. 
2. It's too much slower than Naive Bayes classifier.
3. Not suitable for high noisy dataset.

Documentation: https://scikit-learn.org/stable/modules/svm.html  

Kernel: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

If speed is a major consideration (and for many real-time machine learning applications, it certainly is) then you may want to sacrifice a bit of accuracy if it means you can train/predict faster. Which of these are applications where you can imagine a very quick-running algorithm is especially important?

predicting the author of an email
flagging credit card fraud, and blocking a transaction before it goes through
voice recognition, like Si

Hopefully it’s becoming clearer what Sebastian meant when he said Naive Bayes is great for text--it’s faster and generally gives better performance than an SVM for this particular problem. Of course, there are plenty of other problems where an SVM might work better. Knowing which one to try when you’re tackling a problem for the first time is part of the art and science of machine learning. In addition to picking your algorithm, depending on which one you try, there are parameter tunes to worry about as well, and the possibility of overfitting (especially if you don’t have lots of training data).

Our general suggestion is to try a few different algorithms for each problem. Tuning the parameters can be a lot of work, but just sit tight for now--toward the end of the class we will introduce you to GridCV, a great sklearn tool that can find an optimal parameter tune almost automatically.
'''



"\n1. Higher acuuracy than Naive bayes. \n2. It's too much slower than Naive Bayes classifier.\n3. Not suitable for high noisy dataset.\n\nDocumentation: https://scikit-learn.org/stable/modules/svm.html  \n\nKernel: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html\n\nIf speed is a major consideration (and for many real-time machine learning applications, it certainly is) then you may want to sacrifice a bit of accuracy if it means you can train/predict faster. Which of these are applications where you can imagine a very quick-running algorithm is especially important?\n\npredicting the author of an email\nflagging credit card fraud, and blocking a transaction before it goes through\nvoice recognition, like Si\n"

In [2]:
""" 
    This is the code to accompany the Lesson 2 (SVM) mini-project.

    Use a SVM to identify emails from the Enron corpus by their authors:    
    Sara has label 0
    Chris has label 1
"""

import os   

import sys
from time import time

# print(os.getcwd())

sys.path.append("../tools/")
# os.chdir("tools")
from email_preprocess import preprocess

In [3]:
### features_train and features_test are the features for the training
### and testing datasets, respectively
### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()

no. of Chris training emails: 7936
no. of Sara training emails: 7884


In [4]:
from sklearn import svm
from sklearn.metrics import accuracy_score

In [5]:
#########################################################
clf = svm.SVC(kernel='linear')

t0 = time()
clf.fit(features_train, labels_train)
print("training time:", round(time()-t0, 3), "s")

t0 = time()
pred = clf.predict(features_test)
print("testing time:", round(time()-t0, 3), "s")

accuracy = accuracy_score(labels_test, pred)
print(accuracy)
#########################################################

training time: 109.604 s
testing time: 11.404 s
0.9840728100113766


In [6]:
'''One way to speed up an algorithm is to train it on a smaller training dataset. The tradeoff is that the accuracy almost always goes down when you do this. Let’s explore this more concretely: add in the following two lines immediately before training your classifier.

features_train = features_train[:len(features_train)/100]
labels_train = labels_train[:len(labels_train)/100]

'''

# Down dataset to 1% of orginal dataset.
features_train = features_train[:int(len(features_train)/100)]
labels_train = labels_train[:int(len(labels_train)/100)]

t0 = time()
clf.fit(features_train, labels_train)
print("training time:", round(time()-t0, 3), "s")

t0 = time()
pred = clf.predict(features_test)
print("testing time:", round(time()-t0, 3), "s")

accuracy = accuracy_score(labels_test, pred)
print(accuracy)


training time: 0.113 s
testing time: 0.659 s
0.8845278725824801


In [7]:
'''
Keep the training set slice code from the last quiz, so that you are still training on only 1% of the full training set. Change the kernel of your SVM to “rbf”. What’s the accuracy now, with this more complex kernel?
'''

clf = svm.SVC(kernel='rbf')


t0 = time()
clf.fit(features_train, labels_train)
print("training time:", round(time()-t0, 3), "s")

t0 = time()
pred = clf.predict(features_test)
print("testing time:", round(time()-t0, 3), "s")

accuracy = accuracy_score(labels_test, pred)
print(accuracy)

training time: 0.24 s
testing time: 1.306 s
0.8953356086461889


In [8]:
'''Keep the training set size and rbf kernel from the last quiz, but try several values of C (say, 10.0, 100., 1000., and 10000.). Which one gives the best accuracy?'''

C_VALUES = [10.0, 100.0, 1000.0, 10000.0]
BEST_C_VALUE = 0 
for C_VALUE in C_VALUES:
    clf = svm.SVC(kernel='rbf', C=C_VALUE)
    clf.fit(features_train, labels_train).predict(features_test)
    accuracy = accuracy_score(labels_test, pred)
    print(f'Accuracy of C {C_VALUE} : {accuracy}')
    if BEST_C_VALUE < C_VALUE:
        BEST_C_VALUE = C_VALUE

print(f'Best C value {BEST_C_VALUE}')

Accuracy of C 10.0 : 0.8953356086461889
Accuracy of C 100.0 : 0.8953356086461889
Accuracy of C 1000.0 : 0.8953356086461889
Accuracy of C 10000.0 : 0.8953356086461889
Best C value 10000.0


In [9]:
'''Once you've optimized the C value for your RBF kernel, what accuracy does it give? Does this C value correspond to a simpler or more complex decision boundary?

(If you're not sure about the complexity, go back a few videos to the "SVM C Parameter" part of the lesson. The result that you found there is also applicable here, even though it's now much harder or even impossible to draw the decision boundary in a simple scatterplot.)
'''

# Less C value more complex
# C=1 -> more complex
clf = svm.SVC(kernel='rbf',C=1)


clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
accuracy = accuracy_score(labels_test, pred)
print(accuracy)

0.8953356086461889


In [10]:
'''Now that you’ve optimized C for the RBF kernel, go back to using the full training set. In general, having a larger training set will improve the performance of your algorithm, so (by tuning C and training on a large dataset) we should get a fairly optimized result. What is the accuracy of the optimized SVM?'''

#### Full dataset
### features_train and features_test are the features for the training
### and testing datasets, respectively
### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()

clf = svm.SVC(kernel='rbf', C=10000.0) 

t0 = time()
clf.fit(features_train, labels_train)
print("training time:", round(time()-t0, 3), "s")

t0 = time()
pred = clf.predict(features_test)
print("testing time:", round(time()-t0, 3), "s")

accuracy = accuracy_score(labels_test, pred)
print(accuracy)
#########################################################

no. of Chris training emails: 7936
no. of Sara training emails: 7884
training time: 117.092 s
testing time: 19.689 s
0.9960182025028441


In [11]:
'''What class does your SVM (0 or 1, corresponding to Sara and Chris respectively) predict for element 10 of the test set? The 26th? The 50th? (Use the RBF kernel, C=10000, and 1% of the training set. Normally you'd get the best results using the full training set, but we found that using 1% sped up the computation considerably and did not change our results--so feel free to use that shortcut here.)

And just to be clear, the data point numbers that we give here (10, 26, 50) assume a zero-indexed list. So the correct answer for element #100 would be found using something like answer=predictions[100]


Just to be clear, the data point numbers that we give here (10, 26, 50) assume a zero-indexed list. So the correct answer for element #100 would be found using something like answer=predictions[100]'''

# Down dataset to 1% of orginal dataset.
features_train = features_train[:int(len(features_train)/100)]
labels_train = labels_train[:int(len(labels_train)/100)]


clf = svm.SVC(kernel='rbf', C=10000.0) 

t0 = time()
clf.fit(features_train, labels_train)
print("training time:", round(time()-t0, 3), "s")

t0 = time()
pred = clf.predict(features_test)
print("testing time:", round(time()-t0, 3), "s")

accuracy = accuracy_score(labels_test, pred)
print(accuracy)

selective_indexes = (10, 26, 50)
for seleted_index in selective_indexes:
    print(f"Index = {seleted_index} : prediction = {pred[seleted_index]}")
#########################################################

training time: 0.109 s
testing time: 1.188 s
0.8998862343572241
Index = 10 : prediction = 1
Index = 26 : prediction = 0
Index = 50 : prediction = 1


In [14]:
'''There are over 1700 test events--how many are predicted to be in the “Chris” (1) class? (Use the RBF kernel, C=10000., and the full training set.)'''

# Full dataset
### features_train and features_test are the features for the training
### and testing datasets, respectively
### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()

clf = svm.SVC(kernel='rbf', C=10000.0) 


clf.fit(features_train, labels_train)
pred = clf.predict(features_test)

Class_1_Chris = 0
for i in pred:
    if i == 1:
        Class_1_Chris += 1

print(f'{Class_1_Chris} predicted to be in the "Chris"(1) class')

no. of Chris training emails: 7936
no. of Sara training emails: 7884
866 predicted to be in the "Chris"(1) class
