# Application: Support Vector Machines Part 2

#### References

1. **Support-vector machine:** https://en.wikipedia.org/wiki/Support-vector_machine
2. **Support Vector Machines**: https://scikit-learn.org/stable/modules/svm.html

## Application: Support Vector Machines Part 2

In this notebook we will use SVMs to classify emails. Just like we did in the Naive Bayes application notebook.

In [48]:

import os
import sys
import numpy as np
from sklearn import svm
from sklearn.metrics import accuracy_score

# Need these so that Jupyter can find the datasets and the various utilities 
nb_dir = os.path.split(os.getcwd())[0]
if nb_dir not in sys.path:
    sys.path.append(nb_dir)

In [15]:
from utilities.email_preprocess import preprocess

In [16]:
# preprocess the data
features_train, features_test, labels_train, labels_test = preprocess(words_file ="data/machine_learning/word_data.pkl",
                                                                     authors_file="data/machine_learning/email_authors.pkl")

no. of Chris training emails: 7936
no. of Sara training emails: 7884


In [17]:
# create our SVM classifier
clf_linear = svm.SVC(kernel='linear')

In [18]:

# fit the data and also do some timing
clf_linear.fit(features_train,labels_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [12]:
# predict with the linear classifier
pred_linear = clf_linear.predict(features_test)

In [13]:
# compute accuracy
accuracy = accuracy_score(pred_linear, labels_test)
print("Linear SVM accuracy is ", accuracy)

Linear SVM accuracy is  0.984072810011


### Speeding up the training process

One way to speed up our algorithm is to use a smaller training dataset. The tradeoff is that the accuracy almost always goes down. Let's explore this more thoroughly. Let's slice the training dataset and retain only about 1% of the original data.

In [22]:

features_train = features_train[:int(len(features_train)/100)]
labels_train = labels_train[:int(len(labels_train)/100)]

print("Size of trainng examples: ",len(features_train))

Size of trainng examples:  158


In [23]:
# create our SVM classifier
clf_linear = svm.SVC(kernel='linear')

In [24]:
# fit the data and also do some timing
clf_linear.fit(features_train,labels_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [25]:
# predict with the linear classifier
pred_linear = clf_linear.predict(features_test)

In [26]:
# compute accuracy
accuracy = accuracy_score(pred_linear, labels_test)
print("Linear SVM accuracy is ", accuracy)

Linear SVM accuracy is  0.884527872582


### Use an RBF kernel

In [27]:
# create our SVM classifier. We can leave it also blank as this is the default option
clf_linear = svm.SVC(kernel='rbf')

In [28]:
# fit the data and also do some timing
clf_linear.fit(features_train,labels_train)



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [29]:
# predict with the linear classifier
pred_linear = clf_linear.predict(features_test)

In [30]:
# compute accuracy
accuracy = accuracy_score(pred_linear, labels_test)
print("RBF SVM accuracy is ", accuracy)

RBF SVM accuracy is  0.616040955631


The RBF kernel on the reduced dataset does not perform so great. Let's try some values for the $C$ variable 

In [31]:
c_opts = [10.0, 100.0, 1000.0, 10000.0]

for c in c_opts:
    clf_linear = svm.SVC(kernel='rbf', C=c)
    clf_linear.fit(features_train,labels_train)
    pred_linear = clf_linear.predict(features_test)
    accuracy = accuracy_score(pred_linear, labels_test)
    print("RBF SVM with C %s has accuracy %s "%(c,accuracy))



RBF SVM with C 10.0 has accuracy 0.616040955631 




RBF SVM with C 100.0 has accuracy 0.616040955631 




RBF SVM with C 1000.0 has accuracy 0.821387940842 




RBF SVM with C 10000.0 has accuracy 0.892491467577 


We can see that increasing the value of $C$ increases the accuracy. Recall that this option controls the tradeoff between smooth decision boundary and classifying trainig points correctly. Hence  a large value of $C$ we get more points classified correctly. Thus the decision boundary will be more wigly.

### Train on the full dataset

In [32]:
# preprocess the data
features_train, features_test, labels_train, labels_test = preprocess(words_file ="data/machine_learning/word_data.pkl",
                                                                     authors_file="data/machine_learning/email_authors.pkl")

no. of Chris training emails: 7936
no. of Sara training emails: 7884


In [33]:
clf_linear = svm.SVC(kernel='rbf', C=10000.0)
clf_linear.fit(features_train,labels_train)
pred_linear = clf_linear.predict(features_test)
accuracy = accuracy_score(pred_linear, labels_test)
print("RBF SVM with C %s has accuracy %s "%(c,accuracy))



RBF SVM with C 10000.0 has accuracy 0.990898748578 


### Extract Predictions

In [34]:
features_train = features_train[:int(len(features_train)/100)]
labels_train = labels_train[:int(len(labels_train)/100)]

In [38]:
# create our SVM classifier. We can leave it also blank as this is the default option
clf_rbf = svm.SVC(kernel='rbf', C=10000.0)

In [39]:
# fit the data and also do some timing
clf_rbf.fit(features_train,labels_train)



SVC(C=10000.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [40]:
# predict with the linear classifier
pred_rbf = clf_linear.predict(features_test)

In [41]:
print("For element 10: ", pred_rbf[10])
print("For element 26: ", pred_rbf[26])
print("For element 50: ", pred_rbf[50])

For element 10:  1
For element 26:  0
For element 50:  1


In [42]:
# preprocess the data
features_train, features_test, labels_train, labels_test = preprocess(words_file ="data/machine_learning/word_data.pkl",
                                                                     authors_file="data/machine_learning/email_authors.pkl")

no. of Chris training emails: 7936
no. of Sara training emails: 7884


In [43]:
# create our SVM classifier. We can leave it also blank as this is the default option
clf_rbf = svm.SVC(kernel='rbf', C=10000.0)

In [44]:
# fit the data and also do some timing
clf_rbf.fit(features_train,labels_train)



SVC(C=10000.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [49]:
# predict with the linear classifier
pred_rbf = clf_linear.predict(features_test)
unique, counts = np.unique(pred_rbf, return_counts=True)
print("Count of 1s predicted: ", unique, counts ) 

Count of 1s predicted:  [0 1] [ 740 1018]
