## Spam Classification with SVC and sklearn
### Assignment week 10/11 - jdeblase

It can be useful to be able to classify new "test" documents using already classified "training" documents.  A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.  Here is one example of such data:  http://archive.ics.uci.edu/ml/datasets/Spambase

For this project, you can use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).

In [1]:
import pandas as pd

from sklearn import svm
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix, classification_report

For this assignment I will be using the Spambase database at the url link to predict whether emails are spam or not spam. The classifier I will use is the Support Vector Classifier (SVM) class found in the svm module in sklearn, along with several utility functions and classes found in the framework.

The first step is to load the dataset and seperate the response variable from the rest of the predictor variables. All the predictor variables have been normalized already as continous real or integer values. The response variable is nominal, 1 for spam, 0 for non-spam. More information on the dataset can be found <a href="https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.names"> here</a>.



In [2]:
## load data
all_data = pd.read_csv("spambase/spambase.data",header=None).as_matrix()
x = all_data[:,:-1]
y = all_data[:,-1]

The next step is split the data into training and test sets. This can be done manually, but here we use sklearn's train_test_split() function to handle shuffling and randomizing the samples. The size of the test set is set to 0.2.

In [3]:
# split data into test and train arrays for sklearn
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2)


A basic estimator is set using the SVC class. We will leave the kernel function to its default setting of 'rbf', and leave 'C' and 'gamma' at their default settings as well. 

After we train the model, we will use it to predict the x test set and evaluate its performance with sklearn's confusion matrix and classification report. 

In [4]:
# basic fit and test
svm_est = svm.SVC()
svm_est.fit(x_train,y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [5]:
pred = svm_est.predict(x_test)

In [6]:
confusion_matrix(y_test,pred)

array([[471,  91],
       [ 77, 282]])

In [7]:
print classification_report(y_test,pred)

             precision    recall  f1-score   support

        0.0       0.86      0.84      0.85       562
        1.0       0.76      0.79      0.77       359

avg / total       0.82      0.82      0.82       921



We can see from the confusion matrix that our basic model correctly predicted 753 emails, with 77 false positives (Type I errors) and 91 false negatives(Type II errors). Both the precision and recall were 0.82, meaning that the model returned 82% percent of the results correctly, and that the model had a high rate of relevance.

We can improve the model even further by using sklearn's cross validation iterator and Grid Search to exhaustively tune the hyper-parameters of 'C' and 'gamma'. The following code references material from sklearn's <a href="http://scikit-learn.org/stable/modules/cross_validation.html"> documentation </a> on cross validation techniques.


In [8]:
# Using CV and fine tuning on hyper-parameters to improve model

# first rebuild training and test sets
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2,random_state=0)

# create an iterator for CV
from sklearn.cross_validation import ShuffleSplit
from sklearn.grid_search import GridSearchCV

cv_iter = ShuffleSplit(len(x_train),n_iter=5,test_size=0.2)
# cross validate with increasing 'C' and decreasing 'gamma' levels
params = [{'C':[1,10,100,1000,10000],'gamma':[0.01,0.001,0.0001,0.00001]}] 

clf = GridSearchCV(estimator=svm_est, cv=cv_iter, param_grid=params)

We now fit the new model with the training sets and use the parameters for the best estimator from the clf_fit to create a new SVC model with fine tuned parameters.

In [9]:
clf.fit(x_train, y_train)
clf.best_estimator_

SVC(C=10000, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=1e-05, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [10]:
new_svm_est = svm.SVC(C=clf.best_estimator_.C,gamma=clf.best_estimator_.gamma)

In [11]:
new_svm_est.fit(x_train,y_train)

SVC(C=10000, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=1e-05, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

We now make predictions with the tuned model.

In [12]:
pred = new_svm_est.predict(x_test)

In [14]:
print confusion_matrix(y_test,pred)

[[509  29]
 [ 51 332]]


In [15]:
print classification_report(y_test, pred)

             precision    recall  f1-score   support

        0.0       0.91      0.95      0.93       538
        1.0       0.92      0.87      0.89       383

avg / total       0.91      0.91      0.91       921



We can see that this improves the model significantly. The model now produces 841 accurate results, with 51 false positives and 29 false negatives. This results in a hihgly relevant model that accurately predicts spam 91% of the time. 