# Data620 Week 10 Assignment

## Justin Hink

Step 1, load in our data

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Import spam dataset
spam = pd.read_csv("spambase.csv")

spam_count = len(spam[spam.spamclass == 1])
ham_count = len(spam[spam.spamclass == 0])

print "Spam: %d" %spam_count
print "Not spam: %d" %ham_count

Spam: 1813
Not spam: 2788


Step 2, split data into training, test and eval data sets

In [2]:
prop_train = 0.7
prop_val = 0.15
prop_test = 0.15
total_count = len(spam)
trainNum = int(prop_train * total_count)
evalNum = int(prop_val * total_count)
testNum = total_count - trainNum - evalNum

trainSet, testSet = train_test_split(spam, test_size=testNum, random_state=9)
trainSet, evalSet = train_test_split(trainSet, test_size=evalNum, random_state=99)

print "Training set count: %d" %len(trainSet)
print "Eval set count: %d" %len(evalSet)
print "Testing set count: %d" %len(testSet)

Training set count: 3220
Eval set count: 690
Testing set count: 691


## Analysis

### Method 1 - Random Forest 

In [4]:
from sklearn import ensemble
import sklearn.metrics as sm

# utility method to print model metrics
def print_model_metrics(actual, pred):
    cm = sm.confusion_matrix(actual, pred, labels=[1, 0])

    print
    print "true positives: %d" %cm[0,0]
    print "false positives: %d" %cm[1,0]
    print "true negatives: %d" %cm[1,1]
    print "false negatives: %d" %cm[0,1]
    print
    print sm.classification_report(actual, pred, labels=[1,0], target_names=["Spam", "Ham"])

yTest = testSet['spamclass']
xTest = testSet.drop(labels='spamclass', axis=1)
yTrain = trainSet['spamclass']
xTrain = trainSet.drop(labels='spamclass', axis=1)

rf = ensemble.RandomForestClassifier(criterion="entropy", random_state=99)
m1 = rf.fit(xTrain, yTrain)

# check our in sample model efficacy
train1 = m1.predict(xTrain)
print_model_metrics(yTrain, train1)

# check out of sample model efficacy
test1 = m1.predict(xTest)
print_model_metrics(yTest, test1)


true positives: 1263
false positives: 0
true negatives: 1946
false negatives: 11

             precision    recall  f1-score   support

       Spam       1.00      0.99      1.00      1274
        Ham       0.99      1.00      1.00      1946

avg / total       1.00      1.00      1.00      3220


true positives: 238
false positives: 13
true negatives: 414
false negatives: 26

             precision    recall  f1-score   support

       Spam       0.95      0.90      0.92       264
        Ham       0.94      0.97      0.96       427

avg / total       0.94      0.94      0.94       691



### Method 2 - SVM 

In [5]:
from sklearn import svm

sv = svm.SVC(random_state=99)
m2 = sv.fit(xTrain, yTrain)

sv_train = m2.predict(xTrain)
print_model_metrics(yTrain, sv_train)


true positives: 1139
false positives: 42
true negatives: 1904
false negatives: 135

             precision    recall  f1-score   support

       Spam       0.96      0.89      0.93      1274
        Ham       0.93      0.98      0.96      1946

avg / total       0.95      0.95      0.94      3220



## Conclusions

Both the random forest and svm methods were able to classify the spam with a high degree of effectiveness.  However, because of the application (email spam), we would choose the SVM method as it had a smaller number of false positives (ie - non spam emails that are flagged as spam).  We have greater sensitivity to improperly blocking good emails than we have for falsely allowing spam through the filter.

Also of minor note:  Both models performed much better when evaluated in sample (as expected).