## Max Wagner
### Data 620 - Week 10 - Document Classification 
***

Read in the data, and type out nearly 1000 names.

In [1]:
import nltk
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn import svm
%matplotlib inline

spam = pd.read_csv("spambase/spambase.data")
spam.columns = ['word_freq_make','word_freq_address','word_freq_all','word_freq_3d','word_freq_our','word_freq_over',
                'word_freq_remove','word_freq_internet','word_freq_order','word_freq_mail','word_freq_receive','word_freq_will',
                'word_freq_people','word_freq_report','word_freq_addresses','word_freq_free','word_freq_business',
                'word_freq_email','word_freq_you','word_freq_credit','word_freq_your','word_freq_font','word_freq_000',
                'word_freq_money','word_freq_hp','word_freq_hpl','word_freq_george','word_freq_650','word_freq_lab',
                'word_freq_labs','word_freq_telnet','word_freq_857','word_freq_data','word_freq_415','word_freq_85',
                'word_freq_technology','word_freq_1999','word_freq_parts','word_freq_pm','word_freq_direct',
                'word_freq_cs','word_freq_meeting','word_freq_original','word_freq_project','word_freq_re',
                'word_freq_edu','word_freq_table','word_freq_conference','char_freq_;','char_freq_(','char_freq_[',
                'char_freq_!','char_freq_$','char_freq_#','capital_run_length_average','capital_run_length_longest',
                'capital_run_length_total','is_spam'] 

***
Split into training and testing, make sure it split okay.

In [2]:
train, test = train_test_split(spam, test_size = 0.3)
print "spam: " + str(len(spam)) + " | train: " + str(len(train)) + " | test: " + str(len(test))

spam: 4600 | train: 3220 | test: 1380


***
### Random Forest
Let's try this with random forest first, then maybe another method after depending on how it works.

In [3]:
spam_rf = RandomForestClassifier(n_jobs = -1)
spam_rf_fit = spam_rf.fit(train, train['is_spam'])

In [4]:
spam_rf_test = spam_rf_fit.predict(test)
print metrics.classification_report(test['is_spam'], spam_rf_test, target_names=["Spam", "Ham"])

             precision    recall  f1-score   support

       Spam       1.00      1.00      1.00       838
        Ham       1.00      1.00      1.00       542

avg / total       1.00      1.00      1.00      1380



This is saying that random forest is 100% accurate? Seems fishy, let's try it again with a much smaller training group.

In [5]:
train, test = train_test_split(spam, test_size = 0.99)
print "spam: " + str(len(spam)) + " | train: " + str(len(train)) + " | test: " + str(len(test))

spam: 4600 | train: 46 | test: 4554


In [6]:
spam_rf = RandomForestClassifier(n_jobs = -1)
spam_rf_fit = spam_rf.fit(train, train['is_spam'])
spam_rf_test = spam_rf_fit.predict(test)
print metrics.classification_report(test['is_spam'], spam_rf_test, target_names=["Spam", "Ham"])

             precision    recall  f1-score   support

       Spam       0.98      0.99      0.98      2760
        Ham       0.98      0.96      0.97      1794

avg / total       0.98      0.98      0.98      4554



Even with only using 46 instances to train the model with, it is still predicting correctly 90%+ of the time. I'm assuming this is caused by the data set being pretty uniform without a whole lot of variance in types of spam. Applying this model to other corpus' of text would probably not work too well.

***
### SVM

I wanted to try another method to see the comparison between the two. This method is using support vector machines.

In [7]:
train, test = train_test_split(spam, test_size = 0.3)
print "spam: " + str(len(spam)) + " | train: " + str(len(train)) + " | test: " + str(len(test))

spam: 4600 | train: 3220 | test: 1380


In [8]:
spam_svm = svm.SVC()
spam_svm_fit = spam_svm.fit(train, train['is_spam'])

In [9]:
spam_svm_test = spam_svm_fit.predict(test)
print metrics.classification_report(test['is_spam'], spam_svm_test, target_names=["Spam", "Ham"])

             precision    recall  f1-score   support

       Spam       0.90      0.85      0.88       845
        Ham       0.79      0.84      0.81       535

avg / total       0.85      0.85      0.85      1380



This is more in line with what I expected from a classification model. Let's try again with the low training amount.

In [10]:
train, test = train_test_split(spam, test_size = 0.99)
print "spam: " + str(len(spam)) + " | train: " + str(len(train)) + " | test: " + str(len(test))

spam: 4600 | train: 46 | test: 4554


In [11]:
spam_svm = svm.SVC()
spam_svm_fit = spam_svm.fit(train, train['is_spam'])

In [12]:
spam_svm_test = spam_svm_fit.predict(test)
print metrics.classification_report(test['is_spam'], spam_svm_test, target_names=["Spam", "Ham"])

             precision    recall  f1-score   support

       Spam       0.61      0.97      0.75      2760
        Ham       0.56      0.06      0.11      1794

avg / total       0.59      0.61      0.50      4554



Worse, but still better than guessing and with only 46 training sets!

***
### Summary

RandomForest outperformed my expectations and the SVM method. Even with low training data it still provided a fairly accurate account of what would be spam or ham.