DATA620 Fall 2019 Week 10/11 Document Classification Omar Pineda

It can be useful to be able to classify new "test" documents using already classified "training" documents.  A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.  Here is one example of such data:  UCI Machine Learning Repository: Spambase Data Set https://archive.ics.uci.edu/ml/datasets/Spambase

For this project, you can either use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).

For more adventurous students, you are welcome (encouraged!) to come up a different set of documents (including scraped web pages!?) that have already been classified (e.g. tagged), then analyze these documents to predict how new documents should be classified.

First, we import our packages and load our dataset. We will use the Spambase dataset linked in the above instructions. We added the headers to the data file and saved it as a .csv that we uploaded to github for easy consumption.

In [41]:
import nltk
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn import ensemble

mail = pd.read_csv("https://raw.githubusercontent.com/omarp120/DATA620-Web-Analytics/master/spambase.csv")
mail.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,spam
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


Next, we split our data into training and test subsets using train_test_split.

In [46]:
features = mail.loc[:, mail.columns != 'spam'].values
target = mail.loc[:, mail.columns == 'spam'].values.ravel()
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.3, random_state=42, stratify = target)

Our first model is a K Nearest Neighbor classifer using 5 neighbors to start things off. We used 5 neighbors because accuracy for this classifier is maximized at 3 or 5 neighbors and anything beyond that would lead to underfitting. The accuracy that we got using this model is 79.15%.

In [50]:
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(X_train, y_train)
knnacc = knn.score(X_test, y_test)
print(knnacc)

0.7914554670528603


Our second model leverages a Decision Tree classifier to label our mail as either spam or not. This proves to be more successful as our accuracy jumps up to 90.08%.

In [52]:
dtclassifier = DecisionTreeClassifier()
dtclassifier.fit(X_train, y_train)
dtacc = dtclassifier.score(X_test, y_test)
print(dtacc)

0.9080376538740044


For our third model we used a Support Vector Machine classfier, and our accuracy went down to 70.82%.

In [53]:
svmclassifier = svm.SVC(gamma='scale')
svmclassifier.fit(X_train, y_train)
svmacc = svmclassifier.score(X_test, y_test)
print(svmacc)

0.7081824764663287


Lastly, we implemented a Random Forest classifier to categorize our mail and our accuracy was the highest here at 94.79%.

In [54]:
rfclassifier = ensemble.RandomForestClassifier(criterion="entropy")
rfclassifier.fit(X_train, y_train)
rfacc = rfclassifier.score(X_test, y_test)
print(rfacc)

0.9536567704561911


In conclusion, we explored four different methods for classifying e-mails as spam or non-spam. Our Support Vector Machine classifier resulted in the lowest accuracy (70.82%) while our Random Forest classifier had the greatest accuracy (94.79%). You can see a summary of our classifiers and their respective accuracies below.

In [55]:
data = [['KNN', knnacc], ['Decision Tree', dtacc], ['Support Vector Macine', svmacc], ['Random Forest', rfacc]] 
df = pd.DataFrame(data, columns = ['Classifier', 'Accuracy']) 
df

Unnamed: 0,Classifier,Accuracy
0,KNN,0.791455
1,Decision Tree,0.908038
2,Support Vector Macine,0.708182
3,Random Forest,0.953657
