## DATA 620

### Week 10 : Document Classification

Group 4: Joshua Hummell, Jiho Kim, Scott Reed

For this project, you can either use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).



### Data Load

The first thing we want to do is load the packages and data. The data is from http://archive.ics.uci.edu/ml/datasets/Spambase and stored in Github for reproducibility purpose.

In [59]:
# Load Libraries
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.svm import SVC, NuSVC, LinearSVC
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import recall_score
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

Import data file into a pandas dataframe

In [3]:
# load dataset
path = 'https://raw.githubusercontent.com/jihokim97/CUNY-SPS-DATA-620/main/Assignment%2010/spambase.data'
spam = pd.read_csv(path, sep = ",", header = None)

In [5]:
spam.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


We have to rename the columns accordingly with the spambase.names file 

In [6]:
names = ["word_freq_make", 
 "word_freq_address", 
 "word_freq_all", 
 "word_freq_3d", 
 "word_freq_our", 
 "word_freq_over", 
 "word_freq_remove", 
 "word_freq_internet",
 "word_freq_order",        
 "word_freq_mail",         
 "word_freq_receive",      
 "word_freq_will",     
 "word_freq_people",
 "word_freq_report",       
 "word_freq_addresses",    
 "word_freq_free",       
 "word_freq_business",     
 "word_freq_email",        
 "word_freq_you",         
 "word_freq_credit",       
 "word_freq_your",         
 "word_freq_font",         
 "word_freq_000",          
 "word_freq_money",        
 "word_freq_hp",           
 "word_freq_hpl",          
 "word_freq_george",       
 "word_freq_650",          
 "word_freq_lab",          
 "word_freq_labs",         
 "word_freq_telnet",       
 "word_freq_857",          
 "word_freq_data",         
 "word_freq_415",          
 "word_freq_85",           
 "word_freq_technology",   
 "word_freq_1999",         
 "word_freq_parts",        
 "word_freq_pm",           
 "word_freq_direct",       
 "word_freq_cs",           
 "word_freq_meeting",      
 "word_freq_original",     
 "word_freq_project",      
 "word_freq_re",           
 "word_freq_edu",          
 "word_freq_table",        
 "word_freq_conference",   
 "char_freq_;",            
 "char_freq_(",            
 "char_freq_[",            
 "char_freq_!",            
 "char_freq_$",            
 "char_freq_#",            
 "capital_run_length_average",
 "capital_run_length_longest",
 "capital_run_length_total",
 "spam"]
spam.columns = names

In [7]:
spam.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,spam
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


Under the "spam" columns, 1 denotes that the email was spam and 0 denotes that it was not spam. 

Before we split the data, we must now divide the data into features and targets.The target, also known as the label, is the value we want to forecast, in this case the spam, and the features are all of the columns used by the model to generate a prediction.

In [8]:
# Labels are the values we want to predict. 
labels = spam['spam']

# Remove the lables from the data.
features = spam.drop('spam', axis = 1)

### Training and Testing Sets

The data is split in which 30% is witheld in the test set and 70% is in training set. 

In [10]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.3, random_state = 42)

In order to insure that we split the data correctly, we examine the shape of all data. We expect the number of columns of training features and testing features to match and number of rows to match the training and testing features and lables. 

In [11]:
print('Training Features Shape:', train_features.shape)
print('Training Labels Shape:', train_labels.shape)
print('Testing Features Shape:', test_features.shape)
print('Testing Labels Shape:', test_labels.shape)


Training Features Shape: (3220, 57)
Training Labels Shape: (3220,)
Testing Features Shape: (1381, 57)
Testing Labels Shape: (1381,)


### Naive Bayes and Support Vector Machines (SVM)

In [45]:
nb = GaussianNB()

# Train the model on training data
nb.fit(train_features, train_labels)

#Predict outcome 
nbresults = nb.predict(test_features)


In [54]:
print('Naive Bayes Accuracy:', accuracy_score(test_labels, nbresults, normalize = True))


Naive Bayes Accuracy: 0.8247646632874729


In [None]:
svm = LinearSVC()

# Train the model using the training sets 
svm.fit(train_features, train_labels)

#Predict outcome
svmresults = svm.predict(test_features)

In [50]:
print("LinearSVM accuracy : ", accuracy_score(test_labels, svmresults, normalize = True))



LinearSVM accuracy :  0.8464880521361332


When it comes to evaluation of  model’s performance, sometimes accuracy may not be the best indicator. Confusion Matrix is a very good way to understand results like true positive, false positive, true negative and so on.


In [57]:
print(confusion_matrix(test_labels,nbresults))
print(confusion_matrix(test_labels,svmresults))

[[592 212]
 [ 30 547]]
[[781  23]
 [189 388]]


**Recall** is evaluating how well a model in finding all the positive samples. **Recall** can be found using mathematical equation: **TP/TP+FN**

In [60]:
print("Recall:", recall_score(test_labels, nbresults))
print("Recall:", recall_score(test_labels, svmresults))



Recall: 0.9480069324090121
Recall: 0.6724436741767764
