# Assignment 6

### Author : Omer Ozeren

## Introduction

It can be useful to be able to classify new "test" documents using already classified "training" documents.  A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.  Here is one example of such data:  UCI Machine Learning Repository: Spambase Data Set

For this project, you can either use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).

For more adventurous students, you are welcome (encouraged!) to come up a different set of documents (including scraped web pages!?) that have already been classified (e.g. tagged), then analyze these documents to predict how new documents should be classified.

### VIDEO

Please click [here](https://www.youtube.com/watch?v=_1Ye-HfdYLU) to watch Assignment 6 video.

### Data Set 

Please click [here](http://archive.ics.uci.edu/ml/datasets/Spambase) to see the spam data set, I used files spambase.names & spambase.data. The spambase.data is the actual data and spambase.names is the  column names and the directions to construct the file.

### Import packages

In [1]:
import nltk
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn import ensemble
from sklearn import svm
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
import sklearn.metrics as sm
import seaborn as sn
import warnings
import matplotlib.pyplot as plt

In [2]:
raw_data = pd.read_csv(r"https://raw.githubusercontent.com/omerozeren/DATA620/main/Assignment_6/spambase.csv")

In [3]:
raw_data

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,spamclass
0,0.00,0.64,0.64,0.0,0.32,0.00,0.00,0.00,0.00,0.00,...,0.000,0.000,0.0,0.778,0.000,0.000,3.756,61,278,1
1,0.21,0.28,0.50,0.0,0.14,0.28,0.21,0.07,0.00,0.94,...,0.000,0.132,0.0,0.372,0.180,0.048,5.114,101,1028,1
2,0.06,0.00,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.010,0.143,0.0,0.276,0.184,0.010,9.821,485,2259,1
3,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,...,0.000,0.137,0.0,0.137,0.000,0.000,3.537,40,191,1
4,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,...,0.000,0.135,0.0,0.135,0.000,0.000,3.537,40,191,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4596,0.31,0.00,0.62,0.0,0.00,0.31,0.00,0.00,0.00,0.00,...,0.000,0.232,0.0,0.000,0.000,0.000,1.142,3,88,0
4597,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.000,0.000,0.0,0.353,0.000,0.000,1.555,4,14,0
4598,0.30,0.00,0.30,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.102,0.718,0.0,0.000,0.000,0.000,1.404,6,118,0
4599,0.96,0.00,0.00,0.0,0.32,0.00,0.00,0.00,0.00,0.00,...,0.000,0.057,0.0,0.000,0.000,0.000,1.147,5,78,0


The last column spamclass determines whether a data is spam or ham. **If spamclass is 1, it's a spam and if it's 0, it's not a spam.**
For better readability, under the spam column, I will change the binary values if the email is spam or non spam to 'email' or 'spam'.

In [4]:
raw_data['spamclass'] = raw_data['spamclass'].replace([0], 'email') #change 0's to email 
raw_data['spamclass'] = raw_data['spamclass'].replace([1], 'spam') # change 1's to spam 
raw_data

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,spamclass
0,0.00,0.64,0.64,0.0,0.32,0.00,0.00,0.00,0.00,0.00,...,0.000,0.000,0.0,0.778,0.000,0.000,3.756,61,278,spam
1,0.21,0.28,0.50,0.0,0.14,0.28,0.21,0.07,0.00,0.94,...,0.000,0.132,0.0,0.372,0.180,0.048,5.114,101,1028,spam
2,0.06,0.00,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.010,0.143,0.0,0.276,0.184,0.010,9.821,485,2259,spam
3,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,...,0.000,0.137,0.0,0.137,0.000,0.000,3.537,40,191,spam
4,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,...,0.000,0.135,0.0,0.135,0.000,0.000,3.537,40,191,spam
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4596,0.31,0.00,0.62,0.0,0.00,0.31,0.00,0.00,0.00,0.00,...,0.000,0.232,0.0,0.000,0.000,0.000,1.142,3,88,email
4597,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.000,0.000,0.0,0.353,0.000,0.000,1.555,4,14,email
4598,0.30,0.00,0.30,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.102,0.718,0.0,0.000,0.000,0.000,1.404,6,118,email
4599,0.96,0.00,0.00,0.0,0.32,0.00,0.00,0.00,0.00,0.00,...,0.000,0.057,0.0,0.000,0.000,0.000,1.147,5,78,email


###  Raw Data description 

In [5]:
raw_data.describe()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,word_freq_conference,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total
count,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,...,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0
mean,0.104553,0.213015,0.280656,0.065425,0.312223,0.095901,0.114208,0.105295,0.090067,0.239413,...,0.031869,0.038575,0.13903,0.016976,0.269071,0.075811,0.044238,5.191515,52.172789,283.289285
std,0.305358,1.290575,0.504143,1.395151,0.672513,0.273824,0.391441,0.401071,0.278616,0.644755,...,0.285735,0.243471,0.270355,0.109394,0.815672,0.245882,0.429342,31.729449,194.89131,606.347851
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.588,6.0,35.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.065,0.0,0.0,0.0,0.0,2.276,15.0,95.0
75%,0.0,0.0,0.42,0.0,0.38,0.0,0.0,0.0,0.0,0.16,...,0.0,0.0,0.188,0.0,0.315,0.052,0.0,3.706,43.0,266.0
max,4.54,14.28,5.1,42.81,10.0,5.88,7.27,11.11,5.26,18.18,...,10.0,4.385,9.752,4.081,32.478,6.003,19.829,1102.5,9989.0,15841.0


In [6]:
spam_data = raw_data[raw_data.spamclass == 'spam']
ham_data = raw_data[raw_data.spamclass == 'email']

### Spam Data Description

In [7]:
spam_data.describe()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,word_freq_conference,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total
count,1813.0,1813.0,1813.0,1813.0,1813.0,1813.0,1813.0,1813.0,1813.0,1813.0,...,1813.0,1813.0,1813.0,1813.0,1813.0,1813.0,1813.0,1813.0,1813.0,1813.0
mean,0.152339,0.16465,0.403795,0.164672,0.513955,0.174876,0.275405,0.208141,0.170061,0.350507,...,0.002101,0.020573,0.10897,0.008199,0.513713,0.174478,0.078877,9.519165,104.393271,470.619415
std,0.310645,0.348919,0.480725,2.219087,0.707195,0.321927,0.57211,0.544864,0.354804,0.631384,...,0.026821,0.091621,0.282141,0.047449,0.744183,0.360479,0.611941,49.846186,299.284969,825.081179
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,2.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.094,0.0,0.0,2.324,15.0,93.0
50%,0.0,0.0,0.3,0.0,0.29,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.065,0.0,0.331,0.08,0.0,3.621,38.0,194.0
75%,0.17,0.21,0.64,0.0,0.78,0.24,0.34,0.19,0.19,0.51,...,0.0,0.0,0.144,0.0,0.645,0.211,0.018,5.708,84.0,530.0
max,4.54,4.76,3.7,42.81,7.69,2.54,7.27,11.11,3.33,7.55,...,0.77,1.117,9.752,1.171,7.843,6.003,19.829,1102.5,9989.0,15841.0


In [8]:
spam_data.shape

(1813, 58)

### Ham Data Description

In [9]:
ham_data.describe()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,word_freq_conference,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total
count,2788.0,2788.0,2788.0,2788.0,2788.0,2788.0,2788.0,2788.0,2788.0,2788.0,...,2788.0,2788.0,2788.0,2788.0,2788.0,2788.0,2788.0,2788.0,2788.0,2788.0
mean,0.073479,0.244466,0.200581,0.000886,0.18104,0.044544,0.009383,0.038415,0.038049,0.16717,...,0.051227,0.050281,0.158578,0.022684,0.109984,0.011648,0.021713,2.377301,18.214491,161.470947
std,0.297838,1.633223,0.502959,0.021334,0.614521,0.222888,0.110467,0.247238,0.198517,0.643197,...,0.365153,0.303372,0.260604,0.134927,0.820859,0.069647,0.243912,5.113685,39.084792,355.738403
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.384,4.0,18.75
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0645,0.0,0.0,0.0,0.0,1.857,10.0,54.0
75%,0.0,0.0,0.12,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.222,0.0,0.027,0.0,0.0,2.555,18.0,141.0
max,4.34,14.28,5.1,0.87,10.0,5.88,3.07,5.88,5.26,18.18,...,10.0,4.385,5.277,4.081,32.478,2.038,7.407,251.0,1488.0,5902.0


In [10]:
ham_data.shape

(2788, 58)

### Data Preparation

Before splitting the data into training and test sets, I need to explicitly specify the independent and dependant inputs.

In [11]:
y = raw_data['spamclass']
X = raw_data.loc[:,~raw_data.columns.isin(['spamclass'])]

**The data is split where 20% is in the test set and 80% for the training data**

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2, random_state = 42)

#### Train Data Shape

In [13]:
X_train.shape

(3680, 57)

In [14]:
y_train.shape

(3680,)

#### Test Data Shape

In [15]:
X_test.shape

(921, 57)

In [16]:
y_test.shape

(921,)

## Modeling 
**I will use  four possible machine learning methods, Naive Bayes, Decision Tree, Random Forest and Support Vector Machines.**


### Naive Bayes

In [17]:
naive_model = GaussianNB()

# Train the model 
naive_model.fit(X_train, y_train)

#Predict  
naive_predicted = naive_model.predict(X_test)

In [18]:
naive_result = pd.DataFrame()
naive_result['Actual'] = list(y_test)
naive_result['Predicted'] = list(naive_predicted)
naive_result['Prediction_Correction'] = naive_result['Actual'] == naive_result['Predicted'] 
naive_result

Unnamed: 0,Actual,Predicted,Prediction_Correction
0,email,spam,False
1,email,email,True
2,email,email,True
3,spam,spam,True
4,email,email,True
...,...,...,...
916,spam,spam,True
917,email,spam,False
918,email,email,True
919,email,email,True


In [19]:
print("Naive-Bayes Accuracy : ", accuracy_score(y_test, naive_predicted, normalize = True))
print(classification_report(y_test, naive_predicted))

Naive-Bayes Accuracy :  0.8208469055374593
              precision    recall  f1-score   support

       email       0.95      0.73      0.82       531
        spam       0.72      0.95      0.82       390

    accuracy                           0.82       921
   macro avg       0.83      0.84      0.82       921
weighted avg       0.85      0.82      0.82       921



### Decision Tree

In [20]:
tree_model = tree.DecisionTreeClassifier(criterion = "entropy", random_state = 42)

# Train the model 
tree_model.fit(X_train, y_train)

#Predict  
tree_predicted = tree_model.predict(X_test)

In [21]:
tree_result = pd.DataFrame()
tree_result['Actual'] = list(y_test)
tree_result['Predicted'] = list(tree_predicted)
tree_result['Prediction_Correction'] = tree_result['Actual'] == tree_result['Predicted'] 
tree_result

Unnamed: 0,Actual,Predicted,Prediction_Correction
0,email,email,True
1,email,email,True
2,email,email,True
3,spam,spam,True
4,email,email,True
...,...,...,...
916,spam,spam,True
917,email,email,True
918,email,email,True
919,email,spam,False


In [22]:
print("Decision Treee Accuracy : ", accuracy_score(y_test, tree_predicted, normalize = True))
print(classification_report(y_test, tree_predicted))

Decision Treee Accuracy :  0.9315960912052117
              precision    recall  f1-score   support

       email       0.93      0.95      0.94       531
        spam       0.94      0.90      0.92       390

    accuracy                           0.93       921
   macro avg       0.93      0.93      0.93       921
weighted avg       0.93      0.93      0.93       921



### Random Forest

In [23]:
forest_model = ensemble.RandomForestClassifier(criterion = "entropy", random_state = 42)
# Train the model 
forest_model.fit(X_train, y_train)

#Predict  
forest_predicted = forest_model.predict(X_test)

In [24]:
forest_result = pd.DataFrame()
forest_result['Actual'] = list(y_test)
forest_result['Predicted'] = list(forest_predicted)
forest_result['Prediction_Correction'] = forest_result['Actual'] == forest_result['Predicted'] 
forest_result

Unnamed: 0,Actual,Predicted,Prediction_Correction
0,email,email,True
1,email,email,True
2,email,email,True
3,spam,spam,True
4,email,email,True
...,...,...,...
916,spam,spam,True
917,email,email,True
918,email,email,True
919,email,email,True


In [25]:
print("Forest Accuracy : ", accuracy_score(y_test, forest_predicted, normalize = True))
print(classification_report(y_test, forest_predicted))

Forest Accuracy :  0.9576547231270358
              precision    recall  f1-score   support

       email       0.95      0.98      0.96       531
        spam       0.97      0.93      0.95       390

    accuracy                           0.96       921
   macro avg       0.96      0.95      0.96       921
weighted avg       0.96      0.96      0.96       921



#### Support Vector Machines

In [26]:
svm_model = svm.SVC(random_state = 42)
# Train the model 
svm_model.fit(X_train, y_train)

#Predict  
svm_predicted = svm_model.predict(X_test)

In [27]:
svm_result = pd.DataFrame()
svm_result['Actual'] = list(y_test)
svm_result['Predicted'] = list(svm_predicted)
svm_result['Prediction_Correction'] = svm_result['Actual'] == svm_result['Predicted'] 
svm_result

Unnamed: 0,Actual,Predicted,Prediction_Correction
0,email,email,True
1,email,email,True
2,email,email,True
3,spam,email,False
4,email,email,True
...,...,...,...
916,spam,spam,True
917,email,email,True
918,email,email,True
919,email,email,True


In [28]:
print("Forest Accuracy : ", accuracy_score(y_test, svm_predicted, normalize = True))
print(classification_report(y_test, svm_predicted))

Forest Accuracy :  0.6623235613463626
              precision    recall  f1-score   support

       email       0.66      0.84      0.74       531
        spam       0.66      0.42      0.51       390

    accuracy                           0.66       921
   macro avg       0.66      0.63      0.63       921
weighted avg       0.66      0.66      0.64       921



### Conclusion
I used four classification models using Naive Bayes,Decision Tree, Random Forest, and SVM. Ican see the Random Forest model has better accuracy (95%) compare to remaining models.

In [29]:
summary_df = pd.DataFrame()
summary_df = summary_df.append({'Naive Bayes':accuracy_score(y_test, naive_predicted, normalize = True),
                               'Random Forest':accuracy_score(y_test, forest_predicted, normalize = True),
                               'Decision Tree':accuracy_score(y_test, tree_predicted, normalize = True),
                               'SVM':accuracy_score(y_test, svm_predicted, normalize = True)} ,ignore_index=True)
summary_df

Unnamed: 0,Decision Tree,Naive Bayes,Random Forest,SVM
0,0.931596,0.820847,0.957655,0.662324
