<h1> Spambase </h1>

<p> The "spam" concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography...

The classification task for this dataset is to determine whether a given email is spam or not.
	
Additional Variable Information
The last column of 'spambase.data' denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.  Most of the attributes indicate whether a particular word or character was frequently occuring in the e-mail.  The run-length attributes (55-57) measure the length of sequences of consecutive capital letters.  For the statistical measures of each attribute, see the end of this file.  Here are the definitions of the attributes:

48 continuous real [0,100] attributes of type word_freq_WORD 
= percentage of words in the e-mail that match WORD, i.e. 100 * (number of times the WORD appears in the e-mail) / total number of words in e-mail.  A "word" in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string.

6 continuous real [0,100] attributes of type char_freq_CHAR] 
= percentage of characters in the e-mail that match CHAR, i.e. 100 * (number of CHAR occurences) / total characters in e-mail

1 continuous real [1,...] attribute of type capital_run_length_average 
= average length of uninterrupted sequences of capital letters

1 continuous integer [1,...] attribute of type capital_run_length_longest 
= length of longest uninterrupted sequence of capital letters

1 continuous integer [1,...] attribute of type capital_run_length_total 
= sum of length of uninterrupted sequences of capital letters 
= total number of capital letters in the e-mail

1 nominal {0,1} class attribute of type spam
= denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.   </p>
<h4> Reference </h4>
<p> Hopkins,Mark, Reeber,Erik, Forman,George, and Suermondt,Jaap. (1999). Spambase. UCI Machine Learning Repository. https://doi.org/10.24432/C53G6X. </p>

In [1]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [1]:
# Installing the UIC Machine Learning Repository
pip install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.3-py3-none-any.whl (7.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.3
Note: you may need to restart the kernel to use updated packages.


In [61]:
# Importing data from the UIC Machine Learning repository
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
spambase = fetch_ucirepo(id=94) 
  
# data (as pandas dataframes) 
X = spambase.data.features 
y = spambase.data.targets 

In [62]:
X.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,word_freq_conference,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.0,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191


In [63]:
X.columns

Index(['word_freq_make', 'word_freq_address', 'word_freq_all', 'word_freq_3d',
       'word_freq_our', 'word_freq_over', 'word_freq_remove',
       'word_freq_internet', 'word_freq_order', 'word_freq_mail',
       'word_freq_receive', 'word_freq_will', 'word_freq_people',
       'word_freq_report', 'word_freq_addresses', 'word_freq_free',
       'word_freq_business', 'word_freq_email', 'word_freq_you',
       'word_freq_credit', 'word_freq_your', 'word_freq_font', 'word_freq_000',
       'word_freq_money', 'word_freq_hp', 'word_freq_hpl', 'word_freq_george',
       'word_freq_650', 'word_freq_lab', 'word_freq_labs', 'word_freq_telnet',
       'word_freq_857', 'word_freq_data', 'word_freq_415', 'word_freq_85',
       'word_freq_technology', 'word_freq_1999', 'word_freq_parts',
       'word_freq_pm', 'word_freq_direct', 'word_freq_cs', 'word_freq_meeting',
       'word_freq_original', 'word_freq_project', 'word_freq_re',
       'word_freq_edu', 'word_freq_table', 'word_freq_conference',


In [64]:
y.head()

Unnamed: 0,Class
0,1
1,1
2,1
3,1
4,1


In [65]:
X.isnull().any()

word_freq_make                False
word_freq_address             False
word_freq_all                 False
word_freq_3d                  False
word_freq_our                 False
word_freq_over                False
word_freq_remove              False
word_freq_internet            False
word_freq_order               False
word_freq_mail                False
word_freq_receive             False
word_freq_will                False
word_freq_people              False
word_freq_report              False
word_freq_addresses           False
word_freq_free                False
word_freq_business            False
word_freq_email               False
word_freq_you                 False
word_freq_credit              False
word_freq_your                False
word_freq_font                False
word_freq_000                 False
word_freq_money               False
word_freq_hp                  False
word_freq_hpl                 False
word_freq_george              False
word_freq_650               

In [66]:
# We need to check next if the data has duplicate values.
X.duplicated().any()

True

In [67]:
# Since there is a duplicate we want to drop it to minimize the error that may occur in the model.
X[X.duplicated()]

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,word_freq_conference,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total
26,0.0,0.0,0.00,0.0,0.0,0.00,0.0,0.0,0.0,0.0,...,0.0,0.000,0.196,0.000,0.392,0.196,0.0,5.466,22,82
103,0.0,0.0,0.64,0.0,0.0,0.64,0.0,0.0,0.0,0.0,...,0.0,0.094,0.189,0.284,0.662,0.000,0.0,10.068,131,292
104,0.0,0.0,0.00,0.0,0.0,0.00,0.0,0.0,0.0,0.0,...,0.0,0.305,0.611,0.000,1.529,0.000,0.0,5.500,22,66
105,0.0,0.0,0.64,0.0,0.0,0.64,0.0,0.0,0.0,0.0,...,0.0,0.094,0.189,0.284,0.662,0.000,0.0,10.068,131,292
106,0.0,0.0,0.00,0.0,0.0,0.00,0.0,0.0,0.0,0.0,...,0.0,0.305,0.611,0.000,1.529,0.000,0.0,5.500,22,66
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4439,0.0,0.0,0.74,0.0,0.0,0.00,0.0,0.0,0.0,0.0,...,0.0,0.036,0.147,0.000,0.147,0.000,0.0,2.587,55,282
4441,0.0,0.0,0.74,0.0,0.0,0.00,0.0,0.0,0.0,0.0,...,0.0,0.036,0.147,0.000,0.147,0.000,0.0,2.587,55,282
4537,0.0,0.0,0.00,0.0,0.0,0.00,0.0,0.0,0.0,0.0,...,0.0,0.000,0.000,0.000,0.000,0.000,0.0,1.000,1,6
4541,0.0,0.0,0.00,0.0,0.0,0.00,0.0,0.0,0.0,0.0,...,0.0,0.000,0.000,0.000,0.000,0.000,0.0,1.000,1,2


In [68]:
# Before we drop the duplicates in X we want to get first the y without duplicates
y = y[~X.duplicated()]
y.shape

(4207, 1)

In [69]:
# We drop now the duplicates
X.drop_duplicates(inplace=True)
X.shape

(4207, 57)

In [70]:
# Verify if there's no any duplicates.
X.duplicated().any()

False

In [71]:
# Since we already check the data if there's missing or duplicate
# We are now creating a model to predict whether the email is "spam" or not.
# First we need to standard scale the explanatory variable.
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
scale.fit(X)
scaled_feature = scale.transform(X)
X_feat = pd.DataFrame(scaled_feature, columns=X.columns)
X_feat.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,word_freq_conference,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total
0,-0.3478,1.160663,0.675992,-0.046661,-0.007837,-0.350298,-0.295506,-0.263176,-0.325745,-0.378392,...,-0.116449,-0.15976,-0.525298,-0.164424,0.588847,-0.317216,-0.105109,-0.049173,0.044294,-0.021207
1,0.352102,0.368319,0.404492,-0.046661,-0.269499,0.663876,0.233095,-0.092593,-0.325745,1.052853,...,-0.116449,-0.15976,-0.044094,-0.164424,0.107517,0.433524,0.004976,-0.008214,0.244667,1.190981
2,-0.147828,-0.247949,0.811741,-0.046661,1.31501,0.337891,0.182752,0.029252,1.942148,0.002258,...,-0.116449,-0.120166,-0.003994,-0.164424,-0.006295,0.450207,-0.082175,0.133755,2.168241,3.180587
3,-0.3478,-0.247949,-0.565148,-0.046661,0.442803,-0.350298,0.48481,1.272071,0.772766,0.580847,...,-0.116449,-0.15976,-0.025867,-0.164424,-0.171085,-0.317216,-0.105109,-0.055779,-0.060901,-0.161821
4,-0.3478,-0.247949,-0.565148,-0.046661,0.442803,-0.350298,0.48481,1.272071,0.772766,0.580847,...,-0.116449,-0.15976,-0.033158,-0.164424,-0.173456,-0.317216,-0.105109,-0.055779,-0.060901,-0.161821


In [72]:
# Second, we are not splitting the data into training and test set.
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_feat, y, test_size=0.30, random_state=10)

<h2> Logistic Regression </h2>
<p> We are now fit the training into Logistic Regression model </p>

In [73]:
from sklearn.linear_model import LogisticRegression

logit = LogisticRegression()
logit.fit(X_train, y_train)

In [74]:
# Predicting using the model
log_predict = logit.predict(X_test)

In [75]:
# Classification report of the model
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
print(classification_report(y_test, log_predict))

              precision    recall  f1-score   support

           0       0.91      0.95      0.93       738
           1       0.92      0.86      0.89       525

    accuracy                           0.91      1263
   macro avg       0.91      0.91      0.91      1263
weighted avg       0.91      0.91      0.91      1263



In [76]:
print(confusion_matrix(y_test, log_predict))

[[699  39]
 [ 72 453]]


In [77]:
y_test.groupby("Class")["Class"].count()

Class
0    738
1    525
Name: Class, dtype: int64

<p> Using Logistic Regression to fit the data yielded an accuracy of 91%, indicating that the result is 91% correct. The confusion matrix only shows Type I and Type II errors, whereas the model correctly predicts 699 non-spam emails out of 738, indicating a true negative, and 453 spam emails out of 525, indicating a true positive. </p>

<h2> Decision Tree </h2>
<p> We are now fitting the data using Decision Tree </p>

In [78]:
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()
dtree.fit(X_train, y_train)

In [79]:
dtree_predict = dtree.predict(X_test)

In [80]:
print(classification_report(y_test, dtree_predict))

              precision    recall  f1-score   support

           0       0.93      0.90      0.91       738
           1       0.87      0.90      0.88       525

    accuracy                           0.90      1263
   macro avg       0.90      0.90      0.90      1263
weighted avg       0.90      0.90      0.90      1263



In [81]:
print(confusion_matrix(y_test, dtree_predict))

[[667  71]
 [ 53 472]]


In [82]:
y_test.groupby("Class")["Class"].count()

Class
0    738
1    525
Name: Class, dtype: int64

<p> 90% accuracy was obtained when a Decision Tree classifier was used to fit the data, meaning that 90% of the results are accurate. While the model correctly predicts 666 non-spam emails out of 738, indicating a true negative, and 472 spam emails out of 525, indicating a true positive, the confusion matrix only displays Type I and Type II errors. </p>

<h2> Random Forest </h2>
<p> We are now fit the data using Random Forest </p>

In [83]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=600)
rfc.fit(X_train, y_train)

In [84]:
rfc_pred = rfc.predict(X_test)

In [85]:
print(classification_report(y_test, rfc_pred))

              precision    recall  f1-score   support

           0       0.94      0.97      0.95       738
           1       0.96      0.91      0.93       525

    accuracy                           0.95      1263
   macro avg       0.95      0.94      0.94      1263
weighted avg       0.95      0.95      0.95      1263



In [86]:
print(confusion_matrix(y_test, rfc_pred))

[[716  22]
 [ 47 478]]


In [87]:
y_test.groupby("Class")["Class"].count()

Class
0    738
1    525
Name: Class, dtype: int64

<p> A 95% accuracy rate was obtained by fitting the data with a Random Forest classifier, meaning that 94% of the results are correct. While the model correctly predicts 714 non-spam emails out of 738, indicating a true negative, and 477 spam emails out of 525, indicating a true positive, the confusion matrix only displays Type I and Type II errors. </p>

In [88]:
log_acc = accuracy_score(y_test, log_predict)
dtree_acc = accuracy_score(y_test, dtree_predict)
rfc_acc = accuracy_score(y_test, rfc_pred)
accuracy ={"Model": ["Logistic", "Decision Tree", "Random Forest"], "Accuracy": [log_acc, dtree_acc, rfc_acc]}
accuracy_feat = pd.DataFrame(accuracy)
accuracy_feat

Unnamed: 0,Model,Accuracy
0,Logistic,0.912114
1,Decision Tree,0.901821
2,Random Forest,0.945368


<p> In this section, we strongly recommend using Random Forest in the provided data set for better prediction of whether the email is spam or not, as it turns out that Random Forest yielded the best accuracy when compared to Decision Tree and Logistic Regression. </p>

<p> <b>Note:</b> The accuracy might be change for every run on this file due to randomness of the train/test data </p>