# **INTRODUCTION**

# **DEFINING OUR QUESTION**

Build a naive bayes classifier that detects a spam email

# **METRIC FOR SUCCESS**

The model should be able to classify emails as spam or not with at least 80% accuracy.

# **UNDERSTANDING CONTEXT**

performing classification of the testing set samples using the Naive Bayes Classifier.

# **RECORDING EXPERIMENTAL DESIGN**

1)Load Data

2)Data Cleaning

3)Exploratory Data Analysis

4)Data Modelling

5)Model Evaluation

6)Model improvement and tuning

7)Conclusion

# **RELEVANCE OF DATA**

The data used in analysis is relevant for prediction

# **DATA LOADING**

# **Importing our libraries**

In [1]:
# Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from imblearn.over_sampling import SMOTE
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from pandas_profiling import ProfileReport
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score

# Set global parameters
%matplotlib inline
sns.set()
plt.rcParams['figure.figsize'] = (10.0, 8.0)
warnings.filterwarnings('ignore')

In [3]:
with open('spambase.names') as file:
  names = file.read()
  print(names)

| SPAM E-MAIL DATABASE ATTRIBUTES (in .names format)
|
| 48 continuous real [0,100] attributes of type word_freq_WORD 
| = percentage of words in the e-mail that match WORD,
| i.e. 100 * (number of times the WORD appears in the e-mail) / 
| total number of words in e-mail.  A "word" in this case is any 
| string of alphanumeric characters bounded by non-alphanumeric 
| characters or end-of-string.
|
| 6 continuous real [0,100] attributes of type char_freq_CHAR
| = percentage of characters in the e-mail that match CHAR,
| i.e. 100 * (number of CHAR occurences) / total characters in e-mail
|
| 1 continuous real [1,...] attribute of type capital_run_length_average
| = average length of uninterrupted sequences of capital letters
|
| 1 continuous integer [1,...] attribute of type capital_run_length_longest
| = length of longest uninterrupted sequence of capital letters
|
| 1 continuous integer [1,...] attribute of type capital_run_length_total
| = sum of length of uninterrupted sequences of

In [5]:
columns_data= ['word_freq_make',
          'word_freq_address',      
          'word_freq_all',          
          'word_freq_3d',          
          'word_freq_our',          
          'word_freq_over',         
          'word_freq_remove',       
          'word_freq_internet',     
          'word_freq_order',        
          'word_freq_mail',         
          'word_freq_receive',      
          'word_freq_will',         
          'word_freq_people',       
          'word_freq_report',       
          'word_freq_addresses',    
          'word_freq_free',         
          'word_freq_business',     
          'word_freq_email',        
          'word_freq_you',          
          'word_freq_credit',       
          'word_freq_your',         
          'word_freq_font',         
          'word_freq_000',          
          'word_freq_money',        
          'word_freq_hp',           
          'word_freq_hpl',          
          'word_freq_george',       
          'word_freq_650',          
          'word_freq_lab',          
          'word_freq_labs',         
          'word_freq_telnet',       
          'word_freq_857',          
          'word_freq_data',         
          'word_freq_415',          
          'word_freq_85',           
          'word_freq_technology',   
          'word_freq_1999',         
          'word_freq_parts',        
          'word_freq_pm',           
          'word_freq_direct',       
          'word_freq_cs',           
          'word_freq_meeting',      
          'word_freq_original',     
          'word_freq_project',      
          'word_freq_re',           
          'word_freq_edu',          
          'word_freq_table',        
          'word_freq_conference',   
          'char_freq_;',            
          'char_freq_(',            
          'char_freq_[',            
          'char_freq_!',            
          'char_freq_$',            
          'char_freq_#',            
          'capital_run_length_average', 
          'capital_run_length_longest', 
          'capital_run_length_total',
          'spam']


In [6]:
# Load data
email_data = pd.read_csv('spambase.data', names=columns_data)

In [7]:
# Preview data
email_data.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,spam
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


In [8]:
email_data.sample(5)

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,spam
2730,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1,5,0
2952,0.0,0.0,0.42,0.0,0.0,0.0,0.21,0.0,0.0,0.21,...,0.034,0.139,0.034,0.0,0.069,0.0,3.151,37,312,0
2107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.56,...,0.0,0.194,0.194,0.0,0.0,0.0,3.631,17,69,0
2676,0.2,0.06,0.2,0.0,0.4,0.0,0.0,0.0,0.0,0.0,...,0.028,0.093,0.0,0.0,0.018,0.0,2.423,26,693,0
356,0.0,0.0,0.0,0.0,0.0,0.0,0.45,0.91,0.45,0.91,...,0.0,0.254,0.0,0.063,0.127,0.0,4.735,46,161,1


In [9]:
# Shape of data
email_data.shape

(4601, 58)

In [None]:
# Information about the data
email_data.info()

In [10]:
# Check for duplicates
email_data.duplicated().sum()

391

In [11]:
# Drop duplicates
email_data.drop_duplicates(inplace=True)

In [13]:
# Check missing values
email_data.isnull().sum()

word_freq_make                0
word_freq_address             0
word_freq_all                 0
word_freq_3d                  0
word_freq_our                 0
word_freq_over                0
word_freq_remove              0
word_freq_internet            0
word_freq_order               0
word_freq_mail                0
word_freq_receive             0
word_freq_will                0
word_freq_people              0
word_freq_report              0
word_freq_addresses           0
word_freq_free                0
word_freq_business            0
word_freq_email               0
word_freq_you                 0
word_freq_credit              0
word_freq_your                0
word_freq_font                0
word_freq_000                 0
word_freq_money               0
word_freq_hp                  0
word_freq_hpl                 0
word_freq_george              0
word_freq_650                 0
word_freq_lab                 0
word_freq_labs                0
word_freq_telnet              0
word_fre

In [14]:
# Before split data check proportion of target variable
email_data.spam.value_counts(normalize=True)*10

0    6.011876
1    3.988124
Name: spam, dtype: float64

In [15]:
# Spam variable is quite imbalanced thus apply SMOTE technique to handle imbalance

# Get X and Y
X = email_data.iloc[:, :-1]
y = email_data.spam

# Apply smote to x and y
sm = SMOTE(sampling_strategy='auto', k_neighbors=1, random_state=42)
X_res, y_res = sm.fit_resample(X, y)

# Split data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(X_res, y_res, test_size=.3, random_state=23)

In [16]:
# Scale data
scaler = MinMaxScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

In [18]:
# Apply LDA decomposition

lda = LinearDiscriminantAnalysis(n_components=1)
lda.fit(x_train, y_train)

# Get explained variation by lda
lda.explained_variance_ratio_

array([1.])

In [19]:
# First component explains 100% of the class difference
lda_pred = lda.predict(x_test)
print(classification_report(y_test, lda_pred))

              precision    recall  f1-score   support

           0       0.87      0.93      0.90       749
           1       0.92      0.87      0.89       770

    accuracy                           0.90      1519
   macro avg       0.90      0.90      0.90      1519
weighted avg       0.90      0.90      0.90      1519



LDA classifier predicts with 90% accuracy whether an email is a spam or not.

In [20]:
# Get the confusion matrix
print(confusion_matrix(y_test, lda_pred, labels=[0,1]))

[[693  56]
 [103 667]]


The LDA classifier does a better job of predicting the 0 class(non-spam emails) than the 1 class(spam class) with more false positive than false negatives.



In [21]:
# Apply Naive Bayes classifiers to lda transformed data

x_train_lda = lda.transform(x_train)
x_test_lda = lda.transform(x_test)

gaussian_bayes = GaussianNB()
gaussian_bayes.fit(x_train_lda, y_train)

gaussian_pred = gaussian_bayes.predict(x_test_lda)

print(classification_report(y_test, gaussian_pred))

              precision    recall  f1-score   support

           0       0.87      0.92      0.90       749
           1       0.92      0.87      0.89       770

    accuracy                           0.90      1519
   macro avg       0.90      0.90      0.90      1519
weighted avg       0.90      0.90      0.90      1519



In [22]:
# Get confusion matrix
matrix = confusion_matrix(y_test, gaussian_pred)
pd.DataFrame(matrix, columns=[0,1], index=[0,1])

Unnamed: 0,0,1
0,691,58
1,100,670


In [23]:
# Apply different test size split

# Split data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(X_res, y_res, test_size=.4, random_state=34)

# Scale data
scaler = MinMaxScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

# Apply LDA transformation
x_train_lda = lda.transform(x_train)
x_test_lda = lda.transform(x_test)

# Modelling
gaussian_bayes = GaussianNB()
gaussian_bayes.fit(x_train_lda, y_train)

gaussian_pred = gaussian_bayes.predict(x_test_lda)

print(classification_report(y_test, gaussian_pred))

              precision    recall  f1-score   support

           0       0.88      0.95      0.91       995
           1       0.95      0.88      0.91      1030

    accuracy                           0.91      2025
   macro avg       0.91      0.91      0.91      2025
weighted avg       0.91      0.91      0.91      2025



In [24]:
# Get confusion matrix
matrix = confusion_matrix(y_test, gaussian_pred)
pd.DataFrame(matrix, columns=[0,1], index=[0,1])

Unnamed: 0,0,1
0,944,51
1,128,902


In [25]:
# Apply different test size split: 20%

# Split data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(X_res, y_res, test_size=.2, random_state=70)

# Scale data
scaler = MinMaxScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

# Apply LDA transformation
x_train_lda = lda.transform(x_train)
x_test_lda = lda.transform(x_test)

# Modelling
gaussian_bayes = GaussianNB()
gaussian_bayes.fit(x_train_lda, y_train)

gaussian_pred = gaussian_bayes.predict(x_test_lda)

print(classification_report(y_test, gaussian_pred))

              precision    recall  f1-score   support

           0       0.89      0.95      0.92       525
           1       0.94      0.87      0.90       488

    accuracy                           0.91      1013
   macro avg       0.91      0.91      0.91      1013
weighted avg       0.91      0.91      0.91      1013



In [26]:
# Get confusion matrix
matrix = confusion_matrix(y_test, gaussian_pred)
pd.DataFrame(matrix, columns=[0,1], index=[0,1])

Unnamed: 0,0,1
0,497,28
1,62,426


# **Conclusion**

The best gaussian naive bayes classifiers have an accuracy of 91% and a test set that is 20% or 40%. To handle the imbalance in the data, SMOTE with k_neighbors = 1 has been applied. The SMOTE transformed data is then scaled with a MinMaxScaler and decomposed using LDA .