# Email Spam Detection using the Naive-Bayes algorithm

## Importing of libraries

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score, confusion_matrix
from sklearn.naive_bayes import GaussianNB

## Importing and inspection of dataset

In [3]:
email_df = pd.read_csv(r"C:\Users\pjhop\OneDrive\Documents\Programming & Coding\Python\Projects\Datasets\emails.csv")
email_df.head()

Unnamed: 0,Email No.,the,to,ect,and,for,of,a,you,hou,...,connevey,jay,valued,lay,infrastructure,military,allowing,ff,dry,Prediction
0,Email 1,0,0,1,0,0,0,2,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Email 2,8,13,24,6,6,2,102,1,27,...,0,0,0,0,0,0,0,1,0,0
2,Email 3,0,0,1,0,0,0,8,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Email 4,0,5,22,0,5,1,51,2,10,...,0,0,0,0,0,0,0,0,0,0
4,Email 5,7,6,17,1,5,2,57,0,9,...,0,0,0,0,0,0,0,1,0,0


In [4]:
email_df.info

<bound method DataFrame.info of        Email No.  the  to  ect  and  for  of    a  you  hou  ...  connevey  \
0        Email 1    0   0    1    0    0   0    2    0    0  ...         0   
1        Email 2    8  13   24    6    6   2  102    1   27  ...         0   
2        Email 3    0   0    1    0    0   0    8    0    0  ...         0   
3        Email 4    0   5   22    0    5   1   51    2   10  ...         0   
4        Email 5    7   6   17    1    5   2   57    0    9  ...         0   
...          ...  ...  ..  ...  ...  ...  ..  ...  ...  ...  ...       ...   
5167  Email 5168    2   2    2    3    0   0   32    0    0  ...         0   
5168  Email 5169   35  27   11    2    6   5  151    4    3  ...         0   
5169  Email 5170    0   0    1    1    0   0   11    0    0  ...         0   
5170  Email 5171    2   7    1    0    2   1   28    2    0  ...         0   
5171  Email 5172   22  24    5    1    6   5  148    8    2  ...         0   

      jay  valued  lay  infrast

In [5]:
email_df.isnull().sum()

Email No.     0
the           0
to            0
ect           0
and           0
             ..
military      0
allowing      0
ff            0
dry           0
Prediction    0
Length: 3002, dtype: int64

In [6]:
email_df.shape

(5172, 3002)

The dataset consists of 5,172 rows, where each row represents an email message. It also includes the frequency count for the top three thousand most commonly used words in the dataset. The target variable for this dataset is a binary prediction value of (0, 1), which indicates whether an email is classified as "Normal" or "Spam".

## Splitting dataset into the training and test data

In [7]:
x = email_df.drop(['Prediction', 'Email No.'], axis=1)
y = email_df['Prediction']
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=30, test_size=0.25)

## Naive-Bayes algorithm classification - fit, predictions and metrics

Naive Bayes is a classification algorithm that uses Bayes' theorem to make predictions. There are three main types of Naive Bayes classifiers: Gaussian Naive Bayes for continuous data, Multinomial Naive Bayes for discrete data, and Bernoulli Naive Bayes for binary data. We use the Gaussian Naive Bayes classifier in our model.

In [8]:
nb = GaussianNB()
model = nb.fit(x_train, y_train)
y_pred = model.predict(x_test)

In [9]:
confusion_matrix(y_test, y_pred)

array([[890,  38],
       [ 15, 350]], dtype=int64)

In [10]:
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy: ', round(accuracy * 100, 3), '%')

Accuracy:  95.901 %


In [11]:
f1 = f1_score(y_test, y_pred)
print('F1 score: ', round(f1 * 100, 3), '%')

F1 score:  92.961 %


In [12]:
recall = recall_score(y_test, y_pred)
print('Recall score: ', round(recall * 100, 3), '%')

Recall score:  95.89 %


In [13]:
precision = precision_score(y_test, y_pred)
print('Precision score: ', round(precision * 100, 3), '%')

Precision score:  90.206 %


When looking at the metrics, we can see our model has both a high accuracy, precision, recall and f1 score. When looking at the confusion matrix, the model has a high number of true positives and true negatives, indicating that it is performing well. However, the relatively high number of false positives and false negatives suggests that the model still has room for improvement.