# Spam Email Classification using a Naive Bayes Classifier and NLTK


The Spam Email Classification system is the implementation of a machine learning based model, specifically using the Naive Bayes classifier, that allows us to categorize input mail as either spam or not spam. The training has been performed on a dataset provided, and implemented using NLTK with Python.

## Importing Libraries

In [51]:
# import the necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, f1_score
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

## Import Dataset

Dataset description.
The dataset used to train the model is the Email Spam Classification Dataset CSV from Kaggle.
The dataset contains 5172 rows, with 3002 columns
The first column indicates Email name, and the last column has the labels for prediction : 1 for spam, 0 for not spam. The remaining 3000 columns are the 3000 most common words in all the emails, after excluding the non-alphabetical characters/words. 

In [52]:
#import dataset
dataset = pd.read_csv(r'C:\Users\wigox\Downloads\archive (3)\emails.csv')
print(dataset)

       Email No.  the  to  ect  and  for  of    a  you  hou  ...  connevey  \
0        Email 1    0   0    1    0    0   0    2    0    0  ...         0   
1        Email 2    8  13   24    6    6   2  102    1   27  ...         0   
2        Email 3    0   0    1    0    0   0    8    0    0  ...         0   
3        Email 4    0   5   22    0    5   1   51    2   10  ...         0   
4        Email 5    7   6   17    1    5   2   57    0    9  ...         0   
...          ...  ...  ..  ...  ...  ...  ..  ...  ...  ...  ...       ...   
5167  Email 5168    2   2    2    3    0   0   32    0    0  ...         0   
5168  Email 5169   35  27   11    2    6   5  151    4    3  ...         0   
5169  Email 5170    0   0    1    1    0   0   11    0    0  ...         0   
5170  Email 5171    2   7    1    0    2   1   28    2    0  ...         0   
5171  Email 5172   22  24    5    1    6   5  148    8    2  ...         0   

      jay  valued  lay  infrastructure  military  allowing  ff 

## On the Preprocessing of the Dataset

The dataset contains the most common words in all the emails, therefore, it has been preprocessed and does not require further preprocessing such as;
- the removal of stop words
- removing of alphanumeric characters
- tokenization of the dataset
- removing of punctuation

However, preprocessing using the NLTK Library would still be necessary for the content of input email, as the input email would most likely contain stop words, punctuation, etc.

In [53]:
dataset.head()

Unnamed: 0,Email No.,the,to,ect,and,for,of,a,you,hou,...,connevey,jay,valued,lay,infrastructure,military,allowing,ff,dry,Prediction
0,Email 1,0,0,1,0,0,0,2,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Email 2,8,13,24,6,6,2,102,1,27,...,0,0,0,0,0,0,0,1,0,0
2,Email 3,0,0,1,0,0,0,8,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Email 4,0,5,22,0,5,1,51,2,10,...,0,0,0,0,0,0,0,0,0,0
4,Email 5,7,6,17,1,5,2,57,0,9,...,0,0,0,0,0,0,0,1,0,0


## Training the Model

- The model is trained using Naive bayes classifier, it works by categorizing an email being spam or not by calculating the likelihood of its words appearing in spam or non-spam emails, and assumes all words contribute independently to the classification.

- Since the dataset provided to us is of a moderate size, it is assumed that Naive Bayes will perform well because it assumes feature independence, which often works well for text data (like words in an email).

In [54]:
# Separating the features and labels
X = dataset.iloc[:, 1:-1]  
# removing the email name and labels columns
y = dataset.iloc[:, -1]   
# specifying the target (Labels) column

### Splitting the dataset

In [55]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Model training

In [56]:
# Initialize and train the Naive Bayes model
model = MultinomialNB()
model.fit(X_train, y_train)


### Making prediction

In [57]:

y_pred = model.predict(X_test)

## Evaluation Metrics

In [58]:
# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
print("F1 Score: ", f1_score(y_test, y_pred))

Accuracy: 0.9545893719806763
Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.95      0.97       739
           1       0.89      0.96      0.92       296

    accuracy                           0.95      1035
   macro avg       0.94      0.96      0.95      1035
weighted avg       0.96      0.95      0.96      1035

F1 Score:  0.9235772357723576


## Using NLTK to preprocess the input email.

In this step, stop words are removed, and stemming is performed with the use of stemmer to preprocess the input email.
After this, the email is broken down into tokens which are then passed into the model for prediction.

After passing the tokens through the trained model, it will be able to predict whether the tokens passed into it are spam or not.


In [59]:

nltk.download('punkt')
nltk.download('stopwords')

# Initialize stop words and stemmer
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()


def preprocess(text): 
  if isinstance(text, str): 
    tokens = word_tokenize(text.lower()) 
    tokens = [word for word in tokens if word.isalpha()]  
    tokens = [word for word in tokens if word not in stop_words]  
    tokens = [stemmer.stem(word) for word in tokens] 
    return tokens

# Apply preprocessing
dataset['processed_message'] = dataset['message'].apply(preprocess)
email_content = dataset['processed_message']

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\wigox\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\wigox\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
