<a href="https://colab.research.google.com/github/kevinpandya/Email-Spam-Filtering-using-Python/blob/main/SpamFilter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Spam Email Detection using Python**

Email spam, also called junk email, is unsolicited messages sent in bulk by email (spamming). Here, I have tried to create a program which can detect spam emails using Natural Language Processing and Python Programming Language.

Firstly, import the necessary libraries required for the program to run. In natural language processing, useless words (data), are referred to as stop words. To remove them the stopwords library is imported from nltk.corpus.

In [1]:
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
import string

To load the dataset of spam emails on Google Colab.

In [2]:
from google.colab import files
uploaded = files.upload() 

Saving emails.csv to emails.csv


It reads the dataset file and prints the contents of the first 5 rows.

In [3]:
df = pd.read_csv('emails.csv')
df.head(5)

Unnamed: 0,text,spam
0,Subject: enron methanol ; meter # : 988291\r\n...,0
1,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,"Subject: photoshop , windows , office . cheap ...",1
4,Subject: re : indian springs\r\nthis deal is t...,0


Prints the shape of the file (Get the number of rows and cols in the dataset used)

In [4]:
df.shape

(5171, 2)

To check if there are any duplicate entries present in the dataset and to remove them if there are any repeating records present.

In [5]:
df.drop_duplicates(inplace = True)

Checking the number of columns and rows after removing duplicate entries.

In [6]:
df.shape

(4993, 2)

Check for the missing data for each column.

In [7]:
df.isnull().sum()

text    0
spam    0
dtype: int64

In [8]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

The function process_text is required for pre-processing of the text present in the dataset. To analyze the emails whether an email is a spam or not, we need to convert the emails into a list of tokens. The given function will remove the punctuations and stopwords that are not useful in detecting whether the email is spam or not.

In [9]:
def process_text(text):
    
    #1 Remove Punctuations
    nopunc = [char for char in text if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    
    #2 Remove Stop Words
    clean_words = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
    
    #3 Return a list of clean words
    return clean_words

Shows the list of tokens generated after pre-processing of the text.

In [10]:
df['text'].head().apply(process_text)

0    [Subject, enron, methanol, meter, 988291, foll...
1    [Subject, hpl, nom, january, 9, 2001, see, att...
2    [Subject, neon, retreat, ho, ho, ho, around, w...
3    [Subject, photoshop, windows, office, cheap, m...
4    [Subject, indian, springs, deal, book, teco, p...
Name: text, dtype: object

Convert the text into a matrix of token counts.

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
messages_bow = CountVectorizer(analyzer=process_text).fit_transform(df['text'])

Split the data into training & testing sets, and print them. We will use this one row of data for testing to make our prediction later on and test to see if the prediction matches with the actual value. The data is split into 80% training and 20% testing data set.

The testing feature (independent) data set will be stored in X_test and the testing target (dependent) data set will be stored in y_test. The training feature (independent) data set will be stored in X_train and the training target (dependent) data set will be stored in y_train .

In [12]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(messages_bow, df['spam'], test_size = 0.20, random_state = 0)

In [13]:
messages_bow.shape

(4993, 50381)

Create and train the Multinomial Naive Bayes classifier which is suitable for classification with discrete features (e.g., word counts for text classification)

In [14]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Print the classifiers prediction and actual values on the data set.

In [15]:
print(classifier.predict(X_train))
print(y_train.values)

[1 0 1 ... 0 0 1]
[1 0 1 ... 0 0 1]


Evaluate the model on the training data set. See how well the model performed by evaluating the Naive Bayes classifier by showing the report, confusion matrix & accuracy score.

In [16]:
from sklearn.metrics import classification_report,confusion_matrix, accuracy_score
pred = classifier.predict(X_train)
print(classification_report(y_train ,pred ))
print('Confusion Matrix: \n',confusion_matrix(y_train,pred))
print()
print('Accuracy: ', accuracy_score(y_train,pred))

              precision    recall  f1-score   support

           0       1.00      0.99      0.99      2809
           1       0.98      0.99      0.99      1185

    accuracy                           0.99      3994
   macro avg       0.99      0.99      0.99      3994
weighted avg       0.99      0.99      0.99      3994

Confusion Matrix: 
 [[2787   22]
 [  13 1172]]

Accuracy:  0.9912368552829244


In [17]:
print('Predicted value: ',classifier.predict(X_test))
print('Actual value: ',y_test.values)

Predicted value:  [0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 1 1 0 0 0 0 1 0 1 1 1 0 1 1
 0 0 0 0 1 0 1 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1 1 0 0 1 1 1 0
 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 1 1 0 0 0 0 0 0 1 0
 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 0 0 1 1 0 0 1 0 0 1 0 1 1 0 1 0 1 0 0
 0 0 1 0 1 0 1 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1
 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
 1 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0
 1 0 0 0 1 1 0 1 1 0 1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 1 0
 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1
 0 0 0 0 0 1 1 0 0 1 1 1 0 1 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 1 0 1 0 0
 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 1
 0 0 1 0 0 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 

Now, evaluate the model on the test data set. Print the generated report, condusion matrix and accuracy of the model.

In [18]:
from sklearn.metrics import classification_report,confusion_matrix, accuracy_score
pred = classifier.predict(X_test)
print(classification_report(y_test ,pred ))
print('Confusion Matrix: \n', confusion_matrix(y_test,pred))
print()
print('Accuracy: ', accuracy_score(y_test,pred))

              precision    recall  f1-score   support

           0       0.98      0.98      0.98       722
           1       0.95      0.96      0.96       277

    accuracy                           0.98       999
   macro avg       0.97      0.97      0.97       999
weighted avg       0.98      0.98      0.98       999

Confusion Matrix: 
 [[709  13]
 [ 11 266]]

Accuracy:  0.975975975975976
