##*Email Spam Detection Using Multinomial Naive Bayes Algorithm and Simple Bag of Words*

By Nakshatra Singh

###**1. Retrieve and Inspect Dataset**

Let's download the dataset which is uploaded on my google drive.

In [None]:
!gdown --id 1CLmJed0Qu6DxKChYzAo1iU4ZtT0EO47- 

Downloading...
From: https://drive.google.com/uc?id=1CLmJed0Qu6DxKChYzAo1iU4ZtT0EO47-
To: /content/emails.csv
0.00B [00:00, ?B/s]8.95MB [00:00, 78.6MB/s]


We'll use `pandas` to parse the csv files.

In [None]:
import pandas as pd
df = pd.read_csv('/content/emails.csv')

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


Let's take a look at the first few rows of the table just to see what's in there.

In [None]:
df.head(5) 

What's the shape of the dataframe?


In [None]:
df.shape

(5728, 2)

Does the dataframe contain any null row values?

In [None]:
df.isnull().sum() 

text    0
spam    0
dtype: int64

How many columns does the dataframe have?

In [None]:
df.columns

Index(['text', 'spam'], dtype='object')

We'll drop the duplicate texts and check if any rows were dropped or not.

In [None]:
df.drop_duplicates(inplace=True) 

Did the shape reduce?


In [None]:
df.shape

(5695, 2)

Yes, this means their were a few duplicate text rows which are now deleted.

###**2. NLTK**

We'll use nltk stopwords to remove words which provide  us no valuable information.

In [None]:
import nltk 
from nltk.corpus import stopwords
nltk.download('stopwords') 

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

Next, I'll write a helper fuction which will preprocess our text for model training.

In [None]:
import string
def process_text(text):
  # Remove Punctuation
  nopunc = [char for char in text if char not in string.punctuation]
  nopunc = ''.join(nopunc) 
  
  # Remove Stopwords
  clean_words = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

  # Return a list of cleaned Text
  return clean_words

Let's see how the dataframe looks after applying the function.

In [None]:
df['text'].head().apply(process_text) 

0    [Subject, naturally, irresistible, corporate, ...
1    [Subject, stock, trading, gunslinger, fanny, m...
2    [Subject, unbelievable, new, homes, made, easy...
3    [Subject, 4, color, printing, special, request...
4    [Subject, money, get, software, cds, software,...
Name: text, dtype: object

###**3. Further Understanding**

Here is an example of how Bag of Words makes a matrix of count features for each word present in the dataframe.

In [None]:
# Example

message4 = 'hello hello hello world hello play'
message5 = 'test test test one hello hello world'
print(message4) 

# Convert the text to a matrix of token counts
from sklearn.feature_extraction.text import CountVectorizer
bow4 = CountVectorizer(analyzer=process_text).fit_transform([[message4], [message5]])
print(bow4)          # Matrix of Features
print()
print(bow4.shape)    # Matrix Space of token counts

# 0 at the first index means first sentence and 1 at the first index means
# second sentence.
# --> (0, 0) == first sentence, (a particular word is given a random index [i], hello is given the index 0)
# which is repeated 4 times in that sentence only.

hello hello hello world hello play
  (0, 0)	4
  (0, 4)	1
  (0, 2)	1
  (1, 0)	2
  (1, 4)	1
  (1, 3)	3
  (1, 1)	1

(2, 5)


Let's convert all our text in the dataframe to a bag of words matrix.

In [None]:
# Convert a collection of text to a matrix of tokens
message_bow = CountVectorizer(analyzer=process_text).fit_transform(df['text']) 

Let's see how many unique tokens (without stopwords) are made by CountVectroizer.

In [None]:
message_bow.shape

(5695, 37229)

Let's split the data into training and validation sets.

In [None]:
from sklearn.model_selection import train_test_split
X_train, x_validation, Y_train, y_validation = train_test_split(message_bow,
                                                                df['spam'], 
                                                                test_size=0.2,
                                                                random_state=0)


###**4. Multinomial Naive Bayes Algorithm**

We'll use the Multinomial Naive Bayes classifier, as it is good in handling mutiple features and will suit this problem.

In [None]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB().fit(X_train, Y_train) 

Now, let's have a quick glimpse if our model is doing alright or now.

In [None]:
print(classifier.predict(X_train))

print(Y_train.values) 

[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]


We aren't able to see much of the target columns (due to big size) but the classifier gives correct predictions for the most of it as we see.

###**5. Model Metrics**

Now since our model is trained, let's print out the model metrics which will define how good our model is actually doing.

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# On Training data
y_pred = classifier.predict(X_train)
print(classification_report(Y_train, y_pred))       # Classification Report
print()
print('Confusion Matrix: \n', confusion_matrix(Y_train, y_pred)) # Confusion Matrix
print()
print('Accuracy: ', accuracy_score(Y_train, y_pred)) # Accuracy Score

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3457
           1       0.99      1.00      0.99      1099

    accuracy                           1.00      4556
   macro avg       0.99      1.00      1.00      4556
weighted avg       1.00      1.00      1.00      4556


Confusion Matrix: 
 [[3445   12]
 [   1 1098]]

Accuracy:  0.9971466198419666


In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# On Validation Data
y_pred = classifier.predict(x_validation)
print(classification_report(y_validation, y_pred))
print()
print('Confusion Matrix: \n', confusion_matrix(y_validation, y_pred)) 
print()
print('Accuracy: ', accuracy_score(y_validation, y_pred))  

              precision    recall  f1-score   support

           0       1.00      0.99      0.99       870
           1       0.97      1.00      0.98       269

    accuracy                           0.99      1139
   macro avg       0.98      0.99      0.99      1139
weighted avg       0.99      0.99      0.99      1139


Confusion Matrix: 
 [[862   8]
 [  1 268]]

Accuracy:  0.9920983318700615


###**6. Summary**

- We understood how bag of word works
- We saw how a simple CountVectorzier with MultinomialNB can give extremely accurate results.
- We used NLTK stopwords to remove unnecessary words which dont give us much information.
- We also evaluated the model metrics for performance.

IF YOU LIKED THIS NOTEBOOK, MAKE SURE TO CHECK OUT MY OTHER [REPOS👊](https://github.com/nakshatrasinghh?tab=repositories)!!!