## **Spam and Ham Email Classification**

*Group: Gregg Maloy, Jacob Silver, and Justin Williams*

In [172]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

First, we'll read in a "spam and ham" dataset we found on kaggle. The data consists of emails from Enron, the energy company that collapsed amid a massive fraud scandal in the early 2000s. As we'll see when we explore the data further, each of email is labeled as either "spam" or "ham" (aka not spam). That means that we can both train our model on known label values, and later test its accuracy on a subset of the data we set aside.

Let's pull in our data:

In [141]:
df = pd.read_csv('spam_ham_dataset.csv')

In [142]:
df.head()

Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...,0


As we can see, in addition to the ham/spam label and raw email text, the data contains a column called **label_num**, which simply assigns a value of 0 to ham emails and 1 to spam. This saves us the trouble of mapping the data this way ourselves, which is a necessary step for modeling; such algorithms expect numerical rather than categorical data.

There is also a column called **Unnamed: 0** which appears wholly unnecessary, and is perhaps a holdover index from a previous, differently-shuffled iteration of the data. Let's remove it so our data is as clean as possible.

In [143]:
df.drop(columns = "Unnamed: 0", inplace = True)

It appears that all of the raw emails in the **text** column begin with the word "Subject". It's useful to know what words are in the email's subject line, but the word "Subject" itself is unnecessary and may muddy our text. Once we confirm that all of the emails begin this way, we can remove it from all of them.

We have a choice whether to consider the subject line differently than other text. Should it be treated as its own set of features? If it contains words taken from the text body, should we remove them to avoid duplication?

For the sake of simplicity and staying true to a "bag of words" approach, we elected to keep the words of the subject line, but treat them as part of our broader text. If the words in the subject line are repeated in the body, so be it; it may make sense for them to be weighted as such.

In [144]:
#check if the first 9 characters of each raw text are identical
df['text'].apply(lambda x: x[:9]).value_counts()

Subject:     5171
Name: text, dtype: int64

In [145]:
#remove that text from all
df['text'] = df['text'].apply(lambda x: x[9:])

At a glance the text all appears to be lower-case, but let's confirm just in case.

In [146]:
df['text'].apply(lambda x: x.islower()).value_counts()

True     5153
False      18
Name: text, dtype: int64

Almost, but not quite; let's make all the text lower for the sake of consistency.

In [147]:
df['text'] = df['text'].apply(str.lower)

In order to build a model based on this text, we need them to be represented numerically. To do this, we need to break each text into its constituent parts (in this case, words), then represent each text as a collection of counts for each word. All words occurring in the entire corpus are thus treated as variables (or "features") in the model, whose weights are determined by the number of occurrences in that particular text. For the vast majority of these words, the weight in any given text will be 0--but we still need them to be laid out as consistent vectors. 

This process is called **Vectorization**, and thankfully, python--specifically, the sklearn package--contains useful functionality for the purpose. First, we can instantiate our Vectorizer. Here, we include the argument 'english' for the optional stop_words parameter. This will automatically exclude English-language stop-words (highly common words like "a" and "the") which could clutter our vectors with no value to our model.

In [149]:
vect = CountVectorizer(stop_words='english')

Next, by fitting our vectorizer to our corpus, we are asking it to develop the "vocabulary", or unique set of words, for our emails.

In [150]:
vect.fit(df['text'])

In [97]:
len(vect.get_feature_names_out())

50140

The command *vect.get_feature_names_out()* returns an array containing every word across all emails. That means our model will have just over 50,000 features.

Our next step is to create a new DataFrame representing our email data as vectors. This will facilitate the later work of splitting the data into training and testing sets.

In [161]:
#While this is not necessary, we appended our label_num column (which is our dependent variable) to all of our
#independent feature columns, which we derive from a 2-dimensional vector array.
df1 = pd.concat([df[['label_num']],
                 pd.DataFrame(vect.transform(df['text']).toarray(), columns=vect.get_feature_names_out())],
                axis = 1)

In [163]:
df1.head()

Unnamed: 0,label_num,00,000,0000,000000,000000000002858,000000000049773,000080,000099,0001,...,zynve,zyqtaqlt,zyrtec,zyyqywp,zzezrjok,zzn,zzo,zzocb,zzso,zzsyt
0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now that we have a DataFrame with vectorized data, we can split it into training and testing sets using the *train_test_split* functionality in the sklearn package. This function takes in a DataFrame of independent variable features, an array of the dependent variable, and a ratio for breaking up the data.

For the independent *x* DataFrame, we input df1.iloc[:,1:], representing all feature columns in df1. For the dependent *y* array, we use only the label_num column. Finally, we use 0.20 as our test size, relying on an industry standard 80/20 split between training and testing sets.

In [164]:
X_train,X_test,y_train,y_test = train_test_split(df1.iloc[:,1:],df1['label_num'],test_size=0.20)

It is now time to build our model. A Multinomial Naive Bayes classifier is commonly used for cases like this, and should be suitable. We can build such a model using sklearn's MultinomialNB functionality.

In [167]:
#instantiate Multinomial Naive Bayes classifier
nb = MultinomialNB()
#fit the model to our training data
nb.fit(X_train,y_train)

Now that we have a model trained on 80% of our emails, we can test it on the remaining 20% reserved for that purpose, using the *predict* method for MultinomialNB classifiers. This returns an array of 0s and 1s, which represent the models prediction for whether each email is ham (0) or spam (1).

In [170]:
y_pred = nb.predict(X_test)

In [171]:
y_pred

array([0, 0, 0, ..., 0, 1, 0])

Finally, we can test the accuracy of our model using sklearn's *classification_report* function. This produces a report that compares our predicted results against the true ham/spam values we know from our original data.

In [175]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.99      0.98      0.99       731
           1       0.95      0.98      0.97       304

    accuracy                           0.98      1035
   macro avg       0.97      0.98      0.98      1035
weighted avg       0.98      0.98      0.98      1035



Per this report, our model is quite effective! Let's break down some of the findings:

- The model classified ham with 99% **precision** and spam with 95% precision. That means that of all the emails the model guessed were ham, 99% really were, and of the emails it classified as spam, 95% were indeed spam.
- For both ham and spam, the model scored 98% on **recall**. That means that of all the actual spam emails in the test data, the model identified 98% of them. Likewise for ham.
- F1-Score is, per the classificion_report documentation, a "harmonic mean of the precision and recall." The best possible F1 score is 1, so scores of .99 and .97 are quite high!
- Accuracy is simply the total number of correct predictions over the total number of predictions. Thus, it can be said that our model is 98% accurate.