# Email Spam Classification

## Fetch the data

In [1]:
import os
# Lists to store filenames
ham_filenames = []              
spam_filenames = []

for each in sorted(os.listdir(r'easy_ham')):
    if len(each)>20:
        ham_filenames.append(each)

for each in sorted(os.listdir(r'spam')):
    if len(each)>20:
        spam_filenames.append(each)
        
print(len(ham_filenames),len(spam_filenames))

2500 500


## Parsing the Emails

In [2]:
import email
import email.parser
import email.policy
# Lists to store respective e-mails
ham_emails = []
spam_emails = []

for each in ham_filenames:
    with open('easy_ham/'+each, "rb") as t:
        ham_emails.append(email.parser.BytesParser(policy=email.policy.default).parse(t))
        
for each in spam_filenames:
    with open('spam/'+each, "rb") as t:
        spam_emails.append(email.parser.BytesParser(policy=email.policy.default).parse(t))
        
print(len(ham_emails),len(spam_emails))

2500 500


In [3]:
type(ham_emails[0])

email.message.EmailMessage

It is an instance of the EmailMessage class.<br>
Visit https://docs.python.org/3/library/email.message.html for more info.

## Example Spam and Ham emails

In [4]:
print(ham_emails[0].get_content().strip())

Date:        Wed, 21 Aug 2002 10:54:46 -0500
    From:        Chris Garrigues <cwg-dated-1030377287.06fa6d@DeepEddy.Com>
    Message-ID:  <1029945287.4797.TMDA@deepeddy.vircio.com>


  | I can't reproduce this error.

For me it is very repeatable... (like every time, without fail).

This is the debug log of the pick happening ...

18:19:03 Pick_It {exec pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace} {4852-4852 -sequence mercury}
18:19:03 exec pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace 4852-4852 -sequence mercury
18:19:04 Ftoc_PickMsgs {{1 hit}}
18:19:04 Marking 1 hits
18:19:04 tkerror: syntax error in expression "int ...

Note, if I run the pick command by hand ...

delta$ pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace  4852-4852 -sequence mercury
1 hit

That's where the "1 hit" comes from (obviously).  The version of nmh I'm
using is ...

delta$ pick -version
pick -- nmh-1.0.4 [compiled on fuchsia.cs.mu.OZ.AU at Sun Mar 17 14:55:56 

In [5]:
print(spam_emails[5].get_content().strip())

A POWERHOUSE GIFTING PROGRAM You Don't Want To Miss! 
 
  GET IN WITH THE FOUNDERS! 
The MAJOR PLAYERS are on This ONE
For ONCE be where the PlayerS are
This is YOUR Private Invitation

EXPERTS ARE CALLING THIS THE FASTEST WAY 
TO HUGE CASH FLOW EVER CONCEIVED
Leverage $1,000 into $50,000 Over and Over Again

THE QUESTION HERE IS:
YOU EITHER WANT TO BE WEALTHY 
OR YOU DON'T!!!
WHICH ONE ARE YOU?
I am tossing you a financial lifeline and for your sake I 
Hope you GRAB onto it and hold on tight For the Ride of youR life!

Testimonials

Hear what average people are doing their first few days:
�We've received 8,000 in 1 day and we are doing that over and over again!' Q.S. in AL
 �I'm a single mother in FL and I've received 12,000 in the last 4 days.� D. S. in FL
�I was not sure about this when I sent off my $1,000 pledge, but I got back $2,000 the very next day!� L.L. in KY
�I didn't have the money, so I found myself a partner to work this with. We have received $4,000 over the last 2 days

## Various structures of emails

In [6]:
# A recursive function to list down the type of an email (including its subparts)

def get_email_structure(email):
    if isinstance(email, str):
        return email
    payload = email.get_payload()
    if isinstance(payload, list):
        return "multipart({})".format(", ".join([
            get_email_structure(sub_email)
            for sub_email in payload
        ]))
    else:
        return email.get_content_type()

# A function to get the type of each email and also count them

from collections import Counter

def structures_counter(emails):
    structures = Counter()
    for email in emails:
        structure = get_email_structure(email)
        structures[structure] += 1
    return structures

In [7]:
# Most common structures in each category
print('Ham:',structures_counter(ham_emails).most_common(),sep='\n')
print('Spam:',structures_counter(spam_emails).most_common(),sep='\n')

Ham:
[('text/plain', 2408), ('multipart(text/plain, application/pgp-signature)', 66), ('multipart(text/plain, text/html)', 8), ('multipart(text/plain, text/plain)', 4), ('multipart(text/plain)', 3), ('multipart(text/plain, application/octet-stream)', 2), ('multipart(text/plain, text/enriched)', 1), ('multipart(text/plain, application/ms-tnef, text/plain)', 1), ('multipart(multipart(text/plain, text/plain, text/plain), application/pgp-signature)', 1), ('multipart(text/plain, video/mng)', 1), ('multipart(text/plain, multipart(text/plain))', 1), ('multipart(text/plain, application/x-pkcs7-signature)', 1), ('multipart(text/plain, multipart(text/plain, text/plain), text/rfc822-headers)', 1), ('multipart(text/plain, multipart(text/plain, text/plain), multipart(multipart(text/plain, application/x-pkcs7-signature)))', 1), ('multipart(text/plain, application/x-java-applet)', 1)]
Spam:
[('text/plain', 218), ('text/html', 183), ('multipart(text/plain, text/html)', 45), ('multipart(text/html)', 20

We can observe that most of the spam emails have html code in them. Also, most of the ham emails have pgp signatures.
<br> -'multipart' refers to multiple subparts within.

In [8]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

X = np.array(ham_emails+spam_emails)              # Append the second list to the first
y = np.array([0]*len(ham_emails) + [1]*len(spam_emails))    # 0 is appended len(ham_emails) times followed by 1's len(spam_emails) times

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Convert Emails to plain text

In [9]:
import re
from html import unescape

# A function to convert html code to text using regex
def html_to_plain_text(html):
    text = re.sub('<head.*?>.*?</head>', '', html, flags=re.M | re.S | re.I)
    text = re.sub('<a\s.*?>', ' HYPERLINK ', text, flags=re.M | re.S | re.I)
    text = re.sub('<.*?>', '', text, flags=re.M | re.S)
    text = re.sub(r'(\s*\n)+', '\n', text, flags=re.M | re.S)
    return unescape(text)

In [10]:
from bs4 import BeautifulSoup

# A function which converts the given email to plain text
def email_to_text(email):
    html = None
    for part in email.walk():
        ctype = part.get_content_type()
        if not ctype in ("text/plain", "text/html"):
            continue                              # Ignore the images and other types
        try:
            content = part.get_content()
        except: # in case of encoding issues
            content = str(part.get_payload())
        if ctype == "text/plain":
            return content
        else:
            html = content
    if html:
        return html_to_plain_text(html)

In [11]:
print(spam_emails[0].get_content().strip())       # Email before conversion

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META content="text/html; charset=windows-1252" http-equiv=Content-Type>
<META content="MSHTML 5.00.2314.1000" name=GENERATOR></HEAD>
<BODY><!-- Inserted by Calypso -->
<TABLE border=0 cellPadding=0 cellSpacing=2 id=_CalyPrintHeader_ rules=none 
style="COLOR: black; DISPLAY: none" width="100%">
  <TBODY>
  <TR>
    <TD colSpan=3>
      <HR color=black noShade SIZE=1>
    </TD></TR></TD></TR>
  <TR>
    <TD colSpan=3>
      <HR color=black noShade SIZE=1>
    </TD></TR></TBODY></TABLE><!-- End Calypso --><!-- Inserted by Calypso --><FONT 
color=#000000 face=VERDANA,ARIAL,HELVETICA size=-2><BR></FONT></TD></TR></TABLE><!-- End Calypso --><FONT color=#ff0000 
face="Copperplate Gothic Bold" size=5 PTSIZE="10">
<CENTER>Save up to 70% on Life Insurance.</CENTER></FONT><FONT color=#ff0000 
face="Copperplate Gothic Bold" size=5 PTSIZE="10">
<CENTER>Why Spend More Than You Have To?
<CENTER><FONT color=#ff0000 face="Copp

In [12]:
print(email_to_text(spam_emails[0]))              # Email after conversion


Save up to 70% on Life Insurance.
Why Spend More Than You Have To?
Life Quote Savings
    Ensuring your
      family's financial security is very important. Life Quote Savings makes
      buying life insurance simple and affordable. We Provide FREE Access to The
      Very Best Companies and The Lowest Rates.
          Life Quote Savings is FAST, EASY and
            SAVES you money! Let us help you get started with the best values in
            the country on new coverage. You can SAVE hundreds or even thousands
            of dollars by requesting a FREE quote from Lifequote Savings. Our
            service will take you less than 5 minutes to complete. Shop and
            compare. SAVE up to 70% on all types of Life insurance!
             HYPERLINK Click Here For Your
            Free Quote!
          Protecting your family is the best investment you'll ever
          make!
      If you are in receipt of this email
      in error and/or wish to be removed from our list,  HYPERLI

Let's write a transformer class which converts an email to a word counter.

In [13]:
import urlextract
import nltk
from sklearn.base import BaseEstimator, TransformerMixin

url_extractor = urlextract.URLExtract()
stemmer = nltk.PorterStemmer()

class EmailToWordCounterTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, strip_headers=True, lower_case=True, remove_punctuation=True,
                 replace_urls=True, replace_numbers=True, stemming=True):
        self.strip_headers = strip_headers
        self.lower_case = lower_case
        self.remove_punctuation = remove_punctuation
        self.replace_urls = replace_urls
        self.replace_numbers = replace_numbers
        self.stemming = stemming
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        X_transformed = []
        for email in X:
            text = email_to_text(email) or ""            # First, convert the email to plain text completely
            if self.lower_case:
                text = text.lower()
            if self.replace_urls and url_extractor is not None:   # Replace any urls with "URL"
                urls = list(set(url_extractor.find_urls(text)))
                urls.sort(key=lambda url: len(url), reverse=True)
                for url in urls:
                    text = text.replace(url, " URL ")
            if self.replace_numbers:                              # Replace all the numbers with "NUMBER"
                text = re.sub(r'\d+(?:\.\d*(?:[eE]\d+))?', 'NUMBER', text)
            if self.remove_punctuation:                           # Remove punctuations
                text = re.sub(r'\W+', ' ', text, flags=re.M)
            word_counts = Counter(text.split())             # Split the text into words and get the count of each word
            if self.stemming and stemmer is not None:       # Get the count of only the stem of each word
                stemmed_word_counts = Counter()
                for word, count in word_counts.items():
                    stemmed_word = stemmer.stem(word)
                    stemmed_word_counts[stemmed_word] += count
                word_counts = stemmed_word_counts
            X_transformed.append(word_counts)
        return np.array(X_transformed)

In [14]:
X_few = X_train[6:8]
X_few_wordcounts = EmailToWordCounterTransformer().fit_transform(X_few)
X_few_wordcounts

array([Counter({'a': 10, 'i': 7, 'to': 7, 'song': 5, 'the': 4, 'and': 4, 'url': 4, 'of': 3, 'playlist': 3, 'digit': 3, 'is': 3, 'not': 3, 'their': 3, 'm': 3, 'up': 3, 'law': 2, 'can': 2, 'number': 2, 'from': 2, 'inform': 2, 'that': 2, 'if': 2, 'it': 2, 'websit': 2, 'with': 2, 'servic': 2, 'as': 2, 'info': 2, 'my': 2, 'anyon': 1, 'heard': 1, 'thi': 1, 'befor': 1, 'q': 1, 'get': 1, 'we': 1, 'are': 1, 'unabl': 1, 'offer': 1, 'perform': 1, 'right': 1, 'in': 1, 'sound': 1, 'record': 1, 'act': 1, 'pass': 1, 'by': 1, 'congress': 1, 'prevent': 1, 'us': 1, 'disclos': 1, 'such': 1, 'state': 1, 'one': 1, 'transmit': 1, 'signal': 1, 'cannot': 1, 'be': 1, 'pre': 1, 'announc': 1, 'music': 1, 'choic': 1, 'polici': 1, 'releas': 1, 'upcom': 1, 'or': 1, 'previous': 1, 'play': 1, 'recent': 1, 'musicchoic': 1, 'upgrad': 1, 'veri': 1, 'import': 1, 'far': 1, 'concern': 1, 'real': 1, 'time': 1, 'directv': 1, 'receiv': 1, 'on': 1, 'shelf': 1, 'display': 1, 'scroll': 1, 'intermitt': 1, 'sure': 1, 'go': 1, 'fir

We need to convert this output into a sparse matrix.

In [16]:
from scipy.sparse import csr_matrix

# The fit module finds out the most common occuring words in the given word counts and includes the in the 'vocabulary'
# whose size is given by the user.

# The transform module creates a sparse matrix indicating the number of occurences of each word in the vocabulary.
# The number of words which are present in the email but not in vocabulary are indicated as zeros and this number is 
# shown as the first entry in each row.

class WordCounterToVectorTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, vocabulary_size=1000):
        self.vocabulary_size = vocabulary_size
    def fit(self, X, y=None):
        total_count = Counter()
        for word_count in X:
            for word, count in word_count.items():
                total_count[word] += min(count, 10)
        most_common = total_count.most_common()[:self.vocabulary_size]
        self.most_common_ = most_common
        self.vocabulary_ = {word: index + 1 for index, (word, count) in enumerate(most_common)}
        return self
    def transform(self, X, y=None):
        rows = []
        cols = []
        data = []
        for row, word_count in enumerate(X):
            for word, count in word_count.items():
                rows.append(row)
                cols.append(self.vocabulary_.get(word, 0))
                data.append(count)
        return csr_matrix((data, (rows, cols)), shape=(len(X), self.vocabulary_size + 1))

In [20]:
vocab_transformer = WordCounterToVectorTransformer(vocabulary_size=15)
X_few_vectors = vocab_transformer.fit_transform(X_few_wordcounts)
X_few_vectors

<2x16 sparse matrix of type '<class 'numpy.int32'>'
	with 30 stored elements in Compressed Sparse Row format>

In [21]:
X_few_vectors.toarray()   # Sparse matrix

array([[154,  10,   7,   7,   4,   1,   2,   4,   0,   1,   1,   5,   2,
          4,   1,   2],
       [134,   6,   8,   6,   5,   5,   4,   2,   6,   4,   4,   0,   3,
          1,   4,   2]], dtype=int32)

In [22]:
vocab_transformer.vocabulary_ # Words selected in vocabulary

{'a': 1,
 'i': 2,
 'to': 3,
 'the': 4,
 'in': 5,
 'that': 6,
 'url': 7,
 'html': 8,
 'thi': 9,
 'get': 10,
 'song': 11,
 'it': 12,
 'and': 13,
 'd': 14,
 'from': 15}

In [24]:
from sklearn.pipeline import Pipeline

final_pipeline = Pipeline([('prep_wordcount',EmailToWordCounterTransformer()),
                           ('prep_sparse',WordCounterToVectorTransformer())])

X_train_transformed = final_pipeline.fit_transform(X_train)

In [26]:
X_train_transformed.toarray()

array([[ 3,  0,  0, ...,  0,  0,  0],
       [41,  0, 11, ...,  0,  0,  0],
       [16,  1,  0, ...,  0,  0,  0],
       ...,
       [86, 29, 20, ...,  0,  0,  0],
       [12,  6,  2, ...,  0,  0,  0],
       [75, 28, 11, ...,  0,  0,  0]], dtype=int32)

## Applying ML Algorithms

### Logistic Regression

In [41]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score,cross_val_predict

log_cls = LogisticRegression()
print(log_cls)
score = cross_val_score(log_cls,X_train_transformed,y_train,cv=3,verbose=3)
print(score)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
[CV]  ................................................................
[CV] .................................. , score=0.98375, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.985, total=   0.0s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.1s remaining:    0.0s


[CV]  ................................................................
[CV] ................................... , score=0.9925, total=   0.2s
[0.98375 0.985   0.9925 ]


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.4s finished


In [30]:
from sklearn.metrics import precision_score,recall_score,f1_score,confusion_matrix

y_pred = cross_val_predict(log_cls,X_train_transformed,y_train,cv=3)
confusion_matrix(y_train,y_pred)

array([[1989,    6],
       [  25,  380]], dtype=int64)

### MLP Classifier

In [32]:
from sklearn.neural_network import MLPClassifier

nn = MLPClassifier()
print(cross_val_score(nn,X_train_transformed,y_train,cv=3,scoring='accuracy'))

[0.985   0.98375 0.99125]


### Random Forest Classifier

In [34]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(random_state=42)
print(cross_val_score(rf,X_train_transformed,y_train,cv=3,scoring='accuracy'))

[0.9675  0.9725  0.97125]


### SGD Classifier

In [39]:
from sklearn.linear_model import SGDClassifier

sgd_cls = SGDClassifier(random_state=42,max_iter=5)
print(cross_val_score(sgd_cls,X_train_transformed,y_train,cv=3,scoring='accuracy'))

[0.94625 0.9575  0.9725 ]


`Logistic Regression` seems like a good choice for a model.

### Hyperparameter Tuning

In [43]:
from sklearn.model_selection import GridSearchCV

param_grid = {'C': [0.003, 0.03, 0.3, 1, 3, 30, 300, 3000]}
grid_search_log_cls = GridSearchCV(log_cls,param_grid,cv=3,verbose=3,scoring='accuracy',return_train_score=True)
grid_search_log_cls.fit(X_train_transformed,y_train)

log_cls = grid_search_log_cls.best_estimator_

Fitting 3 folds for each of 8 candidates, totalling 24 fits
[CV] C=0.003 .........................................................
[CV] ............................ C=0.003, score=0.9625, total=   0.0s
[CV] C=0.003 .........................................................
[CV] ............................ C=0.003, score=0.9675, total=   0.0s
[CV] C=0.003 .........................................................
[CV] ........................... C=0.003, score=0.96625, total=   0.0s
[CV] C=0.03 ..........................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s


[CV] ............................... C=0.03, score=0.97, total=   0.0s
[CV] C=0.03 ..........................................................
[CV] ............................ C=0.03, score=0.97625, total=   0.0s
[CV] C=0.03 ..........................................................
[CV] ............................ C=0.03, score=0.98375, total=   0.0s
[CV] C=0.3 ...........................................................
[CV] ............................. C=0.3, score=0.98125, total=   0.0s
[CV] C=0.3 ...........................................................
[CV] .............................. C=0.3, score=0.9825, total=   0.1s
[CV] C=0.3 ...........................................................
[CV] .............................. C=0.3, score=0.9925, total=   0.1s
[CV] C=1 .............................................................
[CV] ............................... C=1, score=0.98375, total=   0.0s
[CV] C=1 .............................................................
[CV] .

[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:    5.0s finished


In [44]:
X_test_transformed = final_pipeline.transform(X_test)
y_pred = log_cls.predict(X_test_transformed)
print(confusion_matrix(y_test,y_pred))

print(precision_score(y_test,y_pred),recall_score(y_test,y_pred),f1_score(y_test,y_pred),sep='\n')

[[500   5]
 [  2  93]]
0.9489795918367347
0.9789473684210527
0.9637305699481866


We observe that the Logistic Regression model produced a precision of `94.8%` and a recall of `97.8%`.