# Spam Classifier

**This notebook contains:** an overview of the Apache SpamAssassin public dataset, the data preprocessing steps, model training, and the identification of the best machine learning model for classifying spam emails.

### Imports

In [1]:
import os
import warnings
import tarfile
import urllib
import shutil
import email
import email.policy
import numpy as np
from email import message_from_string
from pathlib import Path
from collections import Counter
from bs4 import BeautifulSoup
from html import unescape
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, TransformerMixin
import nltk
import urlextract
import re
from scipy.sparse import csr_matrix
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import precision_score, recall_score

### Fetching data

In [2]:
def fetch_spam_data():
    spam_root = 'http://spamassassin.apache.org/old/publiccorpus/'
    ham_url = spam_root + '20030228_easy_ham.tar.bz2'
    spam_url = spam_root + '20030228_spam.tar.bz2'
    spam_path = Path() / 'datasets'

    spam_path.mkdir(parents=True, exist_ok=True)

    for target_name, url in (('ham', ham_url), ('spam', spam_url)):
        target_dir = spam_path / target_name
        if not target_dir.is_dir():
            path = (spam_path / f'{target_name}.tar.bz2')
            print('Downloading', path)
            urllib.request.urlretrieve(url, path)

            temp_dir = spam_path / f'tmp_{target_name}'
            temp_dir.mkdir(exist_ok=True)
            with tarfile.open(path) as f:
                f.extractall(path=temp_dir)
            
            extracted_root = next(temp_dir.iterdir())
            shutil.move(str(extracted_root), target_dir)

            temp_dir.rmdir()
            os.remove(path)

    return [spam_path / name for name in ('ham', 'spam')]

In [6]:
warnings.filterwarnings("ignore", category=DeprecationWarning)
ham_dir, spam_dir = fetch_spam_data()

Downloading datasets\ham.tar.bz2
Downloading datasets\spam.tar.bz2


### Loading data

In [6]:
ham_filenames = [f for f in sorted(ham_dir.iterdir()) if f.name != 'cmds']
spam_filenames = [f for f in sorted(spam_dir.iterdir()) if f.name != 'cmds']

In [7]:
def load_email(filepath):
    with open(filepath, 'rb') as f:
        return email.parser.BytesParser(policy=email.policy.default).parse(f)

In [8]:
ham_emails = [load_email(filepath) for filepath in ham_filenames]
spam_emails = [load_email(filepath) for filepath in spam_filenames]

### Exploring data

In [9]:
print(ham_emails[0].get_content())

    Date:        Wed, 21 Aug 2002 10:54:46 -0500
    From:        Chris Garrigues <cwg-dated-1030377287.06fa6d@DeepEddy.Com>
    Message-ID:  <1029945287.4797.TMDA@deepeddy.vircio.com>


  | I can't reproduce this error.

For me it is very repeatable... (like every time, without fail).

This is the debug log of the pick happening ...

18:19:03 Pick_It {exec pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace} {4852-4852 -sequence mercury}
18:19:03 exec pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace 4852-4852 -sequence mercury
18:19:04 Ftoc_PickMsgs {{1 hit}}
18:19:04 Marking 1 hits
18:19:04 tkerror: syntax error in expression "int ...

Note, if I run the pick command by hand ...

delta$ pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace  4852-4852 -sequence mercury
1 hit

That's where the "1 hit" comes from (obviously).  The version of nmh I'm
using is ...

delta$ pick -version
pick -- nmh-1.0.4 [compiled on fuchsia.cs.mu.OZ.AU at Sun Mar 17 14:55

In [10]:
print(spam_emails[0].get_content())

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META content="text/html; charset=windows-1252" http-equiv=Content-Type>
<META content="MSHTML 5.00.2314.1000" name=GENERATOR></HEAD>
<BODY><!-- Inserted by Calypso -->
<TABLE border=0 cellPadding=0 cellSpacing=2 id=_CalyPrintHeader_ rules=none 
style="COLOR: black; DISPLAY: none" width="100%">
  <TBODY>
  <TR>
    <TD colSpan=3>
      <HR color=black noShade SIZE=1>
    </TD></TR></TD></TR>
  <TR>
    <TD colSpan=3>
      <HR color=black noShade SIZE=1>
    </TD></TR></TBODY></TABLE><!-- End Calypso --><!-- Inserted by Calypso --><FONT 
color=#000000 face=VERDANA,ARIAL,HELVETICA size=-2><BR></FONT></TD></TR></TABLE><!-- End Calypso --><FONT color=#ff0000 
face="Copperplate Gothic Bold" size=5 PTSIZE="10">
<CENTER>Save up to 70% on Life Insurance.</CENTER></FONT><FONT color=#ff0000 
face="Copperplate Gothic Bold" size=5 PTSIZE="10">
<CENTER>Why Spend More Than You Have To?
<CENTER><FONT color=#ff0000 face="Copp

In [11]:
for header, value in spam_emails[0].items():
    print(header, ":", value)

Return-Path : <12a1mailbot1@web.de>
Delivered-To : zzzz@localhost.spamassassin.taint.org
Received : from localhost (localhost [127.0.0.1])	by phobos.labs.spamassassin.taint.org (Postfix) with ESMTP id 136B943C32	for <zzzz@localhost>; Thu, 22 Aug 2002 08:17:21 -0400 (EDT)
Received : from mail.webnote.net [193.120.211.219]	by localhost with POP3 (fetchmail-5.9.0)	for zzzz@localhost (single-drop); Thu, 22 Aug 2002 13:17:21 +0100 (IST)
Received : from dd_it7 ([210.97.77.167])	by webnote.net (8.9.3/8.9.3) with ESMTP id NAA04623	for <zzzz@spamassassin.taint.org>; Thu, 22 Aug 2002 13:09:41 +0100
From : 12a1mailbot1@web.de
Received : from r-smtp.korea.com - 203.122.2.197 by dd_it7  with Microsoft SMTPSVC(5.5.1775.675.6);	 Sat, 24 Aug 2002 09:42:10 +0900
To : dcek1a1@netsgo.com
Subject : Life Insurance - Why Pay More?
Date : Wed, 21 Aug 2002 20:31:57 -1600
MIME-Version : 1.0
Message-ID : <0103c1042001882DD_IT7@dd_it7>
Content-Type : text/html; charset="iso-8859-1"
Content-Transfer-Encoding : qu

In [12]:
def get_email_structure(email):
    if isinstance(email, str): 
        return email
    payload = email.get_payload()
    if isinstance(payload, list):
        multipart = ', '.join([get_email_structure(sub) for sub in payload])
        return f'multipart({multipart})'
    else: 
        return email.get_content_type()

In [13]:
def structures_counter(emails):
    structures = Counter()
    for email in emails:
        structure = get_email_structure(email)
        structures[structure] += 1
    return structures

In [14]:
structures_counter(ham_emails).most_common()

[('text/plain', 2408),
 ('multipart(text/plain, application/pgp-signature)', 66),
 ('multipart(text/plain, text/html)', 8),
 ('multipart(text/plain, text/plain)', 4),
 ('multipart(text/plain)', 3),
 ('multipart(text/plain, application/octet-stream)', 2),
 ('multipart(text/plain, text/enriched)', 1),
 ('multipart(text/plain, application/ms-tnef, text/plain)', 1),
 ('multipart(multipart(text/plain, text/plain, text/plain), application/pgp-signature)',
  1),
 ('multipart(text/plain, video/mng)', 1),
 ('multipart(text/plain, multipart(text/plain))', 1),
 ('multipart(text/plain, application/x-pkcs7-signature)', 1),
 ('multipart(text/plain, multipart(text/plain, text/plain), text/rfc822-headers)',
  1),
 ('multipart(text/plain, multipart(text/plain, text/plain), multipart(multipart(text/plain, application/x-pkcs7-signature)))',
  1),
 ('multipart(text/plain, application/x-java-applet)', 1)]

In [15]:
structures_counter(spam_emails).most_common()

[('text/plain', 218),
 ('text/html', 183),
 ('multipart(text/plain, text/html)', 45),
 ('multipart(text/html)', 20),
 ('multipart(text/plain)', 19),
 ('multipart(multipart(text/html))', 5),
 ('multipart(text/plain, image/jpeg)', 3),
 ('multipart(text/html, application/octet-stream)', 2),
 ('multipart(text/plain, application/octet-stream)', 1),
 ('multipart(text/html, text/plain)', 1),
 ('multipart(multipart(text/html), application/octet-stream, image/jpeg)', 1),
 ('multipart(multipart(text/plain, text/html), image/gif)', 1),
 ('multipart/alternative', 1)]

### Creating training and test sets

In [16]:
X = np.array(ham_emails + spam_emails, dtype=object)
y = np.array([0] * len(ham_emails) + [1] * len(spam_emails))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### HTML to plain text

In [17]:
def html_to_plain_text(html):
    soup = BeautifulSoup(html, 'html.parser')

    if soup.head:
        soup.head.decompose()

    for a in soup.find_all('a'):
        a.replace_with('HYPERLINK')

    text = soup.get_text()

    lines = [line.strip() for line in text.splitlines() if line.strip()]
    text = '\n'.join(lines)

    return unescape(text)

In [18]:
html_spam_emails = [email for email in X_train[y_train==1]
                   if get_email_structure(email) == 'text/html']

sample_html_spam = html_spam_emails[0]
print(sample_html_spam.get_content())

<HTML>
<BODY BGCOLOR="#ffffff">
<P>
<<HTML>
<TABLE WIDTH=400 BORDER=0 CELLPADDING=0 CELLSPACING=0>
  <TR>
    <TD ALIGN="LEFT" VALIGN="TOP"><FONT FACE="Tahoma, Arial, Verdana" SIZE=2></FONT>
      <H2>
	<FONT COLOR="#FF0000">GET HIGH...LEGALLY!</FONT>
      </H2>
      <P>
      <B>IT REALLY WORKS!<BR>
      PASSES ALL DRUG TESTS!<BR>
      EXTREMELY POTENT!</B>
      <P>
      <A HREF="http://www.greenmatrix.net/herb/index.html"><B>CLICK HERE for more
      info on Salvia Divinorum</B></A>
      <P>
      <B> <A HREF="http://www.greenmatrix.net/herb/5x.html">CLICK HERE for SALVIA
      5X EXTRACT!</A> <BR>
      <P>
      <B> <A HREF="http://www.greenmatrix.net/herb/13x.html">CLICK HERE for SALVIA
      13X</A>. The most POTENT, LEGAL, SMOKABLE herb on the planet! 13 times the
      power of Salvia Divinorum!<BR>
      <P>
      <P>
      <BR>
      <U>Removal Information:</U><BR>
      We are strongly against sending unsolicited emails to those who do not wish
      to receive our sp

In [19]:
print(html_to_plain_text(sample_html_spam.get_content()))

<
GET HIGH...LEGALLY!
IT REALLY WORKS!
PASSES ALL DRUG TESTS!
EXTREMELY POTENT!
HYPERLINK
HYPERLINK
HYPERLINK. The most POTENT, LEGAL, SMOKABLE herb on the planet! 13 times the
power of Salvia Divinorum!
Removal Information:
We are strongly against sending unsolicited emails to those who do not wish
to receive our special mailings. You have opted in to one or more of our
affiliate sites requesting to be notified of any special offers we may run
from time to time. This is NOT unsolicited email. If you do not wish to receive
further mailings, please
HYPERLINK. Please accept our apologies if you have been sent this
email in error. We honor all removal requests within 24 hours.


###  Email to plain text

In [20]:
def email_to_text(email):
    for part in email.walk():
        content_type = part.get_content_type()
        if not content_type in ('text/plain', 'text/html'):
            continue
        try:
            content = part.get_content()
        except:
            content = str(part.get_payload())
        
        if content_type == 'text/plain':
            return content
        else:
            return html_to_plain_text(content)

In [21]:
print(email_to_text(sample_html_spam))

<
GET HIGH...LEGALLY!
IT REALLY WORKS!
PASSES ALL DRUG TESTS!
EXTREMELY POTENT!
HYPERLINK
HYPERLINK
HYPERLINK. The most POTENT, LEGAL, SMOKABLE herb on the planet! 13 times the
power of Salvia Divinorum!
Removal Information:
We are strongly against sending unsolicited emails to those who do not wish
to receive our special mailings. You have opted in to one or more of our
affiliate sites requesting to be notified of any special offers we may run
from time to time. This is NOT unsolicited email. If you do not wish to receive
further mailings, please
HYPERLINK. Please accept our apologies if you have been sent this
email in error. We honor all removal requests within 24 hours.


### Transformers

In [22]:
stemmer = nltk.PorterStemmer()
url_extractor = urlextract.URLExtract()

In [25]:
class EmailToWordCounterTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, strip_headers=True, lower_case=True,
                 remove_punctuation=True, replace_urls=True,
                 replace_numbers=True, stemming=True):
        self.strip_headers = strip_headers
        self.lower_case = lower_case
        self.remove_punctuation = remove_punctuation
        self.replace_urls = replace_urls
        self.replace_numbers = replace_numbers
        self.stemming = stemming

    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X_transformed = []
        for email in X:
            text = email_to_text(email) or ''
            if self.lower_case:
                text = text.lower()
            if self.replace_urls and url_extractor is not None:
                urls = list(set(url_extractor.find_urls(text)))
                urls.sort(key=lambda url: len(url), reverse=True)
                for url in urls:
                    text = text.replace(url, ' URL ')
            if self.replace_numbers:
                text = re.sub(r'\d+(?:\.\d*)?(?:[eE][+-]?\d+)?', 'NUMBER', text)
            if self.remove_punctuation:
                text = re.sub(r'\W+', ' ', text, flags=re.M)
            word_counts = Counter(text.split())
            if self.stemming and stemmer is not None:
                stemmed_word_counts = Counter()
                for word, count in word_counts.items():
                    stemmed_word = stemmer.stem(word)
                    stemmed_word_counts[stemmed_word] += count
                word_counts = stemmed_word_counts
            X_transformed.append(word_counts)
        return np.array(X_transformed)

In [26]:
X_few = X_train[:3]
X_few_wordcounts = EmailToWordCounterTransformer().fit_transform(X_few)
X_few_wordcounts

array([Counter({'chuck': 1, 'murcko': 1, 'wrote': 1, 'stuff': 1, 'yawn': 1, 'r': 1}),
       Counter({'the': 11, 'of': 9, 'and': 8, 'all': 3, 'christian': 3, 'to': 3, 'by': 3, 'jefferson': 2, 'i': 2, 'have': 2, 'superstit': 2, 'one': 2, 'on': 2, 'been': 2, 'ha': 2, 'half': 2, 'rogueri': 2, 'teach': 2, 'jesu': 2, 'some': 1, 'interest': 1, 'quot': 1, 'url': 1, 'thoma': 1, 'examin': 1, 'known': 1, 'word': 1, 'do': 1, 'not': 1, 'find': 1, 'in': 1, 'our': 1, 'particular': 1, 'redeem': 1, 'featur': 1, 'they': 1, 'are': 1, 'alik': 1, 'found': 1, 'fabl': 1, 'mytholog': 1, 'million': 1, 'innoc': 1, 'men': 1, 'women': 1, 'children': 1, 'sinc': 1, 'introduct': 1, 'burnt': 1, 'tortur': 1, 'fine': 1, 'imprison': 1, 'what': 1, 'effect': 1, 'thi': 1, 'coercion': 1, 'make': 1, 'world': 1, 'fool': 1, 'other': 1, 'hypocrit': 1, 'support': 1, 'error': 1, 'over': 1, 'earth': 1, 'six': 1, 'histor': 1, 'american': 1, 'john': 1, 'e': 1, 'remsburg': 1, 'letter': 1, 'william': 1, 'short': 1, 'again': 1, 'becom

In [29]:
class WordCounterToVectorTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, vocabulary_size=1000):
        self.vocabulary_size =vocabulary_size

    def fit(self, X, y=None):
        total_count = Counter()
        for word_count in X:
            for word, count in word_count.items():
                total_count[word] += min(count, 10)
        most_common = total_count.most_common()[:self.vocabulary_size]
        self.vocabulary_ = {word: index + 1
                            for index, (word, count) in enumerate(most_common)}
        return self
    
    def transform(self, X, y=None):
        rows = []
        cols = []
        data = []
        for row, word_count in enumerate(X):
            for word, count in word_count.items():
                rows.append(row)
                cols.append(self.vocabulary_.get(word, 0))
                data.append(count)
        return csr_matrix((data, (rows, cols)),
                          shape=(len(X), self.vocabulary_size + 1))

In [31]:
vocab_transformer = WordCounterToVectorTransformer(vocabulary_size=10)
X_few_vectors = vocab_transformer.fit_transform(X_few_wordcounts)
X_few_vectors

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 20 stored elements and shape (3, 11)>

In [34]:
X_few_vectors.toarray()

array([[ 6,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [99, 11,  9,  8,  3,  1,  3,  1,  3,  2,  3],
       [67,  0,  1,  2,  3,  4,  1,  2,  0,  1,  0]])

In [33]:
vocab_transformer.vocabulary_

{'the': 1,
 'of': 2,
 'and': 3,
 'to': 4,
 'url': 5,
 'all': 6,
 'in': 7,
 'christian': 8,
 'on': 9,
 'by': 10}

### Selecting and testing model

In [None]:
preprocess_pipeline = Pipeline([
    ('email_to_wordcount', EmailToWordCounterTransformer()),
    ('wordcount_to_vector', WordCounterToVectorTransformer()),
])

X_train_transformed = preprocess_pipeline.fit_transform(X_train)

In [41]:
log_clf = LogisticRegression(max_iter=1000, random_state=42)
score = cross_val_score(log_clf, X_train_transformed, y_train, cv=3)
print(f'Mean: {score.mean():.2%}')

Mean: 98.62%


In [43]:
X_test_transformed = preprocess_pipeline.transform(X_test)
log_clf.fit(X_train_transformed, y_train)
y_pred = log_clf.predict(X_test_transformed)

print(f'Precision: {precision_score(y_test, y_pred):.2%}')
print(f'Recall: {recall_score(y_test, y_pred):.2%}')

Precision: 95.88%
Recall: 97.89%
