## **EXERCISE 04:**

Exercise: _Build a spam classifier (a more challenging exercise):_

* _Download examples of spam and ham from [Apache SpamAssassin's public datasets](https://homl.info/spamassassin)._
* _Unzip the datasets and familiarize yourself with the data format._
* _Split the datasets into a training set and a test set._
* _Write a data preparation pipeline to convert each email into a feature vector. Your preparation pipeline should transform an email into a (sparse) vector that indicates the presence or absence of each possible word. For example, if all emails only ever contain four words, "Hello," "how," "are," "you," then the email "Hello you Hello Hello you" would be converted into a vector [1, 0, 0, 1] (meaning [“Hello" is present, "how" is absent, "are" is absent, "you" is present]), or [3, 0, 0, 2] if you prefer to count the number of occurrences of each word._

_You may want to add hyperparameters to your preparation pipeline to control whether or not to strip off email headers, convert each email to lowercase, remove punctuation, replace all URLs with "URL," replace all numbers with "NUMBER," or even perform _stemming_ (i.e., trim off word endings; there are Python libraries available to do this)._

_Finally, try out several classifiers and see if you can build a great spam classifier, with both high recall and high precision._

---

### **DOWNLOAD SPAM DATA:**

In [35]:
import tarfile
from pathlib import Path
import urllib.request

def fetch_spam_data():
    url_root = "https://spamassassin.apache.org/old/publiccorpus/"
    ham_url = url_root + "20030228_hard_ham.tar.bz2"
    spam_url = url_root + "20050311_spam_2.tar.bz2"

    data_path = Path("data")
    data_path.mkdir(exist_ok=True)

    for dir_name, tar_name, url in (("hard_ham", "hard_ham", ham_url), ("spam_2", "spam_2", spam_url)):
        if not (data_path / dir_name).is_dir():
            path = (data_path / tar_name).with_suffix(".tar.bz2")
            print("Downloading", path)
            urllib.request.urlretrieve(url, path)
            with tarfile.open(path) as tf:
                tf.extractall(path=data_path)

    return [data_path / dir_name for dir_name in ("hard_ham", "spam_2")]


In [36]:
ham_dir, spam_dir = fetch_spam_data()
ham_filenames = [f for f in sorted(ham_dir.iterdir()) if len(f.name) > 20]
spam_filenames = [f for f in sorted(spam_dir.iterdir()) if len(f.name) > 20]

---

### **LOOK AT DATASET:**

In [38]:
len(ham_filenames)

250

In [39]:
len(spam_filenames)

1396

In [40]:
ham_filenames[:10]

[WindowsPath('data/hard_ham/00001.7c7d6921e671bbe18ebb5f893cd9bb35'),
 WindowsPath('data/hard_ham/00002.ca96f74042d05c1a1d29ca30467cfcd5'),
 WindowsPath('data/hard_ham/00003.268fd170a3fc73bee2739d8204856a53'),
 WindowsPath('data/hard_ham/00004.68819fc91d34c82433074d7bd3127dcc'),
 WindowsPath('data/hard_ham/00005.34bcaad58ad5f598f5d6af8cfa0c0465'),
 WindowsPath('data/hard_ham/00006.3409dec8ca4fcf2d6e0582554473b5c9'),
 WindowsPath('data/hard_ham/00007.d24e99a602ee7fb442714c0d448cd08e'),
 WindowsPath('data/hard_ham/00008.b42457819236bee543bebffb61b91e44'),
 WindowsPath('data/hard_ham/00009.ddea79a02a9978cb3dafef3c05ff37a6'),
 WindowsPath('data/hard_ham/00010.e82bd1f5f7eae426682a7f8e4cbf1ae6')]

---

### **PREPROCESSING:**

We can use Python's `email` module to parse these emails (this handles headers, encoding, and so on):

In [41]:
import email
import email.policy

def load_email(filepath):
    with open(filepath, "rb") as f:
        return email.parser.BytesParser(policy=email.policy.default).parse(f)

In [42]:
ham_emails = [load_email(filepath) for filepath in ham_filenames]
spam_emails = [load_email(filepath) for filepath in spam_filenames]

In [43]:
ham_emails[:10]

[<email.message.EmailMessage at 0x28a535671c0>,
 <email.message.EmailMessage at 0x28a53566da0>,
 <email.message.EmailMessage at 0x28a53566530>,
 <email.message.EmailMessage at 0x28a53567070>,
 <email.message.EmailMessage at 0x28a53565180>,
 <email.message.EmailMessage at 0x28a53565300>,
 <email.message.EmailMessage at 0x28a53565030>,
 <email.message.EmailMessage at 0x28a53567760>,
 <email.message.EmailMessage at 0x28a535677c0>,
 <email.message.EmailMessage at 0x28a53566a40>]

In [44]:
print(ham_emails[1].get_content().strip())

May 7, 2002


Dear rod-3ds@arsecandle.org:


Congratulations!  On behalf of Frito-Lay, Inc., we are pleased to advise you
 that you've won Fourth Prize in the 3D's(R) Malcolm in the Middle(TM)
 Sweepstakes.   Fourth Prize consists of 1 manufacturer's coupon redeemable at
 participating retailers for 1 free bag of 3D's(R) brand snacks (up to 7 oz.
 size), with an approximate retail value of $2.59 and an expiration date of
 12/31/02.

Follow these instructions to claim your prize:

1.	Print out this email message.

2.	Complete ALL of the information requested.  Print clearly and legibly.  Sign
 where indicated.

3.	If you are under 18 or otherwise under the legal age of majority in your
 state, your parent or legal guardian must co-sign where indicated below.

4.	Mail the completed and signed form to:  3D's(R) Malcolm in the Middle(TM)
 Sweepstakes, Redemption Center, PO Box 1520, Elmhurst IL 60126.  WE MUST
 RECEIVE THIS FORM NO LATER THAN MAY 28, 2002 IN ORDER TO SEND YOU THE PRIZE.

P

In [45]:
print(spam_emails[6].get_content().strip())

NEW PRODUCT ANNOUNCEMENT

From: OUTSOURCE ENG.& MFG. INC.


Sir/Madam;

This note is to inform you of new watchdog board technology for maintaining
continuous unattended operation of PC/Servers etc. that we have released for
distribution.
  
We are proud to announce Watchdog Control Center featuring MAM (Multiple
Applications Monitor) capability.
The key feature of this application enables you to monitor as many
applications as you
have resident on any computer as well as the operating system for
continuous unattended operation.  The Watchdog Control Center featuring
MAM capability expands third party application "control" of a Watchdog as
access to the application's
source code is no longer needed.

Here is how it all works:
Upon installation of the application and Watchdog, the user may select
many configuration options, based on their model of Watchdog, to fit their
operational needs.  If the MAM feature is enabled, the user may select any
executable program that they wish for monit

Some emails are actually multipart, with images and attachments (which can have their own attachments). Let's look at the various types of structures we have:

In [46]:
def get_email_structure(email):
    if isinstance(email, str):
        return email
    payload = email.get_payload()
    if isinstance(payload, list):  # Multipart email (different content types).
        multipart = ", ".join([get_email_structure(sub_email) for sub_email in payload])
        return f"multipart({multipart})"
    else:
        return email.get_content_type()

In [47]:
from collections import Counter

def structures_counter(emails):
    structures = Counter()
    for email in emails:
        structure = get_email_structure(email)
        structures[structure] += 1
    return structures

In [48]:
structures_counter(ham_emails).most_common()

[('text/html', 118),
 ('text/plain', 81),
 ('multipart(text/plain, text/html)', 43),
 ('multipart(text/html)', 2),
 ('multipart(text/plain, image/bmp)', 1),
 ('multipart(multipart(text/plain, text/html))', 1),
 ('multipart(text/plain, application/x-pkcs7-signature)', 1),
 ('multipart(text/plain, image/png, image/png)', 1),
 ('multipart(multipart(text/plain, text/html), image/gif, image/gif, image/gif, image/gif, image/gif, image/gif, image/gif, image/gif, image/gif, image/gif, image/gif, image/jpeg, image/gif, image/gif, image/gif, image/gif, image/gif, image/gif)',
  1),
 ('multipart(text/plain, text/plain)', 1)]

In [49]:
structures_counter(spam_emails).most_common()

[('text/plain', 597),
 ('text/html', 589),
 ('multipart(text/plain, text/html)', 114),
 ('multipart(text/html)', 29),
 ('multipart(text/plain)', 25),
 ('multipart(multipart(text/html))', 18),
 ('multipart(multipart(text/plain, text/html))', 5),
 ('multipart(text/plain, application/octet-stream, text/plain)', 3),
 ('multipart(text/html, text/plain)', 2),
 ('multipart(text/html, image/jpeg)', 2),
 ('multipart(multipart(text/plain), application/octet-stream)', 2),
 ('multipart(text/plain, application/octet-stream)', 2),
 ('multipart(text/plain, multipart(text/plain))', 1),
 ('multipart(multipart(text/plain, text/html), image/jpeg, image/jpeg, image/jpeg, image/jpeg, image/jpeg)',
  1),
 ('multipart(multipart(text/plain, text/html), image/jpeg, image/jpeg, image/jpeg, image/jpeg, image/gif)',
  1),
 ('text/plain charset=us-ascii', 1),
 ('multipart(multipart(text/html), image/gif)', 1),
 ('multipart(multipart(text/plain, text/html), application/octet-stream, application/octet-stream, applic

It seems that the ham emails are more often plain text, while spam has quite a lot of HTML. Moreover, quite a few ham emails are signed using PGP, while no spam is. In short, it seems that the email structure is useful information to have.

Now let's take a look at the email headers:

In [50]:
for header, value in spam_emails[0].items():
    print(header, ":", value)

Return-Path : <ilug-admin@linux.ie>
Delivered-To : yyyy@localhost.netnoteinc.com
Received : from localhost (localhost [127.0.0.1])	by phobos.labs.netnoteinc.com (Postfix) with ESMTP id 9E1F5441DD	for <jm@localhost>; Tue,  6 Aug 2002 06:48:09 -0400 (EDT)
Received : from phobos [127.0.0.1]	by localhost with IMAP (fetchmail-5.9.0)	for jm@localhost (single-drop); Tue, 06 Aug 2002 11:48:09 +0100 (IST)
Received : from lugh.tuatha.org (root@lugh.tuatha.org [194.125.145.45]) by    dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g72LqWv13294 for    <jm-ilug@jmason.org>; Fri, 2 Aug 2002 22:52:32 +0100
Received : from lugh (root@localhost [127.0.0.1]) by lugh.tuatha.org    (8.9.3/8.9.3) with ESMTP id WAA31224; Fri, 2 Aug 2002 22:50:17 +0100
Received : from bettyjagessar.com (w142.z064000057.nyc-ny.dsl.cnc.net    [64.0.57.142]) by lugh.tuatha.org (8.9.3/8.9.3) with ESMTP id WAA31201 for    <ilug@linux.ie>; Fri, 2 Aug 2002 22:50:11 +0100
Received : from 64.0.57.142 [202.63.165.34] by bettyjagessa

Split into a training set and a test set.

In [51]:
import numpy as np
from sklearn.model_selection import train_test_split

X = np.array(ham_emails + spam_emails, dtype=object)
y = np.array([0] * len(ham_emails) + [1] * len(spam_emails))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

The following function first drops the `<head>` section, then converts all `<a>` tags to the word HYPERLINK, then it gets rid of all HTML tags, leaving only the plain text. For readability, it also replaces multiple newlines with single newlines, and finally it unescapes html entities (such as `&gt;` or `&nbsp;`):

In [52]:
import re
from html import unescape

def html_to_plain_text(html):
    text = re.sub('<head.*?>.*?</head>', '', html, flags=re.M | re.S | re.I)
    text = re.sub('<a\s.*?>', ' HYPERLINK ', text, flags=re.M | re.S | re.I)
    text = re.sub('<.*?>', '', text, flags=re.M | re.S)
    text = re.sub(r'(\s*\n)+', '\n', text, flags=re.M | re.S)
    return unescape(text)

In [53]:
html_spam_emails = [email for email in X_train[y_train == 1]
                    if get_email_structure(email) == "text/html"]
sample_html_spam = html_spam_emails[7]
print(sample_html_spam.get_content().strip()[:1000], "...")

<html>

<head>
<meta http-equiv="Content-Language" content="en-us">
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<meta name="GENERATOR" content="Microsoft FrontPage 4.0">
<meta name="ProgId" content="FrontPage.Editor.Document">
<title>Norton System Works </title>
</head>

<body>

<table border="1" width="100%" cellspacing="0" cellpadding="0">
  <tr>
    <td width="100%">
      <table border="0" width="100%" cellspacing="0" cellpadding="0" bgcolor="#800000">
        <tr>
          <td width="100%">
            <p align="center"><b><font face="Baskerville Old Face" color="#ffffff" size="6">Norton
            System Works</font></b>
          </td>
        </tr>
      </table>
      <table border="0" width="100%" cellspacing="0" cellpadding="0" bgcolor="#FFCC00">
        <tr>
          <td width="100%">
            <table cellSpacing="0" cellPadding="0" width="100%" bgColor="#800000" border="0">
              <tbody>
                <tr>
                  <td

In [54]:
print(html_to_plain_text(sample_html_spam.get_content())[:1000], "...")


            Norton
            System Works
                            A
                            complete problem-solving suite for advanced users
                            and small businesses.
                  Norton
                  System Works Features:
              Norton AntiVirus
                protects your PC from virus threats
              Norton Utilities
                optimizes PC performance and solves problems
              Norton CleanSweep
                cleans out Internet clutter
              GoBack by Roxio
                provides quick and easy system recovery
              Norton Ghost
                clones and upgrades your system easily
              WinFax Basic
                sends and receives professional-looking faxes
             HYPERLINK Order
            Today
            $300.00 +
            Value
            Your
            Price- $29.99
           
 HYPERLINK Click
here to unsubscribe from these
mailings.
 ...


In [55]:
def email_to_text(email):
    html = None
    for part in email.walk(): # Iterate over all parts of the email.
        ctype = part.get_content_type()
        if not ctype in ("text/plain", "text/html"):
            continue
        try:
            content = part.get_content()
        except: # Encoding issues.
            content = str(part.get_payload())
        if ctype == "text/plain":
            return content
        else:
            html = content
    if html:
        return html_to_plain_text(html)


In [56]:
print(email_to_text(sample_html_spam)[:100], "...")


            Norton
            System Works
                            A
                          ...


Use the Natural Language Toolkit to stemming.

In [57]:
import nltk

stemmer = nltk.PorterStemmer()
for word in ("Computations", "Computation", "Computing", "Computed", "Compute",
             "Compulsive"):
    print(word, "=>", stemmer.stem(word))

Computations => comput
Computation => comput
Computing => comput
Computed => comput
Compute => comput
Compulsive => compuls


In [58]:
import urlextract

url_extractor = urlextract.URLExtract()
some_text = "Will it detect github.com and https://youtu.be/7Pq-S557XQU?t=3m32s"
print(url_extractor.find_urls(some_text))

['github.com', 'https://youtu.be/7Pq-S557XQU?t=3m32s']


Put all of these together to build a transformer.

In [59]:
from sklearn.base import BaseEstimator, TransformerMixin

class EmailToWordCounterTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, strip_headers=True, lower_case=True, remove_punctuation=True,
                 replace_urls=True, replace_numbers=True, stemming=True):
        self.strip_headers = strip_headers
        self.lower_case = lower_case
        self.remove_punctuation = remove_punctuation
        self.replace_urls = replace_urls
        self.replace_numbers = replace_numbers
        self.stemming = stemming
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        X_transformed = []
        for email in X:
            text = email_to_text(email) or ""
            if self.lower_case:
                text = text.lower()
            if self.replace_urls and url_extractor:
                urls = list(set(url_extractor.find_urls((text))))
                urls.sort(key=lambda url: len(url), reverse=True)
                for url in urls:
                    text = text.replace(url, " URL ")
            if self.replace_numbers:
                text = re.sub(r'\d+(?:\.\d*)?(?:[eE][+-]?\d+)?', 'NUMBER', text)
            if self.remove_punctuation:
                text = re.sub(r'\W+', ' ', text, flags=re.M)
            word_counts = Counter(text.split())
            if self.stemming and stemmer:
                stemmed_word_counts = Counter()
                for word, count in word_counts.items():
                    stemmed_word = stemmer.stem(word)
                    stemmed_word_counts[stemmed_word] += count
                word_counts = stemmed_word_counts
            X_transformed.append(word_counts)
        return np.array(X_transformed)


In [60]:
X_few = X_train[:3]
X_few_wordcounts = EmailToWordCounterTransformer().fit_transform(X_few)
X_few_wordcounts

array([Counter({'number': 164, 'email': 85, 'to': 68, 'you': 63, 'broadcast': 59, 'the': 50, 'a': 49, 'of': 39, 'our': 29, 'order': 26, 'for': 24, 'your': 24, 'have': 24, 'and': 23, 'advertis': 22, 'softwar': 22, 'is': 21, 'with': 21, 'packag': 20, 'if': 19, 'are': 17, 'by': 17, 'can': 15, 'send': 15, 'i': 15, 'free': 14, 'receiv': 14, 'in': 14, 'it': 14, 'or': 13, 'peopl': 13, 'on': 12, 'day': 12, 'that': 12, 'target': 11, 'out': 11, 'thi': 11, 'about': 10, 'address': 10, 'all': 10, 'thank': 10, 'use': 10, 'will': 10, 'internet': 9, 'at': 9, 'say': 8, 'so': 8, 'million': 8, 'they': 8, 'us': 8, 'we': 8, 'now': 7, 'mail': 7, 'through': 7, 'credit': 7, 'card': 7, 's': 6, 'just': 6, 'busi': 6, 'from': 6, 'be': 6, 'them': 6, 'as': 6, 'unlimit': 6, 'ever': 6, 'complet': 6, 'product': 5, 'servic': 5, 'everi': 5, 'what': 5, 'up': 5, 'than': 5, 'postal': 5, 'automat': 5, 'good': 5, 'within': 5, 'not': 5, 'onli': 5, 'custom': 5, 'entir': 5, 'addit': 5, 'm': 5, 'daili': 4, 'week': 4, 'profit': 4

Now we have the word counts, and we need to convert them to vectors. For this, we will build another transformer whose `fit()` method will build the vocabulary (an ordered list of the most common words) and whose `transform()` method will use the vocabulary to convert word counts to vectors. The output is a sparse matrix.

In [61]:
from scipy.sparse import csr_matrix

class WordCounterToVectorTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, vocabulary_size=1000):
        self.vocabulary_size = vocabulary_size
    def fit(self, X, y=None):
        total_count = Counter()
        for word_count in X:
            for word, count in word_count.items():
                total_count[word] += min(count, 10)
        most_common = total_count.most_common()[:self.vocabulary_size]
        self.vocabulary_ = {word: index + 1 for index, (word, count) in enumerate(most_common)}
        return self
    def transform(self, X, y=None):
        rows, cols, data = [], [], []
        for row, word_count in enumerate(X):
            for word, count in word_count.items():
                rows.append(row)
                cols.append(self.vocabulary_.get(word, 0))
                data.append(count)
        return csr_matrix((data, (rows, cols)), shape=(len(X), self.vocabulary_size+1))

In [62]:
vocab_transformer = WordCounterToVectorTransformer(vocabulary_size=10)
X_few_vectors = vocab_transformer.fit_transform(X_few_wordcounts)
X_few_vectors

<3x11 sparse matrix of type '<class 'numpy.intc'>'
	with 24 stored elements in Compressed Sparse Row format>

In [63]:
X_few_vectors.toarray()

array([[1702,  164,   68,   49,   13,   85,   21,    7,   17,   11,   10],
       [ 235,   26,    0,    1,    0,    0,    0,    3,    0,    0,    1],
       [  29,    0,    2,    0,    1,    1,    1,    1,    1,    1,    0]],
      dtype=int32)

In [64]:
from sklearn.pipeline import Pipeline

preprocess_pipeline = Pipeline([
    ("email_to_wordcount", EmailToWordCounterTransformer()),
    ("wordcount_to_vector", WordCounterToVectorTransformer()),
])

X_train_transformed = preprocess_pipeline.fit_transform(X_train)

---

### **MODEL SELECTION:**

In [77]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

log_clf = LogisticRegression(max_iter=1000, random_state=42)
log_score = cross_val_score(log_clf, X_train_transformed, y_train, cv=3)
log_score.mean()

0.9589630507969198

In [67]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(random_state=42)
sgd_score = cross_val_score(sgd_clf, X_train_transformed, y_train, cv=3)
sgd_score.mean()

0.8792069287123426

In [68]:
from sklearn.svm import SVC

svc_clf = SVC(kernel="linear", random_state=42)
svc_score = cross_val_score(svc_clf, X_train_transformed, y_train, cv=3)
svc_score.mean()

0.9430125198059794

In [78]:
from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
forest_score = cross_val_score(forest_clf, X_train_transformed, y_train, cv=3)
forest_score.mean()

0.9559258450262288

In [79]:
from sklearn.metrics import precision_score, recall_score

X_test_transformed = preprocess_pipeline.transform(X_test)

log_clf = LogisticRegression(max_iter=1000, random_state=42)
log_clf.fit(X_train_transformed, y_train)
y_pred = log_clf.predict(X_test_transformed)

print(f"Precision: {precision_score(y_test, y_pred):.2%}")
print(f"Recall: {recall_score(y_test, y_pred):.2%}")

Precision: 96.82%
Recall: 98.56%
