# Spam Classification

I was first exposed to this exercise in [Andrew Ng's Intro to Machine Learning class on Coursera](https://www.coursera.org/learn/machine-learning). I revisited it through end-of-chapter-3 exercise in [Aurélien Géron's Machine Learning Handbook](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/) in order to extend my understanding of the concepts as well as the relevant python tools. I am using Aurélien's code on [github](https://github.com/ageron/handson-ml2) with some tweaks for 20050311_spam_2 and 20030228_hard_ham data from [spamassassin corpus](http://spamassassin.apache.org/old/publiccorpus/).

In [1]:
# setup environment
import os
import urllib
import tarfile
import email
import email.policy
import nltk
import urlextract
import numpy as np
import re


from collections import Counter
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import precision_score, recall_score

In [2]:
# get email data from spamAssassin public corpus
ROOT = "http://spamassassin.apache.org/old/publiccorpus/"
SPAM_URL = "".join([ROOT, "20050311_spam_2.tar.bz2"])
HAM_URL = "".join([ROOT, "20030228_hard_ham.tar.bz2"])
DATA_PATH = os.path.join("data", "spamClf")

# define function to pull and extract data
def getSpamData(dataUrl=SPAM_URL, dataPath=DATA_PATH):
    '''pull and extract spamAssasin email data'''
    if not os.path.isdir(dataPath):
        os.makedirs(dataPath)
    for f, url in [["spam_2.tar.bz2", SPAM_URL], ["hard_ham.tar.bz2", HAM_URL]]:
        path = os.path.join(dataPath, f)
        if not os.path.isfile(path):
            urllib.request.urlretrieve(url, path)
            tar_bz2_f = tarfile.open(path)
            tar_bz2_f.extractall(path=DATA_PATH)
            tar_bz2_f.close()

In [3]:
getSpamData(dataUrl=SPAM_URL, dataPath=DATA_PATH) # call function

In [4]:
# load data (emails)
SPAM_DIR = os.path.join(DATA_PATH, "spam_2")
HAM_DIR = os.path.join(DATA_PATH, "hard_ham")

# filenames are 38 characters
spam_files = [f for f in sorted(os.listdir(SPAM_DIR)) if len(f) >= 35]
ham_files = [f for f in sorted(os.listdir(HAM_DIR)) if len(f) >= 35]

In [5]:
# number of pulled spam and ham files
print(f"Number of spam files: {len(spam_files)}")
print(f"Number of ham files: {len(ham_files)}")

Number of spam files: 1396
Number of ham files: 250


This (spam_2 and hard_ham) set of data a considerably larger ratio (1396/250) of spam file over ham files. The "spam" and "easy_ham" sets had this ratio at 500/2500.

In [6]:
# function to load emails (create email parser instance)
def loadEmail(isSpam, file, spamPath=DATA_PATH):
    if isSpam:
        folder = "spam_2"
    else:
        folder = "hard_ham"
    with open(os.path.join(spamPath, folder, file), "rb") as f:
        # parser API used (vs. feedparser) as emails in files (not livefeed)
        # create BytesParser instance
        return email.parser.BytesParser(policy=email.policy.default).parse(f)

In [7]:
spam_emails = [loadEmail(isSpam=True, file=f) for f in spam_files]
ham_emails = [loadEmail(isSpam=False, file=f) for f in ham_files]

In [8]:
# get email structures
def getEmailStructure(email):
    '''get structure of email'''
    if isinstance(email, str):
        return email
    payload = email.get_payload() # list of multipart message objects
    if isinstance(payload, list):
        return "multipart({})".format(", ".join([
            getEmailStructure(subEmail) for subEmail in payload
        ]))
    else:
        return email.get_content_type() # email's content type

In [9]:
# split data into training and test sets

# list of email objects
X = np.array(spam_emails + ham_emails, dtype=object)

# target labels
y = np.array([1] * len(spam_emails) + [0] * len(ham_emails))

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

In [10]:
# print first ham sample from X_train 
print(X_train[y_train==0][3].get_content().strip())

I am trying to secure three of four virtual hostnames on our Apache server.
We are not taking credit card orders or user's personal information, but are
merely hoping to secure email and calendar web transactions for our users.
We are not running any secure applications on the root host.

I have been testing this week with CA, client, and host certificate
requests, certificates, and keys, and think I have a fairly good beginner's
grasp of the commands and command line options.


My questions are:

1.  Is it necessary to create a CA certificate for each of the secure
virtual hosts, or can one CA certificate for the root be used to sign each
of the keys for all three common names we are trying to secure?

2.  Even though the root host is not conducting secure transactions, am I
correct in configuring the server with a CACertificateFile in the main body
of httpsd.conf and then setting the CACertificateFile for each virtual host
in the <Virtual . . .> section of httpsd.conf?  This sort of 

In [11]:
# preprocess ...

# function to converst email body into bag of words
def html2text(html):
    soup = BeautifulSoup(html, features="lxml")
    text = soup.get_text() # remove html markups
    
    if soup.head: soup.head.decompose() # remove headers
    for a in soup.find_all("a"):
        a.replace_with(" HYPERLINK ") # convert all <a> tags with text HYPERLINK
    for s in soup(["script", "style"]):
        s.decompose() # remove tags
    text = ' '.join(soup.stripped_strings) # retrieve tag contents
    return text

In [12]:
# check html2text works - print spam html
htmlSpamEmails = [email for email in X_train[y_train==1]
                 if getEmailStructure(email) == "text/html"]

sampleHtmlSpam = htmlSpamEmails[10]
# first 500 characters
print(sampleHtmlSpam.get_content().strip()[:500], "...")

<html><body bgColor="#CCCCCC" topmargin=1 onMouseOver="window.status=''; return true" oncontextmenu="return false" ondragstart="return false" onselectstart="return false">
<div align="center">Hello, jm@netnoteinc.com<BR><BR></div><div align="center"></div><p align="center"><b><font face="Arial" size="4">Human Growth Hormone Therapy</font></b></p>
<p align="center"><b><font face="Arial" size="4">Lose weight while building lean muscle mass<br>and reversing the ravages of aging all at once.</font>< ...


In [13]:
# check html2text works - print spam text (using html2text function)
print(html2text(sampleHtmlSpam.get_content())[:500], "...")

Hello, jm@netnoteinc.com Human Growth Hormone Therapy Lose weight while building lean muscle mass and reversing the ravages of aging all at once. Remarkable discoveries about Human Growth Hormones ( HGH ) are changing the way we think about aging and weight loss. Lose Weight Build Muscle Tone Reverse Aging Increased Libido Duration Of Penile Erection Healthier Bones Improved Memory Improved skin New Hair Growth Wrinkle Disappearance HYPERLINK You are receiving this email as a subscr iber to the  ...


In [14]:
# check html2text works - print spam text (using html2text function)
print(html2text(sampleHtmlSpam.get_content())[:500], "...")

Hello, jm@netnoteinc.com Human Growth Hormone Therapy Lose weight while building lean muscle mass and reversing the ravages of aging all at once. Remarkable discoveries about Human Growth Hormones ( HGH ) are changing the way we think about aging and weight loss. Lose Weight Build Muscle Tone Reverse Aging Increased Libido Duration Of Penile Erection Healthier Bones Improved Memory Improved skin New Hair Growth Wrinkle Disappearance HYPERLINK You are receiving this email as a subscr iber to the  ...


In [15]:
# preprocess ...
# function to convert any content to plain text
def email2text(email):
    html = None
    for part in email.walk():
        ctype = part.get_content_type()
        if not ctype in ("text/plain", "text/html"):
            continue
        try:
            content = part.get_content()
        except:
            content = str(part.get_payload())
        if ctype == "text/plain":
            return content
        else:
            html = content
    if html:
        return html2text(html)

In [16]:
print(email2text(sampleHtmlSpam)[:100], "...")

Hello, jm@netnoteinc.com Human Growth Hormone Therapy Lose weight while building lean muscle mass an ...


In [17]:
# preprocess ...
stemmer = nltk.PorterStemmer() # initialize stemmer
url_extractor = urlextract.URLExtract() # initialize url extractor

In [18]:
# preprocess ...
# class to convert emails to word counters
#nltk.download("stopwords")

class Emails2WordCounts(BaseEstimator, TransformerMixin):
    '''convert emails to word counters'''
    def __init__(self, strip_headers=True, replace_urls=True,
                 replace_numbers=True, remove_punctuation=True,
                 stemming=True, lowercase=True,
                remove_stopwords=True):
        self.strip_headers = strip_headers
        self.replace_urls = replace_urls
        self.replace_numbers = replace_numbers
        self.remove_punctuation = remove_punctuation
        self.stemming = stemming
        self.lowercase = lowercase
        self.remove_stopwords=remove_stopwords
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X_transformed = []
        for email in X:
            text = email2text(email) or ''
            if self.lowercase:
                text = text.lower()
            if self.replace_urls:
                urls = list(set(url_extractor.find_urls(text)))
                urls.sort(key=lambda url: len(url), reverse=True)
            for url in urls:
                text = text.replace(url, "URL")
            if self.replace_numbers:
                regx = re.compile(r'\d+(?:\.\d*(?:[eE]\d+))?')
                text = regx.sub(string=text, repl="NUMBER")
            if self.remove_punctuation:
                text = re.sub(r'\W+', ' ', text, flags=re.M)
                regx = re.compile(r"([^\w\s]+)|([_-]+)")
                text = regx.sub(string=text, repl=" ")
            if self.remove_stopwords:
                words = text.split()
                keepWords = [word for word in words if 
                             word not in stopwords.words('english')] 
            word_counts = Counter(keepWords)
            if self.stemming:
                stemmed_word_counts = Counter()
                for word, count in word_counts.items():
                    stemmed_word = stemmer.stem(word)
                    stemmed_word_counts[stemmed_word] += count
                word_counts = stemmed_word_counts
            X_transformed.append(word_counts)
        return np.array(X_transformed)

In [19]:
# check it for a sample email
X_few = X_train[:2]
X_few_wc = Emails2WordCounts().fit_transform(X_few)
X_few_wc

array([Counter({'dnumber': 43, 'bnumber': 38, 'anumb': 29, 'number': 24, 'cnumber': 21, 'url': 14, 'free': 12, 'get': 9, 'cb': 9, 'account': 8, 'ca': 8, 'fnumber': 8, 'download': 7, 'instal': 7, 'softwar': 7, 'open': 7, 'bc': 7, 'ba': 7, 'cd': 6, 'bb': 6, 'purchas': 5, 'fe': 5, 'ac': 5, 'enumb': 5, 'al': 4, 'bonu': 4, 'ce': 4, 'sign': 3, 'ee': 3, 'bf': 3, 'cc': 3, 'cf': 3, 'email': 2, 'member': 2, 'remov': 2, 'l': 2, 'real': 2, 'requir': 2, 'de': 2, 'da': 2, 'dd': 2, 'ab': 2, 'bd': 2, 'df': 2, 'spam': 1, 'list': 1, 'topdollaremail': 1, 'opt': 1, 'servic': 1, 'see': 1, 'receiv': 1, 'easi': 1, 'rea': 1, 'buy': 1, 'extra': 1, 'paypal': 1, 'first': 1, 'purch': 1, 'ase': 1, 'non': 1, 'paid': 1, 'futur': 1, 'mail': 1, 'pleas': 1, 'unsubscrib': 1, 'click': 1, 'user': 1, 'info': 1, 'scroll': 1, 'bottom': 1, 'page': 1, 'lose': 1, 'referr': 1, 'money': 1, 'owe': 1, 'enumbersmtp': 1, 'eb': 1, 'anumberpc': 1, 'anumbersmtp': 1, 'ec': 1, 'fb': 1, 'fd': 1, 'aa': 1, 'ed': 1, 'http': 1, 'www': 1, 'seek

In [20]:
# preprocess ...
# build list (orderd) of most common words and convert it to vectors
class WordCounts2Vector(BaseEstimator, TransformerMixin):
    '''convert word counts to vectors'''
    def __init__(self, vocabulary_size=1000):
        self.vocabulary_size = vocabulary_size
    def fit(self, X, y=None):
        total_count = Counter()
        for word_count in X: # counter dictionary object
            for word, count in word_count.items():
                total_count[word] += min(count, 10)
        most_common = total_count.most_common()[:self.vocabulary_size]
        self.most_common_ = most_common
        self.vocabulary_ = {word: index + 1 for index, (word, count) in 
                            enumerate(most_common)}
        return self
    def transform(self, X, y=None):
        rows = []
        cols = []
        data = []
        for row, word_count in enumerate(X):
            for word, count in word_count.items():
                rows.append(row)
                cols.append(self.vocabulary_.get(word, 0))
                data.append(count)
        return csr_matrix((data, (rows, cols)), 
                          shape=(len(X), self.vocabulary_size + 1))

In [21]:
vocab_transformer = WordCounts2Vector(vocabulary_size=10)
X_few_vectors = vocab_transformer.fit_transform(X_few_wc)
X_few_vectors

<2x11 sparse matrix of type '<class 'numpy.longlong'>'
	with 13 stored elements in Compressed Sparse Row format>

In [22]:
X_few_vectors.toarray()
# first column represents words in that row(email) that 
#     do not appear in the vocabulary
# the rest of the columns represent the times the vocabulary
#     words appear in the email
# the vocabulary is obtained using the vocabulary_ 
#     attribute of the class

array([[184,  14,  24,  12,  38,  43,  21,  29,   9,   9,   8],
       [ 42,   3,   0,   0,   0,   0,   0,   0,   0,   0,   0]],
      dtype=int64)

In [23]:
vocab_transformer.vocabulary_

{'url': 1,
 'number': 2,
 'free': 3,
 'bnumber': 4,
 'dnumber': 5,
 'cnumber': 6,
 'anumb': 7,
 'get': 8,
 'cb': 9,
 'account': 10}

In [24]:
# training pipeline
preprocess_pipeline = Pipeline([
    ("email2wordcounts", Emails2WordCounts()),
    ("wordcounts2vector", WordCounts2Vector()),
])

%time X_train_transformed = preprocess_pipeline.fit_transform(X_train)

CPU times: user 2min, sys: 24.8 s, total: 2min 25s
Wall time: 3min 57s


In [25]:
log_clf = LogisticRegression(solver="lbfgs", 
                             max_iter=1000, random_state=42)
score = cross_val_score(log_clf, X_train_transformed,
                        y_train, cv=3, verbose=3)

print(f"\nMean cross validation score: {score.mean():.4f}")

[CV]  ................................................................
[CV] .................................... , score=0.954, total=   0.2s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.2s remaining:    0.0s


[CV] .................................... , score=0.952, total=   0.3s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.4s remaining:    0.0s


[CV] .................................... , score=0.952, total=   0.3s

Mean cross validation score: 0.9529


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.7s finished


In [26]:
%time X_test_transformed = preprocess_pipeline.transform(X_test)

log_clf = LogisticRegression(solver="lbfgs", max_iter=1000,
                             random_state=42)
log_clf.fit(X_train_transformed, y_train)

y_pred = log_clf.predict(X_test_transformed)

print("\nPrecision: {:.2f}%".format(
    100 * precision_score(y_test, y_pred)))
print("Recall: {:.2f}%".format(100 * recall_score(y_test, y_pred)))

CPU times: user 34.3 s, sys: 7.34 s, total: 41.7 s
Wall time: 1min 8s

Precision: 97.60%
Recall: 98.62%


The logistic regression model gives reasonably high Precision and Recall scores on the test set. Lets check the performance using sklearn SVM (SVC) with GridSearch. 

In [27]:
# estimator
svm_clf = SVC()

# grid search parameters
param_grid = {
    'C': np.logspace(-1, 2, 10), # 0.1-100
    'gamma': np.logspace(-1, 1, 10), # 0.1-10
    'kernel': ["linear", "rbf"]
}

# perform grid
grid_search = GridSearchCV(svm_clf, param_grid, cv=3, verbose=1)

In [28]:
# fit best model
%time grid_search.fit(X_train_transformed, y_train)

Fitting 3 folds for each of 200 candidates, totalling 600 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 600 out of 600 | elapsed:  3.7min finished


CPU times: user 3min 39s, sys: 807 ms, total: 3min 40s
Wall time: 3min 43s


GridSearchCV(cv=3, estimator=SVC(),
             param_grid={'C': array([  0.1       ,   0.21544347,   0.46415888,   1.        ,
         2.15443469,   4.64158883,  10.        ,  21.5443469 ,
        46.41588834, 100.        ]),
                         'gamma': array([ 0.1       ,  0.16681005,  0.27825594,  0.46415888,  0.77426368,
        1.29154967,  2.15443469,  3.59381366,  5.9948425 , 10.        ]),
                         'kernel': ['linear', 'rbf']},
             verbose=1)

In [29]:
# print best parameters and cross validation score
print(grid_search.best_score_)
print(grid_search.best_params_)

0.9506055342327069
{'C': 0.1, 'gamma': 0.1, 'kernel': 'linear'}


In [30]:
# evaluate metrics on the test set
y_pred = grid_search.predict(X_test_transformed)

print("\nPrecision: {:.2f}%".format(
    100 * precision_score(y_test, y_pred)))
print("Recall: {:.2f}%".format(100 * recall_score(y_test, y_pred)))


Precision: 98.26%
Recall: 97.92%


The evaluated metrics on the test set using the grid search method on svm classifier were comparable to the logistic regression classifier. Both pipelines gave precision and recall scores above 97.5%.