# Frame the problem and look at the big picture
This is exercise #4 from Chapter 3 of Hands-On Machine Learning, which covers classification. 

1. __Define the objective.__ The task is to build a spam classifier using data from [Apache SpamAssassin's public datasets](https://spamassassin.apache.org/old/publiccorpus/) ([README](https://spamassassin.apache.org/old/publiccorpus/readme.html)). 

2. __How will your solution be used?__ This could be used by an email service like Gmail to automatically flag and filter out messages that have a high probability of being spam. 

3. __What are the current solutions/workarounds (if any)?__ Well, spam filters do already exist in most if not all email services. That being said, some spam emails still get through, so it's up to users to identify those on their own. In addition, when a spam email does get through, users can mark it as spam, and all future emails from that address will be automatically marked as spam. 

4. __How should you frame this problem (supervised/unsupervised, online/offline, etc.)?__ This is a __supervised classification task__ because we are making binary predictions (spam vs. ham) and we have labeled training data. Ideally, this is an online learning task, because spammers never stop thinking of new ways to trick people, so the model should be learning from new data constantly. However, for the sake of this exercise, it will be an offline system using static training data. 

5. __How should performance be measured?__ This is a bit more complex than the Titanic task, for example, where _accuracy_ was our primary performance metric. Is it better to lean further towards spam and risk misclassifying safe emails as spam? Or is it better to lean further towards "ham" and risk missing spam emails?
    - According to ChatGPT: "For a spam classifier, it is generally more fitting to __prioritize precision over recall__." This is because a spam classifier that marks safe emails as spam is a real nuisance to users, whereas missing a few spam emails is not as big a deal. By focusing on precision, we will minimize false positives and reduce the chances of misclassifying safe emails. 
    
    - `Precision = TP / (TP + FP)`
    
    - That being said, __there should also be a minimum recall.__ I could predict just one instance as spam, and if I'm correct, that means my precision is 100%. However, recall would be way too low.  
    
6. __Is the performance measure aligned with the business objective?__ Yes. Optimizing for precision will reduce the number of false positives, and therefore create a better hypothetical user experience. Since the spam classifier would most likely be used in a commercial product, user experience is the most important thing. 

7. __What would be the minimum performance needed to reach the business objective?__ It seems that a __precision of 95%__ is the widely recommended benchmark for spam classifiers. Regarding minimum recall, maybe let's see baseline performance of different models to estimate what a realistic minimum might be. 

8. __What are comparable problems? Can you reuse experience or tools?__ Beyond the Titanic dataset, this is really my first end-to-end classification project. From my work on the [MNIST dataset](https://github.com/iherman10/mnist-classification/blob/main/chapter_3.ipynb), I can reuse some charting function to analyze performance and compare models. From my work on the [Titanic dataset](https://github.com/iherman10/titanic/blob/main/titanic_2.ipynb), I can reuse the custom transformer class structures for data transformation purposes.

9. __Is human expertise available?__ The internet :) I'm working on this solo. 

10. __How would you solve the problem manually?__ When I try to eyeball whether an email is spam or not, I consider:
    - Have I received emails from this sender before?
    - Are there obvious typos?
    - Are there links? 
    - Are they asking for money? Or credit card information? Etc. 
    - Are they writing in all caps? Or all lower-case? 
    
    These considerations might influence the feature engineering part of this project. 

11. __List assumptions you've made so far.__ 
    - I should be able to identify most spam just by looking at it. 
    
    - I'll have to leverage text transformations to process the data. 
    
    - Baseline models should get me most of the way towards my performance goal, and feature engineering/hyperparameter tuning will get me the rest of the way.  

12. __Verify assumptions if possible.__ TBD...

In [38]:
# Libraries 
import pandas as pd 
import numpy as np 

import os
import tarfile 
import urllib.request

import joblib

import email 
import email.policy

from collections import Counter

from sklearn.model_selection import train_test_split

# Get the data
Available data from SpamAssassin website: 
- `spam`: 500 spam messages 
- `spam_2`: 1396 spam messages, added more recently. 
- `easy_ham`: 2500 non-spam messages, fairly easy to differentiate.  
- `easy_ham_2`: 144 non-spam messages, added more recently. 
- `hard_ham`: 250 non-spam messages, harder to differentiate. 

In total, there are __6046__ messages with about a 31% spam ratio. 

In [2]:
# Download data 
DOWNLOAD_ROOT = 'http://spamassassin.apache.org/old/publiccorpus/'

SPAM_URL = DOWNLOAD_ROOT + '20030228_spam.tar.bz2'
SPAM_2_URL = DOWNLOAD_ROOT + '20050311_spam_2.tar.bz2'
EASY_HAM_URL = DOWNLOAD_ROOT + '20030228_easy_ham.tar.bz2'
EASY_HAM_2_URL = DOWNLOAD_ROOT + '20030228_easy_ham_2.tar.bz2'
HARD_HAM_URL = DOWNLOAD_ROOT + '20030228_hard_ham.tar.bz2'

SPAM_PATH = os.path.join('datasets', 'spam')

def fetch_spam_data(spam_url=SPAM_URL, 
                    spam_2_url=SPAM_2_URL, 
                    easy_ham_url=EASY_HAM_URL, 
                    easy_ham_2_url=EASY_HAM_2_URL, 
                    hard_ham_url=HARD_HAM_URL,  
                    spam_path=SPAM_PATH):
    if not os.path.isdir(spam_path):
        os.makedirs(spam_path)
    for filename, url in (('spam.tar.bz2', SPAM_URL), 
                          ('spam_2.tar.bz2', SPAM_2_URL), 
                          ('easy_ham.tar.bz2', EASY_HAM_URL), 
                          ('easy_ham_2.tar.bz2', EASY_HAM_2_URL), 
                          ('hard_ham.tar.bz2', HARD_HAM_URL)):
        path = os.path.join(spam_path, filename)
        if not os.path.isfile(path):
            urllib.request.urlretrieve(url, path)
        tar_bz2_file = tarfile.open(path)
        tar_bz2_file.extractall(path=spam_path)
        tar_bz2_file.close()

In [3]:
fetch_spam_data()

In [4]:
# Load all the emails 
SPAM_DIR = os.path.join(SPAM_PATH, 'spam')
SPAM_2_DIR = os.path.join(SPAM_PATH, 'spam_2')
EASY_HAM_DIR = os.path.join(SPAM_PATH, 'easy_ham')
EASY_HAM_2_DIR = os.path.join(SPAM_PATH, 'easy_ham_2')
HARD_HAM_DIR = os.path.join(SPAM_PATH, 'hard_ham')

spam_filenames = [name for name in sorted(os.listdir(SPAM_DIR)) if len(name) > 20]
spam_2_filenames = [name for name in sorted(os.listdir(SPAM_2_DIR)) if len(name) > 20]
easy_ham_filenames = [name for name in sorted(os.listdir(EASY_HAM_DIR)) if len(name) > 20]
easy_ham_2_filenames = [name for name in sorted(os.listdir(EASY_HAM_2_DIR)) if len(name) > 20]
hard_ham_filenames = [name for name in sorted(os.listdir(HARD_HAM_DIR)) if len(name) > 20]

In [9]:
print(f"""
spam: {len(spam_filenames)} files
spam_2: {len(spam_2_filenames)} files
easy_ham: {len(easy_ham_filenames)} files
easy_ham_2: {len(easy_ham_2_filenames)} files 
hard_ham: {len(hard_ham_filenames)} files 
""")


spam: 500 files
spam_2: 1396 files
easy_ham: 2500 files
easy_ham_2: 1400 files 
hard_ham: 250 files 



In [16]:
# Parse emails 
def load_email(directory, filename, spam_path=SPAM_PATH):
    with open(os.path.join(spam_path, directory, filename), "rb") as f:
        return email.parser.BytesParser(policy=email.policy.default).parse(f)
    
spam_emails = [load_email(directory='spam', filename=name) for name in spam_filenames]
spam_2_emails = [load_email(directory='spam_2', filename=name) for name in spam_2_filenames]
easy_ham_emails = [load_email(directory='easy_ham', filename=name) for name in easy_ham_filenames]
easy_ham_2_emails = [load_email(directory='easy_ham_2', filename=name) for name in easy_ham_2_filenames]
hard_ham_emails = [load_email(directory='hard_ham', filename=name) for name in hard_ham_filenames]

In [46]:
# Create train and test sets 
X = np.array(spam_emails \
             + spam_2_emails \
             + easy_ham_emails \
             + easy_ham_2_emails \
             + hard_ham_emails \
             , dtype=object)

y = np.array([1] * len(spam_emails) \
             + [1] * len(spam_2_emails) \
             + [0] * len(easy_ham_emails) \
             + [0] * len(easy_ham_2_emails) \
             + [0] * len(hard_ham_emails))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Explore the data
Let's start by examining examples of ham vs. spam to understand what the data looks like.

In [None]:
# 

In [26]:
# Ham
print(easy_ham_emails[1].get_content().strip())

Martin A posted:
Tassos Papadopoulos, the Greek sculptor behind the plan, judged that the
 limestone of Mount Kerdylio, 70 miles east of Salonika and not far from the
 Mount Athos monastic community, was ideal for the patriotic sculpture. 
 
 As well as Alexander's granite features, 240 ft high and 170 ft wide, a
 museum, a restored amphitheatre and car park for admiring crowds are
planned
---------------------
So is this mountain limestone or granite?
If it's limestone, it'll weather pretty fast.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
4 DVDs Free +s&p Join Now
http://us.click.yahoo.com/pt6YBB/NXiEAA/mG3HAA/7gSolB/TM
---------------------------------------------------------------------~->

To unsubscribe from this group, send an email to:
forteana-unsubscribe@egroups.com

 

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/


In [27]:
# Spam
print(spam_emails[6].get_content().strip())

Help wanted.  We are a 14 year old fortune 500 company, that is
growing at a tremendous rate.  We are looking for individuals who
want to work from home.

This is an opportunity to make an excellent income.  No experience
is required.  We will train you.

So if you are looking to be employed from home with a career that has
vast opportunities, then go:

http://www.basetel.com/wealthnow

We are looking for energetic and self motivated people.  If that is you
than click on the link and fill out the form, and one of our
employement specialist will contact you.

To be removed from our link simple go to:

http://www.basetel.com/remove.html


4139vOLW7-758DoDY1425FRhM1-764SMFc8513fCsLl40


Some emails are multipart, with images and attachments. Let's look at different types of structures. 

In [28]:
def get_email_structure(email):
    if isinstance(email, str):
        return email 
    payload = email.get_payload()
    if isinstance(payload, list):
        return f'multipart({", ".join([get_email_structure(sub_email) for sub_email in payload])})'
    else:
        return email.get_content_type()

In [31]:
def structures_counter(emails):
    structures = Counter()
    for email in emails:
        structure = get_email_structure(email)
        structures[structure] += 1
    return structures

In [32]:
structures_counter(easy_ham_emails).most_common()

[('text/plain', 2408),
 ('multipart(text/plain, application/pgp-signature)', 66),
 ('multipart(text/plain, text/html)', 8),
 ('multipart(text/plain, text/plain)', 4),
 ('multipart(text/plain)', 3),
 ('multipart(text/plain, application/octet-stream)', 2),
 ('multipart(text/plain, text/enriched)', 1),
 ('multipart(text/plain, application/ms-tnef, text/plain)', 1),
 ('multipart(multipart(text/plain, text/plain, text/plain), application/pgp-signature)',
  1),
 ('multipart(text/plain, video/mng)', 1),
 ('multipart(text/plain, multipart(text/plain))', 1),
 ('multipart(text/plain, application/x-pkcs7-signature)', 1),
 ('multipart(text/plain, multipart(text/plain, text/plain), text/rfc822-headers)',
  1),
 ('multipart(text/plain, multipart(text/plain, text/plain), multipart(multipart(text/plain, application/x-pkcs7-signature)))',
  1),
 ('multipart(text/plain, application/x-java-applet)', 1)]

In [33]:
structures_counter(spam_emails).most_common()

[('text/plain', 218),
 ('text/html', 183),
 ('multipart(text/plain, text/html)', 45),
 ('multipart(text/html)', 20),
 ('multipart(text/plain)', 19),
 ('multipart(multipart(text/html))', 5),
 ('multipart(text/plain, image/jpeg)', 3),
 ('multipart(text/html, application/octet-stream)', 2),
 ('multipart(text/plain, application/octet-stream)', 1),
 ('multipart(text/html, text/plain)', 1),
 ('multipart(multipart(text/html), application/octet-stream, image/jpeg)', 1),
 ('multipart(multipart(text/plain, text/html), image/gif)', 1),
 ('multipart/alternative', 1)]

It seems that ham emails are more often plan text, while spam has a lot of html. Also, a lot of ham emails are signed using "pgp", while no spam is. Email structure might be an important feature. 

In [36]:
# Examine email headers 
for header, value in spam_emails[0].items():
    print(header, ':', value)

Return-Path : <12a1mailbot1@web.de>
Delivered-To : zzzz@localhost.spamassassin.taint.org
Received : from localhost (localhost [127.0.0.1])	by phobos.labs.spamassassin.taint.org (Postfix) with ESMTP id 136B943C32	for <zzzz@localhost>; Thu, 22 Aug 2002 08:17:21 -0400 (EDT)
Received : from mail.webnote.net [193.120.211.219]	by localhost with POP3 (fetchmail-5.9.0)	for zzzz@localhost (single-drop); Thu, 22 Aug 2002 13:17:21 +0100 (IST)
Received : from dd_it7 ([210.97.77.167])	by webnote.net (8.9.3/8.9.3) with ESMTP id NAA04623	for <zzzz@spamassassin.taint.org>; Thu, 22 Aug 2002 13:09:41 +0100
From : 12a1mailbot1@web.de
Received : from r-smtp.korea.com - 203.122.2.197 by dd_it7  with Microsoft SMTPSVC(5.5.1775.675.6);	 Sat, 24 Aug 2002 09:42:10 +0900
To : dcek1a1@netsgo.com
Subject : Life Insurance - Why Pay More?
Date : Wed, 21 Aug 2002 20:31:57 -1600
MIME-Version : 1.0
Message-ID : <0103c1042001882DD_IT7@dd_it7>
Content-Type : text/html; charset="iso-8859-1"
Content-Transfer-Encoding : qu

In [37]:
# Just look at the Subject header
spam_emails[0]['Subject']

'Life Insurance - Why Pay More?'

# Prepare the data

# Shortlist promising models

# Fine-tune the system

# Present your solution