# Module 2
Kevin Nolasco

MCIS560: Intro to Machine Learning

Cabrini University

01/30/2022

The purpose of this assignment is to get familiar with *Supervised Learning*. Supervised Learning is a segment of machine learning where we provide the model with the output that is expected. A classic example of supervised learning is the Spam vs. Ham (not spam) problem. Below we will download many emails - where we know ahead of time whether it is spam or ham - and we will practice training a Machine Learning model that can accurately predict whether the class of the email. Since we are dealing with a categorical output, we will train a machine learning model using the **classification** approach instead of the **regression** approach.

# Get the Data

We will loosely base our code around [Aurelion Geron's Repo](https://github.com/knolasco/handson-ml2/blob/master/03_classification.ipynb), specifically Chapter 3.

In [7]:
# ready paths to download the data
import os
import tarfile
import urllib

DOWNLOAD_ROOT = "http://spamassassin.apache.org/old/publiccorpus/"
HAM_URL = DOWNLOAD_ROOT + "20030228_easy_ham.tar.bz2"
SPAM_URL = DOWNLOAD_ROOT + "20030228_spam.tar.bz2"
SPAM_PATH = os.path.join("datasets", "spam")

def fetch_spam_data(spam_url=SPAM_URL, spam_path=SPAM_PATH):
    if not os.path.isdir(spam_path):
        os.makedirs(spam_path)
    for filename, url in (("ham.tar.bz2", HAM_URL), ("spam.tar.bz2", SPAM_URL)):
        path = os.path.join(spam_path, filename)
        if not os.path.isfile(path):
            urllib.request.urlretrieve(url, path)
        tar_bz2_file = tarfile.open(path)
        tar_bz2_file.extractall(path=SPAM_PATH)
        tar_bz2_file.close()

In [8]:
# download the data
fetch_spam_data()

In [3]:
# load the emails
HAM_DIR = os.path.join(SPAM_PATH, "easy_ham")
SPAM_DIR = os.path.join(SPAM_PATH, "spam")
ham_filenames = [name for name in sorted(os.listdir(HAM_DIR)) if len(name) > 20]
spam_filenames = [name for name in sorted(os.listdir(SPAM_DIR)) if len(name) > 20]

In [4]:
# use email package to parse through emails
# create function to handle both cases
import email
import email.policy

def load_email(is_spam, filename, spam_path = SPAM_PATH):
    directory = 'spam' if is_spam else 'easy_ham'
    with open(os.path.join(spam_path, directory, filename), 'rb') as f:
        return email.parser.BytesParser(policy = email.policy.default).parse(f)

In [9]:
# parse through emails
ham_emails = [load_email(is_spam = False, filename = name) for name in ham_filenames]
spam_emails = [load_email(is_spam = True, filename = name) for name in spam_filenames]

Let's look at the type of a single email in ham_emails. This will give us an idea of what object we are working with and we can learn any methods that the object contains that could be helpful in this analysis.

In [14]:
type(ham_emails[0])

email.message.EmailMessage

Looking up this type on the internet we are able to [read the documentation](https://docs.python.org/3/library/email.message.html) and learn that this class contains the .as_string() method that returns the entire message as a string. Let's test this method and see what we get.

In [13]:

ham_emails[0].as_string()

'Return-Path: <exmh-workers-admin@spamassassin.taint.org>\nDelivered-To: zzzz@localhost.netnoteinc.com\nReceived: from localhost (localhost [127.0.0.1])\n\tby phobos.labs.netnoteinc.com (Postfix) with ESMTP id D03E543C36\n\tfor <zzzz@localhost>; Thu, 22 Aug 2002 07:36:16 -0400 (EDT)\nReceived: from phobos [127.0.0.1]\n\tby localhost with IMAP (fetchmail-5.9.0)\n\tfor zzzz@localhost (single-drop); Thu, 22 Aug 2002 12:36:16 +0100 (IST)\nReceived: from listman.spamassassin.taint.org (listman.spamassassin.taint.org\n [66.187.233.211]) by    dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id\n g7MBYrZ04811 for    <zzzz-exmh@spamassassin.taint.org>; Thu, 22 Aug 2002\n 12:34:53 +0100\nReceived: from listman.spamassassin.taint.org (localhost.localdomain\n [127.0.0.1]) by    listman.redhat.com (Postfix) with ESMTP id 8386540858;\n Thu, 22 Aug 2002    07:35:02 -0400 (EDT)\nDelivered-To: exmh-workers@listman.spamassassin.taint.org\nReceived: from int-mx1.corp.spamassassin.taint.org\n (int-mx1.corp