## Spam Classifier

Build a spam classifier

#### Load dependencies

In [30]:
# This will help us to measure the time it took for the whole
# notebook to execute
import time
start_time = time.time()

import os
import importlib
import sys
sys.path.append('../../utils')
import datasets
importlib.reload(datasets)
import helpers
importlib.reload(helpers)

import email            # Provides functionality for creating, parsing, and managing email messages, including MIME types
import email.policy     # Offers predefined and customizable policies to control how email messages are formatted and processed, ensuring compliance with different email standards.

#### Get and load dataset

In [31]:
# Download dataset and decompress
DOWNLOAD_ROOT = "http://spamassassin.apache.org/old/publiccorpus/"
DATASET_FILES = ["20030228_easy_ham.tar.bz2", "20030228_spam.tar.bz2"]
DATASET_DIRS = ["easy_ham", "spam"]
SPAM_PATH = os.path.join("../../datasets", "spam")

datasets.download(DOWNLOAD_ROOT, SPAM_PATH, DATASET_FILES, "bz2")

# Load emails
HAM_DIR = os.path.join(SPAM_PATH, DATASET_DIRS[0])
SPAM_DIR = os.path.join(SPAM_PATH, DATASET_DIRS[1])
ham_filenames = [name for name in sorted(os.listdir(HAM_DIR)) if len(name) > 20]
spam_filenames = [name for name in sorted(os.listdir(SPAM_DIR)) if len(name) > 20]
print("Emails were loaded")

20030228_easy_ham.tar.bz2 is already downloaded.
20030228_spam.tar.bz2 is already downloaded.
Emails were loaded


In [32]:
# Review ham files
len(ham_filenames)

2500

In [33]:
# Review spam files
len(spam_filenames)

500

#### Python's email module to parse these emails

In [34]:
def load_email(is_spam, filename, spam_path=SPAM_PATH):
    directory = DATASET_DIRS[1] if is_spam else DATASET_DIRS[0]
    with open(os.path.join(spam_path, directory, filename), "rb") as f:
        return email.parser.BytesParser(policy=email.policy.default).parse(f)

Load parsed emails and review some examples

In [35]:
ham_emails = [load_email(is_spam=False, filename=name) for name in ham_filenames]
spam_emails = [load_email(is_spam=True, filename=name) for name in spam_filenames]

In [36]:
# Ham email
print(ham_emails[1].get_content().strip())

Martin A posted:
Tassos Papadopoulos, the Greek sculptor behind the plan, judged that the
 limestone of Mount Kerdylio, 70 miles east of Salonika and not far from the
 Mount Athos monastic community, was ideal for the patriotic sculpture. 
 
 As well as Alexander's granite features, 240 ft high and 170 ft wide, a
 museum, a restored amphitheatre and car park for admiring crowds are
planned
---------------------
So is this mountain limestone or granite?
If it's limestone, it'll weather pretty fast.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
4 DVDs Free +s&p Join Now
http://us.click.yahoo.com/pt6YBB/NXiEAA/mG3HAA/7gSolB/TM
---------------------------------------------------------------------~->

To unsubscribe from this group, send an email to:
forteana-unsubscribe@egroups.com

 

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/


In [37]:
# Spam email
print(spam_emails[6].get_content().strip())

Help wanted.  We are a 14 year old fortune 500 company, that is
growing at a tremendous rate.  We are looking for individuals who
want to work from home.

This is an opportunity to make an excellent income.  No experience
is required.  We will train you.

So if you are looking to be employed from home with a career that has
vast opportunities, then go:

http://www.basetel.com/wealthnow

We are looking for energetic and self motivated people.  If that is you
than click on the link and fill out the form, and one of our
employement specialist will contact you.

To be removed from our link simple go to:

http://www.basetel.com/remove.html


4139vOLW7-758DoDY1425FRhM1-764SMFc8513fCsLl40


---

## Total Time

This show the total time of execution

In [38]:
# Sets the total time of execution
end_time = time.time()
helpers.calculate_execution_time(start_time, end_time)

Total execution time: 0.0 minutes and 1.21 seconds
