 4. Build a spam classifier (a more challenging exercise):
 • Download examples of spam and ham from Apache SpamAssassin’s public
 datasets.
 • Unzip the datasets and familiarize yourself with the data format.
 • Split the datasets into a training set and a test set.
 • Write a data preparation pipeline to convert each email into a feature vector.
 Your preparation pipeline should transform an email into a (sparse) vector that
 indicates the presence or absence of each possible word. For example, if all
 emails only ever contain four words, “Hello,” “how,” “are,” “you,” then the email
 “Hello you Hello Hello you” would be converted into a vector [1, 0, 0, 1]
 (meaning [“Hello” is present, “how” is absent, “are” is absent, “you” is
 present]), or [3, 0, 0, 2] if you prefer to count the number of occurrences of
 each word.
 You may want to add hyperparameters to your preparation pipeline to control
 whether or not to strip off email headers, convert each email to lowercase,
 remove punctuation, replace all URLs with “URL,” replace all numbers with
 “NUMBER,” or even perform stemming (i.e., trim off word endings; there are
 Python libraries available to do this).
 Finally, try out several classifiers and see if you can build a great spam classifier, with both high recall and high precision

In [41]:
%pip install scikit-learn numpy pandas scikit-learn-intelex bz2file

Note: you may need to restart the kernel to use updated packages.




In [42]:
from sklearnex import patch_sklearn
patch_sklearn()

Extension for Scikit-learn* enabled (https://github.com/uxlfoundation/scikit-learn-intelex)


In [43]:
import urllib.request
import tarfile
import os
import random


In [60]:
os.makedirs(f'data\\raw', exist_ok=True)

In [61]:
from urllib.parse import urlparse

def download_file(url):
    a = urlparse(url)
    path = os.path.basename(a.path)
    print(path)
    file_path = os.path.join('data\\raw', path)
    if os.path.isfile(file_path):
        return
    urllib.request.urlretrieve(url, file_path)

In [62]:
def extract_file(url):
    a = urlparse(url)
    path = os.path.basename(a.path)
    file_path = os.path.join('data\\raw', path)
    extract_folder = os.path.join('data/ham', path.replace('.tar.bz2', ''))
    os.makedirs(extract_folder, exist_ok=True)
    # Only extract if the folder is empty
    if not os.listdir(extract_folder):
        with tarfile.open(file_path) as tar:
            tar.extractall(extract_folder)

In [63]:
def load_data(url, list_ham, list_spam):
    a = urlparse(url)
    path = os.path.basename(a.path)
    folder_name = path.replace('.tar.bz2', '')
    extract_folder = os.path.join('data\\ham', folder_name)
    # List all files in the extracted folder
    for root, dirs, files in os.walk(extract_folder):
        for file in files:
            file_path = os.path.join(root, file)
            if 'spam' in folder_name:
                list_spam.append(file_path)
            else:
                list_ham.append(file_path)

In [64]:
ham_url_ham_easy=['https://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2','https://spamassassin.apache.org/old/publiccorpus/20030228_easy_ham.tar.bz2','https://spamassassin.apache.org/old/publiccorpus/20030228_easy_ham_2.tar.bz2']
ham_url_spam_easy=['https://spamassassin.apache.org/old/publiccorpus/20021010_spam.tar.bz2','https://spamassassin.apache.org/old/publiccorpus/20030228_spam.tar.bz2','https://spamassassin.apache.org/old/publiccorpus/20030228_spam_2.tar.bz2']

ham_files=[]
spam_files=[]

for ham in ham_url_ham_easy + ham_url_spam_easy:
    download_file(ham)
    extract_file(ham)
    load_data(ham,ham_files,spam_files)
    

20021010_easy_ham.tar.bz2
20030228_easy_ham.tar.bz2
20030228_easy_ham_2.tar.bz2
20021010_spam.tar.bz2
20030228_spam.tar.bz2
20030228_spam_2.tar.bz2


In [None]:
ham_files

['data\\ham\\20021010_easy_ham\\easy_ham\\0001.ea7e79d3153e7469e7a9c3e0af6a357e',
 'data\\ham\\20021010_easy_ham\\easy_ham\\0002.b3120c4bcbf3101e661161ee7efcb8bf',
 'data\\ham\\20021010_easy_ham\\easy_ham\\0003.acfc5ad94bbd27118a0d8685d18c89dd',
 'data\\ham\\20021010_easy_ham\\easy_ham\\0004.e8d5727378ddde5c3be181df593f1712',
 'data\\ham\\20021010_easy_ham\\easy_ham\\0005.8c3b9e9c0f3f183ddaf7592a11b99957',
 'data\\ham\\20021010_easy_ham\\easy_ham\\0006.ee8b0dba12856155222be180ba122058',
 'data\\ham\\20021010_easy_ham\\easy_ham\\0007.c75188382f64b090022fa3b095b020b0',
 'data\\ham\\20021010_easy_ham\\easy_ham\\0008.20bc0b4ba2d99aae1c7098069f611a9b',
 'data\\ham\\20021010_easy_ham\\easy_ham\\0009.435ae292d75abb1ca492dcc2d5cf1570',
 'data\\ham\\20021010_easy_ham\\easy_ham\\0010.4996141de3f21e858c22f88231a9f463',
 'data\\ham\\20021010_easy_ham\\easy_ham\\0011.07b11073b53634cff892a7988289a72e',
 'data\\ham\\20021010_easy_ham\\easy_ham\\0012.d354b2d2f24d1036caf1374dd94f4c94',
 'data\\ham\\200

In [66]:
spam_files

['data\\ham\\20021010_spam\\spam\\0000.7b1b73cf36cf9dbc3d64e3f2ee2b91f1',
 'data\\ham\\20021010_spam\\spam\\0001.bfc8d64d12b325ff385cca8d07b84288',
 'data\\ham\\20021010_spam\\spam\\0002.24b47bb3ce90708ae29d0aec1da08610',
 'data\\ham\\20021010_spam\\spam\\0003.4b3d943b8df71af248d12f8b2e7a224a',
 'data\\ham\\20021010_spam\\spam\\0004.1874ab60c71f0b31b580f313a3f6e777',
 'data\\ham\\20021010_spam\\spam\\0005.1f42bb885de0ef7fc5cd09d34dc2ba54',
 'data\\ham\\20021010_spam\\spam\\0006.7a32642f8c22bbeb85d6c3b5f3890a2c',
 'data\\ham\\20021010_spam\\spam\\0007.859c901719011d56f8b652ea071c1f8b',
 'data\\ham\\20021010_spam\\spam\\0008.9562918b57e044abfbce260cc875acde',
 'data\\ham\\20021010_spam\\spam\\0009.c05e264fbf18783099b53dbc9a9aacda',
 'data\\ham\\20021010_spam\\spam\\0010.7f5fb525755c45eb78efc18d7c9ea5aa',
 'data\\ham\\20021010_spam\\spam\\0011.2a1247254a535bac29c476b86c708901',
 'data\\ham\\20021010_spam\\spam\\0012.7bc8e619ad0264979edce15083e70a02',
 'data\\ham\\20021010_spam\\spam\\0013