 4. Build a spam classifier (a more challenging exercise):
 • Download examples of spam and ham from Apache SpamAssassin’s public
 datasets.
 • Unzip the datasets and familiarize yourself with the data format.
 • Split the datasets into a training set and a test set.
 • Write a data preparation pipeline to convert each email into a feature vector.
 Your preparation pipeline should transform an email into a (sparse) vector that
 indicates the presence or absence of each possible word. For example, if all
 emails only ever contain four words, “Hello,” “how,” “are,” “you,” then the email
 “Hello you Hello Hello you” would be converted into a vector [1, 0, 0, 1]
 (meaning [“Hello” is present, “how” is absent, “are” is absent, “you” is
 present]), or [3, 0, 0, 2] if you prefer to count the number of occurrences of
 each word.
 You may want to add hyperparameters to your preparation pipeline to control
 whether or not to strip off email headers, convert each email to lowercase,
 remove punctuation, replace all URLs with “URL,” replace all numbers with
 “NUMBER,” or even perform stemming (i.e., trim off word endings; there are
 Python libraries available to do this).
 Finally, try out several classifiers and see if you can build a great spam classifier, with both high recall and high precision

https://grok.com/share/c2hhcmQtMw%3D%3D_3490c19f-cb97-4031-8832-55d63b95a7b8

In [107]:
%pip install scikit-learn numpy pandas scikit-learn-intelex bz2file nltk

Note: you may need to restart the kernel to use updated packages.




In [108]:
from sklearnex import patch_sklearn
patch_sklearn()

Extension for Scikit-learn* enabled (https://github.com/uxlfoundation/scikit-learn-intelex)


In [109]:
import urllib.request
import tarfile
import os
import random


In [110]:
os.makedirs(f'data\\raw', exist_ok=True)

In [111]:
from urllib.parse import urlparse

def download_file(url):
    a = urlparse(url)
    path = os.path.basename(a.path)
    print(path)
    file_path = os.path.join('data\\raw', path)
    if os.path.isfile(file_path):
        return
    urllib.request.urlretrieve(url, file_path)

In [112]:
def extract_file(url):
    a = urlparse(url)
    path = os.path.basename(a.path)
    file_path = os.path.join('data\\raw', path)
    extract_folder = os.path.join('data/ham', path.replace('.tar.bz2', ''))
    os.makedirs(extract_folder, exist_ok=True)
    # Only extract if the folder is empty
    if not os.listdir(extract_folder):
        with tarfile.open(file_path) as tar:
            tar.extractall(extract_folder)

In [113]:
def load_data(url, list_ham, list_spam):
    a = urlparse(url)
    path = os.path.basename(a.path)
    folder_name = path.replace('.tar.bz2', '')
    extract_folder = os.path.join('data\\ham', folder_name)
    # List all files in the extracted folder
    for root, dirs, files in os.walk(extract_folder):
        for file in files:
            file_path = os.path.join(root, file)
            if 'spam' in folder_name:
                list_spam.append(file_path)
            else:
                list_ham.append(file_path)

In [114]:
ham_url_ham_easy=['https://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2','https://spamassassin.apache.org/old/publiccorpus/20030228_easy_ham.tar.bz2','https://spamassassin.apache.org/old/publiccorpus/20030228_easy_ham_2.tar.bz2']
ham_url_spam_easy=['https://spamassassin.apache.org/old/publiccorpus/20021010_spam.tar.bz2','https://spamassassin.apache.org/old/publiccorpus/20030228_spam.tar.bz2','https://spamassassin.apache.org/old/publiccorpus/20030228_spam_2.tar.bz2']

ham_files=[]
spam_files=[]

for ham in ham_url_ham_easy + ham_url_spam_easy:
    download_file(ham)
    extract_file(ham)
    load_data(ham,ham_files,spam_files)
    

20021010_easy_ham.tar.bz2
20030228_easy_ham.tar.bz2
20030228_easy_ham_2.tar.bz2
20021010_spam.tar.bz2
20030228_spam.tar.bz2
20030228_spam_2.tar.bz2


In [115]:
print(f'No of Ham files :{len(ham_files)}, No of Spam Files: {len(spam_files)}')

No of Ham files :6453, No of Spam Files: 2400


In [116]:
all_files = ham_files + spam_files
all_labels = [0] * len(ham_files) + [1] * len(spam_files)

In [117]:
from sklearn.model_selection import train_test_split

In [118]:
X_train, X_test, y_train, y_test = train_test_split(
    all_files, all_labels, test_size=0.2, random_state=42, stratify=all_labels
)

In [119]:
# Output split sizes for verification
print(f'\nTraining set size: {len(X_train)} samples')
print(f'Test set size: {len(X_test)} samples')
print(f'Number of ham in training set: {y_train.count(0)}')
print(f'Number of spam in training set: {y_train.count(1)}')
print(f'Number of ham in test set: {y_test.count(0)}')
print(f'Number of spam in test set: {y_test.count(1)}')


Training set size: 7082 samples
Test set size: 1771 samples
Number of ham in training set: 5162
Number of spam in training set: 1920
Number of ham in test set: 1291
Number of spam in test set: 480


----

Read Email and Create Pipeline

In [120]:
import os
from email.parser import Parser
import re
import string

In [121]:
def get_email_content(email_path, strip_headers=False):
    with open(email_path, 'r', encoding='latin-1') as f:
        text = f.read()
    parser = Parser()
    msg = parser.parsestr(text)
    if strip_headers:
        body = ''
        if msg.is_multipart():
            for part in msg.walk():
                ctype = part.get_content_type()
                cdisp = str(part.get('Content-Disposition'))
                if ctype == 'text/plain' and 'attachment' not in cdisp:
                    body += part.get_payload(decode=True).decode('latin-1', errors='ignore')
        else:
            body = msg.get_payload(decode=True).decode('latin-1', errors='ignore')
        return body
    else:
        return text

In [122]:
train_texts, test_texts=[],[]
for path in X_train:
    train_texts.append(get_email_content(path))

for path in X_test:
    test_texts.append(get_email_content(path))

In [123]:
def get_analyzer(lowercase=True, remove_punct=True, replace_url=True, replace_num=True, stemming=False):
    def analyzer_func(text):
        if lowercase:
            text = text.lower()
        if replace_url:
            text = re.sub(r'(http|https|www)\S+', 'URL', text)
        if replace_num:
            text = re.sub(r'\d+', 'NUMBER', text)
        if remove_punct:
            text = text.translate(str.maketrans('', '', string.punctuation))
        words = text.split()
        if stemming:
            words = [stemmer.stem(word) for word in words if word]
        return words
    return analyzer_func

In [124]:
strip_headers = False  # Keep headers for better spam indicators
lowercase = True
remove_punct = True
replace_url = True
replace_num = True
stemming = True
binary = False

In [125]:
analyzer = get_analyzer(lowercase=lowercase, remove_punct=remove_punct, replace_url=replace_url, 
                        replace_num=replace_num, stemming=stemming)

In [126]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score
import nltk
from nltk.stem.porter import PorterStemmer

In [127]:
vectorizer = CountVectorizer(analyzer=analyzer, binary=binary)

In [128]:
# Fit and transform
# Initialize stemmer
stemmer = PorterStemmer()
X_train_vec = vectorizer.fit_transform(train_texts)
X_test_vec = vectorizer.transform(test_texts)

In [129]:
# Example: Training and evaluating classifiers
classifiers = {
    'MultinomialNB': MultinomialNB(),
    'LogisticRegression': LogisticRegression(max_iter=1000),
    'SVC': SVC()
}

In [130]:
for name, clf in classifiers.items():
    clf.fit(X_train_vec, y_train)
    y_pred = clf.predict(X_test_vec)
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    print(f"{name} - Accuracy: {accuracy:.4f}, Precision: {precision:.4f}, Recall: {recall:.4f}")

MultinomialNB - Accuracy: 0.9701, Precision: 0.9954, Recall: 0.8938




LogisticRegression - Accuracy: 0.9944, Precision: 0.9958, Recall: 0.9833
SVC - Accuracy: 0.9870, Precision: 0.9893, Recall: 0.9625
