# Overview

Within this notebook we will build a spam classifier. For this task we will use the Spam Assasin dataset. Below you can find an overview of the dataset:

  - spam: 500 spam messages, all received from non-spam-trap sources.

  - easy_ham: 2500 non-spam messages.  These are typically quite easy to
    differentiate from spam, since they frequently do not contain any spammish
    signatures (like HTML etc).

  - hard_ham: 250 non-spam messages which are closer in many respects to
    typical spam: use of HTML, unusual HTML markup, coloured text,
    "spammish-sounding" phrases etc.

  - easy_ham_2: 1400 non-spam messages.  A more recent addition to the set.

  - spam_2: 1397 spam messages.  Again, more recent.

Total count: 6047 messages, with about a 31% spam ratio.

The corpora are prefixed with the date they were assembled.  They are
compressed using "bzip2".  The messages are named by a message number and
their MD5 checksum.


# Downloading and Loading the Data

In [None]:
from pathlib import Path
import urllib.request
import tarfile

urls= ["https://spamassassin.apache.org/old/publiccorpus/20030228_easy_ham.tar.bz2",
        "https://spamassassin.apache.org/old/publiccorpus/20030228_easy_ham_2.tar.bz2",
        "https://spamassassin.apache.org/old/publiccorpus/20030228_hard_ham.tar.bz2",
        "https://spamassassin.apache.org/old/publiccorpus/20030228_spam.tar.bz2",
        "https://spamassassin.apache.org/old/publiccorpus/20050311_spam_2.tar.bz2"]

Paths = ["datasets/easy_ham.tar.bz2", "datasets/easy_ham_2.tar.bz2", "datasets/hard_ham.tar.bz2",
        "datasets/spam.tar.bz2", "datasets/spam2.tar.bz2"]


for i in range(len(urls)):        
        
    tarball_path = Path(Paths[i])
    if not tarball_path.is_file():
        Path("datasets").mkdir(parents=True, exist_ok=True)
        url = urls[i]
        urllib.request.urlretrieve(url, tarball_path)
        with tarfile.open(tarball_path) as titanic_tarball:
            titanic_tarball.extractall(path="datasets")

In [1]:
import os
import glob
from bs4 import BeautifulSoup
import re

from sklearn.feature_extraction.text import CountVectorizer


def read_file(file_path):
    with open(file_path, "r", encoding="ISO-8859-1") as file:
        return file.read()

def extract_email_body(file_content):
    soup = BeautifulSoup(file_content, "lxml")
    body = soup.body
    if body is None:
        return " ".join(re.findall(r'\b\w+\b', soup.text))
    return " ".join(re.findall(r'\b\w+\b', body.text))

def load_emails(path, label):
    emails = []
    file_paths = glob.glob(os.path.join(path, "*"))
    for file_path in file_paths:
        email_body = extract_email_body(read_file(file_path))
        emails.append({"email": email_body, "label": label})
    return emails

In [2]:
import pandas as pd

# Load the email dataset
spam_emails = load_emails("datasets/spam/", 1)
spam2_emails = load_emails("datasets/spam_2/", 1)
easy_ham_emails = load_emails("datasets/easy_ham/", 0)
easy_ham2_emails = load_emails("datasets/easy_ham_2/", 0)
hard_ham_emails = load_emails("datasets/hard_ham/", 0)

# Create a DataFrame
data = pd.DataFrame(spam_emails + easy_ham_emails + hard_ham_emails 
                   + spam2_emails + easy_ham2_emails)
X = data["email"]
y = data["label"]

# Explore the data

Take a look a spam e-mail sample:

In [13]:
print(X[1])

From fivestarpicks netzero com Sun Sep 22 14 13 11 2002 Return Path Delivered To zzzz localhost spamassassin taint org Received from localhost jalapeno 127 0 0 1 by zzzzason org Postfix with ESMTP id D0B7A16F03 for Sun 22 Sep 2002 14 13 10 0100 IST Received from jalapeno 127 0 0 1 by localhost with IMAP fetchmail 5 9 0 for zzzz localhost single drop Sun 22 Sep 2002 14 13 10 0100 IST Received from webnote net mail webnote net 193 120 211 219 by dogma slashnull org 8 11 6 8 11 6 with ESMTP id g8M2uMC17924 for Sun 22 Sep 2002 03 56 22 0100 Received from 210 126 63 68 217 167 180 65 by webnote net 8 9 3 8 9 3 with SMTP id DAA10727 for Sun 22 Sep 2002 03 56 47 0100 Message Id 200209220256 DAA10727 webnote net Received from 34 57 158 148 34 57 158 148 by rly xr02 mx aol com with local Sep 21 2002 9 36 31 PM 0400 Received from unknown HELO rly xw01 mx aol com 96 213 243 25 by n9 groups yahoo com with asmtp Sep 21 2002 8 46 25 PM 1200 Received from mx rootsystems net 60 127 54 24 by smtp serve

Explore a non-spam e-mail sample:

In [14]:
print(X[1001])



Explore a hard non-spam sample which might look more like spam:

In [21]:
print(X[3101])

Return Path Received from abv sfo acmta2 cnet com abv sfo1 acmta2 cnet com 206 16 1 161 by dogma slashnull org 8 11 6 8 11 6 with ESMTP id g6AKHYJ02058 for Wed 10 Jul 2002 21 17 34 0100 Received from abv sfo1 ac agent4 206 16 0 226 by abv sfo acmta2 cnet com PowerMTA TM v1 5 Wed 10 Jul 2002 16 19 34 0700 envelope from Message ID 1219134 1026332248132 JavaMail root abv sfo1 ac agent4 Date Wed 10 Jul 2002 13 17 28 0700 PDT From CNET Shopper Electronics Edition To qqqqqqqqqq cnet newsletters spamassassin taint org Subject Looking for the perfect camera for your summer vacation CNET SHOPPER Mime Version 1 0 Content Type text html charset ISO 8859 1 Content Transfer Encoding 7bit X Mailer Accucast http www accucast com X Mailer Version 2 8 4 2 CNET Shopper Newsletter Electronics Edition Shopper All CNET The Web 1 Sony Cyber Shot DSC F707 2 Canon PowerShot S40 3 Palm m515 4 Palm i705 5 Nikon Coolpix 995 All most popular Live tech help NOW April s tech award 1 million open jobs News com Top C

After taking a look at the samples we will be vectorizing the e-mails with a TfidfVectorizer in order to tokenize the feature inputs.

### Create a test set 

In [4]:
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Prepare Data 

In [22]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC

# Vectorize the email data and use a Support Vector Machine (SVM) classifier
svc_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words="english")),
    ('classifier', SVC())
])



# Train and Fine-Tune the Model

In [24]:
from sklearn.model_selection import GridSearchCV

# Use Grid Search to find the best hyperparameters for the SVM
param_grid = {
    'tfidf__ngram_range': [(1, 1), (1, 2)],
    'classifier__C': [0.1, 1, 10],
    'classifier__kernel': ['linear', 'rbf']
}

grid_search = GridSearchCV(svc_pipeline, param_grid, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)

print("Best parameters found by grid search:", grid_search.best_params_)

Best parameters found by grid search: {'classifier__C': 10, 'classifier__kernel': 'linear', 'tfidf__ngram_range': (1, 2)}


In [25]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


# Test the classifier
y_pred = grid_search.best_estimator_.predict(X_test)

# Evaluate the classifier
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.9966969446738233

Confusion Matrix:
 [[840   2]
 [  2 367]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       842
           1       0.99      0.99      0.99       369

    accuracy                           1.00      1211
   macro avg       1.00      1.00      1.00      1211
weighted avg       1.00      1.00      1.00      1211



We get great results with the SVC Classifier. Just 4 samples are misclassified, namely two non-spam e-mails get classified as spam and 2 spam e-mails get misclassified as non-spam, but the vast majority is detected correctly.