In [None]:
from IPython.display import HTML
HTML(open("../style.css", "r").read())

In [None]:
%load_ext nb_mypy
%nb_mypy On

# Spam Detection Using Scikit-Learn

In this notebook, we will build a spam detector using the **Naive Bayes** algorithm provided by the `scikit-learn` library. 

The process is streamlined into the following steps:

  - **Data Loading**: Reading email text files from directories.
  - **Feature Extraction**: Converting text data into numerical vectors (counts) using `CountVectorizer`.
  - **Model Training**: Fitting a `MultinomialNB` classifier on the training data.
  - **Evaluation**: Calculating the <em style='color:blue;'>precision</em> and <em style='color:blue;'>recall</em> using built-in metrics.

## Step 1: Imports and Setup

We need `os` for file handling and several modules from `sklearn` for the machine learning pipeline.

In [None]:
import os

We also need several classes and functions from SciKit-Learn.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import precision_score, recall_score, accuracy_score, classification_report

The directory 
https://github.com/karlstroetmann/Artificial-Intelligence/tree/master/Python/6%20Classification/EmailData
contains 960 emails that are divided into four subdirectories:

  - `spam-train` contains 350 spam emails for training,
  - `ham-train`  contains 350 non-spam emails for training,
  - `spam-test`  contains 130 spam emails for testing,
  - `ham-test`   contains 130 non-spam emails for testing.

Originally, this data has been collected by **Ion Androutsopoulos**.  I have found this data on a now defunct 
*open classroom* page on https://online.stanford.edu/free-courses provided by Andrew Ng.

In [None]:
spam_dir_train: str = 'EmailData/spam-train/'
ham__dir_train: str = 'EmailData/ham-train/'
spam_dir_test:  str = 'EmailData/spam-test/'
ham__dir_test:  str = 'EmailData/ham-test/'

## Step 2: Loading Data

Unlike the manual implementation where we processed files one by one during prediction, `scikit-learn` works best when we load all data into memory first (lists of strings).

We define a helper function `load_data` that reads all files from a spam directory and a ham directory, returning a list of email texts (`X`) and a list of labels (`y`).

Convention:
* **1**: Spam
* **0**: Ham

In [None]:
def load_data(spam_dir: str, ham_dir: str) -> tuple[list[str], list[int]]:
    emails = []
    labels = []
    # Load Spam (Label = 1)
    for filename in os.listdir(spam_dir):
        path = os.path.join(spam_dir, filename)
        with open(path, 'r', encoding='latin-1') as f:
            emails.append(f.read())
            labels.append(1)
    # Load Ham (Label = 0)
    for filename in os.listdir(ham_dir):
        path = os.path.join(ham_dir, filename)
        with open(path, 'r', encoding='latin-1') as f:
            emails.append(f.read())
            labels.append(0)
    return emails, labels

Now we load the training and testing sets into memory.

In [None]:
X_train_text, y_train = load_data(spam_dir_train, ham__dir_train)
X_test_text, y_test   = load_data(spam_dir_test, ham__dir_test)

print(f"Training samples: {len(X_train_text)}")
print(f"Testing samples:  {len(X_test_text)}")

## Step 3: Vectorization (Feature Extraction)

The naive Bayes algorithm requires numerical data. In the previous manual implementation, we created a dictionary of "Common Words" and counted them manually.

`scikit-learn` provides `CountVectorizer` which automates this:
1.  **Tokenization**: Splits text into words.
2.  **Vocabulary Building**: Finds all unique words (features).
3.  **Encoding**: Counts how often each word appears in each email.

We fit the vectorizer *only* on the training data to avoid data leakage, then transform both sets.

In [None]:
# Initialize CountVectorizer
# We can set max_features=2500 to match the original notebook's "Common_Words" logic if desired,
# but Scikit-Learn can handle the full vocabulary efficiently. Let's stick to the default.
vectorizer = CountVectorizer(max_features=2500)

# Learn vocabulary from training text and vectorize it
X_train = vectorizer.fit_transform(X_train_text)

# Vectorize test text (using the vocabulary learned from training)
X_test = vectorizer.transform(X_test_text)

print(f"Vocabulary size: {len(vectorizer.get_feature_names_out())}")

## Step 4: Training the Model

We use `MultinomialNB`, which is the standard Naive Bayes variant for data with discrete counts (like word counts).

This replaces the manual calculation of `Spam_Probability` and `Ham_Probability` dictionaries. The parameter `alpha=1.0` in `MultinomialNB` handles the *Laplace smoothing* automatically (just as we added +1 in the manual formula).

In [None]:
clf = MultinomialNB(class_prior=[0.99, 0.01], fit_prior=False, alpha=1.0)
clf.fit(X_train, y_train)

## Step 5: Evaluation

We can now predict the labels for our test set and calculate **Precision** and **Recall**.

Recall definitions:
  - *Precision*: percentage of selected items that are relevant (True Positives / (True Positives + False Positives))
  - *Recall*:    percentage of relevant items selected (True Positives / (True Positives + False Negatives))

Note: In `scikit-learn`, we defined Spam as `1` (positive class). In the original notebook, the precision/recall was calculated specifically regarding *Ham* as the positive class (seeking to avoid filtering important emails).

Below, we print a classification report which shows metrics for *both* classes.

In [None]:
y_pred = clf.predict(X_test)

print("Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=['Ham', 'Spam']))