# Praktikum 2: Naive Bayes Spam Classifier

In this lab, we extend the Naïve Bayes text classification workflow introduced in the seminar by applying it to a larger, real-world dataset and exploring model behavior, feature design, and evaluation in depth.

**Dataset**:
telegram-spam-ham (Hugging Face: https://huggingface.co/datasets/thehamkercat/telegram-spam-ham).
Contains text messages labeled as spam or not_spam.

## Loading libraries
In the repository, you will find the scripts to install the dependencies required for this lab under the `/scripts` folder. 
- If you are in a YourAI cluster node, make sure to run the `install_env.sh` script to download additional dependencies. 
- If you are in `google collabs`, open the terminal and load the commands in the `install_google_collabs.sh`.

In [None]:
import pandas as pd
import spacy
from datasets import load_from_disk
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

try:
    nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
except OSError:
    print("Warning: spaCy model 'en_core_web_sm' not found. Please run 'python -m spacy download en_core_web_sm'")
    # Fallback to a simpler model creation if the standard one fails
    nlp = spacy.blank("en")

## Task 1. Load dataset

**Goal**: familiarize yourself with the dataset and ensure it’s ready for training.

**Steps**: 

1. Load the train/test splits.
2. Inspect class distribution, number of documents, and average message length.
3. Clean duplicates and missing values if any.


**Deliverables**:
- Short summary of dataset stats (counts, class ratio, sample messages).
- Changes to the code, to clean and prepare the data

**Note**: 
Make sure you run:

```sh
python scripts/download_dataset.py "thehamkercat/telegram-spam-ham" data/telegram-spam-ham
```

at the repository root.




In [None]:
ds = load_from_disk("../../data/telegram-spam-ham")
df_telegram_all =  ds["train"].to_pandas()
df_telegram_all.columns = ["label", "text"]

# dataset is too big
df_telegram, _ = train_test_split(
    df_telegram_all, train_size=10_000, stratify=df_telegram_all["label"], random_state=42
)

# create split
X_train, X_test, y_train, y_test = train_test_split(
    df_telegram["text"], df_telegram["label"], test_size=0.2, random_state=42, stratify=df_telegram["label"]
)

## Task 2. Train a baseline NB spam classifier with basic pre-processing

**Goal**: train a baseline MultinomialNB classifier using a Bag-of-Words representation, reflecting on its performance.

**Steps**: 
1. Record the baseline performance
2. Inspect the top 300 features/words with higher learned probaility `P(c|w)`. Use the provided utility function `preview_errors_explained()`.
3. Reflect on the main causes for the current misclassifications

**Deliverable**: 
- Write down your interpretation of the limitations of this baseline implementation
- Write down a list of potential pre-processing operations you would implement


In [None]:
min_word_count = 1 
vectorizer = CountVectorizer(stop_words='english', 
                             min_df=min_word_count) # drop words that appear in less than 'min_word_count' document

X_train_bow = vectorizer.fit_transform(X_train) # fit to the training split
X_test_bow  = vectorizer.transform(X_test)      # transform the test split

print(f"Vocabulary size: {len(vectorizer.get_feature_names_out())}")

nb = MultinomialNB(alpha=1.0)
nb.fit(X_train_bow, y_train)

y_pred = nb.predict(X_test_bow)
print(classification_report(y_test, y_pred))

print("Confusion Matrix.")
cm = confusion_matrix(y_test, y_pred, labels=nb.classes_) 
print(cm)

disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=nb.classes_)
disp.plot(cmap="Blues")
plt.title(f"Confusion Matrix — Naïve Bayes (Simple)")
plt.show()

In [None]:
# Helper function!
def print_indicative_features(nb, vectorizer, topk=10, verbose=True):
    # 1. Identify class indices (spam = 1, ham = 0)
    classes = nb.classes_.tolist()
    i_spam  = classes.index("spam")
    i_ham   = classes.index("ham")
    
    # 2. Retrieve the learned log probabilities for each word and class
    feature_names = np.array(vectorizer.get_feature_names_out())
    log_pw = nb.feature_log_prob_
    
    # 3. Compute the log-odds for each feature: spam minus ham
    log_odds = log_pw[i_spam] - log_pw[i_ham]
    
    # 4. Build a small DataFrame for inspection
    df_weights = pd.DataFrame({
        "feature": feature_names,
        "logP_w_given_spam": log_pw[i_spam],
        "logP_w_given_ham":  log_pw[i_ham],
        "logodds_spam_minus_ham": log_odds,
        "odds_ratio": np.exp(log_odds)  # x times more likely to be of that class
    }).sort_values("logodds_spam_minus_ham", ascending=False)
    
    # 5. Display the top indicative words for each class

    top_spam = df_weights.head(topk)                 # most spam-indicative
    top_ham  = df_weights.tail(topk).iloc[::-1]      # most ham-indicative

    if (verbose):
        print("Top spam-indicative features:")
        display(top_spam[["feature", "logodds_spam_minus_ham", "odds_ratio"]])    
        print("\nTop ham-indicative features:")
        display(top_ham[["feature", "logodds_spam_minus_ham", "odds_ratio"]])
    else:    
        spam_words = ", ".join(top_spam["feature"].tolist())
        ham_words  = ", ".join(top_ham["feature"].tolist())
        print(f"Top spam-indicative words ({topk}):\n {spam_words}")
        print()
        print(f"Top ham-indicative words  ({topk}):\n {ham_words}")

## Task 3. Implement NB with advanced pre-processing

**Goal**: design and test your own preprocessing pipeline, and features.

**Steps**:
1. Implement any additional pre-processing steps you hypothesised before
2. Encode additional non word features, after exploring the dataset
3. Keep track on the effect of those changes in the performance.

**Deliverable**:
- Changes to the code reflecting your experiments
- Log of performance of your exploration

In [None]:
import re, html, unicodedata

def spacy_tokenizer(text, do_normalise=True):
    """
    Custom tokenizer with lemmas
    """
    doc = nlp(text)

    tokens = []
    for token in doc:
        if token.is_punct or token.is_space or token.is_stop:
            continue

        lemma = token.lemma_ if do_normalise else token.text
        tokens.append(lemma)

    return [t for t in tokens if t]
    

def run_nb_pipeline(custom_tokenizer, class_prior = None):
    """Train NB. class_prior = [P(ham), P(spam)] or None to learn from data."""
    vectorizer = CountVectorizer(tokenizer=custom_tokenizer, stop_words=None, token_pattern=None)
    X_train_bow = vectorizer.fit_transform(X_train)
    X_test_bow  = vectorizer.transform(X_test)
    
    print(f"Vocabulary size: {len(vectorizer.get_feature_names_out())}")
    
    nb = MultinomialNB(alpha=1.0, class_prior=class_prior)
    nb.fit(X_train_bow, y_train)
    
    y_pred = nb.predict(X_test_bow)
    print(classification_report(y_test, y_pred))
    
    ConfusionMatrixDisplay.from_estimator(nb, X_test_bow, y_test, cmap="Blues")
    plt.title("Confusion Matrix – Naïve Bayes Spam Classifier with custom pre-processing ")
    plt.show()

    # We print out the top indicative words
    print_indicative_features(nb, vectorizer, topk=300, verbose=False)

    return (nb, vectorizer, y_pred)

nb_a, vec_a, y_pred_a = run_nb_pipeline(spacy_tokenizer)

## Task 4. Explore changes in priors and tau 
**Goal**: define custom decision threshold for the classification probabilities

**Steps**:
1. Implement a revised version of `run_nb_pipeline()` that performs classification based on a given threshold (tau)
2. Evaluate your pipeline with values (tau=60, tau=80)

**Deliverable**:
- Implemented revised function `run_nb_pipeline()`
- Log of performance for the given values of tau
- Reflection on the impact of those values on the classification metrics

## Log of performance
Paste below the performance results you obtain from your exploration. Describe the parameters or processing that led to the results.