In [None]:
%matplotlib inline

In [None]:
from collections import Counter
from collections.abc import Iterable
import itertools
from pathlib import Path

import datasets
import gradio as ui
from matplotlib.axes import Axes
import matplotlib.pyplot as plt
import numpy as np
from numpy.typing import ArrayLike
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.metrics import (
  accuracy_score, classification_report,
  ConfusionMatrixDisplay, PrecisionRecallDisplay,
)
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.pipeline import Pipeline
import sklearn.tree as sk_tree
import torch
from transformers.models.bert import (
  BertForSequenceClassification, BertTokenizer
)

from common.config import misc
from common.types import FeatureImportances, HamSpamFeatureImportances
from pipeline.text_classifier_builder import TextClassifierBuilder
from pipeline.utils import get_predictor, get_predictor_name, get_transformers
from tasks.ada_boost_task import (
  AdaBoostClassifierBuilder, AdaBoostTask
)
import tasks.bert_task as bert_task
from tasks.best_bow_task import BestBowTask
from tasks.decision_tree_task import (
  DecisionTreeClassifierBuilder, DecisionTreeTask
)
from tasks.email_preprocess_task import EmailPreprocessTask
from tasks.extra_trees_task import (
  ExtraTreesClassifierBuilder, ExtraTreesTask
)
from tasks.gradient_boosting_task import (
  GradientBoostingClassifierBuilder, GradientBoostingTask
)
from tasks.linear_svm_task import (
  LinearSvmClassifierBuilder, LinearSvmTask
)
from tasks.logistic_regression_task import (
  LogisticRegressionClassifierBuilder, LogisticRegressionTask
)
from tasks.naive_bayes_task import (
  NaiveBayesClassifierBuilder, NaiveBayesTask
)
from tasks.nltk_task import NltkTask
from tasks.poly_svm_task import (
  PolySvmClassifierBuilder, PolySvmTask
)
from tasks.random_forest_task import (
  RandomForestClassifierBuilder, RandomForestTask
)
from tasks.rbf_svm_task import (
  RbfSvmClassifierBuilder, RbfSvmTask
)
from tasks.sms_preprocess_task import SmsPreprocessTask
from tasks.stacking_task import (
  StackingClassifierBuilder, StackingTask
)
from tasks.voting_task import (
  VotingClassifierBuilder, VotingTask
)

# Demonstration of methods for message spam detection
## Introduction
This project demonstrates and compares various message spam detection methods by implementing a general ham/spam binary classification task on datasets obtained from:
- [SpamAssassin public mail corpus](#spamassassin-public-mail-corpus)
- [UCI Machine Learning Repository](#sms-spam-collection---uci-machine-learning-repository)

The ham/spam binary classification task is implemented by utilization of the following two groups of [machine learning models](#machine-learning-models):
- [bag-of-words (BoW)](#bag-of-words)
  - [naive Bayes](#naive-bayes)
  - [logistic regression](#logistic-regression)
  - [decision trees](#decision-tree)
  - [support vector machines](#support-vector-machine)
  - [ensembles - bagging (voting), boosting, stacking](#ensemble-learning)
- [bidirectional encoder representations from transformers (BERT)](#bert)

The message spam data is modeled in two different ways depending on the classification approach:
- [term frequency–inverse document frequency (`TF-IDF`)](#tf-idf), a "normalized" `BoW` text model,
  together with custom [feature engineering](#feature-engineering)
- [word embeddings for neural natural language processing](#word-embedding)

Because of the wide-spread application of the `BoW` approach for text classification and the presence of many `BoW`-based methods (some of them mentioned above), the main focus is to build a high-performance `BoW` classifier, while a `BERT` classifier is used mainly to point out the specifics of using `BERT` as an alternative approach for text classification. So the `BoW` model with best metrics is selected for comparison with a `BERT` model, though some [limitations](#limitations) apply.

The implementation of the `BoW` classifiers is based on [Scikit-learn](#scikit-learn) and the `Scikit-learn` terms for *estimator*, *transformer* and *predictor* are used.
The implementation of the `BERT` classifier is based on [Hugging Face Transformers](#hugging-face-transformers) and [PyTorch](#pytorch).

There are multiple trade-offs between the `BoW` and `BERT` approaches, and this project attempts to empirically show the strong and weak points of each of the two approaches.

`BoW` models can work with very long text sequences at the expense of reduced contextual interpretation of word order and grouping. For the latter, as a compensation, `BoW` models can statistically process groups of word-tokens (`n-grams`) but this is not a substitute for technologies like `word2vec`, [Recurrent Neural Networks (RNNs)](#recurrent-neural-network) and [Attention](#attention), which are applied in neural networks like `BERT` and together allow for capturing phrases or bidirectional interdependencies within a sequence of input word-tokens.

Additionally, `BoW` models provide better explainability on how their algorithm works and how the classification output is created.

<br><br>

## File structure
### Folders (packages)
- *common*: General configuration and types.
- *models*: Saved `BoW` and `BERT` models.
- *pipeline*: Training a `BoW` text classifier based on a `Scikit-learn` pipeline.
- *tasks*: [Luigi](#luigi) tasks and builders of classifiers leveraging [Optuna](#optuna) and related utilities.
### This notebook
- Analysis of text classifiers already built by [Luigi](#luigi) tasks.
### Other files
- Various configuration and setup files, such as *luigi.cfg*.

## Limitations
If in *luigi.cfg* is set `classification.feature_selector_type=svd` and a `BoW` model is built to use a feature selector, then `BoW`-`BERT` comparison is not supported for this `BoW` model. This limitation is due to unavailable `TF-IDF` feature information when dimensionality reduction is applied before the classifier in a pipeline.

## Data retrieval

### Data distribution utilities

Here are some basic utilities that can give an overview on how the message data is distributed: in terms of letter, language, size, duplicates and ham/spam label.

In [None]:
def show_letter_distribution(spam_df: pd.DataFrame,
                             title: str,
                             axes: Axes | None = None):
  counter = Counter()
  for message in spam_df.message:
    for character in message:
      if character.isalpha():
        counter[character.lower()] += 1
  letters, counts = zip(*counter.most_common(50))
  to_show = False
  if axes is None:
    axes = plt.gca()
    to_show = True
  axes.bar(letters, counts)
  axes.set_xlabel("Letter")
  axes.set_ylabel("Count")
  axes.set_title(title)
  if to_show:
    plt.show()


def show_language_distribution(spam_df: pd.DataFrame,
                               title: str,
                               axes: Axes | None = None):
  spam_data_count = len(spam_df)
  all_english_words = NltkTask().all_english_words
  english_count = sum(is_likely_english_text(message, all_english_words)
                      for message in spam_df.message)
  nonenglish_count = spam_data_count - english_count
  language_distribution = [english_count, nonenglish_count]
  to_show = False
  if axes is None:
    axes = plt.gca()
    to_show = True
  axes.pie(
    language_distribution,
    labels=[f"English ({english_count})",
            f"other ({nonenglish_count})"],
    autopct="%.2f%%",
  )
  axes.set_title(title)
  if to_show:
    plt.show()


def is_likely_english_text(text: str, all_english_words: set[str]):
  if text is None or not text:
    return False

  words_count = 0
  english_words_count = 0
  for token in text.lower().split():
    if token.isalpha():
      if token in all_english_words:
        english_words_count += 1
      words_count += 1
  return (
    words_count != 0
    and 0.67 < (english_words_count / words_count)
  )


def show_size_distribution(spam_df: pd.DataFrame,
                           title: str,
                           axes: Axes | None = None):
  ham_sizes = [len(message.split())
               for message in spam_df[spam_df.is_spam == 0].message]
  spam_sizes = [len(message.split())
                for message in spam_df[spam_df.is_spam == 1].message]
  to_show = False
  if axes is None:
    axes = plt.gca()
    to_show = True
  axes.hist(ham_sizes, bins="fd", log=True, label="Ham", alpha=0.5)
  axes.hist(spam_sizes, bins="fd", log=True, label="Spam", alpha=0.5)
  axes.set_title(title)
  axes.legend()
  if to_show:
    plt.show()


def show_duplicates_distribution(spam_df: pd.DataFrame,
                                 title: str,
                                 axes: Axes | None = None):
  spam_data_count = len(spam_df)
  ham_duplicates_count = (spam_df[spam_df.duplicated].is_spam == 0).sum()
  spam_duplicates_count = (spam_df[spam_df.duplicated].is_spam == 1).sum()
  uniques_count = (
    spam_data_count - ham_duplicates_count - spam_duplicates_count
  )
  duplicates_distribution = [
    ham_duplicates_count, spam_duplicates_count, uniques_count
  ]
  to_show = False
  if axes is None:
    axes = plt.gca()
    to_show = True
  axes.pie(
    duplicates_distribution,
    explode=[0, 0.5, 0],
    labels=[f"Ham Duplicates ({ham_duplicates_count})",
            f"Spam Duplicates ({spam_duplicates_count})",
            f"Uniques ({uniques_count})"],
    autopct="%.2f%%",
  )
  axes.set_title(title)
  if to_show:
    plt.show()


def show_spam_distribution(spam_df: pd.DataFrame,
                           title: str,
                           axes: Axes | None = None):
  spam_data_count = len(spam_df)
  spam_count = sum(label for label in spam_df.is_spam)
  ham_count = spam_data_count - spam_count
  spam_distribution = [ham_count, spam_count]
  to_show = False
  if axes is None:
    axes = plt.gca()
    to_show = True
  axes.pie(
    spam_distribution,
    labels=[f"Ham ({ham_count})", f"Spam ({spam_count})"],
    autopct="%.2f%%",
  )
  axes.set_title(title)
  if to_show:
    plt.show()

### Email dataset

In this section are displayed letter, language, size, duplicates and ham/spam distribution characteristics on the output of the `tasks.email_preprocess_task.EmailPreprocessTask` task.

The email letter distribution shows that the most frequent letters are part of the English alphabet, and non-English letters are much less frequent. The latter observation is confirmed by the email language statistics.

In [None]:
email_spam_data_path = Path(EmailPreprocessTask().output().path)
if Path.exists(email_spam_data_path):
  email_spam_df = pd.read_csv(email_spam_data_path)
  _, (ax_size, ax_dup, ax_spam, ax_letter, ax_lang) = \
    plt.subplots(5, 1, figsize=(5, 25))
  show_size_distribution(
    email_spam_df,
    "Email sizes distribution",
    ax_size,
  )
  show_duplicates_distribution(
    email_spam_df,
    "Email spam duplicates distribution",
    ax_dup,
  )
  show_spam_distribution(
    email_spam_df,
    "Email spam distribution",
    ax_spam,
  )
  show_letter_distribution(
    email_spam_df,
    "Email letter distribution",
    ax_letter,
  )
  show_language_distribution(
    email_spam_df,
    "Email language distribution",
    ax_lang,
  )
  plt.show()

### SMS dataset

In this section are displayed letter, language, size, duplicates and ham/spam distribution characteristics on the output of the `tasks.sms_preprocess_task.SmsPreprocessTask` task.

The SMS letter distribution shows that the most frequent letters are part of the English alphabet, and non-English letters are much less frequent. The latter observation is confirmed by the SMS language statistics.

In [None]:
sms_spam_data_path = Path(SmsPreprocessTask().output().path)
if Path.exists(sms_spam_data_path):
  sms_spam_df = pd.read_csv(sms_spam_data_path)
  _, (ax_size, ax_dup, ax_spam, ax_letter, ax_lang) = \
    plt.subplots(5, 1, figsize=(5, 25))
  show_size_distribution(
    sms_spam_df,
    "SMS sizes distribution",
    ax_size,
  )
  show_duplicates_distribution(
    sms_spam_df,
    "SMS spam duplicates distribution",
    ax_dup,
  )
  show_spam_distribution(
    sms_spam_df,
    "SMS spam distribution",
    ax_spam,
  )
  show_letter_distribution(
    sms_spam_df,
    "SMS letter distribution",
    ax_letter,
  )
  show_language_distribution(
    sms_spam_df,
    "SMS language distribution",
    ax_lang,
  )
  plt.show()

### Message dataset

In this section is loaded the output of the `tasks.train_test_split.TrainTestSplit` task.

In [None]:
train_df = pd.read_csv("data/train_messages.csv")
X_train, y_train = train_df.message, train_df.is_spam

test_df = pd.read_csv("data/test_messages.csv")
X_test, y_test = test_df.message, test_df.is_spam

## Common evaluatory functions

In this section are common metrics visualization functions:
  - `show_confusion_matrix`: utilizes `sklearn.metrics.ConfusionMatrixDisplay.from_predictions`;
  - `show_precision_recall_curve`: utilizes `sklearn.metrics.PrecisionRecallDisplay.from_predictions`.

In [None]:
def show_confusion_matrix(
  y: ArrayLike,
  predictions: ArrayLike,
  title: str,
  axes: Axes | None = None,
) -> None:
  ConfusionMatrixDisplay.from_predictions(
    y, predictions, labels=[0, 1],
    normalize="true", values_format=".0%", ax=axes,
  )
  to_show = False
  if axes is None:
    to_show = True
    axes = plt.gca()
  axes.xaxis.set_ticklabels(["Ham", "Spam"])
  axes.yaxis.set_ticklabels(["Ham", "Spam"])
  axes.set_title(title)
  if to_show:
    plt.show()


def show_precision_recall_curve(
  y: ArrayLike,
  predictions: ArrayLike,
  title: str,
  axes: Axes | None = None,
) -> None:
  PrecisionRecallDisplay.from_predictions(
    y, predictions,
    pos_label=1, name=title, ax=axes,
  )
  if axes is None:
    plt.show()

## BoW approach

This section, together with its subsections, contains research that is specific to the `BoW` approach.

### Evaluation

In this section are defined some evaluatory functions intended for `BoW` classifiers, such as:
- `show_scores`: displays metrics such as classification report, confusion matrix and precision-recall curve;
- `get_top_features`: optionally displays and returns separately `TOP_K_FEATURES` ham and `TOP_K_FEATURES` spam features, `TOP_K_FEATURES` ham and/or spam features, or *None*;
- `get_top_clustered_features`: optionally displays and returns separately `TOP_K_FEATURES` ham and `TOP_K_FEATURES` spam features, or *None*;
- `has_clusterable_features`: tests whether a `BoW` classifier has clusterable features, used by `get_top_clustered_features`;
- `print_feature_importances`: a utility function that displays the feature importances, used by `get_top_features` and `get_top_clustered_features`.

In [None]:
TOP_K_FEATURES = 10


def show_scores(
  model: Pipeline,
  X: pd.Series = X_test,
  y: pd.Series = y_test,
) -> None:
  """Displays various metrics for `model`.
  
  Displayed metrics are:
  - classification report;
  - confusion matrix;
  - precision-recall curve.

  Predictions used for precision-recall curve are obtained from
  `predict_proba`, `decision_function` or `predict` - whichever
  method is available, in order of priority.

  Args:
    model: A `Scikit-learn` pipeline.
    X: The messages.
    y: The ham/spam labels.
  """
  predictions = model.predict(X)
  predictor_name = get_predictor_name(model)
  classification_report_str = classification_report(
    y, predictions,
    labels=[0, 1], target_names=["Ham", "Spam"],
    digits=3, zero_division=np.nan,  # type: ignore
  )
  print(f"\n\n{predictor_name}:")
  print("-" * 80)
  print(classification_report_str)
  if hasattr(model, "predict_proba"):
    prob_predictions = model.predict_proba(X)[:, 1]
  elif hasattr(model, "decision_function"):
    prob_predictions = model.decision_function(X)
  else:
    prob_predictions = predictions
  _, (ax_cm, ax_prc) = plt.subplots(2, 1, figsize=(8, 12))
  show_confusion_matrix(y, predictions,
                        predictor_name, ax_cm)
  show_precision_recall_curve(y, prob_predictions,
                              predictor_name, ax_prc)
  plt.show()


def get_top_features(
  model: Pipeline,
  X: pd.Series = X_train,
  y: pd.Series = y_train,
  k: int = TOP_K_FEATURES,
  show_features: bool = True,
) -> (HamSpamFeatureImportances | FeatureImportances | None):
  """Optionally displays and returns top features and their importances.

  Forwards the call to `get_top_clustered_features` if `model`
  uses dimensionality reduction in the feature selection step.

  What features are returned depends on what attributes `model` has.

  Args:
    model: A `Scikit-learn` pipeline.
    X: The messages.
    y: The ham/spam labels.
    k: Number of selected top features.
    show_features: Display or not the features.

  Returns:
    Separately `TOP_K_FEATURES` ham and `TOP_K_FEATURES` spam features,
    `TOP_K_FEATURES` ham and/or spam features, or *None*.
  """
  if has_clusterable_features(model):
    return get_top_clustered_features(model, X, y, k)

  predictor = get_predictor(model)
  transformers = get_transformers(model)
  names = transformers.get_feature_names_out()
  has_feature_log_prob_ = hasattr(predictor, "feature_log_prob_")
  has_coef_ = hasattr(predictor, "coef_")
  has_ham_spam_importances = has_feature_log_prob_ or has_coef_
  if has_ham_spam_importances:
    if has_feature_log_prob_:
      importances = predictor.feature_log_prob_  # type: ignore
      ham_top_k_indices = np.argsort(importances[0])[-k:][::-1]
      spam_top_k_indices = np.argsort(importances[1])[-k:][::-1]
      ham_top_k_names = names[ham_top_k_indices]
      ham_top_k_importances = importances[0][ham_top_k_indices]
      spam_top_k_names = names[spam_top_k_indices]
      spam_top_k_importances = importances[1][spam_top_k_indices]
    else:
      X_tfidf_mean = transformers.transform(X).mean(axis=0)
      importances = (predictor.coef_.ravel()  # type: ignore
                     * np.asarray(X_tfidf_mean).ravel())
      ham_top_k_indices = np.argsort(importances)[:k]
      spam_top_k_indices = np.argsort(importances)[-k:][::-1]
      ham_top_k_names = names[ham_top_k_indices]
      ham_top_k_importances = importances[ham_top_k_indices]
      spam_top_k_names = names[spam_top_k_indices]
      spam_top_k_importances = importances[spam_top_k_indices]
    if show_features:
      show_feature_importances(ham_top_k_names,
                               ham_top_k_importances,
                               "Most important ham features:")
      show_feature_importances(spam_top_k_names,
                               spam_top_k_importances,
                               "Most important spam features:")
    return (
      list(zip(ham_top_k_names, ham_top_k_importances)),
      list(zip(spam_top_k_names, spam_top_k_importances)),
    )
  elif hasattr(predictor, "feature_importances_"):
    importances = predictor.feature_importances_  # type: ignore
    top_k_indices = np.argsort(importances)[-k:][::-1]
    top_k_names = names[top_k_indices]
    top_k_importances = importances[top_k_indices]
    if show_features:
      show_feature_importances(top_k_names,
                               top_k_importances,
                               "Most important features:")
    return list(zip(top_k_names, top_k_importances))
  else:
    print("No metric available for getting the top features")
    return None


def get_top_clustered_features(
  model: Pipeline,
  X: pd.Series = X_train,
  y: pd.Series = y_train,
  k: int = TOP_K_FEATURES,
  show_features: bool = True,
) -> (HamSpamFeatureImportances | FeatureImportances | None):
  """Optionally displays and returns top features and their importances.

  What features are returned depends on what attributes `model` has.
  Even though clusters and ground truth labels can be expected to not
  match entirely, ham/spam features are still returned based on
  similarity (relevance) calculated via plain accuracy score.

  Args:
    model: A `Scikit-learn` pipeline.
    X: The messages.
    y: The ham/spam labels.
    k: Number of selected top features.
    show_features: Display or not the features.

  Returns:
    Separately `TOP_K_FEATURES` ham and `TOP_K_FEATURES` spam features,
    or *None*.
  """
  transformers = get_transformers(model)
  names = model.named_steps["Features"].get_feature_names_out()
  kmeans = KMeans(n_clusters=2, n_init=10,
                  random_state=misc().random_seed)  # type: ignore
  kmeans.fit(transformers.transform(X))
  svd = model.named_steps["Feature Selector"]["truncatedsvd"]
  top_indices = np.argsort(
    svd.inverse_transform(kmeans.cluster_centers_)
  )[:, ::-1]
  similarity_score = accuracy_score(y, kmeans.labels_)
  print(f"Labels-clusters accuracy={similarity_score:.3f}")
  ham_index, spam_index = ((0, 1)
                           if similarity_score >= 0.5
                           else (1, 0))
  ham_top_k_indices = top_indices[ham_index, :k]
  ham_top_k_names = names[ham_top_k_indices]
  ham_top_k_importances = [np.nan] * len(ham_top_k_indices)
  spam_top_k_indices = top_indices[spam_index, :k]
  spam_top_k_names = names[spam_top_k_indices]
  spam_top_k_importances = [np.nan] * len(spam_top_k_indices)
  if show_features:
    show_feature_importances(ham_top_k_names,
                             ham_top_k_importances,
                             "Most important ham features:")
    show_feature_importances(spam_top_k_names,
                             spam_top_k_importances,
                             "Most important spam features:")
  return (
    list(zip(ham_top_k_names, ham_top_k_importances)),
    list(zip(spam_top_k_names, spam_top_k_importances)),
  )


def has_clusterable_features(model: Pipeline) -> bool:
  if "Feature Selector" in model.named_steps:
    feature_selector = model.named_steps["Feature Selector"]
    return (hasattr(feature_selector, "named_steps")
            and "truncatedsvd" in feature_selector.named_steps)
  else:
    return False


def show_feature_importances(
  feature_names: Iterable[str],
  feature_importances: Iterable[float],
  title: str,
) -> None:
  print(f"\n\n{title}")
  print("-" * 80)
  for importance, name in zip(feature_importances, feature_names):
    print(f"{name} -> {importance:.3f}")

#### Naive Bayes classifier

Here are displayed top features and various metrics for `naive_bayes_classifier` which is the output of a `tasks.naive_bayes_task.NaiveBayesTask`.

In [None]:
naive_bayes_buider = NaiveBayesClassifierBuilder()
naive_bayes_classifier = naive_bayes_buider.build(
  NaiveBayesTask().output().path
)
get_top_features(naive_bayes_classifier)
show_scores(naive_bayes_classifier)
naive_bayes_classifier

#### Logistic regression classifier

Here are displayed top features and various metrics for `logistic_regression_classifier` which is the output of a `tasks.logistic_regression_task.LogisticRegressionTask`.

In [None]:
logistic_regression_builder = LogisticRegressionClassifierBuilder()
logistic_regression_classifier = logistic_regression_builder.build(
  LogisticRegressionTask().output().path
)
get_top_features(logistic_regression_classifier)
show_scores(logistic_regression_classifier)
logistic_regression_classifier

#### Decision tree classifier

Here are displayed top features and various metrics for `decision_tree_classifier` which is the output of a `tasks.decision_tree_task.DecisionTreeTask`.

The decision tree structure is visualized.

In [None]:
def visualize_tree(model: Pipeline) -> None:
  transformers = get_transformers(model)
  predictor = get_predictor(model)
  print(
    sk_tree.export_text(
      predictor,
      feature_names=transformers.get_feature_names_out(),
      class_names=["Ham", "Spam"],  # type: ignore
      max_depth=predictor.tree_.max_depth,  # type: ignore
    )
  )


decision_tree_builder = DecisionTreeClassifierBuilder()
decision_tree_classifier = decision_tree_builder.build(
  DecisionTreeTask().output().path
)
get_top_features(decision_tree_classifier)
show_scores(decision_tree_classifier)
visualize_tree(decision_tree_classifier)
decision_tree_classifier

#### SVM classifiers

##### Linear SVM classifier

Here are displayed top features and various metrics for `linear_svm_classifier` which is the output of a `tasks.linear_svm_task.LinearSvmTask`.

In [None]:
linear_svm_builder = LinearSvmClassifierBuilder()
linear_svm_classifier = linear_svm_builder.build(
  LinearSvmTask().output().path
)
linear_svm_top_features = get_top_features(linear_svm_classifier)
show_scores(linear_svm_classifier)
linear_svm_classifier

##### RBF SVM classifier

Here are displayed various metrics for `rbf_svm_classifier` which is the output of a `tasks.rbf_svm_task.RbfSvmTask`.

In [None]:
rbf_svm_builder = RbfSvmClassifierBuilder()
rbf_svm_classifier = rbf_svm_builder.build(
  RbfSvmTask().output().path
)
show_scores(rbf_svm_classifier)
rbf_svm_classifier

##### Polynomial SVM classifier

Here are displayed various metrics for `poly_svm_classifier` which is the output of a `tasks.poly_svm_task.PolySvmTask`.

In [None]:
poly_svm_builder = PolySvmClassifierBuilder()
poly_svm_classifier = poly_svm_builder.build(
  PolySvmTask().output().path
)
show_scores(poly_svm_classifier)
poly_svm_classifier

#### Bagging classifiers

##### Random forest classifier

Here are displayed top features and various metrics for `random_forest_classifier` which is the output of a `tasks.random_forest_task.RandomForestTask`.

In [None]:
random_forest_builder = RandomForestClassifierBuilder()
random_forest_classifier = random_forest_builder.build(
  RandomForestTask().output().path
)
get_top_features(random_forest_classifier)
show_scores(random_forest_classifier)
random_forest_classifier

##### Extra-trees classifier

Here are displayed top features and various metrics for `extra_trees_classifier` which is the output of a `tasks.extra_trees_task.ExtraTreesTask`.

In [None]:
extra_trees_builder = ExtraTreesClassifierBuilder()
extra_trees_classifier = extra_trees_builder.build(
  ExtraTreesTask().output().path
)
get_top_features(extra_trees_classifier)
show_scores(extra_trees_classifier)
extra_trees_classifier

#### Boosting classifiers

##### Adaptive boosting classifier

Here are displayed top features and various metrics for `ada_boost_classifier` which is the output of a `tasks.ada_boost_task.AdaBoostTask`.

In [None]:
ada_boost_builder = AdaBoostClassifierBuilder()
ada_boost_classifier = ada_boost_builder.build(
  AdaBoostTask().output().path
)
get_top_features(ada_boost_classifier)
show_scores(ada_boost_classifier)
ada_boost_classifier

##### Gradient boosting classifier

Here are displayed top features and various metrics for `gradient_boosting_classifier` which is the output of a `tasks.gradient_boosting_task.GradientBoostingTask`.

In [None]:
gradient_boosting_builder = GradientBoostingClassifierBuilder()
gradient_boosting_classifier = gradient_boosting_builder.build(
  GradientBoostingTask().output().path
)
get_top_features(gradient_boosting_classifier)
show_scores(gradient_boosting_classifier)
gradient_boosting_classifier

#### Voting classifier

Here are displayed various metrics for `voting_classifier` which is the output of a `tasks.voting_task.VotingTask`.

In [None]:
voting_builder = VotingClassifierBuilder()
voting_classifier = voting_builder.build(
  VotingTask().output().path
)
show_scores(voting_classifier)
voting_classifier

#### Stacking classifier

Here are displayed various metrics for `stacking_classifier` which is the output of a `tasks.stacking_task.StackingTask`.

In [None]:
stacking_builder = StackingClassifierBuilder()
stacking_classifier = stacking_builder.build(
  StackingTask().output().path
)
show_scores(stacking_classifier)
stacking_classifier

### Best BoW classifier

Here are displayed various metrics for `best_bow_classifier` which is the output of a `tasks.best_bow_task.BestBowTask`.

Also displayed is another `tasks.best_bow_task.BestBowTask` output: the scores of all `BoW` classifiers that have competed with `best_bow_classifier`.

In [None]:
bow_classifier_scores = plt.imread(
  BestBowTask().output()["bow_classifier_scores"].path
)
plt.figure(figsize=(12, 10))
plt.xticks([])
plt.yticks([])
plt.imshow(bow_classifier_scores)
plt.show()

best_bow_builder = TextClassifierBuilder()
best_bow_classifier = best_bow_builder.build(
  BestBowTask().output()["best_bow_classifier"].path
)
show_scores(best_bow_classifier)
best_bow_classifier

## BERT approach

This section, together with its subsections, contains research that is specific to the `BERT` approach.

### Data preprocessing

In order for the best `BoW` model to be compared with a `BERT` model, via `bert_task.get_bow_vocabulary` and `bert_task.get_bow_dataset` `X_test` is converted to `bert_test_df`, a `pandas.DataFrame` based on `best_bow_classifier`'s vocabulary.

In [None]:
if has_clusterable_features(best_bow_classifier):
  raise NotImplementedError(
    "No TF-IDF features after dimensionality reduction for clustering."
  )

bow_vocabulary = bert_task.get_bow_vocabulary(best_bow_classifier)
bert_test_df = bert_task.get_bow_dataset(
  best_bow_classifier, bow_vocabulary,
  X_test, y_test,
)

### Evaluation

In order to support comparisons between `BoW` and `BERT` classifiers, in this section:
- the already trained `bert_classifier` is loaded via `transformers.BertForSequenceClassification.from_pretrained`;
- to generate predictions from `bert_classifier`, `bert_test_df` is converted to `datasets.Dataset` format and further processed by `bert_tokenizer`.

Based on `get_bert_predictions` are displayed various metrics for `bert_classifier`: classification report, confusion matrix and precision-recall curve.

In [None]:
def get_bert_predictions(
  bert_classifier: BertForSequenceClassification,
  bert_tokenizer: BertTokenizer,
  X: datasets.Dataset,
) -> tuple[torch.Tensor, torch.Tensor]:
  inputs = bert_tokenizer(list(X["message"]),
                          padding="max_length",
                          truncation=True,
                          return_tensors="pt")
  with torch.no_grad():
    bert_prob_predictions = torch.softmax(
      bert_classifier(**inputs).logits, dim=1
    )[:, 1]
    bert_predictions = (bert_prob_predictions >= 0.5).type(torch.int32)
  return bert_prob_predictions, bert_predictions

bert_model_path = bert_task.BertTask().output()["bert_classifier_model"].path
bert_classifier = BertForSequenceClassification.from_pretrained(
  bert_model_path
)
bert_tokenizer = BertTokenizer.from_pretrained(bert_model_path)
X_bert_test = datasets.Dataset.from_pandas(bert_test_df)
bert_prob_predictions, bert_predictions = get_bert_predictions(
  bert_classifier, bert_tokenizer, X_bert_test
)
classification_report_str = classification_report(
  y_test, bert_predictions,
  labels=[0, 1], target_names=["Ham", "Spam"],
  digits=3, zero_division=np.nan,  # type: ignore
)
print(f"\n\nBERT:")
print("-" * 80)
print(classification_report_str)
_, (ax_cm, ax_prc) = plt.subplots(2, 1, figsize=(8, 12))
show_confusion_matrix(y_test, bert_predictions,
                      "BERT", ax_cm)
show_precision_recall_curve(y_test, bert_prob_predictions,
                            "BERT", ax_prc)
plt.show()

## Comparison of BoW and BERT models

The best `BoW` classifier and the `BERT` classifier are compared based on various criteria such as explainability or kinds of prediction errors they make.

#### Explainability

`bert_classifier` does not have an equivalent to the `get_top_features` function to indicate predictive word-tokens for either ham or spam category. However, the `get_last_hidden_state` function outputs feature word-token `hidden states` which should have `cosine similarity` close to 1 for words that are used together within a particular context - for example, a ham or a spam message.

In the below example, applying `sklearn.metrics.pairwise.cosine_similarity` on `linear_svm_classifier`'s top ham and spam feature `hidden states` should yield a value *far* from 1, while the top ham or top spam features should have `cosine similarity` *close* to 1. Of course, the quality of the results depends on how well `linear_svm_classifier` and `bert_classifier` have learned from the data and the data itself.

How can the above technique support the purpose of explainability in `BERT`? Even without clear indication of spammy words, messages can - for example - be clustered by `cosine similarity`, with similar values indicating the same category (ham or spam). The latter can help with predictions or label spreading on new message data.

In [None]:
def cosine_similarity_demo(
  bert_classifier: BertForSequenceClassification,
  bert_tokenizer: BertTokenizer,
  bow_classifier_top_features: HamSpamFeatureImportances,
) -> None:
  top_ham_features = [name.split("TF-IDF__")[1]  # N-grams (n > 1) skipped.
                      for name, _ in bow_classifier_top_features[0]
                      if name.startswith("TF-IDF__")
                         and all(not ch.isspace() for ch in name)]
  top_ham_feature_reprs = [get_last_hidden_state(
                             bert_classifier,
                             bert_tokenizer,
                             feature,
                           )
                           for feature in top_ham_features]

  top_spam_features = [name.split("TF-IDF__")[1]  # N-grams (n > 1) skipped.
                       for name, _ in bow_classifier_top_features[1]
                       if name.startswith("TF-IDF__")
                          and all(not ch.isspace() for ch in name)]
  top_spam_feature_reprs = [get_last_hidden_state(
                              bert_classifier,
                              bert_tokenizer,
                              feature,
                            )
                            for feature in top_spam_features]

  ham_feature_data = list(zip(top_ham_features, top_ham_feature_reprs,
                              ["ham"] * len(top_ham_features)))
  spam_feature_data = list(zip(top_spam_features, top_spam_feature_reprs,
                               ["spam"] * len(top_spam_features)))
  feature_data = ham_feature_data + spam_feature_data

  output = None
  for f_data_a, f_data_b in itertools.combinations(feature_data, r=2):
    f_a, f_repr_a, f_tag_a = f_data_a
    f_b, f_repr_b, f_tag_b = f_data_b
    try:
      similarity = cosine_similarity(f_repr_a, f_repr_b)
      output = f"{similarity.ravel()[0]:.3f}"
    except ValueError as e:
      output = f"'{e}'"
    print(f"cosine_similarity( "
            f"{f_tag_a}({f_a}), "
            f"{f_tag_b}({f_b}) ) => {output}")


def get_last_hidden_state(
  bert_classifier: BertForSequenceClassification,
  bert_tokenizer: BertTokenizer,
  feature: str,
) -> np.ndarray:
  inputs = bert_tokenizer(feature,
                          padding="max_length",
                          truncation=True,
                          return_tensors="pt")
  with torch.no_grad():
    last_hidden_state = bert_classifier(
      **inputs,
      output_hidden_states=True
    ).hidden_states[-1].numpy().reshape(1, -1)
  return last_hidden_state


cosine_similarity_demo(bert_classifier, bert_tokenizer,
                       linear_svm_top_features)  # type: ignore

### Prediction errors

[RNN](#recurrent-neural-network) models have the [vanishing gradients](#vanishing-gradients) problem, which is also applicable for [BERT](#bert)'s and every other deep model's architecture. So deep models have problems with processing long texts (truncation for big-sized text or padding for small-sized text). Processing of long texts is generally not an issue with `BoW` models (here no accounting for high dimensionality issues).

In the below example is demonstrated how `bert_classifier` fails on label predictions and is outperformed by `best_bow_classifier` on many test set messages - mostly ones that take up roughly longer than `bert_preprocessor.tokenizer.model_max_length`.

In [None]:
def prediction_errors_demo(
  X_bow: pd.Series = X_test,
  X_bert: datasets.Dataset = X_bert_test,
  y: pd.Series = y_test,
) -> None:
  bow_predictions = best_bow_classifier.predict(X_bow)
  _, bert_predictions = get_bert_predictions(
    bert_classifier, bert_tokenizer, X_bert
  )
  bow_scored, bert_scored = 0, 0
  bow_scored_ids, bert_scored_ids = [], []
  bow_scored_sizes, bert_scored_sizes = [], []
  for i, (bow_pred, bert_pred) in enumerate(zip(bow_predictions,
                                                bert_predictions)):
    message, is_spam = X_bow.iloc[i], y.iloc[i]
    if bow_pred != bert_pred:
      size = len(message.lower().split())
      if is_spam == bow_pred:
        bow_scored += 1
        bow_scored_ids += [i]
        bow_scored_sizes += [size]
      else:
        bert_scored += 1
        bert_scored_ids += [i]
        bert_scored_sizes += [size]

  plt.scatter(bow_scored_ids, bow_scored_sizes,
              label=f"BoW({bow_scored})")
  plt.scatter(bert_scored_ids, bert_scored_sizes,
              label=f"BERT({bert_scored})")
  plt.xlabel("Message ID")
  plt.ylabel("Message size")
  plt.title(f"Correctly predicted test message labels: BoW vs. BERT")
  plt.legend()
  plt.show()


prediction_errors_demo()

## Further development and testing

Based on the `BoW`-`BERT` comparative analysis so far, it is evident that `BoW` text classification models pose less concerns as to their applicability for ham/spam classification. These `BoW` models inherently support very long texts and have clear explainability means such as the granularity of the features and the feature importances.

Nevertheless, `BERT` models have their own unique strengths related to the context-based interpretation, which enables usage of similarity measures such as the `cosine similarity`, and also provides for better prediction `accuracy` by "naturally" covering the complex language use cases.

Overall, and also evident in the examples part of this work, `BoW` and `BERT` models complement each other in many ways, which brings into consideration the possibility to apply some *voting* scheme to leverage predictions or generated data from both kinds of classifiers.

Below is demonstrated how `best_bow_classifier` and `bert_classifier` can be tested with custom messages - inline in this notebook and also via user interface (UI) with [Gradio](#gradio).

### Custom message test - inline

In [None]:
custom_message_data = [
  ("this is a test", 0),
  ("top offer", 1),
]
custom_messages, custom_labels = zip(*custom_message_data)
print("best_bow_classifier.score = "
      f"{best_bow_classifier.score(custom_messages, custom_labels)}")
_, bert_custom_predictions = get_bert_predictions(
  bert_classifier, bert_tokenizer,
  datasets.Dataset.from_dict({
    "message": custom_messages
  })
)
bert_custom_score = (
  np.sum(bert_custom_predictions.numpy()
         == np.array(custom_labels)).astype(np.int32)
  / len(bert_custom_predictions)
)
print(f"bert_classifier.score = {bert_custom_score}")

### Custom message test - UI

In [None]:
with ui.Blocks() as demo:
  with ui.Row():
    txt_message = ui.Textbox(label="message")
    txt_spam_proba = ui.Textbox(label="spam probability")

  with ui.Row():
    btn_check = ui.Button("check")
    if hasattr(best_bow_classifier, "predict_proba"):
      btn_check.click(
        fn=lambda m: best_bow_classifier.predict_proba([m]).ravel()[1].round(3),
        inputs=txt_message,
        outputs=txt_spam_proba,
      )
    else:
      btn_check.click(
        fn=lambda m: best_bow_classifier.predict([m]).ravel()[0].round(3),
        inputs=txt_message,
        outputs=txt_spam_proba,
      )
  demo.launch(share=False)

## References

<br><br>

### APA style for references
American Psychological Association. (2022). Creating an APA Style reference list guide. https://apastyle.apa.org/instructional-aids/creating-reference-list.pdf

American Psychological Association. (2024). APA Style common reference examples guide. https://apastyle.apa.org/instructional-aids/reference-examples.pdf

<br><br>

### Feature extraction methods
##### TF-IDF
- [TF–IDF - Wikipedia](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)
##### Feature engineering
- [Feature engineering - Wikipedia](https://en.wikipedia.org/wiki/Feature_engineering)
##### Word embedding
- [Word embedding - Wikipedia](https://en.wikipedia.org/wiki/Word_embedding)

<br><br>

### Machine learning methods
##### Naive Bayes
- [Naive Bayes classifier - Wikipedia](https://en.wikipedia.org/wiki/Naive_Bayes_classifier)
##### Logistic regression
- [Logistic regression - Wikipedia](https://en.wikipedia.org/wiki/Logistic_regression)
##### Decision tree
- [Decision tree learning - Wikipedia](https://en.wikipedia.org/wiki/Decision_tree_learning)
##### Support vector machine
- [Support vector machine - Wikipedia](https://en.wikipedia.org/wiki/Support_vector_machine)
##### Ensemble learning
- [Ensemble learning - Wikipedia](https://en.wikipedia.org/wiki/Ensemble_learning)
##### Recurrent neural network
- [Recurrent neural network - Wikipedia](https://en.wikipedia.org/wiki/Recurrent_neural_network)
##### Attention
- [Attention (machine learning) - Wikipedia](https://en.wikipedia.org/wiki/Attention_(machine_learning))

<br><br>

### Machine learning models
##### Bag-of-words
- [Bag-of-words model - Wikipedia](https://en.wikipedia.org/wiki/Bag-of-words_model)
##### BERT
Turc, I., Chang, M. W., Lee, K., & Toutanova, K. (2019). Well-read students learn better: On the importance of pre-training compact models. *arXiv preprint arXiv:1908.08962v2*. https://doi.org/10.48550/arXiv.1908.08962
- [BERT (language model) - Wikipedia](https://en.wikipedia.org/wiki/BERT_(language_model))

<br><br>

### Machine learning problems
##### Vanishing gradients
- [Vanishing gradient problem - Wikipedia](https://en.wikipedia.org/wiki/Vanishing_gradient_problem)

<br><br>

### Datasets
##### SpamAssassin public mail corpus
- [Index of /old/publiccorpus](https://spamassassin.apache.org/old/publiccorpus/)
##### SMS Spam Collection - UCI Machine Learning Repository
Almeida, T. & Hidalgo, J. (2011). SMS Spam Collection [Dataset]. *UCI Machine Learning Repository*. https://doi.org/10.24432/C5CC84.
- [SMS Spam Collection - UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/228/)

<br><br>

### Guides and tutorials
- [Classification of text documents using sparse features — scikit-learn documentation](https://scikit-learn.org/stable/auto_examples/text/plot_document_classification_20newsgroups.html#analysis-of-a-bag-of-words-document-classifier)
- [Clustering text documents using k-means — scikit-learn documentation](https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html#k-means-clustering-on-text-features)
- [Text classification with an RNN  |  TensorFlow](https://www.tensorflow.org/text/tutorials/text_classification_rnn)
- [Classify text with BERT  |  Text  |  TensorFlow](https://www.tensorflow.org/text/tutorials/classify_text_with_bert)
- [google/bert_uncased_L-2_H-128_A-2 · Hugging Face](https://huggingface.co/google/bert_uncased_L-2_H-128_A-2)
- [Text classification · Hugging Face](https://huggingface.co/docs/transformers/tasks/sequence_classification)
- [Trainer · Hugging Face](https://huggingface.co/docs/transformers/trainer)

<br><br>

### Libraries
##### Gradio
Abid, A., Abdalla, A., Abid, A., Khan, D., Alfozan, A., & Zou, J. (2019). Gradio: Hassle-free sharing and testing of ML models in the wild [Computer software]. https://doi.org/10.48550/arXiv.1906.02569
- [Quickstart](https://www.gradio.app/guides/quickstart)
##### Hugging Face Transformers
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Le Scao, T., Gugger, S., Drame, M., Lhoest, Q., & Rush, A. M. (2020). Transformers: State-of-the-Art Natural Language Processing [Conference paper]. 38–45. https://www.aclweb.org/anthology/2020.emnlp-demos.6
- [Transformers](https://huggingface.co/docs/transformers/index)
##### Keras
Chollet, F., & others. (2015). Keras. https://keras.io
- [Getting started with Keras](https://keras.io/getting_started/#tensorflow--keras-2-backwards-compatibility)
##### Matplotlib
Hunter, J. D. (May-June 2007). Matplotlib: A 2D Graphics Environment. *Computing in Science & Engineering*, *9*(3), 90-95. https://doi.org/10.1109/MCSE.2007.55
- [Quick start guide](https://matplotlib.org/stable/users/explain/quick_start.html)
##### NLTK
Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. *O'Reilly Media, Inc.*. https://www.nltk.org/book/
- [NLTK :: Natural Language Toolkit](https://www.nltk.org/)
##### Numpy
Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane, A., del Río, J. F., Wiebe, M., Peterson, P., ... Oliphant, T. E. (2020). Array programming with NumPy. *Nature*, *585*, 357–362. https://doi.org/10.1038/s41586-020-2649-2
- [What is NumPy?](https://numpy.org/doc/2.2/user/whatisnumpy.html)
##### Pandas
The pandas development team. pandas-dev/pandas: Pandas [Computer software]. https://doi.org/10.5281/zenodo.3509134
- [Getting started — pandas](https://pandas.pydata.org/docs/getting_started/index.html#intro-to-pandas)
##### PyTorch
Ansel, J., Yang, E., He, H., Gimelshein, N., Jain, A., Voznesensky, M., Bao, B., Bell, P., Berard, D., Burovski, E., Chauhan, G., Chourdia, A., Constable, W., Desmaison, A., DeVito, Z., Ellison, E., Feng, W., Gong, J., Gschwind, M., Hirsh, B., Huang, S., Kalambarkar, K., Kirsch, L., Lazos, M., Lezcano, M., Liang, Y., Liang, J., Lu, Y., Luk, C., Maher, B., Pan, Y., Puhrsch, C., Reso, M., Saroufim, M., Siraichi, M. Y., Suk, H., Suo, M., Tillet, P., Wang, E., Wang, X., Wen, W., Zhang, S., Zhao, X., Zhou, K., Zou, R., Mathews, A., Chanan, G., Wu, P., & Chintala, S. (2024). PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation [Conference paper]. 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS '24). https://doi.org/10.1145/3620665.3640366
- [Start Locally | PyTorch](https://pytorch.org/get-started/locally)
##### Scikit-learn
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine Learning in Python. *Journal of Machine Learning Research*, *12*, 2825–2830. https://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html
- [Getting Started — scikit-learn](https://scikit-learn.org/stable/getting_started.html)
##### TensorFlow
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jozefowicz, R., Jia, Y., Kaiser, L., Kudlur, M., ... Zheng, X. (2015). TensorFlow, Large-scale machine learning on heterogeneous systems [Computer software]. https://doi.org/10.5281/zenodo.4724125
- [Introduction to TensorFlow](https://www.tensorflow.org/learn)

<br><br>

### Tools
##### Luigi
[Getting Started — Luigi](https://luigi.readthedocs.io/en/stable/)
##### Optuna
Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna: A next-generation hyperparameter optimization framework [Conference paper]. *Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining*, 2623–2631. https://doi.org/10.1145/3292500.3330701
- [Optuna: A hyperparameter optimization framework](https://optuna.readthedocs.io/en/stable/)