## **Imports**

In [None]:
!pip uninstall -y -q datasets
!pip install -q --no-deps "transformers==4.52.4"
!pip install -q --no-deps "datasets>=2.19.2"
!pip install -q evaluate accelerate sklearn-compat pyarrow sentencepiece tqdm seaborn matplotlib

In [25]:
import pandas as pd
import numpy as np
import torch
from torch.utils.data import DataLoader
from torch import nn
import torch.nn.functional as F
import random
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments, BertModel , EarlyStoppingCallback,DataCollatorWithPadding
from datasets import Dataset
import re
from tqdm import tqdm
import paths_ml_task_1 as PATHS
from gensim.models.doc2vec import TaggedDocument, Doc2Vec
from transformers import DataCollatorWithPadding
from sklearn.preprocessing import LabelEncoder
import os
from pathlib import Path
from tqdm.notebook import tqdm
from transformers.modeling_outputs import SequenceClassifierOutput
#from google.colab import files

# **Experiment: Evaluating the Impact of Preprocessing Strategies on Classification Performance**

**TL;DR:** In this experiment, we systematically assess how varying levels of text normalization affect classification performance for both SVMs and BERT models. To ensure fairness and comparability, all models are trained and evaluated on the *same set of 1,000 speeches*, each identified via a unique speech ID across preprocessing variants. We find that preprocessing plays a pivotal — yet model-dependent — role in downstream accuracy and class balance. The unprocessed dataset (`data_set_7`) serves as a neutral baseline for all comparisons.


---


## **Context and Objective**

- This experiment complements earlier findings on [vectorization](ML-Task-1_Vectorization-Experiment.ipynb) and [speech length](ML-Task-1_Speech-Length-Experiment.ipynb), offering a structured evaluation of how various levels of text normalization (e.g., lowercasing, stopword removal) may influence downstream classification models' performance.

- Again, our objective at the moment is **not primarily to maximize model performance**, but rather to perform a **evidence-based trend and impact analysis** quantifying the influence of linguistic normalization decisions on text classification performance, by isolating the effectiveness of preprocessing pipelines.


---


## **Generate DataSets using the [Preprocessing Pipeline](../dataPreprocessingHelpers/preprocessing_pipeline.py)**


We apply a structured and modular preprocessing pipeline that allows for fine-grained control over normalization levels. Each stage introduces a specific linguistic transformation:

- **Level 1 – Minimal Processing:** Lowercasing, punctuation removal, and Unicode normalization
- **Level 2 – Token and Word Filtering:** Stopword removal
- **Level 3 – Linguistic Normalization:** Lemmatization and/or stemming via spaCy and NLTK
- **Level 4 – Extended Cleaning:** Removal of custom domain-specific formulaic phrases and stopwords

Each resulting dataset is labeled as `data_set_1` through `data_set_7`, where `data_set_7` corresponds to the **original, unprocessed speeches**, and serves as a **neutral baseline** for model performance comparisons.


- To generate the datasets for the comparison from Bundestags speech data ('speech_content' column), we apply our structured and modular **[Preprocessing Pipeline](../dataPreprocessingHelpers/preprocessing_pipeline.py)** that allows for fine-grained control over linguistic normalization levels:
    - **Level 1 – Minimal Processing:** Lowercasing, punctuation removal, and Unicode normalization
    - **Level 2 – Token and Word Filtering:** Stopword removal
    - **Level 3 – Linguistic Normalization:** Lemmatization and/or stemming via spaCy and NLTK
    - **Level 4 – Extended Cleaning:** Removal of custom domain-specific formulaic phrases and stopwords
- This pipeline is heavily parallelized using all available CPU cores and allows to process multiple independent dataset configurations at a time using Python’s multiprocessing backend. For this experiment, a number of Dataset Configurations of varying levels are prepared in [Preprocessing_ML-Task-1_Classification.py](Preprocessing_ML-Task-1_Classification.py).


### **Data Set Configurations:**

| Name       | Model    | Lowercase | Remove digits | Remove punctuation | Remove stopword | Remove domain specific stopwords | Phrase pattern removal | Contribution Mode | Merge factions      | Remove faction  | Remove group                               |
|------------|----------|-----------|---------------|--------------------|-----------------|----------------------------------|------------------------|-------------------|---------------------|-----------------|--------------------------------------------|
| data_set_1 | SVM      | Yes       | Yes           | Yes                | Yes             | Yes                              | Yes                    | REMOVE            | BSW into DIE LINKE. | `faction_id=-1` | `position_short="Presidium of Parliament"` |
| data_set_2 | SVM      | Yes       | Yes           | Yes                | Yes             | No                               | Yes                    | REMOVE            | BSW into DIE LINKE. | `faction_id=-1` | `position_short="Presidium of Parliament"` |
| data_set_3 | SVM      | Yes       | No            | No                 | No              | No                               | No                     | INSERT            | BSW into DIE LINKE. | `faction_id=-1` | `position_short="Presidium of Parliament"` |
| data_set_6 | SVM      | Yes       | No            | No                 | Yes             | No                               | No                     | INSERT            | BSW into DIE LINKE. | `faction_id=-1` | `position_short="Presidium of Parliament"` |
| data_set_4 | BERT     | No        | Yes           | Yes                | Yes             | Yes                              | Yes                    | REMOVE            | BSW into DIE LINKE. | `faction_id=-1` | `position_short="Presidium of Parliament"` |
| data_set_5 | BERT     | No        | No            | No                 | Yes             | No                               | No                     | INSERT            | BSW into DIE LINKE. | `faction_id=-1` | `position_short="Presidium of Parliament"` |
| data_set_7 | baseline | No        | No            | No                 | No              | No                               | No                     | INSERT            | BSW into DIE LINKE. | `faction_id=-1` | `position_short="Presidium of Parliament"` |
| data_set_8 | none     | Yes       | No            | No                 | No              | No                               | Yes                    | REMOVE            | BSW into DIE LINKE. | `faction_id=-1` | `position_short="Presidium of Parliament"` |
| data_set_9 | BERT     | No        | No            | No                 | No              | No                               | Yes                    | REMOVE            | BSW into DIE LINKE. | `faction_id=-1` | `position_short="Presidium of Parliament"` |

> - **Note:** All the configuration dictionaries and processing initalization are kept [separateley](Preprocessing_ML-Task-1_Classification.py) because of some weird issues I ran into regarding Python’s multiprocessing backend in combination with Jupyter.
> - Configuration details shared across all datasets:
  `only_valid_faction_id=True`,
  `position_short=["Presidium of Parliament", "Guest"]`,
  `tokenization=True`, `lemmatization=True`, `stemming=True`,
  `add_char_count=True`, `add_token_count=True`, `add_lemma_count=True`


### **Input:**
```
data/
├── dataFinalStage/
│   ├── speechContentFinalStage/
│   │   ├── speech_content_19.pkl
│   │   └── speech_content_20.pkl
```


### **Output:**
```
data/
├── dataPreprocessedStage/
│   ├── dataClassification/
│   │   ├── dataSets/
│   │   │   ├── data_set_<1-9>_19_20.xlsx
│   │   │   ├── data_set_<1-9>_19.xlsx
│   │   │   ├── data_set_<1-9>_20.xlsx
│   │   │   ├── data_set_<1-9>_19_20.pkl
│   │   │   ├── data_set_<1-9>_19.pkl
│   │   │   └── data_set_<1-9>_20.pkl
```


---


## **Methodology**
- **Sampling Strategy:**
    - Initial length-based filtering removes speeches outside the central 65 % speech length range: the shortest (lower 25 % percentile) and longest (upper 90 % percentile) speeches are excluded.
    - To ensure comparability, all models are evaluated and trained on the exact same subset of speeches: Therefore, we randomly select 1,000 speeches by their unique id, **ensuring comparability across all datasets**.
- **Train-Test Split:** One stratified 80/20 split (also based on the speeches' ids) with fixed random seed for reproducibility is used for **all models**. The trained models are evaluated on the 20 % hold-out set.
- **Model Configurations:**
    - **Linear Support Vector Machine** (`LinearSVC` classifier from scikit-learn with a fixed `random_state=42`):
        - **TF-IDF Vectorizer Settings:**
            - `ngram_range=(1, 3)` → unigrams to trigrams
            - `max_features=50000`
            - `min_df=3` → removes very rare words
            - `max_df=0.5` → removes overly frequent, generic words
    - **BERT model:** `bert-base-german-cased` model from HuggingFace Transformers is used as a pretrained encoder. The encoders' outputs are then passed on to a custom linear classification head, using mean pooling (Computes the mean of all token embeddings, weighted by the attention mask, to obtain a sentence embedding) as an embedding strategy.
        - **Tokenizer:** `BertTokenizer` from HuggingFace, configured with `padding="max_length"` and `truncation=True` to enforce a hard 512 tokens input-length limit.
        - **Training Hyperparameters:**
          - Epochs: 3
          - Batch size: 8
          - Learning rate: 1e-5
          - Evaluation strategy: `epoch`
          - Evaluation and prediction steps use HuggingFace's Trainer API
          - Optimizer: AdamW (implicitly used by the Trainer)
          - Fixed `random_seed` (`SEED = 42`)
    - All training procedures are conducted with **fixed random seeds** and **constant hyperparameters** across runs to ensure experimental control.


---


## **Evaluation**


To assess model behavior under varying preprocessing conditions, we apply the following metrics for each model-preprocessing combination:
- **Accuracy:** Overall proportion of correctly classified speeches.
- **Macro F1-Score:** Harmonic mean of precision and recall, averaged equally across classes (mitigates class imbalance effects).
- **Confusion Matrix:** Visualizes inter-class confusion patterns and highlights which parties are frequently misclassified.
- **Qualitative Sample Inspection:** Review preprocessed text of correctly and incorrectly classified samples from each variant to evaluate if they are still interpretable or result in “dead”, empty texts.

In line with our evidence-based philosophy, we use model performance on `data_set_7` (no normalization) as a **reference baseline**. All other results are interpreted as **relative gains or losses** in comparison to this unprocessed variant to ultimately identify **which preprocessing steps support — or hinder — learning for each model type**.


### **Expectations** (based on these assumtions we choose our dataset configurations)
- TF-IDF models benefit from **aggressive preprocessing** (removing noise, reducing vocabulary size)
- Deep models are **sensitive to aggressive preprocessing**
  → Removing stopwords/punctuation may degrade performance due to **loss of syntactic/semantic structure**


### **Model-Specific Expectations and Interpretive Framework**

It is critical to emphasize that **preprocessing cannot be considered a universally beneficial or detrimental operation**: its effectiveness is strongly model-dependent, owing to fundamental differences in how classical and deep learning architectures encode language.

**Traditional models like SVMs** based on surface-level statistical regularities are highly **sensitive to sparsity, vocabulary size, and noise**. Accordingly, we expect performance gains through aggressive normalization:
- **Strong improvements with heavier preprocessing**, particularly through:
  - Stopword removal → reduces dimensionality without meaningful information loss
  - Lemmatization/stemming → increases token consistency and term frequency
  - Punctuation and casing normalization → reduces spurious feature variance

**Deep transformer-based models like BERT**, in contrast, are **pretrained on raw, naturalistic language** and exploit **contextual dependencies**, token position, and syntax. Hence:
- Even **minor normalization** like lowercasing could negatively impact a cased models' learning performance
- **Overly aggressive preprocessing** (e.g., removing stopwords or punctuation) may disrupt syntactic structure, which can negatively affect attention mechanisms and downstream representations.
- Thus, we expect BERT models to perform best on **lightly normalized or even raw inputs**, where the full richness of linguistic cues is preserved.


### **Conclusion**

By systematically varying preprocessing strategies and evaluating their downstream effects, this experiment provides nuanced insight into the compatibility of linguistic normalization with different model architectures. In the future these findings can help us design preprocessing pipelines not by intuition, but by **empirical alignment with model-specific requirements**.


---


## **Idea for future extensions:**

**Named Entity Removal** as an additional option provided by the Preprocessing Pipeline could help reduce personalization or named references and would be interesting to evaluate.

In [5]:
"""
Generating Datasets via the Preprocessing pipeline and splitting data
"""
#!python preprocessing_ML-Task-1_classification.py
!python preprocessing_experiment_sampling_strategy.py

  invalid_mask.loc[:, str_cols] |= blank_mask
  invalid_mask.loc[:, str_cols] |= blank_mask
  invalid_mask.loc[:, str_cols] |= blank_mask
  invalid_mask.loc[:, str_cols] |= blank_mask
  invalid_mask.loc[:, str_cols] |= blank_mask
  invalid_mask.loc[:, str_cols] |= blank_mask
  invalid_mask.loc[:, str_cols] |= blank_mask
  invalid_mask.loc[:, str_cols] |= blank_mask
  invalid_mask.loc[:, str_cols] |= blank_mask
  invalid_mask.loc[:, str_cols] |= blank_mask
  invalid_mask.loc[:, str_cols] |= blank_mask
  invalid_mask.loc[:, str_cols] |= blank_mask
  invalid_mask.loc[:, str_cols] |= blank_mask
  invalid_mask.loc[:, str_cols] |= blank_mask
  invalid_mask.loc[:, str_cols] |= blank_mask
  invalid_mask.loc[:, str_cols] |= blank_mask
  invalid_mask.loc[:, str_cols] |= blank_mask
  invalid_mask.loc[:, str_cols] |= blank_mask
  invalid_mask.loc[:, str_cols] |= blank_mask
  invalid_mask.loc[:, str_cols] |= blank_mask
  invalid_mask.loc[:, str_cols] |= blank_mask
  invalid_mas

In [None]:
"""
Setup constants
"""
# Set seeds for reproducibility
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)
# Models & Tokenizer
MODEL_NAME = "bert-base-german-cased"
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)
LEARNING_RATE = 1e-5
NUM_LABELS = 6  # will later also be adjustet based on the number of uniqe parties in the dataset
BATCH_SIZE = 8
EPOCHS = 3
MAX_SAMPLE_SIZE_PARTY=1000 # speech counts of parties need to be double checked after filtering
TEST_SPLIT_SIZE=0.2
MAX_FEATURES = 50000
NGRAM_RANGE = (1, 3)
MIN_DF = 3
MAX_DF = 0.5
UPPER_PERCENTILE = 0.9
LOWER_PERCENTILE = 0.25
MAX_LENGTH_BERT = 512
SPEECH_ID_COL: str = "id"
LENGTH_COL: str = "speech_length_chars"
LABEL_COL: str = "faction_id"
ABBR_COL: str = "faction_abbreviation"
# Datsts
SVM_DATASETS  = [7, 1, 2, 3]
BERT_DATASETS = [7, 4, 5, 9]



"""
Global variables
"""
# Global dictionary to store all confusion matrices by model name
confusion_matrices = {}
# Global dictionary to collect classification reports per model
classification_reports = {}
# Load faction map
faction_map = pd.read_pickle(PATHS.FINAL_FACTIONS_ABBREVIATIONS).drop_duplicates(subset="id").set_index("id")["abbreviation"]
# Set device type explicitly for MPS (Apple Silicon), CUDA or CPU
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")



class BertMeanPoolingClassifier(nn.Module):
    def __init__(self, model_name, num_labels):
        super().__init__()
        self.bert = BertModel.from_pretrained(model_name)
        self.classifier = nn.Linear(self.bert.config.hidden_size, num_labels)
        self.loss_fn = nn.CrossEntropyLoss()

    def forward(self, input_ids, attention_mask, labels=None):
        output = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        hidden_states = output.last_hidden_state # (B, L, H)
        mask = attention_mask.unsqueeze(-1).expand(hidden_states.size()).float()
        summed_vectors = (hidden_states * mask).sum(dim=1)   # (B, H)
        valid_tokens   = mask.sum(dim=1) # (B, 1)
        mean_pooled    = summed_vectors / torch.clamp(valid_tokens, min=1e-9)
        logits = self.classifier(mean_pooled) # classification
        # loss:
        if labels is not None:
            loss = self.loss_fn(logits, labels)
            return {"loss": loss, "logits": logits}
        else:
            return {"logits": logits}



"""
Tokenization and dataset conversion
"""
def tokenize_function(examples, max_length): # here without padding -> need data collator later
    """
    Tokenizes input text using BertTokenizer with specific max token length.

    :param: examples (dict): Dictionary containing the key "speech_content" with raw text.
    :param: max_length (int): Maximum number of tokens after padding/truncation.

    :return: tokenized_output (dict): Dictionary with input_ids and attention masks.
    """
    return tokenizer(examples["speech_content"], truncation=True, max_length=max_length)


def prepare_dataset_BERT(df_subset, max_len):
    """
    Prepares and tokenizes dataset from dataframe for BERT model training.

    :param: df_subset (pd.DataFrame): DataFrame containing speeches and encoded labels.
    :param: max_len (int): Maximum token length.

    :return: df_tokenized (Dataset): Tokenized and formatted dataset compatible with PyTorch.
    """
    df_dataset = Dataset.from_pandas(df_subset[["speech_content", "label_encoded"]])
    df_tokenized = df_dataset.map(lambda x: tokenize_function(x, max_len), batched=True)
    df_tokenized = df_tokenized.rename_column("label_encoded", "labels")

    cols = ["input_ids", "attention_mask", "labels"]
    if "token_type_ids" in df_tokenized.column_names:
        cols.append("token_type_ids")
    df_tokenized.set_format(type="torch", columns=cols)
    return df_tokenized


def compute_metrics(pred):
    labels = pred.label_ids
    preds = np.argmax(pred.predictions, axis=1)
    return {
        "accuracy": accuracy_score(labels, preds),
        "macro_f1": f1_score(labels, preds, average="macro")
    }



"""
Training Loop for BERT on mean pooling embeddings
"""
def train_bert(train_df : pd.DataFrame, val_df : pd.DataFrame,  test_df : pd.DataFrame, model_name : str):
    """
    Training Loop for BERT. Finetuning of a Mean-Pooling-BERT-Class per epoch eval.

    :param: train_df (Dataset): training dataset.
    :param: val_df (Dataset): validation dataset.
    :param: test_df (Dataset): testing dataset.
    :param: model_name (str): Identifier for logging and visualizations.

    :return: None
    """
    # prep datasets
    train_ds = prepare_dataset_BERT(train_df, MAX_LENGTH_BERT)
    val_ds = prepare_dataset_BERT(val_df, MAX_LENGTH_BERT)
    test_ds = prepare_dataset_BERT(test_df, MAX_LENGTH_BERT)

    # Initialize BERT model and trainer and move to device
    model = BertMeanPoolingClassifier(MODEL_NAME, num_labels=NUM_LABELS).to(device)

    # Set training arguments for the HuggingFace trainer
    training_args = TrainingArguments(
        output_dir                 = PATHS.PREPROCESSING_EXPERIMENT_BERT,
        num_train_epochs           = EPOCHS,
        per_device_train_batch_size= BATCH_SIZE,
        per_device_eval_batch_size = BATCH_SIZE,
        learning_rate              = LEARNING_RATE,
        eval_strategy              = "epoch",
        save_strategy              = "epoch",
        logging_strategy           = "epoch",
        load_best_model_at_end     = True,
        metric_for_best_model      = "macro_f1",
        greater_is_better          = True,
        seed                       = SEED
    )

    # Collator (Dynamic Padding)
    collator = DataCollatorWithPadding(tokenizer, pad_to_multiple_of=8)

    # Trainer object initialization
    trainer = Trainer(
        model          = model,
        args           = training_args,
        train_dataset  = train_ds,
        eval_dataset   = val_ds,
        tokenizer      = tokenizer,
        data_collator  = collator,
        compute_metrics= compute_metrics,
        callbacks      = [EarlyStoppingCallback(
                              early_stopping_patience=2,
                              early_stopping_threshold=0.001)]
    )

    # train model
    trainer.train()

    # Predicting and decoding labels
    preds = trainer.predict(test_ds)
    encoder = LabelEncoder()
    y_true = encoder.inverse_transform(preds.label_ids)
    y_pred = encoder.inverse_transform(np.argmax(preds.predictions, axis=1))

    # Map to faction abbreviations
    y_true_abbr = [faction_map.get(label, str(label)) for label in y_true]
    y_pred_abbr = [faction_map.get(label, str(label)) for label in y_pred]

    # unique labels and abbr
    unique_labels=sorted(np.unique(y_true))
    unique_abbr=sorted(np.unique(y_true_abbr))


    # Calculate and store confusion matrix globally
    cm_local = confusion_matrix(y_true, y_pred, labels=unique_labels)
    # Store for later plotting
    confusion_matrices[model_name] = {
        "matrix": cm_local,
        "labels": unique_abbr
    }


    # Classification Report
    report = classification_report(
        y_true,
        y_pred,
        target_names=[faction_map.get(fid, str(fid)) for fid in unique_labels],
        digits=3,
        output_dict=True
    )
    # Save classification report for later use
    classification_reports[model_name] = report



"""
Training Loop for SVM with TF-IDF
"""
def train_svm(svm_train_df : pd.DataFrame, svm_test_df : pd.DataFrame, model_name):
    """
    Trains an SVM classifier using a TF-IDF pipeline.

    :param: train_df (pd.DataFrame): Training DataFrame.
    :param: test_df (pd.DataFrame): Testing DataFrame.
    :param model_name (str): Identifier for logging and result visualization.

    :return: None

    :raises: ValueError: If data preprocessing fails or model training encounters inconsistent labels.
    """
    # Define TF-IDF & SVM pipeline and train the classifier
    pipeline = Pipeline([
        ("tfidf", TfidfVectorizer(max_features=MAX_FEATURES, ngram_range=NGRAM_RANGE, min_df=MIN_DF, max_df=MAX_DF)),
        ("clf", LinearSVC(random_state=SEED))
    ])
    pipeline.fit(svm_train_df["speech_content"], svm_train_df["faction_id"])

    # predict
    y_pred = pipeline.predict(svm_test_df["speech_content"])
    y_true = svm_test_df["faction_id"]

    # Mapping for Abbreviations
    label_ids = sorted(np.unique(y_true))
    label_names = [faction_map.get(fid, str(fid)) for fid in label_ids]
    y_pred_abbr = [faction_map.get(fid, str(fid)) for fid in y_pred]
    y_true_abbr = [faction_map.get(fid, str(fid)) for fid in y_true]


    # Calculate and store confusion matrix globally
    cm_local = confusion_matrix(y_true_abbr, y_pred_abbr, labels=label_names)
    # Store for later plotting
    confusion_matrices[model_name] = {
        "matrix": cm_local,
        "labels": label_names
    }


    # Classification Report
    report = classification_report( y_true_abbr,
                                    y_pred_abbr,
                                    target_names=label_names,
                                    digits=3,
                                    output_dict=True
    )
    # Save classification report for later use
    classification_reports[model_name] = report



def evaluate_model_visual(model_names: list[str]):
    """
    Compares classification results for 4 models in a 2x2 plot grid.
    Shows Confusion Matrices and Precision/Recall per class.

    :param model_names (list): List of 4 model names (keys from global confusion_matrices / classification_reports).
    """
    assert len(model_names) == 4, "You must provide exactly 4 model names!"
    # Define party order and colors
    party_order = ["CDU/CSU", "SPD", "Die Grünen", "AfD", "FDP", "DIE LINKE."]
    party_colors = {
        "CDU/CSU": "#000000",
        "SPD": "#E3000F",
        "Die Grünen": "#46962B",
        "FDP": "#FFED00",
        "AfD": "#009EE0",
        "DIE LINKE.": "#BE3075",
    }
    # Normalize label names (e.g., replace "Bündnis 90 / Die Grünen" with "Die Grünen")
    label_mapping = {
        "Bündnis 90/Die Grünen": "Die Grünen"
    }


    # --- Confusion Matrices ---
    fig_cm, axes_cm = plt.subplots(2, 2, figsize=(16, 12))
    for i, model in enumerate(model_names):
        row, col = divmod(i, 2)
        cm = confusion_matrices[model]["matrix"]
        labels = [label_mapping.get(l, l) for l in confusion_matrices[model]["labels"]]
        label_order = [l for l in party_order if l in labels]
        cm_df = pd.DataFrame(cm, index=labels, columns=labels).loc[label_order, label_order]

        sns.heatmap(cm_df, annot=True, fmt="d", cmap="Reds",
                    xticklabels=label_order, yticklabels=label_order,
                    ax=axes_cm[row, col])

        axes_cm[row, col].set_title(f"{model}:", fontsize=16)
        axes_cm[row, col].set_xlabel("Predicted", fontsize=14)
        axes_cm[row, col].set_ylabel("True", fontsize=14)
        axes_cm[row, col].tick_params(axis='x', rotation=90, labelsize=12)
        axes_cm[row, col].tick_params(axis='y', rotation=0, labelsize=12)

    fig_cm.suptitle("Confusion Matrices:", fontsize=18)
    fig_cm.tight_layout(rect=[0, 0.03, 1, 0.95])
    plt.show()


    # --- Precision & Recall ---
    fig_pr, axes_pr = plt.subplots(2, 2, figsize=(16, 12))

    for i, model in enumerate(model_names):
        row, col = divmod(i, 2)
        raw_report = classification_reports[model]
        labels = [label_mapping.get(l, l) for l in confusion_matrices[model]["labels"]]
        report = {label_mapping.get(l, l): raw_report[l] for l in raw_report if l in labels}
        label_order = [l for l in party_order if l in labels]

        df_plot = pd.DataFrame({
            "Party": label_order * 2,
            "Metric": ["Precision"] * len(label_order) + ["Recall"] * len(label_order),
            "Value": [report[l]["precision"] for l in label_order] +
                     [report[l]["recall"] for l in label_order],
            "Color": [party_colors[l] for l in label_order] * 2
        })

        sns.barplot(data=df_plot, x="Party", y="Value", hue="Metric",
                    ax=axes_pr[row, col], dodge=True,
                    palette={"Precision": "#4c72b0", "Recall": "#dd8452"})

        axes_pr[row, col].set_ylim(0, 1.05)
        axes_pr[row, col].set_title(f"{model}:", fontsize=16)
        axes_pr[row, col].tick_params(axis='x', labelsize=14, rotation=90)
        axes_pr[row, col].tick_params(axis='y', labelsize=12)

        # Entferne X-Label, setze Y-Label nur links
        axes_pr[row, col].set_xlabel("")
        if col == 0:
            axes_pr[row, col].set_ylabel("Precision & Recall", fontsize=14)
        else:
            axes_pr[row, col].set_ylabel("")

        # Entferne individuelle Legenden
        axes_pr[row, col].get_legend().remove()

        # Annotate bars
        for p in axes_pr[row, col].patches:
            height = p.get_height()
            if height > 0:
                axes_pr[row, col].annotate(f"{height:.2f}",
                                           (p.get_x() + p.get_width() / 2., height + 0.01),
                                           ha='center', va='bottom', rotation=90, fontsize=12)

    # Gemeinsame Legende oben links (außerhalb der Subplots)
    handles, legend_labels = axes_pr[0, 0].get_legend_handles_labels()
    fig_pr.legend(
        handles, legend_labels,
        title="Metric",
        loc="upper left",
        bbox_to_anchor=(0.01, 1.02),
        fontsize=12,
        title_fontsize=12
    )

    fig_pr.suptitle("Per-Class Precision & Recall:", fontsize=18)
    fig_pr.tight_layout(rect=[0, 0.05, 1, 0.92])
    plt.show()



def evaluate_model_textual(model_names: list[str]):
    """
    Prints textual evaluation summary (confusion matrix & classification report)
    for multiple models.

    :param model_names (list[str]): List of model identifiers.
    """
    assert len(model_names) > 0, "At least one model name must be provided."

    party_order = ["CDU/CSU", "SPD", "Die Grünen", "AfD", "FDP", "DIE LINKE."]
    label_mapping = {
        "Bündnis 90/Die Grünen": "Die Grünen"
    }

    for model_name in model_names:
        print(f"\n{'='*80}")
        print(f"### Evaluation Results for Model: **{model_name}**")
        print(f"{'='*80}")

        # Load evaluation data
        raw_cm = confusion_matrices[model_name]["matrix"]
        raw_labels = confusion_matrices[model_name]["labels"]
        raw_report = classification_reports[model_name]

        # Normalize label names
        labels = [label_mapping.get(label, label) for label in raw_labels]
        cm_df = pd.DataFrame(raw_cm, index=labels, columns=labels)

        # Remap classification report keys
        report = {
            label_mapping.get(label, label): raw_report[label]
            for label in raw_labels
        }

        # Label order
        label_order = [l for l in party_order if l in labels]

        # Print confusion matrix
        print("\n**Confusion Matrix (absolute counts)**\n")
        print(cm_df.loc[label_order, label_order].to_markdown())

        # Print classification report
        report_df = pd.DataFrame(report).transpose()
        print("\n**Classification Report**\n")
        print(report_df.loc[label_order, ["precision", "recall", "f1-score", "support"]].round(3))

        print("\n\n")



"""
Load data
"""
# Helperfunctions
def load_ids(path: Path) -> set[int]:
    """Reads a pickle of ids and gives it back as a int set"""
    return set(pd.read_pickle(path))

def load_dataset(idx: int) -> pd.DataFrame:
    """Loads a Preprocessed-Dataset «data_set_<idx>_20.pkl» as DataFrame."""
    ds_path = PATHS.BASE_CLASSIFICATION_DATASET_DIR / f"data_set_{idx}_20.pkl"
    if not ds_path.exists():
        raise FileNotFoundError(ds_path)
    return pd.read_pickle(ds_path)

def subset(df: pd.DataFrame, ids: set[int]) -> pd.DataFrame:
    """Filters a df for Speech-IDs and resetet the Index."""
    return df[df["id"].isin(ids)].reset_index(drop=True)


# Load ID-lists
ids_train      = load_ids(PATHS.TRAIN_IDS)
ids_test       = load_ids(PATHS.TEST_IDS)
ids_test_bert  = load_ids(PATHS.TEST_BERT_IDS)
ids_eval_bert  = load_ids(PATHS.EVAL_BERT_IDS)



"""
Start SVM training with shared splits
"""
for idx in SVM_DATASETS:
    df = load_dataset(idx)

    train_df = subset(df, ids_train)
    test_df  = subset(df, ids_test)

    model_name = f"SVM_DS_{idx}_20"
    print(f"[SVM]  Training → {model_name:14} "
          f"({len(train_df)} Train / {len(test_df)} Test)")
    train_svm(train_df, test_df, model_name)



"""
Start BERT training
"""
for idx in BERT_DATASETS:
    df = load_dataset(idx)

    train_df = subset(df, ids_train)
    val_df   = subset(df, ids_eval_bert)   # «eval» = 20 % of global Test-Set
    test_df  = subset(df, ids_test_bert)   # «test» = 80 % of global Test-Set

    model_name = f"BERT_DS_{idx}_20"
    print(f"[BERT] Training → {model_name:14} "
          f"({len(train_df)} Train / {len(val_df)} Val / {len(test_df)} Test)")
    train_bert(train_df, val_df, test_df, model_name)



"""
Evaluate Models
"""
evaluate_model_textual([
    "SVM_ds7_20",
    "SVM_ds1_20",
    "SVM_ds2_20",
    "SVM_ds3_20",
    "BERT_ds7_20",
    "BERT_ds4_20"
    "BERT_ds5_20",
    "BERT_ds9_20"
])

evaluate_model_visual([
    "SVM_ds7_20", "SVM_ds1_20",
    "SVM_ds2_20", "SVM_ds3_20",
])

evaluate_model_visual([
    "BERT_ds7_20", "BERT_ds4_20",
    "BERT_ds5_20", "BERT_ds9_20"
])

# **Outlier Detection Strategy Based on Speech Length – Guided by Experimental Findings**


## **Context**

- The experiment on speech length and party classification performance demonstrates that:
    - Short speeches often lack party-distinctive semantics and confuse classifiers.
    - Long speeches provide richer context and result in better predictions.
- However, due to varied preprocessing configurations (e.g., stopword/pattern removal), the absolute speech lengths (in characters, tokens, lemmas) vary significantly across datasets. As a result: Static thresholds (e.g., speech_length_chars < 150) are insufficient and inconsistent.
- Problem:
    - A globally fixed cutoff removes too much data from some datasets and nothing from others.
    - Preprocessing may remove structure (phrases, tokens), drastically altering speech length distributions.
    - Outlier detection must adapt to each dataset’s internal distribution.
    - Each datapoint has 3 available length indicators:
        - `speech_length_chars`
        - `speech_length_tokens`
        - `speech_length_lemmas`
- Solution Strategy: Dynamic Outlier Scoring and Filtering: We define a combined outlier score that flags datapoints as extreme if at least one of the three length dimensions deviates strongly from the norm. This approach is robust across preprocessing setups.
    - Step 1: Normalize all three features (Using RobustScaler of sklearn because it is not sensitive to already present outliers.)
    - Step 2: Compute a composite outlier score (Use the maximum absolute z-score across all three normalized features. This ensures that extreme deviation in any single dimension is sufficient to mark a point as an outlier.)
    - Step 3: Filter datapoints based on a dynamic threshold
- Advantages of This Method
    - Works across all preprocessing variants without the need to manually adjust thresholds.
    - Balances out multi-modal outlier detection: characters, tokens, and lemmas.
    - Improves consistency and comparability of filtered datasets.
    - Helps ensure semantic sufficiency of inputs for both classic and deep models.



## **How the Experiment Justifies This Outlier Removal Strategy**

- The experimental results clearly show that short speeches lead to significant drops in classification performance for both BERT and SVM models. These short speeches often contain procedural phrases, greetings, or fragmentary content that lacks semantic signals necessary for reliable party prediction. On the other hand, while extremely long speeches are rare, they can introduce disproportionate influence on training due to length and verbosity. Therefore, it is justified to automatically remove both extremes.
- This thresholding strategy — grounded in statistical normalization and backed by observed performance gaps — ensures that the training data used by classifiers contains semantically meaningful and well-structured political content. It aligns preprocessing with empirical model behavior, not just arbitrary assumptions, leading to higher overall model robustness and generalization.
- Note: To make the experiment more efficient, we directly used the BERT tokens to determine the speech length. The objective of the experiment was to gain insight if and to which extent the speech length influences model-performance and not to determine specific boundaries for speech length, which as argued above would not be suitable for this application anyway. Because in the following steps we will train various models with different vector representations we use the char count of the post-preprocessed speeches to filter them for length.
