# Robust News Classification: Main Experiments

This notebook ties together all components of the robust news classification project for the current flow (train on full ISOT, test on held-out files).

## Overview

This notebook demonstrates:
1. **Data Loading & Preprocessing**: Load ISOT training data and clean text
2. **Training**: Train on the full ISOT training set (Fake + True)
3. **Baseline Models**: TF-IDF + LogReg/SVM
4. **Advanced Models**: Sentence embeddings (MiniLM)
5. **Evaluation**:
   - Fake-only test (`data/test/fake.csv`): false negatives / fake recall
   - Mixed labeled test (`data/test/WELFake_Dataset_sample_10000.csv`): Macro-F1, ROC-AUC, PR-AUC, confusion matrix
6. **Cross-Dataset Transfer**: WELFake evaluation covers the external test without fine-tuning

## Project Goals

- Train on full labeled ISOT data (text only)
- Check false negatives on fake-only held-out data
- Measure balanced metrics on a mixed external set (WELFake)
- Compare TF-IDF baselines with embedding models

## 1. Setup and Imports

Import all necessary modules from the `src/` directory and configure settings.

In [1]:
import torch
print('cuda:', torch.cuda.is_available())
print('mps:', torch.backends.mps.is_available())

cuda: False
mps: True


In [2]:
# Initialize result holders to avoid NameError if optional cells are skipped
# Base (text-only) variants
results_lr_welfake = None
results_svm_welfake = None
results_emb_welfake = None
fake_only_lr = None
fake_only_svm = None
fake_only_emb = None

# Multi-feature / embedding+length variants
results_lr_welfake_multi = None
results_svm_welfake_multi = None
results_emb_welfake_multi = None
fake_only_lr_multi = None
fake_only_svm_multi = None
fake_only_emb_multi = None

In [3]:
import sys
from pathlib import Path
import pandas as pd
import numpy as np
import warnings
from scipy.sparse import hstack, csr_matrix
from sklearn.preprocessing import StandardScaler
warnings.filterwarnings('ignore')

# Add src directory to path for imports
project_root = Path().resolve().parent
sys.path.insert(0, str(project_root / 'src'))

# Import using importlib for files with numeric prefixes
import importlib.util

# Import preprocessing utilities
spec = importlib.util.spec_from_file_location("preprocessing", project_root / "src" / "01_preprocessing.py")
preprocessing = importlib.util.module_from_spec(spec)
spec.loader.exec_module(preprocessing)
sys.modules["preprocessing"] = preprocessing
from preprocessing import load_isot, apply_cleaning, clean_text

# Import baseline models
spec = importlib.util.spec_from_file_location("baseline_models", project_root / "src" / "03_baseline_models.py")
baseline_models = importlib.util.module_from_spec(spec)
spec.loader.exec_module(baseline_models)
sys.modules["baseline_models"] = baseline_models
from baseline_models import build_tfidf, train_logreg, train_svm

# Import evaluation
spec = importlib.util.spec_from_file_location("baseline_eval", project_root / "src" / "04_baseline_eval.py")
baseline_eval = importlib.util.module_from_spec(spec)
spec.loader.exec_module(baseline_eval)
sys.modules["baseline_eval"] = baseline_eval
from baseline_eval import evaluate

print("All imports successful!")
print(f"Project root: {project_root}")

All imports successful!
Project root: /Users/reuben/robust-news-classification/robust-news-classification


## 2. Data Loading and Preprocessing

Load the ISOT dataset (Fake and True news) and apply text cleaning to remove noise and standardize formatting.

In [4]:
# Load ISOT dataset
print("Loading ISOT dataset...")
df = load_isot(
    fake_path='../data/training/Fake.csv',
    real_path='../data/training/True.csv'
)

print(f"\nDataset shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"\nLabel distribution:")
print(df['label'].value_counts())
print(f"\nSubject distribution:")
print(df['subject'].value_counts())

# Apply text cleaning
print("\n" + "="*60)
print("Applying text cleaning...")
df = apply_cleaning(df, text_column='text')
df = apply_cleaning(df, text_column='title')

print(f"\nSample cleaned text (first 200 chars):")
print(df['text_cleaned'].iloc[0][:200])

Loading ISOT dataset...
Loading fake news from ../data/training/Fake.csv...
Loading real news from ../data/training/True.csv...
Loaded 23481 fake articles and 21417 real articles
Total: 44898 articles

Dataset shape: (44898, 6)
Columns: ['title', 'text', 'subject', 'date', 'label', 'source_file']

Label distribution:
label
1    23481
0    21417
Name: count, dtype: int64

Subject distribution:
subject
politicsNews       11272
worldnews          10145
News                9050
politics            6841
left-news           4459
Government News     1570
US_News              783
Middle-east          778
Name: count, dtype: int64

Applying text cleaning...
Applied text cleaning to column 'text'
Created new column 'text_cleaned' with cleaned text
Applied text cleaning to column 'title'
Created new column 'title_cleaned' with cleaned text

Sample cleaned text (first 200 chars):
Donald Trump just couldn t wish all Americans a Happy New Year and leave it at that. Instead, he had to give a shout ou

## 3. Train/Test Setup

Train on the full ISOT training set (Fake + True). Evaluate on two held-out files:
- `data/test/fake.csv` (all fake) to measure false negatives (fake recall)
- `data/test/WELFake_Dataset_sample_10000.csv` (mixed, labeled) for full metrics

In [5]:
print("="*60)
print("Preparing train and held-out test sets")
print("="*60)

# Train on full ISOT training set
X_train_full = df['text_cleaned'].tolist()
y_train_full = df['label'].values

# Fake-only test set: measure false negatives / fake recall
print("\nLoading fake-only test set (all fake)...")
df_fake_test = pd.read_csv('../data/test/fake.csv')
df_fake_test['text_cleaned'] = df_fake_test['text'].apply(clean_text)
X_test_fake = df_fake_test['text_cleaned'].tolist()
# Fake class = 1 (original convention)
y_test_fake = np.ones(len(df_fake_test), dtype=int)

# WELFake mixed test set: full metrics
print("\nLoading WELFake test set (mixed labeled)...")
df_welfake = pd.read_csv('../data/test/WELFake_Dataset_sample_10000.csv')
# WELFake labels are actually 0=real, 1=fake; this matches our convention, so no mapping needed
# If you find labels are reversed, flip with {0:1, 1:0}
df_welfake['text_cleaned'] = df_welfake['text'].apply(clean_text)
X_test_welfake = df_welfake['text_cleaned'].tolist()
y_test_welfake = df_welfake['label'].values

print(f"Train size: {len(X_train_full)}")
print(f"Fake-only test size: {len(X_test_fake)}")
print(f"WELFake test size: {len(X_test_welfake)}")

Preparing train and held-out test sets

Loading fake-only test set (all fake)...

Loading WELFake test set (mixed labeled)...
Train size: 44898
Fake-only test size: 12999
WELFake test size: 10000


## 4. Baseline Models: TF-IDF + Linear Classifiers

Train interpretable baseline models using TF-IDF features with Logistic Regression and Linear SVM.

In [6]:
# Build TF-IDF vectorizer
print("="*60)
print("BASELINE MODELS: TF-IDF + Linear Classifiers")
print("="*60)
vectorizer = build_tfidf(max_features=5000, ngram_range=(1, 2))

# Transform text data (fit once on full training set)
X_train_tfidf = vectorizer.fit_transform(X_train_full)
X_test_fake_tfidf = vectorizer.transform(X_test_fake)
X_test_welfake_tfidf = vectorizer.transform(X_test_welfake)

print(f"\nTF-IDF feature matrix shape: {X_train_tfidf.shape}")

BASELINE MODELS: TF-IDF + Linear Classifiers
Created TF-IDF vectorizer with:
  max_features: 5000
  min_df: 2
  max_df: 0.95
  ngram_range: (1, 2)
  stop_words: english

TF-IDF feature matrix shape: (44898, 5000)


In [7]:
# Train and evaluate Logistic Regression on full train -> WELFake
print("\n" + "="*60)
print("Logistic Regression - Full Train -> WELFake")
print("="*60)
model_lr = train_logreg(X_train_tfidf, y_train_full)
results_lr_welfake = evaluate(model_lr, X_test_welfake_tfidf, y_test_welfake, 
                             model_name="Logistic Regression (WELFake)")


Logistic Regression - Full Train -> WELFake
Training Logistic Regression classifier...
  Training samples: 44898
  C (regularization): 1.0
  max_iter: 1000
  solver: lbfgs
Training completed.
  Training accuracy: 0.9925

Evaluating Logistic Regression (WELFake)
Test set size: 10000 samples
Class distribution: {np.int64(0): np.int64(4844), np.int64(1): np.int64(5156)}

Metrics                   Value          
----------------------------------------
Accuracy                  0.8358
Precision (macro)         0.8656
Recall (macro)            0.8313
F1-score (macro)          0.8309  <-- PRIMARY METRIC
F1-score (weighted)       0.8318
ROC-AUC                   0.9048  <-- Secondary metric
PR-AUC (Avg Precision)    0.8781  <-- Secondary metric

Confusion Matrix:
                Predicted Real  Predicted Fake 
Actual Real     3332            1512           
Actual Fake     130             5026           

Detailed Classification Report:
              precision    recall  f1-score   support


In [8]:
# Train and evaluate Linear SVM on full train -> WELFake
print("\n" + "="*60)
print("Linear SVM - Full Train -> WELFake")
print("="*60)
model_svm = train_svm(X_train_tfidf, y_train_full)
results_svm_welfake = evaluate(model_svm, X_test_welfake_tfidf, y_test_welfake,
                              model_name="Linear SVM (WELFake)")


Linear SVM - Full Train -> WELFake
Training Linear SVM classifier...
  Training samples: 44898
  C (regularization): 1.0
  max_iter: 1000
  dual: False
Training completed.
  Training accuracy: 0.9990

Evaluating Linear SVM (WELFake)
Test set size: 10000 samples
Class distribution: {np.int64(0): np.int64(4844), np.int64(1): np.int64(5156)}

Metrics                   Value          
----------------------------------------
Accuracy                  0.8343
Precision (macro)         0.8688
Recall (macro)            0.8295
F1-score (macro)          0.8288  <-- PRIMARY METRIC
F1-score (weighted)       0.8297
ROC-AUC                   0.9033  <-- Secondary metric
PR-AUC (Avg Precision)    0.8803  <-- Secondary metric

Confusion Matrix:
                Predicted Real  Predicted Fake 
Actual Real     3274            1570           
Actual Fake     87              5069           

Detailed Classification Report:
              precision    recall  f1-score   support

        Real       0.97     

In [9]:
# Fake-only test: false negatives / fake recall (TF-IDF models)
print("\n" + "="*60)
print("Fake-only test set (all fake) - TF-IDF models")
print("="*60)

def fake_only_report(model, X_fake, model_name):
    preds = model.predict(X_fake)
    total = len(preds)
    # Fake class = 1; false negatives are predictions of 0 (real)
    fn = np.sum(preds == 0)
    recall_fake = 1 - fn / total if total else float('nan')
    print(f"{model_name}: total={total}, false_negatives={fn}, fake_recall={recall_fake:.4f}")
    return {"total": int(total), "false_negatives": int(fn), "fake_recall": float(recall_fake)}

fake_only_lr = fake_only_report(model_lr, X_test_fake_tfidf, "LogReg (fake-only)")
fake_only_svm = fake_only_report(model_svm, X_test_fake_tfidf, "Linear SVM (fake-only)")


Fake-only test set (all fake) - TF-IDF models
LogReg (fake-only): total=12999, false_negatives=826, fake_recall=0.9365
Linear SVM (fake-only): total=12999, false_negatives=685, fake_recall=0.9473


## 4b. Multi-feature TF-IDF + Length Features

We augment the text-only baseline with title text (when available) and simple length features (text/title character counts) to see if multiple features improve performance. Models are trained on the concatenated title+body plus the length features and evaluated on the same test sets.

In [10]:
print("="*60)
print("MULTI-FEATURE MODELS: TF-IDF (title + text) + length features")
print("="*60)

# Helper to build concatenated text and length features

def build_concat_and_lengths(df, text_col="text_cleaned", title_col="title_cleaned"):
    # Title may be missing (e.g., fake-only test). Fall back to empty strings.
    if title_col in df.columns:
        titles = df[title_col].fillna("")
    else:
        titles = pd.Series([""] * len(df))
    texts = df[text_col].fillna("")
    concat = (titles + " " + texts).str.strip()
    text_len = texts.str.len()
    title_len = titles.str.len()
    return concat, text_len, title_len

# Build train features
train_concat, train_text_len, train_title_len = build_concat_and_lengths(df)
vectorizer_multi = build_tfidf(max_features=7000, ngram_range=(1, 2))
X_train_concat = vectorizer_multi.fit_transform(train_concat)

scaler_multi = StandardScaler(with_mean=False)
num_train = np.vstack([train_text_len.values, train_title_len.values]).T
num_train_scaled = csr_matrix(scaler_multi.fit_transform(num_train))
X_train_multi = hstack([X_train_concat, num_train_scaled])

y_train_multi = y_train_full

# Build test features (fake-only and WELFake)
fake_concat, fake_text_len, fake_title_len = build_concat_and_lengths(df_fake_test, text_col="text_cleaned", title_col="title" if "title" in df_fake_test.columns else "title_cleaned")
welfake_concat, welfake_text_len, welfake_title_len = build_concat_and_lengths(df_welfake, text_col="text_cleaned", title_col="title" if "title" in df_welfake.columns else "title_cleaned")

X_test_fake_concat = vectorizer_multi.transform(fake_concat)
num_fake = np.vstack([fake_text_len.values, fake_title_len.values]).T
num_fake_scaled = csr_matrix(scaler_multi.transform(num_fake))
X_test_fake_multi = hstack([X_test_fake_concat, num_fake_scaled])

a = vectorizer_multi.transform(welfake_concat)
num_welfake = np.vstack([welfake_text_len.values, welfake_title_len.values]).T
num_welfake_scaled = csr_matrix(scaler_multi.transform(num_welfake))
X_test_welfake_multi = hstack([a, num_welfake_scaled])

a = None  # free reference

# Train & evaluate Logistic Regression (multi-feature)
print("\n" + "="*60)
print("Logistic Regression - Multi-feature (title+text+lengths) -> WELFake")
print("="*60)
model_lr_multi = train_logreg(X_train_multi, y_train_multi)
results_lr_welfake_multi = evaluate(model_lr_multi, X_test_welfake_multi, y_test_welfake, model_name="LogReg (multi-feature)")

# Train & evaluate Linear SVM (multi-feature)
print("\n" + "="*60)
print("Linear SVM - Multi-feature (title+text+lengths) -> WELFake")
print("="*60)
model_svm_multi = train_svm(X_train_multi, y_train_multi)
results_svm_welfake_multi = evaluate(model_svm_multi, X_test_welfake_multi, y_test_welfake, model_name="Linear SVM (multi-feature)")

# Fake-only test with multi-feature models
print("\n" + "="*60)
print("Fake-only test set (all fake) - Multi-feature models")
print("="*60)

fake_only_lr_multi = fake_only_report(model_lr_multi, X_test_fake_multi, "LogReg (multi-feature, fake-only)")
fake_only_svm_multi = fake_only_report(model_svm_multi, X_test_fake_multi, "Linear SVM (multi-feature, fake-only)")

MULTI-FEATURE MODELS: TF-IDF (title + text) + length features
Created TF-IDF vectorizer with:
  max_features: 7000
  min_df: 2
  max_df: 0.95
  ngram_range: (1, 2)
  stop_words: english

Logistic Regression - Multi-feature (title+text+lengths) -> WELFake
Training Logistic Regression classifier...
  Training samples: 44898
  C (regularization): 1.0
  max_iter: 1000
  solver: lbfgs
Training completed.
  Training accuracy: 0.9927

Evaluating LogReg (multi-feature)
Test set size: 10000 samples
Class distribution: {np.int64(0): np.int64(4844), np.int64(1): np.int64(5156)}

Metrics                   Value          
----------------------------------------
Accuracy                  0.8109
Precision (macro)         0.8196
Recall (macro)            0.8082
F1-score (macro)          0.8085  <-- PRIMARY METRIC
F1-score (weighted)       0.8092
ROC-AUC                   0.8997  <-- Secondary metric
PR-AUC (Avg Precision)    0.8972  <-- Secondary metric

Confusion Matrix:
                Predicted Re

## 5. Advanced Models: Sentence Embeddings

Train models using sentence embeddings as features, providing richer semantic representations than TF-IDF.

**Note**: This section requires the `sentence-transformers` library. Uncomment and run if available.

In [None]:
# Import embedding utilities
spec = importlib.util.spec_from_file_location("embeddings_model", project_root / "src" / "05_embeddings_model.py")
embeddings_model = importlib.util.module_from_spec(spec)
spec.loader.exec_module(embeddings_model)
sys.modules["embeddings_model"] = embeddings_model
from embeddings_model import build_embeddings, embed_text, train_embedding_classifier

# Build sentence embedding model
print("="*60)
print("ADVANCED MODELS: Sentence Embeddings")
print("="*60)
embedder = build_embeddings(model_name="all-MiniLM-L6-v2")

# Compute embeddings for full train and tests
print("\nComputing embeddings for full train and test sets...")
emb_train_full = embed_text(embedder, X_train_full)
emb_test_fake_emb = embed_text(embedder, X_test_fake)
emb_test_welfake_emb = embed_text(embedder, X_test_welfake)

print(f"Embedding dimension: {emb_train_full.shape[1]}")

ADVANCED MODELS: Sentence Embeddings
Loading sentence-embedding model: all-MiniLM-L6-v2
Using device for embeddings: mps
Embedding model loaded successfully.

Computing embeddings for full train and test sets...
Encoding 44898 texts into embeddings...


Batches:   0%|          | 0/1404 [00:00<?, ?it/s]

In [None]:
# Train and evaluate embedding-based classifier -> WELFake
print("\n" + "="*60)
print("Embedding-based Classifier - Full Train -> WELFake")
print("="*60)
model_emb = train_embedding_classifier(emb_train_full, y_train_full)
results_emb_welfake = evaluate(model_emb, emb_test_welfake_emb, y_test_welfake,
                              model_name="Embedding Classifier (WELFake)")


Embedding-based Classifier - Full Train -> WELFake
Training Logistic Regression classifier on embeddings...
  Training samples: 44898
  Embedding dimension: 384
  C (regularization): 1.0
  max_iter: 5000


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Training completed.
  Training accuracy on embeddings: 0.9617

Evaluating Embedding Classifier (WELFake)
Test set size: 10000 samples
Class distribution: {np.int64(0): np.int64(4844), np.int64(1): np.int64(5156)}

Metrics                   Value          
----------------------------------------
Accuracy                  0.8042
Precision (macro)         0.8110
Recall (macro)            0.8018
F1-score (macro)          0.8021  <-- PRIMARY METRIC
F1-score (weighted)       0.8027
ROC-AUC                   0.8843  <-- Secondary metric
PR-AUC (Avg Precision)    0.8752  <-- Secondary metric

Confusion Matrix:
                Predicted Real  Predicted Fake 
Actual Real     3506            1338           
Actual Fake     620             4536           

Detailed Classification Report:
              precision    recall  f1-score   support

        Real       0.85      0.72      0.78      4844
        Fake       0.77      0.88      0.82      5156

    accuracy                           0.80     

In [None]:
# Fake-only test: false negatives / fake recall (embeddings)
print("\n" + "="*60)
print("Fake-only test set (all fake) - Embedding model")
print("="*60)
emb_preds_fake = model_emb.predict(emb_test_fake_emb)
total_fake = len(emb_preds_fake)
# Fake class = 1; false negatives are predictions of 0 (real)
fn_fake = np.sum(emb_preds_fake == 0)
recall_fake_emb = 1 - fn_fake / total_fake if total_fake else float('nan')
print(f"Embedding model: total={total_fake}, false_negatives={fn_fake}, fake_recall={recall_fake_emb:.4f}")
fake_only_emb = {"total": int(total_fake), "false_negatives": int(fn_fake), "fake_recall": float(recall_fake_emb)}


Fake-only test set (all fake) - Embedding model
Embedding model: total=12999, false_negatives=3235, fake_recall=0.7511


## 5b. Sentence Embeddings + Length Features

Augment the sentence-embedding baseline by concatenating title+text and appending simple length features (text/title character counts) to see if these non-text signals help.

In [None]:
print("="*60)
print("ADVANCED MODELS: Sentence Embeddings + Length Features")
print("="*60)

# Reuse helper from TF-IDF multi-feature section to build concatenated texts and lengths
# (build_concat_and_lengths must be defined in the earlier multi-feature cell)

# Build train concatenated text and numeric features
train_concat_emb, train_text_len_emb, train_title_len_emb = build_concat_and_lengths(df)
train_concat_list = train_concat_emb.tolist()
emb_train_concat = embed_text(embedder, train_concat_list)
num_train_emb = np.vstack([train_text_len_emb.values, train_title_len_emb.values]).T
num_scaler_emb = StandardScaler()
num_train_emb_scaled = num_scaler_emb.fit_transform(num_train_emb)
emb_train_multi = np.hstack([emb_train_concat, num_train_emb_scaled])

y_train_emb_multi = y_train_full

# Build test sets
fake_concat_emb, fake_text_len_emb, fake_title_len_emb = build_concat_and_lengths(
    df_fake_test, text_col="text_cleaned", title_col="title" if "title" in df_fake_test.columns else "title_cleaned"
)
fake_concat_list = fake_concat_emb.tolist()

welfake_concat_emb, welfake_text_len_emb, welfake_title_len_emb = build_concat_and_lengths(
    df_welfake, text_col="text_cleaned", title_col="title" if "title" in df_welfake.columns else "title_cleaned"
)
welfake_concat_list = welfake_concat_emb.tolist()

emb_test_fake_concat = embed_text(embedder, fake_concat_list)
num_fake_emb = np.vstack([fake_text_len_emb.values, fake_title_len_emb.values]).T
num_fake_emb_scaled = num_scaler_emb.transform(num_fake_emb)
emb_test_fake_multi = np.hstack([emb_test_fake_concat, num_fake_emb_scaled])

emb_test_welfake_concat = embed_text(embedder, welfake_concat_list)
num_welfake_emb = np.vstack([welfake_text_len_emb.values, welfake_title_len_emb.values]).T
num_welfake_emb_scaled = num_scaler_emb.transform(num_welfake_emb)
emb_test_welfake_multi = np.hstack([emb_test_welfake_concat, num_welfake_emb_scaled])

ADVANCED MODELS: Sentence Embeddings + Length Features
Encoding 44898 texts into embeddings...


Batches:   0%|          | 0/1404 [00:00<?, ?it/s]

Generated embeddings with shape: (44898, 384)
Encoding 12999 texts into embeddings...


Batches:   0%|          | 0/407 [00:00<?, ?it/s]

Generated embeddings with shape: (12999, 384)
Encoding 10000 texts into embeddings...


Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Generated embeddings with shape: (10000, 384)


In [None]:
# Train and evaluate Logistic Regression (embeddings + lengths)
print("\n" + "="*60)
print("Embedding + Lengths Classifier - Full Train -> WELFake")
print("="*60)
model_emb_multi = train_embedding_classifier(emb_train_multi, y_train_emb_multi)
results_emb_welfake_multi = evaluate(model_emb_multi, emb_test_welfake_multi, y_test_welfake,
                                    model_name="Embedding + Lengths (WELFake)")


Embedding + Lengths Classifier - Full Train -> WELFake
Training Logistic Regression classifier on embeddings...
  Training samples: 44898
  Embedding dimension: 386
  C (regularization): 1.0
  max_iter: 5000


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Training completed.
  Training accuracy on embeddings: 0.9674

Evaluating Embedding + Lengths (WELFake)
Test set size: 10000 samples
Class distribution: {np.int64(0): np.int64(4844), np.int64(1): np.int64(5156)}

Metrics                   Value          
----------------------------------------
Accuracy                  0.7932
Precision (macro)         0.7943
Recall (macro)            0.7920
F1-score (macro)          0.7924  <-- PRIMARY METRIC
F1-score (weighted)       0.7928
ROC-AUC                   0.8665  <-- Secondary metric
PR-AUC (Avg Precision)    0.8791  <-- Secondary metric

Confusion Matrix:
                Predicted Real  Predicted Fake 
Actual Real     3658            1186           
Actual Fake     882             4274           

Detailed Classification Report:
              precision    recall  f1-score   support

        Real       0.81      0.76      0.78      4844
        Fake       0.78      0.83      0.81      5156

    accuracy                           0.79     1

In [None]:
# Fake-only test: false negatives / fake recall (embeddings + lengths)
print("\n" + "="*60)
print("Fake-only test set (all fake) - Embedding + lengths model")
print("="*60)
emb_preds_fake_multi = model_emb_multi.predict(emb_test_fake_multi)
total_fake_multi = len(emb_preds_fake_multi)
fn_fake_multi = np.sum(emb_preds_fake_multi == 0)
recall_fake_emb_multi = 1 - fn_fake_multi / total_fake_multi if total_fake_multi else float('nan')
print(f"Embedding+lengths model: total={total_fake_multi}, false_negatives={fn_fake_multi}, fake_recall={recall_fake_emb_multi:.4f}")
fake_only_emb_multi = {"total": int(total_fake_multi), "false_negatives": int(fn_fake_multi), "fake_recall": float(recall_fake_emb_multi)}


Fake-only test set (all fake) - Embedding + lengths model
Embedding+lengths model: total=12999, false_negatives=5379, fake_recall=0.5862


## 6. Results Summary

Compare models on WELFake (mixed labeled) and report fake-only recall on `data/test/fake.csv`.

In [None]:
print("="*60)
print("RESULTS SUMMARY")
print("="*60)

# Safe helper in case some runs are skipped

def safe(d, key):
    return None if d is None else d.get(key)

results_summary = pd.DataFrame({
    "Model": [
        "LogReg (TF-IDF)",
        "Linear SVM (TF-IDF)",
        "LogReg (TF-IDF + lengths)",
        "Linear SVM (TF-IDF + lengths)",
        "Embedding (text only)",
        "Embedding + lengths",
    ],
    "WELFake - Macro-F1": [
        safe(results_lr_welfake, "f1_macro"),
        safe(results_svm_welfake, "f1_macro"),
        safe(results_lr_welfake_multi, "f1_macro"),
        safe(results_svm_welfake_multi, "f1_macro"),
        safe(results_emb_welfake, "f1_macro"),
        safe(results_emb_welfake_multi, "f1_macro"),
    ],
    "WELFake - ROC-AUC": [
        safe(results_lr_welfake, "roc_auc"),
        safe(results_svm_welfake, "roc_auc"),
        safe(results_lr_welfake_multi, "roc_auc"),
        safe(results_svm_welfake_multi, "roc_auc"),
        safe(results_emb_welfake, "roc_auc"),
        safe(results_emb_welfake_multi, "roc_auc"),
    ],
    "WELFake - PR-AUC": [
        safe(results_lr_welfake, "pr_auc"),
        safe(results_svm_welfake, "pr_auc"),
        safe(results_lr_welfake_multi, "pr_auc"),
        safe(results_svm_welfake_multi, "pr_auc"),
        safe(results_emb_welfake, "pr_auc"),
        safe(results_emb_welfake_multi, "pr_auc"),
    ],
    "Fake-only recall": [
        safe(fake_only_lr, "fake_recall"),
        safe(fake_only_svm, "fake_recall"),
        safe(fake_only_lr_multi, "fake_recall"),
        safe(fake_only_svm_multi, "fake_recall"),
        safe(fake_only_emb, "fake_recall"),
        safe(fake_only_emb_multi, "fake_recall"),
    ],
})

print("\nModel Performance Comparison:")
print(results_summary.to_string(index=False))

RESULTS SUMMARY

Model Performance Comparison:
                        Model  WELFake - Macro-F1  WELFake - ROC-AUC  WELFake - PR-AUC  Fake-only recall
              LogReg (TF-IDF)            0.830949           0.904819          0.878144          0.936457
          Linear SVM (TF-IDF)            0.828783           0.903293          0.880313          0.947304
    LogReg (TF-IDF + lengths)            0.808537           0.899661          0.897192          0.724594
Linear SVM (TF-IDF + lengths)            0.819348           0.914173          0.909507          0.835141
        Embedding (text only)            0.802100           0.884306          0.875243          0.751135
          Embedding + lengths            0.792412           0.866485          0.879129          0.586199


## 7. Conclusions and Discussion

### Key Findings
- **TF-IDF remains strongest** on WELFake 10k: LogReg macro-F1 0.831 (ROC-AUC 0.905); Linear SVM macro-F1 0.829 (ROC-AUC 0.903).
- **Length features did not help**: Adding title/text lengths dropped macro-F1 (≈0.808–0.819) and fake-only recall (down to 0.72–0.84).
- **Embeddings trail TF-IDF**: Text-only embeddings macro-F1 0.802 (ROC-AUC 0.884); adding lengths hurt further (macro-F1 0.792).
- **Fake-only recall best for TF-IDF SVM**: Recall 0.947; embedding+lengths greatly increased false negatives.

### Interpretation
- Simple length cues are weak signals here and can harm recall; more discriminative metadata/structure is needed.
- Cross-dataset robustness is modest (macro-F1 ≈0.79–0.83) despite decent ROC-AUC, indicating persistent domain/style mismatch.
- TF-IDF still outperforms sentence embeddings under this setup; embeddings + naive numeric features are not closing the gap.

### Next Steps
- Explore richer features: URL/domain tokens, punctuation/ratio features, and explicit title/body weighting instead of raw lengths.
- Consider domain adaptation or light fine-tuning on a WELFake-like validation split; calibrate decision thresholds for recall/precision trade-offs.
- Transformers remain excluded for runtime; revisit with tighter configs (shorter max_length, partial epochs) if compute allows.
