# Robust News Classification: Main Experiments

This notebook ties together all components of the robust news classification project, implementing the complete experimental pipeline as specified in the project proposal.

## Overview

This notebook demonstrates:
1. **Data Loading & Preprocessing**: Loading ISOT dataset and applying text cleaning
2. **Data Splitting**: Both random splits and topic-holdout splits for robustness evaluation
3. **Baseline Models**: TF-IDF + Logistic Regression and TF-IDF + Linear SVM
4. **Advanced Models**: Sentence-embedding models and transformer-based classifiers
5. **Evaluation**: Comprehensive metrics (Macro-F1, PR-AUC, ROC-AUC) under different split strategies
6. **Cross-Dataset Transfer**: Zero-shot evaluation on external datasets (WELFake)

## Project Goals

- Evaluate model robustness under topic shifts (topic-holdout splits)
- Compare interpretable baselines with advanced embedding/transformer models
- Test cross-dataset generalization to assess real-world applicability
- Use Macro-F1 as primary metric to balance class performance

## 1. Setup and Imports

Import all necessary modules from the `src/` directory and configure settings.

In [None]:
import sys
from pathlib import Path
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Add src directory to path for imports
project_root = Path().resolve().parent
sys.path.insert(0, str(project_root / 'src'))

# Import using importlib for files with numeric prefixes
import importlib.util

# Import preprocessing utilities
spec = importlib.util.spec_from_file_location("preprocessing", project_root / "src" / "01_preprocessing.py")
preprocessing = importlib.util.module_from_spec(spec)
spec.loader.exec_module(preprocessing)
from preprocessing import load_isot, apply_cleaning, clean_text

# Import data splitting
spec = importlib.util.spec_from_file_location("data_splitting", project_root / "src" / "02_data_splitting.py")
data_splitting = importlib.util.module_from_spec(spec)
spec.loader.exec_module(data_splitting)
from data_splitting import random_split, topic_holdout_split

# Import baseline models
spec = importlib.util.spec_from_file_location("baseline_models", project_root / "src" / "03_baseline_models.py")
baseline_models = importlib.util.module_from_spec(spec)
spec.loader.exec_module(baseline_models)
from baseline_models import build_tfidf, train_logreg, train_svm

# Import evaluation
spec = importlib.util.spec_from_file_location("baseline_eval", project_root / "src" / "04_baseline_eval.py")
baseline_eval = importlib.util.module_from_spec(spec)
spec.loader.exec_module(baseline_eval)
from baseline_eval import evaluate

print("All imports successful!")
print(f"Project root: {project_root}")

## 2. Data Loading and Preprocessing

Load the ISOT dataset (Fake and True news) and apply text cleaning to remove noise and standardize formatting.

In [None]:
# Load ISOT dataset
print("Loading ISOT dataset...")
df = load_isot(
    fake_path='../data/training/Fake.csv',
    real_path='../data/training/True.csv'
)

print(f"\nDataset shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"\nLabel distribution:")
print(df['label'].value_counts())
print(f"\nSubject distribution:")
print(df['subject'].value_counts())

# Apply text cleaning
print("\n" + "="*60)
print("Applying text cleaning...")
df = apply_cleaning(df, text_column='text')
df = apply_cleaning(df, text_column='title')

print(f"\nSample cleaned text (first 200 chars):")
print(df['text_cleaned'].iloc[0][:200])

## 3. Data Splitting Strategies

Create both random and topic-holdout splits to evaluate model robustness under different scenarios.

In [None]:
# Random split (for baseline comparison)
print("="*60)
print("RANDOM SPLIT")
print("="*60)
df_train_random, df_test_random = random_split(df, test_size=0.2, random_state=42)

# Topic-holdout split (for robustness evaluation)
print("\n" + "="*60)
print("TOPIC HOLDOUT SPLIT")
print("="*60)
# Hold out 'politicsNews' topic (typically the largest category)
df_train_topic, df_test_topic = topic_holdout_split(
    df, 
    topic_column='subject',
    heldout_topic='politicsNews'
)

# Prepare text and labels for both splits
print("\n" + "="*60)
print("Preparing data for modeling...")
print("="*60)

# Random split
X_train_random = df_train_random['text_cleaned'].tolist()
X_test_random = df_test_random['text_cleaned'].tolist()
y_train_random = df_train_random['label'].values
y_test_random = df_test_random['label'].values

# Topic-holdout split
X_train_topic = df_train_topic['text_cleaned'].tolist()
X_test_topic = df_test_topic['text_cleaned'].tolist()
y_train_topic = df_train_topic['label'].values
y_test_topic = df_test_topic['label'].values

print(f"Random split - Train: {len(X_train_random)}, Test: {len(X_test_random)}")
print(f"Topic split - Train: {len(X_train_topic)}, Test: {len(X_test_topic)}")

## 4. Baseline Models: TF-IDF + Linear Classifiers

Train interpretable baseline models using TF-IDF features with Logistic Regression and Linear SVM.

In [None]:
# Build TF-IDF vectorizer
print("="*60)
print("BASELINE MODELS: TF-IDF + Linear Classifiers")
print("="*60)
vectorizer = build_tfidf(max_features=5000, ngram_range=(1, 2))

# Transform text data
X_train_random_tfidf = vectorizer.fit_transform(X_train_random)
X_test_random_tfidf = vectorizer.transform(X_test_random)
X_train_topic_tfidf = vectorizer.fit_transform(X_train_topic)
X_test_topic_tfidf = vectorizer.transform(X_test_topic)

print(f"\nTF-IDF feature matrix shape: {X_train_random_tfidf.shape}")

In [None]:
# Train and evaluate Logistic Regression on random split
print("\n" + "="*60)
print("Logistic Regression - Random Split")
print("="*60)
model_lr_random = train_logreg(X_train_random_tfidf, y_train_random)
results_lr_random = evaluate(model_lr_random, X_test_random_tfidf, y_test_random, 
                             model_name="Logistic Regression (Random Split)")

In [None]:
# Train and evaluate Logistic Regression on topic-holdout split
print("\n" + "="*60)
print("Logistic Regression - Topic Holdout Split")
print("="*60)
model_lr_topic = train_logreg(X_train_topic_tfidf, y_train_topic)
results_lr_topic = evaluate(model_lr_topic, X_test_topic_tfidf, y_test_topic,
                            model_name="Logistic Regression (Topic Holdout)")

In [None]:
# Train and evaluate Linear SVM on random split
print("\n" + "="*60)
print("Linear SVM - Random Split")
print("="*60)
model_svm_random = train_svm(X_train_random_tfidf, y_train_random)
results_svm_random = evaluate(model_svm_random, X_test_random_tfidf, y_test_random,
                              model_name="Linear SVM (Random Split)")

In [None]:
# Train and evaluate Linear SVM on topic-holdout split
print("\n" + "="*60)
print("Linear SVM - Topic Holdout Split")
print("="*60)
model_svm_topic = train_svm(X_train_topic_tfidf, y_train_topic)
results_svm_topic = evaluate(model_svm_topic, X_test_topic_tfidf, y_test_topic,
                             model_name="Linear SVM (Topic Holdout)")

## 5. Advanced Models: Sentence Embeddings

Train models using sentence embeddings as features, providing richer semantic representations than TF-IDF.

**Note**: This section requires the `sentence-transformers` library. Uncomment and run if available.

In [None]:
# Import embedding utilities
spec = importlib.util.spec_from_file_location("embeddings_model", project_root / "src" / "05_embeddings_model.py")
embeddings_model = importlib.util.module_from_spec(spec)
spec.loader.exec_module(embeddings_model)
from embeddings_model import build_embeddings, embed_text, train_embedding_classifier

# Build sentence embedding model
print("="*60)
print("ADVANCED MODELS: Sentence Embeddings")
print("="*60)
embedder = build_embeddings(model_name="all-MiniLM-L6-v2")

# Compute embeddings for random split
print("\nComputing embeddings for random split...")
emb_train_random = embed_text(embedder, X_train_random)
emb_test_random = embed_text(embedder, X_test_random)

# Compute embeddings for topic-holdout split
print("Computing embeddings for topic-holdout split...")
emb_train_topic = embed_text(embedder, X_train_topic)
emb_test_topic = embed_text(embedder, X_test_topic)

print(f"Embedding dimension: {emb_train_random.shape[1]}")

In [None]:
# Train and evaluate embedding-based classifier on random split
print("\n" + "="*60)
print("Embedding-based Classifier - Random Split")
print("="*60)
model_emb_random = train_embedding_classifier(emb_train_random, y_train_random)
results_emb_random = evaluate(model_emb_random, emb_test_random, y_test_random,
                              model_name="Embedding Classifier (Random Split)")

In [None]:
# Train and evaluate embedding-based classifier on topic-holdout split
print("\n" + "="*60)
print("Embedding-based Classifier - Topic Holdout Split")
print("="*60)
model_emb_topic = train_embedding_classifier(emb_train_topic, y_train_topic)
results_emb_topic = evaluate(model_emb_topic, emb_test_topic, y_test_topic,
                             model_name="Embedding Classifier (Topic Holdout)")

## 6. Results Summary

Compare all models across both split strategies to assess robustness.

In [None]:
# Compile results summary
print("="*60)
print("RESULTS SUMMARY")
print("="*60)

results_summary = pd.DataFrame({
    'Model': [
        'Logistic Regression (TF-IDF)',
        'Linear SVM (TF-IDF)',
        'Embedding Classifier',
    ],
    'Random Split - Macro-F1': [
        results_lr_random['f1_macro'],
        results_svm_random['f1_macro'],
        results_emb_random['f1_macro'] if 'results_emb_random' in locals() else None,
    ],
    'Random Split - ROC-AUC': [
        results_lr_random['roc_auc'],
        results_svm_random['roc_auc'],
        results_emb_random['roc_auc'] if 'results_emb_random' in locals() else None,
    ],
    'Topic Holdout - Macro-F1': [
        results_lr_topic['f1_macro'],
        results_svm_topic['f1_macro'],
        results_emb_topic['f1_macro'] if 'results_emb_topic' in locals() else None,
    ],
    'Topic Holdout - ROC-AUC': [
        results_lr_topic['roc_auc'],
        results_svm_topic['roc_auc'],
        results_emb_topic['roc_auc'] if 'results_emb_topic' in locals() else None,
    ],
})

print("\nModel Performance Comparison:")
print(results_summary.to_string(index=False))

# Calculate robustness gap (performance drop from random to topic-holdout)
results_summary['Macro-F1 Drop'] = (
    results_summary['Random Split - Macro-F1'] - 
    results_summary['Topic Holdout - Macro-F1']
)
print("\n" + "="*60)
print("Robustness Analysis (Macro-F1 drop from random to topic-holdout):")
print("="*60)
print(results_summary[['Model', 'Macro-F1 Drop']].to_string(index=False))

## 7. Cross-Dataset Transfer Evaluation

Evaluate model generalization by testing on external datasets (WELFake) without fine-tuning.

**Note**: This section requires the WELFake dataset. Adjust paths as needed.

In [None]:
# Import cross-dataset utilities
spec = importlib.util.spec_from_file_location("cross_dataset_transfer", project_root / "src" / "07_cross_dataset_transfer.py")
cross_dataset_transfer = importlib.util.module_from_spec(spec)
spec.loader.exec_module(cross_dataset_transfer)
from cross_dataset_transfer import load_kaggle_dataset, zero_shot_test

# Load WELFake dataset for cross-dataset evaluation
print("="*60)
print("CROSS-DATASET TRANSFER: WELFake Dataset")
print("="*60)

try:
    df_welfake = load_kaggle_dataset(
        path='../data/test/WELFake_Dataset_sample_1000.csv',
        text_column='text',
        label_column='label',
        text_cleaner=clean_text
    )

    print(f"WELFake dataset shape: {df_welfake.shape}")
    print(f"Label distribution:")
    print(df_welfake['label'].value_counts())

    X_welfake = df_welfake['text_cleaned'].tolist()
    y_welfake = df_welfake['label'].values

    # Evaluate baseline models on WELFake (zero-shot)
    print("\n" + "="*60)
    print("Zero-Shot Evaluation on WELFake Dataset")
    print("="*60)

    # Transform WELFake text with TF-IDF vectorizer (trained on ISOT)
    X_welfake_tfidf = vectorizer.transform(X_welfake)

    print("\nLogistic Regression (TF-IDF) - WELFake:")
    f1_lr_welfake = zero_shot_test(model_lr_random, X_welfake_tfidf, y_welfake)
    print(f"Macro-F1: {f1_lr_welfake:.4f}")

    print("\nLinear SVM (TF-IDF) - WELFake:")
    f1_svm_welfake = zero_shot_test(model_svm_random, X_welfake_tfidf, y_welfake)
    print(f"Macro-F1: {f1_svm_welfake:.4f}")

    if 'emb_welfake' in locals():
        print("\nEmbedding Classifier - WELFake:")
        emb_welfake = embed_text(embedder, X_welfake)
        f1_emb_welfake = zero_shot_test(model_emb_random, emb_welfake, y_welfake)
        print(f"Macro-F1: {f1_emb_welfake:.4f}")
except FileNotFoundError as e:
    print(f"WELFake dataset not found: {e}")
    print("Skipping cross-dataset evaluation.")

## 8. Conclusions and Discussion

### Key Findings:

1. **Model Performance**: Compare baseline (TF-IDF) vs. advanced (embeddings, transformers) approaches
2. **Robustness**: Assess performance drop from random splits to topic-holdout splits
3. **Generalization**: Evaluate cross-dataset transfer performance on WELFake

### Interpretation:

- **Macro-F1** serves as the primary metric to balance performance across classes
- **Topic-holdout splits** reveal whether models rely on topic-specific shortcuts
- **Cross-dataset evaluation** tests real-world applicability beyond ISOT

### Next Steps:

- Fine-tune hyperparameters for optimal performance
- Analyze failure cases and model interpretability
- Expand cross-dataset evaluation to additional external datasets
- Add transformer model evaluation (requires additional setup)