# Robust News Classification: Main Experiments

This notebook ties together all components of the robust news classification project for the current flow (train on full ISOT, test on held-out files).

## Overview

This notebook demonstrates:
1. **Data Loading & Preprocessing**: Load ISOT training data and clean text
2. **Training**: Train on the full ISOT training set (Fake + True)
3. **Baseline & Advanced Models**: TF-IDF + LogReg/SVM; optional sentence-embedding classifier
4. **Evaluation**:
   - Fake-only test (`data/test/fake.csv`): false negatives / fake recall
   - Mixed labeled test (`data/test/WELFake_Dataset_sample_1000.csv`): Macro-F1, ROC-AUC, PR-AUC, confusion matrix
5. **Cross-Dataset Transfer**: WELFake evaluation covers the external test without fine-tuning

## Project Goals

- Train on full labeled ISOT data (text only)
- Check false negatives on fake-only held-out data
- Measure balanced metrics on a mixed external set (WELFake)
- Compare TF-IDF baselines with optional embedding model

## 1. Setup and Imports

Import all necessary modules from the `src/` directory and configure settings.

In [None]:
import sys
from pathlib import Path
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Add src directory to path for imports
project_root = Path().resolve().parent
sys.path.insert(0, str(project_root / 'src'))

# Import using importlib for files with numeric prefixes
import importlib.util

# Import preprocessing utilities
spec = importlib.util.spec_from_file_location("preprocessing", project_root / "src" / "01_preprocessing.py")
preprocessing = importlib.util.module_from_spec(spec)
spec.loader.exec_module(preprocessing)
sys.modules["preprocessing"] = preprocessing
from preprocessing import load_isot, apply_cleaning, clean_text

# Import baseline models
spec = importlib.util.spec_from_file_location("baseline_models", project_root / "src" / "03_baseline_models.py")
baseline_models = importlib.util.module_from_spec(spec)
spec.loader.exec_module(baseline_models)
sys.modules["baseline_models"] = baseline_models
from baseline_models import build_tfidf, train_logreg, train_svm

# Import evaluation
spec = importlib.util.spec_from_file_location("baseline_eval", project_root / "src" / "04_baseline_eval.py")
baseline_eval = importlib.util.module_from_spec(spec)
spec.loader.exec_module(baseline_eval)
sys.modules["baseline_eval"] = baseline_eval
from baseline_eval import evaluate

print("All imports successful!")
print(f"Project root: {project_root}")

## 2. Data Loading and Preprocessing

Load the ISOT dataset (Fake and True news) and apply text cleaning to remove noise and standardize formatting.

In [None]:
# Load ISOT dataset
print("Loading ISOT dataset...")
df = load_isot(
    fake_path='../data/training/Fake.csv',
    real_path='../data/training/True.csv'
)

print(f"\nDataset shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"\nLabel distribution:")
print(df['label'].value_counts())
print(f"\nSubject distribution:")
print(df['subject'].value_counts())

# Apply text cleaning
print("\n" + "="*60)
print("Applying text cleaning...")
df = apply_cleaning(df, text_column='text')
df = apply_cleaning(df, text_column='title')

print(f"\nSample cleaned text (first 200 chars):")
print(df['text_cleaned'].iloc[0][:200])

## 3. Train/Test Setup

Train on the full ISOT training set (Fake + True). Evaluate on two held-out files:
- `data/test/fake.csv` (all fake) to measure false negatives (fake recall)
- `data/test/WELFake_Dataset_sample_1000.csv` (mixed, labeled) for full metrics

In [None]:
print("="*60)
print("Preparing train and held-out test sets")
print("="*60)

# Train on full ISOT training set
X_train_full = df['text_cleaned'].tolist()
y_train_full = df['label'].values

# Fake-only test set: measure false negatives / fake recall
print("\nLoading fake-only test set (all fake)...")
df_fake_test = pd.read_csv('../data/test/fake.csv')
df_fake_test['text_cleaned'] = df_fake_test['text'].apply(clean_text)
X_test_fake = df_fake_test['text_cleaned'].tolist()
y_test_fake = np.ones(len(df_fake_test), dtype=int)

# WELFake mixed test set: full metrics
print("\nLoading WELFake test set (mixed labeled)...")
df_welfake = pd.read_csv('../data/test/WELFake_Dataset_sample_1000.csv')
# Map WELFake labels (source: 0=fake, 1=real) into project convention (1=fake, 0=real)
df_welfake['label'] = df_welfake['label'].map({0: 1, 1: 0})
df_welfake['text_cleaned'] = df_welfake['text'].apply(clean_text)
X_test_welfake = df_welfake['text_cleaned'].tolist()
y_test_welfake = df_welfake['label'].values

print(f"Train size: {len(X_train_full)}")
print(f"Fake-only test size: {len(X_test_fake)}")
print(f"WELFake test size: {len(X_test_welfake)}")

## 4. Baseline Models: TF-IDF + Linear Classifiers

Train interpretable baseline models using TF-IDF features with Logistic Regression and Linear SVM.

In [None]:
# Build TF-IDF vectorizer
print("="*60)
print("BASELINE MODELS: TF-IDF + Linear Classifiers")
print("="*60)
vectorizer = build_tfidf(max_features=5000, ngram_range=(1, 2))

# Transform text data (fit once on full training set)
X_train_tfidf = vectorizer.fit_transform(X_train_full)
X_test_fake_tfidf = vectorizer.transform(X_test_fake)
X_test_welfake_tfidf = vectorizer.transform(X_test_welfake)

print(f"\nTF-IDF feature matrix shape: {X_train_tfidf.shape}")

In [None]:
# Train and evaluate Logistic Regression on full train -> WELFake
print("\n" + "="*60)
print("Logistic Regression - Full Train -> WELFake")
print("="*60)
model_lr = train_logreg(X_train_tfidf, y_train_full)
results_lr_welfake = evaluate(model_lr, X_test_welfake_tfidf, y_test_welfake, 
                             model_name="Logistic Regression (WELFake)")

In [None]:
# Train and evaluate Linear SVM on full train -> WELFake
print("\n" + "="*60)
print("Linear SVM - Full Train -> WELFake")
print("="*60)
model_svm = train_svm(X_train_tfidf, y_train_full)
results_svm_welfake = evaluate(model_svm, X_test_welfake_tfidf, y_test_welfake,
                              model_name="Linear SVM (WELFake)")

In [None]:
# Fake-only test: false negatives / fake recall (TF-IDF models)
print("\n" + "="*60)
print("Fake-only test set (all fake) - TF-IDF models")
print("="*60)

def fake_only_report(model, X_fake, model_name):
    preds = model.predict(X_fake)
    total = len(preds)
    fn = np.sum(preds == 0)
    recall_fake = 1 - fn / total if total else float('nan')
    print(f"{model_name}: total={total}, false_negatives={fn}, fake_recall={recall_fake:.4f}")
    return {"total": int(total), "false_negatives": int(fn), "fake_recall": float(recall_fake)}

fake_only_lr = fake_only_report(model_lr, X_test_fake_tfidf, "LogReg (fake-only)")
fake_only_svm = fake_only_report(model_svm, X_test_fake_tfidf, "Linear SVM (fake-only)")

Note: Placeholder removed after refactor; no additional code needed here.

## 5. Advanced Models: Sentence Embeddings

Train models using sentence embeddings as features, providing richer semantic representations than TF-IDF.

**Note**: This section requires the `sentence-transformers` library. Uncomment and run if available.

In [None]:
# Import embedding utilities
spec = importlib.util.spec_from_file_location("embeddings_model", project_root / "src" / "05_embeddings_model.py")
embeddings_model = importlib.util.module_from_spec(spec)
spec.loader.exec_module(embeddings_model)
sys.modules["embeddings_model"] = embeddings_model
from embeddings_model import build_embeddings, embed_text, train_embedding_classifier

# Build sentence embedding model
print("="*60)
print("ADVANCED MODELS: Sentence Embeddings")
print("="*60)
embedder = build_embeddings(model_name="all-MiniLM-L6-v2")

# Compute embeddings for full train and tests
print("\nComputing embeddings for full train and test sets...")
emb_train_full = embed_text(embedder, X_train_full)
emb_test_fake_emb = embed_text(embedder, X_test_fake)
emb_test_welfake_emb = embed_text(embedder, X_test_welfake)

print(f"Embedding dimension: {emb_train_full.shape[1]}")

In [None]:
# Train and evaluate embedding-based classifier -> WELFake
print("\n" + "="*60)
print("Embedding-based Classifier - Full Train -> WELFake")
print("="*60)
model_emb = train_embedding_classifier(emb_train_full, y_train_full)
results_emb_welfake = evaluate(model_emb, emb_test_welfake_emb, y_test_welfake,
                              model_name="Embedding Classifier (WELFake)")

In [None]:
# Fake-only test: false negatives / fake recall (embeddings)
print("\n" + "="*60)
print("Fake-only test set (all fake) - Embedding model")
print("="*60)
emb_preds_fake = model_emb.predict(emb_test_fake_emb)
total_fake = len(emb_preds_fake)
fn_fake = np.sum(emb_preds_fake == 0)
recall_fake_emb = 1 - fn_fake / total_fake if total_fake else float('nan')
print(f"Embedding model: total={total_fake}, false_negatives={fn_fake}, fake_recall={recall_fake_emb:.4f}")
fake_only_emb = {"total": int(total_fake), "false_negatives": int(fn_fake), "fake_recall": float(recall_fake_emb)}

## 6. Results Summary

Compare models on WELFake (mixed labeled) and report fake-only recall on `data/test/fake.csv`.

In [None]:
# Compile results summary
print("="*60)
print("RESULTS SUMMARY")
print("="*60)

results_summary = pd.DataFrame({
    'Model': [
        'Logistic Regression (TF-IDF)',
        'Linear SVM (TF-IDF)',
        'Embedding Classifier',
    ],
    'WELFake - Macro-F1': [
        results_lr_welfake['f1_macro'],
        results_svm_welfake['f1_macro'],
        results_emb_welfake['f1_macro'] if 'results_emb_welfake' in locals() else None,
    ],
    'WELFake - ROC-AUC': [
        results_lr_welfake['roc_auc'],
        results_svm_welfake['roc_auc'],
        results_emb_welfake['roc_auc'] if 'results_emb_welfake' in locals() else None,
    ],
    'WELFake - PR-AUC': [
        results_lr_welfake['pr_auc'],
        results_svm_welfake['pr_auc'],
        results_emb_welfake['pr_auc'] if 'results_emb_welfake' in locals() else None,
    ],
    'Fake-only recall': [
        fake_only_lr['fake_recall'],
        fake_only_svm['fake_recall'],
        fake_only_emb['fake_recall'] if 'fake_only_emb' in locals() else None,
    ],
})

print("\nModel Performance Comparison:")
print(results_summary.to_string(index=False))

## 7. Cross-Dataset Transfer Evaluation

Handled above by running the ISOT-trained models on WELFake (zero-shot, no fine-tuning). No additional code needed here.

(Reference) Cross-dataset transfer summary: see WELFake results above; no further actions here.

## 8. Conclusions and Discussion

### Key Findings:

1. **Model Performance**: Compare baseline (TF-IDF) vs. advanced (embeddings, transformers) approaches
2. **Robustness**: Assess performance drop from random splits to topic-holdout splits
3. **Generalization**: Evaluate cross-dataset transfer performance on WELFake

### Interpretation:

- **Macro-F1** serves as the primary metric to balance performance across classes
- **Topic-holdout splits** reveal whether models rely on topic-specific shortcuts
- **Cross-dataset evaluation** tests real-world applicability beyond ISOT

### Next Steps:

- Fine-tune hyperparameters for optimal performance
- Analyze failure cases and model interpretability
- Expand cross-dataset evaluation to additional external datasets
- Add transformer model evaluation (requires additional setup)