# Relevance model comparison (imbalanced binary classification)

This notebook compares **three relevance models** on the same split and metrics:

1. **TF-IDF + Logistic Regression** (baseline, very strong)
2. **TF-IDF + LinearSVC** (often better recall)
3. **TF-IDF + LightGBM** (best non-linear option)

### Why these models
- Data is **highly imbalanced** (relevant=1 is rare)
- Task is **binary filtering**
- Metrics focus on **PR-AUC** and **recall at high precision**

### What this notebook does
- Loads a CSV with columns: `text` and `relevant` (or `relevance`)
- Creates a **single stratified train/val split**
- Runs **sensible hyperparameter search** per model (not insane grids)
- Evaluates with:
  - PR-AUC
  - Recall @ Precision ≥ 0.90
  - F1 (positive class)
- Prints a **final comparison table**


In [None]:
# =========================
# Imports & config
# =========================
import numpy as np
import pandas as pd
from pathlib import Path

from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import (
    average_precision_score,
    precision_recall_curve,
    f1_score,
)

# LightGBM
import lightgbm as lgb

RANDOM_STATE = 42
TEST_SIZE = 0.30
TARGET_PRECISION = 0.90

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 140)


## 1) Load data

Edit the path below. Required columns:
- `text`
- `relevant` **or** `relevance`


In [None]:
# TODO: update path
DATASET_CSV = Path("dataset.csv")

df = pd.read_csv(DATASET_CSV)

# detect relevance column
if "relevant" in df.columns:
    REL_COL = "relevant"
elif "relevance" in df.columns:
    REL_COL = "relevance"
else:
    raise ValueError("Expected column `relevant` or `relevance`")

TEXT_COL = "text"

df[REL_COL] = df[REL_COL].fillna(0).astype(int)
df = df.dropna(subset=[TEXT_COL]).copy()

print("Shape:", df.shape)
print("Class distribution:")
print(df[REL_COL].value_counts())


## 2) Train / validation split (stratified)

We stratify by relevance to keep positives in both splits.


In [None]:
X_train, X_val, y_train, y_val = train_test_split(
    df[TEXT_COL].astype(str),
    df[REL_COL].values,
    test_size=TEST_SIZE,
    random_state=RANDOM_STATE,
    stratify=df[REL_COL] if df[REL_COL].nunique() > 1 else None,
)

print("Train:", len(X_train), "Val:", len(X_val))
print("Train positives:", y_train.sum(), "Val positives:", y_val.sum())


## 3) Helper metrics


In [None]:
def recall_at_precision(y_true, y_score, target_precision=0.90):
    p, r, _ = precision_recall_curve(y_true, y_score)
    p, r = p[:-1], r[:-1]
    ok = np.where(p >= target_precision)[0]
    if len(ok) == 0:
        return 0.0
    return float(np.max(r[ok]))


def evaluate_binary(y_true, y_score, threshold=0.5):
    pr_auc = average_precision_score(y_true, y_score)
    y_pred = (y_score >= threshold).astype(int)
    f1 = f1_score(y_true, y_pred, pos_label=1)
    r_at_p = recall_at_precision(y_true, y_score, TARGET_PRECISION)
    return {
        "pr_auc": pr_auc,
        "f1@0.5": f1,
        f"recall@P>={TARGET_PRECISION}": r_at_p,
    }


## 4) Model 1 — TF-IDF + Logistic Regression

Why this search:
- `ngram_range`: (1,2) vs (1,3) → key trade-off
- `C`: controls regularization (most important)
- `class_weight=balanced`: **must** for imbalance


In [None]:
logreg_pipe = Pipeline([
    ("tfidf", TfidfVectorizer(
        min_df=2,
        max_df=0.95,
        strip_accents=None,
    )),
    ("clf", LogisticRegression(
        max_iter=3000,
        class_weight="balanced",
        solver="liblinear",
    )),
])

logreg_params = {
    "tfidf__ngram_range": [(1,2), (1,3)],
    "clf__C": [0.1, 0.5, 1.0, 2.0],
}

logreg_search = RandomizedSearchCV(
    logreg_pipe,
    logreg_params,
    n_iter=6,
    scoring="average_precision",
    cv=3,
    random_state=RANDOM_STATE,
    n_jobs=-1,
    verbose=1,
)

logreg_search.fit(X_train, y_train)
logreg_best = logreg_search.best_estimator_

logreg_scores = evaluate_binary(
    y_val,
    logreg_best.predict_proba(X_val)[:,1],
)

print("Best params:", logreg_search.best_params_)
print("Scores:", logreg_scores)


## 5) Model 2 — TF-IDF + LinearSVC (calibrated)

Why:
- SVC often gives **better recall**
- Needs **calibration** to produce probabilities


In [None]:
svc_pipe = Pipeline([
    ("tfidf", TfidfVectorizer(
        min_df=2,
        max_df=0.95,
        strip_accents=None,
    )),
    ("clf", LinearSVC(
        class_weight="balanced",
        max_iter=5000,
    )),
])

svc_params = {
    "tfidf__ngram_range": [(1,2), (1,3)],
    "clf__C": [0.1, 0.5, 1.0, 2.0],
}

svc_search = RandomizedSearchCV(
    svc_pipe,
    svc_params,
    n_iter=6,
    scoring="average_precision",
    cv=3,
    random_state=RANDOM_STATE,
    n_jobs=-1,
    verbose=1,
)

svc_search.fit(X_train, y_train)
svc_best = svc_search.best_estimator_

# calibrate to get probabilities
svc_cal = CalibratedClassifierCV(svc_best, method="sigmoid", cv=3)
svc_cal.fit(X_train, y_train)

svc_scores = evaluate_binary(
    y_val,
    svc_cal.predict_proba(X_val)[:,1],
)

print("Best params:", svc_search.best_params_)
print("Scores:", svc_scores)


## 6) Model 3 — TF-IDF + LightGBM

Why:
- Non-linear interactions of n-grams
- Often improves recall on tricky cases

Notes:
- We **limit features** to avoid overfitting
- `scale_pos_weight` handles imbalance


In [None]:
# Vectorize separately for LightGBM (sparse matrix)
tfidf_lgbm = TfidfVectorizer(
    min_df=3,
    max_df=0.9,
    ngram_range=(1,2),
    max_features=200_000,
)

Xtr_lgbm = tfidf_lgbm.fit_transform(X_train)
Xva_lgbm = tfidf_lgbm.transform(X_val)

pos_weight = (len(y_train) - y_train.sum()) / max(1, y_train.sum())

lgbm = lgb.LGBMClassifier(
    objective="binary",
    n_estimators=400,
    learning_rate=0.05,
    num_leaves=63,
    subsample=0.8,
    colsample_bytree=0.8,
    scale_pos_weight=pos_weight,
    random_state=RANDOM_STATE,
    n_jobs=-1,
)

lgbm.fit(Xtr_lgbm, y_train)

lgbm_scores = evaluate_binary(
    y_val,
    lgbm.predict_proba(Xva_lgbm)[:,1],
)

print("Scores:", lgbm_scores)


## 7) Final comparison


In [None]:
results = pd.DataFrame.from_dict({
    "TF-IDF + LogReg": logreg_scores,
    "TF-IDF + LinearSVC": svc_scores,
    "TF-IDF + LightGBM": lgbm_scores,
}, orient="index")

results


## How to interpret results

- **PR-AUC** → main metric (higher is better)
- **Recall@P≥0.90** → how many relevant sentences you keep at high precision
- If two models are close:
  - prefer **simpler (LogReg / SVC)**
  - LightGBM only if it gives a **clear** win

Typical outcome on such data:
- LogReg = very strong baseline
- LinearSVC = slightly better recall
- LightGBM = best recall, but risk of overfitting
