## 1. Load cleaned data and prepare `X_train`, `X_val`, `y_train`, `y_val`

**Purpose:**  
Read the cleaned train/validation splits, validate columns, and extract features (`clean_text`) and labels for model input.


In [3]:
# Minimal, clear setup for Step 1

import pandas as pd
from pathlib import Path

# ---- Config ----
DATA_DIR = Path("data")
TRAIN_PATH = DATA_DIR / "clean_train.csv"
VAL_PATH   = DATA_DIR / "clean_val.csv"

LABEL_COLS = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
REQ_COLS = ["clean_text"] + LABEL_COLS

# ---- Load ----
train_df = pd.read_csv(TRAIN_PATH)
val_df   = pd.read_csv(VAL_PATH)

# ---- Validate required columns ----
missing_train = [c for c in REQ_COLS if c not in train_df.columns]
missing_val   = [c for c in REQ_COLS if c not in val_df.columns]
if missing_train or missing_val:
    raise ValueError(
        f"Missing columns -> train:{missing_train}  val:{missing_val}. "
        "Ensure files have 'clean_text' and six label columns."
    )

# ---- Clean types/sanity ----
for df in (train_df, val_df):
    # Text column: fill NaN, ensure string
    df["clean_text"] = df["clean_text"].fillna("").astype(str).str.strip()
    # Labels: fill NaN, cast to int {0,1}
    df[LABEL_COLS] = df[LABEL_COLS].fillna(0).astype(int)

# ---- Prepare features (X) and targets (y) ----
X_train = train_df["clean_text"].values
X_val   = val_df["clean_text"].values
y_train = train_df[LABEL_COLS].values
y_val   = val_df[LABEL_COLS].values

# ---- Quick visibility ----
print("X_train shape:", X_train.shape, "| y_train shape:", y_train.shape)
print("X_val   shape:", X_val.shape,   "| y_val   shape:", y_val.shape)

print("\nTrain label prevalence (%)")
print((train_df[LABEL_COLS].mean() * 100).round(2).to_string())

print("\nVal label prevalence (%)")
print((val_df[LABEL_COLS].mean() * 100).round(2).to_string())


X_train shape: (127397,) | y_train shape: (127397, 6)
X_val   shape: (31850,) | y_val   shape: (31850, 6)

Train label prevalence (%)
toxic            9.57
severe_toxic     0.98
obscene          5.27
threat           0.30
insult           4.90
identity_hate    0.89

Val label prevalence (%)
toxic            9.59
severe_toxic     1.06
obscene          5.34
threat           0.29
insult           5.06
identity_hate    0.84


## 2. TF-IDF + Logistic Regression Baseline

**Purpose:**  
Train a linear baseline model using TF-IDF features and Logistic Regression with class balancing, then evaluate it using micro/macro F1 and ROC-AUC on the validation set.


In [4]:
# Minimal baseline: TF-IDF + One-vs-Rest Logistic Regression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import f1_score, roc_auc_score, classification_report
import numpy as np

# Reuse LABEL_COLS, X_train, X_val, y_train, y_val from Step 1

baseline = Pipeline([
    ("tfidf", TfidfVectorizer(
        ngram_range=(1,2),
        min_df=2,
        max_features=200_000,
        sublinear_tf=True
    )),
    ("clf", OneVsRestClassifier(
        LogisticRegression(
            max_iter=1000,
            class_weight="balanced",
            solver="liblinear"  # stable for small/medium sparse problems
        ),
        n_jobs=1
    ))
])

# Fit
baseline.fit(X_train, y_train)


0,1,2
,steps,"[('tfidf', ...), ('clf', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,analyzer,'word'
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'

0,1,2
,estimator,LogisticRegre...r='liblinear')
,n_jobs,1
,verbose,0

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,'balanced'
,random_state,
,solver,'liblinear'
,max_iter,1000


## 3. Baseline Summary
**Purpose:**  
Present the baseline evaluation results in a clear, structured format for easy interpretation and comparison.

**Notes:**  
- Displays both overall (micro/macro) and per-label ROC-AUC metrics.    
- Serves as the final reference point before introducing transformer-based models.


In [5]:
from sklearn.metrics import f1_score, roc_auc_score
import numpy as np

# Predict
proba_val = baseline.predict_proba(X_val)
y_pred = (proba_val >= 0.5).astype(int)

# Compute metrics
micro_f1 = f1_score(y_val, y_pred, average="micro", zero_division=0)
macro_f1 = f1_score(y_val, y_pred, average="macro", zero_division=0)
micro_auc = roc_auc_score(y_val, proba_val, average="micro")
macro_auc = roc_auc_score(y_val, proba_val, average="macro")

print("\n" + "="*55)
print(" BASELINE MODEL: TF-IDF + Logistic Regression (OvR) ")
print("="*55)

print(f"\nOverall Metrics:")
print(f"  • Micro F1   : {micro_f1:.4f}")
print(f"  • Macro F1   : {macro_f1:.4f}")
print(f"  • Micro AUC  : {micro_auc:.4f}")
print(f"  • Macro AUC  : {macro_auc:.4f}")

print("\nPer-label AUCs:")
for lbl, auc in zip(LABEL_COLS, roc_auc_score(y_val, proba_val, average=None)):
    print(f"  - {lbl:<14} : {auc:.4f}")

print("\n" + "-"*75)
print("Summary: Baseline establishes a strong linear reference model.")
print("Future models should aim to improve macro-F1 and handle rare labels better.")
print("-"*75)



 BASELINE MODEL: TF-IDF + Logistic Regression (OvR) 

Overall Metrics:
  • Micro F1   : 0.7061
  • Macro F1   : 0.5957
  • Micro AUC  : 0.9856
  • Macro AUC  : 0.9821

Per-label AUCs:
  - toxic          : 0.9742
  - severe_toxic   : 0.9886
  - obscene        : 0.9881
  - threat         : 0.9849
  - insult         : 0.9802
  - identity_hate  : 0.9769

---------------------------------------------------------------------------
Summary: Baseline establishes a strong linear reference model.
Future models should aim to improve macro-F1 and handle rare labels better.
---------------------------------------------------------------------------


### Step 4 — Persist Baseline for Inference

**Purpose:**  
Save the trained TF-IDF+LR pipeline and metadata so the UI can load and predict.


In [15]:
# Save artifacts: pipeline + metadata
from pathlib import Path
import joblib, json

MODELS_DIR = Path("models")
MODELS_DIR.mkdir(parents=True, exist_ok=True)

PIPE_PATH = MODELS_DIR / "baseline_pipeline.joblib"
META_PATH = MODELS_DIR / "baseline_meta.json"

joblib.dump(baseline, PIPE_PATH)

meta = {
    "label_cols": LABEL_COLS,
    "threshold": 0.5,
    "model_type": "tfidf+logreg_ovr",
}
with open(META_PATH, "w", encoding="utf-8") as f:
    json.dump(meta, f, indent=2)

print("Saved:", PIPE_PATH, "and", META_PATH)


Saved: models\baseline_pipeline.joblib and models\baseline_meta.json
