# NLP Text Classification Template – TF-IDF → Models

This notebook is a template for **text classification** tasks, such as:

- Sentiment classification (positive / neutral / negative)  
- Topic labels (sports / finance / tech)  
- Toxic vs non-toxic comments  

It uses classic, robust components:

- Text cleaning (light)  
- **TF-IDF** features  
- Linear models (LogisticRegression) + RF baseline  

You can later swap TF-IDF with transformer embeddings, but this is a strong baseline.


In [None]:
# ========== 1. Imports & Config (NLP Text Classification) ==========

from pathlib import Path

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score,
    f1_score,
    classification_report,
    confusion_matrix,
    ConfusionMatrixDisplay,
)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 100)
sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (10, 6)
plt.rcParams["figure.dpi"] = 100

# ---- Config ----
DATA_DIR = Path("../input")
TRAIN_FILE = "train_text.csv"

TEXT_COL = "text"
TARGET_COL = "label"

RANDOM_STATE = 42


In [None]:
# ========== 2. Load Data ==========

def load_data(data_dir: Path = DATA_DIR, train_file: str = TRAIN_FILE) -> pd.DataFrame:
    path = data_dir / train_file
    if not path.exists():
        raise FileNotFoundError(f"Train file not found: {path}")
    df = pd.read_csv(path)
    print("Data shape:", df.shape)
    display(df.head())
    return df


df = load_data()


### 3️⃣ Label Distribution & Text Lengths

We first look at:

- Class balance  
- Text length distribution (short vs long texts)  
- Potential data issues (empty text, etc.)


In [None]:
print("Label distribution:")
display(df[TARGET_COL].value_counts(dropna=False))
display(df[TARGET_COL].value_counts(normalize=True))

df["text_len"] = df[TEXT_COL].astype(str).str.len()
sns.histplot(df["text_len"], bins=50)
plt.title("Text length distribution")
plt.xlabel("Number of characters")
plt.show()


### 4️⃣ Train/Validation Split

We do a standard stratified split so each class is represented.  
For Kaggle-style setups with separate test data, this is just for local validation.


In [None]:
X = df[TEXT_COL].astype(str)
y = df[TARGET_COL]

X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y
)

print("Train size:", X_train.shape[0])
print("Valid size:", X_valid.shape[0])


### 5️⃣ Text Vectorization with TF-IDF

We convert raw text into numeric features using **TF-IDF**.

Guidelines:

- For short texts: `ngram_range=(1, 2)` is usually good.  
- For long texts: limit `max_features` and tune `min_df`.  


In [None]:
tfidf = TfidfVectorizer(
    max_features=50000,
    ngram_range=(1, 2),
    min_df=2,
)

X_train_vec = tfidf.fit_transform(X_train)
X_valid_vec = tfidf.transform(X_valid)

print("Vectorized shapes:", X_train_vec.shape, X_valid_vec.shape)


### 6️⃣ Baseline Models: Logistic Regression & Random Forest

We evaluate:

- **LogisticRegression** (very strong baseline for text)  
- **RandomForest** (tree-based baseline, often weaker but included for completeness)

Metrics:

- Accuracy  
- F1 (weighted)  
- Confusion matrix  


In [None]:
def evaluate_model(clf, name: str):
    clf.fit(X_train_vec, y_train)
    y_pred = clf.predict(X_valid_vec)
    acc = accuracy_score(y_valid, y_pred)
    f1 = f1_score(y_valid, y_pred, average="weighted")
    print(f"\n=== {name} ===")
    print(f"Accuracy: {acc:.4f} | F1 (weighted): {f1:.4f}")
    print(classification_report(y_valid, y_pred, digits=4))
    cm = confusion_matrix(y_valid, y_pred)
    ConfusionMatrixDisplay(cm).plot()
    plt.title(f"Confusion Matrix – {name}")
    plt.show()


logreg = LogisticRegression(max_iter=2000, n_jobs=-1)
evaluate_model(logreg, "Logistic Regression")

rf = RandomForestClassifier(
    n_estimators=300, max_depth=None, n_jobs=-1, random_state=RANDOM_STATE
)
evaluate_model(rf, "Random Forest")


### 7️⃣ Next Steps

- Tune TF-IDF hyperparameters (`min_df`, `max_features`, `ngram_range`).  
- Try other linear models (LinearSVC, SGDClassifier).  
- Add class weights if labels are imbalanced.  
- Upgrade to transformer embeddings for more advanced performance.

You can save the fitted TF-IDF vectorizer and model with joblib/pickle.
