# Naive Bayes and Linear Models for Text
## Objective

Train and evaluate **interpretable classical machine learning models** for text classification using sparse NLP features.

This notebook establishes **performance baselines** before moving to embeddings or transformers.

## Why These Models Matter

Despite modern NLP advances:

Naive Bayes and linear models remain:

- Fast

- Robust on small datasets

- Highly interpretable

> Often outperform complex models under data scarcity

They are also ideal for:

- Model debugging

- Feature validation

- Explainability requirements

## Imports and Setup

In [2]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix
)


# Example Dataset

A small **binary text classification** example.

In [5]:
data = {
    "text": [
        "this model works well",
        "terrible results and poor model",
        "excellent performance and stability",
        "bad predictions and weak accuracy",
        "robust and interpretable model",
        "awful behavior and unreliable output"
    ],
    "label": [1, 0, 1, 0, 1, 0]
}

df = pd.DataFrame(data)
df


Unnamed: 0,text,label
0,this model works well,1
1,terrible results and poor model,0
2,excellent performance and stability,1
3,bad predictions and weak accuracy,0
4,robust and interpretable model,1
5,awful behavior and unreliable output,0


# Train / Test Split (Leakage-Safe)

In [31]:
X_train, X_test, y_train, y_test = train_test_split(
    df["text"],
    df["label"],
    test_size=0.3,
    random_state=2010,
    stratify=df["label"]
)

#  Multinomial Naive Bayes
### Why Naive Bayes?

- Designed for count-based features

- Handles high-dimensional sparse data well

- Extremely fast

# Pipeline Definition

In [11]:
nb_pipeline = Pipeline([
    ("tfidf", TfidfVectorizer()),
    ("model", MultinomialNB())
])


## Train and Evaluate

In [14]:
nb_pipeline.fit(X_train, y_train)

y_pred_nb = nb_pipeline.predict(X_test)

print("Naive Bayes Accuracy:", accuracy_score(y_test, y_pred_nb))
print(classification_report(y_test, y_pred_nb))


Naive Bayes Accuracy: 0.5
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.50      1.00      0.67         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [16]:
nb_pipeline.fit(X_train, y_train)

y_pred_nb = nb_pipeline.predict(X_test)

print("Naive Bayes Accuracy:", accuracy_score(y_test, y_pred_nb))
print(classification_report(y_test, y_pred_nb))


Naive Bayes Accuracy: 0.5
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.50      1.00      0.67         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


# Logistic Regression
## Why Logistic Regression?

- Linear decision boundary

- Well-calibrated probabilities

- Strong interpretability

### Pipeline Definition

In [19]:
logreg_pipeline = Pipeline([
    ("tfidf", TfidfVectorizer()),
    ("model", LogisticRegression(
        max_iter=1000,
        solver="liblinear"
    ))
])


## Train and Evaluate

In [22]:
logreg_pipeline.fit(X_train, y_train)

y_pred_lr = logreg_pipeline.predict(X_test)

print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_lr))
print(classification_report(y_test, y_pred_lr))


Logistic Regression Accuracy: 0.5
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.50      1.00      0.67         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


# Model Comparison

In [25]:
results = pd.DataFrame({
    "model": ["Naive Bayes", "Logistic Regression"],
    "accuracy": [
        accuracy_score(y_test, y_pred_nb),
        accuracy_score(y_test, y_pred_lr)
    ]
})

results


Unnamed: 0,model,accuracy
0,Naive Bayes,0.5
1,Logistic Regression,0.5


# Feature Interpretability (Linear Models)
### Inspect Most Influential Tokens

In [28]:
vectorizer = logreg_pipeline.named_steps["tfidf"]
model = logreg_pipeline.named_steps["model"]

feature_names = vectorizer.get_feature_names_out()
coefficients = model.coef_[0]

top_positive = np.argsort(coefficients)[-5:]
top_negative = np.argsort(coefficients)[:5]

print("Positive class indicators:")
for idx in reversed(top_positive):
    print(feature_names[idx], coefficients[idx])

print("\nNegative class indicators:")
for idx in top_negative:
    print(feature_names[idx], coefficients[idx])


Positive class indicators:
robust 0.24839105996178054
interpretable 0.24839105996178054
performance 0.22454698343474958
excellent 0.22454698343474958
stability 0.22454698343474958

Negative class indicators:
poor -0.21454762968951593
results -0.21454762968951593
terrible -0.21454762968951593
accuracy -0.1970505588661281
predictions -0.1970505588661281


# Common Pitfalls

- ❌ Training vectorizers outside pipelines
- ❌ Using GaussianNB with sparse text features
- ❌ Ignoring class imbalance
- ❌ Over-tuning before baseline validation

# When These Models Shine

- Use Naive Bayes / Linear models when:

- Dataset is small or medium

- Interpretability is required

- Latency constraints exist

- You need strong baselines

# Key Takeaways

- Always start with classical baselines

- Pipelines prevent leakage by design

- Linear models provide transparent feature effects

- Poor baseline performance often signals preprocessing issues