# Naive Bayes Classifiers

## Overview

**Naive Bayes** is a family of probabilistic classifiers based on Bayes' theorem with the "naive" assumption of conditional independence between features.

### Core Concept

*"Predict the class with highest posterior probability"*

### Key Ideas

1. **Probabilistic**: Models probability distribution of features
2. **Naive Independence**: Assumes features are conditionally independent given class
3. **Fast**: Training and prediction are very efficient
4. **Simple**: Minimal hyperparameters

## Mathematical Foundation

### Bayes' Theorem

\[
P(y|x) = \frac{P(x|y) \cdot P(y)}{P(x)}
\]

where:
- \(P(y|x)\) = **Posterior**: Probability of class \(y\) given features \(x\)
- \(P(x|y)\) = **Likelihood**: Probability of features \(x\) given class \(y\)
- \(P(y)\) = **Prior**: Probability of class \(y\)
- \(P(x)\) = **Evidence**: Probability of features \(x\) (constant for all classes)

### Naive Bayes Classifier

**Naive assumption**: Features are conditionally independent given class

\[
P(x|y) = P(x_1, x_2, ..., x_d | y) = \prod_{i=1}^{d} P(x_i | y)
\]

**Classification**: Choose class with highest posterior probability

\[
\hat{y} = \arg\max_{y} P(y) \prod_{i=1}^{d} P(x_i | y)
\]

or in log-space (numerical stability):

\[
\hat{y} = \arg\max_{y} \left[ \log P(y) + \sum_{i=1}^{d} \log P(x_i | y) \right]
\]

## Types of Naive Bayes

### 1. Gaussian Naive Bayes

**Assumption**: Features follow Gaussian (normal) distribution

\[
P(x_i | y) = \frac{1}{\sqrt{2\pi\sigma_y^2}} \exp\left(-\frac{(x_i - \mu_y)^2}{2\sigma_y^2}\right)
\]

**Use for**: Continuous features

### 2. Multinomial Naive Bayes

**Assumption**: Features represent counts or frequencies

\[
P(x|y) = \frac{(\sum_i x_i)!}{\prod_i x_i!} \prod_i p_{yi}^{x_i}
\]

**Use for**: Text classification (word counts), document categorization

### 3. Bernoulli Naive Bayes

**Assumption**: Features are binary (0/1)

\[
P(x_i | y) = P(i|y) \cdot x_i + (1 - P(i|y)) \cdot (1 - x_i)
\]

**Use for**: Binary features, presence/absence of features

## Topics Covered

1. Bayes' theorem intuition
2. Gaussian Naive Bayes for continuous data
3. Multinomial Naive Bayes for text classification
4. Bernoulli Naive Bayes for binary features
5. Comparing all three variants
6. Laplace smoothing (alpha parameter)
7. Real-world text classification
8. Strengths, limitations, and best practices

## Setup and Import

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from time import time

# Naive Bayes models
from sklearn.naive_bayes import (
    GaussianNB, MultinomialNB, BernoulliNB,
    ComplementNB, CategoricalNB
)

# Text processing
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Other models for comparison
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

# Utilities
from sklearn.model_selection import (
    train_test_split, cross_val_score, GridSearchCV
)
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import (
    accuracy_score, classification_report, confusion_matrix,
    roc_auc_score, log_loss
)
from sklearn.datasets import (
    load_iris, load_wine, load_breast_cancer,
    make_classification, fetch_20newsgroups
)

np.random.seed(42)
sns.set_style('whitegrid')
print("✓ Libraries imported successfully")

## 1. Bayes' Theorem Intuition

### 1.1 Simple Example: Medical Test

In [None]:
print("Bayes' Theorem: Medical Test Example")
print("="*70)
print("Scenario: Testing for a rare disease\n")

# Define probabilities
P_disease = 0.01  # 1% of population has disease (Prior)
P_positive_given_disease = 0.95  # 95% sensitivity (True Positive Rate)
P_positive_given_healthy = 0.05  # 5% false positive rate

# Calculate P(positive) using law of total probability
P_positive = (P_positive_given_disease * P_disease + 
              P_positive_given_healthy * (1 - P_disease))

# Apply Bayes' Theorem: P(disease | positive)
P_disease_given_positive = (P_positive_given_disease * P_disease) / P_positive

print(f"Prior Probability:")
print(f"  P(disease) = {P_disease:.3f} = {P_disease*100:.1f}%\n")

print(f"Likelihood:")
print(f"  P(positive | disease) = {P_positive_given_disease:.3f} = {P_positive_given_disease*100:.0f}%")
print(f"  P(positive | healthy) = {P_positive_given_healthy:.3f} = {P_positive_given_healthy*100:.0f}%\n")

print(f"Evidence:")
print(f"  P(positive) = {P_positive:.4f} = {P_positive*100:.2f}%\n")

print(f"Posterior Probability (using Bayes' Theorem):")
print(f"  P(disease | positive) = {P_disease_given_positive:.4f} = {P_disease_given_positive*100:.1f}%\n")

print("💡 Key Insight:")
print(f"   Even with 95% test accuracy, only {P_disease_given_positive*100:.1f}% chance of having disease!")
print("   Why? The disease is rare (low prior probability)")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Before test (Prior)
axes[0].bar(['Healthy', 'Disease'], [1-P_disease, P_disease], color=['green', 'red'], alpha=0.7)
axes[0].set_ylabel('Probability')
axes[0].set_title('Before Test (Prior)\nP(disease) = 1%')
axes[0].set_ylim([0, 1])

# After positive test (Posterior)
axes[1].bar(['Healthy', 'Disease'], 
           [1-P_disease_given_positive, P_disease_given_positive], 
           color=['green', 'red'], alpha=0.7)
axes[1].set_ylabel('Probability')
axes[1].set_title(f'After Positive Test (Posterior)\nP(disease|positive) = {P_disease_given_positive*100:.1f}%')
axes[1].set_ylim([0, 1])

plt.tight_layout()
plt.show()

print("\nThis is the foundation of Naive Bayes classification!")

## 2. Gaussian Naive Bayes

### 2.1 For Continuous Features

In [None]:
# Load iris dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

print("Gaussian Naive Bayes - Iris Dataset")
print("="*70)
print(f"Samples: {X_iris.shape[0]}")
print(f"Features: {X_iris.shape[1]} (continuous)")
print(f"Classes: {iris.target_names}\n")

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_iris, y_iris, test_size=0.3, random_state=42, stratify=y_iris
)

# Train Gaussian NB
gnb = GaussianNB()

start_time = time()
gnb.fit(X_train, y_train)
train_time = time() - start_time

# Predictions
y_pred = gnb.predict(X_test)
y_pred_proba = gnb.predict_proba(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)

print(f"Training time: {train_time:.6f}s")
print(f"Accuracy: {accuracy:.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

# Show learned parameters
print("\nLearned Parameters (Mean and Variance per class):")
print("="*70)
for idx, class_name in enumerate(iris.target_names):
    print(f"\n{class_name}:")
    print(f"  Prior: P({class_name}) = {gnb.class_prior_[idx]:.3f}")
    print(f"  Means: {gnb.theta_[idx]}")
    print(f"  Variances: {gnb.var_[idx]}")

In [None]:
# Visualize Gaussian distributions for first feature
print("\nVisualizing Feature Distributions")
print("="*70)

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

for feature_idx in range(4):
    ax = axes[feature_idx]
    
    # Plot histograms for each class
    for class_idx, class_name in enumerate(iris.target_names):
        class_data = X_train[y_train == class_idx, feature_idx]
        ax.hist(class_data, bins=15, alpha=0.5, label=class_name)
        
        # Overlay Gaussian
        mu = gnb.theta_[class_idx, feature_idx]
        sigma = np.sqrt(gnb.var_[class_idx, feature_idx])
        x_range = np.linspace(X_train[:, feature_idx].min(), 
                             X_train[:, feature_idx].max(), 100)
        gaussian = (1/(sigma * np.sqrt(2*np.pi))) * np.exp(-0.5*((x_range-mu)/sigma)**2)
        # Scale to match histogram
        gaussian_scaled = gaussian * len(class_data) * (x_range[1] - x_range[0]) * 15
        ax.plot(x_range, gaussian_scaled, linewidth=2)
    
    ax.set_xlabel(iris.feature_names[feature_idx])
    ax.set_ylabel('Frequency')
    ax.set_title(f'Feature {feature_idx+1}: {iris.feature_names[feature_idx]}')
    ax.legend()
    ax.grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\n💡 Gaussian NB fits a Gaussian distribution for each feature per class")

## 3. Multinomial Naive Bayes

### 3.1 For Count Data (Text Classification)

In [None]:
print("Multinomial Naive Bayes - Text Classification")
print("="*70)

# Sample text data
texts = [
    "I love this movie",
    "This film is great",
    "Best movie ever",
    "Amazing film",
    "I hate this movie",
    "Worst film ever",
    "Terrible movie",
    "Bad film"
]

labels = [1, 1, 1, 1, 0, 0, 0, 0]  # 1=positive, 0=negative

print("Training Data:")
for text, label in zip(texts, labels):
    sentiment = "Positive" if label == 1 else "Negative"
    print(f"  [{sentiment}] {text}")

# Convert to count vectors
vectorizer = CountVectorizer()
X_counts = vectorizer.fit_transform(texts)

print(f"\nVocabulary: {vectorizer.get_feature_names_out()}")
print(f"Vocabulary size: {len(vectorizer.get_feature_names_out())}")

# Show count matrix
print("\nCount Matrix:")
count_df = pd.DataFrame(
    X_counts.toarray(),
    columns=vectorizer.get_feature_names_out(),
    index=[f"Doc{i+1}" for i in range(len(texts))]
)
print(count_df)

# Train Multinomial NB
mnb = MultinomialNB()
mnb.fit(X_counts, labels)

print(f"\nTrained Multinomial Naive Bayes")
print(f"Class priors: {mnb.class_prior_}")

# Test predictions
test_texts = [
    "This movie is great",
    "I hate this film",
    "Amazing and best"
]

X_test_counts = vectorizer.transform(test_texts)
predictions = mnb.predict(X_test_counts)
probabilities = mnb.predict_proba(X_test_counts)

print("\nTest Predictions:")
for text, pred, proba in zip(test_texts, predictions, probabilities):
    sentiment = "Positive" if pred == 1 else "Negative"
    print(f"  '{text}'")
    print(f"    → {sentiment} (confidence: {proba[pred]:.3f})")
    print(f"    Probabilities: [Neg={proba[0]:.3f}, Pos={proba[1]:.3f}]")

### 3.2 Real-World Text Classification: 20 Newsgroups

In [None]:
print("20 Newsgroups Dataset - Multinomial Naive Bayes")
print("="*70)

# Load subset of categories
categories = ['sci.space', 'rec.sport.baseball', 'comp.graphics']

print(f"Loading {len(categories)} categories: {categories}\n")

# Load train and test data
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories,
                                      remove=('headers', 'footers', 'quotes'),
                                      random_state=42)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories,
                                     remove=('headers', 'footers', 'quotes'),
                                     random_state=42)

print(f"Training samples: {len(newsgroups_train.data)}")
print(f"Test samples: {len(newsgroups_test.data)}\n")

# Show sample documents
print("Sample Documents:")
print("="*70)
for i in range(3):
    print(f"\nDocument {i+1} - Category: {newsgroups_train.target_names[newsgroups_train.target[i]]}")
    print(newsgroups_train.data[i][:200] + "...\n")

# Convert to count vectors
print("Converting to count vectors...")
vectorizer_news = CountVectorizer(max_features=5000, stop_words='english')
X_train_counts = vectorizer_news.fit_transform(newsgroups_train.data)
X_test_counts = vectorizer_news.transform(newsgroups_test.data)

print(f"Vocabulary size: {len(vectorizer_news.get_feature_names_out())}")
print(f"Train matrix shape: {X_train_counts.shape}")
print(f"Test matrix shape: {X_test_counts.shape}")

# Train Multinomial NB
print("\nTraining Multinomial Naive Bayes...")
start = time()
mnb_news = MultinomialNB(alpha=1.0)
mnb_news.fit(X_train_counts, newsgroups_train.target)
train_time = time() - start

# Predict
start = time()
y_pred = mnb_news.predict(X_test_counts)
predict_time = time() - start

# Evaluate
accuracy = accuracy_score(newsgroups_test.target, y_pred)

print(f"\nTraining time: {train_time:.3f}s")
print(f"Prediction time: {predict_time:.3f}s")
print(f"Accuracy: {accuracy:.4f}")

print(f"\nClassification Report:")
print(classification_report(newsgroups_test.target, y_pred, 
                           target_names=newsgroups_train.target_names))

# Confusion matrix
cm = confusion_matrix(newsgroups_test.target, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
           xticklabels=newsgroups_train.target_names,
           yticklabels=newsgroups_train.target_names)
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.title('Multinomial NB - Confusion Matrix')
plt.tight_layout()
plt.show()

print("\n💡 Multinomial NB is very fast and effective for text classification!")

## 4. Bernoulli Naive Bayes

### 4.1 For Binary Features

In [None]:
print("Bernoulli Naive Bayes - Binary Features")
print("="*70)

# Use same text data but convert to binary (word presence/absence)
texts_binary = [
    "I love this movie",
    "This film is great",
    "Best movie ever",
    "Amazing film",
    "I hate this movie",
    "Worst film ever",
    "Terrible movie",
    "Bad film"
]

labels_binary = [1, 1, 1, 1, 0, 0, 0, 0]

# Convert to binary features (presence/absence)
vectorizer_binary = CountVectorizer(binary=True)  # binary=True!
X_binary = vectorizer_binary.fit_transform(texts_binary)

print("Binary Feature Matrix (1 = word present, 0 = word absent):")
binary_df = pd.DataFrame(
    X_binary.toarray(),
    columns=vectorizer_binary.get_feature_names_out(),
    index=[f"Doc{i+1}" for i in range(len(texts_binary))]
)
print(binary_df)

# Train Bernoulli NB
bnb = BernoulliNB()
bnb.fit(X_binary, labels_binary)

print(f"\nTrained Bernoulli Naive Bayes")
print(f"Class priors: {bnb.class_prior_}")

# Feature log probabilities
print("\nFeature probabilities for each class:")
feature_probs = np.exp(bnb.feature_log_prob_)
prob_df = pd.DataFrame(
    feature_probs.T,
    columns=['P(word|Negative)', 'P(word|Positive)'],
    index=vectorizer_binary.get_feature_names_out()
).round(3)
print(prob_df)

# Test
test_binary = ["This is an amazing movie"]
X_test_binary = vectorizer_binary.transform(test_binary)
pred = bnb.predict(X_test_binary)
proba = bnb.predict_proba(X_test_binary)

print(f"\nTest: '{test_binary[0]}'")
print(f"Prediction: {'Positive' if pred[0] == 1 else 'Negative'}")
print(f"Probabilities: [Neg={proba[0][0]:.3f}, Pos={proba[0][1]:.3f}]")

print("\n💡 Bernoulli NB: Each feature is binary (present/absent)")
print("   vs Multinomial NB: Uses word counts")

## 5. Comparing All Three Variants

### 5.1 Same Dataset, Different NB Models

In [None]:
print("Comparing Gaussian, Multinomial, and Bernoulli Naive Bayes")
print("="*70)

# Use breast cancer dataset (binary classification)
cancer = load_breast_cancer()
X_cancer = cancer.data
y_cancer = cancer.target

print(f"Dataset: {cancer.DESCR.split(chr(10))[0]}")
print(f"Samples: {X_cancer.shape[0]}")
print(f"Features: {X_cancer.shape[1]} (continuous)\n")

# Split
X_train_cancer, X_test_cancer, y_train_cancer, y_test_cancer = train_test_split(
    X_cancer, y_cancer, test_size=0.2, random_state=42, stratify=y_cancer
)

# For Multinomial and Bernoulli: need non-negative features
# Scale to [0, 1] range
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train_cancer)
X_test_scaled = scaler.transform(X_test_cancer)

# Train all three
models = {
    'Gaussian NB': GaussianNB(),
    'Multinomial NB': MultinomialNB(),
    'Bernoulli NB': BernoulliNB()
}

results = []

for name, model in models.items():
    # Use scaled data for Multinomial and Bernoulli
    if 'Gaussian' in name:
        X_tr, X_te = X_train_cancer, X_test_cancer
    else:
        X_tr, X_te = X_train_scaled, X_test_scaled
    
    # Train
    start = time()
    model.fit(X_tr, y_train_cancer)
    train_time = time() - start
    
    # Predict
    start = time()
    y_pred = model.predict(X_te)
    predict_time = time() - start
    
    # Evaluate
    accuracy = accuracy_score(y_test_cancer, y_pred)
    
    # Cross-validation
    cv_score = cross_val_score(model, X_tr, y_train_cancer, cv=5).mean()
    
    results.append({
        'Model': name,
        'Train Time (s)': train_time,
        'Predict Time (s)': predict_time,
        'CV Score': cv_score,
        'Test Accuracy': accuracy
    })
    
    print(f"{name:20} - Test Acc: {accuracy:.4f}, CV: {cv_score:.4f}, "
          f"Train: {train_time:.5f}s")

results_df = pd.DataFrame(results)

# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Accuracy comparison
x = np.arange(len(results_df))
width = 0.35
axes[0].bar(x - width/2, results_df['CV Score'], width, label='CV Score', alpha=0.8)
axes[0].bar(x + width/2, results_df['Test Accuracy'], width, label='Test Accuracy', alpha=0.8)
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Naive Bayes Variants - Accuracy Comparison')
axes[0].set_xticks(x)
axes[0].set_xticklabels(results_df['Model'])
axes[0].legend()
axes[0].grid(alpha=0.3, axis='y')

# Training time comparison
axes[1].bar(results_df['Model'], results_df['Train Time (s)'], alpha=0.8, color='orange')
axes[1].set_ylabel('Training Time (seconds)')
axes[1].set_title('Training Time Comparison')
axes[1].grid(alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\n" + results_df.to_string(index=False))

## 6. Laplace Smoothing (Alpha Parameter)

### 6.1 Handling Zero Probabilities

In [None]:
print("Laplace Smoothing (Alpha Parameter)")
print("="*70)
print("Problem: What if a word never appears in training data?")
print("  → P(word|class) = 0 → Entire probability becomes 0!\n")
print("Solution: Add-alpha (Laplace) smoothing")
print("  → Add small constant α to all counts\n")

# Test different alpha values
alpha_values = [0.1, 0.5, 1.0, 2.0, 5.0]

# Use newsgroups data
alpha_results = []

for alpha in alpha_values:
    mnb_alpha = MultinomialNB(alpha=alpha)
    mnb_alpha.fit(X_train_counts, newsgroups_train.target)
    
    y_pred_alpha = mnb_alpha.predict(X_test_counts)
    accuracy_alpha = accuracy_score(newsgroups_test.target, y_pred_alpha)
    
    alpha_results.append({
        'Alpha': alpha,
        'Accuracy': accuracy_alpha
    })
    
    print(f"Alpha = {alpha:4.1f} → Accuracy: {accuracy_alpha:.4f}")

alpha_df = pd.DataFrame(alpha_results)

# Plot
plt.figure(figsize=(10, 6))
plt.plot(alpha_df['Alpha'], alpha_df['Accuracy'], 'o-', linewidth=2, markersize=8)
plt.xlabel('Alpha (Smoothing Parameter)')
plt.ylabel('Test Accuracy')
plt.title('Effect of Laplace Smoothing on Multinomial NB')
plt.grid(alpha=0.3)
plt.axvline(x=1.0, color='red', linestyle='--', alpha=0.5, label='α=1 (standard Laplace)')
plt.legend()
plt.tight_layout()
plt.show()

best_alpha = alpha_df.loc[alpha_df['Accuracy'].idxmax(), 'Alpha']
print(f"\nBest alpha: {best_alpha}")

print("\n💡 Alpha Parameter:")
print("   α = 0: No smoothing (can cause zero probabilities)")
print("   α = 1: Laplace smoothing (standard choice)")
print("   α > 1: More smoothing (more uniform distribution)")
print("   \n   Tip: α=1 is usually a good default")

## 7. Naive Bayes vs Other Classifiers

### 7.1 Performance and Speed Comparison

In [None]:
print("Naive Bayes vs Other Classifiers - Text Classification")
print("="*70)

# Models to compare
classifiers = {
    'Multinomial NB': MultinomialNB(),
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM (Linear)': SVC(kernel='linear', random_state=42)
}

comparison_results = []

for name, clf in classifiers.items():
    print(f"\nTraining {name}...")
    
    # Train
    start = time()
    clf.fit(X_train_counts, newsgroups_train.target)
    train_time = time() - start
    
    # Predict
    start = time()
    y_pred_comp = clf.predict(X_test_counts)
    predict_time = time() - start
    
    # Evaluate
    accuracy_comp = accuracy_score(newsgroups_test.target, y_pred_comp)
    
    comparison_results.append({
        'Model': name,
        'Train Time (s)': train_time,
        'Predict Time (s)': predict_time,
        'Accuracy': accuracy_comp
    })
    
    print(f"  Accuracy: {accuracy_comp:.4f}, Train: {train_time:.3f}s, Predict: {predict_time:.4f}s")

comp_df = pd.DataFrame(comparison_results)

print("\n" + comp_df.to_string(index=False))

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Training time
axes[0].barh(comp_df['Model'], comp_df['Train Time (s)'], alpha=0.8)
axes[0].set_xlabel('Training Time (seconds)')
axes[0].set_title('Training Time Comparison')
axes[0].grid(alpha=0.3, axis='x')

# Prediction time
axes[1].barh(comp_df['Model'], comp_df['Predict Time (s)'], alpha=0.8, color='orange')
axes[1].set_xlabel('Prediction Time (seconds)')
axes[1].set_title('Prediction Time Comparison')
axes[1].grid(alpha=0.3, axis='x')

# Accuracy
axes[2].barh(comp_df['Model'], comp_df['Accuracy'], alpha=0.8, color='green')
axes[2].set_xlabel('Accuracy')
axes[2].set_title('Accuracy Comparison')
axes[2].set_xlim([0.8, 1.0])
axes[2].grid(alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

print("\n💡 Naive Bayes is extremely fast and competitive in accuracy!")
print("   Especially good for:")
print("   - Text classification")
print("   - Real-time prediction")
print("   - Large datasets")
print("   - Baseline models")

## 8. Decision Guide and Best Practices

### 8.1 Which Naive Bayes to Use?

In [None]:
print("Naive Bayes Decision Guide")
print("="*70)

guide = [
    {
        'Data Type': 'Continuous features (real-valued)',
        'Use': 'GaussianNB',
        'Example': 'Medical measurements, sensor data'
    },
    {
        'Data Type': 'Count data (word counts, frequencies)',
        'Use': 'MultinomialNB',
        'Example': 'Text classification, document categorization'
    },
    {
        'Data Type': 'Binary features (yes/no, present/absent)',
        'Use': 'BernoulliNB',
        'Example': 'Feature presence, binary attributes'
    },
    {
        'Data Type': 'Imbalanced text data',
        'Use': 'ComplementNB',
        'Example': 'Skewed text categories'
    },
    {
        'Data Type': 'Categorical features',
        'Use': 'CategoricalNB',
        'Example': 'Nominal categories (color, size, etc.)'
    },
]

guide_df = pd.DataFrame(guide)
print(guide_df.to_string(index=False))

### 8.2 Strengths and Limitations

In [None]:
print("\nNaive Bayes Strengths and Limitations")
print("="*70)

print("\n✓ STRENGTHS:")
strengths = [
    "Very fast training and prediction",
    "Works well with small training data",
    "Handles high-dimensional data well",
    "Not sensitive to irrelevant features",
    "Provides probability estimates",
    "Simple to understand and implement",
    "Performs well on text classification",
    "Good baseline model",
    "Minimal hyperparameter tuning needed",
    "Handles missing values naturally (Gaussian)"
]
for i, s in enumerate(strengths, 1):
    print(f"  {i:2}. {s}")

print("\n✗ LIMITATIONS:")
limitations = [
    "Strong independence assumption (often violated)",
    "Cannot capture feature interactions",
    "Probability estimates can be poorly calibrated",
    "Zero-frequency problem (needs smoothing)",
    "Assumes specific distribution (Gaussian for continuous)",
    "Less accurate than discriminative models on complex tasks",
    "Sensitive to feature scale (Gaussian)",
    "Cannot handle continuous features (Multinomial/Bernoulli)"
]
for i, l in enumerate(limitations, 1):
    print(f"  {i}. {l}")

print("\n\n💡 WHEN TO USE NAIVE BAYES:")
use_cases = [
    "Text classification (spam detection, sentiment analysis)",
    "Document categorization",
    "Need for real-time prediction",
    "Small training dataset",
    "High-dimensional data",
    "Need for simple baseline",
    "Probabilistic predictions required",
    "Multi-class classification"
]
for i, u in enumerate(use_cases, 1):
    print(f"  {i}. {u}")

print("\n\n⚠️ WHEN TO AVOID NAIVE BAYES:")
avoid_cases = [
    "Features are highly correlated",
    "Need to capture feature interactions",
    "Require well-calibrated probabilities",
    "Maximum accuracy is critical",
    "Data doesn't fit distribution assumptions",
    "Complex non-linear relationships"
]
for i, a in enumerate(avoid_cases, 1):
    print(f"  {i}. {a}")

## Summary and Quick Reference

### Quick Reference Code

```python
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB

# ===== GAUSSIAN NB (Continuous Features) =====
gnb = GaussianNB(
    priors=None,      # Class priors (None = use training data)
    var_smoothing=1e-9  # Portion of largest variance added to all
)

# ===== MULTINOMIAL NB (Count Data) =====
mnb = MultinomialNB(
    alpha=1.0,        # Laplace smoothing (0=none, 1=standard)
    fit_prior=True    # Learn class priors from data
)

# ===== BERNOULLI NB (Binary Features) =====
bnb = BernoulliNB(
    alpha=1.0,        # Laplace smoothing
    binarize=0.0,     # Threshold for binarizing (None = already binary)
    fit_prior=True
)

# Train and predict
model.fit(X_train, y_train)
predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test)
log_probs = model.predict_log_proba(X_test)  # For numerical stability
```

### Text Classification Pipeline

```python
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# Option 1: Count Vectors
text_clf = Pipeline([
    ('vect', CountVectorizer(max_features=5000, stop_words='english')),
    ('clf', MultinomialNB(alpha=1.0))
])

# Option 2: TF-IDF
text_clf_tfidf = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=5000, stop_words='english')),
    ('clf', MultinomialNB(alpha=1.0))
])

# Train
text_clf.fit(texts_train, labels_train)
predictions = text_clf.predict(texts_test)
```

### Key Hyperparameters

**alpha (Multinomial/Bernoulli)**:
- Laplace/Lidstone smoothing parameter
- α = 0: No smoothing
- α = 1: Laplace smoothing (standard)
- α > 1: More smoothing
- Typical range: [0.1, 0.5, 1.0, 2.0]

**var_smoothing (Gaussian)**:
- Portion of largest variance added to all for stability
- Default: 1e-9
- Increase if underflow errors occur

**binarize (Bernoulli)**:
- Threshold for binarizing features
- None: Assume features already binary
- float: Binarize at this threshold

### Comparison Table

| Variant | Feature Type | Distribution | Use Case |
|---------|--------------|--------------|----------|
| GaussianNB | Continuous | Gaussian | Sensor data, measurements |
| MultinomialNB | Counts/Frequencies | Multinomial | Text classification, word counts |
| BernoulliNB | Binary (0/1) | Bernoulli | Feature presence/absence |
| ComplementNB | Counts (imbalanced) | Complement | Imbalanced text data |
| CategoricalNB | Categorical | Categorical | Nominal categories |

### Computational Complexity

| Phase | Complexity | Notes |
|-------|------------|-------|
| Training | O(n × d) | Very fast |
| Prediction | O(c × d) | c = number of classes |
| Memory | O(c × d) | Stores parameters only |

### Best Practices

1. **Choose right variant**: Match to your data type
2. **Use smoothing**: α=1 is good default for Multinomial/Bernoulli
3. **Text preprocessing**: Remove stopwords, use max_features
4. **Try both CountVectorizer and TfidfVectorizer**: Test both
5. **Check distribution**: Verify Gaussian assumption for GaussianNB
6. **Consider calibration**: Use CalibratedClassifierCV if probabilities matter
7. **Use as baseline**: Always try NB first for text
8. **Log probabilities**: Use predict_log_proba for numerical stability

### Common Pitfalls

| Pitfall | Solution |
|---------|----------|
| Negative values in Multinomial | Use MinMaxScaler or absolute values |
| Zero probabilities | Use alpha smoothing (α ≥ 1) |
| Poor probability calibration | Use CalibratedClassifierCV |
| Wrong variant for data type | Match variant to feature type |
| Correlated features | Consider feature selection or other models |
| Underflow in probabilities | Use predict_log_proba |

### Typical Performance

**Text Classification**:
- Training: <1s for 10k documents
- Prediction: <0.1s for 1k documents
- Accuracy: 80-95% depending on task

**Continuous Data**:
- Often 70-85% accuracy
- Outperformed by SVM/Random Forest on complex tasks
- Excellent baseline

### Real-World Applications

1. **Spam Detection**: Multinomial NB on email text
2. **Sentiment Analysis**: Binary/Multinomial on reviews
3. **Document Classification**: Multinomial NB on article text
4. **Medical Diagnosis**: Gaussian NB on symptoms/measurements
5. **Recommender Systems**: Collaborative filtering
6. **Real-time Classification**: Fast prediction needed

### Further Reading

- **Paper**: "Naive Bayes at Forty" - Lewis (1998)
- **Book**: "Machine Learning: A Probabilistic Perspective" - Murphy
- **sklearn Docs**: https://scikit-learn.org/stable/modules/naive_bayes.html
- **Text Classification**: "Text Classification from Labeled and Unlabeled Documents" - Nigam et al.

### Next Steps

- Probability calibration techniques
- Semi-supervised Naive Bayes
- Online learning with partial_fit()
- Feature engineering for NB
- Ensemble methods with NB