# Naïve Bayes for Email Spam Detection
Naïve Bayes applies Bayes' theorem with the **naïve** assumption that features are conditionally independent. For text classification we convert each message into word-count features. The model estimates how likely each word appears in spam vs. normal mail and combines these independent probabilities to choose the most probable class. Because probabilities multiply, we work in log space to avoid underflow. Metrics such as accuracy, precision, recall, and F1-score provide a balanced picture of performance, while the confusion matrix highlights specific error types.

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
# Tiny illustrative dataset (label: 1 = spam, 0 = normal)
records = pd.DataFrame({
    "text": [
        "limited time offer claim your prize now",
        "meeting reminder for project update",
        "win cash by entering free lottery",
        "family dinner plans for saturday",
        "exclusive deal just for you click",
        "invoice attached for last month",
        "cheap meds available order today",
        "team outing scheduled at 5pm",
        "congratulations you have won a voucher",
        "please review the attached report"
    ],
    "label": [1, 0, 1, 0, 1, 0, 1, 0, 1, 0]
})

print(records.head())

In [None]:
vectorizer = CountVectorizer(stop_words="english")
X = vectorizer.fit_transform(records["text"])
y = records["label"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

model = MultinomialNB()
model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)

print("Confusion matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nDetailed report:")
print(classification_report(y_test, y_pred, target_names=["normal", "spam"]))