# Naive Bayes Spam Email Classifier

A machine learning project implementing the Naive Bayes algorithm from scratch for email spam detection. This project demonstrates the importance of numerical stability in probabilistic models and achieves 99.21% accuracy on the test dataset.

## Table of Contents
1. [Project Overview](#project-overview)
2. [Data Loading and Exploration](#data-loading-and-exploration)
3. [Data Preprocessing](#data-preprocessing)
4. [Model Implementation](#model-implementation)
5. [Model Training](#model-training)
6. [Evaluation and Results](#evaluation-and-results)
7. [Demo and Conclusion](#demo-and-conclusion)

---

## Project Overview

This project implements a Naive Bayes classifier for spam email detection with the following key features:

- **From-scratch implementation** of Naive Bayes algorithm
- **Numerical stability** handling using log-space computations
- **Text preprocessing** with stopword removal and tokenization
- **Comprehensive evaluation** with multiple metrics
- **Performance comparison** between standard and log-space implementations

**Key Achievement**: Improved model accuracy from 84.82% to 99.21% by implementing log-space computations to handle numerical underflow.

In [1]:
# Import required libraries
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from nltk import word_tokenize
import string
import nltk

# Download required NLTK data
nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/kaisarimtiyaz/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/kaisarimtiyaz/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [2]:
# Load the email dataset
dataframe_emails = pd.read_csv('emails.csv')

# Display dataset overview
print("Dataset Overview:")
print("=" * 40)
print(f"Total emails: {len(dataframe_emails):,}")
print(f"Spam emails: {dataframe_emails.spam.sum():,} ({dataframe_emails.spam.sum()/len(dataframe_emails):.2%})")
print(f"Ham emails: {len(dataframe_emails) - dataframe_emails.spam.sum():,} ({1-dataframe_emails.spam.sum()/len(dataframe_emails):.2%})")

# Display first few rows
dataframe_emails.head()

Dataset Overview:
Total emails: 5,728
Spam emails: 1,368 (23.88%)
Ham emails: 4,360 (76.12%)


Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


## Data Preprocessing

Implementing text preprocessing functions to clean and prepare the email data for training. This includes:

- **Email shuffling** to avoid bias in train/test splits
- **Subject line removal** (first 9 characters)
- **Tokenization** and **stopword removal**
- **Punctuation filtering**

In [3]:
def preprocess_emails(df):
    """
    Preprocesses email data by shuffling and extracting content.
    
    Args:
        df: DataFrame containing email data with 'text' and 'spam' columns
        
    Returns:
        tuple: (email_content, labels) as numpy arrays
    """
    # Shuffle dataset to avoid bias
    df = df.sample(frac=1, ignore_index=True, random_state=42)
    
    # Remove "Subject:" prefix (first 9 characters)
    X = df.text.apply(lambda x: x[9:]).to_numpy()
    Y = df.spam.to_numpy()
    
    return X, Y

def preprocess_text(X):
    """
    Preprocesses text data by removing stopwords and punctuation.
    
    Args:
        X: Text data (string or array of strings)
        
    Returns:
        list: Preprocessed text with stopwords and punctuation removed
    """
    # Create stopword and punctuation set
    stop = set(stopwords.words('english') + list(string.punctuation))
    
    # Handle single string input
    if isinstance(X, str):
        X = np.array([X])
    
    X_preprocessed = []
    for email in X:
        # Tokenize, lowercase, and filter stopwords/punctuation
        tokens = np.array([word.lower() for word in word_tokenize(email) 
                          if word.lower() not in stop]).astype(X.dtype)
        X_preprocessed.append(tokens)
    
    return X_preprocessed[0] if len(X) == 1 else X_preprocessed

# Apply preprocessing
X, Y = preprocess_emails(dataframe_emails)
X_treated = preprocess_text(X)

print("Preprocessing completed!")
print(f"Sample preprocessed email: {X_treated[0][:10]}")  # Show first 10 words

Preprocessing completed!
Sample preprocessed email: ['energy' 'derivatives' 'conference' 'may' '29' 'toronto' 'good' 'morning'
 'amy' 'vince']


In [4]:
# Train-test split (80-20)
TRAIN_SIZE = int(0.80 * len(X_treated))

X_train = X_treated[:TRAIN_SIZE]
Y_train = Y[:TRAIN_SIZE]
X_test = X_treated[TRAIN_SIZE:]
Y_test = Y[TRAIN_SIZE:]

print("Dataset Split Summary:")
print("=" * 30)
print(f"Training set: {len(X_train):,} emails")
print(f"  - Spam: {sum(Y_train == 1):,} ({sum(Y_train == 1)/len(Y_train):.2%})")
print(f"  - Ham: {sum(Y_train == 0):,} ({sum(Y_train == 0)/len(Y_train):.2%})")
print(f"\nTest set: {len(X_test):,} emails")
print(f"  - Spam: {sum(Y_test == 1):,} ({sum(Y_test == 1)/len(Y_test):.2%})")
print(f"  - Ham: {sum(Y_test == 0):,} ({sum(Y_test == 0)/len(Y_test):.2%})")

Dataset Split Summary:
Training set: 4,582 emails
  - Spam: 1,114 (24.31%)
  - Ham: 3,468 (75.69%)

Test set: 1,146 emails
  - Spam: 254 (22.16%)
  - Ham: 892 (77.84%)


## Model Implementation

### Core Naive Bayes Functions

Implementing the mathematical foundation of the Naive Bayes classifier:

1. **Word Frequency Calculation**: Count word occurrences in spam vs ham emails
2. **Probability Estimation**: Calculate P(word|class) using frequency counts
3. **Email Classification**: Apply Bayes' theorem for classification

In [5]:
def get_word_frequency(X, Y):
    """
    Calculate word frequencies for spam and ham emails with Laplace smoothing.
    
    Args:
        X: Array of preprocessed emails
        Y: Array of labels (1=spam, 0=ham)
        
    Returns:
        dict: Word frequency dictionary with spam/ham counts
    """
    word_dict = {}
    
    for i in range(len(X)):
        email = set(X[i])  # Remove duplicates within email
        cls = Y[i]
        
        for word in email:
            if word not in word_dict:
                # Initialize with Laplace smoothing (add 1)
                word_dict[word] = {"spam": 1, "ham": 1}
                
            # Increment count for appropriate class
            if cls == 0:
                word_dict[word]["ham"] += 1
            else:
                word_dict[word]["spam"] += 1
                
    return word_dict

def prob_word_given_class(word, cls, word_frequency, class_frequency):
    """Calculate conditional probability P(word|class)"""
    return word_frequency[word][cls] / class_frequency[cls]

def prob_email_given_class(treated_email, cls, word_frequency, class_frequency):
    """Calculate P(email|class) using independence assumption"""
    prob = 1
    for word in treated_email:
        if word in word_frequency:
            prob *= word_frequency[word][cls] / class_frequency[cls]
    return prob

def log_prob_email_given_class(treated_email, cls, word_frequency, class_frequency):
    """Calculate log P(email|class) to prevent numerical underflow"""
    log_prob = 0
    for word in treated_email:
        if word in word_frequency:
            log_prob += np.log(word_frequency[word][cls] / class_frequency[cls])
    return log_prob

### Classifier Implementations

Two versions of the Naive Bayes classifier:

1. **Standard Implementation**: Direct probability calculation
2. **Log-space Implementation**: Uses logarithms to handle numerical stability issues

In [6]:
def naive_bayes(treated_email, word_frequency, class_frequency, return_likelihood=False):
    """
    Standard Naive Bayes classifier implementation.
    
    Args:
        treated_email: Preprocessed email content
        word_frequency: Word frequency dictionary
        class_frequency: Class frequency dictionary
        return_likelihood: Whether to return likelihood values
        
    Returns:
        int: 1 for spam, 0 for ham (or likelihood tuple if requested)
    """
    # Calculate P(email|spam) and P(email|ham)
    prob_email_given_spam = prob_email_given_class(treated_email, 'spam', word_frequency, class_frequency)
    prob_email_given_ham = prob_email_given_class(treated_email, 'ham', word_frequency, class_frequency)
    
    # Calculate prior probabilities
    total_emails = class_frequency['spam'] + class_frequency['ham']
    p_spam = class_frequency['spam'] / total_emails
    p_ham = class_frequency['ham'] / total_emails
    
    # Calculate posterior probabilities
    spam_likelihood = p_spam * prob_email_given_spam
    ham_likelihood = p_ham * prob_email_given_ham
    
    if return_likelihood:
        return (spam_likelihood, ham_likelihood)
    
    return 1 if spam_likelihood >= ham_likelihood else 0

def log_naive_bayes(treated_email, word_frequency, class_frequency, return_likelihood=False):
    """
    Log-space Naive Bayes classifier for numerical stability.
    
    Args:
        treated_email: Preprocessed email content
        word_frequency: Word frequency dictionary
        class_frequency: Class frequency dictionary
        return_likelihood: Whether to return log-likelihood values
        
    Returns:
        int: 1 for spam, 0 for ham (or log-likelihood tuple if requested)
    """
    # Calculate log P(email|spam) and log P(email|ham)
    log_prob_email_given_spam = log_prob_email_given_class(treated_email, 'spam', word_frequency, class_frequency)
    log_prob_email_given_ham = log_prob_email_given_class(treated_email, 'ham', word_frequency, class_frequency)
    
    # Calculate log prior probabilities
    total_emails = class_frequency['spam'] + class_frequency['ham']
    log_p_spam = np.log(class_frequency['spam'] / total_emails)
    log_p_ham = np.log(class_frequency['ham'] / total_emails)
    
    # Calculate log posterior probabilities
    log_spam_likelihood = log_p_spam + log_prob_email_given_spam
    log_ham_likelihood = log_p_ham + log_prob_email_given_ham
    
    if return_likelihood:
        return (log_spam_likelihood, log_ham_likelihood)
    
    return 1 if log_spam_likelihood >= log_ham_likelihood else 0

## Model Training

Training the Naive Bayes model by calculating word frequencies and class distributions from the training data.

In [7]:
# Train the model
print("Training Naive Bayes model...")
word_frequency = get_word_frequency(X_train, Y_train)
class_frequency = {'ham': sum(Y_train == 0), 'spam': sum(Y_train == 1)}

print("\nModel Training Summary:")
print("=" * 30)
print(f"Vocabulary size: {len(word_frequency):,} unique words")
print(f"Training emails:")
print(f"  - Ham: {class_frequency['ham']:,}")
print(f"  - Spam: {class_frequency['spam']:,}")

# Example: Show probability of key words
sample_words = ['lottery', 'meeting', 'free', 'schedule']
print(f"\nSample Word Probabilities:")
print("-" * 40)
for word in sample_words:
    if word in word_frequency:
        p_spam = prob_word_given_class(word, 'spam', word_frequency, class_frequency)
        p_ham = prob_word_given_class(word, 'ham', word_frequency, class_frequency)
        print(f"{word:10} | P(word|spam)={p_spam:.4f} | P(word|ham)={p_ham:.4f}")

Training Naive Bayes model...

Model Training Summary:
Vocabulary size: 33,812 unique words
Training emails:
  - Ham: 3,468
  - Spam: 1,114

Sample Word Probabilities:
----------------------------------------
lottery    | P(word|spam)=0.0081 | P(word|ham)=0.0003
meeting    | P(word|spam)=0.0081 | P(word|ham)=0.1886
free       | P(word|spam)=0.1768 | P(word|ham)=0.0995
schedule   | P(word|spam)=0.0090 | P(word|ham)=0.1029


In [8]:
def get_true_positives(Y_true, Y_pred):
    """Count true positives (correctly identified spam)"""
    return sum(1 for i in range(len(Y_true)) if Y_true[i] == 1 and Y_pred[i] == 1)

def get_true_negatives(Y_true, Y_pred):
    """Count true negatives (correctly identified ham)"""
    return sum(1 for i in range(len(Y_true)) if Y_true[i] == 0 and Y_pred[i] == 0)

def get_false_positives(Y_true, Y_pred):
    """Count false positives (ham classified as spam)"""
    return sum(1 for i in range(len(Y_true)) if Y_true[i] == 0 and Y_pred[i] == 1)

def get_false_negatives(Y_true, Y_pred):
    """Count false negatives (spam classified as ham)"""
    return sum(1 for i in range(len(Y_true)) if Y_true[i] == 1 and Y_pred[i] == 0)

def calculate_metrics(Y_true, Y_pred):
    """Calculate comprehensive evaluation metrics"""
    tp = get_true_positives(Y_true, Y_pred)
    tn = get_true_negatives(Y_true, Y_pred)
    fp = get_false_positives(Y_true, Y_pred)
    fn = get_false_negatives(Y_true, Y_pred)
    
    accuracy = (tp + tn) / len(Y_true)
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    
    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1_score': f1_score,
        'true_positives': tp,
        'true_negatives': tn,
        'false_positives': fp,
        'false_negatives': fn
    }

In [9]:
# Generate predictions using both models
print("Generating predictions...")
Y_pred_standard = [naive_bayes(email, word_frequency, class_frequency) for email in X_test]
Y_pred_log = [log_naive_bayes(email, word_frequency, class_frequency) for email in X_test]

# Calculate metrics for both models
metrics_standard = calculate_metrics(Y_test, Y_pred_standard)
metrics_log = calculate_metrics(Y_test, Y_pred_log)

def display_results(metrics, model_name):
    """Display formatted evaluation results"""
    print(f"\n{model_name}")
    print("=" * len(model_name))
    print(f"Accuracy:  {metrics['accuracy']:.4f} ({metrics['accuracy']:.2%})")
    print(f"Precision: {metrics['precision']:.4f}")
    print(f"Recall:    {metrics['recall']:.4f}")
    print(f"F1-Score:  {metrics['f1_score']:.4f}")
    print(f"\nConfusion Matrix:")
    print(f"True Positives:  {metrics['true_positives']:4d}")
    print(f"True Negatives:  {metrics['true_negatives']:4d}")
    print(f"False Positives: {metrics['false_positives']:4d}")
    print(f"False Negatives: {metrics['false_negatives']:4d}")

display_results(metrics_standard, "Standard Naive Bayes Results")
display_results(metrics_log, "Log-space Naive Bayes Results")

# Highlight the improvement
improvement = metrics_log['accuracy'] - metrics_standard['accuracy']
print(f"Log-space implementation improved accuracy by {improvement:.4f} ({improvement:.2%})")
print(f"This demonstrates the critical importance of numerical stability in ML!")

Generating predictions...

Standard Naive Bayes Results
Accuracy:  0.8482 (84.82%)
Precision: 0.5957
Recall:    0.9803
F1-Score:  0.7411

Confusion Matrix:
True Positives:   249
True Negatives:   723
False Positives:  169
False Negatives:    5

Log-space Naive Bayes Results
Accuracy:  0.9921 (99.21%)
Precision: 0.9842
Recall:    0.9803
F1-Score:  0.9822

Confusion Matrix:
True Positives:   249
True Negatives:   888
False Positives:    4
False Negatives:    5
Log-space implementation improved accuracy by 0.1440 (14.40%)
This demonstrates the critical importance of numerical stability in ML!


In [10]:
# Find an email where the models disagree
disagreement_indices = [i for i in range(len(Y_pred_standard)) 
                       if Y_pred_standard[i] != Y_pred_log[i]]

if disagreement_indices:
    idx = disagreement_indices[0]
    sample_email = X_test[idx]
    
    print("Numerical Stability Analysis")
    print("=" * 40)
    print(f"Email length: {len(sample_email)} words")
    print(f"True label: {'Spam' if Y_test[idx] == 1 else 'Ham'}")
    print(f"Standard NB prediction: {'Spam' if Y_pred_standard[idx] == 1 else 'Ham'}")
    print(f"Log-space NB prediction: {'Spam' if Y_pred_log[idx] == 1 else 'Ham'}")
    
    # Show likelihood values
    spam_like, ham_like = naive_bayes(sample_email, word_frequency, class_frequency, return_likelihood=True)
    log_spam_like, log_ham_like = log_naive_bayes(sample_email, word_frequency, class_frequency, return_likelihood=True)
    
    print(f"\nLikelihood Analysis:")
    print(f"Standard - Spam: {spam_like:.2e}, Ham: {ham_like:.2e}")
    print(f"Log-space - Spam: {log_spam_like:.2f}, Ham: {log_ham_like:.2f}")
    
    if spam_like == 0 and ham_like == 0:
        print("\n⚠️  NUMERICAL UNDERFLOW DETECTED!")
        print("Standard implementation suffers from floating-point precision limits.")
        print("Log-space implementation maintains numerical stability.")

Numerical Stability Analysis
Email length: 262 words
True label: Ham
Standard NB prediction: Spam
Log-space NB prediction: Ham

Likelihood Analysis:
Standard - Spam: 0.00e+00, Ham: 0.00e+00
Log-space - Spam: -1161.60, Ham: -854.10

⚠️  NUMERICAL UNDERFLOW DETECTED!
Standard implementation suffers from floating-point precision limits.
Log-space implementation maintains numerical stability.


## Demo and Conclusion

### Interactive Email Classification

Test the trained model with sample emails to demonstrate its capabilities.

In [11]:
def classify_email(email_text, show_details=False):
    """
    Classify a single email and optionally show detailed analysis.
    
    Args:
        email_text: Raw email text
        show_details: Whether to show preprocessing steps and probabilities
        
    Returns:
        tuple: (prediction, confidence_info)
    """
    # Preprocess the email
    treated = preprocess_text(email_text)
    
    # Get predictions and probabilities
    prediction = log_naive_bayes(treated, word_frequency, class_frequency)
    log_spam_prob, log_ham_prob = log_naive_bayes(treated, word_frequency, class_frequency, return_likelihood=True)
    
    result = "Spam" if prediction == 1 else "Ham"
    
    if show_details:
        print(f"Preprocessed: {treated[:10]}..." if len(treated) > 10 else f"Preprocessed: {treated}")
        print(f"Log probabilities - Spam: {log_spam_prob:.2f}, Ham: {log_ham_prob:.2f}")
        print(f"Prediction: {result}")
    
    return result, (log_spam_prob, log_ham_prob)

# Test with sample emails
test_emails = [
    "🎯 SPAM: Click here to win a lottery ticket and claim your prize NOW!",
    "📧 HAM: Our meeting will happen in the main office. Please be there on time.",
    "🎯 SPAM: FREE MONEY! Act now to claim your $1000 reward! Limited time offer!",
    "📧 HAM: Please review the quarterly report attached to this email. Thanks!",
    "🎯 SPAM: You have won $10,000! Click here to claim your prize immediately!",
    "📧 HAM: The project deadline has been extended to next Friday. Let me know if you have questions."
]

print("EMAIL CLASSIFICATION DEMO")
print("=" * 50)

correct_predictions = 0
for i, email in enumerate(test_emails):
    # Extract expected label from emoji
    expected = "Spam" if "🎯" in email else "Ham"
    clean_email = email.split(": ", 1)[1]  # Remove emoji prefix
    
    prediction, _ = classify_email(clean_email)
    is_correct = "✅" if prediction == expected else "❌"
    
    print(f"\n{i+1}. Email: '{clean_email[:60]}{'...' if len(clean_email) > 60 else ''}'")
    print(f"   Expected: {expected} | Predicted: {prediction} {is_correct}")
    
    if prediction == expected:
        correct_predictions += 1

print(f"\nDemo Accuracy: {correct_predictions}/{len(test_emails)} ({correct_predictions/len(test_emails):.1%})")

EMAIL CLASSIFICATION DEMO

1. Email: 'Click here to win a lottery ticket and claim your prize NOW!'
   Expected: Spam | Predicted: Spam ✅

2. Email: 'Our meeting will happen in the main office. Please be there ...'
   Expected: Ham | Predicted: Ham ✅

3. Email: 'FREE MONEY! Act now to claim your $1000 reward! Limited time...'
   Expected: Spam | Predicted: Spam ✅

4. Email: 'Please review the quarterly report attached to this email. T...'
   Expected: Ham | Predicted: Ham ✅

5. Email: 'You have won $10,000! Click here to claim your prize immedia...'
   Expected: Spam | Predicted: Spam ✅

6. Email: 'The project deadline has been extended to next Friday. Let m...'
   Expected: Ham | Predicted: Ham ✅

Demo Accuracy: 6/6 (100.0%)
