### Case Study: Classifying Emails as Spam or Ham for an Email Service Provider

#### Description
An email service provider aims to enhance its spam filtering system to reduce the number of spam emails reaching user inboxes. The objective is to build a predictive model that accurately classifies emails as spam or ham (non-spam).

#### Steps

* Data Generation:
Created a synthetic dataset of emails with random variations for spam and ham emails.
Utilized templates and random word selections to generate varied content.

* Data Preprocessing:
Cleaned the email text by removing special characters and stopwords.
Performed lemmatization to normalize words.

* Feature Extraction:
Used TF-IDF Vectorization to convert text data into numerical features.

* Model Training:
Split the dataset into training and test sets.
Trained a Multinomial Naive Bayes classifier.

* Evaluation:
Assessed model performance using accuracy, precision, recall, F1-score, and ROC-AUC metrics.

In [1]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import re
import random
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, roc_auc_score


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/joelfuentes/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/joelfuentes/nltk_data...


In [2]:
# Step 1: Data Generation

# Define components for spam emails
spam_subjects = [
    "Congratulations", "Urgent", "Limited Time Offer", "You have won", "Exclusive Deal"
]

spam_bodies = [
    "You have been selected to win a {prize}. Click {link} to claim now.",
    "Your {account} has been compromised. Send your {details} to {email} immediately.",
    "Earn {amount} working from home. No experience required.",
    "Limited time offer on {product}. Buy now and get {discount}% off.",
    "Dear {name}, you have won a {prize}. Reply to this email to claim."
]

spam_placeholders = {
    'prize': ["$1000 gift card", "free vacation", "brand new car"],
    'link': ["here", "this link", "the following URL"],
    'account': ["bank account", "email account", "social media account"],
    'details': ["password", "login details", "credentials"],
    'email': ["security@fakebank.com", "support@phishingsite.com"],
    'amount': ["$5000 per week", "$1000 daily", "$2000 monthly"],
    'product': ["electronics", "fashion items", "home appliances"],
    'discount': ["50", "70", "80"],
    'name': ["user", "customer", "friend"]
}

# Define components for ham emails
ham_subjects = [
    "Meeting Reminder", "Project Update", "Invitation", "Question", "Follow-up"
]

ham_bodies = [
    "Hi {name}, are we still on for the {event} tomorrow?",
    "Please find attached the {document} for your review.",
    "It was great meeting you at the {event}. Let's catch up soon.",
    "Can you provide an update on the {project} status?",
    "Thank you for your {action}. It was very helpful."
]

ham_placeholders = {
    'name': ["John", "Sarah", "Michael", "Jessica"],
    'event': ["meeting", "lunch", "conference call", "seminar"],
    'document': ["report", "proposal", "presentation", "invoice"],
    'project': ["marketing", "development", "design", "research"],
    'action': ["feedback", "assistance", "support", "response"]
}

# Function to generate emails with random variations
def generate_emails(subjects, bodies, placeholders, num_emails):
    emails = []
    for _ in range(num_emails):
        subject = random.choice(subjects)
        body_template = random.choice(bodies)
        body = body_template.format(
            **{key: random.choice(values) for key, values in placeholders.items()}
        )
        email = f"Subject: {subject}\n\n{body}"
        emails.append(email)
    return emails

# Generate synthetic spam and ham emails
num_emails = 1000
spam_emails = generate_emails(spam_subjects, spam_bodies, spam_placeholders, num_emails)
ham_emails = generate_emails(ham_subjects, ham_bodies, ham_placeholders, num_emails)

# Create the dataset
emails = pd.DataFrame({
    'text': spam_emails + ham_emails,
    'label': ['spam'] * num_emails + ['ham'] * num_emails
})

In [3]:
# Step 2: Data Preprocessing
def preprocess_text(text):
    # Remove 'Subject:' and newlines
    text = re.sub(r'Subject: ', '', text)
    text = text.replace('\n', ' ')
    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Lowercase and split into words
    words = text.lower().split()
    # Remove stopwords
    words = [w for w in words if w not in stopwords.words('english')]
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(w) for w in words]
    return ' '.join(words)

emails['cleaned_text'] = emails['text'].apply(preprocess_text)

# Step 3: Feature Extraction
vectorizer = TfidfVectorizer(max_features=500)
X = vectorizer.fit_transform(emails['cleaned_text'])
y = emails['label'].map({'ham': 0, 'spam': 1})

In [4]:
# Step 4: Model Training
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

model = MultinomialNB()
model.fit(X_train, y_train)

# Step 5: Evaluation
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

# ROC-AUC Score
y_pred_prob = model.predict_proba(X_test)[:, 1]
roc_auc = roc_auc_score(y_test, y_pred_prob)
print(f"ROC-AUC Score: {roc_auc:.2f}")

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       200
           1       1.00      1.00      1.00       200

    accuracy                           1.00       400
   macro avg       1.00      1.00      1.00       400
weighted avg       1.00      1.00      1.00       400

ROC-AUC Score: 1.00


### Interpratation

#### Results
* Accuracy: 99%
* Precision: 99%
* Recall: 99%
* F1-score: 99%
* ROC-AUC Score: 0.99

As expected, the Multinomial Naive Bayes model demonstrated excellent performance on the synthetic dataset with random variations. By generating a diverse set of spam and ham emails, the model effectively learned to distinguish between the two classes. This model can be integrated into the email service provider's system to enhance spam detection.

### Recommended Next Steps
* Real-World Data Integration: Incorporate real email data (ensuring privacy and compliance) to validate the model's performance in a practical setting.

* Advanced Feature Engineering: Explore n-grams, word embeddings, or deep learning models like RNNs for potentially improved performance.

* Handling Imbalanced Data: In real scenarios, spam emails are often less frequent. Implement techniques to handle imbalanced datasets, such as SMOTE or class weighting.

* Continuous Learning: Develop mechanisms for the model to adapt to new spam tactics by retraining with recent data periodically.
Deployment and Monitoring: Integrate the model into the email system and set up monitoring to track its performance and user feedback.