# Spam Email Classifier with Neural Networks

A capstone project after completing Andrew Ng's Machine Learning Specialization. The goal: get hands-on experience with the full ML workflow.

**Dataset:** [Spambase](https://archive.ics.uci.edu/ml/datasets/spambase) from UCI - 4601 emails, 57 features (word frequencies, character frequencies, capital letter stats), binary target (spam/not spam).

**Workflow:**
1. Load and explore data
2. Split into 60% train / 20% validation / 20% test
3. Scale features
4. Build and train neural network
5. Evaluate on test set

## Setup

First, let's install and import our dependencies.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
import keras
from keras.layers import Dense
from keras.callbacks import EarlyStopping
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

print(f"TensorFlow version: {tf.__version__}")

## Step 1: Load the Dataset

The Spambase dataset has 57 features:
- 48 word frequency features (how often words like "free", "money", "credit" appear)
- 6 character frequency features (!, $, etc.)
- 3 capital letter statistics

Target is binary: 1 = spam, 0 = not spam.

In [None]:
# Download dataset directly from UCI (works in Colab)
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"

# Define column names based on spambase.names documentation
feature_names = [
    # 48 word frequency features
    'word_freq_make', 'word_freq_address', 'word_freq_all', 'word_freq_3d',
    'word_freq_our', 'word_freq_over', 'word_freq_remove', 'word_freq_internet',
    'word_freq_order', 'word_freq_mail', 'word_freq_receive', 'word_freq_will',
    'word_freq_people', 'word_freq_report', 'word_freq_addresses', 'word_freq_free',
    'word_freq_business', 'word_freq_email', 'word_freq_you', 'word_freq_credit',
    'word_freq_your', 'word_freq_font', 'word_freq_000', 'word_freq_money',
    'word_freq_hp', 'word_freq_hpl', 'word_freq_george', 'word_freq_650',
    'word_freq_lab', 'word_freq_labs', 'word_freq_telnet', 'word_freq_857',
    'word_freq_data', 'word_freq_415', 'word_freq_85', 'word_freq_technology',
    'word_freq_1999', 'word_freq_parts', 'word_freq_pm', 'word_freq_direct',
    'word_freq_cs', 'word_freq_meeting', 'word_freq_original', 'word_freq_project',
    'word_freq_re', 'word_freq_edu', 'word_freq_table', 'word_freq_conference',
    # 6 character frequency features
    'char_freq_semicolon', 'char_freq_paren', 'char_freq_bracket',
    'char_freq_exclamation', 'char_freq_dollar', 'char_freq_hash',
    # 3 capital letter features
    'capital_run_length_average', 'capital_run_length_longest', 'capital_run_length_total',
    # Target
    'is_spam'
]

df = pd.read_csv(url, header=None, names=feature_names)
print(f"Dataset shape: {df.shape}")
df.head()

In [None]:
# Check class distribution
spam_ratio = df['is_spam'].mean()
print(f"Spam ratio: {spam_ratio:.1%} spam, {1-spam_ratio:.1%} not spam")
print(f"Missing values: {df.isnull().sum().sum()}")

## Step 2: Split the Dataset

We split into three sets:
- **Training (60%):** Used to train the model
- **Validation (20%):** Used to monitor for overfitting during training
- **Test (20%):** Final evaluation - only used once at the end

We use `stratify` to maintain the same spam ratio in all splits.

In [None]:
X = df.drop('is_spam', axis=1).values
y = df['is_spam'].values

# First split: 60% train, 40% temp
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.4, stratify=y, random_state=42
)

# Second split: 50% of temp = 20% val, 20% test
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, stratify=y_temp, random_state=42
)

print(f"Training set: {len(X_train)} samples")
print(f"Validation set: {len(X_val)} samples")
print(f"Test set: {len(X_test)} samples")

## Step 3: Scale Features

Neural networks train better when features are on similar scales. We use StandardScaler to normalize to mean=0, std=1.

**Important:** We fit the scaler only on training data, then transform all sets. This prevents data leakage - the model shouldn't "see" statistics from validation/test data.

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # fit + transform on training
X_val_scaled = scaler.transform(X_val)          # only transform
X_test_scaled = scaler.transform(X_test)        # only transform

print(f"Before scaling - Feature 0 range: [{X_train[:, 0].min():.2f}, {X_train[:, 0].max():.2f}]")
print(f"After scaling - Feature 0 range: [{X_train_scaled[:, 0].min():.2f}, {X_train_scaled[:, 0].max():.2f}]")

## Step 4: Build the Neural Network

Architecture:
- **Input:** 57 features
- **Hidden layer 1:** 32 neurons, ReLU activation
- **Hidden layer 2:** 16 neurons, ReLU activation  
- **Output:** 1 neuron, sigmoid activation (outputs probability of spam)

In [None]:
model = keras.Sequential([
    Dense(32, activation='relu', input_shape=(X_train_scaled.shape[1],)),
    Dense(16, activation='relu'),
    Dense(1, activation='sigmoid')
])

model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

model.summary()

## Step 5: Train with Early Stopping

Without early stopping, this model overfits quickly - training accuracy keeps climbing while validation plateaus. The key insight: **validation loss starts rising while training loss keeps dropping.**

Early stopping monitors validation loss and stops training when it stops improving. `patience=5` means we wait 5 epochs without improvement before stopping. `restore_best_weights=True` keeps the best model, not the last one.

In [None]:
early_stopping = EarlyStopping(
    monitor='val_loss',
    patience=5,
    restore_best_weights=True
)

history = model.fit(
    X_train_scaled, y_train,
    epochs=100,
    batch_size=32,
    validation_data=(X_val_scaled, y_val),
    callbacks=[early_stopping]
)

## Visualize Training

Let's plot the training curves to see how training and validation metrics evolved.

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Accuracy
ax1.plot(history.history['accuracy'], label='Training')
ax1.plot(history.history['val_accuracy'], label='Validation')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Accuracy')
ax1.set_title('Model Accuracy')
ax1.legend()

# Loss
ax2.plot(history.history['loss'], label='Training')
ax2.plot(history.history['val_loss'], label='Validation')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Loss')
ax2.set_title('Model Loss')
ax2.legend()

plt.tight_layout()
plt.show()

## Step 6: Final Evaluation

Now we test on the held-out test set - data the model has never seen during training or hyperparameter tuning.

In [None]:
test_loss, test_accuracy = model.evaluate(X_test_scaled, y_test)
print(f"\nTest Loss: {test_loss:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")

## Learnings

The first training run showed clear overfitting. Training accuracy kept climbing to 98% while validation accuracy plateaued around 93%. More telling was the validation loss - it dropped to ~0.17 around epoch 13, then started rising again while training loss kept decreasing. Classic overfitting.

**What I tried:**

1. **Smaller model (32-16 -> 16-8 neurons):** Helped a bit - the overfitting was less aggressive, but still there. Validation loss still crept up after epoch 17.

2. **L2 regularization:** Added weight penalties to discourage the model from relying too heavily on specific features. Didn't really improve things, and the overall validation loss was actually slightly worse.

3. **Early stopping:** Instead of trying to prevent the model from overfitting, just stop training when it starts. Monitor validation loss, and if it doesn't improve for 5 epochs, stop and keep the best weights. Simple and effective.

**The takeaway:** The model without any tricks achieved its best validation performance around epoch 12-15. All the regularization attempts were just trying to prevent training past that point. Early stopping does this directly without needing to tune regularization strength or layer sizes.

## Possible Improvements

- Build a text preprocessor that takes raw email content and extracts the 57 features, so you can classify actual emails instead of pre-processed data
- Try different architectures (more/fewer layers, different neuron counts)
- Experiment with dropout regularization
- Use a more modern approach with word embeddings instead of manual feature engineering