# CA02: Email Spam Classifier using Naive Bayes

## Assignment Overview
This notebook implements a **supervised machine learning spam email classifier** using the **Naive Bayes algorithm**.

### Objectives:
1. **Train** a Naive Bayes model on 702 labeled emails (351 spam, 351 non-spam)
2. **Test** the model on 260 new emails
3. **Evaluate** the model's accuracy by comparing predictions with actual labels

### Dataset Structure:
- **Training Data**: `./train-mails/` - 702 emails for model training
- **Test Data**: `./test-mails/` - 260 emails for model evaluation
- **File Naming Convention**:
  - Non-spam: `number-numbermsg[number].txt` (e.g., `3-1msg1.txt`)
  - Spam: `spmsg[Number].txt` (e.g., `spmsga162.txt`)

### Machine Learning Workflow:
1. **Data Preparation**: Extract and clean email text
2. **Feature Engineering**: Create word frequency matrix using top 3000 words
3. **Model Training**: Train Gaussian Naive Bayes classifier
4. **Prediction**: Classify test emails
5. **Evaluation**: Calculate accuracy score

---

**IMPORTANT NOTE:**
- The data folders must be located at `./train-mails` and `./test-mails`
- This ensures code portability across different systems
- Do not modify the original data files or folder names

## Step 1: Import Required Libraries

We import the following libraries:
- **os**: For file system operations (reading directories, file paths)
- **numpy**: For numerical operations and matrix handling
- **Counter**: From collections, to count word frequencies efficiently
- **GaussianNB**: The Naive Bayes classifier from scikit-learn
- **accuracy_score**: To evaluate model performance

In [None]:
# Standard library imports
import os
import numpy as np
from collections import Counter

# Machine Learning imports from scikit-learn
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

## Step 2: Create Dictionary Function

### Purpose:
This function creates a **dictionary of the 3000 most frequent words** from all training emails.

### Why we need this:
- Reduces dimensionality (instead of thousands of unique words, we use only 3000)
- Removes noise by filtering out rare words and stop words
- Creates a consistent feature space for the machine learning model

### Process:
1. Read all email files from the training directory
2. Extract all words from all emails
3. Count word frequencies using Counter
4. Remove non-alphabetic words (numbers, punctuation, symbols)
5. Remove single-character words (typically not meaningful)
6. Return the 3000 most common words

In [None]:
def make_Dictionary(root_dir):
    """
    Creates a dictionary of the 3000 most frequent valid words from training emails.

    Parameters:
    -----------
    root_dir : str
        Path to the directory containing training email files

    Returns:
    --------
    list of tuples
        List of (word, frequency) tuples for the 3000 most common words
    """

    all_words = []  # List to store all words from all emails

    # Create list of full file paths for all files in the directory
    emails = [os.path.join(root_dir, f) for f in os.listdir(root_dir)]

    # Iterate through each email file
    for mail in emails:
        with open(mail) as m:  # Open each email file
            for line in m:  # Read line by line
                words = line.split()  # Split line into words (whitespace delimiter)
                all_words += words  # Add words to our master list

    # Count frequency of each word using Counter (dictionary subclass)
    dictionary = Counter(all_words)

    # Create a copy of all words for safe iteration during deletion
    list_to_remove = list(dictionary)

    # Clean the dictionary by removing unwanted items
    for item in list_to_remove:
        # Remove words containing non-alphabetic characters (numbers, punctuation)
        if item.isalpha() == False:
            del dictionary[item]
        # Remove single-character words (usually not meaningful)
        elif len(item) == 1:
            del dictionary[item]

    # Get the 3000 most common words as a list of (word, count) tuples
    dictionary = dictionary.most_common(3000)

    return dictionary

## Step 3: Feature Extraction Function

### Purpose:
This function converts emails into a **numerical feature matrix** that can be used by machine learning algorithms.

### How it works:
- Creates a matrix where:
  - Each **row** represents one email
  - Each **column** represents one word from our 3000-word dictionary
  - Each **cell value** is the frequency of that word in that email

### Process:
1. Read all email files from the specified directory
2. For each email:
   - Extract words from line 3 (actual content, skipping subject and blank line)
   - Count how many times each dictionary word appears
   - Store counts in the feature matrix
   - Label the email as spam (1) or non-spam (0) based on filename
3. Return the feature matrix and labels

### Example:
If word "free" is at index 42 in dictionary and appears 3 times in email 5:
- `features_matrix[5, 42] = 3`

In [None]:
def extract_features(mail_dir):
    """
    Extracts features from emails and creates a numerical feature matrix.

    Parameters:
    -----------
    mail_dir : str
        Path to directory containing email files

    Returns:
    --------
    tuple
        (features_matrix, labels)
        - features_matrix: numpy array of shape (n_emails, 3000) with word frequencies
        - labels: numpy array of shape (n_emails,) with 0 for non-spam, 1 for spam
    """

    # Get list of all file paths in the directory
    files = [os.path.join(mail_dir, fi) for fi in os.listdir(mail_dir)]

    # Initialize feature matrix: rows = emails, columns = 3000 dictionary words
    # All values start at 0 (word not present)
    features_matrix = np.zeros((len(files), 3000))

    # Initialize labels array: 0 = non-spam, 1 = spam
    train_labels = np.zeros(len(files))

    count = 1  # Counter for spam emails (for tracking purposes)
    docID = 0  # Document ID (row index in feature matrix)

    # Process each email file
    for fil in files:
        with open(fil) as fi:
            # Enumerate through lines to identify line 3 (i=2, zero-indexed)
            for i, line in enumerate(fi):
                if i == 2:  # Line 3 contains the actual email content
                    words = line.split()  # Split content into words

                    # For each word in this email
                    for word in words:
                        wordID = 0  # Position of word in dictionary

                        # Search for this word in our dictionary
                        for i, d in enumerate(dictionary):
                            if d[0] == word:  # d[0] is the word, d[1] is frequency
                                wordID = i
                                # Count occurrences and store in feature matrix
                                features_matrix[docID, wordID] = words.count(word)

        # Default label is 0 (non-spam)
        train_labels[docID] = 0

        # Parse filename to determine if email is spam
        filepathTokens = fil.split('/')  # Split path by '/'
        lastToken = filepathTokens[len(filepathTokens) - 1]  # Get filename

        # If filename starts with "spmsg", it's a spam email
        if lastToken.startswith("spmsg"):
            train_labels[docID] = 1  # Label as spam
            count = count + 1  # Increment spam counter

        docID = docID + 1  # Move to next email (next row)

    return features_matrix, train_labels

## Step 4: Define Data Paths

### Directory Structure:
Set the paths to training and test data folders using **relative paths**.

### Why relative paths?
- Ensures code portability across different computers
- Peer reviewers and instructors can run the code without modifications
- Works regardless of the absolute location on the file system

### Expected folder structure:
```
.
├── CA02_NB_assignment.ipynb
├── train-mails/
│   ├── 1-1msg1.txt
│   ├── spmsga1.txt
│   └── ...
└── test-mails/
    ├── 1-1msg1.txt
    ├── spmsga1.txt
    └── ...
```

In [None]:
# Define relative paths to data directories
# './' means current directory where this notebook is located
TRAIN_DIR = './train-mails'  # Training data: 702 emails
TEST_DIR = './test-mails'    # Test data: 260 emails

## Step 5: Data Preparation and Feature Extraction

### What happens in this step:

1. **Create Dictionary**:
   - Analyze all training emails
   - Extract the 3000 most common valid words
   - This becomes our feature set (vocabulary)

2. **Extract Training Features**:
   - Convert 702 training emails into a 702×3000 matrix
   - Each cell contains word frequency
   - Create corresponding labels (spam vs. non-spam)

3. **Extract Test Features**:
   - Convert 260 test emails into a 260×3000 matrix
   - Use the SAME dictionary created from training data
   - Create corresponding labels for evaluation

### Why use the same dictionary for test data?
- Machine learning models need consistent features
- Test data must have the same structure as training data
- Model was trained on these 3000 words, so it can only understand these words

In [None]:
# Step 1: Build the dictionary from training emails
# This analyzes all training data and selects the 3000 most frequent valid words
dictionary = make_Dictionary(TRAIN_DIR)

print("reading and processing emails from TRAIN and TEST folders")

# Step 2: Extract features from training data
# Converts training emails to numerical matrix and extracts labels
features_matrix, labels = extract_features(TRAIN_DIR)

# Step 3: Extract features from test data
# Uses the SAME dictionary to ensure consistency
test_features_matrix, test_labels = extract_features(TEST_DIR)

FileNotFoundError: [Errno 2] No such file or directory: './train-mails'

## Step 6: Model Training, Prediction, and Evaluation

### Machine Learning Pipeline:

#### 1. **Model Training** (Gaussian Naive Bayes)
   - **Algorithm**: Gaussian Naive Bayes assumes features follow a normal distribution
   - **Training data**: 702 emails with their word frequency features
   - **Learning process**: Model calculates probability distributions for each word in spam vs. non-spam emails
   - **Why Naive Bayes?**
     - Works well with text classification
     - Fast training and prediction
     - Handles high-dimensional data efficiently
     - "Naive" assumption: treats each word as independent (simplification that works well in practice)

#### 2. **Prediction**
   - Apply trained model to test data (260 emails)
   - Model calculates probability that each email is spam
   - Assigns label: 1 (spam) or 0 (non-spam)

#### 3. **Evaluation**
   - Compare predicted labels with actual labels
   - Calculate accuracy: (correct predictions) / (total predictions)
   - Expected accuracy: ~96-97%

### Mathematical Intuition:
Naive Bayes calculates: P(Spam | Email) using Bayes' Theorem:
- P(Spam | Email) = P(Email | Spam) × P(Spam) / P(Email)
- Classifies as spam if P(Spam | Email) > P(Not Spam | Email)

In [None]:
# TRAINING PHASE
print("Training Model using Gaussian Naive Bayes algorithm .....")

# Initialize the Gaussian Naive Bayes classifier
model = GaussianNB()

# Train the model on training features and labels
# features_matrix: 702 emails × 3000 word frequencies
# labels: 702 labels (0 or 1)
model.fit(features_matrix, labels)

print("Training completed")

# PREDICTION PHASE
print("testing trained model to predict Test Data labels")

# Use the trained model to predict labels for test emails
# Input: test_features_matrix (260 emails × 3000 word frequencies)
# Output: predicted_labels (260 predictions: 0 or 1)
predicted_labels = model.predict(test_features_matrix)

# EVALUATION PHASE
print("Completed classification of the Test Data .... now printing Accuracy Score by comparing the Predicted Labels with the Test Labels:")

# Calculate accuracy: (number of correct predictions) / (total predictions)
# Compares predicted_labels with actual test_labels
accuracy = accuracy_score(test_labels, predicted_labels)

# Print the accuracy score (expected: ~0.965 or 96.5%)
print(accuracy)

---

## Results Interpretation

### Expected Output:
- **Accuracy Score**: ~0.9654 (96.54%)

### What this means:
- The model correctly classified approximately **251 out of 260** test emails
- Only about **9 emails** were misclassified
- This is excellent performance for a spam classifier!

### Model Performance Summary:
- **High accuracy** indicates the Naive Bayes algorithm works well for email spam detection
- The 3000-word feature space is sufficient to capture spam characteristics
- Word frequency is a strong indicator of spam vs. legitimate emails

### Potential Improvements:
1. Use TF-IDF instead of raw word counts
2. Include bigrams or trigrams (word pairs/triplets)
3. Try other algorithms (SVM, Random Forest, Neural Networks)
4. Perform cross-validation for more robust evaluation
5. Add subject line as a separate feature

---

======================= END OF PROGRAM =========================