CA02: This is a eMail Spam Classifers that uses Naive Bayes supervised machine learning algorithm.

Jerry and Nicholson

Assignmetn Summary:

# Imports

Purpose:

These imports provide file system access, word counting utilities, numerical arrays, and the Naive Bayes model with an accuracy metric.

In [None]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!cp -r "/content/drive/MyDrive/CA02-Jerry and Nichlison" /content/


In [None]:
!ls /content


'CA02-Jerry and Nichlison'   drive   sample_data


In [None]:
# Standard library
import os
from collections import Counter

# Third-party libraries
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Part1 Building the Dictionary

Purpose:

This part of the code is responsible for looking at ALL training emails and deciding which words are important enough to become features for the Naive Bayes model.

The goal is to build a dictionary of words that represents the training data.

In [None]:
def make_dictionary(root_dir, vocab_size=3000):
    # Collect all words from all training emails
    all_words = []
    emails = [os.path.join(root_dir, f) for f in os.listdir(root_dir)]

    for mail in emails:
        if not os.path.isfile(mail):
            continue
        # Use latin-1 to safely read the legacy email text files
        with open(mail, "r", encoding="latin-1") as m:
            for line in m:
                words = line.lower().split()
                all_words.extend(words)

    # Count frequency of every word
    dictionary = Counter(all_words)

    # Remove tokens that are not useful features
    for item in list(dictionary):
        if (not item.isalpha()) or len(item) == 1:
            del dictionary[item]

    # Keep the top N most common words
    return dictionary.most_common(vocab_size)

## Breakdown of the code logic:

#1 Collect words from every training email.

We list all files in the training folder, skip non-files, open each email, lowercase the text, and append every word into a single list. Lowercasing ensures "Free" and "free" are treated as the same token.

#2 Count word frequency across the corpus.

`Counter(all_words)` computes how many times each word appears in the training set so we can rank words by importance.

#3 Clean the vocabulary.

We remove any token that is not purely alphabetic and any one-letter token, which filters out punctuation, numbers, and noisy tokens.

#4 Limit the vocabulary size.

We keep only the top `vocab_size` most frequent words, which reduces noise and keeps the feature matrix manageable.

# Part2 Extract features

Purpose:

This part is to transform each email from text file into a numerical feature vector based on word frequencies and assign a spam or non-spam label using the file naming rule, so the data becomes usable for Naive Bayes classification.

The goal to create two outputs, feature matrix and label array, these two outputs are the input required to train and test the Naive Bayes model.

In [None]:
def extract_features(mail_dir, dictionary):
    # Build file list and allocate output arrays
    files = [os.path.join(mail_dir, fi) for fi in os.listdir(mail_dir)]
    vocab_size = len(dictionary)
    features_matrix = np.zeros((len(files), vocab_size), dtype=np.int32)
    train_labels = np.zeros(len(files), dtype=np.int32)

    # Map each dictionary word to its column index
    word_index = {word: i for i, (word, _) in enumerate(dictionary)}

    for doc_id, fil in enumerate(files):
        if not os.path.isfile(fil):
            continue

        # Count word occurrences in the whole email
        with open(fil, "r", encoding="latin-1") as fi:
            word_counts = Counter()
            for line in fi:
                words = line.lower().split()
                word_counts.update(words)

        # Fill the feature vector for this email
        for word, count in word_counts.items():
            idx = word_index.get(word)
            if idx is not None:
                features_matrix[doc_id, idx] = count

        # Label spam based on filename convention
        filename = os.path.basename(fil)
        train_labels[doc_id] = 1 if filename.startswith("spmsg") else 0

    return features_matrix, train_labels

## Breakdown of the code logic:

#1 Initialize arrays for features and labels.

We create a feature matrix with one row per email and one column per dictionary word, plus a label vector to store spam (1) or non-spam (0).

#2 Build a fast lookup table.

`word_index` maps each dictionary word to its column index so we can update the feature matrix in O(1) time per word.

#3 Read each email and count words.

For each file, we read all lines, lowercase the text, and use `Counter` to count how many times each word appears in that email.

#4 Fill the feature matrix.

Only words that exist in the dictionary are written into the feature matrix, keeping the columns consistent across train and test.

#5 Assign labels based on filenames.

Emails whose filenames start with `spmsg` are labeled as spam (1); all others are labeled non-spam (0).

# Data Paths

Purpose:

This cell sets the directory paths for the training and testing email folders. Adjust these paths if your folder structure is different.

In [None]:
# Enter the paths of the training and testing folders
# Update these if your data is stored elsewhere
TRAIN_DIR = "./test-mails"
TEST_DIR = "./train-mails"

In [None]:
print("TRAIN_DIR =", TRAIN_DIR)
print("TEST_DIR  =", TEST_DIR)


TRAIN_DIR = ./test-mails
TEST_DIR  = ./train-mails


In [None]:
!ls "/content/CA02-Jerry and Nichlison"


CA02_NB_assignment_Jerry_Nicholson.ipynb  __MACOSX  test-mails	train-mails


In [None]:
%cd "/content/CA02-Jerry and Nichlison"


/content/CA02-Jerry and Nichlison


In [None]:
dictionary = make_dictionary(TRAIN_DIR)


In [None]:
TRAIN_DIR = "./train-mails"
TEST_DIR  = "./test-mails"

print("TRAIN_DIR =", TRAIN_DIR)
print("TEST_DIR  =", TEST_DIR)


TRAIN_DIR = ./train-mails
TEST_DIR  = ./test-mails


In [None]:
!ls "./train-mails"
!ls "./test-mails"


3-1msg1.txt	6-21msg2.txt   6-832msg1.txt  spmsga166.txt  spmsgb143.txt
3-1msg2.txt	6-21msg3.txt   6-833msg1.txt  spmsga16.txt   spmsgb144.txt
3-1msg3.txt	6-22msg1.txt   6-89msg1.txt   spmsga17.txt   spmsgb145.txt
3-375msg1.txt	6-241msg2.txt  6-93msg1.txt   spmsga18.txt   spmsgb146.txt
3-378msg1.txt	6-241msg3.txt  6-96msg1.txt   spmsga19.txt   spmsgb147.txt
3-378msg2.txt	6-243msg1.txt  6-97msg1.txt   spmsga1.txt    spmsgb148.txt
3-378msg3.txt	6-245msg1.txt  8-809msg1.txt  spmsga20.txt   spmsgb149.txt
3-378msg4.txt	6-245msg2.txt  8-811msg1.txt  spmsga21.txt   spmsgb14.txt
3-378msg5.txt	6-245msg3.txt  8-811msg2.txt  spmsga22.txt   spmsgb150.txt
3-379msg1.txt	6-248msg1.txt  8-814msg1.txt  spmsga23.txt   spmsgb151.txt
3-379msg2.txt	6-248msg2.txt  8-815msg1.txt  spmsga24.txt   spmsgb152.txt
3-379msg3.txt	6-248msg3.txt  8-817msg1.txt  spmsga25.txt   spmsgb153.txt
3-380msg1.txt	6-249msg1.txt  8-817msg2.txt  spmsga26.txt   spmsgb154.txt
3-380msg2.txt	6-250msg1.txt  8-817msg3.txt  spmsga27.txt  

In [None]:
dictionary = make_dictionary(TRAIN_DIR)


# Build Feature Matrices

Purpose:

This cell builds the vocabulary using only the training data, then converts both the training and test emails into numeric feature matrices that share the same column order.

Key idea:

The dictionary must be created from training data only to avoid data leakage, and the same dictionary is used for both train and test so the model sees consistent features.

In [None]:
# Build the dictionary from training emails only
dictionary = make_dictionary(TRAIN_DIR)

# Convert emails to feature vectors using the same dictionary
print("reading and processing emails from TRAIN and TEST folders")
features_matrix, labels = extract_features(TRAIN_DIR, dictionary)
test_features_matrix, test_labels = extract_features(TEST_DIR, dictionary)

reading and processing emails from TRAIN and TEST folders


# Train and Evaluate the Model

Purpose:

This cell trains a Multinomial Naive Bayes classifier on the training feature matrix, predicts labels for the test set, and prints the accuracy score.

Why Multinomial Naive Bayes:

The features are word counts, so MultinomialNB is a natural choice because it models counts directly.

In [None]:
print("Training Model using Multinomial Naive Bayes algorithm .....")
# Create and fit the model on training data
model = MultinomialNB()
model.fit(features_matrix, labels)
print("Training completed")

# Predict test labels and report accuracy
print("testing trained model to predict Test Data labels")
predicted_labels = model.predict(test_features_matrix)
print(
    "Completed classification of the Test Data .... now printing Accuracy Score by comparing the Predicted Labels with the Test Labels:"
)
print(accuracy_score(test_labels, predicted_labels))

Training Model using Multinomial Naive Bayes algorithm .....
Training completed
testing trained model to predict Test Data labels
Completed classification of the Test Data .... now printing Accuracy Score by comparing the Predicted Labels with the Test Labels:
0.9615384615384616


======================= END OF PROGRAM =========================