# Email Spam Detection – Assignment (Set 5)

**Course**: DSECLZG530  
**Dataset**: Email Spam Detection Dataset (classification) – Kaggle  

**Group No**: 45  
**Member Name(s)**:  
- Lakshmi Sahithi Uppu : 2024da04343
- Jitendra Kumar : 2024da04067
- Jyotirvasu Sharma : 2024da04068
- Shiwam Kumar Suman : 2024da04069
---

## Problem Statement

The goal of Part I of the task is to use raw textual data in language models for recommendation based application.

The goal of Part II of task is to implement comprehensive preprocessing steps for a given dataset, enhancing the quality and relevance of the textual information. The preprocessed text is then transformed into a feature-rich representation using a chosen vectorization method for further use in the application to perform similarity analysis.


**Part I**:  
Use all email texts as a training corpus to build a **Trigram probabilistic language model** and compare the following two test sentences. Recommend which sentence is more relevant to the training corpus.

- Test Sentence 1: *"Please review the attached document for our meeting agenda."*  
- Test Sentence 2: *"Congratulations! You've won a million dollars, click here now!"*

**Part II**:  
Perform sequential tasks:
1. Text Preprocessing (tokenization, lowercasing, stopword removal, stemming, lemmatization).  
2. Feature Extraction using **TF–IDF word embeddings**.  
3. Similarity Analysis: find top two most similar words based on TF–IDF word vectors, justify similarity metric and feature design, and visualize a subset of word embeddings in 2D using **PCA**.


In [0]:
# 
# Imports
# 

import pandas as pd
import numpy as np
import re
import math
from collections import Counter

# ML / Vectorization / Similarity / PCA
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import PCA

# Plotting
import matplotlib.pyplot as plt

In [0]:
# 
# Load dataset from github repo
# Note: The kaggle dataset is being copied from kaggle site and kept in github public repository for easy access.
# 

# Using the raw CSV file link from GitHub
df = pd.read_csv(
    "https://raw.githubusercontent.com/jyotirvasu/Email_Spam_Detection/main/spam.csv",
    encoding="ISO-8859-1"
)
print(df.columns)

# Use the correct column name for the email text
# For this dataset, it is usually 'v2'
df = df[["v2"]].rename(
    columns={"v2": "emails"}
)
df = df.dropna(subset=["emails"])

display(df)

In [0]:
# 
# dataset overview
# 

print("Number of emails in dataset:", len(df))
print("\nSample rows:")
display(df.head())

# Show basic length statistics of emails
email_lengths = df["emails"].astype(str).apply(len)
print("\nEmail length statistics (characters):")
print(email_lengths.describe())


## Part I – Trigram Probabilistic Language Model

We build a **Trigram language model** from the email corpus.

For a sentence \( w_1, w_2, \ldots, w_n \), the probability under a trigram model is approximated as:

\[
P(w_1^n) \approx \prod_{i=3}^{n} P(w_i \mid w_{i-2}, w_{i-1})
\]

We use **Laplace (add-1) smoothing**:

\[
P(w_i \mid w_{i-2}, w_{i-1}) = 
\frac{\text{count}(w_{i-2}, w_{i-1}, w_i) + 1}
{\text{count}(w_{i-2}, w_{i-1}) + V}
\]

where \( V \) is the vocabulary size.

We then compute **log-probabilities** for the two test sentences and recommend the sentence with higher log-probability as **more relevant** to the training corpus.


In [0]:
# 
# Part I: Trigram Language Model
# 

START = "<s>"
END = "</s>"

def clean_text_basic(text):
    """Lowercase + keep only alphanumeric & space."""
    text = str(text).lower()
    text = re.sub(r"[^a-z0-9\s]", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

df["clean_emails"] = df["emails"].apply(clean_text_basic)

unigram_counts = Counter()
bigram_counts = Counter()
trigram_counts = Counter()

def tokenize_sent(text):
    return text.split()

# Build trigram counts from entire corpus
for email in df["clean_emails"]:
    tokens = tokenize_sent(email)
    tokens = [START, START] + tokens + [END]
    for i in range(len(tokens)):
        unigram_counts[(tokens[i],)] += 1
        if i >= 1:
            bigram_counts[(tokens[i-1], tokens[i])] += 1
        if i >= 2:
            trigram_counts[(tokens[i-2], tokens[i-1], tokens[i])] += 1

V = len(unigram_counts)  # vocabulary size
print("Vocabulary size (unigrams):", V)

def trigram_prob(w1, w2, w3):
    """Laplace-smoothed trigram probability P(w3 | w1, w2)."""
    trig = (w1, w2, w3)
    big = (w1, w2)
    num = trigram_counts[trig] + 1
    denom = bigram_counts[big] + V
    return num / denom

def sentence_log_prob(sentence):
    """Return log10 probability of a sentence under trigram LM."""
    s = clean_text_basic(sentence)
    tokens = s.split()
    tokens = [START, START] + tokens + [END]
    log_p = 0.0
    for i in range(2, len(tokens)):
        w1, w2, w3 = tokens[i-2], tokens[i-1], tokens[i]
        p = trigram_prob(w1, w2, w3)
        log_p += math.log10(p)
    return log_p

# Test sentences
test_sentence_1 = "Please review the attached document for our meeting agenda."
test_sentence_2 = "Congratulations! You've won a million dollars, click here now!"

log_p1 = sentence_log_prob(test_sentence_1)
log_p2 = sentence_log_prob(test_sentence_2)

print("Log P(Test Sentence 1):", log_p1)
print("Log P(Test Sentence 2):", log_p2)

if log_p1 > log_p2:
    print("\nRecommended: Test Sentence 1 is more probable under the trigram model.")
else:
    print("\nRecommended: Test Sentence 2 is more probable under the trigram model.")


### Interpretation (Part I)

- We trained a **trigram language model** using all emails as the training corpus.  
- We applied **Laplace smoothing** to handle unseen trigrams.  
- For each test sentence, we computed the **log-probability** under this model.  
- The **sentence with higher log-probability** is recommended as more relevant to the training corpus.  
- The printed result above clearly states which test sentence is recommended by the model.


## Part II – Text Preprocessing, TF–IDF Features & Similarity Analysis

We now:

1. Perform **text preprocessing** (tokenization, lowercasing, stopword removal, stemming, lemmatization).  
2. Use **TF–IDF** to obtain word embeddings.  
3. Compute **similarity between word vectors**, identify top two most similar words, and  
4. Visualize a subset of word embeddings in **2D using PCA**.


In [0]:
# 
# Part II (i): Text Preprocessing (NLTK-free)
# 

import re

# Simple English stopword list (enough to demonstrate stopword removal)
basic_stopwords = {
    "the","a","an","and","or","if","in","on","for","to","of","is","are","am","was","were",
    "this","that","these","those","it","as","at","by","with","from","be","have","has","had",
    "you","your","yours","me","my","we","our","they","their","them","he","she","his","her"
}

def simple_tokenize(text: str):
    """Tokenization + lowercasing using regex."""
    text = str(text).lower()
    tokens = re.findall(r"[a-z]+", text)
    return tokens

def simple_stem(word: str) -> str:
    """Very crude stemmer: strips common suffixes for demonstration."""
    for suf in ["ing", "edly", "edly", "ed", "ly", "es", "s"]:
        if word.endswith(suf) and len(word) > len(suf) + 2:
            return word[:-len(suf)]
    return word

def simple_lemmatize(word: str) -> str:
    """Very simple lemmatizer: handles some plural forms and irregulars."""
    irregular = {
        "mice": "mouse",
        "children": "child",
        "geese": "goose",
        "men": "man",
        "women": "woman",
        "teeth": "tooth",
        "feet": "foot"
    }
    if word in irregular:
        return irregular[word]
    if word.endswith("ies") and len(word) > 4:
        return word[:-3] + "y"
    if word.endswith("ses") and len(word) > 4:
        return word[:-2]
    if word.endswith("s") and len(word) > 3:
        return word[:-1]
    return word

def full_preprocess_tokens(text: str):
    """Return tokens at each preprocessing stage for demonstration."""
    tokens = simple_tokenize(text)
    no_stop = [t for t in tokens if t not in basic_stopwords]
    stemmed = [simple_stem(t) for t in no_stop]
    lemmatized = [simple_lemmatize(t) for t in no_stop]
    return {
        "tokens": tokens,
        "no_stop": no_stop,
        "stemmed": stemmed,
        "lemmatized": lemmatized
    }

def preprocess_for_model(text: str) -> str:
    """Final preprocessed text used for TF–IDF: tokenize, remove stopwords, lemmatize."""
    tokens = simple_tokenize(text)
    tokens = [t for t in tokens if t not in basic_stopwords]
    tokens = [simple_lemmatize(t) for t in tokens]
    return " ".join(tokens)

# Apply preprocessing to whole dataset
df["processed_text"] = df["emails"].apply(preprocess_for_model)
df[["emails", "processed_text"]].head()


In [0]:
# Demonstration of all preprocessing steps on a sample email
sample_text = df['emails'].iloc[0]
print("Original email:\n", sample_text)

processed = full_preprocess_tokens(sample_text)
print("\nTokens:", processed['tokens'])
print("\nAfter stopword removal:", processed['no_stop'])
print("\nAfter stemming:", processed['stemmed'])
print("\nAfter lemmatization:", processed['lemmatized'])

### Explanation (Text Preprocessing)

- **Tokenization** splits each email into words (tokens), enabling word-level analysis.  
- **Lowercasing** ensures that “Free” and “free” are treated as the same word.  
- **Stopword removal** removes highly frequent but non-informative words like “the”, “and”.  
- **Stemming** crudely reduces words to their root (e.g., *running* → *run*), merging inflected forms.  
- **Lemmatization** maps words to their dictionary form (e.g., *better* → *good*), giving cleaner normalized tokens.  

These steps **reduce noise**, handle **morphological variation**, and improve the quality of features for TF–IDF and similarity analysis.


In [0]:
# 
# Part II (ii): TF–IDF Feature Extraction
# 

corpus = df["processed_text"].tolist()

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

print("TF–IDF matrix shape (num_docs x num_features):", tfidf_matrix.shape)
print("Vocabulary size:", len(tfidf_vectorizer.vocabulary_))


### Explanation (TF–IDF)

- **TF–IDF (Term Frequency–Inverse Document Frequency)** gives higher weight to words that:  
  - appear frequently in a given document (high TF),  
  - but are rare across other documents (high IDF).  
- This highlights discriminative words (e.g., “free”, “win”, “click”) which are useful for spam detection.  
- Each email is represented as a **high-dimensional sparse vector**, where each dimension corresponds to a word in the vocabulary.


In [0]:
# 
# Part II (iii): Similarity Analysis – find most similar word pair
# 

# Get vocabulary mapping and index -> word mapping
vocab = tfidf_vectorizer.vocabulary_          # word -> column index
index_to_word = {idx: word for word, idx in vocab.items()}

# Word vectors = columns of TF–IDF matrix
# Shape: (vocab_size, num_docs)
word_tfidf_matrix = tfidf_matrix.T

print("Word embedding matrix shape (num_words x num_docs):", word_tfidf_matrix.shape)

# To avoid huge computation, limit to top N words
max_words_for_similarity = min(800, word_tfidf_matrix.shape[0])
sub_word_matrix = word_tfidf_matrix[:max_words_for_similarity, :]

print(f"Using {max_words_for_similarity} words for similarity analysis.")

# Cosine similarity between word vectors
sim_matrix = cosine_similarity(sub_word_matrix)

max_sim = -1.0
best_pair = (None, None)

num_words = sim_matrix.shape[0]

for i in range(num_words):
    for j in range(i + 1, num_words):
        sim = sim_matrix[i, j]
        if sim > max_sim:
            max_sim = sim
            best_pair = (i, j)

word1 = index_to_word[best_pair[0]]
word2 = index_to_word[best_pair[1]]

print("Most similar word pair (within examined subset):")
print(f"Word 1: {word1}")
print(f"Word 2: {word2}")
print("Cosine similarity:", max_sim)


### Justification of Similarity Metric & Feature Design

- **Feature Design**:  
  - We represent each word by its **TF–IDF vector across all documents**.  
  - If two words appear in **similar sets of emails with similar TF–IDF weights**, they likely play a similar semantic role (e.g., spam-related words like *free*, *win*, *prize*).  

- **Similarity Metric – Cosine Similarity**:  
  - Measures the **angle between two vectors**, ignoring magnitude.  
  - Works very well with **high-dimensional sparse vectors** like TF–IDF.  
  - Standard metric in information retrieval and text similarity tasks.  
  - A high cosine similarity indicates that two words tend to co-occur in similar documents and hence are semantically related in this corpus.  

The printed pair above corresponds to the two words that are most similar according to this representation.


In [0]:
# 
# PCA-based 2D Visualization of Word Embeddings
# 

# Choose a small subset of words for visualization
max_words_to_plot = min(50, word_tfidf_matrix.shape[0])
vis_sub_matrix = word_tfidf_matrix[:max_words_to_plot, :].toarray()
vis_words = [index_to_word[i] for i in range(max_words_to_plot)]

pca = PCA(n_components=2)
word_vectors_2d = pca.fit_transform(vis_sub_matrix)

plt.figure(figsize=(12, 8))
plt.scatter(word_vectors_2d[:, 0], word_vectors_2d[:, 1])

for i, w in enumerate(vis_words):
    plt.text(word_vectors_2d[i, 0] + 0.01,
             word_vectors_2d[i, 1] + 0.01,
             w, fontsize=9)

plt.title("2D PCA Projection of Word TF–IDF Embeddings")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.grid(True)
plt.show()


### Interpretation of PCA-based 2D Visualization

- **PCA (Principal Component Analysis)** reduces the high-dimensional TF–IDF word embeddings to **2 dimensions** while preserving as much variance as possible.  
- In the 2D plot:  
  - Words that appear in **similar contexts** (e.g., spam-related tokens) tend to cluster closer together.  
  - Words used in normal (ham) emails often form a different region.  
- This gives a **semantic map of words** in the email corpus and visually supports our similarity analysis.


## Conclusion

In this notebook we:

1. **Built a trigram probabilistic language model** on the entire email corpus and used it to compare two test sentences, recommending the one with higher log-probability as more relevant to the dataset.
2. Performed **complete text preprocessing** (tokenization, lowercasing, stopword removal, stemming, and lemmatization) with clear demonstration on a sample email.
3. Extracted **TF–IDF features** from the preprocessed emails to obtain a high-dimensional sparse representation suitable for text mining.
4. Conducted a **similarity analysis between word embeddings** (based on TF–IDF and cosine similarity) and identified the most similar pair of words in the vocabulary subset.
5. Used **PCA to project word embeddings to 2D**, visualizing the semantic structure where words with similar usage tend to cluster together.

These steps together shows language modelling, text preprocessing, feature engineering, similarity analysis and visualization on the email spam detection dataset.
