Addressing hypothesis 1 - Words may be used for different meanings in the different contexts (spam or non-spam).

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.preprocessing import FunctionTransformer
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize



In [2]:
train_df = pd.read_csv("./data/processed/training.csv")
test_df = pd.read_csv("./data/processed/testing.csv")
print(f"training dataset shape: {train_df.shape}")
print(f"testing dataset shape: {test_df.shape}")

training dataset shape: (100737, 8)
testing dataset shape: (25185, 8)


Benchmark: Logistic Regression with BoW

In [3]:
# Build the pipeline: BoW + Logistic Regression
pipeline = Pipeline([
    ("bow", CountVectorizer()), # Converts text to BoW feature vectors
    ("classifier", LogisticRegression(max_iter=1000))
])

# Train the model
pipeline.fit(train_df['cleaned_body'], train_df['label'])

# Predict and evaluate
y_pred_bow_only = pipeline.predict(test_df['cleaned_body'])
accuracy_bow_only = accuracy_score(test_df['label'], y_pred_bow_only)
precision_bow_only = precision_score(test_df['label'], y_pred_bow_only)
recall_bow_only = recall_score(test_df['label'], y_pred_bow_only)
f1_bow_only = f1_score(test_df['label'], y_pred_bow_only)

print(f"Accuracy:  {accuracy_bow_only:.6f}")
print(f"Precision: {precision_bow_only:.6f}")
print(f"Recall:    {recall_bow_only:.6f}")
print(f"F1 Score:  {f1_bow_only:.6f}")

Accuracy:  0.983284
Precision: 0.978607
Recall:    0.986242
F1 Score:  0.982409


Alternative to test if word2vec feature is useful: Logistic Regression with BoW + word2vec

In [4]:
# Tokenize training and testing text for Word2Vec
X_train_tokens = [word_tokenize(text) for text in train_df['tokens']]

# Train Word2Vec model on training tokens only
w2v_model = Word2Vec(sentences=X_train_tokens, vector_size=100, window=5, min_count=1)

# Function to compute average Word2Vec vector for a sentence
def get_avg_w2v_features(texts, model, vector_size):
    features = []
    for text in texts:
        tokens = word_tokenize(text)
        valid_tokens = [t for t in tokens if t in model.wv]
        if not valid_tokens:
            features.append(np.zeros(vector_size))
        else:
            vec = np.mean([model.wv[t] for t in valid_tokens], axis=0)
            features.append(vec)
    return np.array(features)

# FunctionTransformer for Word2Vec averaging
w2v_transformer = FunctionTransformer(
    lambda x: get_avg_w2v_features(x, w2v_model, 100),
    validate=False
)

# Combine TF-IDF and Word2Vec features
combined_features = FeatureUnion([
    ("bow", Pipeline([("bow", CountVectorizer())])),
    ("w2v", Pipeline([("w2v", w2v_transformer)]))
])

# Pipeline: feature extraction + classifier
pipeline = Pipeline([
    ("features", combined_features),
    ("classifier", LogisticRegression(max_iter=1000))
])

# Fit on training set and evaluate on test set
pipeline.fit(train_df['cleaned_body'], train_df['label'])
y_pred = pipeline.predict(test_df['cleaned_body'])

accuracy = accuracy_score(test_df['label'], y_pred)
precision = precision_score(test_df['label'], y_pred)
recall = recall_score(test_df['label'], y_pred)
f1 = f1_score(test_df['label'], y_pred)

print(f"Accuracy:  {accuracy:.6f}")
print(f"Precision: {precision:.6f}")
print(f"Recall:    {recall:.6f}")
print(f"F1 Score:  {f1:.6f}")


Accuracy:  0.982728
Precision: 0.978422
Recall:    0.985235
F1 Score:  0.981817


The benchmark model fared better in all 4 performance metrics compared to the alternative model.
Let's take a closer look at why accounting for word semantics unexpectedly decreased model performance.

In [5]:
results_analysis_df = test_df[["cleaned_body", "label"]]
results_analysis_df = results_analysis_df[y_pred_bow_only != y_pred]
print(f"Benchmark & Alternative models disagreed on these rows, resulting in df with shape: {results_analysis_df.shape}")
results_analysis_df["baseline_predict"] = y_pred_bow_only[y_pred_bow_only != y_pred]
results_analysis_df["alternative_predict"] = y_pred[y_pred_bow_only != y_pred]
results_analysis_df.head()

Benchmark & Alternative models disagreed on these rows, resulting in df with shape: (44, 2)


Unnamed: 0,cleaned_body,label,baseline_predict,alternative_predict
621,calger pastoria q q fountain valley eas delta ...,0,0,1
623,url a detector from an asteroidchasing nasa pr...,0,0,1
1602,bill is this the david gray you are going to s...,0,0,1
1622,download dispatch tools for software developer...,0,1,0
2026,so i tired grahams method first and got an err...,0,1,0


We will filter for the rows where the benchmark model predicted correctly but the alternative model predicted wrongly. There are 31 of such rows.

In [6]:
results_analysis_df[(results_analysis_df["label"] == results_analysis_df["baseline_predict"])].shape

(29, 4)

The following 3 email content are some example instances where the **baseline model correctly identified spam** but the alternative model misidentified.

We observe that these content are spam mainly because of the repeated use of words which allowed the BoW-only model which keeps a count of words to fare better. As the usage of words were not typical of spam messages, the added word2vec feature in the alternative model may have led the model away from the correct answer. This shows that the information gleaned from analyzing word semantics may be limited in identifying content that is spam not due to its meaning but spam because of other reasons. 

In [7]:
actually_spam = results_analysis_df[(results_analysis_df["label"] == results_analysis_df["baseline_predict"]) & (results_analysis_df["label"] == 1)]
print(f"Example content 1: '{actually_spam['cleaned_body'].iloc[3]}'")
print(f"Example content 2: '{actually_spam['cleaned_body'].iloc[7]}'")
print(f"Example content 3: '{actually_spam['cleaned_body'].iloc[8]}'")

Example content 1: 'see attachment iain didnt go into detail but the council didnt give their b they dont know about the meeti patrick nodded therell be tro'
Example content 2: 'java virtual machine instructions are represented in this chapter by entries of the form shown in figure also he limped badly with one leg i also omitted methods such as freenet napster hotmail gnutella mostly because i think they are inherently bad and would have lead to misinterpretation of my goals was there reproach to him in the quiet figure and the mild eyes in the last several years this kind of software has advanced rapidly ah for my husband for my dear lord edward the center window is an ultrawave map of the region directly behind us the password protects your web site from the possibility of someone else making changes to it receives the message that matches the given correlation identifier from a transactional queue and immediately raises an exception if no message with the specified correlation iden

The following 3 email content are some example instances where the **baseline model correctly identified non-spam** but the alternative model misidentified.

We observe that a common pattern these non-spam content has is that they happen to be quite instructive and invites the reader to take actions like clicking or opening a file which is similar to spam-typical content. However, these were innocent requests that may some times be shared amongst friends. Unfortunately, certain harmless non-spam content may coincidentally use words in a manner similar to malicious spam content which swayed the alternative model in the wrong direction during prediction.

In [8]:
not_actually_spam = results_analysis_df[(results_analysis_df["label"] == results_analysis_df["baseline_predict"]) & (results_analysis_df["label"] == 0)]
print(f"Example content 1: '{not_actually_spam['cleaned_body'].iloc[3]}'")
print(f"Example content 2: '{not_actually_spam['cleaned_body'].iloc[6]}'")
print(f"Example content 3: '{not_actually_spam['cleaned_body'].iloc[11]}'")

Example content 1: 'bill just wanted to let you know i was thinking of you today it was a long time ago when returnig from a shakedown patrol in the caribbean dave staton came down to the engine room lower level when my submarine pulled into new london telling me to get topside and to the hospital as my wife was having a baby i missed you being born by about hour but i was able to get there and hold you soon after you looked like winston churchill with a little turtle head your mom was exhausted but happy mke sure you give her a call today i know she misses talking to you love dad'
Example content 2: 'in sigh here is esai s latest natural gas fundwatch edna o connell office manager esai edgewater place suite wakefield ma ednao esaibos com ngol pdf'
Example content 3: 'i am sure doerfer never said that only binary comparison should be allowed e g turkic and mongolic but not turkic mongolic and manchu tungusic all at once on the contary what he said over and over again was that binary co