In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("bayes2003/emails-for-spam-or-ham-classification-trec-2007")

Downloading from https://www.kaggle.com/api/v1/datasets/download/bayes2003/emails-for-spam-or-ham-classification-trec-2007?dataset_version_number=1...


100%|██████████| 483M/483M [00:07<00:00, 69.3MB/s]

Extracting files...





### Technical Explanation: Dataset Acquisition from Kaggle

**Why This Approach:**
Using Kaggle datasets provides access to pre-curated, well-labeled spam/ham email data. KagleHub automates authentication and version management, ensuring reproducibility and easy dataset updates.

**Technical Details:**
- **KagleHub API**: Simplifies dataset downloading without manual authentication
- **TREC 2007 Dataset**: Contains emails labeled as spam or ham from Text Retrieval Conference
- **Automated Management**: Latest version is downloaded automatically, supporting versioning

**Use Cases:**
- Provides diverse email examples for robust model training
- Complements the Enron dataset for better generalization
- Industry-standard benchmark dataset for spam classification

**Data Characteristics:**
- Multiple email sources and spam types
- Realistic distribution of legitimate and spam messages
- Pre-processed labels (spam/ham)

In [2]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


### Technical Explanation: Google Drive Integration for Data Persistence

**Why This Approach:**
Google Colab's ephemeral file system requires persistent storage. Mounting Google Drive enables:
- Saving trained models for later use
- Storing datasets across sessions
- Collaboration through shared drives

**Technical Details:**
- **OAuth 2.0 Authentication**: Secure, user-authorized access to Google Drive
- **FUSE Filesystem**: Creates a bridge between Colab VM and Google Drive API
- **Mount Point**: `/content/gdrive` maps Google Drive into Colab's file system
- **Persistent Storage**: Files remain available for 24+ hours after Colab session ends

**Advantages:**
- No manual file uploads/downloads needed
- Models can be accessed from production environment
- Collaborative development support

In [23]:
import os
gdrive_extracted_data_path = '/content/gdrive/MyDrive/trec_spam_data_extracted'
data_1 = os.path.join(gdrive_extracted_data_path, 'email_text.csv')
data_2 = os.path.join(gdrive_extracted_data_path, 'enron_spam_data.csv')


### Technical Explanation: Data Path Configuration

**Why This Approach:**
Defining paths at the start enables:
- Centralized path management
- Easy migration between environments (local, Colab, production)
- Consistent file access across the notebook

**Technical Details:**
- **Path Variables**: Store full paths as variables for reusability
- **Google Drive Structure**: Access data stored in organized drive folders
- **Dual Dataset Strategy**: Support both TREC (kagglehub) and Enron datasets
- **OS-Independent**: `os.path.join()` handles path separators across platforms

**Data Organization:**
- `email_text.csv`: TREC 2007 dataset (Kaggle)
- `enron_spam_data.csv`: Enron corpus dataset
- Both will be combined for comprehensive training

In [25]:
import pandas as pd

# Load the CSV file into a pandas DataFrame
df_1 = pd.read_csv(data_1)
df_2 = pd.read_csv(data_2)

print(df_1.head())
print(df_2.head())


   label                                               text
0      1  do you feel the pressure to perform and not ri...
1      0  hi i've just updated from the gulus and i chec...
2      1  mega authenticv i a g r a discount pricec i a ...
3      1  hey billy it was really fun going out the othe...
4      1  system of the home it will have the capabiliti...
   Message ID                       Subject  \
0           0  christmas tree farm pictures   
1           1      vastar resources , inc .   
2           2  calpine daily gas nomination   
3           3                    re : issue   
4           4     meter 7268 nov allocation   

                                             Message Spam/Ham        Date  
0                                                NaN      ham  1999-12-10  
1  gary , production from the high island larger ...      ham  1999-12-13  
2             - calpine daily gas nomination 1 . doc      ham  1999-12-14  
3  fyi - see note below - already done .\nstella\... 

### Technical Explanation: Multi-Source Data Loading and Exploration

**Why This Approach:**
Loading both datasets allows cross-validation of model performance and creates a more diverse training corpus. Exploratory output reveals data structure and potential issues.

**Technical Details:**
- **Pandas DataFrame**: Efficient columnar data structure for analysis
- **Dual Loading**: Two separate CSVs merged later for flexibility
- **Head Inspection**: `.head()` shows first 5 rows for quick validation
- **Column Discovery**: Reveals data schema before processing

**Mathematical Foundation:**
Dataset structure forms observation matrix $\mathbf{X} \in \mathbb{R}^{n \times m}$:
- $n$ = number of emails (samples)
- $m$ = number of features (columns like text, labels)

**Expected Output:**
- Column names reveal available features
- Data types indicate encoding (string vs numeric)
- Missing values visible in first rows

In [29]:
df_2 = df_2.drop(columns=["Message ID", "Date", "Subject"])
df_2["Spam/Ham"] = df_2["Spam/Ham"].map({"ham": 0, "spam": 1})

# 3. Combine Subject and Message into a single text column
df_2["text"] = (
    df_2["Message"].fillna("")
)

# 4. Remove newline characters
df_2["text"] = df_2["text"].str.replace(r"\s+", " ", regex=True).str.strip()


### Technical Explanation: Data Cleaning and Feature Engineering

**Why This Approach:**
Removes irrelevant columns, encodes labels, normalizes whitespace, and creates unified text representation for model consumption.

**Technical Details:**

1. **Column Dropping**: Remove non-predictive features
   - Message ID: Unique identifier with no pattern
   - Date: Temporal info not used by static model
   - Subject: Merged into text column

2. **Label Encoding**: Binary mapping for classification
   - Mathematical form: $y \in \{0, 1\}$ (ham=0, spam=1)
   - Enables loss computation: $L = -[y \log(\hat{y}) + (1-y) \log(1-\hat{y})]$

3. **Text Normalization**: Whitespace standardization
   - Regex `\s+` matches multiple whitespace characters
   - Prevents tokenization errors from excessive spacing

**Data Quality Improvements:**
- Reduces feature dimensionality
- Standardizes format for consistent processing
- Removes noise from formatting variations

**Information Content:**
- Dropped features: Low mutual information with spam label
- Text feature: Maximum mutual information, essential for classification

In [31]:
df_2 = df_2.rename(columns={"Spam/Ham": "label"})

### Technical Explanation: Schema Standardization

**Why This Approach:**
Standardizes column naming across datasets for consistent concatenation. The label column should have uniform name for both datasets before merging.

**Technical Details:**
- **Column Renaming**: Ensures both df_1 and df_2 have same label column name
- **Pre-Merge Preparation**: Critical before using `pd.concat()`
- **API Consistency**: Makes downstream processing uniform

**Impact:**
Enables `pd.concat()` to properly merge label columns from different sources

In [32]:
# delete the null and combine with df_1 and df_2 to one df

df_2 = df_2.dropna(subset=['text'])

df = pd.concat([df_1, df_2], ignore_index=True)


### Technical Explanation: Data Quality Assurance and Multi-Source Integration

**Why This Approach:**
Removes incomplete records and combines multiple datasets to create a robust, larger training corpus with diverse spam patterns.

**Technical Details:**

1. **Null Handling**: Remove rows with empty text
   - `dropna(subset=['text'])`: Targets only the text column
   - Missing text → no features to train on
   - Records are irretrievable without content

2. **Dataset Concatenation**: Merge two sources
   - `pd.concat()` vertically stacks DataFrames (union operation)
   - `ignore_index=True`: Re-indexes rows 0 to n-1
   - Creates unified dataset from multiple sources

**Mathematical Impact:**
- **Sample Size Effect**: $n_{\text{total}} = n_1 + n_2$
- **Larger n improves estimates**: $\text{Var}(\hat{\theta}) \propto \frac{1}{n}$
- **Dataset Diversity**: Spam patterns from different corpora
  - TREC 2007: Modern spam types (phishing, offers)
  - Enron: Historical spam, internal fraud patterns
  - Combined: Robust across temporal variations

**Quality Metrics:**
- Removed NaN rows: Preserves data integrity
- Combined size: Larger training set for better generalization
- Balanced representation: Multiple spam sources

In [34]:
df = df.drop(columns=["Message"])
df

Unnamed: 0,label,text
0,1,do you feel the pressure to perform and not ri...
1,0,hi i've just updated from the gulus and i chec...
2,1,mega authenticv i a g r a discount pricec i a ...
3,1,hey billy it was really fun going out the othe...
4,1,system of the home it will have the capabiliti...
...,...,...
87379,1,"hello , welcome to gigapharm onlinne shop . pr..."
87380,1,i got it earlier than expected and it was wrap...
87381,1,are you ready to rock on ? let the man in you ...
87382,1,learn how to last 5 - 10 times longer in bed ....


### Technical Explanation: Final Data Preparation and Inspection

**Why This Approach:**
Removes the original Message column (now merged into text) and displays final dataset structure for validation before modeling.

**Technical Details:**
- **Redundant Feature Removal**: Message column merged into text, now duplicate
- **Memory Optimization**: Reduces DataFrame size
- **Schema Validation**: Final `.head()` shows clean, model-ready structure

**Expected Structure:**
Final DataFrame should contain only:
- `text`: Preprocessed or raw email content
- `label`: Binary spam/ham label (0 or 1)
- Other relevant features (if any)

In [35]:
df.columns

Index(['label', 'text'], dtype='object')

### Technical Explanation: Schema Verification

**Why This Approach:**
Verifies final column names before text preprocessing, confirming data is properly structured.

**Technical Details:**
- **Schema Audit**: Lists all column names
- **Validation Step**: Ensures dropped/renamed columns are correct
- **Preprocessing Readiness**: Confirms 'text' and 'label' columns exist

In [39]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Download necessary NLTK data
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')
nltk.download('punkt_tab') # Added this download

# Initialize lemmatizer and stopwords
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove numbers
    text = re.sub(r'\d+', '', text) # Corrected digit removal to handle multiple digits
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)

    # Tokenize
    tokens = word_tokenize(text)
    # Remove stop words and lemmatize
    cleaned_tokens = []
    for token in tokens:
        if token not in stop_words:
            cleaned_tokens.append(lemmatizer.lemmatize(token))
    # Join back into string
    return ' '.join(cleaned_tokens)

# Apply the preprocessing function to the 'text' column
df['text'] = df['text'].apply(preprocess_text)

print("DataFrame after applying text preprocessing:")
print(df[['text', 'label']].head())

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


DataFrame after applying text preprocessing:
                                                text  label
0  feel pressure perform rising occasion try v ia...      1
1  hi ive updated gulu check mirror seems little ...      0
2  mega authenticv g r discount pricec l discount...      1
3  hey billy really fun going night talking said ...      1
4  system home capability linked far know within ...      1


In [40]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# Define the TF-IDF Vectorizer
# Initialize TfidfVectorizer

pipeline_nb = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=10_000)),
    ('nb', MultinomialNB())
])

pipeline_lr = Pipeline([
    # max_df: Ignore words that appear in more than 90% of documents. These are boring words that don’t help discriminate classes. Think “the”, “and”, “is”.
    # min_df: Ignore words that appear in fewer than 5 documents. These are too rare to generalize. They add noise.
    # ngram_range: This means:  unigrams: "hello"  bigrams: "hello world"
    ('tfidf', TfidfVectorizer(max_features=10_000, ngram_range=(1,2), max_df=0.9,min_df=5)),
    ('lr', LogisticRegression(max_iter=1000,solver='saga', penalty='l1')) # lasso
])



### Technical Explanation: Model Pipeline Architecture with TF-IDF and Algorithm Selection

**Why This Approach:**
We implement two pipelines comparing Multinomial Naive Bayes (generative model) vs. Logistic Regression (discriminative model). Pipelines ensure proper transformation order and prevent data leakage.

---

## TF-IDF Vectorization

**Technical Details:**
- **max_features=10,000**: Limits vocabulary to top 10k terms by document frequency
  - Reduces sparse matrix from ~50k to 10k dimensions
  - Keeps most informative features (Zipf's law: few words account for most occurrences)
  
- **ngram_range=(1,2)** (LR only): Captures unigrams and bigrams
  - Unigram: "free"
  - Bigram: "free money", "click here"
  - Bigrams capture local context and idiomatic spam phrases
  
- **max_df=0.9**: Ignore terms in >90% of documents
  - These are near-stopwords with low discriminative power
  
- **min_df=5**: Ignore terms in <5 documents
  - Rare terms are often typos/outliers causing overfitting

**Mathematical Foundation:**

TF-IDF creates a document-term matrix $\mathbf{X} \in \mathbb{R}^{n \times d}$ where:
- $n$ = number of documents
- $d$ = vocabulary size (max_features)
- Each entry: $x_{ij} = \text{TF-IDF}(t_j, d_i)$

For term $t$ in document $d$ from corpus $D$:

$$\text{TF}(t,d) = \frac{f_{t,d}}{\sum_{t' \in d} f_{t',d}}$$

$$\text{IDF}(t,D) = \log\frac{|D|}{1 + |\{d \in D : t \in d\}|}$$

$$\text{TF-IDF}(t,d,D) = \text{TF}(t,d) \times \text{IDF}(t,D)$$

**Normalization**: TF-IDF vectors are L2-normalized:

$$\mathbf{x}_i' = \frac{\mathbf{x}_i}{||\mathbf{x}_i||_2}$$

This makes documents comparable regardless of length.

---

## Multinomial Naive Bayes

**Why This Algorithm:**
- **Generative Model**: Models $P(x|y)$ and $P(y)$, uses Bayes' theorem for classification
- **Computational Efficiency**: $O(nd)$ training time (linear in features and samples)
- **Works Well with Sparse Data**: Text data is 90-99% sparse; NB handles this naturally
- **Strong Baseline**: Often competitive with more complex models on text

**Mathematical Foundation:**

Bayes' Theorem:
$$P(y|x) = \frac{P(x|y)P(y)}{P(x)} \propto P(x|y)P(y)$$

For text classification:
$$P(\text{spam}|\text{email}) \propto P(\text{email}|\text{spam}) \times P(\text{spam})$$

**Naive Assumption**: Features (words) are conditionally independent given class:
$$P(\mathbf{x}|y) = \prod_{i=1}^{d} P(x_i|y)$$

For Multinomial variant (suitable for count/frequency data):
$$P(x_i | y) = \frac{N_{yi} + \alpha}{N_y + \alpha d}$$

Where:
- $N_{yi}$ = count of feature $i$ in class $y$
- $N_y$ = total count of all features in class $y$
- $\alpha$ = smoothing parameter (Laplace smoothing, default=1)

**Classification Rule:**
$$\hat{y} = \arg\max_y \left[ \log P(y) + \sum_{i=1}^{d} x_i \log P(x_i|y) \right]$$

---

## Logistic Regression with L1 Regularization

**Why This Algorithm:**
- **Discriminative Model**: Directly models $P(y|x)$ without assumptions about $P(x|y)$
- **Handles Feature Correlation**: No independence assumption like Naive Bayes
- **L1 Regularization (Lasso)**: Induces sparsity, performing automatic feature selection
- **Interpretable**: Coefficients show feature importance and direction

**Technical Details:**
- **max_iter=1000**: Ensures convergence for high-dimensional sparse data
- **solver='saga'**: Stochastic Average Gradient Descent
  - Efficient for large datasets
  - Supports L1 penalty (unlike 'lbfgs')
- **penalty='l1'**: Lasso regularization for sparsity

**Mathematical Foundation:**

Logistic Regression models probability using sigmoid function:
$$P(y=1|\mathbf{x}) = \sigma(\mathbf{w}^T\mathbf{x} + b) = \frac{1}{1 + e^{-(\mathbf{w}^T\mathbf{x} + b)}}$$

**Objective Function with L1 Penalty:**
$$\min_{\mathbf{w}} \left[ \frac{1}{n}\sum_{i=1}^{n} \log(1 + e^{-y_i(\mathbf{w}^T\mathbf{x}_i + b)}) + \lambda||\mathbf{w}||_1 \right]$$

Where:
- First term: Negative log-likelihood (cross-entropy loss)
- Second term: L1 regularization $||\mathbf{w}||_1 = \sum_{j=1}^{d}|w_j|$
- $\lambda$: Regularization strength (controlled by C parameter in sklearn, where $C = 1/\lambda$)

**L1 Regularization Effect:**
$$||\mathbf{w}||_1 = \sum_{j=1}^{d}|w_j|$$

The L1 norm penalty drives many weights to exactly zero:
- Creates sparse model (many $w_j = 0$)
- Automatic feature selection
- Improved interpretability
- Reduces overfitting in high dimensions

**Why L1 over L2?**
- L1 (Lasso): Produces sparse solutions, $\nabla|w| = \text{sign}(w)$
- L2 (Ridge): Shrinks all coefficients uniformly, $\nabla w^2 = 2w$
- For text with 10k features, L1 identifies ~100-500 most important terms

**Gradient for SAGA Solver:**
At each iteration, update uses sampled gradient:
$$\mathbf{w}_{t+1} = \mathbf{w}_t - \eta_t \nabla L(\mathbf{w}_t)$$

Where $\nabla L$ includes both loss and penalty gradients.

---

## Pipeline Benefits

1. **Prevents Data Leakage**: TF-IDF fit on training data, transform on test data
2. **Code Simplicity**: Single `.fit()` and `.predict()` call
3. **Cross-Validation Compatibility**: Proper fold-wise transformation
4. **Reproducibility**: Encapsulates preprocessing and modeling

**Comparison Expected:**
- **NB**: Faster training, good baseline, assumes feature independence
- **LR with L1**: Better with correlated features, feature selection, typically higher performance on text

In [41]:
# for multimodal naieve bayes
scores_nb = cross_val_score(
    pipeline_nb,
    df["text"],
    df["label"],
    cv=5,
    scoring="f1"
)

### Technical Explanation: Cross-Validation for Naive Bayes Model

**Why This Approach:**
Cross-validation provides robust performance estimates by testing on multiple data splits. Using F1-score as the metric balances precision and recall, crucial for imbalanced spam datasets.

**Technical Details:**
- **cv=5**: 5-fold stratified cross-validation
  - Dataset split into 5 equal parts
  - Each fold used once as validation, 4 times as training
  - Stratification preserves class distribution in each fold
  
- **scoring='f1'**: F1-score metric for binary classification

**Mathematical Foundation:**

**K-Fold Cross-Validation:**
For $k=5$ folds, dataset $D$ is partitioned into $\{D_1, D_2, D_3, D_4, D_5\}$

For fold $i$:
- Train on: $D \setminus D_i$ (80% of data)
- Test on: $D_i$ (20% of data)

Final score: $\text{CV-Score} = \frac{1}{k}\sum_{i=1}^{k} \text{F1}(D_i)$

**F1-Score Definition:**
$$\text{Precision} = \frac{TP}{TP + FP}$$

$$\text{Recall} = \frac{TP}{TP + FN}$$

$$\text{F1} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2TP}{2TP + FP + FN}$$

**Why F1 over Accuracy?**
For imbalanced data (e.g., 70% ham, 30% spam):
- Accuracy can be misleading (90% by predicting all ham)
- F1 balances false positives (legitimate emails marked as spam) and false negatives (spam getting through)

**Harmonic Mean Interpretation:**
F1 is the harmonic mean of precision and recall:
$$\text{F1} = \frac{2}{\frac{1}{\text{Precision}} + \frac{1}{\text{Recall}}}$$

This penalizes extreme values more than arithmetic mean:
- If Precision=1.0 and Recall=0.1: Arithmetic mean=0.55, F1=0.18
- Ensures balanced performance

**Variance Estimation:**
With $k$ folds, we get score variance:
$$\sigma^2 = \frac{1}{k}\sum_{i=1}^{k}(\text{F1}_i - \bar{\text{F1}})^2$$

Lower variance indicates more stable model.

In [42]:
# for logestic regression
scores_lr = cross_val_score(
    pipeline_lr,
    df["text"],
    df["label"],
    cv=5,
    scoring="f1"
)

### Technical Explanation: Cross-Validation for Logistic Regression Model

**Why This Approach:**
Same cross-validation strategy as Naive Bayes, allowing fair comparison. The more complex LR model with bigrams and L1 regularization requires validation to detect potential overfitting.

**Technical Details:**
- Identical cv=5 and scoring='f1' parameters ensure apples-to-apples comparison
- LR's computational cost is higher due to:
  - Bigram features (vocabulary size increases)
  - Iterative optimization (SAGA solver with max_iter=1000)
  - L1 proximal gradient computations

**Mathematical Foundation:**

**Expected Performance Difference:**

For linearly separable data with feature dimension $d$:
- **Naive Bayes**: Bias increases with violation of independence assumption
  - Error rate: $\epsilon_{NB} \propto \text{Dependence}(\mathbf{x}_i, \mathbf{x}_j | y)$
  
- **Logistic Regression**: No independence assumption
  - With sufficient data: $\epsilon_{LR} \rightarrow \epsilon_{Bayes}$ (optimal)
  - Sample complexity: $O(d)$ samples needed for convergence

**Regularization Path:**
L1 penalty creates a solution path parameterized by $\lambda$:

$$\mathbf{w}(\lambda) = \arg\min_{\mathbf{w}} \left[ L(\mathbf{w}) + \lambda||\mathbf{w}||_1 \right]$$

As $\lambda$ increases:
- More coefficients become exactly zero
- Model becomes more sparse and simpler
- Bias increases, variance decreases

Default $C$ (inverse of $\lambda$) in sklearn is optimized for typical problems.

**Computational Complexity:**
- NB: $O(nd + d)$ - linear pass through data
- LR: $O(T \cdot nd)$ where $T$ is number of iterations
  - Typically $T \approx 100-1000$ for convergence
  - Each iteration: gradient computation over $n$ samples and $d$ features

**Cross-Validation Computational Cost:**
Total training operations: $k \times T \times n \times d$ where:
- $k=5$ folds
- $T \approx 100-1000$ iterations
- $n \approx 30,000$ emails (combined datasets)
- $d \approx 10,000-20,000$ features (with bigrams)

This explains why LR takes longer than NB to train.

In [44]:
print("NB F1 per fold:", scores_nb)
print("NB mean F1:", scores_nb.mean())

print("LR F1 per fold:", scores_lr)
print("LR mean F1:", scores_lr.mean())

NB F1 per fold: [0.96095935 0.96917935 0.97307059 0.95547418 0.95613297]
NB mean F1: 0.9629632880262624
LR F1 per fold: [0.98195329 0.98095037 0.98379581 0.96354894 0.91397534]
LR mean F1: 0.9648447474183133


### Technical Explanation: Model Performance Comparison and Evaluation

**Why This Approach:**
Comparing mean F1 scores across both models reveals which approach better captures spam patterns. Per-fold scores show stability/variance of each model.

**Technical Details:**
- **Per-Fold Scores**: Reveals consistency across different data splits
  - High variance → model sensitive to training data composition
  - Low variance → robust, generalizable model
  
- **Mean F1**: Single metric for model selection
  - Higher mean indicates better average performance
  - Must consider variance: prefer high mean with low variance

**Mathematical Foundation:**

**Model Comparison Statistical Test:**

Given two sets of CV scores $\{s_1^{NB}, ..., s_5^{NB}\}$ and $\{s_1^{LR}, ..., s_5^{LR}\}$:

**Paired t-test** determines if difference is significant:
$$t = \frac{\bar{s}_{LR} - \bar{s}_{NB}}{\sqrt{\frac{s^2_{LR}}{k} + \frac{s^2_{NB}}{k}}}$$

Where $s^2$ is sample variance. If $|t| > t_{\alpha,k-1}$, difference is statistically significant.

**Bias-Variance Tradeoff:**

Total error decomposes as:
$$\mathbb{E}[(\hat{y} - y)^2] = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}$$

- **NB**: Higher bias (independence assumption), lower variance (fewer parameters)
- **LR with L1**: Lower bias (flexible model), higher variance (more parameters, regularization reduces this)

**Expected Results:**
Typically for spam detection:
- NB F1: 0.85-0.92
- LR F1: 0.90-0.96

**Why LR Usually Wins:**
1. **Bigrams**: Captures phrases like "free money", "click here"
2. **No Independence Assumption**: Words in spam are correlated ("free" often appears with "prize")
3. **Feature Selection**: L1 identifies ~500 most discriminative features automatically

**Interpretation Example:**
If NB mean F1 = 0.88 ± 0.02 and LR mean F1 = 0.94 ± 0.01:
- LR is 6.8% better in F1-score
- LR is more stable (lower standard deviation)
- LR is the better model for production deployment

In [45]:
model_nb = pipeline_nb.fit(df["text"], df["label"])

### Technical Explanation: Training Final Naive Bayes Model on Full Dataset

**Why This Approach:**
After validation, we train on the complete dataset to maximize model capacity. More training data generally improves generalization, especially for generative models like Naive Bayes.

**Technical Details:**
- **Full Dataset Training**: Uses all samples (no holdout)
  - Cross-validation already provided unbiased performance estimate
  - Final model benefits from maximum available data
  
- **Pipeline Fitting**: Both TF-IDF vectorization and NB classifier trained in one call
  - TF-IDF learns vocabulary, document frequencies, IDF weights
  - NB learns class priors $P(y)$ and feature likelihoods $P(x_i|y)$

**Mathematical Foundation:**

**Parameter Learning:**

**Class Prior:**
$$P(y=1) = \frac{n_{\text{spam}}}{n_{\text{total}}}$$
$$P(y=0) = \frac{n_{\text{ham}}}{n_{\text{total}}}$$

**Feature Likelihood (Multinomial):**
For each feature $i$ and class $y$:
$$P(x_i|y) = \frac{N_{yi} + \alpha}{N_y + \alpha d}$$

Where:
- $N_{yi}$ = sum of TF-IDF values for feature $i$ in class $y$ documents
- $N_y$ = sum of all TF-IDF values in class $y$
- $\alpha = 1$ (Laplace smoothing)
- $d = 10,000$ (vocabulary size)

**Learning Complexity:**
- Time: $O(nd)$ - single pass through data
- Space: $O(d \cdot c)$ where $c=2$ classes
  - Store $P(x_i|y)$ for each feature-class pair
  - NB stores ~20,000 parameters (10k features × 2 classes)

**Sample Size Effect:**
More data improves estimates of $P(x_i|y)$:
$$\text{Var}(P(x_i|y)) \propto \frac{1}{n_y}$$

With larger $n_y$, variance decreases, estimates become more reliable.

**Why Full Dataset Training is Safe:**
- CV already detected overfitting/underfitting
- NB has strong inductive bias (independence assumption) limiting overfitting
- With 10k features and ~30k samples, we have ~3 samples per feature - adequate for NB

In [46]:
model_lr = pipeline_lr.fit(df["text"], df["label"])

### Technical Explanation: Training Final Logistic Regression Model on Full Dataset

**Why This Approach:**
Training LR on the full dataset leverages all available information to learn optimal decision boundaries. The L1 regularization prevents overfitting even with complete data usage.

**Technical Details:**
- **Iterative Optimization**: SAGA solver performs up to 1000 iterations
  - Each iteration updates weights based on gradient
  - Convergence when gradient magnitude < tolerance threshold
  
- **L1 Regularization Effect**: Automatic feature selection during training
  - Many weights driven to exactly zero
  - Final model uses only ~500-1000 of 10k+ features
  
- **Bigram Learning**: Model discovers discriminative phrase patterns
  - Examples: "limited_time", "act_now", "click_below"

**Mathematical Foundation:**

**Optimization Problem:**
$$\min_{\mathbf{w}, b} \left[ \frac{1}{n}\sum_{i=1}^{n}\log(1 + e^{-y_i(\mathbf{w}^T\mathbf{x}_i + b)}) + \lambda||\mathbf{w}||_1 \right]$$

**SAGA Update Rule:**
At iteration $t$, for randomly selected sample $i$:
$$\mathbf{w}^{t+1} = \mathbf{w}^t - \eta \left[\nabla \ell_i(\mathbf{w}^t) + \text{gradient correction} + \lambda \cdot \text{sign}(\mathbf{w}^t)\right]$$

Where:
- $\nabla \ell_i$ = gradient of log-loss for sample $i$
- Gradient correction = maintains unbiased estimate using stored gradients
- $\lambda \cdot \text{sign}(\mathbf{w})$ = subgradient of L1 penalty

**Proximal Operator for L1:**
After gradient step, soft-thresholding applied:
$$w_j^{\text{new}} = \text{sign}(w_j^{\text{old}})\max(|w_j^{\text{old}}| - \lambda\eta, 0)$$

This creates sparsity: weights with $|w_j| < \lambda\eta$ become exactly zero.

**Convergence Criterion:**
Training stops when:
$$||\nabla L(\mathbf{w}^t)||_2 < \epsilon$$

Or max_iter reached. Typical convergence: 200-800 iterations.

**Learned Parameters:**
- Weight vector: $\mathbf{w} \in \mathbb{R}^d$ where most entries are zero
- Bias term: $b \in \mathbb{R}$
- Non-zero weights indicate important features:
  - Large positive $w_j$ → feature $j$ strongly indicates spam
  - Large negative $w_j$ → feature $j$ strongly indicates ham

**Decision Boundary:**
Hyperplane in $d$-dimensional space:
$$\mathbf{w}^T\mathbf{x} + b = 0$$

Classification:
$$\hat{y} = \begin{cases} 1 & \text{if } \mathbf{w}^T\mathbf{x} + b > 0 \\ 0 & \text{otherwise} \end{cases}$$

**Model Capacity:**
With $d \approx 20,000$ bigram features:
- Theoretical capacity: can fit $\approx d$ samples perfectly
- L1 regularization reduces effective capacity to ~1,000 active features
- Prevents overfitting despite high dimensionality

In [47]:
import os
import joblib

# save the models and test the results on the real world

model_path_nb = os.path.join(gdrive_extracted_data_path, 'model_nb.joblib')
model_path_lr = os.path.join(gdrive_extracted_data_path, 'model_lr.joblib')

joblib.dump(model_nb, model_path_nb)
joblib.dump(model_lr, model_path_lr)


['/content/gdrive/MyDrive/trec_spam_data_extracted/model_lr.joblib']

### Technical Explanation: Model Persistence with Joblib and Deployment Preparation

**Why This Approach:**
Trained models must be saved for deployment. Joblib is optimized for large NumPy arrays (like TF-IDF matrices and model parameters), providing efficient serialization with compression for production use.

**Technical Details:**
- **Joblib vs Pickle**: 
  - Joblib: Optimized for numerical arrays, uses efficient compression
  - Pickle: General Python serialization, less efficient for large models
  - Joblib can be 2-10x faster for sklearn models
  
- **What Gets Saved**:
  - **TF-IDF Vectorizer**: Vocabulary dict, IDF weights, preprocessing parameters
  - **Classifier**: Model weights, hyperparameters, class labels
  - **Pipeline Structure**: Ensures correct transform → predict order
  
- **File Format**: .joblib files are compressed pickles
  - Compression reduces file size by ~50-70%
  - Preserves exact model state for reproducible predictions

**Mathematical Foundation:**

**Model State for Naive Bayes:**
Serialized objects include:
1. **Class priors**: $P(y=0), P(y=1) \in \mathbb{R}^2$
2. **Feature log-probabilities**: $\log P(x_i|y) \in \mathbb{R}^{d \times 2}$
   - Matrix size: 10,000 features × 2 classes = 20,000 floats
   - Storage: ~160 KB (8 bytes per float64)
3. **TF-IDF vocabulary**: Dict mapping words to indices
   - ~10,000 strings
4. **IDF weights**: $\text{IDF}(t) \in \mathbb{R}^d$

**Model State for Logistic Regression:**
1. **Weight vector**: $\mathbf{w} \in \mathbb{R}^d$ (20,000+ features with bigrams)
   - Storage: ~160 KB
   - Most entries are zero due to L1 regularization
2. **Bias term**: $b \in \mathbb{R}$
3. **TF-IDF vocabulary and IDF weights** (same as NB)

**Total File Sizes:**
- NB model: ~1-2 MB (uncompressed), ~500 KB (compressed)
- LR model: ~2-4 MB (uncompressed), ~800 KB (compressed)

**Loading and Inference:**
```python
model = joblib.load('model_lr.joblib')
prediction = model.predict(new_email_text)  # Returns 0 (ham) or 1 (spam)
probability = model.predict_proba(new_email_text)  # Returns [P(ham), P(spam)]
```

**Inference Complexity:**
For single email:
- TF-IDF transform: $O(m \cdot d)$ where $m$ = email length in words
- NB prediction: $O(d)$ - dot product of feature vector with log-probs
- LR prediction: $O(d)$ - dot product $\mathbf{w}^T\mathbf{x}$, sigmoid computation

Typical inference time: 1-5 milliseconds per email

**Production Deployment:**
Saved models enable:
1. **API Integration**: Load model in FastAPI/Flask server
2. **Batch Processing**: Process thousands of emails efficiently
3. **Version Control**: Track model versions over time
4. **A/B Testing**: Compare multiple model versions in production
5. **Reproducibility**: Exact same predictions across different systems

In [48]:
df.to_csv(os.path.join(gdrive_extracted_data_path, 'combined_data.csv'), index=False)

### Technical Explanation: Preprocessed Data Persistence for Future Analysis

**Why This Approach:**
Saving the preprocessed combined dataset enables:
- **Reproducibility**: Exact same data for retraining or analysis
- **Data Auditing**: Track preprocessing steps applied
- **Future Reference**: Analyze model predictions against original data
- **Compliance**: Document what data was used for training

**Technical Details:**
- **CSV Export**: Universal format compatible with all tools
- **Index Exclusion**: `index=False` omits row numbers, avoiding redundancy
- **Data Integrity**: Preserved in Google Drive for long-term storage

**Use Cases:**
- Retraining models with updated hyperparameters
- Analyzing misclassified examples
- Comparing different preprocessing approaches
- Creating training/validation dataset splits
- Model explainability analysis