# SMS Spam Classification - Model Training

This notebook trains and evaluates models for SMS spam classification. We will:
1. Load the prepared data splits
2. Build text feature extraction pipeline (TF-IDF)
3. Train and evaluate multiple models
4. Compare benchmark models and select the best one

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore')

## 1. Load Data

Load the train, validation, and test splits we created in prepare.ipynb.

In [2]:
# load data splits
train_df = pd.read_csv('train.csv')
val_df = pd.read_csv('validation.csv')
test_df = pd.read_csv('test.csv')

print(f"Train: {len(train_df)}, Validation: {len(val_df)}, Test: {len(test_df)}")

Train: 3900, Validation: 836, Test: 836


In [3]:
# separate features and labels
X_train = train_df['message']
y_train = train_df['label']

X_val = val_df['message']
y_val = val_df['label']

X_test = test_df['message']
y_test = test_df['label']

## 2. Feature Extraction

We'll use TF-IDF (Term Frequency-Inverse Document Frequency) to convert text messages into numerical features. This is a standard approach for text classification.

In [4]:
# create TF-IDF vectorizer
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')

# fit on training data and transform all sets
X_train_tfidf = vectorizer.fit_transform(X_train)
X_val_tfidf = vectorizer.transform(X_val)
X_test_tfidf = vectorizer.transform(X_test)

print(f"TF-IDF features shape: {X_train_tfidf.shape}")

TF-IDF features shape: (3900, 5000)


## 3. Model Functions

Define functions to fit, score, and evaluate models.

In [5]:
def fit_model(model, X_train, y_train):
    """
    Fit a model on training data.
    """
    model.fit(X_train, y_train)
    return model

In [6]:
def score_model(model, X, y):
    """
    Score a model on given data. Returns predictions and probabilities.
    """
    y_pred = model.predict(X)
    return y_pred

In [7]:
def evaluate_predictions(y_true, y_pred):
    """
    Evaluate model predictions. Returns accuracy, precision, recall, and F1.
    """
    metrics = {
        'accuracy': accuracy_score(y_true, y_pred),
        'precision': precision_score(y_true, y_pred),
        'recall': recall_score(y_true, y_pred),
        'f1': f1_score(y_true, y_pred)
    }
    return metrics

In [8]:
def validate_model(model, X_train, y_train, X_val, y_val):
    """
    Fit on train, score and evaluate on both train and validation.
    """
    # fit on training data
    model = fit_model(model, X_train, y_train)
    
    # score on train and validation
    y_train_pred = score_model(model, X_train, y_train)
    y_val_pred = score_model(model, X_val, y_val)
    
    # evaluate on both sets
    train_metrics = evaluate_predictions(y_train, y_train_pred)
    val_metrics = evaluate_predictions(y_val, y_val_pred)
    
    return model, train_metrics, val_metrics

In [9]:
def print_metrics(metrics, name=""):
    """
    Print metrics in a readable format.
    """
    print(f"{name}")
    print(f"  Accuracy:  {metrics['accuracy']:.4f}")
    print(f"  Precision: {metrics['precision']:.4f}")
    print(f"  Recall:    {metrics['recall']:.4f}")
    print(f"  F1 Score:  {metrics['f1']:.4f}")

## 4. Train and Validate Models

We'll train three benchmark models:
1. **Logistic Regression** - simple linear classifier
2. **Naive Bayes** - probabilistic classifier good for text
3. **Linear SVM** - effective for high-dimensional data

### Model 1: Logistic Regression

In [10]:
# train and validate logistic regression
lr_model = LogisticRegression(max_iter=1000)
lr_model, lr_train_metrics, lr_val_metrics = validate_model(
    lr_model, X_train_tfidf, y_train, X_val_tfidf, y_val
)

print("Logistic Regression Results:")
print("-" * 30)
print_metrics(lr_train_metrics, "Training:")
print()
print_metrics(lr_val_metrics, "Validation:")

Logistic Regression Results:
------------------------------
Training:
  Accuracy:  0.9695
  Precision: 0.9903
  Recall:    0.7801
  F1 Score:  0.8727

Validation:
  Accuracy:  0.9617
  Precision: 0.9878
  Recall:    0.7232
  F1 Score:  0.8351


Good results on both train and validation. The model generalizes well with no significant overfitting.

### Model 2: Naive Bayes

In [11]:
# train and validate naive bayes
nb_model = MultinomialNB()
nb_model, nb_train_metrics, nb_val_metrics = validate_model(
    nb_model, X_train_tfidf, y_train, X_val_tfidf, y_val
)

print("Naive Bayes Results:")
print("-" * 30)
print_metrics(nb_train_metrics, "Training:")
print()
print_metrics(nb_val_metrics, "Validation:")

Naive Bayes Results:
------------------------------
Training:
  Accuracy:  0.9851
  Precision: 1.0000
  Recall:    0.8891
  F1 Score:  0.9413

Validation:
  Accuracy:  0.9773
  Precision: 1.0000
  Recall:    0.8304
  F1 Score:  0.9073


Naive Bayes also performs well. It's fast and works well with text classification.

### Model 3: Linear SVM

In [12]:
# train and validate linear SVM
svm_model = LinearSVC(max_iter=1000)
svm_model, svm_train_metrics, svm_val_metrics = validate_model(
    svm_model, X_train_tfidf, y_train, X_val_tfidf, y_val
)

print("Linear SVM Results:")
print("-" * 30)
print_metrics(svm_train_metrics, "Training:")
print()
print_metrics(svm_val_metrics, "Validation:")

Linear SVM Results:
------------------------------
Training:
  Accuracy:  0.9997
  Precision: 0.9981
  Recall:    1.0000
  F1 Score:  0.9990

Validation:
  Accuracy:  0.9856
  Precision: 0.9808
  Recall:    0.9107
  F1 Score:  0.9444


SVM also shows strong performance on this classification task.

## 5. Hyperparameter Tuning

Let's tune the best performing model (Logistic Regression) using grid search on train and validation data.

In [13]:
# combine train and validation for grid search with cross-validation
from sklearn.model_selection import GridSearchCV

# define parameter grid
param_grid = {
    'C': [0.1, 1, 10],
    'solver': ['lbfgs', 'liblinear']
}

# grid search on training data
grid_search = GridSearchCV(
    LogisticRegression(max_iter=1000),
    param_grid,
    cv=3,
    scoring='f1'
)

grid_search.fit(X_train_tfidf, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV F1 score: {grid_search.best_score_:.4f}")

Best parameters: {'C': 10, 'solver': 'lbfgs'}
Best CV F1 score: 0.8923


In [14]:
# validate the tuned model
best_lr = grid_search.best_estimator_

y_val_pred = score_model(best_lr, X_val_tfidf, y_val)
tuned_val_metrics = evaluate_predictions(y_val, y_val_pred)

print("Tuned Logistic Regression - Validation:")
print_metrics(tuned_val_metrics, "")

Tuned Logistic Regression - Validation:

  Accuracy:  0.9833
  Precision: 0.9804
  Recall:    0.8929
  F1 Score:  0.9346


The hyperparameter tuning shows similar results, which suggests the default parameters were already reasonable for this dataset.

## 6. Test Set Evaluation - Final Benchmark

Now we evaluate all three models on the held-out test set to select the best one.

In [15]:
# score all models on test set
models = {
    'Logistic Regression': lr_model,
    'Naive Bayes': nb_model,
    'Linear SVM': svm_model
}

test_results = {}

print("Test Set Results:")
print("=" * 50)

for name, model in models.items():
    y_test_pred = score_model(model, X_test_tfidf, y_test)
    metrics = evaluate_predictions(y_test, y_test_pred)
    test_results[name] = metrics
    print(f"\n{name}:")
    print_metrics(metrics, "")

Test Set Results:

Logistic Regression:

  Accuracy:  0.9593
  Precision: 0.9875
  Recall:    0.7054
  F1 Score:  0.8229

Naive Bayes:

  Accuracy:  0.9665
  Precision: 1.0000
  Recall:    0.7500
  F1 Score:  0.8571

Linear SVM:

  Accuracy:  0.9761
  Precision: 0.9792
  Recall:    0.8393
  F1 Score:  0.9038


In [16]:
# compare models in a table
results_df = pd.DataFrame(test_results).T
results_df = results_df.round(4)
print("\nModel Comparison on Test Set:")
print(results_df)


Model Comparison on Test Set:
                     accuracy  precision  recall      f1
Logistic Regression    0.9593     0.9875  0.7054  0.8229
Naive Bayes            0.9665     1.0000  0.7500  0.8571
Linear SVM             0.9761     0.9792  0.8393  0.9038


## 7. Select Best Model

For spam classification, we care about:
- **High Precision**: Minimize false positives (don't classify ham as spam)
- **Good Recall**: Catch most spam messages
- **F1 Score**: Balance between precision and recall

In [17]:
# select best model based on F1 score
best_model_name = max(test_results, key=lambda x: test_results[x]['f1'])
best_model = models[best_model_name]

print(f"Best Model: {best_model_name}")
print(f"F1 Score: {test_results[best_model_name]['f1']:.4f}")

Best Model: Linear SVM
F1 Score: 0.9038


In [18]:
# detailed evaluation of best model
y_test_pred = score_model(best_model, X_test_tfidf, y_test)

print(f"\n{best_model_name} - Classification Report:")
print(classification_report(y_test, y_test_pred, target_names=['ham', 'spam']))


Linear SVM - Classification Report:
              precision    recall  f1-score   support

         ham       0.98      1.00      0.99       724
        spam       0.98      0.84      0.90       112

    accuracy                           0.98       836
   macro avg       0.98      0.92      0.95       836
weighted avg       0.98      0.98      0.98       836



In [19]:
# confusion matrix
cm = confusion_matrix(y_test, y_test_pred)
print("Confusion Matrix:")
print(f"              Predicted")
print(f"              Ham   Spam")
print(f"Actual Ham    {cm[0][0]:4d}  {cm[0][1]:4d}")
print(f"Actual Spam   {cm[1][0]:4d}  {cm[1][1]:4d}")

Confusion Matrix:
              Predicted
              Ham   Spam
Actual Ham     722     2
Actual Spam     18    94


## Summary

We trained and evaluated three benchmark models for SMS spam classification:

1. **Logistic Regression** - Simple and effective linear model
2. **Naive Bayes** - Classic text classification model
3. **Linear SVM** - Good for high-dimensional sparse data

All models performed well with high accuracy and F1 scores. The best model achieves high precision (important for not blocking legitimate messages) while maintaining good recall (catching spam).

This prototype demonstrates that ML-based spam classification is feasible and can achieve good results with simple TF-IDF features and standard classifiers.