# SVM Baseline Model Before Generated Data
**Authors:** Matías Arévalo, Pilar Guerrero, Moritz Goebbels, Tomás Lock, Allan Stalker  
**Date:** January – May 2025  

## Purpose
Create a SVM Baseline Model that will serve as the baseline detection model to detect scam/spam messages. Here we use the `preprocessed_spam_datase.csv` file, which is the processed dataset without the inclusion of generated data with our model.

To run this notebook, that file should be place in the `data/` folder. If not, file paths should be changed in order for the notebook to run properly.

## Import Libraries

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, confusion_matrix, classification_report
)

## Import Data & Preprocessing

### Loading Data

In [None]:
df = pd.read_csv('../../data/preprocessed_spam_dataset.csv')

In [None]:
df = pd.read_csv('preprocessed_spam_dataset.csv')
df.head()

### Ensuring Only Ham and Spam Messages

In [None]:
df = df[df['label'].isin(['spam', 'ham'])]
df['label'].unique()

### Data Splitting (Train, Val, Test sets)
Here we are using the following ratio to split the data:
- Train Set = 70% of original dataset.
- Validation Set = 15% of original dataset.
- Test Set = 15% of original dataset.

The calculations to make this sizes have been calculated manually.

In [None]:
df_temp, test = train_test_split(df, test_size=0.15, stratify=df['label'], random_state=42)
train, val = train_test_split(df_temp, test_size=0.176, stratify=df_temp['label'], random_state=42)

### Value Mapping
In the notebook we will assign the value `0` to `ham` messages and `1` to `spam` messages.

In [None]:
train['label'] = train['label'].map({'ham': 0, 'spam': 1})
val['label'] = val['label'].map({'ham': 0, 'spam': 1})

### X and y Values

In [None]:
X_train = train['clean_message'].dropna()
y_train = train['label'].loc[X_train.index]
X_val   = val['clean_message'].dropna()
y_val   = val['label'].loc[X_val.index]

## Detection Model

### Building Model

In [None]:
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(analyzer='char', ngram_range=(3, 4), max_features=100000)),
    ('svm', LinearSVC())
])

### Training Model

In [None]:
pipeline.fit(train.dropna(subset=['clean_message'])['clean_message'], train.dropna(subset=['clean_message'])['label'])

## Test Model (Inference and Metrics)

### Make Predictions

In [None]:
preds = pipeline.predict(val.dropna(subset=['clean_message'])['clean_message'])

### Print Metrics

In [None]:
print("Accuracy:", accuracy_score(val['label'], preds))
print("Precision:", precision_score(val['label'], preds))
print("Recall:", recall_score(val['label'], preds))
print("F1 Score:", f1_score(val['label'], preds))

### Print Classification Report

In [None]:
print("\nClassification Report:\n")
print(classification_report(val['label'], preds, target_names=["ham", "spam"]))

### Confusion Matrix

In [None]:
cm = confusion_matrix(val['label'], preds)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=["ham", "spam"], yticklabels=["ham", "spam"])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()