# Logistic Regression Model After Generated Data
**Authors:** Matías Arévalo, Pilar Guerrero, Moritz Goebbels, Tomás Lock, Allan Stalker  
**Date:** January – May 2025  

## Purpose
Create a Logistic Regression Model to detect scam/spam messages. Here we use the `train.csv` and `val.csv` files we created from the generated and original data.

To run this notebook, that file should be place in the `generated_data/` folder. If not, file paths should be changed in order for the notebook to run properly.

## Import Libraries

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, confusion_matrix, classification_report
)

## Import Data & Preprocessing

### Loading Data

In [None]:
train = pd.read_csv('generated_data/train.csv')
head()

In [None]:
val = pd.read_csv('generated_data/val.csv')
df.head()

### X and y Values

In [None]:
X_train = train['clean_message'].dropna()
y_train = train['label'].loc[X_train.index]
X_val   = val['clean_message'].dropna()
y_val   = val['label'].loc[X_val.index]

## Detection Model

### Building Model

In [None]:
baseline_lr = Pipeline([
    ('vect', CountVectorizer(
        analyzer='word',
        ngram_range=(1,1),
        lowercase=True
    )),
    ('clf', LogisticRegression(
        solver='liblinear',
        C=1.0,
        class_weight='balanced'
    ))
])

### Training Model

In [None]:
baseline_lr.fit(X_train, y_train)

## Test Model (Inference and Metrics)

### Make Predictions

In [None]:
preds_lr = baseline_lr.predict(X_val)

### Print Metrics

In [None]:
print("Accuracy:", accuracy_score(val['label'], preds_lr))
print("Precision:", precision_score(val['label'], preds_lr))
print("Recall:", recall_score(val['label'], preds_lr))
print("F1 Score:", f1_score(val['label'], preds_lr))

### Print Classification Report

In [None]:
print("\nClassification Report:\n")
print(classification_report(val['label'], preds_lr, target_names=["ham", "spam"]))

### Confusion Matrix

In [None]:
cm = confusion_matrix(val['label'], preds_lr)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=["ham", "spam"], yticklabels=["ham", "spam"])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

### Confidence Plot

In [None]:
preds = baseline_lr.predict(X_val)
probs = baseline_lr.predict_proba(X_val)
classes = baseline_lr.named_steps['clf'].classes_
idxs = np.searchsorted(classes, preds)
confidences = probs[np.arange(len(preds)), idxs]
plt.figure(figsize=(10,5))
counts, bins, patches = plt.hist(confidences, bins=20, edgecolor='k')
plt.title("Predicted‐Class Confidence Distribution")
plt.xlabel("Predicted‐class probability")
plt.ylabel("Number of samples")
plt.xlim(0,1)

for cnt, patch in zip(counts, patches):
    x = patch.get_x() + patch.get_width()/2
    plt.text(x, cnt, int(cnt), ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.show()