## Spam Detector

- Task: **binary classification** → spam (1) or ham (0).  
- Input: raw text messages.  
- Preprocessing: tokenize → convert text to numeric features.  
- Model: classifier (baseline ML or neural net).  
- Training: minimize binary loss (BCE).  
- Evaluation: accuracy, precision, recall, F1.  

In [288]:
# Import libraries
import pandas as pd
import torch
from torch import nn

In [289]:
# Read the df
df = pd.read_csv("./data/SMSSpamCollection", sep="\t", names=["type", "message"])

df

Unnamed: 0,type,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [290]:
# Creating column spam
df["spam"] = (df["type"] == 'spam').astype(int)

# Deleting type column
df.drop(columns="type", inplace=True)

df

Unnamed: 0,message,spam
0,"Go until jurong point, crazy.. Available only ...",0
1,Ok lar... Joking wif u oni...,0
2,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,U dun say so early hor... U c already then say...,0
4,"Nah I don't think he goes to usf, he lives aro...",0
...,...,...
5567,This is the 2nd time we have tried 2 contact u...,1
5568,Will ü b going to esplanade fr home?,0
5569,"Pity, * was in mood for that. So...any other s...",0
5570,The guy did some bitching but I acted like i'd...,0


In [291]:
# Creating df for training
df_train = df.sample(frac=0.8, random_state=42)

# Creating df for testing
df_test = df.drop(index=df_train.index)

In [292]:
from sklearn.feature_extraction.text import CountVectorizer

# Define vectorizer
vectorizer = CountVectorizer(max_features=1000) # Analyzes maximum 1000 values

In [293]:
# Define input
messages_train = vectorizer.fit_transform(df_train["message"]).todense() # Transform messagens into dense matrix
messages_test = vectorizer.transform(df_test["message"]).todense()

# Transform input into tensor
X_train = torch.tensor(messages_train, dtype=torch.float32)
X_test = torch.tensor(messages_test, dtype=torch.float32)

X_train

tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])

In [294]:
# Define output
y_train = torch.tensor(df_train["spam"].values, dtype=torch.float32).reshape(-1, 1)
y_test = torch.tensor(df_test["spam"].values, dtype=torch.float32).reshape(-1, 1)

y_train

tensor([[0.],
        [0.],
        [0.],
        ...,
        [0.],
        [0.],
        [0.]])

In [295]:
# Define model
model = nn.Linear(1000, 1) # 1000 inputs (max_features) and 1 output(spam or !spam)

# Define loss function
criterion = nn.BCEWithLogitsLoss() # Loss function that applies sigmoid internally

# Define optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=0.02)

In [None]:
# Training loop
epochs = 10000

for epoch in range(epochs):
    optimizer.zero_grad()             # Resetting the gradient
    y_pred = model(X_train)           # Calculating the model prediction
    loss = criterion(y_pred, y_train) # Calculating the error
    loss.backward()                   # Calculates the gradients of the parameters with respect to the loss
    optimizer.step()                  # Update the model weights using the calculated gradients

    if epoch % 500 == 0:
        print(f"Loss: {loss.item()}") # Print loss

Loss: 0.6755606532096863
Loss: 0.2943618893623352
Loss: 0.21991458535194397
Loss: 0.18235984444618225
Loss: 0.1594766527414322
Loss: 0.1438889354467392
Loss: 0.13247667253017426
Loss: 0.12368971854448318
Loss: 0.1166682243347168
Loss: 0.11089546233415604
Loss: 0.10604146867990494
Loss: 0.10188519209623337
Loss: 0.0982726663351059
Loss: 0.09509316831827164
Loss: 0.09226488322019577
Loss: 0.08972594141960144
Loss: 0.08742865920066833
Loss: 0.0853356122970581
Loss: 0.08341697603464127
Loss: 0.08164870738983154


## Evaluation in Binary Classification

- **Accuracy:** % of total hits.
- **Sensitivity/Recall:** Among the true positives, how many I got right.
- **Specificity:** Among the true negatives, how many I got right.
- **Precision:** Among those I classified as positive, how many were actually positive.

In [297]:
# Model evaluation
def evaluate_model(X, y):
    model.eval()
    with torch.no_grad():
        y_pred = nn.functional.sigmoid(model(X)) > 0.5
        
        # Calculate accuracy
        accuracy = (y_pred == y).type(torch.float32).mean()
        print(f"Accuracy: {accuracy * 100:.2f} %")

        # Calculate sensivity
        sensivity = (y_pred[y == 1] == y[y == 1]).type(torch.float32).mean()
        print(f"Sensivity: {sensivity * 100:.2f} %")

        # Calculate specifity
        specifity = (y_pred[y == 0] == y[y == 0]).type(torch.float32).mean()
        print(f"Specifity: {specifity * 100:.2f} %")

        # Calculate precision
        precision = (y_pred[y_pred == 1] == y[y_pred == 1]).type(torch.float32).mean()
        print(f"Precision: {precision * 100:.2f} %")

In [298]:
# Evalutation for train df
evaluate_model(X_train, y_train)

Accuracy: 98.03 %
Sensivity: 86.68 %
Specifity: 99.77 %
Precision: 98.28 %


In [299]:
# Evalutation for test df
evaluate_model(X_test, y_test)

Accuracy: 96.68 %
Sensivity: 78.57 %
Specifity: 99.58 %
Precision: 96.80 %
