<a href="https://colab.research.google.com/github/loresiensis/nlp/blob/main/Algoritmo_deteccion_spam.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Algoritmo de Detección de spam

Para este ejercicio utilicé un dataset de [Kaggle](https://www.kaggle.com/datasets/venky73/spam-mails-dataset?resource=download&select=spam_ham_dataset.csv) que contiene 5172 emails anotados como spam (1) o ham (0).

In [None]:
raw_dataset = 'content/spam_ham_dataset.csv'

In [None]:
import pandas as pd

df = pd.read_csv('spam_ham_dataset.csv')

df

Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...,0
...,...,...,...,...
5166,1518,ham,Subject: put the 10 on the ft\r\nthe transport...,0
5167,404,ham,Subject: 3 / 4 / 2000 and following noms\r\nhp...,0
5168,2933,ham,Subject: calpine daily gas nomination\r\n>\r\n...,0
5169,1409,ham,Subject: industrial worksheets for august 2000...,0


Como quiero poder calcular métricas y ver qué tal funciona el algoritmo, divido el dataset en los subconjuntos para `train` y `test`, quedándome con solo las columnas que contienen los correos y las que contienen los labels `0` o `1`.






In [None]:
!pip install pandas scikit-learn
from sklearn.model_selection import train_test_split

emails = df['text']
labels = df['label_num']

train_emails, test_emails, train_labels, test_labels = train_test_split(emails, labels, test_size=0.2, random_state=42)

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
train_emails

5132    Subject: april activity surveys\r\nwe are star...
2067    Subject: message subject\r\nhey i ' am julie ^...
4716    Subject: txu fuels / sds nomination for may 20...
4710    Subject: re : richardson volumes nov 99 and de...
2268    Subject: a new era of online medical care .\r\...
                              ...                        
4426    Subject: re : ena sales on hpl\r\nlast that i ...
466     Subject: tenaska iv\r\nbob :\r\ni understand f...
3092    Subject: broom , bristles up , flew\r\nbe diff...
3772    Subject: calpine daily gas nomination ( weeken...
860     Subject: re : meter 1459 , 6 / 00\r\nyep , you...
Name: text, Length: 4136, dtype: object

In [None]:
train_labels

5132    0
2067    1
4716    0
4710    0
2268    1
       ..
4426    0
466     0
3092    1
3772    0
860     0
Name: label_num, Length: 4136, dtype: int64

Ahora tokenizo tanto los correos para `train` como para `test` y también intento limpiarlos de signos de puntuación y caracteres especiales.

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

train_tokenized = [nlp(email) for email in train_emails]
test_tokenized = [nlp(email) for email in test_emails]


def preprocess_text(text):
    tokens = [token.text.lower() for token in text if not token.is_punct and not token.is_space and "^" not in token.text]

    preprocessed_text = ' '.join(tokens)

    return preprocessed_text



In [None]:
train_preprocessed = [preprocess_text(text) for text in train_tokenized]
test_preprocessed = [preprocess_text(text) for text in test_tokenized]

Aquí simplemente convierto las variables con el texto ya tokenizado en tablas para poder ver el resultado de esa manera:

In [None]:
train_data = pd.DataFrame({'email': train_preprocessed, 'label': train_labels})
test_data = pd.DataFrame({'email': test_preprocessed, 'label': test_labels})

train_data.reset_index(drop=True, inplace=True)
test_data.reset_index(drop=True, inplace=True)

In [None]:
train_data

Unnamed: 0,email,label
0,subject april activity surveys we are starting...,0
1,subject message subject hey i am julie i just ...,1
2,subject txu fuels sds nomination for may 2001 ...,0
3,subject re richardson volumes nov 99 and dec 9...,0
4,subject a new era of online medical care a new...,1
...,...,...
4131,subject re ena sales on hpl last that i had wa...,0
4132,subject tenaska iv bob i understand from sandi...,0
4133,subject broom bristles up flew be differentiab...,1
4134,subject calpine daily gas nomination weekend >...,0


Para calcular la frecuencia relativa, pongo tanto un recuento de cada palabra como el número total de palabras en los correos de spam y ham para luego poder dividir el recuento entre el número total y obtener así las probabilidades de cada palabra que aparece en los correos etiquetados como spam y como ham.

In [None]:
from collections import defaultdict
import numpy as np

def count_words(emails, labels):
    word_count_spam = defaultdict(int)
    word_count_ham = defaultdict(int)
    total_words_spam = 0
    total_words_ham = 0

    for email, label in zip(train_preprocessed, train_labels):
        words = email.split()
        if label == 1:  # spam
            for word in words:
                word_count_spam[word] += 1
                total_words_spam += 1
        else:  # ham
            for word in words:
                word_count_ham[word] += 1
                total_words_ham += 1

    return word_count_spam, word_count_ham, total_words_spam, total_words_ham

word_count_spam, word_count_ham, total_words_spam, total_words_ham = count_words(train_preprocessed, train_labels)

In [None]:
word_prob_spam = {word: count / total_words_spam for word, count in word_count_spam.items()}
word_prob_ham = {word: count / total_words_ham for word, count in word_count_ham.items()}

Hago también un suavizado de Laplace para evitar tener que lidiar con 0 indeseados o que puedan llevar a una clasificación incorrecta en los casos en los que haya palabras que, por ejemplo, solo estuvieran en spam pero no en ham.

In [None]:
def laplace_sm(word, word_count, total_words, num_unique_words, alpha=1):
    word_frequency = word_count[word]
    return (word_frequency + alpha) / (total_words + alpha * num_unique_words)

num_unique_words = len(set(word_count_spam.keys()).union(set(word_count_ham.keys())))
laplace_prob_spam = {word: laplace_sm(word, word_count_spam, total_words_spam, num_unique_words) for word in word_count_spam.keys()}
laplace_prob_ham = {word: laplace_sm(word, word_count_ham, total_words_ham, num_unique_words) for word in word_count_ham.keys()}

In [None]:
laplace_prob_spam

{'subject': 0.00484296715362968,
 'message': 0.0008762425555741388,
 'hey': 0.0001236192816992561,
 'i': 0.003599502614184222,
 'am': 0.0003635861226448709,
 'julie': 1.0907583679346127e-05,
 'just': 0.0008144329147245108,
 'turned': 5.4537918396730637e-05,
 '18': 0.00017815720009598673,
 'high': 0.00047266195943833215,
 'school': 5.4537918396730637e-05,
 'senior': 5.0902057170281924e-05,
 'in': 0.009304168878482247,
 'houston': 2.1815167358692253e-05,
 'tx': 3.635861226448709e-05,
 'was': 0.0011307528414255484,
 'waiting': 0.00011271169801990998,
 'for': 0.007475330681578546,
 'such': 0.0005453791839673064,
 'a': 0.011162093965197536,
 'long': 0.00034904267773907605,
 'time': 0.0010725790618023691,
 'until': 0.00013452686537860224,
 'this': 0.00695540252619638,
 'day': 0.0004326674859473964,
 'finally': 7.99889469818716e-05,
 'got': 0.00022542339603981996,
 'my': 0.0012507362618983558,
 'wish': 0.0002908688981158967,
 'simply': 0.0001272551429257048,
 'trying': 3.635861226448709e-05,


Lo siguiente es usar las frecuencias relativas para calcular el riesgo de spam, revisando cada palabra de los correos electrónicos (del conjunto de `test`) y si apareciera tanto en `word_prob_spam` como en `word_prob_ham` entonces se anade la diferencia entre ambos logaritmos (Naive Bayes) a la variable spam_risk. Más adelante cuando defina un umbral podré pasarle este resultado para determinar si un correo es clasificado como spam o como ham.

In [None]:
def calculate_spam_risk(email, word_prob_spam, word_prob_ham):
    words = email.split()
    spam_risk = 0

    for word in words:
        spam_probability = word_prob_spam.get(word, 0)
        ham_probability = word_prob_ham.get(word, 0)
        if spam_probability > 0 and ham_probability > 0:
            spam_risk += np.log(spam_probability) - np.log(ham_probability)

    return spam_risk

spam_risks = [calculate_spam_risk(email, laplace_prob_spam, laplace_prob_ham) for email in test_preprocessed]


Para establecer el umbral vamos a usar la curva AUC-ROC, que utiliza el rate de falsos positivos y de verdaderos positivos para determinar qué tan bueno es el modelo a la hora de discernir entre las clases que estaría clasificando. En este caso obtuve un `AUC-ROC de 0.988`, lo que significa que es muy bueno a la hora de distinguir entre casos positivos y negativos. Y también obtuve un valor de `umbral óptimo`.

In [None]:
from sklearn.metrics import roc_curve, auc

fpr, tpr, thresholds = roc_curve(test_labels, spam_risks)

roc_auc = auc(fpr, tpr)

optimal_idx = np.argmin(np.sqrt((1 - tpr) ** 2 + fpr ** 2))
optimal_threshold = thresholds[optimal_idx]

print("AUC-ROC:", roc_auc)
print("Umbral óptimo:", optimal_threshold)


AUC-ROC: 0.9889331481191871
Umbral óptimo: -4.713664417585661


Ahora puedo definir que el `threshold` será el umbral óptimo que obtuve antes y establecer también que si el resultado de un correo supera el umbral, entonces sea clasificado con `1` (spam).

In [None]:
threshold = optimal_threshold
predictions = [1 if spam_risk > threshold else 0 for spam_risk in spam_risks]



Por último, usé el conjunto de `test` para sacar algunas métricas.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

accuracy = accuracy_score(test_labels, predictions)
precision = precision_score(test_labels, predictions)
recall = recall_score(test_labels, predictions)
f1 = f1_score(test_labels, predictions)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 score:", f1)


Accuracy: 0.9748792270531401
Precision: 0.9348534201954397
Recall: 0.9795221843003413
F1 score: 0.9566666666666667
