<a href="https://colab.research.google.com/github/joshcova/NLP_Workshop/blob/main/05_SpamClassification_Exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spam classification exercise

A paradigmatic use case of supervised machine learning models for text classification is that of spam detection.

We all are very good in identifying whether an email is *spam* or not (a.k.a. *ham*) - it would helpful however to automate the systematic detection of spam to a machine, which will automatically identify whether a text is spam or not.

How do we do it?

The model needs to learn what is spam and what is not based on a sample of texts(**training dataset**). Based on this *learning*, it can scale up this classification to "unseen" emails on which it had not been trained on (the **testing dataset**).

Based on the machine learning text classifier's performance on the unseen, testing dataset we can gauge how well the classifier may perform in classifying whether emails are spam or not in general.

Let us see how this works in practice and develop our spam detection classifier!

First, let's load, as usual, the necessary libraries.

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/joshcova/NLP_Workshop/refs/heads/main/data/spam_detection_dataset_1.csv")

In [None]:
df.head()

As you can see the data.frame contains a series of emails (in the "text" column) and a classification ("label") that checks whether they are spam or not.

The idea is to split the dataset into a training and testing dataset, but first we need to get an understanding of the class distribution between the two labels.


In [None]:
df["label"].value_counts()

The next step is to vectorize the text, that is to convert the strings into vectors. As we have covered in the lecture there are two main types of vectorization: a count vectorizer and a TF-IDF vectorizer.

Let us start with the count vectorizer first.

In [None]:
count_vectorizer_spam = CountVectorizer()

In [None]:
y_spam = df["label"]

In [None]:
# We split the dataset into the customary 80%-20% split. The X variable here stands for the text and
# y is the outcome that we are trying to predict (think dependent variable), that is our text
X_train_spam, X_test_spam, y_train_spam, y_test_spam = train_test_split(X_spam, y_spam, test_size=0.2, random_state=42)

In [None]:
print(X_train_spam.shape)
print(y_train_spam.shape)
print(X_test_spam.shape)
print(y_test_spam.shape)

In [None]:
# You can peruse how the vectorization works in practice by convertizing the count vectorizer into a data.frame
x_test_spam_df = pd.DataFrame(X_test_spam.toarray())

In [None]:
## Now you can see the power of the Document-Term Matrix. The documents are the rows and the terms are the columns.

x_test_spam_df.head()

Now that we have separated the dataset into a testing and training dataset, we can start fitting our classifier.

For this analysis we use the simple Naive Bayes (MultinomialNB) classifier. In essense what happens is that the Naive Bayes classifier updatas the probability that a specific word is in a specific class depending on the prior likelihood that it had observed it to be in that class.

In other words if the word "free" has previously been found to occur more often in texts that are classified as spam, it attributes a greater probability that it is in the class spam. The classifier conducts this operation on all the words contained in the text of the different emails and computes an aggregate probability.

In [None]:
nb_classifier_spam =  MultinomialNB()

In [None]:
# This is where the actual training takes place. If you think about the operation that is going on "under the hood", it is amazing how quickly Python computes it.
nb_classifier_spam.fit(X_train_spam, y_train_spam)

In [None]:
# Now we can make predictions based on what the model has learned on the dataset on which the model has not be trained on (i.e. the testing dataset)

y_pred_nb = nb_classifier_spam.predict(X_test_spam)

In [None]:
# check out the accuracy score

accuracy_score(y_test_spam, y_pred_nb)

In [None]:
# We can also check the precision and recall score, bearing in mind that we have to define what a positive is. Think back to the formula of accuracy, precision and recall to check the
# different ways in which the formulas are computed

In [None]:
# Let us define the positive to be "spam"
precision_score(y_test_spam, y_pred_nb, pos_label="spam")
recall_score(y_test_spam, y_pred_nb, pos_label="spam")