# Data Preprocessing

**Authors:** Matías Arévalo, Pilar Guerrero, Moritz Goebbels, Tomás Lock, Allan Stalker  
**Date:** January – May 2025  

## Purpose
Apply further transformations to the merged dataset to prepare it for the training of the model.
This includes text cleaning and normalization techniques like:  
- Replacing URLs with a `[URL]` placeholder  
- Replacing email addresses with `[EMAIL]`  
- Replacing phone numbers with `[PHONE]`  
- Replacing monetary amounts with `[MONEY]`  
- Replacing general numbers with `[NUM]`  
- Replacing emojis and special symbols with `[EMOJI]`  
- Normalizing whitespace and converting text to lowercase  
- and more

## Import Libraries

In [None]:
import pandas as pd
import re
from sklearn.model_selection import train_test_split

## Load Dataset

- **File name:** `final_spam_dataset.csv`  
- **Location:** `data/`

In [None]:
df = pd.read_csv('../../data/final_spam_dataset.csv')

In [None]:
df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
df.shape

(28479, 2)

## Creating Cleaning Function

With this function, we aim to replace specific elements in the text with standardized placeholders.  
This results in a normalized version of each message, making it cleaner and more consistent for the training of the models.

In [None]:
def clean_text_advanced(text):
    text = re.sub(r'https?:\/\/\S+|www\.\S+|\S+\.(com|net|org|io|ly|me|co)(\/\S*)?', ' [URL] ', text, flags=re.IGNORECASE)
    text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b', ' [EMAIL] ', text)
    text = re.sub(r'(\+?\d{1,3})?[\s\-]?\(?\d{2,4}\)?[\s\-]?\d{3,4}[\s\-]?\d{3,4}', ' [PHONE] ', text)
    text = re.sub(r'[$€£¥₹]\s?\d+([\.,]\d{1,2})?', ' [MONEY] ', text)
    text = re.sub(r'\b\d{2,}\b', ' [NUM] ', text)
    text = re.sub(r'[^\w\s,.!?]', ' [EMOJI] ', text)
    text = re.sub(r'\s+', ' ', text).strip()
    text = text.lower()

    return text

## Applying Function to Dataset

Here we will apply the previous made function to the `message` column in our dataframe, however, we will save this cleaned result in another column called `clean_message`.

In [None]:
df['clean_message'] = df['message'].apply(clean_text_advanced)

With the following code, a sample of 10 rows will be displayed:

In [None]:
df[['label', 'message', 'clean_message']].sample(10)

Unnamed: 0,label,message,clean_message
4830,ham,I uploaded mine to Facebook,i uploaded mine to facebook
21212,ham,fix tpyo,fix tpyo
24405,spam,tlo look ups full reportpros 700 cash app sauc...,tlo look ups full reportpros [emoji] num [emoj...
10860,ham,thomas knudsen hi vince i met with thomas this...,thomas knudsen hi vince i met with thomas this...
17648,ham,chinatown got porridge claypot rice yam cake f...,chinatown got porridge claypot rice yam cake f...
23052,ham,i knw wat i hv to do,i knw wat i hv to do
4879,ham,"K I'm leaving soon, be there a little after 9","k i [emoji] m leaving soon, be there a little ..."
14144,ham,tim i gave it all the thought it deserved wink...,tim i gave it all the thought it deserved wink...
9285,ham,personnel announcement jordan h mintz has been...,personnel announcement jordan h mintz has been...
16954,spam,free 1st week entry 2 textpod 4 a chance 2 win...,free 1st week entry 2 textpod 4 a chance 2 win...


## Save the Preprocessed Dataset

Download and save the merged DataFrame in the `data/` folder

In [None]:
df.to_csv('../../data/preprocessed_spam_dataset.csv', index=False)