# 02 ‚Äî Data Preprocessing

This notebook cleans and prepares the raw dataset for model training.

**Data Notes:**
1.  **User Mentions**: `@user` mentions were cleaned prior to corpus creation.
2.  **Special Tags**: Tags like `[GROSERIA]` or `[PERSONA]` were added during an obfuscation task before corpus creation. We strictly preserve their casing.

**Preprocessing Logic:**
1.  **Context Construction**: We concatenate `QuoteText` + `TweetText` to respect the stimulus-response order of the conversation.
2.  **Cleaning**:
    -   Remove URLs (cleaning artifacts).
    -   Demojize emojis.
    -   Lowercase standard text, but **preserve uppercase Tags**.

In [1]:
import pandas as pd
import re
import emoji
from sklearn.model_selection import train_test_split
import os

# Ensure data directories exist
os.makedirs('../data/processed', exist_ok=True)

In [2]:
%load_ext watermark
%watermark -v -n -m -p numpy,pandas,sklearn,emoji

Python implementation: CPython
Python version       : 3.12.12
IPython version      : 9.10.0

numpy  : 1.26.4
pandas : 3.0.0
sklearn: 1.8.0
emoji  : 2.15.0

Compiler    : Clang 17.0.0 (clang-1700.6.3.2)
OS          : Darwin
Release     : 25.2.0
Machine     : x86_64
Processor   : i386
CPU cores   : 8
Architecture: 64bit



## 1. Load Data
We construct the text field by placing `QuoteText` before `TweetText`.

In [3]:
df = pd.read_csv('../data/raw/corpus.csv')
print(f"Original Shape: {df.shape}")

# Create 'text' column: QuoteText (if exists) + TweetText
# Adding a space only if QuoteText exists is handled naturally by fillna('') + ' ' if we are careful, 
# but simple concat with separator is safer.
df['quote_safe'] = df['QuoteText'].fillna('')
df['text'] = df.apply(lambda x: (x['quote_safe'] + " " + x['TweetText']).strip(), axis=1)

df['label'] = df['Categorization']

# Keep only processed columns
df = df[['text', 'label']]
df.head()

Original Shape: (3000, 12)


Unnamed: 0,text,label
0,"""Que ganas de fumarme un porrito y tomarme una...",NEGATIVE
1,Escuchar lana del rey es como tomarme una l√≠ne...,NEGATIVE
2,"A las 10 am ya tom√© un latte, un expreso y est...",NEGATIVE
3,"Todos miran la moneda menos Maradona, que mira...",NEGATIVE
4,Amo tomarme una l√≠nea de colectivos que no con...,NEGATIVE


## 2. Preprocessing Functions

We define `clean_text` to:
-   Remove URLs.
-   Demojize.
-   Lowercase text *except* special tags matching `[TAG]`.

In [4]:
def clean_text(text):
    if not isinstance(text, str):
        return ""
    
    # 1. Remove URLs
    text = re.sub(r'http\S+|www\.\S+', '', text)
    
    # 2. Demojize emojis (e.g. üíä -> :pill:) with delimiters
    text = emoji.demojize(text, language='es', delimiters=(" :", ": "))
    
    # 3. Handle Casing: Lowercase everything EXCEPT tags like [TAG]
    # Split by tags. Capturing group () keeps the delimiter (the tag).
    parts = re.split(r'(\[[A-Z√Å√â√ç√ì√ö√ë]+\])', text)
    processed_parts = []
    for part in parts:
        if re.match(r'^\[[A-Z√Å√â√ç√ì√ö√ë]+\]$', part):
            processed_parts.append(part) # Keep original case for Tags
        else:
            processed_parts.append(part.lower()) # Lowercase everything else
            
    text = "".join(processed_parts)
    
    # 4. Normalize whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

# Test the function
sample = "Hola world! üíä Mira https://t.co/xyz [GROSERIA] y [ANATOMIA] #fiesta"
print(f"Original: {sample}")
print(f"Cleaned:  {clean_text(sample)}")

Original: Hola world! üíä Mira https://t.co/xyz [GROSERIA] y [ANATOMIA] #fiesta
Cleaned:  hola world! :p√≠ldora: mira [GROSERIA] y [ANATOMIA] #fiesta


## 3. Apply Preprocessing

In [5]:
df['text_clean'] = df['text'].apply(clean_text)
print("Sample of cleaned text:")
for txt in df['text_clean'].sample(5, random_state=42):
    print(f"- {txt}")

Sample of cleaned text:
- desde ayer estoy sin luz. es me√°s f√°cil dejar de tomar merca que de usar energ√≠a el√©ctrica. sufro abstinencia de tv, pc, play y heladera.
- habr√≠a que tomar una linea mas aristotelica para venderle a la gente lo trascendente, y la idea de lo eterno como fuente de verdad objetiva y marco moral. pero yo no descartar√≠a a maquiavelo por la din√°mica de poder que es necesario entender para poder interactuar
- y ojala que si llegas a tomar merca este finde, te [GROSERIA] y sea ibuprofeno pisado.
- fumaban un porro en un patio ajeno y fueron detenidos
- y cuando escucho lust for life de iggy pop me dan ganas de inyectarme hero√≠na en el metro


## 4. Train / Validation / Test Split

Stratified split:
-   **Train**: 70%
-   **Validation**: 15%
-   **Test**: 15%

In [6]:
train_df, temp_df = train_test_split(df, test_size=0.3, stratify=df['label'], random_state=42)
val_df, test_df = train_test_split(temp_df, test_size=0.5, stratify=temp_df['label'], random_state=42)

print(f"Train size: {train_df.shape}")
print(f"Val size:   {val_df.shape}")
print(f"Test size:  {test_df.shape}")

Train size: (2100, 3)
Val size:   (450, 3)
Test size:  (450, 3)


## 5. Save Splits

In [7]:
train_df.to_csv('../data/processed/train.csv', index=False)
val_df.to_csv('../data/processed/val.csv', index=False)
test_df.to_csv('../data/processed/test.csv', index=False)

print("Datasets saved to ../data/processed/")

Datasets saved to ../data/processed/


## Summary of the results

Preprocessing complete with refined logic:
*   **Context**: `QuoteText` precedes `TweetText`.
*   **Cleaning**: URLs removed, emojis demojized, text lowercased.
*   **Tag Preservation**: `[TAG]` tokens retain their uppercase formatting.
*   **Output**: 2100 Train / 450 Val / 450 Test samples.