# 02 — Data Preprocessing (Dual Pipeline)

This notebook prepares TWO variations of the dataset:
1. **Standard**: Basic cleaning (lowercase, remove URLs/emojis).
2. **Irony-Augmented**: Standard cleaning + `[IRONIA]` tagging for detected colloquialisms.

**Output Locations**:
- `../data/processed/standard/`
- `../data/processed/irony/`

In [1]:
import pandas as pd
import re
import emoji
from sklearn.model_selection import train_test_split
import os

# Ensure data directories exist
os.makedirs('../data/processed/standard', exist_ok=True)
os.makedirs('../data/processed/irony', exist_ok=True)

In [2]:
%load_ext watermark
%watermark -v -n -m -p numpy,pandas,sklearn,emoji

Python implementation: CPython
Python version       : 3.12.12
IPython version      : 9.10.0

numpy  : 1.26.4
pandas : 3.0.0
sklearn: 1.8.0
emoji  : 2.15.0

Compiler    : Clang 17.0.0 (clang-1700.6.3.2)
OS          : Darwin
Release     : 25.2.0
Machine     : x86_64
Processor   : i386
CPU cores   : 8
Architecture: 64bit



## 1. Load Data & Helper Functions

In [3]:
df = pd.read_csv('../data/raw/corpus.csv')
df['quote_safe'] = df['QuoteText'].fillna('')
df['text'] = df.apply(lambda x: (x['quote_safe'] + " " + x['TweetText']).strip(), axis=1)
df['label'] = df['Categorization']
df = df[['text', 'label']]
print(f"Loaded {df.shape[0]} samples")

Loaded 3000 samples


In [4]:
def tag_irony_logic(text):
    if not isinstance(text, str): return text
    # Laughs
    text = re.sub(r'(?i)\b(j+a+){2,}\b', ' [IRONIA] ', text)
    text = re.sub(r'(?i)\b(j+e+){2,}\b', ' [IRONIA] ', text)
    # Specific phrases
    text = re.sub(r'\(\?+\)?', ' [IRONIA] ', text)
    text = re.sub(r'(?i)\bx+d+\b', ' [IRONIA] ', text)
    text = re.sub(r'(?i)\b(a+h? ?r+e+)\b', ' [IRONIA] ', text)
    text = re.sub(r'(?i)\bare\b', ' [IRONIA] ', text)
    text = re.sub(r'(?i)\bbue\b', ' [IRONIA] ', text)
    text = re.sub(r'(?i)\bwe\b', ' [IRONIA] ', text)
    text = re.sub(r'(?i)\bbueno no\b', ' [IRONIA] ', text)
    text = re.sub(r'(?i)\bno bueno\b', ' [IRONIA] ', text)
    return text

def clean_base(text):
    if not isinstance(text, str): return ""
    # Remove URLs
    text = re.sub(r'http\S+|www\.\S+', '', text)
    # Demojize
    text = emoji.demojize(text, language='es', delimiters=(" :", ": "))
    # Lowercase EXCEPT tags
    parts = re.split(r'(\[[A-ZÁÉÍÓÚÑ]+\])', text)
    processed = []
    for part in parts:
        if re.match(r'^\[[A-ZÁÉÍÓÚÑ]+\]$', part):
            processed.append(part)
        else:
            processed.append(part.lower())
    text = "".join(processed)
    # Whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

def process_standard(text):
    return clean_base(text)

def process_irony(text):
    # Tag irony FIRST, then clean (so [IRONIA] is preserved as uppercase tag)
    text = tag_irony_logic(text)
    return clean_base(text)

## 2. Generate Datasets

In [5]:
# Standard
df_standard = df.copy()
df_standard['text_clean'] = df_standard['text'].apply(process_standard)

# Irony
df_irony = df.copy()
df_irony['text_clean'] = df_irony['text'].apply(process_irony)

print("Sample Standard:", df_standard['text_clean'].iloc[10])
print("Sample Irony:   ", df_irony['text_clean'].iloc[10])

Sample Standard: que ganas de tomarme una línea. la del 59, que me lleva a mi casa.
Sample Irony:    que ganas de tomarme una línea. la del 59, que me lleva a mi casa.


## 3. Split and Save

In [6]:
def save_splits(dataframe, name, output_dir):
    train, temp = train_test_split(dataframe, test_size=0.3, stratify=dataframe['label'], random_state=42)
    val, test = train_test_split(temp, test_size=0.5, stratify=temp['label'], random_state=42)
    
    train.to_csv(f'{output_dir}/train.csv', index=False)
    val.to_csv(f'{output_dir}/val.csv', index=False)
    test.to_csv(f'{output_dir}/test.csv', index=False)
    print(f"Saved {name} splits to {output_dir}")

save_splits(df_standard, "Standard", "../data/processed/standard")
save_splits(df_irony, "Irony", "../data/processed/irony")

Saved Standard splits to ../data/processed/standard
Saved Irony splits to ../data/processed/irony
