# Feature Engineering Pipeline

Este notebook aplica todas las transformaciones de feature engineering desarrolladas durante la exploración y exporta los datasets procesados para su uso en modelos de machine learning.

Postdata: Todo esto fue generado con inteligencia artificial a partir del notebook de exploración de datos

## Imports

In [2]:
import pandas as pd
import emoji
import spacy
import re
from pathlib import Path

## Cargar Modelo de Spacy

In [3]:
# Cargar modelo de spacy para lematización
model = spacy.load("en_core_web_sm", disable=["parser", "ner"])
stop_words = spacy.lang.en.stop_words.STOP_WORDS

print(f"Modelo cargado: {model.meta['name']} v{model.meta['version']}")

Modelo cargado: core_web_sm v3.8.0


## Funciones de Feature Engineering

In [4]:
def clean_tweet(text):
    """Limpia el texto del tweet removiendo menciones, URLs, emojis y caracteres especiales"""
    text = text.lower()
    text = re.sub(r'@\w+', '', text)                # menciones
    text = re.sub(r'http\S+|www.\S+', '', text)     # URLs
    text = emoji.replace_emoji(text, replace=' ')   # emojis
    text = re.sub(r'[^\w\s]', ' ', text)            # símbolos especiales
    text = re.sub(r'\s+', ' ', text).strip()        # espacios extras
    return text

def lemma_filter(token):
    """Filtra stopwords y tokens de longitud 1"""
    return token.lemma_ not in stop_words and len(token.lemma_) > 1

def lemmatize_text(text):
    """Lematiza el texto del tweet"""
    cleaned_text = clean_tweet(text)
    doc = model(cleaned_text)
    return ' '.join([token.lemma_ for token in doc if lemma_filter(token)])

def add_text_features(df):
    """Agrega features basados en el análisis del texto"""
    df = df.copy()
    
    # Features básicos de texto
    df["text_length"] = df["text"].str.len()
    df["word_count"] = df["text"].str.split().str.len()
    df["hashtag_count"] = df["text"].str.count("#")
    df["mention_count"] = df["text"].str.count("@")
    df["url_count"] = df["text"].str.count("http")
    df["uppercase_percentage"] = (
        df["text"].str.findall(r"[A-Z]").str.len() / df["text_length"]
    )
    df["punctuation_percentage"] = (
        df["text"].str.findall(r"[.,!?\"\'()]").str.len() / df["text_length"]
    )
    
    return df

def add_keyword_features(df):
    """Limpia y procesa las keywords"""
    df = df.copy()
    df['keyword_clean'] = df['keyword'].str.lower().str.replace('%20', ' ')
    return df

def add_lemmatization_features(df):
    """Agrega el texto lematizado"""
    df = df.copy()
    print("Lematizando textos...")
    df['text_lemmatized'] = df['text'].apply(lemmatize_text)
    print("Lematización completada!")
    return df

def apply_feature_engineering(df):
    """
    Aplica todo el pipeline de feature engineering al dataset.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Dataset original con columnas: id, keyword, location, text (y target si es train)
    
    Returns:
    --------
    pd.DataFrame
        Dataset con todas las features agregadas
    """
    print(f"Dataset original: {df.shape}")
    
    # Aplicar transformaciones
    df = add_text_features(df)
    print("✓ Features de texto agregados")
    
    df = add_keyword_features(df)
    print("✓ Keywords procesados")
    
    df = add_lemmatization_features(df)
    print("✓ Lematización completada")
    
    print(f"\nDataset final: {df.shape}")
    print(f"Nuevas columnas agregadas: {df.shape[1] - 5}")  # 5 columnas originales (id, keyword, location, text, target)
    
    return df

## Cargar Datasets

In [5]:
BASE_PATH = "../.data/raw/"

train_tweets = pd.read_csv(BASE_PATH + "train.csv")
test_tweets = pd.read_csv(BASE_PATH + "test.csv")

print(f"Train dataset: {train_tweets.shape}")
print(f"Test dataset: {test_tweets.shape}")

Train dataset: (7613, 5)
Test dataset: (3263, 4)


## Aplicar Feature Engineering

### Train Dataset

In [6]:
print("=" * 50)
print("PROCESANDO TRAIN DATASET")
print("=" * 50)

train_processed = apply_feature_engineering(train_tweets)
train_processed.sample(n=5)

PROCESANDO TRAIN DATASET
Dataset original: (7613, 5)
✓ Features de texto agregados
✓ Keywords procesados
Lematizando textos...
Lematización completada!
✓ Lematización completada

Dataset final: (7613, 14)
Nuevas columnas agregadas: 9


Unnamed: 0,id,keyword,location,text,target,text_length,word_count,hashtag_count,mention_count,url_count,uppercase_percentage,punctuation_percentage,keyword_clean,text_lemmatized
662,957,blaze,,looks like a year of writing and computers is ...,0,75,11,0,0,1,0.066667,0.026667,blaze,look like year writing computer ahead
4073,5791,hail,,All Hail Shadow (Hybrid Mix Feat. Mike Szuter)...,1,89,11,2,1,1,0.11236,0.044944,hail,hail shadow hybrid mix feat mike szuter youtube
1801,2588,crash,"Charleston, SC",'Fatal crash reported on Johns Island' http://...,1,61,7,0,0,1,0.081967,0.04918,crash,fatal crash report johns island
5042,7188,mudslide,Notts,#BakeOffFriends #GBBO 'The one with the mudsli...,0,74,13,2,0,0,0.108108,0.027027,mudslide,bakeofffriend gbbo mudslide guy hat
4666,6632,inundated,Maryland,Already expecting to be inundated w/ articles ...,0,138,25,0,0,0,0.007246,0.014493,inundated,expect inundate article trad author pay plumme...


### Test Dataset

In [7]:
print("\n" + "=" * 50)
print("PROCESANDO TEST DATASET")
print("=" * 50)

test_processed = apply_feature_engineering(test_tweets)
test_processed.sample(n=5)


PROCESANDO TEST DATASET
Dataset original: (3263, 4)
✓ Features de texto agregados
✓ Keywords procesados
Lematizando textos...
Lematización completada!
✓ Lematización completada

Dataset final: (3263, 13)
Nuevas columnas agregadas: 8


Unnamed: 0,id,keyword,location,text,text_length,word_count,hashtag_count,mention_count,url_count,uppercase_percentage,punctuation_percentage,keyword_clean,text_lemmatized
2668,8904,snowstorm,"Philadelphia, PA",It's creepy seeing 676 closed. It was closed d...,121,20,0,0,0,0.024793,0.024793,snowstorm,creepy 676 close close snowstorm balcony overl...
682,2218,chemical%20emergency,? In your head ?,THE CHEMICAL BROTHERS to play The Armory in SF...,138,20,0,0,1,0.463768,0.036232,chemical emergency,chemical brother play armory sf tomorrow night...
199,645,arsonist,Fresno,Arson suspect linked to 30 fires caught in Nor...,86,11,0,0,1,0.05814,0.011628,arsonist,arson suspect link 30 fire catch northern cali...
366,1176,blight,Sporting capital of the World,I'm all for renewable energy but re: windfarms...,139,23,0,0,1,0.057554,0.014388,blight,renewable energy windfarm agree abbott horribl...
2250,7487,obliteration,,@Jethro_Harrup How many Hangarback Walkers doe...,117,17,0,1,0,0.059829,0.008547,obliteration,hangarback walker opponent need board infinite...


## Exportar Datasets Procesados

In [8]:
# Crear directorio de salida si no existe
OUTPUT_PATH = Path("../.data/processed/")
OUTPUT_PATH.mkdir(parents=True, exist_ok=True)

# Exportar a pickle
train_output = OUTPUT_PATH / "train.pkl"
test_output = OUTPUT_PATH / "test.pkl"

print("\nGuardando datasets procesados...")
train_processed.to_pickle(train_output)
print(f"✓ Train guardado en: {train_output}")

test_processed.to_pickle(test_output)
print(f"✓ Test guardado en: {test_output}")

print("="*50)
print("EXPORTACIÓN COMPLETADA")
print("="*50)
print("\nArchivos generados:")
print(f"  - {train_output} ({train_output.stat().st_size / 1024 / 1024:.2f} MB)")
print(f"  - {test_output} ({test_output.stat().st_size / 1024 / 1024:.2f} MB)")


Guardando datasets procesados...
✓ Train guardado en: ..\.data\processed\train.pkl
✓ Test guardado en: ..\.data\processed\test.pkl
EXPORTACIÓN COMPLETADA

Archivos generados:
  - ..\.data\processed\train.pkl (1.87 MB)
  - ..\.data\processed\test.pkl (0.79 MB)
