# Feature Engineering Pipeline

Este notebook aplica todas las transformaciones de feature engineering desarrolladas durante la exploración y exporta los datasets procesados para su uso en modelos de machine learning.

Postdata: Todo esto fue generado con inteligencia artificial a partir del notebook de exploración de datos

## Imports

In [8]:
import pandas as pd
import emoji
import spacy
import re
from pathlib import Path

## Cargar Modelo de Spacy

In [9]:
# Cargar modelo de spacy para lematización
model = spacy.load("en_core_web_sm", disable=["parser", "ner"])
stop_words = spacy.lang.en.stop_words.STOP_WORDS

print(f"Modelo cargado: {model.meta['name']} v{model.meta['version']}")

Modelo cargado: core_web_sm v3.8.0


## Funciones de Feature Engineering

In [10]:
def clean_tweet(text):
    """Limpia el texto del tweet removiendo menciones, URLs, emojis y caracteres especiales"""
    text = text.lower()
    text = re.sub(r'@\w+', '', text)                # menciones
    text = re.sub(r'http\S+|www.\S+', '', text)     # URLs
    text = emoji.replace_emoji(text, replace=' ')   # emojis
    text = re.sub(r'[^\w\s]', ' ', text)            # símbolos especiales
    text = re.sub(r'\s+', ' ', text).strip()        # espacios extras
    return text

def lemma_filter(token):
    """Filtra stopwords y tokens de longitud 1"""
    return token.lemma_ not in stop_words and len(token.lemma_) > 1

def lemmatize_text(text):
    """Lematiza el texto del tweet"""
    cleaned_text = clean_tweet(text)
    doc = model(cleaned_text)
    return ' '.join([token.lemma_ for token in doc if lemma_filter(token)])

def add_text_features(df):
    """Agrega features basados en el análisis del texto"""
    df = df.copy()
    
    # Features básicos de texto
    df["text_length"] = df["text"].str.len()
    df["word_count"] = df["text"].str.split().str.len()
    df["hashtag_count"] = df["text"].str.count("#")
    df["mention_count"] = df["text"].str.count("@")
    df["url_count"] = df["text"].str.count("http")
    df["uppercase_percentage"] = (
        df["text"].str.findall(r"[A-Z]").str.len() / df["text_length"]
    )
    df["punctuation_percentage"] = (
        df["text"].str.findall(r"[.,!?\"\'()]").str.len() / df["text_length"]
    )
    
    return df

def add_keyword_features(df):
    """Limpia y procesa las keywords"""
    df = df.copy()
    df['keyword_clean'] = df['keyword'].str.lower().str.replace('%20', ' ')
    return df

def add_lemmatization_features(df):
    """Agrega el texto lematizado"""
    df = df.copy()
    print("Lematizando textos...")
    df['text_lemmatized'] = df['text'].apply(lemmatize_text)
    print("Lematización completada!")
    return df

def apply_feature_engineering(df):
    """
    Aplica todo el pipeline de feature engineering al dataset.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Dataset original con columnas: id, keyword, location, text (y target si es train)
    
    Returns:
    --------
    pd.DataFrame
        Dataset con todas las features agregadas
    """
    print(f"Dataset original: {df.shape}")
    
    # Aplicar transformaciones
    df = add_text_features(df)
    print("✓ Features de texto agregados")
    
    df = add_keyword_features(df)
    print("✓ Keywords procesados")
    
    df = add_lemmatization_features(df)
    print("✓ Lematización completada")
    
    print(f"\nDataset final: {df.shape}")
    print(f"Nuevas columnas agregadas: {df.shape[1] - 5}")  # 5 columnas originales (id, keyword, location, text, target)
    
    return df

## Cargar Datasets

In [11]:
BASE_PATH = "../.data/raw/"

train_tweets = pd.read_csv(BASE_PATH + "train.csv")
test_tweets = pd.read_csv(BASE_PATH + "test.csv")

print(f"Train dataset: {train_tweets.shape}")
print(f"Test dataset: {test_tweets.shape}")

Train dataset: (7613, 5)
Test dataset: (3263, 4)


## Aplicar Feature Engineering

### Train Dataset

In [12]:
print("=" * 50)
print("PROCESANDO TRAIN DATASET")
print("=" * 50)

train_processed = apply_feature_engineering(train_tweets)
train_processed.sample(n=5)

PROCESANDO TRAIN DATASET
Dataset original: (7613, 5)
✓ Features de texto agregados
✓ Keywords procesados
Lematizando textos...
Lematización completada!
✓ Lematización completada

Dataset final: (7613, 14)
Nuevas columnas agregadas: 9
Lematización completada!
✓ Lematización completada

Dataset final: (7613, 14)
Nuevas columnas agregadas: 9


Unnamed: 0,id,keyword,location,text,target,text_length,word_count,hashtag_count,mention_count,url_count,uppercase_percentage,punctuation_percentage,keyword_clean,text_lemmatized
2915,4184,drown,"Jonesboro, Arkansas USA",We are getting some reports of flooding near J...,1,137,23,0,0,0,0.043796,0.029197,drown,report flooding near jonesboro high school use...
503,728,attacked,#GDJB #ASOT,@eunice_njoki aiii she needs to chill and answ...,0,89,15,0,1,0,0.0,0.011236,attacked,aiii need chill answer calmly like attack
3361,4812,evacuation,,FAAN orders evacuation of abandoned aircraft a...,1,136,20,0,0,1,0.125,0.029412,evacuation,faan order evacuation abandon aircraft mma faa...
4410,6269,hijacking,,Vehicle Hijacking in Vosloorus Gauteng on 201...,1,117,14,0,0,1,0.128205,0.008547,hijacking,vehicle hijacking vosloorus gauteng 2015 08 05...
4495,6392,hurricane,,@Hurricane_Dolce happy birthday big Bruh,0,40,5,0,1,0,0.075,0.0,hurricane,happy birthday big bruh


### Test Dataset

In [13]:
print("\n" + "=" * 50)
print("PROCESANDO TEST DATASET")
print("=" * 50)

test_processed = apply_feature_engineering(test_tweets)
test_processed.sample(n=5)


PROCESANDO TEST DATASET
Dataset original: (3263, 4)
✓ Features de texto agregados
✓ Keywords procesados
Lematizando textos...
Lematización completada!
✓ Lematización completada

Dataset final: (3263, 13)
Nuevas columnas agregadas: 8
Lematización completada!
✓ Lematización completada

Dataset final: (3263, 13)
Nuevas columnas agregadas: 8


Unnamed: 0,id,keyword,location,text,text_length,word_count,hashtag_count,mention_count,url_count,uppercase_percentage,punctuation_percentage,keyword_clean,text_lemmatized
2223,7422,obliterated,USA,the 301+ feature on YouTube has been obliterat...,63,12,0,0,0,0.047619,0.015873,obliterated,301 feature youtube obliterate love
2121,7100,military,,New Boonie Hat USMC Airsoft Paintball Military...,116,13,0,0,2,0.206897,0.017241,military,new boonie hat usmc airsoft paintball military...
300,975,blaze,"Newburgh, NY",My big buzzy John BlaZe. Jus Kame home from a ...,97,15,1,0,1,0.113402,0.030928,blaze,big buzzy john blaze jus kame home 12 year bid...
1466,4863,explode,sam,happy Justin makes my heart explode,35,6,0,0,0,0.028571,0.0,explode,happy justin heart explode
2672,8915,snowstorm,Los Angeles,@BigBang_CBS ...wow...ok...um...that was like ...,87,11,0,1,0,0.057471,0.149425,snowstorm,wow ok um like ice water blizzard snowstorm face


## Exportar Datasets Procesados

In [14]:
# Crear directorio de salida si no existe
OUTPUT_PATH = Path("../.data/processed/")
OUTPUT_PATH.mkdir(parents=True, exist_ok=True)

# Exportar a pickle
train_output = OUTPUT_PATH / "train.pkl"
test_output = OUTPUT_PATH / "test.pkl"

print("\nGuardando datasets procesados...")
train_processed.to_pickle(train_output)
print(f"✓ Train guardado en: {train_output}")

test_processed.to_pickle(test_output)
print(f"✓ Test guardado en: {test_output}")

print("="*50)
print("EXPORTACIÓN COMPLETADA")
print("="*50)
print("\nArchivos generados:")
print(f"  - {train_output} ({train_output.stat().st_size / 1024 / 1024:.2f} MB)")
print(f"  - {test_output} ({test_output.stat().st_size / 1024 / 1024:.2f} MB)")


Guardando datasets procesados...
✓ Train guardado en: ../.data/processed/train.pkl
✓ Test guardado en: ../.data/processed/test.pkl
EXPORTACIÓN COMPLETADA

Archivos generados:
  - ../.data/processed/train.pkl (1.87 MB)
  - ../.data/processed/test.pkl (0.79 MB)
