# Feature Engineering Avanzado

Este notebook expande las features existentes con:
1. **Análisis de sentimiento** usando TextBlob
2. **Geocodificación** de locations a coordenadas (cuando sea posible)
3. **Features lingüísticas adicionales** (emojis, capitalización, etc)
4. **Features de intensidad** (palabras de urgencia y desastres)

Estos features serán utilizados en los modelos avanzados (model1 y model2).

## Imports

In [3]:
import pandas as pd
import numpy as np
from pathlib import Path
import re

# Sentiment analysis
from textblob import TextBlob

# Progress bar
from tqdm import tqdm
tqdm.pandas()

# Constantes
COLOR_NO_DISASTER = '#3498db'
COLOR_DISASTER = '#e74c3c'
COLOR_GENERAL = '#95a5a6'

ModuleNotFoundError: No module named 'textblob'

## Cargar Datos

In [None]:
DATA_PATH = Path("../.data/processed/")
RAW_PATH = Path("../.data/raw/")

# Cargar datos procesados (ya tienen features básicas)
train_df = pd.read_pickle(DATA_PATH / "train.pkl")
test_df = pd.read_pickle(DATA_PATH / "test.pkl")

# Cargar raw para location
train_raw = pd.read_csv(RAW_PATH / "train.csv")
test_raw = pd.read_csv(RAW_PATH / "test.csv")

# Agregar location a los DataFrames procesados
train_df['location'] = train_raw['location']
test_df['location'] = test_raw['location']

print(f"Train: {train_df.shape}")
print(f"Test: {test_df.shape}")
print(f"\nColumnas actuales: {list(train_df.columns)}")

## 1. Análisis de Sentimiento con TextBlob

TextBlob proporciona:
- **Polarity**: -1 (negativo) a +1 (positivo)
- **Subjectivity**: 0 (objetivo) a 1 (subjetivo)

Los tweets sobre desastres probablemente tengan polaridad negativa y mayor subjetividad.

In [None]:
def get_sentiment(text):
    """
    Calcula polarity y subjectivity usando TextBlob
    """
    try:
        blob = TextBlob(str(text))
        return blob.sentiment.polarity, blob.sentiment.subjectivity
    except:
        return 0.0, 0.0

# Calcular sentimiento
print("Calculando sentimiento para train...")
train_sentiment = train_df['text'].progress_apply(get_sentiment)
train_df['sentiment_polarity'] = [s[0] for s in train_sentiment]
train_df['sentiment_subjectivity'] = [s[1] for s in train_sentiment]

print("Calculando sentimiento para test...")
test_sentiment = test_df['text'].progress_apply(get_sentiment)
test_df['sentiment_polarity'] = [s[0] for s in test_sentiment]
test_df['sentiment_subjectivity'] = [s[1] for s in test_sentiment]

print("\n✅ Sentimiento agregado")
print(f"Train sentiment stats:\n{train_df[['sentiment_polarity', 'sentiment_subjectivity']].describe()}")

## 2. Features Lingüísticas Avanzadas

Agregar features adicionales que pueden ser útiles para clasificación:
- Número de emojis
- Presencia de palabras en mayúsculas (gritos)
- Ratio de palabras únicas (lexical diversity)
- Presencia de números

In [None]:
def count_emojis(text):
    """Cuenta emojis en el texto usando rangos Unicode"""
    emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        "]+", flags=re.UNICODE)
    return len(emoji_pattern.findall(str(text)))

def count_uppercase_words(text):
    """Cuenta palabras completamente en mayúsculas (excluye palabras de 1 letra)"""
    words = str(text).split()
    return sum(1 for word in words if word.isupper() and len(word) > 1)

def lexical_diversity(text):
    """Ratio de palabras únicas vs total de palabras"""
    words = str(text).lower().split()
    if len(words) == 0:
        return 0.0
    return len(set(words)) / len(words)

def count_numbers(text):
    """Cuenta secuencias de números en el texto"""
    return len(re.findall(r'\d+', str(text)))

# Aplicar features
print("Calculando features lingüísticas...")

for df, name in [(train_df, 'train'), (test_df, 'test')]:
    print(f"\nProcesando {name}...")
    df['emoji_count'] = df['text'].progress_apply(count_emojis)
    df['uppercase_word_count'] = df['text'].progress_apply(count_uppercase_words)
    df['lexical_diversity'] = df['text'].progress_apply(lexical_diversity)
    df['number_count'] = df['text'].progress_apply(count_numbers)

print("\n✅ Features lingüísticas agregadas")

## 3. Geocodificación de Locations

Intentar convertir locations en coordenadas. Como esto puede ser costoso y muchas locations son inválidas,
haremos un enfoque simplificado:
- Identificar countries/ciudades principales
- Asignar coordenadas aproximadas a ubicaciones conocidas
- Usar NaN para ubicaciones desconocidas

Esto puede ayudar al modelo a encontrar patrones geográficos.

In [None]:
# Diccionario de ubicaciones conocidas -> coordenadas aproximadas (lat, lon)
LOCATION_COORDS = {
    # USA
    'usa': (37.0902, -95.7129),
    'united states': (37.0902, -95.7129),
    'us': (37.0902, -95.7129),
    'new york': (40.7128, -74.0060),
    'nyc': (40.7128, -74.0060),
    'los angeles': (34.0522, -118.2437),
    'california': (36.7783, -119.4179),
    'chicago': (41.8781, -87.6298),
    'miami': (25.7617, -80.1918),
    'washington': (38.9072, -77.0369),
    'boston': (42.3601, -71.0589),
    'san francisco': (37.7749, -122.4194),
    'seattle': (47.6062, -122.3321),
    
    # UK
    'uk': (55.3781, -3.4360),
    'united kingdom': (55.3781, -3.4360),
    'london': (51.5074, -0.1278),
    'england': (52.3555, -1.1743),
    
    # Canada
    'canada': (56.1304, -106.3468),
    'toronto': (43.6532, -79.3832),
    
    # Australia
    'australia': (-25.2744, 133.7751),
    'sydney': (-33.8688, 151.2093),
    
    # India
    'india': (20.5937, 78.9629),
    'mumbai': (19.0760, 72.8777),
    
    # Nigeria
    'nigeria': (9.0820, 8.6753),
    'lagos': (6.5244, 3.3792),
    
    # Kenya
    'kenya': (-0.0236, 37.9062),
    'nairobi': (-1.2864, 36.8172),
}

def geocode_location(location):
    """
    Convierte location string a coordenadas (lat, lon)
    Retorna (None, None) si no se puede geocodificar
    """
    if pd.isna(location):
        return None, None
    
    location_lower = str(location).lower().strip()
    
    # Buscar coincidencia exacta
    if location_lower in LOCATION_COORDS:
        return LOCATION_COORDS[location_lower]
    
    # Buscar coincidencia parcial
    for key, coords in LOCATION_COORDS.items():
        if key in location_lower:
            return coords
    
    return None, None

# Aplicar geocodificación
print("Geocodificando locations...")

for df, name in [(train_df, 'train'), (test_df, 'test')]:
    coords = df['location'].progress_apply(geocode_location)
    df['location_lat'] = [c[0] for c in coords]
    df['location_lon'] = [c[1] for c in coords]
    
    # Feature binaria: tiene location válida
    df['has_valid_location'] = (~df['location_lat'].isna()).astype(int)
    
    geocoded_pct = df['has_valid_location'].mean() * 100
    print(f"{name}: {geocoded_pct:.1f}% locations geocodificadas")

print("\n✅ Geocodificación completada")

## 4. Features de Intensidad de Emergencia

Palabras que sugieren urgencia o severidad:

In [None]:
# Palabras de urgencia/emergencia
URGENCY_WORDS = ['urgent', 'emergency', 'help', 'sos', 'alert', 'warning', 'danger', 
                 'critical', 'severe', 'breaking', 'now', 'immediately', 'asap']

DISASTER_INTENSITY_WORDS = ['devastating', 'catastrophic', 'massive', 'huge', 'major',
                            'deadly', 'fatal', 'tragedy', 'victims', 'casualties',
                            'destroyed', 'collapsed', 'killed', 'injured']

def count_word_list(text, word_list):
    """Cuenta cuántas palabras de la lista aparecen en el texto"""
    text_lower = str(text).lower()
    return sum(1 for word in word_list if word in text_lower)

# Aplicar
for df in [train_df, test_df]:
    df['urgency_word_count'] = df['text'].apply(lambda x: count_word_list(x, URGENCY_WORDS))
    df['intensity_word_count'] = df['text'].apply(lambda x: count_word_list(x, DISASTER_INTENSITY_WORDS))

print("✅ Features de intensidad agregadas")

## 5. Resumen de Nuevas Features

In [None]:
# Listar nuevas features
new_features = [
    'sentiment_polarity',
    'sentiment_subjectivity',
    'emoji_count',
    'uppercase_word_count',
    'lexical_diversity',
    'number_count',
    'location_lat',
    'location_lon',
    'has_valid_location',
    'urgency_word_count',
    'intensity_word_count'
]

print("=" * 60)
print("NUEVAS FEATURES AGREGADAS".center(60))
print("=" * 60)
for i, feat in enumerate(new_features, 1):
    print(f"{i:2d}. {feat}")

print(f"\nTotal nuevas features: {len(new_features)}")
print(f"Features anteriores: {train_df.shape[1] - len(new_features)}")
print(f"Features totales ahora: {train_df.shape[1]}")

print("\nEstadísticas de nuevas features (train):")
print(train_df[new_features].describe())

## 6. Exportar Datos Enriquecidos

In [None]:
# Guardar en nueva ubicación
OUTPUT_PATH = Path("../.data/processed/")
OUTPUT_PATH.mkdir(parents=True, exist_ok=True)

train_output = OUTPUT_PATH / "train_advanced.pkl"
test_output = OUTPUT_PATH / "test_advanced.pkl"

train_df.to_pickle(train_output)
test_df.to_pickle(test_output)

print("✅ Datos guardados:")
print(f"  - {train_output.absolute()}")
print(f"  - {test_output.absolute()}")
print(f"\nShapes:")
print(f"  Train: {train_df.shape}")
print(f"  Test: {test_df.shape}")

# Verificar que se guardaron correctamente
import os
if os.path.exists(train_output):
    size_mb = os.path.getsize(train_output) / (1024 * 1024)
    print(f"\n✓ train_advanced.pkl: {size_mb:.2f} MB")
if os.path.exists(test_output):
    size_mb = os.path.getsize(test_output) / (1024 * 1024)
    print(f"✓ test_advanced.pkl: {size_mb:.2f} MB")

## Análisis Exploratorio de Nuevas Features

Veamos cómo se relacionan con el target:

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Comparar sentimiento por target
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Polarity
axes[0, 0].hist([train_df[train_df['target']==0]['sentiment_polarity'],
                 train_df[train_df['target']==1]['sentiment_polarity']],
                label=['No Disaster', 'Disaster'], bins=30, alpha=0.7,
                color=[COLOR_NO_DISASTER, COLOR_DISASTER])
axes[0, 0].set_xlabel('Sentiment Polarity')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Distribución de Polarity por Target', fontweight='bold')
axes[0, 0].legend()
axes[0, 0].grid(alpha=0.3)

# Subjectivity
axes[0, 1].hist([train_df[train_df['target']==0]['sentiment_subjectivity'],
                 train_df[train_df['target']==1]['sentiment_subjectivity']],
                label=['No Disaster', 'Disaster'], bins=30, alpha=0.7,
                color=[COLOR_NO_DISASTER, COLOR_DISASTER])
axes[0, 1].set_xlabel('Sentiment Subjectivity')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('Distribución de Subjectivity por Target', fontweight='bold')
axes[0, 1].legend()
axes[0, 1].grid(alpha=0.3)

# Urgency words
urgency_comparison = train_df.groupby('target')['urgency_word_count'].mean()
axes[1, 0].bar([0, 1], urgency_comparison.values, 
               color=[COLOR_NO_DISASTER, COLOR_DISASTER], alpha=0.8)
axes[1, 0].set_xlabel('Target')
axes[1, 0].set_ylabel('Promedio Palabras Urgencia')
axes[1, 0].set_title('Promedio de Palabras de Urgencia por Target', fontweight='bold')
axes[1, 0].set_xticks([0, 1])
axes[1, 0].set_xticklabels(['No Disaster', 'Disaster'])
axes[1, 0].grid(axis='y', alpha=0.3)

# Intensity words
intensity_comparison = train_df.groupby('target')['intensity_word_count'].mean()
axes[1, 1].bar([0, 1], intensity_comparison.values,
               color=[COLOR_NO_DISASTER, COLOR_DISASTER], alpha=0.8)
axes[1, 1].set_xlabel('Target')
axes[1, 1].set_ylabel('Promedio Palabras Intensidad')
axes[1, 1].set_title('Promedio de Palabras de Intensidad por Target', fontweight='bold')
axes[1, 1].set_xticks([0, 1])
axes[1, 1].set_xticklabels(['No Disaster', 'Disaster'])
axes[1, 1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

# Stats por target
print("\nEstadísticas por Target:")
print("\nNo Disaster (0):")
print(train_df[train_df['target']==0][new_features].describe().loc[['mean', 'std']])
print("\nDisaster (1):")
print(train_df[train_df['target']==1][new_features].describe().loc[['mean', 'std']])