# XGBoost

XGBoost (eXtreme Gradient Boosting) es un modelo de aprendizaje supervisado basado en árboles de decisión que utiliza el algoritmo de boosting para mejorar la precisión de las predicciones. A grandes rasgos, está conformado por una serie de árboles de decisión que se construyen secuencialmente, donde cada árbol intenta corregir los errores del árbol anterior.

## Imports

Vamos a usar las siguientes librerías para la implementación del modelo XGBoost con nuestro dataset.

In [8]:
# Standar
import pandas as pd
import numpy as np
import pathlib

# ML    
from sklearn.model_selection import train_test_split, RandomizedSearchCV, StratifiedKFold
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline, FeatureUnion
from xgboost import XGBClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import f1_score, make_scorer
from emoji import demojize
import spacy

from spacy.lang.en.stop_words import STOP_WORDS
from spacy.cli import download

try:
    nlp = spacy.load("en_core_web_sm")
except OSError:
    download("en_core_web_sm")
    nlp = spacy.load("en_core_web_sm")

## Constantes

In [9]:
COLOR_NO_DISASTER = '#3498db'
COLOR_DISASTER = '#e74c3c'
COLOR_GENERAL = '#95a5a6'

SEED = 42

## Datos

Vamos a importar los datos crudos y preprocesarlos para poder entrenar el modelo XGBoost.

In [10]:
data_path = pathlib.Path("../.data/raw")

# Datos
df = pd.read_csv(data_path / "train.csv")

In [11]:
target_mean = df['target'].mean()

print(f'Shape del dataset: {df.shape}')
print(f'Porcentaje de desastres en el target: {target_mean*100:.2f}%')

columns = ['id', 'keyword', 'location', 'text', 'target']
df = df[columns]
df.sample(5, random_state=SEED)

Shape del dataset: (7613, 5)
Porcentaje de desastres en el target: 42.97%


Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0


## Feature Engineering

Vamos a agregar las siguientes features al dataset:

### Features estadísticas del texto

- Cantidad de palabras
- Cantidad de caracteres
- Cantidad de stopwords
- Cantidad de hashtags
- Cantidad de menciones
- Cantidad de URLs
- Promedio de longitud de las palabras
- Promedio de palabras en mayúsculas
- Cantidad de signos de puntuación
- Cantidad de números
- Cantidad de emojis

### Features de sentimiento

- Sentimiento positivo
- Sentimiento negativo
- Sentimiento neutral

### Features de TF-IDF

Vamos a usar TF-IDF para convertir el texto en una representación numérica que pueda ser utilizada por el modelo XGBoost.

### Encoding para las keywords

Al ser una feature categórica, primero vamos a lematizar las keywords y luego aplicarle:

- One-Hot Encoding
- Mean Encoding

### Ubicación geográfica

Como ya vimos que la feature de ubicación tiene muchos valores falsos, vamos a usar un encoder geográfico que deteermine si la ubicación es válida o no:

- GPE de spaCy

In [12]:
def extract_features(df):
    df = df.copy()
    
    # Texto limpio y lematizado (SIN eliminar stopwords - son importantes para contexto)
    df['clean_text'] = df['text'].str.lower().apply(lambda x: ' '.join([token.lemma_ for token in nlp(x)]))
    
    # Features estadísticas
    df['word_count'] = df['text'].str.split().str.len()
    df['char_count'] = df['text'].str.len()
    df['stopwords_count'] = df['text'].str.lower().str.split().apply(lambda x: sum(w in STOP_WORDS for w in x))
    df['hashtag_count'] = df['text'].str.count(r'#\w+')
    df['mention_count'] = df['text'].str.count(r'@\w+')
    df['url_count'] = df['text'].str.count(r'http\S+|www\S+')
    
    word_lengths = df['text'].str.split().apply(lambda x: np.mean([len(w) for w in x]) if len(x) > 0 else 0)
    df['avg_word_length'] = word_lengths
    
    char_lengths = df['text'].str.len()
    uppercase_counts = df['text'].str.count(r'[A-Z]')
    df['uppercase_ratio'] = np.where(char_lengths > 0, uppercase_counts / char_lengths, 0)
    
    df['punct_count'] = df['text'].str.count(r'[^\w\s]')
    df['number_count'] = df['text'].str.count(r'\d+')
    df['emoji_count'] = df['text'].apply(lambda x: len([c for c in x if c in demojize(x)]))
    
    # Ubicación
    df['has_location'] = df['location'].notna().astype(int)
    df['valid_location'] = np.where(
        df['location'].notna(),
        df['location'].apply(lambda x: any(ent.label_ == 'GPE' for ent in nlp(str(x)).ents)),
        False
    ).astype(int)
    
    # Keywords lematizadas
    df['keyword_lemma'] = df['keyword'].fillna('none').apply(lambda x: ' '.join([token.lemma_ for token in nlp(str(x))]))
    
    return df

df = extract_features(df)
df.sample(5, random_state=SEED)

KeyboardInterrupt: 

In [None]:
# Pipeline de texto: Word TF-IDF + Char TF-IDF + SVD para densificar
word_tfidf = TfidfVectorizer(
    max_features=5000, 
    ngram_range=(1, 2),
    min_df=2  # Ignora palabras en menos de 2 tweets
)

char_tfidf = TfidfVectorizer(
    analyzer='char',
    ngram_range=(3, 5),  # Secuencias de 3-5 caracteres
    max_features=3000,
    min_df=2
)

text_pipeline = Pipeline([
    ('union', FeatureUnion([
        ('word', word_tfidf),
        ('char', char_tfidf),
    ])),
    ('svd', TruncatedSVD(n_components=400, random_state=SEED))
])

print("Generando features densas con SVD...")
X_text_dense = text_pipeline.fit_transform(df['clean_text'])

print(f'Shape de texto denso (SVD): {X_text_dense.shape}')

# Features numéricas
feature_cols = ['word_count', 'char_count', 'stopwords_count', 'hashtag_count', 'mention_count', 
                'url_count', 'avg_word_length', 'uppercase_ratio', 'punct_count', 'number_count', 
                'emoji_count', 'has_location', 'valid_location']

print(f'Shape de features numéricas: {df[feature_cols].shape}')

Generando features densas con SVD...
Shape de texto denso (SVD): (7613, 400)
Shape de features numéricas: (7613, 13)


## Entrenamiento del modelo

In [None]:
# Split ANTES para evitar data leakage
X_numeric = df[['word_count', 'char_count', 'stopwords_count', 'hashtag_count', 'mention_count', 
                'url_count', 'avg_word_length', 'uppercase_ratio', 'punct_count', 'number_count', 
                'emoji_count', 'has_location', 'valid_location']].values

y = df['target'].values

# Split train/test
X_train_idx, X_test_idx = train_test_split(np.arange(len(df)), test_size=0.2, random_state=SEED, stratify=y)

# Mean Encoding SIN LEAKAGE (solo Mean Encoding, sin One-Hot)
train_data = df.iloc[X_train_idx]
test_data = df.iloc[X_test_idx]

keyword_means = train_data.groupby('keyword_lemma')['target'].mean()
global_mean = train_data['target'].mean()

train_keyword_encoded = train_data['keyword_lemma'].map(keyword_means).fillna(global_mean).values.reshape(-1, 1)
test_keyword_encoded = test_data['keyword_lemma'].map(keyword_means).fillna(global_mean).values.reshape(-1, 1)

# Construir matrices finales (DENSAS)
X_train = np.hstack([
    X_numeric[X_train_idx],
    train_keyword_encoded,
    X_text_dense[X_train_idx]
])

X_test = np.hstack([
    X_numeric[X_test_idx],
    test_keyword_encoded,
    X_text_dense[X_test_idx]
])

y_train = y[X_train_idx]
y_test = y[X_test_idx]

print(f'Shape de X_train: {X_train.shape}')
print(f'Shape de X_test: {X_test.shape}')
print(f'Tipo de matriz: {type(X_train)} (densa)')

# Modelo XGBoost con hiperparámetros ajustados para datos densos
param_distributions = {
    'max_depth': [6, 8, 10, 12],
    'learning_rate': [0.01, 0.05, 0.1],
    'n_estimators': [300, 500, 800],
    'subsample': [0.7, 0.8, 0.9],
    'colsample_bytree': [0.5, 0.7],
    'min_child_weight': [1, 3, 5],
    'gamma': [0, 0.1, 0.2]
}

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)
xgb = XGBClassifier(random_state=SEED, eval_metric='logloss')
random_search = RandomizedSearchCV(
    xgb, 
    param_distributions, 
    n_iter=50,
    cv=skf, 
    scoring=make_scorer(f1_score), 
    n_jobs=-1, 
    verbose=1,
    random_state=SEED
)
random_search.fit(X_train, y_train)

print(f'Mejores parámetros: {random_search.best_params_}')
print(f'Mejor F1-Score (CV): {random_search.best_score_:.4f}')

Shape de X_train: (6090, 414)
Shape de X_test: (1523, 414)
Tipo de matriz: <class 'numpy.ndarray'> (densa)
Fitting 5 folds for each of 50 candidates, totalling 250 fits


KeyboardInterrupt: 

## Evaluación

In [None]:
# Optimización del threshold para maximizar F1-Score
best_model = random_search.best_estimator_

# Predecir probabilidades
y_probs_train = best_model.predict_proba(X_train)[:, 1]
y_probs_test = best_model.predict_proba(X_test)[:, 1]

# Buscar mejor threshold
best_f1 = 0
best_thresh = 0.5

for thresh in np.arange(0.3, 0.7, 0.01):
    y_pred_thresh = (y_probs_test > thresh).astype(int)
    score = f1_score(y_test, y_pred_thresh)
    if score > best_f1:
        best_f1 = score
        best_thresh = thresh

# F1 con threshold por defecto (0.5)
y_pred_default = best_model.predict(X_test)
f1_default = f1_score(y_test, y_pred_default)

# F1 con threshold optimizado
y_pred_optimized = (y_probs_test > best_thresh).astype(int)

print(f'F1-Score (Test, threshold=0.5): {f1_default:.4f}')
print(f'F1-Score (Test, threshold={best_thresh:.2f}): {best_f1:.4f}')
print(f'Mejora: +{(best_f1 - f1_default):.4f}')