
Este notebook implementa un flujo completo para clasificación usando XGBoost. En primer lugar, carga los datos de entrenamiento y prueba desde archivos CSV. A continuación, realiza el preprocesamiento de los datos, que incluye limpieza, tratamiento de valores nulos y transformación de variables si es necesario. Después construye y entrena un modelo de XGBoost (XGBClassifier) sobre los datos preprocesados. Una vez entrenado, el modelo genera predicciones sobre el conjunto de prueba y finalmente guarda estas predicciones en un archivo CSV listo para enviar como submission.

Preprocesamiento: limpieza de datos (imputación de nulos o mapeo de categorías), escalado o transformación de variables según sea requerido.

Modelo: XGBClassifier de la librería XGBoost, ajustado con hiperparámetros predeterminados o definidos en el notebook.

In [None]:
pip install xgboost

In [3]:
import pandas as pd
import numpy as np

from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from xgboost import XGBClassifier

In [4]:
# Custom transformers for encoding
class FrequencyEncoder(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns
    def fit(self, X, y=None):
        self.freq_maps_ = {col: X[col].value_counts(normalize=True).to_dict()
                           for col in self.columns}
        return self
    def transform(self, X):
        X = X.copy()
        for col, fmap in self.freq_maps_.items():
            X[f"{col}_FE"] = X[col].map(fmap).fillna(0)
        return X

class CountEncoder(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns
    def fit(self, X, y=None):
        self.count_maps_ = {col: X[col].value_counts().to_dict()
                            for col in self.columns}
        return self
    def transform(self, X):
        X = X.copy()
        for col, cmap in self.count_maps_.items():
            X[f"{col}_CE"] = X[col].map(cmap).fillna(0)
        return X

class DropColumns(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X.drop(columns=self.columns)


In [5]:
# ============================================
# 1. CARGA Y LIMPIEZA BÁSICA
# ============================================
# Cargar datos originales
train = pd.read_csv("train.csv")
test  = pd.read_csv("test.csv")

In [10]:
train = train.sample(frac=0.5, random_state=42).reset_index(drop=True)

In [11]:
mapping = {'bajo':1, 'medio-bajo':2, 'medio-alto':3, 'alto':4}
def clean_data(df, is_train=True):
    df = df.copy()
    for col in df.select_dtypes(include='object'):
        df[col] = df[col].fillna(df[col].mode()[0])
    for col in df.select_dtypes(include=[np.number]):
        df[col] = df[col].fillna(df[col].median())
    binary_cols = [
        'FAMI_TIENEINTERNET','FAMI_TIENELAVADORA','FAMI_TIENEAUTOMOVIL',
        'FAMI_TIENECOMPUTADOR','FAMI_TIENEINTERNET.1',
        'ESTU_PRIVADO_LIBERTAD','ESTU_PAGOMATRICULAPROPIO'
    ]
    for col in binary_cols:
        if col in df:
            df[col] = df[col].map({'Si':1, 'No':0, 'S':1, 'N':0})
    if is_train:
        df['RENDIMIENTO_GLOBAL_NUM'] = df['RENDIMIENTO_GLOBAL'].map(mapping)
    return df

df_train = clean_data(train, is_train=True)
df_test  = clean_data(test,  is_train=False)

# 2. Prepare features and target
TARGET = 'RENDIMIENTO_GLOBAL_NUM'
ID_COL = 'ID'
X = df_train.drop([ID_COL, 'RENDIMIENTO_GLOBAL', TARGET], axis=1)
y = df_train[TARGET]
X_test = df_test.drop([ID_COL], axis=1)

# 3. Identify categorical cardinalities
cat_cols = X.select_dtypes(include='object').columns.tolist()
low_card  = [c for c in cat_cols if X[c].nunique() < 15]
mid_card  = [c for c in cat_cols if 15 <= X[c].nunique() <= 50]
high_card = [c for c in cat_cols if X[c].nunique() > 50]

# 4. Encoding pipeline
encoding = Pipeline([
    ('freq',  FrequencyEncoder(mid_card)),
    ('count', CountEncoder(high_card)),
    ('drop',  DropColumns(mid_card + high_card))
])

# 5. Preprocessor for numeric + low-card one-hot
numeric_cols = X.select_dtypes(include=['int64','float64']).columns.tolist()
preprocessor = ColumnTransformer([
    ('num', SimpleImputer(strategy='median'), numeric_cols),
    ('ohe', OneHotEncoder(handle_unknown='ignore', sparse_output=False), low_card)
], remainder='drop')

# 6. Full pipeline with XGBoost
pipeline = Pipeline([
    ('encode', encoding),
    ('prep',   preprocessor),
    ('clf',    XGBClassifier(
        objective='multi:softprob',
        num_class=4,
        learning_rate=0.1,
        n_estimators=200,
        max_depth=6,
        subsample=0.8,
        colsample_bytree=0.8,
        reg_alpha=1,
        reg_lambda=1,
        use_label_encoder=False,
        eval_metric='mlogloss',
        random_state=42,
        n_jobs=-1
    ))
])

In [12]:
# 7. Cross-validation
le = LabelEncoder()
y_enc = le.fit_transform(y.astype(str))
scores = cross_val_score(pipeline, X, y_enc, cv=5, scoring='accuracy')
print(f"✅ CV Accuracy: {scores.mean():.4f} ± {scores.std():.4f}")

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


✅ CV Accuracy: 0.3755 ± 0.0056




In [14]:
# 8. Final training and prediction
pipeline.fit(X, y_enc)
preds_enc = pipeline.predict(X_test)
preds_lbl = le.inverse_transform(preds_enc)

# 9. Submission
submission = pd.DataFrame({ID_COL: test[ID_COL], 'RENDIMIENTO_GLOBAL': preds_lbl})
mapping = {
        '1': 'bajo',
        '2': 'medio-bajo',
        '3': 'medio-alto',
        '4': 'alto'
    }

submission['RENDIMIENTO_GLOBAL'] = submission["RENDIMIENTO_GLOBAL"].map(mapping)
submission.to_csv('submission_xgb_tuned2.csv', index=False)
print("📄 'submission_xgb_tuned.csv' generated2.")

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


📄 'submission_xgb_tuned.csv' generated2.
