Carga y limpieza inicial

Elimina todas las columnas de identificación (ID).

Convierte variables binarias “Si/No” a 1/0.

Descarta las columnas con más del 80 % de valores faltantes o con un único valor.

Imputación

Rellena valores faltantes numéricos con la mediana.

Rellena valores faltantes categóricos con la moda (categoría más frecuente).

Selección de características

Numéricas: conserva solo las que tengan correlación absoluta ≥ 0.1 con el rendimiento y elimina pares muy correlacionados entre sí (> 0.9) para evitar multicolinealidad.

Categóricas: aplica one-hot encoding únicamente a las de baja cardinalidad (≤ 10 categorías).

Transformaciones

Numéricas: pipeline de imputación → transformación de potencia (Yeo–Johnson) → escalado robusto.

Categóricas: imputación constante → codificación one-hot.

Elimina features de varianza cero antes de PCA.

Modelado

Reduce dimensionalidad con PCA, explorando en GridSearch distintos umbrales de varianza explicada (desde 50 % hasta 99 %).

Ajusta un clasificador de Regresión Logística (multinomial) con penalizaciones L1, L2 o ElasticNet, afinando la fuerza de regularización (C), la proporción L1/L2 y criterios de convergencia.

Evalúa performance mediante validación cruzada (5 folds) optimizando la precisión.

In [4]:
import pandas as pd
import numpy as np

from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler, PowerTransformer, OneHotEncoder, LabelEncoder
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import VarianceThreshold

# 1) Carga de datos
train_df = pd.read_csv('train.csv')
test_df  = pd.read_csv('test.csv')

# 2) Separar X/y y codificar la target
TARGET   = 'RENDIMIENTO_GLOBAL'
y_raw    = train_df[TARGET].values
X        = train_df.drop(columns=[TARGET])
test_ids = test_df['ID']
test_X   = test_df.drop(columns=[TARGET], errors='ignore')

le = LabelEncoder()
y  = le.fit_transform(y_raw)

# 3) Función de limpieza y mapeo binario
def clean_and_map(df):
    df = df.copy()
    # Eliminar columnas ID
    id_cols = [c for c in df.columns if 'id' in c.lower()]
    df.drop(columns=id_cols, inplace=True, errors='ignore')
    # Mapear Si/No a 1/0
    binary_map = {'Si':1,'No':0,'S':1,'N':0}
    for col in df.columns:
        if df[col].dtype == object and set(df[col].dropna().unique()).issubset(binary_map):
            df[col] = df[col].map(binary_map)
    return df

X      = clean_and_map(X)
test_X = clean_and_map(test_X)[X.columns]  # asegurar mismas columnas

# 4) Eliminar columnas con >80% nulos o constantes (solo en train)
null_pct   = X.isna().mean()
drop_nulls = null_pct[null_pct > 0.8].index.tolist()
const_cols = [c for c in X.columns if X[c].nunique() <= 1]
X.drop(columns=drop_nulls + const_cols, inplace=True)
test_X = test_X[X.columns]

# 5) Imputación de faltantes
for df_ in (X, test_X):
    # categóricas → moda
    for c in df_.select_dtypes(include=['object','category']).columns:
        df_[c].fillna(df_[c].mode()[0], inplace=True)
    # numéricas → mediana
    for c in df_.select_dtypes(include=[np.number]).columns:
        df_[c].fillna(df_[c].median(), inplace=True)

# 6) Selección numérica por correlación y multicolinealidad
df_corr     = pd.concat([X.select_dtypes(include=[np.number]), pd.Series(y, name='_t')], axis=1)
corr_matrix = df_corr.corr().abs()
# candidatas con |corr| >= 0.1
num_cands   = corr_matrix['_t'][corr_matrix['_t'] >= 0.1].drop('_t').index.tolist()
sub         = X[num_cands].corr().abs()
upper       = np.triu(np.ones(sub.shape), k=1).astype(bool)
final_num   = [col for i, col in enumerate(sub.columns)
               if not any(upper[j, i] and sub.iloc[j, i] > 0.9 for j in range(sub.shape[0]))]

# 7) Selección categóricas de baja cardinalidad (<=10)
cat_all     = X.select_dtypes(include=['object','category']).columns.tolist()
onehot_cols = [c for c in cat_all if X[c].nunique() <= 10]

# 8) Construcción de pipelines
num_pipe = Pipeline([
    ('imp',   SimpleImputer(strategy='median')),
    ('pow',   PowerTransformer(method='yeo-johnson')),
    ('rob',   RobustScaler())
])
cat_pipe = Pipeline([
    ('imp',   SimpleImputer(strategy='constant', fill_value='Missing')),
    ('oh',    OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])
preprocessor = ColumnTransformer([
    ('num', num_pipe,  final_num),
    ('cat', cat_pipe, onehot_cols)
], remainder='drop')

pipe = Pipeline([
    ('pre',      preprocessor),
    ('vt',       VarianceThreshold()),
    ('pca',      PCA(svd_solver='auto', random_state=42)),  
    ('classifier', LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42))
])

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_[c].fillna(df_[c].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_[c].fillna(df_[c].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values al

In [5]:
# 9) GridSearchCV con grilla ampliada (prefijos corregidos)
param_grid = {
    'pca__n_components':    [0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99],
    'classifier__C':        [1e-3, 1e-2, 1e-1, 1, 10, 100],
    'classifier__penalty':  ['l1', 'l2', 'elasticnet'],
    'classifier__l1_ratio': [0.2, 0.5, 0.8],        # solo para penalty='elasticnet'
    'classifier__solver':   ['saga'],
    'classifier__multi_class': ['multinomial'],
    'classifier__tol':      [1e-3, 1e-4, 1e-5],
}

grid = GridSearchCV(
    estimator=pipe,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=2
)

In [6]:
# 10) Ejecutar búsqueda
grid.fit(X, y)

Fitting 5 folds for each of 1134 candidates, totalling 5670 fits




0,1,2
,estimator,Pipeline(step...m_state=42))])
,param_grid,"{'classifier__C': [0.001, 0.01, ...], 'classifier__l1_ratio': [0.2, 0.5, ...], 'classifier__multi_class': ['multinomial'], 'classifier__penalty': ['l1', 'l2', ...], ...}"
,scoring,'accuracy'
,n_jobs,-1
,refit,True
,cv,5
,verbose,2
,pre_dispatch,'2*n_jobs'
,error_score,
,return_train_score,False

0,1,2
,transformers,"[('num', ...), ('cat', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,missing_values,
,strategy,'median'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,method,'yeo-johnson'
,standardize,True
,copy,True

0,1,2
,with_centering,True
,with_scaling,True
,quantile_range,"(25.0, ...)"
,copy,True
,unit_variance,False

0,1,2
,missing_values,
,strategy,'constant'
,fill_value,'Missing'
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,categories,'auto'
,drop,
,sparse_output,False
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,threshold,0.0

0,1,2
,n_components,0.99
,copy,True
,whiten,False
,svd_solver,'auto'
,tol,0.0
,iterated_power,'auto'
,n_oversamples,10
,power_iteration_normalizer,'auto'
,random_state,42

0,1,2
,penalty,'elasticnet'
,dual,False
,tol,0.001
,C,0.001
,fit_intercept,True
,intercept_scaling,1
,class_weight,'balanced'
,random_state,42
,solver,'saga'
,max_iter,1000


In [7]:
# 11) Resultados
print("Mejores parámetros encontrados:")
print(grid.best_params_)
print(f"Mejor accuracy CV: {grid.best_score_:.4f}")

Mejores parámetros encontrados:
{'classifier__C': 0.001, 'classifier__l1_ratio': 0.2, 'classifier__multi_class': 'multinomial', 'classifier__penalty': 'elasticnet', 'classifier__solver': 'saga', 'classifier__tol': 0.001, 'pca__n_components': 0.99}
Mejor accuracy CV: 0.3489


In [None]:
# 12) Predicción final y generación de submission.csv
best_pipe = grid.best_estimator_
preds_enc = best_pipe.predict(test_X)
preds     = le.inverse_transform(preds_enc)
submission = pd.DataFrame({
    'ID':                 test_ids,
    'RENDIMIENTO_GLOBAL': preds
})
submission.to_csv('submission_PCA_LOG.csv', index=False)
print("✔️ submission.csv listo para Kaggle.")

✔️ submission.csv listo para Kaggle
