# Entrenamiento Modelo LightGBM

Este notebook entrena un modelo LightGBM optimizado con GridSearchCV para predecir no-show médico.

**Estructura del proyecto:**
- Dataset: `data/KaggleV2-May-2016.csv`
- Modelo generado: `models/Classification_medical_no_show-LGBM.joblib`

**Nota:** Este notebook debe ejecutarse desde la carpeta `notebooks/` para que las rutas relativas funcionen correctamente.

In [1]:
from pathlib import Path
import numpy as np
import pandas as pd
from flaml import AutoML
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
import pprint
from sklearn.model_selection import GridSearchCV, cross_val_score
import joblib


In [2]:
# Ruta relativa desde notebooks/ hacia la raíz del proyecto
DATA_DIR = Path.cwd().parent / "data"
datos = pd.read_csv(DATA_DIR / "KaggleV2-May-2016.csv")

## Selección de columnas

In [3]:
columnas_seleccionadas = [
"Gender",
"Age",
"ScheduledDay",
"AppointmentDay",
"Neighbourhood",
"Scholarship",
"Hipertension",
"Diabetes",
"Alcoholism",
"Handcap",
"SMS_received",
"No-show",
]

In [4]:
df_noshow = datos[columnas_seleccionadas].copy()
df_noshow.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 12 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   Gender          110527 non-null  object
 1   Age             110527 non-null  int64 
 2   ScheduledDay    110527 non-null  object
 3   AppointmentDay  110527 non-null  object
 4   Neighbourhood   110527 non-null  object
 5   Scholarship     110527 non-null  int64 
 6   Hipertension    110527 non-null  int64 
 7   Diabetes        110527 non-null  int64 
 8   Alcoholism      110527 non-null  int64 
 9   Handcap         110527 non-null  int64 
 10  SMS_received    110527 non-null  int64 
 11  No-show         110527 non-null  object
dtypes: int64(7), object(5)
memory usage: 10.1+ MB


## Ajustes a los datos

In [5]:
# Asegurar tipo float para 'Age' antes de aplicar NaN y máscaras
# Esto evita FutureWarning por incompatibilidad de dtype con NaN

df_noshow.loc[:, 'Age'] = pd.to_numeric(df_noshow['Age'], errors='coerce')
df_noshow.loc[:, 'Age'] = df_noshow['Age'].mask(df_noshow['Age'] < 0, np.nan)

# Mapear variable objetivo a 0/1 de forma explícita

df_noshow.loc[:, 'No-show'] = df_noshow['No-show'].map({'No': 0, 'Yes': 1})

# Conversión robusta de fechas y extracción de componentes
# Usamos una serie temporal intermedia 's' para garantizar acceso .dt

fechas = ['AppointmentDay', 'ScheduledDay']
for col in fechas:
    s = pd.to_datetime(df_noshow[col], errors='coerce')  # convierte a datetime o NaT
    df_noshow.loc[:, col] = s
    df_noshow.loc[:, f"{col}_year"] = s.dt.year
    df_noshow.loc[:, f"{col}_month"] = s.dt.month
    df_noshow.loc[:, f"{col}_day"] = s.dt.day

# eliminar columnas originales de fechas tras crear las derivadas
df_noshow.drop(columns=fechas, inplace=True)

  df_noshow.loc[:, 'Age'] = df_noshow['Age'].mask(df_noshow['Age'] < 0, np.nan)


In [6]:
df_noshow.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 16 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   Gender                110527 non-null  object 
 1   Age                   110526 non-null  float64
 2   Neighbourhood         110527 non-null  object 
 3   Scholarship           110527 non-null  int64  
 4   Hipertension          110527 non-null  int64  
 5   Diabetes              110527 non-null  int64  
 6   Alcoholism            110527 non-null  int64  
 7   Handcap               110527 non-null  int64  
 8   SMS_received          110527 non-null  int64  
 9   No-show               110527 non-null  object 
 10  AppointmentDay_year   110527 non-null  int32  
 11  AppointmentDay_month  110527 non-null  int32  
 12  AppointmentDay_day    110527 non-null  int32  
 13  ScheduledDay_year     110527 non-null  int32  
 14  ScheduledDay_month    110527 non-null  int32  
 15  

## Preparación de datos

In [7]:
# Separar características y objetivo
X = df_noshow.drop(columns=["No-show"])
y = df_noshow["No-show"].astype(int)

# División entrenamiento-prueba (80-20)
x_train, x_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Conjunto de entrenamiento", x_train.shape, y_train.shape)
print("Conjunto de testeo", x_test.shape, y_test.shape)

Conjunto de entrenamiento (88421, 15) (88421,)
Conjunto de testeo (22106, 15) (22106,)


## Pipelines

In [8]:
# Columnas por tipo
numericas_f = ["Age","Scholarship", "Hipertension", "Diabetes", "Alcoholism", "SMS_received"]
ordinales_f = ["Handcap"]
categoricas_f_solo_genero = ["Gender"]
categoricas_f_genero_neigubourhood = ["Gender","Neighbourhood"]

# Pipeline para variables numéricas: imputación con promedio  ---- Si bien no hay datos nulos en las variables numéricas del actual DF, en otros conjuntos de prueba (o datos sinteticos) se pueden presentar valores faltantes
t_numerico = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('Scaler',StandardScaler())
])

# Pipeline para variable ordinal: imputación y encoding ordinal ---- Si bien no hay datos nulos en las variables ordinales del actual DF, en otros conjuntos de prueba (o datos sinteticos) se pueden presentar valores faltantes
t_ordinal = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ordinal', OrdinalEncoder(categories=[[0, 1, 2, 3, 4]]))
])

# Pipeline para variables categóricas: imputación + one-hot encoding ---- Si bien no hay datos nulos en las variables categoricas del actual DF, en otros conjuntos de prueba (o datos sinteticos) se pueden presentar valores faltantes
t_categoricas = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'))
])


## Preprocesadores

In [9]:
preprocessor_1 = ColumnTransformer(
    transformers=[
        ('num', t_numerico, numericas_f),
        ('ord', t_ordinal, ordinales_f),
        ('cat', t_categoricas, categoricas_f_solo_genero)
    ])

preprocessor_1

0,1,2
,transformers,"[('num', ...), ('ord', ...), ...]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,missing_values,
,strategy,'mean'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,missing_values,
,strategy,'most_frequent'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,categories,"[[0, 1, ...]]"
,dtype,<class 'numpy.float64'>
,handle_unknown,'error'
,unknown_value,
,encoded_missing_value,
,min_frequency,
,max_categories,

0,1,2
,missing_values,
,strategy,'most_frequent'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,categories,'auto'
,drop,'first'
,sparse_output,False
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'


In [10]:
# Entrenar el Pipeline 
preprocessor_1.fit(x_test)
feature_names = preprocessor_1.get_feature_names_out()

x_test_transformed = preprocessor_1.transform(x_test)
x_test_transformed = pd.DataFrame(x_test_transformed, columns=feature_names)
x_test_transformed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22106 entries, 0 to 22105
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   num__Age           22106 non-null  float64
 1   num__Scholarship   22106 non-null  float64
 2   num__Hipertension  22106 non-null  float64
 3   num__Diabetes      22106 non-null  float64
 4   num__Alcoholism    22106 non-null  float64
 5   num__SMS_received  22106 non-null  float64
 6   ord__Handcap       22106 non-null  float64
 7   cat__Gender_M      22106 non-null  float64
dtypes: float64(8)
memory usage: 1.3 MB


In [11]:
preprocessor_2 = ColumnTransformer(
    transformers=[
        ('num', t_numerico, numericas_f),
        ('ord', t_ordinal, ordinales_f),
        ('cat', t_categoricas, categoricas_f_genero_neigubourhood)
    ])

preprocessor_2

0,1,2
,transformers,"[('num', ...), ('ord', ...), ...]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,missing_values,
,strategy,'mean'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,missing_values,
,strategy,'most_frequent'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,categories,"[[0, 1, ...]]"
,dtype,<class 'numpy.float64'>
,handle_unknown,'error'
,unknown_value,
,encoded_missing_value,
,min_frequency,
,max_categories,

0,1,2
,missing_values,
,strategy,'most_frequent'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,categories,'auto'
,drop,'first'
,sparse_output,False
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'


In [12]:
# Entrenar el Pipeline 
preprocessor_2.fit(x_test)
feature_names = preprocessor_2.get_feature_names_out()

x_test_transformed = preprocessor_2.transform(x_test)
x_test_transformed = pd.DataFrame(x_test_transformed, columns=feature_names)
x_test_transformed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22106 entries, 0 to 22105
Data columns (total 85 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   num__Age                                22106 non-null  float64
 1   num__Scholarship                        22106 non-null  float64
 2   num__Hipertension                       22106 non-null  float64
 3   num__Diabetes                           22106 non-null  float64
 4   num__Alcoholism                         22106 non-null  float64
 5   num__SMS_received                       22106 non-null  float64
 6   ord__Handcap                            22106 non-null  float64
 7   cat__Gender_M                           22106 non-null  float64
 8   cat__Neighbourhood_ANDORINHAS           22106 non-null  float64
 9   cat__Neighbourhood_ANTÔNIO HONÓRIO      22106 non-null  float64
 10  cat__Neighbourhood_ARIOVALDO FAVALESSA  22106 non-null  fl

In [13]:
def resumen_clasificación(y_test, y_pred):
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    roc = roc_auc_score(y_test, y_pred)

    return {"accurancy": acc,
            "precision": prec,
            "recall": recall,
            "f1": f1,
            "roc": roc}

In [14]:
# Modelos definidos a probar -con los hiperparametros que tienen por defecto
modelos = {
    "RandomForest": RandomForestClassifier(random_state=42),
    "lightgbm": LGBMClassifier(random_state=42, objective='binary'),
    "xgboost": XGBClassifier(random_state=42,),
    "Extra_tree": ExtraTreesClassifier(random_state=42)
}

In [15]:
# Crear pipelines completos para cada combinación de modelo y preprocesador
pipelines = {}

for modelo_nombre, modelo in modelos.items():
    # Pipeline 1 con OneHotEncoder solo para el género
    pipelines[f"{modelo_nombre}_SG"] = Pipeline([
        ("preprocessing", preprocessor_1),
        ("classifier", modelo)
    ])
    # Pipeline con OneHotEncoder para género y Neighbourhood
    pipelines[f"{modelo_nombre}_GN"] = Pipeline([
        ("preprocessing", preprocessor_2),
        ("classifier", modelo)
    ])

In [16]:
pipelines

{'RandomForest_SG': Pipeline(steps=[('preprocessing',
                  ColumnTransformer(transformers=[('num',
                                                   Pipeline(steps=[('imputer',
                                                                    SimpleImputer()),
                                                                   ('Scaler',
                                                                    StandardScaler())]),
                                                   ['Age', 'Scholarship',
                                                    'Hipertension', 'Diabetes',
                                                    'Alcoholism',
                                                    'SMS_received']),
                                                  ('ord',
                                                   Pipeline(steps=[('imputer',
                                                                    SimpleImputer(strategy='most_frequent')),
                   

In [17]:
# Entrenar y evaluar cada pipeline
resultados = {}

for nombre_pipeline, pipeline in pipelines.items():
    pipeline.fit(x_train, y_train)
    y_pred = pipeline.predict(x_test)
    resultados[nombre_pipeline] = resumen_clasificación(y_test, y_pred)

[LightGBM] [Info] Number of positive: 17855, number of negative: 70566
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004682 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 124
[LightGBM] [Info] Number of data points in the train set: 88421, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.201932 -> initscore=-1.374265
[LightGBM] [Info] Start training from score -1.374265




[LightGBM] [Info] Number of positive: 17855, number of negative: 70566
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.006006 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 278
[LightGBM] [Info] Number of data points in the train set: 88421, number of used features: 85
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.201932 -> initscore=-1.374265
[LightGBM] [Info] Start training from score -1.374265




In [18]:
print("Resumen de Métricas para Cada Pipeline:")
pprint.pprint(resultados)

Resumen de Métricas para Cada Pipeline:
{'Extra_tree_GN': {'accurancy': 0.7623269700533791,
                   'f1': 0.17648902821316614,
                   'precision': 0.2938413361169102,
                   'recall': 0.1261200716845878,
                   'roc': 0.5247140433244388},
 'Extra_tree_SG': {'accurancy': 0.7951687324708224,
                   'f1': 0.027073485174043833,
                   'precision': 0.33157894736842103,
                   'recall': 0.014112903225806451,
                   'roc': 0.5034570864615598},
 'RandomForest_GN': {'accurancy': 0.7570795259205646,
                     'f1': 0.18980084490042246,
                     'precision': 0.29066543438077636,
                     'recall': 0.14090501792114696,
                     'roc': 0.5269483711077223},
 'RandomForest_SG': {'accurancy': 0.794671130009952,
                     'f1': 0.03074951953875721,
                     'precision': 0.3287671232876712,
                     'recall': 0.016129032258064516

In [19]:
df_resultados = pd.DataFrame(resultados).T
df_resultados_sorted = df_resultados.sort_values(by="precision", ascending=False)
df_resultados_sorted

Unnamed: 0,accurancy,precision,recall,f1,roc
lightgbm_SG,0.798381,0.62069,0.004032,0.008012,0.501704
xgboost_SG,0.798245,0.543478,0.0056,0.011086,0.502205
lightgbm_GN,0.798154,0.53125,0.003808,0.007562,0.501479
xgboost_GN,0.798154,0.507143,0.015905,0.030843,0.505997
Extra_tree_SG,0.795169,0.331579,0.014113,0.027073,0.503457
RandomForest_SG,0.794671,0.328767,0.016129,0.03075,0.503898
Extra_tree_GN,0.762327,0.293841,0.12612,0.176489,0.524714
RandomForest_GN,0.75708,0.290665,0.140905,0.189801,0.526948


En el contexto de este problema las clases de la variable objetivo están desbalanceadas --> Hay más personas que sí asistieron (79.80%) de las que no asistieron (20.19%). Por lo anterior, el accurancy no es una métrica ideal para este problema.
En cambio, se ha decidio optimizar la precision dado que puede ser de mayor interés conocer cuantos realmente NO asisten para mejorar la gestión de los costos y mejorar la planeación de las citas. En este sentido, el mejor modelo es una regresión logística con el preprocesador 2

## Validación cruzada

In [20]:
# DataFrame para guardar los resultados de cada fold
df_cv_results = pd.DataFrame(columns=["pipeline", "fold", "precision"])

# Número de folds
cv_folds = 5

for pipeline_name, pipeline_obj in pipelines.items():
    # cross_val_score entrena y evalúa en 5 folds
    scores = cross_val_score(pipeline_obj, x_train, y_train, cv=cv_folds, scoring="precision")

    # Crear un DataFrame temporal con la información de cada fold
    temp_df = pd.DataFrame({
        "pipeline": [pipeline_name]*cv_folds,
        "fold": list(range(1, cv_folds+1)),
        "precision": scores
    })

    # Concatenar al DataFrame global
    df_cv_results = pd.concat([df_cv_results, temp_df], ignore_index=True)


  df_cv_results = pd.concat([df_cv_results, temp_df], ignore_index=True)


[LightGBM] [Info] Number of positive: 14284, number of negative: 56452
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003397 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 124
[LightGBM] [Info] Number of data points in the train set: 70736, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.201934 -> initscore=-1.374251
[LightGBM] [Info] Start training from score -1.374251




[LightGBM] [Info] Number of positive: 14284, number of negative: 56453
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004615 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 124
[LightGBM] [Info] Number of data points in the train set: 70737, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.201931 -> initscore=-1.374268
[LightGBM] [Info] Start training from score -1.374268




[LightGBM] [Info] Number of positive: 14284, number of negative: 56453
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004288 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 124
[LightGBM] [Info] Number of data points in the train set: 70737, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.201931 -> initscore=-1.374268
[LightGBM] [Info] Start training from score -1.374268




[LightGBM] [Info] Number of positive: 14284, number of negative: 56453
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004082 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 124
[LightGBM] [Info] Number of data points in the train set: 70737, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.201931 -> initscore=-1.374268
[LightGBM] [Info] Start training from score -1.374268




[LightGBM] [Info] Number of positive: 14284, number of negative: 56453
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.006748 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 124
[LightGBM] [Info] Number of data points in the train set: 70737, number of used features: 8
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.201931 -> initscore=-1.374268
[LightGBM] [Info] Start training from score -1.374268




[LightGBM] [Info] Number of positive: 14284, number of negative: 56452
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.005904 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 278
[LightGBM] [Info] Number of data points in the train set: 70736, number of used features: 85
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.201934 -> initscore=-1.374251
[LightGBM] [Info] Start training from score -1.374251




[LightGBM] [Info] Number of positive: 14284, number of negative: 56453
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004949 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 278
[LightGBM] [Info] Number of data points in the train set: 70737, number of used features: 85
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.201931 -> initscore=-1.374268
[LightGBM] [Info] Start training from score -1.374268




[LightGBM] [Info] Number of positive: 14284, number of negative: 56453
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.005167 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 276
[LightGBM] [Info] Number of data points in the train set: 70737, number of used features: 84
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.201931 -> initscore=-1.374268
[LightGBM] [Info] Start training from score -1.374268




[LightGBM] [Info] Number of positive: 14284, number of negative: 56453
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004647 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 278
[LightGBM] [Info] Number of data points in the train set: 70737, number of used features: 85
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.201931 -> initscore=-1.374268
[LightGBM] [Info] Start training from score -1.374268




[LightGBM] [Info] Number of positive: 14284, number of negative: 56453
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.006786 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 278
[LightGBM] [Info] Number of data points in the train set: 70737, number of used features: 85
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.201931 -> initscore=-1.374268
[LightGBM] [Info] Start training from score -1.374268




In [21]:
df_cv_results.head(40)

Unnamed: 0,pipeline,fold,precision
0,RandomForest_SG,1,0.337449
1,RandomForest_SG,2,0.351598
2,RandomForest_SG,3,0.356808
3,RandomForest_SG,4,0.328947
4,RandomForest_SG,5,0.298343
5,RandomForest_GN,1,0.308036
6,RandomForest_GN,2,0.306818
7,RandomForest_GN,3,0.300752
8,RandomForest_GN,4,0.29234
9,RandomForest_GN,5,0.313856


In [22]:
# Creamos el pipeline que mejor desempeño tuvo y el clasificador (LGBM)
pipeline_GNM = Pipeline([
    ("preprocessing", preprocessor_2),
    ("classifier", LGBMClassifier(random_state=42))
])

# Grilla de hiperparámetros adaptada a LightGBM
param_grid = {
    "classifier__n_estimators": [100, 300, 500],         # número de árboles
    "classifier__learning_rate": [0.01, 0.05, 0.1],      # tasa de aprendizaje
    "classifier__num_leaves": [10, 15, 40],              # número de hojas por árbol
    "classifier__max_depth": [5, 10, 20],            # profundidad máxima
}

In [23]:
# Configuramos GridSearchCV para evaluar con 4 folds y usando la métrica "precision"
grid_search = GridSearchCV(
    pipeline_GNM,
    param_grid,
    cv=4,
    scoring="precision",
    n_jobs=-1 
)

In [24]:
# Ejecutamos el grid search usando los datos de entrenamiento
grid_search.fit(x_train, y_train)

# Mostramos los mejores parámetros y el mejor accuracy obtenido en validación cruzada
print("Mejores parámetros:", grid_search.best_params_)
print("Mejor precisión:", grid_search.best_score_)

[LightGBM] [Info] Number of positive: 17855, number of negative: 70566
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.008203 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 278
[LightGBM] [Info] Number of data points in the train set: 88421, number of used features: 85
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.201932 -> initscore=-1.374265
[LightGBM] [Info] Start training from score -1.374265
Mejores parámetros: {'classifier__learning_rate': 0.01, 'classifier__max_depth': 5, 'classifier__n_estimators': 500, 'classifier__num_leaves': 40}
Mejor precisión: 0.6145833333333333


## Guardar el modelo

In [25]:
# Ruta relativa desde notebooks/ hacia la carpeta models/
MODELS_DIR = Path.cwd().parent / "models"

mejor_modelo = grid_search.best_estimator_
joblib.dump(mejor_modelo, MODELS_DIR / "Classification_medical_no_show-LGBM.joblib")

['C:\\Users\\dicastaneda\\OneDrive - Grupo-exito.com\\Proyectos\\ProyectosDesarrollo\\medical-noshow-prediction\\models\\Classification_medical_no_show-LGBM.joblib']