# Clasificación utilizando LightGBM (LGBM) (Core)


**Objetivo**

Implementar un pipeline completo de machine learning para un problema de clasificación utilizando LightGBM (LGBM). Se hará especial énfasis en el Análisis Exploratorio de Datos (EDA), preprocesamiento, entrenamiento del modelo y optimización de hiperparámetros.

**Dataset:** Loan Prediction Dataset

**Descripción del Dataset:** El dataset de predicción de préstamos contiene información sobre solicitantes de préstamos, como sus ingresos, historial crediticio y otras características personales. El objetivo es predecir si un solicitante será aprobado para un préstamo basándose en estos factores.

**Instrucciones:**

**Parte 1: Carga y Exploración Inicial de Datos**

**Carga del Dataset:**

* Cargar el dataset desde Kaggle.

**Exploración Inicial:**

* Revisar la estructura del dataset.

* Describir las variables y su distribución.

* Identificar y documentar valores faltantes y outliers.

**Parte 2: Análisis Exploratorio de Datos (EDA)**

**Análisis Estadístico Descriptivo:**

* Calcular estadísticas descriptivas básicas (media, mediana, desviación estándar, etc.).

* Analizar la distribución de las variables categóricas.

**Visualizaciones:**

* Crear histogramas y gráficos de barras para entender la distribución de las variables.

* Crear un mapa de calor para visualizar las correlaciones entre las variables.

* Utilizar gráficos de dispersión para identificar posibles relaciones entre las variables.

**Valores Faltantes y Outliers:**

* Detectar y tratar valores faltantes.

* Identificar y manejar outliers.

**Parte 3: Preprocesamiento de Datos**

**Transformación de Columnas:**

* Codificar variables categóricas utilizando One-Hot Encoding.

* Escalar características numéricas utilizando StandardScaler.

**División del Conjunto de Datos:**

* Dividir el dataset en conjuntos de entrenamiento y prueba.

**Parte 4: Implementación de LightGBM (LGBM)**

**Entrenamiento del Modelo:**

* Entrenar un modelo de LGBM con hiperparámetros básicos.

* Evaluar el modelo utilizando métricas de rendimiento como la exactitud, precisión, recall, F1-Score y ROC-AUC.

**Optimización de Hiperparámetros:**

* Utilizar GridSearchCV para optimizar los hiperparámetros del modelo de LGBM.

**Evaluación del Modelo Optimizado:**

* Evaluar el rendimiento del modelo optimizado y compararlo con el modelo inicial.

In [8]:
# Importar librerías
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [9]:
#Cargar el dataset y revisar su estructura básica.
# Cargar el dataset
path = "../data/train.csv"
df = pd.read_csv(path, sep=',')

# Revisar la estructura básica del dataset
print("Primeras 5 filas del dataset:")
print(df.head())

print("\nInformación del dataset:")
print(df.info())  # Información sobre columnas, tipos de datos y valores nulos.

print("\nDescripción estadística básica:")
print(df.describe())  # Estadísticas básicas del dataset.

Primeras 5 filas del dataset:
    Loan_ID Gender Married Dependents     Education Self_Employed  \
0  LP001002   Male      No          0      Graduate            No   
1  LP001003   Male     Yes          1      Graduate            No   
2  LP001005   Male     Yes          0      Graduate           Yes   
3  LP001006   Male     Yes          0  Not Graduate            No   
4  LP001008   Male      No          0      Graduate            No   

   ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  \
0             5849                0.0         NaN             360.0   
1             4583             1508.0       128.0             360.0   
2             3000                0.0        66.0             360.0   
3             2583             2358.0       120.0             360.0   
4             6000                0.0       141.0             360.0   

   Credit_History Property_Area Loan_Status  
0             1.0         Urban           Y  
1             1.0         Rural     

In [5]:
df.columns

Index(['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status'],
      dtype='object')

In [10]:
# Verificar valores nulos
print(df.isnull().sum())

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64


In [11]:
# Imputar los valores nulos sin usar inplace=True
df['Gender'] = df['Gender'].fillna(df['Gender'].mode()[0])
df['Married'] = df['Married'].fillna(df['Married'].mode()[0])
df['Dependents'] = df['Dependents'].fillna(df['Dependents'].mode()[0])
df['Self_Employed'] = df['Self_Employed'].fillna(df['Self_Employed'].mode()[0])
df['LoanAmount'] = df['LoanAmount'].fillna(df['LoanAmount'].mean())  # Usamos la media para LoanAmount
df['Loan_Amount_Term'] = df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mode()[0])
df['Credit_History'] = df['Credit_History'].fillna(df['Credit_History'].mode()[0])

# Verificar si hay más valores nulos
print(df.isnull().sum())



Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64


In [14]:
# Detectar valores duplicados
duplicados = df.duplicated()
#Imprime la cantidad de numeros duplicados
print(f"Número de filas duplicadas: {duplicados.sum()}")
df.head()

Número de filas duplicadas: 0


Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,146.412162,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [16]:
# Verificar si quedan valores NaN
print(df.isnull().sum())


Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64


In [17]:
df.dtypes

Loan_ID               object
Gender                object
Married               object
Dependents            object
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
Loan_Status           object
dtype: object

In [18]:
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

# Definir las columnas numéricas y categóricas
categorical_columns = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Property_Area']
numerical_columns = ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term', 'Credit_History']

# Crear el preprocesador que incluirá OneHotEncoder para variables categóricas
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_columns),  # Escalado de variables numéricas
        ('cat', OneHotEncoder(), categorical_columns)   # One-Hot Encoding para variables categóricas
    ])


In [19]:
# Separar las características y la variable objetivo (Loan_Status)
X = df.drop(columns=['Loan_ID', 'Loan_Status'])  # Eliminar 'Loan_ID' ya que no es relevante para el modelo
y = df['Loan_Status'].map({'Y': 1, 'N': 0})  # Convertir 'Y' y 'N' a 1 y 0 respectivamente

# Dividir el dataset en entrenamiento (80%) y prueba (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Conjunto de entrenamiento: {X_train.shape}")
print(f"Conjunto de prueba: {X_test.shape}")


Conjunto de entrenamiento: (491, 11)
Conjunto de prueba: (123, 11)


In [20]:
from lightgbm import LGBMClassifier
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score

# Crear el pipeline con preprocesamiento y LightGBM
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LGBMClassifier(random_state=42))
])

# Entrenar el modelo
pipeline.fit(X_train, y_train)

# Hacer predicciones
y_pred = pipeline.predict(X_test)

# Evaluar el modelo
accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)

print(f"Exactitud (Accuracy): {accuracy}")
print(f"ROC-AUC: {roc_auc}")
print("\nReporte de Clasificación:")
print(classification_report(y_test, y_pred))


[LightGBM] [Info] Number of positive: 342, number of negative: 149
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001276 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 388
[LightGBM] [Info] Number of data points in the train set: 491, number of used features: 20
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.696538 -> initscore=0.830864
[LightGBM] [Info] Start training from score 0.830864
Exactitud (Accuracy): 0.7723577235772358
ROC-AUC: 0.7120639534883721

Reporte de Clasificación:
              precision    recall  f1-score   support

           0       0.76      0.51      0.61        43
           1       0.78      0.91      0.84        80

    accuracy                           0.77       123
   macro avg       0.77      0.71      0.73       123
weighted avg       0.77      0.77      0.76       123



In [21]:
from sklearn.model_selection import GridSearchCV

# Definir los parámetros a optimizar
param_grid = {
    'classifier__num_leaves': [31, 50],
    'classifier__max_depth': [5, 10],
    'classifier__learning_rate': [0.01, 0.1],
    'classifier__n_estimators': [50, 100]
}

# GridSearchCV para encontrar los mejores hiperparámetros
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')

# Entrenar con GridSearchCV
grid_search.fit(X_train, y_train)

# Ver los mejores parámetros
print(f"Mejores parámetros: {grid_search.best_params_}")

# Evaluar el modelo optimizado
best_model = grid_search.best_estimator_
y_pred_optimized = best_model.predict(X_test)

accuracy_optimized = accuracy_score(y_test, y_pred_optimized)
roc_auc_optimized = roc_auc_score(y_test, y_pred_optimized)

print(f"Exactitud (Optimizada): {accuracy_optimized}")
print(f"ROC-AUC (Optimizada): {roc_auc_optimized}")
print("\nReporte de Clasificación (Optimizado):")
print(classification_report(y_test, y_pred_optimized))


[LightGBM] [Info] Number of positive: 273, number of negative: 119
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000462 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 328
[LightGBM] [Info] Number of data points in the train set: 392, number of used features: 20
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.696429 -> initscore=0.830348
[LightGBM] [Info] Start training from score 0.830348
[LightGBM] [Info] Number of positive: 274, number of negative: 119
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000093 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 330
[LightGBM] [Info] Number of data points in the train set: 393, number of used features: 20
[LightGBM] [Info] [binary:BoostFro