![](https://www.dii.uchile.cl/wp-content/uploads/2021/06/Magi%CC%81ster-en-Ciencia-de-Datos.png)

# Proyecto: Riesgo en el Banco Giturra

**MDS7202: Laboratorio de Programación Científica para Ciencia de Datos**

### Cuerpo Docente:

- Profesor: Gabriel Iturra, Ignacio Meza De La Jara
- Auxiliar: Sebastián Tinoco
- Ayudante: Arturo Lazcano, Angelo Muñoz

_Por favor, lean detalladamente las instrucciones de la tarea antes de empezar a escribir._

---

## Reglas

- Fecha de entrega: 19/12/2023
- **Grupos de 2 personas.**
- Cualquier duda fuera del horario de clases al foro. Mensajes al equipo docente serán respondidos por este medio.
- Estrictamente prohibida la copia.
- Pueden usar cualquier material del curso que estimen conveniente.

---

### Integrantes:
- Nicolás Acevedo
- Fabiola Pizarro

### **Link de repositorio de GitHub:** `https://github.com/nicoacevedor/MDS7202`


# Presentación del Problema


![](https://www.diarioeldia.cl/u/fotografias/fotosnoticias/2019/11/8/67218.jpg)


**Giturra**, un banquero astuto y ambicioso, estableció su propio banco con el objetivo de obtener enormes ganancias. Sin embargo, su reputación se vio empañada debido a las tasas de interés usureras que imponía a sus clientes. A medida que su banco crecía, Giturra enfrentaba una creciente cantidad de préstamos impagados, lo que amenazaba su negocio y su prestigio.

Para abordar este desafío, Giturra reconoció la necesidad de reducir los riesgos de préstamo y mejorar la calidad de los préstamos otorgados. Decidió aprovechar la ciencia de datos y el análisis de riesgo crediticio. Contrató a un equipo de expertos para desarrollar un modelo predictivo de riesgo crediticio.

Cabe señalar que lo modelos solicitados por el banquero deben ser interpretables. Ya que estos le permitira al equipo comprender y explicar cómo se toman las decisiones crediticias. Utilizando visualizaciones claras y explicaciones detalladas, pudieron identificar las características más relevantes, le permitirá analizar la distribución de la importancia de las variables y evaluar si los modelos son coherentes con el negocio.

Para esto Giturra les solicita crear un modelo de riesgo disponibilizandoles una amplia gama de variables de sus usuarios: como historiales de crédito, ingresos y otros factores financieros relevantes, para evaluar la probabilidad de incumplimiento de pago de los clientes. Con esta información, Giturra podra tomar decisiones más informadas en cuanto a los préstamos, ofreciendo condiciones más favorables a aquellos con menor riesgo de impago.


### Introducción

blablabla

### Importación de librerías a utilizar

In [71]:
from lightgbm import LGBMClassifier
import optuna
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.metrics import classification_report, recall_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

### Configuraciones generales del proyecto

In [3]:
random_state = 42
train_size = 0.7
val_size = 0.2

### Análisis exploratiorio de los datos

In [4]:
df_raw = pd.read_parquet("dataset.pq")
print(df_raw.info())
df_raw.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12500 entries, 0 to 12499
Data columns (total 22 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   customer_id               12500 non-null  object 
 1   age                       12500 non-null  float64
 2   occupation                12500 non-null  object 
 3   annual_income             12500 non-null  float64
 4   monthly_inhand_salary     10584 non-null  float64
 5   num_bank_accounts         12500 non-null  int64  
 6   num_credit_card           12500 non-null  int64  
 7   interest_rate             12500 non-null  int64  
 8   num_of_loan               12500 non-null  float64
 9   delay_from_due_date       12500 non-null  int64  
 10  num_of_delayed_payment    11660 non-null  float64
 11  changed_credit_limit      12246 non-null  float64
 12  num_credit_inquiries      12243 non-null  float64
 13  outstanding_debt          12500 non-null  float64
 14  credit

Unnamed: 0,customer_id,age,occupation,annual_income,monthly_inhand_salary,num_bank_accounts,num_credit_card,interest_rate,num_of_loan,delay_from_due_date,...,num_credit_inquiries,outstanding_debt,credit_utilization_ratio,credit_history_age,payment_of_min_amount,total_emi_per_month,amount_invested_monthly,payment_behaviour,monthly_balance,credit_score
0,CUS_0xd40,23.0,Scientist,19114.12,1824.843333,3,4,3,4.0,3,...,4.0,809.98,23.933795,,No,49.574949,24.785217,High_spent_Medium_value_payments,358.124168,0
1,CUS_0x21b1,28.0,Teacher,34847.84,3037.986667,2,4,6,1.0,3,...,2.0,605.03,32.933856,27.0,No,18.816215,218.904344,Low_spent_Small_value_payments,356.078109,0
2,CUS_0x2dbc,34.0,Engineer,143162.64,12187.22,1,5,8,3.0,8,...,3.0,1303.01,38.374753,18.0,No,246.992319,10000.0,High_spent_Small_value_payments,895.494583,0
3,CUS_0xb891,55.0,Entrepreneur,30689.89,2612.490833,2,5,4,-100.0,4,...,4.0,632.46,27.332515,17.0,No,16.415452,125.617251,High_spent_Small_value_payments,379.216381,0
4,CUS_0x1cdb,21.0,Developer,35547.71,2853.309167,7,5,5,-100.0,1,...,4.0,943.86,25.862922,31.0,Yes,0.0,181.330901,High_spent_Small_value_payments,364.000016,0


In [48]:
df_raw['credit_score'].value_counts(normalize=True)

credit_score
0    0.71184
1    0.28816
Name: proportion, dtype: float64

### 3. Preparación de Datos

In [5]:
# Limpieza
df = df_raw.copy()
# Edad entre 14 y 100
df = df[(df['age'] >= 14) & (df['age'] <= 100)]
# interés máximo de 100%
df = df[df['interest_rate'] <= 100]
# eliminar valores negativos
df = df[df['num_of_loan'] >= 0]
df = df[df['num_bank_accounts'] >= 0]
df = df[df['delay_from_due_date'] >= 0]
df = df[df['num_of_delayed_payment'] >= 0]
# la borré porque creo que no sirve pal proyecto
df = df.drop(columns=['customer_id'])

In [6]:
df.shape

(10528, 21)

#### 3.1 Preprocesamiento con `ColumnTransformer`

In [7]:
# Se cambian los tipos de datos
df['age'] = df['age'].astype('int64')
df['num_of_loan'] = df['num_of_loan'].astype('int64')
df['delay_from_due_date'] = df['delay_from_due_date'].astype('float64')

In [31]:
# SOLO PARA PROBAR LOS MODELOS
# BORRAR ANTES DE SEGUIR
df = df.dropna()
df

Unnamed: 0,age,occupation,annual_income,monthly_inhand_salary,num_bank_accounts,num_credit_card,interest_rate,num_of_loan,delay_from_due_date,num_of_delayed_payment,...,num_credit_inquiries,outstanding_debt,credit_utilization_ratio,credit_history_age,payment_of_min_amount,total_emi_per_month,amount_invested_monthly,payment_behaviour,monthly_balance,credit_score
1,28,Teacher,34847.84,3037.986667,2,4,6,1,3.0,4.0,...,2.0,605.03,32.933856,27.0,No,18.816215,218.904344,Low_spent_Small_value_payments,356.078109,0
2,34,Engineer,143162.64,12187.220000,1,5,8,3,8.0,6.0,...,3.0,1303.01,38.374753,18.0,No,246.992319,10000.000000,High_spent_Small_value_payments,895.494583,0
6,34,Lawyer,131313.40,10469.207759,0,1,8,2,0.0,2.0,...,4.0,352.16,29.187913,31.0,No,911.220179,870.522382,Low_spent_Medium_value_payments,396.111346,0
7,30,Media_Manager,34081.38,2611.115000,8,7,15,3,30.0,14.0,...,9.0,1704.18,33.823488,15.0,Yes,70.478333,29.326364,High_spent_Medium_value_payments,411.306804,1
8,24,Doctor,114838.41,9843.867500,2,5,7,3,11.0,11.0,...,8.0,1377.74,27.813354,21.0,No,226.892792,254.571767,High_spent_Large_value_payments,742.922191,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12282,32,Architect,175572.44,14422.036667,5,4,1,3,10.0,2.0,...,1.0,388.99,43.306156,32.0,No,290.106973,97.860362,High_spent_Large_value_payments,1294.236331,0
12283,40,Scientist,120009.32,10242.776667,5,1,7,4,14.0,1.0,...,0.0,446.51,36.839608,29.0,No,230.528352,197.626771,High_spent_Large_value_payments,836.122544,0
12284,24,Lawyer,59868.93,5111.077500,4,7,9,4,8.0,11.0,...,4.0,417.72,24.803887,29.0,No,104.622038,10000.000000,Low_spent_Small_value_payments,80.190256,0
12286,29,Entrepreneur,34599.94,2942.328333,7,6,3,0,22.0,8.0,...,1.0,942.59,33.128609,27.0,No,0.000000,165.397552,High_spent_Small_value_payments,388.835282,0


In [32]:
# Se divide el DataFrame en características (X) y variable objetivo (y)
X = df.drop(["credit_score"], axis=1)
y = df["credit_score"]

# Definición de variables categóricas y numéricaas
categorical_cols = ['occupation', 'payment_of_min_amount', 'payment_behaviour']
numeric_cols = list(set(X.columns) - (set(categorical_cols)))

# ColumnTransformer
scaler = ColumnTransformer([
    ("NumericScaler", MinMaxScaler(), numeric_cols),
    ("CategoricalEncoder", OneHotEncoder(sparse_output=False), categorical_cols)
], remainder="passthrough")
scaler.set_output(transform='pandas')

# Se prueban las transformaciones
X_preprocessed = scaler.fit_transform(X)

#### 3.2 Holdout 

In [67]:
def split_data(X, y, train_size, val_size):
    X_train, X_med, y_train, y_med = train_test_split(
        X, y, train_size=train_size, random_state=random_state
    )
    X_val, X_test, y_val, y_test = train_test_split(
        X_med, y_med, train_size=val_size / (1 - train_size), random_state=random_state
    )
    return (
        X_train,
        X_val,
        X_test,
        y_train,
        y_val,
        y_test
    )

X_train, X_val, X_test, y_train, y_val, y_test = split_data(X, y, train_size=train_size, val_size=val_size)

#### 3.3 Datos nulos

In [10]:
porcentajes_nulos = (df.isnull().sum() / len(df))
porcentajes_nulos = porcentajes_nulos[porcentajes_nulos > 0].sort_values(ascending=False)
print("Columnas con valores nulos [%]")
print("------------------------------")
porcentajes_nulos

Columnas con valores nulos [%]
------------------------------


monthly_inhand_salary      0.152451
credit_history_age         0.088526
amount_invested_monthly    0.046922
monthly_balance            0.028210
num_credit_inquiries       0.020802
changed_credit_limit       0.020327
dtype: float64

### 4 Baseline

In [78]:
def create_pipeline(scaler, model):
    pipeline = Pipeline([
        ("scaler", scaler),
        ("model", model)
    ])
    return pipeline

all_models = {
    "dummy": DummyClassifier(strategy="stratified", random_state=random_state),
    "logistic_reg": LogisticRegression(random_state=random_state),
    "k_neighbors": KNeighborsClassifier(),
    "decision_tree": DecisionTreeClassifier(random_state=random_state),
    "svc": SVC(random_state=random_state),
    "random_forest": RandomForestClassifier(random_state=random_state),
    "lgbm": LGBMClassifier(random_state=random_state, verbose=-1),
    "xgb": XGBClassifier(random_state=random_state),
}

### Entrenamiento de los modelos

In [79]:
for name, model in all_models.items():
    print(name)
    pipeline = create_pipeline(scaler, model)
    display(pipeline.fit(X_train, y_train))

dummy


logistic_reg


k_neighbors


decision_tree


svc


random_forest


lgbm


xgb


### Evaluación de los clasificadores

In [55]:
recall_by_model = {}

for name, pipeline in all_models.items():
    y_pred = pipeline.predict(X_test)
    cr = classification_report(y_test, y_pred, output_dict=True)
    recall_by_model[name] = cr['1']['recall']
    print(f"Classification Report '{name}'")
    print(classification_report(y_test, y_pred), '\n')

Classification Report 'dummy'
              precision    recall  f1-score   support

           0       0.71      0.72      0.71       517
           1       0.28      0.28      0.28       206

    accuracy                           0.59       723
   macro avg       0.50      0.50      0.50       723
weighted avg       0.59      0.59      0.59       723
 

Classification Report 'logistic_reg'
              precision    recall  f1-score   support

           0       0.79      0.92      0.85       517
           1       0.66      0.40      0.50       206

    accuracy                           0.77       723
   macro avg       0.72      0.66      0.67       723
weighted avg       0.75      0.77      0.75       723
 

Classification Report 'k_neighbors'
              precision    recall  f1-score   support

           0       0.78      0.85      0.82       517
           1       0.52      0.41      0.46       206

    accuracy                           0.72       723
   macro avg       0.

In [65]:
recall_df = pd.DataFrame(recall_by_model.items(), columns=['Model', 'Recall'])
recall_df.sort_values('Recall', ascending=False, inplace=True)
recall_df

Unnamed: 0,Model,Recall
7,xgb,0.558252
6,lgbm,0.553398
3,decision_tree,0.543689
5,random_forest,0.524272
2,k_neighbors,0.412621
1,logistic_reg,0.398058
4,svc,0.378641
0,dummy,0.276699


### Optimización del modelo

In [89]:
optuna.logging.set_verbosity(optuna.logging.WARNING)

def objective(
        trial: optuna.Trial, 
        model,
        data, 
        target, 
        scaler, 
        random_seed=random_state
):
    params = {
        'learning_rate': trial.suggest_float('learning_rate', 0.001, 0.1),
        'n_estimators': trial.suggest_int('n_estimators', 50, 100),
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'max_leaves': trial.suggest_int('max_leaves', 0, 100),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 5),
        'reg_alpha': trial.suggest_float('reg_alpha', 0, 1),
        'reg_lambda': trial.suggest_float('reg_lambda', 0, 1),
        'random_state': random_seed,
    }
    if isinstance(model, LGBMClassifier):
        params['verbose'] = -1
    X_train, X_val, _, y_train, y_val, _ = split_data(data, target, train_size=0.7, val_size=0.2)
    instanciated_model = model(**params)
    pipeline = create_pipeline(scaler, instanciated_model)
    pipeline.fit(X_train, y_train)

    y_pred = pipeline.predict(X_val)
    return recall_score(y_val, y_pred, average="binary")


#### Optimización de `LGBMClassifier`

In [86]:
study_lgbm = optuna.create_study(
    sampler=optuna.samplers.TPESampler(seed=random_state),
    direction='maximize',
    study_name="lgbm_opt"
)  

study_lgbm.optimize(
    lambda trial: objective(trial, LGBMClassifier, X, y, scaler, random_state),
    timeout=5*60,
    show_progress_bar=True
) 

Best trial: 296. Best value: 0.555556:  100%|██████████| 05:00/05:00


In [92]:
print(f"Mejor Recall: {study_lgbm.best_value:.3f}")

print("\nMejores parámetros: \n--------------------")
for param, value in study_lgbm.best_params.items():
    print(f"{param}: {value:.3f}")

Mejor Recall: 0.556

Mejores parámetros: 
--------------------
learning_rate: 0.082
n_estimators: 94.000
max_depth: 9.000
max_leaves: 50.000
min_child_weight: 1.000
reg_alpha: 0.145
reg_lambda: 0.002


#### Optimización de `XGBoostClassifier`

In [90]:
study_xgb = optuna.create_study(
    sampler=optuna.samplers.TPESampler(seed=random_state),
    direction='maximize',
    study_name="lgbm_opt"
)  

study_xgb.optimize(
    lambda trial: objective(trial, XGBClassifier, X, y, scaler, random_state),
    timeout=5*60,
    show_progress_bar=True
) 

Best trial: 1148. Best value: 0.562358:  100%|██████████| 05:00/05:00


In [93]:
print(f"Mejor Recall: {study_xgb.best_value:.3f}")

print("\nMejores parámetros: \n--------------------")
for param, value in study_xgb.best_params.items():
    print(f"{param}: {value:.3f}")

Mejor Recall: 0.562

Mejores parámetros: 
--------------------
learning_rate: 0.081
n_estimators: 98.000
max_depth: 9.000
max_leaves: 70.000
min_child_weight: 1.000
reg_alpha: 0.832
reg_lambda: 0.262
