# DmX Challenge: Predictive Credit Risk Modeling Using Customer Credit Scores and Phone Footprints


In [1]:
# ──────────────────────────────────────────────────────────────────────────
# Script Name : training_etl_prev.py
# Author      : Dilan Castañeda, Paulo Ibarra, Bruno Díaz, Fatima Quintana
# Created On  : Octubre 03, 2024
# Last Update : Octubre 03, 2024
# Version     : 1.0.0
# Description : Credit risk modeling using bureau reports and phone data to predict client default probability.
#──────────────────────────────────────────────────────────────────────────

## Overview

---
Instituto Tecnológico y de Estudios Superiores de Monterrey

Analítica de datos y herramientas de inteligencia artificial TI3001C.103

Profesor: Enrique Ricardo García Hernández

Equipo 2:
*   Dilan González Castañeda             A00831905
*   Fátima Pamela Ramón Quintana         A00833076
*   Paulo Ibarra A01632632
*   Bruno Díaz Flores A0082455


---

**Overview**

Credit risk modeling using bureau reports and phone usage data to predict client default probability for loan approval decisions.

**Database Source**

The database for this project was provided by DMX, containing comprehensive credit bureau reports and detailed phone usage data for each client.

**Key Components**

1. Credit Bureau Data:

* Credit history
* Loan inquiries
* Payment behaviors
* Current debt levels


2. Phone Usage Data:

* Subscription type
* Usage patterns
* Payment history
* Device information


3.  Target Variable:

* Client default status (binary: defaulted or not)



**Project Goals**

Develop a predictive model to assess the likelihood of client loan default
Optimize loan approval decisions based on calculated risk
Enhance the overall loan portfolio quality by minimizing potential defaults

**Methodology**

* Utilize machine learning techniques to analyze historical data
* Incorporate both traditional credit metrics and alternative data (phone usage)
* Create a robust model that can handle various data types and complex relationships

**Expected Outcome**
A reliable credit risk assessment tool that can:

* Accurately predict client default probability
* Assist in making informed loan approval decisions
* Potentially increase approval rates for creditworthy clients while minimizing risk

This project aims to leverage the unique combination of traditional credit data and alternative phone usage data provided by DMX to create a more comprehensive and accurate credit risk assessment model.

### 1. ETL for Model Training

#### Purpose
- Prepare historical data for model development and training.

#### Process
1. **Extract**:
   - Pull historical data from DMX database (credit bureau reports and phone usage data).
   - Include all available features and the target variable (default status).

2. **Transform**:
   - Handle missing values, outliers, and data quality issues.
   - Perform feature engineering (e.g., creating interaction terms, deriving new features).
   - Encode categorical variables.
   - Normalize or standardize numerical features.

3. **Load**:
   - Store the processed data in a format suitable for model training (e.g., parquet files, a data warehouse).

#### Benefits
- Can perform extensive data cleaning and feature engineering.
- Allows for complex transformations that might be computationally expensive.
- Can use the full historical dataset for better feature creation and selection.

By implementing separate ETL processes for training and scoring, we can optimize our credit risk model for both comprehensive learning from historical data and efficient, consistent scoring of new records.

## ETL

### Extract

In [2]:
%pip install unidecode

Note: you may need to restart the kernel to use updated packages.


In [3]:
import pandas as pd
from pandas import DataFrame
import numpy as np
from unidecode import unidecode
import matplotlib.pyplot as plt
import seaborn as sns
from tabulate import tabulate
from sklearn.preprocessing import OrdinalEncoder
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
from typing import Dict, List

In [4]:
# Define the path to the file
# Using the file inside the repository
file_name = r'C:\Users\dilan\Documents\Github\DataAnalysis_and_AI\period_2\Resources\Challenge\base_Reto.csv'
# Using the file from the repository
url = 'https://raw.githubusercontent.com/magotronico/DataAnalysis_and_AI/main/period_2/Resources/Challenge/base_Reto.csv'

# Load original DataBase (csv)
df = pd.read_csv(url, encoding='latin-1') # You can use the file_name variable to load the file if you have it in your local machine
df = pd.read_csv(url, encoding='latin-1')

# Display the first 3 rows of the dataframe
df.head(3)
# Display resume of the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25101 entries, 0 to 25100
Data columns (total 41 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   Solicitud_id                             25101 non-null  float64
 1   Aprobado                                 25101 non-null  int64  
 2   Hit_Buro_Huella                          25101 non-null  int64  
 3   Malo                                     25101 non-null  int64  
 4   Num_IQ_U3M_PL_Financieras                25101 non-null  int64  
 5   Edad_cliente                             25101 non-null  int64  
 6   Porcentaje_cuentas_abiertas              25101 non-null  float64
 7   Num_IQ_U3M                               25101 non-null  int64  
 8   Num_IQ_U3M_TDC_Banco                     25101 non-null  int64  
 9   MaxMOP_U3M                               25101 non-null  int64  
 10  Saldo_actual_prest_personales            25101

In [5]:
# List of unique values per column
temp_df = pd.DataFrame(columns=['Column', 'dtype', 'Unique Values', 'nan', 'size'])

for column in df.columns:
    unique_values = [df[column].unique()]  # Ensure unique values are in a list
    temp_df = pd.concat([temp_df, pd.DataFrame({'Column': [column], 'dtype': [df[column].dtype], 'Unique Values': unique_values})], ignore_index=True)
    temp_df.loc[temp_df['Column'] == column, 'nan'] = df[column].isnull().sum()
    temp_df.loc[temp_df['Column'] == column, 'size'] = df[column].count()

# Display the resulting dataframe
temp_df

Unnamed: 0,Column,dtype,Unique Values,nan,size
0,Solicitud_id,float64,"[1993059.0, 1993154.0, 1993230.0, 1993287.0, 1...",0,25101
1,Aprobado,int64,"[1, 0]",0,25101
2,Hit_Buro_Huella,int64,"[11, 10, 1, 0]",0,25101
3,Malo,int64,"[0, 1, -1]",0,25101
4,Num_IQ_U3M_PL_Financieras,int64,"[0, 1, 4, 2, -1, 3, 6, -2, 5, 7, 8, 10, 9, 18]",0,25101
5,Edad_cliente,int64,"[61, 67, 68, 64, 54, 63, 77, 40, 47, 72, 73, 4...",0,25101
6,Porcentaje_cuentas_abiertas,float64,"[0.3, 0.333333333, 0.073170732, 0.75, 0.147286...",0,25101
7,Num_IQ_U3M,int64,"[1, 3, 7, 12, 6, 5, 8, 4, 10, 18, 11, -1, 2, 0...",0,25101
8,Num_IQ_U3M_TDC_Banco,int64,"[0, 1, 2, -1, 5, 4, 3, -2, 6, 7]",0,25101
9,MaxMOP_U3M,int64,"[2, 9, 1, 7, -1, 0, 6, 3, 4, 5]",0,25101


### Transform


#### Estandarize column names

In [6]:
#Create a copy of the original dataframe
df_copy = df.copy(deep=True)
# Estandarize columns names (lowercase and without accents)
df_copy.columns = [unidecode(str(col)).lower() for col in df.columns]
# Replace 'contabilidad' with 'contactabilidad' in column name
df_copy.columns = df_copy.columns.str.replace('contabilidad', 'contactabilidad')

#### Cleaning columns (variables)

In [7]:
# Drop 'solicitud_id' column
df_copy.drop(columns=['solicitud_id'], inplace=True)

In [8]:
# Clean entidad_federativa column
print(len(df_copy['entidad_federativa'].unique()))
print(df_copy['entidad_federativa'].value_counts().sort_index())

36
entidad_federativa
AGS      368
BCN      820
BCS      268
CAM      425
CDM      149
CDMX    1596
CHI     1663
CHS      491
COA     1593
COL      195
DGO      537
EM      2185
GRO      540
GTO      710
HGO      215
JAL     1070
MIC       37
MICH     550
MOR      445
NAY      269
NL      2303
OAX      458
PUE      796
QR       205
QRO      446
SIN     1061
SLP      399
SON     1419
TAB      229
TAM      884
TLA       14
TLAX     210
VER     1650
YUC      549
ZAC      216
Name: count, dtype: int64


In [9]:
# Replace specific entidades with their standardized names
df_copy['entidad_federativa'] = df_copy['entidad_federativa'].replace({
    'CDM': 'CDMX',
    'MIC': 'MICH',
    'QR': 'QRO',
    'TLA': 'TLAX'
})

# Verify the replacements
print(len(df_copy['entidad_federativa'].unique()))
print(df_copy['entidad_federativa'].value_counts().sort_index())



32
entidad_federativa
AGS      368
BCN      820
BCS      268
CAM      425
CDMX    1745
CHI     1663
CHS      491
COA     1593
COL      195
DGO      537
EM      2185
GRO      540
GTO      710
HGO      215
JAL     1070
MICH     587
MOR      445
NAY      269
NL      2303
OAX      458
PUE      796
QRO      651
SIN     1061
SLP      399
SON     1419
TAB      229
TAM      884
TLAX     224
VER     1650
YUC      549
ZAC      216
Name: count, dtype: int64


In [10]:
# Codification of the 'tipo_suscripcion' column
print(df_copy['tipo_suscripcion'].value_counts())

tipo_suscripcion
PREPAGO    13744
MIXTO       1625
POSPAGO     1575
      .      163
Name: count, dtype: int64


In [11]:
# Replace non-valid values with np.nan
df_copy['tipo_suscripcion'] = df_copy['tipo_suscripcion'].apply(lambda x: x if x in ['PREPAGO', 'MIXTO', 'POSPAGO'] else np.nan)

# Final verification of the column
ordinal_encoder = OrdinalEncoder()
df_copy['tipo_suscripcion'] = ordinal_encoder.fit_transform(df_copy[['tipo_suscripcion']])
df_copy['tipo_suscripcion'] = df_copy['tipo_suscripcion']
print(df_copy['tipo_suscripcion'].value_counts()) # 3: PREPAGO, 2: MIXTO, 1: POSPAGO

tipo_suscripcion
2.0    13744
0.0     1625
1.0     1575
Name: count, dtype: int64


#### Categorize DB based on hit_buro_huella (ScoreCards)

In [12]:
#  Split hit_buro_huella into hit_buro and hit_huella
df_copy['hit_buro_huella'] = df_copy['hit_buro_huella'].apply(lambda x: f'{x:02d}')
df_copy['hit_buro'] = df_copy['hit_buro_huella'].str[0].astype(int)
df_copy['hit_huella'] = df_copy['hit_buro_huella'].str[1].astype(int)
df_copy.insert(3, 'hit_buro', df_copy.pop('hit_buro'))
df_copy.insert(4, 'hit_huella', df_copy.pop('hit_huella'))
df_copy.drop('hit_buro_huella', axis=1, inplace=True)

# Grouping info by hit_buro and hit_huella into 4 categories: {1: 00, 2: 01, 3: 10, 4: 11}
df_copy['hit_group'] = df_copy['hit_buro'].astype(str) + df_copy['hit_huella'].astype(str)
df_copy['hit_group'] = df_copy['hit_group'].map({'00': 1, '01': 2, '10': 3, '11': 4})
df_copy.insert(4, 'hit_group', df_copy.pop('hit_group'))

# Drop 'hit_buro' and 'hit_huella' columns
df_copy.drop(columns=['hit_buro', 'hit_huella'], inplace=True)

<b>Create 4 DataFrame (1 per ScoreCard)</b>

In [13]:
# Creating separate dataframes for each 'Hit_Buro_Huella' value
scorecard_1 = df_copy[df_copy['hit_group'] == 1].copy(deep=True)
scorecard_2 = df_copy[df_copy['hit_group'] == 2].copy(deep=True)
scorecard_3 = df_copy[df_copy['hit_group'] == 3].copy(deep=True)
scorecard_4 = df_copy[df_copy['hit_group'] == 4].copy(deep=True)

<b>Leave only the variables that can be used in each score card:</b>
1. Demographic
2. Demographic and footprint
3. Demographic and credit
4. Demographic, credit and footprint

In [14]:
extra = [
    'aprobado', 
    'malo', 
    'hit_group'
]

demografica = [
    'edad_cliente', 
    'entidad_federativa', 
    'ingreso_bruto'
]

buro = [
    'num_iq_u3m_pl_financieras',
    'porcentaje_cuentas_abiertas',
    'num_iq_u3m',
    'num_iq_u3m_tdc_banco',
    'maxmop_u3m',
    'saldo_actual_prest_personales',
    'num_iq_u3m_prest_personales',
    'numero_ctas_atraso_prest_personales_u3m',
    'numero_ctas_atraso_tdc_u3m',
    'ctas_ab_u18m_tdc_banco',
    'num_ctas_prest_pers',
    'flag_prest_nomina',
    'meses_cta_mas_antig',
    'flag_fraude_prest_nomina',
    'flag_fraude_hipotecario',
    'flag_quebranto_prest_personal'
]

huella = [
    'tipo_suscripcion',
    'status',
    'antiguedad_uso_linea_celular',
    'gasto_mensual_telefonia',
    'actividad_usuario',
    'cambio_sim_u3m',
    'variable_37',
    'gasto_ultimos_60_dias',
    'score_fpd',
    'rango_dispositivo',
    'adopcion_tecno',
    'score_incumplimiento',
    'tasa_contactacion',
    'calidad_telefonica',
    'score_contactabilidad_entrante',
    'score_contactabilidad_saliente',
    'disciplina_tech',
    'cluster_sucursales'
]

# Print len of list
print(f'Extra: {len(extra)} columns')
print(f'Demografica: {len(demografica)} columns')
print(f'Buro: {len(buro)} columns')
print(f'Huella: {len(huella)} columns')

Extra: 3 columns
Demografica: 3 columns
Buro: 16 columns
Huella: 18 columns


In [15]:
# Define the columns for each scorecard
scorecard_1_cols = extra + demografica
scorecard_2_cols = extra + demografica + huella
scorecard_3_cols = extra + demografica + buro
scorecard_4_cols = extra + demografica + buro + huella

# Function to keep only specified columns in a dataframe
def keep_columns(df, columns_to_keep):
    return df[df.columns.intersection(columns_to_keep)]

# Modify each scorecard dataframe
scorecard_1 = keep_columns(scorecard_1, scorecard_1_cols)
scorecard_2 = keep_columns(scorecard_2, scorecard_2_cols)
scorecard_3 = keep_columns(scorecard_3, scorecard_3_cols)
scorecard_4 = keep_columns(scorecard_4, scorecard_4_cols)

# Print the number of columns in each scorecard for verification
print(f"Scorecard 1 (Demographic): {scorecard_1.shape[1]} columns")
print(f"Scorecard 2 (Demographic and footprint): {scorecard_2.shape[1]} columns")
print(f"Scorecard 3 (Demographic and credit): {scorecard_3.shape[1]} columns")
print(f"Scorecard 4 (Demographic, credit and footprint): {scorecard_4.shape[1]} columns")

Scorecard 1 (Demographic): 6 columns
Scorecard 2 (Demographic and footprint): 24 columns
Scorecard 3 (Demographic and credit): 22 columns
Scorecard 4 (Demographic, credit and footprint): 40 columns


## Training Models

### Split raw database

* 75% train - 25% test 

In [16]:
apr_4 = df_copy[(df_copy['aprobado'] == 1) & (df_copy['hit_group'] == 4)]

y = apr_4['malo']
x = apr_4.drop(columns=['aprobado', 'hit_group']) # We do not drop 'malo' because it need to be used in the IV and KS calculations

In [17]:
# Split dataset into training set and test set
# 75% training and 25% test
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=123)

In [18]:
# List of unique values per column
temp_df = pd.DataFrame(columns=['Column', 'dtype', 'Unique Values', 'amnt_unique','nan', 'size'])

for column in x_train.columns:
    unique_values = [x_train[column].unique()]  # Ensure unique values are in a list
    amnt_unique = x_train[column].nunique()
    temp_df = pd.concat([temp_df, pd.DataFrame({'Column': [column], 'dtype': [x_train[column].dtype], 'Unique Values': unique_values, 'amnt_unique': amnt_unique})], ignore_index=True)
    temp_df.loc[temp_df['Column'] == column, 'nan'] = x_train[column].isnull().sum()
    temp_df.loc[temp_df['Column'] == column, 'size'] = x_train[column].count()

# Display the resulting dataframe
temp_df

Unnamed: 0,Column,dtype,Unique Values,amnt_unique,nan,size
0,malo,int64,"[0, 1]",2,0,9539
1,num_iq_u3m_pl_financieras,int64,"[-2, 3, 0, 1, 2, 4, 6, 5, 7, 8]",10,0,9539
2,edad_cliente,int64,"[65, 63, 42, 62, 76, 67, 60, 53, 41, 70, 74, 6...",59,0,9539
3,porcentaje_cuentas_abiertas,float64,"[0.066666667, 0.555555556, 0.260869565, 0.6, 0...",886,0,9539
4,num_iq_u3m,int64,"[8, 4, 3, 1, 5, 12, 6, 2, 19, 7, 10, 14, 15, 9...",36,0,9539
5,num_iq_u3m_tdc_banco,int64,"[-2, 1, 2, 3, 0, 4, 5, 6]",8,0,9539
6,maxmop_u3m,int64,"[9, 2, 5, 6, 1, 3, 0, 7, 4, -1]",10,0,9539
7,saldo_actual_prest_personales,int64,"[249305, -2, 2097, 31524, 114711, 45700, 52249...",6564,0,9539
8,num_iq_u3m_prest_personales,int64,"[2, 1, -2, 4, 3, 5, 7, 6, 8, 0, 9, 10]",12,0,9539
9,ingreso_bruto,float64,"[16154.77, 3799.82, 3506.81, 5668.89, 3700.0, ...",7299,0,9539


### x_train treatment

#### Outliers correction

In [24]:
def outlier_correction(x: DataFrame, val_threshold: int = 40, tail_threshold: float = 0.05) -> DataFrame:
    """
    This function get to correct all outliers in non categorical numeric variables.

    **Parameters**

    val_threshold: *int* [default = 10] Amount of unique values before considered a variable categorical.
    tail_threshold: *float* [default = 0.05] Threshold for the tail of the distribution to be considered an outlier.
    df: *DataFrame* Dataframe to be corrected.

    **Returns**
    df_clean: *DataFrame* Dataframe with outliers corrected.
    
    """

    # Copy the dataframe to avoid modifying the original
    x_copy = x.copy(deep=True)

    # Loop through all columns to clean Outliers
    for column in x_copy.columns:
        # Check if the column is numeric
        if x_copy[column].dtype in ['int64', 'float64']:
            # Check if the column has more unique values than the threshold
            if x_copy[column].nunique() > val_threshold:
                # Calculate the quantiles
                q1 = x_copy[x_copy[column] > 0][column].quantile(tail_threshold)
                q3 = x_copy[x_copy[column] > 0][column].quantile(1 - tail_threshold)
                iqr = q3 - q1

                # Replace the outliers with the quantiles
                x_copy[column] = x_copy[column].apply(lambda x: (q3+1.5*iqr) if x > q3 else x) # Upper bound only for natural distribution
        else:
            # If the column is not numeric, ignore it
            pass

    return x_copy

def bin_creation(x: DataFrame, max: float = 0.35, min: float = 0.05) -> DataFrame:
    # Copy the dataframe to avoid modifying the original
    x_copy = x.copy(deep=True)

    # Loop through all columns to create bins
    for column in x_copy.columns:
        # Check if the column is numeric
        if x_copy[column].dtype in ['int64', 'float64']:
            bin_0 = x_copy[x_copy[column] < 0] # Bin for negative values
            bin_1 = x_copy[(x_copy[column] >= 0)] # df for positive values

            # Check if bin_0 is less than 5% of the total if so, merge it with the bin of positive values with similar TdM
            if bin_0.shape[0] > min * x_copy.shape[0]:
                # Create 5 bins with 20% percentiles for positive values
                bins, bin_labels = pd.qcut(bin_1[column], q=5, labels=False, retbins=True, duplicates="drop")

                # Create tags for the bins
                interval_labels = [f"({bin_labels[i]:.2f} - {bin_labels[i+1]:.2f})" for i in range(len(bin_labels)-1)]

                # Assign tag values with intervals, with no limits duplicated
                bin_1[column] = pd.cut(bin_1[column], bins=bin_labels, labels=interval_labels, right=False)

                # Print with tabulate
                print(tabulate(bin_1[column].value_counts().sort_index(), headers='keys', tablefmt='pretty'))
    return x_copy

# Apply the outlier_correction function to the training set
x_train_clean = outlier_correction(x_train)
x_train_bins = bin_creation(x_train_clean)
x_train_bins

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bin_1[column] = pd.cut(bin_1[column], bins=bin_labels, labels=interval_labels, right=False)


TypeError: 'numpy.int64' object is not iterable

#### IV and KS indicators

In [87]:
def get_iv_ks(df: DataFrame) -> DataFrame:
    """
    This function gets the IV and KS for each variable in the dataframe. 

    **Parameters**

    df: *DataFrame* Dataframe to get the IV and KS.

    **Returns**
    
    res: *DataFrame* Dataframe with summary of the IV and KS for each variable.
    """

    res = df
    
    return res

# Get the IV and KS for the training set
# Add y_train to x_train_clean for outlier correction
x_train_clean['malo'] = y_train.values


# Get the IV and KS for the training set
iv_ks_results = get_iv_ks(x_train_clean)


TypeError: list indices must be integers or slices, not str

#### Features selection

In [36]:
def feature_selection(iv_ks_df: DataFrame, corr_threshold: float = 0.6) -> Dict[str, List]:
    """
    This function gets the features that are going to be used in the model. In case there are 2 or more correlated variables (where the abs(corr) >= 0.6), only the one with the highest IV and KS is going to be selected (priority on KS).

    **Parameters**

    iv_ks_df: *DataFrame* Dataframe with the IV and KS for each variable.
    corr_threshold: *float* [default = 0.6] Threshold for the correlation to be considered high.

    **Returns**
    
    features: *Dict[str, List]* Dictionary with the features selected for the model.
    """
    
    features = {}

    return features

features = feature_selection(iv_ks)

NameError: name 'iv_ks' is not defined

### Decision Tree

#### Training algorithm

### x_test treatment

#### Outliers correction

### Confusion matrix