# 7.2.9 Identificación de créditos riesgosos usando árboles de decisión

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
from ucimlrepo import fetch_ucirepo 
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

import warnings
warnings.filterwarnings("ignore")

link: https://youtu.be/1dLQmi6CyzA?si=jDxi0vXQ8d5ZC-Wc

data: https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data

En el tutorial anterior se discutieron los fundamentos del uso de árboles de clasificación. En este tutorial, se presenta la identificación de créditos potencialmente riesgosos en una identidad crediticia.

## 7.2.9.1 Descripción del problema

Las entidades financieras desean mejorar sus procedimientos de aprobación de créditos con el fin de disminuir los riesgos de no pago de la deuda, lo que acarrea pérdidas a la entidad. El problema real consiste en poder decidir si se aprueba o no un crédito particular con base en información que puede ser fácilmente recolectada por teléfono o en la web.

Se tiene una muestra de 1000 observaciones. Cada registro contiene 20 atributos que recopilan información tanto sobre el crédito como sobre la salud financiera del solicitante. La información fue recolectada por una firma alemana y se puede descargar de **UCI**.

Los atributos y sus valores son los siguientes:

```

Attribute 1:  (qualitative)
         Status of existing checking account
         A11 :      ... <    0 DM
         A12 : 0 <= ... <  200 DM
         A13 :      ... >= 200 DM /
               salary assignments for at least 1 year
         A14 : no checking account

Attribute 2:  (numerical)
         Duration in month

Attribute 3:  (qualitative)
         Credit history
         A30 : no credits taken/
               all credits paid back duly
         A31 : all credits at this bank paid back duly
         A32 : existing credits paid back duly till now
         A33 : delay in paying off in the past
         A34 : critical account/
               other credits existing (not at this bank)

Attribute 4:  (qualitative)
         Purpose
         A40 : car (new)
         A41 : car (used)
         A42 : furniture/equipment
         A43 : radio/television
         A44 : domestic appliances
         A45 : repairs
         A46 : education
         A47 : (vacation - does not exist?)
         A48 : retraining
         A49 : business
         A410 : others

Attribute 5:  (numerical)
         Credit amount

Attribute 6:  (qualitative)
         Savings account/bonds
         A61 :          ... <  100 DM
         A62 :   100 <= ... <  500 DM
         A63 :   500 <= ... < 1000 DM
         A64 :          .. >= 1000 DM
         A65 :   unknown/ no savings account

Attribute 7:  (qualitative)
         Present employment since
         A71 : unemployed
         A72 :       ... < 1 year
         A73 : 1  <= ... < 4 years
         A74 : 4  <= ... < 7 years
         A75 :       .. >= 7 years

Attribute 8:  (numerical)
         Installment rate in percentage of disposable income

Attribute 9:  (qualitative)
         Personal status and sex
         A91 : male   : divorced/separated
         A92 : female : divorced/separated/married
         A93 : male   : single
         A94 : male   : married/widowed
         A95 : female : single

Attribute 10: (qualitative)
         Other debtors / guarantors
         A101 : none
         A102 : co-applicant
         A103 : guarantor

Attribute 11: (numerical)
         Present residence since

Attribute 12: (qualitative)
         Property
         A121 : real estate
         A122 : if not A121 : building society savings agreement/
                  life insurance
         A123 : if not A121/A122 : car or other, not in attribute 6
         A124 : unknown / no property

Attribute 13: (numerical)
         Age in years

Attribute 14: (qualitative)
         Other installment plans
         A141 : bank
         A142 : stores
         A143 : none

Attribute 15: (qualitative)
         Housing
         A151 : rent
         A152 : own
         A153 : for free

Attribute 16: (numerical)
         Number of existing credits at this bank

Attribute 17: (qualitative)
         Job
         A171 : unemployed/ unskilled  - non-resident
         A172 : unskilled - resident
         A173 : skilled employee / official
         A174 : management/ self-employed/
                highly qualified employee/ officer

Attribute 18: (numerical)
         Number of people being liable to provide maintenance for

Attribute 19: (qualitative)
         Telephone
         A191 : none
         A192 : yes, registered under the customers name

Attribute 20: (qualitative)
         foreign worker
         A201 : yes
         A202 : no

```

## 7.2.9.2 Carga de datos

In [4]:
columns = ['checking_balance','months_loan_duration','credit_history','purpose','amount','savings_balance',
'employment_length','installment_rate','personal_status','other_debtors','residence_history',
'property','age','installment_plan','housing','existing_credits','dependents','telephone',
'foreign_worker','job']

# fetch dataset 
statlog_german_credit_data = fetch_ucirepo(id=144) 

# data (as pandas dataframes) 
X = statlog_german_credit_data.data.features 
y = statlog_german_credit_data.data.targets

X.columns = columns
y.columns = ['default']

# # metadata 
# print(statlog_german_credit_data.metadata) 
  
# variable information 
print(statlog_german_credit_data.variables) 

           name     role         type     demographic  \
0    Attribute1  Feature  Categorical            None   
1    Attribute2  Feature      Integer            None   
2    Attribute3  Feature  Categorical            None   
3    Attribute4  Feature  Categorical            None   
4    Attribute5  Feature      Integer            None   
5    Attribute6  Feature  Categorical            None   
6    Attribute7  Feature  Categorical           Other   
7    Attribute8  Feature      Integer            None   
8    Attribute9  Feature  Categorical  Marital Status   
9   Attribute10  Feature  Categorical            None   
10  Attribute11  Feature      Integer            None   
11  Attribute12  Feature  Categorical            None   
12  Attribute13  Feature      Integer             Age   
13  Attribute14  Feature  Categorical            None   
14  Attribute15  Feature  Categorical           Other   
15  Attribute16  Feature      Integer            None   
16  Attribute17  Feature  Categ

In [5]:
#
# Contenido del archivo
#
X.head()

Unnamed: 0,checking_balance,months_loan_duration,credit_history,purpose,amount,savings_balance,employment_length,installment_rate,personal_status,other_debtors,residence_history,property,age,installment_plan,housing,existing_credits,dependents,telephone,foreign_worker,job
0,A11,6,A34,A43,1169,A65,A75,4,A93,A101,4,A121,67,A143,A152,2,A173,1,A192,A201
1,A12,48,A32,A43,5951,A61,A73,2,A92,A101,2,A121,22,A143,A152,1,A173,1,A191,A201
2,A14,12,A34,A46,2096,A61,A74,2,A93,A101,3,A121,49,A143,A152,1,A172,2,A191,A201
3,A11,42,A32,A42,7882,A61,A74,2,A93,A103,4,A122,45,A143,A153,1,A173,2,A191,A201
4,A11,24,A33,A40,4870,A61,A73,3,A93,A101,4,A124,53,A143,A153,2,A173,2,A191,A201


In [6]:
#
# Se verifican los tipos de datos de las columnas
#
X.dtypes

checking_balance        object
months_loan_duration     int64
credit_history          object
purpose                 object
amount                   int64
savings_balance         object
employment_length       object
installment_rate         int64
personal_status         object
other_debtors           object
residence_history        int64
property                object
age                      int64
installment_plan        object
housing                 object
existing_credits         int64
dependents              object
telephone                int64
foreign_worker          object
job                     object
dtype: object

## 7.2.9.3 Análisis Exploratorio

In [7]:
#
# Algunas de las columnas son numéricas y
# las otras son factores.
# DM corresponde a Deutsche Marks
# se verifican algunos valores versus el code book.
#
X.checking_balance.value_counts()

checking_balance
A14    394
A11    274
A12    269
A13     63
Name: count, dtype: int64

In [8]:
X.savings_balance.value_counts()

savings_balance
A61    603
A65    183
A62    103
A63     63
A64     48
Name: count, dtype: int64

In [9]:
#
# El monto del préstamo va desde 250 DM hasta 18.424 DM
#
X.amount.describe()

count     1000.000000
mean      3271.258000
std       2822.736876
min        250.000000
25%       1365.500000
50%       2319.500000
75%       3972.250000
max      18424.000000
Name: amount, dtype: float64

In [10]:
#
# La duración del préstamo va desde 4 hasta 72 meses
#
X.months_loan_duration.describe()

count    1000.000000
mean       20.903000
std        12.058814
min         4.000000
25%        12.000000
50%        18.000000
75%        24.000000
max        72.000000
Name: months_loan_duration, dtype: float64

In [11]:
#
# La columna default indica si hubo problemas
# en el pago del préstamo (1- pago, 2- no pago)
# esta es la columna que se desea pronosticar
# 1-si, 2-no
#
y.default.value_counts()

default
1    700
2    300
Name: count, dtype: int64

## 7.2.9.4 Preprocesamiento

In [12]:
#
# Se construye un codificador para transformar
# los strings a enteros (similar a factores en R)
#
enc = LabelEncoder()

#
# Se aplica el codificador a las columnas
# del dataset
#
columns = [
    "checking_balance",
    "credit_history",
    "purpose",
    "savings_balance",
    "employment_length",
    "personal_status",
    "other_debtors",
    "property",
    "installment_plan",
    "housing",
    "dependents",
    "telephone",
    "foreign_worker",
    "job",
]

for column in columns:
    X[column] = enc.fit_transform(X[column])

## 7.2.9.5 Entrenamiento del modelo

In [13]:
#
#  Se usa el 90% de los datos para entrenamiento
#  y el 10% restante para prueba

# Dividir los datos en conjuntos de entrenamiento y prueba
X_train, X_test, y_train_true, y_test_true = train_test_split(X, y, test_size=0.1, random_state=42)

#
# Se construye el arbol
#
clf = DecisionTreeClassifier()

#
# Se entrena para los datos de prueba
#
clf.fit(X_train, y_train_true)

#
# Se pronostica para la muestra de prueba
#
y_test_pred = clf.predict(X_test)

#
# Métricas de desempeño
#
confusion_matrix(y_test_true, y_test_pred)

array([[53, 18],
       [15, 14]], dtype=int64)

In [14]:
# Calcular la precisión
accuracy = accuracy_score(y_test_true, y_test_pred)
print(f'Accuracy: {accuracy:.2f}')

# Imprimir el informe de clasificación
print('\nClassification Report:')
print(classification_report(y_test_true, y_test_pred))

Accuracy: 0.67

Classification Report:
              precision    recall  f1-score   support

           1       0.78      0.75      0.76        71
           2       0.44      0.48      0.46        29

    accuracy                           0.67       100
   macro avg       0.61      0.61      0.61       100
weighted avg       0.68      0.67      0.67       100



In [15]:
print('ok_')

ok_
