Identificación de créditos riesgosos
===

* *60 min* | Ultima modificación: Junio 22, 2019

En el tutorial anterior se discutieron los fundamentos del uso de árboles de clasificación. En este tutorial, se presenta la identificación de créditos potencialmente riesgosos en una identidad crediticia.

## Descripción del problema

Las entidades financieras desean mejorar sus procedimientos de aprobación de créditos con el fin de disminuir los riesgos de no pago de la deuda, lo que acarrea pérdidas a la entidad. El problema real consiste en poder decidir si se aprueba o no un crédito particular con base en información que puede ser fácilmente recolectada por teléfono o en la web.

Se tiene una muestra de 1000 observaciones. Cada registro contiene 20 atributos que recopilan información tanto sobre el crédito como sobre la salud financiera del solicitante. La información fue recolectada por una firma alemana y se puede descargar de https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data).

Los atributos y sus valores son los siguientes:

     Attribute 1:  (qualitative)
     	      Status of existing checking account
     	      A11 :      ... <    0 DM
     	      A12 : 0 <= ... <  200 DM
     	      A13 :      ... >= 200 DM /
     	            salary assignments for at least 1 year
     	      A14 : no checking account

     Attribute 2:  (numerical)
     	      Duration in month

     Attribute 3:  (qualitative)
     	      Credit history
     	      A30 : no credits taken/
     	            all credits paid back duly
     	      A31 : all credits at this bank paid back duly
     	      A32 : existing credits paid back duly till now
     	      A33 : delay in paying off in the past
     	      A34 : critical account/
     	            other credits existing (not at this bank)

     Attribute 4:  (qualitative)
     	      Purpose
     	      A40 : car (new)
     	      A41 : car (used)
     	      A42 : furniture/equipment
     	      A43 : radio/television
     	      A44 : domestic appliances
     	      A45 : repairs
     	      A46 : education
     	      A47 : (vacation - does not exist?)
     	      A48 : retraining
     	      A49 : business
     	      A410 : others

     Attribute 5:  (numerical)
     	      Credit amount

     Attribute 6:  (qualitative)
     	      Savings account/bonds
     	      A61 :          ... <  100 DM
     	      A62 :   100 <= ... <  500 DM
     	      A63 :   500 <= ... < 1000 DM
     	      A64 :          .. >= 1000 DM
     	      A65 :   unknown/ no savings account

     Attribute 7:  (qualitative)
     	      Present employment since
     	      A71 : unemployed
     	      A72 :       ... < 1 year
     	      A73 : 1  <= ... < 4 years  
     	      A74 : 4  <= ... < 7 years
     	      A75 :       .. >= 7 years

     Attribute 8:  (numerical)
     	      Installment rate in percentage of disposable income

     Attribute 9:  (qualitative)
     	      Personal status and sex
     	      A91 : male   : divorced/separated
     	      A92 : female : divorced/separated/married
     	      A93 : male   : single
     	      A94 : male   : married/widowed
     	      A95 : female : single

     Attribute 10: (qualitative)
     	      Other debtors / guarantors
     	      A101 : none
     	      A102 : co-applicant
     	      A103 : guarantor

     Attribute 11: (numerical)
     	      Present residence since

     Attribute 12: (qualitative)
     	      Property
     	      A121 : real estate
     	      A122 : if not A121 : building society savings agreement/
     				   life insurance
     	      A123 : if not A121/A122 : car or other, not in attribute 6
     	      A124 : unknown / no property

     Attribute 13: (numerical)
     	      Age in years

     Attribute 14: (qualitative)
     	      Other installment plans 
     	      A141 : bank
     	      A142 : stores
     	      A143 : none

     Attribute 15: (qualitative)
     	      Housing
     	      A151 : rent
     	      A152 : own
     	      A153 : for free

     Attribute 16: (numerical)
              Number of existing credits at this bank

     Attribute 17: (qualitative)
     	      Job
     	      A171 : unemployed/ unskilled  - non-resident
     	      A172 : unskilled - resident
     	      A173 : skilled employee / official
     	      A174 : management/ self-employed/
     		         highly qualified employee/ officer

     Attribute 18: (numerical)
     	      Number of people being liable to provide maintenance for

     Attribute 19: (qualitative)
     	      Telephone
     	      A191 : none
     	      A192 : yes, registered under the customers name

     Attribute 20: (qualitative)
     	      foreign worker
     	      A201 : yes
     	      A202 : no


## Preparación

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
%load_ext rpy2.ipython

## Carga de datos

In [2]:
##
## Lee el archivo. 
##
df = pd.read_csv(
    "https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/credit.csv",
    sep = ',',           # separador de campos
    thousands = None,    # separador de miles para números
    decimal = '.',       # separador de los decimales para números
    encoding='latin-1')  # idioma

##
## Verifica la lectura de los datos
##
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
checking_balance        1000 non-null object
months_loan_duration    1000 non-null int64
credit_history          1000 non-null object
purpose                 1000 non-null object
amount                  1000 non-null int64
savings_balance         1000 non-null object
employment_length       1000 non-null object
installment_rate        1000 non-null int64
personal_status         1000 non-null object
other_debtors           1000 non-null object
residence_history       1000 non-null int64
property                1000 non-null object
age                     1000 non-null int64
installment_plan        1000 non-null object
housing                 1000 non-null object
existing_credits        1000 non-null int64
default                 1000 non-null int64
dependents              1000 non-null int64
telephone               1000 non-null object
foreign_worker          1000 non-null object
jo

In [3]:
##
## Contenido del archivo
##
df.head()

Unnamed: 0,checking_balance,months_loan_duration,credit_history,purpose,amount,savings_balance,employment_length,installment_rate,personal_status,other_debtors,...,property,age,installment_plan,housing,existing_credits,default,dependents,telephone,foreign_worker,job
0,< 0 DM,6,critical,radio/tv,1169,unknown,> 7 yrs,4,single male,none,...,real estate,67,none,own,2,1,1,yes,yes,skilled employee
1,1 - 200 DM,48,repaid,radio/tv,5951,< 100 DM,1 - 4 yrs,2,female,none,...,real estate,22,none,own,1,2,1,none,yes,skilled employee
2,unknown,12,critical,education,2096,< 100 DM,4 - 7 yrs,2,single male,none,...,real estate,49,none,own,1,1,2,none,yes,unskilled resident
3,< 0 DM,42,repaid,furniture,7882,< 100 DM,4 - 7 yrs,2,single male,guarantor,...,building society savings,45,none,for free,1,1,2,none,yes,skilled employee
4,< 0 DM,24,delayed,car (new),4870,< 100 DM,1 - 4 yrs,3,single male,none,...,unknown/none,53,none,for free,2,2,2,none,yes,skilled employee


In [4]:
##
## Se verifican los tipos de datos de las columnas 
##
df.dtypes

checking_balance        object
months_loan_duration     int64
credit_history          object
purpose                 object
amount                   int64
savings_balance         object
employment_length       object
installment_rate         int64
personal_status         object
other_debtors           object
residence_history        int64
property                object
age                      int64
installment_plan        object
housing                 object
existing_credits         int64
default                  int64
dependents               int64
telephone               object
foreign_worker          object
job                     object
dtype: object

### Análisis Exploratorio

In [5]:
##
## Algunas de las columnas son numéricas y 
## las otras son factores.
## DM corresponde a Deutsche Marks
## se verifican algunos valores versus el code book.
##
df.checking_balance.value_counts()

unknown       394
< 0 DM        274
1 - 200 DM    269
> 200 DM       63
Name: checking_balance, dtype: int64

In [6]:
df.savings_balance.value_counts()

< 100 DM         603
unknown          183
101 - 500 DM     103
501 - 1000 DM     63
> 1000 DM         48
Name: savings_balance, dtype: int64

In [7]:
##
## El monto del préstamo va desde 250 DM hasta 18.424 DM
##
df.amount.describe()

count     1000.000000
mean      3271.258000
std       2822.736876
min        250.000000
25%       1365.500000
50%       2319.500000
75%       3972.250000
max      18424.000000
Name: amount, dtype: float64

In [8]:
##
## La duración del préstamo va desde 4 hasta 72 meses
##
df.months_loan_duration.describe()

count    1000.000000
mean       20.903000
std        12.058814
min         4.000000
25%        12.000000
50%        18.000000
75%        24.000000
max        72.000000
Name: months_loan_duration, dtype: float64

In [9]:
##
## La columna default indica si hubo problemas 
## en el pago del préstamo (1- pago, 2- no pago)
## esta es la columna que se desea pronosticar
## 1-si, 2-no
##
df.default.value_counts()

1    700
2    300
Name: default, dtype: int64

### Preprocesamiento

In [10]:
from sklearn.preprocessing import LabelEncoder

##
## Se construye un codificador para transformar
## los strings a enteros (similar a factores en R)
##
enc = LabelEncoder()

##
## Se aplica el codificador a las columnas
## del dataset
##
df["checking_balance"] = enc.fit_transform(df["checking_balance"])
df["credit_history"] = enc.fit_transform(df["credit_history"])
df["purpose"] = enc.fit_transform(df["purpose"])
df["savings_balance"] = enc.fit_transform(df["savings_balance"])
df["employment_length"] = enc.fit_transform(df["employment_length"])
df["personal_status"] = enc.fit_transform(df["personal_status"])
df["other_debtors"] = enc.fit_transform(df["other_debtors"])
df["property"] = enc.fit_transform(df["property"])
df["installment_plan"] = enc.fit_transform(df["installment_plan"])
df["housing"] = enc.fit_transform(df["housing"])
df["telephone"] = enc.fit_transform(df["telephone"])
df["foreign_worker"] = enc.fit_transform(df["foreign_worker"])
df["job"] = enc.fit_transform(df["job"])

### Entrenamiento del modelo

In [11]:
##
## Se usa el 90% de los datos para entrenamiento 
## y el 10% restante para prueba
##
train_sample = list(range(900))
test_sample  = list(range(900, 1000))

##
## Genera los conjuntos de entrenamiento y prueba
##
X_train = df.iloc[train_sample,:].copy()
X_test  = df.iloc[test_sample,:].copy()

##
## Se elimina la columna default que 
## corresponde a la variable de salida
##
X_train.drop('default', axis=1, inplace=True)
X_test.drop('default', axis=1, inplace=True)

##
## Se genera la variable dependiente
##
y_train_true = df.default[train_sample]
y_test_true  = df.default[test_sample]

In [12]:
##
## Construcción del arbol de clasificación
##
from sklearn.tree import DecisionTreeClassifier

##
## Se construye el arbol
##
clf = DecisionTreeClassifier()

##
## Se entrena para los datos de prueba
##
clf.fit(X_train, y_train_true)

##
## Se pronostica para la muestra de prueba
##
y_test_pred = clf.predict(X_test)

##
## Métricas de desempeño
##
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test_true, y_test_pred)

array([[48, 20],
       [20, 12]])

In [13]:
%%R -i y_test_true -i y_test_pred
##
## La función CrossTable del paquete gmodels
## entrega información más detallada.
## install.packages("gmodels")
##
library(gmodels)
CrossTable(x = y_test_true, 
           y = y_test_pred,
           prop.chisq=FALSE)


 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  100 

 
             | y_test_pred 
 y_test_true |         1 |         2 | Row Total | 
-------------|-----------|-----------|-----------|
           1 |        48 |        20 |        68 | 
             |     0.706 |     0.294 |     0.680 | 
             |     0.706 |     0.625 |           | 
             |     0.480 |     0.200 |           | 
-------------|-----------|-----------|-----------|
           2 |        20 |        12 |        32 | 
             |     0.625 |     0.375 |     0.320 | 
             |     0.294 |     0.375 |           | 
             |     0.200 |     0.120 |           | 
-------------|-----------|-----------|-----------|
Column Total |        68 |        32 |       100 | 
             |     0.680 |     0.320 |           | 
-------------|----

## Solución usando R

In [14]:
%%R -i df
str(df)

'data.frame':	1000 obs. of  21 variables:
 $ checking_balance    : int  1 0 3 1 1 3 3 0 3 0 ...
 $ months_loan_duration: int  6 48 12 42 24 36 24 36 12 30 ...
 $ credit_history      : int  0 4 0 4 1 4 4 4 4 0 ...
 $ purpose             : int  7 7 4 5 1 4 5 2 7 1 ...
 $ amount              : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
 $ savings_balance     : int  4 2 2 2 2 4 1 2 3 2 ...
 $ employment_length   : int  3 1 2 2 1 1 3 1 2 4 ...
 $ installment_rate    : int  4 2 2 2 3 2 3 2 2 4 ...
 $ personal_status     : int  3 1 3 3 3 3 3 3 0 2 ...
 $ other_debtors       : int  2 2 2 1 2 2 2 2 2 2 ...
 $ residence_history   : int  4 2 3 4 4 4 4 2 4 2 ...
 $ property            : int  2 2 2 0 3 3 0 1 2 1 ...
 $ age                 : int  67 22 49 45 53 35 53 35 61 28 ...
 $ installment_plan    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ housing             : int  1 1 1 0 0 0 1 2 1 1 ...
 $ existing_credits    : int  2 1 1 1 2 1 1 1 1 2 ...
 $ default             : int  1 2 1 1 2 1 1 1

### Análisis exploratorio

In [15]:
%%R
##
## Algunas de las columnas son numéricas y 
## las otras son factores.
## DM corresponde a Deutsche Marks
## se verifican algunos valores versus el code book.
##
table(df$checking_balance)


  0   1   2   3 
269 274  63 394 


In [16]:
%%R
table(df$savings_balance)


  0   1   2   3   4 
103  63 603  48 183 


In [17]:
%%R
##
## El monto del préstamo va desde 250 DM hasta 18.424 DM
##
summary(df$amount)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    250    1366    2320    3271    3972   18424 


In [18]:
%%R
##
## La duración del préstamo va desde 4 hasta 72 meses
##
summary(df$months_loan_duration)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    4.0    12.0    18.0    20.9    24.0    72.0 


In [19]:
%%R
##
## La columna default indica si hubo problemas 
## en el pago del préstamo (1- pago, 2- no pago)
## esta es la columna que se desea pronosticar
## 1-si, 2-no
##
table(df$default)


  1   2 
700 300 


In [20]:
%%R
##
## Se convierte la columna default a 
## factores para su manejo en R
##
df$default <- factor(df$default, labels=c("No", "Yes"))
table(df$default)


 No Yes 
700 300 


### Entrenamiento del modelo

In [21]:
%%R
##
## Se usa el 90% de los datos para entrenamiento 
## y el 10% restante para prueba
##
train_sample <- 1:900

##
## Genera los conjuntos de entrenamiento y prueba. La
## columna 17 es la variable dependiente
##
df_train     <- df[train_sample,]
df_test      <- df[-train_sample,]

##
## Se verifica la proporción entre créditos pagados
## y no pagados en los dos conjuntos de datos
## conjunto de entrenamiento
##
prop.table(table(df_train$defaul))


       No       Yes 
0.7022222 0.2977778 


In [22]:
%%R
##
## Proporción de casos para el conjunto de prueba
##
prop.table(table(df_test$default))


  No  Yes 
0.68 0.32 


In [23]:
%%R
library(rpart)
clf <- rpart(default ~ ., data = df_train)
summary(clf)

Call:
rpart(formula = default ~ ., data = df_train)
  n= 900 

          CP nsplit rel error    xerror       xstd
1 0.03482587      0 1.0000000 1.0000000 0.05118820
2 0.02238806      5 0.7873134 0.9962687 0.05113302
3 0.01492537      7 0.7425373 0.8805970 0.04923614
4 0.01368159      9 0.7126866 0.8582090 0.04882534
5 0.01119403     12 0.6716418 0.8619403 0.04889484
6 0.01000000     13 0.6604478 0.8395522 0.04847157

Variable importance
    checking_balance months_loan_duration               amount 
                  30                   17                   15 
      credit_history      savings_balance        other_debtors 
                   8                    6                    5 
    installment_rate                  age     existing_credits 
                   4                    4                    4 
             purpose            telephone    employment_length 
                   4                    1                    1 
          dependents             property 
    

El último resultado indica que hay 135 errores en 900 ejemplos.

### Evaluación del modelo

In [24]:
%%R
##
## Se evaluar el modelo con los datos de prueba
## install.packages("gmodels")
##
library(gmodels)
y_test_pred <- predict(clf, df_test, type='class')
CrossTable(df_test$default, 
           y_test_pred,
           prop.chisq = FALSE, 
           prop.c = FALSE, 
           prop.r = FALSE,
           dnn = c('actual default', 'predicted default'))


 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  100 

 
               | predicted default 
actual default |        No |       Yes | Row Total | 
---------------|-----------|-----------|-----------|
            No |        60 |         8 |        68 | 
               |     0.600 |     0.080 |           | 
---------------|-----------|-----------|-----------|
           Yes |        23 |         9 |        32 | 
               |     0.230 |     0.090 |           | 
---------------|-----------|-----------|-----------|
  Column Total |        83 |        17 |       100 | 
---------------|-----------|-----------|-----------|

 


## Mejora del modelo

### Adaptive Boosting

Ya que la precisión del modelo presentado para este caso no es suficiente, se aplica la técnica de adaptive boosting. En esta técnica se entrenan muchos árboles simultáneamente sobre los datos; cuando hay un nuevo ejemplo, cada árbol pronóstica la clase y la clasificación final se obtiene por voto (mayoría).

In [25]:
%%R
clf10 <- C5.0(X_train, 
              y_train_true,
              trials = 10)    # cantidad de árboles a considerar

clf10

R[write to console]: Error in C5.0(X_train, y_train_true, trials = 10) : 
  could not find function "C5.0"
Calls: <Anonymous> -> <Anonymous> -> withVisible




Error in C5.0(X_train, y_train_true, trials = 10) : 
  could not find function "C5.0"
Calls: <Anonymous> -> <Anonymous> -> withVisible


In [26]:
%%R
## imprime todos los arboles
summary(clf10)

R[write to console]: Error in summary(clf10) : object 'clf10' not found
Calls: <Anonymous> -> <Anonymous> -> withVisible -> summary




Error in summary(clf10) : object 'clf10' not found
Calls: <Anonymous> -> <Anonymous> -> withVisible -> summary


El resultado anterior indica que hay 29 errores en 900 patrones de entrenamiento.

In [27]:
%%R
y_test_pred10 <- predict(clf10, X_test)

CrossTable(y_test_true, 
           y_test_pred10,
           prop.chisq = FALSE, 
           prop.c = FALSE, 
           prop.r = FALSE,
           dnn = c('actual default', 'predicted default'))

R[write to console]: Error in predict(clf10, X_test) : object 'clf10' not found
Calls: <Anonymous> -> <Anonymous> -> withVisible -> predict




Error in predict(clf10, X_test) : object 'clf10' not found
Calls: <Anonymous> -> <Anonymous> -> withVisible -> predict


### Uso de pesos por tipo de error

Es posible introducir pesos para hacer más costoso un tipo de error que otro.

In [28]:
%%R
matrix_dimensions <- list(predicted=c("No", "Yes"), actual=c("No", "Yes"))
str(matrix_dimensions)

List of 2
 $ predicted: chr [1:2] "No" "Yes"
 $ actual   : chr [1:2] "No" "Yes"


In [29]:
%%R
error_cost <- matrix(c(0, 1, 4, 0), # pesos por tipo de error
                     nrow = 2,
                     dimnames = matrix_dimensions)
error_cost

         actual
predicted No Yes
      No   0   4
      Yes  1   0


In [30]:
%%R
clf_cost <- C5.0(X_train,
                 y_train_true,
                 costs = error_cost)

y_test_pred_cost <- predict(clf_cost, X_test)

CrossTable(y_test_true,
           y_test_pred_cost,
           prop.chisq = FALSE, 
           prop.c = FALSE, 
           prop.r = FALSE,
           dnn = c('actual default', 'predicted default'))

R[write to console]: Error in C5.0(X_train, y_train_true, costs = error_cost) : 
  could not find function "C5.0"
Calls: <Anonymous> -> <Anonymous> -> withVisible




Error in C5.0(X_train, y_train_true, costs = error_cost) : 
  could not find function "C5.0"
Calls: <Anonymous> -> <Anonymous> -> withVisible


**Actividad.---** Respecto a los dos casos anteriores, ¿Cómo se interpreta la tabla anterior?

**Actividad.---** Compute metricas de error y compare los resultados obtenidos?

**Actividad.---** Haga la estimación robusta del mdoelo usando cross-validation.