# ▶▶▶ **APLICACIONES DE TARJETAS DE CREDITO** ◀◀◀
<p>Commercial banks receive <em>a lot</em> of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this notebook, we will build an automatic credit card approval predictor using machine learning techniques, just like the real banks do.</p>
<p><img src="https://assets.datacamp.com/production/project_558/img/credit_card.jpg" alt="Credit card being held in hand"></p>
<p>We'll use the <a href="http://archive.ics.uci.edu/ml/datasets/credit+approval">Credit Card Approval dataset</a> from the UCI Machine Learning Repository. The structure of this notebook is as follows:</p>
<ul>
<li>First, we will start off by loading and viewing the dataset.</li>
<li>We will see that the dataset has a mixture of both numerical and non-numerical features, that it contains values from different ranges, plus that it contains a number of missing entries.</li>
<li>We will have to preprocess the dataset to ensure the machine learning model we choose can make good predictions.</li>
<li>After our data is in good shape, we will do some exploratory data analysis to build our intuitions.</li>
<li>Finally, we will build a machine learning model that can predict if an individual's application for a credit card will be accepted.</li>
</ul>
<p>First, loading and viewing the dataset. We find that since this data is confidential, the contributor of the dataset has anonymized the feature names.</p>

##**Importar la libreria pandas**

In [None]:
#importar la libreria de pandas para leer los datos
import pandas as pd

##**Traer y mostrar datos de un archivo .data de nuestro drive**

In [None]:
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')
cc_apps = pd.read_csv("/content/drive/MyDrive/h201S5_04JuanCondori/cc_approvals.data", header=None)

cc_apps.head()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


##**Resumen, caracteristicas y ver los últimos registros**

In [None]:
# Resumen estadístico
cc_apps_description = cc_apps.describe()
print(cc_apps_description)

print("\n")

# Características del dataset
cc_apps_info = cc_apps.info()
print(cc_apps_info)

print("\n")

# Un vistazo a los últimos registros
print(cc_apps.tail(17))

               2           7          10             14
count  690.000000  690.000000  690.00000     690.000000
mean     4.758725    2.223406    2.40000    1017.385507
std      4.978163    3.346513    4.86294    5210.102598
min      0.000000    0.000000    0.00000       0.000000
25%      1.000000    0.165000    0.00000       0.000000
50%      2.750000    1.000000    0.00000       5.000000
75%      7.207500    2.625000    3.00000     395.500000
max     28.000000   28.500000   67.00000  100000.000000


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 no

##**Manejar los campos o valores vacios**

In [None]:
# Importar numpy
import numpy as np

# Reemplazamos los '?'s con NaN
cc_apps = cc_apps.replace('?', np.nan)

# Observemos nuevamente el valor de la feature 0 para la fila 673
print(cc_apps.tail(17))

      0      1       2  3  4   5   6      7  8  9   10 11 12     13   14 15
673  NaN  29.50   2.000  y  p   e   h  2.000  f  f   0  f  g  00256   17  -
674    a  37.33   2.500  u  g   i   h  0.210  f  f   0  f  g  00260  246  -
675    a  41.58   1.040  u  g  aa   v  0.665  f  f   0  f  g  00240  237  -
676    a  30.58  10.665  u  g   q   h  0.085  f  t  12  t  g  00129    3  -
677    b  19.42   7.250  u  g   m   v  0.040  f  t   1  f  g  00100    1  -
678    a  17.92  10.210  u  g  ff  ff  0.000  f  f   0  f  g  00000   50  -
679    a  20.08   1.250  u  g   c   v  0.000  f  f   0  f  g  00000    0  -
680    b  19.50   0.290  u  g   k   v  0.290  f  f   0  f  g  00280  364  -
681    b  27.83   1.000  y  p   d   h  3.000  f  f   0  f  g  00176  537  -
682    b  17.08   3.290  u  g   i   v  0.335  f  f   0  t  g  00140    2  -
683    b  36.42   0.750  y  p   d   v  0.585  f  f   0  f  g  00240    3  -
684    b  40.58   3.290  u  g   m   v  3.500  f  f   0  t  s  00400    0  -
685    b  21

In [None]:
# Filtramos los valores NaN para la columna 0
cc_apps[cc_apps[0].isna()]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
248,,24.5,12.75,u,g,c,bb,4.75,t,t,2,f,g,73,444,+
327,,40.83,3.5,u,g,i,bb,0.5,f,f,0,f,s,1160,0,-
346,,32.25,1.5,u,g,c,v,0.25,f,f,0,t,g,372,122,-
374,,28.17,0.585,u,g,aa,v,0.04,f,f,0,f,g,260,1004,-
453,,29.75,0.665,u,g,w,v,0.25,f,f,0,t,g,300,0,-
479,,26.5,2.71,y,p,,,0.085,f,f,0,f,s,80,0,-
489,,45.33,1.0,u,g,q,v,0.125,f,f,0,t,g,263,0,-
520,,20.42,7.5,u,g,k,v,1.5,t,t,1,f,g,160,234,+
598,,20.08,0.125,u,g,q,v,1.0,f,t,1,f,g,240,768,+
601,,42.25,1.75,y,p,,,0.0,f,f,0,t,g,150,1,-


In [None]:
# Imputamos los valores faltantes con la media
cc_apps.fillna(cc_apps.mean(), inplace=True)

# Contamos el número de NaNs para verificar
cc_apps.isnull().sum()

  


0     12
1     12
2      0
3      6
4      6
5      9
6      9
7      0
8      0
9      0
10     0
11     0
12     0
13    13
14     0
15     0
dtype: int64

In [None]:
# Recorremos cada columna de cc_apps
for col in cc_apps.columns:
    # Chequeamos si la columna es de tipo 'object'
    if cc_apps[col].dtypes == 'object':
        # Imputamos con el valor más frecuente
        cc_apps = cc_apps.fillna(cc_apps[col].value_counts().index[0])

# Volvemos a contar el número de NaNs en el dataset para verificar
cc_apps.isnull().sum()

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
dtype: int64

##**Procesar y dividir el dataset**

In [None]:
from sklearn.preprocessing import LabelEncoder

# Instanciamos LabelEncoder
le = LabelEncoder()

# Recorremos todos los valores de cada columna y extraemos su tipo de dato
for col in cc_apps.columns:
    # Chequeamos si la columna es de tipo 'object'
    if cc_apps[col].dtypes == 'object':
    # Usamos LabelEncoder para realizar la transformación numérica
        cc_apps[col]=le.fit_transform(cc_apps[col])

In [None]:
from sklearn.model_selection import train_test_split

# Eliminamos las features 11 y 13 y convertimos el DataFrame en un NumPy array
cc_apps = cc_apps.drop([11, 13], axis=1)
cc_apps = cc_apps.to_numpy()

# Separamos características y etiquetas en variables distintas
X, y = cc_apps[:,0:12] , cc_apps[:,13]

# Dividimos el dataset en conjuntos de entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Instanciamos MinMaxScaler y lo utilizamos para escalar X_train y X_test
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.fit_transform(X_test)

##**Entrenar el modelo**

In [None]:
from sklearn.linear_model import LogisticRegression

# Instanciamos el clasificador LogisticRegression con sus parámetros por defecto
logreg = LogisticRegression()

# Entrenamos logreg con los datos escalados
logreg.fit(rescaledX_train, y_train)

LogisticRegression()

##**Evaluar el performance**

In [None]:
from sklearn.metrics import confusion_matrix

# Utilizamos el estimador logreg para predecir instancias sobre el test set y las almacenamos
y_pred = logreg.predict(rescaledX_test)

# Obtenemos la puntuación "accuracy score"
print("Accuracy: ", logreg.score(rescaledX_test, y_test))

# Mostramos la matriz de confusión del modelo
print(confusion_matrix(y_test, y_pred))

Accuracy:  0.8377192982456141
[[93 10]
 [27 98]]


##**Ajustar el modelo**

In [None]:
from sklearn.model_selection import GridSearchCV

# Definimos la grilla de valores para 'tol' y 'max_iter'
tol = [0.01, 0.001, 0.0001]
max_iter = [100, 150, 200]

# Creamos un diccionario con 'tol' y 'max_iter' como claves y las listas anteriores como sus valores
param_grid = dict(tol=tol, max_iter=max_iter)

In [None]:
# Instanciamos GridSearchCV con los parámetros requeridos
grid_model = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=5)

# Utilizamos nuevamente 'scaler' para escalar X
rescaledX = scaler.fit_transform(X)

# Entrenamos el modelo
grid_model_result = grid_model.fit(rescaledX, y)

# Obtenemos los valores de los hiperparámetros que mejores resultados arrojan
best_score, best_params = grid_model_result.best_score_, grid_model_result.best_params_
print("Mejor puntuación: %f , utilizando %s" % (best_score, best_params))

Mejor puntuación: 0.850725 , utilizando {'max_iter': 100, 'tol': 0.01}


# **Referencias:**
* [Modelo publicon en AzureML](https://gallery.cortanaintelligence.com/Experiment/TarjetaCredito)
* [Reposito GitHub](https://github.com/juancondorijara/h201S5_04JuanCondori.git)