![Credit card being held in hand](../img/credit_card_b.jpg)

Commercial banks receive _a lot_ of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!).

### The Data

The data is a small subset of the Credit Card Approval dataset from the UCI Machine Learning Repository showing the credit card applications a bank receives.

In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV

# Load the dataset
cc_apps = pd.read_csv("../data/cc_approvals.data", header=None) 
cc_apps.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,g,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,g,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,g,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,g,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,s,0,+


In [3]:
# Info and describe of cc_apps
print(cc_apps.info())
cc_apps.describe(include='all').T

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 non-null    object 
 9   9       690 non-null    object 
 10  10      690 non-null    int64  
 11  11      690 non-null    object 
 12  12      690 non-null    int64  
 13  13      690 non-null    object 
dtypes: float64(2), int64(2), object(10)
memory usage: 75.6+ KB
None


Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
0,690.0,3.0,b,468.0,,,,,,,
1,690.0,350.0,?,12.0,,,,,,,
2,690.0,,,,4.758725,4.978163,0.0,1.0,2.75,7.2075,28.0
3,690.0,4.0,u,519.0,,,,,,,
4,690.0,4.0,g,519.0,,,,,,,
5,690.0,15.0,c,137.0,,,,,,,
6,690.0,10.0,v,399.0,,,,,,,
7,690.0,,,,2.223406,3.346513,0.0,0.165,1.0,2.625,28.5
8,690.0,2.0,t,361.0,,,,,,,
9,690.0,2.0,f,395.0,,,,,,,


- Column 1 is being recognized as type 'object', we need to investigate what's happening (this can be done by exploring the table with its filters, or by defining a function that determines if it's a number or not).
- It was identified that this is happening because there are missing values that are recorded as '?'.

In [4]:
# Replace all '?' by NaN
cc_apps_transf=cc_apps.replace('?', np.nan)
cc_apps_transf.iloc[:,1]=cc_apps_transf.iloc[:,1].astype(float)

In [5]:
# Transform target to boolean
cc_apps_transf.iloc[:,13] = np.where(cc_apps_transf.iloc[:,13]=='+',1,0)

In [6]:
# Select features and target
features = cc_apps_transf.copy()
target = features.pop(13)
# Convert category features  to type 'category' for get_dummies
features.iloc[:,[0,3,4,5,6,8,9,11]] = features.iloc[:,[0,3,4,5,6,8,9,11]].astype('category')

In [7]:
# Realizar el train-test split
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.25)

def impute_data(X_train, X_test):
    # Copy X_train and X_test to avoid changing the original data
    X_train_imputed = X_train.copy()
    X_test_imputed = X_test.copy()
    
    # Impute object columns with mode
    for col in X_train.select_dtypes(include=['object']).columns:
        mode = X_train[col].mode()[0]
        X_train_imputed[col].fillna(mode, inplace=True)
        X_test_imputed[col].fillna(mode, inplace=True)
    
    # Impute numeric columns with mean
    for col in X_train.select_dtypes(include=['number']).columns:
        mean = X_train[col].mean()
        X_train_imputed[col].fillna(mean, inplace=True)
        X_test_imputed[col].fillna(mean, inplace=True)
    
    return X_train_imputed, X_test_imputed

# Impute the data
X_train_imputed, X_test_imputed = impute_data(X_train, X_test)

In [8]:
# Dummies on X_train_imputed and X_test_imputed
X_train_dummies = pd.get_dummies(X_train_imputed, drop_first=True)
X_test_dummies = pd.get_dummies(X_test_imputed, drop_first=True)

In [9]:
# Scale features
scaler = StandardScaler()
X_train_dummies.columns = X_train_dummies.columns.astype(str)
X_test_dummies.columns = X_test_dummies.columns.astype(str)
X_train_scaled = scaler.fit_transform(X_train_dummies)
X_test_scaled = scaler.transform(X_test_dummies)

In [10]:
# Define the logistic regression model
log_reg = LogisticRegression()

# Define the parameter grid
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'solver': ['liblinear', 'saga'],
    'max_iter': [100, 200, 500]
}

# Perform Grid Search CV
grid_search = GridSearchCV(estimator=log_reg, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_scaled, y_train)

# Print the best model
print("Best model parameters:", grid_search.best_params_)

# Calculate the score of the best model
best_model = grid_search.best_estimator_
score = best_model.score(X_test_scaled, y_test)
print("Accuracy of the best model:", score)

# Print the confusion matrix
y_pred = best_model.predict(X_test_scaled)
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)

Best model parameters: {'C': 0.1, 'max_iter': 100, 'solver': 'saga'}
Accuracy of the best model: 0.8034682080924855
Confusion Matrix:
 [[72 24]
 [10 67]]


## Conclusions

- A logistic regression model was used, and fine-tuning was performed through Grid Search CV
- The metric used to determine the best model was accuracy.
- An accuracy of 0.86 was achieved with the best model.