<a href="https://colab.research.google.com/github/gregbruss/credit-approvals/blob/main/creditApprovals.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Credit Card Approvals

The task of approving people for credit in terms of their creditworthiness is universally done by every major bank and financial institution. These organizations typically receive hundreds of thousands of these types of applications per year. The decision given in any case is a variety of factors, most typically relying on someone's income and employment status, their level of debt, and their prior credit history. Due to the nature of this type of task, it is well suited to machine learning. In this notebook I build an ML model which makes use of data from the UCI Machine Learning repository (https://archive.ics.uci.edu/ml/datasets/credit+approval).

The data has been anonymized, and is interesting as it contains a very nice mix of attributes: continuous, categorical, nominal in small ranges and in large ranges. 

Advancements in credit approval technology have saved banks and other financial institutions literally tens of thousands of hours. This analysis uses a variety of widely-known preprocessing steps such as scaling, label encoding, and missing value imputation, as well as ML techniques such as Logistic Regression, and Hyperparameter Tuning using GridSearch.

The final model is able to predict whether a person was approved or not with an accuracy of 85%.


In [2]:
# Import pandas
import pandas as pd

# Load dataset
applications = pd.read_csv('/content/drive/MyDrive/projects/Credit Approvals/datasets/cc_approvals.data', header=None)
# Inspect data
applications.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


### Inspecting the Applications

In [3]:
cc_summary_statistics = applications.describe()
print(cc_summary_statistics)

               2           7          10             14
count  690.000000  690.000000  690.00000     690.000000
mean     4.758725    2.223406    2.40000    1017.385507
std      4.978163    3.346513    4.86294    5210.102598
min      0.000000    0.000000    0.00000       0.000000
25%      1.000000    0.165000    0.00000       0.000000
50%      2.750000    1.000000    0.00000       5.000000
75%      7.207500    2.625000    3.00000     395.500000
max     28.000000   28.500000   67.00000  100000.000000


In [4]:
# Dataframe Information
applications_info = applications.info()
print(applications_info)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 non-null    object 
 9   9       690 non-null    object 
 10  10      690 non-null    int64  
 11  11      690 non-null    object 
 12  12      690 non-null    object 
 13  13      690 non-null    object 
 14  14      690 non-null    int64  
 15  15      690 non-null    object 
dtypes: float64(2), int64(2), object(12)
memory usage: 86.4+ KB
None


Dataset contains varying data types: float64, int64, and object types.

In [5]:
applications.tail(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
680,b,19.5,0.29,u,g,k,v,0.29,f,f,0,f,g,280,364,-
681,b,27.83,1.0,y,p,d,h,3.0,f,f,0,f,g,176,537,-
682,b,17.08,3.29,u,g,i,v,0.335,f,f,0,t,g,140,2,-
683,b,36.42,0.75,y,p,d,v,0.585,f,f,0,f,g,240,3,-
684,b,40.58,3.29,u,g,m,v,3.5,f,f,0,t,s,400,0,-
685,b,21.08,10.085,y,p,e,h,1.25,f,f,0,f,g,260,0,-
686,a,22.67,0.75,u,g,c,v,2.0,f,t,2,t,g,200,394,-
687,a,25.25,13.5,y,p,ff,ff,2.0,f,t,1,t,g,200,1,-
688,b,17.92,0.205,u,g,aa,v,0.04,f,f,0,f,g,280,750,-
689,b,35.0,3.375,u,g,c,h,8.29,f,f,0,t,g,0,0,-


Features 2, 7, 10 and 14 have numeric values (float64 and int64) -- all other features are non-numeric.

The dataset also has varying ranges:
Feaures 1 is in therange from 2-67

The dataset has a number of missing values, which are labelled with a ?.

In [6]:
# Replace the missing value ?'s with NaNs.
import numpy as np

# Inspect the missing values
print(applications.isnull().values.sum())

# Replace the ?'s with NaNs
applications = applications.replace('?', np.nan)

# Check the missing values again to make sure they are different now
applications.tail(20)


0


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
670,b,47.17,5.835,u,g,w,v,5.5,f,f,0,f,g,465,150,-
671,b,25.83,12.835,u,g,cc,v,0.5,f,f,0,f,g,0,2,-
672,a,50.25,0.835,u,g,aa,v,0.5,f,f,0,t,g,240,117,-
673,,29.5,2.0,y,p,e,h,2.0,f,f,0,f,g,256,17,-
674,a,37.33,2.5,u,g,i,h,0.21,f,f,0,f,g,260,246,-
675,a,41.58,1.04,u,g,aa,v,0.665,f,f,0,f,g,240,237,-
676,a,30.58,10.665,u,g,q,h,0.085,f,t,12,t,g,129,3,-
677,b,19.42,7.25,u,g,m,v,0.04,f,t,1,f,g,100,1,-
678,a,17.92,10.21,u,g,ff,ff,0.0,f,f,0,f,g,0,50,-
679,a,20.08,1.25,u,g,c,v,0.0,f,f,0,f,g,0,0,-


### Numeric Columns: Mean Imputation

Missing values cannot be ignored -- they will affect the performance of our ML models. If we ignore them, the model will miss out on information in the dataset that is useful for training purposes.

In [7]:
# Mean imputation
applications.fillna(applications.mean(), inplace=True)

# Count the number of NaNs in the dataset to verify.
# The remaining NaNs will be in the non-numeric columns
print(applications.isnull().values.sum())


67


### Non-Numeric Columns: Highest Frequency Imputation


In [8]:
# Iterate over each column
for column in applications.columns:
  if applications[column].dtypes == 'object':
    applications = applications.fillna(applications[column].value_counts().index[0])

# Count NaNs to verify
print(applications.isnull().values.sum())

0


# Data Preproccesing
Missing values have been handled. 

There are three main tasks

## 1. Convert non-numeric data to numeric.
This results in faster computation -- sklearn and XGBoost typically require data to be in a numeric format

## 2. Split the data into train, test sets
This allows us to move into the training and testing phase of developing an ML model. We can do feature selection at the same time so that we train on only the most useful subset of features.

## 3. Scale the feature values to a uniform range
Features must be scaled before being fed into the ML model. Take CreditScore as an example, which is a measurement of someone's creditworthiness -- the higher the number, the more financially trustworthy a person is.

If we scale CreditScore, a score of 1 is the highest, and 0 would be the lowest in a 0-1 range.


In [9]:
# 1. Convert non-numeric data to numeric
# This can be done using label encoding

from sklearn.preprocessing import LabelEncoder

# Instantiate LabelEncoder
label_encoder = LabelEncoder()

# Iterate over all the values of each column and extract their dtypes

for column in applications.columns:
  if applications[column].dtypes == 'object':
    applications[column] = label_encoder.fit_transform(applications[column])

# Check if all features are now of a numeric type
applications.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    int64  
 1   1       690 non-null    int64  
 2   2       690 non-null    float64
 3   3       690 non-null    int64  
 4   4       690 non-null    int64  
 5   5       690 non-null    int64  
 6   6       690 non-null    int64  
 7   7       690 non-null    float64
 8   8       690 non-null    int64  
 9   9       690 non-null    int64  
 10  10      690 non-null    int64  
 11  11      690 non-null    int64  
 12  12      690 non-null    int64  
 13  13      690 non-null    int64  
 14  14      690 non-null    int64  
 15  15      690 non-null    int64  
dtypes: float64(2), int64(14)
memory usage: 86.4 KB


In [10]:
# 2. Split data into Train, Test sets

# import train_test_split
from sklearn.model_selection import train_test_split

# Drop features 11 and 13, and convert the DataFrame to a NumPy array
applications = applications.drop([11,13], axis=1)
applications = applications.values

# Segregate features and labels into separate variables
X, y = applications[:, 0:13], applications[:, 13]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.33,
                                                    random_state=42)

In [11]:
# 3. Scale the feature values to a uniform range

# Import MinMaxScaler
from sklearn.preprocessing import MinMaxScaler

# Instantiate MinMaxScaler to rescale Train and Test Inputs
# Namely X_train and X_test
scaler = MinMaxScaler(feature_range = (0,1))
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)


## Fitting a Logistic Regression Model to the Train Set

If we inspect the data, we will find that there is a higher proportion of "Denied" status applications than "Approved" status. Thus we can use this as a benchmark -- the model should be able to improve over the benchmark better than simply always choosing the highest frequency classification.


In [12]:
# Import a GLM like Logistic Regression
from sklearn.linear_model import LogisticRegression

# Instantiate a LogReg classifier with default param values
logreg = LogisticRegression()

# Fit logreg to the training set which has been preprocessed
logreg.fit(X_train_scaled, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [13]:
# We can use a confusion matrix to check where the model is making mistakes

from sklearn.metrics import confusion_matrix

# Use logreg to predict instances from the test set and store it
y_pred = logreg.predict(X_test_scaled)

# Get the accuracy score of logreg
print("Accuracy of logistic regression: ", logreg.score(X_test_scaled, y_test))

Accuracy of logistic regression:  0.8377192982456141


In [14]:
# Print the confusion matrix of the logreg model
confusion_matrix(y_pred, y_test)

array([[92, 26],
       [11, 99]])

## Evaluating and interpreting model accuracy

The model yielded an accuracy score of 84%

The confusion matrix is 

| 92 | 26 |
|----|----|
| 11 | 99 |

The first element of the first row denotes the true negatives: the number of denied applications predicted correctly

The last element of the 2nd row is the true positives -- the number of approved applications predicted correctly.


                      


### Using Grid Search to squeeze out some more model performance.

Can define a grid of hyperparameter values and convert them into a single dictionary format which can be used by the sklearn package GridSearchCV



In [15]:
# Grid Search param grid

# import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Define the grid of values
tol = [0.01, 0.001, 0.0001]
max_iter = [100, 150, 200]

# Create a param grid
param_grid = dict(tol=tol, max_iter = max_iter)
print(param_grid)

{'tol': [0.01, 0.001, 0.0001], 'max_iter': [100, 150, 200]}


In [16]:
# Instantiate GridSearchCV() with the logreg model.

grid_model = GridSearchCV(estimator=logreg, param_grid = param_grid, cv = 5)

# Rescale features again
X_scaled = scaler.fit_transform(X)

# Fit data to grid_model
grid_model_result = grid_model.fit(X_scaled, y)

# Summarize results
best_score, best_params = (grid_model_result.best_score_,
                          grid_model_result.best_params_)

print("Best: %f using %s" % (best_score, best_params))

Best: 0.850725 using {'max_iter': 100, 'tol': 0.01}


### Evaluating Results and Conclusion

In this notebook, an ML classifier was built using a Logistic Regression model, that takes into account the most important features that could be used to predict someone's credit-worthiness. GridSearch was performed to find optimal hyperparameters given the model, and an accuracy of 85% was obtained.