This project is based off of the DataCamp Project "Predicting Credit Card Approvals" by Sayak Paul.

It has been designed to use machine learning techniques as a classification task to create an automated credit card approval predictor with the Credit Card Approval dataset from the UCI Machine Learning Repository.

Starting with importing pandas and looking at the first few rows of the data below. 


In [51]:
# Import pandas
import pandas as pd

# Load dataset
creditCards = pd.read_csv('cc_approvals.data')

# Inspect data
creditCards.head()

Unnamed: 0,b,30.83,0,u,g,w,v,1.25,t,t.1,01,f,g.1,00202,0.1,+
0,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
1,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
2,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
3,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+
4,b,32.08,4.0,u,g,m,v,2.5,t,f,0,t,g,360,0,+


We can see that there is a combination of both numerical and non-numerical details and some missing entries. The data has also been anonymized the feature names for confidentiality.

The probable features in a typical credit card application are Gender, Age, Debt, Married, BankCustomer, EducationLevel, Ethnicity, YearsEmployed, PriorDefault, Employed, CreditScore, DriversLicense, Citizen, ZipCode, Income and finally the ApprovalStatus. This gives us an indication to the columns in the output.

We will look at a summary of the dataset and the DataFrame information below.

In [52]:
# Print summary of dataset
creditCards_sum = creditCards.describe()
print(creditCards_sum)

# Print DataFrame info of dataset
creditCards_info = creditCards.info()
print(creditCards_info)

# look at last few lines of data for missing values in the dataset
creditCards.tail()

                0        1.25          01            0.1
count  689.000000  689.000000  689.000000     689.000000
mean     4.765631    2.224819    2.402032    1018.862119
std      4.978470    3.348739    4.866180    5213.743149
min      0.000000    0.000000    0.000000       0.000000
25%      1.000000    0.165000    0.000000       0.000000
50%      2.750000    1.000000    0.000000       5.000000
75%      7.250000    2.625000    3.000000     396.000000
max     28.000000   28.500000   67.000000  100000.000000
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 689 entries, 0 to 688
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   b       689 non-null    object 
 1   30.83   689 non-null    object 
 2   0       689 non-null    float64
 3   u       689 non-null    object 
 4   g       689 non-null    object 
 5   w       689 non-null    object 
 6   v       689 non-null    object 
 7   1.25    689 non-null    float64
 8   t      

Unnamed: 0,b,30.83,0,u,g,w,v,1.25,t,t.1,01,f,g.1,00202,0.1,+
684,b,21.08,10.085,y,p,e,h,1.25,f,f,0,f,g,260,0,-
685,a,22.67,0.75,u,g,c,v,2.0,f,t,2,t,g,200,394,-
686,a,25.25,13.5,y,p,ff,ff,2.0,f,t,1,t,g,200,1,-
687,b,17.92,0.205,u,g,aa,v,0.04,f,f,0,f,g,280,750,-
688,b,35.0,3.375,u,g,c,h,8.29,f,f,0,t,g,0,0,-


To allow the model to perform correctly the missing values will be prepossessed. 

Some missing values are labeled with '?' and we will replace these missing value question marks with NaN.

In [53]:
# Replace the '?'s with NaN
creditCards = creditCards.replace('?', np.NaN)

# Inspect the missing values again
creditCards.tail()

Unnamed: 0,b,30.83,0,u,g,w,v,1.25,t,t.1,01,f,g.1,00202,0.1,+
684,b,21.08,10.085,y,p,e,h,1.25,f,f,0,f,g,260,0,-
685,a,22.67,0.75,u,g,c,v,2.0,f,t,2,t,g,200,394,-
686,a,25.25,13.5,y,p,ff,ff,2.0,f,t,1,t,g,200,1,-
687,b,17.92,0.205,u,g,aa,v,0.04,f,f,0,f,g,280,750,-
688,b,35.0,3.375,u,g,c,h,8.29,f,f,0,t,g,0,0,-


We will now use mean imputation to replaced missing values with the mean of the available cases pre-processing the missing values within the numeric columns.

Columns with non-numeric data will be input these missing values with the most frequent values as present in the respective columns. To verify the conversion of the missing values we will print their number in the dataset.

In [54]:
# Impute the missing values with mean imputation
creditCards.fillna(np.NaN, inplace=True)

# Count the number of NaNs in the dataset to verify
creditCards.fillna(creditCards.mean(), inplace = True)

# Iterate over each column of creditCards
for col in creditCards:
    # Check if the column is of object type
    if creditCards[col].dtypes == 'object':
        # Impute with the most frequent value
        creditCards = creditCards.fillna(creditCards[col].value_counts().index[0])

# Count the number of NaNs in the dataset and print the counts to verify
print(creditCards.isnull().sum())

b        0
30.83    0
0        0
u        0
g        0
w        0
v        0
1.25     0
t        0
t.1      0
01       0
f        0
g.1      0
00202    0
0.1      0
+        0
dtype: int64


To run the machine learning model further prepossessing is needed to convert all data to numeric for faster computation to be able to divide the data into train and test sets. This will be done using label encoding.


In [55]:
# Import LabelEncoder
import numpy as np
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder

# Instantiate LabelEncoder
le = LabelEncoder()

# Iterate over all the values of each column and extract their dtypes
for col in creditCards:
    # Compare if the dtype is object
    if creditCards[col].dtypes=='object':
    # Use LabelEncoder to do the numeric transformation
        creditCards[col]=le.fit_transform(creditCards[col])


Now, we will split our data into train set and test set to prepare our data for two different phases of machine learning modeling: training and testing. With two thirds of the data used for a training model and a third used for testing. 


In [56]:

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,
                                y,
                                test_size=0.33,
                                random_state=42)


We will pre-process the data through scaling before we can fit a machine learning model to the data.

Assuming the CreditScore column will represent their creditworthiness based on their credit history. The higher this number, the more financially trustworthy a person is considered to be. So, a CreditScore of 1 is the highest since we're rescaling all the values to the range of 0-1 using MinMaxScaler.

In [57]:
# Import MinMaxScaler
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler

# Instantiate MinMaxScaler and use it to rescale X_train and X_test
scaler = MinMaxScaler(feature_range=(0,1))
scaledX_train = scaler.fit_transform(X_train)
scaledX_test = scaler.fit_transform(X_test)


According to UCI, the dataset has higher instances of "Denied" status than instances corresponding to "Approved" status. Out of 690 instances, there are 383 (55.5%) applications that got denied and 307 (44.5%) applications that got approved.This will give a benchmark for our results.

We will create a genaralised linear model with by converting to a logistic regression model and then evaluate the confusion matrix to measure the classification accuracy to predict the approval status of the applications as denied that originally got denied. .

In [58]:
# Import LogisticRegression
from sklearn.linear_model import LogisticRegression

# Instantiate a LogisticRegression classifier with default parameter values
logreg = LogisticRegression()

# Fit logreg to the train set
logreg.fit(scaledX_train,y_train)

# Import confusion_matrix
from sklearn.metrics import confusion_matrix

# Use logreg to predict instances from the test set and store it
y_pred = logreg.predict(scaledX_test)

# Get the accuracy score of logreg model and print it
print("Accuracy of logistic regression classifier: ", logreg.score(scaledX_test,y_test))

# Print the confusion matrix of the logreg model
confusion_matrix(y_test,y_pred)

Accuracy of logistic regression classifier:  0.8377192982456141


array([[92, 11],
       [26, 99]])

We can see that the model has created an accuracy of over 83%. 

The confusion matrix shows the first element of the first row of the confusion matrix denotes the true negatives meaning the number of negative instances (denied applications) predicted by the model correctly. And the last element of the second row of the confusion matrix denotes the true positives meaning the number of positive instances (approved applications) predicted by the model correctly.

If we were to improve the model we can use a grid search of the parameters. Defining the different hyperparameters tol, max_iter and converting to a single dictionary format.

In [59]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Define the grid of values for tol and max_iter
tol = [0.01, 0.001 ,0.0001]
max_iter = [100, 150, 200]

# Create a dictionary where tol and max_iter are keys and the lists of their values are corresponding values
param_grid = dict(tol=tol, max_iter=max_iter)


Using GridSearchCV() on the first model to all data we will supply X (scaled version) and y to five folds of cross-validation.

To use the model we can find the best achieved score with the respective best parameters had a score of 0.85 accuracy.

In [60]:
# Instantiate GridSearchCV with the required parameters
grid_model = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=5)

# Use scaler to rescale X and assign it to rescaledX
scaledX = scaler.fit_transform(X)

# Fit grid_model to the data
grid_model_result = grid_model.fit(scaledX, y)

# Summarize results
best_score, best_params = grid_model_result.best_score_, grid_model_result.best_params_
print("Best: %f using %s" % (best_score, best_params))

Best: 0.850725 using {'max_iter': 100, 'tol': 0.01}
