## Credit card applications
Commercial banks receive numerous credit card applications, and many of them get rejected for various reasons such as high loan balances, low income levels, or too many inquiries on an individual's credit report. Manually analyzing these applications is tedious, prone to errors, and time-consuming. However, banks can automate this process using machine learning techniques. In this notebook, the author will demonstrate how to build an automatic credit card approval predictor using machine learning.

The dataset selected for this project is the Credit Card Approval dataset from the UCI Machine Learning Repository, which can be accessed through this link: http://archive.ics.uci.edu/ml/datasets/credit+approval.

In [1]:
# Import pandas
import pandas as pd
import numpy as np

# Load dataset
cc_apps = pd.read_csv("datasets/cc_approvals.data", header = None)

# Inspect data
cc_apps.head(5)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


Upon our initial inspection of the data, we can observe that the dataset contains both numerical and non-numerical features. While this can be addressed through preprocessing, it is important to explore the dataset further to identify any additional issues that may require attention.

In [2]:
# Print summary statistics
cc_apps_description = cc_apps.describe()
print(cc_apps_description)

print('\n')

# Print DataFrame information
cc_apps_info = cc_apps.info()
print(cc_apps_info)

print('\n')

# Inspect missing values in the dataset
cc_apps.tail(17)

               2           7          10             14
count  690.000000  690.000000  690.00000     690.000000
mean     4.758725    2.223406    2.40000    1017.385507
std      4.978163    3.346513    4.86294    5210.102598
min      0.000000    0.000000    0.00000       0.000000
25%      1.000000    0.165000    0.00000       0.000000
50%      2.750000    1.000000    0.00000       5.000000
75%      7.207500    2.625000    3.00000     395.500000
max     28.000000   28.500000   67.00000  100000.000000


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 no

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
673,?,29.5,2.0,y,p,e,h,2.0,f,f,0,f,g,256,17,-
674,a,37.33,2.5,u,g,i,h,0.21,f,f,0,f,g,260,246,-
675,a,41.58,1.04,u,g,aa,v,0.665,f,f,0,f,g,240,237,-
676,a,30.58,10.665,u,g,q,h,0.085,f,t,12,t,g,129,3,-
677,b,19.42,7.25,u,g,m,v,0.04,f,t,1,f,g,100,1,-
678,a,17.92,10.21,u,g,ff,ff,0.0,f,f,0,f,g,0,50,-
679,a,20.08,1.25,u,g,c,v,0.0,f,f,0,f,g,0,0,-
680,b,19.5,0.29,u,g,k,v,0.29,f,f,0,f,g,280,364,-
681,b,27.83,1.0,y,p,d,h,3.0,f,f,0,f,g,176,537,-
682,b,17.08,3.29,u,g,i,v,0.335,f,f,0,t,g,140,2,-


Our next step is to split the data into a training set and a testing set, to prepare for the different stages of machine learning modeling: training and testing. It is important to keep the test data separate from the training data, to avoid any potential data leakage. We will preprocess the data after the split to ensure that no information from the test data is used in the training process.

Furthermore, we have identified that features such as "DriversLicense" and "ZipCode" are not as significant as other features in predicting credit card approvals.

In [3]:
# Import train_test_split
from sklearn.model_selection import train_test_split


# Drop the features 11 and 13
cc_apps = cc_apps.drop([11,13], axis=1)

# Split into train and test sets
cc_apps_train, cc_apps_test = train_test_split(cc_apps, test_size=0.33, random_state=42)

The dataset comprises of both numeric and non-numeric data, with features 2, 7, 10, and 14 containing numeric values of types float64, float64, int64, and int64 respectively, and the remaining features containing non-numeric values. The dataset also has varying value ranges across different features, with some having a range of 0-28, others having a range of 2-67, and some having a range of 1017-100000. Statistical information such as mean, max, and min can be obtained for features with numerical values.
The dataset also contains missing values which are denoted by '?' and can be seen in the last cell's output of the second task. We need to replace these missing values with NaN in order to handle them properly

In [4]:
# Import numpy
import numpy as np 

# Replace the '?'s with NaN in the train and test sets
cc_apps_train = cc_apps_train.replace("?",np.NaN)
cc_apps_test = cc_apps_test.replace("?",np.NaN)



In [5]:
import warnings
warnings.simplefilter("ignore")

# Impute the missing values with mean imputation
cc_apps_train.fillna(cc_apps_train.mean(), inplace=True)
cc_apps_test.fillna(cc_apps_train.mean(), inplace=True)

# Count the number of NaNs in the datasets and print the counts to verify
print(cc_apps_train.isnull().sum())
print(cc_apps_test.isnull().sum())



0     8
1     5
2     0
3     6
4     6
5     7
6     7
7     0
8     0
9     0
10    0
12    0
14    0
15    0
dtype: int64
0     4
1     7
2     0
3     0
4     0
5     2
6     2
7     0
8     0
9     0
10    0
12    0
14    0
15    0
dtype: int64


The missing values in the numeric columns have been handled successfully, but there are still missing values in columns 0, 1, 3, 4, 5, 6, and 13. Since these columns contain non-numeric data, mean imputation is not a suitable strategy. Therefore, a different approach is required, which involves replacing the missing values with the most common values found in the respective columns.

In [6]:
# Iterate over each column of cc_apps_train
for col in cc_apps_train.columns:
    # Check if the column is of object type
    if cc_apps_train[col].dtypes == 'object':
        # Impute with the most frequent value
        cc_apps_train = cc_apps_train.fillna(cc_apps_train[col].value_counts().index[0])
        cc_apps_test = cc_apps_test.fillna(cc_apps_train[col].value_counts().index[0])

# Count the number of NaNs in the dataset and print the counts to verify
print(cc_apps_train.isnull().sum())
print(cc_apps_test.isnull().sum())

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
12    0
14    0
15    0
dtype: int64
0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
12    0
14    0
15    0
dtype: int64


In [7]:
# Convert the categorical features in the train and test sets independently
cc_apps_train = pd.get_dummies(cc_apps_train)
cc_apps_test = pd.get_dummies(cc_apps_test)

# Reindex the columns of the test set aligning with the train set
cc_apps_test = cc_apps_test.reindex(columns=cc_apps_train.columns, fill_value=0)


The last preprocessing step remaining is scaling the data before fitting a machine learning model.


In [8]:
# Import MinMaxScaler
from sklearn.preprocessing import MinMaxScaler

# Segregate features and labels into separate variables
X_train, y_train = cc_apps_train.iloc[:,:-1].values, cc_apps_train.iloc[:,[-1]].values
X_test, y_test = cc_apps_test.iloc[:,:-1].values, cc_apps_test.iloc[:,[-1]].values

# Instantiate MinMaxScaler and use it to rescale X_train and X_test
scaler = MinMaxScaler()
rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.transform(X_test)

In [9]:
# Import LogisticRegression

from sklearn.linear_model import LogisticRegression

# Instantiate a LogisticRegression classifier with default parameter values
logreg = LogisticRegression()

# Fit logreg to the train set

logreg.fit(rescaledX_train, y_train)

In credit card application prediction, it is crucial to ensure that the machine learning model can accurately predict both approved and denied status according to their frequency in the original dataset. Failing to do so may result in approving an application that should have been denied. To evaluate the model's performance in this regard, we can use a confusion matrix.

In [10]:
# Import confusion_matrix

from sklearn.metrics import confusion_matrix

# Use logreg to predict instances from the test set and store it
y_pred = logreg.predict(rescaledX_test)

# Get the accuracy score of logreg model and print it
print("Accuracy of logistic regression classifier: ", logreg.score(rescaledX_test, y_test))

# Print the confusion matrix of the logreg model

confusion_matrix(y_test, y_pred)

Accuracy of logistic regression classifier:  1.0


array([[103,   0],
       [  0, 125]], dtype=int64)

Our model achieved a perfect accuracy score of 100%, which is an impressive result. In the confusion matrix, the first element of the first row indicates the true negatives, representing the number of denied applications that were correctly predicted as negative by the model. The last element of the second row represents the true positives, indicating the number of approved applications that were correctly predicted as positive by the model.