## To Approve, Or Not to Approve: Automating the Credit Card Approval Process

**Introduction**

Credit Card Applications can often be rejected by Commerical Banks for a variety of reasons, such as low income levels, excessive debt, or a poor credit score. To process every application manually would be far too time consuming and prone to errors to be feasable. With the Machine Learning techniques used in this notebook, this process can be automated to save the bank both time and rescources. 

**Description of Data**

The Data used in this notebook is an aggregation of Credit Card Applications taken from the UCI Machine Learning Repository, provided here: http://archive.ics.uci.edu/ml/datasets/credit+approval. As it contains sensitive informations, all attribute names and values have been changed to meaningless symbols to protect the confidentiality of the data.

**Loading the Dataset into the Notebook in a readable format.**

In [46]:
# Import pandas
import pandas as pd

# Load dataset
cc_apps = pd.read_csv('archive (5).zip', header = None)

# Inspect data
cc_apps.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


**2. Cleaning and processing the data before proceeding with exploratory analysis.**

 * Missing Values will be imputed using Mean Imputation. 
 * Non-Numeric Data will be converted to numeric. 
 * Features will be scalled to a uniform range. 

In [47]:
# Print summary statistics
cc_apps_description = cc_apps.describe()
print(cc_apps_description)

print("\n")

# Print DataFrame information
cc_apps_info = cc_apps.info()
print(cc_apps_info)

print("\n")

# Inspect missing values in the dataset
cc_apps.tail(17)

               2           7          10             14
count  690.000000  690.000000  690.00000     690.000000
mean     4.758725    2.223406    2.40000    1017.385507
std      4.978163    3.346513    4.86294    5210.102598
min      0.000000    0.000000    0.00000       0.000000
25%      1.000000    0.165000    0.00000       0.000000
50%      2.750000    1.000000    0.00000       5.000000
75%      7.207500    2.625000    3.00000     395.500000
max     28.000000   28.500000   67.00000  100000.000000


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 no

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
673,?,29.5,2.0,y,p,e,h,2.0,f,f,0,f,g,256,17,-
674,a,37.33,2.5,u,g,i,h,0.21,f,f,0,f,g,260,246,-
675,a,41.58,1.04,u,g,aa,v,0.665,f,f,0,f,g,240,237,-
676,a,30.58,10.665,u,g,q,h,0.085,f,t,12,t,g,129,3,-
677,b,19.42,7.25,u,g,m,v,0.04,f,t,1,f,g,100,1,-
678,a,17.92,10.21,u,g,ff,ff,0.0,f,f,0,f,g,0,50,-
679,a,20.08,1.25,u,g,c,v,0.0,f,f,0,f,g,0,0,-
680,b,19.5,0.29,u,g,k,v,0.29,f,f,0,f,g,280,364,-
681,b,27.83,1.0,y,p,d,h,3.0,f,f,0,f,g,176,537,-
682,b,17.08,3.29,u,g,i,v,0.335,f,f,0,t,g,140,2,-


In [48]:
# Import numpy
import numpy as np

# Inspect missing values in the dataset
print(cc_apps.tail(17))

# Replace the '?'s with NaN
cc_apps = cc_apps.replace('?', np.NaN)

# Inspect the missing values again
print(cc_apps.tail(17))

    0      1       2  3  4   5   6      7  8  9   10 11 12   13   14 15
673  ?   29.5   2.000  y  p   e   h  2.000  f  f   0  f  g  256   17  -
674  a  37.33   2.500  u  g   i   h  0.210  f  f   0  f  g  260  246  -
675  a  41.58   1.040  u  g  aa   v  0.665  f  f   0  f  g  240  237  -
676  a  30.58  10.665  u  g   q   h  0.085  f  t  12  t  g  129    3  -
677  b  19.42   7.250  u  g   m   v  0.040  f  t   1  f  g  100    1  -
678  a  17.92  10.210  u  g  ff  ff  0.000  f  f   0  f  g    0   50  -
679  a  20.08   1.250  u  g   c   v  0.000  f  f   0  f  g    0    0  -
680  b   19.5   0.290  u  g   k   v  0.290  f  f   0  f  g  280  364  -
681  b  27.83   1.000  y  p   d   h  3.000  f  f   0  f  g  176  537  -
682  b  17.08   3.290  u  g   i   v  0.335  f  f   0  t  g  140    2  -
683  b  36.42   0.750  y  p   d   v  0.585  f  f   0  f  g  240    3  -
684  b  40.58   3.290  u  g   m   v  3.500  f  f   0  t  s  400    0  -
685  b  21.08  10.085  y  p   e   h  1.250  f  f   0  f  g  260 

In [49]:
# Impute the missing values with mean imputation
cc_apps.fillna(cc_apps.mean(), inplace=True)

# Count the number of NaNs in the dataset to verify
cc_apps.isnull().sum()

0     12
1     12
2      0
3      6
4      6
5      9
6      9
7      0
8      0
9      0
10     0
11     0
12     0
13    13
14     0
15     0
dtype: int64

In [50]:
# Iterate over each column of cc_apps
for col in cc_apps.columns:
    # Check if the column is of object type
    if cc_apps[col].dtypes == 'object':
        # Impute with the most frequent value
        cc_apps = cc_apps.fillna(cc_apps[col].value_counts().index[0])

# Count the number of NaNs in the dataset and print the counts to verify
print(cc_apps.isnull().sum())

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
dtype: int64


In [51]:
# Import LabelEncoder
from sklearn.preprocessing import LabelEncoder

# Instantiate LabelEncoder
le = LabelEncoder()

# Iterate over all the values of each column and extract their dtypes
for col in cc_apps.columns:
    # Compare if the dtype is object
    if cc_apps[col].dtype =='object':
    # Use LabelEncoder to do the numeric transformation
        cc_apps[col]=le.fit_transform(cc_apps[col])

**3. Splitting the dataset into train and test sets.**

 * Data is plit into a training set and a testing set. 

 * Irrelevant Data (i.e. DriversLicense and ZipCode) are dropped from the Dataset to insure that only the best features are used in the model.

In [52]:
# Import train_test_split
from sklearn.model_selection import train_test_split

# Drop the features 11 and 13 and convert the DataFrame to a NumPy array
cc_apps = cc_apps.drop([11, 13], axis=1)
cc_apps = cc_apps.to_numpy()

# Segregate features and labels into separate variables
X,y = cc_apps[:,0:13] , cc_apps[:,13]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,
                                y,
                                test_size=0.33,
                                random_state=42)

In [53]:
# Import MinMaxScaler
from sklearn.preprocessing import MinMaxScaler

# Instantiate MinMaxScaler and use it to rescale X_train and X_test
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.fit_transform(X_test)

**9. Training Set is fitted to a Logistical Regression Model.**

In [54]:
# Import LogisticRegression
from sklearn.linear_model import LogisticRegression

# Instantiate a LogisticRegression classifier with default parameter values
logreg = LogisticRegression()

# Fit logreg to the train set
logreg.fit(rescaledX_train, y_train)

LogisticRegression()

**10. Evaluating the Model using a confusation matrix.**

In [55]:
# Import confusion_matrix
from sklearn.metrics import confusion_matrix

# Use logreg to predict instances from the test set and store it
y_pred = logreg.predict(rescaledX_test)

# Get the accuracy score of logreg model and print it
print("Accuracy of logistic regression classifier: ", logreg.score(X_test, y_test))

# Print the confusion matrix of the logreg model
print(confusion_matrix(y_test, y_pred))

Accuracy of logistic regression classifier:  0.45614035087719296
[[92 11]
 [26 99]]


**11. Fine tuning the model to improve the accuracy score.**

 * A Gridsearch is performed with hyperparameter values and converted into dictionary format for use in the GridSearchCV() fucntion. This will determine which values perform best. 

 * Then, GridSearchCV() will be instantiated with our earlier logreg model that contains all the data. GridSearchCV() will then be used to perform a cross-validation of five folds.

 * Once the best achieved score is found with the corresponding best parameters, the notebook is ended as we have achieved maximum accuracy in predicting Credit Card Approval for Customer Applications. 


In [56]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Define the grid of values for tol and max_iter
tol = [0.01, 0.001, 0.0001]
max_iter =[100, 150 ,200]

# Create a dictionary where tol and max_iter are keys and the lists of their values are corresponding values
param_grid = dict(tol=tol, max_iter=max_iter)

In [57]:
# Instantiate GridSearchCV with the required parameters
grid_model = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=5)

# Use scaler to rescale X and assign it to rescaledX
rescaledX = scaler.fit_transform(X)

# Fit data to grid_model
grid_model_result = grid_model.fit(rescaledX, y)

# Summarize results
best_score, best_params_ = grid_model_result.best_score_, grid_model_result.best_params_
print("Best: %f using %s" % (best_score, best_params_))

Best: 0.850725 using {'max_iter': 100, 'tol': 0.01}


References¶

Confidential, UCI Machine Learning Repository. http://archive.ics.uci.edu/ml/datasets/credit+approval

Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. Mathematics for Machine Learning

Rohan, Joseph, Grid Search for Model Tuning, https://towardsdatascience.com/grid-search-for-model-tuning-3319b259367e