# Credit Card Approval Prediction

Commercial Banks receive huge number of credit card applications. Alot of the applications received are rejected because of different reasons like low income, high low balances, etc. Manually analyzing the an individuals background to review the approval process can be mundane. This project aims to automate the task of approval process based on various factors such as Age, dent, maritial status, years of employement, credit score, income, prior default, etc using Machine learning.

The data set used is [Credit Card Approval Dataset](http://archive.ics.uci.edu/ml/datasets/credit+approval) from the UCI Machine Learning Repository.

The features of this dataset have been anonymized to protect the privacy, but [this blog](http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html)  gives us a pretty good overview of the probable features. The probable features in a typical credit card application are Gender, Age, Debt, Married, BankCustomer, EducationLevel, Ethnicity, YearsEmployed, PriorDefault, Employed, CreditScore, DriversLicense, Citizen, ZipCode, Income and finally the ApprovalStatus.

In [4]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.datasets import make_blobs

In [5]:
cc_data=pd.read_csv('crx.data',header=None)

In [6]:
cc_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


In [7]:
cc_data.describe()

Unnamed: 0,2,7,10,14
count,690.0,690.0,690.0,690.0
mean,4.758725,2.223406,2.4,1017.385507
std,4.978163,3.346513,4.86294,5210.102598
min,0.0,0.0,0.0,0.0
25%,1.0,0.165,0.0,0.0
50%,2.75,1.0,0.0,5.0
75%,7.2075,2.625,3.0,395.5
max,28.0,28.5,67.0,100000.0


In [8]:
cc_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 non-null    object 
 9   9       690 non-null    object 
 10  10      690 non-null    int64  
 11  11      690 non-null    object 
 12  12      690 non-null    object 
 13  13      690 non-null    object 
 14  14      690 non-null    int64  
 15  15      690 non-null    object 
dtypes: float64(2), int64(2), object(12)
memory usage: 86.4+ KB


In [9]:
#Drop the attributes DriversLicense and ZipCode, assuming it has no bearing with Credit Card Approval
cc_data = cc_data.drop([11,13], axis=1)

In [8]:
cc_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,12,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,g,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,g,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,g,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,g,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,s,0,+


In [10]:
# Split the data set in to training and testing data set
cc_data_train, cc_data_test = train_test_split(cc_data, test_size = 0.33, random_state = 42)

In [11]:
# Replace ? values with Nan
cc_data_train = cc_data_train.replace('?',np.nan)
cc_data_test = cc_data_test.replace('?',np.nan)

In [12]:
cc_data_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 462 entries, 382 to 102
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       454 non-null    object 
 1   1       457 non-null    object 
 2   2       462 non-null    float64
 3   3       456 non-null    object 
 4   4       456 non-null    object 
 5   5       455 non-null    object 
 6   6       455 non-null    object 
 7   7       462 non-null    float64
 8   8       462 non-null    object 
 9   9       462 non-null    object 
 10  10      462 non-null    int64  
 11  12      462 non-null    object 
 12  14      462 non-null    int64  
 13  15      462 non-null    object 
dtypes: float64(2), int64(2), object(10)
memory usage: 54.1+ KB


In [14]:
# Performing mean imputation for column with int and float data type
cc_data_train.fillna(cc_data_test[[2,7,10,14]].replace(np.nan,0).mean(), inplace=True)
cc_data_test.fillna(cc_data_train[[2,7,10,14]].replace(np.nan,0).mean(), inplace=True)

In [16]:
# For non numeric data type impute the values with most frequent values in their respective columns
for col in cc_data_train:
    if cc_data_train[col].dtypes == 'object':
        cc_data_train.fillna(cc_data_train[col].value_counts().index[0],  inplace=True)
        cc_data_test.fillna(cc_data_test[col].value_counts().index[0],  inplace=True)

In [81]:
#Converting categorical data into numeric
cc_data_train = pd.get_dummies(cc_data_train)
cc_data_test = pd.get_dummies(cc_data_test)

cc_data_test = cc_data_test.reindex(columns=cc_data_train.columns, fill_value=0)

In [85]:
#Seggregate features and labels into seperate variables
X_train, y_train = cc_data_train.iloc[:,:-1].values, cc_data_train.iloc[:,[-1]].values
X_test, y_test = cc_data_test.iloc[:,:-1].values, cc_data_test.iloc[:,[-1]].values

#Rescale X_train and X_test with MinMaxScaler
scaler = MinMaxScaler()
rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.transform(X_test)

In [126]:
# Instantiation and Fitting of Logistic Regression Model
logreg = LogisticRegression()

logreg.fit(rescaledX_train,np.ravel(y_train))

LogisticRegression()

In [127]:
# Use logreg to predict instances from the test set and store it
y_pred = logreg.predict(rescaledX_test)

# Get the accuracy score of logreg model and print it
print("Accuracy of logistic regression classifier: ", logreg.score(rescaledX_train,y_train))

Accuracy of logistic regression classifier:  1.0


In [113]:
# Print confusion matrix
confusion_matrix(y_test,y_pred)

array([[100,   0],
       [  0, 128]])

## Conclusion

The model gave an accuracy score of 100%. 
From the confusion matrix the number of true negatives i.e denied applications are predicted by the model correctly and the true positives i.e approved applications are predicted by the model correctly.