# Predicting Creadit Card Approval

<p>This notebook aims to utilize machine learning models to predict credit card approval</p>

## 1. Background
<p> Everyday, banks receive a lot of applications for credit cards. This huge amount of application can lead to slow pace of service and is very prone to human error. So now we will simulate if machine learning could take part to this process.</p>

<img src="picture/creditcard.jpg" alt = "Credict Card in Hand">

## 2. Inspecting Dataset

In [1]:
import pandas as pd

# Load dataset
credit_card_apps = pd.read_csv("datasets/cc_approvals.data", header=None)

# Inspect data
credit_card_apps.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


<p>According to <a href="http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html">this blog</a>, the probable features in the dataset are <br><code>Gender</code>, <code>Age</code>,<code>Debt</code>,<code>Married</code>,<code>BankCustomer</code>,<code>EducationLevel</code>,<code>Ethnicity</code>,<br><code>YearsEmployed</code>,<code>PriorDefault</code>,<code>Employed</code>,<code>CredictScore</code>,<code>DriverLicense</code>,<br><code>Citizen</code>,<code>ZipCode</code>,<code>Income</code>, and <code>ApprovalStatus</code></p>

<p>The dataset sure has some categorical features and numerical features</p>

In [2]:
#Print summary statistics 
print(credit_card_apps.describe(),"\n")

#Print dataframe information
print(credit_card_apps.info())

               2           7          10             14
count  690.000000  690.000000  690.00000     690.000000
mean     4.758725    2.223406    2.40000    1017.385507
std      4.978163    3.346513    4.86294    5210.102598
min      0.000000    0.000000    0.00000       0.000000
25%      1.000000    0.165000    0.00000       0.000000
50%      2.750000    1.000000    0.00000       5.000000
75%      7.207500    2.625000    3.00000     395.500000
max     28.000000   28.500000   67.00000  100000.000000 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 no

<p>The statistics summary of numerical features in the dataset show that the dataset have different range (0 - 28, 0 - 67, and 0 - 100000).</p>

<p>On the other side, dataframe information shows that there are no nan data. But we still need to check each unique value of categorical features.</p>

### 2.1 Find the missing values of numerical features


In [3]:
missing_value = []
for col in credit_card_apps.columns:
    if ((col == 1) or (col == 13)) or (credit_card_apps[col].dtypes != 'object'):
        for i in range(len(credit_card_apps)):
            try :
                float(credit_card_apps[col].iloc[i])
            except :
                missing_value.append(credit_card_apps[col].iloc[i])
print(missing_value)

['?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?']


We found that there is no missing value in the numerical features.

### 2.2 Find the missing values of categorical features

In [4]:
for col in credit_card_apps.columns:
    if (credit_card_apps[col].dtypes == 'object') :
        print("Column : {}".format(col))
        print(credit_card_apps[col].unique())
        print("\n")

Column : 0
['b' 'a' '?']


Column : 1
['30.83' '58.67' '24.50' '27.83' '20.17' '32.08' '33.17' '22.92' '54.42'
 '42.50' '22.08' '29.92' '38.25' '48.08' '45.83' '36.67' '28.25' '23.25'
 '21.83' '19.17' '25.00' '47.75' '27.42' '41.17' '15.83' '47.00' '56.58'
 '57.42' '42.08' '29.25' '42.00' '49.50' '36.75' '22.58' '27.25' '23.00'
 '27.75' '54.58' '34.17' '28.92' '29.67' '39.58' '56.42' '54.33' '41.00'
 '31.92' '41.50' '23.92' '25.75' '26.00' '37.42' '34.92' '34.25' '23.33'
 '23.17' '44.33' '35.17' '43.25' '56.75' '31.67' '23.42' '20.42' '26.67'
 '36.00' '25.50' '19.42' '32.33' '34.83' '38.58' '44.25' '44.83' '20.67'
 '34.08' '21.67' '21.50' '49.58' '27.67' '39.83' '?' '37.17' '25.67'
 '34.00' '49.00' '62.50' '31.42' '52.33' '28.75' '28.58' '22.50' '28.50'
 '37.50' '35.25' '18.67' '54.83' '40.92' '19.75' '29.17' '24.58' '33.75'
 '25.42' '37.75' '52.50' '57.83' '20.75' '39.92' '24.75' '44.17' '23.50'
 '47.67' '22.75' '34.42' '28.42' '67.75' '47.42' '36.25' '32.67' '48.58'
 '33.58' '18.83' 

Again, the only missing value in this dataset is '?' value. <p>Note : columns 1 and 13 look like a numerical features, but remember that columns 1 and 13 are <code>Age</code> and <code>Zip Code</code>, that means these are categorical features.</p>


## 3. Handling Missing Value

We will now replace '?' with NaN

In [5]:
import numpy as np

#Replace '?' with NaN
credit_card_apps = credit_card_apps.replace('?',np.nan)

#Check for the missing value
credit_card_apps.isnull().sum()

0     12
1     12
2      0
3      6
4      6
5      9
6      9
7      0
8      0
9      0
10     0
11     0
12     0
13    13
14     0
15     0
dtype: int64

After we replaced '?' with NaN, we know the number of missing value for each feature. There are no missing value in numerical features (column 2, 7, 10, 14).

### 3.1 Impute Missing Value

There are many ways to treat missing value, this time we will impute the categorical missing value with most frequent class.

In [6]:
for col in credit_card_apps.columns:
    if credit_card_apps[col].dtypes == 'object':
        credit_card_apps = credit_card_apps.fillna(credit_card_apps[col].value_counts().index[0])

In [7]:
#Check for the missing value
credit_card_apps.isnull().sum()

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
dtype: int64

## 4. Data Preprocessing

First, we will convert the non numerical data to numerical data (non number to number) because machine learning only accepts numeric data. 

### 4.1 Label Encoding
<p> We use label encoding to convert non number data to number data.</p>

In [8]:
# Import LabelEncoder
from sklearn.preprocessing import LabelEncoder

# Instantiate LabelEncoder
le=LabelEncoder()

# Iterate over all the values of each column and extract their dtypes
for col in credit_card_apps.columns.values:
    if credit_card_apps[col].dtypes=='object':
        credit_card_apps[col]=le.fit_transform(credit_card_apps[col])

In [9]:
credit_card_apps.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,1,156,0.0,2,1,13,8,1.25,1,1,1,0,0,68,0,0
1,0,328,4.46,2,1,11,4,3.04,1,1,6,0,0,11,560,0
2,0,89,0.5,2,1,11,4,1.5,1,0,0,0,0,96,824,0
3,1,125,1.54,2,1,13,8,3.75,1,1,5,1,0,31,3,0
4,1,43,5.625,2,1,13,8,1.71,1,0,0,0,2,37,0,0


### 4.2 Feature Importance
<p> Before we proceed to data splitting, we need to know which feature act as the best predictor for our model. This time we will use <code>RandomForest</code>,<code>KNN</code>, and <code>ExtraTrees</code>to find the best features.</p>

#### 4.3.4 Choosing Feature
We will choose features that have high value in all of the feature selection methods.