## Exercise01 - Predicting Credit Card Approvals

Commercial banks receive _a lot_ of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this workbook, you will build an automatic credit card approval predictor using machine learning techniques, just like real banks do.


### 💾 The data: `cc_approvals.csv`
<div class="alert alert-block alert-info">
 <b>Info:</b> Be careful, there are some missing data that is represented by '?'.</div>  

The data is a small subset of the Credit Card Approval dataset from the UCI Machine Learning Repository showing the credit card applications a bank receives. The last column in the dataset is the target value.

### Tasks :

<div class="alert alert-block alert-danger">
    🚫 <b> Restriction:</b> Please refrain from using <b>ChatGPT</b> and <b>jcopml</b> modules to complete this exercise, as doing so may hinder your learning experience.
</div>
    
1. Use supervised learning (KNN and SVM) techniques to automate the credit card approval process for banks.
    - Preproccess the data (uses sklearn modules) and apply supervised learning techniques to find the best model and parameters for the job. 
    - Perform EDA, decide whether you need to do feature engineering?
    - Perform small benchmark to check whether your model could be considerd as a "good model" comparing to random baseline.
    - Check whether the target data is balance or not?



In [1]:
import pandas as pd

# Read in the data
header_name = ['A', 'B', 'C', 'D', 'E', 'F', 'G','H','I','J','K','L','M','N']
cc_apps = pd.read_csv("cc_approvals.csv", names=header_name)

# Preview the data
cc_apps.head(100)

Unnamed: 0,A,B,C,D,E,F,G,H,I,J,K,L,M,N
0,b,30.83,0.000,u,g,w,v,1.250,t,t,1,g,0,+
1,a,58.67,4.460,u,g,q,h,3.040,t,t,6,g,560,+
2,a,24.50,0.500,u,g,q,h,1.500,t,f,0,g,824,+
3,b,27.83,1.540,u,g,w,v,3.750,t,t,5,g,3,+
4,b,20.17,5.625,u,g,w,v,1.710,t,f,0,s,0,+
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,a,28.58,3.540,u,g,i,bb,0.500,t,f,0,g,0,-
96,b,23.00,0.625,y,p,aa,v,0.125,t,f,0,g,1,-
97,b,?,0.500,u,g,c,bb,0.835,t,f,0,s,0,-
98,a,22.50,11.000,y,p,q,v,3.000,t,f,0,g,0,-


In [2]:
## check missing value
print(cc_apps.isnull().sum())

## check the value
print('------------')
for i in cc_apps.columns:
    print(cc_apps[i].unique())

## Replace the '?' with NaN
import numpy as np
df = cc_apps.replace("?", np.nan)

## check tha data
print("---------------")
print(df.isnull().sum())

display(df.head(100))


A    0
B    0
C    0
D    0
E    0
F    0
G    0
H    0
I    0
J    0
K    0
L    0
M    0
N    0
dtype: int64
------------
['b' 'a' '?']
['30.83' '58.67' '24.50' '27.83' '20.17' '32.08' '33.17' '22.92' '54.42'
 '42.50' '22.08' '29.92' '38.25' '48.08' '45.83' '36.67' '28.25' '23.25'
 '21.83' '19.17' '25.00' '47.75' '27.42' '41.17' '15.83' '47.00' '56.58'
 '57.42' '42.08' '29.25' '42.00' '49.50' '36.75' '22.58' '27.25' '23.00'
 '27.75' '54.58' '34.17' '28.92' '29.67' '39.58' '56.42' '54.33' '41.00'
 '31.92' '41.50' '23.92' '25.75' '26.00' '37.42' '34.92' '34.25' '23.33'
 '23.17' '44.33' '35.17' '43.25' '56.75' '31.67' '23.42' '20.42' '26.67'
 '36.00' '25.50' '19.42' '32.33' '34.83' '38.58' '44.25' '44.83' '20.67'
 '34.08' '21.67' '21.50' '49.58' '27.67' '39.83' '?' '37.17' '25.67'
 '34.00' '49.00' '62.50' '31.42' '52.33' '28.75' '28.58' '22.50' '28.50'
 '37.50' '35.25' '18.67' '54.83' '40.92' '19.75' '29.17' '24.58' '33.75'
 '25.42' '37.75' '52.50' '57.83' '20.75' '39.92' '24.75' '44.17

Unnamed: 0,A,B,C,D,E,F,G,H,I,J,K,L,M,N
0,b,30.83,0.000,u,g,w,v,1.250,t,t,1,g,0,+
1,a,58.67,4.460,u,g,q,h,3.040,t,t,6,g,560,+
2,a,24.50,0.500,u,g,q,h,1.500,t,f,0,g,824,+
3,b,27.83,1.540,u,g,w,v,3.750,t,t,5,g,3,+
4,b,20.17,5.625,u,g,w,v,1.710,t,f,0,s,0,+
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,a,28.58,3.540,u,g,i,bb,0.500,t,f,0,g,0,-
96,b,23.00,0.625,y,p,aa,v,0.125,t,f,0,g,1,-
97,b,,0.500,u,g,c,bb,0.835,t,f,0,s,0,-
98,a,22.50,11.000,y,p,q,v,3.000,t,f,0,g,0,-


### Data preprocess and find best model 

In [3]:
from sklearn.model_selection import train_test_split

X = df.drop(columns='N')
y = df.N

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((552, 13), (138, 13), (552,), (138,))

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder

numerical_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy='median')),
    ("scaler", MinMaxScaler())
])

categorical_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy='most_frequent')),
    ("scaler", OneHotEncoder())
])

### KNN 

In [5]:
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer([
    ("numeric", numerical_pipeline, ["B", "C", "H","K", "M"]), 
    ("categoric", categorical_pipeline, ["A", "G", "I", "J", "L"]) #D", "E", "F" 
])

## pipeline
from sklearn.neighbors import KNeighborsClassifier

pipeline = Pipeline([
    ("prep", preprocessor),
    ("algo", KNeighborsClassifier())
])

In [6]:
## tuning and cross validation using GridSearchCV
from sklearn.model_selection import GridSearchCV

parameter = {
    "algo__n_neighbors": range(1,51,2),
    "algo__weights": ["uniform", "distance"],
    "algo__p": [1,2]
}

model = GridSearchCV(pipeline, parameter, cv=3, n_jobs=-1, verbose=1)
model.fit(X_train, y_train)
model.best_params_
print(model.score(X_train, y_train), model.best_score_, model.score(X_test, y_test))

Fitting 3 folds for each of 100 candidates, totalling 300 fits
0.8840579710144928 0.8768115942028986 0.8260869565217391


### SVM 

In [12]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

pipeline_svc = Pipeline ([
    ('prep', preprocessor),
    ('algo', SVC(max_iter=500))
])

svm_parameter = {
    'algo__gamma': [1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03],
    'algo__C': [1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03]
}

model_svc = GridSearchCV(pipeline_svc, svm_parameter, cv=3, n_jobs=-1, verbose=1)
model_svc.fit(X_train, y_train)

print(model_svc.best_params_)
print(model_svc.score(X_train, y_train), model.best_score_, model.score(X_test, y_test))

Fitting 3 folds for each of 49 candidates, totalling 147 fits
{'algo__C': 0.1, 'algo__gamma': 0.1}
0.8713768115942029 0.8768115942028986 0.8260869565217391


### Exploratory Data Analysis 

In [11]:
## target
df.N.value_counts()

-    383
+    307
Name: N, dtype: int64