![Credit card being held in hand](credit_card.jpg)

Commercial banks receive _a lot_ of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this workbook, you will build an automatic credit card approval predictor using machine learning techniques, just like real banks do.

### The Data

The data is a small subset of the Credit Card Approval dataset from the UCI Machine Learning Repository showing the credit card applications a bank receives. This dataset has been loaded as a `pandas` DataFrame called `cc_apps`. The last column in the dataset is the target value.

In [23]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV

# Load the dataset
cc_apps = pd.read_csv("cc_approvals.data", header=None) 
cc_apps.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,g,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,g,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,g,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,g,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,s,0,+


### 2. Pre-processing

In this section, we are reviewing the data. We will:
- check the structure and data types of the data
- check for missing values + imput if necessary
- change data types
- encode data into create dichotomous variables of 1 or 0



In [24]:
# first let's checking missing values
cc_apps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 non-null    object 
 9   9       690 non-null    object 
 10  10      690 non-null    int64  
 11  11      690 non-null    object 
 12  12      690 non-null    int64  
 13  13      690 non-null    object 
dtypes: float64(2), int64(2), object(10)
memory usage: 75.6+ KB


In [25]:
# set up the colname names to iterate through
col_names = cc_apps.columns

# iterate through and print unique values
for i in col_names:
    print(cc_apps[i].unique())

['b' 'a' '?']
['30.83' '58.67' '24.50' '27.83' '20.17' '32.08' '33.17' '22.92' '54.42'
 '42.50' '22.08' '29.92' '38.25' '48.08' '45.83' '36.67' '28.25' '23.25'
 '21.83' '19.17' '25.00' '47.75' '27.42' '41.17' '15.83' '47.00' '56.58'
 '57.42' '42.08' '29.25' '42.00' '49.50' '36.75' '22.58' '27.25' '23.00'
 '27.75' '54.58' '34.17' '28.92' '29.67' '39.58' '56.42' '54.33' '41.00'
 '31.92' '41.50' '23.92' '25.75' '26.00' '37.42' '34.92' '34.25' '23.33'
 '23.17' '44.33' '35.17' '43.25' '56.75' '31.67' '23.42' '20.42' '26.67'
 '36.00' '25.50' '19.42' '32.33' '34.83' '38.58' '44.25' '44.83' '20.67'
 '34.08' '21.67' '21.50' '49.58' '27.67' '39.83' '?' '37.17' '25.67'
 '34.00' '49.00' '62.50' '31.42' '52.33' '28.75' '28.58' '22.50' '28.50'
 '37.50' '35.25' '18.67' '54.83' '40.92' '19.75' '29.17' '24.58' '33.75'
 '25.42' '37.75' '52.50' '57.83' '20.75' '39.92' '24.75' '44.17' '23.50'
 '47.67' '22.75' '34.42' '28.42' '67.75' '47.42' '36.25' '32.67' '48.58'
 '33.58' '18.83' '26.92' '31.25' '56.50' 

As suspected, there was an odd value in the columns 0, 1, 3, 4, 5, and 6. We noticed '?' is reported a few times. We will need to convert the value to np.nan. Then for the numeric columns, I will convert the data type to float. 

We also need to convert the final column. If the column is '+', then the marker is 1, if the column is '-', then the marker is 0. 

In [26]:
# convert the ? into nan
val = np.nan
cc_apps.replace('?', val, inplace=True)

# print through again for unique values
for i in col_names:
    print(cc_apps[i].unique())

['b' 'a' nan]
['30.83' '58.67' '24.50' '27.83' '20.17' '32.08' '33.17' '22.92' '54.42'
 '42.50' '22.08' '29.92' '38.25' '48.08' '45.83' '36.67' '28.25' '23.25'
 '21.83' '19.17' '25.00' '47.75' '27.42' '41.17' '15.83' '47.00' '56.58'
 '57.42' '42.08' '29.25' '42.00' '49.50' '36.75' '22.58' '27.25' '23.00'
 '27.75' '54.58' '34.17' '28.92' '29.67' '39.58' '56.42' '54.33' '41.00'
 '31.92' '41.50' '23.92' '25.75' '26.00' '37.42' '34.92' '34.25' '23.33'
 '23.17' '44.33' '35.17' '43.25' '56.75' '31.67' '23.42' '20.42' '26.67'
 '36.00' '25.50' '19.42' '32.33' '34.83' '38.58' '44.25' '44.83' '20.67'
 '34.08' '21.67' '21.50' '49.58' '27.67' '39.83' nan '37.17' '25.67'
 '34.00' '49.00' '62.50' '31.42' '52.33' '28.75' '28.58' '22.50' '28.50'
 '37.50' '35.25' '18.67' '54.83' '40.92' '19.75' '29.17' '24.58' '33.75'
 '25.42' '37.75' '52.50' '57.83' '20.75' '39.92' '24.75' '44.17' '23.50'
 '47.67' '22.75' '34.42' '28.42' '67.75' '47.42' '36.25' '32.67' '48.58'
 '33.58' '18.83' '26.92' '31.25' '56.50' 

That worked well. Now we can convert column 1 to a float data type. 

In [27]:
# change to float data type
cc_apps[1] = cc_apps[1].astype('float')
cc_apps.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,g,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,g,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,g,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,g,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,s,0,+


Now that we have replaced all the '?' values with np.nan, we can impute the values. If the variable is an object, we will replace with the most common value. If the variable is numeric, I will replace it with the mean. 

In [28]:
# copy before impute
cc_impute = cc_apps.copy()

# impute
# for object type, replace with most frequent value
# for numeric types, replace with the mean value
for i in col_names:
    if cc_apps[i].dtype == 'object':
        cc_apps[i] = cc_apps[i].replace(np.nan, cc_apps[i].value_counts(ascending=False).reset_index().iloc[0,0])
    else:
        cc_apps[i] = cc_apps[i].replace(np.nan, cc_apps[i].mean())

In [29]:
# transform 13 - to 13 + because we want to predict approvals instead
cc_apps[13].replace('+', 1, inplace=True)
cc_apps[13].replace('-', 0, inplace=True)
cc_apps[13].value_counts()

0    383
1    307
Name: 13, dtype: int64

In [30]:
# check the values of each column now
# print through again for unique values
for i in col_names:
    print(cc_apps[i].unique())

['b' 'a']
[30.83       58.67       24.5        27.83       20.17       32.08
 33.17       22.92       54.42       42.5        22.08       29.92
 38.25       48.08       45.83       36.67       28.25       23.25
 21.83       19.17       25.         47.75       27.42       41.17
 15.83       47.         56.58       57.42       42.08       29.25
 42.         49.5        36.75       22.58       27.25       23.
 27.75       54.58       34.17       28.92       29.67       39.58
 56.42       54.33       41.         31.92       41.5        23.92
 25.75       26.         37.42       34.92       34.25       23.33
 23.17       44.33       35.17       43.25       56.75       31.67
 23.42       20.42       26.67       36.         25.5        19.42
 32.33       34.83       38.58       44.25       44.83       20.67
 34.08       21.67       21.5        49.58       27.67       39.83
 31.56817109 37.17       25.67       34.         49.         62.5
 31.42       52.33       28.75       28.58       22.5  

It seems our replacement methodology has worked now! Finally, we will need to encode our data to make each column binary. This will be our one-hot encoding process. 

In [31]:
# retrieve only object columns
cc_cat = cc_apps.select_dtypes(include='object')
cc_cat.head()

Unnamed: 0,0,3,4,5,6,8,9,11
0,b,u,g,w,v,t,t,g
1,a,u,g,q,h,t,t,g
2,a,u,g,q,h,t,f,g
3,b,u,g,w,v,t,t,g
4,b,u,g,w,v,t,f,s


In [32]:
# create the dummy variables
cc_cat_dummies = pd.get_dummies(cc_cat, drop_first=True)
cc_cat_dummies.head()

Unnamed: 0,0_b,3_u,3_y,4_gg,4_p,5_c,5_cc,5_d,5_e,5_ff,5_i,5_j,5_k,5_m,5_q,5_r,5_w,5_x,6_dd,6_ff,6_h,6_j,6_n,6_o,6_v,6_z,8_t,9_t,11_p,11_s
0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1,1,0,0
1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,1,0,0
2,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0
3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1,1,0,0
4,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,1


In [33]:
# merge the datasets together
# create copy of the dataset
cc_dummies = cc_apps.copy()

# drop the current categorical variables
cc_apps = cc_apps.select_dtypes(exclude='object')

# concat with dummy variables
cc_apps = pd.concat([cc_apps, cc_cat_dummies], axis='columns')
cc_apps.head()

Unnamed: 0,1,2,7,10,12,13,0_b,3_u,3_y,4_gg,4_p,5_c,5_cc,5_d,5_e,5_ff,5_i,5_j,5_k,5_m,5_q,5_r,5_w,5_x,6_dd,6_ff,6_h,6_j,6_n,6_o,6_v,6_z,8_t,9_t,11_p,11_s
0,30.83,0.0,1.25,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1,1,0,0
1,58.67,4.46,3.04,6,560,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,1,0,0
2,24.5,0.5,1.5,0,824,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0
3,27.83,1.54,3.75,5,3,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1,1,0,0
4,20.17,5.625,1.71,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,1


We are now prepped and ready for the next step of our modelling.

### 3. Prepare for data modelling

Here we are going to set up the data for the model. We are going to:
- create the target variable
- split the data
- scale numeric variables

In [34]:
# set up the X and y values
X = cc_apps.drop(columns=[13])
y = cc_apps[13]

# this is necessary for the scikit learn modules
X.columns = X.columns.astype(str)

In [35]:
# now split the data from X and y
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [36]:
# scaling the data
scaler = StandardScaler()

# fit transform on the training data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train_scaled

array([[-4.66966021e-01,  7.61171798e-01, -2.25596773e-01, ...,
        -8.86796350e-01, -1.13331534e-01, -3.12114573e-01],
       [-2.23908473e-01, -8.66411576e-01, -5.62094055e-01, ...,
        -8.86796350e-01, -1.13331534e-01, -3.12114573e-01],
       [-9.16280150e-01, -8.41870210e-01, -3.49726882e-01, ...,
        -8.86796350e-01, -1.13331534e-01, -3.12114573e-01],
       ...,
       [ 2.78521244e+00, -9.07641070e-01, -6.48835577e-01, ...,
         1.12765462e+00, -1.13331534e-01, -3.12114573e-01],
       [ 2.73165805e-03,  1.26868724e+00, -6.48835577e-01, ...,
        -8.86796350e-01, -1.13331534e-01, -3.12114573e-01],
       [ 1.42357667e+00, -2.52877433e-01,  3.98044857e-01, ...,
        -8.86796350e-01, -1.13331534e-01,  3.20395164e+00]])

Rather simple and quick. But now, the data is transformed and we can start with our model. 

### 4. Train the model

Let's now apply our newly scaled data to our logistic regression model. Afterwards, we will score and compare our final output values. 

In [37]:
# initiate model
log_reg = LogisticRegression()

model1 = log_reg.fit(X_train_scaled, y_train)

In [38]:
# generate predictions with X_test_scaled
# then will compare afterwards with y_test
y_pred = log_reg.predict(X_test_scaled)

train_cm = confusion_matrix(y_pred, y_test)
train_cm

array([[67, 10],
       [10, 51]])

In [50]:
# check scoring
from sklearn.metrics import f1_score

base_model_f1 = f1_score(y_test, y_pred)
base_model_f1

0.8360655737704918

We have built our initial model, and we have our confusion matrix. 

Let's look at some of the scoring metrics and see what we can come up with.

### 5. Scoring the model

In this section, we are rebuiling our model based on the param_grid. We will optimize and find a better scoring model. To do this, we will:
- create a param grid
- cross-validate across the param grid
- find the model and its parameters with the best score

In [40]:
# creating our parameter grid
param_grid = {
    'solver': ['lbfgs', 'liblinear', 'newton-cg', 'sag', 'saga'],
    'penalty': ['l1', 'l2', None, 'elasticnet'],
    'class_weight': ['balanced', None],
    'fit_intercept': [True, False]
}

In [41]:
# create grid search
grid_log_reg = GridSearchCV(
    estimator = model1,
    param_grid= param_grid,
    scoring = 'f1',
    cv = 10,
    refit="f1")

In [42]:
# fit the parameter grid to the model
grid_log_reg.fit(X_train_scaled, y_train)

In [43]:
# let's check the results of the grid search
grid_results = pd.DataFrame(grid_log_reg.cv_results_)
grid_results

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_class_weight,param_fit_intercept,param_penalty,param_solver,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
0,0.000407,0.000054,0.000000,0.000000,balanced,True,l1,lbfgs,"{'class_weight': 'balanced', 'fit_intercept': ...",,,,,,,,,,,,,45
1,0.003644,0.001227,0.001390,0.000115,balanced,True,l1,liblinear,"{'class_weight': 'balanced', 'fit_intercept': ...",0.867925,0.846154,0.872727,0.836364,0.851852,0.872727,0.833333,0.784314,0.816327,0.73913,0.832085,0.040457,4
2,0.000315,0.000027,0.000000,0.000000,balanced,True,l1,newton-cg,"{'class_weight': 'balanced', 'fit_intercept': ...",,,,,,,,,,,,,45
3,0.000301,0.000009,0.000000,0.000000,balanced,True,l1,sag,"{'class_weight': 'balanced', 'fit_intercept': ...",,,,,,,,,,,,,45
4,0.031511,0.000114,0.001332,0.000044,balanced,True,l1,saga,"{'class_weight': 'balanced', 'fit_intercept': ...",0.851852,0.846154,0.872727,0.836364,0.872727,0.888889,0.833333,0.769231,0.816327,0.73913,0.832673,0.044693,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75,0.000325,0.000049,0.000000,0.000000,,False,elasticnet,lbfgs,"{'class_weight': None, 'fit_intercept': False,...",,,,,,,,,,,,,45
76,0.000294,0.000002,0.000000,0.000000,,False,elasticnet,liblinear,"{'class_weight': None, 'fit_intercept': False,...",,,,,,,,,,,,,45
77,0.000296,0.000009,0.000000,0.000000,,False,elasticnet,newton-cg,"{'class_weight': None, 'fit_intercept': False,...",,,,,,,,,,,,,45
78,0.000297,0.000012,0.000000,0.000000,,False,elasticnet,sag,"{'class_weight': None, 'fit_intercept': False,...",,,,,,,,,,,,,45


In [44]:
# now we can find our best estimators based on the score we had
print(grid_log_reg.best_estimator_)
print(grid_log_reg.best_params_)

# best training score
print(grid_log_reg.best_score_)

LogisticRegression(class_weight='balanced', fit_intercept=False, solver='saga')
{'class_weight': 'balanced', 'fit_intercept': False, 'penalty': 'l2', 'solver': 'saga'}
0.8339733535385709


In [45]:
# find the optimized model based on the grid search
optimized_model = grid_log_reg.best_estimator_

# best score using the test data
best_score = optimized_model.score(X_test_scaled, y_test)
best_score

0.8768115942028986

In [51]:
# check back to the original base logistic model - test data
base_model_f1

0.8360655737704918

### Final Output

In [53]:
print(f"The final logistic model had an F1 score of {best_score.round(4)}. This was an improvement to the base logistic regression model, which had an accuracy score of {base_model_f1.round(4)}.")

The final logistic model had an F1 score of 0.8768. This was an improvement to the base logistic regression model, which had an accuracy score of 0.8361.
