![Credit card being held in hand](credit_card.jpg)

Commercial banks receive _a lot_ of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this workbook, you will build an automatic credit card approval predictor using machine learning techniques, just like real banks do.

### The Data

The data is a small subset of the Credit Card Approval dataset from the UCI Machine Learning Repository showing the credit card applications a bank receives. This dataset has been loaded as a `pandas` DataFrame called `cc_apps`. The last column in the dataset is the target value.

In [57]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load the dataset
cc_apps = pd.read_csv("cc_approvals.data", header=None) 
cc_apps.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,g,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,g,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,g,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,g,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,s,0,+


# 1.  Preprocess the data

In [58]:
# Get basic dataframe (df) info and make copy of df for cleaning
cc_apps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 non-null    object 
 9   9       690 non-null    object 
 10  10      690 non-null    int64  
 11  11      690 non-null    object 
 12  12      690 non-null    int64  
 13  13      690 non-null    object 
dtypes: float64(2), int64(2), object(10)
memory usage: 75.6+ KB


In [59]:
# Review the unique values and most frequent values the object type columns
for idx, col in enumerate(cc_apps.columns):
    if cc_apps[col].dtype =='object': 
        print({idx})
        print(cc_apps[col].value_counts().head(3))
        print(cc_apps[col].unique())
        

{0}
b    468
a    210
?     12
Name: 0, dtype: int64
['b' 'a' '?']
{1}
?        12
22.67     9
20.42     7
Name: 1, dtype: int64
['30.83' '58.67' '24.50' '27.83' '20.17' '32.08' '33.17' '22.92' '54.42'
 '42.50' '22.08' '29.92' '38.25' '48.08' '45.83' '36.67' '28.25' '23.25'
 '21.83' '19.17' '25.00' '47.75' '27.42' '41.17' '15.83' '47.00' '56.58'
 '57.42' '42.08' '29.25' '42.00' '49.50' '36.75' '22.58' '27.25' '23.00'
 '27.75' '54.58' '34.17' '28.92' '29.67' '39.58' '56.42' '54.33' '41.00'
 '31.92' '41.50' '23.92' '25.75' '26.00' '37.42' '34.92' '34.25' '23.33'
 '23.17' '44.33' '35.17' '43.25' '56.75' '31.67' '23.42' '20.42' '26.67'
 '36.00' '25.50' '19.42' '32.33' '34.83' '38.58' '44.25' '44.83' '20.67'
 '34.08' '21.67' '21.50' '49.58' '27.67' '39.83' '?' '37.17' '25.67'
 '34.00' '49.00' '62.50' '31.42' '52.33' '28.75' '28.58' '22.50' '28.50'
 '37.50' '35.25' '18.67' '54.83' '40.92' '19.75' '29.17' '24.58' '33.75'
 '25.42' '37.75' '52.50' '57.83' '20.75' '39.92' '24.75' '44.17' '23.50'

- Looks like most of the object type columns have '?' in them, so will replace w/ most frequent value for the given column. 
- Column 1 needs to be cast as float instead of string.
- Numeric columns w/ missing vals should be replaced w/ mean vals.  Binary type vars should be cast as bool.

In [60]:
# Make copy of df and  for missing in object type
cc_apps_clean = cc_apps.copy()

In [61]:
# replace "?" with NaN throughout
cc_apps_clean = cc_apps_clean.replace('?',np.nan)

In [62]:
# Change column 1 from string to numeric
cc_apps_clean[1] = pd.to_numeric(cc_apps_clean[1], errors='raise')

In [63]:
# Iterate through each column index
for col_index in cc_apps_clean.columns:
    col = cc_apps_clean[col_index]
    if col.dtypes == 'object':  # Handle object (string) columns
        # Replace missing values with the most frequent value
        most_frq_val = col.value_counts().idxmax()
        cc_apps_clean[col_index] = col.fillna(most_frq_val)
    elif np.issubdtype(col.dtypes, np.number):  # Handle numeric columns
        # Replace missing values with the mean of the column
        mean_value = col.mean()
        cc_apps_clean[col_index] = col.fillna(mean_value)

In [64]:
# check that counts went up for most frq value (compare to earlier output)
for idx, col in enumerate(cc_apps_clean.columns):
    if cc_apps_clean[col].dtype =='object': 
        print({idx})
        print(cc_apps_clean[col].value_counts().head(3))
        print(cc_apps_clean[col].unique())

{0}
b    480
a    210
Name: 0, dtype: int64
['b' 'a']
{3}
u    525
y    163
l      2
Name: 3, dtype: int64
['u' 'y' 'l']
{4}
g     525
p     163
gg      2
Name: 4, dtype: int64
['g' 'p' 'gg']
{5}
c    146
q     78
w     64
Name: 5, dtype: int64
['w' 'q' 'm' 'r' 'cc' 'k' 'c' 'd' 'x' 'i' 'e' 'aa' 'ff' 'j']
{6}
v     408
h     138
bb     59
Name: 6, dtype: int64
['v' 'h' 'bb' 'ff' 'j' 'z' 'o' 'dd' 'n']
{8}
t    361
f    329
Name: 8, dtype: int64
['t' 'f']
{9}
f    395
t    295
Name: 9, dtype: int64
['t' 'f']
{11}
g    625
s     57
p      8
Name: 11, dtype: int64
['g' 's' 'p']
{13}
-    383
+    307
Name: 13, dtype: int64
['+' '-']


Looks like only a small number of values are missing in the columns above so will fill w/ most frequent.  


In [65]:
# Coerce columns 0, 8, 9, 13 to type bool
cc_apps_clean.iloc[:, [0, 8, 9, 13]] = cc_apps_clean.iloc[:, [0,8, 9, 13]].astype(bool)

In [66]:
# Let's look at the numeric columns
cc_apps_clean.info()
#print(cc_apps_clean[cc_apps_clean[1].isna()])

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    bool   
 1   1       690 non-null    float64
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 non-null    bool   
 9   9       690 non-null    bool   
 10  10      690 non-null    int64  
 11  11      690 non-null    object 
 12  12      690 non-null    int64  
 13  13      690 non-null    bool   
dtypes: bool(4), float64(3), int64(2), object(5)
memory usage: 56.7+ KB


In [67]:
cc_apps_clean

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,True,30.83,0.000,u,g,w,v,1.25,True,True,1,g,0,True
1,True,58.67,4.460,u,g,q,h,3.04,True,True,6,g,560,True
2,True,24.50,0.500,u,g,q,h,1.50,True,True,0,g,824,True
3,True,27.83,1.540,u,g,w,v,3.75,True,True,5,g,3,True
4,True,20.17,5.625,u,g,w,v,1.71,True,True,0,s,0,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,True,21.08,10.085,y,p,e,h,1.25,True,True,0,g,0,True
686,True,22.67,0.750,u,g,c,v,2.00,True,True,2,g,394,True
687,True,25.25,13.500,y,p,ff,ff,2.00,True,True,1,g,1,True
688,True,17.92,0.205,u,g,aa,v,0.04,True,True,0,g,750,True


In [68]:
# Use one hot encoding to handle categorical data
cc_apps_clean_encoded = pd.get_dummies(cc_apps_clean, columns=[3,4,5,6,11], drop_first=True)
print(cc_apps_clean_encoded)

        0      1       2     7     8     9  ...  6_n  6_o  6_v  6_z  11_p  11_s
0    True  30.83   0.000  1.25  True  True  ...    0    0    1    0     0     0
1    True  58.67   4.460  3.04  True  True  ...    0    0    0    0     0     0
2    True  24.50   0.500  1.50  True  True  ...    0    0    0    0     0     0
3    True  27.83   1.540  3.75  True  True  ...    0    0    1    0     0     0
4    True  20.17   5.625  1.71  True  True  ...    0    0    1    0     0     1
..    ...    ...     ...   ...   ...   ...  ...  ...  ...  ...  ...   ...   ...
685  True  21.08  10.085  1.25  True  True  ...    0    0    0    0     0     0
686  True  22.67   0.750  2.00  True  True  ...    0    0    1    0     0     0
687  True  25.25  13.500  2.00  True  True  ...    0    0    0    0     0     0
688  True  17.92   0.205  0.04  True  True  ...    0    0    1    0     0     0
689  True  35.00   3.375  8.29  True  True  ...    0    0    0    0     0     0

[690 rows x 36 columns]


In [69]:
cc_apps_clean_encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 36 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    bool   
 1   1       690 non-null    float64
 2   2       690 non-null    float64
 3   7       690 non-null    float64
 4   8       690 non-null    bool   
 5   9       690 non-null    bool   
 6   10      690 non-null    int64  
 7   12      690 non-null    int64  
 8   13      690 non-null    bool   
 9   3_u     690 non-null    uint8  
 10  3_y     690 non-null    uint8  
 11  4_gg    690 non-null    uint8  
 12  4_p     690 non-null    uint8  
 13  5_c     690 non-null    uint8  
 14  5_cc    690 non-null    uint8  
 15  5_d     690 non-null    uint8  
 16  5_e     690 non-null    uint8  
 17  5_ff    690 non-null    uint8  
 18  5_i     690 non-null    uint8  
 19  5_j     690 non-null    uint8  
 20  5_k     690 non-null    uint8  
 21  5_m     690 non-null    uint8  
 22  5_

In [70]:
# Make sure all column names are integers
cc_apps_clean_encoded.columns = [int(col) if str(col).isdigit() else hash(col) for col in cc_apps_clean_encoded.columns]

# 2. Prepare the data for modeling

In [71]:
# Define the target variable and the feature variables
X = cc_apps_clean_encoded.iloc[:, :-1]  # features (x) is all except the last
y = cc_apps_clean_encoded.iloc[:,-1]  # target (y) is the last variable (was +/-, now T/F)

In [72]:
# Split the data into train and test sets (80%/20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.5, random_state=42)

In [73]:
# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 3. Train the model

In [74]:
# Instantiate the model 1
model_1 = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model 1
model_1.fit(X_train_scaled, y_train)

# Generate Predictions for model 1
y_pred_1 = model_1.predict(X_train_scaled)


In [75]:
# Instantiate the model 2
model_2 = LogisticRegression(random_state=42)

# Train the model 2
model_2.fit(X_train_scaled, y_train)

# Generate Predictions for model 2
y_pred_2 = model_2.predict(X_train_scaled)

# 4. Find the model with best performance (score)

In [76]:
# Accuracy Score for Model 1
accuracy_1 = accuracy_score(y_test, y_pred_1)
print(f'Model 1 Accuracy: {accuracy_1:.2f}')

# Classification Report (Precision, Recall, F1-score) for Model 1
print("Model 1 Classification Report:\n", classification_report(y_test, y_pred_1))

# Confusion Matrix for Model 1
print("Model 1 Confusion Matrix:\n", confusion_matrix(y_test, y_pred_1))

# Accuracy Score for Model 2
accuracy_2 = accuracy_score(y_test, y_pred_2)
print(f'Model 2 Accuracy: {accuracy_2:.2f}')

# Classification Report (Precision, Recall, F1-score) for Model 2
print("Model 2 Classification Report:\n", classification_report(y_test, y_pred_2))

# Confusion Matrix for Model 2
print("Model 2 Confusion Matrix:\n", confusion_matrix(y_test, y_pred_2))

# Select Model with best score
best_score = max(accuracy_1,accuracy_2)
print(f'Best Accuracy Score: {(best_score*100): .2f}%')

Model 1 Accuracy: 0.84
Model 1 Classification Report:
               precision    recall  f1-score   support

           0       0.91      0.92      0.91       315
           1       0.04      0.03      0.04        30

    accuracy                           0.84       345
   macro avg       0.47      0.48      0.47       345
weighted avg       0.83      0.84      0.84       345

Model 1 Confusion Matrix:
 [[289  26]
 [ 29   1]]
Model 2 Accuracy: 0.90
Model 2 Classification Report:
               precision    recall  f1-score   support

           0       0.91      0.99      0.95       315
           1       0.00      0.00      0.00        30

    accuracy                           0.90       345
   macro avg       0.46      0.50      0.47       345
weighted avg       0.83      0.90      0.87       345

Model 2 Confusion Matrix:
 [[312   3]
 [ 30   0]]
Best Accuracy Score:  90.43%


Was able to get accuracy of 90% from Model 2 (Logistic Regression).