---
# Predicting Credit Card Default
### Model Development with Cross-Validation

---

Sections:
- [Loading Preprocessed Data](#Loading-Data)
- [Validation Set Partitioning](#Set-Partitioning)
- [Model Development with Cross-Validation](#Model-Development)
    - [Decision Tree](#Model-Tree)
    - [Random Forest](#Model-Forest)    
    - [AdaBoosting](#Model-AdaBoosting)
    - [Neural Network](#Model-NN) 
    - [SVM with RBF kernel](#Model-SVM) 
    
---

<a id="Loading-Data"></a>
# Loading Preprocessed Data
---

Note: 
Open <a href="./data_preparation.ipynb">data_preparation.ipynb</a> to see how the data was preprocessed. 

In [6]:
import pandas as pd
import numpy as np
import imblearn #libary for imbalanced functions i.e. K-means SMOTE
from sklearn import preprocessing

#from google.colab import drive
#drive.mount('/content/drive')
# filename = "drive/Shareddrives/DS-project/default_processed.csv"

filename = "default_processed.csv"
data = pd.read_csv(filename)
data.head(10)

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,DEFAULT
0,20000.0,2.0,2.0,1.0,24,2,2.0,-1.0,-1.0,-2.0,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1.0
1,120000.0,2.0,2.0,2.0,26,-1,2.0,0.0,0.0,0.0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1.0
2,90000.0,2.0,2.0,2.0,34,0,0.0,0.0,0.0,0.0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0.0
3,50000.0,2.0,2.0,1.0,37,0,0.0,0.0,0.0,0.0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0.0
4,50000.0,1.0,2.0,1.0,57,-1,0.0,-1.0,0.0,0.0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0.0
5,50000.0,1.0,1.0,2.0,37,0,0.0,0.0,0.0,0.0,...,19394.0,19619.0,20024.0,2500.0,1815.0,657.0,1000.0,1000.0,800.0,0.0
6,500000.0,1.0,1.0,2.0,29,0,0.0,0.0,0.0,0.0,...,542653.0,483003.0,473944.0,55000.0,40000.0,38000.0,20239.0,13750.0,13770.0,0.0
7,100000.0,2.0,2.0,2.0,23,0,-1.0,-1.0,0.0,0.0,...,221.0,-159.0,567.0,380.0,601.0,0.0,581.0,1687.0,1542.0,0.0
8,140000.0,2.0,3.0,1.0,28,0,0.0,2.0,0.0,0.0,...,12211.0,11793.0,3719.0,3329.0,0.0,432.0,1000.0,1000.0,1000.0,0.0
9,20000.0,1.0,3.0,2.0,35,-2,-2.0,-2.0,-2.0,-1.0,...,0.0,13007.0,13912.0,0.0,0.0,0.0,13007.0,1122.0,0.0,0.0


<a id="Set-Partitioning"></a>
# Validation Set Partitioning
---

In order to ascertain a standard deviation and investigate the optimal hyperparameters for each model, we will employ stratified k-fold cross-validation. 

This algorithm will split our dataset into k consecutive folds in a startefied manner such that it preserves the same target class distribution.

In [10]:
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold

# acquire all rows, and all columns except for last one (label)
features = data.iloc[:, :-1]

# assignt to frame with all rows and the last column (label)
label = data.iloc[:,[-1]] 
label = preprocessing.LabelEncoder().fit_transform(label)

skf5 = StratifiedKFold(n_splits = 5)
skf5.get_n_splits(features, label)

print(skf5)
for train_index, test_index in skf5.split(features, label):
    print("TRAIN:", train_index, "Test:", test_index)

StratifiedKFold(n_splits=5, random_state=None, shuffle=False)
TRAIN: [ 5983  5984  5985 ... 29997 29998 29999] Test: [   0    1    2 ... 6044 6045 6047]
TRAIN: [    0     1     2 ... 29997 29998 29999] Test: [ 5983  5984  5985 ... 12048 12049 12050]
TRAIN: [    0     1     2 ... 29997 29998 29999] Test: [11793 11794 11796 ... 18220 18222 18223]
TRAIN: [    0     1     2 ... 29997 29998 29999] Test: [17312 17315 17316 ... 24072 24073 24074]
TRAIN: [    0     1     2 ... 24072 24073 24074] Test: [23688 23691 23696 ... 29997 29998 29999]


  return f(*args, **kwargs)


## Cross Validation Score function

The below function takes in as parameters a scikit-learn model (decision tree,random forest, etc..), the features and labels for a sklearn cross validation model which contains the indexes of each training and testing at each fold. 

It will print out the Accuracy, F1 Macro, Recall and Precision (averaged over all folds) along with their standard deviation

In [13]:
from sklearn.model_selection import cross_validate, cross_val_score, cross_val_predict
from numpy import mean, std

def skf(model, features, labels, cv):

    metrics = ['precision_weighted', 'recall_weighted', 'accuracy', 'f1', 'f1_micro', 'f1_macro', 'f1_weighted']
    scores = cross_validate(model, features, labels, scoring=metrics, cv=cv)                          
    
    metric_pretty_names = ['Precision', 'Recall', 'Accuracy', 'F1', 'F1 Micro', 'F1 Macro', 'F1 Weighted'] 
    score_data = {'mean': [mean(scores[s]) for s in scores if 'time' not in s],
                   'std'  : [std(scores[s]) for s in scores if 'time' not in s]} 
    
    scores_df = pd.DataFrame(score_data, index=metric_pretty_names)
    
    return scores_df

## Oversampling using SMOTE

One way to fight the inbalance *training set* is to generate new samples in the classes which are under-represented. 

The most naive strategy is to generate new samples by randomly sampling with replacement the current available samples.

**over-sample**
Object to over-sample the minority class(es) by picking samples at random with replacement.


**NOTE: 
In the following Model Development section, you will see each model being trained in both original data and oversampling data by SMOTE**

<a id="Model-Development"></a>
# Model Development

---

<a id="Model-Tree"></a>
## Decision Tree
---

The first model we develop is the decision tree classifier. 

We look at 5 different parameters to find the best estimator.

In [11]:
from sklearn.tree import DecisionTreeClassifier

tree_grid = {
    'max_depth': [1, 3, 5, 7],
    'min_samples_split': [2, 4, 6, 8],
    'min_samples_leaf': [1, 2, 4, 6],
    'criterion': ['gini', 'entropy'],
    'max_features': ['auto', 'sqrt', 'log2']
}
y_pred_tree = {}
y_prob_tree = {}

"""
Chosen hyperparameters: 
max_depth = 7
min_samples_split = 8
min_samples_leaf = 4
criterion = entropy
max_features = sqrt
"""
#tree = get_best_model(DecisionTreeClassifier, tree_grid, X_train, y_train)
tree = DecisionTreeClassifier(max_depth=7, min_samples_split=8, min_samples_leaf=4, criterion='entropy', max_features='sqrt')


### Decision Tree: original data
K-Fold Cross Validation Results with Original Data, where K=5

In [14]:
skf(tree, features, label, skf5)

Unnamed: 0,mean,std
Precision,0.79021,0.015602
Recall,0.811033,0.010679
Accuracy,0.811033,0.010679
F1,0.415711,0.052394
F1 Micro,0.811033,0.010679
F1 Macro,0.651481,0.028863
F1 Weighted,0.782944,0.015835


In [15]:
from sklearn.metrics import classification_report, accuracy_score, make_scorer

originalclass = []
predictedclass = []

# Make our customer score
def classification_report_with_accuracy_score(y_true, y_pred):
    originalclass.extend(y_true)
    predictedclass.extend(y_pred)
    return accuracy_score(y_true, y_pred) # return accuracy score

# Nested CV with parameter optimization
nested_score = cross_val_score(tree, X=features, y=label, cv=skf5, scoring=make_scorer(classification_report_with_accuracy_score))

# Average values in classification report for all folds in a K-fold Cross-validation  
print(classification_report(originalclass, predictedclass)) 

              precision    recall  f1-score   support

           0       0.84      0.95      0.89     23364
           1       0.65      0.34      0.45      6636

    accuracy                           0.81     30000
   macro avg       0.74      0.65      0.67     30000
weighted avg       0.80      0.81      0.79     30000



### Decision Tree: SMOTE data

K-Fold Cross Validation Results with SMOTE Data, where K=5

In [16]:
from imblearn.over_sampling import SMOTENC
from sklearn.metrics import recall_score
from imblearn.pipeline import Pipeline, make_pipeline

tree_pipeline = make_pipeline(SMOTENC([1,2,3,5,6,7,8,9,10],random_state=42), tree)
skf(tree_pipeline, features, label, skf5)

Unnamed: 0,mean,std
Precision,0.759963,0.013444
Recall,0.710233,0.011404
Accuracy,0.710233,0.011404
F1,0.471188,0.026546
F1 Micro,0.710233,0.011404
F1 Macro,0.635778,0.016207
F1 Weighted,0.727552,0.011001


In [17]:
originalclass = []
predictedclass = []

# Nested CV with parameter optimization
nested_score = cross_val_score(tree_pipeline, X=features, y=label, cv=skf5, scoring=make_scorer(classification_report_with_accuracy_score))

# Average values in classification report for all folds in a K-fold Cross-validation  
print(classification_report(originalclass, predictedclass)) 

              precision    recall  f1-score   support

           0       0.87      0.74      0.80     23364
           1       0.39      0.60      0.47      6636

    accuracy                           0.71     30000
   macro avg       0.63      0.67      0.63     30000
weighted avg       0.76      0.71      0.72     30000



<a id="Model-Forest"></a>
## Random Forest
---

Next, we develop a the random forest classifier.

We look at 6 different parameters (5 of which are the same as the decision tree) to find the best estimator.

In [18]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate
from numpy import mean
from numpy import std

forest_params = {
    'n_estimators': [10, 50, 100, 500],
    'max_depth': [1, 3, 5, 7],
    'min_samples_split': [2, 4, 6, 8],
    'min_samples_leaf': [1, 2, 4, 6],
    'criterion': ['gini', 'entropy'],
    'max_features': ['auto', 'sqrt', 'log2']
}

"""
Chosen hyperparameters: 
n_estimators = 500
max_depth = 7
min_samples_split = 8
min_samples_leaf = 2
criterion = gini
max_features = sqrt
"""
#forest = get_best_model(RandomForestClassifier, forest_params, X_train, y_train)
forest = RandomForestClassifier(n_estimators=500, max_depth=7, min_samples_split=8, min_samples_leaf=2, criterion='gini', max_features='sqrt')


### Random Forest: Original data
K-Fold Cross Validation Results with Original Data, where K=5

In [19]:
skf(forest, features, label, skf5)

Unnamed: 0,mean,std
Precision,0.80389,0.014343
Recall,0.8201,0.009804
Accuracy,0.8201,0.009804
F1,0.462194,0.027377
F1 Micro,0.8201,0.009804
F1 Macro,0.67708,0.01632
F1 Weighted,0.796899,0.010381


In [20]:
originalclass = []
predictedclass = []

# Nested CV with parameter optimization
nested_score = cross_val_score(forest, X=features, y=label, cv=skf5, 
                               scoring=make_scorer(classification_report_with_accuracy_score))

# Average values in classification report for all folds in a K-fold Cross-validation  
print(classification_report(originalclass, predictedclass)) 

              precision    recall  f1-score   support

           0       0.84      0.95      0.89     23364
           1       0.68      0.35      0.46      6636

    accuracy                           0.82     30000
   macro avg       0.76      0.65      0.68     30000
weighted avg       0.80      0.82      0.80     30000



### Random Forest: SMOTE data
K-Fold Cross Validation Results with SMOTE Data, where K=5

In [21]:
forest_pipeline = make_pipeline(SMOTENC([1,2,3,5,6,7,8,9,10],random_state=42), forest)
skf(forest_pipeline, features, label, skf5)

Unnamed: 0,mean,std
Precision,0.789642,0.012651
Recall,0.771733,0.021584
Accuracy,0.771733,0.021584
F1,0.531854,0.025583
F1 Micro,0.771733,0.021584
F1 Macro,0.690386,0.020495
F1 Weighted,0.778783,0.01805


In [22]:
originalclass = []
predictedclass = []

# Nested CV with parameter optimization
nested_score = cross_val_score(forest_pipeline, X=features, y=label, cv=skf5, 
                               scoring=make_scorer(classification_report_with_accuracy_score))

# Average values in classification report for all folds in a K-fold Cross-validation  
print(classification_report(originalclass, predictedclass)) 

              precision    recall  f1-score   support

           0       0.87      0.82      0.85     23364
           1       0.49      0.58      0.53      6636

    accuracy                           0.77     30000
   macro avg       0.68      0.70      0.69     30000
weighted avg       0.79      0.77      0.78     30000



<a id="Model-AdaBoosting"></a>
## AdaBoosting 
---
Next, we develop the an AdaBoost classifier. 

We look at 3 different parameters to find the best estimator.

Next we use Randomized Search Cross Validation to select our hyperparameters with the goal of finding the best estimator.


In [30]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import *

def get_best_model(model, param_grid, X_train, y_train):
    search = RandomizedSearchCV(model(), param_grid,  scoring='f1', 
                                refit=True, n_jobs=-1)
    best_model = search.fit(X_train, y_train).best_estimator_
    params = best_model.get_params()
    print("Chosen hyperparameters: ")
    for param in param_grid.keys():
        print(param + " = "  +  str(params[param]))
    return best_model

In [25]:
from sklearn.ensemble import AdaBoostClassifier


adaboost_params = {
    'n_estimators': [20, 50, 100, 150],
    'learning_rate': [0.01, 0.05, 1, 1.5],
    'algorithm': ['SAMME', 'SAMME.R']
}
y_pred_adaboost = {}
y_prob_adaboost = {}

"""
Chosen hyperparameters: 
n_estimators = 100
learning_rate = 1.5
algorithm = SAMME
"""
adaboost = get_best_model(AdaBoostClassifier, adaboost_params, X_train, y_train)

Chosen hyperparameters: 
n_estimators = 100
learning_rate = 1.5
algorithm = SAMME.R


### AdaBoosting: original data

K-Fold Cross Validation Results with Original Data, where K=5

In [26]:
skf(adaboost, features, label, skf5)

Unnamed: 0,mean,std
Precision,0.79907,0.012491
Recall,0.816833,0.008704
Accuracy,0.816833,0.008704
F1,0.453305,0.029336
F1 Micro,0.816833,0.008704
F1 Macro,0.671634,0.016783
F1 Weighted,0.793373,0.010035


In [27]:
originalclass = []
predictedclass = []

# Nested CV with parameter optimization
nested_score = cross_val_score(adaboost, X=features, y=label, cv=skf5, 
                               scoring=make_scorer(classification_report_with_accuracy_score))

# Average values in classification report for all folds in a K-fold Cross-validation  
print(classification_report(originalclass, predictedclass)) 

              precision    recall  f1-score   support

           0       0.84      0.95      0.89     23364
           1       0.67      0.34      0.45      6636

    accuracy                           0.82     30000
   macro avg       0.75      0.65      0.67     30000
weighted avg       0.80      0.82      0.79     30000



### AdaBoosting: SMOTE data
K-Fold Cross Validation Results with SMOTE Data, where K=5

In [28]:
adaboost_pipeline = make_pipeline(SMOTENC([1,2,3,5,6,7,8,9,10],random_state=42), adaboost)
skf(adaboost_pipeline, features, label, skf5)

Unnamed: 0,mean,std
Precision,0.769827,0.011368
Recall,0.730167,0.022983
Accuracy,0.730167,0.022983
F1,0.491314,0.023097
F1 Micro,0.730167,0.022983
F1 Macro,0.653714,0.019716
F1 Weighted,0.744268,0.018876


In [29]:
originalclass = []
predictedclass = []

# Nested CV with parameter optimization
nested_score = cross_val_score(adaboost_pipeline, X=features, y=label, cv=skf5, 
                               scoring=make_scorer(classification_report_with_accuracy_score))

# Average values in classification report for all folds in a K-fold Cross-validation  
print(classification_report(originalclass, predictedclass)) 

              precision    recall  f1-score   support

           0       0.87      0.77      0.82     23364
           1       0.42      0.59      0.49      6636

    accuracy                           0.73     30000
   macro avg       0.64      0.68      0.65     30000
weighted avg       0.77      0.73      0.74     30000



<a id="Model-NN"></a>
## Neural Network 
---

Next we develop a Neural Network model using the MLP Classifier. 

We look at 5 different parameters to find the best estimator.

In [31]:
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
nn_params = {
    'hidden_layer_sizes': [(15, 5), (20, 10), (25, 15), (30, 20)],
    'max_iter': [700],
    'learning_rate': ["constant", "invscaling", "adaptive"],
    'alpha': [0.0001, 0.001, 0.01, 0.05],
    'activation': [ 'logistic']
}

"""
Chosen hyperparameters: 
hidden_layer_sizes = (20, 10)
max_iter = 700
learning_rate = invscaling
alpha = 0.0001
activation = logistic
"""

mlp = MLPClassifier(hidden_layer_sizes=(20, 10), 
                    max_iter=700, 
                    learning_rate="invscaling", 
                    alpha=0.0001, activation='logistic')

### Neural Network: original data
K-Fold Cross Validation Results with Original Data, where K=5

In [32]:
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

mlp_pipeline = make_pipeline(StandardScaler(), mlp)
skf(mlp_pipeline, features, label, skf5)

Unnamed: 0,mean,std
Precision,0.802827,0.013861
Recall,0.819567,0.009866
Accuracy,0.819567,0.009866
F1,0.471307,0.02634
F1 Micro,0.819567,0.009866
F1 Macro,0.681256,0.015905
F1 Weighted,0.798322,0.010285


In [33]:
originalclass = []
predictedclass = []

# Nested CV with parameter optimization
nested_score = cross_val_score(mlp_pipeline, X=features, y=label, cv=skf5, 
                               scoring=make_scorer(classification_report_with_accuracy_score))

# Average values in classification report for all folds in a K-fold Cross-validation  
print(classification_report(originalclass, predictedclass)) 

              precision    recall  f1-score   support

           0       0.84      0.95      0.89     23364
           1       0.66      0.36      0.47      6636

    accuracy                           0.82     30000
   macro avg       0.75      0.65      0.68     30000
weighted avg       0.80      0.82      0.80     30000



### Neural Network: SMOTE data
K-Fold Cross Validation Results with SMOTE Data, where K=5

In [34]:
#from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from imblearn.pipeline import make_pipeline

mlp_smote_pipeline = make_pipeline(SMOTENC([1,2,3,5,6,7,8,9,10], random_state=42), StandardScaler(), mlp)
skf(mlp_smote_pipeline, features, label, skf5)

Unnamed: 0,mean,std
Precision,0.772193,0.010267
Recall,0.717067,0.019605
Accuracy,0.717067,0.019605
F1,0.494697,0.020485
F1 Micro,0.717067,0.019605
F1 Macro,0.64897,0.016689
F1 Weighted,0.734993,0.016367


In [35]:
originalclass = []
predictedclass = []

# Nested CV with parameter optimization
nested_score = cross_val_score(mlp_smote_pipeline, X=features, y=label, cv=skf5, 
                               scoring=make_scorer(classification_report_with_accuracy_score))

# Average values in classification report for all folds in a K-fold Cross-validation  
print(classification_report(originalclass, predictedclass)) 

              precision    recall  f1-score   support

           0       0.88      0.74      0.81     23364
           1       0.41      0.63      0.50      6636

    accuracy                           0.72     30000
   macro avg       0.64      0.69      0.65     30000
weighted avg       0.77      0.72      0.74     30000



<a id="Model-SVM"></a>
## SVM with RBF kernel 
---

Next we develop a SVM model using the RBF kernel.

We look at 2 different parameters to find the best estimator.

In [36]:
from sklearn.svm import SVC

svc_params = {'C': [0.1, 1, 10, 50], 'gamma': [1, 0.1, 0.01, 0.001], 'kernel': ['rbf'], 'probability': [1]}
y_pred_svm_rbf = {}
y_prob_svm_rbf = {}

"""
Chosen hyperparameters: 
C = 1
gamma = 0.1
kernel = rbf
probability = 1
"""

svm_rbf_kernel = SVC(C=1, gamma=0.1, kernel='rbf', probability=1)

### SVM with RBF kernel: original data

K-Fold Cross Validation Results with Original Data, where K=5

In [37]:
from sklearn.svm import SVC

svm_rbf_kernel = SVC(C=1, gamma=0.1, kernel='rbf', probability=1)
svm_rfb_pipeline = make_pipeline(StandardScaler(), svm_rbf_kernel)

skf(svm_rfb_pipeline, features, label, skf5)

Unnamed: 0,mean,std
Precision,0.801658,0.012956
Recall,0.818433,0.008663
Accuracy,0.818433,0.008663
F1,0.450368,0.02888
F1 Micro,0.818433,0.008663
F1 Macro,0.6708,0.016561
F1 Weighted,0.793713,0.009936


In [38]:
originalclass = []
predictedclass = []

# Nested CV with parameter optimization
nested_score = cross_val_score(svm_rfb_pipeline, X=features, y=label, cv=skf5, 
                               scoring=make_scorer(classification_report_with_accuracy_score))

# Average values in classification report for all folds in a K-fold Cross-validation  
print(classification_report(originalclass, predictedclass)) 

              precision    recall  f1-score   support

           0       0.84      0.96      0.89     23364
           1       0.68      0.34      0.45      6636

    accuracy                           0.82     30000
   macro avg       0.76      0.65      0.67     30000
weighted avg       0.80      0.82      0.79     30000



### SVM with RBF kernel: SMOTE data
K-Fold Cross Validation Results with SMOTE Data, where K=5

In [39]:
#from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTENC
from sklearn.metrics import recall_score
from imblearn.pipeline import Pipeline, make_pipeline
from sklearn.svm import SVC

svm_rbf_kernel = SVC(C=1, gamma=0.1, kernel='rbf', probability=1)
svm_rfb_pipeline_smote = make_pipeline(SMOTENC([1,2,3,5,6,7,8,9,10], random_state=42), 
                                 StandardScaler(),
                                 svm_rbf_kernel)

skf(svm_rfb_pipeline_smote, features, label, skf5)

Unnamed: 0,mean,std
Precision,0.776264,0.010903
Recall,0.744033,0.023085
Accuracy,0.744033,0.023085
F1,0.50473,0.021994
F1 Micro,0.744033,0.023085
F1 Macro,0.665968,0.019782
F1 Weighted,0.755875,0.018875


In [None]:
originalclass = []
predictedclass = []

# Nested CV with parameter optimization
nested_score = cross_val_score(svm_rfb_pipeline_smote, X=features, y=label, cv=skf5, 
                               scoring=make_scorer(classification_report_with_accuracy_score))

# Average values in classification report for all folds in a K-fold Cross-validation  
print(classification_report(originalclass, predictedclass)) 