# SBA Loan Analysis

# Modeling - Part 4 - CatBoost

## Table of Contents

1. Imports
2. Previewing Data
3. Preprocessing Data
    1. Standard Scaler
    2. Robust Scaler
4. Evaluation Metrics
5. Simple Model
    1. Standard Scaler
    2. Robust Scaler
6. Grid Search
    1. Standard Scaler
    2. Robust Scaler
7. Bayesian Optimization
    1. Standard Scaler
    2. Robust Scaler
8. Save Results
    

## 1. Imports

In [1]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import json

from catboost import CatBoostClassifier

from library.preprocessing import processing_pipeline
from library.modeling import (createModel, createClassificationMetrics,
                             runGridSearchAnalysis, createConfusionMatrix, createFeatureImportanceChart,
                             appendModelingResults, drawRocCurve, obtain_best_bayes_model)

In [2]:
f = open('./results/best_params.json')
data = json.load(f)
best_model_params = dict(data)

In [None]:
model_results = []

## 2. Previewing Data

In [3]:
sba_loans = pd.read_csv('./../data/processed/sba_national_processed_final.csv')

pd.set_option('display.max_columns', None)

In [4]:
sba_loans.head()

Unnamed: 0,Term,NoEmp,CreateJob,RetainedJob,DisbursementGross,GrAppv,SBA_Appv,NAICS_sectors,unemployment_rate,gdp_growth,gdp_annual_change,inflation_rate,inf_rate_annual_chg,NewExist_existing_business,NewExist_new_business,UrbanRural_rural,UrbanRural_urban,isFranchise_not_franchise,RevLineCr_v2_N,RevLineCr_v2_Y,LowDoc_v2_N,LowDoc_v2_Y,MIS_Status_v2_default,state_top10
0,84,4,0,0,60000.0,60000.0,48000.0,45,3.5,4.4472,0.67,2.3377,-0.59,0,1,0,0,1,1,0,0,1,0,0
1,60,2,0,0,40000.0,40000.0,32000.0,72,3.5,4.4472,0.67,2.3377,-0.59,0,1,0,0,1,1,0,0,1,0,0
2,180,7,0,0,287000.0,287000.0,215250.0,62,3.5,4.4472,0.67,2.3377,-0.59,1,0,0,0,1,1,0,1,0,0,0
3,60,2,0,0,35000.0,35000.0,28000.0,0,4.1,4.4472,0.67,2.3377,-0.59,1,0,0,0,1,1,0,0,1,0,0
4,240,14,7,7,229000.0,229000.0,229000.0,0,4.8,4.4472,0.67,2.3377,-0.59,1,0,0,0,1,1,0,1,0,0,1


## 3. Preprocessing Data

In [5]:
target = 'MIS_Status_v2_default'
features = sba_loans.drop(columns='MIS_Status_v2_default', axis=1).columns

### A. Standard Scaler

In [6]:
X_train_ss, X_test_ss, y_train_ss, y_test_ss = processing_pipeline(sba_loans, target)

### B. Robust Scaler

In [7]:
X_train_rs, X_test_rs, y_train_rs, y_test_rs = processing_pipeline(sba_loans, target, scaler='Robust')

## 4. Evaluation Metrics

The following evlaution metrics will be used to evaluate the effectiveness of the logistic models.

**Accuracy Score**

Blurb about accuracy score.

**Classifiation Report**

Blurb about Classification Report

**Matthew's Correlation Coefficient**

Blurb about MCC

**F1 Score**

Blurb about F1 Score and why it will be the main metric for evaluation

## 5. Simple Model

### A. Standard Scaler

In [8]:
cat_ss_mod1 = CatBoostClassifier(random_state=42, verbose=0)
y_pred = createModel(cat_ss_mod1, X_train_ss, y_train_ss, X_test_ss)

In [9]:
metrics = createClassificationMetrics(y_pred, y_test_ss)

In [10]:
print('Accuracy Score: ' + str(round(metrics['acc'], 4)))

Accuracy Score: 0.9499


In [11]:
print('Classification Report: \n' + metrics['cr'])

Classification Report: 
              precision    recall  f1-score   support

        paid       0.97      0.97      0.97    223642
     default       0.84      0.87      0.85     45212

    accuracy                           0.95    268854
   macro avg       0.91      0.92      0.91    268854
weighted avg       0.95      0.95      0.95    268854



In [12]:
print('Matthew\'s Correlation Coefficient: ' + str(round(metrics['mcc'],4)))

Matthew's Correlation Coefficient: 0.8239


In [13]:
print('F1 Score: ' + str(round(metrics['f1'], 4)))

F1 Score: 0.8539


### B. Robust Scaler

In [14]:
cbc_rs_mod1 = CatBoostClassifier(random_state=42, verbose=0)
y_pred = createModel(cbc_rs_mod1, X_train_rs, y_train_rs, X_test_rs)

In [15]:
metrics = createClassificationMetrics(y_pred, y_test_rs)

In [16]:
print('Accuracy Score: ' + str(round(metrics['acc'], 4)))

Accuracy Score: 0.9499


In [17]:
print('Classification Report: \n' + metrics['cr'])

Classification Report: 
              precision    recall  f1-score   support

        paid       0.97      0.97      0.97    223763
     default       0.84      0.87      0.85     45091

    accuracy                           0.95    268854
   macro avg       0.91      0.92      0.91    268854
weighted avg       0.95      0.95      0.95    268854



In [18]:
print('Matthew\'s Correlation Coefficient: ' + str(round(metrics['mcc'],4)))

Matthew's Correlation Coefficient: 0.8237


In [19]:
print('F1 Score: ' + str(round(metrics['f1'], 4)))

F1 Score: 0.8536


## 6. Grid Search Cross Validation

In [20]:
param_grid = {
    'learning_rate': [0.03, 0.1],
    'iterations': [500, 1000],
    'l2_leaf_reg': [1.0, 3.0],
    'depth': [3,6]   
}

### A. Standard Scaler

In [21]:
mod_info = {
    'model': 'CatBoost',
    'method': 'Grid Search',
    'scaler': 'Standard'
}

In [None]:
cbc = CatBoostClassifier(random_state=42, verbose=0)
cbc_ss_best_params, y_pred = runGridSearchAnalysis(cbc, param_grid, X_train_ss, y_train_ss, X_test_ss)

**Evaluation Metrics**

In [None]:
metrics = createClassificationMetrics(y_pred, y_test_ss)
print('Accuracy Score: {}'.format(metrics['acc']))
print('Classification Report: \n{}'.format(metrics['cr']))
print('Matthew\'s Correlation Coefficient: {}'.format(metrics['mcc']))
print('F1 Score: {}'.format(metrics['f1']))

**Confusion Matrix**

In [None]:
matrix = createConfusionMatrix(y_test_ss, y_pred, mod_info)

**ROC Curve**

In [None]:
cbc_mod = CatBoostClassifier(**cbc_ss_best_params, random_state=42, verbose=0)
metrics['auc'] = drawRocCurve(cbc_mod, X_train_ss, X_test_ss, y_train_ss, y_test_ss, mod_info)

**Feature Importance with Best Params**

In [None]:
createFeatureImportanceChart(cbc_mod, features, X_train_ss, y_train_ss)

**Append Results**

In [None]:
model_results, best_model_params = appendModelingResults(model_results, best_model_params, mod_info, 
                                                         cbc_ss_best_params, matrix, metrics)

### B. Robust Scaler

In [None]:
mod_info = {
    'model': 'CatBoost',
    'method': 'Grid Search',
    'scaler': 'Robust'
}

In [None]:
cbc = CatBoostClassifier(random_state=42, verbose=0)
cbc_rs_best_params, y_pred = runGridSearchAnalysis(cbc, param_grid, X_train_rs, y_train_rs, X_test_rs)

**Evaluation Metrics**

In [None]:
metrics = createClassificationMetrics(y_pred, y_test_ss)
print('Accuracy Score: {}'.format(metrics['acc']))
print('Classification Report: \n{}'.format(metrics['cr']))
print('Matthew\'s Correlation Coefficient: {}'.format(metrics['mcc']))
print('F1 Score: {}'.format(metrics['f1']))

**Confusion Matrix**

In [None]:
matrix = createConfusionMatrix(y_test_rs, y_pred, mod_info)

**ROC Curve**

In [None]:
cbc_mod = CatBoostClassifier(**cbc_rs_best_params, random_state=42, verbose=0)
metrics['auc'] = drawRocCurve(cbc_mod, X_train_rs, X_test_rs, y_train_rs, y_test_rs, mod_info)

**Feature Importance with Best Params**

In [None]:
createFeatureImportanceChart(cbc_mod, features, X_train_rs, y_train_rs)

**Append Results**

In [None]:
model_results, best_model_params = appendModelingResults(model_results, best_model_params, mod_info, 
                                                         cbc_rs_best_params, matrix, metrics)

## 7. Bayesian Optimization