<div style="font-family: Trebuchet MS;background-color:DarkCyan;color:Azure;text-align: left;padding-top: 5px;padding-bottom: 15px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;a:link{color: white}">
    <h1 style='color:GhostWhite;'>Should This Loan be Approved or Denied ?</h1>

An XGBoost data model to predict whether a loan can be approved or denied.
    </div>

<div class="alert alert-block alert-success">  
    <b>Dataset Source</b><br><br>
    <a href="https://www.kaggle.com/mirbektoktogaraev/should-this-loan-be-approved-or-denied">U.S. Small Business Administration (SBA) Dataset</a>
<br><br>
    All information about the dataset can be found at the <b>above link</b><br><br>    
    *<i>Thanks to Hamza for his <a href="https://www.kaggle.com/code/hamzaghanmi/xgboost-hyperparameter-tuning-using-optuna/notebook">Notebook on Optuna</a> which was used as a guide.</i> 
<br><br>
    If interested, Data Exploratory Visualization in Tableau can also be seen at :<br>
    <a href= "https://public.tableau.com/app/profile/joseph8038/viz/SBADatasetVisualizationandAnalysis/SBADatasetVisualizationandAnalysis-StoryBoard">SBA Data Exploratory Visualization in Tableau</a>
</div>

<div class="alert alert-block alert-info" style="color:DarkSlateBlue">
This notebook is divided into 4 main parts:<br>
<ul>
<li><a style="color:DarkSlateGrey;" href="#part1"><b>Part 1: Pipeline</b></a> - this is the end result encapsulated into a pipeline</li><br>
<li><a style="color:DarkSlateGrey" href="#part2"><b>Part 2: Data Exploration (EDA) and Preparation, Modeling, Metrics</b></a> - from start to end, with some notes</li><br>
<li><a style="color:DarkSlateGrey;" href="#part3"><b>Part 3: XGBoost HyperParameter Tuning using Optuna - Full and Incremental</b></a></li><br>
<li><a style="color:DarkSlateGrey;" href="#part4"><b>Part 4: Miscellaneous</a></b>  - Early Stopping Rounds, Random Forest Classifier</li>
</ul><br>
"Our model results are way more dependent on how well feature engineering is performed than on the model itself. Machine Learning models are like very skilled linguists that can decipher any text in any language. However, it will not be helpful if they are handed a bunch of scribbles or blurred out text. EDA should not be skipped, as a thorough EDA and feature engineering process accounts for 90% of the results of a good model."<br><br>
One method of avoiding memory leaks is doing processing inside a function. It creates a new scope for the intermediate variables and removes them automatically when the interpreter exits the function; hence, most of the code below are encapsulated into functions for this purpose. 
</div>

<h2>Table Of Contents</h2>
<ul>
    <li><a style="color:DarkSlateGrey" href="#paths_and_flags">Paths and Flags</a></li>
    <li><a style="color:DarkSlateGrey" href="#libraries">Libraries</a></li>   
    <li><a style="color:DarkSlateGrey" href="#functions">Custom Functions And Classes</a></li>
    <li><a style="color:DarkSlateGrey" href="#metrics">Metrics Function</a></li>
    <li><a style="color:DarkSlateGrey" href="#xgboost_class">XGBoost Class</a></li>
    <li><a style="color:DarkSlateGrey" href="#other_models">Other Models Class</a></li>
    <li><a style="color:DarkSlateGrey" href="#optuna_class">Optuna Class</a></li>
    <li><a style="color:DarkSlateGrey" href="#optuna_class_batch">Optuna Class - tuning by batches</a></li><br>
    <li><a style="color:DarkSlateGrey" href="#part1">Part 1. PipeLine</a></li>
    <ul>
        <li><a style="color:DarkSlateGrey" href="#pl_classes">Pipeline Classes</a></li>
        <li><a style="color:DarkSlateGrey" href="#load_pl_df">Load Dataset for PipeLine</a></li>
        <li><a style="color:DarkSlateGrey" href="#pl_run">Run the pipeline</a></li>
    </ul>
    <br>
    <li><a style="color:DarkSlateGrey" href="#part2">Part 2. Data Exploration and Preparation, Modeling, Metrics</a></li>
    <ul>
        <li><a style="color:DarkSlateGrey" href="#de_load_df">Load Dataset</a></li>
        <li><a style="color:DarkSlateGrey" href="#dep">Data Exploration / Preparation</a></li>
        <li><a style="color:DarkSlateGrey" href="#build_model">Build Model Using XGBoost</a></li>
        <ul>
            <li><a style="color:DarkSlateGrey" href="#model1">Model v1</a></li>
            <li><a style="color:DarkSlateGrey" href="#oversample">Oversample</a></li>
            <ul>
                <li><a style="color:DarkSlateGrey" href="#model2">Model v2</a></li>
                <li><a style="color:DarkSlateGrey" href="#model3">Model v3</a></li>
            </ul>
        </ul>
        <li><a style="color:DarkSlateGrey" href="#test_model">Test Model</a></li>
        <ul>
            <li><a style="color:DarkSlateGrey" href="#test_test_dataset">Test Model With Test Dataset</a></li>
           <li><a style="color:DarkSlateGrey" href="#test_user_input">Test Model With User Input</a></li>
        </ul>
        <li><a style="color:DarkSlateGrey" href="#mutual_info">Mutual Information Scores</a></li>
        <li><a style="color:DarkSlateGrey" href="#trim_datasets">Trim Datasets</a></li>
        <li><a style="color:DarkSlateGrey" href="#results1">Full or Trimmed Dataset</a></li>
    </ul>  
    <br>
    <li><a style="color:DarkSlateGrey" href="#part3">Part 3. XGBoost HyperParameter Tuning using Optuna - Full or Incremental</a></li>
    <ul>
    <li><a style="color:DarkSlateGrey" href="#find_best_hp">Find The Best HyperParameter Combination</a></li>
    <li><a style="color:DarkSlateGrey" href="#try_best_hp">Model v4 : Try the Optuna Hyperparameters</a></li>
    <li><a style="color:DarkSlateGrey" href="#optuna_results">Optuna Tuning Results</a></li>
    <li><a style="color:DarkSlateGrey" href="#cross_validation">Cross Validation</a></li>
    </ul>
    <br>
    <li><a style="color:DarkSlateGrey" href="#part4">Part 4. Miscellaneous</a></li>
    <ul>
    <li><a style="color:DarkSlateGrey" href="#early_stopping_rounds">Early Stopping Rounds</a></li>
    <li><a style="color:DarkSlateGrey" href="#random_forest_classifier">Random Forest Classifier</a></li>
    </ul>
</ul>

<a id="paths_and_flags"></a>
<div style="font-family: Trebuchet MS;background-color:LightSteelBlue;color:Black;text-align: left;padding-top: 5px;padding-bottom: 5px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <b>Paths and Flags</b></div>

In [2]:
'''
Change this kaggle_flag to :
   0 - if running outside Kaggle (e.g. Jupyter Notebook), change filepath & savepath to your 
       own path
   1 - if running as a Kaggle notebook
'''
kaggle_flag = 0

# switch to 1 if using GPU.  If 1, tree method for XGBoost will be 'gpu_hist'
gpu_flag = 0

# alert_flag - change to 0 for no sound alert, 1 for sound alert after long running cells
alert_flag = 0

'''
We have two options for running Optuna tuning on XGBoost:  
   OptunaStudy() - run Optuna on the full dataset
   OptunaStudyBatch() - run in batches, lighter on memory, but much slower

Change flag below as needed:
   1 to run OptunaStudy() only
   2 to run OptunaStudyBatch() only
   3 to run both
'''
optuna_flag = 2

#---------------------------------------------------------------------------------------#

if kaggle_flag == 1:
    filepath = "../input/should-this-loan-be-approved-or-denied/"  # Kaggle
    savepath = "./"   #Kaggle
    final_ds = '../input/sba-final-csv-feather-20220402/sba_final.csv.feather' 
else:
    filepath = "C:\\Python\\Python_Data_Science_Exercises\\datasets\\"
    savepath = "C:\\Python\\Python_Data_Science_Exercises\\datasets\\"
    final_ds = f'{savepath}sba_final.csv.feather'

audio_path="https://www.soundjay.com/misc/sounds/tablet-bottle-1.mp3" # for alert

if gpu_flag == 0:
    tree_method = 'hist'
else:
    tree_method = 'gpu_hist'

<a id="libraries"></a>
<div style="font-family: Trebuchet MS;background-color:LightSteelBlue;color:Black;text-align: left;padding-top: 5px;padding-bottom: 5px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <b>Libraries</b></div>

In [3]:
def install_packages():
    piplist = !pip list

    # for text-to-speech
    if not piplist.grep('pyttsx3'):
        !pip3 install pyttsx3
    
    # for oversampling
    if not piplist.grep('imbalanced-learn'):
        !pip3 install imbalanced-learn

    if not piplist.grep('xgboost'):
        !pip3 install xgboost
    
    if not piplist.grep('optuna'):
        !pip3 install optuna

    # for saving file in feather format
    if not piplist.grep('pyarrow'):
        !pip3 install pyarrow
    
    # for EDA 
    if not piplist.grep('pandas-profiling'):
        !pip3 install pandas-profiling
    
    if not piplist.grep('sweetviz'):
        !pip3 install sweetviz
    
    if not piplist.grep('dataprep'):
        !pip3 install dataprep
    '''  
    if not piplist.grep('modin'):
        !pip3 install modin
        !pip3 install modin[ray]
    '''
install_packages()


In [4]:
import pandas as pd
#import modin.pandas as pd
import ray
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import warnings
import pyttsx3
from IPython.display import Audio, display
from IPython.display import FileLink
from IPython.display import IFrame
from IPython.core.display import HTML
import hashlib
import copy      # for deepcopy()
import datetime as dt
import optuna
import gc
from pandas_profiling import ProfileReport
import sweetviz as sv
import shutil
import psutil
import sys
import pickle
import joblib
from pprint import pprint
%matplotlib inline  

In [5]:
# to use with modin.pandas
#ray.shutdown()
#ray.init()

<a id="functions"></a>
<div style="font-family: Trebuchet MS;background-color:LightSteelBlue;color:Black;text-align: left;padding-top: 5px;padding-bottom: 5px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <b>Custom Functions and Classes</b></div>

In [6]:
class color:
    purple = '\033[95m'
    cyan = '\033[96m'
    darkcyan = '\033[36m'
    blue = '\033[94m'
    green = '\033[92m'
    yellow = '\033[93m'
    red = '\033[91m'
    bold = '\033[1m'
    underline = '\033[4m'
    end = '\033[0m'
    bdunl = '%s%s' % (bold, underline)
    bdblue = '%s%s' % (bold, blue)
    bdgreen = '%s%s' % (bold, green)
    bdred = '%s%s' % (bold, red)

In [7]:
''' 
Set up voice object.  Used in different areas of notebook to indicate completion of long processes.
'''
if kaggle_flag == 0:   # not Kaggle
    engine = pyttsx3.init()  # object creation

    """ RATE"""
    #rate = engine.getProperty('rate')   # getting details of current speaking rate
    #print (rate)                        #printing current voice rate
    engine.setProperty('rate', 175)     # setting up new voice rate

    """VOLUME"""
    #volume = engine.getProperty('volume')   #getting to know current volume level (min=0 and max=1)
    #print (volume)                         #printing current volume level
    engine.setProperty('volume',0.7)        # setting up volume level  between 0 and 1

    """VOICE"""
    voices = engine.getProperty('voices')       #getting details of current voice
    #engine.setProperty('voice', voices[0].id)  #changing index, changes voices. o for male
    engine.setProperty('voice', voices[1].id)   #changing index, changes voices. 1 for female

In [8]:
# copy from corochann (Kaggle Grandmaster) notebook 
# https://www.kaggle.com/code/corochann/ashrae-feather-format-for-fast-loading/notebook

from pandas.api.types import is_datetime64_any_dtype as is_datetime

def reduce_mem_usage(df, use_float16=False):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        if is_datetime(df[col]):
            # skip datetime type
            continue
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min >= np.iinfo(np.uint8).min and c_max <= np.iinfo(np.uint8).max:
                    df[col] = df[col].astype(np.uint8)
                elif c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if use_float16 and c_min > np.finfo(np.float16).min and \
                            c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print()
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df

In [9]:
def check_cols_with_nulls(df):
    cols_with_missing = [col for col in df.columns if df[col].isnull().any()]
    if len(cols_with_missing) == 0:
        print("No Missing Values")
    else:
        print(cols_with_missing)
    
    sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')

In [10]:
def DfSplitIntoChunks(df, n):
    colnames = ['Term', 'NoEmp', 'NewExist', 'CreateJob', 'RetainedJob',
           'FranchiseCode', 'UrbanRural', 'LowDoc', 'DisbursementGross',
           'MIS_Status', 'SBA_Appv', 'Industry', 'Recession', 'RealEstate',
           'SBA_Portion', 'State_hash', 'CityState_hash']

    # Split dataframe into chunks of n files...
    chunks=[]
    i, sz = 0, 0
    for x in np.array_split(df, n, axis=0):
        print(f'Processing df number: {i+1}')
        chunks.append(pd.DataFrame(x, columns = colnames, index=None))
        sz += len(chunks[i])
        i += 1
        
    print()
    print(f'Size of df = {len(df)}, Size of {n} Chunks = {sz}')
    
    return chunks

In [11]:
def check_infinity_nan(df,dfname):
    print("checking for infinity")
  
    #ds = sba.isin([np.inf, -np.inf])
    #print(ds)
  
    # printing the count of infinity values
    print()
    print("printing the count of infinity values")
  
    count = np.isinf(df).values.sum()
    print(f"{dfname} contains {str(count)} infinite values")
    print()
    
    has_nan = df.isnull().values.any()
    print(f"Does {dfname} have Nan or Null values ?  {has_nan}")

In [12]:
# used as a converter when loading csv
def fixvals(val):
    retval = val.replace('$','').replace(',','')
    return retval

In [13]:
## I could also use the jupyter notebook magic cell %%time
def runtime(rt1,rt2):
    tdiff=rt2 - rt1
    # get seconds and convert to h:m:s
    print()
    print(f'Runtime : {str(dt.timedelta(seconds=tdiff.total_seconds()))}')

In [14]:
def create_download_link(title = "Download ", filename = "data.csv"):  
    html = '<a href={filename}>{title}</a>'
    html = html.format(title=title + filename,filename=filename)
    return HTML(html)

In [15]:
# These are the usual ipython objects, including this one you are creating
ipython_vars = ['In', 'Out', 'exit', 'quit', 'get_ipython', 'ipython_vars']

# Get a sorted list of the objects and their sizes
mm = sorted([(x, sys.getsizeof(globals().get(x))) \
            for x in dir() if not x.startswith('_') and \
            x not in sys.modules and x not in ipython_vars], key=lambda x: x[1], reverse=True)
pprint(mm)

del ipython_vars
gc.collect()

[('Audio', 1064),
 ('FileLink', 1064),
 ('HTML', 1064),
 ('IFrame', 1064),
 ('ProfileReport', 1064),
 ('color', 1064),
 ('DfSplitIntoChunks', 136),
 ('check_cols_with_nulls', 136),
 ('check_infinity_nan', 136),
 ('create_download_link', 136),
 ('display', 136),
 ('fixvals', 136),
 ('install_packages', 136),
 ('is_datetime', 136),
 ('reduce_mem_usage', 136),
 ('runtime', 136),
 ('final_ds', 119),
 ('audio_path', 105),
 ('filepath', 98),
 ('savepath', 98),
 ('voices', 88),
 ('dt', 72),
 ('np', 72),
 ('pd', 72),
 ('plt', 72),
 ('sns', 72),
 ('sv', 72),
 ('tree_method', 53),
 ('engine', 48),
 ('i', 28),
 ('optuna_flag', 28),
 ('x', 28),
 ('alert_flag', 24),
 ('gpu_flag', 24),
 ('kaggle_flag', 24)]


8

In [16]:
def GetRam():
    # Getting % usage of virtual_memory ( 3rd field)
    #print('RAM memory % used:', psutil.virtual_memory()[2])
    return psutil.virtual_memory()[2]

GetRam()

49.9

<a id="eda_tools"></a>
<div style="font-family: Trebuchet MS;background-color:LightSteelBlue;color:Black;text-align: left;padding-top: 5px;padding-bottom: 5px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <b>EDA Tools</b></div>

In [17]:
def GetSweetVizReport(df):
    print(f'{color.bold}Please wait, preparing SweetViz report{color.end}')
    try:
        my_report = sv.analyze(df)
    
        my_report.show_html(filepath=f'{savepath}SBA_sweetviz_report.html', 
                open_browser=True, 
                layout='vertical', 
                scale=None)
        print()
        (kaggle_flag == 0) and print(f'SweetViz Report has been downloaded to path {savepath}')
        
    except Exception as e:
        print(f'Error: {e}')

<a id="metrics"></a>
<div style="font-family: Trebuchet MS;background-color:LightSteelBlue;color:Black;text-align: left;padding-top: 5px;padding-bottom: 5px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <b>Metrics Function</b></div>

In [18]:
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

def model_eval(y_valid,predictions, cmDisplay='False'):
    print('MAE:', metrics.mean_absolute_error(y_valid, predictions))
    #print('MSE:', metrics.mean_squared_error(y_valid, predictions))
    print('RMSE:', np.sqrt(metrics.mean_squared_error(y_valid, predictions)))
    print()
    
    ClassificationReport = classification_report(y_valid,predictions.round(),output_dict=True)

    print(f'{color.bold}Classification Report:{color.end}')
    print(classification_report(y_valid,predictions.round()))
    
    print()
    print(f"{color.bold}Confusion Matrix:{color.end}")

    if cmDisplay == True:
        cm = confusion_matrix(y_valid, predictions)
        disp = ConfusionMatrixDisplay(confusion_matrix=cm)
        fig, ax = plt.subplots(dpi=100,figsize=(5,5))
        disp.plot(ax=ax,colorbar=False,values_format='d')
    
    cmv = confusion_matrix(y_valid, predictions)
    
    TrueNeg = cmv[0][0]
    FalsePos = cmv[0][1]
    FalseNeg = cmv[1][0]
    TruePos = cmv[1][1]

    TotalNeg = TrueNeg + FalseNeg
    TotalPos = TruePos + FalsePos
    
    print()
    print(f'True Negative : CHGOFF (0) was predicted {TrueNeg} times correctly \
  ({round((TrueNeg/TotalNeg)*100,2)} %)')
    print(f'False Negative : CHGOFF (0) was predicted {FalseNeg} times incorrectly \
    ({round((FalseNeg/TotalNeg)*100,2)} %)')
    print(f'True Positive : P I F (1) was predicted {TruePos} times correctly \
    ({round((TruePos/TotalPos)*100,2)} %)')
    print(f'False Positive : P I F (1) was predicted {FalsePos} times incorrectly \
    ({round((FalsePos/TotalPos)*100,2)} %)')
    
    print()
    asm = (accuracy_score(y_valid, predictions.round()) * 100)
    print(f'{color.bdgreen}Accuracy for model: %.2f{color.end}' % asm)
    print(f'{color.bdblue}f1-score: {color.end}')
    print(f"   CHGOFF (0) : {round(ClassificationReport['0']['f1-score']*100,2)}")
    print(f"   P I F (1)  : {round(ClassificationReport['1']['f1-score']*100,2)}")
    print('RMSE:', np.sqrt(metrics.mean_squared_error(y_valid, predictions)))
    
    return {'cmv':cmv, 'ClassificationReport':ClassificationReport, 'AccuracyScore':asm}

**Feature Importance**

In [19]:
from xgboost import plot_importance

# Plot feature importance
def plot_features(booster, figsize):    
    fig, ax = plt.subplots(1,1,figsize=figsize,dpi=600)
    return plot_importance(booster=booster, ax=ax)

  from pandas import MultiIndex, Int64Index


**Mutual Information**

In [20]:
from sklearn.feature_selection import mutual_info_regression

def make_mi_scores(X, y):
    print()
    print("Please wait, Mutual Information gathering can take time ...")
    X = X.copy()
    #for colname in X.select_dtypes(["object", "category"]):
    #    X[colname], _ = X[colname].factorize()
    # All discrete features should now have integer dtypes
    #discrete_features = [pd.api.types.is_integer_dtype(t) for t in X.dtypes]
    #mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features, random_state=0)
    mi_scores = mutual_info_regression(X, y, random_state=0)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    print("Mutual Information gathering done ...")
    return mi_scores

def plot_mi_scores(scores):
    scores = scores.sort_values(ascending=True)
    width = np.arange(len(scores))
    ticks = list(scores.index)
    plt.barh(width, scores)
    plt.yticks(width, ticks)
    plt.title("Mutual Information Scores")

<a id="xgboost_class"></a>
<div style="font-family: Trebuchet MS;background-color:LightSteelBlue;color:Black;text-align: left;padding-top: 5px;padding-bottom: 5px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <b>XGBoost Class</b></div>

In [21]:
from imblearn.over_sampling import RandomOverSampler
from collections import Counter
from sklearn.model_selection import train_test_split
#from xgboost import XGBRegressor
from xgboost import XGBClassifier

class process_model():  
    def __init__(self, X, y):
        self.X = X
        self.y = y
        self.X_train, self.y_train = None, None
        self.X_valid, self.X_test = None, None
        self.y_valid, self.y_test = None, None

        print(f'MIS_Status Count ->  1 : {Counter(y)[1]}, 0 : {Counter(y)[0]}')
    
    # oversampling method
    def osample(self):
        # define oversampling strategy
        oversample = RandomOverSampler(sampling_strategy='minority') 
        print('X size : ', len(self.X))
        print('y size : ', len(self.y))
        # fit and apply the transform
        X_over, y_over = oversample.fit_resample(self.X, self.y)

        # summarize class distribution
        print(f'Before Oversampling -> 1 : {Counter(self.y)[1]}, 0 : {Counter(self.y)[0]}')
        print(f'After Oversampling  -> 1 : {Counter(y_over)[1]}, 0 : {Counter(y_over)[0]}')
        
        # update X and y with the oversampled results 
        self.X = X_over
        self.y = y_over
        
        # return the oversampled results in case they are needed in another module
        return {'X_over':X_over, 'y_over':y_over}
    
    def split_data(self, X_size = 0.7):   
        # Split Data into Train:Validate:Test
        
        # train_size=X_size
        # In the first step, we will split the data in training and remaining dataset
        self.X_train, X_rem, self.y_train, y_rem = train_test_split(self.X, self.y, \
                                                        train_size = X_size, random_state=48) 

        # Now since we want the valid and test size to be equal,
        # we have to define valid_size=0.5 (that is 50% of remaining data)
        # test_size = 0.5

        self.X_valid, self.X_test, self.y_valid, self.y_test = train_test_split(X_rem,y_rem,\
                                                            test_size=0.5, random_state=48)
    
        print()
        print(f'{color.bdunl}Shapes Before And After Splitting Dataset :{color.end}')
        print('X',self.X.shape,end=''), print('   y', self.y.shape)
        print('X_train',self.X_train.shape,end=''), print('   y_train', self.y_train.shape)
        print('X_valid',self.X_valid.shape,end=''), print('   y_valid', self.y_valid.shape)
        print('X_test', self.X_test.shape, end=''), print('   y_test', self.y_test.shape)
        
        return {'X_train':self.X_train, 'y_train':self.y_train, \
                'X_valid':self.X_valid, 'y_valid':self.y_valid, \
                'X_test':self.X_test, 'y_test':self.y_test}
    
    # Method to run model 
    # desc - description of metrics report
    def prep_run_model(self, desc='Metrics', cmDisplay=False, PipeLine_flag = False,\
                hyperparams = {'n_estimators': 1000, 'learning_rate': 0.05, 'max_depth': 6, \
                               'tree_method':tree_method}):
        print()
        print(f"{color.bold}Please wait, Fitting model can take time ...{color.end}")
        
        '''
        XGBRegressor is for continuous target/outcome variables. These are often called 
        "regression problems."

        XGBClassifier is for categorical target/outcome variables. These are often called 
        "classification problems."
        
        xg_model = XGBRegressor(n_estimators = self.mn_estimators, \
                                learning_rate = self.mlearning_rate, \
                                max_depth = self.mmax_depth,\
                                n_jobs=4)
        
        xg_model = XGBClassifier(n_estimators = self.mn_estimators, \
                                learning_rate = self.mlearning_rate, \
                                max_depth = self.mmax_depth,\
                                use_label_encoder =False,\
                                n_jobs=4)
        '''
        
        if PipeLine_flag == True:
            # hyperparams is a result of Optuna hyperparameter tuning (Part 3 of this notebook)
             
            '''
            hyperparams = {'lambda': 0.0011260613527792323,
                           'alpha': 0.18307583898121738,
                           'colsample_bytree': 0.5,
                           'subsample': 0.8,
                           'learning_rate': 0.02,
                           'max_depth': 11,
                           'random_state': 48,
                           'min_child_weight': 1,
                           'n_estimators': 4000,
                           'tree_method': tree_method}
            '''
            hyperparams = { 'alpha': 0.0046540057600720115,
                            'colsample_bytree': 0.5,
                            'lambda': 0.10810295148897421,
                            'learning_rate': 0.05,
                            'max_depth': 15,
                            'min_child_weight': 1,
                            'random_state': 48,
                            'subsample': 0.8,
                            'n_estimators': 4000,
                            'tree_method': tree_method}

        xg_model = XGBClassifier(**hyperparams,use_label_encoder =False, n_jobs=4)
       
        eval_setparam = [(self.X_train, self.y_train), (self.X_valid, self.y_valid)]
        
        xg_model.fit(self.X_train, self.y_train, 
                     early_stopping_rounds=400,             # 10% of n_estimators
                     eval_metric=['error','logloss'],
                     #eval_set=[(X_valid, y_valid)], 
                     eval_set = eval_setparam,
                     verbose=False)
 
        print("Fitting model completed.")
        print()
        print('Preparing Predictions')
    
        # Get predictions
        predictions = xg_model.predict(self.X_valid)
    
        print()
        print(f'{color.underline}{desc}{color.end}')

        eval_results = model_eval(self.y_valid, predictions, cmDisplay)
            
        # Return these values as they will be needed for further testing or metrics
        # in dictionary form to remember easier 
        return {'xg_model':xg_model,'predictions':predictions, \
                    'X_train':self.X_train, 'y_train':self.y_train, \
                    'X_valid':self.X_valid, 'y_valid':self.y_valid, \
                    'X_test':self.X_test, 'y_test':self.y_test, 'eval_results':eval_results}

<a id="other_models"></a>
<div style="font-family: Trebuchet MS;background-color:LightSteelBlue;color:Black;text-align: left;padding-top: 5px;padding-bottom: 5px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <b>Other Models Class</b></div>

In [22]:
from sklearn.ensemble import RandomForestClassifier

# inherit from XGBoost class (process_model)
class other_models(process_model):  
    #def __init__(self, X, y):
    #    self.X = X
    #    self.y = y

    #    print(f'MIS_Status Count ->  1 : {Counter(y)[1]}, 0 : {Counter(y)[0]}')
    
    # Method to run model 
    # desc - description of metrics report
    def prep_run_model(self, desc='Metrics', modelname = 'rfc',\
                       hparams = {'n_estimators':600, 'random_state':48, 'max_depth':10},\
                       cmDisplay=False):
        print()
        print(f"{color.bold}Please wait, Fitting model can take time ...{color.end}")  

        if modelname == 'rfc':
            model = RandomForestClassifier(**hparams) 
            model.fit(self.X_train, self.y_train)
            
        print("Fitting model completed.")
        print()
        print('Preparing Predictions')
    
        # Get predictions
        predictions = model.predict(self.X_valid)
    
        print()
        print(f'{color.underline}{desc}{color.end}')

        eval_results = model_eval(self.y_valid, predictions, cmDisplay)
            
        # Return these values as they will be needed for further testing or metrics
        # in dictionary form to remember easier 
        return {'model':model,'predictions':predictions, \
                    'X_train':self.X_train, 'y_train':self.y_train, \
                    'X_valid':self.X_valid, 'y_valid':self.y_valid, \
                    'X_test':self.X_test, 'y_test':self.y_test, 'eval_results':eval_results}

<a id="optuna_class"></a>
<div style="font-family: Trebuchet MS;background-color:LightSteelBlue;color:Black;text-align: left;padding-top: 5px;padding-bottom: 5px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <b>Optuna Class</b></div>

In [23]:
class optuna_tuning(process_model):  
    #def __init__(self, X_train, y_train, X_valid, y_valid):
    #    self.X_train, self.y_train = X_train, y_train
    #    self.X_valid, self.y_valid = X_valid, y_valid

    def objective(self, trial):
        param = {
            # tree_method would ideally be gpu_hist for faster speed
            'tree_method':tree_method, 
            # L2 regularization weight, Increasing this value will make model more conservative
            'lambda': trial.suggest_loguniform('lambda', 1e-3, 10.0),
            # L1 regularization weight, Increasing this value will make model more conservative
            'alpha': trial.suggest_loguniform('alpha', 1e-3, 10.0),
            # sampling according to each tree
            'colsample_bytree': trial.suggest_categorical('colsample_bytree',
                            [0.3,0.4,0.5,0.6,0.7,0.8,0.9, 1.0]),
            # sampling ratio for training data
            'subsample': trial.suggest_categorical('subsample', [0.4,0.5,0.6,0.7,0.8,1.0]),
            'learning_rate': trial.suggest_categorical('learning_rate',
                            [0.008,0.009,0.01,0.012,0.014,0.016,0.018, 0.02,0.05]),
            'n_estimators': 4000,
            # maximum depth of the tree, signifies complexity of the tree
            'max_depth': trial.suggest_categorical('max_depth', [6,7,9,11,13,15,17,20]),
            'random_state': trial.suggest_categorical('random_state', [48]),
            # minimum child weight, larger the term more conservative the tree
            'min_child_weight': trial.suggest_int('min_child_weight', 1, 300)
        }
        
        if GetRam() >= 95:
            raise MemoryError('Short On Memory')

        print(f"Ram Used Before Trial : {GetRam()} %")
        
        # print(param)  # for debugging, comment out if desired
        model_xgbc = XGBClassifier(**param,use_label_encoder =False)  
    
        # xgb_model paramter allows the continuation of model training.
        # Model has to be saved by calling `model.get_booster().save_model(path)`
        # model_xgbc.get_booster().save_model(f'{savepath}model_xgbc_saved')
        
        model_xgbc.save_model(f'{savepath}model_xgbc.model')
        model_xgbc.fit(self.X_train, self.y_train, eval_set=[(self.X_valid, self.y_valid)],
                    verbose=False, eval_metric = ['logloss'],
                    early_stopping_rounds = 400)
    
        preds = model_xgbc.predict(self.X_valid)
    
        rmse = metrics.mean_squared_error(self.y_valid, preds,squared=False)
    
        print(f"Ram Used After Last Trial: {GetRam()} %")
        gc.collect()
        
        return rmse

    def run_optuna_trials(self,n_trials=None,timeout=None):
        print()
        print(f"{color.bold}Please wait, finding best trial ...{color.end}")
        
        study = optuna.create_study(direction='minimize')
        try:
            study.optimize(self.objective, n_trials, timeout,
                            #callbacks=[lambda study, trial: gc.collect()],\
                            catch=(RuntimeWarning,ArithmeticError,))
        except MemoryError as e:
            print()
            print(f'{color.bdblue}{e} : Memory was getting low, Trial ended early{color.end}')
        
        print('Number of completed trials:', len(study.trials))
        print('Best trial:', study.best_trial.params)
        
        joblib.dump(x, f"{savepath}optuna_study.pkl")       # save study
        # jl = joblib.load(f"{savepath}optuna_study.pkl")   # load study
        
        return study

<a id="optuna_class_batch"></a>
<div style="font-family: Trebuchet MS;background-color:LightSteelBlue;color:Black;text-align: left;padding-top: 5px;padding-bottom: 5px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <b>Optuna Class - tuning by batches</b></div>

In [24]:
# Optuna Tuning by batches, much slower, but lighter on memory
class optuna_tuning_batch(process_model): 
    def objective(self, trial):
        param = {
            # tree_method would ideally be gpu_hist for faster speed
            'tree_method':tree_method, 
            # L2 regularization weight, Increasing this value will make model more conservative
            'lambda': trial.suggest_loguniform('lambda', 1e-3, 10.0),
            # L1 regularization weight, Increasing this value will make model more conservative
            'alpha': trial.suggest_loguniform('alpha', 1e-3, 10.0),
            # sampling according to each tree
            'colsample_bytree': trial.suggest_categorical('colsample_bytree',
                            [0.3,0.4,0.5,0.6,0.7,0.8,0.9, 1.0]),
            # sampling ratio for training data
            'subsample': trial.suggest_categorical('subsample', [0.4,0.5,0.6,0.7,0.8,1.0]),
            'learning_rate': trial.suggest_categorical('learning_rate',
                            [0.008,0.009,0.01,0.012,0.014,0.016,0.018, 0.02,0.05]),
            'n_estimators': 4000,
            # maximum depth of the tree, signifies complexity of the tree
            'max_depth': trial.suggest_categorical('max_depth', [6,7,9,11,13,15,17,20]),
            'random_state': trial.suggest_categorical('random_state', [48]),
            # minimum child weight, larger the term more conservative the tree
            'min_child_weight': trial.suggest_int('min_child_weight', 1, 300)
        }
        if GetRam() >= 95:
            raise MemoryError('Short On Memory')
        print(f"Ram Used Before Trial: %{GetRam()}")
        
        model_xgbc = XGBClassifier(**param,use_label_encoder =False)  
    
        rt1=dt.datetime.now()
        
        '''
        For batch, use xgb_model parameter in fit().  There are two ways :
           1. save the model to a file, after 1st trial, then give the name to the next trials
           2. just give the name of the model object, in this case model_xgbc
        '''
        # Fit Model
        for i, (X_batch, y_batch) in enumerate(zip(self.X_train_batched, self.y_train_batched)):
            print(f'Step: {i}',end = ' ')
            if i == 0:
                model_xgbc.fit(X_batch, y_batch, eval_set=[(self.X_valid, self.y_valid)],
                        verbose=False, eval_metric = ['logloss'],
                        early_stopping_rounds = 400)
            else:
                model_xgbc.fit(X_batch, y_batch, eval_set=[(self.X_valid, self.y_valid)],
                        verbose=False, eval_metric = ['logloss'],
                        early_stopping_rounds = 400, 
                        xgb_model = model_xgbc
                        # uncomment below if you want to use a saved file
                        #xgb_model = f'{savepath}model_xgbc.json'
                        )
            
            # uncomment below if using a saved file
            #model_xgbc.get_booster().save_model(f'{savepath}model_xgbc.json')
            
            preds = model_xgbc.predict(self.X_valid)
    
            rmse = metrics.mean_squared_error(self.y_valid, preds,squared=False)
            trial.report(rmse, i+1)
            
            gc.collect()
            
            if trial.should_prune():
                raise optuna.TrialPruned()
        
        rt2=dt.datetime.now()
        runtime(rt1,rt2)
            
        print(f"Ram Used After Last Trial: %{GetRam()}")
        gc.collect()
        
        return rmse

    def run_optuna_trials(self,batch_size=1,n_trials=None,timeout=None):
        self.X_train_batched, self.y_train_batched =    self.X_train.reshape(-1,batch_size,16), \
                                                        self.y_train.reshape(-1,batch_size) 
        print()
        print(f'X_train_batched Shape: {self.X_train_batched.shape}')
        print(f'y_train_batched Shape: {self.y_train_batched.shape}')
        print()
    
        print(f"{color.bold}Please wait, finding best trial ...{color.end}")
        
        study = optuna.create_study(direction='minimize')
        try:
            study.optimize(self.objective, n_trials, timeout,
                            #callbacks=[lambda study, trial: gc.collect()],\
                            catch=(RuntimeWarning,ArithmeticError,))
        except MemoryError as e:
            print(f'{color.bdblue}{e} : Memory was getting low, Trial ended early{color.end}')
            
        print('Number of completed trials:', len(study.trials))
        print('Best trial:', study.best_trial.params)
        
        joblib.dump(x, f"{savepath}optuna_study.pkl")       # save study
        # jl = joblib.load(f"{savepath}optuna_study.pkl")   # load study
        
        return study

<a id="part1"></a>
<div style="font-family: Trebuchet MS;background-color:DarkRed;color:AliceBlue;text-align: left;padding-top: 5px;padding-bottom: 15px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
<h1 id="part1" style='color:GhostWhite;'>Part 1. Pipeline</h1>
This pipeline handles both X and y
</div>

<a id="pl_classes"></a>
<div style="font-family: Trebuchet MS;background-color:LightSteelBlue;color:Black;text-align: left;padding-top: 5px;padding-bottom: 5px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <b>PipeLine Classes</b></div>

In [25]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
  
class PL_Object():
    def __init__(self,X,y):
        #store X and Y
        self.X=X
        self.y=y

class PreProcessor(BaseEstimator, TransformerMixin):
    def __init__(self,operation= 'X'):
        self.operation=operation
    @staticmethod
    def enabled(**kwargs):
        return True

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        # check the parameters and return X and y inside the object
        X_data=X.X
        y_data=X.y
        
        print()
        print(f'{color.bdunl}PreProcessor initiated for {self.operation}{color.end}')
        
        #  do some work and assign it back to the X object which contains both X and y data
        if self.operation=='X':
            '''
            # NOTE: 'MIS_Status' is the target (y), but still in X, as we need to drop rows
                    with NaNs. We cannot do it separately, as there will be a mismatch in count 
                    of rows.  At the end of this procedure, we separate the new target data from X 
                    and update y.
            '''
            
            # Drop Na from rows
            #---------------------
            print(f'{color.bdblue}Drop Na{color.end}')
            X_data.dropna(subset=['DisbursementDate', 'NewExist', 'City', 'State',\
                        'LowDoc', 'Name', 'NAICS', 'CreateJob', 'RetainedJob', 'FranchiseCode',\
                        'UrbanRural', 'NoEmp', 'Term', 'MIS_Status'], how='any', inplace=True)
            
            # drop invalid classifications
            print('   Drop invalid classifications')
            X_data = X_data[(X_data['LowDoc'] == 'Y') | (X_data['LowDoc'] == 'N')]
            
            X_data = X_data[(X_data['NewExist'] == 1) | (X_data['NewExist'] == 2)]   
            
            # Trim leading and trailing spaces
            #---------------------------------
            print('   Trim leading and trailing spaces, if any')
            X_data['City'] = X_data['City'].str.strip()
            
            # Change dtype for columns needed for calculation or string extraction 
            #------------------------------------------------------------------------
            print(f'{color.bdgreen}Change dtype for columns needed for calculation or string extraction{color.end}')
            X_data = X_data.astype({'DisbursementGross':np.float64,'SBA_Appv':np.float64,\
                              'GrAppv':np.float64, 'ChgOffPrinGr':np.float64,\
                              'NAICS':np.str_, 'NewExist':np.int8})
            
            # Drop Duplicate Rows
            #------------------------------------------------------------------------
            print(f'{color.bdblue}Drop Duplicate Rows{color.end}')
            dupl_series = X_data.duplicated()
            num_of_dupl = len(X_data[dupl_series == True])
            if num_of_dupl > 0:
                X_data.drop_duplicates(inplace=True)
        
            # Create New Features
            #-----------------------
            print(f'{color.bdblue}Create New Features{color.end}')
            X_data['Industry'] = X_data['NAICS'].str[0:2]
            X_data = X_data.astype({'Industry':np.int8})
            
            X_data['Recession'] = np.where((X_data['DisbursementDate'] >= '2007-09-01')\
                     & (X_data['DisbursementDate'] <= '2009-06-30'), 1, 0)
            
            X_data['RealEstate'] = np.where(X_data['Term'] >= 240, 1, 0)
            
            X_data['SBA_Portion']=(X_data['SBA_Appv']/X_data['GrAppv']) * 100
            
            X_data["CityState"] = X_data["City"] + "_" + X_data["State"]
            
            print()
            print(f"X length = {len(X_data)}")
            print(f"Y length = {len(X_data['MIS_Status'])}")
            
            # Update X object
            X.X = X_data                      # type DataFrame
            X.y = X_data.pop('MIS_Status')    # type series
            
        elif self.operation=='y':
            pass                      
        else:
            pass
        
        #return modified X object
        return X
    

class EncodeCategorical(BaseEstimator, TransformerMixin):
    def __init__(self,operation= 'X'):
        self.operation=operation
    @staticmethod
    def enabled(**kwargs):
        return True

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        # encode categorical features and return X and y inside the object
        X_data=X.X
        y_data=X.y
        
        print()
        print(f'{color.bdunl}Encode Categorical Features initiated for {self.operation}{color.end}')
        
        #  do some work and assign it back to the X object
        if self.operation=='X':         
            X_data['LowDoc'] = np.where((X_data['LowDoc'] == 'Y'), 1, 0)
            
            len_data=len(X_data)
            #cols_to_drop = []
            hash_constant = 900000   # fixed value so we can programmatically reproduce the hash
            #for col in X_data.columns:
            for col in X_data[['State','CityState']]:
                if X_data[col].dtype == 'object':
                    print(f'Column {col} has {X_data[col].nunique()} values among {len_data}')

                    if X_data[col].nunique() < 25:
                        print(f'One-hot encoding of {col}')
                        one_hot_cols = pd.get_dummies(X_data[col])
                        for ohc in one_hot_cols.columns:
                            X_data[col + '_' + ohc] = one_hot_cols[ohc]
                    else:
                      print(f'Hashing of {col}')
                      X_data[col + '_hash'] = X_data[col].apply(lambda row: int(hashlib.sha1((col +\
                                "_" + str(row)).encode('utf-8')).hexdigest(), 16) % hash_constant)

            X.X = X_data
            
        elif self.operation=='y':
            y_data = np.where(y_data == 'P I F', 1, 0)
            
            y_data = y_data.astype(np.int8)
            
            # convert back to series
            y_data = pd.Series(y_data)

            X.y = y_data                      
        else:
            pass
        #return modified X
        return X    

class DropColumns(BaseEstimator, TransformerMixin):
    def __init__(self,operation= 'X'):
        self.operation=operation
    @staticmethod
    def enabled(**kwargs):
        return True

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        X_data=X.X
        
        print()
        print(f'{color.bdunl}Drop Columns initiated for {self.operation}{color.end}')
        
        #  do some work and assign it back to the X object
 
        # Dropping 'City' as 'CityState_hash' is more ideal
        # Zip code has invalid values like 1, 2.  If we pad 0000 to 1, it's still not correct,
        # as state should be Alaska. Zip code 1 is different states in the dataset
        cols_to_drop = ['LoanNr_ChkDgt', 'Zip', 'Bank', 'BankState', 'ApprovalDate', \
                        'ApprovalFY', 'ChgOffDate', 'BalanceGross', 'NAICS', 'ChgOffPrinGr', \
                        'Name', 'RevLineCr', 'DisbursementDate', 'City', 'State', 'CityState',\
                        'GrAppv']

        X_data.drop(columns=cols_to_drop, inplace=True)
            
        print()
        print('Unneeded Columns Dropped')
        
        # reduce mem usage of X_data as final step
        X_data = reduce_mem_usage(X_data)
        
        print()
        print(X_data.info())

        X.X = X_data
            
        #return modified X
        return X    

class XGBoost(BaseEstimator, TransformerMixin):
    def __init__(self,operation= 'X'):
        self.operation=operation
    @staticmethod
    def enabled(**kwargs):
        return True

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        X_data=X.X
        y_data=X.y
        
        print()
        print(f'{color.bdunl}XGBoost initiated{color.end}')
        #print(len(X_data))
        #print(len(y_data))
        
        # Get predictions using training and validation data
        xg_model_run = process_model(X_data, y_data)
        xg_model_run.osample()
        xg_model_run.split_data(0.7)
        xg_model_run_results = xg_model_run.prep_run_model("Train/Valid Data Metrics",\
                                                          cmDisplay=True, PipeLine_flag = True)   
        
        #Test with unseen data
        print()
        print(f'{color.bdunl}Test With Unseen Data X_test and y_test{color.end}')
        
        xg_model = xg_model_run_results['xg_model']
        x_test = xg_model_run_results['X_test']
        y_test = xg_model_run_results['y_test']
        
        # Get predictions
        predictions = xg_model.predict(x_test)
        cmv = model_eval(y_test, predictions)

        X.X = X_data
            
        '''
        A dictionary is returned, and its values can be used outside the pipeline if needed
        
        {'xg_model':xg_model,'predictions':predictions, \
                    'X_train':X_train, 'y_train':y_train, \
                    'X_valid':X_valid, 'y_valid':y_valid, \
                    'X_test':X_test, 'y_test':y_test, 'cmv':cmv}
        '''
        return xg_model_run_results

<a id="load_pl_df"></a>
<div style="font-family: Trebuchet MS;background-color:LightSteelBlue;color:Black;text-align: left;padding-top: 5px;padding-bottom: 5px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <b>Load Dataset for PipeLine</b></div>

In [None]:
X = pd.read_csv(f'{filepath}SBAnational.csv',\
                 converters = {'DisbursementGross':fixvals,'SBA_Appv':fixvals,\
                              'GrAppv':fixvals, 'ChgOffPrinGr':fixvals}, \
                              parse_dates=['DisbursementDate'], low_memory=False)
print("Shape of original SBA dataset : ", X.shape)
print()
print(X[['DisbursementGross','SBA_Appv','GrAppv','ChgOffPrinGr','DisbursementDate']].head(2))

# Filter data to before 2011
X = X[X['DisbursementDate'] <= '2010-12-31']

print()
print(f"Size of data after 2010-12-31 : \
    {len(X[X['DisbursementDate'] > '2010-12-31'])}")
print()
print(f"Size of data before 2011 : \
    {len(X[X['DisbursementDate'] < '2011-01-01'])}")

'''
X still contains the target 'MIS_Status', as we have to drop rows 
with NaNs in the pipeline. "MIS_Status" will be separated from X later in the pipeline

Select target - y is initialized as it goes into the pipeline, but will be updated in the pipeline 
after preprocessing.  Others preprocess y outside the pipeline; here, y will be preprocessed in
the pipeline.
'''
y = X['MIS_Status']

In [None]:
#Assign X and y to the object
My_Object=PL_Object(X,y)
My_Object.X.head(2)

<a id="pl_run"></a>
<div style="font-family: Trebuchet MS;background-color:LightSteelBlue;color:Black;text-align: left;padding-top: 5px;padding-bottom: 5px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <b>Run the pipeline</b></div>

In [None]:
%%time

def RunPipeLine():
    rt1=dt.datetime.now()
    #Assign X and y to the object
    My_Object=PL_Object(X,y)

    #Build a simple pipeline

    My_Pipeline=Pipeline([('X Prep',PreProcessor('X')),
                          ('X EnCat',EncodeCategorical('X')),
                          ('y EnCat',EncodeCategorical('y')),
                          ('DropCols',DropColumns()),
                          ('XGBoost',XGBoost())
                         ])

    My_Object=My_Pipeline.transform(My_Object)

    print()
    print(f'{color.bdred}These results were obtained using Optuna tuning{color.end}')
    
    print()
    print(f'{color.bold}Pipeline Process Completed.{color.end}')

    rt2=dt.datetime.now()
    runtime(rt1,rt2)
    print()
    
    return My_Object        # for further usage below
    
MyObject = RunPipeLine()

<div class="alert alert-block alert-info" style="color:DarkSlateBlue">
    <b>Just for informative reasons</b>, below shows how we can use data (dictionary) passed back by the pipeline to My_Object
    </div>      

In [None]:
def obj_sample_usage():
    print(MyObject.keys())
    pl_model = MyObject['xg_model']
    x=plot_features(pl_model, (10,14))
    print()
    MyObject['X_train'].info()
    
obj_sample_usage()

In [None]:
# clear some variables from memory
del X, y, MyObject
gc.collect()

In [None]:
if alert_flag == 1:
    if kaggle_flag == 0:   # not Kaggle
        engine.say("SBA Machine Learning PipeLine completed.")
        engine.runAndWait()
    else:
        display(Audio(url=audio_path, autoplay=True))

<a id="part2"></a>
<div style="font-family: Trebuchet MS;background-color:DarkRed;color:AliceBlue;text-align: left;padding-top: 5px;padding-bottom: 15px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
<h1 style='color:GhostWhite;'>Part 2 : Data Exploration and Preparation, Modeling, Metrics</h1></div>

<a id="de_load_df"></a>
<div style="font-family: Trebuchet MS;background-color:DarkCyan;color:Azure;text-align: left;padding-top: 5px;padding-bottom: 15px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
<h3 style='color:GhostWhite;'>1. Load Dataset</h3></div>

In [None]:
sba = pd.read_csv(f'{filepath}SBAnational.csv', low_memory=False)

print("Shape of SBA : ", sba.shape)
print(sba.info(memory_usage = 'deep'))

<a id="dep"></a>
<div style="font-family: Trebuchet MS;background-color:DarkCyan;color:Azure;text-align: left;padding-top: 5px;padding-bottom: 15px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <h2 style='color:GhostWhite;'>2. Data Exploration / Preparation</h2><br>
    </div>

<div style="font-family: Trebuchet MS;background-color:LightSteelBlue;color:Black;text-align: left;padding-top: 5px;padding-bottom: 15px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <b>Reload dataset with some conversion</b><br>
    After review, decided to reload dataset with conversion of some features that may be needed for calculation.  It could be done after loading, but this is for instructive purposes on how it's done.
    </div>

In [None]:
del sba
gc.collect()
sba = pd.read_csv(filepath + 'SBAnational.csv',\
                 converters = {'DisbursementGross':fixvals,'SBA_Appv':fixvals,\
                              'GrAppv':fixvals, 'ChgOffPrinGr':fixvals},\
                              parse_dates=['DisbursementDate'], \
                              low_memory=False)

# Convert dtype of some columns that will be used in calculation or string extraction
sba = sba.astype({'DisbursementGross':np.float64,'SBA_Appv':np.float64,\
                              'GrAppv':np.float64, 'ChgOffPrinGr':np.float64, 'NAICS':np.str_})

print("Shape of SBA : ", sba.shape)
print(sba.info(memory_usage = 'deep'))
sba.head(2)

<a id="conv_dtype"></a>
<div style="font-family: Trebuchet MS;background-color:DarkCyan;color:Azure;text-align: left;padding-top: 5px;padding-bottom: 15px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <h2 style='color:GhostWhite;'>2.1 EDA Tools</h2>
    </div>

## SweetViz

In [None]:
GetSweetVizReport(sba)

print()
(kaggle_flag == 1) and create_download_link('Open SweetViz Report in browser ---> ',\
                                            f'{savepath}SBA_sweetviz_report_before.html')

## Pandas Profiler
- this seems to be buggy as at Apr 2022

In [None]:
'''
For a better experience, the report is created as an html file that can be opened in a browser,
and downloaded from there  (Save As ..., html)
'''
def GetPandasProfiling():
    print(f'{color.bdblue}Please wait ... Profiling Report will take some time.{color.end}')

    df = sba.copy()
    # uncomment if one wants to see the report in a cell below
    # df.profile_report(title='SBA Pandas Profiling Report')

    try:
        profile = df.profile_report(title='SBA Pandas Profiling Report', progress_bar=False,\
                                    correlations={
                                        "pearson": {"calculate": True},
                                        "spearman": {"calculate": True},
                                        "kendall": {"calculate": False},
                                        "phi_k": {"calculate": True}
                                        })
        profile.to_file(output_file = f'{savepath}SBA_Profiling_Report.html')
        print(f'{color.bdblue}Profiling Report completed.{color.end}')
        print()
        (kaggle_flag == 0) and print(f'SBA Profiling Report has been downloaded to path {savepath}')
        
    except Exception as e:
        print(f'Error: {e}')

'''
GetPandasProfiling()

gc.collect()
print()
(kaggle_flag == 1) and create_download_link('Open SBA Profiling Report in browser ---> ', \
                           f'{savepath}SBA_Profiling_Report.html')
'''

## DataPrep
- this also seems buggy

In [None]:
from dataprep.datasets import load_dataset
from dataprep.eda import create_report

def GetDataPrepReport():
    print('Please wait, generating DataPrep report')
    try:
        df = sba.copy()
        report = create_report(df, title='SBA DataPrep Report', progress = False);
        report.save(f'{savepath}sba_dataprep_report')
    
        (kaggle_flag == 0) and report.show_browser()
        
    except Exception as e:
        print(f'Error: {e}')
        
'''
GetDataPrepReport()
gc.collect()

# open html in browser, and from there, one can download using Save As, html   
(kaggle_flag == 1) and create_download_link('SBA DataPrep Report in browser ---> ',\
                                             f'{savepath}sba_dataprep_report.html');
'''

In [None]:
del conv_dict
gc.collect()

<a id="drop_rows_cols"></a>
<div style="font-family: Trebuchet MS;background-color:DarkCyan;color:Azure;text-align: left;padding-top: 5px;padding-bottom: 15px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <h2 style='color:GhostWhite;'>2.2 Drop rows or columns if needed</h2>
    </div>

<div style="font-family: Trebuchet MS;background-color:LightSteelBlue;color:Black;text-align: left;padding-top: 5px;padding-bottom: 5px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <b>Check for na's in all columns, as well as invalid categories</b></div>

In [None]:
check_cols_with_nulls(sba)

In [None]:
print(f'{color.bdunl}Features with NA values{color.end}')
sba.isna().sum()

**The number of Na's in rows for the following features, with respect to the size of the database, are not many and can be dropped.**

In [None]:
sba.dropna(subset=['DisbursementDate', 'NewExist', 'City', 'State',\
                        'LowDoc', 'Name', 'NAICS', 'CreateJob', 'RetainedJob', 'FranchiseCode',\
                        'UrbanRural', 'NoEmp', 'Term', 'MIS_Status'], how='any', inplace=True)      

In [None]:
sba.isna().sum()

<div style="font-family: Trebuchet MS;background-color:LightSteelBlue;color:Black;text-align: left;padding-top: 5px;padding-bottom: 5px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <b>RevLineCr</b></div>

In [None]:
len(sba[(sba['RevLineCr'] != 'Y') & (sba['RevLineCr'] != 'N')])
# too many unknowns, we will drop 'RevlineCr' later

<div style="font-family: Trebuchet MS;background-color:LightSteelBlue;color:Black;text-align: left;padding-top: 5px;padding-bottom: 5px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <b>LowDoc</b></div>

In [None]:
len(sba[(sba['LowDoc'] != 'Y') & (sba['LowDoc'] != 'N')])

In [None]:
sns.countplot(x='LowDoc',data=sba)

* **LowDoc seems to have a bearing**

In [None]:
# we can drop rows that are not 'Y' or 'N'
sba = sba[(sba['LowDoc'] == 'Y') | (sba['LowDoc'] == 'N')]
len(sba[(sba['LowDoc'] != 'Y') & (sba['LowDoc'] != 'N')])

<div style="font-family: Trebuchet MS;background-color:LightSteelBlue;color:Black;text-align: left;padding-top: 5px;padding-bottom: 5px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <b>NewExist</b>

In [None]:
len(sba[(sba['NewExist'] != 1) & (sba['NewExist'] != 2)])

In [None]:
sns.countplot(x='NewExist',data=sba)

In [None]:
# records that are not 1 or 2, we can drop these rows as NewExist seems to have a bearing
sba = sba[(sba['NewExist'] == 1) | (sba['NewExist'] == 2)]
len(sba[(sba['NewExist'] != 1) & (sba['NewExist'] != 2)])

In [None]:
sba = sba.astype({'NewExist':np.int8})

<div style="font-family: Trebuchet MS;background-color:LightSteelBlue;color:Black;text-align: left;padding-top: 5px;padding-bottom: 5px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <b>FranchiseCode</b></div>

In [None]:
sba['FranchiseCode'].unique()

<div style="font-family: Trebuchet MS;background-color:LightSteelBlue;color:Black;text-align: left;padding-top: 5px;padding-bottom: 5px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <b>UrbanRural</b></div>

In [None]:
sba['UrbanRural'].unique()

<div style="font-family: Trebuchet MS;background-color:LightSteelBlue;color:Black;text-align: left;padding-top: 5px;padding-bottom: 5px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <b>Term</b></div>

In [None]:
print(len(sba[sba['Term'].isna()]))
print(len(sba[sba['Term']==0]))
print(len(sba[sba['Term']<0]))

In [None]:
sba.head(2)

In [None]:
# Trim leading and trailing spaces
sba['City'] = sba['City'].str.strip()

<div style="font-family: Trebuchet MS;background-color:LightSteelBlue;color:Black;text-align: left;padding-top: 5px;padding-bottom: 5px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <b>Check for na's in all columns</b></div>

In [None]:
check_cols_with_nulls(sba)

# We can ignore these, features to be dropped later

In [None]:
len(sba)

In [None]:
# Save 2
def Save2():
    # for feather format, reset_index(drop=True) to prevent "Unnamed column" being created
    sdf = sba.copy().reset_index(drop=True)
    sdf.to_feather(f'{savepath}sba_save2.csv.feather')

    # index=False to prevent "Unnamed Column" being created
    #sba.to_csv(f'{savepath}sba_save2.csv', index=False)
    
    print('Saved to sba_save2.csv.feather')

Save2()

# Short circuiting
(kaggle_flag == 1) and FileLink(r'sba_save2.csv.feather')  # Kaggle only

<a id="drop_duplicates"></a>
<div style="font-family: Trebuchet MS;background-color:DarkCyan;color:Azure;text-align: left;padding-top: 5px;padding-bottom: 15px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <h2 style='color:GhostWhite;'>2.3 Drop Duplicate Rows</h2>
    </div>

In [None]:
def DropDuplicates():
    dupl_series = sba.duplicated()
    num_of_dupl = len(sba[dupl_series == True])
    if num_of_dupl > 0:
        print(f'Number of Duplicates : {color.bold}{num_of_dupl}{color.end}')
        print()
        print(sba[dupl_series].head(5))
        sba.drop_duplicates(inplace=True)
        print()
        print(f'{color.bold}{num_of_dupl}{color.end} duplicate rows were dropped.')
    else:
        print(f'Duplicate rows found: {color.bold}None{color.end}')

#DropDuplicates()

<a id="create_new_features"></a>
<div style="font-family: Trebuchet MS;background-color:DarkCyan;color:Azure;text-align: left;padding-top: 5px;padding-bottom: 15px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <h2 style='color:GhostWhite;'>2.4 Create New Features</h2>
    </div>

<div style="font-family: Trebuchet MS;background-color:LightSteelBlue;color:Black;text-align: left;padding-top: 5px;padding-bottom: 5px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <b>Industry</b> - The industry sector is the 1st 2 digits of NAICS
    </div>

In [None]:
sba['Industry'] = sba['NAICS'].str[0:2]
sba = sba.astype({'Industry':np.int32})

In [None]:
sba['Industry'].head(2)

In [None]:
sba['Industry'].unique()
# There is an invalid industry shown which is '0', caused by blank NAICS

In [None]:
len(sba[sba['Industry'] == 0])
# This is a bummer, as industry sector has a big effect on a business, speaking as a business 
# domain expert.  Do we drop those with NAICS = 0 ?

In [None]:
# At this stage, we leave it as is and treat it as unknown industry
sba.head(2)

In [None]:
# Check if we can impute from the name.  For example, a bar (or similar) business
sba[(sba['Name'].str.contains('bar',case=False)) & (sba['Industry'] == 0)]\
    [['Name','Industry']].head(10)

**It's not feasible to impute missing Industry codes efficiently, so we abandon the idea.**

<div style="font-family: Trebuchet MS;background-color:LightSteelBlue;color:Black;text-align: left;padding-top: 5px;padding-bottom: 5px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <b>Recession</b><br>
We want to account for variation due to the Great Recession (December 2007 to June 2009). Should we separate the datasets into different time periods ? Before, During, and After ?  Let's check how large the sets are later.  In the meantime, we create a new feature, Recession, with 1 for 'Y' and 0 for 'N' depending on the DisbursementDate. 
<br><br>
</div>

In [None]:
# Convert "DisbursementDate" to datetime

# sba['DisbursementDate'] = pd.to_datetime(sba['DisbursementDate'], format='%d-%b-%y')

# sba.head(2)

In [None]:
# Create new column based on condition
sba['Recession'] = np.where((sba['DisbursementDate'] >= '2007-09-01')\
                     & (sba['DisbursementDate'] <= '2009-06-30'), 1, 0)

In [None]:
print(f'Total - {len(sba)}')
y = len(sba[sba['Recession'] == 1])
n = len(sba[sba['Recession'] == 0])
print(f'Recession - {y}')
print(f'Not Recession - {n}')

<div style="font-family: Trebuchet MS;background-color:LightSteelBlue;color:Black;text-align: left;padding-top: 5px;padding-bottom: 5px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <b>Real Estate</b><br>
Loans backed by real estate will have terms 20 years or greater (≥240 months) and are the only loans granted for such a long term, whereas loans not backed by real estate will have terms less than 20 years ( < 240 months).<br><br>
1 - Backed By Real Estate<br>
0 - Not Backed By Real Estate<br><br>

In [None]:
# Create new column based on condition
sba['RealEstate'] = np.where(sba['Term'] >= 240, 1, 0)

In [None]:
print(f'Total - {len(sba)}')
y = len(sba[sba['RealEstate'] == 1])
n = len(sba[sba['RealEstate'] == 0])
print(f'Yes - {y}')
print(f'No - {n}')
print(f'Yes and No - {y+n}')

<div style="font-family: Trebuchet MS;background-color:LightSteelBlue;color:Black;text-align: left;padding-top: 5px;padding-bottom: 5px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <b>SBA_Portion</b><br>
The portion which is the percentage of the loan that is guaranteed by SBA. This is derived by calculating the ratio of the amount of the loan SBA guarantees and the gross amount approved by the bank (SBA_Appv/GrAppv) * 100.<br><br></div>

In [None]:
sba['SBA_Portion']=(sba['SBA_Appv']/sba['GrAppv']) * 100
sba.head(2)

**CityState**

In [None]:
sba["CityState"] = sba["City"] + "_" + sba["State"]
sba[["CityState", "City", "State"]].head()

In [None]:
sba.head(2)

In [None]:
# Save 3
def Save3():
    sdf = sba.copy().reset_index(drop=True)
    sdf.to_feather(f'{savepath}sba_save3.csv.feather')

    print('Saved to sba_save3.csv.feather')
    
Save3()

(kaggle_flag == 1) and FileLink(r'sba_save3.csv.feather')  # Kaggle only

<a id="encode_cat"></a>
<div style="font-family: Trebuchet MS;background-color:DarkCyan;color:Azure;text-align: left;padding-top: 5px;padding-bottom: 15px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <h2 style='color:GhostWhite;'>2.5 Encode Categorical Features</h2>
    </div>

In [None]:
sba.select_dtypes(["object"]).nunique()

<div style="font-family: Trebuchet MS;background-color:Chocolate;color:AliceBlue;text-align: left;padding-top: 5px;padding-bottom: 5px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <b>MIS_Status</b><br>
    This will be the <b>target</b> variable</div>

In [None]:
sns.set_style('whitegrid')
# Target variable is MIS Status, a categorical variable

print(sba['MIS_Status'].value_counts())
sns.countplot(x='MIS_Status',data=sba)

<div style="font-family: Trebuchet MS;background-color:LightSteelBlue;color:Black;text-align: left;padding-top: 5px;padding-bottom: 5px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    This shows a skewed distribution, where this bias in the target can influence many machine learning algorithms, leading some to ignore the minority class entirely, in this case, CHGOFF.  Before oversampling the data, will try as is.<br><br></div>

In [None]:
# Update column based on condition
sba['MIS_Status'] = np.where((sba['MIS_Status'] == 'P I F'), 1, 0)

In [None]:
print(sba['MIS_Status'].dtype)
sba.head(2)[['City','MIS_Status']]

<div style="font-family: Trebuchet MS;background-color:LightSteelBlue;color:Black;text-align: left;padding-top: 5px;padding-bottom: 15px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <b>LowDoc</b><br>
'Y' = 1<br>
'N' = 0

In [None]:
# Update column based on condition
sba['LowDoc'] = np.where((sba['LowDoc'] == 'Y'), 1, 0)

sba.head(2)

<div style="font-family: Trebuchet MS;background-color:LightSteelBlue;color:Black;text-align: left;padding-top: 5px;padding-bottom: 5px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <b>Others</b></div>

In [None]:
# will not hash 'City' as it is already covered by 'CityState'

def HashCol():
    cols_to_drop = []
    hash_constant = 900000   # fixed value so we can programmatically reproduce the hash when needed
    len_data=len(sba)
    for col in sba[['State','CityState']]:
        if sba[col].dtype == 'object':
            print(f'Column {col} has {sba[col].nunique()} values among {len_data}')

        if sba[col].nunique() < 25:
            print(f'One-hot encoding of {col}')
            one_hot_cols = pd.get_dummies(sba[col])
            for ohc in one_hot_cols.columns:
                sba[col + '_' + ohc] = one_hot_cols[ohc]
        else:
            print(f'Hashing of {col}')
            sba[col + '_hash'] = sba[col].apply(lambda row: int(hashlib.sha1((col + "_" + \
                                    str(row)).encode('utf-8')).hexdigest(), 16) % hash_constant)

        cols_to_drop.append(col)
    print(cols_to_drop)

HashCol()

In [None]:
sba.head(2)[['State','CityState','State_hash','CityState_hash']]

In [None]:
sba.info()

<div style="font-family: Trebuchet MS;background-color:LightSteelBlue;color:Black;text-align: left;padding-top: 5px;padding-bottom: 5px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <b>TimeFrame</b><br>
Create a dataset for later use where we restrict the time frame to loans by excluding those disbursed after 2010 due to the fact the term of a loan is frequently 5 or more years.
    <br><br>

In [None]:
sba_bef_2011 = sba[sba['DisbursementDate'] <= '2010-12-31'].copy()
len(sba_bef_2011[sba_bef_2011['DisbursementDate'] > '2010-12-31'])
len(sba_bef_2011[sba_bef_2011['DisbursementDate'] <= '2011-01-01'])

<div style="font-family: Trebuchet MS;background-color:LightSteelBlue;color:Black;text-align: left;padding-top: 5px;padding-bottom: 5px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <b>Drop columns that are no longer needed<b></div>

In [None]:
# Save 4
def Save4():
    sdf = sba.copy().reset_index(drop=True)
    sdf.to_feather(f'{savepath}sba_save4.csv.feather')

    print('saved to sba_save4.csv.feather')

Save4()

(kaggle_flag == 1) and FileLink(r'sba_save4.csv.feather')  # Kaggle only

In [None]:
cols_to_drop = ['LoanNr_ChkDgt', 'Bank', 'BankState', 'ApprovalDate', \
                        'ApprovalFY', 'ChgOffDate', 'BalanceGross', 'NAICS', 'ChgOffPrinGr', \
                        'Name', 'RevLineCr', 'DisbursementDate', 'City', 'State', 'CityState',\
                         'GrAppv','Zip']

sba_bef_2011.drop(columns=cols_to_drop, inplace=True)

sba.drop(columns=cols_to_drop, inplace=True)

sba_bef_2011 = reduce_mem_usage(sba_bef_2011)

print()
sba = reduce_mem_usage(sba)

print()
print('Unneeded Columns Dropped')
print(sba.info())

In [None]:
def SaveBef2011():
    # Save sba_bef_2011
    ## save this dataset to working dir
    sdf = sba_bef_2011.copy().reset_index(drop=True)
    sdf.to_feather(f'{savepath}sba_bef_2011.csv.feather')

    print("saved to sba_bef_2011.csv.feather")
    
SaveBef2011()

(kaggle_flag == 1) and FileLink(r'sba_bef_2011.csv.feather')  # Kaggle only

In [None]:
del sba_bef_2011
gc.collect()

In [None]:
# Save 5
def Save5():
    sdf = sba.copy().reset_index(drop=True)
    sdf.to_feather(f'{savepath}sba_save5.csv.feather')

    print('saved to sba_save5.csv.feather')

Save5()

(kaggle_flag == 1) and FileLink(r'sba_save5.csv.feather')  # Kaggle only

<div style="font-family: Trebuchet MS;background-color:LightSteelBlue;color:Black;text-align: left;padding-top: 5px;padding-bottom: 5px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <b>Check for Infinite Values<b></div>

In [None]:
check_infinity_nan(sba,'sba')

<div style="font-family: Trebuchet MS;background-color:LightSteelBlue;color:Black;text-align: left;padding-top: 5px;padding-bottom: 5px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <b>Check Correlations</b></div>

In [None]:
fig, ax = plt.subplots(figsize=(20,20))

g = sns.heatmap(
    sba.corr(),
    annot=True,
    ax=ax,
    cmap='OrRd',
    cbar=False,
    linewidth=1
)

g.set_xticklabels(g.get_xticklabels(), rotation=45, horizontalalignment='right')
g.set_yticklabels(g.get_yticklabels(), rotation=45, horizontalalignment='right')

<a id="eda_check"></a>
<div style="font-family: Trebuchet MS;background-color:DarkCyan;color:Azure;text-align: left;padding-top: 5px;padding-bottom: 15px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <h2 style='color:GhostWhite;'>2.6 EDA Check</h2><br>
    Here we generate a SweetViz report for another EDA review
    </div>

In [None]:
GetSweetVizReport(sba)

print()
(kaggle_flag == 1) and create_download_link('Open SweetViz Report in browser ---> ',\
                                            f'{savepath}SBA_sweetviz_report_after.html')

In [None]:
del sba
gc.collect()

<a id="build_model"></a>
<div style="font-family: Trebuchet MS;background-color:DarkCyan;color:Azure;text-align: left;padding-top: 5px;padding-bottom: 15px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <h2 style='color:GhostWhite;'>3. Build Model Using XGBoost</h2>
    </div>

<div style="font-family: Trebuchet MS;background-color:LightSteelBlue;color:Black;text-align: left;padding-top: 5px;padding-bottom: 5px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <b>Early Stopping Rounds<b></div>

"<b>Overfitting</b> is a problem with sophisticated non-linear learning algorithms like gradient boosting.  Early stopping is an approach to training complex machine learning models to avoid overfitting.
<br><br>
<b>XGBoost supports early stopping after a fixed number of iterations.</b>  In addition to specifying a metric and test dataset for evaluation in each epoch, one must specify a window of the number of epochs over which no improvement is observed. This is specified in the early_stopping_rounds parameter.
<br><br>
It is generally a good idea to select the early_stopping_rounds as a reasonable function of the total number of training epochs (10% in this case) or attempt to correspond to the period of inflection points as might be observed on plots of learning curves.
<br><br> - <a href = "https://machinelearningmastery.com/avoid-overfitting-by-early-stopping-with-xgboost-in-python/">Avoid Overfitting By Early Stopping With XGBoost In Python</a>
<br><br>
In the <a href="#xgboost_class">XGBoost class</a> created earlier, n_estimators = 4000.  400 is used as the value for early_stopping_rounds during fitting, 10% of 4000. 
<br><br>
We can also check by plotting, shown in the <a href = "#early_stopping_rounds">Early Stopping Rounds section</a> near the end of this notebook, for reference.

<a id="model1"></a>
<div style="font-family: Trebuchet MS;background-color:DarkCyan;color:Azure;text-align: left;padding-top: 5px;padding-bottom: 15px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <h2 style='color:GhostWhite;'>3.1 Model v1</h2>
    </div>

In [None]:
%%time

def RunModelv1():
    # Select subset of predictors
    X = pd.read_feather(f'{savepath}sba_save5.csv.feather')

    y = X.pop('MIS_Status')

    model1 = process_model(X, y)
    model1.split_data(0.7)
    model1.prep_run_model( "Metrics : Full SBA Not Oversampled")
    print()
    
RunModelv1()

<div class="alert alert-block alert-danger">  
<b>Accuracy for model is good; but ...</b> Precision, Recall, and f1-score of minority class 0 (CHGOFF) is <b>much lower</b> than that of 1 (P I F). This is because MIS_Status is heavily skewed towards 1 (P I F).</div>

<div class="alert alert-block alert-info">  
In such a scenario, <b>Accuracy is not a good metric</b>, as it favors the majority.  <b>The f1-score is the more ideal metric</b>, which correctly shows a poorer score by the minority class.<br><br>
    To solve this, we try <b>Oversampling the data</b>, in the next section.</div>

In [None]:
if alert_flag == 1:
    if kaggle_flag == 0:   # not Kaggle
        engine.say("SBA Machine Learning Model 1 completed.")
        engine.runAndWait()
    else:
        display(Audio(url=audio_path, autoplay=True))

<a id="oversample"></a>
<div style="font-family: Trebuchet MS;background-color:DarkCyan;color:Azure;text-align: left;padding-top: 5px;padding-bottom: 15px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;"><h2 style='color:GhostWhite;'>3.2 OverSample</h2>
    </div>

<a id="model2"></a>
<div style="font-family: Trebuchet MS;background-color:DarkCyan;color:Azure;text-align: left;padding-top: 5px;padding-bottom: 15px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;"><h2 style='color:GhostWhite;'>3.2.1 Model v2</h2>
    </div>

In [None]:
%%time

def RunModelv2():
    # Select subset of predictors
    X = pd.read_feather(f'{savepath}sba_save5.csv.feather')

    # Select target
    y = X.pop('MIS_Status')

    model2 = process_model(X, y)
    model2.osample()             # oversampling method
    model2.split_data(0.7)
    model2_results = model2.prep_run_model("Metrics : Full SBA Oversampled")

    return model2_results['xg_model']

modelv2 = RunModelv2()

<div class="alert alert-block alert-info">
    After oversampling of the minority class, class 0 (CHGOFF) <b>now has a similar </b> precision, recall, and f1-score as class 1 (P I F).
<br><br>     
    The <b>accuracy score</b> is slightly lower than when not oversampled, but much better f-scores.  This should be a good metric now as the target classification is no longer imbalanced.  <b>The model is now more accurate in predicting the target as f-scores are better now.</b></div>

In [None]:
if alert_flag == 1:
    if kaggle_flag == 0:   # not Kaggle
        engine.say("SBA Machine Learning Model 2 completed.")
        engine.runAndWait()
    else:
        display(Audio(url=audio_path, autoplay=True))

In [None]:
# Plot feature importance

plot_features(modelv2, (10,12))

<div class="alert alert-block alert-info">
    <b>Observation</b><br>
    I was hoping to see <b>Industry</b> at a much higher position here, but apparently the incomplete data on industry had an effect.<br><br>
Furthermore, <b>Recession</b> has to be at a very high position, but is at the bottom instead.  This could be due to <b>Recession</b> data being highly skewed towards 1 (Not Recession).<br><br>
<b>Real Estate</b> should have good importance too, but it may be highly skewed as well.

In [None]:
del modelv2
gc.collect()

<a id="model3"></a>
<div style="font-family: Trebuchet MS;background-color:DarkCyan;color:Azure;text-align: left;padding-top: 5px;padding-bottom: 15px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <h2 style='color:GhostWhite;'>3.2.2 Model v3</h2>
    <b>Build a Model Dataset Excluding Year 2011 and Above</b>

We restrict the time frame to loans by excluding those disbursed after 2010 due to the fact the term of a loan is frequently 5 or more years.
       </div>

In [None]:
%%time

def RunModelv3():
    X = pd.read_feather(f'{savepath}sba_bef_2011.csv.feather')

    # Select target
    y = X.pop('MIS_Status')

    model3 = process_model(X, y)
    model3.osample()
    model3.split_data(0.7)
    model3_results = model3.prep_run_model("Metrics : SBA Before 2011 Oversampled")

    return model3_results
    
model3_results = RunModelv3()

<div class="alert alert-block alert-info">
    <b>We get a similar score to Model 2.</b>  Will use this dataset as the last dataset, for now.</div>

In [None]:
# Save 6 - final cleaned csv
def SaveFinalCsv():
    #sdf = pd.read_csv(f'{savepath}sba_bef_2011.csv')
    #sdf.to_csv(f'{savepath}sba_final.csv', index=False)
    
    src_file=f'{savepath}sba_bef_2011.csv.feather'
    dst_file=f'{savepath}sba_final.csv.feather'
    shutil.copy2(src_file, dst_file)

    print('saved to sba_final.csv.feather')

SaveFinalCsv()
(kaggle_flag == 1) and FileLink(r'sba_final.csv.feather')  # Kaggle only

In [None]:
if alert_flag == 1:
    if kaggle_flag == 0:   # not Kaggle
        engine.say("SBA Machine Learning Model 3 completed.")
        engine.runAndWait()
    else:
        display(Audio(url=audio_path, autoplay=True))

In [None]:
# Plot feature importance
modelv3 = model3_results['xg_model']
plot_features(modelv3, (10,14))

<a id="test_model"></a>
<div style="font-family: Trebuchet MS;background-color:DarkCyan;color:Azure;text-align: left;padding-top: 5px;padding-bottom: 20px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <h2 style='color:GhostWhite;'>4. Test Model</h2>
    </div>
    

<a id="test_test_dataset"></a>
<div style="font-family: Trebuchet MS;background-color:DarkCyan;color:Azure;text-align: left;padding-top: 5px;padding-bottom: 20px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <h2 style='color:GhostWhite;'>4.1 Test Model with Test Dataset</h2>
    Test Dataset was previously unseen by the model.
    </div>

In [None]:
def Modelv3WithTestData():
    X_test = model3_results['X_test']
    y_test = model3_results['y_test']
    
    # Get predictions
    predictions = modelv3.predict(X_test)
    model_eval(y_test, predictions);
    
Modelv3WithTestData()

<a id="test_user_input"></a>
<div style="font-family: Trebuchet MS;background-color:DarkCyan;color:Azure;text-align: left;padding-top: 5px;padding-bottom: 20px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <h2 style='color:GhostWhite;'>4.2 Test Model with User Input</h2>
    </div>

<div class="alert alert-block alert-info">So let's assume the following are <b>the entries of a user</b>, through a user interface, looking for a prediction from our model.</div>

In [None]:
def UserInputTest():
    # 16 entries
    user_input =   {'Term':50, 
                    'NoEmp':0,
                    'NewExist':1,
                    'CreateJob':0 ,          
                    'RetainedJob':0,         
                    'FranchiseCode':1,       
                    'UrbanRural':0,           
                    'LowDoc':0,               
                    'DisbursementGross':50000,                 
                    'SBA_Appv':25000,          
                    'Industry':71, 
                    'Recession':0,
                    'RealEstate':0,           
                    'SBA_Portion':50,
                    'City':'EVANSVILLE',
                    'State':'IN'
                   }

    city = user_input['City']
    state = user_input['State']
    city_state = f'{city}_{state}'

    state_hash = int(hashlib.sha1(('State' + "_" + \
                              str(state)).encode('utf-8')).hexdigest(), 16) % 900000
    city_state_hash = int(hashlib.sha1(('CityState' + "_" + \
                              str(city_state)).encode('utf-8')).hexdigest(), 16) % 900000

    print(f'State_hash = {state_hash}')
    print(f'CityState_hash = {city_state_hash}')

    user_input.pop('City')
    user_input.pop('State')
    user_input['State_hash'] = state_hash
    user_input['CityState_hash'] = city_state_hash

    user_input_list = list(user_input.values())
    
    return {'user_input':user_input, 'user_input_list':user_input_list}

user_input_param = UserInputTest()

print()
print(f"{color.bold}User Entry:{color.end}")
user_input_param['user_input']

In [None]:
# User Input test 1
def UserInputTest1():
    features = np.array([user_input_param['user_input_list']])   

    # using inputs to predict the output
    pred = modelv3.predict(features)
    if pred[0] == 1:
        print(f'{color.bdblue}Prediction: Approve The Loan{color.end}')
    else:
        print(f'{color.bdred}Prediction: Do Not Approve The Loan{color.end}')
        
UserInputTest1()

In [None]:
# User Input test 2
def UserInputTest2():
    '''
    # if one wants to edit the list from the previous cell
    user_input2_list = user_input_list[:]   # make a copy
    user_input2_list[0] = 500          # change term 
    '''

    user_input2 = copy.deepcopy(user_input_param['user_input'])
    user_input2['Term'] = 500     # change term
    user_input2_list = list(user_input2.values())

    features = np.array([user_input2_list]) 

    # using inputs to predict the output
    pred = modelv3.predict(features)
    if pred[0] == 1:
        print(f'{color.bdblue}Prediction: Approve The Loan{color.end}')
    else:
        print(f'{color.bdred}Prediction: Do Not Approve The Loan{color.end}')
        
UserInputTest2()

<div style="font-family: Trebuchet MS;background-color:HoneyDew;color:Black;text-align: left;padding-top: 5px;padding-bottom: 15px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;border: 5px solid CadetBlue;"><b>Predictions:</b><br>
    
- 1 -> can approve<br>
- 0 -> do not approve<br>

Of course, in real life, will need to check further using other data (e.g. financial statements, kind of real estate, etc.) or other data's models if available.

In [None]:
del user_input_param
gc.collect()

<a id="mutual_info"></a>
<div style="font-family: Trebuchet MS;background-color:DarkCyan;color:Azure;text-align: left;padding-top: 5px;padding-bottom: 20px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <h2 style='color:GhostWhite;'>5. Mutual Information Scores</h2>
 "A general-purpose metric, normally used before selecting and building a model, but used here in the end, for comparison.  Mutual information is a lot like correlation in that it measures a relationship between two quantities. The advantage of mutual information is that it can detect any kind of relationship, while correlation only detects linear relationships."
    </div>

In [None]:
%%time

def GetMIScores():
    X = pd.read_feather(f'{savepath}sba_final.csv.feather')

    # Select target
    y = X.pop('MIS_Status')

    model_mi = process_model(X, y)
    osample_xy = model_mi.osample()
    mi_scores = make_mi_scores(osample_xy['X_over'], osample_xy['y_over'])

    print()
    return mi_scores

mi_scores = GetMIScores()

In [None]:
if alert_flag == 1:
    if kaggle_flag == 0:   # not Kaggle
        engine.say("SBA Mutual Information completed.")
        engine.runAndWait()
    else:
        display(Audio(url=audio_path, autoplay=True))

In [None]:
plt.figure(dpi=1200, figsize=(8, 5))
plot_mi_scores(mi_scores)

In [None]:
# Plot feature importance
plot_features(modelv3, (10,14))

In [None]:
del mi_scores
gc.collect()

<div class="alert alert-block alert-info">
The importance ranked by <b>Mutual Information</b> and <b>XGBoost Feature Importance</b> metrics are different.  Which ranking do you think is more reasonable ?</div>

<a id="trim_dataset"></a>
<div style="font-family: Trebuchet MS;background-color:DarkCyan;color:Azure;text-align: left;padding-top: 5px;padding-bottom: 20px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <h2 style='color:GhostWhite;'>6. Trim Dataset</h2><br>
After the preprocessing and encoding steps, not all of the features may be useful in forecasting the loan default. Alternatively we can select the <b>top 5 or top 8 features</b>, based on the feature importance plot above, which had a major contribution in forecasting loan defaults.<br><br>

If the model performance is similar in both the cases, that is – by using all the features and by using 5-8 features, then we should use only the top 8 features, in order to keep the model simpler and more efficient.

The idea is to have a less complex model without compromising on the overall model performance.
</div>

In [None]:
X = pd.read_feather(f'{savepath}sba_final.csv.feather')
print(X.shape)

# Select target
y = X.pop('MIS_Status')

#Let's retain the top 8 from Mutual Information metric 
mi_features = ['Term', 'DisbursementGross', 'SBA_Appv', 'SBA_Portion',\
                'CityState_hash', 'FranchiseCode', 'RealEstate', 'UrbanRural']

Xmi = X[mi_features]

#Let's retain the top 8 from Feature Importance metric 
fi_features = ['Term', 'SBA_Appv', 'DisbursementGross', 'CityState_hash', 'State_hash',\
                'SBA_Portion', 'Industry', 'NoEmp']

Xfi = X[fi_features]

In [None]:
%%time

def ModelMI():
    model_mi = process_model(Xmi, y)
    model_mi.osample()
    model_mi.split_data(0.7)
    model_mi_results = model_mi.prep_run_model("Mutual Information Metrics")

    print()
    return model_mi_results

model_mi_results = ModelMI()

In [None]:
if alert_flag == 1:
    if kaggle_flag == 0:   # not Kaggle
        engine.say("Trimmed Dataset by Mutual Information completed.")
        engine.runAndWait()
    else:
        display(Audio(url=audio_path, autoplay=True))

In [None]:
# Plot mutual information
my_model_mi = model_mi_results['xg_model']
plot_features(my_model_mi, (10,5))

In [None]:
# Test with Unseen test data
def MI_Model_On_Test_Data():
    X_test = model_mi_results['X_test']
    X_test_mi = X_test[mi_features]

    y_test = model_mi_results['y_test']

    predictions_mi = my_model_mi.predict(X_test_mi)
    model_eval(y_test, predictions_mi)
    print()
    
MI_Model_On_Test_Data()

In [None]:
%%time

def ModelFI():
    model_fi = process_model(Xfi, y)
    model_fi.osample()
    model_fi.split_data(0.7)
    model_fi_results = model_fi.prep_run_model("Feature Importance Metrics")

    print()
    return model_fi_results
    
model_fi_results = ModelFI()

In [None]:
if alert_flag == 1:
    if kaggle_flag == 0:   # not Kaggle
        engine.say("Trimmed Dataset by Feature Importance completed.")
        engine.runAndWait()
    else:
        display(Audio(url=audio_path, autoplay=True))

In [None]:
# Plot feature importance
my_model_fi = model_fi_results['xg_model']
plot_features(my_model_fi, (10,5))

In [None]:
# Test with Unseen test data
def FI_Model_On_Test_Data():
    X_test = model_fi_results['X_test']
    X_test_fi = X_test[fi_features]

    y_test = model_fi_results['y_test']

    predictions_fi = my_model_fi.predict(X_test_fi)
    model_eval(y_test, predictions_fi)
    print()
    
FI_Model_On_Test_Data()

In [None]:
del X, y, Xmi, Xfi, mi_features, fi_features, my_model_mi, my_model_fi
del model_mi_results, model_fi_results
gc.collect()

<a id="results1"></a>
<div style="font-family: Trebuchet MS;background-color:DarkCyan;color:Azure;text-align: left;padding-top: 5px;padding-bottom: 20px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <h2 style='color:GhostWhite;'>7. Full or Trimmed Dataset</h2>
</div>

<div class="alert alert-block alert-info">
    <b>Do we select the full dataset, or the trimmed dataset ?</b><br><br>
    <b>Observation:</b><br>
    <ul>
        <li><b>Accuracy</b> - Approx 2 points less accuracy of trimmed versus the full features dataset.<br><br>
        <li><b>f1-score</b> - Also approx 2 points less f1-score between full features dataset and Manual Information trimmed dataset.  Approx 1 point difference between full features dataset and Feature Importance trimmed dataset.<br><br>
    </ul>
    We can <b>stick with the full features</b> for now; but the trimmed features are also good, with the <b>Manual Information trimmed dataset</b> very slightly favored.

In [None]:
if alert_flag == 1:
    if kaggle_flag == 0:
        engine.say("SBA Machine Learning completed.")
        engine.runAndWait()
    else:
        display(Audio(url=audio_path, autoplay=True))

<a id="part3"></a>
<div style="font-family: Trebuchet MS;background-color:DarkRed;color:AliceBlue;text-align: left;padding-top: 5px;padding-bottom: 15px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
<h1 style='color:GhostWhite;'>Part 3. XGBoost HyperParameter Tuning using Optuna</h1>
</div>

<a id="find_best_hp"></a>
<div style="font-family: Trebuchet MS;background-color:DarkCyan;color:Azure;text-align: left;padding-top: 5px;padding-bottom: 20px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <h2 style='color:GhostWhite;'>3.1 Find The Best HyperParameter Combination</h2>
</div>

In [None]:
%%time

# For running Optuna study on full dataset
def OptunaStudy():

    X = pd.read_feather(final_ds)
    y = X.pop('MIS_Status')
    
    # instantiate the optuna_tuning class
    ot = optuna_tuning(X, y)
    ot.osample()
    ot.split_data(0.7)
    
    del X, y
    gc.collect()

    ''' 
    Pass the number of trials or timeout in seconds to the run_optuna_trials method. 
    Example : 
        run_optuna_trials(n_trials = 50)          # number of trials
      or
        time_to_run = 60 * 60                     # 1 hour in seconds 
        run_optuna_trials(timeout = time_to_run)  # timeout, in seconds
    '''

    study_results = ot.run_optuna_trials(n_trials=2, batch_size=batch_size)

    print()
    return study_results

if optuna_flag == 1 or optuna_flag == 3:
    study_results = OptunaStudy()

In [None]:
# this cell is to get the batch size if running optuna in batches 
def GetAllowedBatchSizes():
    X = pd.read_feather(final_ds)
    y = X.pop('MIS_Status')
    
    # instantiate the optuna_tuning class
    ot = optuna_tuning(X, y)
    ot.osample()
    ot_df = ot.split_data(0.7)
    
    size = len(ot_df['X_train'])
    print()
    print(f'Size of X_train: {size}')
    
    print()
    batch_size = 1
    #for i in range(1, int(size/2)):        # split into 4, for this dataset only as it varies
    for i in range(1, size):                # split into 2, for this dataset only as it varies
        if (size % i) == 0:
            batch_size = i
            print(f'{color.bdblue}{i}', end=' ')
    print()
    print(f'Suggested Batch Size = {batch_size}')
    return batch_size

if optuna_flag == 2 or optuna_flag == 3:
    sugg_batch_size = GetAllowedBatchSizes()

In [None]:
%%time

# For running Optuna tuning incrementally in batches, much slower, but lighter on memory
def OptunaStudyBatch():
    X = pd.read_feather(final_ds)
    y = X.pop('MIS_Status')

    X = X.values   # convert to numpy array
    y = y.values   # convert to numpy array
    
    # instantiate the optuna_tuning class
    otb = optuna_tuning_batch(X, y)
    otb.osample()
    otb.split_data(0.7)
    
    del X, y
    gc.collect()

    ''' 
    Pass the number of trials or timeout in seconds to the run_optuna_trials method. 
    Example : 
        run_optuna_trials(n_trials = 50)          # number of trials
      or
        time_to_run = 60 * 60                     # 1 hour in seconds 
        run_optuna_trials(timeout = time_to_run)  # timeout, in seconds
    '''

    study_results = otb.run_optuna_trials(n_trials=50, batch_size = sugg_batch_size)

    print()
    return study_results

if optuna_flag == 2 or optuna_flag == 3:
    study_results = OptunaStudyBatch()

In [None]:
best_trial = study_results.best_trial.params
best_trial.update({'n_estimators': 4000, 'tree_method':tree_method})
best_trial

In [None]:
# Trial results dataframe sorted from best value (RMSE) ascending
def ViewResultsAsDf():
    stdf = study_results.trials_dataframe()
    stdf = stdf.sort_values('value',ascending=True)

    return stdf.head(2)    # return here is only used for printing output

ViewResultsAsDf()

In [None]:
#Visualize parameter importance
optuna.visualization.plot_param_importances(study_results)

In [None]:
del sugg_batch_size
gc.collect()

In [None]:
if alert_flag == 1:
    if kaggle_flag == 0:   # not Kaggle
        engine.say("Optuna run completed.")
        engine.runAndWait()
    else:
        display(Audio(url=audio_path, autoplay=True))

<a id="try_best_hp"></a>
<div style="font-family: Trebuchet MS;background-color:DarkCyan;color:Azure;text-align: left;padding-top: 5px;padding-bottom: 20px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <h2 style='color:GhostWhite;'>3.2 Model v4 : Try the Optuna Hyperparameters</h2>
</div>

In [None]:
%%time

def RunModelv4():
    X = pd.read_feather(final_ds)
    y = X.pop('MIS_Status')

    model4 = process_model(X, y)
    model4.osample()
    model4.split_data(0.7)
    model4_results = model4.prep_run_model("Metrics : After Optuna Tuning", \
                                       hyperparams = best_trial)
    return model4_results

model4_results = RunModelv4()

In [None]:
if alert_flag == 1:
    if kaggle_flag == 0:   # not Kaggle
        engine.say("Model Test with Optuna completed.")
        engine.runAndWait()
    else:
        display(Audio(url=audio_path, autoplay=True))

<a id="optuna_results"></a>
<div style="font-family: Trebuchet MS;background-color:DarkCyan;color:Azure;text-align: left;padding-top: 5px;padding-bottom: 20px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <h2 style='color:GhostWhite;'>3.3 Optuna Tuning Results</h2>
</div>

In [None]:
def CompareResults():
    m3_clf_report = model3_results['eval_results']['ClassificationReport']

    m3_0_f1_score = round(m3_clf_report['0']['f1-score'] * 100, 2)
    m3_1_f1_score = round(m3_clf_report['1']['f1-score'] * 100, 2)
    m3_accuracy   = round(m3_clf_report['accuracy'] * 100, 2)


    m4_clf_report = model4_results['eval_results']['ClassificationReport']

    m4_0_f1_score = round(m4_clf_report['0']['f1-score'] * 100, 2)
    m4_1_f1_score = round(m4_clf_report['1']['f1-score'] * 100, 2)
    m4_accuracy   = round(m4_clf_report['accuracy'] * 100, 2)


    data = {'Model v3 : No Optuna':[m3_0_f1_score, m3_1_f1_score, m3_accuracy],
            'Model v4 : With Optuna':[m4_0_f1_score, m4_1_f1_score, m4_accuracy]}
 
    # Creates pandas DataFrame.
    df = pd.DataFrame(data, index =['0 : f1_score',
                                    '1 : f1_score',
                                    'Accuracy'])
    print(f'{color.bdgreen}\
    Accuracy Improvement Using Optuna Suggested Parameters: {round(m4_accuracy - m3_accuracy,2)}\
    {color.end}')
    return df

CompareResults()

In [None]:
# do not run this if you still want to do more work with model 3 and model 4 information
del modelv3, model3_results, model4_results, best_trial, study_results
gc.collect()

<div class="alert alert-block alert-info">
    <b>Observation:</b><br><br>
    <b>The Accuracy and F1 scores after Optuna tuning are improved compared to before tuning;</b> but it all depends on what hyperparameters/values are given.  A few trial sessions may be needed.<br><br>
    We have a bigger score in our <a style="color:DarkSlateGrey" href="#pl_run">Pipeline</a> as we used an Optuna hyperparameter set that was obtained from another Optuna run.
  </div>

<a id="cross_validation"></a>
<div style="font-family: Trebuchet MS;background-color:DarkCyan;color:Azure;text-align: left;padding-top: 5px;padding-bottom: 20px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <h2 style='color:GhostWhite;'>Cross Validation</h2><br>
Measure our model's quality, in RMSE.  Ideally for small datasets, but included here for reference.
</div>

In [None]:
%%time

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score

def CrossVal():
    print(f'{color.bold}Please wait, this will take some time{color.end}')
    print()
    X = pd.read_feather(final_ds)
    y = X.pop('MIS_Status')

    model_cv = process_model(X, y)   # create object from XGBoost class
    model_cv.osample()  # oversample
    model_cv_df = model_cv.split_data(0.7)

    del X, y
    gc.collect()
    
    print()
    X_train, y_train = model_cv_df['X_train'], model_cv_df['y_train']
    X_valid, X_test = model_cv_df['X_valid'], model_cv_df['X_test']
    y_valid, y_test = model_cv_df['y_valid'], model_cv_df['y_test']
    
    # these hyperparameters came from the PipeLine, a result from Optuna tuning
    hyperparams = { 'alpha': 0.0046540057600720115,
                    'colsample_bytree': 0.5,
                    'lambda': 0.10810295148897421,
                    'learning_rate': 0.05,
                    'max_depth': 15,
                    'min_child_weight': 1,
                    'random_state': 48,
                    'subsample': 0.8,
                    'n_estimators': 4000,
                    'tree_method': tree_method}

    xgb_model = XGBClassifier(**hyperparams, use_label_encoder = False)

    # If we pass a pipeline instead of a model to cross_val_score, fit_params won't be 
    # recognized
    fit_params={'early_stopping_rounds': 400, 
                'eval_metric': ['logloss'],
                'verbose': False,
                'eval_set': [(X_valid, y_valid)]
               }

    # Multiply by -1 since sklearn calculates *negative* RMSE
    print()
    scores = -1 * cross_val_score(xgb_model, X_train, y_train,
                                  cv=5,
                                  scoring='neg_root_mean_squared_error',
                                  fit_params = fit_params,
                                  verbose=15)
    print()
    print(f"{color.bdblue}Scores: {scores}{color.end}")
    print()
    print(f"{color.bdgreen}Root Mean Squared Error (Mean): {scores.mean()}")
    print()
    
CrossVal()

<a id="part4"></a>
<div style="font-family: Trebuchet MS;background-color:DarkRed;color:AliceBlue;text-align: left;padding-top: 5px;padding-bottom: 15px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
<h1 id="part1" style='color:GhostWhite;'>Part 4. Miscellaneous</h1>
</div>

<a id="early_stopping_rounds"></a>
<div style="font-family: Trebuchet MS;background-color:DarkCyan;color:Azure;text-align: left;padding-top: 5px;padding-bottom: 15px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <h2 style='color:GhostWhite;'>4.1 Early Stopping Rounds</h2>
</div>

<div class="alert alert-block alert-info">
    Below is a reference on using plots to get an insight on the value to use for XGBoost's early_ stopping_rounds during fitting.</div>

In [None]:
%%time

from matplotlib import pyplot

def PlotEarlyStoppingRounds():
    X = pd.read_feather(final_ds)
    y = X.pop('MIS_Status')
    
    model_esr = process_model(X, y)
    model_esr.osample()     # oversample
    model_esr_df = model_esr.split_data(0.7)
    
    del X, y
    gc.collect()
    
    print(f"{color.bold}Please wait, Fitting model can take time ...{color.end}")
            
    hyperparams = { 'alpha': 0.0046540057600720115,
                    'colsample_bytree': 0.5,
                    'lambda': 0.10810295148897421,
                    'learning_rate': 0.05,
                    'max_depth': 15,
                    'min_child_weight': 1,
                    'random_state': 48,
                    'subsample': 0.8,
                    'n_estimators': 4000,
                    'tree_method': tree_method}

    xg_model = XGBClassifier(**hyperparams,use_label_encoder =False)
       
    eval_setparam = [(model_esr_df['X_train'], model_esr_df['y_train']),\
                     (model_esr_df['X_valid'], model_esr_df['y_valid'])]
       
    # fit the model without specifying early_stopping_rounds
    xg_model.fit(model_esr_df['X_train'], model_esr_df['y_train'], 
                eval_metric=['error','logloss'],
                eval_set = eval_setparam,
                verbose=False)
 
    print("Fitting model completed.")
    print()
    print('Preparing Predictions')
    
    # Get predictions
    predictions = xg_model.predict(model_esr_df['X_valid'])
    
    print()
    print(f'{color.underline}Metrics:{color.end}')

    eval_results = model_eval(model_esr_df['y_valid'], predictions)

    # retrieve performance metrics
    results = xg_model.evals_result()
    epochs = len(results['validation_0']['error'])
    x_axis = range(0, epochs)

    # what we will be looking for are the bottom areas of the plots
    
    # plot log loss
    fig, ax = pyplot.subplots()
    ax.plot(x_axis, results['validation_0']['logloss'], label='Train')
    ax.plot(x_axis, results['validation_1']['logloss'], label='Test')
    ax.legend()
    pyplot.xlabel('Epoch')
    pyplot.ylabel('Log Loss')
    pyplot.title('XGBoost Log Loss')
    pyplot.show()

    # plot classification error
    fig, ax = pyplot.subplots()
    ax.plot(x_axis, results['validation_0']['error'], label='Train')
    ax.plot(x_axis, results['validation_1']['error'], label='Test')
    ax.legend()
    pyplot.xlabel('Epoch')
    pyplot.ylabel('Classification Error')
    pyplot.title('XGBoost Classification Error')
    pyplot.show()
    
PlotEarlyStoppingRounds()

<div class="alert alert-block alert-info">
    From both plots, we can see that 10% of 4000 n_estimator, 400, is a good candidate as the early_stopping_rounds parameter.
</div>

<a id="random_forest_classifier"></a>
<div style="font-family: Trebuchet MS;background-color:DarkCyan;color:Azure;text-align: left;padding-top: 5px;padding-bottom: 20px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <h2 style='color:GhostWhite;'>4.2 Random Forest Classifier</h2><br>
</div>

<div class="alert alert-block alert-info">
    This is just a reference on using a Random Forest Classifier.
  </div>

In [26]:
%%time

# run before tuning
def RunModelrf():
    X = pd.read_feather(final_ds)
    y = X.pop('MIS_Status')

    modelrf = other_models(X, y)
    modelrf.osample()
    modelrf.split_data(0.7)
    modelrf.prep_run_model("Metrics : Random Forest Classifier", modelname='rfc')

RunModelrf()

MIS_Status Count ->  1 : 714212, 0 : 154451
X size :  868663
y size :  868663
Before Oversampling -> 1 : 714212, 0 : 154451
After Oversampling  -> 1 : 714212, 0 : 714212

[1m[4mShapes Before And After Splitting Dataset :[0m
X (1428424, 16)   y (1428424,)
X_train (999896, 16)   y_train (999896,)
X_valid (214264, 16)   y_valid (214264,)
X_test (214264, 16)   y_test (214264,)

[1mPlease wait, Fitting model can take time ...[0m


KeyboardInterrupt: 

<div style="font-family: Trebuchet MS;background-color:LightSteelBlue;color:Black;text-align: left;padding-top: 5px;padding-bottom: 15px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <b>Optuna Tuning for Random Forest</b>
</div>

<div class="alert alert-block alert-info">  
    This is just a sample implementation, for reference.</div>

In [None]:
%%time

# Define an objective function to be maximized.
def objective_rf(trial, X_train, y_train, X_valid, y_valid):
    nn_max_depth = trial.suggest_categorical("max_depth", [5,10,15])
    nn_estimators = trial.suggest_categorical('n_estimators', [100,250,500,750,1000])
    
    rf_obj = RandomForestClassifier(max_depth = nn_max_depth,
                                    n_estimators = nn_estimators,
                                    warm_start = True)
    
    if GetRam() >= 95:
            raise MemoryError('Short On Memory')
    print(f"Ram Used Before Trial : {GetRam()} %")
            
    ## Fit Model
    rf_obj.fit(X_train, y_train)

    # Report intermediate objective value
    intermediate_value = rf_obj.score(X_valid, y_valid)
    trial.report(intermediate_value, i+1)
    
    gc.collect()
        
    # Handle pruning based on the intermediate value.
    if trial.should_prune():
        raise optuna.TrialPruned()
    
    print(f"Ram Used After Trial : {GetRam()} %")
    return intermediate_value

def RandomForestOptunaTuning():
    X = pd.read_feather(final_ds)
    y = X.pop('MIS_Status')
    print(X.info())

    rfo = process_model(X, y)
    rfo.osample()  # oversample
    rfo_df = rfo.split_data(0.7)
    
    X_train, y_train = rfo_df['X_train'], rfo_df['y_train']
    X_valid, y_valid = rfo_df['X_valid'], rfo_df['y_valid']
    
    print()
    
    del X, y, rfo_df
    gc.collect()
    
    print(f'{color.bold}Please wait, this will take time{color.end}')
    
    study = optuna.create_study(direction="maximize")
    
    try:
        study.optimize(lambda trial: objective_rf(trial,
                        X_train, y_train,
                        X_valid, y_valid),
                        n_trials = 20,
                        catch = (RuntimeWarning,ArithmeticError,))
    except MemoryError as e:
        print(f'{color.bdblue}{e} : Memory was getting low, Trial ended early{color.end}')
            
    # Calculating the pruned and completed trials
    pruned_trials = [t for t in study.trials if t.state == optuna.trial.TrialState.PRUNED]
    complete_trials = [t for t in study.trials if t.state == optuna.trial.TrialState.COMPLETE]

    print("  Number of finished trials: ", len(study.trials))
    print("  Number of pruned trials: ", len(pruned_trials))
    print("  Number of complete trials: ", len(complete_trials))
    
    return study

study_results = RandomForestOptunaTuning()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 868663 entries, 0 to 868662
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   Term               868663 non-null  int16  
 1   NoEmp              868663 non-null  int16  
 2   NewExist           868663 non-null  int8   
 3   CreateJob          868663 non-null  int16  
 4   RetainedJob        868663 non-null  int16  
 5   FranchiseCode      868663 non-null  int32  
 6   UrbanRural         868663 non-null  int8   
 7   LowDoc             868663 non-null  int8   
 8   DisbursementGross  868663 non-null  float32
 9   SBA_Appv           868663 non-null  float32
 10  Industry           868663 non-null  int8   
 11  Recession          868663 non-null  int8   
 12  RealEstate         868663 non-null  int8   
 13  SBA_Portion        868663 non-null  float32
 14  State_hash         868663 non-null  int32  
 15  CityState_hash     868663 non-null  int32  
dtypes:

[32m[I 2022-04-06 22:27:42,651][0m A new study created in memory with name: no-name-5660a0dd-a8ee-49ac-bfbb-b510a8a5f256[0m



[1m[4mShapes Before And After Splitting Dataset :[0m
X (1428424, 16)   y (1428424,)
X_train (999896, 16)   y_train (999896,)
X_valid (214264, 16)   y_valid (214264,)
X_test (214264, 16)   y_test (214264,)

[1mPlease wait, this will take time[0m
Ram Used Before Trial : 47.7 %


[32m[I 2022-04-06 22:36:43,703][0m Trial 0 finished with value: 0.8877832953739312 and parameters: {'max_depth': 10, 'n_estimators': 500}. Best is trial 0 with value: 0.8877832953739312.[0m


Ram Used After Trial : 45.9 %
Ram Used Before Trial : 45.7 %


In [39]:
pprint(study_results.best_params) # Get best parameters for the objective function.
print()
pprint(study_results.best_value)  # Get best objective value.
print()
pprint(study_results.best_trial)  # Get best trial's information.

{'max_depth': 15}

0.9121877683605272

FrozenTrial(number=1, values=[0.9121877683605272], datetime_start=datetime.datetime(2022, 4, 6, 20, 44, 38, 756559), datetime_complete=datetime.datetime(2022, 4, 6, 20, 47, 37, 815340), params={'max_depth': 15}, distributions={'max_depth': CategoricalDistribution(choices=(5, 10, 15))}, user_attrs={}, system_attrs={}, intermediate_values={1: 0.9117817272150245, 2: 0.9121877683605272}, trial_id=1, state=TrialState.COMPLETE, value=None)


In [None]:
best_trial = study_results.best_trial.params
best_trial

In [None]:
# Trial results dataframe sorted from best value ascending
ViewResultsAsDf()

In [None]:
#Visualize parameter importance
optuna.visualization.plot_param_importances(study_results)

<div style="font-family: Trebuchet MS;background-color:LightSteelBlue;color:Black;text-align: left;padding-top: 5px;padding-bottom: 15px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <b>Random Forest Score With Optuna Hyperparameters</b>
</div>

In [None]:
%%time

def RunModelrf2():
    X = pd.read_feather(final_ds)
    y = X.pop('MIS_Status')

    modelrf = other_models(X, y)
    modelrf.osample()
    modelrf.split_data(0.7)
    modelrf.prep_run_model("Metrics : Random Forest Classifier", \
                            modelname='rfc', hparams = best_trial)

RunModelrf2()

In [None]:
del study_results, best_trial
gc.collect()

<div style="font-family: Trebuchet MS;background-color:LightSteelBlue;color:Black;text-align: left;padding-top: 5px;padding-bottom: 15px;padding-left: 20px;padding-right: 10px;border-radius: 15px 50px;letter-spacing: 2px;">
    <b>RandomizedSearchCV</b>
</div>

<div class="alert alert-block alert-info">
    Below is a reference on using a <b>RandomizedSearchCV</b> first for Random Forest hyperparameter tuning.<br><br>
  Once done, one would have randomly narrowed down some parameters which we can base our inputs for a full <b>GridSearchCV</b> (not shown here).
    <br><br>
    Both approaches take an <b>extremely long time to run</b> using our SBA dataset, and the line to run the task is commented out.  Uncomment if you want to try.  Otherwise, <b>Optuna</b> is a much faster method.</div>

In [None]:
def ViewDefaultRFCParams():
    rf = RandomForestClassifier(random_state = 48)
    # Look at parameters used by our current forest
    print('Default parameters in use:\n')
    pprint(rf.get_params())

ViewDefaultRFCParams()

In [None]:
from sklearn.model_selection import RandomizedSearchCV

def SuggestRFCParams():
    # Number of trees in random forest
    n_estimators = [int(x) for x in np.linspace(start = 500, stop = 2000, num = 3)]
    # Number of features to consider at every split
    max_features = ['auto', 'sqrt']
    # Maximum number of levels in tree
    max_depth = [int(x) for x in np.linspace(6, 15, num = 4)]
    max_depth.append(None)
    # Minimum number of samples required to split a node
    min_samples_split = [2, 5, 10]
    # Minimum number of samples required at each leaf node
    min_samples_leaf = [1, 2, 4]
    # Method of selecting samples for training each tree
    bootstrap = [True, False]
    
    # Create the random grid
    random_grid = {'n_estimators': n_estimators,
                   'max_features': max_features,
                   'max_depth': max_depth,
                   'min_samples_split': min_samples_split,
                   'min_samples_leaf': min_samples_leaf,
                   'bootstrap': bootstrap}
    pprint(random_grid)
    return random_grid

random_grid = SuggestRFCParams()

In [None]:
%%time

def RandomSearchCV(random_grid):
    # Use the random grid to search for best hyperparameters
    # First create the base model to tune
    rf = RandomForestClassifier()

    # Random search of parameters, using 3 fold cross validation, 
    # search across 100 different combinations, and use all available cores
    rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, \
                                   n_iter = 5, cv = 3, verbose=10, random_state=48)

    # Fit the random search model
    rf_random.fit(modelrf_results['X_train'], modelrf_results['y_train'])
    
    return rf_random.best_params_

#rf_best_params = RandomSearchCV(random_grid)
#rf_best_params

In [None]:
del random_grid #,rf_best_params 
gc.collect()