#  Intro to the Dataset and the Aim
<img src="Designer.jpeg" alt="logo banner" style="width: 800px;"/>

**Problem Statement**: An online platform offering customized loan products, is facing challenges in efficiently assessing the creditworthiness of new loan applicants. By predicting the likelihood of default, the company aims to minimize risks and improve the decision-making process for loan approvals.

**Objective**: The goal is to develop a machine learning model that can predict whether an applicant will default on a personal loan, based on their financial and credit history attributes. The model should help make data-driven decisions, reducing the overall risk of default.

**Dataset Overview**: Refer to [here](https://www.kaggle.com/datasets/ranadeep/credit-risk-dataset) or `data/LCDataDictionary.xlsx`

**Aim**

1. To analyze which factors are critical in determining whether a borrower will default on a personal loan.
2. To develop a predictive model that estimates the likelihood of loan default based on borrower attributes.
3. Ensure interoperability of the model so that we can understand the key drivers of defaults.

**Methods and Techniques used:** EDA, feature engineering, modeling using sklearn pipelines, hyperparameter tuning, optuna

**Measure of Performance and Minimum Threshold to reach the business objective** : Since both recall and precision are important, we will use maximize f1 score above 75% and recall above 80%.

**Assumptions**
* The dataset is assumed to be representative of entire customer base.
* The data remains stable over time, and thus, the model is assumed not to decay rapidly.
* External factors (e.g., economic downturns) are not considered, though they could influence loan repayment behavior.

## Library Setup

In [1]:
# Scientific libraries
import numpy as np
import pandas as pd


# Visual libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Helper libraries
from tqdm.notebook import tqdm, trange # Progress bar
import warnings 
#warnings.filterwarnings('ignore') # ignore all warkings

# To not cache lib import (.py modification won't refelect unless kernal restarts)
#%load_ext autoreload
#%autoreload 2

# Visual setup
%config InlineBackend.figure_format = 'retina' # sets the figure format to 'retina' for high-resolution displays.

# Pandas options
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all' # display all interaction 
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 15)

# Table styles
table_styles = {
    'cerulean_palette': [
        dict(selector="th", props=[("color", "#FFFFFF"), ("background", "#004D80")]),
        dict(selector="td", props=[("color", "#333333")]),
        dict(selector="table", props=[("font-family", 'Arial'), ("border-collapse", "collapse")]),
        dict(selector='tr:nth-child(even)', props=[('background', '#D3EEFF')]),
        dict(selector='tr:nth-child(odd)', props=[('background', '#FFFFFF')]),
        dict(selector="th", props=[("border", "1px solid #0070BA")]),
        dict(selector="td", props=[("border", "1px solid #0070BA")]),
        dict(selector="tr:hover", props=[("background", "#80D0FF")]),
        dict(selector="tr", props=[("transition", "background 0.5s ease")]),
        dict(selector="th:hover", props=[("font-size", "1.07rem")]),
        dict(selector="th", props=[("transition", "font-size 0.5s ease-in-out")]),
        dict(selector="td:hover", props=[('font-size', '1.07rem'),('font-weight', 'bold')]),
        dict(selector="td", props=[("transition", "font-size 0.5s ease-in-out")])
    ]
}


import sys
import os
# Get the path to the parent directory
parent_dir = os.path.abspath(os.path.join(os.getcwd(), '..')) # root dir of project

# Add the parent directory to the system path (can import from anywhere)
if parent_dir not in sys.path:
    sys.path.append(parent_dir)
    
from Prediction_Model import config # custom config data for this project
    


# Seed value for numpy.random => makes notebooks stable across runs
np.random.seed(config.RANDOM_SEED)

## Data Ingestion and Preparation

* `int_rate`, `issue_d`, `installment` and others are found after loan is approved, thus they are removed from the dataset to avoid data leakage
* `earliest_cr_line` is not used because absolute data values are not useful and can affect the model, instead a relative date called `age_of_credit` is created

In [2]:
from Prediction_Model import data_handling as dh
df = dh.load_data_and_sanitize('loan.csv')
# Remove feature which are not available at the time of loan application
df = df[['loan_amnt', 'term', 'grade', 'sub_grade',
       'emp_title', 'emp_length', 'home_ownership', 'annual_inc',
       'verification_status', 'issue_d', 'loan_status', 'purpose', 'title',
       'dti', 'earliest_cr_line', 'open_acc', 'pub_rec', 'revol_bal',
       'revol_util', 'total_acc', 'initial_list_status', 'application_type',
       'zip_code','addr_state','delinq_2yrs', 'inq_last_6mths', 'last_pymnt_amnt',
       'collections_12_mths_ex_med',
       'mths_since_last_delinq','mths_since_last_major_derog', 'mths_since_last_record',
        'open_acc_6m', 'open_il_6m', 'open_il_12m',
        'open_il_24m', 'mths_since_rcnt_il', 'total_bal_il', 'il_util', 'open_rv_12m', 'open_rv_24m',
         'max_bal_bc', 'all_util', 'total_rev_hi_lim', 'inq_fi', 'total_cu_tl', 'inq_last_12m',
         'tot_coll_amt', 'tot_cur_bal']]

# Map defaults and non defaults
df = df[df['loan_status'].isin(['Charged Off', 'Fully Paid','Does not meet the credit policy. Status:Fully Paid','Default','Does not meet the credit policy. Status:Charged Off'])]
df['loan_status'] = df['loan_status'].map({'Charged Off':'defaulter',
                         'Fully Paid':'non defaulter',
                         'Default':'defaulter',
                         'Does not meet the credit policy. Status:Charged Off':'defaulter',
                         'Does not meet the credit policy. Status:Fully Paid':'non defaulter'})
# Date format conversion
df['earliest_cr_line'] = pd.to_datetime(df['earliest_cr_line'])
df['issue_d'] = pd.to_datetime(df['issue_d'], format='%b-%Y')
display(df.head(10).style.set_table_styles(table_styles['cerulean_palette']).set_caption("DF"))
df.info()
df.describe()

  return pd.read_csv(f'{config.PARENT_ABS_PATH}/data/{file_name}').rename(lambda x: x.lower() # this module is imported in files with CWD as root thus '/data'
  df['earliest_cr_line'] = pd.to_datetime(df['earliest_cr_line'])


Unnamed: 0,loan_amnt,term,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,purpose,title,dti,earliest_cr_line,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,application_type,zip_code,addr_state,delinq_2yrs,inq_last_6mths,last_pymnt_amnt,collections_12_mths_ex_med,mths_since_last_delinq,mths_since_last_major_derog,mths_since_last_record,open_acc_6m,open_il_6m,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,tot_coll_amt,tot_cur_bal
0,5000.0,36 months,B,B2,,10+ years,RENT,24000.0,Verified,2011-12-01 00:00:00,non defaulter,credit_card,Computer,27.65,1985-01-01 00:00:00,3.0,0.0,13648.0,83.7,9.0,f,INDIVIDUAL,860xx,AZ,0.0,1.0,171.62,0.0,,,,,,,,,,,,,,,,,,,,
1,2500.0,60 months,C,C4,Ryder,< 1 year,RENT,30000.0,Source Verified,2011-12-01 00:00:00,defaulter,car,bike,1.0,1999-04-01 00:00:00,3.0,0.0,1687.0,9.4,4.0,f,INDIVIDUAL,309xx,GA,0.0,5.0,119.66,0.0,,,,,,,,,,,,,,,,,,,,
2,2400.0,36 months,C,C5,,10+ years,RENT,12252.0,Not Verified,2011-12-01 00:00:00,non defaulter,small_business,real estate business,8.72,2001-11-01 00:00:00,2.0,0.0,2956.0,98.5,10.0,f,INDIVIDUAL,606xx,IL,0.0,2.0,649.91,0.0,,,,,,,,,,,,,,,,,,,,
3,10000.0,36 months,C,C1,AIR RESOURCES BOARD,10+ years,RENT,49200.0,Source Verified,2011-12-01 00:00:00,non defaulter,other,personel,20.0,1996-02-01 00:00:00,10.0,0.0,5598.0,21.0,37.0,f,INDIVIDUAL,917xx,CA,0.0,1.0,357.48,0.0,35.0,,,,,,,,,,,,,,,,,,,
5,5000.0,36 months,A,A4,Veolia Transportaton,3 years,RENT,36000.0,Source Verified,2011-12-01 00:00:00,non defaulter,wedding,My wedding loan I promise to pay back,11.2,2004-11-01 00:00:00,9.0,0.0,7963.0,28.3,12.0,f,INDIVIDUAL,852xx,AZ,0.0,3.0,161.03,0.0,,,,,,,,,,,,,,,,,,,,
7,3000.0,36 months,E,E1,MKC Accounting,9 years,RENT,48000.0,Source Verified,2011-12-01 00:00:00,non defaulter,car,Car Downpayment,5.35,2007-01-01 00:00:00,4.0,0.0,8221.0,87.5,4.0,f,INDIVIDUAL,900xx,CA,0.0,2.0,111.34,0.0,,,,,,,,,,,,,,,,,,,,
8,5600.0,60 months,F,F2,,4 years,OWN,40000.0,Source Verified,2011-12-01 00:00:00,defaulter,small_business,Expand Business & Buy Debt Portfolio,5.55,2004-04-01 00:00:00,11.0,0.0,5210.0,32.6,13.0,f,INDIVIDUAL,958xx,CA,0.0,2.0,152.39,0.0,,,,,,,,,,,,,,,,,,,,
9,5375.0,60 months,B,B5,Starbucks,< 1 year,RENT,15000.0,Verified,2011-12-01 00:00:00,defaulter,other,Building my credit history.,18.08,2004-09-01 00:00:00,2.0,0.0,9279.0,36.5,3.0,f,INDIVIDUAL,774xx,TX,0.0,0.0,121.45,0.0,,,,,,,,,,,,,,,,,,,,
10,6500.0,60 months,C,C3,Southwest Rural metro,5 years,OWN,72000.0,Not Verified,2011-12-01 00:00:00,non defaulter,debt_consolidation,High intrest Consolidation,16.12,1998-01-01 00:00:00,14.0,0.0,4032.0,20.6,23.0,f,INDIVIDUAL,853xx,AZ,0.0,2.0,1655.54,0.0,,,,,,,,,,,,,,,,,,,,
11,12000.0,36 months,B,B5,UCLA,10+ years,OWN,75000.0,Source Verified,2011-12-01 00:00:00,non defaulter,debt_consolidation,Consolidation,10.78,1989-10-01 00:00:00,12.0,0.0,23336.0,67.1,34.0,f,INDIVIDUAL,913xx,CA,0.0,0.0,6315.3,0.0,,,,,,,,,,,,,,,,,,,,


<class 'pandas.core.frame.DataFrame'>
Index: 256939 entries, 0 to 887371
Data columns (total 48 columns):
 #   Column                       Non-Null Count   Dtype         
---  ------                       --------------   -----         
 0   loan_amnt                    256939 non-null  float64       
 1   term                         256939 non-null  object        
 2   grade                        256939 non-null  object        
 3   sub_grade                    256939 non-null  object        
 4   emp_title                    242770 non-null  object        
 5   emp_length                   246937 non-null  object        
 6   home_ownership               256939 non-null  object        
 7   annual_inc                   256935 non-null  float64       
 8   verification_status          256939 non-null  object        
 9   issue_d                      256939 non-null  datetime64[ns]
 10  loan_status                  256939 non-null  object        
 11  purpose                      25

Unnamed: 0,loan_amnt,annual_inc,issue_d,dti,earliest_cr_line,open_acc,pub_rec,revol_bal,revol_util,total_acc,delinq_2yrs,inq_last_6mths,last_pymnt_amnt,collections_12_mths_ex_med,mths_since_last_delinq,mths_since_last_major_derog,mths_since_last_record,open_acc_6m,open_il_6m,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,tot_coll_amt,tot_cur_bal
count,256939.0,256935.0,256939,256939.0,256910,256910.0,256910.0,256939.0,256699.0,256910.0,256910.0,256910.0,256939.0,256794.0,114294.0,47876.0,32674.0,144.0,144.0,144.0,144.0,140.0,144.0,126.0,144.0,144.0,144.0,144.0,190464.0,144.0,144.0,144.0,190464.0,190464.0
mean,13522.11595,72498.85,2013-04-19 22:26:39.480032256,16.534986,1998-02-01 03:42:42.388385024,10.935016,0.143354,15301.2,54.315684,25.011732,0.250411,0.887821,6381.922019,0.006702,35.059259,43.395501,74.381588,1.395833,3.076389,0.881944,1.868056,20.714286,36462.3125,73.657937,1.673611,3.555556,5517.340278,58.945139,29694.85,1.256944,2.097222,2.625,203.3825,138160.5
min,500.0,1896.0,2007-06-01 00:00:00,0.0,1946-01-01 00:00:00,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.9,0.0,0.0,0.0,-4.0,0.0,0.0
25%,7200.0,45000.0,2012-08-01 00:00:00,10.74,1994-09-01 00:00:00,7.0,0.0,5833.0,36.2,16.0,0.0,0.0,476.07,0.0,17.0,26.0,54.0,0.0,1.0,0.0,1.0,4.0,10763.75,63.35,0.0,2.0,2091.25,46.825,13300.0,0.0,0.0,1.0,0.0,28355.75
50%,12000.0,62000.0,2013-07-01 00:00:00,16.2,1999-06-01 00:00:00,10.0,0.0,10918.0,55.8,23.0,0.0,1.0,3818.99,0.0,32.0,43.0,78.0,1.0,2.0,0.0,1.0,12.5,23605.0,77.05,1.0,3.0,4511.5,61.95,22300.0,1.0,0.0,2.0,0.0,80760.5
75%,18200.0,87000.0,2014-05-01 00:00:00,21.99,2002-10-01 00:00:00,14.0,0.0,19083.5,73.9,32.0,0.0,1.0,9931.39,0.0,51.0,60.0,101.0,2.0,4.0,1.0,3.0,21.0,50835.75,88.25,2.0,5.0,8212.75,74.225,36800.0,1.0,2.0,4.0,0.0,207990.5
max,35000.0,8706582.0,2015-12-01 00:00:00,57.14,2012-10-01 00:00:00,76.0,15.0,1746716.0,892.3,150.0,29.0,33.0,36475.59,6.0,152.0,159.0,129.0,6.0,18.0,6.0,14.0,141.0,249212.0,129.2,8.0,21.0,22279.0,102.8,2013133.0,9.0,21.0,19.0,9152545.0,8000078.0
std,8128.811481,58900.43,,7.793541,,4.902947,0.436027,19708.8,24.827559,11.7792,0.742431,1.158745,7342.716238,0.088812,21.861345,21.625591,31.054394,1.349534,3.182293,1.214692,2.056021,27.6708,38137.563875,23.074812,1.629406,2.932284,4507.604776,20.356957,29499.8,1.792551,3.903234,4.057067,21035.5,152328.4


Add new location feature from additional dataset (lat,lng) based on zip code

In [3]:
loc = dh.load_data_and_sanitize('US_zip_to_cord.csv')
loc['zip']=loc['zip'].astype(str)
loc['zip']=loc['zip'].str[:-2]
# loc['zip']= (3-loc['zip'].str.len())*'0'+str(loc['zip'])
loc['zip'] = loc['zip'].map(lambda x: (3 - len(x)) * '0' + x) # add zeros prefix
loc_grouped = loc.groupby('zip').agg({'lat':'mean','lng':'mean'}).reset_index()
loc_grouped
df['zip_code'] = df['zip_code'].str[:3]
df = df.merge(loc_grouped, left_on='zip_code', right_on='zip', how='inner')

df.drop(columns=['zip'], inplace=True)
df
dh.save_data(df,'loan_reduced.csv') # final data saved

Unnamed: 0,zip,lat,lng
0,005,40.815400,-73.045100
1,010,42.264533,-72.571779
2,011,42.124445,-72.570877
3,012,42.347011,-73.226954
4,013,42.594145,-72.576003
...,...,...,...
921,995,60.099946,-155.202000
922,996,60.231311,-156.693148
923,997,65.719588,-153.315713
924,998,58.020365,-134.862400


Unnamed: 0,loan_amnt,term,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,purpose,title,dti,earliest_cr_line,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,application_type,zip_code,addr_state,delinq_2yrs,inq_last_6mths,last_pymnt_amnt,collections_12_mths_ex_med,mths_since_last_delinq,mths_since_last_major_derog,mths_since_last_record,open_acc_6m,open_il_6m,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,tot_coll_amt,tot_cur_bal,lat,lng
0,5000.0,36 months,B,B2,,10+ years,RENT,24000.0,Verified,2011-12-01,non defaulter,credit_card,Computer,27.65,1985-01-01,3.0,0.0,13648.0,83.7,9.0,f,INDIVIDUAL,860,AZ,0.0,1.0,171.62,0.0,,,,,,,,,,,,,,,,,,,,,35.697570,-111.216738
1,2500.0,60 months,C,C4,Ryder,< 1 year,RENT,30000.0,Source Verified,2011-12-01,defaulter,car,bike,1.00,1999-04-01,3.0,0.0,1687.0,9.4,4.0,f,INDIVIDUAL,309,GA,0.0,5.0,119.66,0.0,,,,,,,,,,,,,,,,,,,,,33.432962,-82.075146
2,2400.0,36 months,C,C5,,10+ years,RENT,12252.0,Not Verified,2011-12-01,non defaulter,small_business,real estate business,8.72,2001-11-01,2.0,0.0,2956.0,98.5,10.0,f,INDIVIDUAL,606,IL,0.0,2.0,649.91,0.0,,,,,,,,,,,,,,,,,,,,,41.854723,-87.675914
3,10000.0,36 months,C,C1,AIR RESOURCES BOARD,10+ years,RENT,49200.0,Source Verified,2011-12-01,non defaulter,other,personel,20.00,1996-02-01,10.0,0.0,5598.0,21.0,37.0,f,INDIVIDUAL,917,CA,0.0,1.0,357.48,0.0,35.0,,,,,,,,,,,,,,,,,,,,34.070175,-117.849067
4,5000.0,36 months,A,A4,Veolia Transportaton,3 years,RENT,36000.0,Source Verified,2011-12-01,non defaulter,wedding,My wedding loan I promise to pay back,11.20,2004-11-01,9.0,0.0,7963.0,28.3,12.0,f,INDIVIDUAL,852,AZ,0.0,3.0,161.03,0.0,,,,,,,,,,,,,,,,,,,,,33.438611,-111.822242
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
256923,4200.0,36 months,D,D2,supervisor,10+ years,MORTGAGE,48000.0,Verified,2015-01-01,defaulter,medical,Medical expenses,36.93,1990-08-01,13.0,0.0,12943.0,63.4,45.0,f,INDIVIDUAL,810,CO,0.0,0.0,147.64,0.0,38.0,38.0,,,,,,,,,,,,,20400.0,,,,0.0,207975.0,37.913517,-103.808178
256924,10775.0,36 months,A,A1,Coordinator of RSVP,< 1 year,RENT,54000.0,Not Verified,2015-01-01,non defaulter,debt_consolidation,Debt consolidation,13.22,1975-11-01,9.0,0.0,10776.0,25.8,21.0,w,INDIVIDUAL,325,FL,1.0,0.0,9439.34,0.0,16.0,28.0,,,,,,,,,,,,,41700.0,,,,0.0,24696.0,30.535767,-86.992024
256925,6225.0,36 months,D,D3,Painter,2 years,RENT,27000.0,Source Verified,2015-01-01,non defaulter,debt_consolidation,Debt consolidation,18.58,2011-02-01,3.0,0.0,1756.0,97.6,4.0,f,INDIVIDUAL,330,FL,0.0,1.0,4858.17,0.0,,,,,,,,,,,,,,,1800.0,,,,0.0,8357.0,25.774810,-80.409199
256926,4000.0,36 months,B,B1,Lead Custodian,10+ years,MORTGAGE,50000.0,Verified,2015-01-01,non defaulter,car,Car financing,12.63,2002-09-01,11.0,1.0,1700.0,5.6,30.0,f,INDIVIDUAL,956,CA,0.0,0.0,3655.51,0.0,,,84.0,,,,,,,,,,,,30100.0,,,,0.0,18979.0,38.635924,-121.275410


# EDA

In [4]:
from ydata_profiling import ProfileReport
profile = ProfileReport(df, title="Pandas Profiling Report", explorative=True)
# profile.to_notebook_iframe() # uncomment this to see output in the jupyter notebook 
profile.to_file("EDA_report.html") # uncomment this to get html output in the current directory

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

  return spearmanr(a, b)[0]
  return spearmanr(a, b)[0]
  return spearmanr(a, b)[0]
  return spearmanr(a, b)[0]
  return spearmanr(a, b)[0]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

## Test data 

In [11]:
from sklearn.model_selection import train_test_split
X = df.drop('loan_status', axis=1)
y = df['loan_status']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=40, stratify= y) # default it will shuffle data set before sampling

# Feature Engineering
* Below pipeline is made after repeated iteration of feature improvement, feature construction and finally feature selection
* All the pipelines where evaluated on using the confusion matrix and best as per business objective was chosen 
* `ExtraTreesClassifier` is used for feature evaluation as it is least computationally expensive

In [12]:
from Prediction_Model  import config
from Prediction_Model.FE_pipeline import target_pipeline,selected_FE_with_FS
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

from Prediction_Model.evaluation import  tune_model_threshold_adjustment # custom helper function

In [13]:
from Prediction_Model.FE_pipeline import selected_FE_with_FS
from Prediction_Model import get_features

# get_features.perform_feature_engineering(n_trials=50)

fe_pipe = dh.load_pipeline('fe_pipeline_fitted')
eval_model = dh.load_pipeline('fe_eval_model')
X_train_transformed = fe_pipe.transform(X_train)
config.POST_FE_FEATURES=X_train_transformed.columns
feature_importance_df = pd.DataFrame({
    'feature': config.POST_FE_FEATURES,
    'importance': eval_model.feature_importances_
}).sort_values('importance', ascending=False, key=abs)

feature_importance_df.style.set_table_styles(table_styles['cerulean_palette']).set_caption("LR Feature Importance")

FileNotFoundError: [Errno 2] No such file or directory: '/home/jyothisable/Resources/Coding/Data Science/Scalar Projects/LoanTap-Credit-Default-Risk-Model/Prediction_Model/trained_models/fe_pipeline_fitted.pkl'

# Model Selection and Training

## Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

LR_with_FE = LogisticRegression()


LR_with_FE_CV = GridSearchCV(
    estimator=LR_with_FE,
    param_grid={
        'C': [0.01],
        'solver': ['saga'],
        'penalty': ['elasticnet'],
        'class_weight': ['balanced'],
        'l1_ratio': [0.6],
        'max_iter': [1000],
        'warm_start': [True]
    },
    scoring='f1',
    cv=3,
    n_jobs=config.N_JOBS,
    verbose=True
)
y_train_transformed = target_pipeline.transform(y_train)
y_test_transformed = target_pipeline.transform(y_test)
X_test_transformed = fe_pipe.transform(X_test)
LR_with_FE_CV.fit(X_train_transformed, y_train_transformed)
LR_best_model = LR_with_FE_CV.best_estimator_

In [None]:
# Finding the performance of best model
y_pred=LR_best_model.predict(X_test_transformed) # do FE and then predict (X_test)
print(classification_report(y_test_transformed, y_pred))

### Threshold Adjustment for Logistic Regression

In [None]:
# Post tuning of selected best model (threshold adjustment as per business requirements)
lr_tuned, report = tune_model_threshold_adjustment(LR_best_model, X_train_transformed, y_train, X_test_transformed, y_test,scoring='f1',target_pipeline=target_pipeline)

In [None]:
# Feature Importance
feature_importance_df = pd.DataFrame({
    'feature': config.POST_FE_FEATURES,
    'importance': LR_best_model.coef_[0]
}).sort_values('importance', ascending=False, key=abs)

feature_importance_df.style.set_table_styles(table_styles['cerulean_palette']).set_caption("LR Feature Importance")

## Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

RF_with_FE = RandomForestClassifier()


RF_with_FE_CV = GridSearchCV(
    estimator=RF_with_FE,
    param_grid={
        'n_estimators': [120],
        'max_depth': [5],
        'min_samples_split': [9],
        'min_samples_leaf': [2],
        'criterion': ['entropy'],
        'warm_start': [True],
    },
    scoring='f1',
    cv=3,
    n_jobs=config.N_JOBS,
    verbose=True
)

RF_with_FE_CV.fit(X_train_transformed, y_train_transformed)
RF_best_model = RF_with_FE_CV.best_estimator_

In [None]:
# Post tuning of selected best model (threshold adjustment as per business requirements)
tune_model_threshold_adjustment(RF_best_model, X_train_transformed, y_train, X_test_transformed, y_test,scoring='f1',target_pipeline=target_pipeline)

In [None]:
# Feature Importance
feature_importance_df = pd.DataFrame({
    'feature': config.POST_FE_FEATURES,
    'importance': RF_best_model.feature_importances_
}).sort_values('importance', ascending=False, key=abs)

feature_importance_df.style.set_table_styles(table_styles['cerulean_palette']).set_caption("LR Feature Importance")

## Gradient Boosting

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

GBDT_with_FE = GradientBoostingClassifier()

GBDT_with_FE_CV = GridSearchCV(
    estimator=GBDT_with_FE,
    param_grid={
        'n_estimators': [100],
        'learning_rate': [0.3],
        'subsample': [0.8],
        'max_depth': [5],
        'min_samples_split': [10],
        'min_samples_leaf': [4],
        'warm_start': [True],
    },
    scoring='f1',
    cv=3,
    n_jobs=config.N_JOBS,
    verbose=True
)

GBDT_with_FE_CV.fit(X_train_transformed, y_train_transformed)
GBDT_best_model = GBDT_with_FE_CV.best_estimator_

In [None]:
# Post tuning of selected best model (threshold adjustment as per business requirements)
tune_model_threshold_adjustment(GBDT_best_model, X_train_transformed, y_train, X_test_transformed, y_test,scoring='f1',target_pipeline=target_pipeline)

In [None]:
# Feature Importance DF
feature_importance_df = pd.DataFrame({
    'feature': config.POST_FE_FEATURES,
    'importance': GBDT_best_model.feature_importances_
}).sort_values('importance', ascending=False, key=abs)

feature_importance_df.style.set_table_styles(table_styles['cerulean_palette']).set_caption("LR Feature Importance")

## XGBoost

In [None]:
from xgboost import XGBClassifier

XGB_with_FE = XGBClassifier()

XGB_with_FE_CV = GridSearchCV(
    estimator=XGB_with_FE,
    param_grid={
        'max_depth': [3], 
        'learning_rate': [0.15],
        'n_estimators': [300], 
        'gamma': [0], 
        'subsample': [0.95], 
        'colsample_bytree': [0.95], 
        'lambda': [0.1],
        'tree_method': ["hist"],
        'eval_metric': ["aucpr"]
    },
    scoring='f1',
    cv=3,
    n_jobs=8,
    verbose=True
)

XGB_with_FE_CV.fit(X_train_transformed, y_train_transformed)
XGB_best_model = XGB_with_FE_CV.best_estimator_

In [None]:
# Post tuning of selected best model (threshold adjustment as per business requirements)
tune_model_threshold_adjustment(XGB_best_model, X_train_transformed, y_train, X_test_transformed, y_test,scoring='f1',target_pipeline=target_pipeline)

In [None]:
# Feature Importance DF
feature_importance_df = pd.DataFrame({
    'feature': config.POST_FE_FEATURES,
    'importance': XGB_best_model.feature_importances_
}).sort_values('importance', ascending=False, key=abs)

feature_importance_df.style.set_table_styles(table_styles['cerulean_palette']).set_caption("XGB Feature Importance")

# Conclusion
* Best model is boosting models like GBDT or XGB with f1 score of > 0.79 with least test and training time.
* Most important feature is `zipcode` with from address followed by `grade` and `term`

In [None]:
# throw error to stop run all in notebook
raise SystemExit

# Other scripts

Run MLFlow UI in browser

In [None]:
# Run MLFlow UI in browser - default serve port is 5000
!poetry mlflow ui # localhost:5000

Run experiment from MLFlow project file

In [None]:
# Run MLFlow from MLProject file
!poetry run mlflow run . --experiment-name 'Model Optuna Optimization'

Serve APIs with MLFlow

In [None]:
# serve any model
!poetry run mlflow models serve -m ./mlruns/962701371541841446/ff8f948e9838413f9ea1d5c956fda683/artifacts/model --port 5002 --no-conda localhost:5002

# serve models from registry
!poetry run mlflow models serve -m "models:/XGB prediction model@best" --port 5002 --no-conda # localhost:5002

Serve API via FastAPI

In [None]:
!poetry run python fastapi_app.py # post to localhost:8000/predict, /doc in browser for documentation

Run streamlit app

Live: https://loantap-loan-prediction.streamlit.app/

In [None]:
!poetry run streamlit run streamlit_app.py # local

Run Flask app

In [None]:
!poetry run python flask_app.py # localhost:8080

Docker script

In [None]:
!docker buildx build --tag loantap_api2 . # build from root of repo
!docker run -p 8000:8000 loantap_api2 # run locally