#  Intro to the Dataset and the Aim
<img src="loantap_logo.png" alt="LoanTap logo banner" style="width: 800px;"/>

**Problem Statement**: LoanTap, an online platform offering customized loan products, is facing challenges in efficiently assessing the creditworthiness of loan applicants. By predicting the likelihood of default, the company aims to minimize risks and improve the decision-making process for loan approvals.

**Objective**: The goal is to develop a machine learning model that can predict whether an applicant will default on a personal loan, based on their financial and credit history attributes. The model should help LoanTap make data-driven decisions, reducing the overall risk of default.

**Dataset Overview**: LoanTap has provided a dataset containing various financial and credit-related features for loan applicants. Below is a summary of the dataset:

| Column               | Description                                                        |
|----------------------|--------------------------------------------------------------------|
| loan_amnt            | The loan amount applied for by the borrower                        |
| term                 | Loan term in months (36 or 60)                                     |
| int_rate             | Interest rate on the loan                                          |
| installment          | Monthly payment owed if the loan originates                        |
| grade                | LoanTap assigned grade                                             |
| sub_grade            | LoanTap assigned subgrade                                          |
| emp_title            | Job title supplied by the borrower                                 |
| emp_length           | Employment length in years (0-10)                                  |
| home_ownership       | Home ownership status                                              |
| annual_inc           | Self-reported annual income                                        |
| verification_status  | Income verification status (verified/not verified)                 |
| issue_d              | Date the loan was funded                                           |
| loan_status          | Target variable (current loan status: default or not)              |
| purpose              | Purpose of the loan                                                |
| dti                  | Debt-to-income ratio                                               |
| earliest_cr_line     | Month the borrower’s earliest credit line was opened               |
| open_acc             | Number of open credit lines                                        |
| pub_rec              | Number of derogatory public records                                |
| revol_bal            | Total revolving credit balance                                     |
| revol_util           | Revolving line utilization rate                                    |
| total_acc            | Total number of credit lines                                       |
| initial_list_status  | The initial listing status of the loan. Possible values are – W, F |
| pub_rec              | Number of derogatory public records                                |
| application_type     | Individual or joint application                                    |
| mort_acc             | Number of mortgage accounts                                        |
| pub_rec_bankruptcies | Number of public record bankruptcies                               |
| address              | Address of the individual                                          |

**Aim**

1. To analyze which factors are critical in determining whether a borrower will default on a personal loan.
2. To develop a predictive model that estimates the likelihood of loan default based on borrower attributes.
3. Ensure interpretability of the model so LoanTap can understand the key drivers of defaults.

**Methods and Techniques used:** EDA, feature engineering, modeling using sklearn pipelines, hyperparameter tuning

**Measure of Performance and Minimum Threshold to reach the business objective** : Recall > 90% and  precision > 70% 

**Assumptions**
* The dataset is assumed to be representative of LoanTap’s entire customer base.
* The data remains stable over time, and thus, the model is assumed not to decay rapidly.
* External factors (e.g., economic downturns) are not considered, though they could influence loan repayment behavior.

## Library Setup

In [79]:
# Scientific libraries
import numpy as np
import pandas as pd

# Logging
import logging

# Visual libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Helper libraries
from tqdm.notebook import tqdm, trange # Progress bar
import warnings 
#warnings.filterwarnings('ignore') # ignore all warkings

# To not cache lib import (.py modification won't refelect unless kernal restarts)
#%load_ext autoreload
#%autoreload 2

# Visual setup
%config InlineBackend.figure_format = 'retina' # sets the figure format to 'retina' for high-resolution displays.

# Pandas options
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all' # display all interaction 
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 15)

# Table styles
table_styles = {
    'cerulean_palette': [
        dict(selector="th", props=[("color", "#FFFFFF"), ("background", "#004D80")]),
        dict(selector="td", props=[("color", "#333333")]),
        dict(selector="table", props=[("font-family", 'Arial'), ("border-collapse", "collapse")]),
        dict(selector='tr:nth-child(even)', props=[('background', '#D3EEFF')]),
        dict(selector='tr:nth-child(odd)', props=[('background', '#FFFFFF')]),
        dict(selector="th", props=[("border", "1px solid #0070BA")]),
        dict(selector="td", props=[("border", "1px solid #0070BA")]),
        dict(selector="tr:hover", props=[("background", "#80D0FF")]),
        dict(selector="tr", props=[("transition", "background 0.5s ease")]),
        dict(selector="th:hover", props=[("font-size", "1.07rem")]),
        dict(selector="th", props=[("transition", "font-size 0.5s ease-in-out")]),
        dict(selector="td:hover", props=[('font-size', '1.07rem'),('font-weight', 'bold')]),
        dict(selector="td", props=[("transition", "font-size 0.5s ease-in-out")])
    ]
}

# Seed value for numpy.random => makes notebooks stable across runs
np.random.seed(42)

## Data Ingestion

In [80]:
from loantap_credit_default_risk_model.data_processing import DataHandler
    
    
data_import = DataHandler(file_path='data/raw/logistic_regression.csv')

df = data_import.load_data()
df = data_import.sanitize(df)

df['earliest_cr_line'] = pd.to_datetime(df['earliest_cr_line'])
df['issue_d'] = pd.to_datetime(df['issue_d'], format='%b-%Y')
display(df.head(10).style.set_table_styles(table_styles['cerulean_palette']).set_caption("DF"))
df.info()
df.describe()

  df['earliest_cr_line'] = pd.to_datetime(df['earliest_cr_line'])


Unnamed: 0,loan_amnt,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,purpose,title,dti,earliest_cr_line,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,application_type,mort_acc,pub_rec_bankruptcies,address
0,10000.0,36 months,11.44,329.48,B,B4,Marketing,10+ years,RENT,117000.0,Not Verified,2015-01-01 00:00:00,Fully Paid,vacation,Vacation,26.24,1990-06-01 00:00:00,16.0,0.0,36369.0,41.8,25.0,w,INDIVIDUAL,0.0,0.0,"0174 Michelle Gateway Mendozaberg, OK 22690"
1,8000.0,36 months,11.99,265.68,B,B5,Credit analyst,4 years,MORTGAGE,65000.0,Not Verified,2015-01-01 00:00:00,Fully Paid,debt_consolidation,Debt consolidation,22.05,2004-07-01 00:00:00,17.0,0.0,20131.0,53.3,27.0,f,INDIVIDUAL,3.0,0.0,"1076 Carney Fort Apt. 347 Loganmouth, SD 05113"
2,15600.0,36 months,10.49,506.97,B,B3,Statistician,< 1 year,RENT,43057.0,Source Verified,2015-01-01 00:00:00,Fully Paid,credit_card,Credit card refinancing,12.79,2007-08-01 00:00:00,13.0,0.0,11987.0,92.2,26.0,f,INDIVIDUAL,0.0,0.0,"87025 Mark Dale Apt. 269 New Sabrina, WV 05113"
3,7200.0,36 months,6.49,220.65,A,A2,Client Advocate,6 years,RENT,54000.0,Not Verified,2014-11-01 00:00:00,Fully Paid,credit_card,Credit card refinancing,2.6,2006-09-01 00:00:00,6.0,0.0,5472.0,21.5,13.0,f,INDIVIDUAL,0.0,0.0,"823 Reid Ford Delacruzside, MA 00813"
4,24375.0,60 months,17.27,609.33,C,C5,Destiny Management Inc.,9 years,MORTGAGE,55000.0,Verified,2013-04-01 00:00:00,Charged Off,credit_card,Credit Card Refinance,33.95,1999-03-01 00:00:00,13.0,0.0,24584.0,69.8,43.0,f,INDIVIDUAL,1.0,0.0,"679 Luna Roads Greggshire, VA 11650"
5,20000.0,36 months,13.33,677.07,C,C3,HR Specialist,10+ years,MORTGAGE,86788.0,Verified,2015-09-01 00:00:00,Fully Paid,debt_consolidation,Debt consolidation,16.31,2005-01-01 00:00:00,8.0,0.0,25757.0,100.6,23.0,f,INDIVIDUAL,4.0,0.0,"1726 Cooper Passage Suite 129 North Deniseberg, DE 30723"
6,18000.0,36 months,5.32,542.07,A,A1,Software Development Engineer,2 years,MORTGAGE,125000.0,Source Verified,2015-09-01 00:00:00,Fully Paid,home_improvement,Home improvement,1.36,2005-08-01 00:00:00,8.0,0.0,4178.0,4.9,25.0,f,INDIVIDUAL,3.0,0.0,"1008 Erika Vista Suite 748 East Stephanie, TX 22690"
7,13000.0,36 months,11.14,426.47,B,B2,Office Depot,10+ years,RENT,46000.0,Not Verified,2012-09-01 00:00:00,Fully Paid,credit_card,No More Credit Cards,26.87,1994-09-01 00:00:00,11.0,0.0,13425.0,64.5,15.0,f,INDIVIDUAL,0.0,0.0,USCGC Nunez FPO AE 30723
8,18900.0,60 months,10.99,410.84,B,B3,Application Architect,10+ years,RENT,103000.0,Verified,2014-10-01 00:00:00,Fully Paid,debt_consolidation,Debt consolidation,12.52,1994-06-01 00:00:00,13.0,0.0,18637.0,32.9,40.0,w,INDIVIDUAL,3.0,0.0,USCGC Tran FPO AP 22690
9,26300.0,36 months,16.29,928.4,C,C5,Regado Biosciences,3 years,MORTGAGE,115000.0,Verified,2012-04-01 00:00:00,Fully Paid,debt_consolidation,Debt Consolidation,23.69,1997-12-01 00:00:00,13.0,0.0,22171.0,82.4,37.0,f,INDIVIDUAL,1.0,0.0,"3390 Luis Rue Mauricestad, VA 00813"


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 396030 entries, 0 to 396029
Data columns (total 27 columns):
 #   Column                Non-Null Count   Dtype         
---  ------                --------------   -----         
 0   loan_amnt             396030 non-null  float64       
 1   term                  396030 non-null  object        
 2   int_rate              396030 non-null  float64       
 3   installment           396030 non-null  float64       
 4   grade                 396030 non-null  object        
 5   sub_grade             396030 non-null  object        
 6   emp_title             373103 non-null  object        
 7   emp_length            377729 non-null  object        
 8   home_ownership        396030 non-null  object        
 9   annual_inc            396030 non-null  float64       
 10  verification_status   396030 non-null  object        
 11  issue_d               396030 non-null  datetime64[ns]
 12  loan_status           396030 non-null  object        
 13 

Unnamed: 0,loan_amnt,int_rate,installment,annual_inc,issue_d,dti,earliest_cr_line,open_acc,pub_rec,revol_bal,revol_util,total_acc,mort_acc,pub_rec_bankruptcies
count,396030.0,396030.0,396030.0,396030.0,396030,396030.0,396030,396030.0,396030.0,396030.0,395754.0,396030.0,358235.0,395495.0
mean,14113.888089,13.6394,431.849698,74203.18,2014-02-02 15:57:58.045602560,17.379514,1998-05-03 09:34:15.062495488,11.311153,0.178191,15844.54,53.791749,25.414744,1.813991,0.121648
min,500.0,5.32,16.08,0.0,2007-06-01 00:00:00,0.0,1944-01-01 00:00:00,0.0,0.0,0.0,0.0,2.0,0.0,0.0
25%,8000.0,10.49,250.33,45000.0,2013-05-01 00:00:00,11.28,1994-10-01 00:00:00,8.0,0.0,6025.0,35.8,17.0,0.0,0.0
50%,12000.0,13.33,375.43,64000.0,2014-04-01 00:00:00,16.91,1999-09-01 00:00:00,10.0,0.0,11181.0,54.8,24.0,1.0,0.0
75%,20000.0,16.49,567.3,90000.0,2015-03-01 00:00:00,22.98,2003-04-01 00:00:00,14.0,0.0,19620.0,72.9,32.0,3.0,0.0
max,40000.0,30.99,1533.81,8706582.0,2016-12-01 00:00:00,9999.0,2013-10-01 00:00:00,90.0,86.0,1743266.0,892.3,151.0,34.0,8.0
std,8357.441341,4.472157,250.72779,61637.62,,18.019092,,5.137649,0.530671,20591.84,24.452193,11.886991,2.14793,0.356174


# EDA

In [81]:
from ydata_profiling import ProfileReport
# profile = ProfileReport(df, title="Pandas Profiling Report", explorative=True)
# profile.to_notebook_iframe()

## Validations and Checks
This ensures that when new data is ingested, there are no obvious errors in the data.

In [82]:
!pytest tests/data_tests.py

platform linux -- Python 3.12.6, pytest-8.3.3, pluggy-1.5.0
rootdir: /home/jyothisable/Resources/Coding/Data Science/Scalar Projects/LoanTap-Credit-Default-Risk-Model
configfile: pyproject.toml
plugins: typeguard-4.3.0
collected 4 items                                                              [0m[1m

tests/data_tests.py [32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m                                                 [100%][0m



## Test data 
Separate the test data before visualisation to avoid data snooping bias

In [83]:
from sklearn.model_selection import train_test_split
X = df.drop('loan_status', axis=1)
y = df['loan_status']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=40, stratify= y) # default it will shuffle data set before sampling

- Validating if we have created a proper train and test split which is representative of the entire dataset

In [84]:
# Create a profile report for the train dataset
# train_profile = ProfileReport(X_train, title="Train Dataset Profile", explorative=True)
# train_profile.to_notebook_iframe()
# Create a profile report for the test dataset
# test_profile = ProfileReport(X_test, title="Test Dataset Profile", explorative=True)
# test_profile.to_notebook_iframe()

# comparision_report = train_profile.compare(test_profile)
# comparision_report.to_notebook_iframe()

# Feature Engineering
* `int_rate`, `issue_d`, `installment` are found after loan is approved, thus they are removed from the dataset to avoid data leakage
* `emp_title` and `address`  has many categorical values thus to avoid curse of dimensionality, it is removed from the dataset (#TODO use NLP to extract feature from this)
* `earliest_cr_line` is not used because absolute data values are not useful and can affect the model, instead a relative date called `age_of_credit` is created #TODO

In [85]:
categorical_ordinal_features = ['term', 'grade','sub_grade','emp_length', 'verification_status']

term_order = [' 36 months', ' 60 months']
grade_order = ['A', 'B', 'C', 'D', 'E', 'F', 'G']
sub_grade_order = [grade + str(i) for grade in grade_order for i in range(1,6)]
emp_length_order = ['< 1 year', '1 year', '2 years', '3 years', '4 years', '5 years', '6 years', '7 years', '8 years', '9 years', '10+ years']
verification_status_order = ['Not Verified', 'Verified', 'Source Verified']

order_matrix = [term_order, grade_order, sub_grade_order, emp_length_order, verification_status_order]

categorical_nominal_features = ['home_ownership','purpose','title','initial_list_status','application_type'] # OHE with 1% threshold to be done

numerical_features = ['loan_amnt','revol_util']

numerical_skewed_features = ['annual_inc','dti','open_acc', 'pub_rec','revol_bal', 'total_acc', 'mort_acc','pub_rec_bankruptcies']

## Feature Engineering Pipelines

In [86]:
from sklearn.preprocessing import FunctionTransformer, PowerTransformer, MinMaxScaler, StandardScaler,OneHotEncoder, OrdinalEncoder
from sklearn.preprocessing import PolynomialFeatures
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline, FeatureUnion

from sklearn.preprocessing import OneHotEncoder
from feature_engine.imputation import EndTailImputer
from sklearn.preprocessing import KBinsDiscretizer

from sklearn.feature_selection import SelectKBest,chi2
    
numerical_skewed_pipeline = Pipeline([
    ('select_numerical_skewed_features', FunctionTransformer(lambda X: X[numerical_skewed_features])),
    ('FE_improvement_impute', SimpleImputer(strategy='median'))
])

numerical_features_pipeline = Pipeline([
    ('select_numerical_features', FunctionTransformer(lambda X: X[numerical_features])),
    ('FE_improvement_impute', SimpleImputer(strategy='mean'))
])

numerical_features_combined_pipeline = Pipeline([
    ('all_numerical',FeatureUnion([
        ('numerical_skewed_pipeline', numerical_skewed_pipeline),
        ('numerical_features_pipeline', numerical_features_pipeline)
        ])),
    ('FE_construction_binning', KBinsDiscretizer(n_bins=5,encode='ordinal',strategy='kmeans')),
    ('FE_improvement_scaling',MinMaxScaler())
])


categorical_ordinal_pipeline = Pipeline([
    ('select_categorical_ordinal_features', FunctionTransformer(lambda X: X[categorical_ordinal_features])),
    ('FE_improvement_impute', SimpleImputer(strategy='most_frequent')),
    ('FE_construction_ODE', OrdinalEncoder(categories=order_matrix))
])


all_nominal_cat = FeatureUnion([
            ('select_categorical_nominal_features', FunctionTransformer(lambda X: X[categorical_nominal_features].applymap(lambda x: str(x).strip().lower()))),
            ('FE_construction_zipcode', FunctionTransformer(lambda X: X['address'].str.strip().str.slice(-5).to_frame('zipcode'))),
            ('FE_construction_state', FunctionTransformer(lambda X: X['address'].str.strip().str.slice(-8,-6).to_frame('state'))),
            ('FE_construction_age_of_credit', FunctionTransformer(lambda X: (X['issue_d'].dt.year - X['earliest_cr_line'].dt.year).to_frame('age_of_credit')))
        ])

categorical_nominal_pipeline = Pipeline([
    ('all_nominal_cat',all_nominal_cat),
    ('FE_improvement_impute', SimpleImputer(strategy='most_frequent')),
    ('FE_construction_OHE', OneHotEncoder(handle_unknown='infrequent_if_exist',min_frequency=0.01,sparse_output=False))
])

selected_FE = FeatureUnion([
        ('numerical_combined_pipeline',numerical_features_combined_pipeline),
        ('categorical_ordinal_pipeline', categorical_ordinal_pipeline),
        ('categorical_nominal_pipeline', categorical_nominal_pipeline)
    ])

target_pipeline = Pipeline([
    ('target_ohe',FunctionTransformer(lambda x : x.map({'Fully Paid':0,'Charged Off':1})))
])

from sklearn.feature_selection import VarianceThreshold
selected_FE_with_FS = Pipeline([
    ('feature_engineering_pipeline', selected_FE),
    ('feature_selection_pipeline',SelectKBest(k=30,score_func=chi2))
])


## Feature Evaluation

In [87]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.compose import TransformedTargetRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report,RocCurveDisplay,PrecisionRecallDisplay
import time

def simple_grid_search(x_train, y_train, x_test, y_test, feature_engineering_pipeline,target_pipeline):
    ''' 
    simple helper function to grid search an ExtraTreesClassifier model and 
    print out a classification report for the best param set.
    Best here is defined as having the best cross-validated accuracy on the training set
    '''
    
    params = {  # some simple parameters to grid search
        'base_model__max_depth': [None],
        'base_model__n_estimators': [50],
        'base_model__criterion': ['gini'],
        # 'feature_engineering_pipeline__feature_selection_pipeline__k': [30,40],
        # 'feature_engineering_pipeline__feature_selection_pipeline__score_func': [chi2]
        }
    # params = {}

    base_model = ExtraTreesClassifier(n_jobs=6,random_state=42)
    # base_model = LogisticRegression(max_iter=1000, n_jobs=-1, random_state=1, solver='liblinear')
    
    model_with_fe = Pipeline([
        ('feature_engineering_pipeline', feature_engineering_pipeline),
        ('base_model', base_model)
    ])
    

    model_grid_search = GridSearchCV(model_with_fe, param_grid=params, cv=3,n_jobs=6,verbose=True,scoring='recall')
    start_time = time.time()  # capture the start time

    parse_time = time.time()
    print(f"Parsing took {(parse_time - start_time):.2f} seconds")

    model_grid_search.fit(x_train,target_pipeline.fit_transform(y_train))
    fit_time = time.time()
    print(f"Training took {(fit_time - start_time):.2f} seconds")

    best_model = model_grid_search.best_estimator_

    y_pred=best_model.predict(x_test)
    print(classification_report(target_pipeline.transform(y_test), y_pred))
    
    end_time = time.time()
    print(f"Overall took {(end_time - start_time):.2f} seconds")
    
    return best_model

best_model = simple_grid_search(X_train, y_train, X_test, y_test, selected_FE_with_FS,target_pipeline)


Parsing took 0.00 seconds
Fitting 3 folds for each of 1 candidates, totalling 3 fits


  ('select_categorical_nominal_features', FunctionTransformer(lambda X: X[categorical_nominal_features].applymap(lambda x: str(x).strip().lower()))),


Training took 18.80 seconds


  ('select_categorical_nominal_features', FunctionTransformer(lambda X: X[categorical_nominal_features].applymap(lambda x: str(x).strip().lower()))),


              precision    recall  f1-score   support

           0       0.89      0.95      0.92     95507
           1       0.73      0.53      0.61     23302

    accuracy                           0.87    118809
   macro avg       0.81      0.74      0.77    118809
weighted avg       0.86      0.87      0.86    118809

Overall took 20.57 seconds


In [88]:

# 0.82      0.50      0.62 variance 0.03
# 0.73      0.52      0.61 variance 0.07
# 0.73      0.53      0.61 kbest chi2 k=30

In [108]:
best_model