In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import *
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin

from sklearn import set_config
set_config(transform_output = "pandas")

# Frame the Problem and Look at the Big Picture

1. Define the objective in business terms.
- Enhance risk management by implementing a predictive model that accurately determines the likelihood of a loan applicant defaulting. This model aims to reduce financial losses by enabling proactive decision-making on loan applications.

2. How will your solution be used?
- The predictive model will be integrated into the existing loan approval system to augment decision-making. When an application is received, the system will use the model to assess risk and suggest actions (approve, reject, or suggest modified loan terms). This assists loan officers by providing a data-driven risk assessment, which complements their expertise.

3. What are the current solutions/workarounds (if any)?
- Current solutions include manual assessment based on credit scores and financial ratios. These methods rely heavily on loan officers' subjective judgment and might not consistently capture all risk factors, especially those subtle patterns identifiable only through data analysis

4. How should you frame this problem (supervised/unsupervised, online/offline, ...)?
-This problem is best framed as a supervised binary classification task where each application is labeled as 'default' or 'no default.' The model will operate offline, with periodic updates to incorporate new data and trends, ensuring that the model remains current and effective.

5. How should performance be measured? Is the performance measure aligned with the business objective?
- The key performance metrics will be precision and recall, particularly focusing on minimizing false negatives (approving high-risk applications). The business objective is to reduce defaults, so prioritizing recall (sensitivity) might be essential to capture as many potential defaults as possible, even at the expense of some false positives.

6. What would be the minimum performance needed to reach the business objective?
- The model should achieve at least a 20% improvement in identifying potential defaults over current methods, which would substantively decrease financial losses due to defaults.

7. What are comparable problems? Can you reuse experience or tools?
- This problem shares similarities with credit card fraud detection where anomaly detection and classification methods are used. Techniques and insights from churn prediction models, which also categorize customers based on behavior, can be applied, especially in feature engineering and threshold tuning for classifications.

8. Is human expertise available?
- Yes, loan officers and financial analysts 

9. How would you solve the problem manually?
- Manually, loan officers review applications based on credit reports, repayment history, and financial statements, making decisions based on guidelines that might not dynamically adjust to changing economic conditions or new patterns of default.

10. List the assumptions you (or others) have made so far. Verify assumptions if possible.
- We assume that the historical data on defaults is indicative of future trends and that the selected features sufficiently capture the risk factors associated with defaults. These assumptions will be validated through continuous performance monitoring and feedback from loan officers on the ground.

# Get the Data

1. List the data you need and how much you need
- Anonymised Loan Default data for predicting if a loan will default or not

2. Find and document where you can get that data
- https://www.kaggle.com/datasets/joebeachcapital/loan-default/data

3. Get access authorizations

4. Create a workspace (with enough storage space)

5. Get the data

6. Convert the data to a format you can easily manipulate (without changing the data itself)

7. Ensure sensitive information is deleted or protected (e.g. anonymized)

8. Check the size and type of data (time series, geographical, ...). 

    Size of Data
    - 38,480 rows x 37 columns

    Type of Data
    - float64(19), int64(4), object(14)

    Is it a time series?
    - No

    Are any of the features unusable for the business problem?
    - Maybe the ID column, but other than that, no.

    Which feature(s) will be used as the target/label for the business problem? (including which are required to derive the correct label)
    - repay_fail

    Should any of the features be stratified during the train/test split to avoid sampling biases?
    - Based on the distribution, we should stratify to maintain the target variable's distribution.

9. Sample a test set, put it aside, and never look at it (no data snooping!)

In [4]:
# Load the dataset
file_path = 'Anonymize_Loan_Default_data.csv'  # Replace with your actual file path
try:
    # Attempt to load with default encoding
    Loan_Data = pd.read_csv(file_path)
except UnicodeDecodeError:
    # Fallback to other common encodings if there's an issue
    Loan_Data = pd.read_csv(file_path, encoding='latin1')


In [5]:
Loan_Data.columns

Index(['Unnamed: 0', 'id', 'member_id', 'loan_amnt', 'funded_amnt',
       'funded_amnt_inv', 'term', 'int_rate', 'installment', 'emp_length',
       'home_ownership', 'annual_inc', 'verification_status', 'issue_d',
       'loan_status', 'purpose', 'zip_code', 'addr_state', 'dti',
       'delinq_2yrs', 'earliest_cr_line', 'inq_last_6mths',
       'mths_since_last_delinq', 'open_acc', 'pub_rec', 'revol_bal',
       'revol_util', 'total_acc', 'total_pymnt', 'total_pymnt_inv',
       'total_rec_prncp', 'total_rec_int', 'last_pymnt_d', 'last_pymnt_amnt',
       'next_pymnt_d', 'last_credit_pull_d', 'repay_fail'],
      dtype='object')

In [6]:
Loan_Data.size

1423760

In [7]:
Loan_Data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38480 entries, 0 to 38479
Data columns (total 37 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Unnamed: 0              38480 non-null  int64  
 1   id                      38480 non-null  int64  
 2   member_id               38480 non-null  int64  
 3   loan_amnt               38479 non-null  float64
 4   funded_amnt             38479 non-null  float64
 5   funded_amnt_inv         38479 non-null  float64
 6   term                    38480 non-null  object 
 7   int_rate                38480 non-null  float64
 8   installment             38479 non-null  float64
 9   emp_length              37487 non-null  object 
 10  home_ownership          38480 non-null  object 
 11  annual_inc              38478 non-null  float64
 12  verification_status     38480 non-null  object 
 13  issue_d                 38480 non-null  object 
 14  loan_status             38480 non-null

In [8]:
Loan_Data.head

<bound method NDFrame.head of        Unnamed: 0       id  member_id  loan_amnt  funded_amnt  \
0               2        2          2        0.0          0.0   
1               3   545583     703644     2500.0       2500.0   
2               4   532101     687836     5000.0       5000.0   
3               5   877788    1092507     7000.0       7000.0   
4               6   875406    1089981     2000.0       2000.0   
...           ...      ...        ...        ...          ...   
38475       38476   849205    1060907     3000.0       3000.0   
38476       38477   852914    1065048    10400.0      10400.0   
38477       38478   519553     671637    16000.0      10550.0   
38478       38479   825638    1034448    10000.0      10000.0   
38479       38480  1029847    1249126     3200.0       3200.0   

       funded_amnt_inv       term  int_rate  installment emp_length  ...  \
0              0.00000  36 months      0.00         0.00   < 1 year  ...   
1           2500.00000  36 months    

In [9]:
Loan_Data.describe()

Unnamed: 0.1,Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,int_rate,installment,annual_inc,dti,...,open_acc,pub_rec,revol_bal,total_acc,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,last_pymnt_amnt,repay_fail
count,38480.0,38480.0,38480.0,38479.0,38479.0,38479.0,38480.0,38479.0,38478.0,38480.0,...,38479.0,38479.0,38476.0,38479.0,38479.0,38479.0,38479.0,38479.0,38479.0,38480.0
mean,19240.5,664997.9,826189.9,11094.727644,10831.856337,10150.141518,12.1643,323.163255,68995.31,13.378119,...,9.342966,0.057902,14289.87,22.108501,11980.696892,11274.519569,9646.412705,2232.768235,2614.441757,0.151481
std,11108.363516,219232.2,279353.1,7405.416042,7146.853682,7128.026828,3.73744,209.089097,64476.39,6.744356,...,4.498075,0.245707,21941.38,11.588602,9006.505205,8946.229941,7051.828302,2570.177312,4391.969583,0.358522
min,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,9620.75,498364.5,638462.0,5200.0,5100.0,4950.0,9.62,165.74,40000.0,8.2,...,6.0,0.0,3639.75,13.0,5463.099238,4811.735,4400.0,657.7,212.01,0.0
50%,19240.5,644319.5,824254.5,9750.0,9600.0,8495.792749,11.99,277.98,58650.0,13.485,...,9.0,0.0,8839.5,20.0,9673.221341,8953.24,8000.0,1335.09,526.0,0.0
75%,28860.25,826560.8,1034706.0,15000.0,15000.0,14000.0,14.72,429.35,82000.0,18.69,...,12.0,0.0,17265.5,29.0,16402.394995,15486.925,13315.1,2795.02,3169.815,0.0
max,38480.0,1077430.0,1314167.0,35000.0,35000.0,35000.0,100.99,1305.19,6000000.0,100.0,...,47.0,5.0,1207359.0,90.0,58563.67993,58563.68,35000.02,23611.1,36115.2,1.0


In [10]:
Loan_Data.isnull().sum()

Unnamed: 0                    0
id                            0
member_id                     0
loan_amnt                     1
funded_amnt                   1
funded_amnt_inv               1
term                          0
int_rate                      0
installment                   1
emp_length                  993
home_ownership                0
annual_inc                    2
verification_status           0
issue_d                       0
loan_status                   0
purpose                       0
zip_code                      0
addr_state                    0
dti                           0
delinq_2yrs                   1
earliest_cr_line              0
inq_last_6mths                1
mths_since_last_delinq    24363
open_acc                      1
pub_rec                       1
revol_bal                     4
revol_util                   59
total_acc                     1
total_pymnt                   1
total_pymnt_inv               1
total_rec_prncp               1
total_re

In [11]:
Loan_Data.duplicated().sum()

np.int64(0)

In [12]:
target_column = 'repay_fail' 
Loan_Data[target_column].value_counts(normalize=True)

repay_fail
0    0.848519
1    0.151481
Name: proportion, dtype: float64

In [13]:
# Split the data into a training and test set (optional)
train_set = Loan_Data.sample(frac=0.8, random_state=42)  # 80% for training
test_set = Loan_Data.drop(train_set.index)  # Remaining 20% for testing

Explore the Data
================


1. Copy the data for exploration, downsampling to a manageable size if necessary.

2. Study each attribute and its characteristics: Name; Type (categorical, numerical, bounded, text, structured, ...); % of missing values; Noisiness and type of noise (stochastic, outliers, rounding errors, ...); Usefulness for the task; Type of distribution (Gaussian, uniform, logarithmic, ...)

3. For supervised learning tasks, identify the target attribute(s)

4. Visualize the data

5. Study the correlations between attributes

6. Study how you would solve the problem manually


7. Identify the promising transformations you may want to apply

8. Identify extra data that would be useful (go back to “Get the Data”)



In [None]:
Loan_Data = Loan_Data.copy()