# LGD - Loss Given Default
This notebook will train a model to predict Loss Given Default (LGD). Let's start with feature selection

**LGD** indicates the proportion of the total exposure that remains unrecovered after a borrower has defaulted.

Hence, it is a better practice to build the model with data that borrowers have a `charged off` status (status of loss)

From the data provided, we know that:

* `funded_amnt`: reflects the total amount that was lost at the moment the borrower defaulted;

* `recoveries`: amount that has been recovered.

Hence, LGD can be defined as:

$LGD = \frac{funded\_amnt - recoveries}{funded\_amnt} = 1 - recovery\_rate$

In [1]:
import os
import sys

# Data Science
import numpy as np
import pandas as pd

# Data Visualization
import seaborn as sns
import plotly.io as pio # plotly is only used if you have a powerfull machine
import matplotlib.pyplot as plt
import plotly.figure_factory as ff

# ignore all warnings
import warnings
warnings.filterwarnings('ignore')


In [2]:
train_df = pd.read_csv('../data/train/train.csv')

Only get records that have `charged off` status

In [3]:
DEFAULT_CATEGORIES = [
    "Charged Off",
    "Does not meet the credit policy. Status:Charged Off"
]

train_df = train_df[train_df["loan_status"].isin(DEFAULT_CATEGORIES)]
train_df.shape

(67068, 24)

In [4]:
TARGET_VARIABLE = "LGD"

train_df[TARGET_VARIABLE] = 1 - (train_df["recoveries"] / train_df["funded_amnt"])
train_df[TARGET_VARIABLE].describe()

count    67068.000000
mean         0.922581
std          0.097508
min         -0.294209
25%          0.896070
50%          0.937282
75%          1.000000
max          1.000000
Name: LGD, dtype: float64

we will remove all the LGD that smaller than 0 because there is no loss on those borrower

In [5]:
train_df = train_df[train_df[TARGET_VARIABLE]>0]
train_df.shape

(67043, 25)

Let's check the distribution of the LGD

In [6]:
# fig = ff.create_distplot([train_df[TARGET_VARIABLE]], [TARGET_VARIABLE], bin_size=0.1)
# fig.show()

we can easily notice that our target variable is unevenly distributed, with a lot of values concentrated around 1 because default events in this data more likely happen.

## Feature Analysis
We are going to use features from PD model for alignment. However, we should analyze each features before coming up an appropriate regression model. Let's start by transforming features

### Feature Transformation

In [7]:
SELECTED_FEATURES = [
    "id", "purpose", "initial_list_status", "emp_length",
    "pub_rec", "grade", "addr_state", "term",
    "int_rate", "LGD"
]
train_df = train_df[SELECTED_FEATURES]

encoded_df = pd.get_dummies(train_df['grade'], prefix='grade')
train_df = train_df.drop('grade', axis=1)
train_df = pd.concat([train_df, encoded_df], axis=1)

ADDR_STATE_MAP = {
    "ID": ["ID"],
    "NE_ME": ["NE","ME"],
    "ND": ["ND"],
    "MS_AR_OK": ["MS", "OK", "AR"],
    "SD_LA_AL_AK_IN": ["SD", "LA", "AL", "AK", "IN"],
    "WV_MO_MD_OH": ["WV", "MO", "MD", "OH"],
    "NY_TN_KY_NJ_SC_NC_CT": ["NY", "TN", "KY", "NJ", "SC", "NC", "CT"],
    "IL": ["IL"],
    "PA_NM": ["PA", "NM"],
    "TX_VT": ["TX", "VT"],
    "FL_NV_GA": ["FL", "NV", "GA"],
    "VA_MN": ["MN", "VA"],
    "KS_DE_MA_MI_MT": ["KS", "DE", "MA", "MI", "MT"],
    "AZ_HI_WY_WI": ["AZ", "HI", "WY", "WI"],
    "RI_CA_NH": ["RI", "CA", "NH"],
    "WA_CO_OR_UT": ["WA", "CO", "OR", "UT"],
    "DC": ["DC"]
}

reverse_map = {state: group for group, states in ADDR_STATE_MAP.items() for state in states}

train_df["addr_state"] = train_df["addr_state"].map(reverse_map)

encoded_df = pd.get_dummies(train_df["addr_state"], prefix="addr_state")
train_df = train_df.drop("addr_state", axis=1)
train_df = pd.concat([train_df, encoded_df], axis=1)

PURPOSES_MAP = {
    "cred_card": ["credit_card"],
    "vacation": ["vacation"],
    "car": ["car"],
    "home_improv_debt_consol": ["home_improvement", "debt_consolidation"],
    "moving": ["moving"],
    "renewable_energy": ["renewable_energy"],
    "major_purchase_medical": ["major_purchase", "medical"],
    "other": ["other"],
    "wedding": ["wedding"],
    "small_business": ["small_business"],
    "house": ["house"]
}

reverse_map = {purpose: group for group, purposes in PURPOSES_MAP.items() for purpose in purposes}
train_df["purpose"] = train_df["purpose"].map(reverse_map)

encoded_df = pd.get_dummies(train_df["purpose"], prefix="purpose")
train_df = train_df.drop("purpose", axis=1)
train_df = pd.concat([train_df, encoded_df], axis=1)

encoded_df = pd.get_dummies(train_df["term"], prefix="term")
train_df = train_df.drop("term", axis=1)
train_df = pd.concat([train_df, encoded_df], axis=1)

encoded_df = pd.get_dummies(train_df["initial_list_status"], prefix="initial_list_status")
train_df = train_df.drop("initial_list_status", axis=1)
train_df = pd.concat([train_df, encoded_df], axis=1)


EMP_LENGTH_MAP = {
    "0": 0,
    "< 1 year": 1,
    "1 year": 1,
    "2 years": 2,
    "3 years": 3,
    '4 years': 4,
    '5 years': 5,
    '6 years': 6,
    '7 years': 7,
    '8 years': 8,
    '9 years': 9,
    '10+ years': 10
}
train_df["emp_length"] = train_df["emp_length"].map(EMP_LENGTH_MAP)

KeyError: 'purpos'

### Categorical Features
Since we use one-hot encoding for the categorical features, we should check the distribution of each dummy feature.

In [None]:
DUMMIES_PREFIX = ["grade", "addr_state", "term", "purpose"]

In [None]:
for prefix in DUMMIES_PREFIX:
    dummy_cols = [col for col in train_df.columns if col.startswith(prefix)]
    
    for col in dummy_cols:

        group0 = train_df[train_df[col] == 0][TARGET_VARIABLE]
        group1 = train_df[train_df[col] == 1][TARGET_VARIABLE]

        plt.figure(figsize=(16, 9)) 
        sns.kdeplot(group0, label=f'{col}=0')
        sns.kdeplot(group1, label=f'{col}=1')
        plt.legend()
        plt.title(f'Density plot of {TARGET_VARIABLE} for {col}')
        plt.show()

NameError: name 'DUMMIES_PREFIX' is not defined