# **Loan Default Prediction (Feature Scalling and Engineering)**

## Import Modules

In [1]:
import pandas as pd
import numpy as np

data_descriptions = pd.read_csv('data/data_descriptions.csv')
train_df = pd.read_csv("data/train.csv")
pd.set_option('display.max_colwidth', None)

## Feature Engineering 101

In [2]:
columns = ['Age','Income','LoanAmount','CreditScore','MonthsEmployed','NumCreditLines','InterestRate','LoanTerm','DTIRatio','Education','EmploymentType','MaritalStatus','HasMortgage','HasDependents','LoanPurpose','HasCoSigner']

data_descriptions['Column_name'].tolist()
data_descriptions[data_descriptions['Column_name'].isin(columns)][['Column_name', 'Data_type', 'Description']]

Unnamed: 0,Column_name,Data_type,Description
1,Age,integer,The age of the borrower.
2,Income,integer,The annual income of the borrower.
3,LoanAmount,integer,The amount of money being borrowed.
4,CreditScore,integer,"The credit score of the borrower, indicating their creditworthiness."
5,MonthsEmployed,integer,The number of months the borrower has been employed.
6,NumCreditLines,integer,The number of credit lines the borrower has open.
7,InterestRate,float,The interest rate for the loan.
8,LoanTerm,integer,The term length of the loan in months.
9,DTIRatio,float,"The Debt-to-Income ratio, indicating the borrower's debt compared to their income."
10,Education,string,"The highest level of education attained by the borrower (PhD, Master's, Bachelor's, High School)."


## Apply Feature Engineering 

In [3]:
train_df_feat_engr = train_df.copy()
train_df_feat_engr

Unnamed: 0,LoanID,Age,Income,LoanAmount,CreditScore,MonthsEmployed,NumCreditLines,InterestRate,LoanTerm,DTIRatio,Education,EmploymentType,MaritalStatus,HasMortgage,HasDependents,LoanPurpose,HasCoSigner,Default
0,I38PQUQS96,56,85994,50587,520,80,4,15.23,36,0.44,Bachelor's,Full-time,Divorced,Yes,Yes,Other,Yes,0
1,HPSK72WA7R,69,50432,124440,458,15,1,4.81,60,0.68,Master's,Full-time,Married,No,No,Other,Yes,0
2,C1OZ6DPJ8Y,46,84208,129188,451,26,3,21.17,24,0.31,Master's,Unemployed,Divorced,Yes,Yes,Auto,No,1
3,V2KKSFM3UN,32,31713,44799,743,0,3,7.07,24,0.23,High School,Full-time,Married,No,No,Business,No,0
4,EY08JDHTZP,60,20437,9139,633,8,4,6.51,48,0.73,Bachelor's,Unemployed,Divorced,No,Yes,Auto,No,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
255342,8C6S86ESGC,19,37979,210682,541,109,4,14.11,12,0.85,Bachelor's,Full-time,Married,No,No,Other,No,0
255343,98R4KDHNND,32,51953,189899,511,14,2,11.55,24,0.21,High School,Part-time,Divorced,No,No,Home,No,1
255344,XQK1UUUNGP,56,84820,208294,597,70,3,5.29,60,0.50,High School,Self-employed,Married,Yes,Yes,Auto,Yes,0
255345,JAO28CPL4H,42,85109,60575,809,40,1,20.90,48,0.44,High School,Part-time,Single,Yes,Yes,Other,No,0


###  Steps on apply Feature Engineering for data model scaling

### **1. The age of the borrower**

##### **Current Representation**: *Numeric feature*

**Feature Engineering**

- **Binning**: Create age groups (e.g., Young (18-30), Middle-aged (31-50), Senior (51+)) to capture broader age trends.

- **Age squared**: Create a new feature to capture non-linear relationships between age and the target variable.

In [4]:
train_df_feat_engr['age_bin'] = pd.cut(train_df_feat_engr['Age'], bins=[18, 30, 50, 100], labels=['Young', 'Middle-aged', 'Senior'])

### **2. The annual income of the borrower**

##### **Current Representation**: *Numeric feature*

**Feature Engineering**

- **Binning**: Applying log transformation to reduce the effect of extreme values in annual income.

- **Age squared**:  Categorize into income ranges (e.g., Low, Medium, High) based on domain knowledge or percentiles.

 - **Income per Dependent**: If the borrower has dependents, create a feature that divides income by the number of dependents (from feature 14).

In [5]:
train_df_feat_engr['log_income'] = np.log1p(train_df_feat_engr['Income'])

### **3. The amount of money being borrowed**

##### **Current Representation**: *Numeric feature*

**Feature Engineering**

- **Binning**: Applying log transformation to reduce the effect of extreme values in annual income.

- **Age squared**:  Categorize into income ranges (e.g., Low, Medium, High) based on domain knowledge or percentiles.

- **Income per Dependent**: If the borrower has dependents, create a feature that divides income by the number of dependents (from feature 14).

In [6]:
train_df_feat_engr['loan_to_income'] = train_df_feat_engr['LoanAmount'] / train_df_feat_engr['Income']

### **4. The credit score of the borrower**

##### **Current Representation**: *Numeric feature*

**Feature Engineering**

- **Binning**:  Create credit score bands (e.g., Poor (300-579), Fair (580-669), Good (670-739), Excellent (740+)).

- **Credit Score Squared**: To capture non-linear effects.

In [7]:
train_df_feat_engr['credit_score_bin'] = pd.cut(train_df_feat_engr['CreditScore'], bins=[300, 579, 669, 739, 850], labels=['Poor', 'Fair', 'Good', 'Excellent'])

### **5. The number of months the borrower has been employed**

##### **Current Representation**: *Numeric feature*

**Feature Engineering**

- **Employment Tenure Category**: Group the borrower into categories (e.g., Less than 1 year, 1-5 years, 5+ years).

- **Employment Stability**:  Create a feature that checks if months employed is greater than a threshold (e.g., 24 months) to define a stable employment status.

In [8]:
train_df_feat_engr['employment_tenure'] = pd.cut(train_df_feat_engr['MonthsEmployed'], bins=[0, 12, 60, 1000], 
                                     labels=['<1 year', '1-5 years', '5+ years'])

### **6. The number of credit lines the borrower has open**

##### **Current Representation**: *Numeric feature*

**Feature Engineering**

- **Credit Line Utilization Ratio**: Divide the number of open credit lines by the total available credit.

- **Credit Line Category**:  Categorize into Low, Medium, and High based on the number of open credit lines.

In [9]:
train_df_feat_engr['credit_line_category'] = pd.cut(train_df_feat_engr['NumCreditLines'], 
                                        bins=[0, 3, 6, 100], 
                                        labels=['Low', 'Medium', 'High'])

### **7. The interest rate for the loan**

##### **Current Representation**: *Numeric feature*

**Feature Engineering**

- **Interest Rate Binning**: Create ranges of interest rates (e.g., Low (0-5%), Medium (5-10%), High (>10%)).

- **Interaction with Credit Score**:  Create an interaction term between the credit score and the interest rate to explore how different credit scores affect the interest rate.

In [10]:
train_df_feat_engr['interest_rate_bin'] = pd.cut(train_df_feat_engr['InterestRate'], 
                                     bins=[0, 5, 10, 100], 
                                     labels=['Low', 'Medium', 'High'])

### **8. The term length of the loan in months**

##### **Current Representation**: *Numeric feature*

**Feature Engineering**

- **Short vs Long Term**: Categorize loans as Short Term (<= 36 months) or Long Term (> 36 months).

- **Term-to-Loan Amount Ratio**:  Create a ratio between the loan term and the loan amount to capture loan payment burdens.

In [11]:
train_df_feat_engr['short_long_term'] = np.where(train_df_feat_engr['LoanTerm'] <= 36, 'Short Term', 'Long Term')

### **9. The Debt-to-Income ratio**

##### **Current Representation**: *Numeric feature*

**Feature Engineering**

- **Binning**: Convert the ratio into categories (e.g., Low (0-0.35), Medium (0.35-0.5), High (>0.5)).

- **Interaction with Loan Amount**:  Create an interaction term between the loan amount and the debt-to-income ratio to explore combined effects on creditworthiness.

In [12]:
train_df_feat_engr['dti_bin'] = pd.cut(train_df_feat_engr['DTIRatio'], 
                           bins=[0, 0.35, 0.5, 1], 
                           labels=['Low', 'Medium', 'High'])

### **10. The highest level of education attained by the borrower**

##### **Current Representation**: *Categorical feature*

**Feature Engineering**

- **One-Hot Encoding**: Convert the ratio into categories (e.g., Low (0-0.35), Medium (0.35-0.5), High (>0.5)).

- **Ordinal Encoding**:  Assign numeric values based on education level (e.g., PhD=4, Master's=3, Bachelor's=2, High School=1).

In [13]:
education_mapping = {'High School': 1, "Bachelor's": 2, "Master's": 3, 'PhD': 4}
train_df_feat_engr['education_level'] = train_df_feat_engr['Education'].map(education_mapping)

### **11. The type of employment status of the borrower**

##### **Current Representation**: *Categorical feature*

**Feature Engineering**

- **One-Hot Encoding**: Convert the ratio into categories (e.g., Low (0-0.35), Medium (0.35-0.5), High (>0.5)).

- **Ordinal Encoding**:  Assign numeric values based on education level (e.g., PhD=4, Master's=3, Bachelor's=2, High School=1).

In [14]:
train_df_feat_engr = pd.get_dummies(train_df_feat_engr, columns=['EmploymentType'], drop_first=True)

### **12. The marital status of the borrower**

##### **Current Representation**: *Categorical feature*

**Feature Engineering**

- **One-Hot Encoding**: Convert Single, Married, Divorced into binary columns.

- **Interaction with Dependents**: Create an interaction feature between marital status and whether the borrower has dependents.

In [15]:
train_df_feat_engr = pd.get_dummies(train_df_feat_engr, columns=['MaritalStatus'], drop_first=True)

### **13. Whether the borrower has a mortgage**

##### **Current Representation**: *Binary feature (Yes/No)*

**Feature Engineering**

- **One-Hot Encoding**:  Convert to binary columns (Has Mortgage = 1, No Mortgage = 0).

- **Mortgage-to-Income Ratio**: If the mortgage amount is available, create a ratio between mortgage and income.

In [16]:
train_df_feat_engr['has_mortgage'] = train_df_feat_engr['HasMortgage'].map({'Yes': 1, 'No': 0})

### **14. Whether the borrower has dependents**

##### **Current Representation**: *Binary feature (Yes/No)*

**Feature Engineering**

- **One-Hot Encoding**:  Convert to binary columns.

- **Mortgage-to-Income Ratio**:  If more information is available, convert Yes/No to the actual number of dependents for more granularity.

In [17]:
train_df_feat_engr['has_dependents'] = train_df_feat_engr['HasDependents'].map({'Yes': 1, 'No': 0})

### **15. The purpose of the loan**

##### **Current Representation**: *Categorical feature*

**Feature Engineering**

- **One-Hot Encoding**:  Convert Home, Auto, Education, Business, Other into binary columns.

- **Loan Purpose Categories**: Group purposes into categories such as Necessity (Home, Auto), Growth (Education, Business), and Miscellaneous (Other).

In [18]:
train_df_feat_engr = pd.get_dummies(train_df_feat_engr, columns=['LoanPurpose'], drop_first=True)

### **16. Whether the loan has a co-signer**

##### **Current Representation**: *Binary feature (Yes/No).*

**Feature Engineering**

- **One-Hot Encoding**:  Convert to binary columns.

- **Interaction with Loan Amount**: Create an interaction term between co-signer and loan amount to capture how co-signers might influence the loan size.


In [19]:
train_df_feat_engr['has_cosigner'] = train_df_feat_engr['HasCoSigner'].map({'Yes': 1, 'No': 0})

### Additional Potential Features

**1. Loan Affordability Index**: Combine loan amount, income, debt-to-income ratio, and interest rate to create a comprehensive loan affordability score.

In [20]:
train_df_feat_engr['loan_income_dti_interaction'] = train_df_feat_engr['loan_to_income'] * train_df_feat_engr['DTIRatio']

**2. Financial Stability Score**: Create a composite feature using credit score, annual income, debt-to-income ratio, and employment months.

In [21]:
train_df_feat_engr['credit_interest_interaction'] = train_df_feat_engr['CreditScore'] * train_df_feat_engr['InterestRate']

**3. Risk Index**: Use a combination of credit score, debt-to-income ratio, loan amount, and loan term to generate a risk score for predicting loan default risk.

### Export: Feature Enginnering Data

In [22]:
train_df_feat_engr.to_csv("data/Feature Engineering Data.csv", index=False)

### Output: Feature Enginnering Data

In [23]:
train_df_feat_engr.head(10)

Unnamed: 0,LoanID,Age,Income,LoanAmount,CreditScore,MonthsEmployed,NumCreditLines,InterestRate,LoanTerm,DTIRatio,...,MaritalStatus_Single,has_mortgage,has_dependents,LoanPurpose_Business,LoanPurpose_Education,LoanPurpose_Home,LoanPurpose_Other,has_cosigner,loan_income_dti_interaction,credit_interest_interaction
0,I38PQUQS96,56,85994,50587,520,80,4,15.23,36,0.44,...,False,1,1,False,False,False,True,1,0.258835,7919.6
1,HPSK72WA7R,69,50432,124440,458,15,1,4.81,60,0.68,...,False,0,0,False,False,False,True,1,1.677887,2202.98
2,C1OZ6DPJ8Y,46,84208,129188,451,26,3,21.17,24,0.31,...,False,1,1,False,False,False,False,0,0.475588,9547.67
3,V2KKSFM3UN,32,31713,44799,743,0,3,7.07,24,0.23,...,False,0,0,True,False,False,False,0,0.324907,5253.01
4,EY08JDHTZP,60,20437,9139,633,8,4,6.51,48,0.73,...,False,0,1,False,False,False,False,0,0.326441,4120.83
5,A9S62RQ7US,25,90298,90448,720,18,2,22.72,24,0.1,...,True,1,0,True,False,False,False,1,0.100166,16358.4
6,H8GXPAOS71,38,111188,177025,429,80,1,19.11,12,0.16,...,True,1,0,False,False,True,False,1,0.25474,8198.19
7,0HGZQKJ36W,56,126802,155511,531,67,4,8.15,60,0.43,...,False,0,0,False,False,True,False,1,0.527355,4327.65
8,1R0N3LGNRJ,36,42053,92357,827,83,1,23.94,48,0.2,...,False,1,0,False,True,False,False,0,0.439241,19798.38
9,CM9L1GTT2P,40,132784,228510,480,114,4,9.09,48,0.33,...,False,1,0,False,False,False,True,1,0.567902,4363.2


## Dropping Useless Data Columns for Data Model Scaling

In [24]:
df_fe = train_df_feat_engr.drop(columns=['Age', 'Income', 'LoanAmount', 'CreditScore', 'MonthsEmployed', 'NumCreditLines', 
                                         'InterestRate', 'LoanTerm', 'DTIRatio', 'HasMortgage', 'HasMortgage', 'HasDependents', 'HasCoSigner'], 
            axis=1)

In [25]:
df_fe.head(10)

Unnamed: 0,LoanID,Education,Default,age_bin,log_income,loan_to_income,credit_score_bin,employment_tenure,credit_line_category,interest_rate_bin,...,MaritalStatus_Single,has_mortgage,has_dependents,LoanPurpose_Business,LoanPurpose_Education,LoanPurpose_Home,LoanPurpose_Other,has_cosigner,loan_income_dti_interaction,credit_interest_interaction
0,I38PQUQS96,Bachelor's,0,Senior,11.362044,0.588262,Poor,5+ years,Medium,High,...,False,1,1,False,False,False,True,1,0.258835,7919.6
1,HPSK72WA7R,Master's,0,Senior,10.828401,2.467481,Poor,1-5 years,Low,Low,...,False,0,0,False,False,False,True,1,1.677887,2202.98
2,C1OZ6DPJ8Y,Master's,1,Middle-aged,11.341057,1.534154,Poor,1-5 years,Low,High,...,False,1,1,False,False,False,False,0,0.475588,9547.67
3,V2KKSFM3UN,High School,0,Middle-aged,10.364514,1.412638,Excellent,,Low,Medium,...,False,0,0,True,False,False,False,0,0.324907,5253.01
4,EY08JDHTZP,Bachelor's,0,Senior,9.925151,0.447179,Fair,<1 year,Medium,Medium,...,False,0,1,False,False,False,False,0,0.326441,4120.83
5,A9S62RQ7US,High School,1,Young,11.410882,1.001661,Good,1-5 years,Low,High,...,True,1,0,True,False,False,False,1,0.100166,16358.4
6,H8GXPAOS71,Bachelor's,0,Middle-aged,11.618987,1.592123,Poor,5+ years,Low,High,...,True,1,0,False,False,True,False,1,0.25474,8198.19
7,0HGZQKJ36W,PhD,0,Senior,11.75039,1.226408,Poor,5+ years,Medium,Medium,...,False,0,0,False,False,True,False,1,0.527355,4327.65
8,1R0N3LGNRJ,Bachelor's,1,Middle-aged,10.64671,2.196205,Excellent,5+ years,Low,High,...,False,1,0,False,True,False,False,0,0.439241,19798.38
9,CM9L1GTT2P,High School,0,Middle-aged,11.796487,1.720915,Poor,5+ years,Medium,Medium,...,False,1,0,False,False,False,True,1,0.567902,4363.2
