
This notebook lies in the phase of **Exploratory Data Analysis (EDA)**. EDA of the Lending Club dataset shall be carried out in 3 main phases: **Data Understanding & Data Cleaning, Feature Engineering**. 

![Steps-for-Performing-Exploratory-Data-Analysis.png](../images/Steps-for-Performing-Exploratory-Data-Analysis.png "Steps-for-Performing-Exploratory-Data-Analysis.png")

Main tasks of each phase is as follows: 
- **Data Understanding**
  - Understand data dictionary terminologies 
  - Identify more critical columns 

- **Data Cleaning**
  - Deal with Missing Values, Duplicates, Outliers 

- **Feature Engineering**
  - Deal with Multicollinearity 
  - Feature Transformation, Standardisation, Dimensionality Reduction 
  - Dealing with Dataset Imbalance


By the end of this notebook, I should be able to: 
- Output a **thoroughly cleansed target dataset** for efficient credit risk model building 

- Define **feature and target variables** from the target table clearly 





## 1. Data Understanding 

The dataset is from a [Kaggle Dataset](https://www.kaggle.com/datasets/wordsforthewise/lending-club), which has records of all loans issued to borrowers between 2007 - 2018 Q4. I will now conduct Exploratory Data Analysis to get a sensing of the dataset, before proceeding further. 



In [0]:
# Import Libraries 
from pyspark.sql.functions import col

In [0]:
# I will need the following code to initialise Spark Session in standalone Python notebooks
# from pyspark.sql import SparkSession
# spark = SparkSession.builder.appName("Data Cleaning").getOrCreate()


# File Endpoint and Format 
file_location = "/FileStore/tables/accepted_2007_to_2018Q4.csv"
file_type = "csv"

# CSV options
infer_schema = "false"
first_row_is_header = "true"
delimiter = ","

# Reading a CSV file 
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

display(df)

#### 1.1 Data Dictionary 

The following is the explanation of what some features in the dataset mean. There will be additional explanations in some features to facilitate my understanding of each column. The original data dictionary is obtained from [Figshare](https://figshare.com/articles/dataset/Lending_club_dataset_description/20016077). 

##### Loan Details 

- **id**: Unique loan listing / request identifier (not yet funded / issued)
- **memberId**: Unique borrower ID 
- **loan_amnt**: Amount requested by borrower
- **issue_d**: Month-Year the loan is funded/issued to borrower 

- **funded_amnt**: Total amount of money funded to the borrower (may be partially / fully funded)
- **funded_amnt_inv**: Total amount of money funded to borrower by **investors** (should be the same as `funded_amnt`)
  - This column exists, because in **investors' dataset**, investors can see individual contributions to the borrower's requested amount 
- **term**: Loan term (months)
- **int_rate**: Interest rate
- **installment**: Monthly payment upon approved loan 

- **grade vs sub_grade**: 
  - Grade: Broad Risk Category (A,B)
  - Sub_Grade: A1,B2 (Specific)

- **purpose**: Reason for borrowing money

- **title**: Similar to `purpose`

- **desc**: Loan description (provided by borrower, like a Carousel Listing)

- **url**: Link to loan listing 

- **initial_list_status**: Lending Club determines how loans will be funded by investors based on risk level, investors then see these listings on the platform
  - `F`: Fractional loans (Many investors fund small parts)
  - `w`: Whole loans (Offered to 1 investor who funds everything)

- **disbursement_method**: How loan funding was delivered to borrower
  - **Cash**: Funds directly deposited to borrower's bank account 
  - **Direct_Pay**: Lending Club pays debts for you (Use new personal loan to pay off old debts, at lower interest rates)

- **policy_code**:
  - `1`: Publicly Available Product (Loan is public in Lending Club Platform)
  - `2`: Private Loan Product (Loan is a special private offering, e.g. for ceertain investors only)

- **pymnt_plan**: Checks if payment plan in place for borrower to catch up on missed payments 
  - 'n': refers to 'No Payment Plan'

- **application_type**: Single / Joint 
loan application 


##### Borrower Demographics 
- **emp_title**: Borrower’s job title
- **emp_length**: Years in current job
- **home_ownership**: Home ownership status (rented / owned)
- **annual_inc**: Borrower’s annual income
- **verification_status**: If **individual borrower's income** is verified by Lending Club platform (via Singpass in SG context)
- **zip_code**/**addr_state**: U,S, state of residence (`CA`: California)

##### Borrower Credit History & Scores
- 📍 **dti**: Debt-Income Ratio (20% DTI means 20% of borrower's income go to debt payments) (Higher DTI is a red flag 🚩)

- **delinq_2yrs**: Number of times borrower was 30+days late on payments (delinquency: missed payments) for the past 2 years

- **earliest_cr_line**: Earliest Credit Line (Date borrower opened his earliest credit account) 
  - Open credit account usually means you start borrowing money / gain access to credit, e.g. buy now, pay later with credit cards 

- - **last_fico_range_high & last_fico_range_low**: FICO scores upon last credit report pull by Lending Club (to decide on loan approval)

- 📍 **fico_range_low & fico_range_high**: Credit Rating of Borrower when they are approved for loan (used to assess credit risk)
  - 300–579	Poor
  - 580–669	Fair
  - 670–739	Good
  - 740–799	Very Good
  - 800–850	Excellent

- 📍 **inq_last_6mths**: Number of credit inquiries past 6 months 
  - Everytime borrower apply loan, lender check their credit report 
  - High number of credit inquiries = 要过几次钱 (possibly out of financial stress = higher risk of defaulting / delinquency)
  



- **open_acc**: How many credit accounts borrower has currently open (e.g. credit cards, loans, mortgages). Many credit accounts may mean over-reliance on credit and higher credit risk 


- 📍 **revol_bal**: Total amount borrower owes on revolving credit accounts (credit cards, lines of credit)
  - **Revolving Accounts**: Borrower can reborrow money up till max credit limit without end-date (重复性的)
  - **Installment Loan**: Borrower cannot reborrow unless he apply for new loan, 一次性借钱） 
  - **Line of Credit**: Type of revolving credit. Types are as shown: 
    - **Credit Card**
    - **Personal Line of Credit**: Used to borrow from bank (for any loan)
    - **Home Equity Line of Credit (HELOC)**: Used for home improvements / bulk purchases (home as collateral)
    - **Business Line of Credit**: To help businesses with cash flow 

- 📍 **revol_util**: Revolving Line Utilisation Rate (How much credit is borrower currently using, with respect to total credit limit)

- **total_acc**: Total number of credit lines / accounts borrower has in credit file (opened/closed)

- **last_credit_pull_d**: The last date Lending Cluub checked borrower's credit report (monitor borrower, assess risk)


- 📍 **collections_12_mths_ex_med**: Number of debt collections in past 12 months, excluding medical bills 
  - **Collection**: Borrower fails to repay debt -> Lending Club sells debt to collection agency -> Collection agency acts as loan shark to get back money 
  - Collections can be a **strong signal for financial distress**, since it signals trouble of repaying future loans 

##### Borrower Public Records & Delinquencies 
- **pub_rec_bankruptcies**: Number of bankruptcies listed as public records 

- **tax_liens**: Number of tax liens 

- **delinq_amt**: Total money currently owed on all delinquent accounts (It is time to pay money!)

- **acc_now_delinq**: Number of accounts borrower is currently delinquent (late on payments)

- **chargeoff_within_12_mths**: Number of accounts charged off (considered as loss/defaulted) by creditors in past 12 months 

- **num_accts_ever_120_pd**: Number of times borrower is 120+ days late on payment in their life (Very Late)

- **num_tl_30dpd**: Number of accounts currently 30 or more days due (Updated past 2 months)

- **num_tl_90g_dpd_24m**: Number of accounts that have been 90 or more days past due in the last 24 months


- **num_tl_120dpd_2m**: Number of accounts that have been 120 or more days past due in the last 2 months.


- **mths_since_last_delinq**: Months since borrower's last delinquency (late payment)

- 📍 **mths_since_last_record**: Months since borrowers' last **public record**
  - **Public Record** refers to negative financial events officially recorded in legal documents (serious credit / legal issues). Types are shown below: 
    - **Bankruptcy**
    - **Tax Lien**: Government claim on property due to unpaid tax
    - **Civil Judgement**: Court ruling (lawsuits) that borrower must pay debt 

  - Empty records considered to be more stable 

- 📍 **pub_rec**: Number of degoratory public records (bankruptcies ...) the borrower has

- **mths_since_last_major_derog**


##### Loan Performance & Balance Changes of Lending Club 

The following features track how loan behaves after issuing. These features are crucial since they impact Lending Club profitability. 

- **out_prncp**: Remaining Outstanding Principal Sum (How much of original loan, excluding interest, is left to be paid back) 

- **out_prncp_inv**: (Same as Above - Investors portion)

- **total_pymnt**: Sum of all repayments made by Borrower so far 


- **total_pymnt_inv**: (Same as Above)

- **total_rec_int**: Total Interest Received (made by borrower) by Lending Club

- **total_rec_prncp**: Total Received Principal (excluding interest) by Lending Club

- **last_pymnt_d**: Date of most recent payment made by borrower 

- **last_pymnt_amnt**: Amount of most recent payment received by Lending Club from borrower 

- **next_pymnt_date**: Scheduled date for next payment for the borrower 

- **total_rec_late_fee**: Total late penalty fees borrower paid due to late payments 

- **recoveries**: Post Charge-Off Gross Recovery Payments 
  - Loan is charged off (marked as loss to investors due to borrower defaulting)
  - **Post Charge Off Gross Recovery Payments**: Total recovered money after loan was charged off 

- **collection_recovery_fee**: Cost Lending Club has to incur to recover money **after loan has defaulted**


##### Revolving Account Details 
- **open_rv_12m**
- **open_rv_24m**

- **max_bal_bc**
- **all_util**
- **total_rev_hi_lim**

- **bc_open_to_buy**\
- **bc_util**
- **il_util**
- **num_actv_rev_tl**
- **num_rev_accts**
- **num_rev_tl_bal_gt_0**



##### Account Activity 

The following columns track borrower behaviour with credit accounts (opening/closing accounts, inquiries, delinquencies)

- **num_tl_op_past_12m**





##### Joint Applications

Joint Applications for a loan refers to a shared responsibility for repaying the full amount, e.g. couples, family members, friends, business partners. They have better chances of loan approval, given 1 appplicant has a stronger credit history. Missed payments (Delinquency) also affect credit scores of those involved. 

The following features are related to joint applications. 

- **verification_status_joint**: Checks if combined income of applicants is verified by Lending Club platform (For joint applications: e.g. )


##### Mortgage & Installment Accounts 


##### Hardship & Settlement Programs 
Harship programs (Loan-Specific) are to temporarily help borrowers in trouble, via payment reduction / lower interest rate etc. 

Settlement programs are permanent agreements where lender accepts less than full repayments. 

Borrowers in these programmes have higher risk and it is important to track them for credit risk modeling in banks. 

- **hardship_flag**: Borrower requested payment relief (Y / N)

- **hardship_type**: Reason of hardship (e.g., job loss, medical)

- **hardship_reason**: Detailed description of why the borrower needed help

- **hardship_status**: Current state (active/completed/approved)

- **hardship_amount**: New payment the borrower is allowed to make on this loan 

- **hardship_start_date/end_date**: Period is the hardship programme

- **payment_plan_start_date**: First reduced payment date 

- **hardship_length**: How long the hardship program lasts (in months)

- **hardship_dpd**: How late the borrower was in paying (in days) before getting hardship help 

- **hardship_loan_status**: Status of loan at the time borrower entered hardship plan (Loan is current [on-time]/ in grace / late)

- **orig_projected_additional_accrued_interest**: Estimated extra interest borrower will have to pay if relief is not granted

- **hardship_payoff_balance_amount**: Remaining balance of loan to be paid at relief start

- **hardship_last_payment_amount**: Amount of last payment the borrower made before relief starts 

- **debt_settlement_flag**: Borrower negotiated to pay less than owed

- **debt_settlement_flag_date**: Date settlement started

- **settlement_status**: Settlement progress

- **settlement_date**: Date settled

- **settlement_amount**: Final payment amount

- **settlement_percentage**: % of original debt paid

- **settlement_term**: Duration of settlement agreement

##### Aggregate Balances & Metrics 
The following columns are summary statistics of borrowers' overall credit health at the time of the loan listing. Such features are important for predicting repayment capacity of the borrower. 
- **tot_coll_amt**: Total debt in collections (extreme financial distress signaller!)

- **tot_cur_bal**: Total current debt across all accounts

- **avg_cur_bal**: Average balance per account

- **total_bal_ex_mort**: Total non-mortgage debt

- **tot_hi_cred_lim**: Total high credit / credit limit 

- **num_sats**: Number of satisfactory accounts

- **num_bc_sats**: Number of satisfactory bank card acccounts 

- 📍 **pct_tl_nvr_dlq**: % trade lines (borrower credit accounts) never delinquent (never late on payment)

- **percent_bc_gt_75**: % of bank cards with > 75% limit used

- **total_cu_tl**: Number of Credit union accounts / trade linesin c
  - **Credit Union**: Members pool their money to provide loans to each other (offering lower interest rates on loans)
  - Members of credit unions have lower risks due to higher financial stability 



#### 1.2 Examine Dataset 

In [0]:
df.columns

##### 1.3 Sorting Dataset
For credit risk modeling, banks use past data loan data to predict future defaults / metrics. As such, we want our dataset to be sorted in **chronological order**, so that built models are trained on older data, and tested on newer data **(out-of-time split)**. 

There should not be random splitting of data **(out-of-sample split)**, e.g. `train-test-split` from `sklearn` since credit-risk modeling is a **time-series problem**.

Hence, I will be sorting the dataset right from the start. 



In [0]:
unique_issue_d = df.select("issue_d").distinct()
display(unique_issue_d)

In [0]:
# df.sort(col('id').asc()).display() # or .desc() 

df.sort("issue_d", "id", ascending=[1,1]).display()
