
This notebook lies in the phase of **Exploratory Data Analysis (EDA)**. EDA of the Lending Club dataset shall be carried out in 3 main phases: **Data Understanding & Data Cleaning, Feature Engineering**. 

This notebook is dedicated for **Data Understanding**. By the end of this notebook, I should be able to
  - Understand **data dictionary terminologies**
  - Identify **more critical columns**








The dataset is from a [Kaggle Dataset](https://www.kaggle.com/datasets/wordsforthewise/lending-club), which has records of all loans issued to borrowers between 2007 - 2018 Q4. I will now conduct Exploratory Data Analysis to get a sensing of the dataset, before proceeding further. 



In [1]:
# Import function to start Spark
from init_spark import start_spark
spark = start_spark()


25/06/23 12:00:35 WARN Utils: Your hostname, Chengs-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 192.168.1.5 instead (on interface en0)
25/06/23 12:00:35 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


KeyboardInterrupt: 

:: loading settings :: url = jar:file:/Users/lunlun/Downloads/Github/Credit-Risk-Modeling-PySpark/venv/lib/python3.11/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /Users/lunlun/.ivy2/cache
The jars for the packages stored in: /Users/lunlun/.ivy2/jars
io.delta#delta-spark_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-b0d79e62-55de-488a-9b47-39b10b8a76f6;1.0
	confs: [default]
	found io.delta#delta-spark_2.12;3.1.0 in central
	found io.delta#delta-storage;3.1.0 in central
	found org.antlr#antlr4-runtime;4.9.3 in central
:: resolution report :: resolve 73ms :: artifacts dl 2ms
	:: modules in use:
	io.delta#delta-spark_2.12;3.1.0 from central in [default]
	io.delta#delta-storage;3.1.0 from central in [default]
	org.antlr#antlr4-runtime;4.9.3 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   3   |   0

In [None]:
# I will need the following code to initialise Spark Session in standalone Python notebooks
# from pyspark.sql import SparkSession
# spark = SparkSession.builder.appName("Data Cleaning").getOrCreate()


# File Endpoint and Format 
file_location = "../data/accepted_2007_to_2018Q4.csv"
file_type = "csv"

# CSV options
infer_schema = "true" # ensures all columns are not stirng type 
first_row_is_header = "true"
delimiter = ","

# Reading a CSV file 
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

df.limit(10).toPandas()

25/06/23 11:37:31 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,68407277,,3600.0,3600.0,3600.0,36 months,13.99,123.03,C,C4,...,,,Cash,N,,,,,,
1,68355089,,24700.0,24700.0,24700.0,36 months,11.99,820.28,C,C1,...,,,Cash,N,,,,,,
2,68341763,,20000.0,20000.0,20000.0,60 months,10.78,432.66,B,B4,...,,,Cash,N,,,,,,
3,66310712,,35000.0,35000.0,35000.0,60 months,14.85,829.9,C,C5,...,,,Cash,N,,,,,,
4,68476807,,10400.0,10400.0,10400.0,60 months,22.45,289.91,F,F1,...,,,Cash,N,,,,,,
5,68426831,,11950.0,11950.0,11950.0,36 months,13.44,405.18,C,C3,...,,,Cash,N,,,,,,
6,68476668,,20000.0,20000.0,20000.0,36 months,9.17,637.58,B,B2,...,,,Cash,N,,,,,,
7,67275481,,20000.0,20000.0,20000.0,36 months,8.49,631.26,B,B1,...,,,Cash,N,,,,,,
8,68466926,,10000.0,10000.0,10000.0,36 months,6.49,306.45,A,A2,...,,,Cash,N,,,,,,
9,68616873,,8000.0,8000.0,8000.0,36 months,11.48,263.74,B,B5,...,,,Cash,N,,,,,,


## 1. Data Dictionary 

The following is the explanation of what some features in the dataset mean. There will be additional explanations in some features to facilitate my understanding of each column. The original data dictionary is obtained from [Figshare](https://figshare.com/articles/dataset/Lending_club_dataset_description/20016077). 

The below is the data dictionary I have compiled, which is more beginner friendly for aspiring credit risk modellers. I have grouped similar columns together to understand the dataset better, instead of just staring at seemingly random columns. 📍 Red Pin emojis are used to flag seemingly more critical columns for credit risk modeling. 

### 1.1 Loan Details 

- **id**: Unique loan listing / request identifier (not yet funded / issued)
- **memberId**: Unique borrower ID 
- **loan_amnt**: Amount requested by borrower
- **issue_d**: Month-Year the loan is funded/issued to borrower 

- **funded_amnt**: Total amount of money funded to the borrower (may be partially / fully funded)
- **funded_amnt_inv**: Total amount of money funded to borrower by **investors** (should be the same as `funded_amnt`)
  - This column exists, because in **investors' dataset**, investors can see individual contributions to the borrower's requested amount 
- **term**: Loan term (months)
- **int_rate**: Interest rate
- **installment**: Monthly payment upon approved loan 

- **grade vs sub_grade**: 
  - Grade: Broad Risk Category (A,B)
  - Sub_Grade: A1,B2 (Specific)

- **purpose**: Reason for borrowing money

- **title**: Similar to `purpose`

- **desc**: Loan description (provided by borrower, like a Carousel Listing)

- **url**: Link to loan listing 

- **initial_list_status**: Lending Club determines how loans will be funded by investors based on risk level, investors then see these listings on the platform
  - `F`: Fractional loans (Many investors fund small parts)
  - `w`: Whole loans (Offered to 1 investor who funds everything)

- **disbursement_method**: How loan funding was delivered to borrower
  - **Cash**: Funds directly deposited to borrower's bank account 
  - **Direct_Pay**: Lending Club pays debts for you (Use new personal loan to pay off old debts, at lower interest rates)

- **policy_code**:
  - `1`: Publicly Available Product (Loan is public in Lending Club Platform)
  - `2`: Private Loan Product (Loan is a special private offering, e.g. for ceertain investors only)

- **pymnt_plan**: Checks if payment plan in place for borrower to catch up on missed payments 
  - 'n': refers to 'No Payment Plan'

- **application_type**: Single / Joint 
loan application 

- 📍 **loan_status**: Actual status of the loan (e.g., Current, Fully Paid, Charged Off) -> Target Variable for most credit risk models 


### 1.2 Borrower Demographics
- **emp_title**: Borrower’s job title
- **emp_length**: Years in current job
- **home_ownership**: Home ownership status (rented / owned)
- **annual_inc**: Borrower’s annual income
- **verification_status**: If **individual borrower's income** is verified by Lending Club platform (via Singpass in SG context)
- **zip_code**/**addr_state**: U,S, state of residence (`CA`: California)

### 1.3 Borrower Credit History & Scores
- 📍 **dti**: Debt-Income Ratio (20% DTI means 20% of borrower's income go to debt payments) (Higher DTI is a red flag 🚩)

- **delinq_2yrs**: Number of times borrower was 30+days late on payments (delinquency: missed payments) for the past 2 years

- **earliest_cr_line**: Earliest Credit Line (Date borrower opened his earliest credit account) 
  - Open credit account usually means you start borrowing money / gain access to credit, e.g. buy now, pay later with credit cards 

- **last_fico_range_high & last_fico_range_low**: FICO scores upon last credit report pull by Lending Club (to decide on loan approval)

- 📍 **fico_range_low & fico_range_high**: Credit Rating of Borrower when they are approved for loan (used to assess credit risk)
  - 300–579	Poor
  - 580–669	Fair
  - 670–739	Good
  - 740–799	Very Good
  - 800–850	Excellent

- 📍 **inq_last_6mths**: Number of credit inquiries past 6 months 
  - Everytime borrower applies loan, lender / Lending Club checks borrower's credit report  
  - High number of credit inquiries = Borrower attempted to borrow money multiple times, possibly out of financial stress = higher risk of defaulting / delinquency

- **inq_last_12m**: Number of credit inquiries by Lending Club on the borrower in past 12 months (to assess borrower risk level)

- **inq_fi**: number of inquiries made by financial institutions into the borrower's credit report in the past 6 months — excluding auto loans and mortgage-related inquiries.

- **mths_since_recent_inq**: Months since most recent inquiry by Lending Club on borrower 


- **open_acc**: How many credit accounts borrower has currently open (e.g. credit cards, loans, mortgages). Many credit accounts may mean over-reliance on credit and higher credit risk 


- 📍 **revol_bal**: Total amount borrower owes on revolving credit accounts (credit cards, lines of credit)
  - **Revolving Accounts**: Borrower can reborrow money up till max credit limit without end-date (重复性的)
  - **Installment Loan**: Borrower cannot reborrow unless he apply for new loan, 一次性借钱） 
  - **Line of Credit**: Type of revolving credit. Types are as shown: 
    - **Credit Card**
    - **Personal Line of Credit**: Used to borrow from bank (for any loan)
    - **Home Equity Line of Credit (HELOC)**: Used for home improvements / bulk purchases (home as collateral)
    - **Business Line of Credit**: To help businesses with cash flow 


- **all_util**: Utilisation rate (% of credit used) across all credit lines (revolving and installment)


- 📍 **revol_util**: Revolving Line Utilisation Rate (How much credit is borrower currently using, with respect to total credit limit in revolving credit accounts)

- **total_acc**: Total number of credit lines / accounts borrower has in credit file (opened/closed)

- **last_credit_pull_d**: The last date Lending Cluub checked borrower's credit report (monitor borrower, assess risk)


- 📍 **collections_12_mths_ex_med**: Number of debt collections in past 12 months, excluding medical bills 
  - **Collection**: Borrower fails to repay debt -> Lending Club sells debt to collection agency -> Collection agency acts as loan shark to get back money 
  - Collections can be a **strong signal for financial distress**, since it signals trouble of repaying future loans 

### 1.4 Borrower Public Records & Delinquencies 

**Public Records** refers to negative financial events officially recorded in legal documents (serious credit / legal issues) ⚠️.

Types are shown below: 
  - **Bankruptcy**
  - **Tax Lien**: Government claim on property due to unpaid tax
  - **Civil Judgement**: Court ruling (lawsuits) that borrower must pay debt 

- Empty records considered to be more stable 

The following columns reflect relevance to public records & delinquencies, revealing possibly higher credit risk: 
- **pub_rec_bankruptcies**: Number of bankruptcies listed as public records 

- **tax_liens**: Number of tax liens 

- **delinq_amnt**: Total money currently owed on all delinquent accounts (It is time to pay money!)

- **acc_now_delinq**: Number of accounts borrower is currently delinquent (late on payments)

- **chargeoff_within_12_mths**: Number of accounts charged off (considered as loss/defaulted/closed) by creditors in past 12 months 

- **num_accts_ever_120_pd**: Number of times borrower is 120+ days late on payment in their life (Very Late)

- **num_tl_30dpd**: Number of accounts currently 30 or more days due (Updated past 2 months)

- **num_tl_90g_dpd_24m**: Number of accounts that have been 90 or more days past due in the last 24 months


- **num_tl_120dpd_2m**: Number of accounts that have been 120 or more days past due in the last 2 months.


- **mths_since_last_delinq**: Months since borrower's last delinquency (late payment)

- **mths_since_recent_revol_delinq**: Months since most recent revolving delinquency (late payment)


- 📍 **mths_since_last_record**: Months since borrowers' last **public record**

- 📍 **pub_rec**: Number of degoratory public records (bankruptcies ...) the borrower has

- 📍  **mths_since_last_major_derog**: The number of months since the borrower's most recent major derogatory event on their credit report, specifically a 90 days or worse delinquency

- **mths_since_recent_bc_dlq**: Months since most recent bank card delinquency (late payment)



### 1.5 Loan Performance & Balance Changes of Lending Club 

The following features track how loan behaves after issuing. **These features are crucial since they impact Lending Club profitability.**

- **out_prncp**: Remaining Outstanding Principal Sum (How much of original loan, excluding interest, is left to be paid back) 

- **out_prncp_inv**: (Same as Above - Investors portion)

- **total_pymnt**: Sum of all repayments made by Borrower so far (includes interest)

- **total_pymnt_inv**: (Same as Above)

- **total_rec_int**: Total Interest Received (made by borrower) by Lending Club

- **total_rec_prncp**: Total Received Principal (excluding interest) by Lending Club

- **last_pymnt_d**: Date of most recent payment made by borrower 

- **last_pymnt_amnt**: Amount of most recent payment received by Lending Club from borrower 

- **next_pymnt_date**: Scheduled date for next payment for the borrower 

- **total_rec_late_fee**: Total late penalty fees borrower paid till date

- 📍 **recoveries**: Post Charge-Off Gross Recovery Payments 
  - Loan is charged off (marked as loss to investors due to borrower defaulting)
  - **Post Charge Off Gross Recovery Payments**: Total recovered money after loan was charged off 

- 📍 **collection_recovery_fee**: Cost Lending Club has to incur to recover money **after loan has defaulted**



### 1.6 Account Activity 

The following columns track borrower behaviour with credit accounts (opening/closing accounts, inquiries, delinquencies)

- **num_tl_op_past_12m**: Number of credit accounts (revolving and long-term accountr) opened in past 12 months 

- **open_acc_6m**: Number of open tradelines (credit accounts) of borrowers in last 6 months

- **open_act_il**: Number of currently active installment trades (tradelines / accounts)

- **open_il_12m**: Number of installment accounts opened in past 12 months

- **open_il_24m**: Number of installment accounts opened in past 24 months

- **mths_since_rcnt_il**: Months since most recent installment accounts opened

- **total_bal_il**: Total current balance of all installment accounts


- **acc_open_past_24mths**: Number of trades/accounts opened in past 24 months

- **num_tl_op_past_12m**: Number of credit accounts opened in past 12 months

- **mo_sin_old_il_acct**: Months since oldest bank installment account opened

- **mo_sin_old_rev_tl_op**: Months since oldest revolving account opened

- **mo_sin_rcnt_rev_tl_op**: Months since most recent revolving account opened

- **mo_sin_rcnt_tl**: Months since most recent credit account opened



### 1.7 Joint Applications

Joint Applications for a loan refers to a shared responsibility for repaying the full amount, e.g. couples, family members, friends, business partners. They have better chances of loan approval, given 1 appplicant has a stronger credit history. Missed payments (Delinquency) also affect credit scores of those involved. 

The following features are related to joint applications. 

- **verification_status_joint**: Checks if combined income of applicants is verified by Lending Club platform (For joint applications: e.g. )

- **annual_inc_joint**: The combined annual income provided by the co:borrowers during registration

- **dti_joint**: Borrowers' combined debt-income ratio

- **revol_bal_joint** : Sum of revolving credit balance of the borrowers

- **sec_app_earliest_cr_line** : For joint loans, this shows the co-borrower's oldest credit account date.

- **sec_app_fico_range_low/high**: Secondary-borrower’s FICO score range

- **sec_app_open_acc**: Number of open credit accounts of co-borrower

- **sec_app_mort_acc**: Number of co-borrower's mortgage accounts



### 1.8 Revolving Account Details 
To recap, revolving credit accounts refer to accounts that borrowers can reborrow, till the max credit limit, with no end-date. Examples include credit cards, and line of credits (tap on funds you can repay repeatedly with interest)

- **open_rv_12m**: Number of revolving credit accounts borrower opened in the last 12 months

- **open_rv_24m**: (Same as above, but in the past 24 months)

- **max_bal_bc**: Highest balance ever owed on all revolving credit accounts 

- **total_rev_hi_lim**: Total credit limit on all revolving accounts 

- **bc_open_to_buy**: Amount of available credit left to spend on all bank cards (revolving)

- **bc_util**: Ratio of total current balance to credit limit for all bank card accounts 

- **num_actv_rev_tl**: Number of current active revolving trade lines (active revolving credit accounts) borrower currently has 

- **num_rev_accts**: Number of revolving credit accounts borrower ever had 

- **num_rev_tl_bal_gt_0**: Number of revolving trade lines (credit accounts) with balance > 0 

- **total_bc_limit**: Total credit limit across all bank cards

- **mths_since_recent_bc**: Months since newest bank card account opened 


### 1.9 Mortgage & Installment Accounts 
The following columns are related to **long-term loans** such as mortgages and auto loans. A mortgage is a type of loan used to purchase real estate (house or land), in which the property serves as collateral (lender seizes property, if borrower faiils to repay). An auto loan refers to a loan taken to purchase cars. 

These accounts are different from revolving accounts (credit cards / lines of credit) since they are meant for the long-term and have lower risks. 

- **mort_acc**: Number of mortgage accounts

- **num_il_tl**: Number of installment loans trade lines (credit accounts), e.g. mortgage, auto-loans, student loans

- **num_bc_tl**: Number of bankcard credit accounts 

- **total_il_high_credit_limit**: Add up highest amounts every borrowed across all installment loans 


- 📍 **tot_hi_cred_lim**: Sum of ...
  - Highest Credit Limit (Revolving Accounts)
  - Original Loan Amounts (Installment Loans) across all of borrower's credit accounts 
  - Represents **total maximum amount of credit ever made available to borrower by all lenders = Potential borrowing capacity**


- **il_util**: Ratio of total current balance to high credit/credit limit on all installment accounts 


### 1.10 Hardship & Settlement Programs 
Hardship programs (Loan-Specific) are to temporarily help borrowers in trouble, via payment reduction / lower interest rate etc. 

Settlement programs are permanent agreements where lender accepts less than full repayments. 

**Borrowers in these programmes have higher risk** ⚠️ and it may be important to track them for credit risk modeling in banks. 

- **hardship_flag**: Borrower requested payment relief (Y / N)

- **hardship_type**: Reason of hardship (e.g., job loss, medical)

- **hardship_reason**: Detailed description of why the borrower needed help

- **hardship_status**: Current state (active/completed/approved)

- **hardship_amount**: New payment the borrower is allowed to make on this loan 

- **hardship_start_date/end_date**: Period is the hardship programme

- **payment_plan_start_date**: First reduced payment date 

- **hardship_length**: How long the hardship program lasts (in months)

- **hardship_dpd**: How late the borrower was in paying (in days) before getting hardship help 

- **hardship_loan_status**: Status of loan at the time borrower entered hardship plan (Loan is current [on-time]/ in grace / late)

- **orig_projected_additional_accrued_interest**: Estimated extra interest borrower will have to pay if relief is not granted

- **hardship_payoff_balance_amount**: Remaining balance of loan to be paid upon start of relief 

- **hardship_last_payment_amount**: Amount of last payment the borrower made before relief starts 

- **deferral_term**: Allowed to skip payments for n mths for an approved hardship plan 

- **debt_settlement_flag**: Borrower negotiated to pay less than owed (Y/N)

- **debt_settlement_flag_date**: Date debt settlement status is set

- **settlement_status**: Settlement progress (In Progress / Completed)

- **settlement_date**: Date debt settlement is finalised (agreed on borrower and lender side)

- **settlement_amount**: Final payment amount borrower agrees to pay for the debt settlement (less than what is originally owed)

- **settlement_percentage**: % of original debt paid

- **settlement_term**: Duration of settlement agreement (months)

### 1.11 Aggregate Balances & Metrics 
The following columns are summary statistics of borrowers' overall credit health at the time of the loan listing. High credit balance may not be inherently bad, it all depends on borrowing behaviour of the borrower. 

- **tot_coll_amt**: Total debt in collections (extreme financial distress signaller!)

- **tot_cur_bal**: Total current balance across all accounts (Spend money using credit = Balance increases)

- **avg_cur_bal**: Average balance per account

- **total_bal_ex_mort**: Total non-mortgage debt

- **tot_hi_cred_lim**: Total high credit and/or credit limit (revolving + original loan amount)

- **num_sats**: Number of satisfactory credit accounts

- **num_bc_sats**: Number of satisfactory bank card acccounts 

- 📍 **pct_tl_nvr_dlq**: % trade lines (borrower credit accounts) never delinquent (never late on payment)

- **percent_bc_gt_75**: % of bank cards with > 75% limit used

- **total_cu_tl**: Number of Credit Union accounts / trade lines 
  - **Credit Union**: Members pool their money to provide loans to each other (offering lower interest rates on loans)
  - Members of credit unions have lower credit risks due to higher financial stability 

After sorting out the data dictionary, I have identified some variables that can possibly aid us in predicting LGD, EAD and PD for credit risk modeling. They include: 

- **Borrower Demographic Data** 
- **Public Record & DelinquencyData** 
- **Loan Performance Data**
- **FICO Data** 
- **Hardship & Settlement Data**


## 2. Examine Dataset 

In this section, I will be getting a general sensing of the dataset, e.g. knowing its dimensions, columns, data types of each columns. We will then proceed on to the **Medallion Architecture Data Cleaning Pipeline** step in the next notebook. 


In [None]:
df.limit(10).toPandas()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,68407277,,3600.0,3600.0,3600.0,36 months,13.99,123.03,C,C4,...,,,Cash,N,,,,,,
1,68355089,,24700.0,24700.0,24700.0,36 months,11.99,820.28,C,C1,...,,,Cash,N,,,,,,
2,68341763,,20000.0,20000.0,20000.0,60 months,10.78,432.66,B,B4,...,,,Cash,N,,,,,,
3,66310712,,35000.0,35000.0,35000.0,60 months,14.85,829.9,C,C5,...,,,Cash,N,,,,,,
4,68476807,,10400.0,10400.0,10400.0,60 months,22.45,289.91,F,F1,...,,,Cash,N,,,,,,
5,68426831,,11950.0,11950.0,11950.0,36 months,13.44,405.18,C,C3,...,,,Cash,N,,,,,,
6,68476668,,20000.0,20000.0,20000.0,36 months,9.17,637.58,B,B2,...,,,Cash,N,,,,,,
7,67275481,,20000.0,20000.0,20000.0,36 months,8.49,631.26,B,B1,...,,,Cash,N,,,,,,
8,68466926,,10000.0,10000.0,10000.0,36 months,6.49,306.45,A,A2,...,,,Cash,N,,,,,,
9,68616873,,8000.0,8000.0,8000.0,36 months,11.48,263.74,B,B5,...,,,Cash,N,,,,,,


In [None]:
# Seeing the columns in Lending Club dataset and their respective data types 
# df.columns and df.dtypes in Pandas 

df.printSchema() # double refers to <float> type

root
 |-- id: string (nullable = true)
 |-- member_id: string (nullable = true)
 |-- loan_amnt: double (nullable = true)
 |-- funded_amnt: double (nullable = true)
 |-- funded_amnt_inv: double (nullable = true)
 |-- term: string (nullable = true)
 |-- int_rate: double (nullable = true)
 |-- installment: double (nullable = true)
 |-- grade: string (nullable = true)
 |-- sub_grade: string (nullable = true)
 |-- emp_title: string (nullable = true)
 |-- emp_length: string (nullable = true)
 |-- home_ownership: string (nullable = true)
 |-- annual_inc: string (nullable = true)
 |-- verification_status: string (nullable = true)
 |-- issue_d: string (nullable = true)
 |-- loan_status: string (nullable = true)
 |-- pymnt_plan: string (nullable = true)
 |-- url: string (nullable = true)
 |-- desc: string (nullable = true)
 |-- purpose: string (nullable = true)
 |-- title: string (nullable = true)
 |-- zip_code: string (nullable = true)
 |-- addr_state: string (nullable = true)
 |-- dti: string 


From the output above, it seems that there are non-industry standard `dtypes` of the columns. At a glance, they include `id` not being `int`, `term` in `string` form. These shall be type casted into correct data types after missing value imputations, since type-casting special characters may lead to unexpected missing values, affecting **data integrity**. 

In [None]:
# Check df.shape of Lending Club dataset 

# Get number of rows
num_rows = df.count()

# Get number of columns
num_cols = len(df.columns)

# Print shape (rows, columns)
print(f"Shape: ({num_rows} rows, {num_cols} columns)")



Shape: (2260701 rows, 151 columns)


                                                                                

In [None]:
# Summary Statistics of Lending Club Datasets 
df.summary().toPandas()


25/06/23 11:39:23 WARN DAGScheduler: Broadcasting large task binary with size 1742.7 KiB
                                                                                

Unnamed: 0,summary,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,...,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,count,2260701,0.0,2260668.0,2260668.0,2260668.0,2260668,2260668.0,2260668.0,2260668,...,10918,10918,2260413,2260475,34348,34331,34308,34284,34270,34266
1,mean,8.032205972323003E7,,15046.931227849467,15041.664056818605,15023.437745306326,,13.092829115115324,445.80682288154975,,...,11636.883941559037,193.9943207840982,,,,,,5010.623941597316,47.78073820995767,13.191706818513651
2,stddev,4.4985611312901564E7,,9190.245488232757,9188.413022381976,9192.331678793576,,4.832138364571108,267.1735346084259,,...,7625.98828115293,198.6294958183679,,,,,,3693.1689736660164,7.311602282195859,8.159787347869285
3,min,1000007,,500.0,500.0,0.0,36 months,5.31,4.93,A,...,10008.88,0.01,Cash,Cash,Apr-2013,ACTIVE,Apr-2013,1000.0,0.2,0.0
4,25%,4.4935532E7,,8000.0,8000.0,8000.0,,9.49,251.65,,...,5626.82,44.44,,,,,,2208.0,45.0,6.0
5,50%,8.451392E7,,12900.0,12875.0,12800.0,,12.62,377.95,,...,10028.34,133.16,,,,,,4146.0,45.0,14.0
6,75%,1.2235089E8,,20000.0,20000.0,20000.0,,15.99,593.31,,...,16151.89,284.18,,,,,,6850.0,50.0,18.0
7,max,Total amount funded in policy code 2: 873652739,,40000.0,40000.0,40000.0,60 months,30.99,1719.83,G,...,N,N,N,Y,Sep-2018,N,Sep-2018,N,Y,N
