# Data Acquisition and Preprocessing

This notebook handles the acquisition, cleaning, and preprocessing of the Lending Club loan dataset. The dataset contains loan information from 2007 to 2020 Q1, including borrower characteristics, loan terms, and payment status.

## Objectives
- Download and extract the raw Lending Club dataset
- Convert string columns to appropriate numeric/categorical types
- Generate data catalog and documentation
- Save processed datasets in efficient formats

## 1. Data Download

Download the Lending Club dataset from Kaggle and convert to a more efficient storage format (Parquet).

In [None]:
# !mkdir -p ../data
# !curl -L -o ../data/lending_club.zip "https://www.kaggle.com/api/v1/datasets/download/ethon0426/lending-club-20072020q1"

# !unzip -p ../data/lending_club.zip Loan_status_2007-2020Q3.gzip > ../data/raw_lending_club.csv
# !rm ../data/lending_club.zip

## 2. Initial Data Inspection

Load and examine the raw dataset to understand its structure and identify data quality issues.

In [2]:
import pandas as pd

In [3]:
data = pd.read_csv("../data/raw_lending_club.csv", index_col=0)

print(data.shape)
data.head()

  data = pd.read_csv('../data/raw_lending_club.csv', index_col=0)


(2925493, 141)


Unnamed: 0,id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,...,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag
0,1077501,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,,...,,,,,,,,,,N
1,1077430,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,Ryder,...,,,,,,,,,,N
2,1077175,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,,...,,,,,,,,,,N
3,1076863,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,AIR RESOURCES BOARD,...,,,,,,,,,,N
4,1075358,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,University Medical Group,...,,,,,,,,,,N


## 3. Data Cleaning and Type Conversion

This section focuses on cleaning the dataset and converting columns to appropriate data types to reduce memory usage and improve processing efficiency.

### 3.1. Convert Percentage Strings to Numeric

Convert percentage columns (interest rate, revolving utilization) from string format (e.g., "10.5%") to float values.

In [4]:
def display_before_after(columns, data, raw_data):
    before_after_dictionary = {}

    for col in columns:
        raw_non_na = raw_data[col].dropna().head(3)
        data_non_na = data[col].dropna().head(3)
        before_after_dictionary[f"{col}_before"] = raw_non_na.reset_index(drop=True)
        before_after_dictionary[f"{col}_after"] = data_non_na.reset_index(drop=True)

    return pd.DataFrame(before_after_dictionary)

In [5]:
data["int_rate"] = data["int_rate"].str.rstrip("%").astype(float) / 100
data["revol_util"] = data["revol_util"].str.rstrip("%").astype(float) / 100

In [6]:
data = data[data["id"] != "Loans that do not meet the credit policy"]
data["id"] = data["id"].astype(int)

In [7]:
data["term"] = data["term"].str.rstrip(" months").astype(int)

### 3.2. Convert Date Strings to DateTime

Convert date columns from string format (e.g., "Jan-2015") to proper datetime objects for time-based analysis.

In [8]:
datetime_columns = [
    "issue_d",
    "earliest_cr_line",
    "last_pymnt_d",
    "next_pymnt_d",
    "last_credit_pull_d",
    "sec_app_earliest_cr_line",
    "hardship_end_date",
    "hardship_start_date",
    "payment_plan_start_date",
]

In [9]:
for column in datetime_columns:
    data[column] = pd.to_datetime(data[column].astype(str), format="%b-%Y")

### 3.3. Convert String Columns to Categories
 
Convert categorical string columns (like `grade`, `sub_grade`, `home_ownership`, `verification_status`, `purpose`, etc.) to `category` dtype for efficiency.



## 4. Data Catalog Generation

Create a comprehensive data catalog by merging the dataset profile with the Lending Club data dictionary. This provides documentation for all columns including their descriptions and data types.

In [10]:
column_rename_dictionary = {
    # Loan Identification
    "id": "loan_id",
    # Loan Amount Information
    "loan_amnt": "loan_amount_requested",
    "funded_amnt": "loan_amount_funded",
    "funded_amnt_inv": "loan_amount_funded_investors",
    "out_prncp": "outstanding_principal",
    "out_prncp_inv": "outstanding_principal_investors",
    # Loan Terms
    "term": "loan_term_months",
    "int_rate": "interest_rate",
    "installment": "monthly_payment",
    "grade": "loan_grade",
    "sub_grade": "loan_subgrade",
    "purpose": "loan_purpose",
    "title": "loan_title",
    # Borrower Employment
    "emp_title": "employment_title",
    "emp_length": "employment_length_years",
    # Borrower Demographics & Location
    "home_ownership": "home_ownership_status",
    "annual_inc": "annual_income",
    "annual_inc_joint": "annual_income_joint",
    "zip_code": "zip_code_first3",
    "addr_state": "state",
    # Income Verification
    "verification_status": "income_verification_status",
    # Loan Dates
    "issue_d": "loan_issue_date",
    "pymnt_plan": "payment_plan_flag",
    # Loan Status
    "loan_status": "loan_status",
    "initial_list_status": "initial_listing_status",
    # URLs
    "url": "loan_listing_url",
    # Debt Information
    "dti": "debt_to_income_ratio",
    "dti_joint": "debt_to_income_ratio_joint",
    # Credit History - Delinquency
    "delinq_2yrs": "delinquencies_past_2years",
    "mths_since_last_delinq": "months_since_last_delinquency",
    "acc_now_delinq": "accounts_currently_delinquent",
    "delinq_amnt": "delinquent_amount",
    "mths_since_last_major_derog": "months_since_major_derogatory",
    # Credit History - Credit Lines
    "earliest_cr_line": "earliest_credit_line_date",
    "open_acc": "open_credit_lines",
    "total_acc": "total_credit_lines",
    "pub_rec": "public_records_count",
    "pub_rec_bankruptcies": "public_records_bankruptcies",
    # FICO Scores
    "fico_range_low": "fico_score_low",
    "fico_range_high": "fico_score_high",
    "last_fico_range_low": "fico_score_low_last",
    "last_fico_range_high": "fico_score_high_last",
    # Credit Inquiries
    "inq_last_6mths": "inquiries_last_6months",
    "inq_last_12m": "inquiries_last_12months",
    "inq_fi": "finance_inquiries",
    # Revolving Credit
    "revol_bal": "revolving_balance",
    "revol_util": "revolving_utilization_rate",
    "open_rv_12m": "revolving_trades_opened_12m",
    "open_rv_24m": "revolving_trades_opened_24m",
    "num_rev_accts": "revolving_accounts_count",
    "num_rev_tl_bal_gt_0": "revolving_trades_with_balance",
    "num_op_rev_tl": "open_revolving_trades",
    "mo_sin_old_rev_tl_op": "months_since_oldest_revolving",
    "mo_sin_rcnt_rev_tl_op": "months_since_recent_revolving",
    "mths_since_recent_revol_delinq": "months_since_recent_revolving_delinquency",
    # Installment Accounts
    "open_act_il": "active_installment_trades",
    "open_il_12m": "installment_accounts_opened_12m",
    "open_il_24m": "installment_accounts_opened_24m",
    "num_il_tl": "installment_accounts_count",
    "total_bal_il": "total_installment_balance",
    "il_util": "installment_utilization",
    "mths_since_rcnt_il": "months_since_recent_installment",
    "mo_sin_old_il_acct": "months_since_oldest_installment",
    "total_il_high_credit_limit": "total_installment_credit_limit",
    # Bankcard Accounts
    "num_bc_tl": "bankcard_accounts_count",
    "num_actv_bc_tl": "active_bankcard_accounts",
    "num_bc_sats": "satisfactory_bankcard_accounts",
    "bc_util": "bankcard_utilization",
    "bc_open_to_buy": "bankcard_open_to_buy",
    "max_bal_bc": "max_balance_bankcard",
    "mths_since_recent_bc": "months_since_recent_bankcard",
    "mths_since_recent_bc_dlq": "months_since_recent_bankcard_delinquency",
    "percent_bc_gt_75": "percent_bankcard_over_75pct_limit",
    # Payment Information
    "total_pymnt": "total_payments_received",
    "total_pymnt_inv": "total_payments_received_investors",
    "total_rec_prncp": "total_principal_received",
    "total_rec_int": "total_interest_received",
    "total_rec_late_fee": "total_late_fees_received",
    "last_pymnt_d": "last_payment_date",
    "last_pymnt_amnt": "last_payment_amount",
    "next_pymnt_d": "next_payment_date",
    # Credit Pulls
    "last_credit_pull_d": "last_credit_pull_date",
    "mths_since_recent_inq": "months_since_recent_inquiry",
    # Collections & Charge-offs
    "collections_12_mths_ex_med": "collections_12months_excluding_medical",
    "chargeoff_within_12_mths": "chargeoffs_within_12months",
    "tot_coll_amt": "total_collection_amount",
    "recoveries": "recoveries_post_chargeoff",
    "collection_recovery_fee": "collection_recovery_fee",
    # Account History
    "mths_since_last_record": "months_since_last_public_record",
    "open_acc_6m": "open_trades_last_6months",
    "acc_open_past_24mths": "accounts_opened_past_24months",
    "num_op_rev_tl": "open_revolving_trades",
    "num_tl_op_past_12m": "accounts_opened_past_12months",
    "num_tl_30dpd": "accounts_30days_past_due",
    "num_tl_120dpd_2m": "accounts_120days_past_due",
    "num_tl_90g_dpd_24m": "accounts_90plus_days_past_due_24m",
    "num_accts_ever_120_pd": "accounts_ever_120days_past_due",
    "pct_tl_nvr_dlq": "percent_trades_never_delinquent",
    # Account Balances
    "tot_cur_bal": "total_current_balance",
    "tot_hi_cred_lim": "total_high_credit_limit",
    "total_bal_ex_mort": "total_balance_excluding_mortgage",
    "total_bc_limit": "total_bankcard_limit",
    "avg_cur_bal": "average_current_balance",
    # Account Counts
    "num_actv_rev_tl": "active_revolving_trades",
    "num_sats": "satisfactory_accounts_count",
    "total_cu_tl": "finance_trades_count",
    # Months Since
    "mo_sin_rcnt_tl": "months_since_recent_account",
    # Mortgage
    "mort_acc": "mortgage_accounts_count",
    # Tax Liens
    "tax_liens": "tax_liens_count",
    # Application Type
    "application_type": "application_type",
    "policy_code": "policy_code",
    # Secondary Applicant (Joint Applications)
    "sec_app_earliest_cr_line": "secondary_app_earliest_credit_line",
    "sec_app_open_act_il": "secondary_app_active_installment_trades",
    # Hardship Plan
    "hardship_flag": "hardship_plan_flag",
    "hardship_type": "hardship_plan_type",
    "hardship_reason": "hardship_plan_reason",
    "hardship_status": "hardship_plan_status",
    "hardship_start_date": "hardship_plan_start_date",
    "hardship_end_date": "hardship_plan_end_date",
    "hardship_length": "hardship_plan_length_months",
    "hardship_amount": "hardship_plan_monthly_payment",
    "hardship_dpd": "hardship_plan_days_past_due",
    "hardship_loan_status": "hardship_plan_loan_status",
    "deferral_term": "hardship_deferral_months",
    "payment_plan_start_date": "hardship_payment_plan_start_date",
    "orig_projected_additional_accrued_interest": "hardship_projected_interest",
    "hardship_payoff_balance_amount": "hardship_payoff_balance",
    "hardship_last_payment_amount": "hardship_last_payment_amount",
    # Debt Settlement
    "debt_settlement_flag": "debt_settlement_flag",
}

data = data.rename(columns=column_rename_dictionary)
data.head()

Unnamed: 0,loan_id,loan_amount_requested,loan_amount_funded,loan_amount_funded_investors,loan_term_months,interest_rate,monthly_payment,loan_grade,loan_subgrade,employment_title,...,hardship_plan_start_date,hardship_plan_end_date,hardship_payment_plan_start_date,hardship_plan_length_months,hardship_plan_days_past_due,hardship_plan_loan_status,hardship_projected_interest,hardship_payoff_balance,hardship_last_payment_amount,debt_settlement_flag
0,1077501,5000.0,5000.0,4975.0,36,0.1065,162.87,B,B2,,...,NaT,NaT,NaT,,,,,,,N
1,1077430,2500.0,2500.0,2500.0,60,0.1527,59.83,C,C4,Ryder,...,NaT,NaT,NaT,,,,,,,N
2,1077175,2400.0,2400.0,2400.0,36,0.1596,84.33,C,C5,,...,NaT,NaT,NaT,,,,,,,N
3,1076863,10000.0,10000.0,10000.0,36,0.1349,339.31,C,C1,AIR RESOURCES BOARD,...,NaT,NaT,NaT,,,,,,,N
4,1075358,3000.0,3000.0,3000.0,60,0.1269,67.79,B,B5,University Medical Group,...,NaT,NaT,NaT,,,,,,,N


This dataset contains 141 columns originating from the full Lending Club dataset.  
For a credit granting model, variables must be carefully classified according to when and how they are generated, in order to prevent data leakage and incorrect model inputs.

Below, variables are grouped by their role in the credit decision process, with explicit labels indicating their eligibility as model features.

---

## Identification & Metadata (NOT Model Features)

These variables uniquely identify loans or provide metadata. They must **never** be used as predictive features.

- **loan_id**: Unique identifier assigned by Lending Club for each loan listing
- **loan_listing_url**: URL to the Lending Club page with detailed listing information

---

## Applicant-Provided Information (Eligible Model Features)

These variables are directly provided by the borrower at the time of application and are fully available ex-ante. They represent core inputs for a granting decision.

### Borrower Demographics & Employment

- **employment_title**: Job title supplied by the borrower when applying for the loan
- **employment_length_years**: Employment length in years (0=less than 1 year, 10=10+ years)
- **home_ownership_status**: Home ownership status provided by borrower (RENT, OWN, MORTGAGE, OTHER)
- **annual_income**: Self-reported annual income provided by borrower during registration
- **annual_income_joint**: Combined self-reported annual income for joint applications
- **state**: State provided by borrower in loan application
- **zip_code_first3**: First 3 digits of zip code provided by borrower in loan application
- **application_type**: Indicates whether loan is individual application or joint application with co-borrowers

### Loan Request Characteristics

- **loan_amount_requested**: The listed amount of the loan applied for by the borrower (may be reduced by credit department)
- **loan_term_months**: Number of payments on the loan in months (36 or 60)
- **loan_purpose**: Category provided by borrower for the loan request (e.g., debt_consolidation, credit_card)
- **loan_title**: The loan title provided by the borrower

### Income Verification

- **income_verification_status**: Indicates if income was verified by LC, not verified, or if income source was verified

---

## Credit Bureau Information (Eligible Model Features)

These variables summarize the borrower's credit history as observed at the time of application. They are valid predictors for default risk in a granting scenario.

### Credit Scores

- **fico_score_low**: Lower boundary of FICO score range at loan origination
- **fico_score_high**: Upper boundary of FICO score range at loan origination

### Debt & Affordability Ratios

- **debt_to_income_ratio**: Ratio calculated using borrower's total monthly debt payments (excluding mortgage and LC loan) divided by self-reported monthly income
- **debt_to_income_ratio_joint**: Ratio calculated using co-borrowers' total monthly payments (excluding mortgages and LC loan) divided by combined self-reported monthly income

### Credit History - Delinquency & Defaults

- **delinquencies_past_2years**: Number of 30+ days past-due incidences of delinquency in borrower's credit file for past 2 years
- **months_since_last_delinquency**: Number of months since borrower's last delinquency
- **accounts_currently_delinquent**: Number of accounts on which borrower is now delinquent
- **delinquent_amount**: Past-due amount owed for accounts on which borrower is now delinquent
- **months_since_major_derogatory**: Months since most recent 90-day or worse rating
- **accounts_30days_past_due**: Number of accounts currently 30 days past due (updated in past 2 months)
- **accounts_120days_past_due**: Number of accounts currently 120 days past due (updated in past 2 months)
- **accounts_90plus_days_past_due_24m**: Number of accounts 90 or more days past due in last 24 months
- **accounts_ever_120days_past_due**: Number of accounts ever 120 or more days past due
- **public_records_count**: Number of derogatory public records
- **public_records_bankruptcies**: Number of public record bankruptcies
- **tax_liens_count**: Number of tax liens

### Credit History - Credit Lines & Accounts

- **earliest_credit_line_date**: Month when borrower's earliest reported credit line was opened
- **open_credit_lines**: Number of open credit lines in borrower's credit file
- **total_credit_lines**: Total number of credit lines currently in borrower's credit file
- **months_since_last_public_record**: Number of months since last public record

### Revolving Credit

- **revolving_balance**: Total credit revolving balance
- **revolving_utilization_rate**: Revolving line utilization rate - amount of credit borrower is using relative to all available revolving credit (as decimal)
- **revolving_trades_opened_12m**: Number of revolving trades opened in past 12 months
- **revolving_trades_opened_24m**: Number of revolving trades opened in past 24 months
- **revolving_accounts_count**: Number of revolving accounts
- **revolving_trades_with_balance**: Number of revolving trades with balance > 0
- **open_revolving_trades**: Number of open revolving accounts
- **active_revolving_trades**: Number of currently active revolving trades
- **months_since_oldest_revolving**: Months since oldest revolving account opened
- **months_since_recent_revolving**: Months since most recent revolving account opened
- **months_since_recent_revolving_delinquency**: Months since most recent revolving delinquency

### Installment Accounts

- **active_installment_trades**: Number of currently active installment trades
- **installment_accounts_opened_12m**: Number of installment accounts opened in past 12 months
- **installment_accounts_opened_24m**: Number of installment accounts opened in past 24 months
- **installment_accounts_count**: Number of installment accounts
- **total_installment_balance**: Total current balance of all installment accounts
- **installment_utilization**: Ratio of total current balance to high credit/credit limit on all installment accounts
- **months_since_recent_installment**: Months since most recent installment accounts opened
- **months_since_oldest_installment**: Months since oldest bank installment account opened
- **total_installment_credit_limit**: Total installment high credit/credit limit

### Bankcard Accounts

- **bankcard_accounts_count**: Number of bankcard accounts
- **active_bankcard_accounts**: Number of currently active bankcard accounts
- **satisfactory_bankcard_accounts**: Number of satisfactory bankcard accounts
- **bankcard_utilization**: Ratio of total current balance to high credit/credit limit for all bankcard accounts
- **bankcard_open_to_buy**: Total open to buy on revolving bankcards
- **max_balance_bankcard**: Maximum current balance owed on all revolving accounts
- **months_since_recent_bankcard**: Months since most recent bankcard account opened
- **months_since_recent_bankcard_delinquency**: Months since most recent bankcard delinquency
- **percent_bankcard_over_75pct_limit**: Percentage of all bankcard accounts > 75% of limit

### Credit Inquiries & Recent Activity

- **inquiries_last_6months**: Number of inquiries in past 6 months (excluding auto and mortgage inquiries)
- **inquiries_last_12months**: Number of credit inquiries in past 12 months
- **finance_inquiries**: Number of personal finance inquiries
- **months_since_recent_inquiry**: Months since most recent inquiry
- **open_trades_last_6months**: Number of open trades in last 6 months
- **accounts_opened_past_12months**: Number of accounts opened in past 12 months
- **accounts_opened_past_24months**: Number of trades opened in past 24 months

### Account Balances & Limits

- **total_current_balance**: Total current balance of all accounts
- **total_high_credit_limit**: Total high credit/credit limit
- **total_balance_excluding_mortgage**: Total credit balance excluding mortgage
- **total_bankcard_limit**: Total bankcard high credit/credit limit
- **average_current_balance**: Average current balance of all accounts

### Additional Account Metrics

- **satisfactory_accounts_count**: Number of satisfactory accounts
- **finance_trades_count**: Number of finance trades
- **months_since_recent_account**: Months since most recent account opened
- **mortgage_accounts_count**: Number of mortgage accounts
- **percent_trades_never_delinquent**: Percent of trades never delinquent

### Secondary Applicant (Joint Applications)

- **secondary_app_earliest_credit_line**: Month earliest credit line opened for secondary applicant
- **secondary_app_active_installment_trades**: Number of currently active installment trades at time of application for secondary applicant

---

## Internal Policy / Pricing Outputs (NOT Model Features)

These variables are generated by Lending Club as part of their **internal risk assessment, pricing, or allocation process**. Using them as features would introduce target leakage or circular logic.

- **interest_rate**: Interest rate on the loan (determined by LC based on risk assessment)
- **loan_grade**: LC assigned loan grade (A-G, determined by LC's internal model)
- **loan_subgrade**: LC assigned loan subgrade (e.g., A1, A2, determined by LC's internal model)
- **monthly_payment**: Monthly payment owed by borrower if loan originates (calculated from interest rate and loan terms)
- **loan_amount_funded**: Total amount committed to loan at origination (may differ from requested amount based on LC decision)
- **loan_amount_funded_investors**: Total amount committed by investors for loan (allocation decision)
- **initial_listing_status**: Initial listing status of loan (W=whole, F=fractional, allocation decision)
- **policy_code**: Policy code (1=publicly available, 2=new products not publicly available)
- **loan_issue_date**: Month which loan was funded (post-decision timestamp)

---

## Post-Decision / Performance Variables (Leakage - Strictly Excluded)

These variables are only observed **after the loan has been granted** and repayments have started. They must not be used in any granting or risk prediction model.

### Payments & Outstanding Balances

- **outstanding_principal**: Remaining outstanding principal for total amount funded
- **outstanding_principal_investors**: Remaining outstanding principal for portion funded by investors
- **total_payments_received**: Payments received to date for total amount funded
- **total_payments_received_investors**: Payments received to date for portion funded by investors
- **total_principal_received**: Principal received to date
- **total_interest_received**: Interest received to date
- **total_late_fees_received**: Late fees received to date
- **last_payment_date**: Last month payment was received
- **last_payment_amount**: Last total payment amount received
- **next_payment_date**: Next scheduled payment date

### Collections, Charge-offs & Recoveries

- **collections_12months_excluding_medical**: Number of collections in 12 months excluding medical collections
- **chargeoffs_within_12months**: Number of charge-offs within 12 months
- **total_collection_amount**: Total collection amounts ever owed
- **recoveries_post_chargeoff**: Post charge-off gross recovery
- **collection_recovery_fee**: Post charge-off collection fee

### Credit Monitoring Post-Origination

- **last_credit_pull_date**: Most recent month LC pulled credit for this loan (post-origination monitoring)
- **fico_score_low_last**: Lower boundary of FICO score range from most recent pull (post-origination)
- **fico_score_high_last**: Upper boundary of FICO score range from most recent pull (post-origination)

---

## Loan Status & Hardship Information (Target Leakage)

These variables describe loan outcomes or post-default interventions and are directly or indirectly related to the target. They must be excluded from any predictive modeling.

### Loan Status

- **loan_status**: Current status of the loan (e.g., Fully Paid, Charged Off, Current) - **This is the target variable**
- **payment_plan_flag**: Indicates if a payment plan has been put in place for the loan

### Hardship Plan

- **hardship_plan_flag**: Flags whether borrower is on a hardship plan
- **hardship_plan_type**: Describes the hardship plan offering
- **hardship_plan_reason**: Describes the reason hardship plan was offered
- **hardship_plan_status**: Describes if hardship plan is active, pending, canceled, completed, or broken
- **hardship_plan_start_date**: Start date of hardship plan period
- **hardship_plan_end_date**: End date of hardship plan period
- **hardship_plan_length_months**: Number of months borrower will make smaller payments than normally obligated
- **hardship_plan_monthly_payment**: Interest payment borrower has committed to make each month while on hardship plan
- **hardship_plan_days_past_due**: Account days past due as of hardship plan start date
- **hardship_plan_loan_status**: Loan status as of hardship plan start date
- **hardship_deferral_months**: Amount of months borrower expected to pay less than contractual monthly payment
- **hardship_payment_plan_start_date**: Day first hardship plan payment is due
- **hardship_projected_interest**: Original projected additional interest amount for hardship payment plan
- **hardship_payoff_balance**: Payoff balance amount as of hardship plan start date
- **hardship_last_payment_amount**: Last payment amount as of hardship plan start date

### Debt Settlement

- **debt_settlement_flag**: Flags whether borrower with charge-off is working with debt-settlement company

---

This classification ensures a realistic and leakage-free granting model that can be deployed in a production environment.

## 5. Export Processed Data

Save the cleaned and preprocessed dataset in efficient formats for downstream analysis.

In [11]:
data.reset_index(drop=True).to_parquet("../data/lending_club.parquet")