In [3]:
from config import load_accepted
from config import load_rejected
import utils

# SR 11-7 Model Validation Automation

> This notebook is part of a series made to demonstrate model validation practices based on the FED’s SR 11-7 guidance. Using LendingClub loan data from 2007 to 2018, we can simulate a full model lifecycle (data integrity -> documentation -> performance testing -> governance). This dataset includes both accepted and rejected loan applications allowing us to evaluate model behavior across funded and unfunded populations, benchmark risk segmentation, and simulate governance controls.

## Part 1 - Data Overview

### Dataset: Lending Club Loan Data

This dataset contains detailed loan application records submitted to LendingClub between 2007 and 2018. It is divided into two subsets:
- *Accepted Loans*: Applications that were approved and funded. Includes borrower details, loan terms, and performance outcomes.
- *Rejected Loans*: Applications that were denied. Contains fewer fields (origination data without performance data).

### Key Objectives

- Inspect the general scope & structure of the dataset by:
  - Loading and inspecting the accepted and rejected datasets from LendingClub.
  - Generating high level summary statistics for each dataset, including number of records and columns per data type
  - Generating simple data dictionaries to capture data type, missingness, uniqueness, and sample values per field.

#### Lending Club Accepted (LCA)

In [4]:
lca_df = load_accepted()

lca_data_dict = utils.generate_data_dictionary(lca_df, 5)
lca_data_dict_head = lca_data_dict.head(10)

lca_dd_styled = utils.style_data_dictionary(lca_data_dict_head) # Replace with lca_data_dict for full data dictionary rather than 10 row slice

lca_dd_styled

Unnamed: 0_level_0,Data Type,Missing Values,Unique Values,Sample Value
Column Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
id,object,0,2260701,68426831
member_id,float64,2260701,0,
loan_amnt,float64,33,1572,11950.000000
funded_amnt,float64,33,1572,11950.000000
funded_amnt_inv,float64,33,10057,11950.000000
term,object,33,2,36 months
int_rate,float64,33,673,13.440000
installment,float64,33,93301,405.180000
grade,object,33,7,C
sub_grade,object,33,35,C3


In [5]:
lca_summary = utils.summarize_dataset(lca_df)
lca_summ_styled = utils.style_data_dictionary(lca_summary)

lca_summ_styled

Unnamed: 0,Value
Total Rows,2260701
Total Columns,151
Numeric Columns,113
Categorical Columns,38
Datetime Columns,0
Boolean Columns,0
Columns with Missing Values,150


#### Lending Club Rejected (LCR)

In [6]:
lcr_df = load_rejected()

lcr_data_dict = utils.generate_data_dictionary(lcr_df, 5)
lcr_data_dict_head = lcr_data_dict.head(10)

lcr_dd_styled = utils.style_data_dictionary(lcr_data_dict_head) # Replace with lcr_data_dict for full data dictionary rather than 10 row slice

lcr_dd_styled

Unnamed: 0_level_0,Data Type,Missing Values,Unique Values,Sample Value
Column Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Amount Requested,float64,0,3640,15000.000000
Application Date,object,0,4238,2007-05-27
Loan Title,object,1305,73927,Trinfiniti
Risk_Score,float64,18497630,692,645.000000
Debt-To-Income Ratio,object,0,126145,0%
Zip Code,object,293,1001,105xx
State,object,22,51,NY
Employment Length,object,951355,11,3 years
Policy Code,float64,918,2,0.000000


In [7]:
lcr_summary = utils.summarize_dataset(lcr_df)
lcr_summ_styled = utils.style_data_dictionary(lcr_summary)

lcr_summ_styled

Unnamed: 0,Value
Total Rows,27648741
Total Columns,9
Numeric Columns,3
Categorical Columns,6
Datetime Columns,0
Boolean Columns,0
Columns with Missing Values,6


## Next Steps

In the next notebook `02_DataQuality.ipynb`:
- Quantify missingness and cardinality
- Identify data quality issues