This project analyzes U.S. Small Business Administration (SBA) loan data to predict whether a loan will be paid in full or charged off. The goal is to support lenders in balancing profitability and risk, ensuring more sustainable lending decisions for small businesses.
Project Overview
Small businesses rely on SBA loans to grow, but defaults can cause significant losses. Traditional models often maximize accuracy, but ignore misclassification costs:
Granting a loan to a high-risk borrower → high financial loss
Denying a loan to a creditworthy borrower → lost opportunity
This project applies cost-sensitive classification to adjust decision thresholds and maximize net profitability, not just predictive accuracy.
Dataset
Source: SBA National Loan Data (public dataset)
Size: 800K+ loan records
Features:
Borrower & business information (industry/NAICS, jobs created)
Loan details (amount disbursed, term, program type: LowDoc, RevLineCr)
Dates (approval, disbursement, maturity)
Target: MIS_Status (PIF = Paid in Full, CHGOFF = Charged Off)
Methods
Data Preparation
Dummy variable encoding for categorical features
Standardization for distance-based models (e.g., KNN)
Handling class imbalance via adjusted priors
Models Implemented
Logistic Regression (L1, L2, ElasticNet regularization)
K-Nearest Neighbors (KNN)
Linear Discriminant Analysis (LDA)
Ensemble methods (Bagging, Boosting, Random Forests)
Cost-Sensitive Adjustments
Prior probability re-weighting
Misclassification cost ratio (e.g., 5:1, 10:1)
Profit simulation using loan disbursement amounts
Evaluation
Confusion Matrix & Classification Report
ROC Curve & AUC
Decile lift charts for marketing/lending cutoffs
Net Profit Analysis using expected gain/loss per loan
Results
Best model: XGboost
ROC-AUC: ≈ 0.89 on validation set
Business Insight: Adjusting the cutoff threshold increased estimated net profitability by optimizing approvals toward high-value, lower-risk loans. And to have the highest profit we adjust the loan approval threshold to 0.83 which if loan has more than 83% chance getting paid in full we grant the loan to the applicant
Key Skills Demonstrated
Python (pandas, scikit-learn, matplotlib, seaborn)
Machine Learning: classification, regularization, ensemble learning
Cost-sensitive modeling & business decision optimization
Data visualization & storytelling for stakeholders