Skip to content

leonhart1917/Machine-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

This project analyzes U.S. Small Business Administration (SBA) loan data to predict whether a loan will be paid in full or charged off. The goal is to support lenders in balancing profitability and risk, ensuring more sustainable lending decisions for small businesses.

Project Overview

Small businesses rely on SBA loans to grow, but defaults can cause significant losses. Traditional models often maximize accuracy, but ignore misclassification costs:

Granting a loan to a high-risk borrower → high financial loss

Denying a loan to a creditworthy borrower → lost opportunity

This project applies cost-sensitive classification to adjust decision thresholds and maximize net profitability, not just predictive accuracy.

Dataset

Source: SBA National Loan Data (public dataset)

Size: 800K+ loan records

Features:

Borrower & business information (industry/NAICS, jobs created)

Loan details (amount disbursed, term, program type: LowDoc, RevLineCr)

Dates (approval, disbursement, maturity)

Target: MIS_Status (PIF = Paid in Full, CHGOFF = Charged Off)

Methods

Data Preparation

Dummy variable encoding for categorical features

Standardization for distance-based models (e.g., KNN)

Handling class imbalance via adjusted priors

Models Implemented

Logistic Regression (L1, L2, ElasticNet regularization)

K-Nearest Neighbors (KNN)

Linear Discriminant Analysis (LDA)

Ensemble methods (Bagging, Boosting, Random Forests)

Cost-Sensitive Adjustments

Prior probability re-weighting

Misclassification cost ratio (e.g., 5:1, 10:1)

Profit simulation using loan disbursement amounts

Evaluation

Confusion Matrix & Classification Report

ROC Curve & AUC

Decile lift charts for marketing/lending cutoffs

Net Profit Analysis using expected gain/loss per loan

Results

Best model: XGboost

ROC-AUC: ≈ 0.89 on validation set

Business Insight: Adjusting the cutoff threshold increased estimated net profitability by optimizing approvals toward high-value, lower-risk loans. And to have the highest profit we adjust the loan approval threshold to 0.83 which if loan has more than 83% chance getting paid in full we grant the loan to the applicant

Key Skills Demonstrated

Python (pandas, scikit-learn, matplotlib, seaborn)

Machine Learning: classification, regularization, ensemble learning

Cost-sensitive modeling & business decision optimization

Data visualization & storytelling for stakeholders

About

This depository shows end to end data analyst using machine learning models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published