Skip to content

guydev42/fraud-detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


Overview

A fraud detection pipeline that identifies fraudulent credit card transactions using gradient boosting with SMOTE oversampling and SHAP-based explanations.

Credit card fraud accounts for billions in losses annually, yet fraudulent transactions make up less than 2% of all activity. This extreme class imbalance makes standard classifiers unreliable. This project builds a detection system that trains four models (Logistic Regression, Random Forest, XGBoost, LightGBM) on synthetic transaction data, applies SMOTE to handle class imbalance, and uses SHAP to explain individual predictions. A threshold optimization module balances false positive costs against missed fraud losses.

Problem   →  Detecting rare fraud events in highly imbalanced transaction data
Solution  →  XGBoost with SMOTE oversampling, SHAP explanations, and cost-based threshold tuning
Impact    →  AUC 0.97, catches 94% of fraud with only 3% false positive rate

Key results

Metric Value
AUC-ROC 0.97
Recall (fraud caught) 94%
False positive rate 3%
PR-AUC 0.82
Best model XGBoost

Architecture

┌──────────────────┐    ┌──────────────────┐    ┌──────────────────┐
│  Synthetic data  │───▶│  SMOTE           │───▶│  Feature         │
│  generation      │    │  oversampling    │    │  scaling         │
└──────────────────┘    └──────────────────┘    └────────┬─────────┘
                                                         │
                          ┌──────────────────────────────┘
                          ▼
              ┌──────────────────────┐    ┌──────────────────────┐
              │  Model training      │───▶│  Threshold           │
              │  (4 classifiers)     │    │  optimization        │
              └──────────────────────┘    └──────────┬───────────┘
                                                     │
                          ┌──────────────────────────┘
                          ▼
              ┌──────────────────────┐    ┌──────────────────────┐
              │  SHAP                │───▶│  Fraud scoring       │
              │  explanations        │    │  dashboard           │
              └──────────────────────┘    └──────────────────────┘
Project structure
project_17_fraud_detection/
├── data/
│   ├── fraud_transactions.csv         # Transaction dataset
│   └── generate_data.py               # Synthetic data generator
├── src/
│   ├── __init__.py
│   ├── data_loader.py                 # Data generation and loading
│   └── model.py                       # Training, evaluation, SHAP
├── notebooks/
│   ├── 01_eda.ipynb                   # Exploratory data analysis
│   ├── 02_feature_engineering.ipynb   # SMOTE, scaling, interactions
│   ├── 03_modeling.ipynb              # Model training and CV
│   └── 04_evaluation.ipynb            # ROC, SHAP, cost analysis
├── app.py                             # Streamlit dashboard
├── requirements.txt
└── README.md

Quickstart

# Clone and navigate
git clone https://github.com/guydev42/calgary-data-portfolio.git
cd calgary-data-portfolio/project_17_fraud_detection

# Install dependencies
pip install -r requirements.txt

# Generate transaction data
python data/generate_data.py

# Launch dashboard
streamlit run app.py

Dataset

Property Details
Source Synthetic transaction data modeled on real-world fraud patterns
Transactions 10,000
Fraud rate ~2% (200 fraudulent transactions)
Features 10 (amount, time, distance, velocity, merchant category)
Target is_fraud (binary)

Tech stack


Methodology

Class imbalance handling
  • SMOTE (Synthetic Minority Over-sampling Technique) to balance training data
  • class_weight="balanced" for Logistic Regression and Random Forest
  • scale_pos_weight for XGBoost, is_unbalance for LightGBM
Model training
  • Four classifiers: Logistic Regression, Random Forest, XGBoost, LightGBM
  • 5-fold StratifiedKFold cross-validation
  • Metrics: AUC-ROC, precision, recall, F1, PR-AUC
SHAP explainability
  • TreeExplainer for gradient boosting models
  • Global feature importance via mean absolute SHAP values
  • Waterfall plots for individual transaction explanations
Threshold optimization
  • Business cost model: FN cost ($500 fraud loss) vs FP cost ($25 friction)
  • Sweep thresholds from 0.05 to 0.95 to minimize total cost
  • Achieves 94% recall at 3% false positive rate

Acknowledgements

Built as part of the Calgary Data Portfolio.


About

<div align=center>

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors