A fraud detection pipeline that identifies fraudulent credit card transactions using gradient boosting with SMOTE oversampling and SHAP-based explanations.
Credit card fraud accounts for billions in losses annually, yet fraudulent transactions make up less than 2% of all activity. This extreme class imbalance makes standard classifiers unreliable. This project builds a detection system that trains four models (Logistic Regression, Random Forest, XGBoost, LightGBM) on synthetic transaction data, applies SMOTE to handle class imbalance, and uses SHAP to explain individual predictions. A threshold optimization module balances false positive costs against missed fraud losses.
Problem → Detecting rare fraud events in highly imbalanced transaction data
Solution → XGBoost with SMOTE oversampling, SHAP explanations, and cost-based threshold tuning
Impact → AUC 0.97, catches 94% of fraud with only 3% false positive rate
| Metric | Value |
|---|---|
| AUC-ROC | 0.97 |
| Recall (fraud caught) | 94% |
| False positive rate | 3% |
| PR-AUC | 0.82 |
| Best model | XGBoost |
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Synthetic data │───▶│ SMOTE │───▶│ Feature │
│ generation │ │ oversampling │ │ scaling │
└──────────────────┘ └──────────────────┘ └────────┬─────────┘
│
┌──────────────────────────────┘
▼
┌──────────────────────┐ ┌──────────────────────┐
│ Model training │───▶│ Threshold │
│ (4 classifiers) │ │ optimization │
└──────────────────────┘ └──────────┬───────────┘
│
┌──────────────────────────┘
▼
┌──────────────────────┐ ┌──────────────────────┐
│ SHAP │───▶│ Fraud scoring │
│ explanations │ │ dashboard │
└──────────────────────┘ └──────────────────────┘
Project structure
project_17_fraud_detection/
├── data/
│ ├── fraud_transactions.csv # Transaction dataset
│ └── generate_data.py # Synthetic data generator
├── src/
│ ├── __init__.py
│ ├── data_loader.py # Data generation and loading
│ └── model.py # Training, evaluation, SHAP
├── notebooks/
│ ├── 01_eda.ipynb # Exploratory data analysis
│ ├── 02_feature_engineering.ipynb # SMOTE, scaling, interactions
│ ├── 03_modeling.ipynb # Model training and CV
│ └── 04_evaluation.ipynb # ROC, SHAP, cost analysis
├── app.py # Streamlit dashboard
├── requirements.txt
└── README.md
# Clone and navigate
git clone https://github.com/guydev42/calgary-data-portfolio.git
cd calgary-data-portfolio/project_17_fraud_detection
# Install dependencies
pip install -r requirements.txt
# Generate transaction data
python data/generate_data.py
# Launch dashboard
streamlit run app.py| Property | Details |
|---|---|
| Source | Synthetic transaction data modeled on real-world fraud patterns |
| Transactions | 10,000 |
| Fraud rate | ~2% (200 fraudulent transactions) |
| Features | 10 (amount, time, distance, velocity, merchant category) |
| Target | is_fraud (binary) |
Class imbalance handling
- SMOTE (Synthetic Minority Over-sampling Technique) to balance training data
- class_weight="balanced" for Logistic Regression and Random Forest
- scale_pos_weight for XGBoost, is_unbalance for LightGBM
Model training
- Four classifiers: Logistic Regression, Random Forest, XGBoost, LightGBM
- 5-fold StratifiedKFold cross-validation
- Metrics: AUC-ROC, precision, recall, F1, PR-AUC
SHAP explainability
- TreeExplainer for gradient boosting models
- Global feature importance via mean absolute SHAP values
- Waterfall plots for individual transaction explanations
Threshold optimization
- Business cost model: FN cost ($500 fraud loss) vs FP cost ($25 friction)
- Sweep thresholds from 0.05 to 0.95 to minimize total cost
- Achieves 94% recall at 3% false positive rate
Built as part of the Calgary Data Portfolio.