GitHub - guydev42/fraud-detection: <div align=center>

Overview • Key results • Architecture • Quickstart • Dataset • Methodology

Overview

A fraud detection pipeline that identifies fraudulent credit card transactions using gradient boosting with SMOTE oversampling and SHAP-based explanations.

Credit card fraud accounts for billions in losses annually, yet fraudulent transactions make up less than 2% of all activity. This extreme class imbalance makes standard classifiers unreliable. This project builds a detection system that trains four models (Logistic Regression, Random Forest, XGBoost, LightGBM) on synthetic transaction data, applies SMOTE to handle class imbalance, and uses SHAP to explain individual predictions. A threshold optimization module balances false positive costs against missed fraud losses.

Problem   →  Detecting rare fraud events in highly imbalanced transaction data
Solution  →  XGBoost with SMOTE oversampling, SHAP explanations, and cost-based threshold tuning
Impact    →  AUC 0.97, catches 94% of fraud with only 3% false positive rate

Key results

Metric	Value
AUC-ROC	0.97
Recall (fraud caught)	94%
False positive rate	3%
PR-AUC	0.82
Best model	XGBoost

Architecture

┌──────────────────┐    ┌──────────────────┐    ┌──────────────────┐
│  Synthetic data  │───▶│  SMOTE           │───▶│  Feature         │
│  generation      │    │  oversampling    │    │  scaling         │
└──────────────────┘    └──────────────────┘    └────────┬─────────┘
                                                         │
                          ┌──────────────────────────────┘
                          ▼
              ┌──────────────────────┐    ┌──────────────────────┐
              │  Model training      │───▶│  Threshold           │
              │  (4 classifiers)     │    │  optimization        │
              └──────────────────────┘    └──────────┬───────────┘
                                                     │
                          ┌──────────────────────────┘
                          ▼
              ┌──────────────────────┐    ┌──────────────────────┐
              │  SHAP                │───▶│  Fraud scoring       │
              │  explanations        │    │  dashboard           │
              └──────────────────────┘    └──────────────────────┘

Project structure

project_17_fraud_detection/
├── data/
│   ├── fraud_transactions.csv         # Transaction dataset
│   └── generate_data.py               # Synthetic data generator
├── src/
│   ├── __init__.py
│   ├── data_loader.py                 # Data generation and loading
│   └── model.py                       # Training, evaluation, SHAP
├── notebooks/
│   ├── 01_eda.ipynb                   # Exploratory data analysis
│   ├── 02_feature_engineering.ipynb   # SMOTE, scaling, interactions
│   ├── 03_modeling.ipynb              # Model training and CV
│   └── 04_evaluation.ipynb            # ROC, SHAP, cost analysis
├── app.py                             # Streamlit dashboard
├── requirements.txt
└── README.md

Quickstart

# Clone and navigate
git clone https://github.com/guydev42/calgary-data-portfolio.git
cd calgary-data-portfolio/project_17_fraud_detection

# Install dependencies
pip install -r requirements.txt

# Generate transaction data
python data/generate_data.py

# Launch dashboard
streamlit run app.py

Dataset

Property	Details
Source	Synthetic transaction data modeled on real-world fraud patterns
Transactions	10,000
Fraud rate	~2% (200 fraudulent transactions)
Features	10 (amount, time, distance, velocity, merchant category)
Target	is_fraud (binary)

Tech stack

Methodology

Class imbalance handling

SMOTE (Synthetic Minority Over-sampling Technique) to balance training data
class_weight="balanced" for Logistic Regression and Random Forest
scale_pos_weight for XGBoost, is_unbalance for LightGBM

Model training

Four classifiers: Logistic Regression, Random Forest, XGBoost, LightGBM
5-fold StratifiedKFold cross-validation
Metrics: AUC-ROC, precision, recall, F1, PR-AUC

SHAP explainability

TreeExplainer for gradient boosting models
Global feature importance via mean absolute SHAP values
Waterfall plots for individual transaction explanations

Threshold optimization

Business cost model: FN cost ($500 fraud loss) vs FP cost ($25 friction)
Sweep thresholds from 0.05 to 0.95 to minimize total cost
Achieves 94% recall at 3% false positive rate

Acknowledgements

Built as part of the Calgary Data Portfolio.

Ola K.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Key results

Architecture

Quickstart

Dataset

Tech stack

Methodology

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
notebooks		notebooks
src		src
README.md		README.md
app.py		app.py
index.html		index.html
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Overview

Key results

Architecture

Quickstart

Dataset

Tech stack

Methodology

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages