End-to-end tabular ML pipeline for credit card fraud detection with feature engineering, gradient boosting, FastAPI serving, MLflow tracking, and drift monitoring.
Academic-project note: this repository is a portfolio and academic demonstration built on the public Kaggle Credit Card Fraud Detection dataset. It is designed to showcase reproducible ML engineering patterns, not to represent a production banking system.
tabular-ml packages a full fraud-detection workflow around a heavily imbalanced binary classification problem:
- Exploratory data analysis in Jupyter.
- Reproducible feature engineering with a scikit-learn-compatible pipeline.
- Model training and tuning across XGBoost, LightGBM, and CatBoost.
- Ensemble evaluation with stacking and blending.
- FastAPI inference endpoints for single and batch predictions.
- MLflow experiment tracking and Evidently-based drift monitoring.
flowchart LR
A["Kaggle credit card dataset"] --> B["Data loading and stratified splits"]
B --> C["Feature engineering pipeline"]
C --> D["Model training and Optuna tuning"]
D --> E["Artifacts and metrics"]
E --> F["FastAPI inference service"]
E --> G["MLflow tracking"]
E --> H["Drift monitoring reports"]
| Model | Test PR-AUC | Test ROC-AUC | F1 | Precision | Recall |
|---|---|---|---|---|---|
| XGBoost | 0.8672 | 0.9771 | 0.8817 | 0.9318 | 0.8367 |
| LightGBM | 0.8640 | 0.9708 | 0.8770 | 0.9213 | 0.8367 |
| CatBoost | 0.8376 | 0.9762 | 0.8265 | 0.8265 | 0.8265 |
| Stacking Ensemble | 0.8622 | 0.9782 | 0.8783 | 0.9121 | 0.8469 |
| Blending Ensemble | 0.8672 | 0.9771 | 0.8817 | 0.9318 | 0.8367 |
The current best standalone model is XGBoost on PR-AUC, while stacking slightly improves recall on the held-out test split.
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e ".[dev]"kaggle datasets download -d mlg-ulb/creditcardfraud -p data/raw/ --unzipThe training pipeline expects data/raw/creditcard.csv.
python -m pytestIntegration tests that require optional native ML libraries are marked with integration and will skip automatically if the dependency stack is unavailable.
python -m tabular_ml.models.train_all
python -m tabular_ml.models.run_ensembleGenerated artifacts are written to artifacts/, including trained models, plots, and result summaries.
uvicorn tabular_ml.api.app:app --reload --port 8000Endpoints:
GET /healthPOST /predictPOST /predict/batch
Example:
curl http://localhost:8000/healthpython scripts/monitoring_demo.pydocker compose up --buildServices:
- API docs:
http://127.0.0.1:8000/docs - MLflow UI:
http://127.0.0.1:5001
The repository now uses a documented backend preference in configs/default.yaml:
training:
hardware:
preference: auto # auto | cpu | gpuautois the safe default and currently resolves to CPU.- Apple Silicon machines resolve to CPU for this model stack because XGBoost, LightGBM, and CatBoost do not expose a native MPS backend in this project setup.
- Explicit
gpumode preserves the library-specific GPU settings where upstream support exists:- XGBoost: CUDA
- LightGBM: OpenCL or CUDA-enabled build
- CatBoost: GPU mode
On macOS, xgboost and lightgbm may require OpenMP:
brew install libompsrc/tabular_ml/
api/ FastAPI application and schemas
data/ Data loading and splitting
features/ Feature engineering pipeline
models/ Training, tuning, evaluation, ensembles
monitoring/ Evidently-based drift utilities
docs/
images/ Exported figures used in project documentation
project-overview.md
tests/ Unit and integration tests
configs/ YAML configuration
- Submission-ready project overview: docs/project-overview.md
- EDA notebook: notebooks/01_eda.ipynb
The project overview is intentionally formatted as PDF-ready Markdown with Mermaid diagrams, image references, a title page, and an appendix.
Released under the MIT License. See LICENSE.

