A complete Machine Learning fraud detection system featuring a modular pipeline, PostgreSQL persistence, an interactive CLI, and a real-time web dashboard. Developed as part of the FIAP Tech Challenge β Phase 3.
Want to see it in action? Run the dashboard in just 3 steps:
# 1. Start the Flask server
python run.py
# 2. Open in your browser
# http://127.0.0.1:5000
# 3. Test the detector!
# β Click βLEGΓTIMAβ or βFRAUDULENTAβ
# β Click βEXECUTAR SIMULAΓΓOβ
# β Watch real-time results:
# β
Model Prediction (classified by XGBoost)
# π― Ground Truth (actual transaction type)
# π Fraud probability (0β100%)
# π Confidence level (HIGH / MODERATE / LOW)
# β‘ Inference latency (measured in real-time)
π― What youβll see:
- Simulation Panel β trigger legitimate or fraudulent transactions in one click
- Real-Time Stats β total transactions, detected frauds, recall rate, latency
- Full History β all classifications with prediction vs actual label
π‘ Tip: Try simulating multiple fraudulent transactions. The model will realistically miss some (~15% error), demonstrating that itβs not overfitted.
Develop a fraud detection system using an optimized XGBoost model with:
- β Handling of highly imbalanced data (1:578 fraud ratio)
- β Modular MVC-ML Pipeline (processing β training β inference)
- β Robust validation with StratifiedKFold
- β Automated Grid Search with hyperparameter versioning
- β PostgreSQL (data) + Pickle (models) persistence
- β Complete CLI with 4 operational modes
- β Flask REST API backend for real-time simulation
- β Interactive Web Dashboard for live monitoring and insights
The system follows a layered modular architecture for clear separation of concerns:
- Model (M):
src/models/
β SQLAlchemy ORM + ML configs - View (V):
src/services/frontend/
β HTML templates + static assets - Controller (C):
src/services/backend/
β Flask routes + logic - ML Pipeline:
src/ml/
β data processing and training - Services:
src/services/
β infrastructure components (DB, ML, Frontend, Backend)
ml-fraud-detector/
βββ data/ # π Datasets and configs
β βββ archive/ # Older hyperparameter versions
β βββ examples/ # Sample transactions
β βββ creditcard.csv # Original dataset (284,807 rows)
β βββ xgboost_hyperparameters.json
β
βββ database/ # ποΈ PostgreSQL config
β βββ migrations/ # SQL migrations
β βββ docker-compose.yml # PostgreSQL 15 setup
β βββ schema.sql # Schema (7 pipeline + 2 webapp tables)
β
βββ docs/ # π Documentation
β βββ images/ # Charts and screenshots
β βββ API_ENDPOINTS.md # REST API docs
β βββ DATA_ARCHITECTURE.md # PostgreSQL + Pickle architecture
β βββ DECISOES_TECNICAS.md # ML optimizations
β βββ EDA_REPORT.md # Exploratory data analysis
β βββ MODEL_SELECTION.md # Model comparison
β βββ TRANSACTION_EXAMPLES.md # Examples
β
βββ models/ # π€ Trained ML models
β βββ archive/ # Older versions
β βββ scalers.pkl # RobustScaler + StandardScaler
β βββ xgboost_v2.1.0.pkl # β Production model
β
βββ reports/ # π Generated analysis
β βββ feature_selection_analysis.json
β
βββ src/ # π» Source code
β βββ ml/ # π§ ML pipeline
β βββ models/ # π¦ SQLAlchemy ORM models
β βββ services/ # π Service layer
β βββ __init__.py
β
βββ main.py # π― CLI entry point
βββ run.py # π Flask app entry
βββ requirements.txt # Dependencies
βββ README.md
CSV (raw) β [01] Load β [02] Outlier Analysis β [03] Missing Values
β [04] Normalize β [05] Feature Engineering β [06] Feature Selection
β [07] Train/Test Split β β
Ready for training
PostgreSQL Tables
raw_transactions
,cleaned_transactions
, β¦test_features
classification_results
,simulated_transactions
Pickle Files
models/scalers.pkl
β RobustScaler + StandardScalermodels/xgboost_v2.1.0.pkl
β production model
JSON Configs
xgboost_hyperparameters.json
β active hyperparametersarchive/
β previous versions (auto-timestamped)
Hybrid Design Rationale
- PostgreSQL β traceability + analytics
- Pickle β fast model loading
- JSON β version control for reproducibility
- Python 3.13, Pandas, XGBoost 3.0.5
- Flask 3.0.3, SQLAlchemy, PostgreSQL 15
- Docker Compose, ThreadPoolExecutor, COPY optimization
- Source: Kaggle β Credit Card Fraud Detection
- Transactions: 284,807 (2 days, September 2013)
- Features: 30 (PCA-transformed + Time, Amount)
- Target: Class (0=legit, 1=fraud)
- Imbalance: 492 frauds (0.172%) vs 284,315 legitimate
- Python β₯3.13
- PostgreSQL β₯15
- 2GB+ RAM
git clone <repo-url>
cd ml-fraud-detector
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
or faster:
uv pip install -e .
Option A: Docker (recommended)
cd database
docker-compose up -d
Option B: Local setup
createdb fraud_detection
createuser fraud_user --password fraud_pass_dev
psql -U fraud_user -d fraud_detection -f database/schema.sql
python main.py pipeline
Supports 4 main modes:
Runs full data processing pipeline (Steps 01β07).
Trains XGBoost with current JSON hyperparameters.
Performs automated Grid Search and updates JSON.
Runs inference on CSV or JSON input.
Method | Endpoint | Description |
---|---|---|
POST |
/api/simulate |
Generate and classify a transaction |
GET |
/api/stats |
Get aggregated stats |
GET |
/api/history |
Retrieve prediction history |
GET |
/health |
Health check |
Full API reference: docs/API_ENDPOINTS.md
Interactive web dashboard for simulation and monitoring:
- Run real-time simulations
- Track recall, latency, and fraud rate
- Review full prediction history
- Technical Reports: EDA, model selection, and data architecture
- Plans: MVP roadmap, Kafka scalability, changelog
- Architecture Docs:
src/ml/README.md
src/ml/models/configs.py
:
xgboost_params = {
'colsample_bytree': 0.7,
'learning_rate': 0.3,
'max_depth': 6,
'min_child_weight': 1,
'n_estimators': 100,
'scale_pos_weight': 577,
'subsample': 0.7,
'eval_metric': 'aucpr'
}
Step | Before | After | Gain | Optimization |
---|---|---|---|---|
Normalize | 90s | 16.5s | +81.6% | PostgreSQL COPY |
Pipeline Total | 130s | 62s | +52% | Parallelization |
Metric | v1.1.0 | v2.0.0 | v2.1.0 β |
---|---|---|---|
PR-AUC | 0.8719 | 0.8847 | 0.8772 |
Precision | 72.27% | 85.42% | 86.60% |
Recall | 87.76% | 83.67% | 81.63% |
F1-Score | 79.26% | 84.54% | 84.04% |
Includes:
- Pipeline tables (raw β processed)
- Webapp tables (
classification_results
,simulated_transactions
)
β Completed π Next Step: Optional Kafka streaming for scalability
The modular ML pipeline architecture is designed for easy Apache Airflow integration:
- Step-based structure (01-07) maps directly to Airflow DAGs
- Service layer separation enables distributed task execution
- JSON configs + PostgreSQL provide stateful orchestration support
- CLI modes (
train
,tune
,pipeline
) can become Airflow operators
This allows seamless migration from local execution to scheduled, distributed ML workflows without code restructuring.
In large-scale scenarios, Apache Kafka can enable distributed ingestion, parallel consumers, and real-time reprocessing.
- Dataset: Kaggle β Credit Card Fraud Detection
- Docs: XGBoost, Flask, PostgreSQL
Victor Lucas Santos de Oliveira β LinkedIn Adrianny Lelis da Silva β LinkedIn