Skip to content

goth-coder/ml-fraud-detector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🚨 Credit Card Fraud Detection System

Machine Learning Fraud Detection System

A complete Machine Learning fraud detection system featuring a modular pipeline, PostgreSQL persistence, an interactive CLI, and a real-time web dashboard. Developed as part of the FIAP Tech Challenge – Phase 3.


πŸš€ Quick Demo

Want to see it in action? Run the dashboard in just 3 steps:

# 1. Start the Flask server
python run.py

# 2. Open in your browser
# http://127.0.0.1:5000

# 3. Test the detector!
#    β†’ Click β€œLEGÍTIMA” or β€œFRAUDULENTA”
#    β†’ Click β€œEXECUTAR SIMULAÇÃO”
#    β†’ Watch real-time results:
#       βœ… Model Prediction (classified by XGBoost)
#       🎯 Ground Truth (actual transaction type)
#       πŸ“Š Fraud probability (0–100%)
#       πŸ” Confidence level (HIGH / MODERATE / LOW)
#       ⚑ Inference latency (measured in real-time)

🎯 What you’ll see:

  • Simulation Panel – trigger legitimate or fraudulent transactions in one click
  • Real-Time Stats – total transactions, detected frauds, recall rate, latency
  • Full History – all classifications with prediction vs actual label

πŸ’‘ Tip: Try simulating multiple fraudulent transactions. The model will realistically miss some (~15% error), demonstrating that it’s not overfitted.


🎯 Objective

Develop a fraud detection system using an optimized XGBoost model with:

  • βœ… Handling of highly imbalanced data (1:578 fraud ratio)
  • βœ… Modular MVC-ML Pipeline (processing β†’ training β†’ inference)
  • βœ… Robust validation with StratifiedKFold
  • βœ… Automated Grid Search with hyperparameter versioning
  • βœ… PostgreSQL (data) + Pickle (models) persistence
  • βœ… Complete CLI with 4 operational modes
  • βœ… Flask REST API backend for real-time simulation
  • βœ… Interactive Web Dashboard for live monitoring and insights

πŸ—οΈ Project Architecture

Modular MVC-ML + Service Layer

The system follows a layered modular architecture for clear separation of concerns:

  • Model (M): src/models/ – SQLAlchemy ORM + ML configs
  • View (V): src/services/frontend/ – HTML templates + static assets
  • Controller (C): src/services/backend/ – Flask routes + logic
  • ML Pipeline: src/ml/ – data processing and training
  • Services: src/services/ – infrastructure components (DB, ML, Frontend, Backend)
ml-fraud-detector/
β”œβ”€β”€ data/                        # πŸ“Š Datasets and configs
β”‚   β”œβ”€β”€ archive/                 # Older hyperparameter versions
β”‚   β”œβ”€β”€ examples/                # Sample transactions
β”‚   β”œβ”€β”€ creditcard.csv           # Original dataset (284,807 rows)
β”‚   └── xgboost_hyperparameters.json
β”‚
β”œβ”€β”€ database/                    # πŸ—„οΈ PostgreSQL config
β”‚   β”œβ”€β”€ migrations/              # SQL migrations
β”‚   β”œβ”€β”€ docker-compose.yml       # PostgreSQL 15 setup
β”‚   └── schema.sql               # Schema (7 pipeline + 2 webapp tables)
β”‚
β”œβ”€β”€ docs/                        # πŸ“š Documentation
β”‚   β”œβ”€β”€ images/                  # Charts and screenshots
β”‚   β”œβ”€β”€ API_ENDPOINTS.md         # REST API docs
β”‚   β”œβ”€β”€ DATA_ARCHITECTURE.md     # PostgreSQL + Pickle architecture
β”‚   β”œβ”€β”€ DECISOES_TECNICAS.md     # ML optimizations
β”‚   β”œβ”€β”€ EDA_REPORT.md            # Exploratory data analysis
β”‚   β”œβ”€β”€ MODEL_SELECTION.md       # Model comparison
β”‚   └── TRANSACTION_EXAMPLES.md  # Examples
β”‚
β”œβ”€β”€ models/                      # πŸ€– Trained ML models
β”‚   β”œβ”€β”€ archive/                 # Older versions
β”‚   β”œβ”€β”€ scalers.pkl              # RobustScaler + StandardScaler
β”‚   └── xgboost_v2.1.0.pkl       # ⭐ Production model
β”‚
β”œβ”€β”€ reports/                     # πŸ“ˆ Generated analysis
β”‚   └── feature_selection_analysis.json
β”‚
β”œβ”€β”€ src/                         # πŸ’» Source code
β”‚   β”œβ”€β”€ ml/                      # 🧠 ML pipeline
β”‚   β”œβ”€β”€ models/                  # πŸ“¦ SQLAlchemy ORM models
β”‚   β”œβ”€β”€ services/                # πŸ”Œ Service layer
β”‚   └── __init__.py
β”‚
β”œβ”€β”€ main.py                      # 🎯 CLI entry point
β”œβ”€β”€ run.py                       # πŸš€ Flask app entry
β”œβ”€β”€ requirements.txt              # Dependencies
└── README.md

Data Processing Pipeline (7 Steps)

CSV (raw) β†’ [01] Load β†’ [02] Outlier Analysis β†’ [03] Missing Values
         β†’ [04] Normalize β†’ [05] Feature Engineering β†’ [06] Feature Selection
         β†’ [07] Train/Test Split β†’ βœ… Ready for training

Persistence Architecture

PostgreSQL Tables

  • raw_transactions, cleaned_transactions, … test_features
  • classification_results, simulated_transactions

Pickle Files

  • models/scalers.pkl – RobustScaler + StandardScaler
  • models/xgboost_v2.1.0.pkl – production model

JSON Configs

  • xgboost_hyperparameters.json – active hyperparameters
  • archive/ – previous versions (auto-timestamped)

Hybrid Design Rationale

  • PostgreSQL β†’ traceability + analytics
  • Pickle β†’ fast model loading
  • JSON β†’ version control for reproducibility

Technology Stack

  • Python 3.13, Pandas, XGBoost 3.0.5
  • Flask 3.0.3, SQLAlchemy, PostgreSQL 15
  • Docker Compose, ThreadPoolExecutor, COPY optimization

πŸ“Š Dataset

  • Source: Kaggle – Credit Card Fraud Detection
  • Transactions: 284,807 (2 days, September 2013)
  • Features: 30 (PCA-transformed + Time, Amount)
  • Target: Class (0=legit, 1=fraud)
  • Imbalance: 492 frauds (0.172%) vs 284,315 legitimate

πŸš€ Quick Start

1. Requirements

  • Python β‰₯3.13
  • PostgreSQL β‰₯15
  • 2GB+ RAM

2. Installation

git clone <repo-url>
cd ml-fraud-detector
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

or faster:

uv pip install -e .

3. Setup PostgreSQL

Option A: Docker (recommended)

cd database
docker-compose up -d

Option B: Local setup

createdb fraud_detection
createuser fraud_user --password fraud_pass_dev
psql -U fraud_user -d fraud_detection -f database/schema.sql

4. Run the Pipeline

python main.py pipeline

πŸ–₯️ CLI Interface (main.py)

Supports 4 main modes:

1️⃣ pipeline

Runs full data processing pipeline (Steps 01–07).

2️⃣ train

Trains XGBoost with current JSON hyperparameters.

3️⃣ tune

Performs automated Grid Search and updates JSON.

4️⃣ predict

Runs inference on CSV or JSON input.


🌐 Flask REST API

Endpoints

Method Endpoint Description
POST /api/simulate Generate and classify a transaction
GET /api/stats Get aggregated stats
GET /api/history Retrieve prediction history
GET /health Health check

Full API reference: docs/API_ENDPOINTS.md


🎨 Web Dashboard

Interactive web dashboard for simulation and monitoring:

  • Run real-time simulations
  • Track recall, latency, and fraud rate
  • Review full prediction history

πŸ“š Documentation

  • Technical Reports: EDA, model selection, and data architecture
  • Plans: MVP roadmap, Kafka scalability, changelog
  • Architecture Docs: src/ml/README.md

πŸ”§ Key Configs

src/ml/models/configs.py:

xgboost_params = {
    'colsample_bytree': 0.7,
    'learning_rate': 0.3,
    'max_depth': 6,
    'min_child_weight': 1,
    'n_estimators': 100,
    'scale_pos_weight': 577,
    'subsample': 0.7,
    'eval_metric': 'aucpr'
}

πŸ“Š Performance Metrics

Pipeline Optimizations (52% faster ⚑)

Step Before After Gain Optimization
Normalize 90s 16.5s +81.6% PostgreSQL COPY
Pipeline Total 130s 62s +52% Parallelization

XGBoost Results

Metric v1.1.0 v2.0.0 v2.1.0 ⭐
PR-AUC 0.8719 0.8847 0.8772
Precision 72.27% 85.42% 86.60%
Recall 87.76% 83.67% 81.63%
F1-Score 79.26% 84.54% 84.04%

πŸ—„οΈ PostgreSQL Schema

Includes:

  • Pipeline tables (raw β†’ processed)
  • Webapp tables (classification_results, simulated_transactions)

πŸš€ Status

βœ… Completed πŸ“‹ Next Step: Optional Kafka streaming for scalability

πŸ”„ Future Scalability

The modular ML pipeline architecture is designed for easy Apache Airflow integration:

  • Step-based structure (01-07) maps directly to Airflow DAGs
  • Service layer separation enables distributed task execution
  • JSON configs + PostgreSQL provide stateful orchestration support
  • CLI modes (train, tune, pipeline) can become Airflow operators

This allows seamless migration from local execution to scheduled, distributed ML workflows without code restructuring.

In large-scale scenarios, Apache Kafka can enable distributed ingestion, parallel consumers, and real-time reprocessing.


πŸ”— References


πŸ‘₯ Authors

Victor Lucas Santos de Oliveira – LinkedIn Adrianny Lelis da Silva – LinkedIn


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published