Skip to content

nontster/data-profiler

Repository files navigation

🌐 Language: English | ภาษาไทย

DataProfiler

Automated Data Profiling tool for PostgreSQL and Microsoft SQL Server with dbt-profiler style metrics. Supports storing results in ClickHouse or PostgreSQL.

🎯 Overview

DataProfiler provides:

  1. dbt-profiler Style Metrics calculation via SQL queries (opt-in with --data-profile)
  2. Schema Comparison: Compare table structures and database objects across different environments (e.g., UAT vs Prod)
  3. Schema Objects Profiling: Profiles Stored Procedures, Views, and Triggers for comprehensive database monitoring
  4. Auto-Increment Overflow Risk Analysis with growth prediction using Linear Regression
  5. Table Inventory Collection: Automatically discovers and records all tables per schema for drift detection
  6. Automatic Schema Discovery from source databases (information_schema)
  7. Multi-Database Support: PostgreSQL, Microsoft SQL Server (Azure SQL Edge), MySQL, and Oracle (21c XE+)
  8. Flexible Metrics Storage: Choose between ClickHouse or PostgreSQL for storing results
  9. Multiple Export Formats: Markdown, JSON, CSV, Console Table
  10. Web Dashboard for data visualization (React + TailwindCSS)
  11. Control-M Integration: Production-ready wrapper script for enterprise job scheduling
  12. Full Stack Docker: Complete containerized environment including Grafana, ClickHouse, and Postgres Metrics
  13. High Performance: Uses fast Catalog Statistics (O(1)) for row counts instead of full table scans

📊 Profiled Metrics

For each column, the system collects the following statistics (dbt-profiler compatible):

Metric Description Condition
column_name Column name All columns
data_type Data type All columns
not_null_proportion Proportion of non-NULL values (0.00 - 1.00) All columns
distinct_proportion Proportion of unique values (0.00 - 1.00) All columns
distinct_count Count of unique values All columns
is_unique Whether all values are unique (true/false) All columns
min / max Minimum / Maximum values numeric, date, time*
avg Average value numeric**
median Median value numeric**
std_dev_population Population standard deviation numeric**
std_dev_sample Sample standard deviation numeric**
histogram Numeric bucket distribution (JSON) numeric*
frequencies Top N Categorical frequencies (JSON) non-numeric
profiled_at Profile timestamp All columns

* min/max supported for: integer, numeric, float, date, timestamp, time
** avg, median, std_dev supported for: integer, numeric, float

🔮 Auto-Increment Overflow Risk Analysis

DataProfiler includes Auto-Increment Column Overflow Risk Analysis to predict when primary key columns will reach their maximum capacity.

Features

  • Current Value Tracking: Monitors the current value of auto-increment columns
  • Maximum Value Calculation: Computes the max value based on data type (e.g., INT, BIGINT)
  • Usage Percentage: Calculates current capacity usage
  • Growth Rate Prediction: Uses Linear Regression on historical data from ClickHouse
  • Days Until Full: Predicts when the column will reach maximum capacity
  • Alert Status: CRITICAL (< 30 days / > 90%), WARNING (< 90 days / > 75%), NORMAL

Supported Data Types

Data Type Max Value Range
smallint 32,767 -32,768 to 32,767
integer 2,147,483,647 -2.1B to 2.1B
bigint 9,223,372,036,854,775,807 -9.2 quintillion to 9.2 quintillion
serial 2,147,483,647 1 to 2.1B
bigserial 9,223,372,036,854,775,807 1 to 9.2 quintillion

MSSQL Data Types

Data Type Max Value Range
tinyint 255 0 to 255
smallint 32,767 -32,768 to 32,767
int 2,147,483,647 -2.1B to 2.1B
bigint 9,223,372,036,854,775,807 -9.2 quintillion to 9.2 quintillion

Note: Supports PostgreSQL SERIAL/BIGSERIAL/IDENTITY, MSSQL IDENTITY, and Oracle IDENTITY columns.

🏗️ Schema Profiling & Comparison

DataProfiler can profile table schemas (columns, data types, indexes, foreign keys) and store them for comparison between different environments (e.g., UAT vs Production).

Features

  • Schema Snapshot: Captures detailed schema metadata
  • Comparison Dashboard: Grafana dashboard to visualize differences
  • Multi-Database Support: Profile schemas from both PostgreSQL and MSSQL
  • Strict Mode: Compares defaults, nullability, and index structure

Comparison Metrics

Category Checks
Columns Existence, Data Types, Nullability, Defaults
Indexes Primary Keys, Index Membership, Index Names
Columns Existence, Data Types, Nullability, Defaults
Indexes Primary Keys, Index Membership, Index Names
Constraints Foreign Keys, Check Constraints

Schema Objects Profiling

In addition to table schemas, DataProfiler automatically collects metadata for other database objects:

  • Stored Procedures
  • Views
  • Triggers

This allows you to detect added, removed, or modified objects between environments.

Note: Schema object profiling runs automatically when metrics storage is enabled (default), even without --profile-schema.

🛠️ Requirements

  • Python 3.10+
  • PostgreSQL and/or Microsoft SQL Server (Azure SQL Edge for ARM64/M1)
  • MySQL (v8.0+)
  • Oracle Database (19c/21c XE+)
  • ClickHouse
  • Dependencies:
    • psycopg2 - PostgreSQL adapter
    • pymssql - MSSQL adapter
    • clickhouse-connect - ClickHouse client
    • mysql-connector-python - MySQL adapter
    • oracledb - Oracle adapter
    • python-dotenv - Environment variable management
    • numpy - Numerical computing
    • scipy - Scientific computing (Linear Regression)

📦 Installation

  1. Clone repository:
git clone <repository-url>
cd DataProfiler
  1. Create and activate Virtual Environment:
# Create venv
python -m venv venv

# Activate (macOS/Linux)
source venv/bin/activate

# Activate (Windows)
venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt

⚙️ Configuration

1. Create Environment Variables File

Copy .env.example to .env and edit values:

cp .env.example .env

Edit .env file:

# PostgreSQL Configuration
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_DATABASE=postgres
POSTGRES_USER=postgres
POSTGRES_PASSWORD=your_actual_password
POSTGRES_SCHEMA=public

# MySQL Configuration
MYSQL_HOST=localhost
MYSQL_PORT=3306
MYSQL_DATABASE=prod
MYSQL_USER=user
MYSQL_PASSWORD=password123

# Oracle Configuration
ORACLE_HOST=localhost
ORACLE_PORT=1521
ORACLE_SERVICE_NAME=XEPDB1
ORACLE_USER=PROD
ORACLE_PASSWORD=password123

# ClickHouse Configuration
CLICKHOUSE_HOST=localhost
CLICKHOUSE_PORT=8123
CLICKHOUSE_USER=default
CLICKHOUSE_PASSWORD=your_actual_password

# Metrics Backend Configuration
# Options: 'postgresql' (default) or 'clickhouse'
METRICS_BACKEND=postgresql

# Distribution/Profiling Limits
HISTOGRAM_MAX_BINS=20
FREQUENCY_TOP_N=100

# PostgreSQL Metrics Configuration (separate instance for storing profiling results)
# When using docker-compose, this points to the postgres-metrics service (port 5433)
PG_METRICS_HOST=localhost
PG_METRICS_PORT=5433
PG_METRICS_DATABASE=profiler_metrics
PG_METRICS_USER=postgres
PG_METRICS_PASSWORD=your_password

⚠️ Important: The .env file is already git-ignored. No need to worry about committing credentials.

🤖 CI/CD Pipeline

The project uses GitHub Actions for continuous integration and continuous deployment:

  • Unit Testing: Automatically runs pytest on every push to the repository.
  • Docker Build & Publish: Automatically builds a multi-architecture (linux/amd64, linux/arm64) Docker container and publishes it to Docker Hub as nontster/data-profiler:<SHORT_SHA> upon any merge into the main branch.

Required Repository Secrets: To enable the Docker publish workflow, you must configure the following secrets in your GitHub repository (Settings > Secrets and variables > Actions):

  • DOCKERHUB_USERNAME: Your Docker Hub username.
  • DOCKERHUB_TOKEN: A Docker Hub Personal Access Token with Read & Write permissions.

🚀 Usage

Basic Usage

By default (without feature flags), DataProfiler collects table inventory only — no --table needed:

# Collect table inventory only (default behavior)
python main.py -d mssql --schema prod --app user-service --env production

# Collect table inventory from PostgreSQL (default database)
python main.py --schema uat --app user-service --env uat 

Data Profiling (--data-profile)

Use --data-profile to enable column-level statistics. Requires --table:

# Profile 'users' table from PostgreSQL
python main.py --data-profile --table users

# Profile multiple tables
python main.py --data-profile -t users,products

# Profile from specific schema
python main.py --data-profile --table users --schema prod

# Profile from MSSQL
python main.py --data-profile -d mssql --table users

# Profile from MySQL
python main.py --data-profile -d mysql --table users

# Profile from Oracle
python main.py --data-profile -d oracle --table users --schema PROD

# Profile with Application & Environment context
python main.py --data-profile -d mssql -t users,products --app user-service --env uat --metrics-backend postgresql

Auto-Increment Analysis

Requires --data-profile:

# Include auto-increment overflow analysis
python main.py --data-profile --auto-increment --table users
python main.py --data-profile --auto-increment -d mssql --table users   

# Custom lookback period for growth calculation
python main.py --data-profile --auto-increment --table users  --lookback-days 14

Schema Profiling

# Profile schema for User Service in Production
python main.py --profile-schema --table users --app user-service --env production

# Profile same table in UAT
python main.py --profile-schema --table users --app user-service --env uat

# Profile multiple table schemas
python main.py --profile-schema -t users,products  --app user-service --env production

# Comparison is done via Grafana Dashboard (Table Schema Comparison)

Schema Objects Profiling

Schema objects (Stored Procedures, Views, Triggers) are profiled automatically when connecting to a database with metrics storage enabled.

# Profile schema objects (and table inventory)
python main.py --app user-service --env production --schema prod

View the results in the Schema Objects Comparison Dashboard.

Output Formats

Output formats apply when --data-profile is used:

# Console table (default)
python main.py --table users --data-profile --format table

# Markdown (dbt-profiler style)
python main.py --table users --data-profile --format markdown

# JSON
python main.py --table users --data-profile --format json

# CSV
python main.py --table users --data-profile --format csv

Save to File

python main.py --table users --data-profile --format markdown --output profiles/users.md
python main.py --table users --data-profile --format json --output profiles/users.json
python main.py -t users,orders --data-profile --format csv --output profiles/report.csv

Note: When profiling multiple tables with --output, results are appended to the same file.

Additional Options

# Skip storing to Metrics DB
python main.py --table users --no-store

# Verbose/debug logging (include detailed SQL queries)
python main.py --table users -v

# Show help
python main.py --help

Database Type Selection (-d, --database-type)

Choose which source database to profile:

Option Description Required Environment Variables
postgresql PostgreSQL (default) POSTGRES_HOST, POSTGRES_PORT, etc.
mssql Microsoft SQL Server MSSQL_HOST, MSSQL_PORT, MSSQL_DATABASE, etc.
mysql MySQL MYSQL_HOST, MYSQL_PORT, MYSQL_DATABASE, etc.
oracle Oracle Database ORACLE_HOST, ORACLE_PORT, ORACLE_SERVICE_NAME
# Profile from PostgreSQL (default)
python main.py --table users --data-profile

# Profile from MSSQL
python main.py --data-profile -d mssql --table users
python main.py --data-profile --database-type mssql --table users

# MSSQL with auto-increment analysis
python main.py --data-profile --auto-increment -d mssql --table users

# Oracle with auto-increment analysis
python main.py --data-profile --auto-increment -d oracle --table users --schema PROD

Metrics Backend Selection (--metrics-backend)

Choose where to store profiling results:

Option Description Required Environment Variables
postgresql PostgreSQL (default) PG_METRICS_* or falls back to POSTGRES_* variables
clickhouse ClickHouse CLICKHOUSE_HOST, CLICKHOUSE_PORT
# Use PostgreSQL (default)
python main.py --table users

# Use ClickHouse for metrics storage
python main.py --table users --metrics-backend clickhouse

# Combine: Profile MSSQL, store in PostgreSQL
python main.py -d mssql --table orders --metrics-backend postgresql

Complete Example: Profile MSSQL with PostgreSQL Metrics Backend

# Set environment variables
export MSSQL_HOST=localhost
export MSSQL_PORT=1433
export MSSQL_DATABASE=testdb
export MSSQL_USER=sa
export MSSQL_PASSWORD='YourStrong@Password123'
export MSSQL_SCHEMA=dbo

export METRICS_BACKEND=postgresql
export PG_METRICS_HOST=localhost
export PG_METRICS_PORT=5433
export PG_METRICS_DATABASE=profiler_metrics
export PG_METRICS_USER=postgres
export PG_METRICS_PASSWORD='password123'

# Run profiler
python main.py
  --data-profile \
  --auto-increment \
  -d mssql \
  --table users \
  --metrics-backend postgresql \
  --app user-service \
  --env production \
  --schema prod \

📤 Exporting Dashboards

If you need to export the Grafana dashboards for importing into another Grafana instance (stripping backend-specific suffixes like (PostgreSQL) or (ClickHouse) from titles), you can use the provided export script:

python scripts/export_dashboards.py

This will create a grafana/dashboards_exported directory containing the sanitized JSON files.

Key Features:

  • Dynamic Data Source: Identifies the data source type from the filename (_ch -> ClickHouse, _pg -> PostgreSQL) and sets it as the default.
  • Import Flexibility: Injects a DS_PROFILER_METRICS template variable, allowing you to select the target data source during import.

📁 Project Structure

DataProfiler/
├── .env.example              # Environment variables template
├── .env                      # Environment variables (git ignored)
├── .gitignore                # Git ignore rules
├── LICENSE                   # Apache 2.0 License
├── docker-compose.yml        # Docker full stack environment
├── main.py                   # Main entry point
├── pytest.ini                # Pytest configuration
├── README.md                 # Documentation (English)
├── README.th.md              # Documentation (Thai)
├── requirements.txt          # Python dependencies
│
├── scripts/                  # Automation scripts
│   └── run_profiler.sh       # Control-M wrapper script
│
├── dashboard/                # Web Dashboard
│   ├── backend/              # Flask API server
│   │   ├── app.py            # API endpoints
│   │   └── requirements.txt  # Backend dependencies
│   └── frontend/             # React + Vite frontend
│       ├── src/
│       │   └── App.jsx       # Main React component
│       ├── package.json
│       └── nginx.conf        # Production nginx config
│
├── docs/                     # Documentation assets
│   └── images/               # Dashboard screenshots
│
├── grafana/                  # Grafana dashboards & config
│   ├── dashboards/
│   │   ├── main_dashboard.json               # Single environment view
│   │   └── env_comparison_dashboard.json     # Environment comparison
│   └── provisioning/
│       ├── dashboards/dashboard.yml
│       └── datasources/datasource.yml
│
├── init-scripts/             # Database initialization scripts
│   ├── clickhouse/           # ClickHouse schema & test data
│   ├── mssql/                # MSSQL init & data generation
│   └── postgres/             # PostgreSQL init & data generation
│
├── src/                      # Source code modules
│   ├── __init__.py
│   ├── config.py             # Configuration management
│   ├── exceptions.py         # Custom exceptions
│   ├── core/                 # Core profiling logic
│   │   ├── __init__.py
│   │   ├── autoincrement_metrics.py  # Auto-increment analysis
│   │   ├── formatters.py     # Output formatters (MD, JSON, CSV)
│   │   ├── metrics.py        # dbt-profiler style metrics
│   │   └── schema_comparator.py  # Schema comparison logic
│   └── db/                   # Database connections
│       ├── __init__.py
│       ├── autoincrement.py  # Auto-increment detector (PostgreSQL & MSSQL)
│       ├── clickhouse.py     # ClickHouse metrics client
│       ├── postgres_metrics.py  # PostgreSQL metrics client
│       ├── connection_factory.py  # Multi-database factory
│       ├── mssql.py          # MSSQL client
│       └── postgres.py       # PostgreSQL client
│
├── tests/                    # Unit tests
│   ├── __init__.py
│   ├── test_autoincrement.py
│   ├── test_config.py
│   ├── test_connections.py
│   ├── test_metadata.py
│   └── test_table_inventory.py
│
└── venv/                     # Python virtual environment (git ignored)

🧪 Testing

# Activate virtual environment
source venv/bin/activate

# Run tests
pytest

# Run with verbose output
pytest -v

# Run with coverage report
pytest --cov=src --cov-report=term-missing

🔍 End-to-End Manual Testing

For example, to manually verify all functions for MSSQL with a full workflow:

1. Setup Environment

# Configure MSSQL Connection
export MSSQL_HOST=localhost
export MSSQL_PORT=1433
export MSSQL_DATABASE=testdb
export MSSQL_USER=sa
export MSSQL_PASSWORD=YourStrong@Password123
export MSSQL_SCHEMA=dbo

# Initialize MSSQL Database
python init-scripts/mssql/init-mssql.py
python init-scripts/mssql/init-mssql-schema-objects.py

2. Generate Sample Data

# Generate Sample Data for UAT and Prod schemas
python init-scripts/mssql/generate-mssql-data.py --schema uat
python init-scripts/mssql/generate-mssql-data.py --schema prod

3. Run Profiler (Full Cycle)

You can run the profiler using main.py directly:

# Profile MSSQL (UAT) -> Store in PostgreSQL
python main.py --data-profile --auto-increment --profile-schema -d mssql -t users,products --app user-service --env uat --schema uat  --metrics-backend postgresql

# Profile MSSQL (Prod) -> Store in PostgreSQL
python main.py --data-profile --auto-increment --profile-schema -d mssql -t users,products --app user-service --env prod --schema prod --metrics-backend postgresql

Or using the scripts/run_profiler.sh wrapper script:

# Profile MSSQL (UAT)
scripts/run_profiler.sh --data-profile --auto-increment --profile-schema --database-type mssql -t users,products --app user-service --env uat --schema uat --metrics-backend postgresql

# Profile MSSQL (Prod)
scripts/run_profiler.sh --data-profile --auto-increment --profile-schema --database-type mssql -t users,products --app user-service --env prod --schema prod --metrics-backend postgresql

This workflow verifies:

  • MSSQL connection and data retrieval
  • Sample data generation scripts
  • Schema profiling capabilities
  • Auto-increment analysis
  • Storing metrics in PostgreSQL

CLI Wrapper

DataProfiler includes a wrapper script for running as a scheduled job in Control-M or similar job schedulers.

Wrapper Script Location

scripts/run_profiler.sh

CLI Arguments Support

The wrapper script supports passing CLI arguments directly to the Python profiler. CLI arguments override environment variables.

# Show help
scripts/run_profiler.sh -h
scripts/run_profiler.sh --help

# Override database type via CLI (ignores PROFILER_DB_TYPE env var)
scripts/run_profiler.sh -d mssql
scripts/run_profiler.sh --database-type mssql

# Override metrics backend via CLI (ignores METRICS_BACKEND env var)
scripts/run_profiler.sh --metrics-backend postgresql

# Specify table name via CLI (ignores PROFILER_TABLE env var)
scripts/run_profiler.sh --table users

# Specify multiple tables
scripts/run_profiler.sh --table users,products
scripts/run_profiler.sh -t users,products

# Override schema (ignores PROFILER_SCHEMA env var)
scripts/run_profiler.sh --schema uat

# Combine CLI arguments
scripts/run_profiler.sh --table users -d mssql --metrics-backend postgresql --data-profile --auto-increment --schema prod

scripts/run_profiler.sh -t users,orders --data-profile --app user-service --env uat --database-type mssql --metrics-backend postgresql --auto-increment

# Skip storing metrics
scripts/run_profiler.sh --no-store

Argument Precedence

CLI arguments take precedence over environment variables:

Configuration Priority Example
CLI argument 1 (High) --metrics-backend postgresql
Environment 2 (Low) METRICS_BACKEND=clickhouse

Example: If METRICS_BACKEND=clickhouse is set but you run:

scripts/run_profiler.sh --metrics-backend postgresql

The script will use PostgreSQL for metrics storage.

Environment Variables for Control-M

Configure these environment variables in your Control-M job definition:

Database Connection & Metrics Backend

Refer to the Configuration section for the required environment variables:

  • POSTGRES_*
  • MSSQL_*
  • CLICKHOUSE_*
  • PG_METRICS_*

Profiler Options (Optional)

Variable Default Description
PROFILER_TABLE - Comma-separated table names (required with --data-profile)
PROFILER_SCHEMA (default DB) Schema name (e.g., public, dbo, prod, uat)
PROFILER_FORMAT table Output format: table, markdown, json, csv
PROFILER_OUTPUT_FILE - File path to save output
PROFILER_APP default Application name
PROFILER_ENV production Environment name
PROFILER_DB_TYPE postgresql Database type: postgresql, mssql, mysql
METRICS_BACKEND postgresql Metrics backend: postgresql, clickhouse
PROFILER_DATA_PROFILE false Enable data profiling (column-level statistics)
PROFILER_AUTO_INCREMENT false Enable auto-increment analysis (requires PROFILER_DATA_PROFILE)
PROFILER_PROFILE_SCHEMA false Enable schema profiling
PROFILER_LOOKBACK_DAYS 7 Days for growth rate calculation
PROFILER_NO_STORE false Skip storing metrics
PROFILER_VERBOSE false Enable verbose logging
PYTHON_PATH python3 Path to Python executable
PROFILER_HOME (script dir) Path to DataProfiler installation

Exit Codes

Code Description
0 Success
1 Configuration error (missing required environment variables)
2 Execution error (profiler failed)
3 Python environment error

Examples

Example 1: PostgreSQL + ClickHouse Metrics

# Environment Variables (set in Control-M):
# Source Database (PostgreSQL)
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_DATABASE=postgres
POSTGRES_USER=postgres
POSTGRES_PASSWORD='password123'

CLICKHOUSE_HOST=localhost
CLICKHOUSE_PORT=8123
CLICKHOUSE_USER=default
CLICKHOUSE_PASSWORD='password123'

PROFILER_TABLE=users,products
PROFILER_APP=user-service
PROFILER_ENV=uat
PROFILER_DATA_PROFILE=true
PROFILER_AUTO_INCREMENT=true

# Command:
scripts/run_profiler.sh

Example 2: MSSQL + PostgreSQL Metrics

# Environment Variables (set in Control-M):
# Source Database (MSSQL)
MSSQL_HOST=localhost
MSSQL_PORT=1433
MSSQL_DATABASE=testdb
MSSQL_USER=sa
MSSQL_PASSWORD='YourStrong@Password123'
MSSQL_SCHEMA=dbo

# Metrics Storage (PostgreSQL)
METRICS_BACKEND=postgresql
PG_METRICS_HOST=localhost
PG_METRICS_PORT=5433
PG_METRICS_DATABASE=profiler_metrics
PG_METRICS_USER=postgres
PG_METRICS_PASSWORD='password123'

# Profiler Options
PROFILER_TABLE=users,orders
PROFILER_DB_TYPE=mssql
PROFILER_APP=user-service
PROFILER_ENV=uat
PROFILER_DATA_PROFILE=true
PROFILER_AUTO_INCREMENT=true

# Command:
scripts/run_profiler.sh

Using Docker (Alternative)

# Run profiler inside backend container
docker-compose exec backend python main.py --table users --data-profile --app user-service --env production --auto-increment

Logging

The wrapper script creates logs in $PROFILER_HOME/logs/ with the format:

profiler_<CTM_ORDERID>.log

Control-M variables CTM_JOBNAME and CTM_ORDERID are automatically used for job identification in logs.

🐳 Docker Standalone Image

You can run DataProfiler as a standalone Docker container, useful for CI/CD pipelines or running in environments where you don't need the full stack.

Pull Image

docker pull nontster/data-profiler:latest

Note: The image supports linux/amd64 and linux/arm64.

Run Container

You can pass environment variables directly to the container to configure the connection.

# Example: Profile MSSQL and store metrics in PostgreSQL
docker run --rm \
  -e MSSQL_HOST=host.docker.internal \
  -e MSSQL_PORT=1433 \
  -e MSSQL_DATABASE=testdb \
  -e MSSQL_USER=sa \
  -e MSSQL_PASSWORD='YourStrong@Password123' \
  -e MSSQL_SCHEMA=dbo \
  -e METRICS_BACKEND=postgresql \
  -e PG_METRICS_HOST=host.docker.internal \
  -e PG_METRICS_PORT=5433 \
  -e PG_METRICS_DATABASE=profiler_metrics \
  -e PG_METRICS_USER=postgres \
  -e PG_METRICS_PASSWORD='password123' \
  nontster/data-profiler \
  --data-profile -d mssql -t users,products --app user-service --env uat

🐳 Docker Full Stack Environment

This project is fully containerized. You can spin up the entire stack (DBs, Backend, Frontend, Grafana) with one command.

# Start all services
docker-compose up -d --build

Metrics Backend Selection

You can choose where to store profiling metrics:

# Use PostgreSQL (default)
docker-compose up -d

# Use ClickHouse for metrics storage
METRICS_BACKEND=clickhouse docker-compose up -d

The React dashboard will display which backend is active in the header.

Services Overview

Service URL / Port Description
Frontend http://localhost:8080 Main Data Profiler Dashboard (React)
Grafana http://localhost:3000 Advanced Visualization (Admin)
Backend Internal (5001) API Service (Flask)
ClickHouse localhost:8123 HTTP Interface
PostgreSQL localhost:5432 Sample Source Database
PostgreSQL-Metrics localhost:5433 Metrics Storage Database
MSSQL localhost:1433 Source Database (Azure SQL Edge)
MySQL localhost:3306 Source Database

Note: PostgreSQL is separated into two instances: the sample database (port 5432) for source data profiling, and the metrics database (port 5433) for storing profiling results. This prevents confusion between sample data and metrics data.

Credentials

  • Grafana: User: admin, Pass: admin (or set via GRAFANA_ADMIN_PASSWORD in .env)
  • PostgreSQL: User: postgres, Pass: password123
  • MSSQL: User: sa, Pass: YourStrong@Password123
  • MySQL: User: user, Pass: password123
  • Oracle: User: PROD / UAT, Pass: password123
  • ClickHouse: User: default, Pass: password123

Starting MSSQL (Azure SQL Edge)

MSSQL uses Azure SQL Edge for ARM64/M1 compatibility:

# Start MSSQL container
docker compose up -d mssql

# Wait ~30 seconds for startup, then initialize database
python init-scripts/mssql/init-mssql.py
python init-scripts/mssql/init-mssql-schema-objects.py

# Test profiler
python main.py --table users -d mssql --no-store

Note: Azure SQL Edge doesn't run init scripts automatically like PostgreSQL. Use the Python script to create test databases.

Starting Oracle (21c XE)

Oracle runs as a container with pre-configured PROD and UAT users.

# Start Oracle container
docker compose up -d oracle

# Wait for startup (can take 1-2 minutes)
# Test connection
python main.py -d oracle --schema PROD

Stop Services

docker-compose down -v  # Stop and remove volumes

Sample Data & Testing

Docker automatically initializes sample data with two distinct schemas: prod (Production) and uat (UAT) to simulate real-world scenarios.

Table Schema Records Description
users prod 99 Established user base active since 2023
products prod 111 Full product catalog (Electronics, Accessories, etc.)
users uat 80 New hires, different salary ranges, more NULLs (simulated drift)
products uat 90 New catalog items, missing categories (simulated drift)

Generate Additional Sample Data

To add more test data for auto-increment growth rate calculation:

For PostgreSQL:

# Add 100 new users to PostgreSQL (Prod)
python init-scripts/postgres/generate-postgres-data.py --users 100 --schema prod

# Add 50 new products to UAT
python init-scripts/postgres/generate-postgres-data.py --products 50 --schema uat --no-users

# Add both with specific counts
python init-scripts/postgres/generate-postgres-data.py --users 500 --products 200 --schema prod

For MSSQL:

# Add 100 new users to MSSQL (Prod)
python init-scripts/mssql/generate-mssql-data.py --users 100 --schema prod

# Add 50 new products to MSSQL (UAT)
python init-scripts/mssql/generate-mssql-data.py --products 50 --schema uat --no-users

# Add both users and products
python init-scripts/mssql/generate-mssql-data.py --users 500 --products 200 --schema prod

# Show current statistics only
python init-scripts/mssql/generate-mssql-data.py --stats-only

# Add 100 new users to MySQL (specify schema: prod, uat, or public)
python init-scripts/mysql/generate-mysql-data.py --schema prod --users 100

# For Oracle:
# Add 100 new users to Oracle (PROD schema)
python init-scripts/oracle/generate-oracle-data.py --schema PROD --users 100

# Add 50 new products to Oracle (UAT schema)
python init-scripts/oracle/generate-oracle-data.py --schema UAT --products 50

4. View Results

Tip: Run the profiler multiple times after inserting new data to enable Days Until Full calculation for auto-increment columns.

📊 Dashboard Development

If you want to run the dashboard manually (outside Docker) for development:

# 1. Start Backend API
cd dashboard/backend
python app.py

# 2. Start Frontend
cd dashboard/frontend
npm run dev
# Access at http://localhost:5173

Technology Stack

Component Technology
Backend Flask + Flask-CORS
Frontend React + Vite
Styling TailwindCSS
Charts Recharts

Environment Comparison Dashboard

The dashboard supports dual-environment comparison to compare data profiles between environments (e.g., UAT vs Production).

React Environment Comparison Dashboard Screenshot

Features

  • Dual Environment Selectors: Select two environments to compare side-by-side
  • Summary Panel: Shows row counts and profiling timestamps for both environments
  • Comparison Charts: Side-by-side bar charts for Not Null Proportion and Distinct Proportion
  • Difference Highlighting: Color-coded differences (green = improvement, red = degradation)
  • Min/Max Filtering: Shows min/max values only for numeric and date/time columns (matches backend logic)
  • Auto-Increment Comparison: Compare overflow risk metrics between environments

API Endpoints

Endpoint Description
GET /api/metadata List all applications and their environments
GET /api/profiles/compare/<table>?app=<app>&env1=<env1>&env2=<env2> Compare profiles between two environments
GET /api/autoincrement/compare/<table>?app=<app>&env1=<env1>&env2=<env2> Compare auto-increment metrics

Schema Comparison Dashboard

The dashboard supports dual-environment schema comparison to compare table schemas between environments (e.g., UAT vs Production).

React Schema Comparison Dashboard Screenshot

Features

  • Tab Navigation: Switch between Data Profile and Schema Comparison views
  • Summary Cards: Shows total columns, matching, different, and environment-specific columns
  • Comparison Table: Side-by-side view of column schemas with:
    • Data Type comparison
    • Nullable status (NULL / NOT NULL)
    • Primary Key indicators (🔑)
    • Index membership indicators (📇)
  • Difference Highlighting: Color-coded status badges:
    • ✓ Match (gray) - Column exists and matches in both environments
    • ⚠ Modified (yellow) - Column exists but has differences
      • Added (green) - Column only exists in Environment 2
      • Removed (red) - Column only exists in Environment 1

API Endpoints

Endpoint Description
GET /api/schema/compare/<table>?app=<app>&env1=<env1>&env2=<env2> Compare schema between two environments

Usage

To capture schema data, run the profiler with --profile-schema flag:

# Profile schema for UAT
python main.py --table users --profile-schema --app user-service --env uat

# Profile schema for Production
python main.py --table users --profile-schema --app user-service --env production

# Profile multiple table schemas
python main.py -t users,orders --profile-schema --app user-service --env production

# Combine data profiling + schema profiling
python main.py -t users --data-profile --profile-schema --app user-service --env production

📈 Grafana Dashboard (Alternative)

This project includes a Grafana instance connected to ClickHouse for advanced visualization.

Features

  • Direct ClickHouse Integration: No middleware required.
  • Customizable: Create complex dashboards with SQL queries.
  • Alerting: Native Grafana alerting support.
  • User Management: Role-based access control.

Pre-Provisioned Dashboards

Four dashboards are automatically provisioned:

Dashboard Description
Main Dashboard Single environment view with data profiles, column details, and auto-increment monitoring
Environment Comparison Dashboard Compare profiles between two environments side-by-side with difference highlighting
Schema Comparison Dashboard Compare schema between two environments side-by-side with difference highlighting
Table Inventory Comparison Dashboard Compare table inventories between environments to detect table drift

Grafana Environment Comparison

Grafana Schema Comparison

PostgreSQL Metrics Support

Grafana is pre-configured with both ClickHouse and PostgreSQL datasources:

Datasource Description
ClickHouse Default metrics storage
PostgreSQL-Metrics Alternative when using PostgreSQL backend

Two versions of dashboards are provided:

  • Environment Comparison Dashboard - uses ClickHouse
  • Environment Comparison Dashboard (PostgreSQL) - uses PostgreSQL
  • Table Inventory Comparison Dashboard - uses ClickHouse
  • Table Inventory Comparison Dashboard (PostgreSQL) - uses PostgreSQL

Setup

The Grafana service is included in docker-compose.yml and pre-configured with the ClickHouse datasource.

  1. Start all services:

    docker-compose up -d
  2. Open Grafana:

  3. Select a Dashboard:

    • Go to Dashboards menu
    • Choose Main Dashboard, Environment Comparison Dashboard, or Table Inventory Comparison Dashboard
    • Select Application, Environment(s), and Table from dropdowns

📝 License

This project is licensed under the Apache License 2.0.

Copyright 2024-2026 DataProfiler Contributors

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

🤝 Contributing

Pull requests are welcome! For major changes, please open an issue first to discuss.

About

Automated Data Profiling & Schema Comparison tool for Oracle, MSSQL, PostgreSQL and MySQL. Features dbt-profiler style metrics, Auto-Increment overflow prediction, and Environment Drift detection with a built-in React dashboard

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors