DataProfiler

🌐 Language: English | ภาษาไทย

DataProfiler

Automated Data Profiling tool for PostgreSQL and Microsoft SQL Server with dbt-profiler style metrics. Supports storing results in ClickHouse or PostgreSQL.

🎯 Overview

DataProfiler provides:

dbt-profiler Style Metrics calculation via SQL queries (opt-in with --data-profile)
Schema Comparison: Compare table structures and database objects across different environments (e.g., UAT vs Prod)
Schema Objects Profiling: Profiles Stored Procedures, Views, and Triggers for comprehensive database monitoring
Auto-Increment Overflow Risk Analysis with growth prediction using Linear Regression
Table Inventory Collection: Automatically discovers and records all tables per schema for drift detection
Automatic Schema Discovery from source databases (information_schema)
Multi-Database Support: PostgreSQL, Microsoft SQL Server (Azure SQL Edge), MySQL, and Oracle (21c XE+)
Flexible Metrics Storage: Choose between ClickHouse or PostgreSQL for storing results
Multiple Export Formats: Markdown, JSON, CSV, Console Table
Web Dashboard for data visualization (React + TailwindCSS)
Control-M Integration: Production-ready wrapper script for enterprise job scheduling
Full Stack Docker: Complete containerized environment including Grafana, ClickHouse, and Postgres Metrics
High Performance: Uses fast Catalog Statistics (O(1)) for row counts instead of full table scans

📊 Profiled Metrics

For each column, the system collects the following statistics (dbt-profiler compatible):

Metric	Description	Condition
`column_name`	Column name	All columns
`data_type`	Data type	All columns
`not_null_proportion`	Proportion of non-NULL values (0.00 - 1.00)	All columns
`distinct_proportion`	Proportion of unique values (0.00 - 1.00)	All columns
`distinct_count`	Count of unique values	All columns
`is_unique`	Whether all values are unique (true/false)	All columns
`min` / `max`	Minimum / Maximum values	numeric, date, time*
`avg`	Average value	numeric**
`median`	Median value	numeric**
`std_dev_population`	Population standard deviation	numeric**
`std_dev_sample`	Sample standard deviation	numeric**
`histogram`	Numeric bucket distribution (JSON)	numeric*
`frequencies`	Top N Categorical frequencies (JSON)	non-numeric
`profiled_at`	Profile timestamp	All columns

* min/max supported for: integer, numeric, float, date, timestamp, time
** avg, median, std_dev supported for: integer, numeric, float

🔮 Auto-Increment Overflow Risk Analysis

DataProfiler includes Auto-Increment Column Overflow Risk Analysis to predict when primary key columns will reach their maximum capacity.

Features

Current Value Tracking: Monitors the current value of auto-increment columns
Maximum Value Calculation: Computes the max value based on data type (e.g., INT, BIGINT)
Usage Percentage: Calculates current capacity usage
Growth Rate Prediction: Uses Linear Regression on historical data from ClickHouse
Days Until Full: Predicts when the column will reach maximum capacity
Alert Status: CRITICAL (< 30 days / > 90%), WARNING (< 90 days / > 75%), NORMAL

Supported Data Types

Data Type	Max Value	Range
`smallint`	32,767	-32,768 to 32,767
`integer`	2,147,483,647	-2.1B to 2.1B
`bigint`	9,223,372,036,854,775,807	-9.2 quintillion to 9.2 quintillion
`serial`	2,147,483,647	1 to 2.1B
`bigserial`	9,223,372,036,854,775,807	1 to 9.2 quintillion

MSSQL Data Types

Data Type	Max Value	Range
`tinyint`	255	0 to 255
`smallint`	32,767	-32,768 to 32,767
`int`	2,147,483,647	-2.1B to 2.1B
`bigint`	9,223,372,036,854,775,807	-9.2 quintillion to 9.2 quintillion

Note: Supports PostgreSQL SERIAL/BIGSERIAL/IDENTITY, MSSQL IDENTITY, and Oracle IDENTITY columns.

🏗️ Schema Profiling & Comparison

DataProfiler can profile table schemas (columns, data types, indexes, foreign keys) and store them for comparison between different environments (e.g., UAT vs Production).

Features

Schema Snapshot: Captures detailed schema metadata
Comparison Dashboard: Grafana dashboard to visualize differences
Multi-Database Support: Profile schemas from both PostgreSQL and MSSQL
Strict Mode: Compares defaults, nullability, and index structure

Comparison Metrics

Category	Checks
Columns	Existence, Data Types, Nullability, Defaults
Indexes	Primary Keys, Index Membership, Index Names
Columns	Existence, Data Types, Nullability, Defaults
Indexes	Primary Keys, Index Membership, Index Names
Constraints	Foreign Keys, Check Constraints

Schema Objects Profiling

In addition to table schemas, DataProfiler automatically collects metadata for other database objects:

Stored Procedures
Views
Triggers

This allows you to detect added, removed, or modified objects between environments.

Note: Schema object profiling runs automatically when metrics storage is enabled (default), even without --profile-schema.

🛠️ Requirements

Python 3.10+
PostgreSQL and/or Microsoft SQL Server (Azure SQL Edge for ARM64/M1)
MySQL (v8.0+)
Oracle Database (19c/21c XE+)
ClickHouse
Dependencies:
- psycopg2 - PostgreSQL adapter
- pymssql - MSSQL adapter
- clickhouse-connect - ClickHouse client
- mysql-connector-python - MySQL adapter
- oracledb - Oracle adapter
- python-dotenv - Environment variable management
- numpy - Numerical computing
- scipy - Scientific computing (Linear Regression)

📦 Installation

Clone repository:

git clone <repository-url>
cd DataProfiler

Create and activate Virtual Environment:

# Create venv
python -m venv venv

# Activate (macOS/Linux)
source venv/bin/activate

# Activate (Windows)
venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

⚙️ Configuration

1. Create Environment Variables File

Copy .env.example to .env and edit values:

cp .env.example .env

Edit .env file:

# PostgreSQL Configuration
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_DATABASE=postgres
POSTGRES_USER=postgres
POSTGRES_PASSWORD=your_actual_password
POSTGRES_SCHEMA=public

# MySQL Configuration
MYSQL_HOST=localhost
MYSQL_PORT=3306
MYSQL_DATABASE=prod
MYSQL_USER=user
MYSQL_PASSWORD=password123

# Oracle Configuration
ORACLE_HOST=localhost
ORACLE_PORT=1521
ORACLE_SERVICE_NAME=XEPDB1
ORACLE_USER=PROD
ORACLE_PASSWORD=password123

# ClickHouse Configuration
CLICKHOUSE_HOST=localhost
CLICKHOUSE_PORT=8123
CLICKHOUSE_USER=default
CLICKHOUSE_PASSWORD=your_actual_password

# Metrics Backend Configuration
# Options: 'postgresql' (default) or 'clickhouse'
METRICS_BACKEND=postgresql

# Distribution/Profiling Limits
HISTOGRAM_MAX_BINS=20
FREQUENCY_TOP_N=100

# PostgreSQL Metrics Configuration (separate instance for storing profiling results)
# When using docker-compose, this points to the postgres-metrics service (port 5433)
PG_METRICS_HOST=localhost
PG_METRICS_PORT=5433
PG_METRICS_DATABASE=profiler_metrics
PG_METRICS_USER=postgres
PG_METRICS_PASSWORD=your_password

⚠️ Important: The .env file is already git-ignored. No need to worry about committing credentials.

🤖 CI/CD Pipeline

The project uses GitHub Actions for continuous integration and continuous deployment:

Unit Testing: Automatically runs pytest on every push to the repository.
Docker Build & Publish: Automatically builds a multi-architecture (linux/amd64, linux/arm64) Docker container and publishes it to Docker Hub as nontster/data-profiler:<SHORT_SHA> upon any merge into the main branch.

Required Repository Secrets: To enable the Docker publish workflow, you must configure the following secrets in your GitHub repository (Settings > Secrets and variables > Actions):

DOCKERHUB_USERNAME: Your Docker Hub username.
DOCKERHUB_TOKEN: A Docker Hub Personal Access Token with Read & Write permissions.

🚀 Usage

Basic Usage

By default (without feature flags), DataProfiler collects table inventory only — no --table needed:

# Collect table inventory only (default behavior)
python main.py -d mssql --schema prod --app user-service --env production

# Collect table inventory from PostgreSQL (default database)
python main.py --schema uat --app user-service --env uat

Data Profiling (`--data-profile`)

Use --data-profile to enable column-level statistics. Requires --table:

# Profile 'users' table from PostgreSQL
python main.py --data-profile --table users

# Profile multiple tables
python main.py --data-profile -t users,products

# Profile from specific schema
python main.py --data-profile --table users --schema prod

# Profile from MSSQL
python main.py --data-profile -d mssql --table users

# Profile from MySQL
python main.py --data-profile -d mysql --table users

# Profile from Oracle
python main.py --data-profile -d oracle --table users --schema PROD

# Profile with Application & Environment context
python main.py --data-profile -d mssql -t users,products --app user-service --env uat --metrics-backend postgresql

Auto-Increment Analysis

Requires --data-profile:

# Include auto-increment overflow analysis
python main.py --data-profile --auto-increment --table users
python main.py --data-profile --auto-increment -d mssql --table users   

# Custom lookback period for growth calculation
python main.py --data-profile --auto-increment --table users  --lookback-days 14

Schema Profiling

# Profile schema for User Service in Production
python main.py --profile-schema --table users --app user-service --env production

# Profile same table in UAT
python main.py --profile-schema --table users --app user-service --env uat

# Profile multiple table schemas
python main.py --profile-schema -t users,products  --app user-service --env production

# Comparison is done via Grafana Dashboard (Table Schema Comparison)

Schema Objects Profiling

Schema objects (Stored Procedures, Views, Triggers) are profiled automatically when connecting to a database with metrics storage enabled.

# Profile schema objects (and table inventory)
python main.py --app user-service --env production --schema prod

View the results in the Schema Objects Comparison Dashboard.

Output Formats

Output formats apply when --data-profile is used:

# Console table (default)
python main.py --table users --data-profile --format table

# Markdown (dbt-profiler style)
python main.py --table users --data-profile --format markdown

# JSON
python main.py --table users --data-profile --format json

# CSV
python main.py --table users --data-profile --format csv

Save to File

python main.py --table users --data-profile --format markdown --output profiles/users.md
python main.py --table users --data-profile --format json --output profiles/users.json
python main.py -t users,orders --data-profile --format csv --output profiles/report.csv

Note: When profiling multiple tables with --output, results are appended to the same file.

Additional Options

# Skip storing to Metrics DB
python main.py --table users --no-store

# Verbose/debug logging (include detailed SQL queries)
python main.py --table users -v

# Show help
python main.py --help

Database Type Selection (`-d`, `--database-type`)

Choose which source database to profile:

Option	Description	Required Environment Variables
`postgresql`	PostgreSQL (default)	`POSTGRES_HOST`, `POSTGRES_PORT`, etc.
`mssql`	Microsoft SQL Server	`MSSQL_HOST`, `MSSQL_PORT`, `MSSQL_DATABASE`, etc.
`mysql`	MySQL	`MYSQL_HOST`, `MYSQL_PORT`, `MYSQL_DATABASE`, etc.
`oracle`	Oracle Database	`ORACLE_HOST`, `ORACLE_PORT`, `ORACLE_SERVICE_NAME`

# Profile from PostgreSQL (default)
python main.py --table users --data-profile

# Profile from MSSQL
python main.py --data-profile -d mssql --table users
python main.py --data-profile --database-type mssql --table users

# MSSQL with auto-increment analysis
python main.py --data-profile --auto-increment -d mssql --table users

# Oracle with auto-increment analysis
python main.py --data-profile --auto-increment -d oracle --table users --schema PROD

Metrics Backend Selection (`--metrics-backend`)

Choose where to store profiling results:

Option	Description	Required Environment Variables
`postgresql`	PostgreSQL (default)	`PG_METRICS_` or falls back to `POSTGRES_` variables
`clickhouse`	ClickHouse	`CLICKHOUSE_HOST`, `CLICKHOUSE_PORT`

# Use PostgreSQL (default)
python main.py --table users

# Use ClickHouse for metrics storage
python main.py --table users --metrics-backend clickhouse

# Combine: Profile MSSQL, store in PostgreSQL
python main.py -d mssql --table orders --metrics-backend postgresql

Complete Example: Profile MSSQL with PostgreSQL Metrics Backend

# Set environment variables
export MSSQL_HOST=localhost
export MSSQL_PORT=1433
export MSSQL_DATABASE=testdb
export MSSQL_USER=sa
export MSSQL_PASSWORD='YourStrong@Password123'
export MSSQL_SCHEMA=dbo

export METRICS_BACKEND=postgresql
export PG_METRICS_HOST=localhost
export PG_METRICS_PORT=5433
export PG_METRICS_DATABASE=profiler_metrics
export PG_METRICS_USER=postgres
export PG_METRICS_PASSWORD='password123'

# Run profiler
python main.py
  --data-profile \
  --auto-increment \
  -d mssql \
  --table users \
  --metrics-backend postgresql \
  --app user-service \
  --env production \
  --schema prod \

📤 Exporting Dashboards

If you need to export the Grafana dashboards for importing into another Grafana instance (stripping backend-specific suffixes like (PostgreSQL) or (ClickHouse) from titles), you can use the provided export script:

python scripts/export_dashboards.py

This will create a grafana/dashboards_exported directory containing the sanitized JSON files.

Key Features:

Dynamic Data Source: Identifies the data source type from the filename (_ch -> ClickHouse, _pg -> PostgreSQL) and sets it as the default.
Import Flexibility: Injects a DS_PROFILER_METRICS template variable, allowing you to select the target data source during import.

📁 Project Structure

DataProfiler/
├── .env.example              # Environment variables template
├── .env                      # Environment variables (git ignored)
├── .gitignore                # Git ignore rules
├── LICENSE                   # Apache 2.0 License
├── docker-compose.yml        # Docker full stack environment
├── main.py                   # Main entry point
├── pytest.ini                # Pytest configuration
├── README.md                 # Documentation (English)
├── README.th.md              # Documentation (Thai)
├── requirements.txt          # Python dependencies
│
├── scripts/                  # Automation scripts
│   └── run_profiler.sh       # Control-M wrapper script
│
├── dashboard/                # Web Dashboard
│   ├── backend/              # Flask API server
│   │   ├── app.py            # API endpoints
│   │   └── requirements.txt  # Backend dependencies
│   └── frontend/             # React + Vite frontend
│       ├── src/
│       │   └── App.jsx       # Main React component
│       ├── package.json
│       └── nginx.conf        # Production nginx config
│
├── docs/                     # Documentation assets
│   └── images/               # Dashboard screenshots
│
├── grafana/                  # Grafana dashboards & config
│   ├── dashboards/
│   │   ├── main_dashboard.json               # Single environment view
│   │   └── env_comparison_dashboard.json     # Environment comparison
│   └── provisioning/
│       ├── dashboards/dashboard.yml
│       └── datasources/datasource.yml
│
├── init-scripts/             # Database initialization scripts
│   ├── clickhouse/           # ClickHouse schema & test data
│   ├── mssql/                # MSSQL init & data generation
│   └── postgres/             # PostgreSQL init & data generation
│
├── src/                      # Source code modules
│   ├── __init__.py
│   ├── config.py             # Configuration management
│   ├── exceptions.py         # Custom exceptions
│   ├── core/                 # Core profiling logic
│   │   ├── __init__.py
│   │   ├── autoincrement_metrics.py  # Auto-increment analysis
│   │   ├── formatters.py     # Output formatters (MD, JSON, CSV)
│   │   ├── metrics.py        # dbt-profiler style metrics
│   │   └── schema_comparator.py  # Schema comparison logic
│   └── db/                   # Database connections
│       ├── __init__.py
│       ├── autoincrement.py  # Auto-increment detector (PostgreSQL & MSSQL)
│       ├── clickhouse.py     # ClickHouse metrics client
│       ├── postgres_metrics.py  # PostgreSQL metrics client
│       ├── connection_factory.py  # Multi-database factory
│       ├── mssql.py          # MSSQL client
│       └── postgres.py       # PostgreSQL client
│
├── tests/                    # Unit tests
│   ├── __init__.py
│   ├── test_autoincrement.py
│   ├── test_config.py
│   ├── test_connections.py
│   ├── test_metadata.py
│   └── test_table_inventory.py
│
└── venv/                     # Python virtual environment (git ignored)

🧪 Testing

# Activate virtual environment
source venv/bin/activate

# Run tests
pytest

# Run with verbose output
pytest -v

# Run with coverage report
pytest --cov=src --cov-report=term-missing

🔍 End-to-End Manual Testing

For example, to manually verify all functions for MSSQL with a full workflow:

1. Setup Environment

# Configure MSSQL Connection
export MSSQL_HOST=localhost
export MSSQL_PORT=1433
export MSSQL_DATABASE=testdb
export MSSQL_USER=sa
export MSSQL_PASSWORD=YourStrong@Password123
export MSSQL_SCHEMA=dbo

# Initialize MSSQL Database
python init-scripts/mssql/init-mssql.py
python init-scripts/mssql/init-mssql-schema-objects.py

2. Generate Sample Data

# Generate Sample Data for UAT and Prod schemas
python init-scripts/mssql/generate-mssql-data.py --schema uat
python init-scripts/mssql/generate-mssql-data.py --schema prod

3. Run Profiler (Full Cycle)

You can run the profiler using main.py directly:

# Profile MSSQL (UAT) -> Store in PostgreSQL
python main.py --data-profile --auto-increment --profile-schema -d mssql -t users,products --app user-service --env uat --schema uat  --metrics-backend postgresql

# Profile MSSQL (Prod) -> Store in PostgreSQL
python main.py --data-profile --auto-increment --profile-schema -d mssql -t users,products --app user-service --env prod --schema prod --metrics-backend postgresql

Or using the scripts/run_profiler.sh wrapper script:

# Profile MSSQL (UAT)
scripts/run_profiler.sh --data-profile --auto-increment --profile-schema --database-type mssql -t users,products --app user-service --env uat --schema uat --metrics-backend postgresql

# Profile MSSQL (Prod)
scripts/run_profiler.sh --data-profile --auto-increment --profile-schema --database-type mssql -t users,products --app user-service --env prod --schema prod --metrics-backend postgresql

This workflow verifies:

MSSQL connection and data retrieval
Sample data generation scripts
Schema profiling capabilities
Auto-increment analysis
Storing metrics in PostgreSQL

CLI Wrapper

DataProfiler includes a wrapper script for running as a scheduled job in Control-M or similar job schedulers.

Wrapper Script Location

scripts/run_profiler.sh

CLI Arguments Support

The wrapper script supports passing CLI arguments directly to the Python profiler. CLI arguments override environment variables.

# Show help
scripts/run_profiler.sh -h
scripts/run_profiler.sh --help

# Override database type via CLI (ignores PROFILER_DB_TYPE env var)
scripts/run_profiler.sh -d mssql
scripts/run_profiler.sh --database-type mssql

# Override metrics backend via CLI (ignores METRICS_BACKEND env var)
scripts/run_profiler.sh --metrics-backend postgresql

# Specify table name via CLI (ignores PROFILER_TABLE env var)
scripts/run_profiler.sh --table users

# Specify multiple tables
scripts/run_profiler.sh --table users,products
scripts/run_profiler.sh -t users,products

# Override schema (ignores PROFILER_SCHEMA env var)
scripts/run_profiler.sh --schema uat

# Combine CLI arguments
scripts/run_profiler.sh --table users -d mssql --metrics-backend postgresql --data-profile --auto-increment --schema prod

scripts/run_profiler.sh -t users,orders --data-profile --app user-service --env uat --database-type mssql --metrics-backend postgresql --auto-increment

# Skip storing metrics
scripts/run_profiler.sh --no-store

Argument Precedence

CLI arguments take precedence over environment variables:

Configuration	Priority	Example
CLI argument	1 (High)	`--metrics-backend postgresql`
Environment	2 (Low)	`METRICS_BACKEND=clickhouse`

Example: If METRICS_BACKEND=clickhouse is set but you run:

scripts/run_profiler.sh --metrics-backend postgresql

The script will use PostgreSQL for metrics storage.

Environment Variables for Control-M

Configure these environment variables in your Control-M job definition:

Database Connection & Metrics Backend

Refer to the Configuration section for the required environment variables:

POSTGRES_*
MSSQL_*
CLICKHOUSE_*
PG_METRICS_*

Profiler Options (Optional)

Variable	Default	Description
`PROFILER_TABLE`	-	Comma-separated table names (required with `--data-profile`)
`PROFILER_SCHEMA`	(default DB)	Schema name (e.g., `public`, `dbo`, `prod`, `uat`)
`PROFILER_FORMAT`	`table`	Output format: `table`, `markdown`, `json`, `csv`
`PROFILER_OUTPUT_FILE`	-	File path to save output
`PROFILER_APP`	`default`	Application name
`PROFILER_ENV`	`production`	Environment name
`PROFILER_DB_TYPE`	`postgresql`	Database type: `postgresql`, `mssql`, `mysql`
`METRICS_BACKEND`	`postgresql`	Metrics backend: `postgresql`, `clickhouse`
`PROFILER_DATA_PROFILE`	`false`	Enable data profiling (column-level statistics)
`PROFILER_AUTO_INCREMENT`	`false`	Enable auto-increment analysis (requires `PROFILER_DATA_PROFILE`)
`PROFILER_PROFILE_SCHEMA`	`false`	Enable schema profiling
`PROFILER_LOOKBACK_DAYS`	`7`	Days for growth rate calculation
`PROFILER_NO_STORE`	`false`	Skip storing metrics
`PROFILER_VERBOSE`	`false`	Enable verbose logging
`PYTHON_PATH`	`python3`	Path to Python executable
`PROFILER_HOME`	(script dir)	Path to DataProfiler installation

Exit Codes

Code	Description
`0`	Success
`1`	Configuration error (missing required environment variables)
`2`	Execution error (profiler failed)
`3`	Python environment error

Examples

Example 1: PostgreSQL + ClickHouse Metrics

# Environment Variables (set in Control-M):
# Source Database (PostgreSQL)
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_DATABASE=postgres
POSTGRES_USER=postgres
POSTGRES_PASSWORD='password123'

CLICKHOUSE_HOST=localhost
CLICKHOUSE_PORT=8123
CLICKHOUSE_USER=default
CLICKHOUSE_PASSWORD='password123'

PROFILER_TABLE=users,products
PROFILER_APP=user-service
PROFILER_ENV=uat
PROFILER_DATA_PROFILE=true
PROFILER_AUTO_INCREMENT=true

# Command:
scripts/run_profiler.sh

Example 2: MSSQL + PostgreSQL Metrics

# Environment Variables (set in Control-M):
# Source Database (MSSQL)
MSSQL_HOST=localhost
MSSQL_PORT=1433
MSSQL_DATABASE=testdb
MSSQL_USER=sa
MSSQL_PASSWORD='YourStrong@Password123'
MSSQL_SCHEMA=dbo

# Metrics Storage (PostgreSQL)
METRICS_BACKEND=postgresql
PG_METRICS_HOST=localhost
PG_METRICS_PORT=5433
PG_METRICS_DATABASE=profiler_metrics
PG_METRICS_USER=postgres
PG_METRICS_PASSWORD='password123'

# Profiler Options
PROFILER_TABLE=users,orders
PROFILER_DB_TYPE=mssql
PROFILER_APP=user-service
PROFILER_ENV=uat
PROFILER_DATA_PROFILE=true
PROFILER_AUTO_INCREMENT=true

# Command:
scripts/run_profiler.sh

Using Docker (Alternative)

# Run profiler inside backend container
docker-compose exec backend python main.py --table users --data-profile --app user-service --env production --auto-increment

Logging

The wrapper script creates logs in $PROFILER_HOME/logs/ with the format:

profiler_<CTM_ORDERID>.log

Control-M variables CTM_JOBNAME and CTM_ORDERID are automatically used for job identification in logs.

🐳 Docker Standalone Image

You can run DataProfiler as a standalone Docker container, useful for CI/CD pipelines or running in environments where you don't need the full stack.

Pull Image

docker pull nontster/data-profiler:latest

Note: The image supports linux/amd64 and linux/arm64.

Run Container

You can pass environment variables directly to the container to configure the connection.

# Example: Profile MSSQL and store metrics in PostgreSQL
docker run --rm \
  -e MSSQL_HOST=host.docker.internal \
  -e MSSQL_PORT=1433 \
  -e MSSQL_DATABASE=testdb \
  -e MSSQL_USER=sa \
  -e MSSQL_PASSWORD='YourStrong@Password123' \
  -e MSSQL_SCHEMA=dbo \
  -e METRICS_BACKEND=postgresql \
  -e PG_METRICS_HOST=host.docker.internal \
  -e PG_METRICS_PORT=5433 \
  -e PG_METRICS_DATABASE=profiler_metrics \
  -e PG_METRICS_USER=postgres \
  -e PG_METRICS_PASSWORD='password123' \
  nontster/data-profiler \
  --data-profile -d mssql -t users,products --app user-service --env uat

🐳 Docker Full Stack Environment

This project is fully containerized. You can spin up the entire stack (DBs, Backend, Frontend, Grafana) with one command.

# Start all services
docker-compose up -d --build

Metrics Backend Selection

You can choose where to store profiling metrics:

# Use PostgreSQL (default)
docker-compose up -d

# Use ClickHouse for metrics storage
METRICS_BACKEND=clickhouse docker-compose up -d

The React dashboard will display which backend is active in the header.

Services Overview

Service	URL / Port	Description
Frontend	http://localhost:8080	Main Data Profiler Dashboard (React)
Grafana	http://localhost:3000	Advanced Visualization (Admin)
Backend	Internal (5001)	API Service (Flask)
ClickHouse	localhost:8123	HTTP Interface
PostgreSQL	localhost:5432	Sample Source Database
PostgreSQL-Metrics	localhost:5433	Metrics Storage Database
MSSQL	localhost:1433	Source Database (Azure SQL Edge)
MySQL	localhost:3306	Source Database

Note: PostgreSQL is separated into two instances: the sample database (port 5432) for source data profiling, and the metrics database (port 5433) for storing profiling results. This prevents confusion between sample data and metrics data.

Credentials

Grafana: User: admin, Pass: admin (or set via GRAFANA_ADMIN_PASSWORD in .env)
PostgreSQL: User: postgres, Pass: password123
MSSQL: User: sa, Pass: YourStrong@Password123
MySQL: User: user, Pass: password123
Oracle: User: PROD / UAT, Pass: password123
ClickHouse: User: default, Pass: password123

Starting MSSQL (Azure SQL Edge)

MSSQL uses Azure SQL Edge for ARM64/M1 compatibility:

# Start MSSQL container
docker compose up -d mssql

# Wait ~30 seconds for startup, then initialize database
python init-scripts/mssql/init-mssql.py
python init-scripts/mssql/init-mssql-schema-objects.py

# Test profiler
python main.py --table users -d mssql --no-store

Note: Azure SQL Edge doesn't run init scripts automatically like PostgreSQL. Use the Python script to create test databases.

Starting Oracle (21c XE)

Oracle runs as a container with pre-configured PROD and UAT users.

# Start Oracle container
docker compose up -d oracle

# Wait for startup (can take 1-2 minutes)
# Test connection
python main.py -d oracle --schema PROD

Stop Services

docker-compose down -v  # Stop and remove volumes

Sample Data & Testing

Docker automatically initializes sample data with two distinct schemas: prod (Production) and uat (UAT) to simulate real-world scenarios.

Table	Schema	Records	Description
`users`	`prod`	99	Established user base active since 2023
`products`	`prod`	111	Full product catalog (Electronics, Accessories, etc.)
`users`	`uat`	80	New hires, different salary ranges, more NULLs (simulated drift)
`products`	`uat`	90	New catalog items, missing categories (simulated drift)

Generate Additional Sample Data

To add more test data for auto-increment growth rate calculation:

For PostgreSQL:

# Add 100 new users to PostgreSQL (Prod)
python init-scripts/postgres/generate-postgres-data.py --users 100 --schema prod

# Add 50 new products to UAT
python init-scripts/postgres/generate-postgres-data.py --products 50 --schema uat --no-users

# Add both with specific counts
python init-scripts/postgres/generate-postgres-data.py --users 500 --products 200 --schema prod

For MSSQL:

# Add 100 new users to MSSQL (Prod)
python init-scripts/mssql/generate-mssql-data.py --users 100 --schema prod

# Add 50 new products to MSSQL (UAT)
python init-scripts/mssql/generate-mssql-data.py --products 50 --schema uat --no-users

# Add both users and products
python init-scripts/mssql/generate-mssql-data.py --users 500 --products 200 --schema prod

# Show current statistics only
python init-scripts/mssql/generate-mssql-data.py --stats-only

# Add 100 new users to MySQL (specify schema: prod, uat, or public)
python init-scripts/mysql/generate-mysql-data.py --schema prod --users 100

# For Oracle:
# Add 100 new users to Oracle (PROD schema)
python init-scripts/oracle/generate-oracle-data.py --schema PROD --users 100

# Add 50 new products to Oracle (UAT schema)
python init-scripts/oracle/generate-oracle-data.py --schema UAT --products 50

4. View Results

React Dashboard: http://localhost:8080
Grafana Dashboard: http://localhost:3000

Tip: Run the profiler multiple times after inserting new data to enable Days Until Full calculation for auto-increment columns.

📊 Dashboard Development

If you want to run the dashboard manually (outside Docker) for development:

# 1. Start Backend API
cd dashboard/backend
python app.py

# 2. Start Frontend
cd dashboard/frontend
npm run dev
# Access at http://localhost:5173

Technology Stack

Component	Technology
Backend	Flask + Flask-CORS
Frontend	React + Vite
Styling	TailwindCSS
Charts	Recharts

Environment Comparison Dashboard

The dashboard supports dual-environment comparison to compare data profiles between environments (e.g., UAT vs Production).

Features

Dual Environment Selectors: Select two environments to compare side-by-side
Summary Panel: Shows row counts and profiling timestamps for both environments
Comparison Charts: Side-by-side bar charts for Not Null Proportion and Distinct Proportion
Difference Highlighting: Color-coded differences (green = improvement, red = degradation)
Min/Max Filtering: Shows min/max values only for numeric and date/time columns (matches backend logic)
Auto-Increment Comparison: Compare overflow risk metrics between environments

API Endpoints

Endpoint	Description
`GET /api/metadata`	List all applications and their environments
`GET /api/profiles/compare/<table>?app=<app>&env1=<env1>&env2=<env2>`	Compare profiles between two environments
`GET /api/autoincrement/compare/<table>?app=<app>&env1=<env1>&env2=<env2>`	Compare auto-increment metrics

Schema Comparison Dashboard

The dashboard supports dual-environment schema comparison to compare table schemas between environments (e.g., UAT vs Production).

Features

Tab Navigation: Switch between Data Profile and Schema Comparison views
Summary Cards: Shows total columns, matching, different, and environment-specific columns
Comparison Table: Side-by-side view of column schemas with:
- Data Type comparison
- Nullable status (NULL / NOT NULL)
- Primary Key indicators (🔑)
- Index membership indicators (📇)
Difference Highlighting: Color-coded status badges:
- ✓ Match (gray) - Column exists and matches in both environments
- ⚠ Modified (yellow) - Column exists but has differences
- - Added (green) - Column only exists in Environment 2
- - Removed (red) - Column only exists in Environment 1

API Endpoints

Endpoint	Description
`GET /api/schema/compare/<table>?app=<app>&env1=<env1>&env2=<env2>`	Compare schema between two environments

Usage

To capture schema data, run the profiler with --profile-schema flag:

# Profile schema for UAT
python main.py --table users --profile-schema --app user-service --env uat

# Profile schema for Production
python main.py --table users --profile-schema --app user-service --env production

# Profile multiple table schemas
python main.py -t users,orders --profile-schema --app user-service --env production

# Combine data profiling + schema profiling
python main.py -t users --data-profile --profile-schema --app user-service --env production

📈 Grafana Dashboard (Alternative)

This project includes a Grafana instance connected to ClickHouse for advanced visualization.

Features

Direct ClickHouse Integration: No middleware required.
Customizable: Create complex dashboards with SQL queries.
Alerting: Native Grafana alerting support.
User Management: Role-based access control.

Pre-Provisioned Dashboards

Four dashboards are automatically provisioned:

Dashboard	Description
Main Dashboard	Single environment view with data profiles, column details, and auto-increment monitoring
Environment Comparison Dashboard	Compare profiles between two environments side-by-side with difference highlighting
Schema Comparison Dashboard	Compare schema between two environments side-by-side with difference highlighting
Table Inventory Comparison Dashboard	Compare table inventories between environments to detect table drift

PostgreSQL Metrics Support

Grafana is pre-configured with both ClickHouse and PostgreSQL datasources:

Datasource	Description
ClickHouse	Default metrics storage
PostgreSQL-Metrics	Alternative when using PostgreSQL backend

Two versions of dashboards are provided:

Environment Comparison Dashboard - uses ClickHouse
Environment Comparison Dashboard (PostgreSQL) - uses PostgreSQL
Table Inventory Comparison Dashboard - uses ClickHouse
Table Inventory Comparison Dashboard (PostgreSQL) - uses PostgreSQL

Setup

The Grafana service is included in docker-compose.yml and pre-configured with the ClickHouse datasource.

Start all services:
```
docker-compose up -d
```
Open Grafana:
- URL: http://localhost:3000
- User: admin
- Password: admin
Select a Dashboard:
- Go to Dashboards menu
- Choose Main Dashboard, Environment Comparison Dashboard, or Table Inventory Comparison Dashboard
- Select Application, Environment(s), and Table from dropdowns

📝 License

This project is licensed under the Apache License 2.0.

Copyright 2024-2026 DataProfiler Contributors

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

🤝 Contributing

Pull requests are welcome! For major changes, please open an issue first to discuss.

Name		Name	Last commit message	Last commit date
Latest commit History 118 Commits
.agent		.agent
.github/workflows		.github/workflows
.specify		.specify
dashboard		dashboard
docs/images		docs/images
grafana		grafana
init-scripts		init-scripts
scripts		scripts
specs		specs
src		src
tests		tests
.coverage		.coverage
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
DEVELOPMENT.md		DEVELOPMENT.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
README.th.md		README.th.md
docker-compose.yml		docker-compose.yml
main.py		main.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

DataProfiler

🎯 Overview

📊 Profiled Metrics

🔮 Auto-Increment Overflow Risk Analysis

Features

Supported Data Types

MSSQL Data Types

🏗️ Schema Profiling & Comparison

Features

Comparison Metrics

Schema Objects Profiling

🛠️ Requirements

📦 Installation

⚙️ Configuration

1. Create Environment Variables File

🤖 CI/CD Pipeline

🚀 Usage

Basic Usage

Data Profiling (--data-profile)

Auto-Increment Analysis

Schema Profiling

Schema Objects Profiling

Output Formats

Save to File

Additional Options

Database Type Selection (-d, --database-type)

Metrics Backend Selection (--metrics-backend)

Complete Example: Profile MSSQL with PostgreSQL Metrics Backend

📤 Exporting Dashboards

📁 Project Structure

🧪 Testing

🔍 End-to-End Manual Testing

1. Setup Environment

2. Generate Sample Data

3. Run Profiler (Full Cycle)

CLI Wrapper

Wrapper Script Location

CLI Arguments Support

Argument Precedence

Environment Variables for Control-M

Database Connection & Metrics Backend

Profiler Options (Optional)

Exit Codes

Examples

Example 1: PostgreSQL + ClickHouse Metrics

Example 2: MSSQL + PostgreSQL Metrics

Using Docker (Alternative)

Logging

🐳 Docker Standalone Image

Pull Image

Run Container

🐳 Docker Full Stack Environment

Metrics Backend Selection

Services Overview

Credentials

Starting MSSQL (Azure SQL Edge)

Starting Oracle (21c XE)

Stop Services

Sample Data & Testing

Generate Additional Sample Data

4. View Results

📊 Dashboard Development

Technology Stack

Environment Comparison Dashboard

Features

API Endpoints

Schema Comparison Dashboard

Features

API Endpoints

Usage

📈 Grafana Dashboard (Alternative)

Features

Pre-Provisioned Dashboards

PostgreSQL Metrics Support

Setup

📝 License

Data Profiling (`--data-profile`)

Database Type Selection (`-d`, `--database-type`)

Metrics Backend Selection (`--metrics-backend`)

Packages