🌐 Language: English | ภาษาไทย
Automated Data Profiling tool for PostgreSQL and Microsoft SQL Server with dbt-profiler style metrics. Supports storing results in ClickHouse or PostgreSQL.
DataProfiler provides:
- dbt-profiler Style Metrics calculation via SQL queries (opt-in with
--data-profile) - Schema Comparison: Compare table structures and database objects across different environments (e.g., UAT vs Prod)
- Schema Objects Profiling: Profiles Stored Procedures, Views, and Triggers for comprehensive database monitoring
- Auto-Increment Overflow Risk Analysis with growth prediction using Linear Regression
- Table Inventory Collection: Automatically discovers and records all tables per schema for drift detection
- Automatic Schema Discovery from source databases (information_schema)
- Multi-Database Support: PostgreSQL, Microsoft SQL Server (Azure SQL Edge), MySQL, and Oracle (21c XE+)
- Flexible Metrics Storage: Choose between ClickHouse or PostgreSQL for storing results
- Multiple Export Formats: Markdown, JSON, CSV, Console Table
- Web Dashboard for data visualization (React + TailwindCSS)
- Control-M Integration: Production-ready wrapper script for enterprise job scheduling
- Full Stack Docker: Complete containerized environment including Grafana, ClickHouse, and Postgres Metrics
- High Performance: Uses fast Catalog Statistics (O(1)) for row counts instead of full table scans
For each column, the system collects the following statistics (dbt-profiler compatible):
| Metric | Description | Condition |
|---|---|---|
column_name |
Column name | All columns |
data_type |
Data type | All columns |
not_null_proportion |
Proportion of non-NULL values (0.00 - 1.00) | All columns |
distinct_proportion |
Proportion of unique values (0.00 - 1.00) | All columns |
distinct_count |
Count of unique values | All columns |
is_unique |
Whether all values are unique (true/false) | All columns |
min / max |
Minimum / Maximum values | numeric, date, time* |
avg |
Average value | numeric** |
median |
Median value | numeric** |
std_dev_population |
Population standard deviation | numeric** |
std_dev_sample |
Sample standard deviation | numeric** |
histogram |
Numeric bucket distribution (JSON) | numeric* |
frequencies |
Top N Categorical frequencies (JSON) | non-numeric |
profiled_at |
Profile timestamp | All columns |
*
min/maxsupported for: integer, numeric, float, date, timestamp, time
**avg,median,std_devsupported for: integer, numeric, float
DataProfiler includes Auto-Increment Column Overflow Risk Analysis to predict when primary key columns will reach their maximum capacity.
- Current Value Tracking: Monitors the current value of auto-increment columns
- Maximum Value Calculation: Computes the max value based on data type (e.g., INT, BIGINT)
- Usage Percentage: Calculates current capacity usage
- Growth Rate Prediction: Uses Linear Regression on historical data from ClickHouse
- Days Until Full: Predicts when the column will reach maximum capacity
- Alert Status: CRITICAL (< 30 days / > 90%), WARNING (< 90 days / > 75%), NORMAL
| Data Type | Max Value | Range |
|---|---|---|
smallint |
32,767 | -32,768 to 32,767 |
integer |
2,147,483,647 | -2.1B to 2.1B |
bigint |
9,223,372,036,854,775,807 | -9.2 quintillion to 9.2 quintillion |
serial |
2,147,483,647 | 1 to 2.1B |
bigserial |
9,223,372,036,854,775,807 | 1 to 9.2 quintillion |
| Data Type | Max Value | Range |
|---|---|---|
tinyint |
255 | 0 to 255 |
smallint |
32,767 | -32,768 to 32,767 |
int |
2,147,483,647 | -2.1B to 2.1B |
bigint |
9,223,372,036,854,775,807 | -9.2 quintillion to 9.2 quintillion |
Note: Supports PostgreSQL SERIAL/BIGSERIAL/IDENTITY, MSSQL IDENTITY, and Oracle IDENTITY columns.
DataProfiler can profile table schemas (columns, data types, indexes, foreign keys) and store them for comparison between different environments (e.g., UAT vs Production).
- Schema Snapshot: Captures detailed schema metadata
- Comparison Dashboard: Grafana dashboard to visualize differences
- Multi-Database Support: Profile schemas from both PostgreSQL and MSSQL
- Strict Mode: Compares defaults, nullability, and index structure
| Category | Checks |
|---|---|
| Columns | Existence, Data Types, Nullability, Defaults |
| Indexes | Primary Keys, Index Membership, Index Names |
| Columns | Existence, Data Types, Nullability, Defaults |
| Indexes | Primary Keys, Index Membership, Index Names |
| Constraints | Foreign Keys, Check Constraints |
In addition to table schemas, DataProfiler automatically collects metadata for other database objects:
- Stored Procedures
- Views
- Triggers
This allows you to detect added, removed, or modified objects between environments.
Note: Schema object profiling runs automatically when metrics storage is enabled (default), even without
--profile-schema.
- Python 3.10+
- PostgreSQL and/or Microsoft SQL Server (Azure SQL Edge for ARM64/M1)
- MySQL (v8.0+)
- Oracle Database (19c/21c XE+)
- ClickHouse
- Dependencies:
psycopg2- PostgreSQL adapterpymssql- MSSQL adapterclickhouse-connect- ClickHouse clientmysql-connector-python- MySQL adapteroracledb- Oracle adapterpython-dotenv- Environment variable managementnumpy- Numerical computingscipy- Scientific computing (Linear Regression)
- Clone repository:
git clone <repository-url>
cd DataProfiler- Create and activate Virtual Environment:
# Create venv
python -m venv venv
# Activate (macOS/Linux)
source venv/bin/activate
# Activate (Windows)
venv\Scripts\activate- Install dependencies:
pip install -r requirements.txtCopy .env.example to .env and edit values:
cp .env.example .envEdit .env file:
# PostgreSQL Configuration
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_DATABASE=postgres
POSTGRES_USER=postgres
POSTGRES_PASSWORD=your_actual_password
POSTGRES_SCHEMA=public
# MySQL Configuration
MYSQL_HOST=localhost
MYSQL_PORT=3306
MYSQL_DATABASE=prod
MYSQL_USER=user
MYSQL_PASSWORD=password123
# Oracle Configuration
ORACLE_HOST=localhost
ORACLE_PORT=1521
ORACLE_SERVICE_NAME=XEPDB1
ORACLE_USER=PROD
ORACLE_PASSWORD=password123
# ClickHouse Configuration
CLICKHOUSE_HOST=localhost
CLICKHOUSE_PORT=8123
CLICKHOUSE_USER=default
CLICKHOUSE_PASSWORD=your_actual_password
# Metrics Backend Configuration
# Options: 'postgresql' (default) or 'clickhouse'
METRICS_BACKEND=postgresql
# Distribution/Profiling Limits
HISTOGRAM_MAX_BINS=20
FREQUENCY_TOP_N=100
# PostgreSQL Metrics Configuration (separate instance for storing profiling results)
# When using docker-compose, this points to the postgres-metrics service (port 5433)
PG_METRICS_HOST=localhost
PG_METRICS_PORT=5433
PG_METRICS_DATABASE=profiler_metrics
PG_METRICS_USER=postgres
PG_METRICS_PASSWORD=your_password
⚠️ Important: The.envfile is already git-ignored. No need to worry about committing credentials.
The project uses GitHub Actions for continuous integration and continuous deployment:
- Unit Testing: Automatically runs
pyteston every push to the repository. - Docker Build & Publish: Automatically builds a multi-architecture (
linux/amd64,linux/arm64) Docker container and publishes it to Docker Hub asnontster/data-profiler:<SHORT_SHA>upon any merge into themainbranch.
Required Repository Secrets: To enable the Docker publish workflow, you must configure the following secrets in your GitHub repository (Settings > Secrets and variables > Actions):
DOCKERHUB_USERNAME: Your Docker Hub username.DOCKERHUB_TOKEN: A Docker Hub Personal Access Token with Read & Write permissions.
By default (without feature flags), DataProfiler collects table inventory only — no --table needed:
# Collect table inventory only (default behavior)
python main.py -d mssql --schema prod --app user-service --env production
# Collect table inventory from PostgreSQL (default database)
python main.py --schema uat --app user-service --env uat Use --data-profile to enable column-level statistics. Requires --table:
# Profile 'users' table from PostgreSQL
python main.py --data-profile --table users
# Profile multiple tables
python main.py --data-profile -t users,products
# Profile from specific schema
python main.py --data-profile --table users --schema prod
# Profile from MSSQL
python main.py --data-profile -d mssql --table users
# Profile from MySQL
python main.py --data-profile -d mysql --table users
# Profile from Oracle
python main.py --data-profile -d oracle --table users --schema PROD
# Profile with Application & Environment context
python main.py --data-profile -d mssql -t users,products --app user-service --env uat --metrics-backend postgresqlRequires --data-profile:
# Include auto-increment overflow analysis
python main.py --data-profile --auto-increment --table users
python main.py --data-profile --auto-increment -d mssql --table users
# Custom lookback period for growth calculation
python main.py --data-profile --auto-increment --table users --lookback-days 14# Profile schema for User Service in Production
python main.py --profile-schema --table users --app user-service --env production
# Profile same table in UAT
python main.py --profile-schema --table users --app user-service --env uat
# Profile multiple table schemas
python main.py --profile-schema -t users,products --app user-service --env production
# Comparison is done via Grafana Dashboard (Table Schema Comparison)Schema objects (Stored Procedures, Views, Triggers) are profiled automatically when connecting to a database with metrics storage enabled.
# Profile schema objects (and table inventory)
python main.py --app user-service --env production --schema prodView the results in the Schema Objects Comparison Dashboard.
Output formats apply when --data-profile is used:
# Console table (default)
python main.py --table users --data-profile --format table
# Markdown (dbt-profiler style)
python main.py --table users --data-profile --format markdown
# JSON
python main.py --table users --data-profile --format json
# CSV
python main.py --table users --data-profile --format csvpython main.py --table users --data-profile --format markdown --output profiles/users.md
python main.py --table users --data-profile --format json --output profiles/users.json
python main.py -t users,orders --data-profile --format csv --output profiles/report.csvNote: When profiling multiple tables with
--output, results are appended to the same file.
# Skip storing to Metrics DB
python main.py --table users --no-store
# Verbose/debug logging (include detailed SQL queries)
python main.py --table users -v
# Show help
python main.py --helpChoose which source database to profile:
| Option | Description | Required Environment Variables |
|---|---|---|
postgresql |
PostgreSQL (default) | POSTGRES_HOST, POSTGRES_PORT, etc. |
mssql |
Microsoft SQL Server | MSSQL_HOST, MSSQL_PORT, MSSQL_DATABASE, etc. |
mysql |
MySQL | MYSQL_HOST, MYSQL_PORT, MYSQL_DATABASE, etc. |
oracle |
Oracle Database | ORACLE_HOST, ORACLE_PORT, ORACLE_SERVICE_NAME |
# Profile from PostgreSQL (default)
python main.py --table users --data-profile
# Profile from MSSQL
python main.py --data-profile -d mssql --table users
python main.py --data-profile --database-type mssql --table users
# MSSQL with auto-increment analysis
python main.py --data-profile --auto-increment -d mssql --table users
# Oracle with auto-increment analysis
python main.py --data-profile --auto-increment -d oracle --table users --schema PRODChoose where to store profiling results:
| Option | Description | Required Environment Variables |
|---|---|---|
postgresql |
PostgreSQL (default) | PG_METRICS_* or falls back to POSTGRES_* variables |
clickhouse |
ClickHouse | CLICKHOUSE_HOST, CLICKHOUSE_PORT |
# Use PostgreSQL (default)
python main.py --table users
# Use ClickHouse for metrics storage
python main.py --table users --metrics-backend clickhouse
# Combine: Profile MSSQL, store in PostgreSQL
python main.py -d mssql --table orders --metrics-backend postgresql# Set environment variables
export MSSQL_HOST=localhost
export MSSQL_PORT=1433
export MSSQL_DATABASE=testdb
export MSSQL_USER=sa
export MSSQL_PASSWORD='YourStrong@Password123'
export MSSQL_SCHEMA=dbo
export METRICS_BACKEND=postgresql
export PG_METRICS_HOST=localhost
export PG_METRICS_PORT=5433
export PG_METRICS_DATABASE=profiler_metrics
export PG_METRICS_USER=postgres
export PG_METRICS_PASSWORD='password123'
# Run profiler
python main.py
--data-profile \
--auto-increment \
-d mssql \
--table users \
--metrics-backend postgresql \
--app user-service \
--env production \
--schema prod \
If you need to export the Grafana dashboards for importing into another Grafana instance (stripping backend-specific suffixes like (PostgreSQL) or (ClickHouse) from titles), you can use the provided export script:
python scripts/export_dashboards.pyThis will create a grafana/dashboards_exported directory containing the sanitized JSON files.
Key Features:
- Dynamic Data Source: Identifies the data source type from the filename (
_ch-> ClickHouse,_pg-> PostgreSQL) and sets it as the default. - Import Flexibility: Injects a
DS_PROFILER_METRICStemplate variable, allowing you to select the target data source during import.
DataProfiler/
├── .env.example # Environment variables template
├── .env # Environment variables (git ignored)
├── .gitignore # Git ignore rules
├── LICENSE # Apache 2.0 License
├── docker-compose.yml # Docker full stack environment
├── main.py # Main entry point
├── pytest.ini # Pytest configuration
├── README.md # Documentation (English)
├── README.th.md # Documentation (Thai)
├── requirements.txt # Python dependencies
│
├── scripts/ # Automation scripts
│ └── run_profiler.sh # Control-M wrapper script
│
├── dashboard/ # Web Dashboard
│ ├── backend/ # Flask API server
│ │ ├── app.py # API endpoints
│ │ └── requirements.txt # Backend dependencies
│ └── frontend/ # React + Vite frontend
│ ├── src/
│ │ └── App.jsx # Main React component
│ ├── package.json
│ └── nginx.conf # Production nginx config
│
├── docs/ # Documentation assets
│ └── images/ # Dashboard screenshots
│
├── grafana/ # Grafana dashboards & config
│ ├── dashboards/
│ │ ├── main_dashboard.json # Single environment view
│ │ └── env_comparison_dashboard.json # Environment comparison
│ └── provisioning/
│ ├── dashboards/dashboard.yml
│ └── datasources/datasource.yml
│
├── init-scripts/ # Database initialization scripts
│ ├── clickhouse/ # ClickHouse schema & test data
│ ├── mssql/ # MSSQL init & data generation
│ └── postgres/ # PostgreSQL init & data generation
│
├── src/ # Source code modules
│ ├── __init__.py
│ ├── config.py # Configuration management
│ ├── exceptions.py # Custom exceptions
│ ├── core/ # Core profiling logic
│ │ ├── __init__.py
│ │ ├── autoincrement_metrics.py # Auto-increment analysis
│ │ ├── formatters.py # Output formatters (MD, JSON, CSV)
│ │ ├── metrics.py # dbt-profiler style metrics
│ │ └── schema_comparator.py # Schema comparison logic
│ └── db/ # Database connections
│ ├── __init__.py
│ ├── autoincrement.py # Auto-increment detector (PostgreSQL & MSSQL)
│ ├── clickhouse.py # ClickHouse metrics client
│ ├── postgres_metrics.py # PostgreSQL metrics client
│ ├── connection_factory.py # Multi-database factory
│ ├── mssql.py # MSSQL client
│ └── postgres.py # PostgreSQL client
│
├── tests/ # Unit tests
│ ├── __init__.py
│ ├── test_autoincrement.py
│ ├── test_config.py
│ ├── test_connections.py
│ ├── test_metadata.py
│ └── test_table_inventory.py
│
└── venv/ # Python virtual environment (git ignored)
# Activate virtual environment
source venv/bin/activate
# Run tests
pytest
# Run with verbose output
pytest -v
# Run with coverage report
pytest --cov=src --cov-report=term-missingFor example, to manually verify all functions for MSSQL with a full workflow:
# Configure MSSQL Connection
export MSSQL_HOST=localhost
export MSSQL_PORT=1433
export MSSQL_DATABASE=testdb
export MSSQL_USER=sa
export MSSQL_PASSWORD=YourStrong@Password123
export MSSQL_SCHEMA=dbo
# Initialize MSSQL Database
python init-scripts/mssql/init-mssql.py
python init-scripts/mssql/init-mssql-schema-objects.py# Generate Sample Data for UAT and Prod schemas
python init-scripts/mssql/generate-mssql-data.py --schema uat
python init-scripts/mssql/generate-mssql-data.py --schema prodYou can run the profiler using main.py directly:
# Profile MSSQL (UAT) -> Store in PostgreSQL
python main.py --data-profile --auto-increment --profile-schema -d mssql -t users,products --app user-service --env uat --schema uat --metrics-backend postgresql
# Profile MSSQL (Prod) -> Store in PostgreSQL
python main.py --data-profile --auto-increment --profile-schema -d mssql -t users,products --app user-service --env prod --schema prod --metrics-backend postgresqlOr using the scripts/run_profiler.sh wrapper script:
# Profile MSSQL (UAT)
scripts/run_profiler.sh --data-profile --auto-increment --profile-schema --database-type mssql -t users,products --app user-service --env uat --schema uat --metrics-backend postgresql
# Profile MSSQL (Prod)
scripts/run_profiler.sh --data-profile --auto-increment --profile-schema --database-type mssql -t users,products --app user-service --env prod --schema prod --metrics-backend postgresqlThis workflow verifies:
- MSSQL connection and data retrieval
- Sample data generation scripts
- Schema profiling capabilities
- Auto-increment analysis
- Storing metrics in PostgreSQL
DataProfiler includes a wrapper script for running as a scheduled job in Control-M or similar job schedulers.
scripts/run_profiler.sh
The wrapper script supports passing CLI arguments directly to the Python profiler. CLI arguments override environment variables.
# Show help
scripts/run_profiler.sh -h
scripts/run_profiler.sh --help
# Override database type via CLI (ignores PROFILER_DB_TYPE env var)
scripts/run_profiler.sh -d mssql
scripts/run_profiler.sh --database-type mssql
# Override metrics backend via CLI (ignores METRICS_BACKEND env var)
scripts/run_profiler.sh --metrics-backend postgresql
# Specify table name via CLI (ignores PROFILER_TABLE env var)
scripts/run_profiler.sh --table users
# Specify multiple tables
scripts/run_profiler.sh --table users,products
scripts/run_profiler.sh -t users,products
# Override schema (ignores PROFILER_SCHEMA env var)
scripts/run_profiler.sh --schema uat
# Combine CLI arguments
scripts/run_profiler.sh --table users -d mssql --metrics-backend postgresql --data-profile --auto-increment --schema prod
scripts/run_profiler.sh -t users,orders --data-profile --app user-service --env uat --database-type mssql --metrics-backend postgresql --auto-increment
# Skip storing metrics
scripts/run_profiler.sh --no-storeCLI arguments take precedence over environment variables:
| Configuration | Priority | Example |
|---|---|---|
| CLI argument | 1 (High) | --metrics-backend postgresql |
| Environment | 2 (Low) | METRICS_BACKEND=clickhouse |
Example: If METRICS_BACKEND=clickhouse is set but you run:
scripts/run_profiler.sh --metrics-backend postgresqlThe script will use PostgreSQL for metrics storage.
Configure these environment variables in your Control-M job definition:
Refer to the Configuration section for the required environment variables:
POSTGRES_*MSSQL_*CLICKHOUSE_*PG_METRICS_*
| Variable | Default | Description |
|---|---|---|
PROFILER_TABLE |
- | Comma-separated table names (required with --data-profile) |
PROFILER_SCHEMA |
(default DB) | Schema name (e.g., public, dbo, prod, uat) |
PROFILER_FORMAT |
table |
Output format: table, markdown, json, csv |
PROFILER_OUTPUT_FILE |
- | File path to save output |
PROFILER_APP |
default |
Application name |
PROFILER_ENV |
production |
Environment name |
PROFILER_DB_TYPE |
postgresql |
Database type: postgresql, mssql, mysql |
METRICS_BACKEND |
postgresql |
Metrics backend: postgresql, clickhouse |
PROFILER_DATA_PROFILE |
false |
Enable data profiling (column-level statistics) |
PROFILER_AUTO_INCREMENT |
false |
Enable auto-increment analysis (requires PROFILER_DATA_PROFILE) |
PROFILER_PROFILE_SCHEMA |
false |
Enable schema profiling |
PROFILER_LOOKBACK_DAYS |
7 |
Days for growth rate calculation |
PROFILER_NO_STORE |
false |
Skip storing metrics |
PROFILER_VERBOSE |
false |
Enable verbose logging |
PYTHON_PATH |
python3 |
Path to Python executable |
PROFILER_HOME |
(script dir) | Path to DataProfiler installation |
| Code | Description |
|---|---|
0 |
Success |
1 |
Configuration error (missing required environment variables) |
2 |
Execution error (profiler failed) |
3 |
Python environment error |
# Environment Variables (set in Control-M):
# Source Database (PostgreSQL)
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_DATABASE=postgres
POSTGRES_USER=postgres
POSTGRES_PASSWORD='password123'
CLICKHOUSE_HOST=localhost
CLICKHOUSE_PORT=8123
CLICKHOUSE_USER=default
CLICKHOUSE_PASSWORD='password123'
PROFILER_TABLE=users,products
PROFILER_APP=user-service
PROFILER_ENV=uat
PROFILER_DATA_PROFILE=true
PROFILER_AUTO_INCREMENT=true
# Command:
scripts/run_profiler.sh# Environment Variables (set in Control-M):
# Source Database (MSSQL)
MSSQL_HOST=localhost
MSSQL_PORT=1433
MSSQL_DATABASE=testdb
MSSQL_USER=sa
MSSQL_PASSWORD='YourStrong@Password123'
MSSQL_SCHEMA=dbo
# Metrics Storage (PostgreSQL)
METRICS_BACKEND=postgresql
PG_METRICS_HOST=localhost
PG_METRICS_PORT=5433
PG_METRICS_DATABASE=profiler_metrics
PG_METRICS_USER=postgres
PG_METRICS_PASSWORD='password123'
# Profiler Options
PROFILER_TABLE=users,orders
PROFILER_DB_TYPE=mssql
PROFILER_APP=user-service
PROFILER_ENV=uat
PROFILER_DATA_PROFILE=true
PROFILER_AUTO_INCREMENT=true
# Command:
scripts/run_profiler.sh# Run profiler inside backend container
docker-compose exec backend python main.py --table users --data-profile --app user-service --env production --auto-incrementThe wrapper script creates logs in $PROFILER_HOME/logs/ with the format:
profiler_<CTM_ORDERID>.log
Control-M variables CTM_JOBNAME and CTM_ORDERID are automatically used for job identification in logs.
You can run DataProfiler as a standalone Docker container, useful for CI/CD pipelines or running in environments where you don't need the full stack.
docker pull nontster/data-profiler:latestNote: The image supports
linux/amd64andlinux/arm64.
You can pass environment variables directly to the container to configure the connection.
# Example: Profile MSSQL and store metrics in PostgreSQL
docker run --rm \
-e MSSQL_HOST=host.docker.internal \
-e MSSQL_PORT=1433 \
-e MSSQL_DATABASE=testdb \
-e MSSQL_USER=sa \
-e MSSQL_PASSWORD='YourStrong@Password123' \
-e MSSQL_SCHEMA=dbo \
-e METRICS_BACKEND=postgresql \
-e PG_METRICS_HOST=host.docker.internal \
-e PG_METRICS_PORT=5433 \
-e PG_METRICS_DATABASE=profiler_metrics \
-e PG_METRICS_USER=postgres \
-e PG_METRICS_PASSWORD='password123' \
nontster/data-profiler \
--data-profile -d mssql -t users,products --app user-service --env uatThis project is fully containerized. You can spin up the entire stack (DBs, Backend, Frontend, Grafana) with one command.
# Start all services
docker-compose up -d --buildYou can choose where to store profiling metrics:
# Use PostgreSQL (default)
docker-compose up -d
# Use ClickHouse for metrics storage
METRICS_BACKEND=clickhouse docker-compose up -dThe React dashboard will display which backend is active in the header.
| Service | URL / Port | Description |
|---|---|---|
| Frontend | http://localhost:8080 | Main Data Profiler Dashboard (React) |
| Grafana | http://localhost:3000 | Advanced Visualization (Admin) |
| Backend | Internal (5001) | API Service (Flask) |
| ClickHouse | localhost:8123 | HTTP Interface |
| PostgreSQL | localhost:5432 | Sample Source Database |
| PostgreSQL-Metrics | localhost:5433 | Metrics Storage Database |
| MSSQL | localhost:1433 | Source Database (Azure SQL Edge) |
| MySQL | localhost:3306 | Source Database |
Note: PostgreSQL is separated into two instances: the sample database (port 5432) for source data profiling, and the metrics database (port 5433) for storing profiling results. This prevents confusion between sample data and metrics data.
- Grafana: User:
admin, Pass:admin(or set viaGRAFANA_ADMIN_PASSWORDin .env) - PostgreSQL: User:
postgres, Pass:password123 - MSSQL: User:
sa, Pass:YourStrong@Password123 - MySQL: User:
user, Pass:password123 - Oracle: User:
PROD/UAT, Pass:password123 - ClickHouse: User:
default, Pass:password123
MSSQL uses Azure SQL Edge for ARM64/M1 compatibility:
# Start MSSQL container
docker compose up -d mssql
# Wait ~30 seconds for startup, then initialize database
python init-scripts/mssql/init-mssql.py
python init-scripts/mssql/init-mssql-schema-objects.py
# Test profiler
python main.py --table users -d mssql --no-storeNote: Azure SQL Edge doesn't run init scripts automatically like PostgreSQL. Use the Python script to create test databases.
Oracle runs as a container with pre-configured PROD and UAT users.
# Start Oracle container
docker compose up -d oracle
# Wait for startup (can take 1-2 minutes)
# Test connection
python main.py -d oracle --schema PRODdocker-compose down -v # Stop and remove volumesDocker automatically initializes sample data with two distinct schemas: prod (Production) and uat (UAT) to simulate real-world scenarios.
| Table | Schema | Records | Description |
|---|---|---|---|
users |
prod |
99 | Established user base active since 2023 |
products |
prod |
111 | Full product catalog (Electronics, Accessories, etc.) |
users |
uat |
80 | New hires, different salary ranges, more NULLs (simulated drift) |
products |
uat |
90 | New catalog items, missing categories (simulated drift) |
To add more test data for auto-increment growth rate calculation:
For PostgreSQL:
# Add 100 new users to PostgreSQL (Prod)
python init-scripts/postgres/generate-postgres-data.py --users 100 --schema prod
# Add 50 new products to UAT
python init-scripts/postgres/generate-postgres-data.py --products 50 --schema uat --no-users
# Add both with specific counts
python init-scripts/postgres/generate-postgres-data.py --users 500 --products 200 --schema prodFor MSSQL:
# Add 100 new users to MSSQL (Prod)
python init-scripts/mssql/generate-mssql-data.py --users 100 --schema prod
# Add 50 new products to MSSQL (UAT)
python init-scripts/mssql/generate-mssql-data.py --products 50 --schema uat --no-users
# Add both users and products
python init-scripts/mssql/generate-mssql-data.py --users 500 --products 200 --schema prod
# Show current statistics only
python init-scripts/mssql/generate-mssql-data.py --stats-only
# Add 100 new users to MySQL (specify schema: prod, uat, or public)
python init-scripts/mysql/generate-mysql-data.py --schema prod --users 100
# For Oracle:
# Add 100 new users to Oracle (PROD schema)
python init-scripts/oracle/generate-oracle-data.py --schema PROD --users 100
# Add 50 new products to Oracle (UAT schema)
python init-scripts/oracle/generate-oracle-data.py --schema UAT --products 50- React Dashboard: http://localhost:8080
- Grafana Dashboard: http://localhost:3000
Tip: Run the profiler multiple times after inserting new data to enable
Days Until Fullcalculation for auto-increment columns.
If you want to run the dashboard manually (outside Docker) for development:
# 1. Start Backend API
cd dashboard/backend
python app.py
# 2. Start Frontend
cd dashboard/frontend
npm run dev
# Access at http://localhost:5173| Component | Technology |
|---|---|
| Backend | Flask + Flask-CORS |
| Frontend | React + Vite |
| Styling | TailwindCSS |
| Charts | Recharts |
The dashboard supports dual-environment comparison to compare data profiles between environments (e.g., UAT vs Production).
- Dual Environment Selectors: Select two environments to compare side-by-side
- Summary Panel: Shows row counts and profiling timestamps for both environments
- Comparison Charts: Side-by-side bar charts for Not Null Proportion and Distinct Proportion
- Difference Highlighting: Color-coded differences (green = improvement, red = degradation)
- Min/Max Filtering: Shows min/max values only for numeric and date/time columns (matches backend logic)
- Auto-Increment Comparison: Compare overflow risk metrics between environments
| Endpoint | Description |
|---|---|
GET /api/metadata |
List all applications and their environments |
GET /api/profiles/compare/<table>?app=<app>&env1=<env1>&env2=<env2> |
Compare profiles between two environments |
GET /api/autoincrement/compare/<table>?app=<app>&env1=<env1>&env2=<env2> |
Compare auto-increment metrics |
The dashboard supports dual-environment schema comparison to compare table schemas between environments (e.g., UAT vs Production).
- Tab Navigation: Switch between Data Profile and Schema Comparison views
- Summary Cards: Shows total columns, matching, different, and environment-specific columns
- Comparison Table: Side-by-side view of column schemas with:
- Data Type comparison
- Nullable status (NULL / NOT NULL)
- Primary Key indicators (🔑)
- Index membership indicators (📇)
- Difference Highlighting: Color-coded status badges:
- ✓ Match (gray) - Column exists and matches in both environments
- ⚠ Modified (yellow) - Column exists but has differences
-
- Added (green) - Column only exists in Environment 2
-
- Removed (red) - Column only exists in Environment 1
| Endpoint | Description |
|---|---|
GET /api/schema/compare/<table>?app=<app>&env1=<env1>&env2=<env2> |
Compare schema between two environments |
To capture schema data, run the profiler with --profile-schema flag:
# Profile schema for UAT
python main.py --table users --profile-schema --app user-service --env uat
# Profile schema for Production
python main.py --table users --profile-schema --app user-service --env production
# Profile multiple table schemas
python main.py -t users,orders --profile-schema --app user-service --env production
# Combine data profiling + schema profiling
python main.py -t users --data-profile --profile-schema --app user-service --env productionThis project includes a Grafana instance connected to ClickHouse for advanced visualization.
- Direct ClickHouse Integration: No middleware required.
- Customizable: Create complex dashboards with SQL queries.
- Alerting: Native Grafana alerting support.
- User Management: Role-based access control.
Four dashboards are automatically provisioned:
| Dashboard | Description |
|---|---|
| Main Dashboard | Single environment view with data profiles, column details, and auto-increment monitoring |
| Environment Comparison Dashboard | Compare profiles between two environments side-by-side with difference highlighting |
| Schema Comparison Dashboard | Compare schema between two environments side-by-side with difference highlighting |
| Table Inventory Comparison Dashboard | Compare table inventories between environments to detect table drift |
Grafana is pre-configured with both ClickHouse and PostgreSQL datasources:
| Datasource | Description |
|---|---|
| ClickHouse | Default metrics storage |
| PostgreSQL-Metrics | Alternative when using PostgreSQL backend |
Two versions of dashboards are provided:
- Environment Comparison Dashboard - uses ClickHouse
- Environment Comparison Dashboard (PostgreSQL) - uses PostgreSQL
- Table Inventory Comparison Dashboard - uses ClickHouse
- Table Inventory Comparison Dashboard (PostgreSQL) - uses PostgreSQL
The Grafana service is included in docker-compose.yml and pre-configured with the ClickHouse datasource.
-
Start all services:
docker-compose up -d
-
Open Grafana:
- URL: http://localhost:3000
- User:
admin - Password:
admin
-
Select a Dashboard:
- Go to Dashboards menu
- Choose Main Dashboard, Environment Comparison Dashboard, or Table Inventory Comparison Dashboard
- Select Application, Environment(s), and Table from dropdowns
This project is licensed under the Apache License 2.0.
Copyright 2024-2026 DataProfiler Contributors
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Pull requests are welcome! For major changes, please open an issue first to discuss.



