data-profiling

a set of scripts to pull meta data and data profiling metrics from relational database systems

The collection of scripts and SQL-code which can be tailored to collect specific information about tables and columns within databases. It facilitates the bulk and rapid collection of high level and common metadata and provides a great starting point for identifiying and inventory of your database objects.

The scripts are intended to be used within your own particular client user interface and assumes you are able to connect to the database from which you want to collect data. It is also up to you to dump the results out to whatever output meduim you want. Generally by saving the results as a spreadsheet, csv, or cutting and pasting the results from your user interface. This allows the greatest degree of flexibility.

This repository is not meant to provide very deep data profiling capabilities, other data commercial and open source analytic and data management tools scan do that much better.

We get this most of the meta data from the internal data dictionary of the database system. Some meta data can only be obtained by querying the the tables themselves. Since performance can often be a problem when profiling large tables, we don't try to do this in bulk for an entire database. This is where the more expensive and sophisticated tools are most useful.

Data profiling is a process

Data Profiling generally consists of a series of steps that dig deeper and deeper into the details of the data sets. The high level steps addressed by this repository include:

Inventory of databases (schemas)
Inventory and metadata of tables within the databases
Inventory and metadata of columns with the tables ( count distinct values, num nulls, min, max)
For each column we can then dig into the details ( i.e. frequency distribution or list of values for columns) and relationship to other tables/columns

Databases Platforms

Since each database platform has it's own specific data dictionary, each platform has it's owns set of scripts. The following databases have scripts in this repository:

* Oracle
* SQLServer
* MySQL
* Sybase IQ
* Netezza
* Hive
* AWS Redshift
* Snowflake
* AWS Synapse

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
DB2		DB2
Hive		Hive
MySQL		MySQL
Netezza		Netezza
Redshift		Redshift
SQLServer		SQLServer
Sybase		Sybase
oracle		oracle
snowflake		snowflake
synapse		synapse
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DB2

DB2

Hive

Hive

MySQL

MySQL

Netezza

Netezza

Redshift

Redshift

SQLServer

SQLServer

Sybase

Sybase

oracle

oracle

snowflake

snowflake

synapse

synapse

README.md

README.md

Repository files navigation

data-profiling

Data profiling is a process

Databases Platforms

About

Releases

Packages

Languages

psebenick/data-profiling

Folders and files

Latest commit

History

Repository files navigation

data-profiling

Data profiling is a process

Databases Platforms

About

Topics

Resources

Stars

Watchers

Forks

Languages