Whistler is an open source data quality and profiling tool. Along with profiling it also supports running custom constraints on the data. With underlining support of Apache Spark execution engine Whistler can be extended to data in magnitudes of GB's, TB's.
pip install dq-whistler
# Sample Data
Age,Description
1,"abc"
2,"abc1"
3,
4,"abc4"
10,"xyz"
12,"null"
17,"abc"
20,"abc3"
23,
# You can read data from all the supported sources as per Apache Spark module
df = spark.read.option("header", "true").csv("<your path>")
config = [
{
"name": "Age",
"datatype": "number",
"constraints":[
{
"name": "gt_eq",
"values": 5
},
{
"name": "is_in",
"values": [1, 23]
}
]
},
{
"name": "Description",
"datatype": "string",
"constraints":[
{
"name": "regex",
"values": "([A-Za-z]+)"
},
{
"name": "contains",
"values": "abc"
}
]
}
]
from dq_whistler import DataQualityAnalyzer
output = DataQualityAnalyzer(df, config).analyze()
print(output)
[
{
"col_name": "Age",
"total_count": 9,
"null_count": 0,
"unique_count": 9,
"topn_values": {
"1": 1,
"2": 1,
"3": 1,
"4": 1,
"10": 1,
"12": 1,
"17": 1,
"20": 1,
"23": 1
},
"min": 1,
"max": 23,
"mean": 10.222222222222221,
"stddev": 8.303279138054101,
"quality_score": 0,
"constraints": [
{
"name": "gt_eq",
"values": 5,
"constraint_status": "failed",
"invalid_count": 4,
"invalid_values": [
"1",
"2",
"3",
"4"
]
},
{
"name": "is_in",
"values": [
1,
23
],
"constraint_status": "failed",
"invalid_count": 7,
"invalid_values": [
"2",
"3",
"4",
"10",
"12",
"17",
"20"
]
}
]
},
{
"col_name": "Description",
"total_count": 9,
"null_count": 2,
"unique_count": 7,
"topn_values": {
"abc": 2,
"abc1": 1,
"xyz": 1,
"abc4": 1,
"abc3": 1
},
"quality_score": 0,
"constraints": [
{
"name": "regex",
"values": "([A-Za-z]+)",
"constraint_status": "success",
"invalid_count": 0,
"invalid_values": []
},
{
"name": "contains",
"values": "abc",
"constraint_status": "failed",
"invalid_count": 2,
"invalid_values": [
"xyz",
"null"
]
}
]
}
]
The list below contains the functionality that contributors are planning to develop for this module
- Visualization
- Visualization of profiling output