Python implementation of AVH and baselines reported in paper Auto-Validate by-History: Auto-Program Data Quality Constraints to Validate Recurring Data Pipelines. Can follow the steps below to reproduce results.
Please contact Dezhan Tu (dztu AT g.ucla.edu) and Yeye He (yeyehe AT microsoft.com) for questions or feedback.
- Ubuntu 18.04, Anaconda 3.5+
- Tested on Python 3.8.0
- Download and install arrayfire e.g. pip install arrayfire-3.8.0-cp38-cp38-linux_x86_64.whl
- All other required python packages can be installed using our prepared requirements.txt (run
pip install -r requirements.txt)
Jupyter Notebook that shows and reproduces results reported in the paper
- The notebook
visualization.ipynbshows the main comparison results in our paper - The notebook
sensitivity.ipynbshows all sensitivity and ablation results in our paper
Run AVH from the beginning to reproduce results in the paper
- Run
python avh_with_stationary.py - Run
python avh_no_stationary.py - AVH result will be stored in
./resultfolder (consumed by the Jupyter notebook above)
dists.py: compute statistical distance between two samples, and the major entry iscomp_dist(sample_p, sample_q, dtype='numeric')preprocessing.py: load and preprocesss raw datautils.py: provide some common utility tools usd by most modelsgene_sample.py: generate synthetical data based on our proposed synthesis rules The main entry isGenSample.gen_save_sample(dir_name)stationary_checking: process time-series, which is a part of running AVH-with stationaryglobal_efficiency.py: measure the runtime of singal and single+two distribution AVHavh_no_stationary.py: Auto-Validate-by-History no stationary checking versionavh_with_stationary.py: Auto-Validate-by-History with stationary checking versionazure_drift_detector.py: Azure Drift DetectorazureAD.py: Azure Anomaly Detectiondeequ.py: Amazon Deequ Testhypothesis_test.py: Hypothesis testmad.py: MAD Testml_baseline_single_var.py: extract single-variabe features, and evaluate on supervised ML baselines and unsupervised anomaly detection baselinestfdv.py: Google TFDV TestDQ_test.py: Avg-KNN, using orig features in the paper Automating Data Quality Validation for Dynamic Data IngestionDQ_test_all.py:Avg-KNN, testing on the AVH features Automating Data Quality Validation for Dynamic Data Ingestionfast_rule.py: Fast rule mining in ontological knowledge bases with AMIE+robust_discovery.py: Robust discovery of positive and negative rules in knowledge bases