Skip to content

philo-vanguard/BCleanplus

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

bayesclean

Introduction

Source code of BClean+ cleaning system.

Module Introduction

  1. BClean file: This is the main file of the cleaning system, which receives user-defined parameters, defines core functions, including structure generation, parameter estimation, and inference, and calls core classes of various modules.

  2. Analysis file: Used to evaluate the precision, recall, and running time of the cleaning results.

  3. src folder: Includes User Constraint (UC) class, Bayesian Network Structure class, Compensation Classification, Inference Strategy class, etc.

  4. example folder: Contains BClean+ workflow code for each dataset.

  5. dataset folder: Stores test datasets, including py files with added noise in real datasets.

  6. PClean-master folder: Stores source code of PClean methods and their papers.

User Constraint (UC) Automatic Generation Method

  1. Configure the LLM required for semantic type identification, such as flan-t5-xxl.

  2. Run semantic type recognition and UC code generation:

python generate_UC.py --model_name="flan-t5-xxl-zs" --input_files="./dataset/hospital/hospital_dirty.csv"
--save_path="./UC_json/hospital_schema.json" --input_labels="skip-eval-return" --label_set="custom" 
--custom-labels Identifier Code PersonName OrganizationName LocationName ObjectName Address DateTime Age Count Measurement Money Percentage Score Category Text LanguageName Email PhoneNumber URL Boolean Year 
--response --generate_schema --summary_csv_path="summary_sample.csv"

summary_sample.csv extracts some patterns from our pattern library for demonstration.

  1. Users can manually change the pattern in the generated JSON file to better match the data.

  2. Get the user constraints for the data:

uc = UC(dirty_data)
uc.get_uc()

BClean+ Data Cleaning System User Guide

  1. Install dependencies (or in conda virtual environment):

pip install -r requirements.txt

  1. Run the cleaning (or in conda virtual environment):

Taking hospital as an example, run the command: python /example/hospital.py

You can modify parameters such as infer_strategy and model_choice in /example/hospital.py.

Add a path to model_save_path in /example/hospital.py to save the model's pkl file.

Automatically generate PPL code required by PClean

  1. Run BClean+ and obtain the pkl file of the corresponding model.

  2. Run the generated code:

python generate_julia_run+loaddata.py

  1. Generate PPL code based on user interaction based on understanding of the dataset.

  2. Integrate the generated "run.jl" and "load.jl" files into the PClean framework and run the PClean data cleaning system.

About

source code for BClean+ [TKDE'26]

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 75.4%
  • Julia 24.6%