Source code of BClean+ cleaning system.
-
BClean file: This is the main file of the cleaning system, which receives user-defined parameters, defines core functions, including structure generation, parameter estimation, and inference, and calls core classes of various modules.
-
Analysis file: Used to evaluate the precision, recall, and running time of the cleaning results.
-
src folder: Includes User Constraint (UC) class, Bayesian Network Structure class, Compensation Classification, Inference Strategy class, etc.
-
example folder: Contains BClean+ workflow code for each dataset.
-
dataset folder: Stores test datasets, including py files with added noise in real datasets.
-
PClean-master folder: Stores source code of PClean methods and their papers.
-
Configure the LLM required for semantic type identification, such as flan-t5-xxl.
-
Run semantic type recognition and UC code generation:
python generate_UC.py --model_name="flan-t5-xxl-zs" --input_files="./dataset/hospital/hospital_dirty.csv"
--save_path="./UC_json/hospital_schema.json" --input_labels="skip-eval-return" --label_set="custom"
--custom-labels Identifier Code PersonName OrganizationName LocationName ObjectName Address DateTime Age Count Measurement Money Percentage Score Category Text LanguageName Email PhoneNumber URL Boolean Year
--response --generate_schema --summary_csv_path="summary_sample.csv"
summary_sample.csv extracts some patterns from our pattern library for demonstration.
-
Users can manually change the pattern in the generated JSON file to better match the data.
-
Get the user constraints for the data:
uc = UC(dirty_data)
uc.get_uc()
- Install dependencies (or in conda virtual environment):
pip install -r requirements.txt
- Run the cleaning (or in conda virtual environment):
Taking hospital as an example, run the command: python /example/hospital.py
You can modify parameters such as infer_strategy and model_choice in /example/hospital.py.
Add a path to model_save_path in /example/hospital.py to save the model's pkl file.
-
Run BClean+ and obtain the pkl file of the corresponding model.
-
Run the generated code:
python generate_julia_run+loaddata.py
-
Generate PPL code based on user interaction based on understanding of the dataset.
-
Integrate the generated "run.jl" and "load.jl" files into the PClean framework and run the PClean data cleaning system.