Skip to content

Evaluating Counterfactual Explanations for Entity Matching

License

Notifications You must be signed in to change notification settings

megagonlabs/minun

Repository files navigation

Minun

Minun is a model-agnostic framework to provide local explanations for entity matching applications. It mainly consists of two components: (1) the explainer that generates counterfactual explanations for a given pair of entity entries and black-box entity matching model. (ii) the evaluation framework that trains a student model to evaluate the quality of the entity matching model.

Overall Process

In this work, we support explaining the entity matching model Ditto. So the first step is to train a Ditto model as the teacher model on the training set of the dataset. Then the Minum takes the test set of the dataset and make a predication on each record to generate the ground truth for student model.

How to use the code

Preparation

In this work, we use the datasets from the Deep Matcher project: Each dataset is a directory with files tableA.csv, tableB.csv, train.csv, valid.csv and test.csv. Then we train a teacher model using the Ditto framework as the black-box model to be explained. Unfortunately, the current version doesn't use other models as teacher model. If you want to do so, please try to modify the _eval_model function in ditto_explainer.py and the related functions in ditto_helper.py to accommodate for the teacher model you want to explain.

Generate the explanations

The first step of the Minun pipeline is to run the run_explain.py file. It takes the teacher model, the test_student.csv, tableA.csv, tableB.csv files as input, generate the explainations and formulate the training instances for the student model. As introduced in the paper, here the way to generate the training instance is to add extra training instances with the explanations into the training set for student model. This file will generate the following datasets to be used by the student model:

  • explains.json (if specified the option): the explanation for each pair of entity in the original test set.
  • train.txt: the training set for student model without explanation in the format of Ditto.
  • train.txt.explain_inj/train.txt.explain_inj_bs: the training set for student model with explanation generated by the greedy/binary search algorithm in the format of Ditto.
  • test.txt: the test set for student model in the format of Ditto.

The arguments of this file are as following:

  • config: the path of config file, which is config.json by default
  • datadir: the path of the directory of dataset
  • modeldir: the path of the file of teacher model
  • gpu: whether use GPU for evaluation, if you are using the server, just ignore it as the default value is True.
  • dumpexp: whether to dump the generated explainations as a json file, make it as True if you want.
  • expmethod: the way to enumerate the candidates of explainations, the value could be: greedy/binary.

The explainations can be dumped into json file explains.json optionally. The meaning of an explaination is to replace the content of attribute "key" into the "value" in the left entity while keeping the right entity. For example, given the following explaination:

 {
    "phone": "404/876",
    "class": ""
  }

It means that by replacing the content of phone attribute to 404/876 and delete all contents of class attribute in the left entity, the prediction will be fliped.

Train the student model

Finally, the run_train_student.py file to train the student model on the training sets generated in the previoud step, which is with explanations and without explanations, respectively. The delta F1 score between the results of evaluating the two trained models is regarded as the evaluation result of the explanation.

The arguments of this file are as following:

  • task: Specify the task to be run in the configuration file. Note that you need to set the path based on the output of the previous step.
  • explain: whether train the student model with explanation. True means with explanation, False means without explanation.
  • run_id: The id to distinguish the results from multiple runs. Can be omitted if running only once
  • batch_size: hyper-parameter, the batch size for training
  • max_len: hyper-parameter, the maximum input length of the language model
  • lr: hyper-parameter, the learning rate.
  • n_epoches: hyper-parameter, the number of epoches for training.
  • finetuning: whether use fine-tuning in the training process
  • save_model: whether save the trained model to disk
  • logdir: the path to save the training log.
  • lm: the language model to be used, now support distilbert,bert and roberta

Note that we use a seperate configuration file student_configs.json for training the student model. The tasks for training student model with explanation generated by the greedy and binary search algortihm is with the suffix GR and BS in the configuration file, respectively. The delta F1 is calculated by the results from the original task, e.g. AB vs. those from the task with explanations, e.g. AB_GR or AB_BS.

Repreducing Results

Due to the randomness in some involved libraries, the results obtained from running the code following above instructions might not be exactly the same with those reported in the paper. If you want to reproduce the reported results, please refer to this instruction.

Citation

If you are using the dataset, please cite the following in your work:

@inproceedings{deem22minun,
  author    = {Jin Wang and
               Yuliang Li},
  title     = {Minun: Evaluating Counterfactual Explanations for Entity Matching},
  booktitle = {DEEM@SIGMOD},
  year      = {2022}
}

Disclosure

Embedded in, or bundled with, this product are open source software (OSS) components, datasets and other third party components identified below. The license terms respectively governing the datasets and third-party components continue to govern those portions, and you agree to those license terms, which, when applicable, specifically limit any distribution. You may receive a copy of, distribute and/or modify any open source code for the OSS component under the terms of their respective licenses. In the event of conflicts between Megagon Labs, Inc. Recruit Co., Ltd., license conditions and the Open Source Software license conditions, the Open Source Software conditions shall prevail with respect to the Open Source Software portions of the software. You agree not to, and are not permitted to, distribute actual datasets used with the OSS components listed below. You agree and are limited to distribute only links to datasets from known sources by listing them in the datasets overview table below. You are permitted to distribute derived datasets of data sets from known sources by including links to original dataset source in the datasets overview table below. You agree that any right to modify datasets originating from parties other than Megagon Labs, Inc. are governed by the respective third party’s license conditions. All OSS components and datasets are distributed WITHOUT ANY WARRANTY, without even implied warranty such as for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE, and without any liability to or claim against any Megagon Labs, Inc. entity other than as explicitly documented in this README document. You agree to cease using any part of the provided materials if you do not agree with the terms or the lack of any warranty herein. While Megagon Labs, Inc., makes commercially reasonable efforts to ensure that citations in this document are complete and accurate, errors may occur. If you see any error or omission, please help us improve this document by sending information to contact_oss@megagon.ai.

About

Evaluating Counterfactual Explanations for Entity Matching

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published