Skip to content

This repository releases the code and data for utterance rewriting in open-domain dialogues.

Notifications You must be signed in to change notification settings

jind11/utterance-rewriting

Repository files navigation

utterance-rewriting

This repository releases the code and data for utterance rewriting in open-domain dialogues. More details on data and code can be referred to this paper:

@article{jin2022improving,
  title={Improving Bot Response Contradiction Detection via Utterance Rewriting},
  author={Jin, Di and Liu, Sijia and Liu, Yang and Hakkani-Tur, Dilek},
  journal={arXiv preprint arXiv:2207.11862},
  year={2022}
}

Where to use the code?

This code implements an utterance rewriting model that can rewrite an utterance into a complete form by resolving co-references and ellipsis. For example, if we have a dialogue like:

User: Hello! I heard you liked to go for walks with your Corgi. I don't know much about them. What can you tell me?         
System: Well Corgis are also called Welsh Corgis, for one thing.		
User: Why is that?

Here we would like to rewrite the last utterance to resolve the anaphor "that", then we can use this rewriting model and the resulting rewritten utterance would be:

Why are Corgis also called Welsh Corgis?

How to perform inference?

  1. Install the packages by running:
pip install -r requirements.txt
  1. First create a folder for the model parameters by running:
mkdir -p models

And then download the trained model parameters from this Google Drive and put them in the "models" folder. We have released two model versions: t5-large and t5-base.

  1. Run the following command to perform rewriting and the result file is put in the same folder:
sh ./eval.sh --eval_set data-samples --model redo-t5-large --gpu 0

Here the argument "eval_set" refers to the data you would like to rewrite. A sample data file "data-samples.json" has been provided. In this file, each sample to be rewritten should be one dictionary in each line and there are two important keys in the dictionary: utterances and reference. "Utterances" is the concatenation of all utterances including the context history and the last utterance to be rewritten. We have three special tokens: "<USR>" is put in front of the user utterance, "<SYS>" is put before the system utterance, and "<CUR>" is put before the last utterance which will be rewritten. "Reference" can be the human-written reference sentence for automatic evaluation, which can be anything if you do not have the ground truth.

How to train the model?

  1. First download data from this Google Drive and put them in the "data" folder.

  2. Run the following command to perform training:

sh ./train.sh --model_dir MODEL_DIR --model_name MODEL_NAME --gpu 0

Here "MODEL_DIR" refers to the directory that contains the model parameters used for model initialization. "MODEL_NAME" is the model name we would like to fine-tune, e.g. t5-large.

Rewritten DECODE Dataset

As mentioned in the paper, we have shown a use case of this rewriting model on the contradiction detection dataset, DECODE, and successfully improve it by several points. Here we provide the rewritten DECODE dateset in this Google Drive. To be noted, we have provided updated test sets mentioned in the paper.

About

This repository releases the code and data for utterance rewriting in open-domain dialogues.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages