Code and created datasets for our ACL 2022 paper: Contextual Fine-to-Coarse Distillation for Coarse-grained Response Selection in Open-Domain Conversations.
- Reddit and Twitter datasets released ! (2022.06.20)
- The codes will be released in the near future.
We created two datasets in the paper for contextual matching, i.e., Reddit
and Twitter
, and the statistics and interpretations are shown in the table below.
Datasets | Train set | Dev set | Test set | Database | |
---|---|---|---|---|---|
MC | SC | ||||
300K | 20K | 20K | 20K | 10M | |
20K | 2K | 2K | - | 1M |
File | Role | Explaination |
---|---|---|
database.json | Database | Each instance contains three fields, where `ctx` represents the context, `rsp` represents the response, and `rid` represents the ID of the response. |
train.json | Trainset | Each instance contains a response and a context list corresponding to the response. |
dev.json | Devset | Same as training set. |
test_mc.json | MC testset | Same as database. Each response in MC test set has multiple contexts, which ensures that there exits other contexts in the database that also correspond to this response. |
test_sc.json | SC testset | Same as database. Each response in SC test set has only one context, i.e., there is no context in the database that exactly corresponds to the response. |
Instead of providing the data directly, we provide a script to make the data in consideration of copyright issues.
To download raw reddit dataset train.tsv
, following https://github.com/microsoft/DialoGPT and run python demo.py --data full
. Raw twitter dataset is available in https://github.com/Marsan-Ma-zz/chat_corpus.
- build context-response pairs
python build_data.py
- build training set
python build_trainset.py
- build test set
python build_testset.py
If you extend or use this work, please cite the paper where it was introduced:
@inproceedings{chen-etal-2022-contextual,
title = "Contextual Fine-to-Coarse Distillation for Coarse-grained Response Selection in Open-Domain Conversations",
author = "Chen, Wei and Gong, Yeyun and Xu, Can and Hu, Huang and Yao, Bolun and Wei, Zhongyu and Fan, Zhihao and Hu, Xiaowu and Zhou, Bartuer and Cheng, Biao and Jiang, Daxin and Duan, Nan",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.acl-long.334",
doi = "10.18653/v1/2022.acl-long.334",
pages = "4865--4877",
abstract = "We study the problem of coarse-grained response selection in retrieval-based dialogue systems. The problem is equally important with fine-grained response selection, but is less explored in existing literature. In this paper, we propose a Contextual Fine-to-Coarse (CFC) distilled model for coarse-grained response selection in open-domain conversations. In our CFC model, dense representations of query, candidate contexts and responses is learned based on the multi-tower architecture using contextual matching, and richer knowledge learned from the one-tower architecture (fine-grained) is distilled into the multi-tower architecture (coarse-grained) to enhance the performance of the retriever. To evaluate the performance of the proposed model, we construct two new datasets based on the Reddit comments dump and Twitter corpus. Extensive experimental results on the two datasets show that the proposed method achieves huge improvement over all evaluation metrics compared with traditional baseline methods.",
}