CFC

Code and created datasets for our ACL 2022 paper: Contextual Fine-to-Coarse Distillation for Coarse-grained Response Selection in Open-Domain Conversations.

News

Reddit and Twitter datasets released ! (2022.06.20)

TODO

The codes will be released in the near future.

Dataset

We created two datasets in the paper for contextual matching, i.e., Reddit and Twitter, and the statistics and interpretations are shown in the table below.

Datasets	Train set	Dev set	Test set		Database
Datasets	Train set	Dev set	MC	SC	Database
Reddit	300K	20K	20K	20K	10M
Twitter	20K	2K	2K	-	1M

File	Role	Explaination
File	Role	Explaination
database.json	Database	Each instance contains three fields, where `ctx` represents the context, `rsp` represents the response, and `rid` represents the ID of the response.
train.json	Trainset	Each instance contains a response and a context list corresponding to the response.
dev.json	Devset	Same as training set.
test_mc.json	MC testset	Same as database. Each response in MC test set has multiple contexts, which ensures that there exits other contexts in the database that also correspond to this response.
test_sc.json	SC testset	Same as database. Each response in SC test set has only one context, i.e., there is no context in the database that exactly corresponds to the response.

Build Dataset

Instead of providing the data directly, we provide a script to make the data in consideration of copyright issues.

To download raw reddit dataset train.tsv, following https://github.com/microsoft/DialoGPT and run python demo.py --data full. Raw twitter dataset is available in https://github.com/Marsan-Ma-zz/chat_corpus.

build context-response pairs

python build_data.py

build training set

python build_trainset.py

build test set

python build_testset.py

How to Cite

If you extend or use this work, please cite the paper where it was introduced:

@inproceedings{chen-etal-2022-contextual,
    title = "Contextual Fine-to-Coarse Distillation for Coarse-grained Response Selection in Open-Domain Conversations",
    author = "Chen, Wei and Gong, Yeyun and Xu, Can and Hu, Huang and Yao, Bolun and Wei, Zhongyu and Fan, Zhihao and Hu, Xiaowu and Zhou, Bartuer and Cheng, Biao and Jiang, Daxin and Duan, Nan",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.334",
    doi = "10.18653/v1/2022.acl-long.334",
    pages = "4865--4877",
    abstract = "We study the problem of coarse-grained response selection in retrieval-based dialogue systems. The problem is equally important with fine-grained response selection, but is less explored in existing literature. In this paper, we propose a Contextual Fine-to-Coarse (CFC) distilled model for coarse-grained response selection in open-domain conversations. In our CFC model, dense representations of query, candidate contexts and responses is learned based on the multi-tower architecture using contextual matching, and richer knowledge learned from the one-tower architecture (fine-grained) is distilled into the multi-tower architecture (coarse-grained) to enhance the performance of the retriever. To evaluate the performance of the proposed model, we construct two new datasets based on the Reddit comments dump and Twitter corpus. Extensive experimental results on the two datasets show that the proposed method achieves huge improvement over all evaluation metrics compared with traditional baseline methods.",
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
build_data.py		build_data.py
build_testset.py		build_testset.py
build_trainset.py		build_trainset.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CFC

News

TODO

Dataset

Build Dataset

How to Cite

About

Releases

Packages

Languages

lemuria-wchen/CFC

Folders and files

Latest commit

History

Repository files navigation

CFC

News

TODO

Dataset

Build Dataset

How to Cite

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages