This repository includes the MultiTurnCleanup dataset and paper accepted in the EMNLP 2023 main conference.
MultiTurnCleanup is a targeted dataset for multi-turn spoken conversational transcript cleanup. Detailed experiments and analyses can be found in our paper. This work was done when the first author was a research intern at Google AI Research.
If you find this repo helpful to your research, please cite the paper:
@article{shen2023multiturncleanup,
title={MultiTurnCleanup: A Benchmark for Multi-Turn Spoken Conversational Transcript Cleanup},
author={Shen, Hua and Zayats, Vicky and Rocholl, Johann C and Walker, Daniel D and Padfield, Dirk},
journal={arXiv preprint arXiv:2305.12029},
year={2023}
}
MultiTurnCleanup consists of ~143k multi-turn cleanup labels with the following train/dev/test splits:
File | #Conv | #Turns | #Tokens | #Cleanup |
---|---|---|---|---|
Train | 932 | 74k | 1M | 132k |
Dev | 86 | 3.7k | 60k | 6.1k |
Test | 64 | 2.9k | 43k | 5k |
Sum | 1082 | 81k | 1.1M | 143k |
MultiTurnCleanup dataset is licensed under LDC.
Please create an issue in this repository or contact the authors directly.