MultiTurnCleanup: A Benchmark for Multi-Turn Spoken Conversational Transcript Cleanup

This repository includes the MultiTurnCleanup dataset and paper accepted in the EMNLP 2023 main conference.

MultiTurnCleanup is a targeted dataset for multi-turn spoken conversational transcript cleanup. Detailed experiments and analyses can be found in our paper. This work was done when the first author was a research intern at Google AI Research.

Citation

If you find this repo helpful to your research, please cite the paper:

@article{shen2023multiturncleanup,
  title={MultiTurnCleanup: A Benchmark for Multi-Turn Spoken Conversational Transcript Cleanup},
  author={Shen, Hua and Zayats, Vicky and Rocholl, Johann C and Walker, Daniel D and Padfield, Dirk},
  journal={arXiv preprint arXiv:2305.12029},
  year={2023}
}

Dataset Description

MultiTurnCleanup consists of ~143k multi-turn cleanup labels with the following train/dev/test splits:

File	#Conv	#Turns	#Tokens	#Cleanup
Train	932	74k	1M	132k
Dev	86	3.7k	60k	6.1k
Test	64	2.9k	43k	5k
Sum	1082	81k	1.1M	143k

License

MultiTurnCleanup dataset is licensed under LDC.

Contact

Please create an issue in this repository or contact the authors directly.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md
dev.tsv		dev.tsv
test.tsv		test.tsv
train.tsv		train.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MultiTurnCleanup: A Benchmark for Multi-Turn Spoken Conversational Transcript Cleanup

Citation

Dataset Description

License

Contact

About

Releases

Packages

Contributors 2

huashen218/MultiTurnCleanup

Folders and files

Latest commit

History

Repository files navigation

MultiTurnCleanup: A Benchmark for Multi-Turn Spoken Conversational Transcript Cleanup

Citation

Dataset Description

License

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages