Skip to content

huashen218/MultiTurnCleanup

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

MultiTurnCleanup: A Benchmark for Multi-Turn Spoken Conversational Transcript Cleanup

Arxiv Data

This repository includes the MultiTurnCleanup dataset and paper accepted in the EMNLP 2023 main conference.

MultiTurnCleanup is a targeted dataset for multi-turn spoken conversational transcript cleanup. Detailed experiments and analyses can be found in our paper. This work was done when the first author was a research intern at Google AI Research.

Citation

If you find this repo helpful to your research, please cite the paper:

@article{shen2023multiturncleanup,
  title={MultiTurnCleanup: A Benchmark for Multi-Turn Spoken Conversational Transcript Cleanup},
  author={Shen, Hua and Zayats, Vicky and Rocholl, Johann C and Walker, Daniel D and Padfield, Dirk},
  journal={arXiv preprint arXiv:2305.12029},
  year={2023}
}

Dataset Description

MultiTurnCleanup consists of ~143k multi-turn cleanup labels with the following train/dev/test splits:

File #Conv #Turns #Tokens #Cleanup
Train 932 74k 1M 132k
Dev 86 3.7k 60k 6.1k
Test 64 2.9k 43k 5k
Sum 1082 81k 1.1M 143k

License

MultiTurnCleanup dataset is licensed under LDC.

Contact

Please create an issue in this repository or contact the authors directly.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published