Skip to content

lxj5957/CLTS-plus-Dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

CLTS-plus-Dataset

CLTS+: A Chinese Long Text Summarization Dataset with Abstractive Summaries

Introduction

We have proposed CLTS, the Chinese Long Text Summarization Dataset. However, CLTS is an extractive dataset: extractive summaries frequently borrow words and phrases from their source text, which leads to the fact that models trained on CLTS will extract whole sentences from articles to form summaries when predicting.

In order to solve this problem, we propose CLTS+ dataset. The ground-truth in CLTS+ is the reference summaries in CLTS after paraphrasing. Meanwhile, some inconsistencies will inevitably occur during the process of paraphrasing; for example, people and place names in summaries after paraphrasing can’t be aligned with those in CLTS reference summaries. Therefore, we correct errors of factual inconsistencies to reduce the noise in the dataset and improve the prediction accuracy of models.

This work has been accepted by ICANN2022, we will update the paper link as soon as they published it.

Samples

We select some samples from CLTS+ and you can see them in samples.txt

Download

CLTS+ is available from the link. And the pass word is 7yvn.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published