CLTS-Dataset

CLTS: A Chinese Long Text Summarization Dataset

Introduction

Currently, there are two issues in automatic summarization dataset for Chinese: (1) There are fewer Chinese datasets. (2) There is a lack of Chinese long text datasets, and most existing datasets contain short texts, which can only be used to generate a title from short text.

CLTS is a new Chinese long text summarization dataset, extracted from the Chinese news website ThePaper.cn. The resulting version of the dataset contains more than 180,000 long-sequence pairs, where each article consists of multiple paragraphs and each summary consists of multiple sentences.

The dataset is introduced in CLTS: A New Chinese Long Text Summarization Dataset.

Properties

The comparison result of CLTS to other datasets is shown in the table below. For Chinese dataset, length is the number of Chinese characters, while for English, it is the number of words.

Dataset	CLTS	LCSTS	CNN/DM
Dataset Size	185,397	2,412,363	312,085
Training Set Size	148,317	2,400,591	287,227
Max Article Length	6130	135	2173
Mean Article Length	1363.69	88.65	687.09
Mean Summary Length	58.12	16.09	48.49

Samples

We select some samples from the dataset and you can see them in samples.txt

Download

We split the corpus into three parts, including training, validation and test set. The data can be downloaded from the link here. And the password is iucp.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
samples.txt		samples.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CLTS-Dataset

Introduction

Properties

Samples

Download

About

Releases

Packages

lxj5957/CLTS-Dataset

Folders and files

Latest commit

History

Repository files navigation

CLTS-Dataset

Introduction

Properties

Samples

Download

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages