Liputan6: Summarization Corpus for Indonesian

About

Liputan6 is the first large-scale Indonesian corpus for Abstractive and Extractive summarization. This data is from year 2000 - 2010, and has two sets:

Data	Train	Dev	Test
Canonical	193,883	10,972	10,972
Xtreme	193,883	4,948	3,862

Liputan6 is registered as a new dataset in IndoLEM (Indonesian resource collection encompassing morpho-syntax, semantics, and discourse).

Paper

Fajri Koto, Jey Han Lau, and Timothy Baldwin. Liputan6: A Large-scale Indonesian Dataset for Text Summarization. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing (AACL-IJCNLP 2020)

Obtaining Liputan6 Data

Disclaimer

Although Liputan6 is a publicly available online news portal, according to Indonesian Copyright Law Number 28 Year 2014, this corpus can only be used for non-commercialized activities such as academic research. It is STRONGLY FORBIDDEN to use this corpus as well as any summarization models created using this corpus for commercialized activities. We highly encourage for another respective researcher to not re-distribute the dataset.

Way1 - By filling the form

Please fill this form. A url to download Liputan6 corpus will be sent to your email address.

Way2 - By running the codes

First, please download a json file, containing urls of Liputan6 here. Put file url.json in this repository.
Please run the following codes (tested in Python 3.7). If you want to increase number of thread, please adjust the code manually.

pip install -r requirements.txt
python 0_download.py
python 1_preprocessing.py
python 2_create_extractive_label.py
python 3_get_xtreme.py

If you want to run pointer generator network, and BERT-based summarization model, data preparation is as followed:

python 4_make_data_files_pg.py
python 5_make_data_files_presumm_mbert.py

Training Neural Models

Pointer Generator Network: PG.
Bert-based summarization Model: PreSumm. IndoBERT model used in the paper can be found here.

Test Set Output

We also provide test set output as reported in our paper. You can download them here.

Model	R1	R2	RL
Lead-2	36.68	20.23	33.71
PTGen	36.10	19.19	33.56
BertExt (mBERT)	37.51	20.15	34.57
BertAbs (mBERT)	39.48	21.59	36.72
BertExtAbs (mBERT)	39.81	21.84	37.02
BertExt (indoBERT)	38.03	20.72	35.07
BertAbs (indoBERT)	40.94	23.01	37.89
BertExtAbs (indoBERT)	41.08	22.85	38.01

Evaluation

Please install pyrouge for evaluating the summary.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Liputan6: Summarization Corpus for Indonesian

About

Paper

Obtaining Liputan6 Data

Disclaimer

Way1 - By filling the form

Way2 - By running the codes

Training Neural Models

Test Set Output

Evaluation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Liputan6: Summarization Corpus for Indonesian

About

Paper

Obtaining Liputan6 Data

Disclaimer

Way1 - By filling the form

Way2 - By running the codes

Training Neural Models

Test Set Output

Evaluation