Liputan6 is the first large-scale Indonesian corpus for Abstractive and Extractive summarization. This data is from year 2000 - 2010, and has two sets:
Data | Train | Dev | Test |
---|---|---|---|
Canonical | 193,883 | 10,972 | 10,972 |
Xtreme | 193,883 | 4,948 | 3,862 |
Liputan6 is registered as a new dataset in IndoLEM (Indonesian resource collection encompassing morpho-syntax, semantics, and discourse).
Fajri Koto, Jey Han Lau, and Timothy Baldwin. Liputan6: A Large-scale Indonesian Dataset for Text Summarization. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing (AACL-IJCNLP 2020)
Although Liputan6 is a publicly available online news portal, according to Indonesian Copyright Law Number 28 Year 2014, this corpus can only be used for non-commercialized activities such as academic research. It is STRONGLY FORBIDDEN to use this corpus as well as any summarization models created using this corpus for commercialized activities. We highly encourage for another respective researcher to not re-distribute the dataset.
Please fill this form. A url to download Liputan6 corpus will be sent to your email address.
- First, please download a json file, containing urls of Liputan6 here. Put file
url.json
in this repository. - Please run the following codes (tested in Python 3.7). If you want to increase
number of thread
, please adjust the code manually.
pip install -r requirements.txt
python 0_download.py
python 1_preprocessing.py
python 2_create_extractive_label.py
python 3_get_xtreme.py
- If you want to run pointer generator network, and BERT-based summarization model, data preparation is as followed:
python 4_make_data_files_pg.py
python 5_make_data_files_presumm_mbert.py
- Pointer Generator Network: PG.
- Bert-based summarization Model: PreSumm. IndoBERT model used in the paper can be found here.
We also provide test set output as reported in our paper. You can download them here.
Model | R1 | R2 | RL |
---|---|---|---|
Lead-2 | 36.68 | 20.23 | 33.71 |
PTGen | 36.10 | 19.19 | 33.56 |
BertExt (mBERT) | 37.51 | 20.15 | 34.57 |
BertAbs (mBERT) | 39.48 | 21.59 | 36.72 |
BertExtAbs (mBERT) | 39.81 | 21.84 | 37.02 |
BertExt (indoBERT) | 38.03 | 20.72 | 35.07 |
BertAbs (indoBERT) | 40.94 | 23.01 | 37.89 |
BertExtAbs (indoBERT) | 41.08 | 22.85 | 38.01 |
Please install pyrouge for evaluating the summary.