Skip to content

mirzaalimm/TruncationVsSummarization

Repository files navigation

Truncation Vs Summarization

This repository contains the implementation of our paper:
M. A. Mutasodirin and R. E. Prasojo, "Investigating Text Shortening Strategy in BERT: Truncation vs Summarization," 2021 International Conference on Advanced Computer Science and Information Systems (ICACSIS), 2021, pp. 1-5, doi: 10.1109/ICACSIS53237.2021.9631364.

Abstract

The parallelism of Transformer-based models comes at the cost of their input max-length. Some studies proposed methods to overcome this limitation, but none of them reported the effectiveness of summarization as an alternative. In this study, we investigate the performance of document truncation and summarization in text classification tasks. Each of the two was investigated with several variations. This study also investigated how close their performances are to the performance of full-text. We used a dataset of summarization tasks based on Indonesian news articles (IndoSum) to do classification tests. This study shows how the summaries outperform the majority of truncation method variations and lose to only one. The best strategy obtained in this study is taking the head of the document. The second is extractive summarization. This study explains what happened to the result, leading to further research in order to exploit the potential of document summarization as a shortening alternative. The code and data used in this work are publicly available in https://github.com/mirzaalimm/TruncationVsSummarization.

Keywords

long text, document classification, summarization, shortening strategy, BERT.

Data

Original IndoSum: Link 1, Link 2, Alt. Link.
Ready-to-Use Filtered-IndoSum: Link 1.
Ready-to-Use Automated Abstractive Summaries: Link 1, Alt. Link.

Code

IndoSum Statistics: Link 1, Alt. Link.
Filtering IndoSum: Link 1, Alt. Link.
Automatic Abstractive Summarization: Link 1, Alt. Link.
Filtered-IndoSum Statistics: Link 1, Alt. Link.
Document Classification: Link 1, Alt. Link.

License

MIT

Citation

If you are using any component of our work, pleasa cite:

@inproceedings{Mutasodirin2021Shortening,
	author = {Mirza Alim Mutasodirin and Radityo Eko Prasojo},
	title = {Investigating Text Shortening Strategy in {BERT}: Truncation vs Summarization},
	booktitle = {2021 International Conference on Advanced Computer Science and Information Systems ({ICACSIS})},
	publisher = {{IEEE}},
	year = {2021},
	month = {oct},
	pages={1-5},
	doi = {10.1109/icacsis53237.2021.9631364},
	url = {https://doi.org/10.1109/icacsis53237.2021.9631364}
}

Author's Email Address

Feel free to contact me via email mirza.alim.m@gmail.com.

About

The Implementation of Paper: "Investigating Text Shortening Strategy in BERT: Truncation vs Summarization"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published