In a recent era of big data, text contents are being produced much faster than a human being can consume. According to Marr, as of May 2018, there are 16 million text messages, 156 million emails and 456,000 tweets being sent every minute. In addition, modern technologies (e.g. the Internet) produce news in different formats, such as text messages, social media, online subscription... Obviously, we are not able to consume every single news article in its original form. For readers who would prefer to grasp the main ideas, It’s much more efficient to summarize these news articles into shorter texts. However, manual text summarization is tedious and laborious.
With the power of computer, we hope to perform the text summarization task automatically. Automatic text summarization has several advantages over the manual approach, for instances fewer biases and more personalized recommendations. Since news articles are more organized, it is easier to summarize them in a meaningful way. Our team explored several latest methodologies for automatic text summarization and apply it on a large-scale dataset, namely DeepMind Q&A shared by Google. This dataset contains around 90,000 documents from CNN news. Each document is composed of the body text and human summarized “highlights”. Our goal is to construct a summary comparable with the “highlights” given a news body.
First, you may need to download the DMQA dataset and put them under
cnn_stories folder. Following 4 methods are implemented. Please refer to the codes about how to use them.
- TF-IDF tag based method
- Modified LexRank
- Latent Semantic Analysis