Skip to content

Latest commit

 

History

History
31 lines (19 loc) · 6.32 KB

index.md

File metadata and controls

31 lines (19 loc) · 6.32 KB

Sumy - module for automatic summarization of text documents and HTML pages.

The story

Sumy was created as my diploma thesis and the need for articles length reduction in Czech/Slovak language. Although it's source code was always available publicly on Github I didn't expect to adopt it by so many people. Don't get me wrong. I am happy for it, but that's also why the lack of documentation and sometimes hardcoded features for Slovak/Czech languages may be found in the codebase. Because the thesis is written in the Slovak language I will try to write some practical parts here for people using it.

Sumy is able to create extractive summary. That means that it tries to find the most significant sentences in the document(s) and compose it into the shortened text. There is another approach called abstractive summary but to create it one needs to understand the topic and create new shortened text from it. This is out of the scope of Sumy's current capabilities.

Basic architecture

Even I focused on Czech/Slovak language in my work I wanted Sumy to be extendable for other languages from the start. That's why I created it as a set of independent objects that can be replaced by the user of the library to add better or new capabilities to it.

Document

The central object is the Document which represents the whole document ready to be summarized. It consists of the collection of the Paragraphs which consists of the collection of the Sentences. Every sentence has a boolean flag is_heading indicating if it's a normal sentence or heading. Also, it has tokenizer attached so you can get a list of words from it. But the Word is represented as a simple string.

Tokenizer

To create a Document (or Parser) you will need a Tokenizer. The Tokenizer is one of the language-specific part of the puzzle. I use nltk library to do that so there is a great chance your language is covered by that library. Simply try to pass your language name to it and you will see if it will work :) If it raises the exception you have two choices. The 1st one is to send the pull request to Sumy with a new Tokenizer for your language. And the 2nd is to create your own Tokenizer and pass it to Sumy. And you know, now when you have it it should be easy to send the pull request with your code anyway. The tokenizer is any object with two methods to_sentences(paragraph: str) and to_words(sentence: str).

Parser

You can create the Document by hand but it would be not very convenient. That's why there is DocumentParser for the job. It's the base class you can inherit and extend to create your transformation from the input document format to the Document object. Sumy provides 2 implementations to do that. The first one is the PlainTextParser. The name is not accurate because some very simple formatting is expected. Paragraphs are separated by a single empty line and headings of the paragraphs can be created by writing the whole sentence in UPPER CASE letters. But that's all. The more interesting implementation is the HtmlParser. It is able to extract the main article from the HTML page with the help of breadability library and returns Document with useful meta-information about the document extracted from HTML markup. Many other summarizers use XML format for the input documents and it should not be hard to implement it if you want to. All you should do it to inherit DocumentParser and define the property DocumentParser.document returning Document object.

Preprocessing (optional)

Ok, now you know how to create the Document from your text. Next, you want to summarize it probably. Before we do that you should know that the Document can be preprocessed in any way. You can transform/enhance it with important information for you. You can even add or remove parts of it. Whatever you need. In some edge cases, you can even create the new Document as long as you adhere to the API.

Stemmer

Then you need a Stemmer. The Stemmer is just a fancy word for the algorithm that tries to normalize the words into the single one. The simplest stemmer implementation in Sumy is the so-called null_stemmer. It is handy for cases like Chinese/Japanese/Korean languages where words do not need to be unified. But the Czech/Slovak language has custom Stemmer in Sumy. All other languages use nltk for this. SO again, there is a good chance your language is covered. But stemmer is any callable that takes a word and returns word. That is good news for you because you can implement your own by simply creating a new function with a custom implementation.

Summarizer

And we are reaching the finish line here. You have Document created and you are not afraid to use your Stemmer. Now you are ready to choose one of the Summarizers. Probably except for the RandomSummarizer which serves just as a lower limit when evaluating the quality of the summaries. The Summarizer needs a Stemmer as it's dependency and optionally the list of the stop-words. Although it's the optional dependency I really recommend to use it to get better results. You can use sumy.utils.get_stop_words(language: str) or simply provide your list of the words. After all of this, your summarizer is ready to serve you. Simply provide it the Document and the count of the sentences you want to return and you are done.

You can find some specifics to the summarizators at the separate page.