Shuca is an automatic summarization software. It can summarize a single document (Single-Document Summarization) and multiple documents (Multi-Document Summarization) as an input. As for summarizing documents written in Japanese, see README.ja.md.
- Shuca extracts important sentences from among input sentences in an input so as to maximize the sum of importance of those within the given length limitation.
Sindle Document Summarization
- Shuca regards a task of single-document summarization as an instance of knapsack problem; it hence searches for a best combination of an input through the dynamic programming knapsak algorithm.
- Shuca regards a task of multi-document summarization as an instance of maximum coverage knapsack problem; it hence searches for an approximate solution of a best comination of input sentences through the greedy algorithm.
- Each line in an input must be one sentence; Shuca processes each line as an unit of summarization.
- As preprocess words in an input should be stemmed beforehand.
Download and put files in your directory; no installation is needed.
dat directory, there are some sample files.
automatic_summarization.txt is a text file in which the first two paragraphs of an wikipedia article, ``Automatic Summarization", is.
automatic_summarization.sent.txt and `automatic_summarization.sent.stem.txt` are a sentence-splitted and stemmed version of that respectively.
Please test a following command:
$ ./lib/Shuca.py -e < ./dat/automatic_summarization.sent.stem.txt
-dSpecify an external dictionary. In default setting, Shuca uses a default dictionary.
-eTo summarize texts written in English,
-eoption must be specified.
-lSpecify summarization length in the number of words. For Instance, If speficy
-l 200, Shuca outputs a summary within 200 words. Default summarization length is 200 words.
-mPerform multi-document summarization. Without this option, single-document summarization is performed.
-sSet summarization length with the number of sentences. For instance, If specify
-s 3, three sentences will be output as an output summary.
- Parameters packed together with the software are now roughly set by the developer, and will be replaced with ones estimated through machine learning.