# Extractive Sentence Summarization

In this notebook I provide a python class for creating a full-sentence summary of document. Sentence summary is useful for document summary applications where it is beneficial to give users a quick sense of what is contained in the document to determine if they wanted to read further. There are two different categories of text summarization techniques: extractive and abstractive. Extractive techniques generally require less data, are unsupervised, and "extract" sentences from the document. Conversly, abstrative techniques require labeled training data, are supervised, and create summaries made up of generated, rather than extracted sentences. Methods, which are implemented in the open source package `sumy`, are all extractive and include KL-sum, Edmundson, LexRank, LSA, and random.

## Techniques

**LSA**  
An unsupervised technique that relies on Singular Value Decomposition of a term-sentence matrix with TFIDF weights. See [Generic Text Summarization Using Relevance Measure and Latent Semantic Analysis](http://www.cs.bham.ac.uk/~pxt/IDA/text_summary.pdf) for additional details.

**Edmundson**  
A heuristic technique that considers word frequency, cue words (eg significant, impossible, hardly), title words, and word location. See [New Methods in Automatic Extracting, 1969](http://courses.ischool.berkeley.edu/i256/f06/papers/edmonson69.pdf) for additional details.

**Kullback–Leibler**  
An unsupervised technique that selects sentences by minimizing the divergence between word distribution in the document as a whole and the sentences in the summary. See [Exploring Content Models for Multi-Document Summarization, 2009](http://www.aclweb.org/anthology/N09-1041) for additional details.
 
**LexRank**  
An extension of Google's Page Rank algorithm to sentence selection where sentences are nodes and similarity scores are edges. See [LexRank: Graph-based Lexical Centrality as Salience in Text Summarization, 2004](https://www.cs.cmu.edu/afs/cs/project/jair/pub/volume22/erkan04a-html/erkan04a.html) for additional details.


## Packages

Package versions are:

```
sumy==0.7.0
nltk==3.2.5
```

## Define the Sentence Extractor Class

In [4]:
from sumy.nlp.tokenizers import Tokenizer
from sumy.parsers.plaintext import PlaintextParser
from sumy.summarizers.edmundson import EdmundsonSummarizer
from sumy.summarizers.kl import KLSummarizer
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.summarizers.random import RandomSummarizer
import nltk

class SentencesExtractor:
    """Extract human readable sentences from text.

    Arguments
    ----------

    summarization_method : str, {'edmundson', 'lsa', 'lexrank', 'kl', 'random'}
        Method for extracting the sentence.
        - 'edmundson': Heuristic method based on sentence location.
        - 'lsa': Latent Semantic Analalysis, based on on SVD of term-sentence matrix.
        - 'lexrank': LexRank method, inspired by Google's PageRank algorithm.
        - 'kl': Kullback-Leibler divergence method.
        - 'random': Random sentences, for evaluating performance.

    stop_words : list
        Words to not consider when evaluating sentence importance.

    sentences_count : int
        Number of sentences to return, between 1 and 5
    """

    _MIN_SENTENCES = 1
    _MAX_SENTENCES = 5
    _NAME_TO_SUMMARIZER_CLS = {
        'edmundson': EdmundsonSummarizer(cue_weight=0.0, key_weight=0.0, location_weight=1.0, title_weight=0.0),
        'lsa': LsaSummarizer(),
        'lexrank': LexRankSummarizer(),
        'kl': KLSummarizer(),
        'random': RandomSummarizer(),
    }

    def __init__(self, summarization_method, sentences_count, stop_words):
        assert self._MIN_SENTENCES <= sentences_count <= self._MAX_SENTENCES
        assert summarization_method in self._NAME_TO_SUMMARIZER_CLS
        self.sentences_count = sentences_count
        self.summarizer = self._NAME_TO_SUMMARIZER_CLS[summarization_method]
        self.summarizer.null_words = frozenset(stop_words)
        self.summarization_method = summarization_method
        try:
            nltk.data.find('tokenizers/punkt')
        except LookupError:
            nltk.download('punkt')
        self.tokenizer = Tokenizer('english')

    def extract(self, text):
        text = PlaintextParser.from_string(text, self.tokenizer).document
        summary = self.summarizer(text, sentences_count=self.sentences_count)
        return {
            'summary': ' '.join(map(str, summary)),
            'sentences_count': self.sentences_count,
            'summarization_method': self.summarization_method
        }

## Example Usage

In [2]:
from nltk.corpus import stopwords
sw = stopwords.words('english')

se = SentencesExtractor(
    summarization_method='lexrank'
    , sentences_count=2
    , stop_words=sw
    )

In [3]:
txt = '''
Megaflaps are steep stratal panels that extend far up the sides of
diapirs or their equivalent welds. They have multiple-kilometer fold
widths and structural relief and  are thus distinct from smaller- scale
composite halokinetic sequences. Maximum dips range from near-vertical
to completely overturned. Although overturned megaflaps are associated with
flaring salt, there is no direct link between megaflap formation and the
initiation of salt sheets. Strata within a megaflap are usually convergent,
and the lower boundary is typically concordant with the top salt. The
upper boundary ranges between a prominent onlap surface and a more diffuse
zone of gradual rotation and thinning, and growth strata likewise display
both onlap and stacked wedge geometries. We use quantitative cross-section
restoration to elucidate the origin and development of megaflaps. Megaflaps
typically represent the relatively thin roofs of early salt structures that
include single- flap active diapirs, passive diapirs, salt pillows,
and salt sheets. They develop during halokinetic drape folding as the
minibasin sinks, during contractional squeezing of the diapir and its
roof, or during some combination of the two. The kinematics are dominated
by either limb rotation or kink-band migration, in which roof strata
move through a fold hinge into a lengthening steep megaflap. Both restoration
results and direct field evidence suggest that internal strain is minor,
with little bed lengthening and thinning. Recognition and understanding of
megaflaps are critical to successful petroleum exploration of three-way
truncation traps against salt. Megaflaps also have implications for the
lateral seal of stratigraphic traps and fluid pressures in minibasins
'''

result = se.extract(txt)

print('Document in {} sentences using {}:\n\n{}'.format(
    result['sentences_count'], 
    result['summarization_method'], 
    result['summary']))

Document in 2 sentences using lexrank:

Although overturned megaflaps are associated with flaring salt, there is no direct link between megaflap formation and the initiation of salt sheets. Strata within a megaflap are usually convergent, and the lower boundary is typically concordant with the top salt.
