## Extractive Summarization

Useful links:

* [Automatic summarising: factors and
directions (1999)](https://www.cl.cam.ac.uk/archive/ksj21/ksjdigipapers/summbook99.pdf) - newcomers to summarization should start here. Contains definitions and reviews different approaches and goals of summarization.
* [Text Summarization in Python: Extractive vs. Abstractive techniques revisited (2017)](https://rare-technologies.com/text-summarization-in-python-extractive-vs-abstractive-techniques-revisited/) - overview of summarization techniques in Python
* [Lectures on Summarization Techniques (2015?)](https://www.youtube.com/watch?v=N5N-HCUE3G4) from old Coursera NLP course - it describes many summarization techniques, emphasis on research, most of these techniques don't have implementations in Python
* [Centroid-based Text Summarization through Compositionality of Word Embeddings](http://www.aclweb.org/anthology/W17-1003) - interesting article on using word embeddings to replace Bag of Words representation of an older article. Has a remarkably [good implementation](https://github.com/gaetangate/text-summarizer) (it worked out of the box, which is uncommon for academic implementations, I just added setup.py to make it pip-installable).

## TextRank

## Notes

* **PyTextRank** - weird arcane API, doesn't expose simple function call as gensim/summa
* **sumy** - requires pipeline (doesn't just work on raw strings)
* **pyteaser** - only Python 2

In [30]:
import numpy as np
import pandas as pd
import requests

import gensim
import summa
import text_summarizer
import nltk

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
%matplotlib inline

from termcolor import colored 

import bokeh.plotting
from bokeh.plotting import show
from bokeh.io import output_notebook
from bokeh.models import ColumnDataSource, LabelSet
from bokeh.plotting import figure, show, output_file
from bokeh.sampledata.periodic_table import elements
from bokeh.models import HoverTool

output_notebook()

from IPython.display import display, HTML

In [53]:
def summary_as_set(summary):
    return set(summary.split('\n'))


def jaccard_index(set1, set2):
    return len(set1.intersection(set2)) / len(set1.union(set2))


def summary_jaccard_index(summary1, summary2):
    return jaccard_index(summary_as_set(summary1), summary_as_set(summary2))


def colorize_summary(text, summaries, colors=['red', 'green']):
    def colorize_sentence(sentence, summaries):
        for (i, summary) in enumerate(summaries):
            color = colors[i]
            if sentence in summary:
                return '<font color="{}">{}</font>'.format(color, sentence)
        return sentence
    
    text = text.replace('\r\n\r', 'SPECIAL')
    sentences = nltk.sent_tokenize(text)
    summaries_sentences = [set(nltk.sent_tokenize(summary)) for summary in summaries]
    
    colored_text = ''
    for sentence in sentences:
        colored_text += colorize_sentence(sentence, summaries)
        colored_text += '. '
    return colored_text.replace('SPECIAL', '<br>')

In [3]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/kuba/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/kuba/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Let's check out a couple of summarization methods on The Matrix synopsis which is conveniently available on gensim's page.

In [4]:
text = requests.get('http://rare-technologies.com/the_matrix_synopsis.txt').text

In [5]:
%%time

summa_summary= summa.summarizer.summarize(text, words=500)

CPU times: user 2.04 s, sys: 44.4 ms, total: 2.08 s
Wall time: 1.95 s


In [6]:
%%time

gensim_summary = gensim.summarization.summarize(text, word_count=500)

CPU times: user 1.77 s, sys: 51.8 ms, total: 1.82 s
Wall time: 1.66 s


In [7]:
summary_jaccard_index(summa_summary, gensim_summary)

0.3548387096774194

## Centroid-based summarization

In [8]:
centroid_bow_summarizer = text_summarizer.CentroidBOWSummarizer(preprocess_type='nltk')

In [9]:
centroid_bow_summary = centroid_bow_summarizer.summarize(text, limit=500)

In [10]:
%%time

embedding_model = text_summarizer.centroid_word_embeddings.load_gensim_embedding_model('glove-wiki-gigaword-50')

CPU times: user 19.3 s, sys: 200 ms, total: 19.5 s
Wall time: 20.4 s


In [11]:
centroid_word_embedding_summarizer = text_summarizer.CentroidWordEmbeddingsSummarizer(embedding_model, preprocess_type='nltk')

In [12]:
centroid_word_embedding_summary = centroid_word_embedding_summarizer.summarize(text, limit=500)

  uu = np.average(np.square(u), weights=w)


In [13]:
summary_jaccard_index(centroid_bow_summary, centroid_word_embedding_summary)

0.15384615384615385

In [54]:
display(
    HTML(colorize_summary(text, [centroid_word_embedding_summary, centroid_bow_summary]))
)