## Extractive Summarization

Useful links:

* [Automatic summarising: factors and
directions (1999)](https://www.cl.cam.ac.uk/archive/ksj21/ksjdigipapers/summbook99.pdf) - newcomers to summarization should start here. Contains definitions and reviews different approaches and goals of summarization.
* [Text Summarization in Python: Extractive vs. Abstractive techniques revisited (2017)](https://rare-technologies.com/text-summarization-in-python-extractive-vs-abstractive-techniques-revisited/) - overview of summarization techniques in Python
* [Lectures on Summarization Techniques (2015?)](https://www.youtube.com/watch?v=N5N-HCUE3G4) from old Coursera NLP course - it describes many summarization techniques, emphasis on research, most of these techniques don't have implementations in Python
* [Centroid-based Text Summarization through Compositionality of Word Embeddings](http://www.aclweb.org/anthology/W17-1003) - interesting article on using word embeddings to replace Bag of Words representation of an older article. Has a remarkably [good implementation](https://github.com/gaetangate/text-summarizer) (it worked out of the box, which is uncommon for academic implementations, I just added setup.py to make it pip-installable).

## TextRank

## Notes

* **PyTextRank** - weird arcane API, doesn't expose simple function call as gensim/summa
* **sumy** - requires pipeline (doesn't just work on raw strings)
* **pyteaser** - only Python 2

In [1]:
import numpy as np
import pandas as pd
import requests

import gensim
import summa
import text_summarizer
import nltk

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
%matplotlib inline

import bokeh.plotting
from bokeh.plotting import show
from bokeh.io import output_notebook

output_notebook()

In [2]:
def summary_as_set(summary):
    return set(summary.split('\n'))


def jaccard_index(set1, set2):
    return len(set1.intersection(set2)) / len(set1.union(set2))


def summary_jaccard_index(summary1, summary2):
    return jaccard_index(summary_as_set(summary1), summary_as_set(summary2))

In [3]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/kuba/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/kuba/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Let's check out a couple of summarization methods on The Matrix synopsis which is conveniently available on gensim's page.

In [4]:
text = requests.get('http://rare-technologies.com/the_matrix_synopsis.txt').text

In [5]:
print(text)

The screen is filled with green, cascading code which gives way to the title, The Matrix.

A phone rings and text appears on the screen: "Call trans opt: received. 2-19-98 13:24:18 REC: Log>" As a conversation takes place between Trinity (Carrie-Anne Moss) and Cypher (Joe Pantoliano), two free humans, a table of random green numbers are being scanned and individual numbers selected, creating a series of digits not unlike an ordinary phone number, as if a code is being deciphered or a call is being traced.

Trinity discusses some unknown person. Cypher taunts Trinity, suggesting she enjoys watching him. Trinity counters that "Morpheus (Laurence Fishburne) says he may be 'the One'," just as the sound of a number being selected alerts Trinity that someone may be tracing their call. She ends the call.

Armed policemen move down a darkened, decrepit hallway in the Heart O' the City Hotel, their flashlight beam bouncing just ahead of them. They come to room 303, kick down the door and 

In [6]:
%%time

summa_summary= summa.summarizer.summarize(text, words=500)

CPU times: user 2.26 s, sys: 36.1 ms, total: 2.3 s
Wall time: 2.17 s


In [7]:
%%time

gensim_summary = gensim.summarization.summarize(text, word_count=500)

CPU times: user 1.88 s, sys: 39.6 ms, total: 1.92 s
Wall time: 1.75 s


In [8]:
summary_jaccard_index(summa_summary, gensim_summary)

0.3548387096774194

## Centroid-based summarization

In [9]:
centroid_bow_summarizer = text_summarizer.CentroidBOWSummarizer(preprocess_type='nltk')

In [10]:
centroid_bow_summary = centroid_bow_summarizer.summarize(text, limit=500)

In [11]:
%%time

embedding_model = text_summarizer.centroid_word_embeddings.load_gensim_embedding_model('glove-wiki-gigaword-50')

CPU times: user 20.5 s, sys: 303 ms, total: 20.9 s
Wall time: 21.9 s


In [12]:
centroid_word_embedding_summarizer = text_summarizer.CentroidWordEmbeddingsSummarizer(embedding_model, preprocess_type='nltk')

In [13]:
centroid_word_embedding_summary = centroid_word_embedding_summarizer.summarize(text, limit=500)

  uu = np.average(np.square(u), weights=w)


In [14]:
summary_jaccard_index(centroid_bow_summary, centroid_word_embedding_summary)

0.15384615384615385

In [15]:
preprocessed_text = centroid_word_embedding_summarizer.preprocess_text(text)

In [16]:
print(text)

The screen is filled with green, cascading code which gives way to the title, The Matrix.

A phone rings and text appears on the screen: "Call trans opt: received. 2-19-98 13:24:18 REC: Log>" As a conversation takes place between Trinity (Carrie-Anne Moss) and Cypher (Joe Pantoliano), two free humans, a table of random green numbers are being scanned and individual numbers selected, creating a series of digits not unlike an ordinary phone number, as if a code is being deciphered or a call is being traced.

Trinity discusses some unknown person. Cypher taunts Trinity, suggesting she enjoys watching him. Trinity counters that "Morpheus (Laurence Fishburne) says he may be 'the One'," just as the sound of a number being selected alerts Trinity that someone may be tracing their call. She ends the call.

Armed policemen move down a darkened, decrepit hallway in the Heart O' the City Hotel, their flashlight beam bouncing just ahead of them. They come to room 303, kick down the door and 

In [17]:
sentence_embeddings = np.stack(
    centroid_word_embedding_summarizer.compose_vectors(sentence.split())
    for sentence in preprocessed_text
)

In [18]:
sentence_embeddings.shape

(461, 50)

In [19]:
reducer = PCA(n_components=2)
scaler = StandardScaler()

embeddings = reducer.fit_transform(scaler.fit_transform(sentence_embeddings))

  x = um.multiply(x, x, out=x)


In [20]:
centroid_words = centroid_word_embedding_summarizer.get_topic_idf(preprocessed_text)

In [21]:
datasource = pd.DataFrame({'x': embeddings[:, 0], 'y': embeddings[:, 1], 'text': preprocessed_text})

In [22]:
centroid_embedding = reducer.transform(centroid_word_embedding_summarizer.compose_vectors(centroid_words).reshape(1, -1))[0]

In [23]:


from bokeh.models import ColumnDataSource, LabelSet
from bokeh.plotting import figure, show, output_file
from bokeh.sampledata.periodic_table import elements
from bokeh.models import HoverTool


source = ColumnDataSource(datasource)

TITLE = 'title'
p = figure(toolbar_location="above", title=TITLE)

p.circle("x", "y", size=12, source=source,
         color='blue', line_color="black", fill_alpha=0.8)

labels = LabelSet(x="x", y="y", text="text", y_offset=8, level='glyph',
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')

p.circle(centroid_embedding[0], centroid_embedding[1], color='red', size=15)
#p.circle()
p.add_layout(labels)
#p.add_tools(HoverTool(tooltips=None, renderers=[labels], mode='hline'))

show(p)

  elif np.issubdtype(type(obj), np.float):
