## Text Summarization

### There are broadly two types of summarization — Extractive and Abstractive

    1. Extractive— These approaches select sentences from the corpus that best represent it and arrange them to form a summary.
    2. Abstractive— These approaches use natural language techniques to summarize a text using novel sentences.

In this notebook, let us see a few examples of existing summarization approaches.
The first one comes from the python library sumy, which implements several popular summarization approaches from literature. The second example uses gensim's summarizer implementation. Then we move on to Summa and finally we wrap up extractive summarization using BERT. 


#### [TODO give an intro to abstractive vs extractive summarization here?]

## Summarization with Sumy

### Sumy offers several algorithms and methods for summarization such as:


#### [Not sure if we should give a brief information of the different algorithms mentioned below or just list them]
    1. Luhn – Heurestic method
    2. Latent Semantic Analysis
    4. LexRank – Unsupervised approach inspired by algorithms PageRank and HITS
    5. TextRank - Graph-based summarization technique with keyword extractions in from document
There are many more which you can find in the github repo of [sumy](https://github.com/miso-belica/sumy)

In [3]:
!pip install sumy #install sumy



In [5]:
import nltk
# nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /home/etherealenvy/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [20]:
#Code to summarize a given webpage using Sumy's TextRank implementation. 
from sumy.parsers.html import HtmlParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.lsa import LsaSummarizer

num_sentences_in_summary = 2
url = "https://en.wikipedia.org/wiki/Automatic_summarization"
parser = HtmlParser.from_url(url, Tokenizer("english"))

summarizer_list=("TextRankSummarizer:","LexRankSummarizer:","LuhnSummarizer:","LsaSummarizer") #list of summarizers
summarizers = [TextRankSummarizer(), LexRankSummarizer(), LuhnSummarizer(), LsaSummarizer()]

for i,summarizer in enumerate(summarizers):
    print(summarizer_list[i])
    for sentence in summarizer(parser.document, num_sentences_in_summary):
        print((sentence))
    print("-"*30)

TextRankSummarizer:
For text, extraction is analogous to the process of skimming, where the summary (if available), headings and subheadings, figures, the first and last paragraphs of a section, and optionally the first and last sentences in a paragraph are read before one chooses to read the entire document in detail.
Once the graph is constructed, it is used to form a stochastic matrix, combined with a damping factor (as in the "random surfer model"), and the ranking over vertices is obtained by finding the eigenvector corresponding to eigenvalue 1 (i.e., the stationary distribution of the random walk on the graph).
------------------------------
LexRankSummarizer:
An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.
The main difficulty in supervised extractive summarization is that the known summaries must be manually created by extracting sentences so the sentences in an original training documen

Clearly there are other summarizers and options in sumy. We leave their exploration as an exercise to you!

## Summarization example with Gensim

In [21]:
!pip install gensim #installation of the library



Gensim does not have a HTML parser like sumy. So, let us use the example text from Chapter 5 (nlphistory.txt) to see what its summarized version looks like! 


In [42]:
from gensim.summarization import summarize,summarize_corpus
from gensim.summarization.textcleaner import split_sentences
from gensim import corpora

text = open("nlphistory.txt").read()

#summarize method extracts the most relevant sentences in a text
print("Summarize:\n",summarize(text, word_count=200, ratio = 0.1))


#the summarize_corpus selects the most important documents in a corpus:
sentences = split_sentences(text)# Creates a corpus where each document is a sentence.
tokens = [sentence.split() for sentence in sentences]
dictionary = corpora.Dictionary(tokens)
corpus = [dictionary.doc2bow(sentence_tokens) for sentence_tokens in tokens]

# Extracts the most important documents (shown here in BoW representation)
print("-"*30,"\nSummarize Corpus\n",summarize_corpus(corpus,ratio=0.1))




Summarize:
 This was due to both the steady increase in computational power (see Moore's law) and the gradual lessening of the dominance of Chomskyan theories of linguistics (e.g. transformational grammar), whose theoretical underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning approach to language processing.[3] Some of the earliest-used machine learning algorithms, such as decision trees, produced systems of hard if-then rules similar to existing hand-written rules.
However, part-of-speech tagging introduced the use of hidden Markov models to natural language processing, and increasingly, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to the features making up the input data.
Generally, this task is much more difficult than supervised learning, and typically produces less accurate results for a given amount of input data.
In the 2010s, representation learning and deep n

The two parameters **word_count** and **ratio** we can adjust how much text the summarizer outputs
1. word_count: maximum amount of words we want in the summary
2. ratio: fraction of sentences in the original text should be returned as output

### Todo: Explore other options in gensim summarizer, what are possible shortcomings (e.g., sensitive to input's format etc)
Short-Comings
1. gensim's summarizer uses TextRank by default, an algorithm that uses PageRank. In gensim it is unfortunately implemented using a Python list of PageRank graph nodes, so it may fail if your graph is too big.
2. Sensitive to input's format, in the above summary u see that there are not very seneible references.



## Summa Summarizer
The summa summarizer uses TextRank too but with optimizations on similar functions. More information about the optimizations can be found in the following [paper](https://arxiv.org/pdf/1602.03606.pdf). 

In [43]:
!pip install summa



In [44]:
from summa import summarizer
from summa import keywords
text = open("nlphistory.txt").read()

print("Summary:")
print (summarizer.summarize(text,ratio=0.1))

Summary:
However, part-of-speech tagging introduced the use of hidden Markov models to natural language processing, and increasingly, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to the features making up the input data.
In the 2010s, representation learning and deep neural network-style machine learning methods became widespread in natural language processing, due in part to a flurry of results showing that such techniques[4][5] can achieve state-of-the-art results in many natural language tasks, for example in language modeling,[6] parsing,[7][8] and many others.


### BERT for Extractive Summarization
Lets see how we can use BERT for extractive summarization

In [12]:
# !pip install bert-extractive-summarizer
# !pip install spacy==2.1.3
# !pip install transformers==2.2.2
# !pip install neuralcoref
# !pip install torch
# !pip install numpy
!pip install neuralcoref --no-binary neuralcoref
# !python -m spacy download en_core_web_sm



In [13]:
from summarizer import Summarizer

model = Summarizer()
result = model(text,BertModel='bert-base-uncased', min_length=60)
full = ''.join(result)
print(full)

NameError: name 'BertModel' is not defined