<a href="https://colab.research.google.com/github/onlyabhilash/NLP-Code/blob/main/part-6/03_TextSummarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Text Summarization

### There are broadly two types of summarization — Extractive and Abstractive

    1. Extractive— These approaches select sentences from the corpus that best represent it and arrange them to form a summary.
    2. Abstractive— These approaches use natural language techniques to summarize a text using novel sentences.

In this notebook, let us see a few examples of existing summarization approaches.
The first one comes from the python library sumy, which implements several popular summarization approaches from literature. The second example uses gensim's summarizer implementation. Then we move on to Summa and finally we wrap up extractive summarization using BERT. 

## Summarization with Sumy

### Sumy offers several algorithms and methods for summarization such as:



    1. Luhn – Heurestic method
    2. Latent Semantic Analysis
    4. LexRank – Unsupervised approach inspired by algorithms PageRank and HITS
    5. TextRank - Graph-based summarization technique with keyword extractions in from document
There are many more which you can find in the github repo of [sumy](https://github.com/miso-belica/sumy)

In [None]:
# Install sumy

!pip install sumy

Collecting sumy
  Using cached sumy-0.8.1-py2.py3-none-any.whl (83 kB)
Collecting docopt<0.7,>=0.6.1
  Using cached docopt-0.6.2.tar.gz (25 kB)
Collecting pycountry>=18.2.23
  Downloading pycountry-20.7.3.tar.gz (10.1 MB)
Collecting breadability>=0.1.20
  Using cached breadability-0.1.20.tar.gz (32 kB)

You should consider upgrading via the 'c:\users\sukee\appdata\local\programs\python\python37\python.exe -m pip install --upgrade pip' command.



Collecting lxml>=2.0
  Downloading lxml-4.6.3-cp37-cp37m-win_amd64.whl (3.5 MB)
Using legacy setup.py install for docopt, since package 'wheel' is not installed.
Using legacy setup.py install for pycountry, since package 'wheel' is not installed.
Using legacy setup.py install for breadability, since package 'wheel' is not installed.
Installing collected packages: docopt, pycountry, lxml, breadability, sumy
    Running setup.py install for docopt: started
    Running setup.py install for docopt: finished with status 'done'
    Running setup.py install for pycountry: started
    Running setup.py install for pycountry: finished with status 'done'
    Running setup.py install for breadability: started
    Running setup.py install for breadability: finished with status 'done'
Successfully installed breadability-0.1.20 docopt-0.6.2 lxml-4.6.3 pycountry-20.7.3 sumy-0.8.1


In [None]:
import nltk
# For NLTK virtual environments are high recommended and it requires python verisions higher than 3.5 on windows

[nltk_data] Downloading package punkt to
[nltk_data]     /home/etherealenvy/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
#Code to summarize a given webpage using Sumy's TextRank implementation. 
from sumy.parsers.html import HtmlParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.lsa import LsaSummarizer

num_sentences_in_summary = 2
url = "https://en.wikipedia.org/wiki/Automatic_summarization"
parser = HtmlParser.from_url(url, Tokenizer("english"))

summarizer_list=("TextRankSummarizer:","LexRankSummarizer:","LuhnSummarizer:","LsaSummarizer") #list of summarizers
summarizers = [TextRankSummarizer(), LexRankSummarizer(), LuhnSummarizer(), LsaSummarizer()]

for i,summarizer in enumerate(summarizers):
    print(summarizer_list[i])
    for sentence in summarizer(parser.document, num_sentences_in_summary):
        print((sentence))
    print("-"*30)

TextRankSummarizer:
For text, extraction is analogous to the process of skimming, where the summary (if available), headings and subheadings, figures, the first and last paragraphs of a section, and optionally the first and last sentences in a paragraph are read before one chooses to read the entire document in detail.
Once the graph is constructed, it is used to form a stochastic matrix, combined with a damping factor (as in the "random surfer model"), and the ranking over vertices is obtained by finding the eigenvector corresponding to eigenvalue 1 (i.e., the stationary distribution of the random walk on the graph).
------------------------------
LexRankSummarizer:
An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.
The main difficulty in supervised extractive summarization is that the known summaries must be manually created by extracting sentences so the sentences in an original training documen

Clearly there are other summarizers and options in sumy. We leave their exploration as an exercise to you!

## Summarization example with Gensim

In [None]:
!pip install gensim



You should consider upgrading via the 'c:\users\sukee\appdata\local\programs\python\python37\python.exe -m pip install --upgrade pip' command.


Gensim does not have a HTML parser like sumy. So, let us use the example text from Chapter 5 (nlphistory.txt) to see what its summarized version looks like! 


In [None]:
from gensim.summarization import summarize,summarize_corpus
from gensim.summarization.textcleaner import split_sentences
from gensim import corpora

text = open("nlphistory.txt").read()

#summarize method extracts the most relevant sentences in a text
print("Summarize:\n",summarize(text, word_count=200, ratio = 0.1))


#the summarize_corpus selects the most important documents in a corpus:
sentences = split_sentences(text)# Creates a corpus where each document is a sentence.
tokens = [sentence.split() for sentence in sentences]
dictionary = corpora.Dictionary(tokens)
corpus = [dictionary.doc2bow(sentence_tokens) for sentence_tokens in tokens]

# Extracts the most important documents (shown here in BoW representation)
print("-"*30,"\nSummarize Corpus\n",summarize_corpus(corpus,ratio=0.1))




Summarize:
 Some notably successful natural language processing systems developed in the 1960s were SHRDLU, a natural language system working in restricted "blocks worlds" with restricted vocabularies, and ELIZA, a simulation of a Rogerian psychotherapist, written by Joseph Weizenbaum between 1964 and 1966.
This was due to both the steady increase in computational power (see Moore's law) and the gradual lessening of the dominance of Chomskyan theories of linguistics (e.g. transformational grammar), whose theoretical underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning approach to language processing.[3] Some of the earliest-used machine learning algorithms, such as decision trees, produced systems of hard if-then rules similar to existing hand-written rules.
However, part-of-speech tagging introduced the use of hidden Markov models to natural language processing, and increasingly, research has focused on statistical models, which make soft, proba

The two parameters **word_count** and **ratio** we can adjust how much text the summarizer outputs
1. word_count: maximum amount of words we want in the summary
2. ratio: fraction of sentences in the original text should be returned as output

### Todo: Explore other options in gensim summarizer, what are possible shortcomings (e.g., sensitive to input's format etc)
[Short-Comings
1. gensim's summarizer uses TextRank by default, an algorithm that uses PageRank. In gensim it is unfortunately implemented using a Python list of PageRank graph nodes, so it may fail if your graph is too big.]



## Summa Summarizer
The summa summarizer uses TextRank too but with optimizations on similar functions. More information about the optimizations can be found in the following [paper](https://arxiv.org/pdf/1602.03606.pdf). 

In [None]:
!pip install summa

Collecting summa
  Using cached summa-1.2.0.tar.gz (54 kB)
Using legacy setup.py install for summa, since package 'wheel' is not installed.
Installing collected packages: summa
    Running setup.py install for summa: started
    Running setup.py install for summa: finished with status 'done'
Successfully installed summa-1.2.0


You should consider upgrading via the 'c:\users\sukee\appdata\local\programs\python\python37\python.exe -m pip install --upgrade pip' command.


In [None]:
from summa import summarizer
from summa import keywords
text = open("nlphistory.txt").read()

print("Summary:")
print (summarizer.summarize(text,ratio=0.1))

Summary:
However, part-of-speech tagging introduced the use of hidden Markov models to natural language processing, and increasingly, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to the features making up the input data.
In the 2010s, representation learning and deep neural network-style machine learning methods became widespread in natural language processing, due in part to a flurry of results showing that such techniques[4][5] can achieve state-of-the-art results in many natural language tasks, for example in language modeling,[6] parsing,[7][8] and many others.


### BERT for Extractive Summarization
Lets see how we can use BERT for extractive summarization

In [None]:
#Install the required libraries
!pip install bert-extractive-summarizer
!pip install spacy==2.1.3
!pip install transformers==2.2.2
!pip install neuralcoref
!pip install torch #you can comment this line if u already have tensorflow2.0 installed
!pip install neuralcoref --no-binary neuralcoref
!python -m spacy download en_core_web_sm

Collecting bert-extractive-summarizer
  Downloading bert_extractive_summarizer-0.7.1-py3-none-any.whl (18 kB)
Collecting spacy
  Downloading spacy-3.0.6-cp37-cp37m-win_amd64.whl (11.7 MB)
Collecting scikit-learn
  Downloading scikit_learn-0.24.2-cp37-cp37m-win_amd64.whl (6.8 MB)
Collecting typer<0.4.0,>=0.3.0
  Using cached typer-0.3.2-py3-none-any.whl (21 kB)
Collecting pydantic<1.8.0,>=1.7.1
  Downloading pydantic-1.7.4-cp37-cp37m-win_amd64.whl (1.7 MB)
Collecting catalogue<2.1.0,>=2.0.3
  Using cached catalogue-2.0.4-py3-none-any.whl (16 kB)
Collecting cymem<2.1.0,>=2.0.2
  Downloading cymem-2.0.5-cp37-cp37m-win_amd64.whl (35 kB)
Collecting thinc<8.1.0,>=8.0.3
  Downloading thinc-8.0.3-cp37-cp37m-win_amd64.whl (1.0 MB)
Collecting spacy-legacy<3.1.0,>=3.0.4
  Using cached spacy_legacy-3.0.5-py2.py3-none-any.whl (12 kB)
Collecting preshed<3.1.0,>=3.0.2
  Downloading preshed-3.0.5-cp37-cp37m-win_amd64.whl (108 kB)
Collecting murmurhash<1.1.0,>=0.28.0
  Downloading murmurhash-1.0.5-cp37

You should consider upgrading via the 'c:\users\sukee\appdata\local\programs\python\python37\python.exe -m pip install --upgrade pip' command.


Collecting spacy==2.1.3
  Downloading spacy-2.1.3-cp37-cp37m-win_amd64.whl (26.9 MB)
Collecting preshed<2.1.0,>=2.0.1
  Downloading preshed-2.0.1-cp37-cp37m-win_amd64.whl (73 kB)
Collecting plac<1.0.0,>=0.9.6

You should consider upgrading via the 'c:\users\sukee\appdata\local\programs\python\python37\python.exe -m pip install --upgrade pip' command.



  Downloading plac-0.9.6-py2.py3-none-any.whl (20 kB)
Collecting thinc<7.1.0,>=7.0.2
  Downloading thinc-7.0.8-cp37-cp37m-win_amd64.whl (1.9 MB)
Collecting jsonschema<3.0.0,>=2.6.0
  Downloading jsonschema-2.6.0-py2.py3-none-any.whl (39 kB)
Collecting blis<0.3.0,>=0.2.2
  Downloading blis-0.2.4-cp37-cp37m-win_amd64.whl (3.1 MB)
Collecting srsly<1.1.0,>=0.0.5
  Downloading srsly-1.0.5-cp37-cp37m-win_amd64.whl (176 kB)
Installing collected packages: preshed, plac, srsly, blis, thinc, jsonschema, spacy
  Attempting uninstall: preshed
    Found existing installation: preshed 3.0.5
    Uninstalling preshed-3.0.5:
      Successfully uninstalled preshed-3.0.5
  Attempting uninstall: srsly
    Found existing installation: srsly 2.4.1
    Uninstalling srsly-2.4.1:
      Successfully uninstalled srsly-2.4.1
  Attempting uninstall: blis
    Found existing installation: blis 0.7.4
    Uninstalling blis-0.7.4:
      Successfully uninstalled blis-0.7.4
  Attempting uninstall: thinc
    Found existi

You should consider upgrading via the 'c:\users\sukee\appdata\local\programs\python\python37\python.exe -m pip install --upgrade pip' command.


Collecting jmespath<1.0.0,>=0.7.1
  Using cached jmespath-0.10.0-py2.py3-none-any.whl (24 kB)
Collecting s3transfer<0.5.0,>=0.4.0
  Downloading s3transfer-0.4.2-py2.py3-none-any.whl (79 kB)
Using legacy setup.py install for boto3, since package 'wheel' is not installed.
Installing collected packages: jmespath, botocore, s3transfer, boto3, transformers
    Running setup.py install for boto3: started
    Running setup.py install for boto3: finished with status 'done'
  Attempting uninstall: transformers
    Found existing installation: transformers 2.11.0
    Uninstalling transformers-2.11.0:
      Successfully uninstalled transformers-2.11.0
Successfully installed boto3-1.17.74 botocore-1.20.74 jmespath-0.10.0 s3transfer-0.4.2 transformers-2.2.2
Collecting neuralcoref
  Using cached neuralcoref-4.0-cp37-cp37m-win_amd64.whl (227 kB)
Installing collected packages: neuralcoref
Successfully installed neuralcoref-4.0


You should consider upgrading via the 'c:\users\sukee\appdata\local\programs\python\python37\python.exe -m pip install --upgrade pip' command.
ERROR: Invalid requirement: '#you'
You should consider upgrading via the 'c:\users\sukee\appdata\local\programs\python\python37\python.exe -m pip install --upgrade pip' command.




You should consider upgrading via the 'c:\users\sukee\appdata\local\programs\python\python37\python.exe -m pip install --upgrade pip' command.


Collecting en_core_web_sm==2.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz (11.1 MB)
Using legacy setup.py install for en-core-web-sm, since package 'wheel' is not installed.
Installing collected packages: en-core-web-sm
    Running setup.py install for en-core-web-sm: started
    Running setup.py install for en-core-web-sm: finished with status 'done'
Successfully installed en-core-web-sm-2.1.0
[+] Download and installation successful
You can now load the model via spacy.load('en_core_web_sm')


You should consider upgrading via the 'C:\Users\sukee\AppData\Local\Programs\Python\Python37\python.exe -m pip install --upgrade pip' command.


In [None]:
#sowyma could you please look at this coreference vs without coreference. I personally think we need to use a better input.
#currently using the same one as above the nlphistory.txt

from summarizer import Summarizer
from summarizer.coreference_handler import CoreferenceHandler

model = Summarizer()

print("Without Coreference:")
result = model(text, min_length=200,ratio=0.01)
full = ''.join(result)
print(full)


print("With Coreference:")
handler = CoreferenceHandler(greedyness=.35)

model = Summarizer(sentence_handler=handler)
result = model(text, min_length=200,ratio=0.01)
full = ''.join(result)
print(full)

Without Coreference:
The authors claimed that within three or five years, machine translation would be a solved problem.[2] However, real progress was much slower, and after the ALPAC report in 1966, which found that ten-year-long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. This was due to both the steady increase in computational power (see Moore's law) and the gradual lessening of the dominance of Chomskyan theories of linguistics (e.g. transformational grammar), whose theoretical underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning approach to language processing.[3] Some of the earliest-used machine learning algorithms, such as decision trees, produced systems of hard if-then rules similar to existing hand-written rules.
With Coreference:
However, real progress was much slower, and after the ALPAC report in 1966, which found that ten-year-long research had failed to fulfill the ex

We are done with discussing different Extractive Summarization techniques and examples. Lets move on to Abstractive Summarization.
## Abstractive Summariazation
There have been even efforts to use **RL** for summarization.<br>
The past few years **RNN**s using encoder — decoder models have become popular for abstractive summarization. <br>
Recently **Transformers** which use attention mechanism have become popular for abstractive summarization. 

As mentioned in Ch7  abstractive summarization is more of a research topic than a practical application. 

We will demo simple abstractive text summarization with pretrained T5 — Text-To-Text Transfer Transformer.

In [None]:
!pip install transformers
!pip install torch



You should consider upgrading via the 'c:\users\sukee\appdata\local\programs\python\python37\python.exe -m pip install --upgrade pip' command.




You should consider upgrading via the 'c:\users\sukee\appdata\local\programs\python\python37\python.exe -m pip install --upgrade pip' command.





In [None]:
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config

model = T5ForConditionalGeneration.from_pretrained('t5-small')
tokenizer = T5Tokenizer.from_pretrained('t5-small')

device = torch.device('cpu')

text ="""
don’t build your own MT system if you don’t have to. It is more practical to make use of the translation APIs. When we use such APIs, it is important to pay closer attention to pricing policies. It would perhaps make sense to store the translations of frequently used text (called a translation memory or a translation cache). 

If you’re working with a entirely new language, or say a new domain where existing translation APIs do poorly, it would make sense to start with a domain knowledge based rule based translation system addressing the restricted scenario you deal with. Another approach to address such data scarce scenarios is to augment your training data by doing “back translation”. Let us say we want to translate from English to Navajo language. English is a popular language for MT, but Navajo is not. We do have a few examples of English-Navajo translation. In such a case, one can build a first MT model between Navajo-English, and use this system to translate a few Navajo sentences into English. At this point, these machine translated Navajo-English pairs can be added as additional training data to English-Navajo MT system. This results in a translation system with more examples to train on (even though some of these examples are synthetic). In general, though, if accuracy of translation is paramount, it would perhaps make sense to form a hybrid MT system which combines the neural models with rules and some form of post-processing, though. 

"""


preprocess_text = text.strip().replace("\n","")
t5_prepared_Text = "summarize: "+preprocess_text
print ("original text preprocessed: \n", preprocess_text)

tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)


# summmarize 
summary_ids = model.generate(tokenized_text,
                                    num_beams=4,
                                    no_repeat_ngram_size=2,
                                    min_length=30,
                                    max_length=100,
                                    early_stopping=True)
#there are more parameters which can be found at https://huggingface.co/transformers/model_doc/t5.html

output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print ("\n\nSummarized text: \n",output)

original text preprocessed: 
 don’t build your own MT system if you don’t have to. It is more practical to make use of the translation APIs. When we use such APIs, it is important to pay closer attention to pricing policies. It would perhaps make sense to store the translations of frequently used text (called a translation memory or a translation cache). If you’re working with a entirely new language, or say a new domain where existing translation APIs do poorly, it would make sense to start with a domain knowledge based rule based translation system addressing the restricted scenario you deal with. Another approach to address such data scarce scenarios is to augment your training data by doing “back translation”. Let us say we want to translate from English to Navajo language. English is a popular language for MT, but Navajo is not. We do have a few examples of English-Navajo translation. In such a case, one can build a first MT model between Navajo-English, and use this system to tra