## Text Summarization

### There are broadly two types of summarization — Extractive and Abstractive

    1. Extractive— These approaches select sentences from the corpus that best represent it and arrange them to form a summary.
    2. Abstractive— These approaches use natural language techniques to summarize a text using novel sentences.

In this notebook, let us see a few examples of existing summarization approaches.
The first one comes from the python library sumy, which implements several popular summarization approaches from literature. The second example uses gensim's summarizer implementation. Then we move on to Summa and finally we wrap up extractive summarization using BERT. 

## Summarization with Sumy

### Sumy offers several algorithms and methods for summarization such as:



    1. Luhn – Heurestic method
    2. Latent Semantic Analysis
    4. LexRank – Unsupervised approach inspired by algorithms PageRank and HITS
    5. TextRank - Graph-based summarization technique with keyword extractions in from document
There are many more which you can find in the github repo of [sumy](https://github.com/miso-belica/sumy)

In [24]:
# Install sumy

!pip install sumy



DEPRECATION: textract 1.6.5 has a non-standard dependency specifier extract-msg<=0.29.*. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of textract or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063


In [25]:
import nltk
# For NLTK virtual environments are high recommended and it requires python verisions higher than 3.5 on windows

In [26]:
#Code to summarize a given webpage using Sumy's TextRank implementation. 
from sumy.parsers.html import HtmlParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.lsa import LsaSummarizer

num_sentences_in_summary = 2
url = "https://en.wikipedia.org/wiki/Automatic_summarization"
parser = HtmlParser.from_url(url, Tokenizer("english"))

summarizer_list=("TextRankSummarizer:","LexRankSummarizer:","LuhnSummarizer:","LsaSummarizer") #list of summarizers
summarizers = [TextRankSummarizer(), LexRankSummarizer(), LuhnSummarizer(), LsaSummarizer()]

for i,summarizer in enumerate(summarizers):
    print(summarizer_list[i])
    for sentence in summarizer(parser.document, num_sentences_in_summary):
        print((sentence))
    print("-"*30)

TextRankSummarizer:
For text, extraction is analogous to the process of skimming, where the summary (if available), headings and subheadings, figures, the first and last paragraphs of a section, and optionally the first and last sentences in a paragraph are read before one chooses to read the entire document in detail.
A Class of Submodular Functions for Document Summarization", The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT), 2011^ Sebastian Tschiatschek, Rishabh Iyer, Hoachen Wei and Jeff Bilmes, Learning Mixtures of Submodular Functions for Image Collection Summarization, In Advances of Neural Information Processing Systems (NIPS), Montreal, Canada, December - 2014.^ Ramakrishna Bairi, Rishabh Iyer, Ganesh Ramakrishnan and Jeff Bilmes, Summarizing Multi-Document Topic Hierarchies using Submodular Mixtures, To Appear In the Annual Meeting of the Association for Computational Linguistics (ACL), Beijing, China, July - 2015

Clearly there are other summarizers and options in sumy. We leave their exploration as an exercise to you!

## Summarization example with Gensim

In [27]:
!pip install gensim



DEPRECATION: textract 1.6.5 has a non-standard dependency specifier extract-msg<=0.29.*. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of textract or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063


Gensim does not have a HTML parser like sumy. So, let us use the example text from Chapter 5 (nlphistory.txt) to see what its summarized version looks like! 


In [28]:
# NOTE: gensim summarization deprecated
# from gensim.summarization import summarize,summarize_corpus
# from gensim.summarization.textcleaner import split_sentences
# from gensim import corpora

# text = open("../data/nlphistory.txt").read()

# #summarize method extracts the most relevant sentences in a text
# print("Summarize:\n",summarize(text, word_count=200, ratio = 0.1))


# #the summarize_corpus selects the most important documents in a corpus:
# sentences = split_sentences(text)# Creates a corpus where each document is a sentence.
# tokens = [sentence.split() for sentence in sentences]
# dictionary = corpora.Dictionary(tokens)
# corpus = [dictionary.doc2bow(sentence_tokens) for sentence_tokens in tokens]

# # Extracts the most important documents (shown here in BoW representation)
# print("-"*30,"\nSummarize Corpus\n",summarize_corpus(corpus,ratio=0.1))




The two parameters **word_count** and **ratio** we can adjust how much text the summarizer outputs
1. word_count: maximum amount of words we want in the summary
2. ratio: fraction of sentences in the original text should be returned as output

### Todo: Explore other options in gensim summarizer, what are possible shortcomings (e.g., sensitive to input's format etc)
[Short-Comings
1. gensim's summarizer uses TextRank by default, an algorithm that uses PageRank. In gensim it is unfortunately implemented using a Python list of PageRank graph nodes, so it may fail if your graph is too big.]



## Summa Summarizer
The summa summarizer uses TextRank too but with optimizations on similar functions. More information about the optimizations can be found in the following [paper](https://arxiv.org/pdf/1602.03606.pdf). 

In [29]:
!pip install summa



DEPRECATION: textract 1.6.5 has a non-standard dependency specifier extract-msg<=0.29.*. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of textract or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063


In [31]:
from summa import summarizer
from summa import keywords
text = open("../data/nlphistory.txt").read()

print("Summary:")
print (summarizer.summarize(text,ratio=0.1))

Summary:
However, part-of-speech tagging introduced the use of hidden Markov models to natural language processing, and increasingly, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to the features making up the input data.
In the 2010s, representation learning and deep neural network-style machine learning methods became widespread in natural language processing, due in part to a flurry of results showing that such techniques[4][5] can achieve state-of-the-art results in many natural language tasks, for example in language modeling,[6] parsing,[7][8] and many others.


### BERT for Extractive Summarization
Lets see how we can use BERT for extractive summarization

In [32]:
#Install the required libraries
!pip install bert-extractive-summarizer
!pip install spacy==2.1.3
!pip install transformers==2.2.2
!pip install neuralcoref
!pip install torch #you can comment this line if u already have tensorflow2.0 installed
!pip install neuralcoref --no-binary neuralcoref
!python -m spacy download en_core_web_sm

Collecting typing-extensions>=3.7.4.3 (from huggingface-hub<1.0,>=0.14.1->transformers->bert-extractive-summarizer)
  Obtaining dependency information for typing-extensions>=3.7.4.3 from https://files.pythonhosted.org/packages/ec/6b/63cc3df74987c36fe26157ee12e09e8f9db4de771e0f3404263117e75b95/typing_extensions-4.7.1-py3-none-any.whl.metadata
  Using cached typing_extensions-4.7.1-py3-none-any.whl.metadata (3.1 kB)
Using cached typing_extensions-4.7.1-py3-none-any.whl (33 kB)
Installing collected packages: typing-extensions
Successfully installed typing-extensions-4.7.1


DEPRECATION: textract 1.6.5 has a non-standard dependency specifier extract-msg<=0.29.*. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of textract or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
jupyterlab 4.0.4 requires tomli; python_version < "3.11", which is not installed.
tensorflow-intel 2.13.0 requires typing-extensions<4.6.0,>=3.6.6, but you have typing-extensions 4.7.1 which is incompatible.


Collecting spacy==2.1.3
  Using cached spacy-2.1.3.tar.gz (27.7 MB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'error'


  error: subprocess-exited-with-error
  
  × pip subprocess to install build dependencies did not run successfully.
  │ exit code: 1
  ╰─> [337 lines of output]
      Collecting setuptools
        Obtaining dependency information for setuptools from https://files.pythonhosted.org/packages/c7/42/be1c7bbdd83e1bfb160c94b9cafd8e25efc7400346cf7ccdbdb452c467fa/setuptools-68.0.0-py3-none-any.whl.metadata
        Using cached setuptools-68.0.0-py3-none-any.whl.metadata (6.4 kB)
      Collecting wheel>0.32.0.<0.33.0
        Obtaining dependency information for wheel>0.32.0.<0.33.0 from https://files.pythonhosted.org/packages/28/f5/6955d7b3a5d71ce6bac104f9cf98c1b0513ad656cdaca8ea7d579196f771/wheel-0.41.1-py3-none-any.whl.metadata
        Using cached wheel-0.41.1-py3-none-any.whl.metadata (2.2 kB)
      Collecting Cython
        Obtaining dependency information for Cython from https://files.pythonhosted.org/packages/6d/0b/889b9b839ea7237eb6048191fe653c17ce93e298495eaf8f893cff748951/Cython-3.0.0-

Collecting transformers==2.2.2
  Using cached transformers-2.2.2-py3-none-any.whl (387 kB)
Installing collected packages: transformers
Successfully installed transformers-2.2.2


DEPRECATION: textract 1.6.5 has a non-standard dependency specifier extract-msg<=0.29.*. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of textract or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063


Collecting neuralcoref
  Using cached neuralcoref-4.0.tar.gz (368 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting typing-extensions>=4.6.1 (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy>=2.1.0->neuralcoref)
  Obtaining dependency information for typing-extensions>=4.6.1 from https://files.pythonhosted.org/packages/ec/6b/63cc3df74987c36fe26157ee12e09e8f9db4de771e0f3404263117e75b95/typing_extensions-4.7.1-py3-none-any.whl.metadata
  Using cached typing_extensions-4.7.1-py3-none-any.whl.metadata (3.1 kB)
Using cached typing_extensions-4.7.1-py3-none-any.whl (33 kB)
Building wheels for collected packages: neuralcoref
  Building wheel for neuralcoref (setup.py): started
  Building wheel for neuralcoref (setup.py): finished with status 'error'
  Running setup.py clean for neuralcoref
Failed to build neuralcoref


  error: subprocess-exited-with-error
  
  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [116 lines of output]
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build\lib.win-amd64-cpython-310
      creating build\lib.win-amd64-cpython-310\neuralcoref
      copying neuralcoref\file_utils.py -> build\lib.win-amd64-cpython-310\neuralcoref
      copying neuralcoref\__init__.py -> build\lib.win-amd64-cpython-310\neuralcoref
      creating build\lib.win-amd64-cpython-310\neuralcoref\tests
      copying neuralcoref\tests\test_neuralcoref.py -> build\lib.win-amd64-cpython-310\neuralcoref\tests
      copying neuralcoref\tests\__init__.py -> build\lib.win-amd64-cpython-310\neuralcoref\tests
      creating build\lib.win-amd64-cpython-310\neuralcoref\train
      copying neuralcoref\train\algorithm.py -> build\lib.win-amd64-cpython-310\neuralcoref\train
      copying neuralcoref\train\compat.py -> build\lib.win

Collecting neuralcoref
  Using cached neuralcoref-4.0.tar.gz (368 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting typing-extensions>=4.6.1 (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy>=2.1.0->neuralcoref)
  Obtaining dependency information for typing-extensions>=4.6.1 from https://files.pythonhosted.org/packages/ec/6b/63cc3df74987c36fe26157ee12e09e8f9db4de771e0f3404263117e75b95/typing_extensions-4.7.1-py3-none-any.whl.metadata
  Using cached typing_extensions-4.7.1-py3-none-any.whl.metadata (3.1 kB)
Using cached typing_extensions-4.7.1-py3-none-any.whl (33 kB)
Building wheels for collected packages: neuralcoref
  Building wheel for neuralcoref (setup.py): started
  Building wheel for neuralcoref (setup.py): finished with status 'error'
  Running setup.py clean for neuralcoref
Failed to build neuralcoref


  error: subprocess-exited-with-error
  
  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [116 lines of output]
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build\lib.win-amd64-cpython-310
      creating build\lib.win-amd64-cpython-310\neuralcoref
      copying neuralcoref\file_utils.py -> build\lib.win-amd64-cpython-310\neuralcoref
      copying neuralcoref\__init__.py -> build\lib.win-amd64-cpython-310\neuralcoref
      creating build\lib.win-amd64-cpython-310\neuralcoref\tests
      copying neuralcoref\tests\test_neuralcoref.py -> build\lib.win-amd64-cpython-310\neuralcoref\tests
      copying neuralcoref\tests\__init__.py -> build\lib.win-amd64-cpython-310\neuralcoref\tests
      creating build\lib.win-amd64-cpython-310\neuralcoref\train
      copying neuralcoref\train\algorithm.py -> build\lib.win-amd64-cpython-310\neuralcoref\train
      copying neuralcoref\train\compat.py -> build\lib.win

In [33]:
#sowyma could you please look at this coreference vs without coreference. I personally think we need to use a better input.
#currently using the same one as above the nlphistory.txt

from summarizer import Summarizer
from summarizer.coreference_handler import CoreferenceHandler

model = Summarizer()

print("Without Coreference:")
result = model(text, min_length=200,ratio=0.01)
full = ''.join(result)
print(full)


# print("With Coreference:")
# handler = CoreferenceHandler(greedyness=.35)

# model = Summarizer(sentence_handler=handler)
# result = model(text, min_length=200,ratio=0.01)
# full = ''.join(result)
# print(full)

ModuleNotFoundError: No module named 'summarizer.coreference_handler'

We are done with discussing different Extractive Summarization techniques and examples. Lets move on to Abstractive Summarization.
## Abstractive Summariazation
There have been even efforts to use **RL** for summarization.<br>
The past few years **RNN**s using encoder — decoder models have become popular for abstractive summarization. <br>
Recently **Transformers** which use attention mechanism have become popular for abstractive summarization. 

As mentioned in Ch7  abstractive summarization is more of a research topic than a practical application. 

We will demo simple abstractive text summarization with pretrained T5 — Text-To-Text Transfer Transformer.

In [5]:
!pip install transformers
!pip install torch



You should consider upgrading via the 'c:\users\sukee\appdata\local\programs\python\python37\python.exe -m pip install --upgrade pip' command.




You should consider upgrading via the 'c:\users\sukee\appdata\local\programs\python\python37\python.exe -m pip install --upgrade pip' command.





In [17]:
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config

model = T5ForConditionalGeneration.from_pretrained('t5-small')
tokenizer = T5Tokenizer.from_pretrained('t5-small')

device = torch.device('cpu')

text ="""
don’t build your own MT system if you don’t have to. It is more practical to make use of the translation APIs. When we use such APIs, it is important to pay closer attention to pricing policies. It would perhaps make sense to store the translations of frequently used text (called a translation memory or a translation cache). 

If you’re working with a entirely new language, or say a new domain where existing translation APIs do poorly, it would make sense to start with a domain knowledge based rule based translation system addressing the restricted scenario you deal with. Another approach to address such data scarce scenarios is to augment your training data by doing “back translation”. Let us say we want to translate from English to Navajo language. English is a popular language for MT, but Navajo is not. We do have a few examples of English-Navajo translation. In such a case, one can build a first MT model between Navajo-English, and use this system to translate a few Navajo sentences into English. At this point, these machine translated Navajo-English pairs can be added as additional training data to English-Navajo MT system. This results in a translation system with more examples to train on (even though some of these examples are synthetic). In general, though, if accuracy of translation is paramount, it would perhaps make sense to form a hybrid MT system which combines the neural models with rules and some form of post-processing, though. 

"""


preprocess_text = text.strip().replace("\n","")
t5_prepared_Text = "summarize: "+preprocess_text
print ("original text preprocessed: \n", preprocess_text)

tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)


# summmarize 
summary_ids = model.generate(tokenized_text,
                                    num_beams=4,
                                    no_repeat_ngram_size=2,
                                    min_length=30,
                                    max_length=100,
                                    early_stopping=True)
#there are more parameters which can be found at https://huggingface.co/transformers/model_doc/t5.html

output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print ("\n\nSummarized text: \n",output)

original text preprocessed: 
 don’t build your own MT system if you don’t have to. It is more practical to make use of the translation APIs. When we use such APIs, it is important to pay closer attention to pricing policies. It would perhaps make sense to store the translations of frequently used text (called a translation memory or a translation cache). If you’re working with a entirely new language, or say a new domain where existing translation APIs do poorly, it would make sense to start with a domain knowledge based rule based translation system addressing the restricted scenario you deal with. Another approach to address such data scarce scenarios is to augment your training data by doing “back translation”. Let us say we want to translate from English to Navajo language. English is a popular language for MT, but Navajo is not. We do have a few examples of English-Navajo translation. In such a case, one can build a first MT model between Navajo-English, and use this system to tra