<a href="https://colab.research.google.com/github/plaban1981/Huggingface_transformers_course/blob/main/Text_Summarization_using_pipeline_API_and_T5_transformer_model_in_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## How to Perform Text Summarization using Transformers in Python

https://github.com/huggingface/notebooks/blob/master/examples/summarization.ipynb


https://www.thepythoncode.com/article/text-summarization-using-huggingface-transformers-python

* Text summarization is the task of shortening long pieces of text into a concise summary that preserves key information content and overall meaning.

* There are two different approaches that are widely used for text summarization:

    * **Extractive Summarization:** This is where the model identifies the important sentences and phrases from the original text and only outputs those.

    Extractive strategies are set up as binary classification problems where the goal is to identify the article sentences belonging to the summary.
    * **Abstractive Summarization:** The model produces a completely different text that is shorter than the original, it generates new sentences in a new form, just like humans do. In this tutorial, we will use transformers for this approach.

    Abstractive summaries need to identify the key points and then add a generative element.

    * **Mixed :** Mixed strategies either produce an abstractive summary after identifying an extractive intermediate state or they can choose which approach to use (eg: pointer models) based on the particulars of the text.
    mixed strategies need to combine these elements and provide a mechanism to decide when each mode should be used.

## Here we will use HuggingFace's transformers library in Python to perform abstractive text summarization on any text we want.

## Required installation

In [1]:
pip install transformers 

Collecting transformers
  Downloading transformers-4.9.2-py3-none-any.whl (2.6 MB)
[K     |████████████████████████████████| 2.6 MB 12.4 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 60.0 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 79.6 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 60.3 MB/s 
[?25hCollecting huggingface-hub==0.0.12
  Downloading huggingface_hub-0.0.12-py3-none-any.whl (37 kB)
Installing collected packages: tokenizers, sacremoses, pyyaml, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling PyYAML-3.13:
      Successfully 

* **SentencePiece** is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training.
*  SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.

In [3]:
pip install SentencePiece 

Collecting SentencePiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[?25l[K     |▎                               | 10 kB 31.3 MB/s eta 0:00:01[K     |▌                               | 20 kB 38.4 MB/s eta 0:00:01[K     |▉                               | 30 kB 42.1 MB/s eta 0:00:01[K     |█                               | 40 kB 14.7 MB/s eta 0:00:01[K     |█▍                              | 51 kB 15.8 MB/s eta 0:00:01[K     |█▋                              | 61 kB 10.1 MB/s eta 0:00:01[K     |██                              | 71 kB 11.1 MB/s eta 0:00:01[K     |██▏                             | 81 kB 12.5 MB/s eta 0:00:01[K     |██▍                             | 92 kB 11.4 MB/s eta 0:00:01[K     |██▊                             | 102 kB 12.3 MB/s eta 0:00:01[K     |███                             | 112 kB 12.3 MB/s eta 0:00:01[K     |███▎                            | 122 kB 12.3 MB/s eta 0:00:01[K     |██

In [24]:
import transformers

print(transformers.__version__)

4.9.2


## The most straightforward way to use models in transformers is using the pipeline API:

In [4]:
from transformers import pipeline
# using pipeline API for summarization task
summarization = pipeline("summarization")
original_text = """
Paul Walker is hardly the first actor to die during a production. 
But Walker's death in November 2013 at the age of 40 after a car crash was especially eerie given his rise to fame in the "Fast and Furious" film franchise. 
The release of "Furious 7" on Friday offers the opportunity for fans to remember -- and possibly grieve again -- the man that so many have praised as one of the nicest guys in Hollywood. 
"He was a person of humility, integrity, and compassion," military veteran Kyle Upham said in an email to CNN. 
Walker secretly paid for the engagement ring Upham shopped for with his bride. 
"We didn't know him personally but this was apparent in the short time we spent with him. 
I know that we will never forget him and he will always be someone very special to us," said Upham. 
The actor was on break from filming "Furious 7" at the time of the fiery accident, which also claimed the life of the car's driver, Roger Rodas. 
Producers said early on that they would not kill off Walker's character, Brian O'Connor, a former cop turned road racer. Instead, the script was rewritten and special effects were used to finish scenes, with Walker's brothers, Cody and Caleb, serving as body doubles. 
There are scenes that will resonate with the audience -- including the ending, in which the filmmakers figured out a touching way to pay tribute to Walker while "retiring" his character. At the premiere Wednesday night in Hollywood, Walker's co-star and close friend Vin Diesel gave a tearful speech before the screening, saying "This movie is more than a movie." "You'll feel it when you see it," Diesel said. "There's something emotional that happens to you, where you walk out of this movie and you appreciate everyone you love because you just never know when the last day is you're gonna see them." There have been multiple tributes to Walker leading up to the release. Diesel revealed in an interview with the "Today" show that he had named his newborn daughter after Walker. 
Social media has also been paying homage to the late actor. A week after Walker's death, about 5,000 people attended an outdoor memorial to him in Los Angeles. Most had never met him. Marcus Coleman told CNN he spent almost $1,000 to truck in a banner from Bakersfield for people to sign at the memorial. "It's like losing a friend or a really close family member ... even though he is an actor and we never really met face to face," Coleman said. "Sitting there, bringing his movies into your house or watching on TV, it's like getting to know somebody. It really, really hurts." Walker's younger brother Cody told People magazine that he was initially nervous about how "Furious 7" would turn out, but he is happy with the film. "It's bittersweet, but I think Paul would be proud," he said. CNN's Paul Vercammen contributed to this report.
"""
summary_text = summarization(original_text)[0]['summary_text']
print("Summary:", summary_text)

Downloading:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)


Summary:  Paul Walker died in November 2013 after a car crash in Los Angeles . The late actor was one of the nicest guys in Hollywood . The release of "Furious 7" on Friday offers a chance to grieve again . There have been multiple tributes to Walker leading up to the film's release .


In [5]:
summarization(original_text)[0]

{'summary_text': ' Paul Walker died in November 2013 after a car crash in Los Angeles . The late actor was one of the nicest guys in Hollywood . The release of "Furious 7" on Friday offers a chance to grieve again . There have been multiple tributes to Walker leading up to the film\'s release .'}

## Another example

In [6]:
original_text = """
For the first time in eight years, a TV legend returned to doing what he does best. 
Contestants told to "come on down!" on the April 1 edition of "The Price Is Right" encountered not host Drew Carey but another familiar face in charge of the proceedings. 
Instead, there was Bob Barker, who hosted the TV game show for 35 years before stepping down in 2007. 
Looking spry at 91, Barker handled the first price-guessing game of the show, the classic "Lucky Seven," before turning hosting duties over to Carey, who finished up. 
Despite being away from the show for most of the past eight years, Barker didn't seem to miss a beat.
"""
summary_text = summarization(original_text)[0]['summary_text']
print("Summary:", summary_text)

Summary:  Bob Barker returns to "The Price Is Right" for the first time in eight years . The 91-year-old hosted the show for 35 years before stepping down in 2007 . Drew Carey finished up hosting duties on the April 1 edition of the game show . Barker handled the first price-guessing game of the show .


## Using T5 Model

* T5 is an abstractive summarization algorithm. It means that it will rewrite sentences when necessary than just picking up sentences directly from the original text.

* T5 comes with 3 versions in this library, 
    * t5-small, which is a smaller version of t5-base,
    * t5-base and 
    * t5-large that is larger and more accurate than the others.

In [7]:
from transformers import T5ForConditionalGeneration, T5Tokenizer

# initialize the model architecture and weights
model = T5ForConditionalGeneration.from_pretrained("t5-base")
# initialize the model tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-base")

Downloading:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

In [8]:
article = """
Justin Timberlake and Jessica Biel, welcome to parenthood. 
The celebrity couple announced the arrival of their son, Silas Randall Timberlake, in statements to People. 
"Silas was the middle name of Timberlake's maternal grandfather Bill Bomar, who died in 2012, while Randall is the musician's own middle name, as well as his father's first," People reports. 
The couple announced the pregnancy in January, with an Instagram post. It is the first baby for both.
"""

## Encoding Text

* tokenizer.encode() method to convert the string text to a list of integers, where each integer is a unique token.
* set the max_length to 512, indicating that we do not want the original text to bypass 512 tokens, 
* set return_tensors to "pt" to get PyTorch tensors as output.
* we prepended the text with "summarize: " text, and that's because T5 isn't just for text summarization, you can basically use it for any text-to-text transformation, such as machine translation or question answering.

For example, the T5 transformer can be used for machine translation, you can set "translate English to German: " instead of "summarize: " and you'll get a German translation output (more precisely, you'll get a summarized German translation, as you'll see why in model.generate()).

In [9]:
# encode the text into tensor of integers using the appropriate tokenizer
inputs = tokenizer.encode("summarize: " + article, return_tensors="pt", max_length=512, truncation=True)

In [10]:
inputs

tensor([[21603,    10, 12446, 25045, 16948,    11, 16908,  2106,    15,    40,
             6,  2222,    12,  4208,  4500,     5,    37, 17086,  1158,  2162,
             8,  6870,    13,    70,   520,     6, 10221,     9,     7, 11377,
          1748, 25045, 16948,     6,    16,  6643,    12,  2449,     5,    96,
           134,   173,     9,     7,    47,     8,  2214,   564,    13, 25045,
         16948,    31,     7, 28574, 18573,  3259,  1491,  1635,     6,   113,
          3977,    16,  1673,     6,   298, 11377,  1748,    19,     8, 16244,
            31,     7,   293,  2214,   564,     6,    38,   168,    38,   112,
          2353,    31,     7,   166,   976,  2449,  2279,     5,    37,  1158,
          2162,     8,  8999,    16,  1762,     6,    28,    46,  4601,   442,
             5,    94,    19,     8,   166,  1871,    21,   321,     5,     1]])

In [14]:
tokenizer.convert_ids_to_tokens(inputs[0])

['▁summarize',
 ':',
 '▁Justin',
 '▁Timber',
 'lake',
 '▁and',
 '▁Jessica',
 '▁Bi',
 'e',
 'l',
 ',',
 '▁welcome',
 '▁to',
 '▁parent',
 'hood',
 '.',
 '▁The',
 '▁celebrity',
 '▁couple',
 '▁announced',
 '▁the',
 '▁arrival',
 '▁of',
 '▁their',
 '▁son',
 ',',
 '▁Sil',
 'a',
 's',
 '▁Rand',
 'all',
 '▁Timber',
 'lake',
 ',',
 '▁in',
 '▁statements',
 '▁to',
 '▁People',
 '.',
 '▁"',
 'S',
 'il',
 'a',
 's',
 '▁was',
 '▁the',
 '▁middle',
 '▁name',
 '▁of',
 '▁Timber',
 'lake',
 "'",
 's',
 '▁maternal',
 '▁grandfather',
 '▁Bill',
 '▁Bo',
 'mar',
 ',',
 '▁who',
 '▁died',
 '▁in',
 '▁2012',
 ',',
 '▁while',
 '▁Rand',
 'all',
 '▁is',
 '▁the',
 '▁musician',
 "'",
 's',
 '▁own',
 '▁middle',
 '▁name',
 ',',
 '▁as',
 '▁well',
 '▁as',
 '▁his',
 '▁father',
 "'",
 's',
 '▁first',
 ',"',
 '▁People',
 '▁reports',
 '.',
 '▁The',
 '▁couple',
 '▁announced',
 '▁the',
 '▁pregnancy',
 '▁in',
 '▁January',
 ',',
 '▁with',
 '▁an',
 '▁Instagram',
 '▁post',
 '.',
 '▁It',
 '▁is',
 '▁the',
 '▁first',
 '▁baby',
 '▁for',


##  Generate the summarized text and print it:

The parameters passed to model.generate() method are:

* **max_length:** The maximum number of tokens to generate, we have specified a total of 512, you can change that if you want.
* **min_length:** This is the minimum number of tokens to generate, if you look closely at the tensor output, you'll count a total of 50 tokens,  Note that this will also work if you set it to another task such as English to German translation.
* **length_penalty:** Exponential penalty to the length, 1.0 means no penalty, increasing this parameter, will increase the length of the output text.

* **num_beams:** Specifying this parameter will lead the model to use **beam search** instead of **greedy search**, setting num_beams to 4, will allow the model to lookahead for 4 possible words (1 in the case of greedy search), to keep the most likely 4 of hypotheses at each time step, and choosing the one that has the overall highest probability.
* **early_stopping:** We set it to True, so that generation is finished when all beam hypotheses reached the end of string token (EOS).
* We then the **decode()** method from the tokenizer to convert the tensor back to human-readable text

In [23]:
# generate the summarization output
outputs = model.generate(
    inputs, 
    max_length=150, 
    min_length=40, 
    length_penalty=2.0, 
    num_beams=1, 
    top_p=0.92, 
    top_k=0,
    temperature=0.7,
    early_stopping=True)
# just for debugging
print(outputs)
print(tokenizer.decode(outputs[0],skip_special_tokens=True))

tensor([[    0,     8,  1158,  2162,     8,  8999,    16,  1762,     3,     5,
            34,    19,     8,   166,  1871,    21,   321,     3,     5,     8,
          1158,  2162,     8,  8999,    16,     3,     9,   875,   442,     3,
             5,     8,  1871,    19,     8,  2214,   564,    13, 25045, 16948,
            31,     7, 28574, 18573,     3,     5,     1]])
the couple announced the pregnancy in January. it is the first baby for both. the couple announced the pregnancy in a blog post. the baby is the middle name of Timberlake's maternal grandfather.


## Metrics used for Text Summarization 

* One of the challenges faced by summarization evaluation is that it requires the existence of a set of reference, or gold, summaries. 
* These are not naturally available for most topics and explains why news corpuses and scientific journals dominate research
* Automatic evaluation of summarization and translation tasks is a fascinating but controversial topic. 
* For current purposes, it is enough to know that the **Rouge-N** family of metrics has emerged as the standard metric

#### Rouge-N matrix has the following properities:

* **Rouge-N** measures the overlap of n-grams between the predicted and gold summaries
* **Rouge recall**  normalizes the overlap by the length of the gold summary
* **Rouge precision** normalizes the overlap by the length of the predicted summary neutralizing the failure of recall to account for concision. For example, a very long predicted summary could score perfect recall despite having many superfluous or misleading words
* **Rouge-1 F1** (harmonic mean of recall and precision) is the primary evaluation metric
* **Rouge-L**, a stricter measure which also takes order into account, is usually reported alongside Rouge-1

In [26]:
!pip install datasets rouge-score nltk



In [27]:
pip install rouge-score nltk



In [28]:
pip install nltk



## Loading Datasets

In [30]:
from datasets import load_dataset, load_metric
#
raw_datasets = load_dataset("xsum")
metric = load_metric("rouge")

Downloading:   0%|          | 0.00/1.93k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/954 [00:00<?, ?B/s]

Using custom data configuration default


Downloading and preparing dataset xsum/default (download: 245.38 MiB, generated: 507.60 MiB, post-processed: Unknown size, total: 752.98 MiB) to /root/.cache/huggingface/datasets/xsum/default/1.2.0/4957825a982999fbf80bca0b342793b01b2611e021ef589fb7c6250b3577b499...


Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.00M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset xsum downloaded and prepared to /root/.cache/huggingface/datasets/xsum/default/1.2.0/4957825a982999fbf80bca0b342793b01b2611e021ef589fb7c6250b3577b499. Subsequent calls will reuse this data.


Downloading:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

In [31]:
metric

Metric(name: "rouge", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Value(dtype='string', id='sequence')}, usage: """
Calculates average rouge scores for a list of hypotheses and references
Args:
    predictions: list of predictions to score. Each predictions
        should be a string with tokens separated by spaces.
    references: list of reference for each prediction. Each
        reference should be a string with tokens separated by spaces.
    rouge_types: A list of rouge types to calculate.
        Valid names:
        `"rouge{n}"` (e.g. `"rouge1"`, `"rouge2"`) where: {n} is the n-gram based scoring,
        `"rougeL"`: Longest common subsequence based scoring.
        `"rougeLSum"`: rougeLsum splits text using `"
"`.
        See details in https://github.com/huggingface/datasets/issues/617
    use_stemmer: Bool indicating whether Porter stemmer should be used to strip word suffixes.
    use_agregator: Return aggregates if this is set to True
Retu

## Using T5 Transformer model

In [None]:
model_checkpoint = "t5-small"

In [42]:
fake_preds = ["hello there", "general kenobi"]
fake_labels = ["hello there", "general Amanda"]
metric.compute(predictions=fake_preds, references=fake_labels)

{'rouge1': AggregateScore(low=Score(precision=0.5, recall=0.5, fmeasure=0.5), mid=Score(precision=0.75, recall=0.75, fmeasure=0.75), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rouge2': AggregateScore(low=Score(precision=0.0, recall=0.0, fmeasure=0.0), mid=Score(precision=0.5, recall=0.5, fmeasure=0.5), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rougeL': AggregateScore(low=Score(precision=0.5, recall=0.5, fmeasure=0.5), mid=Score(precision=0.75, recall=0.75, fmeasure=0.75), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rougeLsum': AggregateScore(low=Score(precision=0.5, recall=0.5, fmeasure=0.5), mid=Score(precision=0.75, recall=0.75, fmeasure=0.75), high=Score(precision=1.0, recall=1.0, fmeasure=1.0))}

In [32]:
#
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 204045
    })
    validation: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11332
    })
    test: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11334
    })
})

* To access an actual element, you need to select a split first, then give an index:

In [37]:
raw_datasets['train']['document'][0]

'Recent reports have linked some France-based players with returns to Wales.\n"I\'ve always felt - and this is with my rugby hat on now; this is not region or WRU - I\'d rather spend that money on keeping players in Wales," said Davies.\nThe WRU provides £2m to the fund and £1.3m comes from the regions.\nFormer Wales and British and Irish Lions fly-half Davies became WRU chairman on Tuesday 21 October, succeeding deposed David Pickering following governing body elections.\nHe is now serving a notice period to leave his role as Newport Gwent Dragons chief executive after being voted on to the WRU board in September.\nDavies was among the leading figures among Dragons, Ospreys, Scarlets and Cardiff Blues officials who were embroiled in a protracted dispute with the WRU that ended in a £60m deal in August this year.\nIn the wake of that deal being done, Davies said the £3.3m should be spent on ensuring current Wales-based stars remain there.\nIn recent weeks, Racing Metro flanker Dan Lydi

In [38]:
inputs = tokenizer.encode("summarize: " + raw_datasets['train']['document'][0], return_tensors="pt", max_length=512, truncation=True)

In [39]:
# generate the summarization output
outputs = model.generate(
    inputs, 
    max_length=150, 
    min_length=40, 
    length_penalty=2.0, 
    num_beams=1, 
    top_p=0.92, 
    top_k=0,
    temperature=0.7,
    early_stopping=True)
# just for debugging
print(outputs)
print(tokenizer.decode(outputs[0],skip_special_tokens=True))

tensor([[    0,  1798, 22209,  7021, 13404,   845,     3,    88,   133,  1066,
           453,  1508,    16, 10256,     3,     5,     3,    88,    19,  3122,
             3,     9,  2103,  1059,    12,  1175,   112,  1075,    38, 24260,
           350, 16103, 14580,     7,  5752,  4297,     3,     5,     8,  1798,
             3,   115, 10694,    77,    11,     3, 24804,   107,  3971,    18,
         17114,    65,   118,  5229,    28,     3,     9,  1205,    12,     8,
             3,  1598,     3,     5,     1]])
former rugby union chairman says he would rather keep players in Wales. he is serving a notice period to leave his role as Newport Gwent dragons chief executive. the former britain and irish fly-half has been linked with a return to the uk.
