<a href="https://colab.research.google.com/github/loudly-soft/experiments/blob/main/Text_Summarisation_HuggingFace_T5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T5 Text Summarisation with HuggingFace

Compare different HuggingFace T5 models for text summarisation:

* https://huggingface.co/transformers/task_summary.html


#### The models are:

- Finetuned on news dataset
  * https://huggingface.co/mrm8488/t5-base-finetuned-summarize-news

- Podcast summariser
  * https://huggingface.co/paulowoicho/t5-podcast-summarisation

- One-line summariser
  * https://huggingface.co/snrspeaks/t5-one-line-summary
  * https://huggingface.co/csebuetnlp/mT5_multilingual_XLSum

####Disable horizontal scroll bar so long output is wrapped


In [None]:
from IPython.display import HTML, display

def set_css():
  display(HTML('''<style>pre { white-space: pre-wrap; }</style>'''))
get_ipython().events.register('pre_run_cell', set_css)

## Load T5 models

Don't know which fine-tuned model is reliable so I just picked this one:
* https://huggingface.co/mrm8488/t5-base-finetuned-summarize-news

In [None]:
!pip install transformers[sentencepiece]

Collecting transformers[sentencepiece]
  Downloading transformers-4.11.2-py3-none-any.whl (2.9 MB)
[K     |████████████████████████████████| 2.9 MB 4.0 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 76.5 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 53.5 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 58.8 MB/s 
[?25hCollecting huggingface-hub>=0.0.17
  Downloading huggingface_hub-0.0.18-py3-none-any.whl (56 kB)
[K     |████████████████████████████████| 56 kB 6.3 MB/s 
Collecting sentencepiece!=0.1.92,>=0.1.91
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.

## Summarise news article from web using different T5 models

In [None]:
!pip install newspaper3k

In [None]:
from newspaper import Article
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer


# Unfinetuned T5
T5_BASE = 't5-base'

# Finetuned on news
# https://huggingface.co/mrm8488/t5-base-finetuned-summarize-news
T5_MRM8488 = 'mrm8488/t5-base-finetuned-summarize-news'

# Podcast summariser
# https://huggingface.co/paulowoicho/t5-podcast-summarisation
T5_PAULOWOICHO = 'paulowoicho/t5-podcast-summarisation'

# One-line summariser
# https://huggingface.co/snrspeaks/t5-one-line-summary
T5_SNRSPEAKS = 'snrspeaks/t5-one-line-summary'

# Extreme summariser
# https://huggingface.co/csebuetnlp/mT5_multilingual_XLSum
T5_CSEBUETNLP = 'csebuetnlp/mT5_multilingual_XLSum'


def summarize(model_id, text, max_length=512):
  """Generate summary"""

  tokenizer = AutoTokenizer.from_pretrained(model_id)
  model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

  if model_id == T5_BASE:
    inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", max_length=max_length, truncation=True, add_special_tokens=True)
    outputs = model.generate(inputs, max_length=max_length, length_penalty=2.0, num_beams=4, early_stopping=True)
    return tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)

  if model_id == T5_MRM8488:
    inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", max_length=max_length, truncation=True, add_special_tokens=True)
    outputs = model.generate(inputs, num_beams=2, max_length=max_length, repetition_penalty=2.5, length_penalty=1.0, early_stopping=True)
    return tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)

  if model_id == T5_SNRSPEAKS:
    inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", max_length=max_length, truncation=True, add_special_tokens=True)
    outputs = model.generate(input_ids=inputs, num_beams=5, max_length=max_length, repetition_penalty=2.5, length_penalty=1.0, early_stopping=True)
    return tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)

  if model_id == T5_PAULOWOICHO:
    inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", max_length=max_length, truncation=True, add_special_tokens=True)
    outputs = model.generate(input_ids=inputs, num_beams=5, max_length=max_length, repetition_penalty=2.5, length_penalty=1.0, early_stopping=True)
    return tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)

  if model_id == T5_CSEBUETNLP:
    import re
    process_whitespace = lambda k: re.sub('\s+', ' ', re.sub('\n+', ' ', k.strip()))
    inputs = tokenizer.encode("summarize: " + process_whitespace(text), return_tensors="pt", padding="max_length", max_length=max_length, truncation=True)
    outputs = model.generate(input_ids=inputs, num_beams=4, max_length=max_length, no_repeat_ngram_size=2)
    return tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)


# news articles
urls = [
  'https://www.reuters.com/business/aerospace-defense/iata-sees-sharp-fall-airline-losses-2022-2021-10-04/',
  'https://www.fox13now.com/news/local-news/search-for-missing-utah-man-in-yellowstone-moves-from-rescue-to-recovery',
  'https://www.sciencenews.org/article/rice-agriculture-feeds-world-climate-change-drought-flood-risk',
  'https://www.sciencenews.org/article/dna-genetics-how-polynesia-settled-migration-islands-pacific-ocean',
  'https://www.sciencenews.org/article/satellite-mega-constellations-night-sky-stars-simulations',
  'https://www.sciencenews.org/article/planet-habitable-new-type-hycean-search-extraterrestrial-life-aliens',
  'https://www.sciencenews.org/article/black-holes-mass-measure-new-technique-accretion-disk',
  'https://www.sciencenews.org/article/moon-lunar-magnetic-field-short-time-space',
  'https://www.sciencenews.org/article/covid-colds-common-respiratory-diseases-kids-return-school',
  'https://www.sciencenews.org/article/covid-coronavirus-who-gets-booster-shots-vaccines-pfizer-fda'
]

# for each news article
for url in urls:
  # download and extract text from article
  article = Article(url)
  article.download()
  article.parse()

  # print news title
  print()
  print(article.title.upper() + f'  ({len(article.text)} chars)')
  print('=' * len(article.title))
  print()

  # compare summaries from different models
  for model_id in [T5_BASE, T5_MRM8488, T5_PAULOWOICHO, T5_SNRSPEAKS, T5_CSEBUETNLP]:
    print(f'{model_id}:\n')
    print(summarize(model_id, article.text))
    print('\n')



AIRLINES SEE SHARPLY LOWER LOSSES IN 2022, RECOVERY IN SIGHT  (2354 chars)

t5-base:

global airlines on Monday projected a sharp reduction in industry losses next year. but revised up the financial toll inflicted by the coronavirus pandemic in 2020 and 2021. IATA urged governments to keep wage support measures and slot wavers in place.


mrm8488/t5-base-finetuned-summarize-news:

International Air Transport Association (IATA) predicted net losses at airlines would narrow to $11.6 billion in 2022 from $51.8 billion this year. However, IATA revised up losses for 2020 to $137.7 billion from $126.4 billion estimated earlier. Domestic travel demand is estimated to reach 93% of the pre-pandemic level in 2022. However, passenger numbers are expected to increase to 3.4 billion next year from 2.3 billion in 2021. Demand for air cargo is forecast to rise 13.2% above the 2019 levels, IATA said.


paulowoicho/t5-podcast-summarisation:

the International Air Transport Association (IATA) forecasts