<a href="https://colab.research.google.com/github/loudly-soft/nlp-experiments/blob/main/Text_Summarisation_HuggingFace_T5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T5 Text Summarisation with HuggingFace

An attempt to use HuggingFace T5 for text summarisation:

* https://huggingface.co/transformers/task_summary.html

**Synposis**

1. Load T5 models
2. Grab a news article from web and summarise with T5



####Disable horizontal scroll bar so long output is wrapped


In [None]:
from IPython.display import HTML, display

def set_css():
  display(HTML('''<style>pre { white-space: pre-wrap; }</style>'''))
get_ipython().events.register('pre_run_cell', set_css)

## Load T5 models

Don't know which fine-tuned model is reliable so I just picked this one:
* https://huggingface.co/mrm8488/t5-base-finetuned-summarize-news

In [None]:
!pip install transformers[sentencepiece]

Collecting transformers[sentencepiece]
  Downloading transformers-4.11.2-py3-none-any.whl (2.9 MB)
[K     |████████████████████████████████| 2.9 MB 5.3 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 46.9 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 41.2 MB/s 
Collecting huggingface-hub>=0.0.17
  Downloading huggingface_hub-0.0.17-py3-none-any.whl (52 kB)
[K     |████████████████████████████████| 52 kB 1.5 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 64.4 MB/s 
Collecting sentencepiece!=0.1.92,>=0.1.91
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)


In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer


T5_BASE = 't5-base'
T5_FINETUNED = 'mrm8488/t5-base-finetuned-summarize-news'

def get_tokenizer_and_model(model_id):
  """Factory for model and tokenizer"""

  if model_id == T5_FINETUNED:
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
  else:
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

  return tokenizer, model


## Summarise news article from web using different T5 models

In [None]:
!pip install newspaper3k

Collecting newspaper3k
  Downloading newspaper3k-0.2.8-py3-none-any.whl (211 kB)
[?25l[K     |█▌                              | 10 kB 21.7 MB/s eta 0:00:01[K     |███                             | 20 kB 27.0 MB/s eta 0:00:01[K     |████▋                           | 30 kB 16.4 MB/s eta 0:00:01[K     |██████▏                         | 40 kB 12.4 MB/s eta 0:00:01[K     |███████▊                        | 51 kB 5.7 MB/s eta 0:00:01[K     |█████████▎                      | 61 kB 5.8 MB/s eta 0:00:01[K     |██████████▉                     | 71 kB 5.5 MB/s eta 0:00:01[K     |████████████▍                   | 81 kB 6.2 MB/s eta 0:00:01[K     |██████████████                  | 92 kB 6.0 MB/s eta 0:00:01[K     |███████████████▌                | 102 kB 5.3 MB/s eta 0:00:01[K     |█████████████████               | 112 kB 5.3 MB/s eta 0:00:01[K     |██████████████████▋             | 122 kB 5.3 MB/s eta 0:00:01[K     |████████████████████▏           | 133 kB 5.3 MB/s eta 0:

In [None]:
from newspaper import Article


#url = 'https://www.fox13now.com/news/local-news/search-for-missing-utah-man-in-yellowstone-moves-from-rescue-to-recovery'
#url = 'https://www.sciencenews.org/article/rice-agriculture-feeds-world-climate-change-drought-flood-risk'
#url = 'https://www.sciencenews.org/article/dna-genetics-how-polynesia-settled-migration-islands-pacific-ocean'
#url = 'https://www.sciencenews.org/article/satellite-mega-constellations-night-sky-stars-simulations'
#url = 'https://www.sciencenews.org/article/planet-habitable-new-type-hycean-search-extraterrestrial-life-aliens'
url = 'https://www.sciencenews.org/article/black-holes-mass-measure-new-technique-accretion-disk'
#url = 'https://www.sciencenews.org/article/moon-lunar-magnetic-field-short-time-space'
#url = 'https://www.sciencenews.org/article/covid-colds-common-respiratory-diseases-kids-return-school'
#url = 'https://www.sciencenews.org/article/covid-coronavirus-who-gets-booster-shots-vaccines-pfizer-fda'

# download and extract text from article
article = Article(url)
article.download()
article.parse()


def summarize(text, tokenizer, model):
  """Generate summary"""

  # T5 uses a max_length of 512 so we cut the article to 512 tokens
  inputs = tokenizer("summarize: " + text, return_tensors="pt", max_length=512, truncation=True, return_length=True)
  outputs = model.generate(inputs["input_ids"], max_length=300, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
  return tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True), inputs.length


for model_id in [T5_BASE, T5_FINETUNED]:
  summary, processed_tokens = summarize(article.text, *get_tokenizer_and_model(model_id))

  print('model: %s' % model_id.upper())
  print('processed tokens: %d' % processed_tokens)
  print('input chars: %d' % len(article.text))
  print('summary chars: %d\n' % len(summary))
  print()

  # print title and summary
  print(article.title)
  print('-' * len(article.title))
  print(summary)
  print('\n')


model: T5-BASE
processed tokens: 512
input chars: 4445
summary chars: 246


Measuring a black hole’s mass isn’t easy. A new technique could change that
---------------------------------------------------------------------------
“It’s a new way to weigh black holes,” says astronomer Colin Burke of the University of Illinois at Urbana-Champaign. the method could be used on any astrophysical object with an accretion disk, and may even help find elusive midsize black holes.


model: MRM8488/T5-BASE-FINETUNED-SUMMARIZE-NEWS
processed tokens: 512
input chars: 4445
summary chars: 492


Measuring a black hole’s mass isn’t easy. A new technique could change that
---------------------------------------------------------------------------
67 actively feeding black holes were weighed using accretion disks to measure their mass. As gas and dust falls into a black hole, the material organizes into a disk that is heated to white-hot temperatures and can, in some cases, outshine all the stars in the g