# T5 Text Summarizer

This Jupyter Notebook demonstrates how Google's T5 pre-train model is used to generate summary.

Please ensure both Python packages below have been 'pip' installed.

```pip install torch```

```pip install transformers```

## Import Libraries / Modules

In [3]:
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

## Load Model and Tokenizer

In [4]:
model = AutoModelForSeq2SeqLM.from_pretrained('t5-base')

tokenizer = AutoTokenizer.from_pretrained('t5-base')

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


## Input Text

Load CNN [news](https://edition.cnn.com/2021/04/21/media/netflix-earnings-analysis/index.html) in text form.

In [5]:
text = """
One challenge that Kingdom preachers face is opposition. Apostates, religious leaders, and politicians have given many the wrong impression about our work. If our relatives, acquaintances, and workmates are misled by this propaganda, they may pressure us to stop serving Jehovah and to stop preaching. In some countries, the opposition takes the form of intimidation, threats, arrests, and even imprisonment. We are not surprised at this reaction. Jesus foretold: “You will be hated by all the nations on account of my name.” (Matt. 24:9) The very fact that we are experiencing such hatred is proof that we have Jehovah’s approval. (Matt. 5:11, 12) The Devil is behind this opposition. But he is no match for Jesus! With Jesus’ support, the good news is reaching people of all nations. Consider the evidence.
"""

## Tokenize Text

In [6]:
tokens_input = tokenizer.encode("summarize: "+text, return_tensors='pt', 
                                max_length=tokenizer.model_max_length, 
                                truncation=True)

## Generate Summary

In [7]:
summary_ids = model.generate(tokens_input, min_length=80,
                             max_length=150,
                             length_penalty=20, 
                             num_beams=2)

summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

In [8]:
print(summary)

apostates, religious leaders, and politicians have given many the wrong impression about our work. if relatives, acquaintances, and workmates are misled by this propaganda, they may pressure us to stop preaching. with Jesus’ support, the good news is reaching people of all nations. consider the evidence. apostasy, christianity, and apostasy are just a few of the challenges that kingdom preachers face - apostasy, christi


In [9]:
# number of tokens generated from the text using T5 Tokenizer
len(tokenizer(text)['input_ids'])

192

In [10]:
# model maximum acceptable token inputs length
tokenizer.model_max_length

512

# BERT Extractive Summary

Number of tokenized text exceeds (>) maximum acceptable token inputs length, this means that the latter text will be truncated and won't be fed into the T5 summarizer model.

To solve the risk of missing out important details in the latter text, let's perform extractive summarization followed by abstractive summarization.

Before we proceed, make sure we have pip installed BERT extractive summarizer.

```pip install bert-extractive-summarizer```

In [11]:
# import BERT summarizer module
from summarizer import Summarizer

Use BERT summarizer to extract only top 50% of sentences that are considered important.

In [12]:
bert_model = Summarizer()
ext_summary = bert_model(text, ratio=0.5)

Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [13]:
print(ext_summary)

One challenge that Kingdom preachers face is opposition. Apostates, religious leaders, and politicians have given many the wrong impression about our work. 24:9) The very fact that we are experiencing such hatred is proof that we have Jehovah’s approval. ( 5:11, 12) The Devil is behind this opposition.


## Tokenize BERT Summary

In [14]:
tokens_input_2 = tokenizer.encode("summarize: "+ext_summary, return_tensors='pt', 
                                max_length=tokenizer.model_max_length, 
                                truncation=True)

In [15]:
len(tokenizer(ext_summary)['input_ids'])

69

The number of tokenized text is just slightly exceeded the maximum acceptable token inputs length (521 > 512), this is okay because this won't make much different to the summary.

## Extractive-Abstractive Summary

In [16]:
summary_ids_2 = model.generate(tokens_input_2, min_length=80,
                             max_length=150,
                             length_penalty=20, 
                             num_beams=2)

summary_2 = tokenizer.decode(summary_ids_2[0], skip_special_tokens=True)

In [17]:
print(summary_2)

apostates, religious leaders, and politicians have given many the wrong impression about our work. the very fact that we are experiencing such hatred is proof that we have Jehovah’s approval. the very fact that we are experiencing such hatred is proof that we have Jehovah’s approval. apostates, religious leaders, and politicians have given many the wrong impression about our work.
