<a href="https://colab.research.google.com/github/mspatke/Abstractive_Summarization_BART_Transformer/blob/main/Text_summarization_BART_T5_Pegasus.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install transformers[sentencepiece] datasets sacrebleu rouge_score py7zr -q

In [None]:
import matplotlib.pyplot as plt

import pandas as pd
from datasets import load_dataset, load_metric
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

import pandas as pd
import numpy as np

import nltk
from nltk.tokenize import sent_tokenize

nltk.download("punkt")

In [None]:
from transformers import pipeline, set_seed

In [None]:
from datasets import load_dataset

dataset = load_dataset("cnn_dailymail", version="3.0.0")

print(f"Features in cnn_dailymail : {dataset['train'].column_names}")

**Text Summarization Pipelines**

In [None]:
sample_text = dataset["train"][1]["article"]

# We'll collect the generated summaries of each model in a dictionary
summaries = {}

In [None]:
sample_text

**Summarization Baseline**

In [None]:
def baseline_summary(text):
    return "\n".join(sent_tokenize(text))

In [None]:
summaries['baseline'] = baseline_summary(sample_text)

summaries['baseline']

**huggingface pipeline**

In [None]:
set_seed(42)

pipe = pipeline('text-generation', model = 'gpt2-medium' )

gpt2_query = sample_text + "\nTL;DR:\n"

pipe_out = pipe(gpt2_query, max_length = 512, clean_up_tokenization_spaces = True)

In [None]:
pipe_out

In [None]:
pipe_out[0]

In [None]:
summaries['gpt2'] = "\n".join(sent_tokenize(pipe_out[0]["generated_text"]))

**T5** <br>
T5 (Text-To-Text Transfer Transformer) is a transformer model that is trained in an end-to-end manner with text as input and modified text as output, in contrast to BERT-style models that can only output either a class label or a span of the input. This text-to-text formatting makes the T5 model fit for multiple NLP tasks like Summarization, Question-Answering, Machine Translation, and Classification problems.

How T5 is different from BERT? Both T5 and BERT are trained with MLM (Masked Language Model) approach.

What is MLM?

The MLM is a fill-in-the-blank task, where the model masks part of the input text and tries to predict what that masked word should be.

Example:

“I like to eat peanut butter and

The only difference is that T5 replaces multiple consecutive tokens with the single Mask Keyword, unlike, BERT which uses Mask token for each word. This illustration is shown below.

T5 expects a prefix before the input text to understand the task given by the user. For example,
“summarize:” for the summarization,
“cola sentence:” for the classification,
“translate English to Spanish:” for the machine translation, etc.,

In [None]:
pipe = pipeline('summarization', model = 't5-small' )

pipe_out = pipe(sample_text)

In [None]:
pipe_out

In [None]:
summaries['t5'] = 'n'.join(sent_tokenize(pipe_out[0]['summary_text']))

# **BART**


BART is a denoising autoencoder for pretraining sequence-to-sequence models. It is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Transformer-based neural machine translation architecture.

That means, It uses a standard seq2seq/NMT architecture with a bidirectional encoder (like BERT) and a left-to-right decoder (like GPT). This means the encoder's attention mask is fully visible, like BERT, and the decoder's attention mask is causal, like GPT2.

This means that a fine-tuned BART model can take a text sequence (for example, English) as input and produce a different text sequence at the output (for example, French).

This type of model is relevant for machine translation, question-answering , text summarization, or sequence classification (categorizing input text sentences or tokens).

Another task is sentence entailment which, given two or more sentences, evaluates whether the sentences are logical extensions or are logically related to a given statement.

In [None]:
pipe = pipeline("summarization", model="facebook/bart-large-cnn")
pipe_out = pipe(sample_text)

In [None]:
pipe_out

In [None]:
summaries["bart"] = "\n".join(sent_tokenize(pipe_out[0]["summary_text"]))

summaries["bart"]

# **PEGASUS**
The PEGASUS model’s pre-training task is very similar to summarization, i.e. important sentences are removed and masked from an input document and are later generated together as one output sequence from the remaining sentences, which is fairly similar to a summary. In PEGASUS, several whole sentences are removed from documents during pre-training, and the model is tasked with recovering them. The Input for such pre-training is a document with missing sentences, while the output consists of the missing sentences being concatenated together. The advantage of this self-supervision is that you can create as many examples as there are documents without any human intervention, which often becomes a bottleneck problem in purely supervised systems.

In [None]:
pipe = pipeline('summarization', model="google/pegasus-cnn_dailymail"  )

pipe_out = pipe(sample_text)

In [None]:
pipe_out

In [None]:
summaries["pegasus"] = pipe_out[0]["summary_text"].replace(" .<n>", ".\n")

