#Text Summarization


Data Mining, Text Mining and Big Data Analytics - 2022/2023

Pietro Epis (mat. 0001030354)


##Introduction
This notebook has the goal of exploring some well-known transformers-based models, such as T5, BART and Pegasus, by exploiting them in the task of Text Summarization. In particular, *abstractive* summarization was implemented, thus generating summaries that may contain new phrases and sentences that do not appear in the source text (in contrast to *extractive* summarization that just returns a subsequence of the input).

In particular, the domain of application was related to dialogues, therefore the datasets that have been chosen are composed of conversations, and they are SAMSum, DialogSum and TWEETSUMM.

**[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (Text-To-Text Transfer Transformer) is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks, that aims to unify every language problem under a text-to-text approach. T5 was pre-trained on Colossal Clean Crawled Corpus (C4) dataset, on which strong preprocessing (deduplication, discarding incomplete sentences, and removing offensive or noisy content) was applied, leading to better results on downstream tasks.

**[BART](https://huggingface.co/docs/transformers/model_doc/bart)** integrates a bidirectional encoder (like BERT) and an autoregressive decoder (GPT like), and is said to be particularly effective when exploited for text generation, but also works well for comprehension tasks. It was pre-trained for several denoising tasks, aiming to restore a corrupted document (that features issues such as token masking, delition or infilling, but also sentence permutation or document rotation).

**[PEGASUS](https://huggingface.co/docs/transformers/model_doc/pegasus)** is a transformer-based model whose pre-training task is actually similar to summarization, indeed it was based on the idea of removing sentences from an input document, and generated together as one output sequence from the remaining sentences. It was presented in 2019 and it achieved state-of-the-art performance.

All of these models can process inputs long up to 1024 tokens, therefore truncation is needed in case of longer sequences.

The **[SAMSum](https://arxiv.org/pdf/1911.12237v2.pdf)** dataset contains daily messenger-like conversations with summaries. The style and register are diversified, indeed conversations could be informal, semi-formal or formal, they may contain slang words, emoticons and typos. The dataset contains 16369 conversations, split into 14732 for training, 818 for validation and 819 for test.

The **[DialogSum](https://arxiv.org/pdf/2105.06762v4.pdf)** dataset is made up by 13460 dialogues (12460 for training, 500 for validation and 1500 for test), and it's characterized by the variety of topic and face-to-face spoken conversation contexts, including school, work, medication, shopping, leisure, travel. Most conversations take place between friends, colleagues, and between service providers and customers.

The **[TWEETSUMM](https://arxiv.org/pdf/2111.11894.pdf)** dataset is based on the [Customer Support on Twitter](https://www.kaggle.com/datasets/thoughtvector/customer-support-on-twitter), available on Kaggle. It comprises 1100 dialogues built from the previously mentioned dataset, each of them annotated with 3 extractive and 3 abstractive summaries (created by human annotators). Unlike the previous datasets, this one required some stronger preprocessing to take it to a suitable format, as will be explained in the following.

##Import Libraries

In [8]:
!pip install datasets
!pip install py7zr
!pip install transformers
!pip install rouge_score
!pip install evaluate
!pip install nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [9]:
import pandas as pd

import numpy as np

import datasets
from datasets import load_dataset, load_metric, Dataset, DatasetDict

from transformers import pipeline, AutoTokenizer, DataCollatorForSeq2Seq, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer, logging

import torch

import evaluate

import random

import nltk
nltk.download("punkt")
from nltk.tokenize import sent_tokenize

import re

import json

import os

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [27]:
# Set the random seed to 42 for the reproducibility of experiments that involve randomness
def set_reproducibility(seed):
  random.seed(seed)
  np.random.seed(seed)
  torch.manual_seed(seed)
  os.environ["TF_DETERMINISTIC_OPS"] = "1"

set_reproducibility(42)

In [46]:
# Avoid eccessive verbosity, print only errors

logging.set_verbosity(logging.ERROR)

##TweetSumm Dataset Reconstruction

The procedure implemented into this block has the goal of constructing the TweetSumm, that means fetching the dialogues from the Kaggle dataset and matching them with the related summary, and concatenating the conversastions in order to reach a suitable format to feed it to the model to carry out our task.

Note: there is no need to run this part at every execution, nor to mount Google Drive, indeed this routine was run just once to generate the dataset in a proper format, that was eventually exported into *csv* files, then stored into the repository for re-use purpose.

In [7]:
from google.colab import drive
drive.mount("/content/drive")
!cp "/content/drive/MyDrive/TextMining/tweetsumm/twcs.csv" "twcs.csv" # Customer Support on Twitter Dataset
!cp "/content/drive/MyDrive/TextMining/tweetsumm/final_train_tweetsum.jsonl" "final_train_tweetsum.jsonl"
!cp "/content/drive/MyDrive/TextMining/tweetsumm/final_valid_tweetsum.jsonl" "final_valid_tweetsum.jsonl"
!cp "/content/drive/MyDrive/TextMining/tweetsumm/final_test_tweetsum.jsonl" "final_test_tweetsum.jsonl"

Mounted at /content/drive


In [None]:
# Download the TweetSumProcessor class from TweetSum GitHub repository, that provides
# out-of-the-box functions to ease the convertion 

!rm tweet_sum_processor.py
!wget https://raw.githubusercontent.com/guyfe/Tweetsumm/main/tweet_sum_processor.py
from tweet_sum_processor import TweetSumProcessor

rm: cannot remove 'tweet_sum_processor.py': No such file or directory
--2023-02-06 13:33:51--  https://raw.githubusercontent.com/guyfe/Tweetsumm/main/tweet_sum_processor.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7132 (7.0K) [text/plain]
Saving to: ‘tweet_sum_processor.py’


2023-02-06 13:33:51 (78.1 MB/s) - ‘tweet_sum_processor.py’ saved [7132/7132]



In [None]:
# Concatenate the parts of the dialog and the parts of the summary

def preprocess_inputs(json_data):
  dialogue = json_data["dialog"]["turns"]
  full_text = []
  # Concatenate the sentences (turns) of the dialog into a single string
  for i in dialogue:
    string = " ".join(i["sentences"])
    full_text.append(string + " <BR>") # <BR> for transition between speakers
  conversation = " ".join(full_text)
  # Loop over single words and replace links with <LINK>
  words = conversation.split(" ")
  for i in range(0, len(words)):
    if "http" in words[i]:
      words[i] = "<LINK>"
  text = " ".join(words)
  # Remove all special characters
  text = re.sub(r"[^a-zA-Z0-9,!.?<> ]", "", text)
  return text

In [None]:
# Take the first annotation and concatenate the sentences it is composed of

def get_summary(json_data): 
  return " ".join(json_data["summaries"]["abstractive_summaries"][0])

In [None]:
# Return two aligned lists containing the input texts (the dialogues) and the related summaries

def build_dataset(file_name, processor):
  inputs = []
  summaries = []
  with open(file_name) as f:
    # read the file (train/validation/test) and pass the content to the library functions provided by TweetSumm
    dialog_with_summaries = processor.get_dialog_with_summaries(f.readlines())
    for dialog_with_summary in dialog_with_summaries:
      try:
        json_format = json.loads(dialog_with_summary.get_json())
        inputs.append(preprocess_inputs(json_format))
        summaries.append(get_summary(json_format))
      except TypeError:
        pass
  return inputs, summaries

In [None]:
# Exploit the functions above to get, for each split of the dataset, two lists
# (one containing the dialogues and one the related summary), that are suitable
# to create a DataFrame

processor = TweetSumProcessor("twcs.csv")
train_inputs, train_summaries = build_dataset("final_train_tweetsum.jsonl", processor)
valid_inputs, valid_summaries = build_dataset("final_valid_tweetsum.jsonl", processor)
test_inputs, test_summaries = build_dataset("final_test_tweetsum.jsonl", processor)

In [None]:
# Instantiate the DataFrames and write them to file

train = pd.DataFrame({"inputs": train_inputs, "summaries": train_summaries})
train.to_csv("/content/drive/MyDrive/TextMining/tweetsumm/tweetsum_train.csv", index=False)

valid = pd.DataFrame({"inputs": valid_inputs, "summaries": valid_summaries})
valid.to_csv("/content/drive/MyDrive/TextMining/tweetsumm/tweetsum_train.csv", index=False)

test = pd.DataFrame({"inputs": test_inputs, "summaries": test_summaries})
test.to_csv("/content/drive/MyDrive/TextMining/tweetsumm/tweetsum_train.csv", index=False)

##Download Datasets

In [14]:
# Remove all characters but the standard ones, such as letters, digits and basic
# punctuation, along with '<', '>', and '#' that are special semantic characters
# of the datasets (used as tags or delimiters)

def preprocessing(sample):
  sample["text"] = re.sub(r"[^a-zA-Z0-9,!\.?<>#' ]", "", sample["text"])
  return sample

In [10]:
# Print five random samples from the dataset passed as parameter (integral text 
# and summary)

def show_samples(dataset):
  for i in random.sample(range(dataset.shape[0]), 5):
    print("### DIALOGUE", str(i), " ###")
    print("TEXT")
    print(dataset[i]["text"], "\n")
    print("SUMMARY")
    print(dataset[i]["summary"], "\n\n")

The datasets (except for TweetSumm that has a custom management) are downloaded through the HuggingFace APIs, that structures them as a DatasetDict with three keys: *train*, *validation* and *test*.

Since the models for summarization expect two columns representing the integral dialogue and the summary, respectively named *text* and *summary*, it's necessary to rename the columns to reach the required format.

###SAMSum

In [15]:
dataset_samsum = load_dataset("samsum")
dataset_samsum = dataset_samsum.rename_column("dialogue", "text")
dataset_samsum = dataset_samsum.map(preprocessing)



  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/14732 [00:00<?, ?ex/s]

  0%|          | 0/819 [00:00<?, ?ex/s]

  0%|          | 0/818 [00:00<?, ?ex/s]

In [16]:
show_samples(dataset_samsum["train"])

### DIALOGUE 13453  ###
TEXT
Molly Any big plans for 2019?Isaac hmm, I'm considering going to the conference in San Fransisco in 2019Isaac and then maybe flying to Tahiti, I've always dreamt about it and I think it's an opportunityJose wow, that sounds really nice, maybe it's a good ideaIsaac I think so, it's not even too expensive when one flies from San FranciscoIsaac maybe anybody would like to join me?Jose I'd love to! But it's still very expensive I suppose?Isaac about 1000 for the flightsMolly really? I thought they are much more expensiveIsaac no, they are actually not that expensiveIsaac I mean still a lot, but not undoableMolly True, I may consider itIsaac Think about it guys and let me knowJose I will! 

SUMMARY
Isaac wants to go to Tahiti in 2019. Jose and Molly consider going with him. 


### DIALOGUE 13724  ###
TEXT
Ron Hi there guys! Got any plans for the weekend?Taylor Hi, Ron! Actually, I do!Harry Hi, Ron. Me too Ron Got plans together? Harry Nope  Taylor Nah. So what'r

###DialogSum

In [17]:
dataset_dialogsum = load_dataset("knkarthick/dialogsum")
dataset_dialogsum = dataset_dialogsum.rename_column("dialogue", "text")
dataset_dialogsum = dataset_dialogsum.map(preprocessing)

Downloading readme:   0%|          | 0.00/4.58k [00:00<?, ?B/s]



Downloading and preparing dataset csv/knkarthick--dialogsum to /root/.cache/huggingface/datasets/knkarthick___csv/knkarthick--dialogsum-caf2f3e75d9073aa/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/442k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/knkarthick___csv/knkarthick--dialogsum-caf2f3e75d9073aa/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/12460 [00:00<?, ?ex/s]

  0%|          | 0/1500 [00:00<?, ?ex/s]

  0%|          | 0/500 [00:00<?, ?ex/s]

In [19]:
show_samples(dataset_dialogsum["train"])

### DIALOGUE 9407  ###
TEXT
#Person1# Morning, Zina. Just wanted to say thanks again! #Person2# Hi, Vince. Thanks for stopping by. How's the work coming along for the online auction? #Person1# Oh, yeah. I'm glad you mentioned that. I think we need to hire somebody new to manage it. #Person2# Can't Elvin handle it? #Person1# I think he's got too much on his plate.  

SUMMARY
Vince tells Zina Elvin cannot handle the online auction so they need to hire someone else. 


### DIALOGUE 851  ###
TEXT
#Person1# Good morning, Mr. Carson, please?#Person2# I'm afraid Mr. Carson is at a very important meeting at the moment and cannot be disturbed. May I know who's calling?#Person1# Yes, this is Mr. Prince. I would like to talk to Mr. Carson today, if possible.#Person2# Well, I'm afraid the meeting won't finish until one o'clock and then he has a lunch appointment. If he has time, I can ask him to ring you before he leaves.#Person1# OK. I'd be grateful if you would.#Person2# Not at all. Mr. Prince. 

###TweetSumm

The restructured dataset is imported from the *csv* files that has been generated, and a DatasetDict object is created, thus achieving the same format of the other datasets imported from HuggingFace.

In [None]:
df_train = pd.read_csv("https://raw.githubusercontent.com/pietroepis/text-summarization/main/tweetsum_train.csv", names = ["text", "summary"], header = 0)
df_valid = pd.read_csv("https://raw.githubusercontent.com/pietroepis/text-summarization/main/tweetsum_valid.csv", names = ["text", "summary"], header = 0)
df_test = pd.read_csv("https://raw.githubusercontent.com/pietroepis/text-summarization/main/tweetsum_test.csv", names = ["text", "summary"], header = 0)

dataset_tweetsumm = DatasetDict({
    "train": Dataset.from_pandas(df_train),
    "validation": Dataset.from_pandas(df_valid),
    "test": Dataset.from_pandas(df_test)
})

In [26]:
show_samples(dataset_tweetsumm["train"])

### DIALOGUE 534  ###
TEXT
VerizonSupport a number of recordings on a new, almost empty DVR are unplayable Unable to play the selected program. Please try again. I also get an error on the FIOS Mobile app. Other recordings work fine. Help? <LINK> <BR> Help as arrived! Are the recordings that will not play also from 1116? <BR> VerizonSupport Looks like 1116 and 1117 at least. Have had the STB only a few days. <BR> Thanks for confirming. Can you please reboot the cable box and try to access the recordings again? If you have not already tried this. HSB <BR> VerizonSupport Just did! I unplugged and let it sit for 30 seconds. <BR> As we are having an issue with DVR viewing could you reboot the router and then your cable box. RMD <BR> VerizonSupport Ok! Done! <BR> VerizonSupport Er, done still jacked up. <BR> Please follow and DM us. RMD <BR> 

SUMMARY
Customer complaining that unable to play the selected program. Agent is replying that reboot the cable box and try to access the recording ag

##Models Performance Overview

As a first approach it's worth testing the pre-trained models for the summarization task without fine-tuning them on our specific datasets. To drive the choice of the most promising model (that will then be fine-tuned) one hundred instances has been sampled from the validation split (in order to leave the test set unseen for the final evaluation, as best practices suggest).

To assess the performance, the **[ROUGE](https://huggingface.co/spaces/evaluate-metric/rouge)** metric has been chosen. In particular, the function provided by HuggingFace computes several metrics:

*   ROUGE-1: unigram based scoring
*   ROUGE-2: overlap of bigrams scoring
*   ROUGE-L: based on the length of the longest common subsequence

In [None]:
rouge_metric = load_metric("rouge")

  rouge_metric = load_metric("rouge")


Downloading builder script:   0%|          | 0.00/2.16k [00:00<?, ?B/s]

In [45]:
# Create the pipelines for inference with the chosen models (truncate for sequences longer than 1024 tokens)

pipe_t5 = pipeline("summarization", model="t5-base", truncation = True);
pipe_bart = pipeline("summarization", model="facebook/bart-base", truncation = True);
pipe_pegasus = pipeline("summarization", model="google/pegasus-xsum", truncation = True);

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/259 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/87.0 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/3.52M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

In [None]:
# T5 model expect a prefix according to the task being performed ("summarize" for summarization)

def t5_prefix(item):
  item["text"] = "summarize: " + item["text"]
  return item

In [None]:
# Given the dataset (with the groundtruths), the summaries generated by the model
# and the desired metric (rouge in our case), compute the scores on all the samples

def compute_metrics(dataset, pipeline, metric):
  summaries = [pipeline(text)[0]["summary_text"] for text in dataset["text"]];
  metric.add_batch(predictions = summaries, references = dataset["summary"])
  return metric.compute()

# Define the metrics that we care about and that will be shown for each model

rouge_names = ["rouge1", "rouge2", "rougeL"]

###Baseline

A common baseline for text summarization is to simply take the first three sentences of an article, often called the *lead-3* baseline. We could use full stops to detect the sentence boundaries, but this will fail on acronyms (dot separated letters), so instead we’ll take advantage of the **nltk** library, which includes a better algorithm to handle these cases and exceptions.

In [None]:
# Extract the first three sentences from the text

def three_sentence_summary(text):
    return "\n".join(sent_tokenize(text)[:3])

In [None]:
# As before, compute the desired metric, taking into account the dataset annotations
# and the predicted summary obtained with lead-3 baseline

def evaluate_baseline(dataset, metric):
    summaries = [three_sentence_summary(text) for text in dataset["text"]]
    metric.add_batch(predictions = summaries, references = dataset["summary"])
    return metric.compute()

Now, for each of the three datasets we run all the three models plus the baseline (lead-3) and aggregate the computed metrics into a dictionary, whose keys are the names of the models. Then, the results are shown.

###SAMSum

In [None]:
metrics_samsum = {
    "baseline": evaluate_baseline(dataset_samsum["validation"].shuffle(seed=42).select(range(100)), rouge_metric),
    "t5": compute_metrics(dataset_samsum["validation"].map(t5_prefix).shuffle(seed=42).select(range(100)), pipe_t5, rouge_metric),
    "bart": compute_metrics(dataset_samsum["validation"].shuffle(seed=42).select(range(100)), pipe_bart, rouge_metric),
    "pegasus": compute_metrics(dataset_samsum["validation"].shuffle(seed=42).select(range(100)), pipe_pegasus, rouge_metric)
}

In [None]:
for model in metrics_samsum:
    print(model.upper())
    print(dict((rn, metrics_samsum[model][rn].mid.fmeasure) for rn in rouge_names), "\n")

BASELINE
{'rouge1': 0.26834923291513324, 'rouge2': 0.08021047956473354, 'rougeL': 0.20276822713629422} 

T5
{'rouge1': 0.24948036772829302, 'rouge2': 0.0772794008374155, 'rougeL': 0.19677859773786072} 

BART
{'rouge1': 0.2662318996280884, 'rouge2': 0.0825453326791383, 'rougeL': 0.20034077267279599} 

PEGASUS
{'rouge1': 0.14710353097578352, 'rouge2': 0.034901201961626035, 'rougeL': 0.1217849982896291} 



###DialogSum

In [None]:
metrics_dialogsum = {
    "baseline": evaluate_baseline(dataset_dialogsum["validation"].shuffle(seed=42).select(range(100)), rouge_metric),
    "t5": compute_metrics(dataset_dialogsum["validation"].map(t5_prefix).shuffle(seed=42).select(range(100)), pipe_t5, rouge_metric),
    "bart": compute_metrics(dataset_dialogsum["validation"].shuffle(seed=42).select(range(100)), pipe_bart, rouge_metric),
    "pegasus": compute_metrics(dataset_dialogsum["validation"].shuffle(seed=42).select(range(100)), pipe_pegasus, rouge_metric)
}

In [None]:
for model in metrics_dialogsum:
    print(model.upper())
    print(dict((rn, metrics_dialogsum[model][rn].mid.fmeasure) for rn in rouge_names), "\n")

BASELINE
{'rouge1': 0.24765155448630344, 'rouge2': 0.0673591957285137, 'rougeL': 0.19237169264838924} 

T5
{'rouge1': 0.23453990512209315, 'rouge2': 0.06534545773956851, 'rougeL': 0.15226425468558383} 

BART
{'rouge1': 0.2592665257828205, 'rouge2': 0.06701633482085703, 'rougeL': 0.16682382914973007} 

PEGASUS
{'rouge1': 0.1365315420186302, 'rouge2': 0.02539039334530887, 'rougeL': 0.11139094306478743} 



###TweetSumm

In [None]:
metrics_tweetsumm = {
    "baseline": evaluate_baseline(dataset_tweetsumm["validation"].shuffle(seed=42).select(range(100)), rouge_metric),
    "t5": compute_metrics(dataset_tweetsumm["validation"].map(t5_prefix).shuffle(seed=42).select(range(100)), pipe_t5, rouge_metric),
    "bart": compute_metrics(dataset_tweetsumm["validation"].shuffle(seed=42).select(range(100)), pipe_bart, rouge_metric),
    "pegasus": compute_metrics(dataset_tweetsumm["validation"].shuffle(seed=42).select(range(100)), pipe_pegasus, rouge_metric)
}

In [None]:
for model in metrics_tweetsumm:
    print(model.upper())
    print(dict((rn, metrics_tweetsumm[model][rn].mid.fmeasure) for rn in rouge_names), "\n")

BASELINE
{'rouge1': 0.23897688922059618, 'rouge2': 0.09322063429201882, 'rougeL': 0.19243642430803826} 

T5
{'rouge1': 0.21532898276082823, 'rouge2': 0.07032641832570306, 'rougeL': 0.16609965044552368} 

BART
{'rouge1': 0.22506958872941535, 'rouge2': 0.08009745584755937, 'rougeL': 0.1883570852339118} 

PEGASUS
{'rouge1': 0.11531217518049672, 'rouge2': 0.02091225021990102, 'rougeL': 0.08681143786539551} 



From this preliminary analysis we can note that all the models have comparable performance, indeed the computed metrics are fairly similar and don't differ too much. 

Anyway, the most surprising fact that emerged is related to the results provided by the baseline, that in many cases are better in terms of metric in relation to the other models (without fine-tuning).

Since it turned out pretty prohibitive (in computational resources and necessary time terms) to fine tune all the three models for all the three datasets, only the most promising one for each dataset has been selected to be fine tuned on the specific instance. Suprisingly, the higher metrics among the three models were reached by BART on all the three dialogue datasets, even if PEGASUS is considered nowadays state-of-the-art for abstractive summarization (maybe because of the particular structure of the texts, that are dialogues and not simple plain texts).

Therefore, in the following, BART model is fine-tuned on the three datasets, and an analysis on the improved performance is carried out.

##Fine Tuning

In [30]:
model_name = "facebook/bart-base"

In [31]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Data collators are objects that will form a batch by using a list of dataset elements as input
data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer, model = model_name)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [32]:
# Convert the inputs to a suitable format to pass them to the model, that means
# tokenizing the texts, setting the 1024 as maximum length for the input and
# to 128 tokens for the output summary. In both cases truncation is needed for 
# longer sequences

def preprocess(examples):
  model_inputs = tokenizer(examples["text"], max_length = 1024, truncation = True)
  labels = tokenizer(text_target = examples["summary"], max_length = 128, truncation = True)
  model_inputs["labels"] = labels["input_ids"]

  return model_inputs

In [33]:
rouge = evaluate.load("rouge")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [34]:
# Decode the predictions and the labels (from token ids) and compute the rouge 
# metric rounded to the fourth decimal

def compute_metrics_training(eval_pred):
  predictions, labels = eval_pred
  decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
  labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
  decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

  result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

  prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
  result["gen_len"] = np.mean(prediction_lens)
  
  return {k: round(v, 4) for k, v in result.items()}

Because of the limits in terms of available memory (RAM and GPU RAM) and computational power, it was necessary to put some restrictions and simplifications to make the fine-tuning step feasible. In particular, the only strategy that was found is to set the batch size to 8 (it would have been better to experiment with bigger sizes, but that way it immediately ran out of memory). Furthermore, for the same reasons, the number of training epochs was limited to 4, except for TweetSumm (it was feasible because of the significantly smaller size of the dataset).

Since it was definitely not possible to fine-tune all the models in the same Google Colab session, it was necessary to empty the memory and restart the runtime every time. Therefore, once each of the models had been trained, the obtained weights for the model have been exported as *pth* file for possible reuse (without having to retrain) in further sessions, such as to perform inference and assess performance.

For each model and dataset, a table highlights the noteworthy improvement in terms of performance achieved by fine-tuning the model. The metrics to carry out this comparison have been computed on the validation set.

###BART with SAMSum

In [None]:
# Apply tokenization to the SAMSum dataset
tokenized_samsum = dataset_samsum.map(preprocess, batched=True)

# Initialize the model with the available weights from pre-training
model_samsum = AutoModelForSeq2SeqLM.from_pretrained(model_name)

  0%|          | 0/15 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [None]:
training_args_samsum = Seq2SeqTrainingArguments(
    output_dir = "samsum_model", # where the model checkpoints will be written
    evaluation_strategy = "epoch", # do evaluation at the end of each epoch
    learning_rate = 2e-5, # initial learning rate
    per_device_train_batch_size = 8, # batch size per GPU core during training
    per_device_eval_batch_size = 4, # batch size per GPU core during evaluation
    weight_decay = 0.01, # weight decay to apply for AdamW
    save_total_limit = 4, # number of checkpoints
    num_train_epochs = 4, # number of epochs
    predict_with_generate = True, # predict using the generate method
    push_to_hub = False # don't upload on HuggingFace Hub
)

trainer_samsum = Seq2SeqTrainer(
    model = model_samsum,
    args = training_args_samsum,
    train_dataset = tokenized_samsum["train"],
    eval_dataset = tokenized_samsum["validation"],
    tokenizer = tokenizer,
    data_collator = data_collator,
    compute_metrics = compute_metrics_training,
)

trainer_samsum.train()

The following columns in the training set don't have a corresponding argument in `BartForConditionalGeneration.forward` and have been ignored: summary, text, id. If summary, text, id are not expected by `BartForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 14732
  Num Epochs = 4
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 7368
  Number of trainable parameters = 139420416
You're using a BartTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,1.8673,1.578068,0.4576,0.2258,0.3876,0.387,17.4169
2,1.649,1.519774,0.4765,0.2458,0.4029,0.4023,18.1797
3,1.5146,1.50649,0.4812,0.2481,0.4061,0.4061,18.0465
4,1.4319,1.501813,0.4847,0.2522,0.4099,0.41,18.1626


[1;30;43mOutput streaming troncato alle ultime 5000 righe.[0m
  "pad_token_id": 1,
  "transformers_version": "4.26.0"
}

Generate config GenerationConfig {
  "bos_token_id": 0,
  "decoder_start_token_id": 2,
  "early_stopping": true,
  "eos_token_id": 2,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2,
  "no_repeat_ngram_size": 3,
  "num_beams": 4,
  "pad_token_id": 1,
  "transformers_version": "4.26.0"
}

Generate config GenerationConfig {
  "bos_token_id": 0,
  "decoder_start_token_id": 2,
  "early_stopping": true,
  "eos_token_id": 2,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2,
  "no_repeat_ngram_size": 3,
  "num_beams": 4,
  "pad_token_id": 1,
  "transformers_version": "4.26.0"
}

Generate config GenerationConfig {
  "bos_token_id": 0,
  "decoder_start_token_id": 2,
  "early_stopping": true,
  "eos_token_id": 2,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2,
  "no_repeat_ngram_size": 3,
  "num_beams": 4,
  "pad_token_id": 1,
  "transformers_version": "

TrainOutput(global_step=7368, training_loss=1.631315622733548, metrics={'train_runtime': 3028.7458, 'train_samples_per_second': 19.456, 'train_steps_per_second': 2.433, 'total_flos': 9672775811112960.0, 'train_loss': 1.631315622733548, 'epoch': 4.0})

In [None]:
# Export the weights (no need to execute this if running the following cells in the same session)

temp_file = "/content/drive/MyDrive/TextMining/bart_samsum.pth"
torch.save(model_samsum, temp_file)

Improvement achieved with fine-tuning (on validation set):

|  |  |  |  |
|---|---|---|---|
|  | **ROUGE-1** | **ROUGE-2** | **ROUGE-L** |
| **BART (non fine-tuned)** | 0.2662 | 0.0825 | 0.2003 |
| **BART (fine-tuned)** | 0.4847 | 0.2522 | 0.4099 |

###BART with DialogSum

In [None]:
tokenized_dialogsum = dataset_dialogsum.map(preprocess, batched=True)
model_dialogsum = AutoModelForSeq2SeqLM.from_pretrained(model_name)

  0%|          | 0/13 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [None]:
training_args_dialogsum = Seq2SeqTrainingArguments(
    output_dir = "dialogsum_model", # where the model checkpoints will be written
    evaluation_strategy = "epoch", # do evaluation at the end of each epoch
    learning_rate = 2e-5, # initial learning rate
    per_device_train_batch_size = 8, # batch size per GPU core during training
    per_device_eval_batch_size = 4, # batch size per GPU core during evaluation
    weight_decay = 0.01, # weight decay to apply for AdamW
    save_total_limit = 4, # number of checkpoints
    num_train_epochs = 4, # number of epochs
    predict_with_generate = True, # predict using the generate method
    push_to_hub = False # don't upload on HuggingFace Hub
)

trainer_dialogsum = Seq2SeqTrainer(
    model = model_dialogsum,
    args = training_args_dialogsum,
    train_dataset = tokenized_dialogsum["train"],
    eval_dataset = tokenized_dialogsum["validation"],
    tokenizer = tokenizer,
    data_collator = data_collator,
    compute_metrics = compute_metrics_training,
)

trainer_dialogsum.train()

The following columns in the training set don't have a corresponding argument in `BartForConditionalGeneration.forward` and have been ignored: topic, text, id, summary. If topic, text, id, summary are not expected by `BartForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 12460
  Num Epochs = 4
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 6232
  Number of trainable parameters = 139420416
You're using a BartTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,1.3742,1.198758,0.4024,0.1794,0.3437,0.344,19.824
2,1.2229,1.160584,0.4062,0.1857,0.3513,0.3512,19.87
3,1.128,1.146699,0.4134,0.1936,0.3572,0.3573,19.858
4,1.0665,1.138816,0.4154,0.1992,0.3603,0.3605,19.892


[1;30;43mOutput streaming troncato alle ultime 5000 righe.[0m
  "forced_eos_token_id": 2,
  "no_repeat_ngram_size": 3,
  "num_beams": 4,
  "pad_token_id": 1,
  "transformers_version": "4.26.0"
}

Generate config GenerationConfig {
  "bos_token_id": 0,
  "decoder_start_token_id": 2,
  "early_stopping": true,
  "eos_token_id": 2,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2,
  "no_repeat_ngram_size": 3,
  "num_beams": 4,
  "pad_token_id": 1,
  "transformers_version": "4.26.0"
}

Generate config GenerationConfig {
  "bos_token_id": 0,
  "decoder_start_token_id": 2,
  "early_stopping": true,
  "eos_token_id": 2,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2,
  "no_repeat_ngram_size": 3,
  "num_beams": 4,
  "pad_token_id": 1,
  "transformers_version": "4.26.0"
}

Generate config GenerationConfig {
  "bos_token_id": 0,
  "decoder_start_token_id": 2,
  "early_stopping": true,
  "eos_token_id": 2,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2,
  "no_repeat_ngram_s

TrainOutput(global_step=6232, training_loss=1.2275835583237538, metrics={'train_runtime': 2864.9179, 'train_samples_per_second': 17.397, 'train_steps_per_second': 2.175, 'total_flos': 9911420944588800.0, 'train_loss': 1.2275835583237538, 'epoch': 4.0})

In [None]:
# Export the weights (no need to execute this if running the following cells in the same session)

temp_file = "/content/drive/MyDrive/TextMining/bart_dialogsum.pth"
torch.save(model_dialogsum, temp_file)

Improvement achieved with fine-tuning (on validation set):

|  |  |  |  |
|---|---|---|---|
|  | **ROUGE-1** | **ROUGE-2** | **ROUGE-L** |
| **BART (non fine-tuned)** | 0.2592 | 0.0670 | 0.1668 |
| **BART (fine-tuned)** | 0.4154 | 0.1992 | 0.3603 |

###BART with TweetSumm

In [35]:
tokenized_tweetsumm = dataset_tweetsumm.map(preprocess, batched=True)
model_tweetsumm = AutoModelForSeq2SeqLM.from_pretrained(model_name)

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/558M [00:00<?, ?B/s]

In [None]:
training_args_tweetsumm = Seq2SeqTrainingArguments(
    output_dir = "tweetsumm_model", # where the model checkpoints will be written
    evaluation_strategy = "epoch", # do evaluation at the end of each epoch
    learning_rate = 2e-5, # initial learning rate
    per_device_train_batch_size = 8, # batch size per GPU core during training
    per_device_eval_batch_size = 4, # batch size per GPU core during evaluation
    weight_decay = 0.01, # weight decay to apply for AdamW
    save_total_limit = 4, # number of checkpoints
    num_train_epochs = 8, # number of epochs
    predict_with_generate = True, # predict using the generate method
    push_to_hub = False # don't upload on HuggingFace Hub
)

trainer_tweetsumm = Seq2SeqTrainer(
    model = model_tweetsumm,
    args = training_args_tweetsumm,
    train_dataset = tokenized_tweetsumm["train"],
    eval_dataset = tokenized_tweetsumm["validation"],
    tokenizer = tokenizer,
    data_collator = data_collator,
    compute_metrics = compute_metrics_training,
)

trainer_tweetsumm.train()

The following columns in the training set don't have a corresponding argument in `BartForConditionalGeneration.forward` and have been ignored: summary, text. If summary, text are not expected by `BartForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 869
  Num Epochs = 8
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 872
  Number of trainable parameters = 139420416
You're using a BartTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,No log,2.199017,0.3492,0.1552,0.3033,0.3036,20.0
2,No log,2.113911,0.3602,0.1663,0.3183,0.3183,20.0
3,No log,2.069575,0.348,0.1679,0.307,0.3074,20.0
4,No log,2.03603,0.3503,0.1635,0.3075,0.3078,20.0
5,2.093100,2.009148,0.3551,0.1654,0.3109,0.3114,20.0
6,2.093100,2.013633,0.3525,0.1662,0.3079,0.3081,20.0
7,2.093100,2.013915,0.3505,0.1655,0.3075,0.3079,20.0
8,2.093100,2.028465,0.3478,0.1647,0.3056,0.306,20.0


The following columns in the evaluation set don't have a corresponding argument in `BartForConditionalGeneration.forward` and have been ignored: summary, text. If summary, text are not expected by `BartForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 108
  Batch size = 4
Generate config GenerationConfig {
  "bos_token_id": 0,
  "decoder_start_token_id": 2,
  "early_stopping": true,
  "eos_token_id": 2,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2,
  "no_repeat_ngram_size": 3,
  "num_beams": 4,
  "pad_token_id": 1,
  "transformers_version": "4.26.0"
}

Generate config GenerationConfig {
  "bos_token_id": 0,
  "decoder_start_token_id": 2,
  "early_stopping": true,
  "eos_token_id": 2,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2,
  "no_repeat_ngram_size": 3,
  "num_beams": 4,
  "pad_token_id": 1,
  "transformers_version": "4.26.0"
}

Generate config GenerationConfig {
  "bos_token_id": 0,
  "decode

TrainOutput(global_step=872, training_loss=1.9018745772335508, metrics={'train_runtime': 554.4927, 'train_samples_per_second': 12.538, 'train_steps_per_second': 1.573, 'total_flos': 1798075302266880.0, 'train_loss': 1.9018745772335508, 'epoch': 8.0})

In [None]:
# Export the weights (no need to execute this if running the following cells in the same session)

temp_file = "/content/drive/MyDrive/TextMining/bart_tweetsumm.pth"
torch.save(model_tweetsumm, temp_file)

Improvement achieved with fine-tuning (on validation set):

|  |  |  |  |
|---|---|---|---|
|  | **ROUGE-1** | **ROUGE-2** | **ROUGE-L** |
| **BART (non fine-tuned)** | 0.2250 | 0.0800 | 0.1883 |
| **BART (fine-tuned)** | 0.3478 | 0.1647 | 0.3056 |

##Evaluation on Test Set

After having fine-tuned the most promising model on each of the datasets (in our specific case it was always BART), it's time to assess its performance on completely unseen data, therefore running it on the *test* split of the dataset.

Besides the performance expressed in terms of metric (ROUGE and its variants), several qualitative results are shown, with a comparison between the prediction that would have been carried out by the model before and after fine-tuning on the specific dataset.

As regards the quantitative evaluation, it's been shown the significant ROUGE score improvement after having fine-tuned the model, and a comparison with the state-of-the-art results that have been reported.

Note that, if running the whole notebook in a single Google Colab session, there is no need to run the cell that performs the import of the models weights (that would require the professor to connect Google Drive and carry out additional tasks) and the restoration of the trainer object.

In [36]:
# Import the weights computed for the models during fine-tuning
# Needed only if splitting the execution of the notebook in several sessions

model_samsum = torch.load("/content/drive/MyDrive/TextMining/bart_samsum.pth")
model_dialogsum = torch.load("/content/drive/MyDrive/TextMining/bart_dialogsum.pth")
model_tweetsumm = torch.load("/content/drive/MyDrive/TextMining/bart_tweetsumm.pth")

In [38]:
# Initialize a fictitious Seq2SeqTrainer object with the pre-saved weights, 
# in order to restore a trainer object that is necessary to compute the predictions

def init_trainer(model, dataset):
  training_args = Seq2SeqTrainingArguments(
    output_dir = "evaluation",
    evaluation_strategy = "epoch",
    learning_rate = 2e-5,
    per_device_train_batch_size = 8,
    per_device_eval_batch_size = 4,
    weight_decay = 0.01,
    save_total_limit = 4,
    num_train_epochs = 4,
    predict_with_generate = True,
    push_to_hub = False
  )

  trainer = Seq2SeqTrainer(
    model = model,
    args = training_args,
    train_dataset = dataset["train"],
    eval_dataset = dataset["validation"],
    tokenizer = tokenizer,
    data_collator = data_collator,
    compute_metrics = compute_metrics_training
  )

  return trainer

In [39]:
# For each sample, exploit the trainer object to do inference and generate the 
# relative summary

def get_summaries(trainer, dataset):
  out = trainer.predict(dataset)
  generated_summaries = []
  for i in range(0, dataset.shape[0]):
    generated_summaries.append(tokenizer.decode(out[0][i], skip_special_tokens = True))
  
  return generated_summaries

In [40]:
# Select 5 random samples, and for each of them print the full text, the summary
# from the dataset, the summary generated by the model before fine-tuning and the
# summary generated by the fine-tuned model

def show_sample_summaries(dataset, summaries, pipeline):
  for i in random.sample(range(len(summaries)), 5):
    print("### DIALOGUE", str(i), " ###")
    print("TEXT")
    print(dataset[i]["text"])
    print("SUMMARY - GROUNDTRUTH")
    print(dataset[i]["summary"])
    print("SUMMARY - NON FINE-TUNED")
    print(pipeline(dataset[i]["text"])[0]["summary_text"])
    print("SUMMARY - FINE-TUNED")
    print(summaries[i])
    print()

For each dataset, it's been designed a table that shows the three metrics (ROUGE-1, ROUGE-2 and ROUGE-L) reached with our fine-tuned BART model (evaluated in the following on the test set) and the respective model at the state of the art, reported in the papers associated to the datasets.

###SAMSum

In [None]:
trainer_samsum = init_trainer(model_samsum, tokenized_samsum);

In [None]:
samsum_summaries = get_summaries(trainer_samsum, tokenized_samsum["test"])

In [None]:
show_sample_summaries(tokenized_samsum["test"], samsum_summaries, pipe_bart)

### DIALOGUE 726  ###
TEXT
Nathan i want to buy myself a bike in springAubrey thats great but where are you gonna keep it? Your apartment is so smallNathan i was thinking of hanging it on the wall, there are some special hooksAubrey you can always keep it in the hallwayNathan i dont want to, people who do that annoy me, its hard to walk around with all these bikes striped to the handrailsAubrey i agree... didnt think about thatNathan yeah, well I also got a stationary bike so I can be in shape during winter DAubrey really? I am so proud of you!!Nathan ye, I do like 25 kilometers everydayAubrey thats a lot!Nathan my goal for the summer is 100 kilometersAubrey fingers crossed!
SUMMARY - GROUNDTRUTH
Nathan is planning on buying a bike in spring. He will probably store the bike on some special hooks because his apartment is small. Nathan has also bought a stationary bike to keep fit.
SUMMARY - NON FINE-TUNED
Nathan i want to buy myself a bike in springAubrey thats great but where are you g

In [None]:
test_metrics_samsum = trainer_samsum.evaluate(tokenized_samsum["test"])
test_metrics_samsum

The following columns in the evaluation set don't have a corresponding argument in `BartForConditionalGeneration.forward` and have been ignored: summary, text, id. If summary, text, id are not expected by `BartForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 819
  Batch size = 4
Generate config GenerationConfig {
  "bos_token_id": 0,
  "decoder_start_token_id": 2,
  "early_stopping": true,
  "eos_token_id": 2,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2,
  "no_repeat_ngram_size": 3,
  "num_beams": 4,
  "pad_token_id": 1,
  "transformers_version": "4.26.0"
}



Generate config GenerationConfig {
  "bos_token_id": 0,
  "decoder_start_token_id": 2,
  "early_stopping": true,
  "eos_token_id": 2,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2,
  "no_repeat_ngram_size": 3,
  "num_beams": 4,
  "pad_token_id": 1,
  "transformers_version": "4.26.0"
}

Generate config GenerationConfig {
  "bos_token_id": 0,
  "decoder_start_token_id": 2,
  "early_stopping": true,
  "eos_token_id": 2,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2,
  "no_repeat_ngram_size": 3,
  "num_beams": 4,
  "pad_token_id": 1,
  "transformers_version": "4.26.0"
}

Generate config GenerationConfig {
  "bos_token_id": 0,
  "decoder_start_token_id": 2,
  "early_stopping": true,
  "eos_token_id": 2,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2,
  "no_repeat_ngram_size": 3,
  "num_beams": 4,
  "pad_token_id": 1,
  "transformers_version": "4.26.0"
}

Generate config GenerationConfig {
  "bos_token_id": 0,
  "decoder_start_token_id": 2,
  "early_stopping": true,

{'eval_loss': 4.303366184234619,
 'eval_rouge1': 0.2629,
 'eval_rouge2': 0.0637,
 'eval_rougeL': 0.2152,
 'eval_rougeLsum': 0.2151,
 'eval_gen_len': 19.9634,
 'eval_runtime': 72.7131,
 'eval_samples_per_second': 11.263,
 'eval_steps_per_second': 2.819}

The metrics computed on the test set show a significant improvement given by the fine-tuning procedure on the SAMSum dataset. This table puts in comparison the increase in performance obtained with fine-tuning and the results reported in the paper associated to the dataset, reached with Fast Abs RL model.

|  |  |  |  |
|---|---|---|---|
|  | **ROUGE-1** | **ROUGE-2** | **ROUGE-L** |
| **BART (fine-tuned)** | 0.2629 | 0.0637 | 0.2151 |
| **Fast Abs RL (sota)** | 0.4099 | 0.1772 | 0.3830 |

###DialogSum

In [None]:
trainer_dialogsum = init_trainer(model_dialogsum, tokenized_dialogsum);

In [None]:
dialogsum_summaries = get_summaries(trainer_dialogsum, tokenized_dialogsum["test"])

The following columns in the test set don't have a corresponding argument in `BartForConditionalGeneration.forward` and have been ignored: topic, text, id, summary. If topic, text, id, summary are not expected by `BartForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1500
  Batch size = 4
Generate config GenerationConfig {
  "bos_token_id": 0,
  "decoder_start_token_id": 2,
  "early_stopping": true,
  "eos_token_id": 2,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2,
  "no_repeat_ngram_size": 3,
  "num_beams": 4,
  "pad_token_id": 1,
  "transformers_version": "4.26.0"
}

Generate config GenerationConfig {
  "bos_token_id": 0,
  "decoder_start_token_id": 2,
  "early_stopping": true,
  "eos_token_id": 2,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2,
  "no_repeat_ngram_size": 3,
  "num_beams": 4,
  "pad_token_id": 1,
  "transformers_version": "4.26.0"
}

Generate config GenerationConfig {
  "bos_token_

In [None]:
show_sample_summaries(tokenized_dialogsum["test"], dialogsum_summaries, pipe_bart)

### DIALOGUE 1029  ###
TEXT
#Person1# Is your city a historical place? #Person2# Not rally. 200 years ago, it was just a small insignificant village. #Person1# How did it grow into such a large place? #Person2# Large deposits of coal were found nearly and so many industries located themselves here. The village quickly grew into a key industrial centre. #Person1# As the city grew, it must have absorbed many village nearby. #Person2# Yes, it did. The names of those village survive as the names of parts of the city. #Person1# I see. Are there any building more than 200 years old in your city? #Person2# Oh, yes. Several of the buildings from the villages still survive. Many of them were inns for travelers and today survive as pubs. There was a castle near one village, so our city has a castle too. #Person1# Really? So your city does have some old history after all. 
SUMMARY - GROUNDTRUTH
#Person2#'s city was just a small insignificant village 200 years ago. It then grew into a key industri

In [None]:
test_metrics_dialogsum = trainer_dialogsum.evaluate(tokenized_dialogsum["test"])
test_metrics_dialogsum

The following columns in the evaluation set don't have a corresponding argument in `BartForConditionalGeneration.forward` and have been ignored: summary, text, topic, id. If summary, text, topic, id are not expected by `BartForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1500
  Batch size = 4
Generate config GenerationConfig {
  "bos_token_id": 0,
  "decoder_start_token_id": 2,
  "early_stopping": true,
  "eos_token_id": 2,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2,
  "no_repeat_ngram_size": 3,
  "num_beams": 4,
  "pad_token_id": 1,
  "transformers_version": "4.26.0"
}



Generate config GenerationConfig {
  "bos_token_id": 0,
  "decoder_start_token_id": 2,
  "early_stopping": true,
  "eos_token_id": 2,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2,
  "no_repeat_ngram_size": 3,
  "num_beams": 4,
  "pad_token_id": 1,
  "transformers_version": "4.26.0"
}

Generate config GenerationConfig {
  "bos_token_id": 0,
  "decoder_start_token_id": 2,
  "early_stopping": true,
  "eos_token_id": 2,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2,
  "no_repeat_ngram_size": 3,
  "num_beams": 4,
  "pad_token_id": 1,
  "transformers_version": "4.26.0"
}

Generate config GenerationConfig {
  "bos_token_id": 0,
  "decoder_start_token_id": 2,
  "early_stopping": true,
  "eos_token_id": 2,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2,
  "no_repeat_ngram_size": 3,
  "num_beams": 4,
  "pad_token_id": 1,
  "transformers_version": "4.26.0"
}

Generate config GenerationConfig {
  "bos_token_id": 0,
  "decoder_start_token_id": 2,
  "early_stopping": true,

{'eval_loss': 4.182870864868164,
 'eval_rouge1': 0.2323,
 'eval_rouge2': 0.042,
 'eval_rougeL': 0.1951,
 'eval_rougeLsum': 0.195,
 'eval_gen_len': 20.0,
 'eval_runtime': 135.0429,
 'eval_samples_per_second': 11.108,
 'eval_steps_per_second': 2.777}

The evaluation performed above shows an improvement in comparison to the model executed before fine-tuning, even if the enhancement is less noticeable than the one obtained for TweetSumm. The table shows the comparison between the metrics reached by the model before and after fine-tuning (BART) and the results at the state of the art, presented in the paper associated to the dataset, that have been obtained with BART too.

|  |  |  |  |
|---|---|---|---|
|  | **ROUGE-1** | **ROUGE-2** | **ROUGE-L** |
| **BART (fine-tuned)** | 0.2323 | 0.0420 | 0.1951 |
| **BART (sota)** | 0.4728 | 0.2118 | 0.4483 |

###TweetSumm

In [41]:
trainer_tweetsumm = init_trainer(model_tweetsumm, tokenized_tweetsumm);

In [42]:
tweetsumm_summaries = get_summaries(trainer_tweetsumm, tokenized_tweetsumm["test"])

The following columns in the test set don't have a corresponding argument in `BartForConditionalGeneration.forward` and have been ignored: text, summary. If text, summary are not expected by `BartForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 110
  Batch size = 4
You're using a BartTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Generate config GenerationConfig {
  "bos_token_id": 0,
  "decoder_start_token_id": 2,
  "early_stopping": true,
  "eos_token_id": 2,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2,
  "no_repeat_ngram_size": 3,
  "num_beams": 4,
  "pad_token_id": 1,
  "transformers_version": "4.26.0"
}

Generate config GenerationConfig {
  "bos_token_id": 0,
  "decoder_start_token_id": 2,
  "early_stopping": true,
  "eos_token_id": 2,
  "forc

In [47]:
show_sample_summaries(tokenized_tweetsumm["test"], tweetsumm_summaries, pipe_bart)

### DIALOGUE 23  ###
TEXT
HPSupport Problems with Envy 7640 printer, set as default. HP doctor fixes problems, and then it doesnt work again. Updated drivers. <BR> Hey, Thanks for reaching out. I would love to help! Is there any error message that shows up on your printer or on your computer? .12 <BR> Send us a direct message by clicking on the link below. Cheers Mat 22 <LINK> <BR> HPSupport Cant get it to print at all from one computer. Frustrating ready to take it back. <LINK> <BR> Hi Josey, thanks for tweeting, I see from your tweet that you get User intervention required message while printing from your printer, 12 <BR> follow steps from this document <LINK> reply in direct message <LINK> <BR> HPSupport I have followed most of the steps in the document I currently have at least 6 hours troubleshooting the HP Envy printer. It should be plug and play way too complicated. <BR> HPSupport I have followed most of the steps in the document I currently have at least 6 hours troubleshooting



HPSupport Problems with Envy 7640 printer, set as default. HP doctor fixes problems, and then it doesnt work again. Updated drivers. <BR> Hey, Thanks for reaching out. I would love to help! Is there any error message that shows up on your printer or on your computer? .12 <BR></BR> Send us a direct message by clicking on the link below. Cheers Mat 22 <LINK> < BR> HPSUpport Cant get it to print at all from one computer. Frustrating ready to take it back. <Link> <br> Hi
SUMMARY - FINE-TUNED
Customer is complaining about the HP Envy 7640 printer which is set as default.

### DIALOGUE 66  ###
TEXT
Delta Sky Club Guest for Platinum Medallion any flight or SkyTeamoperated? has to be international as well? mine is Delta international <BR> I am happy to look into this, Edouard. HCA <BR> Hi there, in order to receive the 29 rate, your companion would have to be traveling in the same reservation as you. HCA <BR> Delta wait what rate? <LINK> states that Platinum get one guest. I want to know what 

In [None]:
test_metrics_tweetsumm = trainer_tweetsumm.evaluate(tokenized_tweetsumm["test"])
test_metrics_tweetsumm

The following columns in the evaluation set don't have a corresponding argument in `BartForConditionalGeneration.forward` and have been ignored: summary, text. If summary, text are not expected by `BartForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 110
  Batch size = 4
Generate config GenerationConfig {
  "bos_token_id": 0,
  "decoder_start_token_id": 2,
  "early_stopping": true,
  "eos_token_id": 2,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2,
  "no_repeat_ngram_size": 3,
  "num_beams": 4,
  "pad_token_id": 1,
  "transformers_version": "4.26.0"
}



Generate config GenerationConfig {
  "bos_token_id": 0,
  "decoder_start_token_id": 2,
  "early_stopping": true,
  "eos_token_id": 2,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2,
  "no_repeat_ngram_size": 3,
  "num_beams": 4,
  "pad_token_id": 1,
  "transformers_version": "4.26.0"
}

Generate config GenerationConfig {
  "bos_token_id": 0,
  "decoder_start_token_id": 2,
  "early_stopping": true,
  "eos_token_id": 2,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2,
  "no_repeat_ngram_size": 3,
  "num_beams": 4,
  "pad_token_id": 1,
  "transformers_version": "4.26.0"
}

Generate config GenerationConfig {
  "bos_token_id": 0,
  "decoder_start_token_id": 2,
  "early_stopping": true,
  "eos_token_id": 2,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2,
  "no_repeat_ngram_size": 3,
  "num_beams": 4,
  "pad_token_id": 1,
  "transformers_version": "4.26.0"
}

Generate config GenerationConfig {
  "bos_token_id": 0,
  "decoder_start_token_id": 2,
  "early_stopping": true,

{'eval_loss': 2.1274261474609375,
 'eval_rouge1': 0.3311,
 'eval_rouge2': 0.143,
 'eval_rougeL': 0.2897,
 'eval_rougeLsum': 0.2897,
 'eval_gen_len': 20.0,
 'eval_runtime': 18.5346,
 'eval_samples_per_second': 5.935,
 'eval_steps_per_second': 1.511,
 'epoch': 8.0}

As shown by the output of the cell above, we can ascertain that also in the case of TweetSumm the performance increased significantly, with respect to the non-fine-tuned model. In the table below is a comparison between the state-of-the-art results (reached with BART too) and our two models (BART with and without fine-tuning)

|  |  |  |  |
|---|---|---|---|
|  | **ROUGE-1** | **ROUGE-2** | **ROUGE-L** |
| **BART (fine-tuned)** | 0.3311 | 0.1430 | 0.2897 |
| **BART (sota)** | 0.3793 | 0.1926 | 0.3350 |

##Conclusion

According to the metrics that have been shown, the comparison and the observations that have been carried out, it's possible to conclude that the fine-tuning procedure has been worth it in almost all the cases, and the gain has been pretty significant, as regards the metrics. It's particularly true in the case of TweetSumm, in which the relative imporvement in terms of ROUGE score is the highest with respect to the other datasets, and above all it's the one on which it's been possible to get closer to the results at the state of the art, giving the most satisfactory achievement.

From a qualitative point of view, the generated summaries are not always very consistent, indeed the model often tends to return just the initial part of the text as summary.

As further improvement of this project, it may be interesting to fine-tune the models with more computational power available, in order to be allowed to train longer, for more epochs, and more exhaustively tune the hyperparameters, such as the batch size.