# Thesis Experiment: BART Model
## Michael LeVine, April 13, 2024

The purpose of this notebook is to test the summarization capabilities of th BART5 model.

Attribution: This approach partially is based on a training course of Janana Ravi, Certified Google Cloud Architect and Data Engineer - from the LinkedIn Learning course: AI Text Summarization with Hugging Face, released 10/30/2.  In addition, some of the code was derived from the Hugging Face transformers summarization page https://huggingface.co/docs/transformers/en/tasks/summarization23.

## Overview: Using a Transformer Model from Hugging Face: BART

### The Bart model
The pre-trained model that we will use is the "BART" model from Hugging Face, which can be found here: https://huggingface.co/docs/transformers/model_doc/bart

The "model card," which describes the model, notes that BART is a "sequence to sequence" transformer model.  The architcture includes a bidirectional encoder (like BERT) and a left-to-right decoder (like GPT).

## Verifying the Compute Environment

### Graphics Processing Unit (GPU)
Running inference on transformer models can be done without a GPU.  However, for training, a GPU is recommended.  The following block of code shows checks whether a GPU is available for use in a PyTorch environment. 
.

In [1]:
import torch

if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")
    
print("using", device, "device")

using cuda device


## Installing and importing required libraries and dependencies

In [2]:
#command line pip install the necessary required libraries and dependencies
#the transformers library allows us to access the pre-trained t5-small model
#the datasets library provides access to the Hugging Face datasets
#the evaluate model enables us to evaluate the summarizations the model produces
#the rouge_score is a standard evaluation metric used in text summarization tasks
#the accelerate function allows for distributed training on GPUs
#pip install transformers datasets evaluate rouge_score accelerate

### Import the transformers library

In [2]:
#this code imports needed libraries

import transformers
import datasets
import evaluate
import rouge_score
import accelerate

print(transformers.__version__) # verifies the transformers version

4.32.1


## Importing, Reducing, and Exploring the dataset

The experiment will use the CNN/Daily Mail dataset.  Two datasets will be created:
* Training Dataset
* Holdout Dataset (for running inference on the model)


### Instantiating a training dataset

In [4]:
#loading the dataset which was previously saved
from datasets import load_dataset
cnn_news_summary_ds = load_dataset("arrow", data_files={'train': 'data/cnn_news_summary_ds/train/data-00000-of-00001.arrow', 'test': 'data/cnn_news_summary_ds/test/data-00000-of-00001.arrow'})

In [5]:
cnn_news_summary_ds

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 2296
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 575
    })
})

The above cell shows that the cnn_dailymail dataset has been loaded, containing the following three components:


* training data (approx. 287,000 records)
* validation data (approx. 13,400 records)
* test data (approx. 11,500 records)



As per above output, now the dataset is broken down into two components:
* a `train` (training) dataset of 2296 articles
* a `test` dataset of 575

### Instantiating a holdout dataset (200 records)

The holdout dataset is used to run inference on both the "off-the-shelf" model and the fine-tuned model.  The purpose of having a holdout set is so the model is running inference on a different dataset from what it was trained on in order to test its performance.

In [32]:
#load holdout set for inference from a local csv (to ensure same order)
cnn_holdout_ds = load_dataset ("csv", data_files='data/cnn_holdout_ds.csv', split = "train[0:200]")
cnn_holdout_ds

Dataset({
    features: ['article', 'highlights', 'id'],
    num_rows: 200
})

### Exploring the Dataset

In [33]:
#dataset shape
cnn_news_summary_ds.shape

{'train': (2296, 3), 'test': (575, 3)}

The above output shows the cnn_dailymail `train` set is 2296 rows x 3 columns.  The `test` set is 575 rows x 3 columns.

In [34]:
#dataset object type
type(cnn_news_summary_ds)

datasets.dataset_dict.DatasetDict

The above output shows the cnn_dailymail dataset is of type Dataset, within the datasets library.

In [35]:
#dataset structure
cnn_news_summary_ds

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 2296
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 575
    })
})

The Dataset has three features: 
* `article`: The full text of the news article
*  `highlights`: the target summary, also known as the reference summary
*  `id`: the unique id for each article/highlights pair


In [36]:
#looking at the features of the dataset
cnn_news_summary_ds['train'].features

{'article': Value(dtype='string', id=None),
 'highlights': Value(dtype='string', id=None),
 'id': Value(dtype='string', id=None)}

In [37]:
#examining the first record of the dataset.  
cnn_news_summary_ds['train'][0]

{'article': '(CNN) -- After almost 10 months, the FBI has zeroed in on a suspect in the case of missing Florida pilot Robert Wiles, who may have been kidnapped for ransom. Missing Florida pilot Robert Wiles is thought to have been kidnapped for ransom. "We\'re close to solving the case," said FBI special agent David Couvertier. He would not elaborate. Agents also would not identify the suspect, and they said the person is not in custody. Investigators would only reveal that the "key suspect" is in Florida, either in Orlando, Lakeland or Melbourne. "They\'re holding that back in hopes of getting additional information," said Couvertier. The FBI says it\'s also looking at several persons of interest in those same three Florida cities. Wiles, 27, was last seen in the family\'s aircraft maintenance business, National Flight Services, at Lakeland Linder Regional Airport on April 1, 2008. The day Wiles disappeared, he left behind his bags, his computer, and even his car. His father says the 

In [39]:
#examining the first record of the holdout dataset.  
cnn_holdout_ds[0]

{'article': "Manchester United have fallen off their perch. And they’re dropping like a stone towards mediocrity. That is the undeniable fact that has been hammered home relentlessly during the past six months. Whether we are talking about the events of Wednesday night at Olympiacos or before the startled eyes of the faithful at Old Trafford, the evidence is there for all to see. Can't stop the slump: David Moyes cannot believe it as he watches Manchester United lose at Olympiacos . Down and almost out: Robin van Persie lies on the floor during a defeat which sees United's Champions League campaign hanging by a thread . Disbelief: Wayne Rooney cries out in vain during another shambolic United display . Abject: The frustration shows on the Man United players' faces on taking the restart after conceding to Olympiacos . Coming to get you: Liverpool are looking to take United's place in the top four . Now is it time for Man United to sack Moyes? Out of the title race, out of the FA . Cup, 

In [40]:
cnn_holdout_ds[120]

{'article': "Every day Sportsmail takes a look at the European papers to see what are the biggest stories creating talking points on the continent. On Saturday, Italian newspapers Tuttosport and Corriere dello Sport both lead with reports that Juventus manager Antonio Conte could be set to leave the club this summer. Tuttosport claim that Conte, who has just led Juve to a third consecutive Serie A title but has failed to make progress in the Champions League, and club president Andrea Agnelli seem very distant and CDS says Juve and Conte are 'moving apart'. Uncertain future: Juventus manager Antonio Conte could leave the club this summer . Triple crown: Conte has led Juventus to three consecutive Serie A league titles . Conte has been linked with Monaco, who are also reportedly keen on Arsenal manager Arsene Wenger and Benfica's Jorge Jesus. Elsewhere in Italy, La Gazzetta dello Sport pays tribute to Inter Milan right-back Javier Zanetti who is set to retire at the end of the season, l

## Preprocessing and Cleaning the Data

In [41]:
#defining a text cleaning function.  This function iterates over the 'article'
# and 'highlights' section and replaces various text strings (like
#backslashes, new lines, etc.) with the empty string

def clean_txt(example):
  for txt in ['article', 'highlights']:
    example[txt] = example[txt].lower() #convert text to lowercase
    example[txt] = example[txt].replace('\\','')
    example[txt] = example[txt].replace('/','')
    example[txt] = example[txt].replace('\n','')
    example[txt] = example[txt].replace('``','')
    example[txt] = example[txt].replace('"','')
    example[txt] = example[txt].replace('--','')
  return example

### Mapping the text cleaning function to the training dataset and the holdout dataset

In [42]:
#hugging face datasets allow the .map operation to
#apply a function to all records in a dataset, and then will
#update the dataset.  In efect, the .map() method
# maps the `clean_txt` function to
# all the records in the `cnn_news_summary_ds dataset.
#The result will be that the training and
#test data will now be cleaned
cleaned_cnn_news_summary_ds = cnn_news_summary_ds.map(clean_txt)

cleaned_cnn_news_summary_ds

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 2296
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 575
    })
})

In [43]:
#verifying that we have a clean training dataset by comparing a record from the original
#dataset to the cleaned dataset
print('\n\n==== Original dataset ====\n\n')

print (cnn_news_summary_ds['train']['article'][0])

print('\n\n==== Cleaned dataset ====\n\n')

cleaned_cnn_news_summary_ds['train']['article'][0]



==== Original dataset ====


(CNN) -- After almost 10 months, the FBI has zeroed in on a suspect in the case of missing Florida pilot Robert Wiles, who may have been kidnapped for ransom. Missing Florida pilot Robert Wiles is thought to have been kidnapped for ransom. "We're close to solving the case," said FBI special agent David Couvertier. He would not elaborate. Agents also would not identify the suspect, and they said the person is not in custody. Investigators would only reveal that the "key suspect" is in Florida, either in Orlando, Lakeland or Melbourne. "They're holding that back in hopes of getting additional information," said Couvertier. The FBI says it's also looking at several persons of interest in those same three Florida cities. Wiles, 27, was last seen in the family's aircraft maintenance business, National Flight Services, at Lakeland Linder Regional Airport on April 1, 2008. The day Wiles disappeared, he left behind his bags, his computer, and even his car. His fa

"(cnn)  after almost 10 months, the fbi has zeroed in on a suspect in the case of missing florida pilot robert wiles, who may have been kidnapped for ransom. missing florida pilot robert wiles is thought to have been kidnapped for ransom. we're close to solving the case, said fbi special agent david couvertier. he would not elaborate. agents also would not identify the suspect, and they said the person is not in custody. investigators would only reveal that the key suspect is in florida, either in orlando, lakeland or melbourne. they're holding that back in hopes of getting additional information, said couvertier. the fbi says it's also looking at several persons of interest in those same three florida cities. wiles, 27, was last seen in the family's aircraft maintenance business, national flight services, at lakeland linder regional airport on april 1, 2008. the day wiles disappeared, he left behind his bags, his computer, and even his car. his father says the next day, wiles was supp

In [44]:
cleaned_cnn_holdout_ds = cnn_holdout_ds.map(clean_txt)

cleaned_cnn_holdout_ds

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Dataset({
    features: ['article', 'highlights', 'id'],
    num_rows: 200
})

In [45]:
#verifying that we have a clean holdout dataset by comparing a record from the original
#holdout dataset to the cleaned holdout dataset
print('\n\n==== Original holdout dataset ====\n\n')

print (cnn_holdout_ds['article'][0])

print('\n\n==== Cleaned holdout dataset ====\n\n')

cleaned_cnn_holdout_ds['article'][0]



==== Original holdout dataset ====


Manchester United have fallen off their perch. And they’re dropping like a stone towards mediocrity. That is the undeniable fact that has been hammered home relentlessly during the past six months. Whether we are talking about the events of Wednesday night at Olympiacos or before the startled eyes of the faithful at Old Trafford, the evidence is there for all to see. Can't stop the slump: David Moyes cannot believe it as he watches Manchester United lose at Olympiacos . Down and almost out: Robin van Persie lies on the floor during a defeat which sees United's Champions League campaign hanging by a thread . Disbelief: Wayne Rooney cries out in vain during another shambolic United display . Abject: The frustration shows on the Man United players' faces on taking the restart after conceding to Olympiacos . Coming to get you: Liverpool are looking to take United's place in the top four . Now is it time for Man United to sack Moyes? Out of the title r

"manchester united have fallen off their perch. and they’re dropping like a stone towards mediocrity. that is the undeniable fact that has been hammered home relentlessly during the past six months. whether we are talking about the events of wednesday night at olympiacos or before the startled eyes of the faithful at old trafford, the evidence is there for all to see. can't stop the slump: david moyes cannot believe it as he watches manchester united lose at olympiacos . down and almost out: robin van persie lies on the floor during a defeat which sees united's champions league campaign hanging by a thread . disbelief: wayne rooney cries out in vain during another shambolic united display . abject: the frustration shows on the man united players' faces on taking the restart after conceding to olympiacos . coming to get you: liverpool are looking to take united's place in the top four . now is it time for man united to sack moyes? out of the title race, out of the fa . cup, out of the l

## Instantiating an "off-the-shelf" BART model and tokenizer

In [46]:
from transformers import pipeline, BartForConditionalGeneration, TrainingArguments, Trainer, \
                         DataCollatorForSeq2Seq, BartTokenizer
import pandas as pd
from datasets import Dataset
import random

In [47]:
from transformers import AutoTokenizer

checkpoint = "facebook/bart-base"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [48]:
#looking at model architecture
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
model

BartForConditionalGeneration(
  (model): BartModel(
    (shared): Embedding(50265, 768, padding_idx=1)
    (encoder): BartEncoder(
      (embed_tokens): Embedding(50265, 768, padding_idx=1)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 768)
      (layers): ModuleList(
        (0-5): 6 x BartEncoderLayer(
          (self_attn): BartAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_layer_norm): LayerNorm((768,), eps=

## Preprocessing Datasets

In [49]:
#define a preprocessing function

#prefix = "summarize: "


def preprocess_function(examples):
    #inputs = [prefix + doc for doc in examples["article"]]
    inputs = [doc for doc in examples["article"]]
    #model_inputs = tokenizer(inputs, max_length=1000, truncation=True)
    model_inputs = tokenizer(inputs, truncation=True)

    #labels = tokenizer(text_target=examples["highlights"], max_length=256, truncation=True)
    labels = tokenizer(text_target=examples["highlights"], truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

### Preprocessing the training dataset

In [50]:
tokenized_cnn_training_ds = cleaned_cnn_news_summary_ds.map(preprocess_function, batched=True)
tokenized_cnn_training_ds

Map:   0%|          | 0/575 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 2296
    })
    test: Dataset({
        features: ['article', 'highlights', 'id', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 575
    })
})

### Preprocessing the holdout dataset

In [51]:
tokenized_cnn_holdout_ds = cleaned_cnn_holdout_ds.map(preprocess_function, batched=True)
tokenized_cnn_holdout_ds

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Dataset({
    features: ['article', 'highlights', 'id', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 200
})

## Instantiating a Data Collator

In [52]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

## Importing ROUGE Metric
*  Rouge is a standard evaluation metric used in text summarization tasks
*  ROUGE provides an *objective metric* to compare model-produced summary with the dataset's reference summary.  

### Importing the evaluate library

The 'evaluate' library from Hugging Face allows us to evaluate ML models.  The 'evaluate' library provides access to dozens of evaluation metrics across many ML domains (including NLP, computer vision, etc.).

In [53]:
import evaluate

rouge = evaluate.load("rouge")
rouge

EvaluationModule(name: "rouge", module_type: "metric", features: [{'predictions': Value(dtype='string', id='sequence'), 'references': Sequence(feature=Value(dtype='string', id='sequence'), length=-1, id=None)}, {'predictions': Value(dtype='string', id='sequence'), 'references': Value(dtype='string', id='sequence')}], usage: """
Calculates average rouge scores for a list of hypotheses and references
Args:
    predictions: list of predictions to score. Each prediction
        should be a string with tokens separated by spaces.
    references: list of reference for each prediction. Each
        reference should be a string with tokens separated by spaces.
    rouge_types: A list of rouge types to calculate.
        Valid names:
        `"rouge{n}"` (e.g. `"rouge1"`, `"rouge2"`) where: {n} is the n-gram based scoring,
        `"rougeL"`: Longest common subsequence based scoring.
        `"rougeLsum"`: rougeLsum splits text using `"
"`.
        See details in https://github.com/huggingface/

## Running inference on holdout set with "off-the-shelf" BART model

In [54]:
#creating 'holdout_article_texts' variable to hold articles from the test set
holdout_article_texts = tokenized_cnn_holdout_ds['article']

#creating 'holdout_article_summaries' variable to hold summaries from the test set
holdout_article_summaries = tokenized_cnn_holdout_ds["highlights"]


In [55]:
tokenized_cnn_holdout_ds

Dataset({
    features: ['article', 'highlights', 'id', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 200
})

In [56]:
tokenized_cnn_holdout_ds[0]['input_ids']

[0,
 397,
 17419,
 10409,
 33,
 4491,
 160,
 49,
 228,
 611,
 4,
 8,
 51,
 17,
 27,
 241,
 6614,
 101,
 10,
 7326,
 1567,
 18422,
 32508,
 1571,
 4,
 14,
 16,
 5,
 29941,
 754,
 14,
 34,
 57,
 22355,
 184,
 29836,
 148,
 5,
 375,
 411,
 377,
 4,
 549,
 52,
 32,
 1686,
 59,
 5,
 1061,
 9,
 18862,
 46836,
 363,
 23,
 1021,
 31434,
 9504,
 366,
 50,
 137,
 5,
 37747,
 2473,
 9,
 5,
 15828,
 23,
 793,
 30246,
 3109,
 6,
 5,
 1283,
 16,
 89,
 13,
 70,
 7,
 192,
 4,
 64,
 75,
 912,
 5,
 13254,
 35,
 44009,
 475,
 2160,
 293,
 1395,
 679,
 24,
 25,
 37,
 11966,
 313,
 17419,
 10409,
 2217,
 23,
 1021,
 31434,
 9504,
 366,
 479,
 159,
 8,
 818,
 66,
 35,
 16785,
 179,
 3538,
 20187,
 324,
 5738,
 15,
 5,
 1929,
 148,
 10,
 3002,
 61,
 3681,
 10409,
 18,
 4739,
 1267,
 637,
 7209,
 30,
 10,
 15019,
 479,
 26440,
 35,
 169,
 858,
 4533,
 6071,
 25355,
 66,
 11,
 25876,
 148,
 277,
 1481,
 3146,
 12589,
 10409,
 2332,
 479,
 4091,
 21517,
 35,
 5,
 8413,
 924,
 15,
 5,
 313,
 10409,
 472,
 108,
 

In [57]:
len(tokenized_cnn_holdout_ds[0]['input_ids'])

1024

In [58]:
#instantiating a summarizer pipeline with an off-the-shelf t5-small model
summarizer = pipeline ('summarization', model="facebook/bart-base", truncation=True) #added the truncation argument to the pipeline paramaters

In [59]:
summarizer ("The researchers investigate the feasibility of using BART to enhance machine translation decoders for translating into English. Using pre-trained encoders has been proven to improve models, while the benefits of incorporating pre-trained language models into decoders have been more limited. Using a set of encoder parameters learned from bitext, they demonstrate that the entire BART model can be used as a single pretrained decoder for machine translation. More specifically, they swap out the embedding layer of BART's encoder with a brand new encoder using random initialization. When the model is trained from start to end, the new encoder is trained to map foreign words into an input BART can then translate into English. In both stages of training, the cross-entropy loss is backpropagated from the BART model's output to train the source encoder. In the first stage, they fix most of BART's parameters and only update the randomly initialized source encoder, the BART positional embeddings, and the self-attention input projection matrix of BART's encoder first layer. Second, they perform a limited number of training iterations on all model parameters.")

[{'summary_text': "The researchers investigate the feasibility of using BART to enhance machine translation decoders for translating into English. Using pre-trained encoders has been proven to improve models, while the benefits of incorporating pre-qualified language models into decoder have been more limited. Using a set of encoder parameters learned from bitext, they demonstrate that the entire BART model can be used as a single pretrained decoder for machine translation. More specifically, they swap out the embedding layer of BART's encoder with a brand new encoder using random initialization. When the model is trained from start to end, the new Encoder"}]

In [60]:
holdout_article_texts[0]

"manchester united have fallen off their perch. and they’re dropping like a stone towards mediocrity. that is the undeniable fact that has been hammered home relentlessly during the past six months. whether we are talking about the events of wednesday night at olympiacos or before the startled eyes of the faithful at old trafford, the evidence is there for all to see. can't stop the slump: david moyes cannot believe it as he watches manchester united lose at olympiacos . down and almost out: robin van persie lies on the floor during a defeat which sees united's champions league campaign hanging by a thread . disbelief: wayne rooney cries out in vain during another shambolic united display . abject: the frustration shows on the man united players' faces on taking the restart after conceding to olympiacos . coming to get you: liverpool are looking to take united's place in the top four . now is it time for man united to sack moyes? out of the title race, out of the fa . cup, out of the l

In [61]:
summarizer(holdout_article_texts[0])

[{'summary_text': "manchester united have fallen off their perch. and they’re dropping like a stone towards mediocrity. that is the undeniable fact that has been hammered home relentlessly during the past six months. whether we are talking about the events of wednesday night at olympiacos or before the startled eyes of the faithful at old trafford, the evidence is there for all to see. can't stop the slump: david moyes cannot believe it as he watches manchester united lose at o Olympiacos . down and almost out: robin van persie lies on the floor during a defeat which sees united"}]

In [62]:
type (cnn_holdout_ds[0])

dict

In [63]:
type(holdout_article_texts[0])

str

In [64]:
len(holdout_article_texts[0])

7698

In [73]:
#running inference on holdout set with off the shelf model
from tqdm import tqdm

#instantiating an empty list named 'holdout summaries'
holdout_off_the_shelf_summaries = []

#prefix ='summarize: '

for i, text in enumerate(tqdm(holdout_article_texts[:500])):
    #candidate = summarizer(prefix + text)
    candidate = summarizer(text)
    
    holdout_off_the_shelf_summaries.append(candidate[0]["summary_text"])

100%|█████████████████████████████████████████| 200/200 [08:00<00:00,  2.40s/it]


In [74]:
type(holdout_off_the_shelf_summaries)

list

In [75]:
holdout_off_the_shelf_summaries[0]

"manchester united have fallen off their perch. and they’re dropping like a stone towards mediocrity. that is the undeniable fact that has been hammered home relentlessly during the past six months. whether we are talking about the events of wednesday night at olympiacos or before the startled eyes of the faithful at old trafford, the evidence is there for all to see. can't stop the slump: david moyes cannot believe it as he watches manchester united lose at o Olympiacos . down and almost out: robin van persie lies on the floor during a defeat which sees united"

In [76]:
holdout_article_texts[0]

"manchester united have fallen off their perch. and they’re dropping like a stone towards mediocrity. that is the undeniable fact that has been hammered home relentlessly during the past six months. whether we are talking about the events of wednesday night at olympiacos or before the startled eyes of the faithful at old trafford, the evidence is there for all to see. can't stop the slump: david moyes cannot believe it as he watches manchester united lose at olympiacos . down and almost out: robin van persie lies on the floor during a defeat which sees united's champions league campaign hanging by a thread . disbelief: wayne rooney cries out in vain during another shambolic united display . abject: the frustration shows on the man united players' faces on taking the restart after conceding to olympiacos . coming to get you: liverpool are looking to take united's place in the top four . now is it time for man united to sack moyes? out of the title race, out of the fa . cup, out of the l

## Evaluating the off-the-shelf bART performance

In [77]:
#aggregated results of inference on holdout set
result_off_the_shelf_agg = rouge.compute(predictions = holdout_off_the_shelf_summaries,
                           references = holdout_article_summaries[:],
                           use_stemmer=True)

result_off_the_shelf_agg

{'rouge1': 0.37319685231845234,
 'rouge2': 0.16752622693870892,
 'rougeL': 0.23152644215843776,
 'rougeLsum': 0.23089810632891777}

In [78]:
#unaggregated results of inference on holdout set
result_off_the_shelf_unagg = rouge.compute(predictions = holdout_off_the_shelf_summaries,
                           references = holdout_article_summaries[:],
                           use_stemmer=True,
                             use_aggregator=False)

result_off_the_shelf_unagg

{'rouge1': [0.29665071770334933,
  0.3723404255319149,
  0.46280991735537197,
  0.5612244897959184,
  0.28125,
  0.2162162162162162,
  0.1565217391304348,
  0.4806201550387597,
  0.43076923076923074,
  0.4000000000000001,
  0.2948717948717949,
  0.5837837837837837,
  0.2753623188405797,
  0.5405405405405405,
  0.4965517241379311,
  0.3125,
  0.3728813559322034,
  0.3,
  0.33587786259541985,
  0.445859872611465,
  0.46153846153846156,
  0.4857142857142857,
  0.3384615384615385,
  0.4369747899159664,
  0.2658959537572254,
  0.28776978417266186,
  0.494949494949495,
  0.3542857142857143,
  0.1923076923076923,
  0.41860465116279066,
  0.29370629370629375,
  0.2105263157894737,
  0.45121951219512196,
  0.11475409836065575,
  0.26573426573426573,
  0.3567567567567568,
  0.40909090909090906,
  0.3057324840764331,
  0.33093525179856115,
  0.3896103896103896,
  0.9540229885057472,
  0.6666666666666666,
  0.3424657534246575,
  0.3865546218487395,
  0.38144329896907214,
  0.21138211382113822,
  0

## Creating DataFrames to hold new summary

In [80]:
cleaned_cnn_holdout_df = pd.read_csv('data/data_frame_after_t5.csv')
cleaned_cnn_holdout_df

Unnamed: 0.1,Unnamed: 0,article,highlights,id,t5_summaries,t5_fine_tuned_summaries
0,0,manchester united have fallen off their perch....,manchester united were beaten 2-0 in their cha...,5b3a626078390cb0e05327b4019753fd11cb8cea,manchester united lost 1-0 to olympiacos in th...,manchester united lost 1-0 to olympiacos in th...
1,1,a mother whose russian husband snatched their ...,"rachael neustadt's sons - daniel, eight and jo...",59d478d4a4299e2192997e56a9db9003fa2bac2d,"rachael neustadt's sons daniel, eight, and jon...","rachael neustadt's sons daniel, eight, and jon..."
2,2,claim: supporters of mayor lutfur rahman alleg...,islamic voters allegedly told to be 'good musl...,ec961b7d0912e7753dffe4360b77481eba96f2e1,supporters of mayor lutfur rahman allegedly ha...,supporters of mayor lutfur rahman allegedly ha...
3,3,the 15-year-old cousin of a palestinian boy wh...,"mohammed abu khder, 16, abducted and burned to...",092d90d61eb105b3955820cc4894ac2c4995ad1b,"mohammed abu khder, 16, was abducted from his ...","mohammed abu khder, 16, was burned to death in..."
4,4,it may have made its way up the pole to become...,spearmint rhino records £2.1m loss in 2011 .lo...,d0d59018cdf48aaeb6e1838c0323f8555e800765,spearmint rhino has recorded a loss of £2.1mil...,spearmint rhino has recorded a loss of £2.1m i...
...,...,...,...,...,...,...
195,195,reality tv show the block has been accused of ...,'the block' caught out faking a visit from the...,985b1bf7fc710e4ffdd9dd02e72d889a7997e89d,reality tv show the block has been accused of ...,reality tv show the block has been accused of ...
196,196,the average cost of raising a child to seconda...,average cost of raising a child from birth up ...,e466296e19d7a14cf4916d70a2cbc296e4659c99,average cost of raising a child to secondary s...,average cost of raising a child to secondary s...
197,197,thai police investigating the murder of two br...,"pornprasit sukdam claims he was offered £13,30...",a07624a84fe59a3321e83f153d6fd615207a8545,"pornprasit sukdam claims he was offered 700,00...","pornprasit sukdam claims he was offered 700,00..."
198,198,from clumpy flat shoes that seem to shorten a ...,clumpy flat shoes that seem to shorten a woman...,e16474b52bbf45f49434fc4a0b1d68e2d3fba3c3,kim carillo says she feels surprisingly sexy i...,"kim carillo, who usually favours a more alluri..."


In [81]:
cleaned_cnn_holdout_df['bart_summaries'] = holdout_off_the_shelf_summaries
cleaned_cnn_holdout_df

Unnamed: 0.1,Unnamed: 0,article,highlights,id,t5_summaries,t5_fine_tuned_summaries,bart_summaries
0,0,manchester united have fallen off their perch....,manchester united were beaten 2-0 in their cha...,5b3a626078390cb0e05327b4019753fd11cb8cea,manchester united lost 1-0 to olympiacos in th...,manchester united lost 1-0 to olympiacos in th...,manchester united have fallen off their perch....
1,1,a mother whose russian husband snatched their ...,"rachael neustadt's sons - daniel, eight and jo...",59d478d4a4299e2192997e56a9db9003fa2bac2d,"rachael neustadt's sons daniel, eight, and jon...","rachael neustadt's sons daniel, eight, and jon...",a mother whose russian husband snatched their ...
2,2,claim: supporters of mayor lutfur rahman alleg...,islamic voters allegedly told to be 'good musl...,ec961b7d0912e7753dffe4360b77481eba96f2e1,supporters of mayor lutfur rahman allegedly ha...,supporters of mayor lutfur rahman allegedly ha...,claim: supporters of mayor lutfur rahman alleg...
3,3,the 15-year-old cousin of a palestinian boy wh...,"mohammed abu khder, 16, abducted and burned to...",092d90d61eb105b3955820cc4894ac2c4995ad1b,"mohammed abu khder, 16, was abducted from his ...","mohammed abu khder, 16, was burned to death in...",the 15-year-old cousin of a palestinian boy wh...
4,4,it may have made its way up the pole to become...,spearmint rhino records £2.1m loss in 2011 .lo...,d0d59018cdf48aaeb6e1838c0323f8555e800765,spearmint rhino has recorded a loss of £2.1mil...,spearmint rhino has recorded a loss of £2.1m i...,it may have made its way up the pole to become...
...,...,...,...,...,...,...,...
195,195,reality tv show the block has been accused of ...,'the block' caught out faking a visit from the...,985b1bf7fc710e4ffdd9dd02e72d889a7997e89d,reality tv show the block has been accused of ...,reality tv show the block has been accused of ...,reality tv show the block has been accused of ...
196,196,the average cost of raising a child to seconda...,average cost of raising a child from birth up ...,e466296e19d7a14cf4916d70a2cbc296e4659c99,average cost of raising a child to secondary s...,average cost of raising a child to secondary s...,the average cost of raising a child to seconda...
197,197,thai police investigating the murder of two br...,"pornprasit sukdam claims he was offered £13,30...",a07624a84fe59a3321e83f153d6fd615207a8545,"pornprasit sukdam claims he was offered 700,00...","pornprasit sukdam claims he was offered 700,00...",thai police investigating the murder of two br...
198,198,from clumpy flat shoes that seem to shorten a ...,clumpy flat shoes that seem to shorten a woman...,e16474b52bbf45f49434fc4a0b1d68e2d3fba3c3,kim carillo says she feels surprisingly sexy i...,"kim carillo, who usually favours a more alluri...",from clumpy flat shoes that seem to shorten a ...


## Fine-tuning the BART model

In [82]:
#defining a custom compute metrics function
import numpy as np


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

In [83]:
training_args = Seq2SeqTrainingArguments(
    output_dir="bart_results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=4,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=False,
    load_best_model_at_end=True, #even if we overtrain model by accident, we will still 
    #load the checkpoint that had lowest evaluation loss

    #evaluation_strategy can be steps or epochs - correlates to how often we stop training and evaluate our model
    eval_steps=50,
    save_strategy='epoch' #save the model after every epoch
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_cnn_training_ds["train"],
    eval_dataset=tokenized_cnn_training_ds["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
    
)
trainer.evaluate() #adding a max_new_tokens paramater)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None)
You're using a BartTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'eval_loss': 3.721768379211426,
 'eval_rouge1': 0.1283,
 'eval_rouge2': 0.0377,
 'eval_rougeL': 0.1025,
 'eval_rougeLsum': 0.1025,
 'eval_gen_len': 20.0,
 'eval_runtime': 25.7208,
 'eval_samples_per_second': 22.355,
 'eval_steps_per_second': 0.7}

In [84]:
training_args = Seq2SeqTrainingArguments(
    output_dir="bart_results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=4,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=False,
    load_best_model_at_end=True, #even if we overtrain model by accident, we will still 
    #load the checkpoint that had lowest evaluation loss

    #evaluation_strategy can be steps or epochs - correlates to how often we stop training and evaluate our model
    eval_steps=50,
    save_strategy='epoch' #save the model after every epoch
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_cnn_training_ds["train"],
    eval_dataset=tokenized_cnn_training_ds["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train() #adding a max_new_tokens paramater)

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,No log,2.221561,0.2189,0.0858,0.1767,0.1763,20.0
2,No log,2.172081,0.2219,0.0873,0.1791,0.179,20.0
3,No log,2.13842,0.2251,0.0913,0.1827,0.1825,20.0
4,No log,2.135774,0.2238,0.0892,0.1814,0.1811,20.0




TrainOutput(global_step=288, training_loss=2.449014663696289, metrics={'train_runtime': 319.4605, 'train_samples_per_second': 28.748, 'train_steps_per_second': 0.902, 'total_flos': 5599819632476160.0, 'train_loss': 2.449014663696289, 'epoch': 4.0})

In [85]:
trainer.save_model()

## Instantiating the fine_tuned model and running inference with that model

In [86]:
fine_tuned_checkpoint = "bart_results"

In [87]:
#instantiating the fine-tuned model
#fine_tuned_model = T5ForConditionalGeneration.from_pretrained(fine_tuned_checkpoint)
fine_tuned_model = AutoModelForSeq2SeqLM.from_pretrained(fine_tuned_checkpoint)

In [88]:
#instantiating the fine-tuned tokeinizer
tokenizer = AutoTokenizer.from_pretrained(fine_tuned_checkpoint)

### Running inference on holdout set with "fine-tuned" bart model

In [89]:
#instantiating a summarizer pipeline with the fine-tuned bart model
fine_tuned_summarizer = pipeline ('summarization', model=fine_tuned_checkpoint, truncation=True)

In [90]:
#running inference on holdout set with fine-tuned
from tqdm import tqdm

#instantiating an empty list named 'holdout summaries'
holdout_fine_tuned_summaries = []

#prefix ='summarize: '

for i, text in enumerate(tqdm(holdout_article_texts[:500])):

    #candidate = fine_tuned_summarizer(prefix + text)
    candidate = fine_tuned_summarizer(text)

    holdout_fine_tuned_summaries.append(candidate[0]["summary_text"])

100%|█████████████████████████████████████████| 200/200 [04:07<00:00,  1.24s/it]


## Adding fine-tuned summaries to dataframe

In [91]:
cleaned_cnn_holdout_df['bart_fine_tuned_summaries'] = holdout_fine_tuned_summaries
cleaned_cnn_holdout_df

Unnamed: 0.1,Unnamed: 0,article,highlights,id,t5_summaries,t5_fine_tuned_summaries,bart_summaries,bart_fine_tuned_summaries
0,0,manchester united have fallen off their perch....,manchester united were beaten 2-0 in their cha...,5b3a626078390cb0e05327b4019753fd11cb8cea,manchester united lost 1-0 to olympiacos in th...,manchester united lost 1-0 to olympiacos in th...,manchester united have fallen off their perch....,manchester united have fallen off their perch....
1,1,a mother whose russian husband snatched their ...,"rachael neustadt's sons - daniel, eight and jo...",59d478d4a4299e2192997e56a9db9003fa2bac2d,"rachael neustadt's sons daniel, eight, and jon...","rachael neustadt's sons daniel, eight, and jon...",a mother whose russian husband snatched their ...,rachael neustadt and her two sons were freed i...
2,2,claim: supporters of mayor lutfur rahman alleg...,islamic voters allegedly told to be 'good musl...,ec961b7d0912e7753dffe4360b77481eba96f2e1,supporters of mayor lutfur rahman allegedly ha...,supporters of mayor lutfur rahman allegedly ha...,claim: supporters of mayor lutfur rahman alleg...,claim: supporters of mayor lutfur rahman hande...
3,3,the 15-year-old cousin of a palestinian boy wh...,"mohammed abu khder, 16, abducted and burned to...",092d90d61eb105b3955820cc4894ac2c4995ad1b,"mohammed abu khder, 16, was abducted from his ...","mohammed abu khder, 16, was burned to death in...",the 15-year-old cousin of a palestinian boy wh...,cousin of palestinian boy burned to death in i...
4,4,it may have made its way up the pole to become...,spearmint rhino records £2.1m loss in 2011 .lo...,d0d59018cdf48aaeb6e1838c0323f8555e800765,spearmint rhino has recorded a loss of £2.1mil...,spearmint rhino has recorded a loss of £2.1m i...,it may have made its way up the pole to become...,spearmint rhino has filed accounts showing tha...
...,...,...,...,...,...,...,...,...
195,195,reality tv show the block has been accused of ...,'the block' caught out faking a visit from the...,985b1bf7fc710e4ffdd9dd02e72d889a7997e89d,reality tv show the block has been accused of ...,reality tv show the block has been accused of ...,reality tv show the block has been accused of ...,reality tv show the block has been accused of ...
196,196,the average cost of raising a child to seconda...,average cost of raising a child from birth up ...,e466296e19d7a14cf4916d70a2cbc296e4659c99,average cost of raising a child to secondary s...,average cost of raising a child to secondary s...,the average cost of raising a child to seconda...,the average cost of raising a child to seconda...
197,197,thai police investigating the murder of two br...,"pornprasit sukdam claims he was offered £13,30...",a07624a84fe59a3321e83f153d6fd615207a8545,"pornprasit sukdam claims he was offered 700,00...","pornprasit sukdam claims he was offered 700,00...",thai police investigating the murder of two br...,thai police investigating the murder of two br...
198,198,from clumpy flat shoes that seem to shorten a ...,clumpy flat shoes that seem to shorten a woman...,e16474b52bbf45f49434fc4a0b1d68e2d3fba3c3,kim carillo says she feels surprisingly sexy i...,"kim carillo, who usually favours a more alluri...",from clumpy flat shoes that seem to shorten a ...,kim carillo tests some of the latest man-repel...


## Evaluating the fine-tuned BART performance

In [92]:
#aggregated results of inference on holdout set using fine-tune dmodel
result_fine_tuned_agg = rouge.compute(predictions = holdout_fine_tuned_summaries,
                           references = holdout_article_summaries[:],
                           use_stemmer=True)

result_fine_tuned_agg

{'rouge1': 0.3460776741089032,
 'rouge2': 0.1402276911998109,
 'rougeL': 0.2267815770232174,
 'rougeLsum': 0.2263745975677425}

In [93]:
#unaggregated results of inference on holdout set using fine-tune dmodel
result_fine_tuned_unagg = rouge.compute(predictions = holdout_fine_tuned_summaries,
                           references = holdout_article_summaries[:],
                           use_stemmer=True,
                            use_aggregator=False)
                                      

result_fine_tuned_unagg

{'rouge1': [0.14705882352941177,
  0.17391304347826086,
  0.4050632911392405,
  0.32592592592592595,
  0.3076923076923077,
  0.27999999999999997,
  0.275,
  0.45238095238095233,
  0.5569620253164557,
  0.32608695652173914,
  0.37777777777777777,
  0.4657534246575342,
  0.45,
  0.4421052631578947,
  0.3953488372093023,
  0.45945945945945943,
  0.34375,
  0.2947368421052632,
  0.38202247191011235,
  0.49523809523809526,
  0.39999999999999997,
  0.5777777777777778,
  0.21621621621621623,
  0.5671641791044776,
  0.2456140350877193,
  0.27272727272727276,
  0.3783783783783784,
  0.40310077519379844,
  0.27450980392156865,
  0.45614035087719296,
  0.38834951456310685,
  0.34615384615384615,
  0.45714285714285713,
  0.17073170731707318,
  0.37209302325581395,
  0.2777777777777778,
  0.380952380952381,
  0.2947368421052632,
  0.3714285714285714,
  0.5714285714285714,
  0.35714285714285715,
  0.8,
  0.37362637362637363,
  0.27777777777777773,
  0.33898305084745767,
  0.27027027027027023,
  0.31

## Findings: Off the shelf BART vs. fine-tuned BART

In [94]:
print('\n\n==== Running inference with off-the-shelf BART ====\n\n')

print (f'Aggregated Rouge Scores {result_off_the_shelf_agg}\n\n')

print('\n\n==== Running inference with fine-tuned BART ====\n\n')

print (f'Aggregated Rouge Scores {result_fine_tuned_agg}\n\n')




==== Running inference with off-the-shelf BART ====


Aggregated Rouge Scores {'rouge1': 0.37319685231845234, 'rouge2': 0.16752622693870892, 'rougeL': 0.23152644215843776, 'rougeLsum': 0.23089810632891777}




==== Running inference with fine-tuned BART ====


Aggregated Rouge Scores {'rouge1': 0.3460776741089032, 'rouge2': 0.1402276911998109, 'rougeL': 0.2267815770232174, 'rougeLsum': 0.2263745975677425}




## Examining some of the summaries

In [95]:
#looking at summaries of the off-the-shelf t5 model
print('\n\n==== Article ====\n\n')

print (cleaned_cnn_holdout_ds['article'][10])

print('\n\n==== Reference summary ====\n\n')

print (cleaned_cnn_holdout_ds['highlights'][10])

print('\n\n==== Generated summary: off-the-shelf (w/o finetuning) ====\n\n')

print (holdout_off_the_shelf_summaries[10])

print('\n\n==== Generated summary: fine-tuned model ====\n\n')

print (holdout_fine_tuned_summaries[10])



==== Article ====


filter: new bt customers will have to actively chose whether to have parental control filters when they subscribe to the company's broadband service (library image) bt has announced all new customers will have parental control filters switched on when they subscribe to its broadband service. the company has offered the free parental controls service for a number of years, but this is the first time new customers will have to actively choose whether to turn the filters off. during the set-up process, a box that turns on the controls - which block a number of sites including pornography, those containing self-harm and violence as well as hate sites - will be automatically ticked. subscribers will then have to actively turn the blocks off and instead decide what level of protection - if any - they require. the company will also extend its filter service to all internet devices including games consoles and tablets. previously parents could only block potentially harmf

In [96]:
#looking at summaries of the off-the-shelf t5 model
print('\n\n==== Article ====\n\n')

print (cleaned_cnn_holdout_ds['article'][20])

print('\n\n==== Reference summary ====\n\n')

print (cleaned_cnn_holdout_ds['highlights'][20])

print('\n\n==== Generated summary: off-the-shelf (w/o finetuning) ====\n\n')

print (holdout_off_the_shelf_summaries[20])

print('\n\n==== Generated summary: fine-tuned model ====\n\n')

print (holdout_fine_tuned_summaries[20])



==== Article ====


anderlecht manager besnik hasi believes arsenal were worried about entertaining their fans rather than seeing the game out after throwing away a three-goal lead against the belgian side. the gunners left the emirates with just a point when anderlecht fought back from 3-0 down in the second-half to draw 3-3, a result which means arsene wenger's side are not yet guaranteed to reach the knockout stages of the champions league. hasi, who did not shake hands with wenger at full-time, claims the premier league side went in search of more goals in a bid to please supporters and paid for it, . anderlecht striker aleksadar mitrovic (right) heads a dramatic late equaliser for his side against arsenal . anderlecht manager besik hasi shouts instructions to his players on the touchline at the emirates . mitrovic runs off to celebrate his goal in front of anderlecht's travelling supporters at the emirates . 'arsenal tried to play the same way,' he told the daily telegraph. 'the

In [97]:
holdout_off_the_shelf_summaries[0]

"manchester united have fallen off their perch. and they’re dropping like a stone towards mediocrity. that is the undeniable fact that has been hammered home relentlessly during the past six months. whether we are talking about the events of wednesday night at olympiacos or before the startled eyes of the faithful at old trafford, the evidence is there for all to see. can't stop the slump: david moyes cannot believe it as he watches manchester united lose at o Olympiacos . down and almost out: robin van persie lies on the floor during a defeat which sees united"

In [98]:
fine_tuned_summarizer('Weeks after undergoing heart surgery, Gail Lawson found herself back in an operating room. Her incision wasn’t healing, and an infection was spreading.At a hospital in Ridgewood, N.J., Dr. Sidney Rabinowitz performed a complex, hourslong procedure to repair tissue and close the wound. While recuperating, Ms. Lawson phoned the doctor’s office in a panic. He returned the call himself and squeezed her in for an appointment the next day.“He was just so good with me, so patient, so kind,” she said.But the doctor was not in her insurance plan’s network of providers, leaving his bill open to negotiation by her insurer. Once back on her feet, Ms. Lawson received a letter from the insurer, UnitedHealthcare, advising that Dr. Rabinowitz would be paid $5,449.27 — a small fraction of what he had billed the insurance company. That left Ms. Lawson with a bill of more than $100,000.“I’m thinking to myself, ‘But this is why I had insurance,’” said Ms.')

[{'summary_text': 'Dr. Sidney Rabinowitz performed a complex, hourslong procedure to repair tissue and close the wound .The doctor was not in her insurance plan’s network of providers, leaving his bill open to negotiation .'}]

In [99]:
summarizer('Weeks after undergoing heart surgery, Gail Lawson found herself back in an operating room. Her incision wasn’t healing, and an infection was spreading.At a hospital in Ridgewood, N.J., Dr. Sidney Rabinowitz performed a complex, hourslong procedure to repair tissue and close the wound. While recuperating, Ms. Lawson phoned the doctor’s office in a panic. He returned the call himself and squeezed her in for an appointment the next day.“He was just so good with me, so patient, so kind,” she said.But the doctor was not in her insurance plan’s network of providers, leaving his bill open to negotiation by her insurer. Once back on her feet, Ms. Lawson received a letter from the insurer, UnitedHealthcare, advising that Dr. Rabinowitz would be paid $5,449.27 — a small fraction of what he had billed the insurance company. That left Ms. Lawson with a bill of more than $100,000.“I’m thinking to myself, ‘But this is why I had insurance,’” said Ms.')

[{'summary_text': 'Weeks after undergoing heart surgery, Gail Lawson found herself back in an operating room. Her incision wasn’t healing, and an infection was spreading.At a hospital in Ridgewood, N.J., Dr. Sidney Rabinowitz performed a complex, hourslong procedure to repair tissue and close the wound. While recuperating, Ms. Lawson phoned the doctor’s office in a panic. He returned the call himself and squeezed her in for an appointment the next day.“He was just so good with me, so patient, so kind,” she said.But the'}]

## Exporting dataframes as .csv

In [102]:
cleaned_cnn_holdout_df.to_csv('data/data_frame_after_t5_bart.csv')

In [103]:
test_df = pd.read_csv('data/data_frame_after_t5_bart.csv')
test_df

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,article,highlights,id,t5_summaries,t5_fine_tuned_summaries,bart_summaries,bart_fine_tuned_summaries
0,0,0,manchester united have fallen off their perch....,manchester united were beaten 2-0 in their cha...,5b3a626078390cb0e05327b4019753fd11cb8cea,manchester united lost 1-0 to olympiacos in th...,manchester united lost 1-0 to olympiacos in th...,manchester united have fallen off their perch....,manchester united have fallen off their perch....
1,1,1,a mother whose russian husband snatched their ...,"rachael neustadt's sons - daniel, eight and jo...",59d478d4a4299e2192997e56a9db9003fa2bac2d,"rachael neustadt's sons daniel, eight, and jon...","rachael neustadt's sons daniel, eight, and jon...",a mother whose russian husband snatched their ...,rachael neustadt and her two sons were freed i...
2,2,2,claim: supporters of mayor lutfur rahman alleg...,islamic voters allegedly told to be 'good musl...,ec961b7d0912e7753dffe4360b77481eba96f2e1,supporters of mayor lutfur rahman allegedly ha...,supporters of mayor lutfur rahman allegedly ha...,claim: supporters of mayor lutfur rahman alleg...,claim: supporters of mayor lutfur rahman hande...
3,3,3,the 15-year-old cousin of a palestinian boy wh...,"mohammed abu khder, 16, abducted and burned to...",092d90d61eb105b3955820cc4894ac2c4995ad1b,"mohammed abu khder, 16, was abducted from his ...","mohammed abu khder, 16, was burned to death in...",the 15-year-old cousin of a palestinian boy wh...,cousin of palestinian boy burned to death in i...
4,4,4,it may have made its way up the pole to become...,spearmint rhino records £2.1m loss in 2011 .lo...,d0d59018cdf48aaeb6e1838c0323f8555e800765,spearmint rhino has recorded a loss of £2.1mil...,spearmint rhino has recorded a loss of £2.1m i...,it may have made its way up the pole to become...,spearmint rhino has filed accounts showing tha...
...,...,...,...,...,...,...,...,...,...
195,195,195,reality tv show the block has been accused of ...,'the block' caught out faking a visit from the...,985b1bf7fc710e4ffdd9dd02e72d889a7997e89d,reality tv show the block has been accused of ...,reality tv show the block has been accused of ...,reality tv show the block has been accused of ...,reality tv show the block has been accused of ...
196,196,196,the average cost of raising a child to seconda...,average cost of raising a child from birth up ...,e466296e19d7a14cf4916d70a2cbc296e4659c99,average cost of raising a child to secondary s...,average cost of raising a child to secondary s...,the average cost of raising a child to seconda...,the average cost of raising a child to seconda...
197,197,197,thai police investigating the murder of two br...,"pornprasit sukdam claims he was offered £13,30...",a07624a84fe59a3321e83f153d6fd615207a8545,"pornprasit sukdam claims he was offered 700,00...","pornprasit sukdam claims he was offered 700,00...",thai police investigating the murder of two br...,thai police investigating the murder of two br...
198,198,198,from clumpy flat shoes that seem to shorten a ...,clumpy flat shoes that seem to shorten a woman...,e16474b52bbf45f49434fc4a0b1d68e2d3fba3c3,kim carillo says she feels surprisingly sexy i...,"kim carillo, who usually favours a more alluri...",from clumpy flat shoes that seem to shorten a ...,kim carillo tests some of the latest man-repel...
