# Training a Summarization Model

Now let's see how we can use `HuggingFace` to train a summarization model on a new dataset. We'll use the SAMSum dataset.

In [2]:
from datasets import load_dataset


dataset_news = load_dataset("multi_news")

split_lengths = [len(dataset_news[split]) for split in dataset_news ]

print(f"Split lengths: {split_lengths}")
print(f"Features: {dataset_news ['train'].column_names}")
print(f"\nDialogue:")
print(dataset_news["test"][0]["document"])
print("\nSummary")
print(dataset_news["test"][0]["summary"])

  0%|          | 0/3 [00:00<?, ?it/s]

Split lengths: [44972, 5622, 5622]
Features: ['document', 'summary']

Dialogue:
GOP Eyes Gains As Voters In 11 States Pick Governors 
  
 Enlarge this image toggle caption Jim Cole/AP Jim Cole/AP 
  
 Voters in 11 states will pick their governors tonight, and Republicans appear on track to increase their numbers by at least one, with the potential to extend their hold to more than two-thirds of the nation's top state offices. 
  
 Eight of the gubernatorial seats up for grabs are now held by Democrats; three are in Republican hands. Republicans currently hold 29 governorships, Democrats have 20, and Rhode Island's Gov. Lincoln Chafee is an Independent. 
  
 Polls and race analysts suggest that only three of tonight's contests are considered competitive, all in states where incumbent Democratic governors aren't running again: Montana, New Hampshire and Washington. 
  
 While those state races remain too close to call, Republicans are expected to wrest the North Carolina governorship fro

In [3]:
print(dataset_news.shape)

{'train': (44972, 2), 'validation': (5622, 2), 'test': (5622, 2)}


In [4]:
from transformers import pipeline

# Evaluate this using PEGASUS
pipe = pipeline("summarization", model="google/pegasus-cnn_dailymail", framework='pt')
pipe_out = pipe(dataset_news["test"][0]["summary"])
print("Summary:")
print(pipe_out[0]["summary_text"].replace(" .<n>", ".\n"))



Downloading (…)lve/main/config.json:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Summary:
The GOP could end the night with control of more than two-thirds of the 50 states.
It's expected to keep the three Republican ones that are up for grabs.
 Races in Montana, New Hampshire, and Washington are still too close to call.
The results could have a big impact on health care, since a Supreme Court ruling grants states the ability to opt out of ObamaCare's Medicaid expansion .


# Evaluating the entire test set

We will need a way to compare the baseline PEGASUS model to the finetuned version. We'll create an evaluation loop for this.

In [5]:
from tqdm import tqdm
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

def chunks(list_of_elements, batch_size):
    for i in range(0, len(list_of_elements), batch_size):
        yield list_of_elements[i : i + batch_size]

def evaluate_summaries(dataset, metric, model, tokenizer,
                       batch_size=16, device=device,
                       column_text="article", column_summary="highlights"):
    article_batches = list(chunks(dataset[column_text], batch_size))
    target_batches = list(chunks(dataset[column_summary], batch_size))

    for article_batch, target_batch in tqdm(
        zip(article_batches, target_batches), total=len(article_batches)):

        inputs = tokenizer(article_batch, max_length=1024, truncation=True,
                        padding="max_length", return_tensors="pt")

        summaries = model.generate(input_ids=inputs["input_ids"].to(device),
                                   attention_mask=inputs["attention_mask"].to(device),
                                   length_penalty=0.8, num_beams=8, max_length=128)

        decoded_summaries = [tokenizer.decode(s, skip_special_tokens=True,
                                              clean_up_tokenization_spaces=True)
                             for s in summaries]

        decoded_summaries = [d.replace("<n>", " ") for d in decoded_summaries]
        
    return metric.compute(predictions=decoded_summaries, references=target_batch)

In [6]:
# Load the model directly
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_ckpt = "ainize/bart-base-cnn"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)

Downloading (…)okenizer_config.json:   0%|          | 0.00/261 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/558M [00:00<?, ?B/s]

In [7]:
!pip install evaluate

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: evaluate
Successfully installed evaluate-0.4.0
[0m

In [8]:
!pip install rouge_score

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25ldone
[?25h  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24954 sha256=13560aed741026114ad4300dffd1e64f074bc6015684a508a96db4aa66f25085
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2
[0m

In [9]:
import evaluate

rouge_metric = evaluate.load("rouge")
score = evaluate_summaries(dataset_news["test"], rouge_metric, model,
                           tokenizer, column_text="document",
                           column_summary="summary", batch_size=8)

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

100%|██████████| 703/703 [40:57<00:00,  3.50s/it]


In [10]:
import pandas as pd

pd.DataFrame(score, index=["bart"])

Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
bart,0.243496,0.106703,0.148415,0.177487


In order to fine tune this model, we need to be able to tokenize the data. We can also limit the lengths of each dialogue and summary to 1024 and 128, respectively.

In [11]:
def convert_examples_to_features(example_batch):
    input_encodings = tokenizer(example_batch["document"], truncation=True,
                                max_length=1024)

    with tokenizer.as_target_tokenizer():
        target_encodings = tokenizer(example_batch["summary"], max_length=128,
                                     truncation=True)

    return {"input_ids": input_encodings["input_ids"],
            "attention_mask": input_encodings["attention_mask"],
            "labels": target_encodings["input_ids"]}

dataset_news_pt = dataset_news.map(convert_examples_to_features,
                                       batched=True)

columns = ["input_ids", "labels", "attention_mask"]
dataset_news_pt.set_format(type="torch", columns=columns)

  0%|          | 0/45 [00:00<?, ?ba/s]



  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

# Preparing a batch of data

When training `seq2seq` models, we need to apply "teacher forcing". The encoder will receive input tokens using the labels shifted by one as well as the encoder output. The prediction is then compared to the shifted labels to calculate the loss. To clarify, the decoder only sees the previous ground truth labels.

`HuggingFace` provides a `DataCollatorForSeq2Seq` class that handles this for us.

In [12]:
from transformers import DataCollatorForSeq2Seq

seq2seq_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [13]:
from transformers import TrainingArguments, Trainer

# Gradient accumulation saves memory by updating the model only every X batches
training_args = TrainingArguments(
    output_dir="bart-news", num_train_epochs=1, warmup_steps=200,
    per_device_train_batch_size=1, per_device_eval_batch_size=1,
    weight_decay=0.01, logging_steps=10, push_to_hub=False,
    evaluation_strategy="steps", eval_steps=500, save_steps=1e6,
    gradient_accumulation_steps=16)

In [None]:
trainer = Trainer(model=model, args=training_args,
                  tokenizer=tokenizer, data_collator=seq2seq_collator,
                  train_dataset=dataset_news_pt["train"],
                  eval_dataset=dataset_news_pt["validation"])

trainer.train()

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


You're using a BartTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss
500,2.5999,2.434538
1000,2.5988,2.382715
1500,2.5474,2.353394
2000,2.4992,2.339948
2500,2.5042,2.326356


In [None]:
# Evaluate after finetuning
score = evaluate_summaries(
    dataset_news["test"], rouge_metric, trainer.model, tokenizer,
    batch_size=2, column_text="document", column_summary="summary")
pd.DataFrame(score, index=[f"bart_finetuned"])

In [None]:
sample_text = dataset_news["test"][0]["document"]
reference = dataset_news["test"][0]["summary"]

inputs = tokenizer(sample_text, max_length=1024, truncation=True,
                   padding="max_length", return_tensors="pt")

summaries = model.generate(input_ids=inputs["input_ids"].to(device),
                           attention_mask=inputs["attention_mask"].to(
    device),
    length_penalty=0.8, num_beams=8, max_length=128)

decoded_summaries = [tokenizer.decode(s, skip_special_tokens=True,
                                      clean_up_tokenization_spaces=True)
                     for s in summaries]

decoded_summaries = [d.replace("<n>", " ") for d in decoded_summaries]


In [None]:
print(decoded_summaries)