# Dialogue Summarization using BART Transformer

## Introduction

This project implements abstractive dialogue summarization using a pre-trained **BART transformer** model. The baseline model (`facebook/bart-large-cnn`) is first evaluated, and then improved through **Supervised Fine-Tuning (SFT)** on the DialogSum dataset.

The model is trained on dialogue-summary pairs to adapt it specifically for dialogue summarization. Performance is evaluated using **ROUGE metrics** to compare baseline and fine-tuned results.


## What is BART?

**BART** (Bidirectional and Auto-Regressive Transformer) is a sequence-to-sequence transformer model developed by Facebook AI. It combines:

- A bidirectional encoder (like BERT)

- An auto-regressive decoder (like GPT)

This architecture makes BART highly effective for text generation tasks such as summarization, translation, and question answering.

In this project, we use the facebook/bart-large-cnn model, which is pre-trained and optimized for summarization tasks.

## What is Supervised Fine-Tuning (SFT)?

**Supervised Fine-Tuning (SFT)** is a training approach where a pre-trained model is further trained on labeled data for a specific task.

In this project:

- Input: Dialogue text

- Target: Human-written summary

- Loss Function: Cross-entropy loss

- Training: End-to-end update of all model parameters

SFT helps adapt the general-purpose pre-trained BART model to perform better on dialogue summarization.



### Loading the dataset

In [None]:
!pip install datasets



In [None]:
from datasets import load_dataset

ds = load_dataset("knkarthick/dialogsum")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
ds

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

In [None]:
ds['train'][1]

{'id': 'train_1',
 'dialogue': "#Person1#: Hello Mrs. Parker, how have you been?\n#Person2#: Hello Dr. Peters. Just fine thank you. Ricky and I are here for his vaccines.\n#Person1#: Very well. Let's see, according to his vaccination record, Ricky has received his Polio, Tetanus and Hepatitis B shots. He is 14 months old, so he is due for Hepatitis A, Chickenpox and Measles shots.\n#Person2#: What about Rubella and Mumps?\n#Person1#: Well, I can only give him these for now, and after a couple of weeks I can administer the rest.\n#Person2#: OK, great. Doctor, I think I also may need a Tetanus booster. Last time I got it was maybe fifteen years ago!\n#Person1#: We will check our records and I'll have the nurse administer and the booster as well. Now, please hold Ricky's arm tight, this may sting a little.",
 'summary': 'Mrs Parker takes Ricky for his vaccines. Dr. Peters checks the record and then gives Ricky a vaccine.',
 'topic': 'vaccines'}

In [None]:
ds['train'][1]['dialogue']

"#Person1#: Hello Mrs. Parker, how have you been?\n#Person2#: Hello Dr. Peters. Just fine thank you. Ricky and I are here for his vaccines.\n#Person1#: Very well. Let's see, according to his vaccination record, Ricky has received his Polio, Tetanus and Hepatitis B shots. He is 14 months old, so he is due for Hepatitis A, Chickenpox and Measles shots.\n#Person2#: What about Rubella and Mumps?\n#Person1#: Well, I can only give him these for now, and after a couple of weeks I can administer the rest.\n#Person2#: OK, great. Doctor, I think I also may need a Tetanus booster. Last time I got it was maybe fifteen years ago!\n#Person1#: We will check our records and I'll have the nurse administer and the booster as well. Now, please hold Ricky's arm tight, this may sting a little."

In [None]:
ds['train'][1]['summary']  #abstractive summarization (words aren't directly extracted form the text, instead takes the keywords, preprocess it and generate a new summary)

'Mrs Parker takes Ricky for his vaccines. Dr. Peters checks the record and then gives Ricky a vaccine.'

# 1. Without Fine-Tuning

In [None]:
!pip install transformers



In [None]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")

Please make sure the generation config includes `forced_bos_token_id=0`. 


Loading weights:   0%|          | 0/511 [00:00<?, ?it/s]

In [None]:
article_1 = ds['train'][1]['dialogue']

article_2 = """
As Yogi Berra famously said, it’s tough to make predictions, especially about the future. But had the baseball legend spent any time observing the UN climate negotiations, he could have safely predicted that climate finance will prove to be a key sticking point at COP29 in Baku at the end of this year.

‘Who will pay and how much?’ are perennial questions at the climate talks, but this year, the discussions about climate finance will be especially prominent. At COP29, Parties to the Paris Agreement must negotiate a new climate finance goal, to replace the existing commitment from 2009 for developed countries to provide US$100 billion climate finance annually from 2020 to 2025 - a commitment that only in 2022 was starting to be fulfilled, according to a recent OECD report.

It is vital that the forthcoming Bonn Climate Change Conference sends the right political signals, and lays the procedural and technical groundwork for an ambitious climate finance deal in Baku.

A pressing need

With global warming already destabilising the climate and devastating people’s lives and livelihoods, the need for finance to reduce greenhouse gas emissions and to adapt to a warming world has never been more pressing.

The sums involved are large. The Paris Agreement’s Global Stocktake process estimates that US$5.8-5.9 trillion is required to implement Nationally Determined Contributions (NDCs) in developing countries up to 2030. They will require US$215-387 billion annually over this period for adaptation. Investments of US$1.5 trillion in renewable energy are required worldwide every year up until 2030, according to IRENA.

But these sums are also affordable and beneficial for developed countries. They should be seen in the context of ongoing investments in energy and other infrastructure: around US$2.3 trillion was invested in energy infrastructure in 2023, of which US$1.74 trillion was in clean energy. These investments will generate strong returns for their investors and reduce the costs for energy consumers.

And, crucially, they should also be seen in the context of the alternative. The latest research estimates that the world economy is already set to face a 19% income reduction within the next 26 years based on the levels of warming we have already locked in. The more we delay and the more the planet heats, the greater the economic costs will be.

Laying the foundations for a new finance goal

While financial resources are beginning to flow, they are not flowing fast enough, and certainly not flowing to those developing countries where need is greatest and access to finance is most challenging.

The UN climate framework provides mechanisms that can enable those flows of climate finance. Back in 2015, parties at the climate talks agreed to establish a “new collective quantified goal” (NCQG) for climate finance. They agreed that the NCQG would be set prior to 2025.

The  ultimate size of the NCQG will be a product of the negotiations, but Parties have agreed it must be a significant increase from the floor of US$100 billion annually. For WWF, it must be needs-based and sufficiently ambitious to meet the scale of the challenge we face, and immediately accessible to help countries that are already facing the chaos of a destabilised climate system.

While developed countries are expected to provide financial and technical support, developing countries also have a role to play. Parties are due to submit revised NDCs in 2025, presenting how they plan to reduce emissions and adapt to climate change. Developing countries have the opportunity to use their NDCs to set out how international climate finance can support them and increase their ambition. To do this, they need to know the finance will be forthcoming.
"""

In [None]:
def summarize(article):
    # Tokenize the input
    inputs = tokenizer(article, max_length=1024, truncation=True, return_tensors="pt")

    # Generate the summary
    summary_ids = model.generate(inputs["input_ids"], max_length=50, min_length=20, length_penalty=2.0, num_beams=4, early_stopping=True)

    # Decode the summary
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

In [None]:
article_1_summary = summarize(article_1)
article_2_summary = summarize(article_2)

print("Summary of Article 1: ", article_1_summary)
print("Summary of Article 2: ", article_2_summary)

Summary of Article 1:  Ricky has received his Polio, Tetanus and Hepatitis B shots. He is 14 months old, so he is due for Hep atitis A, Chickenpox and Measles shots.
Summary of Article 2:  The need for finance to reduce greenhouse gas emissions and to adapt to a warming world has never been more pressing. Parties to the Paris Agreement must negotiate a new climate finance goal.


In [None]:
ds['train'][1]['summary']

'Mrs Parker takes Ricky for his vaccines. Dr. Peters checks the record and then gives Ricky a vaccine.'

# 2. With Fine - Tuning

- 'input_ids' represent the tokenized form of your input text. Each token (which could be a word or part of a word) is converted into a unique integer ID based on the model's vocabulary.

- 'attention_mask' is a tensor that indicates which tokens should be attended to and which should be ignored (usually padding tokens). It's a binary mask where typically:

       1 indicates that the token should be attended to.
       0 indicates that the token is padding and should be ignored.

- In sequence-to-sequence models, such as text summarization models, you have:

       Input IDs: Tokenized IDs of the source text (e.g., dialogue).
       Target IDs: Tokenized IDs of the target text (e.g., summary).

- During training, the model computes the loss between the predicted sequence and the target sequence. To ensure that padding tokens do not affect this loss calculation, padding token IDs are often replaced with -100.


### Evaluation metric: ROUGE

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a standard metric used to evaluate text summarization models. It measures the overlap between the generated summary and the reference (ground-truth) summary.

In this project, we use:

     ROUGE-1 → Measures overlap of individual words (unigrams)
     ROUGE-2 → Measures overlap of word pairs (bigrams)
     ROUGE-L → Measures longest common subsequence (sentence-level structure similarity)

In [None]:
#tokenization
def preprocess_function(batch):
  source = batch['dialogue']
  target = batch['summary']
  source_ids = tokenizer(source, truncation=True, padding="max_length", max_length=128)    #padding - to make the data equal in size
  target_ids = tokenizer(target, truncation=True, padding="max_length", max_length=128)

  # Replace pad token id with -100 for labels to ignore padding in loss computation
  labels = target_ids["input_ids"]
  labels = [[label if label != tokenizer.pad_token else -100 for label in labels_example] for labels_example in labels]

  return {
      "input_ids": source_ids["input_ids"],
      "attention_mask": source_ids["attention_mask"],   #attention mask - to specify which token should be attended (0 and 1)
      "labels": labels
  }


In [None]:
df_source = ds.map(preprocess_function, batched=True)

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [None]:
from transformers import TrainingArguments, Trainer

#define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=8,
    num_train_epochs=2,
    remove_unused_columns=True
)

In [None]:
#create trainer object
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=df_source["train"],
    eval_dataset=df_source["test"]
)

In [None]:
trainer.train()

Step,Training Loss
500,0.58095
1000,0.432407
1500,0.41907
2000,0.315183
2500,0.296318
3000,0.291804


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

TrainOutput(global_step=3116, training_loss=0.38625339243623197, metrics={'train_runtime': 3560.8827, 'train_samples_per_second': 6.998, 'train_steps_per_second': 0.875, 'total_flos': 6750530835578880.0, 'train_loss': 0.38625339243623197, 'epoch': 2.0})

# Saving the model

In [None]:
# save the model and tokenizer after training
model.save_pretrained("./results/bart_finetuned_model")
tokenizer.save_pretrained("./results/bart_finetuned_model")

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

('./results/bart_finetuned_model/tokenizer_config.json',
 './results/bart_finetuned_model/tokenizer.json')

# Summary Generation with Fine-Tuned BART Model

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# load the trained model
model_finetuned = AutoModelForSeq2SeqLM.from_pretrained("./results/bart_finetuned_model")
tokenizer_finetuned = AutoTokenizer.from_pretrained("./results/bart_finetuned_model")

# Function to summarize a blog post
def summarize(article):
    # Tokenize the input blog post
    inputs = tokenizer_finetuned(article, max_length=1024, truncation=True, return_tensors="pt")

    # Generate the summary
    summary_ids = model_finetuned.generate(inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)

    # Decode the summary
    summary = tokenizer_finetuned.decode(summary_ids[0], skip_special_tokens=True)
    return summary


# Get the summary
article_1_summary = summarize(article_1)
article_2_summary = summarize(article_2)
print("Summary of Article 1", article_1_summary)
print("Summary of Article 2", article_2_summary)

Loading weights:   0%|          | 0/512 [00:00<?, ?it/s]

Summary of Article 1 Mrs. Parker comes to Dr. Peters to get Ricky's vaccines and a Tetanus booster. The doctor tells her Ricky has received his Polio, Tetanus, Hepatitis B, and Measles shots and will give her some more.
Summary of Article 2 #Climate finance will be a key sticking point at the COP29 in Baku at the end of this year. The Paris Agreement's Global Stocktake process estimates that US$5.9 trillion is required to implement Nationally Determined Contributions in developing countries up to 2030, and it needs to be seen in the context of ongoing investments in energy and other infrastructure.


# Evaluation using Rouge

In [None]:
!pip install evaluate rouge_score



In [None]:
import evaluate
import numpy as np

rouge = evaluate.load("rouge")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred

    decoded_preds = tokenizer_finetuned.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer_finetuned.pad_token_id)
    decoded_labels = tokenizer_finetuned.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(
        predictions=decoded_preds,
        references=decoded_labels,
        use_stemmer=True
    )

    return {
        "rouge1": result["rouge1"],
        "rouge2": result["rouge2"],
        "rougeL": result["rougeL"],
    }

In [None]:
from transformers import Trainer, TrainingArguments
import torch

small_test = df_source["test"].select(range(100))

eval_args = TrainingArguments(
    output_dir="./eval_tmp",
    per_device_eval_batch_size=1,
    fp16=True,
    do_train=False,
    do_eval=True
)

eval_trainer = Trainer(
    model=model_finetuned,
    args=eval_args,
    eval_dataset=small_test,
    compute_metrics=compute_metrics
)

torch.cuda.empty_cache()

results = eval_trainer.evaluate()
print(results)