# Project 3: Large Language Models for Dialogue Summarization

This notebook presents an end-to-end transformer-based approach to abstractive dialogue summarization using the SAMSum dataset. We explore model training, evaluation, and downstream applications.

We focus on:
- building and fine-tuning a **BERT–GPT-2 encoder–decoder** model for abstractive summarization,
- comparing it against a simple **heuristic baseline**,
- computing both **custom and official ROUGE metrics**,
- performing **qualitative error analysis** with real examples,
- designing a **ChatGPT-style action-first prompting layer** on top of the summaries.

The notebook is structured as follows:

1. Setup and Reproducibility  
2. Dataset Loading and Exploratory Data Analysis (EDA)  
3. Tokenization and Preprocessing  
4. BERT–GPT-2 Encoder–Decoder Model  
5. Training with Seq2SeqTrainer  
6. Baseline vs. Model: ROUGE Evaluation and Qualitative Analysis  
7. Official ROUGE with `evaluate`  
8. Action-First ChatGPT Prompting Layer  
9. Conclusions and Next Steps


In [1]:
!pip install -q evaluate rouge-score nltk

In [2]:
# Fix Colab environment issues (REQUIRED)
!pip install -q --upgrade datasets transformers huggingface_hub evaluate rouge-score

In [3]:
# 1. Setup and Reproducibility

import random
import numpy as np
import torch
import re

import pandas as pd
from datasets import load_dataset

from transformers import (
    AutoTokenizer,
    EncoderDecoderModel,
    DataCollatorForSeq2Seq,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
)

from evaluate import load as load_metric


def set_seed(seed: int = 42):
    """
    Set random seeds for Python, NumPy and PyTorch to improve reproducibility.
    """
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)


set_seed(42)

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using device:", device)

Using device: cpu


## 2. Dataset Loading and Quick EDA

We use the **SAMSum** dataset, which contains short messenger-style dialogues and
human-written abstractive summaries.

In this section we:
- load the dataset using the `datasets` library,
- inspect a few raw examples,
- compute quick statistics on dialogue and summary lengths.


In [4]:
# 2.1 Load the SAMSum dataset

from datasets import load_dataset

# Try the official SAMSum dataset first
try:
    samsum = load_dataset("samsum")
    print("Loaded 'samsum' dataset from the Hub.")
except Exception as e:
    print("Failed to load 'samsum' directly. Error:", e)
    print("Trying the mirror dataset 'knkarthick/samsum' instead...")
    samsum = load_dataset("knkarthick/samsum")
    print("Loaded 'knkarthick/samsum' dataset successfully.")

print(samsum)
print("Train size:", len(samsum["train"]))
print("Validation size:", len(samsum["validation"]))
print("Test size:", len(samsum["test"]))


print("Train size:", len(samsum["train"]))
print("Validation size:", len(samsum["validation"]))
print("Test size:", len(samsum["test"]))


# 2.2 Inspect a few sample dialogues
for i in range(2):
    print("=" * 80)
    print("Dialogue:\n", samsum["train"][i]["dialogue"])
    print("\nSummary:\n", samsum["train"][i]["summary"])


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Failed to load 'samsum' directly. Error: Dataset 'samsum' doesn't exist on the Hub or cannot be accessed.
Trying the mirror dataset 'knkarthick/samsum' instead...
Loaded 'knkarthick/samsum' dataset successfully.
DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14731
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
})
Train size: 14731
Validation size: 818
Test size: 819
Train size: 14731
Validation size: 818
Test size: 819
Dialogue:
 Amanda: I baked  cookies. Do you want some?
Jerry: Sure!
Amanda: I'll bring you tomorrow :-)

Summary:
 Amanda baked cookies and will bring Jerry some tomorrow.
Dialogue:
 Olivia: Who are you voting for in this election? 
Oliver: Liberals as always.
Olivia: Me too!!
Oliver: Great

Summary:
 Olivia and Olivier are voting for liberals in this electi

### 2.3 Quick Length Statistics

We compute basic statistics (min, max, mean) for character lengths of dialogues and summaries
on a subset of the training data. This helps us choose reasonable maximum sequence lengths.


In [5]:
def quick_stats(dataset, n_samples: int = 2000):
    """
    Compute simple statistics (min, max, mean character length)
    for dialogues and summaries in a dataset split.
    """
    n_samples = min(n_samples, len(dataset))
    dlg_lengths = []
    sum_lengths = []

    for i in range(n_samples):
        dlg = dataset[i]["dialogue"]
        summ = dataset[i]["summary"]
        dlg_lengths.append(len(dlg))
        sum_lengths.append(len(summ))

    stats = {
        "dialogue_min_len": int(np.min(dlg_lengths)),
        "dialogue_max_len": int(np.max(dlg_lengths)),
        "dialogue_mean_len": float(np.mean(dlg_lengths)),
        "summary_min_len": int(np.min(sum_lengths)),
        "summary_max_len": int(np.max(sum_lengths)),
        "summary_mean_len": float(np.mean(sum_lengths)),
    }
    return stats


train_stats = quick_stats(samsum["train"], n_samples=2000)
train_stats


{'dialogue_min_len': 42,
 'dialogue_max_len': 2466,
 'dialogue_mean_len': 510.2215,
 'summary_min_len': 4,
 'summary_max_len': 300,
 'summary_mean_len': 111.501}

## 3. Tokenization and Preprocessing

Our summarization model uses:

- **BERT** as the encoder (for the input dialogue),
- **GPT-2** as the decoder (for the target summary).

We therefore use two tokenizers:

- `bert-base-uncased` for dialogues,
- `gpt2` for summaries.

Preprocessing steps:
- truncate/pad dialogues and summaries to fixed maximum lengths,
- tokenize with the corresponding tokenizers,
- replace padding tokens in labels with `-100` so that they are ignored by the loss function.


In [6]:
# 3.1 Load tokenizers for encoder and decoder

encoder_name = "bert-base-uncased"
decoder_name = "gpt2"

enc_tok = AutoTokenizer.from_pretrained(encoder_name, use_fast=True)
dec_tok = AutoTokenizer.from_pretrained(decoder_name, use_fast=True)

# GPT-2 does not have a PAD token by default, so we add one
if dec_tok.pad_token is None:
    dec_tok.add_special_tokens({"pad_token": dec_tok.eos_token})

print("Encoder vocab size:", enc_tok.vocab_size)
print("Decoder vocab size:", len(dec_tok))
print("Decoder PAD token:", dec_tok.pad_token, dec_tok.pad_token_id)

# 3.2 Define maximum sequence lengths based on data stats
MAX_INPUT_LEN = 256   # maximum tokens for dialogues
MAX_TARGET_LEN = 64   # maximum tokens for summaries


Encoder vocab size: 30522
Decoder vocab size: 50257
Decoder PAD token: <|endoftext|> 50256


In [7]:
# 3.3 Preprocessing Function

def preprocess_batch(batch):
    """
    Tokenize dialogues and summaries for encoder-decoder training.

    - Encoder: BERT tokenizer on 'dialogue'
    - Decoder: GPT-2 tokenizer on 'summary'
    - PAD tokens in labels are replaced with -100 so they are ignored by the loss.

    Args:
        batch: A batch of examples with 'dialogue' and 'summary' fields.

    Returns:
        A dictionary with input_ids, attention_mask, and labels.
    """
    # Tokenize dialogues for the encoder
    enc = enc_tok(
        batch["dialogue"],
        truncation=True,
        padding="max_length",
        max_length=MAX_INPUT_LEN,
    )

    # Tokenize summaries for the decoder
    dec = dec_tok(
        batch["summary"],
        truncation=True,
        padding="max_length",
        max_length=MAX_TARGET_LEN,
    )

    # Use decoder token IDs as labels
    enc["labels"] = dec["input_ids"]

    # Replace padding token IDs in labels with -100 so they are ignored by the loss
    enc["labels"] = [
        [(t if t != dec_tok.pad_token_id else -100) for t in seq]
        for seq in enc["labels"]
    ]

    return enc


# 3.4 (Optional) Subsample dataset for faster training (Colab-friendly)

TRAIN_N = 6000
VAL_N = 1500

train_raw = samsum["train"].select(range(min(TRAIN_N, len(samsum["train"]))))
val_raw = samsum["validation"].select(range(min(VAL_N, len(samsum["validation"]))))

# 3.5 Apply preprocessing

train_tokenized = train_raw.map(
    preprocess_batch,
    batched=True,
    remove_columns=["dialogue", "summary"],
)

val_tokenized = val_raw.map(
    preprocess_batch,
    batched=True,
    remove_columns=["dialogue", "summary"],
)

train_tokenized[0]


{'id': '13818513',
 'input_ids': [101,
  8282,
  1024,
  1045,
  17776,
  16324,
  1012,
  2079,
  2017,
  2215,
  2070,
  1029,
  6128,
  1024,
  2469,
  999,
  8282,
  1024,
  1045,
  1005,
  2222,
  3288,
  2017,
  4826,
  1024,
  1011,
  1007,
  102,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,


## 4. BERT–GPT-2 Encoder–Decoder Model

We build a **BERT–GPT-2 encoder–decoder model** initialized from pretrained weights:

- the encoder is `bert-base-uncased`,
- the decoder is `gpt2`,
- cross-attention is enabled in the decoder by default,
- we configure the correct special tokens (PAD, EOS, decoder start).

This setup leverages strong contextual representations from BERT
and fluent generative capabilities from GPT-2, making it suitable for abstractive dialogue summarization.


In [8]:
# 4.1 Build encoder–decoder model from pretrained components

model = EncoderDecoderModel.from_encoder_decoder_pretrained(
    encoder_name,
    decoder_name,
)

# If we added a PAD token to the decoder tokenizer, we need to resize embeddings
model.decoder.resize_token_embeddings(len(dec_tok))

# 4.2 Configure special tokens and generation settings
# Special tokens (это правильно оставляем в model.config)
model.config.decoder_start_token_id = dec_tok.pad_token_id
model.config.eos_token_id = dec_tok.eos_token_id
model.config.pad_token_id = dec_tok.pad_token_id
model.config.vocab_size = model.config.decoder.vocab_size

# >>> Generation parameters должны жить в model.generation_config <<<
gen_cfg = model.generation_config
gen_cfg.max_length = MAX_TARGET_LEN
gen_cfg.min_length = 8
gen_cfg.no_repeat_ngram_size = 2
#gen_cfg.early_stopping = True
model.generation_config = gen_cfg


model.to(device)
print("Model initialized and moved to device:", device)


Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

BertModel LOAD REPORT from: bert-base-uncased
Key                                        | Status     |  | 
-------------------------------------------+------------+--+-
cls.seq_relationship.bias                  | UNEXPECTED |  | 
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED |  | 
cls.predictions.transform.dense.bias       | UNEXPECTED |  | 
cls.seq_relationship.weight                | UNEXPECTED |  | 
cls.predictions.transform.dense.weight     | UNEXPECTED |  | 
cls.predictions.bias                       | UNEXPECTED |  | 
cls.predictions.transform.LayerNorm.weight | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Loading weights:   0%|          | 0/148 [00:00<?, ?it/s]

GPT2LMHeadModel LOAD REPORT from: gpt2
Key                                                 | Status     | 
----------------------------------------------------+------------+-
h.{0...11}.attn.bias                                | UNEXPECTED | 
transformer.h.{0...11}.ln_cross_attn.weight         | MISSING    | 
transformer.h.{0...11}.crossattention.c_proj.weight | MISSING    | 
transformer.h.{0...11}.ln_cross_attn.bias           | MISSING    | 
transformer.h.{0...11}.crossattention.c_proj.bias   | MISSING    | 
transformer.h.{0...11}.crossattention.c_attn.bias   | MISSING    | 
transformer.h.{0...11}.crossattention.c_attn.weight | MISSING    | 
transformer.h.{0...11}.crossattention.q_attn.weight | MISSING    | 
transformer.h.{0...11}.crossattention.q_attn.bias   | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider

Model initialized and moved to device: cpu


## 5. Data Collator and Training Setup

We use the Hugging Face **Seq2SeqTrainer** API:

- `DataCollatorForSeq2Seq` dynamically pads examples to the longest sequence in the batch,
- `Seq2SeqTrainingArguments` define hyperparameters,
- `Seq2SeqTrainer` handles training, evaluation, and generation.

We keep training settings moderate to fit within typical Google Colab limits,
but they can be scaled up for more serious experiments.


In [9]:
# 5.1 Data collator for seq2seq models
data_collator = DataCollatorForSeq2Seq(
    tokenizer=enc_tok,
    model=model,
    padding="longest",
)


# 5.2 Seq2Seq training arguments (version-safe)
training_args = Seq2SeqTrainingArguments(
    output_dir="outputs_samsum_bert2gpt2",
    num_train_epochs=1,
    max_steps=250,                      # keep small for Colab; increase if you can
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    learning_rate=5e-5,
    weight_decay=0.01,
    predict_with_generate=True,
    fp16=False,                         # set True if your GPU supports it
    do_train=True,
    do_eval=True,
    logging_steps=50,                   # how often to log
    save_steps=1000,                    # effectively no checkpoints
    save_total_limit=1,
    report_to=[],                       # no external logging (W&B etc.)
)

# 5.3 Initialize the Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_tokenized,
    eval_dataset=val_tokenized,
    data_collator=data_collator,
)


## 6. Training the Encoder–Decoder Model

We now fine-tune the pretrained BERT–GPT-2 encoder–decoder model on SAMSum.

The `Seq2SeqTrainer`:
- performs forward and backward passes,
- handles optimizer and learning rate scheduling,
- periodically evaluates on the validation set.

For a stronger model, we would train for more steps and possibly with larger batch sizes.
Here we focus on a **Colab-friendly but realistic** training run.


In [10]:
# 6.1 Train the model (this may take several minutes depending on hardware)

train_result = trainer.train()
print("Training finished. Global step:", train_result.global_step)

# 6.2 Evaluate the model using Trainer's built-in evaluation
eval_metrics = trainer.evaluate()
print("Validation metrics from Trainer:", eval_metrics)




Step,Training Loss
50,4.31988
100,4.016059
150,3.903568
200,3.967823
250,3.886154


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Training finished. Global step: 250




Validation metrics from Trainer: {'eval_loss': 3.78239369392395, 'eval_runtime': 830.1496, 'eval_samples_per_second': 0.985, 'eval_steps_per_second': 0.493, 'epoch': 0.08333333333333333}


## 7. Baseline and Custom ROUGE Implementation

To better interpret model performance, we define:

1. A very simple **baseline summarizer**:
   - It takes the first and last sentence of the dialogue as the "summary".
2. A custom implementation of **ROUGE-1** and **ROUGE-L** F1:
   - This demonstrates understanding of how ROUGE works,
   - and allows us to compare baseline vs. model on the same metric.


In [11]:
def split_into_sentences(text: str):
    """
    Very simple sentence splitter based on punctuation.
    This is intentionally lightweight and heuristic.
    """
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    sentences = [s for s in sentences if s]
    return sentences


def baseline_head_tail(dialogue: str, max_sentences: int = 2) -> str:
    """
    Baseline summarizer: take the first and last sentence from the dialogue.
    If the dialogue is very short, return what is available.

    This provides a naive extractive baseline to compare against our model.
    """
    sentences = split_into_sentences(dialogue)
    if not sentences:
        return ""
    if len(sentences) <= max_sentences:
        return " ".join(sentences)
    # use the first and last sentence
    return sentences[0] + " " + sentences[-1]


In [12]:
from collections import Counter


def _tok(s: str):
    """
    Very simple whitespace-based tokenizer with lowercasing.
    """
    return s.lower().split()


def rouge1_f1_single(pred: str, ref: str) -> float:
    """
    Compute ROUGE-1 F1 score between a single prediction and reference.
    """
    pred_tokens = _tok(pred)
    ref_tokens = _tok(ref)
    if not pred_tokens or not ref_tokens:
        return 0.0

    pred_counts = Counter(pred_tokens)
    ref_counts = Counter(ref_tokens)

    overlap = sum((pred_counts & ref_counts).values())
    if overlap == 0:
        return 0.0

    precision = overlap / len(pred_tokens)
    recall = overlap / len(ref_tokens)
    if precision + recall == 0:
        return 0.0

    return 2 * precision * recall / (precision + recall)


def lcs_len(x: list, y: list) -> int:
    """
    Longest Common Subsequence (LCS) length for ROUGE-L.
    """
    m, n = len(x), len(y)
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    for i in range(m):
        for j in range(n):
            if x[i] == y[j]:
                dp[i + 1][j + 1] = dp[i][j] + 1
            else:
                dp[i + 1][j + 1] = max(dp[i][j + 1], dp[i + 1][j])
    return dp[m][n]


def rougeL_f1_single(pred: str, ref: str) -> float:
    """
    Compute ROUGE-L F1 score between a single prediction and reference.
    """
    pred_tokens = _tok(pred)
    ref_tokens = _tok(ref)
    if not pred_tokens or not ref_tokens:
        return 0.0

    lcs = lcs_len(pred_tokens, ref_tokens)
    if lcs == 0:
        return 0.0

    precision = lcs / len(pred_tokens)
    recall = lcs / len(ref_tokens)
    if precision + recall == 0:
        return 0.0

    return 2 * precision * recall / (precision + recall)


def avg_scores(pred_list, ref_list, metric_fn):
    """
    Compute the average of a given ROUGE metric over a list of predictions.
    """
    scores = []
    for p, r in zip(pred_list, ref_list):
        scores.append(metric_fn(p, r))
    return float(np.mean(scores)), scores


## 8. Generating Model Summaries & Comparing with Baseline

In this section we:

1. Build a small evaluation subset from the validation split.
2. Generate summaries using:
   - the **baseline** heuristic,
   - our fine-tuned **encoder–decoder model**.
3. Compute **custom ROUGE-1 and ROUGE-L** scores for both.
4. Inspect a few **qualitative examples** to understand typical model behavior.


In [13]:
def generate_summary(dialogue: str, max_new_tokens: int = 48) -> str:
    """
    Generate an abstractive summary for a single dialogue using the
    encoder–decoder model.

    Args:
        dialogue: Original SAMSum-style conversation (multi-turn).
        max_new_tokens: Maximum number of tokens to generate.

    Returns:
        Decoded summary string.
    """
    model.eval()
    inputs = enc_tok(
        [dialogue],
        truncation=True,
        padding="max_length",
        max_length=MAX_INPUT_LEN,
        return_tensors="pt",
    ).to(device)

    with torch.no_grad():
        generated_ids = model.generate(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_new_tokens=max_new_tokens,
            num_beams=4,
            early_stopping=True,
        )

    summary = dec_tok.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return summary.strip()


In [14]:
# 8.2 Build a small evaluation subset for qualitative and quantitative analysis
EVAL_N = 150
eval_raw = val_raw.select(range(min(EVAL_N, len(val_raw))))

dialogues = [ex["dialogue"] for ex in eval_raw]
references = [ex["summary"] for ex in eval_raw]

# Baseline predictions
baseline_preds = [baseline_head_tail(dlg) for dlg in dialogues]

# Model predictions
model_preds = [generate_summary(dlg) for dlg in dialogues]

print("Example model prediction:")
print("=" * 80)
print("Dialogue:\n", dialogues[0])
print("\nReference summary:\n", references[0])
print("\nBaseline summary:\n", baseline_preds[0])
print("\nModel summary:\n", model_preds[0])


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Example model prediction:
Dialogue:
 A: Hi Tom, are you busy tomorrow’s afternoon?
B: I’m pretty sure I am. What’s up?
A: Can you go with me to the animal shelter?.
B: What do you want to do?
A: I want to get a puppy for my son.
B: That will make him so happy.
A: Yeah, we’ve discussed it many times. I think he’s ready now.
B: That’s good. Raising a dog is a tough issue. Like having a baby ;-) 
A: I'll get him one of those little dogs.
B: One that won't grow up too big;-)
A: And eat too much;-))
B: Do you know which one he would like?
A: Oh, yes, I took him there last Monday. He showed me one that he really liked.
B: I bet you had to drag him away.
A: He wanted to take it home right away ;-).
B: I wonder what he'll name it.
A: He said he’d name it after his dead hamster – Lemmy  - he's  a great Motorhead fan :-)))

Reference summary:
 A will go to the animal shelter tomorrow to get a puppy for her son. They already visited the shelter last Monday and the son chose the puppy. 

Baseline 

In [15]:
# 8.3 Compute custom ROUGE-1 and ROUGE-L for baseline and model

base_r1_mean, base_r1_list = avg_scores(baseline_preds, references, rouge1_f1_single)
base_rL_mean, base_rL_list = avg_scores(baseline_preds, references, rougeL_f1_single)

mod_r1_mean, mod_r1_list = avg_scores(model_preds, references, rouge1_f1_single)
mod_rL_mean, mod_rL_list = avg_scores(model_preds, references, rougeL_f1_single)

print("=== Custom ROUGE (approximate) ===")
print(f"Baseline ROUGE-1 F1: {base_r1_mean:.4f}")
print(f"Baseline ROUGE-L F1: {base_rL_mean:.4f}")
print(f"Model    ROUGE-1 F1: {mod_r1_mean:.4f}")
print(f"Model    ROUGE-L F1: {mod_rL_mean:.4f}")


=== Custom ROUGE (approximate) ===
Baseline ROUGE-1 F1: 0.1464
Baseline ROUGE-L F1: 0.1281
Model    ROUGE-1 F1: 0.1376
Model    ROUGE-L F1: 0.1095


In [16]:
# 8.4 Show a few qualitative examples with scores

for i in range(3):
    print("=" * 80)
    print(f"Example {i+1}")
    print("--- Dialogue ---")
    print(dialogues[i])
    print("\n--- Reference summary ---")
    print(references[i])
    print("\n--- Baseline summary ---")
    print(baseline_preds[i])
    print("\n--- Model summary ---")
    print(model_preds[i])
    print(
        f"\nBaseline ROUGE-1 F1: {base_r1_list[i]:.4f} | ROUGE-L F1: {base_rL_list[i]:.4f}"
    )
    print(
        f"Model    ROUGE-1 F1: {mod_r1_list[i]:.4f} | ROUGE-L F1: {mod_rL_list[i]:.4f}"
    )


Example 1
--- Dialogue ---
A: Hi Tom, are you busy tomorrow’s afternoon?
B: I’m pretty sure I am. What’s up?
A: Can you go with me to the animal shelter?.
B: What do you want to do?
A: I want to get a puppy for my son.
B: That will make him so happy.
A: Yeah, we’ve discussed it many times. I think he’s ready now.
B: That’s good. Raising a dog is a tough issue. Like having a baby ;-) 
A: I'll get him one of those little dogs.
B: One that won't grow up too big;-)
A: And eat too much;-))
B: Do you know which one he would like?
A: Oh, yes, I took him there last Monday. He showed me one that he really liked.
B: I bet you had to drag him away.
A: He wanted to take it home right away ;-).
B: I wonder what he'll name it.
A: He said he’d name it after his dead hamster – Lemmy  - he's  a great Motorhead fan :-)))

--- Reference summary ---
A will go to the animal shelter tomorrow to get a puppy for her son. They already visited the shelter last Monday and the son chose the puppy. 

--- Baseline 

## 9. Official ROUGE via `evaluate`

In addition to the custom implementation, we compute **standard ROUGE metrics**
using the `evaluate` library. This provides:

- ROUGE-1, ROUGE-2, and ROUGE-L,
- precision, recall, and F1 scores.

We report these for our encoder–decoder model on the evaluation subset.


In [19]:
from evaluate import load as load_metric

rouge_metric = load_metric("rouge")

rouge_result = rouge_metric.compute(
    predictions=model_preds,
    references=references,
    use_stemmer=True,
)

print("=== HuggingFace evaluate ROUGE (model) ===")
for key in ["rouge1", "rouge2", "rougeL"]:
    score = rouge_result[key]
    score_float = float(score)
    print(f"{key.upper()} F1: {score_float:.4f}")


=== HuggingFace evaluate ROUGE (model) ===
ROUGE1 F1: 0.1443
ROUGE2 F1: 0.0153
ROUGEL F1: 0.1127


## 10. Action-First ChatGPT Prompting Layer

Beyond pure summarization, we add an **action-first prompting layer** designed for ChatGPT
(or any comparable auto-regressive LLM).

Workflow:

1. Our encoder–decoder model produces a concise summary of a conversation.
2. We wrap this summary into a structured prompt that asks the LLM to:
   - extract key decisions,
   - list action items with owners and deadlines,
   - identify open questions or unresolved issues.

This demonstrates how a custom summarization model can be integrated into a larger LLM-powered assistant for productivity and decision support.

This experiment further highlights how summarization errors can propagate into downstream LLM-based systems, emphasizing the importance of faithful and grounded summaries in real-world applications.


In [20]:
ACTION_DIGEST_PROMPT = """
You are an AI assistant that helps busy professionals keep track of decisions and next steps.

Below is a summary of a conversation. Your job is:
1. Extract the key decisions that were made.
2. Extract the action items with clear owners and deadlines if mentioned.
3. Highlight any open questions or unresolved issues.

Return your output in the following structured format:

Decisions:
- ...

Action Items:
- [Owner] ... (deadline: ...)

Open Questions:
- ...

Conversation summary:
\"\"\"{summary}\"\"\"
"""


def build_action_digest_prompt(summary: str) -> str:
    """
    Wrap a model-generated summary into an 'action-first' prompt that can be
    sent to ChatGPT (or another LLM) to extract decisions and next steps.
    """
    return ACTION_DIGEST_PROMPT.format(summary=summary)


# 10.2 Example: use a model-generated summary and build the LLM prompt
example_idx = 0
example_summary = model_preds[example_idx]
example_prompt = build_action_digest_prompt(example_summary)

print("=== Example action-first prompt for ChatGPT ===")
print(example_prompt)


=== Example action-first prompt for ChatGPT ===

You are an AI assistant that helps busy professionals keep track of decisions and next steps.

Below is a summary of a conversation. Your job is:
1. Extract the key decisions that were made.
2. Extract the action items with clear owners and deadlines if mentioned.
3. Highlight any open questions or unresolved issues.

Return your output in the following structured format:

Decisions:
- ...

Action Items:
- [Owner] ... (deadline: ...)

Open Questions:
- ...

Conversation summary:
"""Sylvia is going to go to the beach. She will be there for a few hours.  The beach is a bit crowded, so she will have to wait for the bus to take her to a beach in the middle of"""



## 11. Conclusions and Next Steps

In this project, we developed and evaluated an end-to-end transformer-based system for abstractive dialogue summarization. We implemented a BERT–GPT-2 encoder–decoder architecture and fine-tuned it on the SAMSum dataset using a Colab-compatible training setup. To establish a meaningful point of comparison, we defined a simple extractive baseline that concatenates the first and last utterances of each dialogue.

Model performance was evaluated using both custom-implemented ROUGE-1 and ROUGE-L metrics as well as the official ROUGE-1/2/L implementation provided by the HuggingFace `evaluate` library. In addition to quantitative evaluation, we conducted qualitative analysis by inspecting generated summaries and comparing them against reference summaries and baseline outputs. Finally, we demonstrated a realistic downstream application by designing an action-first ChatGPT prompting layer that consumes model-generated summaries and extracts decisions, action items, and open questions.

### Strengths

- End-to-end transformer pipeline specifically tailored to dialogue summarization.
- Comprehensive evaluation combining quantitative metrics with qualitative error analysis.
- Clear demonstration of abstractive summarization behavior and its trade-offs relative to extractive baselines.
- Practical system-level integration through an action-oriented LLM prompt, illustrating downstream usage and error propagation.

### Limitations

- Training time, model size, and number of fine-tuning steps were constrained by available hardware resources.
- ROUGE metrics, while widely adopted, primarily measure lexical overlap and do not fully capture semantic faithfulness or factual grounding.
- The model occasionally exhibits semantic drift and hallucinations due to limited fine-tuning data and the absence of explicit grounding or copy mechanisms.
- Hyperparameter tuning was kept minimal to maintain computational feasibility.

### Future Work

Future extensions of this work could explore fine-tuning larger and more specialized encoder–decoder architectures such as BART or T5 for improved summarization quality. Incorporating semantic evaluation metrics (e.g., BERTScore) alongside ROUGE could provide a more faithful assessment of model performance. Additionally, integrating human feedback or reinforcement learning–based fine-tuning approaches could help reduce hallucinations and improve grounding. Finally, deploying the model behind an API and integrating it into a real-time chat or productivity application would allow for real-world evaluation and iterative improvement.


### Final Remarks

This project focuses on building and evaluating a realistic dialogue summarization pipeline under limited computational resources. While the model does not achieve state-of-the-art performance, it highlights key challenges of abstractive summarization and demonstrates how such models can be integrated into larger LLM-based systems.
