# Fine-Tune a Generative AI Model for Dialogue Summarization

# Table of Contents

- [ 1 - Set up Kernel, Load Required Dependencies, Dataset and LLM](#1)
  - [ 1.1 - Set up Kernel and Required Dependencies](#1.1)
  - [ 1.2 - Load Dataset and LLM](#1.2)
  - [ 1.3 - Test the Model with Zero Shot Inferencing](#1.3)
- [ 2 - Perform Full Fine-Tuning](#2)
  - [ 2.1 - Preprocess the Dialog-Summary Dataset](#2.1)
  - [ 2.2 - Fine-Tune the Model with the Preprocessed Dataset](#2.2)
  - [ 2.3 - Evaluate the Model Qualitatively (Human Evaluation)](#2.3)
  - [ 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)](#2.4)
- [ 3 - Perform Parameter Efficient Fine-Tuning (PEFT)](#3)
  - [ 3.1 - Setup the PEFT/LoRA model for Fine-Tuning](#3.1)
  - [ 3.2 - Train PEFT Adapter](#3.2)
  - [ 3.3 - Evaluate the Model Qualitatively (Human Evaluation)](#3.3)
  - [ 3.4 - Evaluate the Model Quantitatively (with ROUGE Metric)](#3.4)

<a name='1'></a>
## 1 - Set up Kernel, Load Required Dependencies, Dataset and LLM

<a name='1.1'></a>
### 1.1 - Set up Kernel and Required Dependencies

In [1]:
# %pip install --upgrade pip
# %pip install --disable-pip-version-check \
#     torch==1.13.1 \
#     torchdata==0.5.1 --quiet

# %pip install \
#     transformers==4.27.2 \
#     datasets==2.11.0 \
#     evaluate==0.4.0 \
#     rouge_score==0.1.2 \
#     loralib==0.1.1 \
#     peft==0.3.0 --quiet



Import the necessary components. Some of them are new for this week, they will be discussed later in the notebook. 

In [1]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np




In [None]:
from datasets import load_dataset

# Try specifying the split explicitly
dataset = load_dataset("knkarthick/dialogsum")
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

Load the pre-trained [FLAN-T5 model](https://huggingface.co/docs/transformers/model_doc/flan-t5) and its tokenizer directly from HuggingFace. Notice that you will be using the [small version](https://huggingface.co/google/flan-t5-base) of FLAN-T5. Setting `torch_dtype=torch.bfloat16` specifies the memory type to be used by this model.

In [33]:
from transformers import T5ForConditionalGeneration, T5Tokenizer
from transformers import BartForConditionalGeneration, BartTokenizer

# Load T5
model_name = "facebook/bart-large"
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.63k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.02G [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


model.safetensors:   0%|          | 0.00/1.02G [00:00<?, ?B/s]

It is possible to pull out the number of model parameters and find out how many of them are trainable. The following function can be used to do that, at this stage, you do not need to go into details of it. 

In [5]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"\ntrainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(print_number_of_trainable_model_parameters(model))


trainable model parameters: 737668096
all model parameters: 737668096
percentage of trainable model parameters: 100.00%


In [6]:
dialogue1 = dataset['train'][0]['dialogue']
dialogue2 = dataset['train'][1]['dialogue']
dialogue3 = dataset['train'][2]['dialogue']
summary = dataset['train'][0:3]['summary']
print(dialogue1)
print(dialogue2,"\n")
print(dialogue3,"\n")
# print(summary)

#Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today?
#Person2#: I found it would be a good idea to get a check-up.
#Person1#: Yes, well, you haven't had one for 5 years. You should have one every year.
#Person2#: I know. I figure as long as there is nothing wrong, why go see the doctor?
#Person1#: Well, the best way to avoid serious illnesses is to find out about them early. So try to come at least once a year for your own good.
#Person2#: Ok.
#Person1#: Let me see here. Your eyes and ears look fine. Take a deep breath, please. Do you smoke, Mr. Smith?
#Person2#: Yes.
#Person1#: Smoking is the leading cause of lung cancer and heart disease, you know. You really should quit.
#Person2#: I've tried hundreds of times, but I just can't seem to kick the habit.
#Person1#: Well, we have classes and some medications that might help. I'll give you more information before you leave.
#Person2#: Ok, thanks doctor.
#Person1#: Hello Mrs. Parker, how have you been?
#Person2#: Hello Dr

<a name='1.3'></a>
### 1.3 - Test the Model with Zero Shot Inferencing

Test the model with the zero shot inferencing. You can see that the model struggles to summarize the dialogue compared to the baseline summary, but it does pull out some important information from the text which indicates the model can be fine-tuned to the task at hand.

In [34]:
index = 93

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

inputs = tokenizer(prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"], 
        max_new_tokens=200,
    )[0], 
    skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))
# print(dash_line)
# print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Francis and Monica are discussing when to work on the financial report.

---------------------------------------------------------------------------------------------------
MODEL GENERATION - ZERO SHOT:
#Person2#: Hello, this is Monica. I was wondering when we can work on this financial report. I am available from 1 PM to 4 PM on Friday afternoon.#Person1#: Today, I am busy all day long. It'll have to be another time. I'll see you on Friday morning. Then see you at 4 PM. Then come back on Friday.�#Person3#: Thank you, Monica.ⓘ#Person4#: Thanks, Monica, I will see you tomorrow morning.�#Person5#: Good morning, Francis. I'm Francis.ⓘ #Person6#: I'm Monica.�#Person7#: Hi, Francis, I'm Jean.ⓘ⅕⅜⅔⅛⅚⅓⅗Ⅰ⅑⅐⅒⅖ⅎ⅙�


<a name='2'></a>
## 2 - Perform Full Fine-Tuning

In [14]:
def tokenize_function(example):
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, max_length=512, return_tensors="pt").input_ids
    example['labels'] = tokenizer(example["summary"], padding="max_length", max_length=512, truncation=True, return_tensors="pt").input_ids
    
    return example

# The dataset actually contains 3 diff splits: train, validation, test.
# The tokenize_function code is handling all data across all splits in batches.
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary',])

Map:   0%|          | 0/12460 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

To save some time in the lab, you will subsample the dataset:

In [15]:
tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 2== 0, with_indices=True)

Filter:   0%|          | 0/12460 [00:00<?, ? examples/s]

Filter:   0%|          | 0/500 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1500 [00:00<?, ? examples/s]

Check the shapes of all three parts of the dataset:

In [16]:
print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets['train']['labels'][0]}")
print(f"Training: {tokenized_datasets['train']['labels'][1]}")
print(f"Validation: {tokenized_datasets['validation'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")

print(tokenized_datasets)

Shapes of the datasets:
Training: [1363, 5, 3931, 31, 7, 652, 3, 9, 691, 18, 413, 6, 11, 7582, 12833, 77, 7, 7786, 7, 376, 12, 43, 80, 334, 215, 5, 12833, 77, 7, 31, 195, 428, 128, 251, 81, 70, 2287, 11, 11208, 12, 199, 1363, 5, 3931, 10399, 10257, 5, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

The output dataset is ready for fine-tuning.

<a name='3'></a>
## 3 - Perform Parameter Efficient Fine-Tuning (PEFT)

Now, let's perform **Parameter Efficient Fine-Tuning (PEFT)** fine-tuning as opposed to "full fine-tuning" as you did above. PEFT is a form of instruction fine-tuning that is much more efficient than full fine-tuning - with comparable evaluation results as you will see soon. 

PEFT is a generic term that includes **Low-Rank Adaptation (LoRA)** and prompt tuning (which is NOT THE SAME as prompt engineering!). In most cases, when someone says PEFT, they typically mean LoRA. LoRA, at a very high level, allows the user to fine-tune their model using fewer compute resources (in some cases, a single GPU). After fine-tuning for a specific task, use case, or tenant with LoRA, the result is that the original LLM remains unchanged and a newly-trained “LoRA adapter” emerges. This LoRA adapter is much, much smaller than the original LLM - on the order of a single-digit % of the original LLM size (MBs vs GBs).  

That said, at inference time, the LoRA adapter needs to be reunited and combined with its original LLM to serve the inference request.  The benefit, however, is that many LoRA adapters can re-use the original LLM which reduces overall memory requirements when serving multiple tasks and use cases.

<a name='3.1'></a>
### 3.1 - Setup the PEFT/LoRA model for Fine-Tuning

You need to set up the PEFT/LoRA model for fine-tuning with a new layer/parameter adapter. Using PEFT/LoRA, you are freezing the underlying LLM and only training the adapter. Have a look at the LoRA configuration below. Note the rank (`r`) hyper-parameter, which defines the rank/dimension of the adapter to be trained.

Add LoRA adapter layers/parameters to the original LLM to be trained.

In [40]:
from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training
from sklearn.metrics import accuracy_score, f1_score
from transformers import TrainingArguments, Trainer
import numpy as np
import time

# Define LoRA configuration
lora_config = LoraConfig(
    r=32,  # Rank
    lora_alpha=32,
    # target_modules=["q", "v"],  # Specify target modules
    target_modules=[
        "model.encoder.layers.0.self_attn.q_proj", 
        "model.encoder.layers.0.self_attn.k_proj", 
        "model.encoder.layers.0.self_attn.v_proj", 
        "model.encoder.layers.0.self_attn.out_proj",
        "model.decoder.layers.0.self_attn.q_proj", 
        "model.decoder.layers.0.self_attn.k_proj", 
        "model.decoder.layers.0.self_attn.v_proj", 
        "model.decoder.layers.0.self_attn.out_proj",
        "model.decoder.layers.0.encoder_attn.q_proj", 
        "model.decoder.layers.0.encoder_attn.k_proj", 
        "model.decoder.layers.0.encoder_attn.v_proj", 
        "model.decoder.layers.0.encoder_attn.out_proj"
    ],  # Specify target modules for BART
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM  # Assuming BART for sequence-to-sequence tasks
)

# Apply LoRA and quantization to the original model
peft_model = get_peft_model(model, lora_config)

# Prepare the model for 4-bit quantization training
peft_model = prepare_model_for_kbit_training(peft_model)

for name, param in peft_model.named_parameters():
    if "lora" in name:
        param.requires_grad = True


In [37]:
for name, param in model.named_parameters():
    print(name)

model.shared.weight
model.encoder.embed_positions.weight
model.encoder.layers.0.self_attn.k_proj.weight
model.encoder.layers.0.self_attn.k_proj.bias
model.encoder.layers.0.self_attn.v_proj.weight
model.encoder.layers.0.self_attn.v_proj.bias
model.encoder.layers.0.self_attn.q_proj.weight
model.encoder.layers.0.self_attn.q_proj.bias
model.encoder.layers.0.self_attn.out_proj.weight
model.encoder.layers.0.self_attn.out_proj.bias
model.encoder.layers.0.self_attn_layer_norm.weight
model.encoder.layers.0.self_attn_layer_norm.bias
model.encoder.layers.0.fc1.weight
model.encoder.layers.0.fc1.bias
model.encoder.layers.0.fc2.weight
model.encoder.layers.0.fc2.bias
model.encoder.layers.0.final_layer_norm.weight
model.encoder.layers.0.final_layer_norm.bias
model.encoder.layers.1.self_attn.k_proj.weight
model.encoder.layers.1.self_attn.k_proj.bias
model.encoder.layers.1.self_attn.v_proj.weight
model.encoder.layers.1.self_attn.v_proj.bias
model.encoder.layers.1.self_attn.q_proj.weight
model.encoder.la

In [41]:
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:64"
from accelerate import Accelerator

# Initialize Accelerator with mixed precision
accelerator = Accelerator(mixed_precision="fp16")  # Use "bf16" if supported and you prefer it

# Proceed with your model training code using accelerator


In [42]:
# Function to print the number of trainable parameters and their percentage
def print_number_of_trainable_model_parameters(model):
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total_params = sum(p.numel() for p in model.parameters())
    percentage = (trainable_params / total_params) * 100
    print(f"Trainable parameters: {trainable_params} out of {total_params} total parameters")
    print(f"Percentage of trainable parameters: {percentage:.2f}%")
    return trainable_params, total_params, percentage

print_number_of_trainable_model_parameters(peft_model)


Trainable parameters: 786432 out of 407077888 total parameters
Percentage of trainable parameters: 0.19%


(786432, 407077888, 0.1931895647449168)

<a name='3.2'></a>
### 3.2 - Train PEFT Adapter

Define training arguments and create `Trainer` instance.

In [None]:
from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training
from sklearn.metrics import accuracy_score, f1_score
from transformers import TrainingArguments, Trainer
import numpy as np
import time

# Define compute metrics function
def compute_metrics(eval_pred):
    predictions, labels = eval_pred.predictions, eval_pred.label_ids
    
    if isinstance(predictions, tuple):
        predictions = predictions[0]

    print("Predictions shape:", predictions.shape)
    print("Labels shape:", labels.shape)
    
    preds = np.argmax(predictions, axis=-1)

    preds_flat = preds.flatten()
    labels_flat = labels.flatten()

    mask = labels_flat != -100
    preds_flat = preds_flat[mask]
    labels_flat = labels_flat[mask]

    accuracy = accuracy_score(labels_flat, preds_flat)
    f1 = f1_score(labels_flat, preds_flat, average='weighted')

    return {
        'accuracy': accuracy,
        'f1': f1
    }

# Define training arguments
output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'

# Adjust your training arguments
training_args = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=2,  # reduce batch size
    per_device_eval_batch_size=2,  # reduce eval batch size
    gradient_accumulation_steps=16,  # increase gradient accumulation steps
    learning_rate=2e-5,  # try tuning this
    num_train_epochs=5,  # try increasing epochs
    logging_steps=10,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    fp16=True,  # mixed precision
    report_to=[]  # disable W&B
)

# Initialize the model (assuming peft_model is already defined)
peft_model = prepare_model_for_kbit_training(peft_model)

# Trainer without DeepSpeed
peft_trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

# Train the model
peft_trainer.train()

# Save the model and tokenizer
peft_model_path = "./peft-dialogue-summary-checkpoint-local"
peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)

In [None]:
from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training
from sklearn.metrics import accuracy_score, f1_score
from transformers import TrainingArguments, Trainer
import numpy as np
import time
import torch

# Define compute metrics function
def compute_metrics(eval_pred):
    predictions, labels = eval_pred.predictions, eval_pred.label_ids
    
    if isinstance(predictions, tuple):
        predictions = predictions[0]

    print("Predictions shape:", predictions.shape)
    print("Labels shape:", labels.shape)
    
    preds = np.argmax(predictions, axis=-1)

    preds_flat = preds.flatten()
    labels_flat = labels.flatten()

    mask = labels_flat != -100
    preds_flat = preds_flat[mask]
    labels_flat = labels_flat[mask]

    accuracy = accuracy_score(labels_flat, preds_flat)
    f1 = f1_score(labels_flat, preds_flat, average='weighted')

    return {
        'accuracy': accuracy,
        'f1': f1
    }

# Define training arguments
output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'

# Adjust your training arguments
training_args = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=1,  # reduce batch size
    per_device_eval_batch_size=1,  # reduce eval batch size
    gradient_accumulation_steps=32,  # increase gradient accumulation steps
    learning_rate=1e-6,  # further reduce learning rate
    num_train_epochs=5,  # try increasing epochs
    logging_steps=10,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    fp16=True,  # mixed precision
    report_to=[],  # disable W&B
    max_steps=1000,  # limit the number of steps
    max_grad_norm=1.0  # gradient clipping
)

# Initialize the model (assuming peft_model is already defined)
peft_model.train()  # Set the model to training mode
peft_model = prepare_model_for_kbit_training(peft_model)

# Ensure all parameters requiring gradients are set correctly
for name, param in peft_model.named_parameters():
    if "lora" in name:
        param.requires_grad = True
    else:
        param.requires_grad = False

# Trainer without DeepSpeed
class CustomTrainer(Trainer):
    def training_step(self, model, inputs, optimizer):
        # Clear CUDA cache before each training step
        torch.cuda.empty_cache()
        return super().training_step(model, inputs)

    def evaluation_step(self, model, inputs):
        # Clear CUDA cache before each evaluation step
        torch.cuda.empty_cache()
        return super().evaluation_step(model, inputs)

peft_trainer = CustomTrainer(
    model=peft_model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

# Clear CUDA cache before training
torch.cuda.empty_cache()

# Train the model
peft_trainer.train()

# Save the model and tokenizer
peft_model_path = "./peft-dialogue-summary-checkpoint-local"
peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)

Now everything is ready to train the PEFT adapter and save the model.





That training was performed on a subset of data. To load a fully trained PEFT model, read a checkpoint of a PEFT model from S3.

Check that the size of this model is much less than the original LLM:

Prepare this model by adding an adapter to the original FLAN-T5 model. You are setting `is_trainable=False` because the plan is only to perform inference with this PEFT model. If you were preparing the model for further training, you would set `is_trainable=True`.

In [None]:
from peft import PeftModel, PeftConfig

peft_model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

peft_model = PeftModel.from_pretrained(peft_model_base, 
                                       './peft-dialogue-summary-checkpoint-local/', 
                                       torch_dtype=torch.bfloat16,
                                       is_trainable=False)

The number of trainable parameters will be `0` due to `is_trainable=False` setting:

In [None]:
print(print_number_of_trainable_model_parameters(peft_model))

Trainable parameters: 0 out of 251116800 total parameters
Percentage of trainable parameters: 0.00%
(0, 251116800, 0.0)


<a name='3.3'></a>
### 3.3 - Evaluate the Model Qualitatively (Human Evaluation)

Make inferences for the same example as in sections [1.3](#1.3) and [2.3](#2.3), with the original model, fully fine-tuned and PEFT model.

In [None]:
index = 1
dialogue = dataset['test'][index]['dialogue']
baseline_human_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

input_ids = tokenizer(prompt, return_tensors="pt").input_ids
human_baseline_summary = dataset['test'][index]['summary']


# original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
# original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

# instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
# instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

# Get the device of the model (this assumes your model is on the GPU)
device = peft_model.device

# Move input_ids to the same device as the model
input_ids = input_ids.to(device)

# Generate the summary using the PEFT model
peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

# Print the summaries
print("\n")
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print("\n")
print(f'PEFT MODEL: {peft_model_text_output}')




BASELINE HUMAN SUMMARY:
In order to prevent employees from wasting time on Instant Message programs, #Person1# decides to terminate the use of those programs and asks Ms. Dawson to send out a memo to all employees by the afternoon.


PEFT MODEL: #Person1#: I need to take a dictation for you.


In [None]:
dialogues = dataset['test']['dialogue'][0:2]  # Select every 2nd element from index 0 to 10

human_baseline_summaries = dataset['test']['summary'][0:2]
print(dialogues,"\n")
print(human_baseline_summaries)


['Ms. Dawson helps #Person1# to write a memo to inform every employee that they have to change the communication method and should not use Instant Messaging anymore.', 'In order to prevent employees from wasting time on Instant Message programs, #Person1# decides to terminate the use of those programs and asks Ms. Dawson to send out a memo to all employees by the afternoon.']


<a name='3.4'></a>
### 3.4 - Evaluate the Model Quantitatively (with ROUGE Metric)
Perform inferences for the sample of the test dataset (only 10 dialogues and summaries to save time). 

In [None]:
# Select dialogues and summaries with an increment of 2
dialogues = dataset['test']['dialogue'][0:250:3]  # Select every 2nd element from index 0 to 10

human_baseline_summaries = dataset['test']['summary'][0:250:3]
original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []

for idx, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """
    
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    human_baseline_text_output = human_baseline_summaries[idx]
    idx=idx+1
    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

    # instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    # instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)
    device = peft_model.device

# Move input_ids to the same device as the model
    input_ids = input_ids.to(device)
    peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=300))
    peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

    original_model_summaries.append(original_model_text_output)
  #  instruct_model_summaries.append(instruct_model_text_output)
    peft_model_summaries.append(peft_model_text_output)


zipped_summaries = list(zip(human_baseline_summaries, peft_model_summaries))
 
df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'peft_model_summaries'])
df




Unnamed: 0,human_baseline_summaries,peft_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,#Person1#: I need to take a dictation for you.
1,#Person2# arrives late because of traffic jam....,The traffic jam at the Carrefour intersection ...
2,#Person1# tells Kate that Masha and Hero get d...,Masha and Hero are getting divorced.
3,#Person1# and Brian are at the birthday party ...,"#Person1#: Happy birthday, Brian. #Person2#: I..."
4,#Person1# is surprised at the Olympic Stadium'...,The Olympic stadium is the center of the park.
...,...,...
79,#Person1# comes to sign an agreement but it is...,The agreement hasn't been fully prepared.
80,#Person1# rent a car from ABC Rent-a-car Compa...,The car accident happened near the border.
81,#Person1# is lost on the way to the school cli...,The first turning on the right is the school c...
82,#Person2# wants to change her room because the...,#Person1#: Good morning. How may I help you?


In [None]:
for i in range(len(df)):
    print(df['human_baseline_summaries'][i])
    print(df['peft_model_summaries'][i],"\n")


Ms. Dawson helps #Person1# to write a memo to inform every employee that they have to change the communication method and should not use Instant Messaging anymore.
#Person1#: I need to take a dictation for you. 

#Person2# arrives late because of traffic jam. #Person1# persuades #Person2# to use public transportations to keep healthy and to protect the environment.
The traffic jam at the Carrefour intersection is a problem. 

#Person1# tells Kate that Masha and Hero get divorced. Kate is surprised because she thought they are perfect couple.
Masha and Hero are getting divorced. 

#Person1# and Brian are at the birthday party of Brian. Brian thinks #Person1# looks great and is popular.
#Person1#: Happy birthday, Brian. #Person2#: I'm so happy you're having a good time. #Person1#: Thank you, I'm sure you're having a good time. #Person2#: Thank you, I'm sure you're having a good time. #Person1#: Thank you, I'm sure you're having a good time. #Person2#: Thank you, I'm sure you're having a 

In [23]:
import torch
print(torch.cuda.is_available())  # Should return True
print(torch.cuda.device_count())  # Should return the number of GPUs
if torch.cuda.is_available():
    print(torch.cuda.get_device_name(0))  # Should return the name of your GPU


True
1
NVIDIA GeForce RTX 3050 Laptop GPU


In [None]:
rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(df['human_baseline_summaries'])],
    use_aggregator=True,
    use_stemmer=True,
)

# instruct_model_results = rouge.compute(
#     predictions=instruct_model_summaries,
#     references=human_baseline_summaries[0:len(instruct_model_summaries)],
#     use_aggregator=True,
#     use_stemmer=True,
# )

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

# print('ORIGINAL MODEL:')
# print(original_model_results)
# print('INSTRUCT MODEL:')
# print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)

PEFT MODEL:
{'rouge1': 0.2568921170425704, 'rouge2': 0.05977093833631557, 'rougeL': 0.21934565894310598, 'rougeLsum': 0.2213824515230341}


In [None]:
print(f'Human Baseline Summaries Length: {len(human_baseline_summaries)}')
print(f'PEFT Model Summaries Length: {len(peft_model_summaries)}')


Human Baseline Summaries Length: 15
PEFT Model Summaries Length: 5


Notice, that PEFT model results are not too bad, while the training process was much easier!

You already computed ROUGE score on the full dataset, after loading the results from the `data/dialogue-summary-training-results.csv` file. Load the values for the PEFT model now and check its performance compared to other models.

In [None]:
human_baseline_summaries = df['human_baseline_summaries'].values
# original_model_summaries = df['original_model_summaries'].values
# instruct_model_summaries = df['instruct_model_summaries'].values
peft_model_summaries     = df['peft_model_summaries'].values

# original_model_results = rouge.compute(
#     predictions=original_model_summaries,
#     references=human_baseline_summaries[0:len(original_model_summaries)],
#     use_aggregator=True,
#     use_stemmer=True,
# )

# instruct_model_results = rouge.compute(
#     predictions=instruct_model_summaries,
#     references=human_baseline_summaries[0:len(instruct_model_summaries)],
#     use_aggregator=True,
#     use_stemmer=True,
# )

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

# print('ORIGINAL MODEL:')
# print(original_model_results)
# print('INSTRUCT MODEL:')
# print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)

PEFT MODEL:
{'rouge1': 0.2568921170425704, 'rouge2': 0.05977093833631557, 'rougeL': 0.21934565894310598, 'rougeLsum': 0.2213824515230341}


The results show less of an improvement over full fine-tuning, but the benefits of PEFT typically outweigh the slightly-lower performance metrics.

Calculate the improvement of PEFT over the original model:

In [None]:
print("Absolute percentage improvement of PEFT MODEL over HUMAN BASELINE")
print(np.array(list(original_model_results.values())))
print(np.array(list(peft_model_results.values())))
improvement = (np.array(list(peft_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over HUMAN BASELINE
[0.10325378 0.01742063 0.08636895 0.08519822]
[0.25689212 0.05977094 0.21934566 0.22138245]
rouge1: 15.36%
rouge2: 4.24%
rougeL: 13.30%
rougeLsum: 13.62%


Now calculate the improvement of PEFT over a full fine-tuned model:

In [None]:
print("Absolute percentage improvement of PEFT MODEL over INSTRUCT MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(instruct_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Here you see a small percentage decrease in the ROUGE metrics vs. full fine-tuned. However, the training requires much less computing and memory resources (often just a single GPU).