# Fine-Tune a Generative AI Model for Dialogue Summarization

In this notebook, you will fine-tune an existing LLM from Hugging Face for enhanced dialogue summarization. You will use the [FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) model, which provides a high quality instruction tuned model and can summarize text out of the box. To improve the inferences, you will explore a full fine-tuning approach and evaluate the results with ROUGE metrics. Then you will perform Parameter Efficient Fine-Tuning (PEFT), evaluate the resulting model and see that the benefits of PEFT outweigh the slightly-lower performance metrics.

# Table of Contents

- [ 1 - Load Required Dependencies, Dataset and LLM](#1)
  - [ 1.1 - Set up Required Dependencies](#1.1)
  - [ 1.2 - Load Dataset and LLM](#1.2)
  - [ 1.3 - Test the Model with Zero Shot Inferencing](#1.3)
- [ 2 - Perform Full Fine-Tuning](#2)
  - [ 2.1 - Preprocess the Dialog-Summary Dataset](#2.1)
  - [ 2.2 - Fine-Tune the Model with the Preprocessed Dataset](#2.2)
  - [ 2.3 - Evaluate the Model Qualitatively (Human Evaluation)](#2.3)
  - [ 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)](#2.4)
- [ 3 - Perform Parameter Efficient Fine-Tuning (PEFT)](#3)
  - [ 3.1 - Setup the PEFT/LoRA model for Fine-Tuning](#3.1)
  - [ 3.2 - Train PEFT Adapter](#3.2)
  - [ 3.3 - Evaluate the Model Qualitatively (Human Evaluation)](#3.3)
  - [ 3.4 - Evaluate the Model Quantitatively (with ROUGE Metric)](#3.4)

<a name='1'></a>
## 1 - Load Required Dependencies, Dataset and LLM (5 points)

<a name='1.1'></a>
### 1.1 - Set up Required Dependencies (1 point)

Now install the required packages for the LLM and datasets.



In [27]:
# Installing required dependencies
!pip install datasets torch transformers evaluate rouge_score loralib peft wandb

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)






Import the necessary components. Some of them are new for this week, they will be discussed later in the notebook.

In [82]:
# Importing necessary components
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from transformers import Trainer, TrainingArguments
import evaluate
import torch
import time
import wandb
import pandas as pd
import numpy as np

<a name='1.2'></a>
### 1.2 - Load Dataset and LLM (2 points)

You are going to continue experimenting with the [DialogSum](https://huggingface.co/datasets/knkarthick/dialogsum) Hugging Face dataset. It contains 10,000+ dialogues with the corresponding manually labeled summaries and topics.

In [4]:
# Loading Dataset
huggingface_dataset_name = "knkarthick/dialogsum"
dataset = load_dataset(huggingface_dataset_name)
dataset

README.md: 0.00B [00:00, ?B/s]

train.csv:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

validation.csv: 0.00B [00:00, ?B/s]

test.csv: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/12460 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1500 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

Load the pre-trained [FLAN-T5 model](https://huggingface.co/docs/transformers/model_doc/flan-t5) and its tokenizer directly from HuggingFace. Notice that you will be using the [small version](https://huggingface.co/google/flan-t5-small) of FLAN-T5. Setting `torch_dtype=torch.bfloat16` specifies the memory type to be used by this model.

In [6]:
# Loading pre-trained FLAN-T5 small model and its tokenizer directly from HuggingFace
model_name = "google/flan-t5-small"
original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

It is possible to pull out the number of model parameters and find out how many of them are trainable. The following function can be used to do that, at this stage, you do not need to go into details of it.

In [7]:
# Function to print number of parameters in model and number of parameters in model that are trainable
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    all_model_params = sum(p.numel() for p in model.parameters())
    print("Total Number of Parameters: " + str(all_model_params))
    print("Total Number of Trainable Parameters: " + str(trainable_model_params))

print_number_of_trainable_model_parameters(original_model)

Total Number of Parameters: 76961152
Total Number of Trainable Parameters: 76961152


<a name='1.3'></a>
### 1.3 - Test the Model with Zero Shot Inferencing (2 Points)

Test the model with the zero shot inferencing. You can see that the model struggles to summarize the dialogue compared to the baseline summary, but it does pull out some important information from the text which indicates the model can be fine-tuned to the task at hand.

In [9]:
# Get random dialogue and it's summary from the test dataset
index = 200
dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

# Create prompt for zero shot inferencing
prompt = "Summarize the following dialogue.\n\n" + dialogue + "\n\nSummary:"

# Tokenize prompt
inputs = tokenizer(prompt, return_tensors="pt")

# Get model to generate a response to input prompt
response = original_model.generate(**inputs)

# Decode model's response
output = tokenizer.decode(
    response[0],
    skip_special_tokens=True
)

# Compare zero shot inferencing output of our model to baseline human summary
dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')

---------------------------------------------------------------------------------------------------
INPUT PROMPT:
Summarize the following dialogue.

#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?
#Person2#: No.
#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.
#Person2#: That sounds great. Thanks.

Summary:
-------------------------------------------------------------------------

1.3

Compare the generated summary with the human baseline using qualitative analysis

As seen in the above output, the zero-shot generated summary does capture the information that appears towards the end of the conversation, but fails to capture the main purpose of the conversation as a whole like the baseline human summary does. Also, the zero-shot generated summary thinks Person 2 is me instead of a third party conversation between two random people.

<a name='2'></a>
## 2 - Perform Full Fine-Tuning (10 points)

<a name='2.1'></a>
### 2.1 - Preprocess the Dialog-Summary Dataset (2 points)

You need to convert the dialog-summary (prompt-response) pairs into explicit instructions for the LLM. Prepend an instruction to the start of the dialog with `Summarize the following conversation` and to the start of the summary with `Summary` as follows:

Training prompt (dialogue):
```
Summarize the following conversation.

    Chris: This is his part of the conversation.
    Antje: This is her part of the conversation.
    
Summary:
```

Training response (summary):
```
Both Chris and Antje participated in the conversation.
```

Then preprocess the prompt-response dataset into tokens and pull out their `input_ids` (1 per token).

In [57]:
def tokenize_function(examples):
    # Create prompt for every example in batch
    inputs = ["Summarize the following conversation.\n\n" + dialogue + "\n\nSummary:" 
              for dialogue in examples['dialogue']]

    # Tokenize the prompts
    model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding="max_length")
    # Tokenize the labels(baseline human summaries)
    labels = tokenizer(examples['summary'], max_length=128, truncation=True, padding="max_length") 
    model_inputs["labels"] = labels["input_ids"]
    # Return tokenized model inputs
    return model_inputs

# The dataset actually contains 3 diff splits: train, validation, test.
# The tokenize_function code is handling all data across all splits in batches.
tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=dataset['train'].column_names)

Map:   0%|          | 0/12460 [00:00<?, ? examples/s]

To save some time in the lab, you will subsample the dataset:

In [58]:
# Create a subsampled version of the dataset for efficient training
tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 3 == 0, with_indices=True)

Filter:   0%|          | 0/12460 [00:00<?, ? examples/s]

Filter:   0%|          | 0/500 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1500 [00:00<?, ? examples/s]

Check the shapes of all three parts of the dataset:

In [59]:
print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets['train'].shape}")
print(f"Validation: {tokenized_datasets['validation'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")

print(tokenized_datasets)

Shapes of the datasets:
Training: (4154, 3)
Validation: (167, 3)
Test: (500, 3)
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 4154
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 167
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 500
    })
})


The output dataset is ready for fine-tuning.

<a name='2.2'></a>
### 2.2 - Fine-Tune the Model with the Preprocessed Dataset (3 points)

Now utilize the built-in Hugging Face `Trainer` class (see the documentation [here](https://huggingface.co/docs/transformers/main_classes/trainer)). Pass the preprocessed dataset with reference to the original model. Other training parameters are found experimentally and there is no need to go into details about those at the moment.

In [60]:
output_dir = f'./dialogue-summary-training-{str(int(time.time()))}'

# Configure TrainingArguments with appropriate learning rate, epochs, and other hyperparameters
training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=5e-5,
    num_train_epochs=3,
    auto_find_batch_size=True,
    logging_steps=100
)

# Initialize the Hugging Face Trainer class with the model, training arguments, and datasets
trainer = Trainer(
    model=original_model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation']
)

Start training process...



The code trainer.train() utilizes the Weights & Biases (wandb) library to track and visualize the training process. To proceed, you'll need to sign up for a wandb account using your Gmail and then enter your unique API token to authenticate and enable logging of the training progress.

In [62]:
# Login to wandb and initialize it
wandb.login(key = "43d56b51a7a89074cdecf6fd4ef49d1d5256762e")
wandb.init(project="hw4_problem1", name="2.2")

# Execute the training process using trainer.train()
trainer.train()

# Save the fine-tuned model checkpoint
trainer.save_model("./flan-t5-finetuned")
tokenizer.save_pretrained("./flan-t5-finetuned")

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc




Step,Training Loss
100,11.7675
200,7.6397
300,5.7394
400,5.09
500,4.8697
600,4.7344
700,4.7306




('./flan-t5-finetuned/tokenizer_config.json',
 './flan-t5-finetuned/special_tokens_map.json',
 './flan-t5-finetuned/spiece.model',
 './flan-t5-finetuned/added_tokens.json',
 './flan-t5-finetuned/tokenizer.json')


Create an instance of the `AutoModelForSeq2SeqLM` class for the instruct model:

In [64]:
# Load tokenizer and models
# Import T5Tokenizer from transformers
from transformers import T5Tokenizer, AutoModelForSeq2SeqLM

# Define the model path using the config.json path
model_path = "./flan-t5-finetuned"

# Load tokenizer and models
# Use the default T5 tokenizer
instruct_tokenizer = T5Tokenizer.from_pretrained("t5-base") # TODO  # or "t5-base", "t5-large", etc.

# Load the model in a way that is compatible with single-GPU environments
instruct_model = AutoModelForSeq2SeqLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    # The following line addresses the multi-GPU loading issue
    device_map="auto",
)

# Move model to GPU if available (optional, as device_map="auto" should handle it)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
instruct_model.to(device)

T5ForConditionalGeneration(
  (shared): Embedding(32128, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=512, out_features=384, bias=False)
              (k): Linear(in_features=512, out_features=384, bias=False)
              (v): Linear(in_features=512, out_features=384, bias=False)
              (o): Linear(in_features=384, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 6)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=512, out_features=1024, bias=False)
              (wi_1): Linear(in_features=512, out_features=1024, bias=False)
              (wo): 

<a name='2.3'></a>
### 2.3 - Evaluate the Model Qualitatively (Human Evaluation) (2 points)

As with many GenAI applications, a qualitative approach where you ask yourself the question "Is my model behaving the way it is supposed to?" is usually a good starting point. In the example below (the same one we started this notebook with), you can see how the fine-tuned model is able to create a reasonable summary of the dialogue compared to the original inability to understand what is being asked of the model.

In [69]:
# Get random test dialogue and label from test dataset
index = 200
dialogue = dataset['test'][index]['dialogue']
human_baseline_summary = dataset['test'][index]['summary']

# Construct prompt
prompt = "Summarize the following conversation.\n\n" + dialogue + "\n\nSummary:"

# Tokenize the prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# Move input_ids to the same device as the model
input_ids = input_ids.to(device)

# Get text output from original model
original_tokenizer = tokenizer
original_model_response = original_model.generate(input_ids)
original_model_outputs = original_model_response[0]
original_model_text_output = original_tokenizer.decode(
    original_model_outputs,
    skip_special_tokens=True
)

# Get text output from instruct finetuned model
instruct_model_response = instruct_model.generate(input_ids)
instruct_model_outputs = instruct_model_response[0]
instruct_model_text_output = instruct_tokenizer.decode(
    instruct_model_outputs,
    skip_special_tokens=True
)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
Share this with the others.
---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
#Person1# You could consider adding a painting program to your software.


2.3 Compare outputs across models using the same test examples, Analyze improvements in summary quality, coherence, and relevance

As you can see in the output above, the finetuned model captures the conversation much better than the original model which ended up giving a different response to the test prompt this time. The original model doesn't really capture the purpose of the conversation well at all while the finetuned model at least captures some key points made in the conversation. 

<a name='2.4'></a>
### 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric) (3 points)

The [ROUGE metric](https://en.wikipedia.org/wiki/ROUGE_(metric)) helps quantify the validity of summarizations produced by models. It compares summarizations to a "baseline" summary which is usually created by a human. While not perfect, it does indicate the overall increase in summarization effectiveness that we have accomplished by fine-tuning.

In [70]:
# Load rouge evaluator
rouge = evaluate.load("rouge")

Downloading builder script: 0.00B [00:00, ?B/s]

Generate the outputs for the sample of the test dataset (only 10 dialogues and summaries to save time), and save the results.

In [77]:
# Get 10 test set dialogues and their labels
dialogues = dataset['test'][40:50]['dialogue']
human_baseline_summaries = dataset['test'][40:50]['summary']

original_model_summaries = []
instruct_model_summaries = []

for _, dialogue in enumerate(dialogues):
    prompt = "Summarize the following conversation.\n\n" + dialogue + "\n\nSummary:"
    # Tokenize the prompt
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    # Move input_ids to the same device as the model
    input_ids = input_ids.to(device)

    # Get text output from original model
    original_model_response = original_model.generate(input_ids)
    original_model_outputs = original_model_response[0]
    original_model_text_output = original_tokenizer.decode(
        original_model_outputs,
        skip_special_tokens=True
    )
    
    # Get text output from instruct finetuned model
    instruct_model_response = instruct_model.generate(input_ids)
    instruct_model_outputs = instruct_model_response[0]
    instruct_model_text_output = instruct_tokenizer.decode(
        instruct_model_outputs,
        skip_special_tokens=True
    )

    # Append model outputs to appropriate lists
    original_model_summaries.append(original_model_text_output)
    instruct_model_summaries.append(instruct_model_text_output)

# Display human baseline, original model, and finetuned model summaries in a dataframe
zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries))
df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries'])
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries
0,#Person1# is in a hurry to catch a train. Tom ...,@Person1#Person1#Person2#It's a minute,"@Person, I'm not a fan of the ten to nine by m..."
1,#Person1# is rushing to catch a train but Tom ...,@PresidentSony_Persons are not the same.,"@Person, I'm not a fan of the ten to nine by m..."
2,#Person1# wants to adjust #Person1#'s life and...,You should not do this.,#Person1#
3,#Person1# has a bad lifestyle. #Person2# kindl...,It's a good idea.,#Person1#
4,#Person2# hopes #Person1# will become healthy ...,#Person1#,#Person1#
5,#Person1# tells #Person2# that Ruojia is marri...,"#Person, #Person.",#Person! #Person! #Person! #Person! #Person!
6,#Person2# is surprised to know from #Person1# ...,Share your ideas.,#Person! #Person! #Person! #Person! #Person!
7,#Person2# is surprised that Ruojia's married. ...,@Person_________________,#Person! #Person! #Person! #Person! #Person!
8,#Person2# at first thinks #Person1#'s behaviou...,#Person1#Person2#,#Person1# You might make a few enemies.
9,#Person1# plans on playing a trick to others. ...,'I'm not a fan of being rude to your friends.,#Person1# You might make a few enemies.


Evaluate the models computing ROUGE metrics. Notice the improvement in the results!

In [79]:
# Get rouge score metric for original model generated summaries
original_model_results = rouge.compute(
    predictions=original_model_summaries, references=human_baseline_summaries
)

# Get rouge score metric for finetuned model generated summaries
instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries, references=human_baseline_summaries
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)

ORIGINAL MODEL:
{'rouge1': 0.10210084033613445, 'rouge2': 0.0, 'rougeL': 0.08015390064071605, 'rougeLsum': 0.07840561511961107}
INSTRUCT MODEL:
{'rouge1': 0.11086793517633425, 'rouge2': 0.0, 'rougeL': 0.09812110581030985, 'rougeLsum': 0.09455990521235702}


2.4 Analyze and compare performance metrics between models

As you can clearly see from the output above, the finetuned model slightly beats the original model in every rouge metric except rouge2 which is 0.0 for both the original and finetuned model generated summaries. When you analyze the generated summaries qualitatively the outputs from the original model and finetuned model both look pretty bad though. The original model I guess allows for a more creative output than the finetuned model which is maybe why the original model generated summaries that have more of a diverse vocabulary.

The file `data/dialogue-summary-training-results.csv` contains a pre-populated list of all model results which you can use to evaluate on a larger section of data. Let's do that for each of the models:

In [80]:
# Update the path to the CSV file
results_path = "../input/pre-populated-list/dialogue-summary-training-results.csv"
results = pd.read_csv(results_path)

human_baseline_summaries = results['human_baseline_summaries'].values
original_model_summaries = results['original_model_summaries'].values
instruct_model_summaries = results['instruct_model_summaries'].values

# Get rouge score metric for original model generated summaries
original_model_results = rouge.compute(
    predictions=original_model_summaries, references=human_baseline_summaries
)

# Get rouge score metric for finetuned model generated summaries
instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries, references=human_baseline_summaries
)


print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)

ORIGINAL MODEL:
{'rouge1': 0.2216686882994889, 'rouge2': 0.0707492488737373, 'rougeL': 0.19245630286595683, 'rougeLsum': 0.192409231638204}
INSTRUCT MODEL:
{'rouge1': 0.4041959932817219, 'rouge2': 0.17064828985299663, 'rougeL': 0.3267557101191949, 'rougeLsum': 0.3266766725171105}


The results show substantial improvement in all ROUGE metrics:

In [83]:
print("Absolute percentage improvement of INSTRUCT MODEL over ORIGINAL MODEL")

improvement = (np.array(list(instruct_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(instruct_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of INSTRUCT MODEL over ORIGINAL MODEL
rouge1: 18.25%
rouge2: 9.99%
rougeL: 13.43%
rougeLsum: 13.43%


2.4 Analyze and compare performance metrics between models

As you can clearly see from the output above, when evaluated across a larger section of data, the finetuned model beats the original model in every rouge metric with a much wider margin than previously seen when comparing with only 10 test samples.

<a name='3'></a>
## 3 - Perform Parameter Efficient Fine-Tuning (PEFT) (10 points)

Now, let's perform **Parameter Efficient Fine-Tuning (PEFT)** fine-tuning as opposed to "full fine-tuning" as you did above. PEFT is a form of instruction fine-tuning that is much more efficient than full fine-tuning - with comparable evaluation results as you will see soon.

PEFT is a generic term that includes **Low-Rank Adaptation (LoRA)** and prompt tuning (which is NOT THE SAME as prompt engineering!). In most cases, when someone says PEFT, they typically mean LoRA. LoRA, at a very high level, allows the user to fine-tune their model using fewer compute resources (in some cases, a single GPU). After fine-tuning for a specific task, use case, or tenant with LoRA, the result is that the original LLM remains unchanged and a newly-trained “LoRA adapter” emerges. This LoRA adapter is much, much smaller than the original LLM - on the order of a single-digit % of the original LLM size (MBs vs GBs).  

That said, at inference time, the LoRA adapter needs to be reunited and combined with its original LLM to serve the inference request.  The benefit, however, is that many LoRA adapters can re-use the original LLM which reduces overall memory requirements when serving multiple tasks and use cases.

<a name='3.1'></a>
### 3.1 - Setup the PEFT/LoRA model for Fine-Tuning (2 points)

You need to set up the PEFT/LoRA model for fine-tuning with a new layer/parameter adapter. Using PEFT/LoRA, you are freezing the underlying LLM and only training the adapter. Have a look at the LoRA configuration below. Note the rank (`r`) hyper-parameter, which defines the rank/dimension of the adapter to be trained.

In [84]:
from peft import LoraConfig, get_peft_model, TaskType

# Configure LoRA parameters using LoraConfig with appropriate rank, alpha, and target modules
lora_config = LoraConfig(
    r = 16, # TODO
    lora_alpha=32,            
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)

Add LoRA adapter layers/parameters to the original LLM to be trained.

In [86]:
# Initialize the PEFT model 
peft_model = get_peft_model(original_model, lora_config)

# Verify the reduction in trainable parameters compared to full fine tuning
print_number_of_trainable_model_parameters(peft_model)

Total Number of Parameters: 77649280
Total Number of Trainable Parameters: 688128


<a name='3.2'></a>
### 3.2 - Train PEFT Adapter (3 points)

Define training arguments and create `Trainer` instance.

In [87]:
output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'

# Set up training arguments specific to PEFT, including higher learning rate
peft_training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=1e-3,
    num_train_epochs=3,
    auto_find_batch_size=True,
    logging_steps=100
)

# Initialize training using the Hugging Face Trainer
peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation']
)

No label_names provided for model class `PeftModelForSeq2SeqLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Now everything is ready to train the PEFT adapter and save the model.



In [88]:
# Login to wandb and initialize it
wandb.init(project="hw4_problem1", name="3.2")

peft_model_path="./flan-t5-peft-finetuned"

# Execute the training process using peft_trainer.train()
peft_trainer.train()

# Save the fine-tuned model checkpoint
peft_trainer.save_model(peft_model_path)
perf_tokenizer = original_tokenizer
perf_tokenizer.save_pretrained(peft_model_path)

0,1
train/epoch,▁▂▃▄▅▆▇█
train/global_step,▁▂▃▄▅▆▇█
train/grad_norm,█▃▂▁▁▁▂
train/learning_rate,█▇▆▅▃▂▁
train/loss,█▄▂▁▁▁▁

0,1
total_flos,2316567469621248.0
train/epoch,3.0
train/global_step,780.0
train/grad_norm,15.75
train/learning_rate,1e-05
train/loss,4.7306
train_loss,6.20044
train_runtime,427.1481
train_samples_per_second,29.175
train_steps_per_second,1.826




Step,Training Loss
100,3.1492
200,2.1645
300,1.9905
400,1.9066
500,1.8567
600,1.8269
700,1.8112




('./flan-t5-peft-finetuned/tokenizer_config.json',
 './flan-t5-peft-finetuned/special_tokens_map.json',
 './flan-t5-peft-finetuned/spiece.model',
 './flan-t5-peft-finetuned/added_tokens.json',
 './flan-t5-peft-finetuned/tokenizer.json')



That training was performed on a subset of data. To load a fully trained PEFT model, read a checkpoint of a PEFT model from Google Drive.

Prepare this model by adding an adapter to the original FLAN-T5 model. You are setting `is_trainable=False` because the plan is only to perform inference with this PEFT model. If you were preparing the model for further training, you would set `is_trainable=True`.

In [94]:
from peft import PeftModel, PeftConfig

# Get PeftConfig from peft_model_path
peft_config = PeftConfig.from_pretrained(peft_model_path)

# Get base peft model from peft_config
peft_model_base = AutoModelForSeq2SeqLM.from_pretrained(peft_config.base_model_name_or_path) # TODO
# Get tokenizer from peft_model_path
peft_tokenizer = AutoTokenizer.from_pretrained(peft_model_path) # TODO

# Generate peft_model
peft_model = PeftModel.from_pretrained(
    peft_model_base, 
    "./flan-t5-peft-finetuned",
    is_trainable=False  # For inference only
)


# Move the entire peft_model to the device
peft_model = peft_model.to(device)

The number of trainable parameters will be `0` due to `is_trainable=False` setting:

In [95]:
print_number_of_trainable_model_parameters(peft_model)

Total Number of Parameters: 77649280
Total Number of Trainable Parameters: 0


<a name='3.3'></a>
### 3.3 - Evaluate the Model Qualitatively (Human Evaluation) (2 points)

Make inferences for the same example as in sections [1.3](#1.3) and [2.3](#2.3), with the original model, fully fine-tuned and PEFT model.

In [96]:
# Get random dialogue and label from test dataset
index = 200
dialogue = dataset['test'][index]['dialogue']
human_baseline_summary = dataset['test'][index]['summary']

prompt = "Summarize the following conversation.\n\n" + dialogue + "\n\nSummary:"

# Tokenize the prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# Move input_ids to the same device as the model
input_ids = input_ids.to(device)

# Get text output from original model
original_model_response = original_model.generate(input_ids)
original_model_outputs = original_model_response[0]
original_model_text_output = original_tokenizer.decode(
    original_model_outputs,
    skip_special_tokens=True
)

# Get text output from instruct finetuned model
instruct_model_response = instruct_model.generate(input_ids)
instruct_model_outputs = instruct_model_response[0]
instruct_model_text_output = tokenizer.decode(
    instruct_model_outputs,
    skip_special_tokens=True
)

# Get text output from peft model
peft_model_response = peft_model.generate(input_ids=input_ids)
peft_model_outputs = peft_model_response[0]
peft_model_text_output = peft_tokenizer.decode(
    peft_model_outputs,
    skip_special_tokens=True
)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')
print(dash_line)
print(f'PEFT MODEL: {peft_model_text_output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
#Person1# thinks #Person2# should upgrade to the system because it is
---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
#Person1# You could consider adding a painting program to your software.
---------------------------------------------------------------------------------------------------
PEFT MODEL: #Person1# thinks adding a painting program to the software would allow #Person


3.3 Analyze the quality of summaries considering different aspects

As you can see in the output above, the finetuned model captures the conversation much better than the original model which ended up giving a different response to the test prompt this time. The original model doesn't really capture the purpose of the conversation well while the finetuned model at least captures some key points made in the conversation. In addition, the peft model surprisingly captures the conversation better than both the finetuned model and the original model. Unlike the finetuned model, the peft model is able to identify that the conversation is a  third-party conversation. Maybe the fully finetuned model was overfitting and the peft model is able to generalize much better than the fully finetuned model due to the reduction in model size.

<a name='3.4'></a>
### 3.4 - Evaluate the Model Quantitatively (with ROUGE Metric) (3 points)
Perform inferences for the sample of the test dataset (only 10 dialogues and summaries to save time).

In [104]:
# Get 10 test set dialogues and their labels
dialogues = dataset['test'][40:50]['dialogue']
human_baseline_summaries = dataset['test'][40:50]['summary']

original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []

for idx, dialogue in enumerate(dialogues):
    prompt = "Summarize the following conversation.\n\n" + dialogue + "\n\nSummary:"

    # Move input_ids to the same device as the model
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

    # Get text output from original model
    original_model_response = original_model.generate(input_ids)
    original_model_outputs = original_model_response[0]
    original_model_text_output = original_tokenizer.decode(
        original_model_outputs,
        skip_special_tokens=True
    )
    
    # Get text output from instruct finetuned model
    instruct_model_response = instruct_model.generate(input_ids)
    instruct_model_outputs = instruct_model_response[0]
    instruct_model_text_output = instruct_tokenizer.decode(
        instruct_model_outputs,
        skip_special_tokens=True
    )

    # Get text output from peft model
    peft_model_response = peft_model.generate(input_ids=input_ids)
    peft_model_outputs = peft_model_response[0]
    peft_model_text_output = peft_tokenizer.decode(
        peft_model_outputs,
        skip_special_tokens=True
    )

    # append the model outputs to their respective lists
    original_model_summaries.append(original_model_text_output)
    instruct_model_summaries.append(instruct_model_text_output)
    peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries, peft_model_summaries))

# display the outputs of all the models as a pandas dataframe
df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries', 'peft_model_summaries'])
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries,peft_model_summaries
0,#Person1# is in a hurry to catch a train. Tom ...,#P1# tells Tom Tom is ten to nine by his watch...,"@Person, I'm not a fan of the ten to nine by m...",Tom is waiting for the train to arrive. He has...
1,#Person1# is rushing to catch a train but Tom ...,#Person2# is off now and will catch the train.,"@Person, I'm not a fan of the ten to nine by m...",Tom is waiting for the train to arrive. He has...
2,#Person1# wants to adjust #Person1#'s life and...,#Person2# tells #Person1# #Person2# can't,#Person1#,#Person1# tells #Person2# that #Person2# can'
3,#Person1# has a bad lifestyle. #Person2# kindl...,#Person2# tells #Person2# #Person1# can't,#Person1#,#Person1# tells #Person2# that #Person2# can'
4,#Person2# hopes #Person1# will become healthy ...,#Person1# tells #Person1# #Person1# can't,#Person1#,#Person1# tells #Person2# that #Person2# can'
5,#Person1# tells #Person2# that Ruojia is marri...,#Person1# invites #Person2# to the party tonig...,#Person! #Person! #Person! #Person! #Person!,Ruojia's party is going to be a party tonight....
6,#Person2# is surprised to know from #Person1# ...,Ruojia wants to go to the party tonight. #Pers...,#Person! #Person! #Person! #Person! #Person!,Ruojia's party is going to be a party tonight....
7,#Person2# is surprised that Ruojia's married. ...,#Person1# wants to go to the party. #Person1# ...,#Person! #Person! #Person! #Person! #Person!,Ruojia's party is going to be a party tonight....
8,#Person2# at first thinks #Person1#'s behaviou...,#Person1# tells #Person1# that #Person22# is,#Person1# You might make a few enemies.,#Person1# tells #Person2# that the two ugly ol...
9,#Person1# plans on playing a trick to others. ...,#Person1# tells #Person1# #Person1#'s friends,#Person1# You might make a few enemies.,#Person1# tells #Person2# that the two ugly ol...


Compute ROUGE score for this subset of the data.

In [105]:
# Load the rouge evaluator
rouge = evaluate.load('rouge')

# Get rouge score metric for original model generated summaries
original_model_results = rouge.compute(
    predictions=original_model_summaries, references=human_baseline_summaries
)

# Get rouge score metric for finetuned model generated summaries
instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries, references=human_baseline_summaries
)

# Get rouge score metric for peft model generated summaries
peft_model_results = rouge.compute(
    predictions=peft_model_summaries, references=human_baseline_summaries
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)

ORIGINAL MODEL:
{'rouge1': 0.2297754317843419, 'rouge2': 0.016856060606060607, 'rougeL': 0.19614187176847026, 'rougeLsum': 0.19629431610291276}
INSTRUCT MODEL:
{'rouge1': 0.11086793517633425, 'rouge2': 0.0, 'rougeL': 0.09812110581030985, 'rougeLsum': 0.09455990521235702}
PEFT MODEL:
{'rouge1': 0.27515248972164164, 'rouge2': 0.05506175640250051, 'rougeL': 0.20027310328222514, 'rougeLsum': 0.19921123063254126}


3.4 Analyze PEFT vs. original model metrics Compare PEFT vs. full fine-tuning results

As you can clearly see from the output above, the finetuned model performed the worst, followed by the original model, and then the peft model which performed the best across all metrics. The fact that the finetuned model performed the worst indicates that finetuning the entire model caused overfitting. Surprisingly, the peft model performed the best which indicates that the model has way more parameters than it actually needs and reducing the model size can help it generalize better/produce better results on unseen data.

Notice, that PEFT model results are not too bad, while the training process was much easier!

You already computed ROUGE score on the full dataset, after loading the results from the `data/dialogue-summary-training-results.csv` file. Load the values for the PEFT model now and check its performance compared to other models.

In [107]:
# Update the path to the CSV file
results_path = "../input/pre-populated-list/dialogue-summary-training-results.csv"
results = pd.read_csv(results_path)

human_baseline_summaries = results['human_baseline_summaries'].values
original_model_summaries = results['original_model_summaries'].values
instruct_model_summaries = results['instruct_model_summaries'].values
peft_model_summaries     = results['peft_model_summaries'].values

# Get rouge score metric for original model generated summaries
original_model_results = rouge.compute(
    predictions=original_model_summaries, references=human_baseline_summaries
)

# Get rouge score metric for finetuned model generated summaries
instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries, references=human_baseline_summaries
)

# Get rouge score metric for peft model generated summaries
peft_model_results = rouge.compute(
    predictions=peft_model_summaries, references=human_baseline_summaries
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)

ORIGINAL MODEL:
{'rouge1': 0.2216686882994889, 'rouge2': 0.0707492488737373, 'rougeL': 0.19245630286595683, 'rougeLsum': 0.192409231638204}
INSTRUCT MODEL:
{'rouge1': 0.4041959932817219, 'rouge2': 0.17064828985299663, 'rougeL': 0.3267557101191949, 'rougeLsum': 0.3266766725171105}
PEFT MODEL:
{'rouge1': 0.39119098357131776, 'rouge2': 0.15459808342905274, 'rougeL': 0.31367299251500014, 'rougeLsum': 0.31360615168633016}


The results show less of an improvement over full fine-tuning, but the benefits of PEFT typically outweigh the slightly-lower performance metrics.

Calculate the improvement of PEFT over the original model:

In [108]:
print("Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL
rouge1: 16.95%
rouge2: 8.38%
rougeL: 12.12%
rougeLsum: 12.12%


Now calculate the improvement of PEFT over a full fine-tuned model:

In [109]:
print("Absolute percentage improvement of PEFT MODEL over INSTRUCT MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(instruct_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over INSTRUCT MODEL
rouge1: -1.30%
rouge2: -1.61%
rougeL: -1.31%
rougeLsum: -1.31%


3.4 Analyze and compare performance metrics between models

Evaluate the trade-off between performance and computational efficiency

As you can clearly see from the output above, when evaluated across a larger section of data, the finetuned model beats both the original model and the peft model in every rouge metric. However, the peft model rouge metrics are very close to the finetuned model metrics indicating that reducing the model size with LoRA still allows it to produce comparable results that are pretty close to what we can achieve with full model finetuning. Finetuning the peft model is also way more computationally efficient due to the significantly less amount of parameters that it handles as shown in the previous output that prints the number of trainable parameters for the peft model. As a result, it might be worth it to just finetune the peft model for more epochs than finetuning the full model if there's a computational budget to be met. In this case the peft model might actually produce better results than finetuning the full model.

Here you see a small percentage decrease in the ROUGE metrics vs. full fine-tuned. However, the training requires much less computing and memory resources (often just a single GPU).

Limitations Encountered During the Fine-Tuning Process: I was not really able to get good performance when finetuning the full model here on Kaggle no matter what reasonable values I tried for number of epochs(1-5) and learning rates I tried. I settled on this after around 2 hours of trying to figure this out.

I've executed this entire notebook on Kaggle where some of the necessary dependencies and libraries are already installed and don't need to be explicitly installed using pip install.