## Fine-Tune a Generative AI Model for Dialogue Summarization

In this notebook, you will fine-tune an existing LLM from Hugging Face for enhanced dialogue summarization. You will use the FLAN-T5 model, which provides a high quality instruction tuned model and can summarize text out of the box. To improve the inferences, you will explore a full fine-tuning approach and evaluate the results with ROUGE metrics. Then you will perform Parameter Efficient Fine-Tuning (PEFT), evaluate the resulting model and see that the benefits of PEFT outweigh the slightly-lower performance metrics.

### Table of Contents
* Load Dataset and LLM
  * Load Dataset and LLM
  * Test the Model with Zero Shot Inferencing
* Perform Full Fine-Tuning
  * Preprocess the Dialog-Summary Dataset
  * Fine-Tune the Model with the Preprocessed Dataset
  * Evaluate the Model Qualitatively (Human Evaluation)
  * Evaluate the Model Quantitatively (with ROUGE Metric)
* Perform Parameter Efficient Fine-Tuning (PEFT)
  * Setup the PEFT/LoRA model for Fine-Tuning
  * Train PEFT Adapter
  * Evaluate the Model Qualitatively (Human Evaluation)
  * Evaluate the Model Quantitatively (with ROUGE Metric)

In [1]:
#%pip install --upgrade pip
#%pip install --disable-pip-version-check \
#    torch==1.13.1 \
#    torchdata==0.5.1 --quiet
#
#%pip install \
#    transformers==4.27.2 \
#    datasets==2.11.0 \
#    evaluate==0.4.0 \
#    rouge_score==0.1.2 \
#    loralib==0.1.1 \
#    peft==0.3.0 --quiet
!pip install evaluate rouge_score loralib peft
!pip install scikit-learn scipy bitsandbytes

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com


In [2]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np

#### Load Dataset and LLM
You are going to continue experimenting with the DialogSum Hugging Face dataset. It contains 10,000+ dialogues with the corresponding manually labeled summaries and topics.

In [3]:
huggingface_dataset_name = "knkarthick/dialogsum"

dataset = load_dataset(huggingface_dataset_name)

dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

##### Load the pre-trained FLAN-T5 model and its tokenizer directly from HuggingFace. Notice that you will be using the [small version](https://huggingface.co/google/flan-t5-base) of FLAN-T5. Setting torch_dtype=torch.bfloat16 specifies the memory type to be used by this model. 

##### In the AutoModelForCausalLM class of the  Transformers library, the torch_dtype argument specifies the data type used for internal computations and storage in the model. It can take several values:

- None (default): This uses the default dtype for PyTorch Tensors, which is currently torch.float32.
- A valid torch dtype: You can explicitly specify a different dtype like torch.float16 or torch.bfloat16 for lower memory usage and faster computation.
- "auto": This option automatically tries to infer the appropriate dtype based on the loaded model weights. If the weights are already in a specific dtype, it will use that. Otherwise, it falls back to the default (torch.float32).


In [4]:
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name='google/flan-t5-base'
#model_name='google/flan-t5-large'
#model_name='google/flan-t5-xl'
#model_name='google/flan-t5-xxl'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
#original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

##### It is possible to pull out the number of model parameters and find out how many of them are trainable. The following function can be used to do that, at this stage, you do not need to go into details of it.

def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(print_number_of_trainable_model_parameters(original_model))

#### Test the Model with Zero Shot Inferencing
Test the model with the zero shot inferencing. You can see that the model struggles to summarize the dialogue compared to the baseline summary, but it does pull out some important information from the text which indicates the model can be fine-tuned to the task at hand.

In [5]:
original_model = original_model.to(device)
index = 200

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

inputs = tokenizer(prompt, return_tensors='pt').to(device)
output = tokenizer.decode(
    original_model.generate(
        inputs["input_ids"], 
        max_new_tokens=200,
    )[0], 
    skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')

---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Summarize the following conversation.

#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?
#Person2#: No.
#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.
#Person2#: That sounds great. Thanks.

Summary:

-------------------------------------------------------------------

### Perform Full Fine-Tuning

#### Preprocess the Dialog-Summary Dataset
You need to convert the dialog-summary (prompt-response) pairs into explicit instructions for the LLM. Preprocess the prompt-response dataset into tokens and pull out their input_ids (1 per token).

In [6]:
def tokenize_function(example):
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids
    
    return example


In [7]:
tokenized_datasets = dataset.map(tokenize_function, batched=True)

In [8]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'labels'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'labels'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'labels'],
        num_rows: 1500
    })
})

In [9]:
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary',])

In [10]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 500
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 1500
    })
})

In [11]:
print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets['train'].shape}")
print(f"Validation: {tokenized_datasets['validation'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")

print(tokenized_datasets)

Shapes of the datasets:
Training: (12460, 2)
Validation: (500, 2)
Test: (1500, 2)
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 500
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 1500
    })
})


#### Fine-Tune the Model with the Preprocessed Dataset
Now utilize the built-in Hugging Face Trainer class (see the documentation [here](https://huggingface.co/docs/transformers/main_classes/trainer)). Pass the preprocessed dataset with reference to the original model. Other training parameters are found experimentally and there is no need to go into details about those at the moment.

In [12]:
#output_dir = f'./dialogue-summary-training-{str(int(time.time()))}'
output_dir = "./dialogue-summary-training"

training_args = TrainingArguments(
    output_dir=output_dir,
    #learning_rate=1e-5,
    learning_rate=1e-4,
    num_train_epochs=1,
    #num_train_epochs=45,
    #num_train_epochs=15,
    weight_decay=0.01,
    logging_steps=100,
    #max_steps=1
    #max_steps=100
)

trainer = Trainer(
    model=original_model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation']
)

Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [13]:
trainer.train() #It took ~14 minutes with fp32 and 1 epoch
#trainer.train(resume_from_checkpoint=True)

Step,Training Loss
100,9.8867
200,1.5768
300,0.6741
400,0.4244
500,0.3154
600,0.2749
700,0.2543
800,0.2265
900,0.2226
1000,0.2223


Checkpoint destination directory ./dialogue-summary-training/checkpoint-500 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./dialogue-summary-training/checkpoint-1000 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./dialogue-summary-training/checkpoint-1500 already exists and is non-empty.Saving will proceed but saved results may be invalid.


TrainOutput(global_step=1558, training_loss=0.9814703847079589, metrics={'train_runtime': 245.4243, 'train_samples_per_second': 50.769, 'train_steps_per_second': 6.348, 'total_flos': 8532076611502080.0, 'train_loss': 0.9814703847079589, 'epoch': 1.0})

In [14]:
instruct_model_path="./dialogue-summary-checkpoint-local"

trainer.model.save_pretrained(instruct_model_path)
tokenizer.save_pretrained(instruct_model_path)

('./dialogue-summary-checkpoint-local/tokenizer_config.json',
 './dialogue-summary-checkpoint-local/special_tokens_map.json',
 './dialogue-summary-checkpoint-local/tokenizer.json')

In [15]:
#!ls -al ./dialogue-summary-checkpoint-local
#total 474179
#drwxr-xr-x  2 qualis in0183      8192 Dec 17 15:14 .
#drwxr-xr-x 15 qualis in0183      8192 Dec 17 15:14 ..
#-rw-r--r--  1 qualis in0183      1558 Dec 17 15:14 config.json
#-rw-r--r--  1 qualis in0183       142 Dec 17 15:14 generation_config.json
#-rw-r--r--  1 qualis in0183 495189552 Dec 17 15:14 model.safetensors
#-rw-r--r--  1 qualis in0183      2543 Dec 17 15:14 special_tokens_map.json
#-rw-r--r--  1 qualis in0183     20771 Dec 17 15:14 tokenizer_config.json
#-rw-r--r--  1 qualis in0183   2422256 Dec 17 15:14 tokenizer.json

In [16]:
instruct_model = AutoModelForSeq2SeqLM.from_pretrained(instruct_model_path, torch_dtype=torch.bfloat16)
#instruct_model = AutoModelForSeq2SeqLM.from_pretrained(instruct_model_path)
#instruct_model = AutoModelForSeq2SeqLM.from_pretrained("./flan-dialogue-summary-checkpoint", torch_dtype=torch.bfloat16)
#instruct_model = AutoModelForSeq2SeqLM.from_pretrained("./fine-tune-test/dialogue-summary-checkpoint-local", torch_dtype=torch.bfloat16)

In [17]:
instruct_model = instruct_model.to(device)

In [18]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"


print(print_number_of_trainable_model_parameters(instruct_model)) #trainable model parameters: 247577856

trainable model parameters: 247577856
all model parameters: 247577856
percentage of trainable model parameters: 100.00%


#### Evaluate the Model Qualitatively (Human Evaluation)
As with many GenAI applications, a qualitative approach where you ask yourself the question "Is my model behaving the way it is supposed to?" is usually a good starting point. In the example below (the same one we started this notebook with), you can see how the fine-tuned model is able to create a reasonable summary of the dialogue compared to the original inability to understand what is being asked of the model.

In [19]:
#index = 200
index = 40

dialogue = dataset['test'][index]['dialogue']
human_baseline_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids = input_ids.to(device)

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
#Person1# is a minute away from the nine-thirty train. #Person2# is the only one off the train.
---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
#Person1# is late.


#### Evaluate the Model Quantitatively (with ROUGE Metric)
The ROUGE metric helps quantify the validity of summarizations produced by models. It compares summarizations to a "baseline" summary which is usually created by a human. While not perfect, it does indicate the overall increase in summarization effectiveness that we have accomplished by fine-tuning.

In [20]:
rouge = evaluate.load('rouge')

##### Generate the outputs for the sample of the test dataset (only 10 dialogues and summaries to save time), and save the results.

In [21]:
#dialogues = dataset['test'][0:10]['dialogue']
#human_baseline_summaries = dataset['test'][0:10]['summary']
dialogues = dataset['test'][0:20]['dialogue']
human_baseline_summaries = dataset['test'][0:20]['summary']

original_model_summaries = []
instruct_model_summaries = []

for _, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """
    #input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)
    original_model_summaries.append(original_model_text_output)

    instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)
    instruct_model_summaries.append(instruct_model_text_output)
    
zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries))
 
df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries'])
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,#Person2# is asking #Person1# to take a dictat...,#Person1# asks Ms. Dawson to take a dictation ...
1,In order to prevent employees from wasting tim...,#Person2#'s memo is to be distributed to all e...,#Person1# asks Ms. Dawson to take a dictation ...
2,Ms. Dawson takes a dictation for #Person1# abo...,#Person2# will take a dictation for Ms. Dawson.,#Person1# asks Ms. Dawson to take a dictation ...
3,#Person2# arrives late because of traffic jam....,#Person1# was stuck in traffic.,#Person1# is stuck in traffic. #Person2# think...
4,#Person2# decides to follow #Person1#'s sugges...,#Person2# is a newcomer.,#Person1# is stuck in traffic. #Person2# think...
5,#Person2# complains to #Person1# about the tra...,#Person1# is a car owner and he is adamant abo...,#Person1# is stuck in traffic. #Person2# think...
6,#Person1# tells Kate that Masha and Hero get d...,"#Person1# is Kate, and #Person2# are discussin...",#Person1# tells #Person2# that Masha and Hero ...
7,#Person1# tells Kate that Masha and Hero are g...,#Person1# is Kate's ex-boyfriend. #Person2#'s ...,#Person1# tells #Person2# that Masha and Hero ...
8,#Person1# and Kate talk about the divorce betw...,Masha and Hero are getting divorced. #Person2#...,#Person1# tells #Person2# that Masha and Hero ...
9,#Person1# and Brian are at the birthday party ...,#Person1# asked Brian to have a party.,#Person1# asks Brian to have a dance with #Per...


In [22]:
#print(type(human_baseline_summaries))
#print(len(human_baseline_summaries))
#print(human_baseline_summaries)

In [23]:
original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)

ORIGINAL MODEL:
{'rouge1': 0.3248444680688397, 'rouge2': 0.10953369133218546, 'rougeL': 0.26114823993129155, 'rougeLsum': 0.2617179300909278}
INSTRUCT MODEL:
{'rouge1': 0.32072643093910524, 'rouge2': 0.1187067123648013, 'rougeL': 0.27282005991978125, 'rougeLsum': 0.2713663905240295}


The file **Generative-AI-with-LLMs/data/dialogue-summary-training-results.csv** contains a pre-populated list of all model results which you can use to evaluate on a larger section of data. Let's do that for each of the models. The length of results is 1500. 

In [24]:
#results = pd.read_csv("data/dialogue-summary-training-results.csv")
#results = pd.read_csv("git-projects/Generative-AI-with-LLMs/Week-2/data/dialogue-summary-training-results.csv")
results = pd.read_csv("/scratch/qualis/git-projects/Generative-AI-with-LLMs/data/dialogue-summary-training-results.csv")

human_baseline_summaries = results['human_baseline_summaries'].values
original_model_summaries = results['original_model_summaries'].values
instruct_model_summaries = results['instruct_model_summaries'].values

print(len(instruct_model_summaries))

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)


instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)

1500
ORIGINAL MODEL:
{'rouge1': 0.2334158581572823, 'rouge2': 0.07603964187010573, 'rougeL': 0.20145520923859048, 'rougeLsum': 0.20145899339006135}
INSTRUCT MODEL:
{'rouge1': 0.42161291557556113, 'rouge2': 0.18035380596301792, 'rougeL': 0.3384439349963909, 'rougeLsum': 0.33835653595561666}


In [25]:
# len(results) # 1500

##### The results show substantial improvement in all ROUGE metrics:

In [26]:
print("Absolute percentage improvement of INSTRUCT MODEL over HUMAN BASELINE")

improvement = (np.array(list(instruct_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(instruct_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of INSTRUCT MODEL over HUMAN BASELINE
rouge1: 18.82%
rouge2: 10.43%
rougeL: 13.70%
rougeLsum: 13.69%


### Perform Parameter Efficient Fine-Tuning (PEFT)
Now, let's perform Parameter Efficient Fine-Tuning (PEFT) fine-tuning as opposed to "full fine-tuning" as you did above. PEFT is a form of instruction fine-tuning that is much more efficient than full fine-tuning - with comparable evaluation results as you will see soon.

PEFT is a generic term that includes Low-Rank Adaptation (LoRA) and prompt tuning (which is NOT THE SAME as prompt engineering!). In most cases, when someone says PEFT, they typically mean LoRA. LoRA, at a very high level, allows the user to fine-tune their model using fewer compute resources (in some cases, a single GPU). After fine-tuning for a specific task, use case, or tenant with LoRA, the result is that the original LLM remains unchanged and a newly-trained “LoRA adapter” emerges. This LoRA adapter is much, much smaller than the original LLM - on the order of a single-digit % of the original LLM size (MBs vs GBs).

That said, at inference time, the LoRA adapter needs to be reunited and combined with its original LLM to serve the inference request. The benefit, however, is that many LoRA adapters can re-use the original LLM which reduces overall memory requirements when serving multiple tasks and use cases.


#### Setup the PEFT/LoRA model for Fine-Tuning
You need to set up the PEFT/LoRA model for fine-tuning with a new layer/parameter adapter. Using PEFT/LoRA, you are freezing the underlying LLM and only training the adapter. Have a look at the LoRA configuration below. Note the rank (r) hyper-parameter, which defines the rank/dimension of the adapter to be trained.

In [27]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=32, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)

In [28]:
peft_model = get_peft_model(original_model, 
                            lora_config)
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 3538944
all model parameters: 251116800
percentage of trainable model parameters: 1.41%


#### Train PEFT Adapter
Define training arguments and create Trainer instance.

In [29]:
output_dir = "./peft-dialogue-summary-training"

peft_training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3, # Higher learning rate than full fine-tuning.
    num_train_epochs=2,
    logging_steps=500,
    #max_steps=1    
)
    
peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
)



In [30]:
peft_trainer.train() #It took ~21 minutes with fp32 and 2 epochs

Step,Training Loss
500,0.1585
1000,0.1077
1500,0.1046
2000,0.101
2500,0.0998
3000,0.0992


Checkpoint destination directory ./peft-dialogue-summary-training/checkpoint-500 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./peft-dialogue-summary-training/checkpoint-1000 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./peft-dialogue-summary-training/checkpoint-1500 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./peft-dialogue-summary-training/checkpoint-2000 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./peft-dialogue-summary-training/checkpoint-2500 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./peft-dialogue-summary-training/checkpoint-3000 already exists and is non-empty.Saving will proceed but saved results may be in

TrainOutput(global_step=3116, training_loss=0.11147048813081675, metrics={'train_runtime': 447.1361, 'train_samples_per_second': 55.732, 'train_steps_per_second': 6.969, 'total_flos': 1.733507439132672e+16, 'train_loss': 0.11147048813081675, 'epoch': 2.0})

In [31]:
peft_model_path="./peft-dialogue-summary-checkpoint-local"

peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)

('./peft-dialogue-summary-checkpoint-local/tokenizer_config.json',
 './peft-dialogue-summary-checkpoint-local/special_tokens_map.json',
 './peft-dialogue-summary-checkpoint-local/tokenizer.json')

##### Prepare this model by adding an adapter to the original FLAN-T5 model. You are setting is_trainable=False because the plan is only to perform inference with this PEFT model. If you were preparing the model for further training, you would set is_trainable=True.

In [32]:
from peft import PeftModel, PeftConfig

peft_model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")
#peft_model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-large", torch_dtype=torch.bfloat16)
#tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-large")

peft_model = PeftModel.from_pretrained(peft_model_base, 
                                       './peft-dialogue-summary-checkpoint-local/',  
                                       torch_dtype=torch.bfloat16,
                                       is_trainable=False)

##### The number of trainable parameters will be 0 due to is_trainable=False setting:

In [33]:
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 0
all model parameters: 251116800
percentage of trainable model parameters: 0.00%


#### Evaluate the Model Qualitatively (Human Evaluation)
Make inferences for the same example as in sections 1.3 and 2.3, with the original model, fully fine-tuned and PEFT model.

In [34]:
peft_model = peft_model.to(device)

index = 200
dialogue = dataset['test'][index]['dialogue']
baseline_human_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids = input_ids.to(device)

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')
print(dash_line)
print(f'PEFT MODEL: {peft_model_text_output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
#Person2# thinks adding a painting program to the software would be a definite bonus and adding a CD-ROM drive would be a great addition. #Person1# also recommends a CD-ROM drive and #Person2# suggests adding a CD-ROM drive.
---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
#Person1# asks #Person2# what they would need to upgrade their system.
---------------------------------------------------------------------------------------------------
PEFT MODEL: #Person1# recommends a painting program to #Person2#. #Person1# suggests adding a painting program to #Person2#'s software. #Person1# suggests adding

#### Evaluate the Model Quantitatively (with ROUGE Metric)
Perform inferences for the sample of the test dataset (only 10 dialogues and summaries to save time).

In [35]:
dialogues = dataset['test'][0:20]['dialogue']
human_baseline_summaries = dataset['test'][0:20]['summary']

original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []

for idx, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """
    
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    input_ids = input_ids.to(device)

    human_baseline_text_output = human_baseline_summaries[idx]
    
    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

    instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

    peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

    original_model_summaries.append(original_model_text_output)
    instruct_model_summaries.append(instruct_model_text_output)
    peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries, peft_model_summaries))
 
df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries', 'peft_model_summaries'])
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries,peft_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,#Person1# tells #Person2# the code of communic...,#Person1# asks Ms. Dawson to take a dictation ...,Ms. Dawson takes a dictation for #Person1# to ...
1,In order to prevent employees from wasting tim...,Ms. Dawson asks #Person1# to take a dictation ...,#Person1# asks Ms. Dawson to take a dictation ...,Ms. Dawson takes a dictation for #Person1# to ...
2,Ms. Dawson takes a dictation for #Person1# abo...,Ms. Dawson tells #Person1# that all office com...,#Person1# asks Ms. Dawson to take a dictation ...,Ms. Dawson takes a dictation for #Person1# to ...
3,#Person2# arrives late because of traffic jam....,#Person2# got stuck in traffic. #Person1# sugg...,#Person1# is stuck in traffic. #Person2# think...,#Person2# got stuck in traffic again. #Person1...
4,#Person2# decides to follow #Person1#'s sugges...,#Person2# is waiting for his car to get stuck ...,#Person1# is stuck in traffic. #Person2# think...,#Person2# got stuck in traffic again. #Person1...
5,#Person2# complains to #Person1# about the tra...,#Person2# got stuck in traffic jams. #Person1#...,#Person1# is stuck in traffic. #Person2# think...,#Person2# got stuck in traffic again. #Person1...
6,#Person1# tells Kate that Masha and Hero get d...,Kate tells #Person1# that Masha and Hero are g...,#Person1# tells #Person2# that Masha and Hero ...,Kate is surprised that Masha and Hero are gett...
7,#Person1# tells Kate that Masha and Hero are g...,Kate and #Person1# are surprised by the divorc...,#Person1# tells #Person2# that Masha and Hero ...,Kate is surprised that Masha and Hero are gett...
8,#Person1# and Kate talk about the divorce betw...,Kate is surprised that Masha and Hero are gett...,#Person1# tells #Person2# that Masha and Hero ...,Kate is surprised that Masha and Hero are gett...
9,#Person1# and Brian are at the birthday party ...,Brian is happy to have a dance with #Person1#....,#Person1# asks Brian to have a dance with #Per...,Brian has a birthday party. He has a dance wit...


In [36]:
rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)

ORIGINAL MODEL:
{'rouge1': 0.38378750294976843, 'rouge2': 0.11323811331662312, 'rougeL': 0.2809936706241831, 'rougeLsum': 0.2822712572313981}
INSTRUCT MODEL:
{'rouge1': 0.32072643093910524, 'rouge2': 0.1187067123648013, 'rougeL': 0.27282005991978125, 'rougeLsum': 0.2713663905240295}
PEFT MODEL:
{'rouge1': 0.36431761473333446, 'rouge2': 0.14071352657747338, 'rougeL': 0.2823463258134642, 'rougeLsum': 0.2836757210138682}


##### Notice, that PEFT model results are not too bad, while the training process was much easier!

##### You already computed ROUGE score on the full dataset, after loading the results from the data/dialogue-summary-training-results.csv file. Load the values for the PEFT model now and check its performance compared to other models.

In [37]:
#results:"Generative-AI-with-LLMs/data/dialogue-summary-training-results.csv")
human_baseline_summaries = results['human_baseline_summaries'].values
original_model_summaries = results['original_model_summaries'].values
instruct_model_summaries = results['instruct_model_summaries'].values
peft_model_summaries     = results['peft_model_summaries'].values

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)

ORIGINAL MODEL:
{'rouge1': 0.2334158581572823, 'rouge2': 0.07603964187010573, 'rougeL': 0.20145520923859048, 'rougeLsum': 0.20145899339006135}
INSTRUCT MODEL:
{'rouge1': 0.42161291557556113, 'rouge2': 0.18035380596301792, 'rougeL': 0.3384439349963909, 'rougeLsum': 0.33835653595561666}
PEFT MODEL:
{'rouge1': 0.40810631575616746, 'rouge2': 0.1633255794568712, 'rougeL': 0.32507074586565354, 'rougeLsum': 0.3248950182867091}


##### The results show less of an improvement over full fine-tuning, but the benefits of PEFT typically outweigh the slightly-lower performance metrics.

##### Calculate the improvement of PEFT over the original model:

In [38]:
print("Absolute percentage improvement of PEFT MODEL over HUMAN BASELINE")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over HUMAN BASELINE
rouge1: 17.47%
rouge2: 8.73%
rougeL: 12.36%
rougeLsum: 12.34%


##### Now calculate the improvement of PEFT over a full fine-tuned model:

In [39]:
print("Absolute percentage improvement of PEFT MODEL over INSTRUCT MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(instruct_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over INSTRUCT MODEL
rouge1: -1.35%
rouge2: -1.70%
rougeL: -1.34%
rougeLsum: -1.35%


##### Here you see a small percentage decrease in the ROUGE metrics vs. full fine-tuned. However, the training requires much less computing and memory resources (often just a single GPU).