# Fine-Tune a Generative AI Model for Dialogue Summarization

In this notebook, we will fine-tune an existing LLM from Hugging Face for enhanced dialogue summarization. We will use the [FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) model, which provides a high quality instruction tuned model and can summarize text out of the box. To improve the inferences, we will explore a full fine-tuning approach and evaluate the results with ROUGE metrics. Then perform Parameter Efficient Fine-Tuning (PEFT), evaluate the resulting model and see that the benefits of PEFT outweigh the slightly-lower performance metrics. Also we'll try to load the quantized version of the model, which will allow us to reduce memory usage.

*   QLoRA: Quantized Low Rank Adapters - this is a method for fine-tuning LLMs that uses a small number of quantized, updateable parameters to limit the complexity of training. This technique also allows those small sets of parameters to be added efficiently into the model itself, which means you can do fine-tuning on lots of data sets, potentially, and swap these "adapters" into your model when necessary.
*   Bits and Bytes: An excellent package by Tim Dettmers et al., which provides a lightweight wrapper around custom CUDA functions that make LLMs go faster - optimizers, matrix mults, and quantization. In this notebook we'll be using the library to load our model as efficiently as possible.

*   PEFT: An excellent Huggingface library that enables a number Parameter Efficient Fine-tuning (PEFT) methods, which again make it less expensive to fine-tune LLMs - especially on more lightweight hardware like that present in Kaggle notebooks.




# Table of Contents

- [ 1 - Set up Kernel, Load Required Dependencies, Dataset and LLM](#1)
  - [ 1.1 - Set up Kernel and Required Dependencies](#1.1)
  - [ 1.2 - Load Dataset and LLM](#1.2)
  - [ 1.3 - Test the Model with Zero Shot Inferencing](#1.3)
- [ 2 - Perform Full Fine-Tuning](#2)
  - [ 2.1 - Preprocess the Dialog-Summary Dataset](#2.1)
  - [ 2.2 - Fine-Tune the Model with the Preprocessed Dataset](#2.2)
  - [ 2.3 - Evaluate the Model Qualitatively (Human Evaluation)](#2.3)
  - [ 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)](#2.4)
- [ 3 - Perform Parameter Efficient Fine-Tuning (PEFT)](#3)
  - [ 3.1 - Setup the PEFT/LoRA model for Fine-Tuning](#3.1)
  - [ 3.2 - Train PEFT Adapter](#3.2)
  - [ 3.3 - Evaluate the Model Qualitatively (Human Evaluation)](#3.3)
  - [ 3.4 - Evaluate the Model Quantitatively (with ROUGE Metric)](#3.4)
- [4 - Fine-Tune T5 with LoRA and bnb int-4](#4)
  - [4.1 - Set up BitsAndBytesConfig](#4.1)
  - [4.2 - Setup and Train the qLoRA model](#4.2)
  - [4.3 - Evaluate the Model Qualitatively](#4.3)
  - [4.4 - Setup LoftQ initialization](#4.4)



<a name='1'></a>
## 1 - Set up Kernel, Load Required Dependencies, Dataset and LLM

<a name='1.1'></a>
### 1.1 - Set up Kernel and Required Dependencies

Now install the required packages for the LLM and datasets.



In [None]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets
!pip install -q evaluate==0.4.0 \
                rouge_score==0.1.2 \
                loralib==0.1.1

!pip install -q torch==2.2.1 \
                torchdata==0.7.1


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m401.2/401.2 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.4/302.4 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for peft (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pypr

In [None]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer, AutoConfig, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, PeftConfig, PeftModel, get_peft_model, prepare_model_for_kbit_training, replace_lora_weights_loftq


import time
import os
import torch

import evaluate
import pandas as pd
import numpy as np



<a name='1.2'></a>
### 1.2 - Load Dataset and LLM

We are going to continue experimenting with the [DialogSum](https://huggingface.co/datasets/knkarthick/dialogsum) Hugging Face dataset. It contains 10,000+ dialogues with the corresponding manually labeled summaries and topics.

In [None]:
huggingface_dataset_name = "knkarthick/dialogsum"

dataset = load_dataset(huggingface_dataset_name)

dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/4.65k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/442k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/12460 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1500 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

Load the pre-trained [FLAN-T5 model](https://huggingface.co/docs/transformers/model_doc/flan-t5) and its tokenizer directly from HuggingFace. Notice that we will be using the [small version](https://huggingface.co/google/flan-t5-base) of FLAN-T5. Setting `torch_dtype=torch.bfloat16` specifies the memory type to be used by this model.

In [None]:
model_name='google/flan-t5-base'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)


config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

It is possible to pull out the number of model parameters and find out how many of them are trainable. The following function can be used to do that.

In [None]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(print_number_of_trainable_model_parameters(original_model))

trainable model parameters: 247577856
all model parameters: 247577856
percentage of trainable model parameters: 100.00%


<a name='1.3'></a>
### 1.3 - Test the Model with Zero Shot Inferencing

Test the model with the zero shot inferencing. We can see that the model struggles to summarize the dialogue compared to the baseline summary, but it does pull out some important information from the text which indicates the model can be fine-tuned to the task at hand.

In [None]:
index = 200

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

inputs = tokenizer(prompt, return_tensors='pt').to("cuda")
output = tokenizer.decode(
    original_model.generate(
        inputs["input_ids"],
        max_new_tokens=200,
    )[0],
    skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')

---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Summarize the following conversation.

#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?
#Person2#: No.
#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.
#Person2#: That sounds great. Thanks.

Summary:

-------------------------------------------------------------------

<a name='2'></a>
## 2 - Perform Full Fine-Tuning

<a name='2.1'></a>
### 2.1 - Preprocess the Dialog-Summary Dataset

We need to convert the dialog-summary (prompt-response) pairs into explicit instructions for the LLM. Prepend an instruction to the start of the dialog with `Summarize the following conversation` and to the start of the summary with `Summary` as follows:

Training prompt (dialogue):
```
Summarize the following conversation.

    Chris: This is his part of the conversation.
    Antje: This is her part of the conversation.
    
Summary:
```

Training response (summary):
```
Both Chris and Antje participated in the conversation.
```

Then preprocess the prompt-response dataset into tokens and pull out their `input_ids` (1 per token).

In [None]:
def tokenize_function(example):
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids.to("cuda")
    example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids.to("cuda")

    return example

# The dataset actually contains 3 diff splits: train, validation, test.
# The tokenize_function code is handling all data across all splits in batches.
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary',])

Map:   0%|          | 0/12460 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

Create a smaller subset of the full dataset to fine-tune on to reduce the time it takes:

In [None]:
# small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
# small_eval_dataset = tokenized_datasets["validation"].shuffle(seed=42).select(range(1000))


tokenized_datasets_subsample = tokenized_datasets.filter(lambda example, index: index % 5 == 0, with_indices=True)

Filter:   0%|          | 0/12460 [00:00<?, ? examples/s]

Filter:   0%|          | 0/500 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1500 [00:00<?, ? examples/s]

Check the shapes of all three parts of the dataset:

In [None]:
print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets_subsample['train'].shape}")
print(f"Validation: {tokenized_datasets_subsample['validation'].shape}")
print(f"Test: {tokenized_datasets_subsample['test'].shape}")

print(tokenized_datasets_subsample)

Shapes of the datasets:
Training: (2492, 2)
Validation: (100, 2)
Test: (300, 2)
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 2492
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 100
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 300
    })
})


In [None]:

#device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
#model.to(device)

The output dataset is ready for fine-tuning.

<a name='2.2'></a>
### 2.2 - Fine-Tune the Model with the Preprocessed Dataset

Now utilize the built-in Hugging Face `Trainer` class (see the documentation [here](https://huggingface.co/docs/transformers/main_classes/trainer)). Pass the preprocessed dataset with reference to the original model.

In [None]:

model = AutoModelForSeq2SeqLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)


output_dir = f'./dialogue-summary-training-{str(int(time.time()))}'

from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

# Define training args
batch_size = 8
# Show the training loss with every epoch
logging_steps = len(tokenized_datasets_subsample["train"]) // batch_size

training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    predict_with_generate=True,
    fp16=False, # Overflows with fp16
    auto_find_batch_size=True,
    learning_rate=1e-3,
    num_train_epochs=1,
    weight_decay=0.01,
   # logging & evaluation strategies
    logging_dir=f"{output_dir}/logs",
    logging_strategy="steps",
    logging_steps=logging_steps,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    load_best_model_at_end=True,
    # metric_for_best_model="overall_f1",
    # push to hub parameters
    # report_to="tensorboard",
    # push_to_hub=False,
    # hub_strategy="every_save",
    # hub_model_id=repository_id,
    # hub_token=HfFolder.get_token(),
)

from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)


# Create Trainer instance
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets_subsample['train'],
    eval_dataset=tokenized_datasets_subsample['validation'],
    tokenizer=tokenizer,
    # compute_metrics=compute_metrics,
)





Start training process...



In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss


Epoch,Training Loss,Validation Loss


Epoch,Training Loss,Validation Loss
1,0.105,0.091872


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight'].


TrainOutput(global_step=623, training_loss=0.2886780666884029, metrics={'train_runtime': 912.7233, 'train_samples_per_second': 2.73, 'train_steps_per_second': 0.683, 'total_flos': 1711893381120000.0, 'train_loss': 0.2886780666884029, 'epoch': 1.0})



In [None]:
trainer.evaluate()

save our pre-trained model in local for further use.

In [None]:
instuct_model_path="./flan-dialogue-summary-checkpoint-local"

trainer.model.save_pretrained(instuct_model_path)
tokenizer.save_pretrained(instuct_model_path)


('./flan-dialogue-summary-checkpoint-local/tokenizer_config.json',
 './flan-dialogue-summary-checkpoint-local/special_tokens_map.json',
 './flan-dialogue-summary-checkpoint-local/spiece.model',
 './flan-dialogue-summary-checkpoint-local/added_tokens.json',
 './flan-dialogue-summary-checkpoint-local/tokenizer.json')

In [None]:
!ls -al ./flan-dialogue-summary-checkpoint-local/model.safetensors

-rw-r--r-- 1 root root 495189552 May  8 15:36 ./flan-dialogue-summary-checkpoint-local/model.safetensors


The size of the model is approximately 472MB.

Training a fully fine-tuned version of the model (~1GB) would take a few hours on a GPU. To save time, you can download a checkpoint of the fully fine-tuned model. But for now, let's use this model as the instruct model in this lab.

Create an instance of the `AutoModelForSeq2SeqLM` class for the instruct model:

In [None]:
instruct_model = AutoModelForSeq2SeqLM.from_pretrained("./flan-dialogue-summary-checkpoint-local",
                                                       device_map="auto",
                                                       torch_dtype=torch.bfloat16)

<a name='2.3'></a>
### 2.3 - Evaluate the Model Qualitatively (Human Evaluation)

As with many GenAI applications, a qualitative approach where you ask yourself the question "Is my model behaving the way it is supposed to?" is usually a good starting point. In the example below (the same one we started this notebook with), you can see how the fine-tuned model is able to create a reasonable summary of the dialogue compared to the original inability to understand what is being asked of the model.

In [None]:
index = 200
dialogue = dataset['test'][index]['dialogue']
human_baseline_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
#Person1#: I'm thinking of upgrading my computer.one#: I'm not sure what exactly I would need. #Person1#: I'd probably need a painting program. #Person2#: I'd probably need a faster processor. #Person1#: I'd probably need a faster hard disc. #Person2#: I'd probably need a CD-ROM drive.one#: I'd probably need a CD-ROM drive too.one#: I'd probably need a CD-ROM drive too.one#: I'd probably need a CD-ROM drive too.one#: I'd probably need a CD-ROM drive too.one#: I'd probably need a CD-ROM drive too.one#: I'd probably need a CD-ROM drive too.
---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
#Person1# w

<a name='2.4'></a>
### 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)

The [ROUGE metric](https://en.wikipedia.org/wiki/ROUGE_(metric)) helps quantify the validity of summarizations produced by models. It compares summarizations to a "baseline" summary which is usually created by a human. While not perfect, it does indicate the overall increase in summarization effectiveness that we have accomplished by fine-tuning.

In [None]:
rouge = evaluate.load('rouge')

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Generate the outputs for the sample of the test dataset (only 10 dialogues and summaries to save time), and save the results.

In [None]:
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []

for _, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")

    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)
    original_model_summaries.append(original_model_text_output)

    instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)
    instruct_model_summaries.append(instruct_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries))

df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries'])
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,#Person1#: I need to take a dictation for you....,#Person1# asks Ms. Dawson to take a dictation ...
1,In order to prevent employees from wasting tim...,#Person1#: I need to take a dictation for you....,#Person1# asks Ms. Dawson to take a dictation ...
2,Ms. Dawson takes a dictation for #Person1# abo...,#Person1#: I need to take a dictation for you....,#Person1# asks Ms. Dawson to take a dictation ...
3,#Person2# arrives late because of traffic jam....,The traffic jam at the Carrefour intersection ...,#Person1# got stuck in traffic again. #Person2...
4,#Person2# decides to follow #Person1#'s sugges...,The traffic jam at the Carrefour intersection ...,#Person1# got stuck in traffic again. #Person2...
5,#Person2# complains to #Person1# about the tra...,The traffic jam at the Carrefour intersection ...,#Person1# got stuck in traffic again. #Person2...
6,#Person1# tells Kate that Masha and Hero get d...,Masha and Hero are getting divorced..............,Kate and Masha are getting divorced. Kate tell...
7,#Person1# tells Kate that Masha and Hero are g...,Masha and Hero are getting divorced..............,Kate and Masha are getting divorced. Kate tell...
8,#Person1# and Kate talk about the divorce betw...,Masha and Hero are getting divorced..............,Kate and Masha are getting divorced. Kate tell...
9,#Person1# and Brian are at the birthday party ...,"#Person1#: Happy birthday, Brian. #Person2#: I...",Brian invites #Person1# to a party. #Person2# ...


Evaluate the models computing ROUGE metrics. Notice the improvement in the results!

In [None]:
original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)

ORIGINAL MODEL:
{'rouge1': 0.18606273374529952, 'rouge2': 0.08843095860344613, 'rougeL': 0.16575461240757294, 'rougeLsum': 0.16513940250782355}
INSTRUCT MODEL:
{'rouge1': 0.38809702728501955, 'rouge2': 0.11075361749496665, 'rougeL': 0.26789102145909593, 'rougeLsum': 0.26853477247420615}


The results show substantial improvement in all ROUGE metrics:

In [None]:
print("Absolute percentage improvement of INSTRUCT MODEL over ORIGINAL MODEL")

improvement = (np.array(list(instruct_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(instruct_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of INSTRUCT MODEL over ORIGINAL MODEL
rouge1: 20.20%
rouge2: 2.23%
rougeL: 10.21%
rougeLsum: 10.34%


<a name='3'></a>
## 3 - Perform Parameter Efficient Fine-Tuning (PEFT)

Now, let's perform **Parameter Efficient Fine-Tuning (PEFT)** fine-tuning as opposed to "full fine-tuning" as you did above. PEFT is a form of instruction fine-tuning that is much more efficient than full fine-tuning.

PEFT is a generic term that includes **Low-Rank Adaptation (LoRA)** and prompt tuning (which is NOT THE SAME as prompt engineering!). In most cases, when someone says PEFT, they typically mean LoRA. LoRA, at a very high level, allows the user to fine-tune their model using fewer compute resources (in some cases, a single GPU). After fine-tuning for a specific task, use case, or tenant with LoRA, the result is that the original LLM remains unchanged and a newly-trained “LoRA adapter” emerges. This LoRA adapter is much, much smaller than the original LLM - on the order of a single-digit % of the original LLM size (MBs vs GBs).  

That said, at inference time, the LoRA adapter needs to be reunited and combined with its original LLM to serve the inference request.  The benefit, however, is that many LoRA adapters can re-use the original LLM which reduces overall memory requirements when serving multiple tasks and use cases.

<a name='3.1'></a>
### 3.1 - Setup the PEFT/LoRA model for Fine-Tuning

We need to set up the PEFT/LoRA model for fine-tuning with a new layer/parameter adapter. Using PEFT/LoRA, you are freezing the underlying LLM and only training the adapter. Have a look at the LoRA configuration below. Note the rank (`r`) hyper-parameter, which defines the rank/dimension of the adapter to be trained.

In [None]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=16, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)

Add LoRA adapter layers/parameters to the original LLM to be trained.

In [None]:
base_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)


# add LoRA adaptor
peft_model = get_peft_model(base_model,
                            lora_config)
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 1769472
all model parameters: 249347328
percentage of trainable model parameters: 0.71%


<a name='3.2'></a>
### 3.2 - Train PEFT Adapter

Define training arguments and create `Trainer` instance.

In [None]:
output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'


# Define training args
batch_size = 8
# Show the training loss with every epoch
logging_steps = len(tokenized_datasets_subsample["train"]) // batch_size

peft_training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    predict_with_generate=True,
    fp16=False, # Overflows with fp16
    auto_find_batch_size=True,
    learning_rate=1e-3, # Higher learning rate than full fine-tuning.
    num_train_epochs=1,
    weight_decay=0.01,
   # logging & evaluation strategies
    logging_dir=f"{output_dir}/logs",
    logging_strategy="steps",
    logging_steps=logging_steps,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    load_best_model_at_end=True,
    # metric_for_best_model="overall_f1",
    # push to hub parameters
    # report_to="tensorboard",
    # push_to_hub=False,
    # hub_strategy="every_save",
    # hub_model_id=repository_id,
    # hub_token=HfFolder.get_token(),
)



# Create Trainer instance
peft_trainer = Seq2SeqTrainer(
    model=peft_model,
    args=peft_training_args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets_subsample['train'],
    eval_dataset=tokenized_datasets_subsample['validation'],
    # compute_metrics=compute_metrics,
)



NOTE That training was performed on a subset of data.
Now everything is ready to train the PEFT adapter and save the model.



In [None]:
peft_trainer.train()

Epoch,Training Loss,Validation Loss
1,0.1453,0.104718




TrainOutput(global_step=623, training_loss=0.9907814381019262, metrics={'train_runtime': 746.3779, 'train_samples_per_second': 3.339, 'train_steps_per_second': 0.835, 'total_flos': 1719961380716544.0, 'train_loss': 0.9907814381019262, 'epoch': 1.0})



In [None]:
peft_model_path="./peft-dialogue-summary-checkpoint-local"

peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)




('./peft-dialogue-summary-checkpoint-local/tokenizer_config.json',
 './peft-dialogue-summary-checkpoint-local/special_tokens_map.json',
 './peft-dialogue-summary-checkpoint-local/spiece.model',
 './peft-dialogue-summary-checkpoint-local/added_tokens.json',
 './peft-dialogue-summary-checkpoint-local/tokenizer.json')

Check that the size of this model is much less than the original LLM:

In [None]:
!ls -al ./peft-dialogue-summary-checkpoint-local/adapter_model.safetensors

-rw-r--r-- 1 root root 3559176 May  8 15:53 ./peft-dialogue-summary-checkpoint-local/adapter_model.safetensors


Prepare this model by adding an adapter to the original FLAN-T5 model. You are setting `is_trainable=False` because the plan is only to perform inference with this PEFT model. If you were preparing the model for further training, you would set `is_trainable=True`.

In [None]:
from peft import PeftModel, PeftConfig

peft_model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base", device_map="auto", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

peft_model = PeftModel.from_pretrained(peft_model_base,
                                       './peft-dialogue-summary-checkpoint-local/',
                                       torch_dtype=torch.bfloat16,
                                       device_map="auto",
                                       is_trainable=False)

The number of trainable parameters will be `0` due to `is_trainable=False` setting:

In [None]:
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 0
all model parameters: 249347328
percentage of trainable model parameters: 0.00%


<a name='3.3'></a>
### 3.3 - Evaluate the Model Qualitatively (Human Evaluation)

Make inferences for the same example as in sections [1.3](#1.3) and [2.3](#2.3), with the original model, fully fine-tuned and PEFT model.

In [None]:
index = 200
dialogue = dataset['test'][index]['dialogue']
baseline_human_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')
print(dash_line)
print(f'PEFT MODEL: \n{peft_model_text_output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
#Person1#: I'm thinking of upgrading my computer.one#: I'm not sure what exactly I would need. #Person1#: I'd probably need a painting program. #Person2#: I'd probably need a faster processor. #Person1#: I'd probably need a faster hard disc. #Person2#: I'd probably need a CD-ROM drive.one#: I'd probably need a CD-ROM drive too.one#: I'd probably need a CD-ROM drive too.one#: I'd probably need a CD-ROM drive too.one#: I'd probably need a CD-ROM drive too.one#: I'd probably need a CD-ROM drive too.one#: I'd probably need a CD-ROM drive too.
---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
#Person1# w

<a name='3.4'></a>
### 3.4 - Evaluate the Model Quantitatively (with ROUGE Metric)
Perform inferences for the sample of the test dataset (only 20 dialogues and summaries to save time).

In [None]:
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []

for idx, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")

    human_baseline_text_output = human_baseline_summaries[idx]

    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

    instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

    peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

    original_model_summaries.append(original_model_text_output)
    instruct_model_summaries.append(instruct_model_text_output)
    peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries,instruct_model_summaries, peft_model_summaries))

df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries', 'peft_model_summaries'])
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries,peft_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,#Person1#: I need to take a dictation for you....,#Person1# asks Ms. Dawson to take a dictation ...,#Person1# asks #Person2# to take a dictation f...
1,In order to prevent employees from wasting tim...,#Person1#: I need to take a dictation for you....,#Person1# asks Ms. Dawson to take a dictation ...,#Person1# asks #Person2# to take a dictation f...
2,Ms. Dawson takes a dictation for #Person1# abo...,#Person1#: I need to take a dictation for you....,#Person1# asks Ms. Dawson to take a dictation ...,#Person1# asks #Person2# to take a dictation f...
3,#Person2# arrives late because of traffic jam....,The traffic jam at the Carrefour intersection ...,#Person1# got stuck in traffic again. #Person2...,#Person2# got stuck in traffic again. #Person2...
4,#Person2# decides to follow #Person1#'s sugges...,The traffic jam at the Carrefour intersection ...,#Person1# got stuck in traffic again. #Person2...,#Person2# got stuck in traffic again. #Person2...
5,#Person2# complains to #Person1# about the tra...,The traffic jam at the Carrefour intersection ...,#Person1# got stuck in traffic again. #Person2...,#Person2# got stuck in traffic again. #Person2...
6,#Person1# tells Kate that Masha and Hero get d...,Masha and Hero are getting divorced..............,Kate and Masha are getting divorced. Kate tell...,#Person1# tells #Person2# that Masha and Hero ...
7,#Person1# tells Kate that Masha and Hero are g...,Masha and Hero are getting divorced..............,Kate and Masha are getting divorced. Kate tell...,#Person1# tells #Person2# that Masha and Hero ...
8,#Person1# and Kate talk about the divorce betw...,Masha and Hero are getting divorced..............,Kate and Masha are getting divorced. Kate tell...,#Person1# tells #Person2# that Masha and Hero ...
9,#Person1# and Brian are at the birthday party ...,"#Person1#: Happy birthday, Brian. #Person2#: I...",Brian invites #Person1# to a party. #Person2# ...,Brian's birthday is coming soon. Brian's party...


In [None]:
# rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)

ORIGINAL MODEL:
{'rouge1': 0.18606273374529952, 'rouge2': 0.08843095860344613, 'rougeL': 0.16575461240757294, 'rougeLsum': 0.16513940250782355}
INSTRUCT MODEL:
{'rouge1': 0.38809702728501955, 'rouge2': 0.11075361749496665, 'rougeL': 0.26789102145909593, 'rougeLsum': 0.26853477247420615}
PEFT MODEL:
{'rouge1': 0.37942512132582745, 'rouge2': 0.14971354480409457, 'rougeL': 0.29900013790040364, 'rougeLsum': 0.30198484859385333}


Notice, that PEFT model results are not too bad, while the training process was much easier!

The results show less of an improvement over full fine-tuning, but the benefits of PEFT typically outweigh the slightly-lower performance metrics.

Calculate the improvement of PEFT over the original model:

In [None]:
print("Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL
rouge1: 19.34%
rouge2: 6.13%
rougeL: 13.32%
rougeLsum: 13.68%


Now calculate the improvement of PEFT over a full fine-tuned model:

In [None]:
print("Absolute percentage improvement of PEFT MODEL over INSTRUCT MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(instruct_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over INSTRUCT MODEL
rouge1: -0.87%
rouge2: 3.90%
rougeL: 3.11%
rougeLsum: 3.35%


Here you see a small percentage decrease in the ROUGE1 metrics vs. full fine-tuned. but other metrics are better than full fine-tuning.
PEFT approaches have been shown to perform better than full fine-tuning in low-data regimes.
Additionally, the training for PEFT method requires much less computing and memory resources (often just a single GPU).


<a name='4'></a>
## 4 - Fine-Tune T5 with LoRA and bnb int-4


Quantization represents data with fewer bits, making it a useful technique for reducing memory-usage and accelerating inference especially when it comes to large language models (LLMs).

However, after a model is quantized it isn’t typically further trained for downstream tasks because training can be unstable due to the lower precision of the weights and activations. But since PEFT methods only add extra trainable parameters, this allows you to train a quantized model with a PEFT adapter on top! Combining quantization with PEFT can be a good strategy for training even the largest models on a single GPU. For example, QLoRA is a method that quantizes a model to 4-bits and then trains it with LoRA.

We'll be using Bits and Bytes to load it in 4-bit format, which should reduce memory consumption considerably, at a cost of some accuracy.

Note the parameters in BitsAndBytesConfig - this is a fairly standard 4-bit quantization configuration, loading the weights in 4-bit format, using a straightforward format (normal float 4) with double quantization to improve QLoRA's resolution. The weights are converted back to bfloat16 for weight updates, then the extra precision is discarded.


<a name='4.1'></a>
### 4.1 - Set up BitsAndBytesConfig

bitsandbytes is a quantization library with a Transformers integration. With this integration, you can quantize a model to 8 or 4-bits and enable many other options by configuring the BitsAndBytesConfig class.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig


config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

qmodel = AutoModelForSeq2SeqLM.from_pretrained(model_name,
                                               quantization_config=config,
                                               device_map="auto",
                                               use_cache=False,  # We will be using gradient checkpointing
                                               trust_remote_code=True,
                                               )


call prepare_model_for_kbit_training function to preprocess the quantized model for training.

In [None]:
from peft import prepare_model_for_kbit_training

qmodel = prepare_model_for_kbit_training(qmodel)

<a name='4.2'></a>
### 4.2 - Setup and Train the qLoRA model

Now that the quantized model is ready, let’s set up  LoRA configuration.


In [None]:
from peft import LoraConfig, get_peft_model, TaskType


qlora_config = LoraConfig(
    r=16, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)


Add LoRA adapter layers/parameters to the quantized LLM to be trained.

In [None]:
qlora_model = get_peft_model(qmodel, lora_config)

print(print_number_of_trainable_model_parameters(qlora_model))

trainable model parameters: 1769472
all model parameters: 169131264
percentage of trainable model parameters: 1.05%


In [None]:
output_dir = f'./qlora-dialogue-summary-training-{str(int(time.time()))}'


# Define training args
batch_size = 8
# Show the training loss with every epoch
logging_steps = len(tokenized_datasets_subsample["train"]) // batch_size

qlora_training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    predict_with_generate=True,
    fp16=False, # Overflows with fp16
    auto_find_batch_size=True,
    learning_rate=1e-3, # Higher learning rate than full fine-tuning.
    num_train_epochs=1,
    weight_decay=0.01,
   # logging & evaluation strategies
    logging_dir=f"{output_dir}/logs",
    logging_strategy="steps",
    logging_steps=logging_steps,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    load_best_model_at_end=True,
    # metric_for_best_model="overall_f1",
    # push to hub parameters
    # report_to="tensorboard",
    # push_to_hub=False,
    # hub_strategy="every_save",
    # hub_model_id=repository_id,
    # hub_token=HfFolder.get_token(),
)


# Create Trainer instance
qlora_trainer = Seq2SeqTrainer(
    model=qlora_model,
    args=qlora_training_args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets_subsample['train'],
    eval_dataset=tokenized_datasets_subsample['validation'],
    # compute_metrics=compute_metrics,
)




In [None]:
qlora_trainer.train()



Epoch,Training Loss,Validation Loss
1,2.3634,0.122565




TrainOutput(global_step=312, training_loss=2.3562517058199797, metrics={'train_runtime': 865.7529, 'train_samples_per_second': 2.878, 'train_steps_per_second': 0.36, 'total_flos': 1719961380716544.0, 'train_loss': 2.3562517058199797, 'epoch': 1.0})

In [None]:
qlora_model_path="./qlora-dialogue-summary-checkpoint-local"

qlora_trainer.model.save_pretrained(qlora_model_path)
tokenizer.save_pretrained(qlora_model_path)



('./qlora-dialogue-summary-checkpoint-local/tokenizer_config.json',
 './qlora-dialogue-summary-checkpoint-local/special_tokens_map.json',
 './qlora-dialogue-summary-checkpoint-local/spiece.model',
 './qlora-dialogue-summary-checkpoint-local/added_tokens.json',
 './qlora-dialogue-summary-checkpoint-local/tokenizer.json')

merge the fine-tuned weights with the quantized base model

In [None]:
from peft import PeftModel, PeftConfig

quant_peft_model = PeftModel.from_pretrained(qmodel,
                                       './qlora-dialogue-summary-checkpoint-local/',
                                       torch_dtype=torch.bfloat16,
                                       is_trainable=False)


<a name='4.3'></a>
### 4.3 - Evaluate the Model Qualitatively



In [None]:
index = 200
dialogue = dataset['test'][index]['dialogue']
baseline_human_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

quant_peft_model_outputs = quant_peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
quant_peft_model_text_output = tokenizer.decode(quant_peft_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')
print(dash_line)
print(f'PEFT MODEL: \n{peft_model_text_output}')
print(dash_line)
print(f'QUANTIZED PEFT MODEL: \n{quant_peft_model_text_output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
#Person1#: I'm thinking of upgrading my computer.one#: I'm not sure what exactly I would need. #Person1#: I'd probably need a painting program. #Person2#: I'd probably need a faster processor. #Person1#: I'd probably need a faster hard disc. #Person2#: I'd probably need a CD-ROM drive.one#: I'd probably need a CD-ROM drive too.one#: I'd probably need a CD-ROM drive too.one#: I'd probably need a CD-ROM drive too.one#: I'd probably need a CD-ROM drive too.one#: I'd probably need a CD-ROM drive too.one#: I'd probably need a CD-ROM drive too.
---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
#Person1# w

In [None]:
# dialogues = dataset['test'][0:10]['dialogue']
# human_baseline_summaries = dataset['test'][0:10]['summary']

# original_model_summaries = []
# instruct_model_summaries = []
# peft_model_summaries = []
quant_peft_model_summaries = []

for idx, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

    # input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")

    # human_baseline_text_output = human_baseline_summaries[idx]

    # original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    # original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

    # instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    # instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

    # peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    # peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

    quant_peft_model_outputs = quant_peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    quant_peft_model_text_output = tokenizer.decode(quant_peft_model_outputs[0], skip_special_tokens=True)

    # original_model_summaries.append(original_model_text_output)
    # instruct_model_summaries.append(instruct_model_text_output)
    # peft_model_summaries.append(peft_model_text_output)
    quant_peft_model_summaries.append(quant_peft_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries, peft_model_summaries, quant_peft_model_summaries))

df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries', 'peft_model_summaries', 'quant_peft_model_summaries'])
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries,peft_model_summaries,quant_peft_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,#Person1#: I need to take a dictation for you....,#Person1# asks Ms. Dawson to take a dictation ...,#Person1# asks #Person2# to take a dictation f...,#Person2# wants to upgrade his system. #Person...
1,In order to prevent employees from wasting tim...,#Person1#: I need to take a dictation for you....,#Person1# asks Ms. Dawson to take a dictation ...,#Person1# asks #Person2# to take a dictation f...,#Person2# wants to upgrade his system. #Person...
2,Ms. Dawson takes a dictation for #Person1# abo...,#Person1#: I need to take a dictation for you....,#Person1# asks Ms. Dawson to take a dictation ...,#Person1# asks #Person2# to take a dictation f...,#Person2# wants to upgrade his system. #Person...
3,#Person2# arrives late because of traffic jam....,The traffic jam at the Carrefour intersection ...,#Person1# got stuck in traffic again. #Person2...,#Person2# got stuck in traffic again. #Person2...,#Person2# wants to upgrade his system. #Person...
4,#Person2# decides to follow #Person1#'s sugges...,The traffic jam at the Carrefour intersection ...,#Person1# got stuck in traffic again. #Person2...,#Person2# got stuck in traffic again. #Person2...,#Person2# wants to upgrade his system. #Person...
5,#Person2# complains to #Person1# about the tra...,The traffic jam at the Carrefour intersection ...,#Person1# got stuck in traffic again. #Person2...,#Person2# got stuck in traffic again. #Person2...,#Person2# wants to upgrade his system. #Person...
6,#Person1# tells Kate that Masha and Hero get d...,Masha and Hero are getting divorced..............,Kate and Masha are getting divorced. Kate tell...,#Person1# tells #Person2# that Masha and Hero ...,#Person2# wants to upgrade his system. #Person...
7,#Person1# tells Kate that Masha and Hero are g...,Masha and Hero are getting divorced..............,Kate and Masha are getting divorced. Kate tell...,#Person1# tells #Person2# that Masha and Hero ...,#Person2# wants to upgrade his system. #Person...
8,#Person1# and Kate talk about the divorce betw...,Masha and Hero are getting divorced..............,Kate and Masha are getting divorced. Kate tell...,#Person1# tells #Person2# that Masha and Hero ...,#Person2# wants to upgrade his system. #Person...
9,#Person1# and Brian are at the birthday party ...,"#Person1#: Happy birthday, Brian. #Person2#: I...",Brian invites #Person1# to a party. #Person2# ...,Brian's birthday is coming soon. Brian's party...,#Person2# wants to upgrade his system. #Person...


In [None]:
rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

quant_peft_model_results = rouge.compute(
    predictions=quant_peft_model_summaries,
    references=human_baseline_summaries[0:len(quant_peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)
print('QUANTIZED PEFT MODEL:')
print(quant_peft_model_results)

ORIGINAL MODEL:
{'rouge1': 0.18606273374529952, 'rouge2': 0.08843095860344613, 'rougeL': 0.16575461240757294, 'rougeLsum': 0.16513940250782355}
INSTRUCT MODEL:
{'rouge1': 0.38809702728501955, 'rouge2': 0.11075361749496665, 'rougeL': 0.26789102145909593, 'rougeLsum': 0.26853477247420615}
PEFT MODEL:
{'rouge1': 0.37942512132582745, 'rouge2': 0.14971354480409457, 'rougeL': 0.29900013790040364, 'rougeLsum': 0.30198484859385333}
QUANTIZED PEFT MODEL:
{'rouge1': 0.10034886015902476, 'rouge2': 0.0, 'rougeL': 0.09345729763848232, 'rougeLsum': 0.09208129269104878}


In [None]:



print("Absolute percentage improvement of QUANTIZED MODEL over PEFT MODEL")

improvement = (np.array(list(quant_peft_model_results.values())) - np.array(list(peft_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of QUANTIZED MODEL over PEFT MODEL
rouge1: -27.91%
rouge2: -14.97%
rougeL: -20.55%
rougeLsum: -20.99%


<a name='4.4'></a>
### 4.4 - Setup LoftQ initialization

LoftQ initializes LoRA weights such that the quantization error is minimized, and it can improve performance when training quantized models.

In [None]:
from peft import replace_lora_weights_loftq

loftq_initi_model=replace_lora_weights_loftq(qlora_model)


In [None]:
output_dir = f'./loftq-dialogue-summary-training-{str(int(time.time()))}'


# Define training args
loftq_initi_training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    fp16=False, # Overflows with fp16
    auto_find_batch_size=True,
    learning_rate=1e-3,
    num_train_epochs=1,
    weight_decay=0.01,
   # logging & evaluation strategies
    logging_dir=f"{output_dir}/logs",
    logging_strategy="steps",
    logging_steps=100,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    load_best_model_at_end=True,
    # metric_for_best_model="overall_f1",
    # push to hub parameters
    # report_to="tensorboard",
    # push_to_hub=False,
    # hub_strategy="every_save",
    # hub_model_id=repository_id,
    # hub_token=HfFolder.get_token(),
)




# Create Trainer instance
loftq_initi_trainer = Seq2SeqTrainer(
    model=qlora_model,
    args=loftq_initi_training_args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets_subsample['train'],
    eval_dataset=tokenized_datasets_subsample['validation'],
    # compute_metrics=compute_metrics,
)




In [None]:
loftq_initi_trainer.train()



Epoch,Training Loss,Validation Loss


In [None]:
loftq_initi_trainer.evaluate()

NameError: name 'loftq_initi_trainer' is not defined

In [None]:
loftq_initi_model_path="./loftq-dialogue-summary-checkpoint-local"

loftq_initi_trainer.model.save_pretrained(loftq_initi_model_path)
tokenizer.save_pretrained(loftq_initi_model_path)

NameError: name 'loftq_initi_trainer' is not defined

In [None]:
loftq_initi_peft_model = PeftModel.from_pretrained(qmodel,
                                                 './loftq-dialogue-summary-checkpoint-local/',
                                                  device_map="auto",
                                                  torch_dtype=torch.bfloat16,
                                                  is_trainable=False)

In [None]:
# dialogues = dataset['test'][0:10]['dialogue']
# human_baseline_summaries = dataset['test'][0:10]['summary']

# original_model_summaries = []
# instruct_model_summaries = []
# peft_model_summaries = []
# quant_peft_model_summaries = []
loftq_initi_peft_model_summaries = []

for idx, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

    # input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

    # human_baseline_text_output = human_baseline_summaries[idx]

    # original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    # original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

    # instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    # instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

    # peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    # peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

    # quant_peft_model_outputs = quant_peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    # quant_peft_model_text_output = tokenizer.decode(quant_peft_model_outputs[0], skip_special_tokens=True)

    loftq_initi_peft_model_outputs = loftq_initi_peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    loftq_initi_peft_model_text_output = tokenizer.decode(loftq_initi_peft_model_outputs[0], skip_special_tokens=True)

    # original_model_summaries.append(original_model_text_output)
    # instruct_model_summaries.append(instruct_model_text_output)
    # peft_model_summaries.append(peft_model_text_output)
    # quant_peft_model_summaries.append(quant_peft_model_text_output)
    loftq_initi_peft_model_summaries.append(loftq_initi_peft_model_text_output)




In [None]:
# rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

quant_peft_model_results = rouge.compute(
    predictions=quant_peft_model_summaries,
    references=human_baseline_summaries[0:len(quant_peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

loftq_initi_peft_model_results = rouge.compute(
    predictions=loftq_initi_peft_model_summaries,
    references=human_baseline_summaries[0:len(loftq_initi_peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)
print('QUANTIZED PEFT MODEL:')
print(quant_peft_model_results)
print('LOFTQ PEFT MODEL:')
print(loftq_initi_peft_model_results)