<a href="https://colab.research.google.com/github/olonok69/LLM_Notebooks/blob/main/Prompt_engineering/Prompt_tuning_flan_t5_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prompt Fine-Tune a Generative AI Model for Dialogue Summarization

In this notebook, you will fine-tune an existing LLM from Hugging Face for enhanced dialogue summarization. You will use the [FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) model, which provides a high quality instruction tuned model and can summarize text out of the box. To improve the inferences, you will explore a full fine-tuning approach and evaluate the results with ROUGE metrics.

https://www.linkedin.com/pulse/small-overview-demo-o-google-flan-t5-model-balayogi-g/


Flan Dataset https://arxiv.org/pdf/2301.13688.pdf

<a name='1'></a>
## 1 - Set up Kernel and Required Dependencies

In [1]:
%pip install -q --disable-pip-version-check \
    evaluate==0.4.0 \
    py7zr==0.20.4 \
    sentencepiece==0.1.99 \
    rouge_score==0.1.2 \
    loralib==0.1.1 \
    peft==0.4.0 \
    trl==0.7.2
%pip install -q    wandb bitsandbytes accelerate

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/81.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/66.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.3/66.3 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m124.0/124.0 kB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m48.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m17.2 MB/s[0

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!wget https://github.com/wandb/edu/raw/main/llm-training-course/colab/utils.py

--2024-01-06 12:04:33--  https://github.com/wandb/edu/raw/main/llm-training-course/colab/utils.py
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/wandb/edu/main/llm-training-course/colab/utils.py [following]
--2024-01-06 12:04:33--  https://raw.githubusercontent.com/wandb/edu/main/llm-training-course/colab/utils.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8155 (8.0K) [text/plain]
Saving to: ‘utils.py.1’


2024-01-06 12:04:34 (88.1 MB/s) - ‘utils.py.1’ saved [8155/8155]



In [3]:
from google.colab import output
output.enable_custom_widget_manager()

In [4]:
PROJECT = "FlanT5-Lora"
MODEL_NAME = 'google/flan-t5-base'
DATASET = "knkarthick/dialogsum"

In [5]:
import wandb
wandb.init(project=PROJECT, # the project I am working on
           tags=[MODEL_NAME, DATASET],
           notes ="Fine tuning FlanT5 with Dialogsum Dataset. Prompt Instruction and Lora") # the Hyperparameters I want to keep track of

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [6]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np

# tqdm library makes the loops show a smart progress meter.
from tqdm import tqdm
tqdm.pandas()

<a name='1.2'></a>
### 1.2 - Load Dataset and LLM

You are going to continue experimenting with the [DialogSum](https://huggingface.co/datasets/knkarthick/dialogsum) Hugging Face dataset. It contains 10,000+ dialogues with the corresponding manually labeled summaries and topics.

In [7]:
huggingface_dataset_name = "knkarthick/dialogsum"

dataset = load_dataset(huggingface_dataset_name)

dataset

Downloading readme:   0%|          | 0.00/4.65k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/442k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

In [8]:
with wandb.init(project=PROJECT, job_type="dataset"):
   wbtrain = wandb.Table(data=dataset['train'].to_pandas())
   wbvalidation = wandb.Table(data=dataset['validation'].to_pandas())
   wbtest = wandb.Table(data=dataset['test'].to_pandas())
   wandb.log({"dialogsum_train": wbtrain})
   wandb.log({"dialogsum_validation": wbvalidation})
   wandb.log({"dialogsum_test": wbtest})


VBox(children=(Label(value='0.002 MB of 0.011 MB uploaded\r'), FloatProgress(value=0.19567077344284736, max=1.…

VBox(children=(Label(value='11.152 MB of 23.427 MB uploaded\r'), FloatProgress(value=0.4760480748250359, max=1…

Load the pre-trained [FLAN-T5 model](https://huggingface.co/docs/transformers/model_doc/flan-t5) and its tokenizer directly from HuggingFace. Notice that you will be using the [base version](https://huggingface.co/google/flan-t5-base) of FLAN-T5. Setting `torch_dtype=torch.bfloat16` specifies the memory type to be used by this model.

In [9]:
model_name='google/flan-t5-base'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

It is possible to pull out the number of model parameters and find out how many of them are trainable. The following function can be used to do that, at this stage, you do not need to go into details of it.

In [10]:
def print_number_of_trainable_model_parameters(model, tag="original_model"):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    with wandb.init(project=PROJECT, job_type="log_parameters"):
      wandb.log({f'{tag}': {"trainable_model_params":trainable_model_params}})
      wandb.log({f'{tag}': {"all_model_params":all_model_params}})
      wandb.log({f'{tag}': {"percentage_of_trainable_model_parameters": 100 * trainable_model_params}} )

    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params}%"

print(print_number_of_trainable_model_parameters(original_model))

[34m[1mwandb[0m: Currently logged in as: [33molonok[0m ([33molonok69[0m). Use [1m`wandb login --relogin`[0m to force relogin


VBox(children=(Label(value='0.002 MB of 0.011 MB uploaded\r'), FloatProgress(value=0.19002908184461986, max=1.…

trainable model parameters: 247577856
all model parameters: 247577856
percentage of trainable model parameters: 100.0%


<a name='1.3'></a>
### 1.3 - Test the Model with Zero Shot Inferencing

Test the model with the zero shot inferencing. You can see that the model struggles to summarize the dialogue compared to the baseline summary, but it does pull out some important information from the text which indicates the model can be fine-tuned to the task at hand.

In [11]:
# Define W&B Table to store generations
columns = ["index", "dialoge", "prompt", "human_sumary", "zero_shot_output"]
table = wandb.Table(columns=columns)

In [12]:
lindex = [100,200,300]

for index in lindex:
  dialogue = dataset['test'][index]['dialogue']
  summary = dataset['test'][index]['summary']

  prompt = f"""
  Summarize the following conversation.

  {dialogue}

  Summary:
  """

  inputs = tokenizer(prompt, return_tensors='pt')
  output = tokenizer.decode(
      original_model.generate(
          inputs["input_ids"],
          max_new_tokens=200,
      )[0],
      skip_special_tokens=True
  )

  dash_line = ('-'.join('' for x in range(100)))
  print(dash_line)
  print(f'INPUT PROMPT:\n{prompt}')
  print(dash_line)
  print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
  print(dash_line)
  print(f'MODEL GENERATION - ZERO SHOT:\n{output}')
  table.add_data(index,dialogue,prompt,summary,output)

---------------------------------------------------------------------------------------------------
INPUT PROMPT:

  Summarize the following conversation.

  #Person1#: OK, that's a cut! Let's start from the beginning, everyone.
#Person2#: What was the problem that time?
#Person1#: The feeling was all wrong, Mike. She is telling you that she doesn't want to see you any more, but I want to get more anger from you. You're acting hurt and sad, but that's not how your character would act in this situation.
#Person2#: But Jason and Laura have been together for three years. Don't you think his reaction would be one of both anger and sadness?
#Person1#: At this point, no. I think he would react the way most guys would, and then later on, we would see his real feelings.
#Person2#: I'm not so sure about that.
#Person1#: Let's try it my way, and you can see how you feel when you're saying your lines. After that, if it still doesn't feel right, we can try something else.

  Summary:
  
----------

In [13]:
with wandb.init(project=PROJECT, job_type="examples"):

   wandb.log({"zero_shot_inference": table})

VBox(children=(Label(value='0.008 MB of 0.014 MB uploaded\r'), FloatProgress(value=0.5557614439640381, max=1.0…

<a name='2'></a>
## 2 - Perform Full Fine-Tuning

In [14]:
def tokenize_function(example):
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids

    return example

# The dataset actually contains 3 diff splits: train, validation, test.
# The tokenize_function code is handling all data across all splits in batches.
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary',])

Map:   0%|          | 0/12460 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

In [17]:
tokenized_datasets['train'].to_pandas().head()

Unnamed: 0,input_ids,labels
0,"[12198, 1635, 1737, 8, 826, 3634, 5, 1713, 345...","[1363, 5, 3931, 31, 7, 652, 3, 9, 691, 18, 413..."
1,"[12198, 1635, 1737, 8, 826, 3634, 5, 1713, 345...","[8667, 13156, 1217, 11066, 63, 21, 112, 12956,..."
2,"[12198, 1635, 1737, 8, 826, 3634, 5, 1713, 345...","[1713, 345, 13515, 536, 4663, 31, 7, 479, 21, ..."
3,"[12198, 1635, 1737, 8, 826, 3634, 5, 1713, 345...","[1713, 345, 13515, 536, 4663, 31, 7, 12603, 25..."
4,"[12198, 1635, 1737, 8, 826, 3634, 5, 1713, 345...","[1534, 8654, 5484, 7, 2504, 8511, 23, 12, 2595..."


To save some time in the lab, you will subsample the dataset:

In [15]:
with wandb.init(project=PROJECT, job_type="dataset"):
   wbtrain_tokenized = wandb.Table(data=tokenized_datasets['train'].to_pandas())
   wbvalidation_tokenized = wandb.Table(data=tokenized_datasets['validation'].to_pandas())
   wbtest_tokenized = wandb.Table(data=tokenized_datasets['test'].to_pandas())
   wandb.log({"dialogsum_train_tokenized": wbtrain_tokenized})
   wandb.log({"dialogsum_validation_tokenized": wbvalidation_tokenized})
   wandb.log({"dialogsum_test_tokenized": wbtest_tokenized})

VBox(children=(Label(value='43.676 MB of 45.490 MB uploaded\r'), FloatProgress(value=0.960126222566225, max=1.…

Check the shapes of all three parts of the dataset:

In [18]:
print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets['train'].shape}")
print(f"Validation: {tokenized_datasets['validation'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")

print(tokenized_datasets)

Shapes of the datasets:
Training: (12460, 2)
Validation: (500, 2)
Test: (1500, 2)
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 500
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 1500
    })
})


In [19]:
from types import SimpleNamespace
from pathlib import Path
from tqdm.notebook import tqdm
from datetime import datetime

The output dataset is ready for fine-tuning.

<a name='2.2'></a>
### 2.2 - Fine-Tune the Model with the Preprocessed Dataset

Now utilize the built-in Hugging Face `Trainer` class (see the documentation [here](https://huggingface.co/docs/transformers/main_classes/trainer)). Pass the preprocessed dataset with reference to the original model. Other training parameters are found experimentally and there is no need to go into details about those at the moment.

In [20]:
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

config = SimpleNamespace(
    # hyperparameters
    learning_rate=1e-5,
    gradient_accumulation_steps=2,
   num_train_epochs=20,
    save_steps=1000,
    save_strategy='steps', # we cannot set it to "no". Otherwise, the model cannot guess the best checkpoint.
    eval_steps=1000,
    logging_steps=1000,
    evaluation_strategy="steps",
    warmup_steps=500,
    save_total_limit=3,
    load_best_model_at_end = True,
    output_dir = f'./dialogue-summary-training-{str(int(time.time()))}'
)

In [21]:


training_args = TrainingArguments(
    output_dir=config.output_dir,
    learning_rate=config.learning_rate,
    gradient_accumulation_steps=config.gradient_accumulation_steps,
   num_train_epochs=config.num_train_epochs,
    save_steps=config.save_steps,
    save_strategy=config.save_strategy, # we cannot set it to "no". Otherwise, the model cannot guess the best checkpoint.
    eval_steps=config.eval_steps,
    logging_steps=config.logging_steps,
    evaluation_strategy=config.evaluation_strategy,
    warmup_steps=config.warmup_steps,
    save_total_limit=config.save_total_limit,
    load_best_model_at_end = config.load_best_model_at_end,
    report_to="wandb",
    run_name=f"Prompt_tuning_original_model-{datetime.now().strftime('%Y-%m-%d-%H-%M')}"
)

trainer = Trainer(
    model=original_model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation']
)

Start training process...

In [None]:
with wandb.init(project=PROJECT, job_type="train"):
  trainer.train()

Step,Training Loss,Validation Loss
1000,40.0811,26.982
2000,16.1498,8.171
3000,11.2468,7.90325
4000,11.098,7.878
5000,11.1036,7.8735
6000,11.0764,7.87125
7000,11.0543,7.87
8000,11.0738,7.8705
9000,11.0836,7.868
10000,11.1133,7.8685


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight'].


VBox(children=(Label(value='0.002 MB of 0.002 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/loss,█▁▁▁▁▁▁▁▁▁▁▁▁▁▁
eval/runtime,▅▅▁▂█▅▆▄▃▄▃▆▁▄▃
eval/samples_per_second,▄▄█▇▁▄▃▅▆▅▆▃█▅▆
eval/steps_per_second,▄▄█▇▁▄▃▅▆▅▆▃█▅▆
train/epoch,▁▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇███
train/global_step,▁▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇███
train/learning_rate,██▇▆▆▅▅▄▄▃▃▂▂▂▁
train/loss,█▂▁▁▁▁▁▁▁▁▁▁▁▁▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,7.8705
eval/runtime,3.6082
eval/samples_per_second,138.574
eval/steps_per_second,17.46
train/epoch,20.0
train/global_step,15580.0
train/learning_rate,0.0
train/loss,11.1335
train/total_flos,1.706415322300416e+17
train/train_loss,13.27944


Training a fully fine-tuned version of the model would take a few hours on a GPU. To save time, download a checkpoint of the fully fine-tuned model to use in the rest of this notebook.

In [None]:
trainer.state.best_model_checkpoint

'./dialogue-summary-training-1704544440/checkpoint-9000'

In [None]:
trainer.state.best_model_checkpoint
!mkdir flan-dialogue-summary-checkpoint
custom_path = "./flan-dialogue-summary-checkpoint/"
trainer.save_model(output_dir=custom_path)

In [None]:
#!aws s3 cp --recursive s3://dsoaws/models/flan-dialogue-summary-checkpoint/ ./flan-dialogue-summary-checkpoint/

The size of the downloaded instruct model is approximately 1GB.

In [None]:
!ls -al ./flan-dialogue-summary-checkpoint/

total 483608
drwxr-xr-x 2 root root      4096 Jan  6 14:14 .
drwxr-xr-x 1 root root      4096 Jan  6 14:13 ..
-rw-r--r-- 1 root root      1558 Jan  6 14:13 config.json
-rw-r--r-- 1 root root       142 Jan  6 14:13 generation_config.json
-rw-r--r-- 1 root root 495189552 Jan  6 14:14 model.safetensors
-rw-r--r-- 1 root root      4664 Jan  6 14:14 training_args.bin


In [None]:
!ls -al ./dialogue-summary-training-1704544440/checkpoint-9000

total 1450976
drwxr-xr-x 2 root root      4096 Jan  6 13:33 .
drwxr-xr-x 5 root root      4096 Jan  6 14:08 ..
-rw-r--r-- 1 root root      1558 Jan  6 13:33 config.json
-rw-r--r-- 1 root root       142 Jan  6 13:33 generation_config.json
-rw-r--r-- 1 root root 495189552 Jan  6 13:33 model.safetensors
-rw-r--r-- 1 root root 990548986 Jan  6 13:33 optimizer.pt
-rw-r--r-- 1 root root     14244 Jan  6 13:33 rng_state.pth
-rw-r--r-- 1 root root      1064 Jan  6 13:33 scheduler.pt
-rw-r--r-- 1 root root      3412 Jan  6 13:33 trainer_state.json
-rw-r--r-- 1 root root      4664 Jan  6 13:33 training_args.bin


Create an instance of the `AutoModelForSeq2SeqLM` class for the instruct model:

In [None]:
with wandb.init(project=PROJECT, job_type="models"):
  artifact = wandb.Artifact("instruct_model", type="model")
  artifact.add_dir(custom_path)
  wandb.save(custom_path)
  wandb.log_artifact(artifact)

[34m[1mwandb[0m: Adding directory to artifact (./flan-dialogue-summary-checkpoint)... Done. 1.7s


VBox(children=(Label(value='0.002 MB of 0.002 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

In [None]:
instruct_model = AutoModelForSeq2SeqLM.from_pretrained("./flan-dialogue-summary-checkpoint", torch_dtype=torch.bfloat16)

In [None]:
instruct_model = instruct_model.to("cuda")

<a name='2.3'></a>
### 2.3 - Evaluate the Model Qualitatively (Human Evaluation)

As with many GenAI applications, a qualitative approach where you ask yourself the question "Is my model behaving the way it is supposed to?" is usually a good starting point. In the example below (the same one we started this notebook with), you can see how the fine-tuned model is able to create a reasonable summary of the dialogue compared to the original inability to understand what is being asked of the model.

In [None]:
# Define W&B Table to store generations
columns = ["index", "dialoge", "prompt", "human_sumary", "zero_shot_output", "instruct_model_output"]
table = wandb.Table(columns=columns)

In [None]:
lindex = [100,200,300]
for index in lindex:
  dialogue = dataset['test'][index]['dialogue']
  human_baseline_summary = dataset['test'][index]['summary']

  prompt = f"""
  Summarize the following conversation.

  {dialogue}

  Summary:
  """

  input_ids = tokenizer(prompt, return_tensors="pt").input_ids
  input_ids = input_ids.to("cuda")
  original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
  original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

  instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
  instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

  print(dash_line)
  print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
  print(dash_line)
  print(f'ORIGINAL MODEL:\n{original_model_text_output}')
  print(dash_line)
  print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')
  table.add_data(index,dialogue,prompt,human_baseline_summary,original_model_text_output,instruct_model_text_output )

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# and Mike have a disagreement on how to act out a scene. #Person1# proposes that Mike can try to act in #Person1#'s way.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
The first person is the same as the first person. The second person is the same as the first person. The third person is the same as the first person. The fourth person is the same as the first person. The final answer is "I'm not sure".
---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
Let's try it.
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
--------------------------------------------------

In [None]:
with wandb.init(project=PROJECT, job_type="examples"):

   wandb.log({"instruct_model": table})

VBox(children=(Label(value='0.014 MB of 0.014 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

<a name='2.4'></a>
### 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)

The [ROUGE metric](https://en.wikipedia.org/wiki/ROUGE_(metric)) helps quantify the validity of summarizations produced by models. It compares summarizations to a "baseline" summary which is usually created by a human. While not perfect, it does indicate the overall increase in summarization effectiveness that we have accomplished by fine-tuning.

In [None]:
rouge = evaluate.load('rouge')

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Generate the outputs for the sample of the test dataset (only 10 dialogues and summaries to save time), and save the results.

In [None]:
original_model = original_model.to("cuda")

In [None]:
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []

for _, dialogue in enumerate(tqdm(dialogues)):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    input_ids = input_ids.to("cuda")

    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)
    original_model_summaries.append(original_model_text_output)

    instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)
    instruct_model_summaries.append(instruct_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries))

df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries'])
df

  0%|          | 0/10 [00:00<?, ?it/s]

Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,This memo will be distributed to all employees...,This memo is to be distributed to all employee...
1,In order to prevent employees from wasting tim...,Memo to all employees.,This memo is to be distributed to all employee...
2,Ms. Dawson takes a dictation for #Person1# abo...,This memo should go out as an intra-office mem...,This memo is to be distributed to all employee...
3,#Person2# arrives late because of traffic jam....,You're finally here.,Take public transport to work.
4,#Person2# decides to follow #Person1#'s sugges...,I'm sorry to hear that.,Take public transport to work.
5,#Person2# complains to #Person1# about the tra...,You're finally here!,Take public transport to work.
6,#Person1# tells Kate that Masha and Hero get d...,Masha and Hero are getting divorced.,Masha and Hero are getting divorced.
7,#Person1# tells Kate that Masha and Hero are g...,Masha and Hero are getting married.,Masha and Hero are getting divorced.
8,#Person1# and Kate talk about the divorce betw...,Masha and Hero are getting divorced.,Masha and Hero are getting divorced.
9,#Person1# and Brian are at the birthday party ...,Brian's birthday is today.,Brian's birthday is coming up.


Evaluate the models computing ROUGE metrics. Notice the improvement in the results!

In [None]:
original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)

ORIGINAL MODEL:
{'rouge1': 0.20180703364836144, 'rouge2': 0.09008695652173913, 'rougeL': 0.18366638667760765, 'rougeLsum': 0.18742341224247064}
INSTRUCT MODEL:
{'rouge1': 0.29970979020979016, 'rouge2': 0.14344664031620552, 'rougeL': 0.24626456876456876, 'rougeLsum': 0.24932465682465677}


The file `data/dialogue-summary-training-results.csv` contains a pre-populated list of all model results which you can use to evaluate on a larger section of data. Let's do that for each of the models:

In [None]:
#results = pd.read_csv("data-peft/dialogue-summary-training-results-peft.csv")

human_baseline_summaries = df['human_baseline_summaries'].values
original_model_summaries = df['original_model_summaries'].values
instruct_model_summaries = df['instruct_model_summaries'].values

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)

ORIGINAL MODEL:
{'rouge1': 0.20180703364836144, 'rouge2': 0.09008695652173913, 'rougeL': 0.18366638667760765, 'rougeLsum': 0.18742341224247064}
INSTRUCT MODEL:
{'rouge1': 0.29970979020979016, 'rouge2': 0.14344664031620552, 'rougeL': 0.24626456876456876, 'rougeLsum': 0.24932465682465677}


The results show substantial improvement in all ROUGE metrics:

In [None]:
# Define W&B Table to store generations
columns = ["metric", "original_model", "instruct_model", "improvement"]
table1 = wandb.Table(columns=columns)

In [None]:
print("Absolute percentage improvement of INSTRUCT MODEL over HUMAN BASELINE")
with wandb.init(project=PROJECT, job_type="metrics"):
  improvement = (np.array(list(instruct_model_results.values())) - np.array(list(original_model_results.values())))
  for key, value, inst, ori in zip(instruct_model_results.keys(), improvement, instruct_model_results.values(),original_model_results.values() ):
      print(f'{key}: {value*100:.2f}% original = {ori} instruct = {inst}' )
      table1.add_data(key, ori, inst, f"{value*100:.2f}%")


  wandb.log({"Rouge Metrics": table1})

Absolute percentage improvement of INSTRUCT MODEL over HUMAN BASELINE


rouge1: 9.79% original = 0.20180703364836144 instruct = 0.29970979020979016
rouge2: 5.34% original = 0.09008695652173913 instruct = 0.14344664031620552
rougeL: 6.26% original = 0.18366638667760765 instruct = 0.24626456876456876
rougeLsum: 6.19% original = 0.18742341224247064 instruct = 0.24932465682465677


VBox(children=(Label(value='0.003 MB of 0.003 MB uploaded\r'), FloatProgress(value=0.9035836177474402, max=1.0…

In [None]:
with wandb.init(project=PROJECT, job_type="examples"):
  table2= wandb.Table(data=df)
  wandb.log({"outputs_original_instruct_model": table2})

VBox(children=(Label(value='0.007 MB of 0.007 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

<a name='3'></a>
## 3 - Perform Parameter Efficient Fine-Tuning (PEFT)

Now, let's perform Parameter Efficient Fine-Tuning (PEFT) fine-tuning as opposed to "full fine-tuning" as you did above. PEFT is a form of instruction fine-tuning that is much more efficient than full fine-tuning - with comparable evaluation results as you will see soon.

PEFT is a generic term that includes LoRA and prompt tuning (which is NOT THE SAME as prompt engineering!). In most cases, when someone says PEFT, they typically mean LoRA. LoRA, at a very high level, allows the user to fine-tune their model using fewer compute resources (in some cases, a single GPU). After fine-tuning for a specific task, use case, or tenant with LoRA, the result is that the original LLM remains unchanged and a newly-trained “LoRA adapter” emerges. This LoRA adapter is much, much smaller than the original LLM - on the order of a single-digit % of the original LLM size (MBs vs GBs).  

That said, at inference time, the LoRA adapter needs to be reunited and combined with its original LLM to serve the inference request.  The benefit, however, is that many LoRA adapters can re-use the original LLM which reduces overall memory requirements when serving multiple tasks, use cases, or tenants from a single SageMaker Endpoint.

<a name='3.1'></a>
### 3.1 - Setup the PEFT/LoRA model for Fine-Tuning

You need to set up the PEFT/LoRA model for fine-tuning with new a new layer/parameter adapter. Using PEFT/LoRA, you are freezing the underlying LLM and only training the adapter. Have a look at the LoRA configurations below. Note the rank (`r`) hyper-parameter, which defines the rank/dimension of the adapter to be trained.

In [None]:
from peft import LoraConfig, get_peft_model, TaskType

In [None]:
config2 = SimpleNamespace(
    # hyperparameters
    learning_rate=1e-3,
    gradient_accumulation_steps=2,
   num_train_epochs=20,
    save_steps=1000,
    save_strategy='steps', # we cannot set it to "no". Otherwise, the model cannot guess the best checkpoint.
    eval_steps=1000,
    logging_steps=1000,
    evaluation_strategy="steps",
    warmup_steps=500,
    save_total_limit=3,
    load_best_model_at_end = True,
    output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}',
    r=32,
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM, # FLAN-T5
    auto_find_batch_size=True,
)

In [None]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=config2.r,
    lora_alpha=config2.lora_alpha,
    target_modules=config2.target_modules,
    lora_dropout=config2.lora_dropout,
    bias=config2.bias,
    task_type=config2.task_type # FLAN-T5
)

Add LoRA adapter layers/parameters to the original LLM to be trained.

In [None]:
peft_model = get_peft_model(original_model,
                            lora_config)
print(print_number_of_trainable_model_parameters(peft_model,"prft_model"))

VBox(children=(Label(value='0.002 MB of 0.012 MB uploaded\r'), FloatProgress(value=0.1827879371190247, max=1.0…

trainable model parameters: 3538944
all model parameters: 251116800
percentage of trainable model parameters: 1.4092820552029972%


<a name='3.2'></a>
### 3.2 - Train PEFT Adapter

Define training arguments and create `Trainer` instance.

In [None]:
output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'

peft_training_args = TrainingArguments(
    auto_find_batch_size=config2.auto_find_batch_size,
   output_dir=config2.output_dir,
    learning_rate=config2.learning_rate,
    gradient_accumulation_steps=config2.gradient_accumulation_steps,
   num_train_epochs=config2.num_train_epochs,
    save_steps=config2.save_steps,
    save_strategy=config2.save_strategy, # we cannot set it to "no". Otherwise, the model cannot guess the best checkpoint.
    eval_steps=config2.eval_steps,
    logging_steps=config2.logging_steps,
    evaluation_strategy=config2.evaluation_strategy,
    warmup_steps=config2.warmup_steps,
    save_total_limit=config2.save_total_limit,
    load_best_model_at_end = config2.load_best_model_at_end,
    report_to="wandb",
    run_name=f"PEFT_tuning_original_model-{datetime.now().strftime('%Y-%m-%d-%H-%M')}"


)

peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets['validation']
)

Now everything is ready to train PEFT adapter and save the model.

In [None]:
with wandb.init(project=PROJECT, job_type="train"):
  peft_trainer.train()



Step,Training Loss,Validation Loss
1000,0.9913,0.093648
2000,0.1118,0.088063
3000,0.1036,0.085344
4000,0.1539,0.223523
5000,0.2227,0.169648
6000,0.1921,0.163125
7000,0.1893,0.167383
8000,0.1868,0.160555
9000,0.1847,0.221602
10000,0.1829,0.163359


VBox(children=(Label(value='0.012 MB of 0.022 MB uploaded\r'), FloatProgress(value=0.5685057270263181, max=1.0…

0,1
eval/loss,▁▁▁█▅▅▅▅█▅▅▅▅▅▅
eval/runtime,▆▃▁▁▃▃▄▁▃▃▄▃▄▂█
eval/samples_per_second,▃▆██▆▆▅█▆▆▅▆▅▇▁
eval/steps_per_second,▃▆██▆▆▅█▆▆▅▆▅▇▁
train/epoch,▁▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇███
train/global_step,▁▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇███
train/learning_rate,█▇▇▆▆▆▅▅▄▄▃▂▂▂▁
train/loss,█▁▁▁▂▂▂▂▂▂▂▂▂▂▂
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.15909
eval/runtime,4.1376
eval/samples_per_second,120.843
eval/steps_per_second,15.226
train/epoch,20.0
train/global_step,15580.0
train/learning_rate,4e-05
train/loss,0.1749
train/total_flos,1.733507439132672e+17
train/train_loss,0.22494


In [None]:
peft_model_path="./peft-dialogue-summary-checkpoint-local"

peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)

('./peft-dialogue-summary-checkpoint-local/tokenizer_config.json',
 './peft-dialogue-summary-checkpoint-local/special_tokens_map.json',
 './peft-dialogue-summary-checkpoint-local/spiece.model',
 './peft-dialogue-summary-checkpoint-local/added_tokens.json',
 './peft-dialogue-summary-checkpoint-local/tokenizer.json')

That training was performed on a subset of data. To load a fully trained PEFT model, read a checkpoint of a PEFT model from S3.

In [None]:
peft_trainer.state.best_model_checkpoint

'./peft-dialogue-summary-training-1704554309/checkpoint-3000'

In [None]:

custom_path = "./peft-dialogue-summary-checkpoint/"
peft_trainer.save_model(output_dir=custom_path)

In [None]:
with wandb.init(project=PROJECT, job_type="models"):
  artifact = wandb.Artifact("peft_model", type="model")
  artifact.add_dir(custom_path)
  wandb.save(custom_path)
  wandb.log_artifact(artifact)

[34m[1mwandb[0m: Adding directory to artifact (./peft-dialogue-summary-checkpoint)... Done. 0.1s


VBox(children=(Label(value='13.526 MB of 13.526 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

Check that the size of this model is much less than the original LLM:

Prepare this model by adding an adapter to the original FLAN-T5 model. You are setting `is_trainable=False` because the plan is only to perform inference with this PEFT model. If you were preparing the model for further training, you would set `is_trainable=True`.

In [None]:
from peft import PeftModel, PeftConfig

peft_model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

peft_model = PeftModel.from_pretrained(peft_model_base,
                                       peft_model_path,
                                       torch_dtype=torch.bfloat16,
                                       is_trainable=False)

Check the number of trainable parameters:

In [None]:
#print(print_number_of_trainable_model_parameters(peft_model))

<a name='3.3'></a>
### 3.3 - Evaluate the Model Qualitatively (Human Evaluation)

Make inferences for the same example as in sections [1.3](#1.3) and [2.3](#2.3), with the original model, fully fine-tuned and PEFT model.

In [None]:
peft_model = peft_model.to("cuda")

In [None]:
# Define W&B Table to store generations
columns = ["index", "dialoge", "prompt", "human_sumary", "zero_shot_output", "instruct_model_output", "peft_moddel_output"]
table3 = wandb.Table(columns=columns)

In [None]:
lindex = [100,200,300]
for index in lindex:
  dialogue = dataset['test'][index]['dialogue']
  baseline_human_summary = dataset['test'][index]['summary']

  prompt = f"""
  Summarize the following conversation.

  {dialogue}

  Summary: """

  input_ids = tokenizer(prompt, return_tensors="pt").input_ids
  input_ids = input_ids.to("cuda")
  original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
  original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

  instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
  instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

  peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
  peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

  print(dash_line)
  print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
  print(dash_line)
  print(f'ORIGINAL MODEL:\n{original_model_text_output}')
  print(dash_line)
  print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')
  print(dash_line)
  print(f'PEFT MODEL: {peft_model_text_output}')
  table3.add_data(index,dialogue,prompt,human_baseline_summary,original_model_text_output,instruct_model_text_output, peft_model_text_output )

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# is crazy for Trump and voted for him. #Person2# doesn't agree with #Person1# on Trump and will vote for Biden.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
Mike and Laura are talking about Jason and Laura's relationship. Mike thinks Jason would react the way most guys would and then they can try something else.
---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
Let's try it.
---------------------------------------------------------------------------------------------------
PEFT MODEL: Mike wants to get more anger from Laura because he's acting hurt and sad. Mike thinks Jason would react the way most guys would, and they can try it.
------------------------------------------------------------------------------------------

In [None]:
with wandb.init(project=PROJECT, job_type="examples"):

   wandb.log({"peft_model": table3})

VBox(children=(Label(value='0.009 MB of 0.016 MB uploaded\r'), FloatProgress(value=0.5970112839280268, max=1.0…

<a name='2.4'></a>
### 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)
Perform inferences for the sample of the test dataset (only 10 dialogues and summaries to save time).

In [None]:
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []

for idx, dialogue in enumerate(tqdm(dialogues)):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    input_ids = input_ids.to("cuda")

    human_baseline_text_output = human_baseline_summaries[idx]

    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

    instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

    peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

    original_model_summaries.append(original_model_text_output)
    instruct_model_summaries.append(instruct_model_text_output)
    peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries, peft_model_summaries))

df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries', 'peft_model_summaries'])
df

  0%|          | 0/10 [00:00<?, ?it/s]

Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries,peft_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,Ms. Dawson takes a dictation for #Person1#. Ms...,This memo is to be distributed to all employee...,Ms. Dawson takes a dictation for #Person1# to ...
1,In order to prevent employees from wasting tim...,Ms. Dawson takes a dictation for #Person1#. Ms...,This memo is to be distributed to all employee...,Ms. Dawson takes a dictation for #Person1# to ...
2,Ms. Dawson takes a dictation for #Person1# abo...,Ms. Dawson takes a dictation for #Person1#. Ms...,This memo is to be distributed to all employee...,Ms. Dawson takes a dictation for #Person1# to ...
3,#Person2# arrives late because of traffic jam....,#Person2# got stuck in traffic again. #Person1...,Take public transport to work.,#Person2# got stuck in traffic again. #Person1...
4,#Person2# decides to follow #Person1#'s sugges...,#Person2# got stuck in traffic again. #Person1...,Take public transport to work.,#Person2# got stuck in traffic again. #Person1...
5,#Person2# complains to #Person1# about the tra...,#Person2# got stuck in traffic again. #Person1...,Take public transport to work.,#Person2# got stuck in traffic again. #Person1...
6,#Person1# tells Kate that Masha and Hero get d...,Kate tells Kate Masha and Hero are getting div...,Masha and Hero are getting divorced.,Kate tells Kate Masha and Hero are getting div...
7,#Person1# tells Kate that Masha and Hero are g...,Kate tells Kate Masha and Hero are getting div...,Masha and Hero are getting divorced.,Kate tells Kate Masha and Hero are getting div...
8,#Person1# and Kate talk about the divorce betw...,Kate tells Kate Masha and Hero are getting div...,Masha and Hero are getting divorced.,Kate tells Kate Masha and Hero are getting div...
9,#Person1# and Brian are at the birthday party ...,Brian invites #Person1# to the party. Brian is...,Brian's birthday is coming up.,Brian invites #Person1# to the party. Brian is...




In [None]:
with wandb.init(project=PROJECT, job_type="examples"):
  table2= wandb.Table(data=df)
  wandb.log({"outputs_original_instruct_peft_model": table2})

VBox(children=(Label(value='0.017 MB of 0.017 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

Compute ROUGE score for this subset of the data.

In [None]:
rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)

ORIGINAL MODEL:
{'rouge1': 0.37751677960494434, 'rouge2': 0.1291639190582505, 'rougeL': 0.28449063806342356, 'rougeLsum': 0.2860785998361939}
INSTRUCT MODEL:
{'rouge1': 0.29970979020979016, 'rouge2': 0.14344664031620552, 'rougeL': 0.24626456876456876, 'rougeLsum': 0.24932465682465677}
PEFT MODEL:
{'rouge1': 0.3918212559607333, 'rouge2': 0.14390889856709577, 'rougeL': 0.2951729817073787, 'rougeLsum': 0.2945373198749543}


Notice, that PEFT model results are not too bad, while the training process was much easier!

You already computed ROUGE score on the full dataset, after loading the results from the `data-peft/dialogue-summary-training-results-peft.csv` file. Load the values for the PEFT model now and check its performance compared to other models.

In [None]:
human_baseline_summaries = df['human_baseline_summaries'].values
original_model_summaries = df['original_model_summaries'].values
instruct_model_summaries = df['instruct_model_summaries'].values
peft_model_summaries     = df['peft_model_summaries'].values

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)

ORIGINAL MODEL:
{'rouge1': 0.37751677960494434, 'rouge2': 0.1291639190582505, 'rougeL': 0.28449063806342356, 'rougeLsum': 0.2860785998361939}
INSTRUCT MODEL:
{'rouge1': 0.29970979020979016, 'rouge2': 0.14344664031620552, 'rougeL': 0.24626456876456876, 'rougeLsum': 0.24932465682465677}
PEFT MODEL:
{'rouge1': 0.3918212559607333, 'rouge2': 0.14390889856709577, 'rougeL': 0.2951729817073787, 'rougeLsum': 0.2945373198749543}


The results show less of an improvement over full fine-tuning, but the benefits of PEFT typically outweigh the slightly-lower performance metrics.

Calculate the improvement of PEFT over the original model:

In [None]:
# Define W&B Table to store generations
columns = ["metric", "original_model", "instruct_model", "improvement"]
table10 = wandb.Table(columns=columns)

In [None]:
print("Absolute percentage improvement of PEFT MODEL over HUMAN BASELINE")
with wandb.init(project=PROJECT, job_type="metrics"):
  # Instruct model
  improvement = (np.array(list(instruct_model_results.values())) - np.array(list(original_model_results.values())))
  for key, value, inst, ori in zip(instruct_model_results.keys(), improvement, instruct_model_results.values(),original_model_results.values() ):
      print(f'{key}: {value*100:.2f}% original = {ori} instruct = {inst}' )
      table10.add_data(key, ori, inst, f"{value*100:.2f}%")
  #peft model
  improvement = (np.array(list(peft_model_results.values())) - np.array(list(original_model_results.values())))
  for key, value, inst, ori in zip(peft_model_results.keys(), improvement, peft_model_results.values(),original_model_results.values() ):
      print(f'{key}: {value*100:.2f}% original = {ori} instruct = {inst}' )
      table10.add_data(key, ori, inst, f"{value*100:.2f}%")
  # Peft Model over Instruct Modxel
  improvement = (np.array(list(peft_model_results.values())) - np.array(list(instruct_model_results.values())))
  for key, value, inst, ori in zip(peft_model_results.keys(), improvement, peft_model_results.values(),instruct_model_results.values() ):
      print(f'{key}: {value*100:.2f}% original = {ori} instruct = {inst}' )
      table10.add_data(key, ori, inst, f"{value*100:.2f}%")

  wandb.log({"Rouge Metrics": table10})

Absolute percentage improvement of PEFT MODEL over HUMAN BASELINE


rouge1: -7.78% original = 0.37751677960494434 instruct = 0.29970979020979016
rouge2: 1.43% original = 0.1291639190582505 instruct = 0.14344664031620552
rougeL: -3.82% original = 0.28449063806342356 instruct = 0.24626456876456876
rougeLsum: -3.68% original = 0.2860785998361939 instruct = 0.24932465682465677
rouge1: 1.43% original = 0.37751677960494434 instruct = 0.3918212559607333
rouge2: 1.47% original = 0.1291639190582505 instruct = 0.14390889856709577
rougeL: 1.07% original = 0.28449063806342356 instruct = 0.2951729817073787
rougeLsum: 0.85% original = 0.2860785998361939 instruct = 0.2945373198749543
rouge1: 9.21% original = 0.29970979020979016 instruct = 0.3918212559607333
rouge2: 0.05% original = 0.14344664031620552 instruct = 0.14390889856709577
rougeL: 4.89% original = 0.24626456876456876 instruct = 0.2951729817073787
rougeLsum: 4.52% original = 0.24932465682465677 instruct = 0.2945373198749543


VBox(children=(Label(value='0.004 MB of 0.004 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

Now calculate the improvement of PEFT over a full fine-tuned model:

In [None]:
print("Absolute percentage improvement of PEFT MODEL over INSTRUCT MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(instruct_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over INSTRUCT MODEL
rouge1: 9.21%
rouge2: 0.05%
rougeL: 4.89%
rougeLsum: 4.52%


Here you see a small percentage decrease in the ROUGE metrics vs. full fine-tuned. However, the training requires much less computing and memory resources (often just a single GPU).

In [None]:
results_path = "/content/drive/MyDrive/llms/results_flat_t5.csv"
df.to_csv(results_path, index=False)

In [None]:
wandb.finish()