<a href="https://colab.research.google.com/github/olonok69/LLM_Notebooks/blob/main/RLHF/1_DIALOGSUM_PEFT_flan_t5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PEFT Fine-Tune a Generative AI Model for Dialogue Summarization

In this notebook, you will fine-tune an existing LLM from Hugging Face for enhanced dialogue summarization. You will use the [FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) model, which provides a high quality instruction tuned model and can summarize text out of the box. To improve the inferences, you will explore a full fine-tuning approach and evaluate the results with ROUGE metrics. Then you will perform PEFT fine-tuning, evaluate the resulting model and see that the benefits of PEFT outweigh the slightly-lower performance metrics.

I use this Notebook as project in W&B course

<a name='1'></a>
## 1 - Set up Kernel and Required Dependencies

In [1]:
%pip install -q --disable-pip-version-check \
    evaluate==0.4.0 \
    py7zr==0.20.4 \
    sentencepiece==0.1.99 \
    rouge_score==0.1.2 \
    loralib==0.1.1 \
    peft==0.4.0 \
    trl==0.7.2
%pip install -q    wandb bitsandbytes accelerate

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.3/66.3 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m124.0/124.0 kB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m45.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m16.5 

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
!wget https://github.com/wandb/edu/raw/main/llm-training-course/colab/utils.py

--2024-01-08 20:27:03--  https://github.com/wandb/edu/raw/main/llm-training-course/colab/utils.py
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/wandb/edu/main/llm-training-course/colab/utils.py [following]
--2024-01-08 20:27:03--  https://raw.githubusercontent.com/wandb/edu/main/llm-training-course/colab/utils.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8155 (8.0K) [text/plain]
Saving to: ‘utils.py’


2024-01-08 20:27:03 (112 MB/s) - ‘utils.py’ saved [8155/8155]



In [4]:
from google.colab import output
output.enable_custom_widget_manager()

In [6]:
PROJECT = "FlanT5-Lora-RLHF"
MODEL_NAME = 'google/flan-t5-base'
DATASET = "knkarthick/dialogsum"

In [7]:
import wandb
wandb.init(project=PROJECT, # the project I am working on
           tags=[MODEL_NAME, DATASET],
           notes ="Fine tuning FlanT5 with Dialogsum Dataset. Fine Tune Lora. Apply then RLHF for toxicity") # the Hyperparameters I want to keep track of

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [8]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np

# tqdm library makes the loops show a smart progress meter.
from tqdm import tqdm
tqdm.pandas()

<a name='1.2'></a>
### 1.2 - Load Dataset and LLM

You are going to continue experimenting with the [DialogSum](https://huggingface.co/datasets/knkarthick/dialogsum) Hugging Face dataset. It contains 10,000+ dialogues with the corresponding manually labeled summaries and topics.

In [9]:
huggingface_dataset_name = "knkarthick/dialogsum"

dataset = load_dataset(huggingface_dataset_name)

dataset

Downloading readme:   0%|          | 0.00/4.65k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/442k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

In [10]:
import os
data_path = "/content/drive/MyDrive/data/dialogsum"
wbtrain = dataset['train'].to_pandas()
wbvalidation = dataset['validation'].to_pandas()
wbtest =dataset['test'].to_pandas()

wbtrain.to_csv(os.path.join(data_path, "wbtrain.csv"), index=False)
wbvalidation.to_csv(os.path.join(data_path, "wbvalidation.csv"), index=False)
wbtest.to_csv(os.path.join(data_path, "wbtest.csv"), index=False)

Load the pre-trained [FLAN-T5 model](https://huggingface.co/docs/transformers/model_doc/flan-t5) and its tokenizer directly from HuggingFace. Notice that you will be using the [base version](https://huggingface.co/google/flan-t5-base) of FLAN-T5. Setting `torch_dtype=torch.bfloat16` specifies the memory type to be used by this model.

In [11]:
model_name='google/flan-t5-base'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

It is possible to pull out the number of model parameters and find out how many of them are trainable. The following function can be used to do that, at this stage, you do not need to go into details of it.

In [12]:
def print_number_of_trainable_model_parameters(model, tag="original_model"):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()


    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params}%"

print(print_number_of_trainable_model_parameters(original_model))

trainable model parameters: 247577856
all model parameters: 247577856
percentage of trainable model parameters: 100.0%


<a name='1.3'></a>
### 1.3 - Test the Model with Zero Shot Inferencing

Test the model with the zero shot inferencing. You can see that the model struggles to summarize the dialogue compared to the baseline summary, but it does pull out some important information from the text which indicates the model can be fine-tuned to the task at hand.

<a name='2'></a>
## 2 - Perform Full Fine-Tuning

In [13]:
def tokenize_function(example):
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids

    return example

# The dataset actually contains 3 diff splits: train, validation, test.
# The tokenize_function code is handling all data across all splits in batches.
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary',])

Map:   0%|          | 0/12460 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

To save some time in the lab, you will subsample the dataset:

In [None]:
#tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)

Check the shapes of all three parts of the dataset:

In [14]:
print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets['train'].shape}")
print(f"Validation: {tokenized_datasets['validation'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")

print(tokenized_datasets)

Shapes of the datasets:
Training: (12460, 2)
Validation: (500, 2)
Test: (1500, 2)
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 500
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 1500
    })
})


In [15]:
from types import SimpleNamespace
from pathlib import Path
from tqdm.notebook import tqdm
from datetime import datetime

The output dataset is ready for fine-tuning.

In [16]:
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

In [17]:
rouge = evaluate.load('rouge')

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Generate the outputs for the sample of the test dataset (only 10 dialogues and summaries to save time), and save the results.

<a name='3'></a>
## 3 - Perform Parameter Efficient Fine-Tuning (PEFT)

Now, let's perform Parameter Efficient Fine-Tuning (PEFT) fine-tuning as opposed to "full fine-tuning" as you did above. PEFT is a form of instruction fine-tuning that is much more efficient than full fine-tuning - with comparable evaluation results as you will see soon.

PEFT is a generic term that includes LoRA and prompt tuning (which is NOT THE SAME as prompt engineering!). In most cases, when someone says PEFT, they typically mean LoRA. LoRA, at a very high level, allows the user to fine-tune their model using fewer compute resources (in some cases, a single GPU). After fine-tuning for a specific task, use case, or tenant with LoRA, the result is that the original LLM remains unchanged and a newly-trained “LoRA adapter” emerges. This LoRA adapter is much, much smaller than the original LLM - on the order of a single-digit % of the original LLM size (MBs vs GBs).  

That said, at inference time, the LoRA adapter needs to be reunited and combined with its original LLM to serve the inference request.  The benefit, however, is that many LoRA adapters can re-use the original LLM which reduces overall memory requirements when serving multiple tasks, use cases, or tenants from a single SageMaker Endpoint.

<a name='3.1'></a>
### 3.1 - Setup the PEFT/LoRA model for Fine-Tuning

You need to set up the PEFT/LoRA model for fine-tuning with new a new layer/parameter adapter. Using PEFT/LoRA, you are freezing the underlying LLM and only training the adapter. Have a look at the LoRA configurations below. Note the rank (`r`) hyper-parameter, which defines the rank/dimension of the adapter to be trained.

In [18]:
from peft import LoraConfig, get_peft_model, TaskType

In [19]:
config2 = SimpleNamespace(
    # hyperparameters
    learning_rate=1e-3,
    gradient_accumulation_steps=2,
   num_train_epochs=20,
    save_steps=1000,
    save_strategy='steps', # we cannot set it to "no". Otherwise, the model cannot guess the best checkpoint.
    eval_steps=1000,
    logging_steps=1000,
    evaluation_strategy="steps",
    warmup_steps=500,
    save_total_limit=3,
    load_best_model_at_end = True,
    output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}',
    r=32,
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM, # FLAN-T5
    auto_find_batch_size=True,
)

In [20]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=config2.r,
    lora_alpha=config2.lora_alpha,
    target_modules=config2.target_modules,
    lora_dropout=config2.lora_dropout,
    bias=config2.bias,
    task_type=config2.task_type # FLAN-T5
)

Add LoRA adapter layers/parameters to the original LLM to be trained.

In [21]:
peft_model = get_peft_model(original_model,
                            lora_config)
print(print_number_of_trainable_model_parameters(peft_model,"prft_model"))

trainable model parameters: 3538944
all model parameters: 251116800
percentage of trainable model parameters: 1.4092820552029972%


<a name='3.2'></a>
### 3.2 - Train PEFT Adapter

Define training arguments and create `Trainer` instance.

In [22]:
output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'

peft_training_args = TrainingArguments(
    auto_find_batch_size=config2.auto_find_batch_size,
   output_dir=config2.output_dir,
    learning_rate=config2.learning_rate,
    gradient_accumulation_steps=config2.gradient_accumulation_steps,
   num_train_epochs=config2.num_train_epochs,
    save_steps=config2.save_steps,
    save_strategy=config2.save_strategy, # we cannot set it to "no". Otherwise, the model cannot guess the best checkpoint.
    eval_steps=config2.eval_steps,
    logging_steps=config2.logging_steps,
    evaluation_strategy=config2.evaluation_strategy,
    warmup_steps=config2.warmup_steps,
    save_total_limit=config2.save_total_limit,
    load_best_model_at_end = config2.load_best_model_at_end,
    report_to="wandb",
    run_name=f"PEFT_tuning_original_model-{datetime.now().strftime('%Y-%m-%d-%H-%M')}"


)

peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets['validation']
)

Now everything is ready to train PEFT adapter and save the model.

In [23]:
with wandb.init(project=PROJECT, job_type="train"):
  peft_trainer.train()



VBox(children=(Label(value='0.002 MB of 0.002 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

Step,Training Loss,Validation Loss
1000,3.3979,0.093316
2000,0.1125,0.087789
3000,0.1036,0.084746
4000,0.168,0.169922
5000,0.1928,0.160156
6000,0.1721,0.098203
7000,0.1095,0.087305
8000,0.0987,0.084043
9000,0.0996,0.087266
10000,0.0976,0.085121


VBox(children=(Label(value='0.002 MB of 0.013 MB uploaded\r'), FloatProgress(value=0.16817947014980444, max=1.…

0,1
eval/loss,▂▁▁█▇▂▁▁▁▁▁▁▁▁▁
eval/runtime,▃▅▄▆▆▇▅█▇▇▅▆█▆▁
eval/samples_per_second,▆▄▅▃▃▂▄▁▂▂▄▃▁▃█
eval/steps_per_second,▆▄▅▃▃▂▄▁▂▂▄▃▁▃█
train/epoch,▁▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇███
train/global_step,▁▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇███
train/learning_rate,█▇▇▆▆▆▅▅▄▄▃▂▂▂▁
train/loss,█▁▁▁▁▁▁▁▁▁▁▁▁▁▁
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.082
eval/runtime,4.0976
eval/samples_per_second,122.023
eval/steps_per_second,15.375
train/epoch,20.0
train/global_step,15580.0
train/learning_rate,4e-05
train/loss,0.0873
train/total_flos,1.733507439132672e+17
train/train_loss,0.3243


In [24]:
peft_model_path="/content/drive/MyDrive/models/peft-t5"

peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)

('/content/drive/MyDrive/models/peft-t5/tokenizer_config.json',
 '/content/drive/MyDrive/models/peft-t5/special_tokens_map.json',
 '/content/drive/MyDrive/models/peft-t5/spiece.model',
 '/content/drive/MyDrive/models/peft-t5/added_tokens.json',
 '/content/drive/MyDrive/models/peft-t5/tokenizer.json')

That training was performed on a subset of data. To load a fully trained PEFT model, read a checkpoint of a PEFT model from S3.

In [25]:
peft_trainer.state.best_model_checkpoint

'./peft-dialogue-summary-training-1704746271/checkpoint-15000'

In [26]:

custom_path = "/content/drive/MyDrive/models/peft-t5-best"
peft_trainer.save_model(output_dir=custom_path)

In [27]:
with wandb.init(project=PROJECT, job_type="models"):
  artifact = wandb.Artifact("peft_model", type="model")
  artifact.add_dir(custom_path)
  wandb.save(custom_path)
  wandb.log_artifact(artifact)

[34m[1mwandb[0m: Currently logged in as: [33molonok[0m ([33molonok69[0m). Use [1m`wandb login --relogin`[0m to force relogin


[34m[1mwandb[0m: Adding directory to artifact (/content/drive/MyDrive/models/peft-t5-best)... Done. 0.1s


VBox(children=(Label(value='13.526 MB of 13.526 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

In [28]:
wandb.finish()