##Agenda
1. Library installation
2. Loading the necessary libraries
3. Loading dataset
4. Accessing the model and tokenizer
5. Number trainable parameters of the flan-t5-base model
6. Fine-tuning using the PEFT (LoRA) technique
7. Training PEFT Adapter
8. Evaluating the model
9. Evaluating the Model using ROUGE performance metric

##Installing library

In [2]:
%pip install -U datasets huggingface_hub fsspec
%pip install transformers
%pip install loralib
%pip install peft
%pip install evaluate
%pip install rouge_score
%pip install torch
%pip install torchdata

Collecting fsspec
  Using cached fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Collecting peft
  Downloading peft-0.12.0-py3-none-any.whl.metadata (13 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.13.0->peft)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.13.0->peft)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.13.0->peft)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.13.0->peft)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.13.0->peft)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cu

##Loading the necessary libraries

In [3]:
import time
import pandas as pd
import numpy as np
import torch
import evaluate
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Trainer, GenerationConfig, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from peft import PeftModel, PeftConfig

##Loading dataset

We are using the 'dialogsum' datset (https://huggingface.co/datasets/knkarthick/dialogsum) from the huggingface platform.

In [4]:
#accessing dataset from HuggingFace (https://huggingface.co/datasets/knkarthick/dialogsum)
huggingface_dataset_name = "knkarthick/dialogsum"
dataset_dialogue = load_dataset(huggingface_dataset_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/4.65k [00:00<?, ?B/s]

train.csv:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

validation.csv:   0%|          | 0.00/442k [00:00<?, ?B/s]

test.csv:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/12460 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1500 [00:00<?, ? examples/s]

In [5]:
dataset_dialogue

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

##Accessing the model and tokenizer

We are accessing an open-sourced flan-t5-base model from the huggingface and we will use autoTokenizer.

In [6]:
#accessing an open-sourced flan-t5-base model from hugging face
huggingface_model='google/flan-t5-base'
model_flan_t5 = AutoModelForSeq2SeqLM.from_pretrained(huggingface_model, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(huggingface_model)

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

## Number trainable parameters of the flan-t5-base model

In [7]:
def flan_t5_trainable_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"flan-t5 trainable parameters: {trainable_model_params}\nall flan-t5 parameters: {all_model_params}\npercentage of trainable flan-t5 parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(flan_t5_trainable_parameters(model_flan_t5))

flan-t5 trainable parameters: 247577856
all flan-t5 parameters: 247577856
percentage of trainable flan-t5 parameters: 100.00%


##Zero-shot prompting

In [8]:
index_dialogue_example = 15
dialogue = dataset_dialogue['test'][index_dialogue_example]['dialogue']
summary = dataset_dialogue['test'][index_dialogue_example]['summary']

#designing the prompt with no example (zero-shot)
prompt = f"What was the conversation?:\n\n{dialogue} Please summerize it in three lines:\n\n"

#passing the prompt to the model
inputs = tokenizer(prompt, return_tensors='pt')
outputs = model_flan_t5.generate(inputs['input_ids'], max_length=150)
zero_shot_summary = tokenizer.decode(outputs[0], skip_special_tokens=True)

#printing the generated summary for zero-shot
print ('Dialogue')
print ('----------------------')
print(dialogue)
print ('----------------------')
print('Baseline summary:')
print ('----------------------')
print(summary)
print ('----------------------')
print(f'One shot summary:')
print ('----------------------')
print(zero_shot_summary)
print ('----------------------')
print('\n')

Dialogue
----------------------
#Person1#: I've had it! I am done working for a company that is taking me nowhere!
#Person2#: So what are you gonna do? Just quit?
#Person1#: That's exactly what I am going to do! I have decided to create my own company! I am going to write up a business plan, get some investors and start working for myself!
#Person2#: Have you ever written up a business plan before?
#Person1#: Well, no, it can't be that hard! I mean, all you have to do is explain your business, how you are going to do things and that's it, right?
#Person2#: You couldn't be more wrong! A well written business plan will include an executive summary which highlights the idea of the business in two pages or less. Then you need to describe your company with information such as what type of legal structure it has, history, etc.
#Person1#: Well, that seems easy enough.
#Person2#: Wait, there is more! Then you need to introduce and describe your goods or services. What they are and how they are

In [9]:
def tokenize_function(example):
    start_prompt = 'What was the conversation?\n\n'
    end_prompt = '\n\Please summerize it: '
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids

    return example

#the dataset contains train, validation, test datasets; the tokenize_function code is handling all data across all splits in batches.
tokenized_datasets = dataset_dialogue.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary',])

Map:   0%|          | 0/12460 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

In [10]:
tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)

Filter:   0%|          | 0/12460 [00:00<?, ? examples/s]

Filter:   0%|          | 0/500 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1500 [00:00<?, ? examples/s]

In [11]:
print(f"Datasets sizes:")
print(f"Train: {tokenized_datasets['train'].shape}")
print(f"Validation: {tokenized_datasets['validation'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")

Datasets sizes:
Train: (125, 2)
Validation: (5, 2)
Test: (15, 2)


In [12]:
#printing tokenized datasets
print(tokenized_datasets)

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 125
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 5
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 15
    })
})


In [13]:
#setting output directory
out_dir = f'./dialogue-summary-training-{str(int(time.time()))}'

In [14]:
#passing training arguments
args_train = TrainingArguments(
    output_dir=out_dir,
    learning_rate=1e-5,
    num_train_epochs=1,
    weight_decay=0.01,
    logging_steps=1,
    max_steps=1
)
#training the model
model_flan_t5_training = Trainer(
    model=model_flan_t5,
    args=args_train,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation']
)

max_steps is given, it will override any value given in num_train_epochs


In [15]:
model_flan_t5_training.train()

Step,Training Loss
1,47.25


TrainOutput(global_step=1, training_loss=47.25, metrics={'train_runtime': 11.786, 'train_samples_per_second': 0.679, 'train_steps_per_second': 0.085, 'total_flos': 5478058819584.0, 'train_loss': 47.25, 'epoch': 0.0625})

###Fine-tuning using the PEFT (LoRA) technique

In [16]:
lora_config = LoraConfig(
    r=32,
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM
)

In [17]:
peft_model_flan_t5_model = get_peft_model(model_flan_t5, lora_config)
print(flan_t5_trainable_parameters(peft_model_flan_t5_model))

flan-t5 trainable parameters: 3538944
all flan-t5 parameters: 251116800
percentage of trainable flan-t5 parameters: 1.41%


###Training PEFT adapter

In [18]:
out_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'

peft_train_args = TrainingArguments(
    output_dir=out_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3,
    num_train_epochs=1,
    logging_steps=1,
    max_steps=1
)

peft_flan_t5_trainer = Trainer(
    model=peft_model_flan_t5_model,
    args=peft_train_args,
    train_dataset=tokenized_datasets["train"],
)

max_steps is given, it will override any value given in num_train_epochs


In [19]:
peft_flan_t5_trainer.train()
peft_flan_t5_path="./peft-dialogue-summary-checkpoint-local"
peft_flan_t5_trainer.model.save_pretrained(peft_flan_t5_path)
tokenizer.save_pretrained(peft_flan_t5_path)

Step,Training Loss
1,48.75


('./peft-dialogue-summary-checkpoint-local/tokenizer_config.json',
 './peft-dialogue-summary-checkpoint-local/special_tokens_map.json',
 './peft-dialogue-summary-checkpoint-local/spiece.model',
 './peft-dialogue-summary-checkpoint-local/added_tokens.json',
 './peft-dialogue-summary-checkpoint-local/tokenizer.json')

In [20]:
model_id='google/flan-t5-base'
tokenizer = AutoTokenizer.from_pretrained(model_id)
peft_model_base = AutoModelForSeq2SeqLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
peft_model = PeftModel.from_pretrained(peft_model_base, 'intotheverse/peft-dialogue-summary-checkpoint', torch_dtype=torch.bfloat16, is_trainable=False)

adapter_config.json:   0%|          | 0.00/334 [00:00<?, ?B/s]

adapter_model.bin:   0%|          | 0.00/14.2M [00:00<?, ?B/s]

###Evaluating the model

In [21]:
index = 200
dialogue = dataset_dialogue['test'][index]['dialogue']
baseline_human_summary = dataset_dialogue['test'][index]['summary']

prompt = f"What was the conversation?:\n\n{dialogue} Please summerize it:\n\n"

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

original_model_outputs = model_flan_t5.generate(input_ids=input_ids.cuda(), generation_config= GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

print('-------------')
print(f'Output from the original model:\n{original_model_text_output}')
print('-------------')
print(f'Output from the PEFT-tuned model: {peft_model_text_output}')

-------------
Output from the original model:
#Person1: I'd like to upgrade my system, but I'm not sure what I can do. #Person2: I'm not sure what I'm looking for. #Person1: I'm not sure what I'm looking for. #Person1: I'm not sure what I'm looking for. #Person2: I'm not sure what I'm looking for. #Person1: I'm looking for something that I can make up my own flyers and banners. #Person1: I'm looking for something that I can make up my own flyers and banners. #Person2: I'm looking for something that I can make up my own flyers and banners. #Person1: I'm looking for something that I can make up my own flyers and banners. #Person2: I'm looking for something
-------------
Output from the PEFT-tuned model: #Person1# suggests upgrading #Person2#'s system and #Person2#'s hardware. #Person2# suggests adding a painting program to #Person2#'s software and adding a CD-ROM drive. #Person1# suggests adding a CD-ROM drive too.


###Evaluating the model using ROUGE performance metric

In [22]:
dialogues = dataset_dialogue['test'][0:10]['dialogue']
baseline_summaries = dataset_dialogue['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []

for idx, dialogue in enumerate(dialogues):
    prompt = f"What was the conversation?:\n\n{dialogue} Please summerize it:\n\n"

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    human_baseline_text_output = baseline_summaries[idx]

    original_model_outputs = model_flan_t5.generate(input_ids=input_ids.cuda(), generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

    peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

    original_model_summaries.append(original_model_text_output)
    peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(baseline_summaries, original_model_summaries, peft_model_summaries))

df = pd.DataFrame(zipped_summaries, columns = ['baseline_summaries', 'original_model_summaries', 'PEFT_tuned_model_summaries'])
df

Unnamed: 0,baseline_summaries,original_model_summaries,PEFT_tuned_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,"#Person1#: I'm sorry, but I'm not sure what to...",#Person1# asks Ms. Dawson to take a dictation ...
1,In order to prevent employees from wasting tim...,#Person1: I need to take a dictation for Ms. D...,#Person1# asks Ms. Dawson to take a dictation ...
2,Ms. Dawson takes a dictation for #Person1# abo...,#Person1: I need you to take a dictation for m...,#Person1# asks Ms. Dawson to take a dictation ...
3,#Person2# arrives late because of traffic jam....,#Person2#: I'm here!,#Person2# got stuck in traffic again. #Person1...
4,#Person2# decides to follow #Person1#'s sugges...,#Person1: I'm sorry to hear that. I'm sorry to...,#Person2# got stuck in traffic again. #Person1...
5,#Person2# complains to #Person1# about the tra...,#Person1#: You're here!,#Person2# got stuck in traffic again. #Person1...
6,#Person1# tells Kate that Masha and Hero get d...,#Person1#: Masha and Hero are getting divorced...,Kate tells #Person2# Masha and Hero are gettin...
7,#Person1# tells Kate that Masha and Hero are g...,#Person1#: Masha and Hero are getting divorced...,Kate tells #Person2# Masha and Hero are gettin...
8,#Person1# and Kate talk about the divorce betw...,#Person1: Masha and Hero are getting divorced....,Kate tells #Person2# Masha and Hero are gettin...
9,#Person1# and Brian are at the birthday party ...,Brian's birthday party is coming to an end.,Brian remembers his birthday and invites #Pers...


In [24]:
rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('Result of the original Model:')
print(original_model_results)
print('Result of the PEFT-tuned Model:')
print(peft_model_results)

Result of the original Model:
{'rouge1': 0.1692428269645396, 'rouge2': 0.05323769745576412, 'rougeL': 0.15736100611146567, 'rougeLsum': 0.15801248356459033}
Result of the PEFT-tuned Model:
{'rouge1': 0.34108280630980237, 'rouge2': 0.09925903847447465, 'rougeL': 0.2529410964836085, 'rougeLsum': 0.25289434591818816}
