## Install dependencies

---



This first section will install required dependencies.

In [53]:
!pip install transformers datasets peft torch



In [54]:
!pip install evaluate



In [55]:
!pip install rouge_score



## Loading Dataset

---



We will work with the famous "dialogsum" dataset. It is made of dialogues that could happend between humans. All conversations have corresponding summaries.

You can obtain more information [here](https://huggingface.co/datasets/knkarthick/dialogsum).

In [56]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np

In [57]:
huggingface_dataset_name = "knkarthick/dialogsum"

dataset = load_dataset(huggingface_dataset_name)

dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

As we can see, the dataset is coming with everything we need: train, validation and test sets. With more than 12k training conversations, we have everything we need for fine-tuning our LLM.

Below is an example of the first dialogue and its corresponding summary:

In [58]:
print(f"This is the dialogue:\n{dataset['train'][0]['dialogue']}")
print("---"*20)
print(f"This is the summary:\n{dataset['train'][0]['summary']}")

This is the dialogue:
#Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today?
#Person2#: I found it would be a good idea to get a check-up.
#Person1#: Yes, well, you haven't had one for 5 years. You should have one every year.
#Person2#: I know. I figure as long as there is nothing wrong, why go see the doctor?
#Person1#: Well, the best way to avoid serious illnesses is to find out about them early. So try to come at least once a year for your own good.
#Person2#: Ok.
#Person1#: Let me see here. Your eyes and ears look fine. Take a deep breath, please. Do you smoke, Mr. Smith?
#Person2#: Yes.
#Person1#: Smoking is the leading cause of lung cancer and heart disease, you know. You really should quit.
#Person2#: I've tried hundreds of times, but I just can't seem to kick the habit.
#Person1#: Well, we have classes and some medications that might help. I'll give you more information before you leave.
#Person2#: Ok, thanks doctor.
----------------------------------------------

It is important to note that all summaries are human generated and will be our base for performance review of our model.
Also, we should keep in mind that language is a complex topic, and of course we can never assure summaries are one hundred percent perfect.

## Importing base Flan-T5

To start, we will load the Flan-T5 model designed by Google's team and first released in the paper *Scaling Instruction-Finetuned Language Models* (by HW Chung et al., 2022). Paper is available [here](https://arxiv.org/pdf/2210.11416) and you can consult module's documentation at HuggingFace dedicated [page](https://huggingface.co/docs/transformers/en/model_doc/flan-t5).

In [59]:
model_name='google/flan-t5-base'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Out of curiosity, we will print below the number of parameters Flan-T5 is including. As you will see, numbers are quite outstanding with more than 200M parameters.

In [60]:
def original_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0

    for layer, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()

    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(original_model_parameters(original_model))

trainable model parameters: 247577856
all model parameters: 247577856
percentage of trainable model parameters: 100.00%


Even though all parameters are trainable, it is obvious that training from scratch such a model is impossible for simple individuals like us. However, we will see later how to still effectively "retrain" the model to suit our needs.

For now, let's test Flan-T5 on our dataset and ask for summaries of dialogues.

In [61]:
index = np.random.randint(100)

dialogue = dataset["train"][index]["dialogue"]
summary = dataset["train"][index]["summary"]

print(f"This is the dialogue at index {index}:\n{dialogue}")
print("----"*15)
print(f"This is the corresponding summary:\n{summary}")

This is the dialogue at index 51:
#Person1#: What kind of job do you intend to do?
#Person2#: I want to do some management job since I have three-year's work history.
#Person1#: What are your plans if you were hired?
#Person2#: I would apply my specialty and experience to my job and gradually move up to the management level in this company.
------------------------------------------------------------
This is the corresponding summary:
#Person2# tells #Person1# #Person2#'s ideal job and the job plan if hired.


First, we need to design a prompt template. We will use zero-shot inference, i.e. we will not provide any example of what we expect to the model. Feel free to compare performances by using one-shot or even few-shot inferences.

In [62]:
prompt = f"""

Summarize the following conversation:
{dialogue}

Summary:
"""

Now that we have our prompt, we must:

1.   Tokenize the input using a transformer architecture so the model can understand what we are asking (see [*Attention Is All You Need*](https://arxiv.org/abs/1706.03762) (by A Vaswani et al., 2017) for more details on how transformers work.
2.   Decode the output so **we** can read it.



In [63]:
input = tokenizer(prompt, return_tensors="pt")
output = tokenizer.decode(original_model.generate(input["input_ids"],
                                                max_new_tokens=200)[0],
                                                skip_special_tokens=True)

Ok now that we have our prediction, let's compare it to the actual summary:

In [64]:
print(f"Input Prompt:\n{prompt}")
print("-----"*20)
print(f"Baseline Human Summary:\n{summary}")
print("-----"*20)
print(f"LLM Summary - Zero Shot:\n{output}")

Input Prompt:


Summarize the following conversation:
#Person1#: What kind of job do you intend to do?
#Person2#: I want to do some management job since I have three-year's work history.
#Person1#: What are your plans if you were hired?
#Person2#: I would apply my specialty and experience to my job and gradually move up to the management level in this company. 

Summary:

----------------------------------------------------------------------------------------------------
Baseline Human Summary:
#Person2# tells #Person1# #Person2#'s ideal job and the job plan if hired.
----------------------------------------------------------------------------------------------------
LLM Summary - Zero Shot:
Ask the person to describe their career goals.


What are your thoughts on this? It looks like the model is very straight to the point, and lacking details in its summary. It is like the summary is too short and missing some valuable informations.

To get an idea of the model performance, we will compute its ROUGE score. This score is calculated as below (higher the better):

$$ \text{ROUGE-1} = \frac{\text{Unigram Matches}}{\text{Total number of unigram in reference}} $$



$$ \text{ROUGE-2} = \frac{\text{Unigram Matches}}{\text{Unigrams in Output}} $$

$$ \text{ROUGE-L} = \frac{\text{Length of longest common subsequence}}{\text{Total number of words in reference summaries}} $$


$$ \text{ROUGE-Lsum} = \frac{\text{Total length of overlapping summary}}{\text{Total length of reference summaries}} $$

In [65]:
rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=output,
    references=summary[0:len(output)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)

ORIGINAL MODEL:
{'rouge1': 0.021739130434782608, 'rouge2': 0.0, 'rougeL': 0.021739130434782608, 'rougeLsum': 0.021739130434782608}


Clearly, our model is not performing especially well: even though Flan-T5 was designed to solve various tasks, it is not necessarly especially well designed to produced summaries on the go. Let's see how we can improve its performance.

## Perform Parameter Efficient Fine-Tuning (PEFT)

In this section we will implement Parameter Efficient Fine-Tuning (or PEFT).

Full fine-tuning of large LLMs are challenging and very costly, which makes it hard to retrain a whole model. With PEFT, the idea is to freeze most of the layers of our model, and update only some of the layers. Various methods exist to do this such as (i) "Selective" (only fine-tune some parameters), (ii) "Reparameterization" (implement a low rank representation of the model), or (iii) "Additive" (add trainable layers or parameters to the model).

In this section we will go with the second option and use the [LoRA method](https://huggingface.co/docs/diffusers/main/en/training/lora).

In [66]:
import time
import torch
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, TrainingArguments, Trainer, DataCollatorForSeq2Seq
from peft import LoraConfig, get_peft_model, PeftModel, TaskType

lora_config = LoraConfig(
    r=32,  # Rank of the submatrices
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM
)

model_name = 'google/flan-t5-base'
tokenizer = AutoTokenizer.from_pretrained(model_name)
original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# we create the PEFT model
peft_model = get_peft_model(original_model, lora_config)

# we define an output directory for the trained model - this is key as we will save the model to be able to reuse it later
output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'


In [67]:
# we need to preprocess the dataset to correspond to LoRA expecations.

def preprocess_function(examples):
    inputs = examples['dialogue']
    targets = examples['summary']
    model_inputs = tokenizer(inputs, max_length=512, truncation=True)

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=128, truncation=True)
    model_inputs["labels"] = labels["input_ids"]

    return model_inputs

tokenized_datasets = dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/12460 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

In [68]:
# Data collator for padding
data_collator = DataCollatorForSeq2Seq(tokenizer, model=original_model)

Now, we can design our training. Note that we minimized most of the options here in the sake of computation time. Feel free to play with these parameters to see how performance of the model changes.

In [69]:
peft_training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3,
    num_train_epochs=10,
    logging_steps=10,
    max_steps=100,
    save_steps=10,
    save_total_limit=2,
    evaluation_strategy="steps",
    eval_steps=10,
    logging_dir='./logs',
    report_to="none"
)

peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer
)

max_steps is given, it will override any value given in num_train_epochs


In [70]:
# we train our new model
peft_trainer.train()

Step,Training Loss,Validation Loss
10,1.6866,1.302095
20,1.4303,1.228671
30,1.5146,1.197252
40,1.4069,1.194839
50,1.3015,1.178031
60,1.4134,1.172116
70,1.3144,1.165537
80,1.1773,1.159838
90,1.2982,1.159395
100,1.3199,1.159831




TrainOutput(global_step=100, training_loss=1.3863263034820557, metrics={'train_runtime': 270.7186, 'train_samples_per_second': 2.955, 'train_steps_per_second': 0.369, 'total_flos': 409095060406272.0, 'train_loss': 1.3863263034820557, 'epoch': 0.06418485237483953})

In [71]:
# we save the fine-tuned model and its tokenizer for later use
peft_model_path = "./peft-dialogue-summary-checkpoint-local"
peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)



('./peft-dialogue-summary-checkpoint-local/tokenizer_config.json',
 './peft-dialogue-summary-checkpoint-local/special_tokens_map.json',
 './peft-dialogue-summary-checkpoint-local/spiece.model',
 './peft-dialogue-summary-checkpoint-local/added_tokens.json',
 './peft-dialogue-summary-checkpoint-local/tokenizer.json')

In [72]:
# Set the device to GPU (or CPU if GPU is not available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Clearing any existing cache (if applicable) in case we run the code a multiple times and RAM is full
if device.type == "cuda":
    torch.cuda.empty_cache()

tokenizer = AutoTokenizer.from_pretrained(peft_model_path)

original_model = AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-base').to(device)
fine_tuned_model = PeftModel.from_pretrained(original_model, peft_model_path).to(device)


In [73]:
input_ids = tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True).input_ids.to(device)

output_PEFT = tokenizer.decode(fine_tuned_model.model.generate(input_ids, max_new_tokens=100)[0], skip_special_tokens=True)

Ok now, let's see what are our outputs:

In [74]:
print(f"Input Prompt:\n{prompt}")
print("-----"*20)
print(f"Baseline Human Summary:\n{summary}")
print("-----"*20)
print(f"Base Model Summary - Zero Shot:\n{output}")
print("-----"*20)
print(f"PEFT Model Summary - Zero Shot:\n{output_PEFT}")

Input Prompt:


Summarize the following conversation:
#Person1#: What kind of job do you intend to do?
#Person2#: I want to do some management job since I have three-year's work history.
#Person1#: What are your plans if you were hired?
#Person2#: I would apply my specialty and experience to my job and gradually move up to the management level in this company. 

Summary:

----------------------------------------------------------------------------------------------------
Baseline Human Summary:
#Person2# tells #Person1# #Person2#'s ideal job and the job plan if hired.
----------------------------------------------------------------------------------------------------
Base Model Summary - Zero Shot:
Ask the person to describe their career goals.
----------------------------------------------------------------------------------------------------
PEFT Model Summary - Zero Shot:
#Person2# wants to do some management job because he has three years' work history.


It seems our PEFT model is performing slightly better than the base model thanks to our training. However, compared to the human summary, we could argue that the difference is still obvious.

Let's score our model and compare performances with the base:

In [75]:
rouge = evaluate.load('rouge')

if len(output_PEFT) > len(summary):
    output_PEFT = output_PEFT[:len(summary)]
else:
    summary = summary[:len(output_PEFT)]

PEFT_model_results = rouge.compute(
    predictions=output_PEFT,
    references=summary,
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print("----"*20)
print("PEFT MODEL:")
print(PEFT_model_results)

ORIGINAL MODEL:
{'rouge1': 0.021739130434782608, 'rouge2': 0.0, 'rougeL': 0.021739130434782608, 'rougeLsum': 0.021739130434782608}
--------------------------------------------------------------------------------
PEFT MODEL:
{'rouge1': 0.12162162162162163, 'rouge2': 0.0, 'rougeL': 0.12162162162162163, 'rougeLsum': 0.12162162162162163}


Numbers do the talking here: even if not perfect, our PEFT model is surclassing the base by far. This proves that with good parameters and enough training, PEFT can provide real good improvements and help us to design an LLM better suited to our use case.