# flan-T5 LLM model

Steps included
1. Load flan-t5 model & dialogue-summarization dataset.
2. Full fine-tune flan-T5 model on GPU
3. Test inference of Base model and Fine-tuned model
4. Test the fine-tuned model with rough and bleu scores
5. Track the experiment with wandb (weights and biases)

In [1]:
%pip install --disable-pip-version-check \
    torch==1.13.1 \
    torchdata==0.5.1 --quiet

%pip install \
    transformers==4.27.2 \
    datasets==2.11.0 \
    evaluate==0.4.0 \
    rouge_score==0.1.2 \
    loralib==0.1.1 \
    wandb \
    peft==0.3.0 --quiet

Note: you may need to restart the kernel to use updated packages.
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
apache-beam 2.46.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.6 which is incompatible.
apache-beam 2.46.0 requires numpy<1.25.0,>=1.14.3, but you have numpy 1.26.4 which is incompatible.
apache-beam 2.46.0 requires pyarrow<10.0.0,>=3.0.0, but you have pyarrow 15.0.2 which is incompatible.
kaggle-environments 1.14.3 requires transformers>=4.33.1, but you have transformers 4.27.2 which is incompatible.
pathos 0.3.2 requires dill>=0.3.8, but you have dill 0.3.6 which is incompatible.
pathos 0.3.2 requires multiprocess>=0.70.16, but you have multiprocess 0.70.14 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


In [2]:
!pip install evaluate



In [3]:
!pip install git+https://github.com/huggingface/datasets#egg=datasets

Collecting datasets
  Cloning https://github.com/huggingface/datasets to /tmp/pip-install-70ld035t/datasets_0d860a46be254ecfbb6c2c840181ffa3
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/datasets /tmp/pip-install-70ld035t/datasets_0d860a46be254ecfbb6c2c840181ffa3
  Resolved https://github.com/huggingface/datasets to commit ceb25e118f21f54b5b5c5e9c223713f14a798eb5
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: datasets
  Building wheel for datasets (pyproject.toml) ... [?25ldone
[?25h  Created wheel for datasets: filename=datasets-2.19.1.dev0-py3-none-any.whl size=517668 sha256=a0bdb8531eaf004d784624ddc47ff37900157eeff2b1e47b9fbafbf21f3031e9
  Stored in directory: /tmp/pip-ephem-wheel-cache-ndsnwe7e/wheels/7f/ba/ce/8f6a52388a9966c7d9afa987113a763f7c105f568f369adbc6
Successfully built datase

In [4]:
!pip install rouge_score



In [5]:
import datasets 
print(datasets.__version__)

2.19.1.dev0


In [6]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer, EarlyStoppingCallback
import torch
import time
import pandas as pd
import numpy as np
import wandb

# wandb.login() 

2024-05-01 12:54:13.835998: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-01 12:54:13.836106: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-01 12:54:13.944745: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## 1. Load the flan-t5 model and dialogue summarization dataset


In [7]:
huggingface_dataset_name = "knkarthick/dialogsum"
dataset = load_dataset(huggingface_dataset_name)

#load model and tokenzier
original_model = AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-base')
tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-base')

dtype = next(original_model.parameters()).dtype
print(f"Tensor's dataType -->{dtype}")

#check where the model is loaded (should print either cpu or cuda)
print(f"Model is loaded on -->{next(original_model.parameters()).device}")

Downloading readme:   0%|          | 0.00/4.65k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/442k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/12460 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1500 [00:00<?, ? examples/s]

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Tensor's dataType -->torch.float32
Model is loaded on -->cpu


### 1.3 Redoing the datasplits for balalanced & optimum test/validation/test split 

In [8]:
from datasets import load_dataset, concatenate_datasets, DatasetDict

# Combine the splits (train, test, validation)
combined_dataset = concatenate_datasets([dataset["train"], dataset["test"], dataset["validation"]])

# Shuffle the combined dataset
combined_dataset = combined_dataset.shuffle(seed=42)

# Split the dataset into 80% train, 10% test, 10% validation
train_test_split = combined_dataset.train_test_split(test_size=0.20)  # Splitting 20% for test+validation
test_validation_split = train_test_split['test'].train_test_split(test_size=0.5)  # Splitting the 20% into two equal halves

# Creating the final DatasetDict
final_dataset = DatasetDict({
    'train': train_test_split['train'],
    'test': test_validation_split['test'],
    'validation': test_validation_split['train']
})

test_summaries = final_dataset['test']['summary']

### 1.4 Tokenizing the dataset for training

In [9]:
# Compress the given text to short expressions, and such that you can reconstruct it 
# as close as possible to the original. Unlike the usual text compression, 
# I need you to comply with the 5 conditions below:
    
# 1. You can ONLY remove unimportant words. 
# 2. Do not reorder the original words.
# 3. Do not change the original words.
# 4. Do not use abbreviations or emojis.
# 5. Do not add new words or symbols.

# Compress the origin aggressively by removing words only. 
# Compress the origin as short as you can, while retain- ing as much information as possible. 
# If you understand, please compress the following text: {text to compress} 
# The compressed text is:

def tokenize_function(examples):
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    prompts = [start_prompt + dialogue + end_prompt for dialogue in examples["dialogue"]]
    model_max_input_length = tokenizer.model_max_length

    # Tokenize the input dialogue text
    tokenized_inputs = tokenizer(prompts, max_length=model_max_input_length, padding="max_length", truncation=True)
    
    # Tokenize the labels for the dialogues
    tokenized_labels = tokenizer(examples["summary"], max_length=model_max_input_length, padding="max_length", truncation=True)

    # We need to replace the labels token ids of padding with -100 so they are not taken into account in the loss computation
    tokenized_labels["input_ids"] = [
        [(label if label != tokenizer.pad_token_id else -100) for label in labels] for labels in tokenized_labels["input_ids"]
    ]

    return {"input_ids": tokenized_inputs["input_ids"], "labels": tokenized_labels["input_ids"]}

tokenized_datasets = final_dataset.map(tokenize_function, batched=True)

# Remove columns which are not necessary for training
columns_to_remove = ['id', 'topic', 'dialogue', 'summary']
tokenized_datasets = tokenized_datasets.remove_columns(columns_to_remove)

Map:   0%|          | 0/11568 [00:00<?, ? examples/s]

Map:   0%|          | 0/1446 [00:00<?, ? examples/s]

Map:   0%|          | 0/1446 [00:00<?, ? examples/s]

## 2 Full-finetune the flan-t5 model by training with above dataset & track experiment with wandb

In [10]:
lr_rate = 3e-5
wt_decay = 0.01
early_st_th = 0.009 
early_st_ptnce = 3
steps = 250

# wandb configuration for experiment tracking
config={
    'learning_rate': lr_rate,
    'weight_decay': wt_decay,
    'early_stopping_threshold' : early_st_th,
    'early_stopping_patience':early_st_ptnce,
    'steps':steps,
    'per_device_train_batch_size':32,
    'per_device_eval_batch_size':16,
}

timestamp = str(int(time.time()))

output_dir = f'/notebooks/models/flant5-fullfinetuned-{timestamp}'

# early stopping callback will help to stop the training if no siginficant reduction in error is observed.
early_stopping_callback = EarlyStoppingCallback(early_stopping_patience=early_st_ptnce, early_stopping_threshold=early_st_th)

training_args = TrainingArguments(
    report_to= "wandb",
    output_dir=output_dir,
    learning_rate=lr_rate,
    auto_find_batch_size=True,
    weight_decay=wt_decay,
    logging_steps=steps,
    eval_steps=steps,
    max_steps=1000,
    evaluation_strategy="steps",
    save_strategy="steps",
    load_best_model_at_end = True,
    gradient_accumulation_steps=2,   
    max_grad_norm=1.0,
    warmup_steps=500, 
)

trainer = Trainer(
    model=original_model.to("cuda:0"),
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
    callbacks=[early_stopping_callback]
)

In [11]:
run = wandb.init(project='genai-llm', config=config, name=f'flant5-fullfinetune-{timestamp}')
start_time = time.time()
trainer.train()
training_time = time.time() - start_time
run.log({"Training time (seconds)":training_time})
run.log({"Training configuration":config})

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.01111350547777824, max=1.0)…



Step,Training Loss,Validation Loss


Step,Training Loss,Validation Loss
250,1.5592,1.167345
500,1.2878,1.111439
750,1.2333,1.090827
1000,1.2327,1.084353




In [12]:
# save the best model and tokenizer
trainer.save_model(f"{output_dir}/final")
tokenizer.save_pretrained(f"{output_dir}/final")

model_artifact = wandb.Artifact('model_artifact', type='model')
model_artifact.add_dir(f"{output_dir}/final")
run.log_artifact(model_artifact)

[34m[1mwandb[0m: Adding directory to artifact (/notebooks/models/flant5-fullfinetuned-1714568112/final)... Done. 10.7s


<Artifact model_artifact>

## 3. Now let's compare the inference of the original and the fine-tuned model with zero shot prompt

In [13]:
## load the new model and tokenizer
finetuned_model = AutoModelForSeq2SeqLM.from_pretrained(f"{output_dir}/final")
tokenizer2 = AutoTokenizer.from_pretrained(f"{output_dir}/final")

original_model = AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-base')
tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-base')

In [14]:
#let's get inference from original model
example_record = 200
dialogue = dataset['test'][example_record]['dialogue']

print(dialogue)

start_prompt = 'Summarize the following conversation.\n\n'
end_prompt = '\n\nSummary: '
prompt = start_prompt + dialogue + end_prompt


input = tokenizer(prompt, return_tensors='pt')
output_tokens = original_model.generate(input["input_ids"], max_new_tokens=50,)
original_model_output = tokenizer.decode(output_tokens[0], skip_special_tokens=True)

print("Summary-->")
print(original_model_output)

#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?
#Person2#: No.
#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.
#Person2#: That sounds great. Thanks.
Summary-->
#Person1#: I'm thinking of upgrading my computer.


In [15]:
#lets get inference from finetuned model

input = tokenizer2(prompt, return_tensors='pt')
output_tokens = finetuned_model.generate(input["input_ids"], max_new_tokens=50,)
finetuned_model_output = tokenizer2.decode(output_tokens[0], skip_special_tokens=True)

print("#### Human Baseline Summary -->")
print(dataset['test'][example_record]['summary'])
print("#### Summary Generated by original model->")
print(original_model_output)
print("#### Summary Generated by finetuned model->")
print(finetuned_model_output)

#### Human Baseline Summary -->
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
#### Summary Generated by original model->
#Person1#: I'm thinking of upgrading my computer.
#### Summary Generated by finetuned model->
#Person2# wants to upgrade #Person2#'s system and hardware. #Person1# suggests adding a painting program to #Person2#'s software and adding a CD-ROM drive.


### Evaluating the model with ROUGE & BLEU Score & compare them with the original model

In [16]:
from tqdm import tqdm

# to save time we will only use 150 items from test split for evaluation
dialogues = final_dataset['test']['dialogue'][:150]
print(len(dialogues))

human_baseline_summaries = final_dataset['test']['dialogue'][:150]
original_model_summaries = []
finetuned_model_summaries = []

# moving both models to gpu for faster inference
original_model.to("cuda:0")
finetuned_model.to("cuda:0")

for dialogue in tqdm(dialogues, desc="Generating summaries from original & finetuned models..."):
    prompt = f"""
    Summarize the following conversation.

    {dialogue}

    Summary: """
    
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda:0")

    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)
    original_model_summaries.append(original_model_text_output)

    finetuned_model_outputs = finetuned_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    finetuned_model_text_output = tokenizer.decode(finetuned_model_outputs[0], skip_special_tokens=True)
    finetuned_model_summaries.append(finetuned_model_text_output)


150


Generating summaries from original & finetuned models...:  10%|█         | 15/150 [00:20<02:17,  1.02s/it]Token indices sequence length is longer than the specified maximum sequence length for this model (1161 > 512). Running this sequence through the model will result in indexing errors
Generating summaries from original & finetuned models...: 100%|██████████| 150/150 [03:24<00:00,  1.36s/it]


### ROUGE Score

In [17]:
import evaluate
rouge = evaluate.load('rouge')
human_baseline_summaries = test_summaries[:150]

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

finetuned_model_results = rouge.compute(
    predictions=finetuned_model_summaries,
    references=human_baseline_summaries[0:len(finetuned_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('Finetuned MODEL:')
print(finetuned_model_results)

run.log({"rouge_score": finetuned_model_results})

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

ORIGINAL MODEL:
{'rouge1': 0.23329031718657797, 'rouge2': 0.07841616534194039, 'rougeL': 0.1997259600136288, 'rougeLsum': 0.1989589635684337}
Finetuned MODEL:
{'rouge1': 0.45716979195215135, 'rouge2': 0.19513860068362954, 'rougeL': 0.36593641715477704, 'rougeLsum': 0.3658375336340019}


### BLEU Score

In [18]:
bleu = evaluate.load("bleu")
    
original_model_results = bleu.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries
)

finetuned_model_results = bleu.compute(
    predictions=finetuned_model_summaries,
    references=human_baseline_summaries,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('Finetuned MODEL:')
print(finetuned_model_results)

run.log({"bleu_score": finetuned_model_results})

run.finish()

Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

ORIGINAL MODEL:
{'bleu': 0.05750065583903749, 'precisions': [0.277600695450594, 0.11780738946093276, 0.060894386298763085, 0.02130492676431425], 'brevity_penalty': 0.7124595327519023, 'length_ratio': 0.7468080502055832, 'translation_length': 3451, 'reference_length': 4621}
Finetuned MODEL:
{'bleu': 0.21238684758776064, 'precisions': [0.4762611275964392, 0.2727966425028615, 0.17085624509033778, 0.09166329421286928], 'brevity_penalty': 1.0, 'length_ratio': 1.1668470028132438, 'translation_length': 5392, 'reference_length': 4621}


VBox(children=(Label(value='947.604 MB of 947.604 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
Training time (seconds),▁
eval/loss,█▃▂▁
eval/runtime,█▅▁▅
eval/samples_per_second,▁▄█▄
eval/steps_per_second,▁▅█▅
train/epoch,▁▁▃▃▆▆███
train/global_step,▁▁▃▃▆▆███████
train/learning_rate,▅█▄▁
train/loss,█▂▁▁
train/total_flos,▁

0,1
Training time (seconds),2206.05359
eval/loss,1.08435
eval/runtime,96.9091
eval/samples_per_second,14.921
eval/steps_per_second,0.939
train/epoch,0.69
train/global_step,1000.0
train/learning_rate,0.0
train/loss,1.2327
train/total_flos,5489014937223168.0


### Conclusion
As we can see that with full-finetuning we managed to get great summaries without employing few-shot learning which will help for compressing prompt without providing any proir context. 