# Finetune Flan-UL2 for Summarization

Needs a 40GB GPU; using Lora and int8 quantization

In [1]:
# on lambdalabs bits and bytes couldn't find the cuda runtime so add ld path
import os
os.environ['LD_LIBRARY_PATH'] = '/usr/lib/x86_64-linux-gnu/'
os.getenv('LD_LIBRARY_PATH')

'/usr/lib/x86_64-linux-gnu/'

In [2]:
# install Hugging Face Libraries
!pip install -Uqq git+https://github.com/huggingface/peft.git
!pip install -Uqq "transformers==4.27.1" "datasets==2.9.0" "accelerate==0.17.1" "evaluate==0.4.0" "bitsandbytes==0.37.1" loralib --upgrade --quiet
# install additional dependencies needed for training
!pip install -Uqq wandb
!pip install -Uqq rouge-score py7zr 

In [3]:
!pip install -Uqq protobuf==3.20

__If you just want to do inference you can jump all the way down to the ["Evaluate"](#evaluate) cell and start running from there to download my adapter weights from hf hub and try it out.__

## Training

In [2]:
import wandb
wandb.login()

wandb: Currently logged in as: johnrobinsn. Use `wandb login --relogin` to force relogin


True

In [3]:
wandb.init(project="flanul2-lora-int8-summarization",config={"dataset":"samsum"})

## Dataset

[samsum](https://huggingface.co/datasets/samsum) dataset, a collection of about 16k messenger-like conversations with summaries. Conversations were created and written down by linguists fluent in English.

```python
{
  "id": "13818513",
  "summary": "Amanda baked cookies and will bring Jerry some tomorrow.",
  "dialogue": "Amanda: I baked cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)"
}
```

To load the `samsum` dataset, we use the **`load_dataset()`** method from the huggingfaces atasets library.

In [4]:
from datasets import load_dataset

# Load dataset from the hub
dataset = load_dataset("samsum")

print(f"Train dataset size: {len(dataset['train'])}")
print(f"Test dataset size: {len(dataset['test'])}")


  from pandas.core.computation.check import NUMEXPR_INSTALLED
Found cached dataset samsum (/home/ubuntu/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e)


  0%|          | 0/3 [00:00<?, ?it/s]

Train dataset size: 14732
Test dataset size: 819


To train our model, we need to convert our inputs (text) to token IDs. This is done by a 🤗 Transformers Tokenizer. If you are not sure what this means, check out **[chapter 6](https://huggingface.co/course/chapter6/1?fw=tf)** of the Hugging Face Course.

In [5]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "google/flan-ul2"
print(model_id)

tokenizer = AutoTokenizer.from_pretrained(model_id)

google/flan-ul2


Before we can start training, we need to preprocess our data. Abstractive Summarization is a text-generation task. Our model will take a text as input and generate a summary as output. We want to understand how long our input and output will take to batch our data efficiently.

In [6]:
from datasets import concatenate_datasets
import numpy as np
# The maximum total input sequence length after tokenization. 
# Sequences longer than this will be truncated, sequences shorter will be padded.
tokenized_inputs = concatenate_datasets([dataset["train"], dataset["test"]]).map(lambda x: tokenizer(x["dialogue"], truncation=True), batched=True, remove_columns=["dialogue", "summary"])
input_lenghts = [len(x) for x in tokenized_inputs["input_ids"]]
# take 85 percentile of max length for better utilization
max_source_length = int(np.percentile(input_lenghts, 85))
print(f"Max source length: {max_source_length}")

# The maximum total sequence length for target text after tokenization. 
# Sequences longer than this will be truncated, sequences shorter will be padded."
tokenized_targets = concatenate_datasets([dataset["train"], dataset["test"]]).map(lambda x: tokenizer(x["summary"], truncation=True), batched=True, remove_columns=["dialogue", "summary"])
target_lenghts = [len(x) for x in tokenized_targets["input_ids"]]
# take 90 percentile of max length for better utilization
max_target_length = int(np.percentile(target_lenghts, 90))
print(f"Max target length: {max_target_length}")

Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e/cache-05bc169e6ed2593a.arrow
Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e/cache-b668a39b31304fdc.arrow


Max source length: 255
Max target length: 50


We preprocess our dataset before training and save it to disk. You could run this step on your local machine or a CPU and upload it to the [Hugging Face Hub](https://huggingface.co/docs/hub/datasets-overview).

In [7]:
def preprocess_function(sample,padding="max_length"):
    # add prefix to the input for t5
    inputs = ["summarize: " + item for item in sample["dialogue"]]

    # tokenize inputs
    model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)

    # Tokenize targets with the `text_target` keyword argument
    labels = tokenizer(text_target=sample["summary"], max_length=max_target_length, padding=padding, truncation=True)

    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # padding in the loss.
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=["dialogue", "summary", "id"])
print(f"Keys of tokenized dataset: {list(tokenized_dataset['train'].features)}")

# save datasets to disk for later easy loading
tokenized_dataset["train"].save_to_disk("data/train")
tokenized_dataset["test"].save_to_disk("data/eval")

Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e/cache-b037665e873647b2.arrow


  0%|          | 0/1 [00:00<?, ?ba/s]

Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e/cache-e515606fde108167.arrow


Keys of tokenized dataset: ['input_ids', 'attention_mask', 'labels']


Saving the dataset (0/1 shards):   0%|          | 0/14732 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/819 [00:00<?, ? examples/s]

## Finetune Flan-UL2 with LoRA and int8

We could shard the Flan-UL2 model to conserve memory while loading the model.

In [8]:
from transformers import AutoModelForSeq2SeqLM

model_id = "google/flan-ul2"
model = AutoModelForSeq2SeqLM.from_pretrained(model_id, load_in_8bit=True, device_map="auto")




Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
CUDA SETUP: CUDA runtime path found: /usr/lib/x86_64-linux-gnu/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 116
CUDA SETUP: Loading binary /home/ubuntu/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda116.so...


Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

Now, we can prepare our model for the LoRA int-8 training using `peft`.

In [9]:
from peft import LoraConfig, get_peft_model, prepare_model_for_int8_training, TaskType

# Define LoRA Config 
lora_config = LoraConfig(
 r=16, 
 lora_alpha=32,
 target_modules=["q", "v"],
 lora_dropout=0.05,
 bias="none",
 task_type=TaskType.SEQ_2_SEQ_LM
)
# prepare int-8 model for training
model = prepare_model_for_int8_training(model)

# add LoRA adaptor
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# trainable params: 18874368 || all params: 11154206720 || trainable%: 0.16921300163961817

trainable params: 25165824 || all params: 19484779520 || trainable%: 0.12915631903439675


As you can see, here we are only training 0.16% of the parameters of the model! This huge memory gain will enable us to fine-tune the model without memory issues.

Next is to create a `DataCollator` that will take care of padding our inputs and labels. We will use the `DataCollatorForSeq2Seq` from the 🤗 Transformers library.

In [10]:
from transformers import DataCollatorForSeq2Seq

# we want to ignore tokenizer pad token in the loss
label_pad_token_id = -100
# Data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)

The last step is to define the hyperparameters (`TrainingArguments`) we want to use for our training.

In [11]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

output_dir="lora-flan-ul2"

# Define training args
training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3, # higher learning rate
    num_train_epochs=5,
    logging_dir=f"{output_dir}/logs",
    logging_strategy="steps",
    logging_steps=500,
    save_strategy="no",
    report_to="wandb",
)

# Create Trainer instance
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset["train"],
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

Let's now train our model and run the cells below. Note that for T5 and UL2, some layers are kept in `float32` for stability purposes.

In [12]:
# train model
trainer.train()

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,2.8547
1000,1.78
1500,1.1425
2000,1.1135
2500,1.0927
3000,1.108
3500,1.134
4000,1.1072
4500,1.0751
5000,1.0954


TrainOutput(global_step=9210, training_loss=1.1604953231563009, metrics={'train_runtime': 51923.1321, 'train_samples_per_second': 1.419, 'train_steps_per_second': 0.177, 'total_flos': 2.189653096613806e+18, 'train_loss': 1.1604953231563009, 'epoch': 5.0})

In [14]:
# Save our LoRA model & tokenizer results
peft_model_id="flanul2-lora-samsum"
trainer.model.save_pretrained(peft_model_id,)
tokenizer.save_pretrained(peft_model_id)
# if you want to save the base model to call
# trainer.model.base_model.save_pretrained(peft_model_id)

('flanul2-lora-samsum/tokenizer_config.json',
 'flanul2-lora-samsum/special_tokens_map.json',
 'flanul2-lora-samsum/tokenizer.json')

In [16]:
!pip install huggingface_hub

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [21]:
!curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash

Detected operating system as Ubuntu/focal.
Checking for curl...
Detected curl...
Checking for gpg...
Detected gpg...
Detected apt version as 2.0.9
Running apt-get update... done.
Installing apt-transport-https... done.
Installing /etc/apt/sources.list.d/github_git-lfs.list...done.
Importing packagecloud gpg key... Packagecloud gpg key imported to /etc/apt/keyrings/github_git-lfs-archive-keyring.gpg
done.
Running apt-get update... done.

The repository is setup! You can now install packages.


In [25]:
!git-lfs install

/usr/bin/sh: 1: git-lfs: not found


In [17]:
import huggingface_hub

In [18]:
huggingface_hub.login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [32]:
repo_id = 'johnrobinsn/flanul2-lora-samsum'
trainer.model.push_to_hub(repo_id)
tokenizer.push_to_hub(repo_id)

CommitInfo(commit_url='https://huggingface.co/johnrobinsn/flanul2-lora-samsum/commit/4d27fa962a2079303107d7a36177cd61c09bb794', commit_message='Upload tokenizer', commit_description='', oid='4d27fa962a2079303107d7a36177cd61c09bb794', pr_url=None, pr_revision=None, pr_num=None)

In [30]:
trainer.model.push_to_hub()

TypeError: push_to_hub() missing 1 required positional argument: 'repo_id'

## Evaluate <a id="evaluate"></a>

In [4]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load peft config for pre-trained checkpoint etc. 
peft_model_id = "johnrobinsn/flanul2-lora-samsum"
config = PeftConfig.from_pretrained(peft_model_id)

# load base LLM model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path,  load_in_8bit=True,  device_map={"":0})
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load the Lora model
model = PeftModel.from_pretrained(model, peft_model_id, device_map={"":0})
model.eval()

print("Peft model loaded")

  from pandas.core.computation.check import NUMEXPR_INSTALLED



Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
CUDA SETUP: CUDA runtime path found: /usr/lib/x86_64-linux-gnu/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 116
CUDA SETUP: Loading binary /home/ubuntu/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda116.so...


Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

Peft model loaded


### Try it out
Let’s load the dataset again with a random sample to try the summarization.

In [6]:
from datasets import load_dataset 
from random import randrange


# Load dataset from the hub and get a sample
dataset = load_dataset("samsum")
sample = dataset['test'][randrange(len(dataset["test"]))]

input_ids = tokenizer(sample["dialogue"], return_tensors="pt", truncation=True).input_ids.cuda()
# with torch.inference_mode():
outputs = model.generate(input_ids=input_ids, max_new_tokens=30, do_sample=True, top_p=0.9)
print(f"input sentence: {sample['dialogue']}\n{'---'* 20}")

print(f"summary:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0]}")

Found cached dataset samsum (/home/ubuntu/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e)


  0%|          | 0/3 [00:00<?, ?it/s]

input sentence: Sawyer: ball game 2nite?
Sutter: injured
Wheeler: what time?
Sawyer: 5.30? 6?
Wheeler: anytime l8r?
Sawyer: pitch booked at 8 so we need to start earlier
Wheeler: so im out sry
Yardley: i guess next time be better
------------------------------------------------------------
summary:
Wheeler won't play ball with Sawyer and Yardley tonight as the pitch is booked at 8 so it has to start earlier.


In [4]:
### Evaluate against Rogue Score

In [5]:
import evaluate
import numpy as np
from datasets import load_from_disk
from tqdm import tqdm

# Metric
metric = evaluate.load("rouge")

def evaluate_peft_model(sample,max_target_length=50):
    # generate summary
    outputs = model.generate(input_ids=sample["input_ids"].unsqueeze(0).cuda(), do_sample=True, top_p=0.9, max_new_tokens=max_target_length)    
    prediction = tokenizer.decode(outputs[0].detach().cpu().numpy(), skip_special_tokens=True)
    # decode eval sample
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(sample['labels'] != -100, sample['labels'], tokenizer.pad_token_id)
    labels = tokenizer.decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    return prediction, labels

# load test dataset from distk
test_dataset = load_from_disk("data/eval/").with_format("torch")

# run predictions
# this can take ~45 minutes
predictions, references = [] , []
for sample in tqdm(test_dataset):
    p,l = evaluate_peft_model(sample)
    predictions.append(p)
    references.append(l)

# compute metric 
rogue = metric.compute(predictions=predictions, references=references, use_stemmer=True)

# print results 
print(f"Rogue1: {rogue['rouge1']* 100:2f}%")
print(f"rouge2: {rogue['rouge2']* 100:2f}%")
print(f"rougeL: {rogue['rougeL']* 100:2f}%")
print(f"rougeLsum: {rogue['rougeLsum']* 100:2f}%")

# Rogue1: 50.386161%
# rouge2: 24.842412%
# rougeL: 41.370130%
# rougeLsum: 41.394230%

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

100%|██████████| 819/819 [1:06:38<00:00,  4.88s/it]


Rogue1: 49.848357%
rouge2: 24.266460%
rougeL: 40.711705%
rougeLsum: 40.706680%
