# Instruction-Tuning Demo
In this demonstration, we will instruction-tune the [Flan-T5 Small model](https://huggingface.co/google/flan-t5-small) using the [SamSum dataset](https://huggingface.co/datasets/Samsung/samsum). The Flan-T5 model, developed by Google, is designed for various NLP tasks. The SamSum dataset, provided by Samsung, contains conversational summaries which are ideal for training summarization models.

## Importing Libraries

In [1]:
from typing import List, Union

import evaluate
import nltk
import torch
from datasets import DatasetDict, concatenate_datasets, load_dataset
from tqdm import tqdm
from transformers import (
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
    DataCollatorForSeq2Seq,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
)

nltk.download("punkt")


The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
[nltk_data] Downloading package punkt to /h/ws_cpun/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Configuring Paths

In [2]:
OUTPUT_DIR = "../../scratch/instruct/" # main directory of the the demo output

In [3]:
MODEL_NAME = "google/flan-t5-small"
DATASET_NAME = "samsum"

## Preprocessing Data

In [4]:
def get_samsum_data(tokenizer: AutoTokenizer) -> List[Union[DatasetDict, int, int]]:
    # Load dataset from the hub
    dataset = load_dataset(DATASET_NAME)

    print(f"Train dataset size: {len(dataset['train'])}")
    print(f"Test dataset size: {len(dataset['test'])}")

    tokenized_inputs = concatenate_datasets([dataset["train"], dataset["test"]]).map(
        lambda x: tokenizer(x["dialogue"], truncation=True),
        batched=True,
        remove_columns=["dialogue", "summary"],
    )

    max_source_length = max([len(x) for x in tokenized_inputs["input_ids"]])
    print(f"Max source length: {max_source_length}")

    tokenized_targets = concatenate_datasets([dataset["train"], dataset["test"]]).map(
        lambda x: tokenizer(x["summary"], truncation=True),
        batched=True,
        remove_columns=["dialogue", "summary"],
    )

    max_target_length = max([len(x) for x in tokenized_targets["input_ids"]])
    print(f"Max target length: {max_target_length}")

    return dataset, max_source_length, max_target_length

## Tokenizing Data

In [5]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)



#### Define what the instruction should look like:

In [6]:
INSTRUCTION = "Summarize the following conversation in at most 3 sentences: "

In [7]:
def preprocess_function(sample, padding="max_length"):
    # add prefix to the input for t5
    inputs = [INSTRUCTION + item for item in sample["dialogue"]]

    # tokenize inputs
    model_inputs = tokenizer(
        inputs, max_length=max_source_length, padding=padding, truncation=True,
    )

    labels = tokenizer(
        text_target=sample["summary"],
        max_length=max_target_length,
        padding=padding,
        truncation=True,
    )

    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label]
            for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

dataset, max_source_length, max_target_length = get_samsum_data(tokenizer)

tokenized_dataset = dataset.map(
    preprocess_function, batc
    hed=True, remove_columns=["dialogue", "summary", "id"],
)
print(f"Keys of tokenized dataset: {list(tokenized_dataset['train'].features)}")

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Train dataset size: 14732
Test dataset size: 819
Max source length: 512


Map:   0%|          | 0/15551 [00:00<?, ? examples/s]

Max target length: 95


Map:   0%|          | 0/819 [00:00<?, ? examples/s]

Keys of tokenized dataset: ['input_ids', 'attention_mask', 'labels']


In [11]:
dataset['train']['dialogue'][:3]

["Amanda: I baked  cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)",
 'Olivia: Who are you voting for in this election? \r\nOliver: Liberals as always.\r\nOlivia: Me too!!\r\nOliver: Great',
 "Tim: Hi, what's up?\r\nKim: Bad mood tbh, I was going to do lots of stuff but ended up procrastinating\r\nTim: What did you plan on doing?\r\nKim: Oh you know, uni stuff and unfucking my room\r\nKim: Maybe tomorrow I'll move my ass and do everything\r\nKim: We were going to defrost a fridge so instead of shopping I'll eat some defrosted veggies\r\nTim: For doing stuff I recommend Pomodoro technique where u use breaks for doing chores\r\nTim: It really helps\r\nKim: thanks, maybe I'll do that\r\nTim: I also like using post-its in kaban style"]

In [12]:
dataset['train']['summary'][:3]

['Amanda baked cookies and will bring Jerry some tomorrow.',
 'Olivia and Olivier are voting for liberals in this election. ',
 'Kim may try the pomodoro technique recommended by Tim to get more stuff done.']

## Training the Model

In [11]:
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

# Define training args
training_args = Seq2SeqTrainingArguments(
    do_train=True,
    do_eval=True,
    evaluation_strategy="epoch",
    logging_strategy="epoch",
    save_strategy="no",
    per_device_eval_batch_size=8,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=1,
    output_dir=OUTPUT_DIR,
    auto_find_batch_size=True,
    learning_rate=1e-3,
    num_train_epochs=0.1,
    logging_dir=f"{OUTPUT_DIR}/logs",
    report_to="none",
)

# we want to ignore tokenizer pad token in the loss
label_pad_token_id = -100
# Data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8,
)

# Create Trainer instance
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    data_collator=data_collator,
)
model.config.use_cache = False

trainer_stats = trainer.train()

train_loss = trainer_stats.training_loss
eval_stats = trainer.evaluate()
eval_loss = eval_stats["eval_loss"]
print(f"Training loss:{train_loss}|Val loss:{eval_loss}")

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Epoch,Training Loss,Validation Loss
0,2.0201,1.785159


Training loss:2.0200536779455236|Val loss:1.7851594686508179


## Inference and Evaluating

#### Define what the instruction should look like:

In [12]:
INSTRUCTION = "Summarize the following conversation in at most 3 sentences: "

In [13]:
def infer_one_sample(model, tokenizer, prompt, max_target_length=50):
    # generate summary
    input_ids = tokenizer(
        INSTRUCTION + prompt, return_tensors="pt", truncation=True,
    ).input_ids.cuda()

    with torch.inference_mode():
        outputs = model.generate(
            input_ids=input_ids,
            do_sample=True,
            top_p=0.9,
            max_new_tokens=max_target_length,
        )
        prediction = tokenizer.batch_decode(
            outputs.detach().cpu().numpy(),
            skip_special_tokens=True,
        )[0]

    return prediction

In [14]:
metric = evaluate.load("rouge")

dataset = load_dataset(DATASET_NAME)
test_dataset = dataset["test"]

model.eval()

predictions, references = [], []
ctr = 0
for sample in tqdm(test_dataset):
    prediction = infer_one_sample(model, tokenizer, sample["dialogue"])
    predictions.append(prediction)
    summary = sample["summary"]
    references.append(summary)

# For demo on 10 samples, uncomment the following block:
"""
if ctr == 10:
    break
ctr += 1
"""

# compute metric
rouge = metric.compute(
    predictions=predictions, references=references, use_stemmer=True,
)

# print results
print(f"rouge1: {rouge['rouge1']* 100:2f}%")
print(f"rouge2: {rouge['rouge2']* 100:2f}%")
print(f"rougeL: {rouge['rougeL']* 100:2f}%")
print(f"rougeLsum: {rouge['rougeLsum']* 100:2f}%")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
100%|██████████| 819/819 [04:38<00:00,  2.94it/s]


rouge1: 39.178121%
rouge2: 14.354108%
rougeL: 30.986775%
rougeLsum: 30.970417%
