source: https://www.philschmid.de/fine-tune-flan-t5

In [1]:
import os

from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    Seq2SeqTrainingArguments,
    DataCollatorForSeq2Seq,
    Seq2SeqTrainer,
)
from datasets import load_dataset
import evaluate
import nltk
import numpy as np
import pandas as pd
import wandb

nltk.download("punkt", quiet=True)


2023-02-10 10:35:08.138368: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-10 10:35:08.663469: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.8/dist-packages/torch/lib:/usr/local/lib/python3.8/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-02-10 10:35:08.663524: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object 

True

In [2]:
checkpoint = "google/flan-t5-base"
dataset_name = "samsum"

ft_output_dir = os.getenv("HF_FINETUNE_OUTPUT_DIR")
model_name = checkpoint.split("/")[-1]
hub_model_id = f"{model_name}-{dataset_name}"
model_output_dir = os.path.join(ft_output_dir, hub_model_id)

os.environ["WANDB_PROJECT"] = hub_model_id

In [3]:
ds = load_dataset(dataset_name)
ds


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
})

In [4]:
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
model.parallelize()

tokenizer = AutoTokenizer.from_pretrained(checkpoint)


In [5]:
example = ds["train"][0]
example


{'id': '13818513',
 'dialogue': "Amanda: I baked  cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)",
 'summary': 'Amanda baked cookies and will bring Jerry some tomorrow.'}

## Max_length analysis
Given last miserable OOM experience from `xsum`, investigate [truncation and padding](https://huggingface.co/docs/transformers/main/en/pad_truncation#padding-and-truncation) this time.  
Get statistics on dialogue and summary token length.

In [6]:
tk_dialogue = tokenizer(ds["train"]["dialogue"])["input_ids"]
tk_summary = tokenizer(ds["train"]["summary"])["input_ids"]
pd.set_option('display.float_format', lambda x: '%.1f' % x)

df = pd.DataFrame(
    {"dialogue": [len(d) for d in tk_dialogue], "summary": [len(s) for s in tk_summary]}
)
print(df.describe())

       dialogue  summary
count   14732.0  14732.0
mean      149.0     28.9
std       110.7     15.1
min         1.0      2.0
25%        66.0     17.0
50%       120.0     26.0
75%       202.0     37.0
max      1153.0     94.0


My hunch is I shouldn't truncate the input. Just need to pad to the longest of the batch. 
The setting would be `tokenizer(batch_sentences, padding=True)`.  

However, it seems that [truncation is inevitable in production](https://twitter.com/RamaswmySridhar/status/1621870502766858241). How to truncate wisely?

### Padding experiments

In [7]:
tk_dialogue = tokenizer(ds["train"]["dialogue"], padding=True)["input_ids"]
tk_summary = tokenizer(ds["train"]["summary"], padding=True)["input_ids"]
pd.set_option('display.float_format', lambda x: '%.1f' % x)

df = pd.DataFrame(
    {"dialogue": [len(d) for d in tk_dialogue], "summary": [len(s) for s in tk_summary]}
)
print(df.describe())

       dialogue  summary
count   14732.0  14732.0
mean     1153.0     94.0
std         0.0      0.0
min      1153.0     94.0
25%      1153.0     94.0
50%      1153.0     94.0
75%      1153.0     94.0
max      1153.0     94.0


Expected, since this is full batch, all sequences are pad to the max length of the whole corpus.  
Let's try this idea with `batch_size = 8`. 

In [8]:
from torch.utils.data import DataLoader

collator = DataCollatorForSeq2Seq(tokenizer, padding=True)
dl = DataLoader(ds['train'].with_transform(lambda x: tokenizer(x['dialogue'])), batch_size=8, collate_fn=collator)


tk_batched = np.array([batch['input_ids'].shape[-1] for batch in dl])

print(len(tk_batched), len(dl))
print(len(np.unique(tk_batched)))

np.unique(tk_batched).max(), np.unique(tk_batched).mean(), np.unique(tk_batched).min()

# 1842 batches, with 482 unique lengths, would be brutal for jax jit LoL
# try pad_to_multiple_of=8


1842 1842
482


(1153, 389.02904564315355, 92)

In [9]:
collator = DataCollatorForSeq2Seq(tokenizer, padding=True, pad_to_multiple_of=8)
dl = DataLoader(ds['train'].with_transform(lambda x: tokenizer(x['dialogue'])), batch_size=8, collate_fn=collator)


tk_batched = np.array([batch['input_ids'].shape[-1] for batch in dl])

print(len(tk_batched), len(dl))
print(len(np.unique(tk_batched)))

np.unique(tk_batched).max(), np.unique(tk_batched).mean(), np.unique(tk_batched).min()

# 1842 batches with 91 unique lengths, much better. 
# does truncation=True change anything here?
# according to doc: tokenizer(batch_sentences, padding=True, truncation=True)
# has the same effect as tokenizer(batch_sentences, padding=True)
# both padding to max sequence in batch

1842 1842
91


(1160, 485.27472527472526, 96)

In [10]:
collator = DataCollatorForSeq2Seq(tokenizer, padding=True, pad_to_multiple_of=8)
dl = DataLoader(ds['train'].with_transform(lambda x: tokenizer(x['dialogue'], truncation=True)), batch_size=8, collate_fn=collator)


tk_batched = np.array([batch['input_ids'].shape[-1] for batch in dl])

print(len(tk_batched), len(dl))
print(len(np.unique(tk_batched)))

np.unique(tk_batched).max(), np.unique(tk_batched).mean(), np.unique(tk_batched).min()

1842 1842
51


(512, 311.52941176470586, 96)

- `truncation=True` truncates the dialogue to 512 tokens, which is the max length of the T5. 
- but as discussed in README, by default T5 should not have a set maximum length.
- this is imposed, artificial limitation by transformers library. 
- input loss here. use with caution.
- in xsum ipynb, I did `truncation=true` in tokenizer, which cut the input to 512 if not other `max_length` is set. That's why it solved OOM problem for me, at the cost of losing info
- should experiment with truncation settings to observer cuda memory vs performance.

### Compare to source ipynb
In [source ipynb](https://www.philschmid.de/fine-tune-flan-t5): 
```python
tokenized_inputs = concatenate_datasets([dataset["train"], dataset["test"]]).map(lambda x: tokenizer(x["dialogue"], truncation=True), batched=True, remove_columns=["dialogue", "summary"])
max_source_length = max([len(x) for x in tokenized_inputs["input_ids"]])

def preprocess_function(sample,padding="max_length"):
    # add prefix to the input for t5
    inputs = ["summarize: " + item for item in sample["dialogue"]]

    # tokenize inputs
    model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)
    pass
```
1. It pads every input to absolute corpus max length. Would waste tons of memory and computation.  
Would definite experiment with `pad_with_multiple_of` on JAX jit to find a better balance. 
2. I use `flan-t5` which is the heir of LM adopted T5, which means prepend `summarize:` to the input in not useful, and not necessary. 

## Move on to training

In [11]:
# no truncation, since the max_length in the training set is only 1153. Should be fine.
def preprocess(examples):
    output = tokenizer(examples["dialogue"])
    output["labels"] = tokenizer(examples["summary"])["input_ids"]
    return output

In [12]:
tk_ds = ds.map(preprocess, batched=True).remove_columns(ds['train'].column_names)
tk_ds

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 818
    })
})

In [13]:
rouge = evaluate.load('rouge')

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [14]:
def compute_metrics(eval_preds):
    preds, labels = eval_preds

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds = [
        "\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds
    ]
    decoded_labels = [
        "\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels
    ]

    result = rouge.compute(
        predictions=decoded_preds, references=decoded_labels, use_stemmer=True
    )
    return result

In [15]:
collator = DataCollatorForSeq2Seq(tokenizer, padding=True, pad_to_multiple_of=8)

# truncation could only be done in tokenizer, the padding settings are back to collator. 
# make sense since collator is where batching happened. 
# in xsum I tried to pushed all these settings to tokenizer, this is better balance.

In [16]:
args = Seq2SeqTrainingArguments(
    output_dir=model_output_dir,
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    weight_decay=0.01,
    save_total_limit=2,
    num_train_epochs=1,
    bf16=True,
    gradient_accumulation_steps=4,
    predict_with_generate=True,
    save_strategy="epoch",
    load_best_model_at_end=True,
    hub_model_id=hub_model_id,
    report_to="wandb",
)

In [17]:
trainer = Seq2SeqTrainer(
    model=model,
    args=args,
    train_dataset=tk_ds["train"],
    eval_dataset=tk_ds["validation"],
    tokenizer=tokenizer,
    data_collator=collator,
    compute_metrics=compute_metrics,
)

Using cuda_amp half precision backend


In [18]:
trainer.train()

***** Running training *****
  Num examples = 14732
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 4
  Total optimization steps = 460
  Number of trainable parameters = 247577856
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: [34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [None]:
wandb.finish()

0,1
eval/loss,▁
eval/rouge1,▁
eval/rouge2,▁
eval/rougeL,▁
eval/rougeLsum,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁
train/global_step,▁▁

0,1
eval/loss,1.39589
eval/rouge1,0.47813
eval/rouge2,0.24603
eval/rougeL,0.40496
eval/rougeLsum,0.44394
eval/runtime,30.6889
eval/samples_per_second,26.655
eval/steps_per_second,3.356
train/epoch,1.0
train/global_step,460.0


In [None]:
total_flos = trainer.state.total_flos
runtime = trainer.state.log_history[1]['train_runtime']

print(f"GPU utilization: {total_flos / 1e12 / runtime:.2f} TFLOPS")

GPU utilization: 20.38 TFLOPS


## Observation
- `rouge-1` matches the source ipynb.
- No truncation seems to work on this dataset.  
- Maybe `xsum` has such outlier long sequences that make truncation is necessary. Otherwise, outlier batch would cause OOM, padding or not. 

### About TFLOPS
- `m.parallelize()`
  - `20.43` tflops.
    - My current profile max on the workstation is ~35 tflops for each, ~70 tflops total.
    - Achieved with `megatron` + `nvlink`. No `nvlink` number is around 62. 
  - GPU1: 16.6G, GPU2: 14.9G
- No `m.parallelize()`
  - `16.66` tflops.
  - GPU1: 22.27, GPU2: 21.93G
  - Why...?
- `pad_to_multiple_of=64` -> `19.72` tflops
  - I' not ready to innovate on [dark magic](https://twitter.com/karpathy/status/1621578354024677377) yet LoL. 
- No `pad_to_multiple_of=8` -> `20.38` tflops
  - I don't need to do this religiously. Make no difference in this case.

ps: previous GPU profile: 
![](asset/tflop.png)