Model Fine-Tuning: Long T5 local
----



# Introduction

The conventional attention mechanism of transformer-style language models is not well suited to long sequences of input or output text. The resource requirements scale quadratically with the number of tokens.

## Reference

Guo, M., Ainslie, J., Uthus, D., Ontañón, S., Ni, J., Sung, Y.-H., & Yang, Y. (2022). LongT5: Efficient Text-To-Text Transformer for Long Sequences. *Findings of the Association for Computational Linguistics: NAACL 2022*, 724–736. https://aclanthology.org/2022.findings-naacl.55


# Setup

In [1]:
!pip install transformers accelerate sentencepiece evaluate datasets
# !pip instal rouge_score



In [2]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq, AutoConfig
from datasets import Dataset
import evaluate
import platform
import pandas as pd
import numpy as np
from os import mkdir, listdir, environ
from time import strftime
import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

## Definititions

In [3]:
working_dir = '.'
data_dir = working_dir + '/data'
models_dir = working_dir + '/models'

tmp_dir = 'model_temp'
if 'model_temp' not in listdir():
    mkdir(tmp_dir)

latest_model = sorted(listdir(models_dir)).pop()
print("Latest model:", latest_model)

###### Model to be fine-tuned #####
### longt5: alternative version in order of memory intensity
# checkpoint = 'google/long-t5-local-base'
# checkpoint = 'google/long-t5-tglobal-base'
# checkpoint = 'google/long-t5-local-large'
# checkpoint = 'google/long-t5-tglobal-large'
#### continue training last model
checkpoint = f'{models_dir}/{latest_model}'
##################################

num_epochs = 1
subset_size = 4096
test_subset_size = None if not subset_size else subset_size // 16

# df = pd.read_csv(f'{google_drive}/My Drive/Learning/Data Science/Springboard/Capstone Projects/3/data/test.csv.gz')
# baseline_cases = df.fulltext.iloc[:10].copy()
# baseline_cases_prefixed = baseline_cases.apply(lambda t: 'Summarize: '+t)

Latest model: t5longtuned202309060527


## Hyperparameters

The training loop barely fits in 16GB with these settings. Removing gradient checkpointing causes memory overload; using an alternative optimizer probably does too.

In [4]:
environ['PJRT_DEVICE'] = 'GPU'
training_args = Seq2SeqTrainingArguments(
    tmp_dir,
    # evaluation_strategy = 'epoch',
    # per_device_eval_batch_size = 1,
    # eval_delay = 5,
    predict_with_generate = True,
    per_device_train_batch_size = 2,
    gradient_accumulation_steps = 2,
    gradient_checkpointing = True,
    num_train_epochs = num_epochs,
    optim = 'adafactor'
)

## Preprocessing

In [5]:
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

def process_df(df, n):
    processed = df.iloc[:n, :].copy()
    # processed.fulltext = processed.fulltext.apply(lambda s: 'summarize: ' + s)
    return processed

def prepare_data(data):
    return tokenizer(text=data['fulltext'], max_length=16384, truncation=True, text_target=data['abstract'])

## Evaluation function

In [6]:
# rouge = evaluate.load('rouge')

# def compute_metrics(eval_pred):
#     predictions, labels = eval_pred
#     decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
#     labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
#     decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

#     result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

#     prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
#     result["gen_len"] = np.mean(prediction_lens)

#     return {k: round(v, 4) for k, v in result.items()}

# Loading data

In [7]:
train_df = pd.read_csv(data_dir+'/train.csv.gz')
test_df =  pd.read_csv(data_dir+'/test.csv.gz')
# renaming = {
#     'abstract':'label',
#     'fulltext':'text'
# }
# train_df.rename(columns=renaming, inplace=True)
print(train_df.info())
print(test_df.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7573 entries, 0 to 7572
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   abstract  7573 non-null   object
 1   fulltext  7573 non-null   object
dtypes: object(2)
memory usage: 118.5+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1892 entries, 0 to 1891
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   abstract  1892 non-null   object
 1   fulltext  1892 non-null   object
dtypes: object(2)
memory usage: 29.7+ KB
None


In [8]:
# randomize and split off validation data
n = train_df.shape[0]
n_val = n//10
n_train = n - n_val
rng = np.random.default_rng()
shuffled_idx = rng.permutation(n)
train_idx = shuffled_idx[:n_train]
val_idx = shuffled_idx[n_train:]

# convert to HF Dataset
train_data = Dataset.from_pandas(process_df(train_df.iloc[train_idx], subset_size)).map(prepare_data, batched=True, batch_size=4)
val_data = Dataset.from_pandas(process_df(train_df.iloc[val_idx], test_subset_size)).map(prepare_data, batched=True, batch_size=4)

Map:   0%|          | 0/4096 [00:00<?, ? examples/s]

Map:   0%|          | 0/256 [00:00<?, ? examples/s]

# Training

In [9]:
# config = AutoConfig.from_pretrained(checkpoint, max_length=400)
model = AutoModelForSeq2SeqLM.from_pretrained(
    checkpoint,
    max_length=512,
    do_sample=True,
    num_beams=2,
    device_map='auto'
)

trainer = Seq2SeqTrainer(
    model = model,
    args = training_args,
    train_dataset = train_data,
    eval_dataset = val_data,
    tokenizer = tokenizer,
    data_collator = collator,
    # compute_metrics = compute_metrics
)
# print(model.config_class.to_dict())

In [10]:
trainer.train()

timestamp = strftime('%Y%m%d%H%M')
model_save_name = 't5longtuned' + timestamp
model_save_dir = f'{models_dir}/{model_save_name}'
mkdir(model_save_dir)
trainer.save_model(model_save_dir)

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss


OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 21.99 GiB total capacity; 15.40 GiB already allocated; 1.36 GiB free; 19.37 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF