## Training Seq2Seq models in Colab without running out of RAM

This notebook is an extension of the notebook [Training NLP models in Colab without running out of RAM](https://github.com/datasci-w266/2022-fall-main/blob/master/materials/walkthrough_notebooks/keras_with_limited_ram/keras_training_with_limited_ram.ipynb). This one focuses on sequence-to-sequence (encoder-decoder, text generation) models like T5, because the way you fine-tune the Huggingface pretrained versions of those models is a bit different than BERT. With T5, you use the full pre-trained model end-to-end without adding any additional layers.

There are a few ways to set up the training for these models, and they differ a little depending on whether you're using tensorflow or pytorch. Huggingface provides trainer classes that can be used to fine-tune their pre-trained models, though these seem to be better supported for pytorch. For tensorflow, the TFTrainer model seems to be deprecated in favor of just using the model's keras functionality (calling .fit). The notebook below shows both how to use a Seq2SeqTrainer for a pytorch model, and how to use keras .fit effectively on a pretrained seq2seq tensorflow model.

In [None]:
!pip install transformers

In [None]:
!pip install sentencepiece

In [3]:
import os
import re
import numpy as np
import pandas as pd

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

import torch  # Only if you use a pytorch model, both options are shown below
from transformers import T5Tokenizer, T5ForConditionalGeneration, TFT5ForConditionalGeneration
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

### Data

To fine-tune T5, we'll use the dataset from the [week 6 lesson notebook](https://github.com/datasci-w266/2022-fall-private/blob/master/materials/lesson_notebooks/lesson_6_Machine_Translation.ipynb) for translating Shakespeare to modern English. You can [download the dataset here](https://github.com/cocoxu/Shakespeare), and upload to your drive folder.

In [4]:
# This cell will authenticate you and mount your Drive in the Colab.
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
# Modify this path to where you saved the Shakespear data in your Drive
text_file = 'drive/MyDrive/ISchool/MIDS/W266/2022_Fall/notebooks/train_plays-org-mod.txt'

In [6]:
with open(text_file) as f:
    lines = f.read().split('\n')[:-1]

prefix = 'translate old to modern: '
text_pairs = []
for line in lines:
    orig, target = line.split('\t')
    text_pairs.append({'orig': prefix + orig, 'target': target})

In [7]:
# Look at some examples
for _ in range(5):
    print(np.random.choice(text_pairs))

{'orig': 'translate old to modern: What dost thou say?', 'target': 'What did you say?'}
{'orig': 'translate old to modern: I may sit in a corner and cry, “Heigh-ho for a husband!” Lady Beatrice, I will get you one.', 'target': 'I should sit in the corner and sing that song, “Heigh-Ho for a Husband!” Lady Beatrice, I’ll get you a husband.'}
{'orig': 'translate old to modern: To England will I steal, and there I’ll steal.', 'target': 'I’ll steal away to England, and I’ll steal some more when I get there.'}
{'orig': 'translate old to modern: She’s a beagle, true-bred, and one that adores me.', 'target': 'She’s a good little woman, and she adores me.'}
{'orig': "translate old to modern: I did enact Julius Caesar; I was killed i' the Capitol; Brutus killed me.", 'target': 'I did enact Julius Caesar, I was killed in the Capitol, Brutus killed me.'}


In [8]:
# Let's create some splits
np.random.shuffle(text_pairs)
num_valid_samples = int(0.15 * len(text_pairs))
num_train_samples = len(text_pairs) - 2 * num_valid_samples
train_pairs = text_pairs[:num_train_samples]
valid_pairs = text_pairs[num_train_samples : num_train_samples + num_valid_samples]
test_pairs = text_pairs[num_train_samples + num_valid_samples :]

print(f"{len(text_pairs)} total pairs")
print(f"{len(train_pairs)} training pairs")
print(f"{len(valid_pairs)} validation pairs")
print(f"{len(test_pairs)} test pairs")

19088 total pairs
13362 training pairs
2863 validation pairs
2863 test pairs


In [9]:
# Save splits to separate csv files, to load only part at a time later
train_file = 'drive/MyDrive/ISchool/MIDS/W266/2022_Fall/notebooks/train_pairs.csv'
valid_file = 'drive/MyDrive/ISchool/MIDS/W266/2022_Fall/notebooks/valid_pairs.csv'
test_file = 'drive/MyDrive/ISchool/MIDS/W266/2022_Fall/notebooks/test_pairs.csv'

pd.DataFrame(train_pairs).to_csv(train_file)
pd.DataFrame(valid_pairs).to_csv(valid_file)
pd.DataFrame(test_pairs).to_csv(test_file)

## Option 1: Pytorch

We'll start with pytorch, showing how you can use a Seq2SeqTrainer with a data generator, to control when, how much and how to load your data as you train. The preprocessor and data generator need to be defined slightly differently for the trainer to use.

Unlikes the previous notebook, the generator won't load a batch at a time. Instead, the trainer will call our generator (and preprocessing function) for one example at a time. So the preprocessing function needs to return a one-dimensional vector of input_ids for each example, not a two dimensional batch.

For the preprocessor, the pytorch models want the inputs in a dictionary with keys for 'input_ids', 'attention_mask', and 'labels'. The first two are the inputs to the encoder (the original text), and the labels are the translated text vocab ids.

Since we're passing this all into a trainer, we doon't need to separate out the decoder input_ids. The trainer will infer those from the labels (offset by one).

In [10]:
def preprocess_data_pt(text_pair, tokenizer, max_length=128):
    orig_text, target_text = text_pair
    orig_encoded = tokenizer.batch_encode_plus(
        [orig_text],
        max_length=max_length,
        padding='max_length',
        truncation=True,
        return_attention_mask=True,
        return_tensors='pt'
    )

    orig_input_ids = orig_encoded['input_ids'][0]
    orig_attention_mask = orig_encoded['attention_mask'][0]
    
    target_encoded = tokenizer.batch_encode_plus(
        [target_text],
        max_length=max_length,
        padding='max_length',
        truncation=True,
        return_attention_mask=True,
        return_tensors='pt'
    )
    
    label_ids = target_encoded['input_ids'][0]
    
    return {'input_ids': orig_input_ids,
            'attention_mask': orig_attention_mask,
            'labels': label_ids}

In [11]:
class TranslationDataGeneratorPT(tf.keras.utils.Sequence):
    
    def __init__(self,
                 tokenizer,
                 n_examples,
                 data_filename,
                 max_length=128,
                 shuffle=True):
        
        self.tokenizer = tokenizer
        self.n_examples = n_examples
        self.data_filename = data_filename
        self.max_length = max_length
        self.shuffle = shuffle
        
        # Initialize row order, call on_epoch_end to shuffle row indices
        self.row_order = np.arange(1, self.n_examples+1)
        self.on_epoch_end()
    
    def __len__(self):
        return self.n_examples
    
    def __getitem__(self, idx):
        row_to_load = self.row_order[idx]
        df = pd.read_csv(self.data_filename,
                         skiprows=range(1, row_to_load),
                         nrows=1)
        
        text_pairs = df[['orig', 'target']].values.astype(str)[0]
        
        batch_data = preprocess_data_pt(
            text_pairs,
            self.tokenizer,
            self.max_length
        )

        return batch_data
    
    def __call__(self):
        for i in range(self.__len__()):
            yield self.__getitem__(i)
            
            if i == self.__len__()-1:
                self.on_epoch_end()
    
    def on_epoch_end(self):
        if self.shuffle:
            self.row_order = list(np.random.permutation(self.row_order))

In [None]:
# Download tokenizer and model

model_name = 't5-base'
t5_tokenizer = T5Tokenizer.from_pretrained(model_name)
t5_model_pt = T5ForConditionalGeneration.from_pretrained(model_name)

In [50]:
# Create the data generators for train and validation data, pytorch version

max_length = 32

train_data_generator = TranslationDataGeneratorPT(
    tokenizer=t5_tokenizer,
    n_examples=len(train_pairs),
    data_filename=train_file,
    max_length=max_length
)

valid_data_generator = TranslationDataGeneratorPT(
    tokenizer=t5_tokenizer,
    n_examples=len(valid_pairs),
    data_filename=valid_file,
    max_length=max_length

In [15]:
# Specify batch size and other training arguments

batch_size = 16

# Modify this filepath to where you want to save the model after fine-tuning
dir_path = 'drive/MyDrive/ISchool/MIDS/W266/2022_Fall/notebooks/'
file_path = dir_path + 't5base-finetuned-shakespeare-to-modern'

args = Seq2SeqTrainingArguments(
    file_path,
    evaluation_strategy='epoch',
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=1,
)

In [17]:
# Define the trainer, passing in the model, training args, and data generators

trainer = Seq2SeqTrainer(
    t5_model_pt,
    args,
    train_dataset=train_data_generator,
    eval_dataset=valid_data_generator
)

In [18]:
# Call train

trainer.train()

***** Running training *****
  Num examples = 13362
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 836


Epoch,Training Loss,Validation Loss
1,0.995,0.695925


Saving model checkpoint to drive/MyDrive/ISchool/MIDS/W266/2022_Summer/notebooks/t5base-finetuned-old-to-modern/checkpoint-500
Configuration saved in drive/MyDrive/ISchool/MIDS/W266/2022_Summer/notebooks/t5base-finetuned-old-to-modern/checkpoint-500/config.json
Model weights saved in drive/MyDrive/ISchool/MIDS/W266/2022_Summer/notebooks/t5base-finetuned-old-to-modern/checkpoint-500/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 2863
  Batch size = 16


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=836, training_loss=0.9100153799832723, metrics={'train_runtime': 497.2823, 'train_samples_per_second': 26.87, 'train_steps_per_second': 1.681, 'total_flos': 508555958353920.0, 'train_loss': 0.9100153799832723, 'epoch': 1.0})

### Does it seem to have worked?

Depending on your task, you'll add your own model evaluation after training. Here's a simple check to make sure it does seem to have fine-tuned T5 for this new task we defined.

In [19]:
for test_input_text in ['Hence forth thou shalt not vex me e\'er again.',
                        'Dost thou foresake me?',
                        'Makest thine own dinner.']:
    test_inputs = t5_tokenizer([prefix + test_input_text], return_tensors='pt')
    test_output_ids = t5_model_pt.generate(test_inputs['input_ids'].cuda())

    print([t5_tokenizer.decode(out_ids, skip_special_tokens=True, 
                              clean_up_tokenization_spaces=False) for out_ids in test_output_ids])



['You’ll not vex me again.']
['Do you foresake me?']
['Make your own dinner.']


You can load the model you trained using the .from_pretrained function you use for pretrained models. If you look in your drive folder, at the filepath you used in the trainer arguments, you'll see a checkpoint folder. Use the full path to that checkpoint folder as the argument to .from_pretrained, to load the model you saved again later.

In [None]:
t5_model_saved = T5ForConditionalGeneration.from_pretrained(file_path + '/checkpoint-500')

In [26]:
# Still works?
for test_input_text in ['Hence forth thou shalt not vex me e\'er again.',
                        'Dost thou foresake me?',
                        'Makest thine own dinner.']:
    test_inputs = t5_tokenizer([prefix + test_input_text], return_tensors='pt')
    test_output_ids = t5_model_saved.generate(test_inputs['input_ids'])

    print([t5_tokenizer.decode(out_ids, skip_special_tokens=True, 
                              clean_up_tokenization_spaces=False) for out_ids in test_output_ids])



['You’ll not vex me again.']
['Do you want to leave me?']
['Make your own dinner.']


## Option 2: Tensorflow

For tensorflow, Huggingface seems to have deprecated their TFTrainer in favor of using keras .fit. You can call .compile() and .fit() directly on the pre-trained T5 model, but it can be tricky to make sure the right inputs are going into the right part of the model, since tensorflow models take a separate list of inputs and labels, usually not in a dictionary with keys like the pytorch version.

Even though we aren't adding any other layers, we can still create a keras model wrapper around the pretrained T5 model. That way, we can pass in the right inputs into the model using keyword arguments. In this case, we'll not only pass in the input_ids and attention_mask for the encoder (original text), we'll also need to pass in the decoder_input_ids. The T5 model has a handy function to shift the labels over by one, so they start with the starter token for the decoder inputs.

We'll just use the first output of the T5 model (the logits) as the output of the overall model, and compile with crossentropy loss. Then we can call .fit on the wrapper model like we did in the last notebook, using a data generator that loads a batch of data each time.

In [11]:
def preprocess_data(text_pairs, tokenizer, model, max_length=128):
    orig_text = [orig for orig, target in text_pairs]
    orig_encoded = tokenizer.batch_encode_plus(
        orig_text,
        max_length=max_length,
        padding='max_length',
        truncation=True,
        return_attention_mask=True,
        return_tensors="tf"
    )

    orig_input_ids = np.array(orig_encoded["input_ids"], dtype="int32")
    orig_attention_masks = np.array(orig_encoded["attention_mask"], dtype="int32")
    
    target_text = [target for orig, target in text_pairs]
    target_encoded = tokenizer.batch_encode_plus(
        target_text,
        max_length=max_length,
        padding='max_length',
        truncation=True,
        return_tensors="tf"
    )

    label_ids = np.array(target_encoded['input_ids'])
    decoder_input_ids = model._shift_right(label_ids)
    
    return [orig_input_ids, orig_attention_masks, decoder_input_ids], label_ids

In [12]:
class TranslationDataGenerator(tf.keras.utils.Sequence):
    
    def __init__(self,
                 tokenizer,
                 model,
                 n_examples,
                 data_filename,
                 max_length=128,
                 batch_size=16,
                 shuffle=True):
        
        self.tokenizer = tokenizer
        self.model = model
        self.n_examples = n_examples
        self.data_filename = data_filename
        self.max_length = max_length
        self.batch_size = batch_size
        self.shuffle = shuffle
        
        # Initialize row order, call on_epoch_end to shuffle row indices
        self.row_order = np.arange(1, self.n_examples+1)
        self.on_epoch_end()
    
    def __len__(self):
        return self.n_examples // self.batch_size
    
    def __getitem__(self, idx):
        batch_start = idx * self.batch_size
        batch_end = (idx + 1) * self.batch_size

        # Indices to skip are the ones in the shuffled row_order before and
        # after the chunk we'll use for this batch
        batch_idx_skip = self.row_order[:batch_start] + self.row_order[batch_end:]
        df = pd.read_csv(self.data_filename, skiprows=batch_idx_skip)
        
        text_pairs = df[['orig', 'target']].values.astype(str).tolist()
        
        batch_data = preprocess_data(
            text_pairs,
            self.tokenizer,
            self.model,
            self.max_length
        )

        return batch_data
    
    def __call__(self):
        for i in range(self.__len__()):
            yield self.__getitem__(i)
            
            if i == self.__len__()-1:
                self.on_epoch_end()
    
    def on_epoch_end(self):
        if self.shuffle:
            self.row_order = list(np.random.permutation(self.row_order))

In [None]:
# Load the pretrained tensorflow model

model_name = 't5-base'
t5_tokenizer = T5Tokenizer.from_pretrained(model_name)
t5_model_tf = TFT5ForConditionalGeneration.from_pretrained(model_name)

In [13]:
# Create the data generators for train and validation data, tensorflow version

max_length = 32
batch_size = 16

train_data_generator = TranslationDataGenerator(
    tokenizer=t5_tokenizer,
    model=t5_model_tf,
    n_examples=len(train_pairs),
    data_filename=train_file,
    max_length=max_length,
    batch_size=batch_size
)

valid_data_generator = TranslationDataGenerator(
    tokenizer=t5_tokenizer,
    model=t5_model_tf,
    n_examples=len(valid_pairs),
    data_filename=valid_file,
    max_length=max_length,
    batch_size=batch_size
)

In [14]:
def build_t5_training_wrapper_model(t5_model, max_length):
    input_ids = layers.Input(shape=(max_length), dtype=tf.int32, name='input_ids')
    attention_mask = layers.Input(shape=(max_length), dtype=tf.int32, name='attention_masks')
    decoder_input_ids = layers.Input(shape=(max_length), dtype=tf.int32, name='labels')
    
    t5_logits = t5_model(input_ids, attention_mask=attention_mask, decoder_input_ids=decoder_input_ids)[0]

    model = tf.keras.models.Model(inputs=[input_ids, attention_mask, decoder_input_ids],
                                  outputs=[t5_logits])
    model.compile(optimizer=tf.keras.optimizers.Adam(),
                  loss=tf.losses.SparseCategoricalCrossentropy(from_logits=True),
                  metrics=['accuracy'])
    
    return model

In [15]:
model_wrapper = build_t5_training_wrapper_model(t5_model_tf, max_length)

In [None]:
# As in the first notebook, we should add a model checkpoint callback to save
# the trained model weights after each epoch. Edit the filepath to where
# you want to save the weights in your own Drive

checkpoint_dir = 'drive/MyDrive/ISchool/MIDS/W266/2022_Fall/notebooks/model_checkpoints/'
checkpoint_filepath = checkpoint_dir + 't5_shakespeare_weights.{epoch:02d}-{val_accuracy:.2f}.hdf5'
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=True)

In [16]:
# Now call .fit on the model_wrapper, passing in the data generators and the
# model checkpoint callback

model_wrapper.fit(train_data_generator,
                  validation_data=valid_data_generator,
                  epochs=1,
                  callbacks=[model_checkpoint_callback])



<keras.callbacks.History at 0x7f9e102eb290>

### Does it work?

Again, depending on your task, you'll add your own model evaluation after training. Here's a simple check to make sure it does seem to have fine-tuned T5 for this new task we defined.

In [26]:
for test_input_text in ['Hence forth thou shalt not vex me e\'er again.',
                        'Dost thou foresake me?',
                        'Makest thine own dinner.']:
    test_inputs = t5_tokenizer([prefix + test_input_text], return_tensors='pt')
    test_output_ids = t5_model_tf.generate(test_inputs['input_ids'])

    print([t5_tokenizer.decode(out_ids, skip_special_tokens=True, 
                              clean_up_tokenization_spaces=False) for out_ids in test_output_ids])

['You’re not going to bother me again.']
['Do you leave me?']
['Make your own dinner.']


In [None]:
# To pick back up where you left off, load the saved model weights
# (Edit the filename to the last saved one that you want to load)

checkpoint_filepath = checkpoint_dir + 't5_shakespeare_weights.01-0.85.hdf5'
model_wrapper.load_weights(checkpoint_filepath)

In [25]:
# Still works?
for test_input_text in ['Hence forth thou shalt not vex me e\'er again.',
                        'Dost thou foresake me?',
                        'Makest thine own dinner.']:
    test_inputs = t5_tokenizer([prefix + test_input_text], return_tensors='pt')
    test_output_ids = t5_model_tf.generate(test_inputs['input_ids'])

    print([t5_tokenizer.decode(out_ids, skip_special_tokens=True, 
                              clean_up_tokenization_spaces=False) for out_ids in test_output_ids])



['You’re not going to bother me again.']
['Do you leave me?']
['Make your own dinner.']
