<a href="https://colab.research.google.com/github/jeanlucjackson/w266_final_project/blob/main/code/RR/question_generation_t5_with_data_generator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Question Generation using T5 in Colab without running out of RAM

This notebook is based on Natalie Ahn's notebook showing how to fine tune T5 in Colab without running out of RAM.

It includes pytorch and tensorflow examples.

We use the SQUAD dataset.

In [None]:
!pip install -q transformers

In [None]:
!pip install -q sentencepiece

In [None]:
!pip install -q datasets

In [None]:
import os
import re
import numpy as np
import pandas as pd
import json

from pprint import pprint

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

import torch  # Only if you use a pytorch model, both options are shown below
from transformers import T5Tokenizer, T5ForConditionalGeneration, TFT5ForConditionalGeneration
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

from datasets import list_datasets, load_dataset_builder, get_dataset_config_names, load_dataset, load_from_disk

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [None]:
# This cell will authenticate you and mount your Drive in the Colab.
from google.colab import drive
drive.mount('/content/drive')

In [None]:
def summarize_dataset (dataset, config=None):
    builder = load_dataset_builder(dataset, config)
    pprint(f"Description:\n {builder.info.description}")
    print(f"Features:")
    pprint(builder.info.features)
    return

### Data

We use SQUAD.

In [None]:
dataset_root = "/content/drive/MyDrive/w266 NLP Final Project/Data/"
dataset_name = "squad"
dataset_folder = dataset_root+dataset_name+".hf"

In [None]:
# Begin with a dataset summary.
summarize_dataset(dataset_name)

#### Load the data

In [None]:
# SQuAD is quick to download from Hugging Face
# Use the code below if you aren't accessing the data from the shared
# Google Drive folder.

# dataset = load_dataset(dataset_name)

# The followind code assumes you have added a link to the shared 
# w266 NLP Final Project folder in your Google Drive folder
# Loading data from there is faster.

dataset = load_from_disk(dataset_folder)

In [None]:
# data_squad.save_to_disk(dataset_folder)

In [None]:
dataset

In [None]:
training_data = dataset['train'].shuffle(seed=1962)

In [None]:
validation_data = dataset['validation'].shuffle(seed=1962)

In [None]:
training_answers = [answer['text'][0] for answer in training_data['answers']]
training_context = training_data['context']
training_questions = training_data['question']

In [None]:
validation_answers = [answer['text'][0] for answer in validation_data['answers']]
validation_context = validation_data['context']
validation_questions = validation_data['question']

#### Assemble input and output pairs

The input format is:
generate question: answer: XXXXXXX context: XXXXXXXX

In [None]:
training_orig = [f"generate question: answer: {answer} context: {context}" for answer, context in zip (training_answers, training_context)]
training_target = training_questions
validiation_orig = [f"generate question: answer: {answer} context: {context}" for answer, context in zip (validation_answers, validation_context)]
validation_target = validation_questions

In [None]:
training_df = pd.DataFrame()
training_df['orig'] = training_orig
training_df['target'] = training_target
training_df

In [None]:
validation_df = pd.DataFrame()
validation_df['orig'] = validiation_orig
validation_df['target'] = validation_target
validation_df

In [None]:
### This call is a work in progress, for now let's rely on what we did above....

# Let's create some splits
#np.random.shuffle(text_pairs)
#num_valid_samples = int(0.15 * len(text_pairs))
#num_train_samples = len(text_pairs) - 2 * num_valid_samples
train_pairs = training_df.shape[0]
valid_pairs = validation_df.shape[0]
#test_pairs = text_pairs[num_train_samples + num_valid_samples :]

#print(f"{len(text_pairs)} total pairs")
print(f"{train_pairs} training pairs")
print(f"{valid_pairs} validation pairs")
#print(f"{len(test_pairs)} test pairs")

In [None]:
# Save splits to separate csv files, to load only part at a time later
training_file = dataset_folder + '/train_pairs.csv'
validation_file = dataset_folder + '/valid_pairs.csv'
# test_file = 'drive/MyDrive/ISchool/MIDS/W266/2022_Fall/notebooks/test_pairs.csv'

training_df.to_csv(training_file)
validation_df.to_csv(validation_file)
# pd.DataFrame(test_pairs).to_csv(test_file)

In [None]:
df = pd.read_csv(training_file)
df

## Option 1: Pytorch

We'll start with pytorch, showing how you can use a Seq2SeqTrainer with a data generator, to control when, how much and how to load your data as you train. The preprocessor and data generator need to be defined slightly differently for the trainer to use.

Unlikes the previous notebook, the generator won't load a batch at a time. Instead, the trainer will call our generator (and preprocessing function) for one example at a time. So the preprocessing function needs to return a one-dimensional vector of input_ids for each example, not a two dimensional batch.

For the preprocessor, the pytorch models want the inputs in a dictionary with keys for 'input_ids', 'attention_mask', and 'labels'. The first two are the inputs to the encoder (the original text), and the labels are the translated text vocab ids.

Since we're passing this all into a trainer, we doon't need to separate out the decoder input_ids. The trainer will infer those from the labels (offset by one).

In [None]:
def preprocess_data_pt(text_pair, tokenizer, max_length=1024):
    orig_text, target_text = text_pair
    orig_encoded = tokenizer.batch_encode_plus(
        [orig_text],
        max_length=max_length,
        padding='max_length',
        truncation=True,
        return_attention_mask=True,
        return_tensors='pt'
    )

    orig_input_ids = orig_encoded['input_ids'][0]
    orig_attention_mask = orig_encoded['attention_mask'][0]
    
    target_encoded = tokenizer.batch_encode_plus(
        [target_text],
        max_length=max_length,
        padding='max_length',
        truncation=True,
        return_attention_mask=True,
        return_tensors='pt'
    )
    
    label_ids = target_encoded['input_ids'][0]
    
    return {'input_ids': orig_input_ids,
            'attention_mask': orig_attention_mask,
            'labels': label_ids}

In [None]:
class TranslationDataGeneratorPT(tf.keras.utils.Sequence):
    
    def __init__(self,
                 tokenizer,
                 n_examples,
                 data_filename,
                 max_length=1024,
                 shuffle=True):
        
        self.tokenizer = tokenizer
        self.n_examples = n_examples
        self.data_filename = data_filename
        self.max_length = max_length
        self.shuffle = shuffle
        
        # Initialize row order, call on_epoch_end to shuffle row indices
        self.row_order = np.arange(1, self.n_examples+1)
        self.on_epoch_end()
    
    def __len__(self):
        return self.n_examples
    
    def __getitem__(self, idx):
        row_to_load = self.row_order[idx]
        df = pd.read_csv(self.data_filename,
                         skiprows=range(1, row_to_load),
                         nrows=1)
        
        text_pairs = df[['orig', 'target']].values.astype(str)[0]
        
        batch_data = preprocess_data_pt(
            text_pairs,
            self.tokenizer,
            self.max_length
        )

        return batch_data
    
    def __call__(self):
        for i in range(self.__len__()):
            yield self.__getitem__(i)
            
            if i == self.__len__()-1:
                self.on_epoch_end()
    
    def on_epoch_end(self):
        if self.shuffle:
            self.row_order = list(np.random.permutation(self.row_order))

In [None]:
# Download tokenizer and model

model_name = "google/t5-v1_1-base"
t5_tokenizer = T5Tokenizer.from_pretrained(model_name)
t5_model_pt = T5ForConditionalGeneration.from_pretrained(model_name)

In [None]:
# Create the data generators for train and validation data, pytorch version

max_length = 1024

train_data_generator = TranslationDataGeneratorPT(
    tokenizer=t5_tokenizer,
    n_examples=training_df.shape[0],
    data_filename=training_file,
    max_length=max_length
)

valid_data_generator = TranslationDataGeneratorPT(
    tokenizer=t5_tokenizer,
    n_examples=validation_df.shape[0],
    data_filename=validation_file,
    max_length=max_length
)

In [None]:
# Specify batch size and other training arguments

batch_size = 16

# Modify this filepath to where you want to save the model after fine-tuning
dir_path = "/content/drive/MyDrive/w266 NLP Final Project/Models/RR Squad One/"
file_path = dir_path + 't5base-finetuned-squad'

args = Seq2SeqTrainingArguments(
    file_path,
    evaluation_strategy='epoch',
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=1,
)

In [None]:
# Define the trainer, passing in the model, training args, and data generators

trainer = Seq2SeqTrainer(
    t5_model_pt,
    args,
    train_dataset=train_data_generator,
    eval_dataset=valid_data_generator
)

In [None]:
# Call train

trainer.train()

### Does it seem to have worked?

Depending on your task, you'll add your own model evaluation after training. Here's a simple check to make sure it does seem to have fine-tuned T5 for this new task we defined.

In [None]:
for test_input_text in ['Hence forth thou shalt not vex me e\'er again.',
                        'Dost thou foresake me?',
                        'Makest thine own dinner.']:
    test_inputs = t5_tokenizer([prefix + test_input_text], return_tensors='pt')
    test_output_ids = t5_model_pt.generate(test_inputs['input_ids'].cuda())

    print([t5_tokenizer.decode(out_ids, skip_special_tokens=True, 
                              clean_up_tokenization_spaces=False) for out_ids in test_output_ids])

You can load the model you trained using the .from_pretrained function you use for pretrained models. If you look in your drive folder, at the filepath you used in the trainer arguments, you'll see a checkpoint folder. Use the full path to that checkpoint folder as the argument to .from_pretrained, to load the model you saved again later.

In [None]:
t5_model_saved = T5ForConditionalGeneration.from_pretrained(file_path + '/checkpoint-500')

In [None]:
# Still works?
for test_input_text in ['Hence forth thou shalt not vex me e\'er again.',
                        'Dost thou foresake me?',
                        'Makest thine own dinner.']:
    test_inputs = t5_tokenizer([prefix + test_input_text], return_tensors='pt')
    test_output_ids = t5_model_saved.generate(test_inputs['input_ids'])

    print([t5_tokenizer.decode(out_ids, skip_special_tokens=True, 
                              clean_up_tokenization_spaces=False) for out_ids in test_output_ids])

## Option 2: Tensorflow

For tensorflow, Huggingface seems to have deprecated their TFTrainer in favor of using keras .fit. You can call .compile() and .fit() directly on the pre-trained T5 model, but it can be tricky to make sure the right inputs are going into the right part of the model, since tensorflow models take a separate list of inputs and labels, usually not in a dictionary with keys like the pytorch version.

Even though we aren't adding any other layers, we can still create a keras model wrapper around the pretrained T5 model. That way, we can pass in the right inputs into the model using keyword arguments. In this case, we'll not only pass in the input_ids and attention_mask for the encoder (original text), we'll also need to pass in the decoder_input_ids. The T5 model has a handy function to shift the labels over by one, so they start with the starter token for the decoder inputs.

We'll just use the first output of the T5 model (the logits) as the output of the overall model, and compile with crossentropy loss. Then we can call .fit on the wrapper model like we did in the last notebook, using a data generator that loads a batch of data each time.

In [None]:
def preprocess_data(text_pairs, tokenizer, model, max_length=128):
    orig_text = [orig for orig, target in text_pairs]
    orig_encoded = tokenizer.batch_encode_plus(
        orig_text,
        max_length=max_length,
        padding='max_length',
        truncation=True,
        return_attention_mask=True,
        return_tensors="tf"
    )

    orig_input_ids = np.array(orig_encoded["input_ids"], dtype="int32")
    orig_attention_masks = np.array(orig_encoded["attention_mask"], dtype="int32")
    
    target_text = [target for orig, target in text_pairs]
    target_encoded = tokenizer.batch_encode_plus(
        target_text,
        max_length=max_length,
        padding='max_length',
        truncation=True,
        return_tensors="tf"
    )

    label_ids = np.array(target_encoded['input_ids'])
    decoder_input_ids = model._shift_right(label_ids)
    
    return [orig_input_ids, orig_attention_masks, decoder_input_ids], label_ids

In [None]:
class TranslationDataGenerator(tf.keras.utils.Sequence):
    
    def __init__(self,
                 tokenizer,
                 model,
                 n_examples,
                 data_filename,
                 max_length=128,
                 batch_size=16,
                 shuffle=True):
        
        self.tokenizer = tokenizer
        self.model = model
        self.n_examples = n_examples
        self.data_filename = data_filename
        self.max_length = max_length
        self.batch_size = batch_size
        self.shuffle = shuffle
        
        # Initialize row order, call on_epoch_end to shuffle row indices
        self.row_order = np.arange(1, self.n_examples+1)
        self.on_epoch_end()
    
    def __len__(self):
        return self.n_examples // self.batch_size
    
    def __getitem__(self, idx):
        batch_start = idx * self.batch_size
        batch_end = (idx + 1) * self.batch_size

        # Indices to skip are the ones in the shuffled row_order before and
        # after the chunk we'll use for this batch
        batch_idx_skip = self.row_order[:batch_start] + self.row_order[batch_end:]
        df = pd.read_csv(self.data_filename, skiprows=batch_idx_skip)
        
        text_pairs = df[['orig', 'target']].values.astype(str).tolist()
        
        batch_data = preprocess_data(
            text_pairs,
            self.tokenizer,
            self.model,
            self.max_length
        )

        return batch_data
    
    def __call__(self):
        for i in range(self.__len__()):
            yield self.__getitem__(i)
            
            if i == self.__len__()-1:
                self.on_epoch_end()
    
    def on_epoch_end(self):
        if self.shuffle:
            self.row_order = list(np.random.permutation(self.row_order))

In [None]:
# Load the pretrained tensorflow model

model_name = 't5-base'
t5_tokenizer = T5Tokenizer.from_pretrained(model_name)
t5_model_tf = TFT5ForConditionalGeneration.from_pretrained(model_name)

In [None]:
# Create the data generators for train and validation data, tensorflow version

max_length = 32
batch_size = 16

train_data_generator = TranslationDataGenerator(
    tokenizer=t5_tokenizer,
    model=t5_model_tf,
    n_examples=len(train_pairs),
    data_filename=train_file,
    max_length=max_length,
    batch_size=batch_size
)

valid_data_generator = TranslationDataGenerator(
    tokenizer=t5_tokenizer,
    model=t5_model_tf,
    n_examples=len(valid_pairs),
    data_filename=valid_file,
    max_length=max_length,
    batch_size=batch_size
)

In [None]:
def build_t5_training_wrapper_model(t5_model, max_length):
    input_ids = layers.Input(shape=(max_length), dtype=tf.int32, name='input_ids')
    attention_mask = layers.Input(shape=(max_length), dtype=tf.int32, name='attention_masks')
    decoder_input_ids = layers.Input(shape=(max_length), dtype=tf.int32, name='labels')
    
    t5_logits = t5_model(input_ids, attention_mask=attention_mask, decoder_input_ids=decoder_input_ids)[0]

    model = tf.keras.models.Model(inputs=[input_ids, attention_mask, decoder_input_ids],
                                  outputs=[t5_logits])
    model.compile(optimizer=tf.keras.optimizers.Adam(),
                  loss=tf.losses.SparseCategoricalCrossentropy(from_logits=True),
                  metrics=['accuracy'])
    
    return model

In [None]:
model_wrapper = build_t5_training_wrapper_model(t5_model_tf, max_length)

In [None]:
# As in the first notebook, we should add a model checkpoint callback to save
# the trained model weights after each epoch. Edit the filepath to where
# you want to save the weights in your own Drive

checkpoint_dir = 'drive/MyDrive/ISchool/MIDS/W266/2022_Fall/notebooks/model_checkpoints/'
checkpoint_filepath = checkpoint_dir + 't5_shakespeare_weights.{epoch:02d}-{val_accuracy:.2f}.hdf5'
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=True)

In [None]:
# Now call .fit on the model_wrapper, passing in the data generators and the
# model checkpoint callback

model_wrapper.fit(train_data_generator,
                  validation_data=valid_data_generator,
                  epochs=1,
                  callbacks=[model_checkpoint_callback])

### Does it work?

Again, depending on your task, you'll add your own model evaluation after training. Here's a simple check to make sure it does seem to have fine-tuned T5 for this new task we defined.

In [None]:
for test_input_text in ['Hence forth thou shalt not vex me e\'er again.',
                        'Dost thou foresake me?',
                        'Makest thine own dinner.']:
    test_inputs = t5_tokenizer([prefix + test_input_text], return_tensors='pt')
    test_output_ids = t5_model_tf.generate(test_inputs['input_ids'])

    print([t5_tokenizer.decode(out_ids, skip_special_tokens=True, 
                              clean_up_tokenization_spaces=False) for out_ids in test_output_ids])

In [None]:
# To pick back up where you left off, load the saved model weights
# (Edit the filename to the last saved one that you want to load)

checkpoint_filepath = checkpoint_dir + 't5_shakespeare_weights.01-0.85.hdf5'
model_wrapper.load_weights(checkpoint_filepath)

In [None]:
# Still works?
for test_input_text in ['Hence forth thou shalt not vex me e\'er again.',
                        'Dost thou foresake me?',
                        'Makest thine own dinner.']:
    test_inputs = t5_tokenizer([prefix + test_input_text], return_tensors='pt')
    test_output_ids = t5_model_tf.generate(test_inputs['input_ids'])

    print([t5_tokenizer.decode(out_ids, skip_special_tokens=True, 
                              clean_up_tokenization_spaces=False) for out_ids in test_output_ids])