In this notebook, we fine-tune and infer a T5 model for paraphrasing using the dataset prepared in `1.1-data-preparation.ipynb` notebook.

In [1]:
import gc
import os
import torch
import pandas as pd

from sklearn.model_selection import train_test_split
from transformers import T5ForConditionalGeneration, T5TokenizerFast, Trainer, TrainingArguments
from transformers.file_utils import cached_property
from torch.utils.data import DataLoader
from typing import List, Dict, Union


os.environ['CUDA_VISIBLE_DEVICES'] = '1'
torch.cuda.is_available()

True

# Data Preprocessing

Load the dataset and tokenizer:

In [2]:
df = pd.read_csv('../data/interim/distinguished.tsv', delimiter='\t')
df

Unnamed: 0,reference,translation,similarity,lenght_diff,ref_tox,trn_tox,source,target
0,"If Alkar is flooding her with psychic waste, t...","if Alkar floods her with her mental waste, it ...",0.785171,0.010309,0.014195,0.981983,"if Alkar floods her with her mental waste, it ...","If Alkar is flooding her with psychic waste, t..."
1,Now you're getting nasty.,you're becoming disgusting.,0.749687,0.071429,0.065473,0.999039,you're becoming disgusting.,Now you're getting nasty.
2,"Well, we could spare your life, for one.","well, we can spare your life.",0.919051,0.268293,0.213313,0.985068,"well, we can spare your life.","Well, we could spare your life, for one."
3,"Ah! Monkey, you've got to snap out of it.","monkey, you have to wake up.",0.664333,0.309524,0.053362,0.994215,"monkey, you have to wake up.","Ah! Monkey, you've got to snap out of it."
4,I've got orders to put her down.,I have orders to kill her.,0.726639,0.181818,0.009402,0.999348,I have orders to kill her.,I've got orders to put her down.
...,...,...,...,...,...,...,...,...
577772,You didn't know that Estelle had stolen some f...,you didn't know that Estelle stole your fish f...,0.870322,0.030769,0.000121,0.949143,you didn't know that Estelle stole your fish f...,You didn't know that Estelle had stolen some f...
577773,It'il suck the life out of you!,you'd be sucked out of your life!,0.722897,0.058824,0.996124,0.215794,It'il suck the life out of you!,you'd be sucked out of your life!
577774,"I can't fuckin' take that, bruv.",I really can't take this.,0.617511,0.212121,0.984538,0.000049,"I can't fuckin' take that, bruv.",I really can't take this.
577775,They called me a fucking hero. The truth is I ...,"they said I was a hero, but I didn't care.",0.679613,0.358209,0.991945,0.000124,They called me a fucking hero. The truth is I ...,"they said I was a hero, but I didn't care."


In [3]:
tokenizer = T5TokenizerFast.from_pretrained("ceshine/t5-paraphrase-paws-msrp-opinosis")

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Perform a train-test split:

In [4]:
df_train, df_test_sets = train_test_split(df, test_size=2000)
df_test, df_valid = train_test_split(df_test_sets, test_size=1000)
print(df_train.shape, df_test.shape, df_valid.shape)

(575777, 8) (1000, 8) (1000, 8)


Save the validation set to a file. We will use it to evaluate the models in the end of the assignment. 

In [5]:
df_valid.to_csv('../data/interim/validation.tsv', sep='\t')

Apply the tokenization to the data:

In [6]:
x_train = tokenizer(df_train['source'].tolist(), truncation=True)
y_train = tokenizer(df_train['target'].tolist(), truncation=True)
x_test = tokenizer(df_test['source'].tolist(), truncation=True)
y_test = tokenizer(df_test['target'].tolist(), truncation=True)

Prepare a Dataset class and data loaders:

In [7]:
class ParaphraseDataset(torch.utils.data.Dataset):
    def __init__(self, x, y):
        self.x = x
        self.y = y

    def __len__(self):
        return len(self.x['input_ids'])

    def __getitem__(self, idx):
        assert idx < len(self)
        item = {
            key: val[idx] for key, val in self.x.items()
        }
        item['decoder_attention_mask'] = self.y['attention_mask'][idx]
        item['labels'] = self.y['input_ids'][idx]
        return item

In [8]:
train_dataset = ParaphraseDataset(x_train, y_train)
test_dataset = ParaphraseDataset(x_test, y_test)
len(train_dataset), len(test_dataset)

(575777, 1000)

In [9]:
train_dataloader = DataLoader(train_dataset, batch_size=4, drop_last=True, shuffle=True, num_workers=8)
test_dataloader = DataLoader(test_dataset, batch_size=4, drop_last=True, shuffle=True, num_workers=8)

# T5 model fine-tuning for paraphrasing

Let us take the T5 model fine-tuned by Skolkovo Institute for text paraphrasing ([link](https://huggingface.co/s-nlp/t5-paraphrase-paws-msrp-opinosis-paranmt))

- This model is fine-tuned for paraphrasing on PAWS, MSRP, Opinosis, and ParaNMT datasets which is quite a big corpus.
- We will fine-tune it further using the ParaNMT toxicity subset that is given for this assignment (and additionally processed in previous notebooks).

In [9]:
model = T5ForConditionalGeneration.from_pretrained('SkolkovoInstitute/t5-paraphrase-paws-msrp-opinosis-paranmt')

In [None]:
device = torch.device('cuda:0')
model.to(device)

In [22]:
class TrainingArgumentsWrapper(TrainingArguments):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        
        self.distributed_state = None

    @cached_property
    def _setup_devices(self):
        return device
    

class DataCollatorWithPadding:
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer

    def __call__(
            self,
            features: List[Dict[str, Union[List[int], torch.Tensor]]]
        ) -> Dict[str, torch.Tensor]:
        
        batch = self.tokenizer.pad(
            features,
            padding=True,
        )
        y_batch = self.tokenizer.pad(
            {
                'input_ids': batch['labels'],
                'attention_mask': batch['decoder_attention_mask']
            },
            padding=True,
        ) 
        batch['labels'] = y_batch['input_ids']
        batch['decoder_attention_mask'] = y_batch['attention_mask']
        
        return {k: torch.tensor(v) for k, v in batch.items()}

In [23]:
dir_name = 't5-paraphrase'

training_args = TrainingArgumentsWrapper(
    output_dir=f'../models/{dir_name}',
    overwrite_output_dir=False,
    num_train_epochs=2,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    per_device_eval_batch_size=8,
    warmup_steps=300,
    weight_decay=0,
    learning_rate=3e-5,
    logging_dir=f'../models/logs/{dir_name}',
    logging_steps=100,
    eval_steps=100,
    evaluation_strategy='steps',
    save_total_limit=1,
    save_steps=5000,
)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In [None]:
gc.collect()
torch.cuda.empty_cache()

trainer.train()

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss
100,0.6389,0.615286
200,0.6272,0.611436
300,0.629,0.608691
400,0.6295,0.605242
500,0.6206,0.603965
600,0.6148,0.602196
700,0.6152,0.601779
800,0.6211,0.599413
900,0.6226,0.598821
1000,0.6202,0.59806


TrainOutput(global_step=9012, training_loss=0.5987315572636212, metrics={'train_runtime': 6325.9929, 'train_samples_per_second': 182.351, 'train_steps_per_second': 1.425, 'total_flos': 6.582223103510016e+16, 'train_loss': 0.5987315572636212, 'epoch': 2.0})

In [None]:
trainer.evaluate()

{'eval_loss': 0.5775772929191589,
 'eval_runtime': 1.5977,
 'eval_samples_per_second': 625.918,
 'eval_steps_per_second': 20.029,
 'epoch': 2.0}

Let us infer the resulted model with an example:

In [None]:
model.eval()

toxic_sample = "I'm going to hit you in all directions, civil and criminal, on all counts."

inputs = tokenizer(toxic_sample, return_tensors='pt')
inputs = {
    k: v.to(device) for k, v in inputs.items()
}

for t in model.generate(**inputs, num_return_sequences=10, do_sample=False, num_beams=10):
    print(tokenizer.decode(t, skip_special_tokens=True))



I'm going to hit you in all directions, civil and criminal, on all counts.
I'll hit you in all directions, civil and criminal, on all counts.
I'm gonna hit you in all directions, civil and criminal, on all counts.
I'm going to take you in all directions, civil and criminal, on all counts.
I'm going to strike you in all directions, civil and criminal, on all counts.
I'll take you in all directions, civil and criminal, on all counts.
I'll strike you in all directions, civil and criminal, on all counts.
I'm going to hit you in every direction, civil and criminal, on all counts.
I'll be hitting you in all directions, civil and criminal, on all counts.
I'm going to hit you in all directions, civil and criminal.


Finally, let's save the model:

In [None]:
model.save_pretrained(f'../models/{dir_name}')