# Experiments with Transformer-based models

Let us reproduce fine-tuning of T5 model as authors of given paper did in [their work](https://huggingface.co/s-nlp/t5-paranmt-detox). I am planning to try different T5 model sizes for fine-tuning. Inspired by [the fine-tuning Translation example from HuggingFace](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation.ipynb).

Let's start with `t5-base` model as authors of the given paper did (they took even t5-base fine-tuned on pharaphrasing already). 

In [1]:
import warnings
import numpy as np
import pandas as pd
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '1,2,3'
import torch


# Set seeds
torch.manual_seed(42)
np.random.seed(42)
torch.cuda.manual_seed(42)
torch.backends.cudnn.deterministic = True
warnings.filterwarnings('ignore')

## Data preprocessing

In [15]:
from datasets import Dataset, DatasetDict
from transformers import T5Tokenizer


tokenizer_checkpoint_path = 't5-base'
tokenizer = T5Tokenizer.from_pretrained(tokenizer_checkpoint_path)

train_df = pd.read_csv('../data/interim/train.csv', index_col=0)
val_df = pd.read_csv('../data/interim/val.csv', index_col=0)
test_df = pd.read_csv('../data/interim/test.csv', index_col=0)

In [16]:
prefix = "detoxify:"
max_input_length = 256
max_target_length = 256


def preprocess_function(examples):
    inputs = [prefix + ex for ex in examples["reference"]]
    targets = [ex for ex in examples["translation"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)
    labels = tokenizer(targets, max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

### Build HuggingFace dataset

In [17]:
train_ds = Dataset.from_pandas(train_df)
val_ds = Dataset.from_pandas(val_df)
test_ds = Dataset.from_pandas(test_df)
ds = DatasetDict()
ds['train'] = train_ds
ds['val'] = val_ds
ds['test'] = test_ds
ds = ds.map(preprocess_function, batched=True)

Map:   0%|          | 0/506998 [00:00<?, ? examples/s]

Map:   0%|          | 0/56334 [00:00<?, ? examples/s]

Map:   0%|          | 0/14445 [00:00<?, ? examples/s]

In [18]:
hf_ds_dir = '../data/interim/dshf'
ds.save_to_disk(hf_ds_dir)

Saving the dataset (0/1 shards):   0%|          | 0/506998 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/56334 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/14445 [00:00<?, ? examples/s]

## Model training

In [2]:
from datasets import load_from_disk


hf_ds_dir = '../data/interim/dshf/'
ds = load_from_disk(hf_ds_dir)

In [3]:
from transformers import DataCollatorForSeq2Seq, Seq2SeqTrainingArguments
from transformers import T5Tokenizer, T5ForConditionalGeneration


model_name = 't5-base'
batch_size = 64
training_model_name = f"{model_name}-finetuned-paranmt500k-detox"

args = Seq2SeqTrainingArguments(
    output_dir=f'../models/{training_model_name}',
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    predict_with_generate=True,
    report_to='wandb',
)

tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

[2023-11-06 00:09:06,167] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)


You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=True`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [4]:
from datasets import load_metric


metric = load_metric("sacrebleu")

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

In [5]:
from transformers import Seq2SeqTrainer


trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=ds['train'],
    eval_dataset=ds['val'],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [6]:
trainer.train()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mmakharev[0m ([33mlora_yandex_sota[0m). Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

Unfortunately, I have got only intermediate checkpoint (but it is very close in step to the final one). I will use it to make inference.

## Model inference

In [11]:
from transformers import T5ForConditionalGeneration, T5Tokenizer


tokenizer_name = 't5-base'
trained_model_name = f"../models/{tokenizer_name}-finetuned-paranmt500k-detox/checkpoint-2500"

tokenizer = T5Tokenizer.from_pretrained(tokenizer_name)
model = T5ForConditionalGeneration.from_pretrained(trained_model_name)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

T5ForConditionalGeneration(
  (shared): Embedding(32128, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=768, out_features=3072, bias=False)
              (wo): Linear(in_features=3072, out_features=768, bias=False)
              (dropout): Dro

In [75]:
from tqdm import tqdm


def detoxify_inference(sentences, prefix="detoxify:", top_k=120, max_length=256):
    outputs = []
    for sentence in tqdm(sentences):
        text = prefix + sentence + " </s>"

        encoding = tokenizer.encode_plus(text, pad_to_max_length=True, return_tensors="pt")
        input_ids, attention_masks = encoding["input_ids"].to(device), encoding["attention_mask"].to(device)

        model_output = model.generate(
            input_ids=input_ids,
            attention_mask=attention_masks,
            do_sample=True,
            max_length=max_length,
            top_k=top_k,
            top_p=0.98,
            early_stopping=True,
            num_return_sequences=1,
        )
        for output in model_output:
            generated_sent = tokenizer.decode(output, skip_special_tokens=True, clean_up_tokenization_spaces=True)
            outputs.append(generated_sent)
    return outputs

In [20]:
import pandas as pd


# The data should be downloaded and preprocessed, use 1.0-download-raw-data.ipynb and 1.2-data-preprocessing.ipynb notebooks
test_data_path = '../data/interim/test.csv'
test_df = pd.read_csv(test_data_path, index_col=0)
test_df.head(1)

Unnamed: 0,reference,translation,similarity,lenght_diff,ref_tox,trn_tox
57809,call out your butchers and I'll tell you.,"Listen, call off the butchers, and I'll tell you.",0.761781,0.16,0.968063,0.005592


### Manual testing

In [29]:
n_samples = 25
sampled_test = test_df.sample(n=n_samples, random_state=111)
sampled_test.head()

Unnamed: 0,reference,translation,similarity,lenght_diff,ref_tox,trn_tox
49901,One way to friggin' Mexico.,straight to Mexico.,0.621713,0.285714,0.973398,0.002661
491891,"You know, it pains me to find out shit about y...","you know, it also hurts me to learn things abo...",0.797427,0.018182,0.998989,0.000465
259555,This kid is like a boar.,this guy is like a stud.,0.652906,0.0,0.991821,0.001623
187825,"James. What are you doing to me with your big,...","what are you doing to me!... your big, big, bi...",0.838453,0.163934,0.978909,0.002112
330314,"deep down, you want someone to suck your beaut...","Deep down, you long to have your perfect feath...",0.688693,0.032787,0.998196,0.000141


In [32]:
predicitons = detoxify_inference(sampled_test['reference'])

100%|██████████| 25/25 [00:04<00:00,  5.42it/s]


In [34]:
for ref, trn, pred in zip(sampled_test['reference'], sampled_test['translation'], predicitons):
    print(f'INIT: {ref}')
    print(f'GOLD: {trn}')
    print(f'PRED: {pred}')
    print('------------------------')

INIT: One way to friggin' Mexico.
GOLD: straight to Mexico.
PRED: that's the only way to Mexico.
------------------------
INIT: You know, it pains me to find out shit about you, too.
GOLD: you know, it also hurts me to learn things about you.
PRED: you know, I have to find out things about you, as well.
------------------------
INIT: This kid is like a boar.
GOLD: this guy is like a stud.
PRED: this baby is like a horn.
------------------------
INIT: James. What are you doing to me with your big, fat, hard...?
GOLD: what are you doing to me!... your big, big, big...
PRED: James, what are you doing to me, big, deep,...?
------------------------
INIT: deep down, you want someone to suck your beautiful feathers.
GOLD: Deep down, you long to have your perfect feathers ruffled.
PRED: Deep down, do you want another nice friend to suck on your feathers.
------------------------
INIT: yeah, but if it rains, we're fucked.
GOLD: Yes, but if it rains, we're buggered.
PRED: Actually, but if it rai

### Generate results on test data

In [79]:
refs = test_df['reference']
preds = detoxify_inference(refs)

100%|██████████| 14445/14445 [40:49<00:00,  5.90it/s] 


In [80]:
len(preds)

14445

In [81]:
import os


result_df = pd.DataFrame({'inputs': test_df['translation'][:len(preds)], 'preds': preds})

if not os.path.exists('../data/interim/model-outputs'):
    os.makedirs('../data/interim/model-outputs')

save_result_path = f'../data/interim/model-outputs/t5-base-detox.csv'
result_df.to_csv(save_result_path)