### 3.0 Notebook for training T5 model
**Hypothesis:** T5 model for translation can be used for text detoxification
All the running is made on colab, so paths are set up for colab. If you want to run it locally, you need to change the paths.

In [1]:
!git clone https://github.com/ivancheroleg/Text-de-toxification-PMLDL-IU

!pip install -r /content/Text-de-toxification-PMLDL-IU/requirements.txt

Cloning into 'Text-de-toxification-PMLDL-IU'...
remote: Enumerating objects: 100, done.[K
remote: Counting objects: 100% (12/12), done.[K
remote: Compressing objects: 100% (9/9), done.[K
remote: Total 100 (delta 4), reused 10 (delta 3), pack-reused 88[K
Receiving objects: 100% (100/100), 43.20 MiB | 21.84 MiB/s, done.
Resolving deltas: 100% (28/28), done.
Updating files: 100% (23/23), done.
Collecting wget (from -r /content/Text-de-toxification-PMLDL-IU/requirements.txt (line 1))
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers (from -r /content/Text-de-toxification-PMLDL-IU/requirements.txt (line 5))
  Downloading transformers-4.35.0-py3-none-any.whl (7.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
Collecting datasets (from -r /content/Text-de-toxification-PMLDL-IU/requirements.txt (line 7))
  Downloading datasets-2.14.6-py3-none-any.whl (49

Load the dataset from the local file

In [2]:
from datasets import load_from_disk

# load dataset from local file
dataset = load_from_disk("/content/Text-de-toxification-PMLDL-IU/data/interim/dataset")

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 502214
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 27900
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 27900
    })
})

In [4]:
model_checkpoint = "t5-small"

from transformers import AutoTokenizer

# we will use autotokenizer for this purpose
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

### Preprocessing

In [5]:
prefix = "Translate toxic to non-toxic:"

max_input_length = 128
max_target_length = 128
source_sentence = "toxic"
target_sentence = "non-toxic"

def preprocess_function(examples):
    # Inputs
    inputs = [prefix + example[source_sentence] for example in examples["translation"]]
    targets = [example[target_sentence] for example in examples["translation"]]

    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    labels = tokenizer(targets, max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

preprocess_function(dataset["train"][:2])

{'input_ids': [[30355, 15, 12068, 12, 529, 18, 14367, 10, 99, 491, 4031, 8347, 7, 160, 28, 160, 2550, 2670, 6, 34, 133, 3209, 8, 306, 1425, 13, 6567, 7031, 1538, 449, 5, 1], [30355, 15, 12068, 12, 529, 18, 14367, 10, 4188, 31, 60, 2852, 27635, 53, 5, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[3, 99, 491, 4031, 19, 18368, 160, 28, 26829, 2670, 6, 24, 3, 9453, 8, 306, 593, 13, 6567, 7031, 1538, 4849, 5, 1], [230, 25, 31, 60, 652, 23147, 5, 1]]}

### Metrics

In [6]:
from datasets import load_metric

metric = load_metric("sacrebleu")

  metric = load_metric("sacrebleu")


Downloading builder script:   0%|          | 0.00/2.85k [00:00<?, ?B/s]

In [7]:
# for the example purpose I will crop the dataset and select first 25000 for train
# and 2500 for validation and test

cropped_datasets = dataset
cropped_datasets['train'] = dataset['train'].select(range(25000))
cropped_datasets['validation'] = dataset['validation'].select(range(2500))
cropped_datasets['test'] = dataset['test'].select(range(2500))
tokenized_datasets = cropped_datasets.map(preprocess_function, batched=True)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2500 [00:00<?, ? examples/s]

Map:   0%|          | 0/2500 [00:00<?, ? examples/s]

declare the model

In [8]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

# Create a model for the pretrained model
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Declare arguments for training
For visualization of training process we will use wandb (Weights and Biases) API. It will be used for logging the metrics and visualizing the training process.

In [9]:
# Defining the parameters for training
batch_size = 32
model_name = model_checkpoint.split("/")[-1]

args = Seq2SeqTrainingArguments(
    f"{model_name}-finetuned-{source_sentence}-to-{target_sentence}",
    evaluation_strategy = "epoch",
    learning_rate=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=10,
    predict_with_generate=True,
    fp16=True,
    report_to="wandb",
)

In [10]:
# instead of writing collate_fn function we will use DataCollatorForSeq2Seq
# similarly it implements the batch creation for training

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [11]:
import numpy as np

# simple postprocessing for text
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels

# compute metrics function to pass to trainer
def compute_metrics(eval_preds):
    preds, labels = eval_preds

    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)

    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

Garbage collection and emptying cuda cache

In [12]:
import torch
import gc

gc.collect()
torch.cuda.empty_cache()

Seq2SeqTrainer

In [13]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [14]:
trainer.train()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,2.6434,2.338291,17.8522,14.0844
2,2.3653,2.291163,18.4731,13.66
3,2.2549,2.274145,18.7538,13.8148
4,2.1586,2.24241,18.9396,13.9384
5,2.031,2.176152,19.8517,13.6668
6,1.8446,2.159206,20.2504,13.7808
7,1.7047,2.157639,19.9139,13.7948
8,1.4622,2.19029,20.3589,13.8188
9,1.2741,2.245464,20.2025,13.834
10,1.1245,2.35466,20.0229,13.722




TrainOutput(global_step=7820, training_loss=1.8684141856630134, metrics={'train_runtime': 1431.4294, 'train_samples_per_second': 174.651, 'train_steps_per_second': 5.463, 'total_flos': 3749809557602304.0, 'train_loss': 1.8684141856630134, 'epoch': 10.0})

### Save the model and run inference

In [15]:
# saving model
trainer.save_model('/content/Text-de-toxification-PMLDL-IU/models/t5_small_tuned')

In [16]:
# loading the model and run inference for it
model = AutoModelForSeq2SeqLM.from_pretrained('/content/Text-de-toxification-PMLDL-IU/models/t5_small_tuned')
model.eval()
model.config.use_cache = False

### Evaluation
Here we can see the evaluation of the model on the test dataset

In [17]:
trainer.evaluate(tokenized_datasets["test"])



{'eval_loss': 2.3751399517059326,
 'eval_bleu': 19.7379,
 'eval_gen_len': 13.7448,
 'eval_runtime': 65.8251,
 'eval_samples_per_second': 37.979,
 'eval_steps_per_second': 1.2,
 'epoch': 10.0}

In [19]:
def translate(model, inference_request, tokenizer=tokenizer):
    """
    Function for translation of the text
    :param model: given model
    :param inference_request: text to translate
    :param tokenizer: tokenizer for the model
    :return: translated text
    """

    input_ids = tokenizer(inference_request, return_tensors="pt").input_ids
    outputs = model.generate(input_ids=input_ids)
    return tokenizer.decode(outputs[0], skip_special_tokens=True,temperature=0)

### Inference

In [20]:
for i in range(5, 8):
    text = cropped_datasets['train']['translation'][i]['toxic']
    print('-------------------')
    print(text)
    print(translate(model, text, tokenizer))

-------------------
i'm not gonna have a child... ...with the same genetic disorder as me who's gonna die. l...




i'm not going to have a baby with a genetic disorder as i
-------------------
they're all laughing at us, so we'll kick your ass.
they're laughing at us, so we'll cut your neck.
-------------------
maine was very short on black people back then.
the maine was very short black on black people.


In [26]:
import locale
print(locale.getpreferredencoding())

import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

ANSI_X3.4-1968


In [29]:
!zip -r t5_small_tuned.zip /content/Text-de-toxification-PMLDL-IU/models/t5_small_tuned

  adding: content/Text-de-toxification-PMLDL-IU/models/t5_small_tuned/ (stored 0%)
  adding: content/Text-de-toxification-PMLDL-IU/models/t5_small_tuned/tokenizer_config.json (deflated 95%)
  adding: content/Text-de-toxification-PMLDL-IU/models/t5_small_tuned/model.safetensors (deflated 7%)
  adding: content/Text-de-toxification-PMLDL-IU/models/t5_small_tuned/special_tokens_map.json (deflated 86%)
  adding: content/Text-de-toxification-PMLDL-IU/models/t5_small_tuned/training_args.bin (deflated 51%)
  adding: content/Text-de-toxification-PMLDL-IU/models/t5_small_tuned/generation_config.json (deflated 29%)
  adding: content/Text-de-toxification-PMLDL-IU/models/t5_small_tuned/config.json (deflated 62%)
  adding: content/Text-de-toxification-PMLDL-IU/models/t5_small_tuned/tokenizer.json (deflated 74%)
