# Training an MT5 model for Slovene paraphrasing

Main configuration

In [1]:
initial_finetuning = True  # this is true only at the beginning of fine-tuning. Set to False if you want to continue training from some checkpoint saved on google drive.
hf_checkpoint = 'google/mt5-small'
drive_checkpoint = ''  # e.g. '/content/drive/MyDrive/models/old-checkpoint-234/'

## Environment Setup

We need a GPU, so we check the availability:

In [2]:
!nvidia-smi

Mon May 22 10:57:16 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P8     9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

And we install all neede libraries.

In [3]:
!pip install datasets==2.11.0 transformers==4.28.0 nltk==3.8.1 parascore==1.0.5 sentencepiece==0.1.98

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets==2.11.0
  Downloading datasets-2.11.0-py3-none-any.whl (468 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 kB[0m [31m32.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers==4.28.0
  Downloading transformers-4.28.0-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m90.5 MB/s[0m eta [36m0:00:00[0m
Collecting parascore==1.0.5
  Downloading parascore-1.0.5-py3-none-any.whl (15 kB)
Collecting sentencepiece==0.1.98
  Downloading sentencepiece-0.1.98-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m76.4 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0 (from datasets==2.11.0)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━

We store checkpoints on Google Drive. After we have mounted our Google Drive, the root folder of our Drive is at `/content/drive/MyDrive/`.

In [4]:
from google.colab import drive
drive.mount("/content/drive/")

Mounted at /content/drive/


## Data Download and Preparation

In [5]:
from datasets import load_dataset

We use our own created dataset for german-german paraphrases.

In [6]:
raw_dataset = load_dataset('yawnick/para_crawl_slsl')
raw_dataset

Downloading and preparing dataset csv/yawnick--para_crawl_slsl to /root/.cache/huggingface/datasets/yawnick___csv/yawnick--para_crawl_slsl-e1050e0a2fc9f827/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/9.59M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.00M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.70M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/yawnick___csv/yawnick--para_crawl_slsl-e1050e0a2fc9f827/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['Original', 'Paraphrase'],
        num_rows: 55544
    })
    test: Dataset({
        features: ['Original', 'Paraphrase'],
        num_rows: 11532
    })
    validation: Dataset({
        features: ['Original', 'Paraphrase'],
        num_rows: 9803
    })
})

Let's store the splits separately and look at one example.

In [7]:
raw_dataset_train = raw_dataset['train']
raw_dataset_val = raw_dataset['validation']
raw_dataset_test = raw_dataset['test']
raw_dataset_train[5]

{'Original': 'In seveda po igranje enkrat, boste želeli, da pridejo nazaj in naprej v igri skozi čas.',
 'Paraphrase': 'In zagotovo po igranju enkrat, boste želeli priti nazaj in nadaljevati tekmo skozi čas.'}

Now, let's prepare the data for training.

In [8]:
from transformers import T5Tokenizer, AutoTokenizer

In [9]:
tokenizer = AutoTokenizer.from_pretrained(hf_checkpoint)

Downloading (…)okenizer_config.json:   0%|          | 0.00/82.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/553 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]



Let's see how the tokenizer works:

In [11]:
s1 = raw_dataset_train[5]['Original']
s2 = raw_dataset_train[5]['Paraphrase']
print(s1)
print(s2)
inputs = tokenizer(s1, text_target=s2)
print([tokenizer.decode(id) for id in inputs['input_ids']])
inputs

In seveda po igranje enkrat, boste želeli, da pridejo nazaj in naprej v igri skozi čas.
In zagotovo po igranju enkrat, boste želeli priti nazaj in nadaljevati tekmo skozi čas.
['In', 'se', 'veda', 'po', 'igra', 'nje', 'en', 'krat', ',', 'bost', 'e', 'žele', 'li', ',', 'da', 'pride', 'jo', 'na', 'zaj', 'in', 'na', 'prej', 'v', '', 'igri', '', 's', 'kozi', 'čas', '.', '</s>']


{'input_ids': [563, 303, 42298, 485, 55279, 8757, 289, 14972, 261, 40586, 265, 141159, 494, 261, 350, 42612, 1113, 294, 32801, 281, 294, 38191, 300, 259, 160648, 259, 263, 64615, 4043, 260, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [563, 466, 82429, 268, 485, 55279, 36106, 289, 14972, 261, 40586, 265, 141159, 494, 979, 524, 294, 32801, 281, 294, 67077, 141658, 8673, 1233, 259, 263, 64615, 4043, 260, 1]}

Now we create a preprocess function that turns a dataset item into a form that the model can use for training.

In [12]:
max_length = 128

# the prefix has to (dynamically) be adjusted depending on the language or when training multilingually (I think).
prefix = 'paraphrase: '

def preprocess_function(examples):
    inputs = [prefix+s1 for s1 in examples['Original']]
    targets = examples['Paraphrase']
    # most likely there will be nothing to truncate, but we still add it
    model_inputs = tokenizer(inputs, text_target=targets, max_length=max_length, truncation=True)
    return model_inputs

Now we apply the preprocessing function to the datasets.

In [13]:
tokenized_ds_train = raw_dataset_train.map(
    preprocess_function,
    batched=True,
    remove_columns=raw_dataset_train.column_names
)
tokenized_ds_val = raw_dataset_val.map(
    preprocess_function,
    batched=True,
    remove_columns=raw_dataset_val.column_names
)
tokenized_ds_test = raw_dataset_test.map(
    preprocess_function,
    batched=True,
    remove_columns=raw_dataset_test.column_names
)

Map:   0%|          | 0/55544 [00:00<?, ? examples/s]

Map:   0%|          | 0/9803 [00:00<?, ? examples/s]

Map:   0%|          | 0/11532 [00:00<?, ? examples/s]

Now the data is ready.

## Model and Training Preparation

Next, the model and a Datacollator.

In [14]:
from transformers import MT5ForConditionalGeneration

Either load the pretrained model from huggingface at the beginning of fine-tuning for the first epoch, or load the model from a previous fine-tune checkooint from google drive.

In [15]:
if initial_finetuning:
  model = MT5ForConditionalGeneration.from_pretrained(hf_checkpoint)
else:
  model = MT5ForConditionalGeneration.from_pretrained(drive_checkpoint)

Downloading pytorch_model.bin:   0%|          | 0.00/1.20G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Next, we instantiate a DataCollator.

In [16]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

Here, I'll skip the example usage of the datacollator, check it out [here](https://huggingface.co/learn/nlp-course/chapter7/4?fw=pt#data-collation).

Now, let's continue with metrics. We will use Parascore.

In [17]:
from parascore import ParaScorer

scorer = ParaScorer(lang='sl')

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/714M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Let's quickly go over how Parascore is used (this example is in english, so it's not ideal):

In [18]:
cands = ["A young person is skating.", "I like sports.", "He catches the ball.", "That's very interesting!"]
sources = ["There's a child on a skateboard.", "I like to relax.", "good morning, everyone!", "I find this interesting."]
score = scorer.free_score(cands, sources)
float(score[-1].mean())

0.7657963037490845

Now, here's the `compute_metrics` function (mostly copied from [here](https://huggingface.co/learn/nlp-course/chapter7/4?fw=pt#metrics)):

In [19]:
import numpy as np

In [20]:
def compute_metrics(eval_preds):
    preds, labels = eval_preds
    # In case the model returns more than the prediction logits
    if isinstance(preds, tuple):
        preds = preds[0]

    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100s in the labels as we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [label.strip() for label in decoded_labels]
    print(decoded_preds[:5])
    print(decoded_labels[:5])
    
    parascore = scorer.free_score(decoded_preds, decoded_labels)
    return {'parascore': float(parascore[-1].mean())}
    

In [21]:
from transformers import Seq2SeqTrainingArguments
from transformers import Seq2SeqTrainer

In [23]:


args = Seq2SeqTrainingArguments(
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_strategy='steps',
    logging_steps=500,
    output_dir='/content/drive/MyDrive/models/mono-slsl',  # this is where the checkpoint will be saved
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=2,
    num_train_epochs=5,
    predict_with_generate=True,
)

In [24]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_ds_train,
    eval_dataset=tokenized_ds_val,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [25]:
trainer.train()

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Parascore
1,2.946,2.044483,0.842204
2,2.5428,1.891694,0.845955
3,2.4093,1.828252,0.847033
4,2.349,1.788897,0.847304
5,2.2935,1.777003,0.847301


['Razumeti različne stopnje varstva lahko pričakovali od vsakega objekt', 'Ponudbo za delo je priložnost povedati ljudem, kaj lahko ponudi', 'Vsak obiskovalec bo vprašal o namenu zaposlitve bivanje', 'Ustvarite čudovite pokrajine, značilnosti terena, vrtove in', 'Varno plača Hoy storitev za nizko ceno!']
['Razumete, koliko različnih stopenj oskrbe lahko pričakujete od vsakega objekta.', 'Delovna priložnost je priložnost, da ljudem poveš, kaj lahko ponudiš, kot tudi kaj zahtevaš.', 'Vsak obiskovalec bo vprašan o namenu bivanja, nastanitve in morebitne zaposlitve.', 'Ustvarite čudovite pokrajine, terenske značilnosti, vrtove in palube za vaš idealen zunanji bivalni prostor!', 'Varno plačano hoy storitev za nizko ceno!']
['Razumeti različne stopnje varstva lahko pričakovali od vsakega objekt', 'Ponudbo za delo je priložnost povedati ljudem, kaj lahko ponudi', 'Vsak obiskovalec bo vprašal o namenu zaposlitve bivanje', 'Ustvarite čudovite pokrajine, značilnosti terena, vrtove in', 'Varno pl

TrainOutput(global_step=17360, training_loss=2.897806151781214, metrics={'train_runtime': 8306.7562, 'train_samples_per_second': 33.433, 'train_steps_per_second': 2.09, 'total_flos': 1.433933636296704e+16, 'train_loss': 2.897806151781214, 'epoch': 5.0})