# Evaluating a multilingual MT5 model for all languages English, German, Czech, Slovene using the whole data

In [2]:
model = 'yawnick/mt5-small-paracrawl-multi-all' 
dataset = 'yawnick/para_crawl_multi_all'

## Environment Setup

We need a GPU

In [3]:
!nvidia-smi

Wed May 24 14:53:58 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   52C    P8     9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

We install all needed libraries

In [4]:
!pip install datasets==2.11.0 transformers==4.28.0 nltk==3.8.1 parascore==1.0.5 sentencepiece==0.1.98

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Connect to Google Drive to save the results in the root folder of our Drive at `/content/drive/MyDrive/`.

In [5]:
from google.colab import drive
drive.mount("/content/drive/")

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


## Data Download and Preparation

In [6]:
from datasets import load_dataset

We use our own created datasets with paraphrases

In [7]:
raw_dataset = load_dataset(dataset, split='test')
raw_dataset[5]



{'Original': 'To še ni vse muslimani – to je žep fundamentalistov, katerih cilj je oslabiti krščansko vero.',
 'Paraphrase': 'To niso vsi muslimani – je žep fundamentalistov, katerih cilj je oslabiti krščansko vero.'}

## Generating paraphrases

First, we will initialize the pipeline

In [8]:
from transformers import pipeline
import tensorflow as tf
from tqdm import tqdm

In [9]:
device_name = tf.test.gpu_device_name()
if len(device_name) > 0:
    print("Found GPU at: {}".format(device_name))
else:
    device_name = "/device:CPU:0"
    print("No GPU, using {}.".format(device_name))

Found GPU at: /device:GPU:0


In [10]:
pipe = pipeline('text2text-generation', model=model, device=0)  # device=0 tells it to use the GPU



In [11]:
def data():
  for row in raw_dataset:
    yield row['Original']

In [33]:
ds_length = raw_dataset.num_rows
ps = []

with tf.device(device_name):
  for res in tqdm(pipe(data(), batch_size=48), total=ds_length):
    ps.append(res[0]['generated_text'])

100%|██████████| 46128/46128 [06:23<00:00, 120.31it/s]


## Evaluating paraphrases

In [13]:
from parascore import ParaScorer

In [14]:
scorer = ParaScorer(lang='multi')

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [29]:
scores1 = scorer.base_score(raw_dataset['Original'][40000:], ps[40000:], raw_dataset['Paraphrase'][40000:], batch_size=16)
len(scores1)

6128

In [30]:
with open('/content/drive/MyDrive/results/multi-para-all-test5.txt', 'w') as f:
  for score in scores1:
    f.write(str(score)+'\n')

In [22]:
del scores1

In [None]:
scores = scorer.base_score(raw_dataset['Original'], ps, raw_dataset['Paraphrase'], batch_size=16)
scores

In [34]:
scores = []
for i in [1,2,3,4,5]:
  with open(f'/content/drive/MyDrive/results/multi-para-all-test{i}.txt') as f:
    lines = list(map(str.strip, f.readlines()))
  for line in lines:
    scores.append(float(line))
print(scores[:5])
print(len(scores))

[0.95, 0.9499998807907104, 0.904100341456277, 0.9133619236946106, 0.8697424225341108]
46128


Print the average Parascore

In [35]:
score = sum(scores) / len(scores)
print('Average Parascore:', score)

Average Parascore: 0.9251025098316062


Generate and export the evaluation table

In [36]:
raw_dataset = raw_dataset.rename_column('Paraphrase', 'Reference')
raw_dataset = raw_dataset.add_column(name='Paraphrase', column=ps)
raw_dataset = raw_dataset.add_column(name='Parascore', column=scores)
raw_dataset = raw_dataset.to_csv('/content/drive/MyDrive/data/eval_table_multi_all.csv')
raw_dataset

Creating CSV from Arrow format:   0%|          | 0/47 [00:00<?, ?ba/s]

12124959