# Evaluating an MT5 model for Slovene paraphrasing

In [18]:
model = 'yawnick/mt5-small-paracrawl-slsl' 
dataset = 'yawnick/para_crawl_slsl'

## Environment Setup

We need a GPU

In [2]:
!nvidia-smi

Tue May 23 23:32:46 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   49C    P8     9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

We install all needed libraries

In [3]:
!pip install datasets==2.11.0 transformers==4.28.0 nltk==3.8.1 parascore==1.0.5 sentencepiece==0.1.98

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets==2.11.0
  Downloading datasets-2.11.0-py3-none-any.whl (468 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers==4.28.0
  Downloading transformers-4.28.0-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m72.8 MB/s[0m eta [36m0:00:00[0m
Collecting parascore==1.0.5
  Downloading parascore-1.0.5-py3-none-any.whl (15 kB)
Collecting sentencepiece==0.1.98
  Downloading sentencepiece-0.1.98-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m37.3 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0 (from datasets==2.11.0)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━

Connect to Google Drive to save the results in the root folder of our Drive at `/content/drive/MyDrive/`.

In [4]:
from google.colab import drive
drive.mount("/content/drive/")

Mounted at /content/drive/


## Data Download and Preparation

In [5]:
from datasets import load_dataset

We use our own created datasets with paraphrases

In [19]:
raw_dataset = load_dataset(dataset, split='test')
raw_dataset[5]



{'Original': 'c) doloèile ustrezne kazni in druge sankcije, s katerimi bo zagotovljeno uèinkovito uveljavljanje tega èlena.',
 'Paraphrase': '(c) določiti ustrezne kazni ali druge sankcije za zagotovitev učinkovitega izvrševanja tega člena.'}

## Generating paraphrases

First, we will initialize the pipeline

In [7]:
from transformers import pipeline
import tensorflow as tf
from tqdm import tqdm

In [8]:
device_name = tf.test.gpu_device_name()
if len(device_name) > 0:
    print("Found GPU at: {}".format(device_name))
else:
    device_name = "/device:CPU:0"
    print("No GPU, using {}.".format(device_name))

Found GPU at: /device:GPU:0


In [9]:
pipe = pipeline('text2text-generation', model=model)  # device=0 tells it to use the GPU

Downloading (…)lve/main/config.json:   0%|          | 0.00/773 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.20G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/285 [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/74.0 [00:00<?, ?B/s]



In [10]:
def data():
  for row in raw_dataset:
    yield row['Original']

In [11]:
ds_length = raw_dataset.num_rows
ps = []

with tf.device(device_name):
  for res in tqdm(pipe(data(), batch_size=48), total=ds_length):
    ps.append(res[0]['generated_text'])

100%|██████████| 11532/11532 [39:15<00:00,  4.90it/s]


## Evaluating paraphrases

In [13]:
from parascore import ParaScorer

In [14]:
scorer = ParaScorer(lang='sl')

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/714M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [15]:
scores = scorer.base_score(raw_dataset['Original'], ps, raw_dataset['Paraphrase'], batch_size=16)
scores

[0.8824673403393138,
 0.9499998211860656,
 0.8239512610435485,
 0.95,
 0.9135177284255064,
 0.785396831035614,
 0.9318889854480694,
 0.857150809764862,
 0.95,
 0.8797531296031267,
 0.8607514667510986,
 0.8184679913520813,
 0.813526349067688,
 0.9148691425646158,
 0.9499998807907104,
 0.9499998211860656,
 0.9089767555667929,
 0.8989313126291547,
 0.95,
 0.9023373005916546,
 0.8455892968177795,
 0.95,
 0.9318613400826088,
 0.901647626786005,
 0.9175284892895967,
 0.8373984265327453,
 0.8588426876068115,
 0.8802691987582616,
 0.874003198828016,
 0.8829956221580505,
 0.8672222276123203,
 0.8126958179473877,
 0.9500002384185791,
 0.9500001192092895,
 0.923463693686894,
 0.9278783853925746,
 0.8781379630314173,
 0.8757826038769313,
 0.8813549659897129,
 0.95,
 0.9490416030981103,
 0.95,
 0.95,
 0.95,
 0.8600729408718291,
 0.8640452790260315,
 0.9500002384185791,
 0.9456807985597727,
 0.9157953558649335,
 0.8302684593200683,
 0.8442694115638733,
 0.7346871662139892,
 0.8649562168121337,
 0.85

Print the average Parascore

In [16]:
score = sum(scores) / len(scores)
print('Average Parascore:', score)

Average Parascore: 0.890035144290663


Generate and export the evaluation table

In [20]:
raw_dataset = raw_dataset.rename_column('Paraphrase', 'Reference')
raw_dataset = raw_dataset.add_column(name='Paraphrase', column=ps)
raw_dataset = raw_dataset.add_column(name='Parascore', column=scores)
raw_dataset = raw_dataset.to_csv('/content/drive/MyDrive/data/eval_table_mono_slsl.csv')
raw_dataset

Creating CSV from Arrow format:   0%|          | 0/12 [00:00<?, ?ba/s]

2897936