# Evaluating an MT5 model for English paraphrasing

In [1]:
model = 'yawnick/mt5-small-paracrawl-enen' 
dataset = 'yawnick/para_crawl_enen'

## Environment Setup

We need a GPU

In [2]:
!nvidia-smi

Tue May 23 22:10:53 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   62C    P8    12W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

We install all needed libraries

In [3]:
!pip install datasets==2.11.0 transformers==4.28.0 nltk==3.8.1 parascore==1.0.5 sentencepiece==0.1.98

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets==2.11.0
  Downloading datasets-2.11.0-py3-none-any.whl (468 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 kB[0m [31m33.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers==4.28.0
  Downloading transformers-4.28.0-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m106.1 MB/s[0m eta [36m0:00:00[0m
Collecting parascore==1.0.5
  Downloading parascore-1.0.5-py3-none-any.whl (15 kB)
Collecting sentencepiece==0.1.98
  Downloading sentencepiece-0.1.98-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m75.2 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0 (from datasets==2.11.0)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━

Connect to Google Drive to save the results in the root folder of our Drive at `/content/drive/MyDrive/`.

In [4]:
from google.colab import drive
drive.mount("/content/drive/")

Mounted at /content/drive/


## Data Download and Preparation

In [5]:
from datasets import load_dataset

We use our own created datasets with paraphrases

In [45]:
raw_dataset = load_dataset(dataset, split='test')
raw_dataset[5]



{'Original': '(c) Provide for appropriate penalties or other sanctions to ensure the effective enforcement of the present article.',
 'Paraphrase': '(c) provide for appropriate penalties or other penalties for the effective enforcement of this Article.'}

## Generating paraphrases

First, we will initialize the pipeline

In [14]:
from transformers import pipeline
import tensorflow as tf
from tqdm import tqdm

In [15]:
device_name = tf.test.gpu_device_name()
if len(device_name) > 0:
    print("Found GPU at: {}".format(device_name))
else:
    device_name = "/device:CPU:0"
    print("No GPU, using {}.".format(device_name))

Found GPU at: /device:GPU:0


In [8]:
pipe = pipeline('text2text-generation', model=model)  # device=0 tells it to use the GPU

Downloading (…)lve/main/config.json:   0%|          | 0.00/773 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.20G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/285 [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/74.0 [00:00<?, ?B/s]



In [9]:
def data():
  for row in raw_dataset:
    yield row['Original']

In [16]:
ds_length = raw_dataset.num_rows
ps = []

with tf.device(device_name):
  for res in tqdm(pipe(data(), batch_size=48), total=ds_length):
    ps.append(res[0]['generated_text'])

100%|██████████| 11532/11532 [32:35<00:00,  5.90it/s]


## Evaluating paraphrases

In [20]:
from parascore import ParaScorer

In [22]:
scorer = ParaScorer(lang='en')

Downloading (…)lve/main/config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.43G [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [40]:
scores = scorer.base_score(raw_dataset['Original'], ps, raw_dataset['Paraphrase'], batch_size=16)
scores

[0.9502637922556143,
 0.9500001192092895,
 0.9477343150118247,
 0.95,
 0.95,
 0.9815458822250366,
 0.9500001192092895,
 0.9573045531097724,
 0.95,
 0.9423022672789437,
 0.9498172053353284,
 0.9630753135681153,
 0.9703712630271911,
 0.9499998807907104,
 0.9499998807907104,
 0.9500001192092895,
 0.9562303136292118,
 0.9499998807907104,
 0.9577079330171857,
 0.9556256012963544,
 1.0044849681854249,
 0.95,
 0.9434137983862403,
 0.9690202641487121,
 0.9499998807907104,
 0.9944803285598754,
 0.9547945223845444,
 0.9724914625712804,
 0.95,
 0.9481945477963118,
 0.9590561022563856,
 0.9644313025474548,
 0.9500001192092895,
 0.9500001192092895,
 0.95,
 0.95,
 0.9655991326059614,
 0.9594076059005817,
 0.9528061530806802,
 0.95,
 0.9499998807907104,
 0.9484397009608339,
 0.9499999403953552,
 0.9812485198398213,
 0.95,
 0.9890734720230102,
 0.9548227733495284,
 0.9378909991711987,
 0.9499999403953552,
 0.9831645059585571,
 0.9369943904876709,
 0.9632472606919567,
 1.0106124833165382,
 0.9779307229

Print the average Parascore

In [41]:
score = sum(scores) / len(scores)
print('Average Parascore:', score)

Average Parascore: 0.961083719110544


Generate and export the evaluation table

In [46]:
raw_dataset = raw_dataset.rename_column('Paraphrase', 'Reference')
raw_dataset = raw_dataset.add_column(name='Paraphrase', column=ps)
raw_dataset = raw_dataset.add_column(name='Parascore', column=scores)
raw_dataset = raw_dataset.to_csv('/content/drive/MyDrive/data/eval_table_mono_enen.csv')
raw_dataset

Creating CSV from Arrow format:   0%|          | 0/12 [00:00<?, ?ba/s]

2970828