# Evaluating an MT5 model for German paraphrasing

In [1]:
model = 'yawnick/mt5-small-paracrawl-dede' 
dataset = 'yawnick/para_crawl_dede'

## Environment Setup

We need a GPU

In [2]:
!nvidia-smi

Wed May 24 00:21:30 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   46C    P8    10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

We install all needed libraries

In [3]:
!pip install datasets==2.11.0 transformers==4.28.0 nltk==3.8.1 parascore==1.0.5 sentencepiece==0.1.98

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets==2.11.0
  Downloading datasets-2.11.0-py3-none-any.whl (468 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers==4.28.0
  Downloading transformers-4.28.0-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m46.6 MB/s[0m eta [36m0:00:00[0m
Collecting parascore==1.0.5
  Downloading parascore-1.0.5-py3-none-any.whl (15 kB)
Collecting sentencepiece==0.1.98
  Downloading sentencepiece-0.1.98-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m34.9 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0 (from datasets==2.11.0)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━

Connect to Google Drive to save the results in the root folder of our Drive at `/content/drive/MyDrive/`.

In [4]:
from google.colab import drive
drive.mount("/content/drive/")

Mounted at /content/drive/


## Data Download and Preparation

In [5]:
from datasets import load_dataset

We use our own created datasets with paraphrases

In [6]:
raw_dataset = load_dataset(dataset, split='test')
raw_dataset[5]

Downloading and preparing dataset csv/yawnick--para_crawl_dede to /root/.cache/huggingface/datasets/yawnick___csv/yawnick--para_crawl_dede-33f67bdf41063882/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/10.8M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.25M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/yawnick___csv/yawnick--para_crawl_dede-33f67bdf41063882/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


{'Original': 'c) angemessene Strafen oder andere Sanktionen zur wirksamen Durchsetzung dieses Artikels vorsehen.',
 'Paraphrase': 'c) angemessene Sanktionen oder andere Sanktionen vorsehen, um die wirksame Durchsetzung dieses Artikels zu gewährleisten.'}

## Generating paraphrases

First, we will initialize the pipeline

In [7]:
from transformers import pipeline
import tensorflow as tf
from tqdm import tqdm

In [8]:
device_name = tf.test.gpu_device_name()
if len(device_name) > 0:
    print("Found GPU at: {}".format(device_name))
else:
    device_name = "/device:CPU:0"
    print("No GPU, using {}.".format(device_name))

Found GPU at: /device:GPU:0


In [9]:
pipe = pipeline('text2text-generation', model=model)  # device=0 tells it to use the GPU

Downloading (…)lve/main/config.json:   0%|          | 0.00/773 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.20G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/285 [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/74.0 [00:00<?, ?B/s]



In [10]:
def data():
  for row in raw_dataset:
    yield row['Original']

In [11]:
ds_length = raw_dataset.num_rows
ps = []

with tf.device(device_name):
  for res in tqdm(pipe(data(), batch_size=48), total=ds_length):
    ps.append(res[0]['generated_text'])

100%|██████████| 11532/11532 [40:59<00:00,  4.69it/s]


## Evaluating paraphrases

In [12]:
from parascore import ParaScorer

In [13]:
scorer = ParaScorer(lang='de')

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/714M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [14]:
scores = scorer.base_score(raw_dataset['Original'], ps, raw_dataset['Paraphrase'], batch_size=16)
scores

[1.0060532464141465,
 0.9500002384185791,
 0.8836195635795593,
 0.9499998807907104,
 0.95,
 0.8957882690429687,
 0.9499999403953552,
 0.9762482810020446,
 0.9499998211860656,
 0.9660128462329677,
 0.937794803124722,
 0.878344447672868,
 0.9139211344718933,
 0.95,
 0.95,
 0.9500001192092895,
 0.9064648409942528,
 0.9514920422009059,
 0.9859179910566253,
 0.8715092364668525,
 0.9742819237709045,
 0.9499998211860656,
 0.9265425304004125,
 0.8594724360119882,
 0.947656502042498,
 0.9366753268241882,
 0.9382009673118591,
 0.9268536150007319,
 0.8977819922048336,
 0.9266439247131347,
 0.9513559029584471,
 0.9255647826194763,
 0.9499999403953552,
 0.9456002455401561,
 0.9463044546396349,
 0.953061462908375,
 0.8684207963943481,
 0.9242485426780873,
 0.8867051775851305,
 0.9500001192092895,
 0.9499998211860656,
 0.9499998211860656,
 0.912790014579973,
 0.9500002384185791,
 0.9582943620000567,
 0.9084085392951965,
 0.95,
 0.95,
 0.9499998807907104,
 0.8825242686271667,
 0.8116968441009521,
 0.8

Print the average Parascore

In [15]:
score = sum(scores) / len(scores)
print('Average Parascore:', score)

Average Parascore: 0.9251448582978138


Generate and export the evaluation table

In [16]:
raw_dataset = raw_dataset.rename_column('Paraphrase', 'Reference')
raw_dataset = raw_dataset.add_column(name='Paraphrase', column=ps)
raw_dataset = raw_dataset.add_column(name='Parascore', column=scores)
raw_dataset = raw_dataset.to_csv('/content/drive/MyDrive/data/eval_table_mono_dede.csv')
raw_dataset

Creating CSV from Arrow format:   0%|          | 0/12 [00:00<?, ?ba/s]

3253691