We will now load the reference dataset, which contains pre-translated texts. The upload must be done manually. This dataset is in the Parquet file format. Within this dataset, there should be:

A column representing the target language (reference translation), denoted as language_code.
A column that contains translations generated by the model under evaluation, denoted as model_code.

You must change "MODEL_NAME" to your specific model.

In [1]:
!pip install pandas
!pip install pyarrow



In [2]:
import pandas as pd

model_code = "cabra0"
file_path = './opus_100_pt_validation_' + model_code + '.parquet'
parquet_data = pd.read_parquet(file_path)

language_code = "pt"

paraphrase_code = model_code + "_pp"
spellcheck_code = model_code + "_sc"

# Preliminary test
print(parquet_data.iloc[200])

translation    {'cabra0': 'Foster a ética em negócios, a Cama...
Name: 200, dtype: object


Grammatical Assessment:

- This phase evaluates the grammatical accuracy of translations.

We will install the necessary library:

In [3]:
# Requires Java installed at the machine
!pip install language-tool-python



Next, initialize the LanguageTool for the desired target language. To determine the appropriate variable for your target language, please refer to the project documentation.

In [4]:
import language_tool_python

language_tool_language_code = 'pt-PT'

tool = language_tool_python.LanguageTool(language_tool_language_code)

# Preliminary test

text = "This is not portuguese."
matches = tool.check(text)

if len(matches) > 0:
    print(f'The text "{text}" have {len(matches)} grammatical errors')
else:
    print(f'The text "{text}" doesn\'t have grammatical errors.')

Downloading LanguageTool 5.7: 100%|██████████| 225M/225M [00:54<00:00, 4.16MB/s] 
Unzipping C:\Users\gustr\AppData\Local\Temp\tmpw5ktuoqb.zip to C:\Users\gustr\.cache\language_tool_python.
Downloaded https://www.languagetool.org/download/LanguageTool-5.7.zip to C:\Users\gustr\.cache\language_tool_python.


The text "This is not portuguese." have 3 grammatical errors


We will iterate through the dataset. For each item that hasn't been checked yet, we will perform the necessary checks and subsequently update and save the dataset. If an item has already been checked, the loop will simply proceed to the next item.

In [5]:
dataset = parquet_data.translation

row_number = len(dataset)
current_row = 0

for item in dataset:

    if model_code in item.keys() and item[model_code] and spellcheck_code in item.keys() and item[spellcheck_code]:
        print(str(current_row)+ "/" + str(row_number)+ " - Existing check: " + text + " -> " + str(item[spellcheck_code]) )
    else:
        text = item[model_code]
        
        # Removes common artifacts that sometimes prejudice the evaluation:
        text = text.replace("- ","")
        text = text.replace("\"","")

        matches = tool.check(text)
        if len(matches) > 0:
          item[spellcheck_code] = 0
        else:
          item[spellcheck_code] = 1

        print(str(current_row)+ "/" + str(row_number)+ " - New check: " + text + " -> " + str(item[spellcheck_code]) )

        parquet_data.to_parquet(file_path)

    current_row=current_row+1

0/2000 - New check: Ele nunca fez nada. -> 1
1/2000 - New check: 50/50 shot? -> 0
2/2000 - New check: Sim, eu vou ser bom. -> 1
3/2000 - New check: Não, não foi minha contribuição. -> 1
4/2000 - New check: Eles então seguem! -> 1
5/2000 - New check: Porque você não lavou. -> 0
6/2000 - New check: Parker não tinha um telefone, e não tinha nenhuma outra maneira de comunicar-se. -> 1
7/2000 - New check: Vou te contar o que o Detective Bird aqui pense que aconteceu. -> 0
8/2000 - New check: Se você amei-me, você teria paddleado. -> 0
9/2000 - New check: Temos dois suspeitos. -> 1
10/2000 - New check: Abriu a porta... -> 1
11/2000 - New check: Os prêstimos podem atrair uma subsídia de interesse para a qual fornece-se uma conta operacional. -> 0
12/2000 - New check: Não um centavo. Então, Mikey Molloy pretendeu ter um de seus ataques... para que eu pudesse entrar quando ninguém estava olhando. -> 0
13/2000 - New check: Juro-vos que este desafio será superado -> 1
14/2000 - New check: Mike es

Calculate the overall metrics, including the total rates and the success rate.

In [6]:
spellcheck_total = 0
spellcheck_success = 0
spellcheck_failure = 0

for item in dataset:
    if model_code in item.keys() and item[model_code] and spellcheck_code in item.keys():
        spellcheck_total = spellcheck_total + 1
        if item[spellcheck_code] == 1:
          spellcheck_success = spellcheck_success + 1
        else:
          spellcheck_failure = spellcheck_failure + 1

if spellcheck_total > 0:
    success_rate = spellcheck_success / spellcheck_total
else:
    success_rate = 0

print(f"From {spellcheck_total} texts, {spellcheck_success} had no identified error ({success_rate * 100:.2f}% success rate)")

From 2000 texts, 715 had no identified error (35.75% success rate)


Semantic Assessment:

- This phase evaluates the semantic accuracy of translations.

We will install the necessary libraries:

In [7]:
!pip install sentence_transformers
!pip install scipy

Collecting sentence_transformers
  Using cached sentence-transformers-2.2.2.tar.gz (85 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting scikit-learn (from sentence_transformers)
  Obtaining dependency information for scikit-learn from https://files.pythonhosted.org/packages/77/85/bff3a1e818ec6aa3dd466ff4f4b0a727db9fdb41f2e849747ad902ddbe95/scikit_learn-1.3.0-cp311-cp311-win_amd64.whl.metadata
  Downloading scikit_learn-1.3.0-cp311-cp311-win_amd64.whl.metadata (11 kB)
Collecting scipy (from sentence_transformers)
  Obtaining dependency information for scipy from https://files.pythonhosted.org/packages/04/b8/947f40706ee2e316fd1a191688f690c4c2b351c2d043fe9deb9b7940e36e/scipy-1.11.1-cp311-cp311-win_

ERROR: Could not install packages due to an OSError: [WinError 2] The system cannot find the file specified: 'C:\\Python311\\Scripts\\nltk.exe' -> 'C:\\Python311\\Scripts\\nltk.exe.deleteme'





Next, we'll initialize the SentenceTransformers using the specified model and configure the function to compute the distance between embeddings.

In [10]:
from sentence_transformers import SentenceTransformer
from scipy.spatial.distance import cosine

model = SentenceTransformer('all-MiniLM-L6-v2')

def paraphrase_chance(sentence1, sentence2):
    embedding1 = model.encode(sentence1, convert_to_tensor=True).cpu()
    embedding2 = model.encode(sentence2, convert_to_tensor=True).cpu()

    cos_distance = cosine(embedding1, embedding2)
    return 1 - cos_distance

# Preliminary test

sentence1 = "O gato está na caixa."
sentence2 = "A caixa contém o gato."
print(paraphrase_chance(sentence1, sentence2))

0.8525395393371582


We will iterate through the dataset. For each item that hasn't been checked yet, we will perform the necessary checks and subsequently update and save the dataset. If an item has already been checked, the loop will simply proceed to the next item.

In [11]:
dataset = parquet_data.translation

row_number = len(dataset)
current_row = 0

for item in dataset:
    if model_code in item.keys() and item[model_code] and paraphrase_code in item.keys() and item[paraphrase_code]:
        print(str(current_row)+ "/" + str(row_number)+ " - Existing check: " + text + " -> " + str(item[paraphrase_code]) )
    else:
        text0 = item[language_code]
        text1 = item[model_code]

        item[paraphrase_code] = paraphrase_chance(text0,text1)

        print(str(current_row)+ "/" + str(row_number)+ " - New check: " + text1 + " -> " + str(item[paraphrase_code]) )

        parquet_data.to_parquet(file_path)

    current_row = current_row + 1

0/2000 - New check: Ele nunca fez nada. -> 0.9109705686569214
1/2000 - New check: 50/50 shot? -> 0.4503629505634308
2/2000 - New check: Sim, eu vou ser bom. -> 0.47876816987991333
3/2000 - New check: Não, não foi minha contribuição. -> 0.6462664604187012
4/2000 - New check: Eles então seguem! -> 0.6872657537460327
5/2000 - New check: Porque você não lavou. -> 0.6454185843467712
6/2000 - New check: Parker não tinha um telefone, e não tinha nenhuma outra maneira de comunicar-se. -> 0.5566431879997253
7/2000 - New check: Vou te contar o que o Detective Bird aqui pense que aconteceu. -> 0.6872642636299133
8/2000 - New check: Se você amei-me, você teria paddleado. -> 0.6432827711105347
9/2000 - New check: - Temos dois suspeitos. -> 1
10/2000 - New check: - Abriu a porta... -> 0.8206663131713867
11/2000 - New check: Os prêstimos podem atrair uma subsídia de interesse para a qual fornece-se uma conta operacional. -> 0.6251880526542664
12/2000 - New check: Não um centavo. Então, Mikey Molloy p

Compute the comprehensive metrics, encompassing both total rates and the success rate. Additionally, determine the final score by taking into account both phases of the analysis.

In [12]:
threshold = 0.75

pp_total = 0
pp_success = 0
pp_failure = 0
pp_sum = 0
pp_success_cleaned = 0
pp_sum_cleaned = 0

for item in dataset:
    if model_code in item.keys() and item[model_code] and paraphrase_code in item.keys():
        pp_total = pp_total + 1
        pp_sum = pp_sum + item[paraphrase_code]

        if item[paraphrase_code] >= threshold:
          pp_success = pp_success + 1
          if item[spellcheck_code] == 1:
            pp_success_cleaned = pp_success_cleaned + 1
            pp_sum_cleaned = pp_sum_cleaned + item[paraphrase_code]
        else:
          pp_failure = pp_failure + 1

if pp_total > 0:
    success_rate = pp_success / pp_total
    sucess_rate_cleaned = pp_success_cleaned / pp_total
else:
    success_rate = 0
    sucess_rate_cleaned = 0

if pp_total > 0:
    average_pp_chance = pp_sum / pp_total
    average_pp_chance_cleaned = pp_sum_cleaned / pp_total


print(f"From {pp_total} texts, {pp_success} are possible paraphrases ({success_rate * 100:.2f}% success rate)")
print(f"The average paraphrase chance is {average_pp_chance * 100:.2f}%.")
print(f"From {pp_total} texts, {pp_success_cleaned} are possible paraphrases with no grammatical errors ({sucess_rate_cleaned * 100:.2f}% cleaned success rate)")
print(f"The average paraphrase chance cleaned of grammar errors is {average_pp_chance_cleaned * 100:.2f}%.")


From 2000 texts, 755 are possible paraphrases (37.75% success rate)
The average paraphrase chance is 65.45%.
From 2000 texts, 248 are possible paraphrases with no grammatical errors (12.40% cleaned success rate)
The average paraphrase chance cleaned of grammar errors is 10.78%.
