# Revisión de resultados traducción de arhuaco - español
Este notebook calcula las métricas en test de los diferentes modelos entrenados con finetuning de un traductor finlandés-español para traducir de arhuaco a español.


## Librerías

In [1]:
!pip install datasets
!pip install sacremoses
!pip install sacrebleu
!pip install evaluate
!pip install transformers[sentencepiece]
!pip install transformers[torch]

Collecting datasets
  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pyarrow-hotfix, dill, multiprocess, datasets
Successfully installed datasets-2.15.0 dill-0.3.7 multiprocess-0.70.15 pyarrow-hotfix-0.6
Collecting sacremoses
  Downloading sacremoses-0.1.1-py3-none-any.whl (897 kB)
[2K     [

In [2]:
from glob import glob
import pandas as pd
import numpy as np
from tqdm.auto import tqdm, trange
import sys
import os

In [3]:
from datasets import load_dataset, DatasetDict, Dataset
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer
import pandas as pd

In [4]:
from transformers import Seq2SeqTrainingArguments
from transformers import DataCollatorForSeq2Seq
from transformers import AutoModelForSeq2SeqLM
from transformers import EarlyStoppingCallback
from transformers import Seq2SeqTrainer

import torch

import numpy as np
import pickle
import evaluate

In [5]:
import sacrebleu

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Funciones

In [7]:
def preprocess_dataset(path_dataset: str, lang_output: str):
  """
  Lee los datos y los preprocesa. Lo pasa al formato necesario DatasetDict
  y divide los datos en train, test y validación.
  Sirve para traducción de indígena a español

  input:
  - path_dataset: con la ruta en donde se encuentra la base a procesar
  - lang_output: wayuu, arh de donde va a terminar la traducción

  output:
  - dataset_dict: DatasetDict con train test y validation
  """
  # Lectura de datos y conversión a diccionario
  dataset = pd.read_csv(path_dataset)
  conv = {'esp': 'es', 'wayuu': lang_output, 'arh': lang_output}
  dataset.rename(columns = conv, inplace = True)

  dataset = [{'es': row['es'], lang_output: row[lang_output]} for _, row in dataset.iterrows()]

  # División train, test y validación
  train, test = train_test_split(dataset, test_size = 0.2, random_state = 42)
  val, test = train_test_split(test, test_size = 0.5, random_state = 42)

  # Creación de datasets
  train = Dataset.from_dict({"id": list(range(len(train))), "translation": train})
  test = Dataset.from_dict({"id": list(range(len(test))), "translation": test})
  validation = Dataset.from_dict({"id": list(range(len(val))), "translation": val})

  # Creación del diccionario
  dataset_dict = DatasetDict({"train": train, "test": test, "validation": validation})

  return dataset_dict

In [8]:
def tokenizar(dataset_dict, tokenizer, max_length = 150):
  """
  A partir de un DatasetDict, tokeniza los datos. Esto depende del modelo a utilizar,
  y de un modelo específico.

  input:
  - dataset_dict: con los datos de train, test y validación
  - tokenizer: tokenizer
  - max_length: de las sentencias a considerar

  output:
  - tokenized_datasets
  """

  def preprocess_function(examples):
      inputs = [ex["fi"] for ex in examples["translation"]]
      targets = [ex["es"] for ex in examples["translation"]]
      model_inputs = tokenizer(
          inputs, text_target=targets, max_length=max_length, truncation=True
      )
      return model_inputs

  # Tokenizar los datos
  tokenized_datasets = dataset_dict.map(
    preprocess_function,
    batched=True,
    remove_columns=dataset_dict["train"].column_names,
  )

  return tokenized_datasets, tokenizer

## Lectura de datos

In [9]:
model_path = "/content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/results"
eval_blues = {}

for res in glob(model_path + '/*'):
  if 'pickle' in res and 'REVES' in res:
    with open(res, 'rb') as file:
      blue_score = pickle.load(file)['eval_bleu']
      eval_blues[res] = blue_score

## Métricas

In [10]:
bleu_calc = sacrebleu.BLEU()
chrf_calc = sacrebleu.CHRF(word_order=2)

## Funciones de predicción

In [11]:
def translate(model, tokenizer, text, src_lang='fi', tgt_lang='es', a=32, b=3, max_input_length=128, num_beams=4, **kwargs):
    tokenizer.src_lang = src_lang
    tokenizer.tgt_lang = tgt_lang
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=max_input_length)
    result = model.generate(
        **inputs.to(model.device),
        forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
        max_new_tokens=int(a + b * inputs.input_ids.shape[1]),
        num_beams=num_beams,
        **kwargs
    )
    return tokenizer.batch_decode(result, skip_special_tokens=True)

def batched_translate(model, tokenizer, texts, batch_size=16, **kwargs):
    """Translate texts in batches of similar length"""
    idxs, texts2 = zip(*sorted(enumerate(texts), key=lambda p: len(p[1]), reverse=True))
    results = []
    for i in trange(0, len(texts2), batch_size):
        results.extend(translate(model, tokenizer, texts2[i: i+batch_size], **kwargs))
    return [p for i, p in sorted(zip(idxs, results))]

## Métrica para datos completos

In [12]:
path_data = '/content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/data_clean/arhuaco'
d = 'COMP'

In [13]:
dataset_dict = preprocess_dataset(path_data + '/' + d + '.csv', lang_output = 'fi')

df_test = pd.DataFrame(dataset_dict['test']['translation'])
df_train = pd.DataFrame(dataset_dict['train']['translation'])
df_validation = pd.DataFrame(dataset_dict['validation']['translation'])

In [14]:
df_test.head()

Unnamed: 0,es,fi
0,"por tanto, no teman. yo cuidare de ustedes y d...","ey uweri yari chowchu unkunasundi nukuko, nund..."
1,unos hombres piadosos enterraron el cuerpo de ...,ey awi niwipaw chow wina'chwuya jinari estewun...
2,"de pronto, se levanto una gran tormenta de vie...",jesuri yow iku unchusi barku aninikwuyase' kin...
3,"pues de la misma manera, cuando vean todo esto...","ey uweri kun igera, jomu zanikunpunnige niga k..."
4,enos tenia noventa anos cuando engendro a cainan.,ey awi enori ikawa uga kugi' izare'ri agumusin...


In [15]:
resultados_completos = [c for c in eval_blues.keys() if 'COMP' in c and not "COMP_NC" in c]
resultados_completos

['/content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/results/arhuaco_metrica_despues_COMP_3_2e-05_REVES.pickle',
 '/content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/results/arhuaco_metrica_despues_COMP_3_0.0002_REVES.pickle',
 '/content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/results/arhuaco_metrica_despues_COMP_5_2e-05_REVES.pickle',
 '/content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/results/arhuaco_metrica_despues_COMP_5_0.0002_REVES.pickle',
 '/content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/results/arhuaco_metrica_despues_COMP_10_2e-05_REVES.pickle',
 '/content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/results/arhuaco_metrica_despues_COMP_10_0.0002_REVES.pickle']

In [16]:
for path in resultados_completos:

  print('\nResultados --- ', path, '----')
  name = path.split('.pickle')[0].replace('metrica_despues', 'modelo')

  tokenizer = AutoTokenizer.from_pretrained(name)
  model = AutoModelForSeq2SeqLM.from_pretrained(name).to('cuda')

  arh_translated_test = batched_translate(model, tokenizer, df_test.es, src_lang='fi', tgt_lang='es')
  print(bleu_calc.corpus_score(arh_translated_test, [df_test['es'].tolist()]))
  print(chrf_calc.corpus_score(arh_translated_test, [df_test['es'].tolist()]))


Resultados ---  /content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/results/arhuaco_metrica_despues_COMP_3_2e-05_REVES.pickle ----


  0%|          | 0/36 [00:00<?, ?it/s]

BLEU = 9.47 24.9/11.3/6.8/4.2 (BP = 1.000 ratio = 1.395 hyp_len = 20542 ref_len = 14728)
chrF2++ = 28.22

Resultados ---  /content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/results/arhuaco_metrica_despues_COMP_3_0.0002_REVES.pickle ----


  0%|          | 0/36 [00:00<?, ?it/s]

BLEU = 2.68 14.2/3.6/1.4/0.7 (BP = 1.000 ratio = 2.054 hyp_len = 30255 ref_len = 14728)
chrF2++ = 21.53

Resultados ---  /content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/results/arhuaco_metrica_despues_COMP_5_2e-05_REVES.pickle ----


  0%|          | 0/36 [00:00<?, ?it/s]

BLEU = 13.99 35.3/17.0/10.1/6.3 (BP = 1.000 ratio = 1.118 hyp_len = 16464 ref_len = 14728)
chrF2++ = 34.12

Resultados ---  /content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/results/arhuaco_metrica_despues_COMP_5_0.0002_REVES.pickle ----


  0%|          | 0/36 [00:00<?, ?it/s]

BLEU = 2.66 15.7/3.7/1.4/0.6 (BP = 1.000 ratio = 1.842 hyp_len = 27131 ref_len = 14728)
chrF2++ = 21.89

Resultados ---  /content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/results/arhuaco_metrica_despues_COMP_10_2e-05_REVES.pickle ----


  0%|          | 0/36 [00:00<?, ?it/s]

BLEU = 19.30 44.8/23.4/14.5/9.2 (BP = 1.000 ratio = 1.013 hyp_len = 14923 ref_len = 14728)
chrF2++ = 41.00

Resultados ---  /content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/results/arhuaco_metrica_despues_COMP_10_0.0002_REVES.pickle ----


  0%|          | 0/36 [00:00<?, ?it/s]

BLEU = 1.71 15.0/2.5/0.7/0.3 (BP = 1.000 ratio = 1.563 hyp_len = 23015 ref_len = 14728)
chrF2++ = 18.19


## Métrica para datos sin constitución


In [17]:
path_data = '/content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/data_clean/arhuaco'
d = 'COMP_NC'

In [18]:
dataset_dict = preprocess_dataset(path_data + '/' + d + '.csv', lang_output = 'fi')

df_test = pd.DataFrame(dataset_dict['test']['translation'])
df_train = pd.DataFrame(dataset_dict['train']['translation'])
df_validation = pd.DataFrame(dataset_dict['validation']['translation'])

In [21]:
resultados_completos_nc = [c for c in eval_blues.keys() if 'COMP_NC' in c]
resultados_completos_nc

['/content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/results/metrica_despues_COMP_NC_3_2e-05_REVES.pickle',
 '/content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/results/metrica_despues_COMP_NC_3_0.0002_REVES.pickle',
 '/content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/results/metrica_despues_COMP_NC_5_2e-05_REVES.pickle',
 '/content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/results/metrica_despues_COMP_NC_5_0.0002_REVES.pickle',
 '/content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/results/metrica_despues_COMP_NC_10_2e-05_REVES.pickle',
 '/content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/results/metrica_despues_COMP_NC_10_0.0002_REVES.pickle']

In [22]:
for path in resultados_completos_nc:

  print('\nResultados --- ', path, '----')
  name = path.split('.pickle')[0].replace('metrica_despues', 'modelo')

  tokenizer = AutoTokenizer.from_pretrained(name)
  model = AutoModelForSeq2SeqLM.from_pretrained(name).to('cuda')

  arh_translated_test = batched_translate(model, tokenizer, df_test.es, src_lang='fi', tgt_lang='es')
  print(bleu_calc.corpus_score(arh_translated_test, [df_test['es'].tolist()]))
  print(chrf_calc.corpus_score(arh_translated_test, [df_test['es'].tolist()]))


Resultados ---  /content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/results/metrica_despues_COMP_NC_3_2e-05_REVES.pickle ----


  0%|          | 0/36 [00:00<?, ?it/s]

BLEU = 10.33 27.2/12.4/7.4/4.6 (BP = 1.000 ratio = 1.392 hyp_len = 19490 ref_len = 14004)
chrF2++ = 30.61

Resultados ---  /content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/results/metrica_despues_COMP_NC_3_0.0002_REVES.pickle ----


  0%|          | 0/36 [00:00<?, ?it/s]

BLEU = 1.81 10.8/2.7/1.0/0.4 (BP = 1.000 ratio = 2.870 hyp_len = 40188 ref_len = 14004)
chrF2++ = 20.64

Resultados ---  /content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/results/metrica_despues_COMP_NC_5_2e-05_REVES.pickle ----


  0%|          | 0/36 [00:00<?, ?it/s]

BLEU = 18.19 43.3/21.7/13.5/8.7 (BP = 1.000 ratio = 1.038 hyp_len = 14537 ref_len = 14004)
chrF2++ = 38.84

Resultados ---  /content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/results/metrica_despues_COMP_NC_5_0.0002_REVES.pickle ----


  0%|          | 0/36 [00:00<?, ?it/s]

BLEU = 2.36 15.0/3.6/1.3/0.5 (BP = 1.000 ratio = 2.054 hyp_len = 28767 ref_len = 14004)
chrF2++ = 22.91

Resultados ---  /content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/results/metrica_despues_COMP_NC_10_2e-05_REVES.pickle ----


  0%|          | 0/36 [00:00<?, ?it/s]

BLEU = 16.96 42.0/20.8/12.4/7.7 (BP = 1.000 ratio = 1.128 hyp_len = 15796 ref_len = 14004)
chrF2++ = 40.58

Resultados ---  /content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/results/metrica_despues_COMP_NC_10_0.0002_REVES.pickle ----


  0%|          | 0/36 [00:00<?, ?it/s]

BLEU = 2.34 18.8/3.5/1.1/0.4 (BP = 1.000 ratio = 1.410 hyp_len = 19744 ref_len = 14004)
chrF2++ = 20.14


## Métrica para datos solo Biblia


In [23]:
path_data = '/content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/data_clean/arhuaco'
d = 'BIBLIA'

In [24]:
dataset_dict = preprocess_dataset(path_data + '/' + d + '.csv', lang_output = 'fi')

df_test = pd.DataFrame(dataset_dict['test']['translation'])
df_train = pd.DataFrame(dataset_dict['train']['translation'])
df_validation = pd.DataFrame(dataset_dict['validation']['translation'])

In [25]:
resultados_completos_biblia = [c for c in eval_blues.keys() if 'BIBLIA' in c]
resultados_completos_biblia

['/content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/results/metrica_despues_BIBLIA_3_2e-05_REVES.pickle',
 '/content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/results/metrica_despues_BIBLIA_3_0.0002_REVES.pickle',
 '/content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/results/metrica_despues_BIBLIA_5_2e-05_REVES.pickle',
 '/content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/results/metrica_despues_BIBLIA_5_0.0002_REVES.pickle',
 '/content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/results/metrica_despues_BIBLIA_10_2e-05_REVES.pickle',
 '/content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/results/metrica_despues_BIBLIA_10_0.0002_REVES.pickle']

In [26]:
for path in resultados_completos_biblia:

  print('\nResultados --- ', path, '----')
  name = path.split('.pickle')[0].replace('metrica_despues', 'modelo')

  tokenizer = AutoTokenizer.from_pretrained(name)
  model = AutoModelForSeq2SeqLM.from_pretrained(name).to('cuda')

  arh_translated_test = batched_translate(model, tokenizer, df_test.es, src_lang='fi', tgt_lang='es')
  print(bleu_calc.corpus_score(arh_translated_test, [df_test['es'].tolist()]))
  print(chrf_calc.corpus_score(arh_translated_test, [df_test['es'].tolist()]))


Resultados ---  /content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/results/metrica_despues_BIBLIA_3_2e-05_REVES.pickle ----


  0%|          | 0/35 [00:00<?, ?it/s]

BLEU = 12.86 29.9/15.1/9.6/6.3 (BP = 1.000 ratio = 1.374 hyp_len = 18941 ref_len = 13785)
chrF2++ = 33.66

Resultados ---  /content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/results/metrica_despues_BIBLIA_3_0.0002_REVES.pickle ----


  0%|          | 0/35 [00:00<?, ?it/s]

BLEU = 3.41 16.3/4.6/2.0/0.9 (BP = 1.000 ratio = 2.075 hyp_len = 28606 ref_len = 13785)
chrF2++ = 24.38

Resultados ---  /content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/results/metrica_despues_BIBLIA_5_2e-05_REVES.pickle ----


  0%|          | 0/35 [00:00<?, ?it/s]

BLEU = 18.11 41.1/21.5/13.6/9.0 (BP = 1.000 ratio = 1.099 hyp_len = 15152 ref_len = 13785)
chrF2++ = 39.23

Resultados ---  /content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/results/metrica_despues_BIBLIA_5_0.0002_REVES.pickle ----


  0%|          | 0/35 [00:00<?, ?it/s]

BLEU = 3.87 21.3/5.6/2.2/0.9 (BP = 1.000 ratio = 1.526 hyp_len = 21030 ref_len = 13785)
chrF2++ = 25.19

Resultados ---  /content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/results/metrica_despues_BIBLIA_10_2e-05_REVES.pickle ----


  0%|          | 0/35 [00:00<?, ?it/s]

BLEU = 19.99 45.8/24.3/15.0/9.5 (BP = 1.000 ratio = 1.090 hyp_len = 15028 ref_len = 13785)
chrF2++ = 43.70

Resultados ---  /content/drive/MyDrive/Colab Notebooks/Talleres NLP/Proyecto/results/metrica_despues_BIBLIA_10_0.0002_REVES.pickle ----


  0%|          | 0/35 [00:00<?, ?it/s]

BLEU = 2.82 22.3/4.2/1.4/0.5 (BP = 1.000 ratio = 1.292 hyp_len = 17815 ref_len = 13785)
chrF2++ = 22.26
