## Loading dataset

We get the metric we need to use for evaluation, in this case, SACREBLEU, ROUGE and METEOR, and we load the dataset using datasets library. In this case we define the test set as the set of Vigo sentences.

In [1]:
from datasets import load_dataset, load_metric
bleu_metric = load_metric("sacrebleu")
rouge_metric = load_metric("rouge")
meteor_metric = load_metric("meteor")

[nltk_data] Downloading package wordnet to /home/marina/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/marina/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/marina/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [2]:
dataset = load_dataset('json', data_files={'train': 'PHOENIX/dataset_train.json','test': 'PHOENIX/dataset_VIGO_test.json'}, field="data")

Using custom data configuration default-8889dd37a417b922
Reusing dataset json (/home/marina/.cache/huggingface/datasets/json/default-8889dd37a417b922/0.0.0/d75ead8d5cfcbe67495df0f89bd262f0023257fbbbd94a730313295f3d756d50)


  0%|          | 0/2 [00:00<?, ?it/s]

With the following function we show some random examples of the dataset to see how data looks like.

In [3]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML
def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))
show_random_elements(dataset["train"])
show_random_elements(dataset["test"])

Unnamed: 0,translation
0,"{'le': 'mañana por la noche tendremos aire más frío del oeste y cada vez se espera más nieve', 'ls': 'MAÑANA NOCHE LLEGA FRÍO REVISIÓN NIEVE VUELVE'}"
1,"{'le': 'las profundidades que se extienden desde los alpes hasta polonia traen lluvias a veces fuertes y tormentosas en el sur', 'ls': 'LOS ALPES LLEGAN BAJOS POLONIA LLEGAN LAS TORMENTAS DE LLUVIA'}"
2,"{'le': 'mañana luego mejora lenta en la zona costera todavía nubes bastante espesas dirección del viento tormentoso noreste amigable y seco', 'ls': 'MAÑANA LENTAMENTE MEJOR COSTA ACTUALMENTE NUBLADA TORMENTA NORESTE AMIGABLE IX'}"
3,"{'le': 'allí mañana hasta treinta y seis grados junto al mar, no tan inquietantemente caliente de veintisiete a veintinueve grados', 'ls': 'MAÑANA SUR HASTA SEIS TREINTA GRADOS NORTE GAY NEG-TIENE SIETE VEINTE A NUEVE GRADOS'}"
4,"{'le': 'el miércoles sigue lloviendo, especialmente en el medio y también en el sur', 'ls': 'MIERCOLES LLUVIA MAYORMENTE EN MEDIO DE LA REGION DE SUED'}"


Unnamed: 0,translation
0,"{'le': 'Soy intolerante a la lactosa.', 'ls': 'YO LECHE L+A+C+T+O+S+A INTOLERANCIA'}"
1,"{'le': '¿Tú qué crees?', 'ls': 'TÚ PENSAR CUÁL'}"
2,"{'le': 'Voy a revisión cada tres meses.', 'ls': 'REVISAR MES(TRES) MES(TRES)'}"
3,"{'le': 'La terapia del cáncer me afecta.', 'ls': 'CÁNCER TERAPIA AFECTAR(s)'}"
4,"{'le': 'No tengo apetito y me dan mareos.', 'ls': 'YO COMER NO-ME-APETECE MAREOS'}"


## Preprocessing data

We load the pre-trained tokenizer that will tokenize the inputs and put them in a format that the model expects, and generate the other inputs that the model needs. By instantiating the tokenizer with from_pretrained function we ensure:
- We get a tokenizer that corresponds to the architecture of the model we want to train
- We download the vocabulary used when pretraining this specific checkpoint.

We load the tokenizer of the previously pre-trained model on our PHOENIX dataset translated into Spanish.

In [4]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("experimentos/full_dataset/checkpoint-14000")

Before feed the data to our model, we write the function that will preprocess our samples. With the argument `truncation=True` we ensure that an input longer than what the model selected can handle will be truncated to the maximum length accepted by the model. The padding will be dealt with later on (in a data collator) so we pad examples to the longest length in the batch and not the whole dataset.

In [5]:
max_input_length = 128
max_target_length = 128
source_lang = "ls"
target_lang = "le"

def preprocess_function(examples):
    inputs = [ex[source_lang] for ex in examples["translation"]]
    targets = [ex[target_lang] for ex in examples["translation"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

We use the map method of our dataset object created earlier to apply this function on all pairs of our dataset.

In [6]:
tokenized_dataset = dataset.map(preprocess_function, batched=True)
tokenized_dataset

Loading cached processed dataset at /home/marina/.cache/huggingface/datasets/json/default-8889dd37a417b922/0.0.0/d75ead8d5cfcbe67495df0f89bd262f0023257fbbbd94a730313295f3d756d50/cache-10b1cefa763b2fe4.arrow
Loading cached processed dataset at /home/marina/.cache/huggingface/datasets/json/default-8889dd37a417b922/0.0.0/d75ead8d5cfcbe67495df0f89bd262f0023257fbbbd94a730313295f3d756d50/cache-25ec2bc5a9328840.arrow


DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'input_ids', 'labels', 'translation'],
        num_rows: 7096
    })
    test: Dataset({
        features: ['attention_mask', 'input_ids', 'labels', 'translation'],
        num_rows: 150
    })
})

## Fine-tuning the model

We download the pretrained model, for this we use the AutoModelForSeq2SeqLM since our task is of the sequence-to-sequence kind.

In [7]:
from transformers import AutoModelForSeq2SeqLM

modelo = AutoModelForSeq2SeqLM.from_pretrained("experimentos/full_dataset/checkpoint-14000")

In [8]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments("experimentos/vigo_preentrenado",
    evaluation_strategy = "epoch",
    fp16=True,
    eval_accumulation_steps=1,
    predict_with_generate=True,
)

Seq2SeqTrainingArguments class contains the attributes to customize the training. It requires one folder name which will be used to save the checkpoints of the model.

In [9]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=modelo)

We define a function for compute the metrics from the predictions. This function use the metric we loaded earlier and decode the predictions into texts. In addition, these decoded predictions are transcribed into a file that will be stored in the indicated directory.

In [10]:
import numpy as np

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]
    
    return preds, labels
    
def compute_metrics_bleu(eval_preds):
    preds, labels = eval_preds
    
    if isinstance(preds, tuple):
        preds = preds[0]
        
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
    
    result = bleu_metric.compute(predictions=decoded_preds, references=decoded_labels)
    file = 'results-' + str(result['score']) + '.txt'
    
    result = {"bleu": result["score"]}
    
    with open('experimentos/vigo_preentrenado/' + file, 'w') as f:
        f.write('\n'.join(decoded_preds))
        
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v,4) for k, v in result.items()}
    return result

In [11]:
import nltk
import numpy as np

def compute_metrics_rouge(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
    
    result = rouge_metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    
    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    
    return {k: round(v, 4) for k, v in result.items()}

In [12]:
import numpy as np
    
def compute_metrics_meteor(eval_preds):
    preds, labels = eval_preds
    
    if isinstance(preds, tuple):
        preds = preds[0]
        
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
        
    result = meteor_metric.compute(predictions=decoded_preds, references=decoded_labels)

    return result

## BLEU METRIC

In [13]:
from transformers import Seq2SeqTrainer

trainer_bleu = Seq2SeqTrainer(
    modelo,
    training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics_bleu,
)

Using amp fp16 backend


In [14]:
bleu_predictions = trainer_bleu.predict(tokenized_dataset["test"])
print(bleu_predictions.metrics)

The following columns in the test set  don't have a corresponding argument in `MarianMTModel.forward` and have been ignored: translation.
***** Running Prediction *****
  Num examples = 150
  Batch size = 8
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)


{'eval_loss': 8.625372886657715, 'eval_bleu': 0.1771, 'eval_gen_len': 13.4133, 'eval_runtime': 4.6987, 'eval_samples_per_second': 31.924, 'eval_steps_per_second': 4.044}


## ROUGE METRIC

In [15]:
from transformers import Seq2SeqTrainer

trainer_rouge = Seq2SeqTrainer(
    modelo,
    training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics_rouge,
)

Using amp fp16 backend


In [16]:
rouge_predictions = trainer_rouge.predict(tokenized_dataset["test"])
print(rouge_predictions.metrics)

The following columns in the test set  don't have a corresponding argument in `MarianMTModel.forward` and have been ignored: translation.
***** Running Prediction *****
  Num examples = 150
  Batch size = 8


{'eval_loss': 8.625372886657715, 'eval_rouge1': 8.1796, 'eval_rouge2': 0.6847, 'eval_rougeL': 7.3327, 'eval_rougeLsum': 7.3565, 'eval_gen_len': 13.4133, 'eval_runtime': 4.8403, 'eval_samples_per_second': 30.99, 'eval_steps_per_second': 3.925}


## METEOR METRIC

In [17]:
from transformers import Seq2SeqTrainer

trainer_meteor = Seq2SeqTrainer(
    modelo,
    training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics_meteor,
)

Using amp fp16 backend


In [18]:
meteor_predictions = trainer_meteor.predict(tokenized_dataset["test"])
print(meteor_predictions.metrics)

The following columns in the test set  don't have a corresponding argument in `MarianMTModel.forward` and have been ignored: translation.
***** Running Prediction *****
  Num examples = 150
  Batch size = 8


{'eval_loss': 8.625372886657715, 'eval_meteor': 0.03173193607731098, 'eval_runtime': 5.598, 'eval_samples_per_second': 26.796, 'eval_steps_per_second': 3.394}
