# Loading dataset

We get the metric we need to use for evaluation, in this case, SACREBLEU, and we load the dataset using datasets library

In [2]:
import warnings 
warnings.filterwarnings('ignore')

from datasets import load_dataset, load_metric
bleu_metric = load_metric("sacrebleu")

In [3]:
dataset = load_dataset('json', data_files={'train': 'PHOENIX/dataset_train.json','test': 'PHOENIX/dataset_test.json'}, field="data")

Using custom data configuration default-6d84514352d255a9
Reusing dataset json (/home/marina/.cache/huggingface/datasets/json/default-6d84514352d255a9/0.0.0/d75ead8d5cfcbe67495df0f89bd262f0023257fbbbd94a730313295f3d756d50)


  0%|          | 0/2 [00:00<?, ?it/s]

In [4]:
dataset["test"][0]

{'translation': {'le': 'lluvia y nieve en los Alpes en la noche después en el norte y noreste, caen aguaceros aquí y allá, de lo contrario, eso está despejado',
  'ls': 'LLUVIA REGION DE NIEVE DESAPARECIENDO LA LLUVIA DEL NORTE PUEDE VER ESTRELLAS DE REGION'}}

With the following function we show some random examples of the dataset to see how data looks like.

In [5]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML
def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))
show_random_elements(dataset["train"])

Unnamed: 0,translation
0,"{'le': 'esta noche diecisiete grados en el mar báltico y cinco grados en los alpes', 'ls': 'NOCHE DIECISIETE NORTE CINCO MONTAÑA'}"
1,"{'le': 'y de la noche al viernes, las tormentas eléctricas probablemente también pueden convertirse en tormentas', 'ls': 'LOS VIERNES LAS TORMENTAS PUEDEN CONOCER EL VIENTO'}"
2,"{'le': 'En el este y sureste, aparte de los campos brumosos, todavía es en su mayoría amigable', 'ls': 'ESTE SUDESTE LUEGO LENTAMENTE SOL POSS-SER TENGA UN POCO DE NIEBLA'}"
3,"{'le': 'Es posible que haya ráfagas de viento cerca de las tormentas; de lo contrario, el viento sopla de débil a moderado en el mar del norte y también fresco en el oeste.', 'ls': 'TORMENTA POSIBLE IX TORMENTA IX DE OTRO MODO PESO PESO VIENTO IX'}"
4,"{'le': 'esta noche en todas partes menos grados donde despeja hay heladas severas', 'ls': 'HOY INCLUSO TODO MENOS DONDE EL CIELO CLARO ANTERIORMENTE HELADA PRIORIDADES'}"


# Preprocessing the data

We load the pre-trained tokenizer that will tokenize the inputs and put them in a format that the model expects, and generate the other inputs that the model needs. By instantiating the tokenizer with from_pretrained function we ensure:
- We get a tokenizer that corresponds to the architecture of the model we want to train
- We download the vocabulary used when pretraining this specific checkpoint.

We use a [pre-trained model](https://huggingface.co/transformers/v3.3.1/pretrained_models.html) for translation, in this case the [MarianMT model](https://huggingface.co/models?language=es&pipeline_tag=translation&sort=downloads)

In [6]:
from transformers import AutoTokenizer, MarianTokenizer

model_marian = "Helsinki-NLP/opus-mt-es-es"
tokenizer = MarianTokenizer.from_pretrained(model_marian)

Before feed the data to our model, we write the function that will preprocess our samples. With the argument `truncation=True` we ensure that an input longer than what the model selected can handle will be truncated to the maximum length accepted by the model. The padding will be dealt with later on (in a data collator) so we pad examples to the longest length in the batch and not the whole dataset.

In [7]:
max_input_length = 128
max_target_length = 128
source_lang = "ls"
target_lang = "le"

def preprocess_function(examples):
    inputs = [ex[source_lang] for ex in examples["translation"]]
    targets = [ex[target_lang] for ex in examples["translation"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [8]:
preprocess_function(dataset['train'][:2])

{'input_ids': [[58, 25801, 22021, 30206, 28989, 13780, 8641, 500, 714, 748, 7855, 10717, 0], [1516, 10314, 533, 8436, 1627, 11467, 6178, 128, 65, 1627, 11467, 6178, 128, 29970, 3467, 415, 533, 2535, 5094, 24774, 25884, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[18, 100, 10, 1496, 12890, 2817, 2081, 9239, 29, 223, 3829, 1204, 7, 8123, 0], [14, 897, 3199, 6430, 44, 18, 44, 22370, 365, 3, 33, 30268, 8789, 123, 9285, 6330, 0]]}

We use the map method of our dataset object created earlier to apply this function on all pairs of our dataset.

In [9]:
tokenized_datasets = dataset.map(preprocess_function, batched=True)

Loading cached processed dataset at /home/marina/.cache/huggingface/datasets/json/default-6d84514352d255a9/0.0.0/d75ead8d5cfcbe67495df0f89bd262f0023257fbbbd94a730313295f3d756d50/cache-7a1c3109455d186a.arrow
Loading cached processed dataset at /home/marina/.cache/huggingface/datasets/json/default-6d84514352d255a9/0.0.0/d75ead8d5cfcbe67495df0f89bd262f0023257fbbbd94a730313295f3d756d50/cache-0da66b7009c774cf.arrow


# Fine-tuning the model

In [10]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

def model_init():
    return AutoModelForSeq2SeqLM.from_pretrained(model_marian)

In [11]:
model = AutoModelForSeq2SeqLM.from_pretrained(model_marian)

We define a function for compute the metrics from the predictions. This function use the metric we loaded earlier and decode the predictions into texts. In addition, these decoded predictions are transcribed into a file that will be stored in the indicated directory.

In [12]:
import numpy as np
import nltk

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]
    
    return preds, labels
    
def compute_metrics_bleu(eval_preds):
    preds, labels = eval_preds
    
    if isinstance(preds, tuple):
        preds = preds[0]
        
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
    
    result = bleu_metric.compute(predictions=decoded_preds, references=decoded_labels)
    file = 'results-' + str(result['score']) + '.txt'
    
    result = {"bleu": result["score"]}
    
    with open('prueba_hiperparametros/' + file, 'w') as f:
        f.write('\n'.join(decoded_preds))
        
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v,4) for k, v in result.items()}
    print(result)
    return result

We define the training arguments to customize the training. It requires one folder name which will be used to save the checkpoints.

In [13]:
args = Seq2SeqTrainingArguments(
    "prueba_hiperparametros",
    save_total_limit=3,
    predict_with_generate=True,
    eval_accumulation_steps=1,
    fp16=True,
)

In [14]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

Before training, we will perform the `hyperparameter_search` method returns a `BestRun` objects, which contains the value of the objective maximized (by default the sum of all metrics) and the hyperparameters it used for that run.

In [15]:
def my_hp_space(trial):
    return{
        "learning_rate": trial.suggest_float("learning_rate", 1e-4, 1e-2, log=True),
        "num_train_epochs": trial.suggest_int("num_train_epochs", 1, 10),
        "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [4, 8]),
        "weight_decay":  trial.suggest_float("weight_decay", 1e-6, 1e-1)

    }

In [25]:
trainer = Seq2SeqTrainer(
    model_init=model_init,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics_bleu,
)


best_trial = trainer.hyperparameter_search(direction="maximize",n_trials=5, hp_space=my_hp_space)


loading configuration file https://huggingface.co/Helsinki-NLP/opus-mt-es-es/resolve/main/config.json from cache at /home/marina/.cache/huggingface/transformers/5f8704e1d92551880c1978068cafdf6503c7a82b65e4494be411d26cb7e86ae5.54793584623797a1e55aa65ed13d20fb17e92502019985cae46974750ad85593
Model config MarianConfig {
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "swish",
  "add_bias_logits": false,
  "add_final_layer_norm": false,
  "architectures": [
    "MarianMTModel"
  ],
  "attention_dropout": 0.0,
  "bad_words_ids": [
    [
      33252
    ]
  ],
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 512,
  "decoder_attention_heads": 8,
  "decoder_ffn_dim": 2048,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 6,
  "decoder_start_token_id": 33252,
  "dropout": 0.1,
  "encoder_attention_heads": 8,
  "encoder_ffn_dim": 2048,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 6,
  "eos_token_id": 0,
  "forced_eos_token_id": 

Step,Training Loss
500,4.7258
1000,4.718
1500,4.9142
2000,5.3663
2500,5.359
3000,5.3807
3500,5.4338
4000,5.3042
4500,5.3672
5000,5.4243


  nn.utils.clip_grad_norm_(
Saving model checkpoint to prueba_hiperparametros/run-0/checkpoint-500
Configuration saved in prueba_hiperparametros/run-0/checkpoint-500/config.json
Model weights saved in prueba_hiperparametros/run-0/checkpoint-500/pytorch_model.bin
tokenizer config file saved in prueba_hiperparametros/run-0/checkpoint-500/tokenizer_config.json
Special tokens file saved in prueba_hiperparametros/run-0/checkpoint-500/special_tokens_map.json
Saving model checkpoint to prueba_hiperparametros/run-0/checkpoint-1000
Configuration saved in prueba_hiperparametros/run-0/checkpoint-1000/config.json
Model weights saved in prueba_hiperparametros/run-0/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in prueba_hiperparametros/run-0/checkpoint-1000/tokenizer_config.json
Special tokens file saved in prueba_hiperparametros/run-0/checkpoint-1000/special_tokens_map.json
  nn.utils.clip_grad_norm_(
Saving model checkpoint to prueba_hiperparametros/run-0/checkpoint-1500
Configura

Saving model checkpoint to prueba_hiperparametros/run-0/checkpoint-6500
Configuration saved in prueba_hiperparametros/run-0/checkpoint-6500/config.json
Model weights saved in prueba_hiperparametros/run-0/checkpoint-6500/pytorch_model.bin
tokenizer config file saved in prueba_hiperparametros/run-0/checkpoint-6500/tokenizer_config.json
Special tokens file saved in prueba_hiperparametros/run-0/checkpoint-6500/special_tokens_map.json
Deleting older checkpoint [prueba_hiperparametros/run-0/checkpoint-5000] due to args.save_total_limit
Saving model checkpoint to prueba_hiperparametros/run-0/checkpoint-7000
Configuration saved in prueba_hiperparametros/run-0/checkpoint-7000/config.json
Model weights saved in prueba_hiperparametros/run-0/checkpoint-7000/pytorch_model.bin
tokenizer config file saved in prueba_hiperparametros/run-0/checkpoint-7000/tokenizer_config.json
Special tokens file saved in prueba_hiperparametros/run-0/checkpoint-7000/special_tokens_map.json
Deleting older checkpoint [pru

Model weights saved in prueba_hiperparametros/run-0/checkpoint-14000/pytorch_model.bin
tokenizer config file saved in prueba_hiperparametros/run-0/checkpoint-14000/tokenizer_config.json
Special tokens file saved in prueba_hiperparametros/run-0/checkpoint-14000/special_tokens_map.json
Deleting older checkpoint [prueba_hiperparametros/run-0/checkpoint-12500] due to args.save_total_limit


Training completed. Do not forget to share your model on huggingface.co/models =)


The following columns in the evaluation set  don't have a corresponding argument in `MarianMTModel.forward` and have been ignored: translation.
***** Running Evaluation *****
  Num examples = 642
  Batch size = 8
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)


[32m[I 2022-02-17 13:39:56,101][0m Trial 0 finished with value: 511.0005 and parameters: {'learning_rate': 0.0007469165006449348, 'num_train_epochs': 8, 'per_device_train_batch_size': 4, 'weight_decay': 0.08783717437402438}. Best is trial 0 with value: 511.0005.[0m
Trial:


{'bleu': 0.0005, 'gen_len': 511.0}


loading configuration file https://huggingface.co/Helsinki-NLP/opus-mt-es-es/resolve/main/config.json from cache at /home/marina/.cache/huggingface/transformers/5f8704e1d92551880c1978068cafdf6503c7a82b65e4494be411d26cb7e86ae5.54793584623797a1e55aa65ed13d20fb17e92502019985cae46974750ad85593
Model config MarianConfig {
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "swish",
  "add_bias_logits": false,
  "add_final_layer_norm": false,
  "architectures": [
    "MarianMTModel"
  ],
  "attention_dropout": 0.0,
  "bad_words_ids": [
    [
      33252
    ]
  ],
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 512,
  "decoder_attention_heads": 8,
  "decoder_ffn_dim": 2048,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 6,
  "decoder_start_token_id": 33252,
  "dropout": 0.1,
  "encoder_attention_heads": 8,
  "encoder_ffn_dim": 2048,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 6,
  "eos_token_id": 0,
  "forced_eos_token_id": 

Step,Training Loss
500,3.9292
1000,3.1296
1500,2.7109
2000,2.4976
2500,2.2656
3000,2.2725
3500,2.1848


Saving model checkpoint to prueba_hiperparametros/run-1/checkpoint-500
Configuration saved in prueba_hiperparametros/run-1/checkpoint-500/config.json
Model weights saved in prueba_hiperparametros/run-1/checkpoint-500/pytorch_model.bin
tokenizer config file saved in prueba_hiperparametros/run-1/checkpoint-500/tokenizer_config.json
Special tokens file saved in prueba_hiperparametros/run-1/checkpoint-500/special_tokens_map.json
Saving model checkpoint to prueba_hiperparametros/run-1/checkpoint-1000
Configuration saved in prueba_hiperparametros/run-1/checkpoint-1000/config.json
Model weights saved in prueba_hiperparametros/run-1/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in prueba_hiperparametros/run-1/checkpoint-1000/tokenizer_config.json
Special tokens file saved in prueba_hiperparametros/run-1/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to prueba_hiperparametros/run-1/checkpoint-1500
Configuration saved in prueba_hiperparametros/run-1/checkpoint-15

{'bleu': 8.4392, 'gen_len': 17.0421}


loading configuration file https://huggingface.co/Helsinki-NLP/opus-mt-es-es/resolve/main/config.json from cache at /home/marina/.cache/huggingface/transformers/5f8704e1d92551880c1978068cafdf6503c7a82b65e4494be411d26cb7e86ae5.54793584623797a1e55aa65ed13d20fb17e92502019985cae46974750ad85593
Model config MarianConfig {
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "swish",
  "add_bias_logits": false,
  "add_final_layer_norm": false,
  "architectures": [
    "MarianMTModel"
  ],
  "attention_dropout": 0.0,
  "bad_words_ids": [
    [
      33252
    ]
  ],
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 512,
  "decoder_attention_heads": 8,
  "decoder_ffn_dim": 2048,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 6,
  "decoder_start_token_id": 33252,
  "dropout": 0.1,
  "encoder_attention_heads": 8,
  "encoder_ffn_dim": 2048,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 6,
  "eos_token_id": 0,
  "forced_eos_token_id": 

Step,Training Loss
500,6.0035
1000,5.6336
1500,5.5446
2000,5.5208
2500,5.4935
3000,5.5177
3500,5.536
4000,5.5076
4500,5.4802
5000,5.5094


Saving model checkpoint to prueba_hiperparametros/run-2/checkpoint-500
Configuration saved in prueba_hiperparametros/run-2/checkpoint-500/config.json
Model weights saved in prueba_hiperparametros/run-2/checkpoint-500/pytorch_model.bin
tokenizer config file saved in prueba_hiperparametros/run-2/checkpoint-500/tokenizer_config.json
Special tokens file saved in prueba_hiperparametros/run-2/checkpoint-500/special_tokens_map.json
Saving model checkpoint to prueba_hiperparametros/run-2/checkpoint-1000
Configuration saved in prueba_hiperparametros/run-2/checkpoint-1000/config.json
Model weights saved in prueba_hiperparametros/run-2/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in prueba_hiperparametros/run-2/checkpoint-1000/tokenizer_config.json
Special tokens file saved in prueba_hiperparametros/run-2/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to prueba_hiperparametros/run-2/checkpoint-1500
Configuration saved in prueba_hiperparametros/run-2/checkpoint-15

{'bleu': 0.0005, 'gen_len': 511.0}


loading configuration file https://huggingface.co/Helsinki-NLP/opus-mt-es-es/resolve/main/config.json from cache at /home/marina/.cache/huggingface/transformers/5f8704e1d92551880c1978068cafdf6503c7a82b65e4494be411d26cb7e86ae5.54793584623797a1e55aa65ed13d20fb17e92502019985cae46974750ad85593
Model config MarianConfig {
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "swish",
  "add_bias_logits": false,
  "add_final_layer_norm": false,
  "architectures": [
    "MarianMTModel"
  ],
  "attention_dropout": 0.0,
  "bad_words_ids": [
    [
      33252
    ]
  ],
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 512,
  "decoder_attention_heads": 8,
  "decoder_ffn_dim": 2048,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 6,
  "decoder_start_token_id": 33252,
  "dropout": 0.1,
  "encoder_attention_heads": 8,
  "encoder_ffn_dim": 2048,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 6,
  "eos_token_id": 0,
  "forced_eos_token_id": 

Step,Training Loss
500,3.5947
1000,2.6711
1500,2.3655
2000,2.1619
2500,1.9736
3000,1.7773
3500,1.682


Saving model checkpoint to prueba_hiperparametros/run-3/checkpoint-500
Configuration saved in prueba_hiperparametros/run-3/checkpoint-500/config.json
Model weights saved in prueba_hiperparametros/run-3/checkpoint-500/pytorch_model.bin
tokenizer config file saved in prueba_hiperparametros/run-3/checkpoint-500/tokenizer_config.json
Special tokens file saved in prueba_hiperparametros/run-3/checkpoint-500/special_tokens_map.json
Saving model checkpoint to prueba_hiperparametros/run-3/checkpoint-1000
Configuration saved in prueba_hiperparametros/run-3/checkpoint-1000/config.json
Model weights saved in prueba_hiperparametros/run-3/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in prueba_hiperparametros/run-3/checkpoint-1000/tokenizer_config.json
Special tokens file saved in prueba_hiperparametros/run-3/checkpoint-1000/special_tokens_map.json
  nn.utils.clip_grad_norm_(
Saving model checkpoint to prueba_hiperparametros/run-3/checkpoint-1500
Configuration saved in prueba_hiperpa

{'bleu': 9.0129, 'gen_len': 16.4283}


loading configuration file https://huggingface.co/Helsinki-NLP/opus-mt-es-es/resolve/main/config.json from cache at /home/marina/.cache/huggingface/transformers/5f8704e1d92551880c1978068cafdf6503c7a82b65e4494be411d26cb7e86ae5.54793584623797a1e55aa65ed13d20fb17e92502019985cae46974750ad85593
Model config MarianConfig {
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "swish",
  "add_bias_logits": false,
  "add_final_layer_norm": false,
  "architectures": [
    "MarianMTModel"
  ],
  "attention_dropout": 0.0,
  "bad_words_ids": [
    [
      33252
    ]
  ],
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 512,
  "decoder_attention_heads": 8,
  "decoder_ffn_dim": 2048,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 6,
  "decoder_start_token_id": 33252,
  "dropout": 0.1,
  "encoder_attention_heads": 8,
  "encoder_ffn_dim": 2048,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 6,
  "eos_token_id": 0,
  "forced_eos_token_id": 

Step,Training Loss
500,5.9058


Saving model checkpoint to prueba_hiperparametros/run-4/checkpoint-500
Configuration saved in prueba_hiperparametros/run-4/checkpoint-500/config.json
Model weights saved in prueba_hiperparametros/run-4/checkpoint-500/pytorch_model.bin
tokenizer config file saved in prueba_hiperparametros/run-4/checkpoint-500/tokenizer_config.json
Special tokens file saved in prueba_hiperparametros/run-4/checkpoint-500/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)


The following columns in the evaluation set  don't have a corresponding argument in `MarianMTModel.forward` and have been ignored: translation.
***** Running Evaluation *****
  Num examples = 642
  Batch size = 8
[32m[I 2022-02-17 14:10:10,921][0m Trial 4 finished with value: 511.0005 and parameters: {'learning_rate': 0.002587833506997122, 'num_train_epochs': 1, 'per_device_train_batch_size': 8, 'weight_decay': 0.029129747516881056}. Best is trial 0 with value: 511.0005.[0m


{'bleu': 0.0005, 'gen_len': 511.0}


We obtain the best result for the bleu metric in the 4th execution of the method `hyperparameter_search`

In [26]:
best_trial

BestRun(run_id='0', objective=511.0005, hyperparameters={'learning_rate': 0.0007469165006449348, 'num_train_epochs': 8, 'per_device_train_batch_size': 4, 'weight_decay': 0.08783717437402438})

## Training

We define the arguments to be used by the model during training

In [19]:
batch_size = 8
args = Seq2SeqTrainingArguments(
    "experimentos/full_dataset",
    evaluation_strategy = "epoch",
    learning_rate=1e-4,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.03,
    save_total_limit=3,
    num_train_epochs=8,
    predict_with_generate=True,
    eval_accumulation_steps=1,
    fp16=True,

)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [19]:
trainer = Seq2SeqTrainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics_bleu,
)

Using amp fp16 backend


In [21]:
trained = trainer.train()

loading configuration file https://huggingface.co/Helsinki-NLP/opus-mt-es-es/resolve/main/config.json from cache at /home/marina/.cache/huggingface/transformers/5f8704e1d92551880c1978068cafdf6503c7a82b65e4494be411d26cb7e86ae5.54793584623797a1e55aa65ed13d20fb17e92502019985cae46974750ad85593
Model config MarianConfig {
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "swish",
  "add_bias_logits": false,
  "add_final_layer_norm": false,
  "architectures": [
    "MarianMTModel"
  ],
  "attention_dropout": 0.0,
  "bad_words_ids": [
    [
      33252
    ]
  ],
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 512,
  "decoder_attention_heads": 8,
  "decoder_ffn_dim": 2048,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 6,
  "decoder_start_token_id": 33252,
  "dropout": 0.1,
  "encoder_attention_heads": 8,
  "encoder_ffn_dim": 2048,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 6,
  "eos_token_id": 0,
  "forced_eos_token_id": 

Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,2.7614,2.426999,10.5473,22.8255
2,2.3221,2.167676,11.8444,20.7181
3,2.0346,2.048408,10.8746,16.0498
4,1.7675,2.014832,12.6676,18.9408
5,1.5255,2.023945,12.5768,17.4206
6,1.3208,2.048484,12.0464,16.852
7,1.1499,2.077922,11.7131,16.5234
8,1.0095,2.10383,11.8573,16.4579


Saving model checkpoint to experimentos/full_dataset/checkpoint-500
Configuration saved in experimentos/full_dataset/checkpoint-500/config.json
Model weights saved in experimentos/full_dataset/checkpoint-500/pytorch_model.bin
tokenizer config file saved in experimentos/full_dataset/checkpoint-500/tokenizer_config.json
Special tokens file saved in experimentos/full_dataset/checkpoint-500/special_tokens_map.json
Saving model checkpoint to experimentos/full_dataset/checkpoint-1000
Configuration saved in experimentos/full_dataset/checkpoint-1000/config.json
Model weights saved in experimentos/full_dataset/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in experimentos/full_dataset/checkpoint-1000/tokenizer_config.json
Special tokens file saved in experimentos/full_dataset/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to experimentos/full_dataset/checkpoint-1500
Configuration saved in experimentos/full_dataset/checkpoint-1500/config.json
Model weights saved i

{'bleu': 10.5473, 'gen_len': 22.8255}


Saving model checkpoint to experimentos/full_dataset/checkpoint-2000
Configuration saved in experimentos/full_dataset/checkpoint-2000/config.json
Model weights saved in experimentos/full_dataset/checkpoint-2000/pytorch_model.bin
tokenizer config file saved in experimentos/full_dataset/checkpoint-2000/tokenizer_config.json
Special tokens file saved in experimentos/full_dataset/checkpoint-2000/special_tokens_map.json
Deleting older checkpoint [experimentos/full_dataset/checkpoint-500] due to args.save_total_limit
  nn.utils.clip_grad_norm_(
Saving model checkpoint to experimentos/full_dataset/checkpoint-2500
Configuration saved in experimentos/full_dataset/checkpoint-2500/config.json
Model weights saved in experimentos/full_dataset/checkpoint-2500/pytorch_model.bin
tokenizer config file saved in experimentos/full_dataset/checkpoint-2500/tokenizer_config.json
Special tokens file saved in experimentos/full_dataset/checkpoint-2500/special_tokens_map.json
Deleting older checkpoint [experimen

{'bleu': 11.8444, 'gen_len': 20.7181}


Saving model checkpoint to experimentos/full_dataset/checkpoint-4000
Configuration saved in experimentos/full_dataset/checkpoint-4000/config.json
Model weights saved in experimentos/full_dataset/checkpoint-4000/pytorch_model.bin
tokenizer config file saved in experimentos/full_dataset/checkpoint-4000/tokenizer_config.json
Special tokens file saved in experimentos/full_dataset/checkpoint-4000/special_tokens_map.json
Deleting older checkpoint [experimentos/full_dataset/checkpoint-2500] due to args.save_total_limit
Saving model checkpoint to experimentos/full_dataset/checkpoint-4500
Configuration saved in experimentos/full_dataset/checkpoint-4500/config.json
Model weights saved in experimentos/full_dataset/checkpoint-4500/pytorch_model.bin
tokenizer config file saved in experimentos/full_dataset/checkpoint-4500/tokenizer_config.json
Special tokens file saved in experimentos/full_dataset/checkpoint-4500/special_tokens_map.json
Deleting older checkpoint [experimentos/full_dataset/checkpoint

{'bleu': 10.8746, 'gen_len': 16.0498}


Saving model checkpoint to experimentos/full_dataset/checkpoint-5500
Configuration saved in experimentos/full_dataset/checkpoint-5500/config.json
Model weights saved in experimentos/full_dataset/checkpoint-5500/pytorch_model.bin
tokenizer config file saved in experimentos/full_dataset/checkpoint-5500/tokenizer_config.json
Special tokens file saved in experimentos/full_dataset/checkpoint-5500/special_tokens_map.json
Deleting older checkpoint [experimentos/full_dataset/checkpoint-4000] due to args.save_total_limit
Saving model checkpoint to experimentos/full_dataset/checkpoint-6000
Configuration saved in experimentos/full_dataset/checkpoint-6000/config.json
Model weights saved in experimentos/full_dataset/checkpoint-6000/pytorch_model.bin
tokenizer config file saved in experimentos/full_dataset/checkpoint-6000/tokenizer_config.json
Special tokens file saved in experimentos/full_dataset/checkpoint-6000/special_tokens_map.json
Deleting older checkpoint [experimentos/full_dataset/checkpoint

{'bleu': 12.6676, 'gen_len': 18.9408}


Saving model checkpoint to experimentos/full_dataset/checkpoint-7500
Configuration saved in experimentos/full_dataset/checkpoint-7500/config.json
Model weights saved in experimentos/full_dataset/checkpoint-7500/pytorch_model.bin
tokenizer config file saved in experimentos/full_dataset/checkpoint-7500/tokenizer_config.json
Special tokens file saved in experimentos/full_dataset/checkpoint-7500/special_tokens_map.json
Deleting older checkpoint [experimentos/full_dataset/checkpoint-6000] due to args.save_total_limit
Saving model checkpoint to experimentos/full_dataset/checkpoint-8000
Configuration saved in experimentos/full_dataset/checkpoint-8000/config.json
Model weights saved in experimentos/full_dataset/checkpoint-8000/pytorch_model.bin
tokenizer config file saved in experimentos/full_dataset/checkpoint-8000/tokenizer_config.json
Special tokens file saved in experimentos/full_dataset/checkpoint-8000/special_tokens_map.json
Deleting older checkpoint [experimentos/full_dataset/checkpoint

{'bleu': 12.5768, 'gen_len': 17.4206}


Saving model checkpoint to experimentos/full_dataset/checkpoint-9000
Configuration saved in experimentos/full_dataset/checkpoint-9000/config.json
Model weights saved in experimentos/full_dataset/checkpoint-9000/pytorch_model.bin
tokenizer config file saved in experimentos/full_dataset/checkpoint-9000/tokenizer_config.json
Special tokens file saved in experimentos/full_dataset/checkpoint-9000/special_tokens_map.json
Deleting older checkpoint [experimentos/full_dataset/checkpoint-7500] due to args.save_total_limit
Saving model checkpoint to experimentos/full_dataset/checkpoint-9500
Configuration saved in experimentos/full_dataset/checkpoint-9500/config.json
Model weights saved in experimentos/full_dataset/checkpoint-9500/pytorch_model.bin
tokenizer config file saved in experimentos/full_dataset/checkpoint-9500/tokenizer_config.json
Special tokens file saved in experimentos/full_dataset/checkpoint-9500/special_tokens_map.json
Deleting older checkpoint [experimentos/full_dataset/checkpoint

{'bleu': 12.0464, 'gen_len': 16.852}


Saving model checkpoint to experimentos/full_dataset/checkpoint-11000
Configuration saved in experimentos/full_dataset/checkpoint-11000/config.json
Model weights saved in experimentos/full_dataset/checkpoint-11000/pytorch_model.bin
tokenizer config file saved in experimentos/full_dataset/checkpoint-11000/tokenizer_config.json
Special tokens file saved in experimentos/full_dataset/checkpoint-11000/special_tokens_map.json
Deleting older checkpoint [experimentos/full_dataset/checkpoint-9500] due to args.save_total_limit
Saving model checkpoint to experimentos/full_dataset/checkpoint-11500
Configuration saved in experimentos/full_dataset/checkpoint-11500/config.json
Model weights saved in experimentos/full_dataset/checkpoint-11500/pytorch_model.bin
tokenizer config file saved in experimentos/full_dataset/checkpoint-11500/tokenizer_config.json
Special tokens file saved in experimentos/full_dataset/checkpoint-11500/special_tokens_map.json
Deleting older checkpoint [experimentos/full_dataset/

{'bleu': 11.7131, 'gen_len': 16.5234}


Saving model checkpoint to experimentos/full_dataset/checkpoint-12500
Configuration saved in experimentos/full_dataset/checkpoint-12500/config.json
Model weights saved in experimentos/full_dataset/checkpoint-12500/pytorch_model.bin
tokenizer config file saved in experimentos/full_dataset/checkpoint-12500/tokenizer_config.json
Special tokens file saved in experimentos/full_dataset/checkpoint-12500/special_tokens_map.json
Deleting older checkpoint [experimentos/full_dataset/checkpoint-11000] due to args.save_total_limit
Saving model checkpoint to experimentos/full_dataset/checkpoint-13000
Configuration saved in experimentos/full_dataset/checkpoint-13000/config.json
Model weights saved in experimentos/full_dataset/checkpoint-13000/pytorch_model.bin
tokenizer config file saved in experimentos/full_dataset/checkpoint-13000/tokenizer_config.json
Special tokens file saved in experimentos/full_dataset/checkpoint-13000/special_tokens_map.json
Deleting older checkpoint [experimentos/full_dataset

{'bleu': 11.8573, 'gen_len': 16.4579}




Training completed. Do not forget to share your model on huggingface.co/models =)




We get the predictions on the test set

In [31]:
predictions = trainer.predict(tokenized_datasets["test"])
print(predictions.metrics)


The following columns in the test set  don't have a corresponding argument in `MarianMTModel.forward` and have been ignored: translation.
***** Running Prediction *****
  Num examples = 642
  Batch size = 4


{'bleu': 11.8573, 'gen_len': 16.4579}
{'eval_loss': 2.103830099105835, 'eval_bleu': 11.8573, 'eval_gen_len': 16.4579, 'eval_runtime': 33.9896, 'eval_samples_per_second': 18.888, 'eval_steps_per_second': 4.737}
