# Finetuning of the language model T5 base on a Question-Answering task (QA) with the dataset SQuAD 1.1 Portuguese

- **Credit**: this notebook is copied/pasted with small changes from [PyTorch Examples](https://huggingface.co/docs/transformers/notebooks#pytorch-examples) of Hugging Face (notebook [summarization.ipynb](https://github.com/huggingface/notebooks/blob/master/examples/summarization.ipynb) and scripts [run_seq2seq_qa.py](https://github.com/huggingface/transformers/blob/master/examples/pytorch/question-answering/run_seq2seq_qa.py), [trainer_seq2seq_qa.py](https://github.com/huggingface/transformers/blob/master/examples/pytorch/question-answering/trainer_seq2seq_qa.py)).
- **Author**: [Pierre GUILLOU](https://www.linkedin.com/in/pierreguillou/)
- **Date**: 27/01/2022
- **Blog post**: [NLP nas empresas | Como eu treinei um modelo T5 em português na tarefa QA no Google Colab](https://medium.com/@pierre_guillou/nlp-nas-empresas-como-eu-treinei-um-modelo-t5-em-portugu%C3%AAs-na-tarefa-qa-no-google-colab-e8eb0dc38894)

# Overview

In this notebook, we will see how to fine-tune one of the [🤗 Transformers](https://github.com/huggingface/transformers) model to a question answering (QA) task, which is the task of extracting the answer to a question from a given context. We will see how to easily load a dataset for these kinds of tasks and use the Trainer API to fine-tune a model on it.

![Widget inference on a QA task](https://github.com/huggingface/notebooks/raw/6204dcb906c6bc6d98168064b6aa27d15885f2fb/examples/images/question_answering.png)

## Configuration

If you're opening this Notebook on colab, you need to mount it to Google Drive.

In [1]:
from google.colab import drive 
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [2]:
!nvidia-smi

Wed Jan 26 16:13:35 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.46       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

This notebook is built to run  with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a sequence-to-sequence version in the Transformers library. Here we picked the [`unicamp-dl/ptt5-base-portuguese-vocab`](https://huggingface.co/unicamp-dl/ptt5-base-portuguese-vocab) checkpoint. 

In [3]:
model_checkpoint = "unicamp-dl/ptt5-base-portuguese-vocab"

If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets as well as other dependencies. Uncomment the following cell and run it.

In [4]:
%%capture
!pip install datasets transformers[sentencepiece] 

If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then execute the following cell and input your username and password:

In [27]:
from huggingface_hub import notebook_login

notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store[0m


Then you need to install Git-LFS. Uncomment the following instructions:

In [28]:
%%capture
!apt install git-lfs

Make sure your version of Transformers is at least 4.11.0 since the functionality was introduced in that version:

In [5]:
import transformers
transformers.logging.set_verbosity_info()

print(transformers.__version__)
# 4.15.0

4.15.0


In [6]:
# get QA classes
!wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/pytorch/question-answering/trainer_seq2seq_qa.py

--2022-01-26 16:13:53--  https://raw.githubusercontent.com/huggingface/transformers/master/examples/pytorch/question-answering/trainer_seq2seq_qa.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5337 (5.2K) [text/plain]
Saving to: ‘trainer_seq2seq_qa.py’


2022-01-26 16:13:53 (54.7 MB/s) - ‘trainer_seq2seq_qa.py’ saved [5337/5337]



In [7]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

from trainer_seq2seq_qa import QuestionAnsweringSeq2SeqTrainer

from transformers import (
    AutoConfig,
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
    DataCollatorForSeq2Seq,
    Seq2SeqTrainingArguments,
    set_seed,
)

from transformers.trainer_utils import EvalLoopOutput, EvalPrediction

from datasets import load_dataset, load_metric

import numpy as np
import json 
import pathlib
from pathlib import Path

By default, the call below will use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library.

In [8]:
# get tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

Could not locate the tokenizer configuration file, will try to use the model config instead.
https://huggingface.co/unicamp-dl/ptt5-base-portuguese-vocab/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpztpqj5il


Downloading:   0%|          | 0.00/456 [00:00<?, ?B/s]

storing https://huggingface.co/unicamp-dl/ptt5-base-portuguese-vocab/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/4d7ac1469dc85fbf7a41e825eafd1794015ec2ed6d5e45016abcbaea5bd890b4.7523e41da181cf87711fbd0047f1275f5e0d9e7aebd27808188316c2df30b5c6
creating metadata file for /root/.cache/huggingface/transformers/4d7ac1469dc85fbf7a41e825eafd1794015ec2ed6d5e45016abcbaea5bd890b4.7523e41da181cf87711fbd0047f1275f5e0d9e7aebd27808188316c2df30b5c6
loading configuration file https://huggingface.co/unicamp-dl/ptt5-base-portuguese-vocab/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/4d7ac1469dc85fbf7a41e825eafd1794015ec2ed6d5e45016abcbaea5bd890b4.7523e41da181cf87711fbd0047f1275f5e0d9e7aebd27808188316c2df30b5c6
Model config T5Config {
  "_name_or_path": "unicamp-dl/ptt5-base-portuguese-vocab",
  "architectures": [
    "T5WithLMHeadModel"
  ],
  "d_ff": 3072,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,


Downloading:   0%|          | 0.00/738k [00:00<?, ?B/s]

storing https://huggingface.co/unicamp-dl/ptt5-base-portuguese-vocab/resolve/main/spiece.model in cache at /root/.cache/huggingface/transformers/48b7f844d7d808a93d6cf3043712e386b1ea5f5c44e899b3cf5a0658e756b5a9.5a8add86126151c5e87be7c1b2461092d73beefc4856c14636a73a8fc08b5fbf
creating metadata file for /root/.cache/huggingface/transformers/48b7f844d7d808a93d6cf3043712e386b1ea5f5c44e899b3cf5a0658e756b5a9.5a8add86126151c5e87be7c1b2461092d73beefc4856c14636a73a8fc08b5fbf
loading file https://huggingface.co/unicamp-dl/ptt5-base-portuguese-vocab/resolve/main/spiece.model from cache at /root/.cache/huggingface/transformers/48b7f844d7d808a93d6cf3043712e386b1ea5f5c44e899b3cf5a0658e756b5a9.5a8add86126151c5e87be7c1b2461092d73beefc4856c14636a73a8fc08b5fbf
loading file https://huggingface.co/unicamp-dl/ptt5-base-portuguese-vocab/resolve/main/tokenizer.json from cache at None
loading file https://huggingface.co/unicamp-dl/ptt5-base-portuguese-vocab/resolve/main/added_tokens.json from cache at None
loa

In [9]:
max_input_length = 384 # 512
max_target_length = 32 # 32
val_max_answer_length = max_target_length

pad_to_max_length = True
padding = "max_length" if pad_to_max_length else False
ignore_pad_token_for_loss = True

max_seq_length = min(max_input_length, tokenizer.model_max_length)
generation_max_length = None
max_eval_samples = None

version_2_with_negative = False # squad 1.1

answer_column = "answers"

We can download the pretrained model and fine-tune it. Since our task is of the sequence-to-sequence kind, we use the `AutoModelForSeq2SeqLM` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us.

Note that  we don't get a warning like in our classification example. This means we used all the weights of the pretrained model and there is no randomly initialized head in this case.

In [10]:
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

loading configuration file https://huggingface.co/unicamp-dl/ptt5-base-portuguese-vocab/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/4d7ac1469dc85fbf7a41e825eafd1794015ec2ed6d5e45016abcbaea5bd890b4.7523e41da181cf87711fbd0047f1275f5e0d9e7aebd27808188316c2df30b5c6
Model config T5Config {
  "_name_or_path": "unicamp-dl/ptt5-base-portuguese-vocab",
  "architectures": [
    "T5WithLMHeadModel"
  ],
  "d_ff": 3072,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 12,
  "num_heads": 12,
  "num_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "transformers_version": "4.15.0",
  "use_cache": true,
  "vocab_size": 32128
}

https://huggingface.co/unicamp-dl/ptt5-base-portug

Downloading:   0%|          | 0.00/850M [00:00<?, ?B/s]

storing https://huggingface.co/unicamp-dl/ptt5-base-portuguese-vocab/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/38f5781ec218bdbc504de08941b3a198638870f5f96219e072fd58e5f3336200.545d54bb250b4fa1fd0f387f1c73173a0d43dec3185eae10b94de48b6132eef6
creating metadata file for /root/.cache/huggingface/transformers/38f5781ec218bdbc504de08941b3a198638870f5f96219e072fd58e5f3336200.545d54bb250b4fa1fd0f387f1c73173a0d43dec3185eae10b94de48b6132eef6
loading weights file https://huggingface.co/unicamp-dl/ptt5-base-portuguese-vocab/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/38f5781ec218bdbc504de08941b3a198638870f5f96219e072fd58e5f3336200.545d54bb250b4fa1fd0f387f1c73173a0d43dec3185eae10b94de48b6132eef6
All model checkpoint weights were used when initializing T5ForConditionalGeneration.

All the weights of T5ForConditionalGeneration were initialized from the model checkpoint at unicamp-dl/ptt5-base-portuguese-vocab.
If your task 

Let's define all hyperparameters of our training job.

In [29]:
# do training and evaluation
do_train = True
do_eval= True

# batch
batch_size = 4
gradient_accumulation_steps = 3
per_device_train_batch_size = batch_size
per_device_eval_batch_size = per_device_train_batch_size*16

# LR, wd, epochs
learning_rate = 1e-4
weight_decay = 0.01
num_train_epochs = 10
fp16 = True

# logs
logging_strategy = "steps"
logging_first_step = True 
logging_steps = 3000     # if logging_strategy = "steps"
eval_steps = logging_steps 

# checkpoints
evaluation_strategy = logging_strategy
save_strategy = logging_strategy
save_steps = logging_steps
save_total_limit = 3

# best model
load_best_model_at_end = True
metric_for_best_model = "f1" #"loss"
if metric_for_best_model == "loss":
  greater_is_better = False
else:
  greater_is_better = True  

# evaluation
num_beams = 1

# folders
model_name = model_checkpoint.split("/")[-1]
folder_model = 'e' + str(num_train_epochs) + '_lr' + str(learning_rate)
output_dir = '/content/drive/MyDrive/' + str(model_name) + '/checkpoints/' + folder_model
Path(output_dir).mkdir(parents=True, exist_ok=True)    #python 3.5 above
logging_dir = '/content/drive/MyDrive/' + str(model_name) + '/logs/' + folder_model
Path(logging_dir).mkdir(parents=True, exist_ok=True)    #python 3.5 above

To instantiate a `Seq2SeqTrainer`, we will need to define three more things. The most important is the [`Seq2SeqTrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.Seq2SeqTrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [30]:
# Training arguments
training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
    do_train=do_train,
    do_eval=do_eval,
    evaluation_strategy=evaluation_strategy,
    learning_rate=learning_rate,
    per_device_train_batch_size=per_device_train_batch_size,
    per_device_eval_batch_size=per_device_eval_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    weight_decay=weight_decay,
    num_train_epochs=num_train_epochs,
    predict_with_generate=True,
    save_steps=save_steps,
    save_total_limit=save_total_limit,  
    save_strategy=save_strategy,
    load_best_model_at_end=load_best_model_at_end,
    metric_for_best_model=metric_for_best_model,
    greater_is_better=greater_is_better,
    logging_dir=logging_dir,         # directory for storing logs
    logging_strategy=logging_strategy,
    logging_steps=logging_steps,     # if logging_strategy = "steps" 
    fp16=fp16,
    push_to_hub=False, 
)

using `logging_steps` to initialize `eval_steps` to 3000
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the cell and customize the weight decay. Since the `Seq2SeqTrainer` will save the model regularly and our dataset is quite large, we tell it to make three saves maximum. Lastly, we use the `predict_with_generate` option (to properly generate summaries) and activate mixed precision training (to go a bit faster).

The last argument to setup everything so we can push the model to the [Hub](https://huggingface.co/models) regularly during training. Remove it if you didn't follow the installation steps at the top of the notebook. If you want to save your model locally in a name that is different than the name of the repository it will be pushed, or if you want to push your model under an organization and not your name space, use the `hub_model_id` argument to set the repo name (it needs to be the full name, including your namespace: for instance `"sgugger/t5-finetuned-xsum"` or `"huggingface/t5-finetuned-xsum"`).

## Loading the dataset

There is a version of SQuAD 1.1 pt no datasets hub of Hugging Face but without information. Then, we prefered to use the [version](https://forum.ailab.unb.br/t/datasets-em-portugues/251/4) of the Deep Learning Brasil group.

As this version is not in the datasets hub of Hugging Face, we can not use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data. Then, we used the following script.

Later, in the paragraph "Push to the datasets hub of Hugging Face", we will push the dataset SQuAD 1.1 pt converted to the `DatasetDict()` format in the datasets hub of Hugging Face. Then, we will be able to use the [🤗 Datasets](https://github.com/huggingface/datasets) library.

### Get SQuAD 1.1 pt from the Web and convert it to a DatasetDict()

We just run this paragraph once in order to get the dataset SQuAD 1.1 pt in the `DatasetDict()` format.

In [None]:
# Get dataset SQUAD in Portuguese
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1Q0IaIlv2h2BC468MwUFmUST0EyN7gNkn' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1Q0IaIlv2h2BC468MwUFmUST0EyN7gNkn" -O squad-pt.tar.gz && rm -rf /tmp/cookies.txt

--2022-01-09 09:13:06--  https://docs.google.com/uc?export=download&confirm=zx7I&id=1Q0IaIlv2h2BC468MwUFmUST0EyN7gNkn
Resolving docs.google.com (docs.google.com)... 108.177.121.139, 108.177.121.100, 108.177.121.101, ...
Connecting to docs.google.com (docs.google.com)|108.177.121.139|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-0k-20-docs.googleusercontent.com/docs/securesc/0sc7su53ge01mr2icn0lbk82uvcbb4hn/83tu27nf111jnrci083g896mthhfphrf/1641719550000/03445611175480770093/18276664787243845082Z/1Q0IaIlv2h2BC468MwUFmUST0EyN7gNkn?e=download [following]
--2022-01-09 09:13:06--  https://doc-0k-20-docs.googleusercontent.com/docs/securesc/0sc7su53ge01mr2icn0lbk82uvcbb4hn/83tu27nf111jnrci083g896mthhfphrf/1641719550000/03445611175480770093/18276664787243845082Z/1Q0IaIlv2h2BC468MwUFmUST0EyN7gNkn?e=download
Resolving doc-0k-20-docs.googleusercontent.com (doc-0k-20-docs.googleusercontent.com)... 209.85.146.132, 2607:f8b0:4001:c1f::84
Connec

In [None]:
!tar -xvf squad-pt.tar.gz

squad-train-v1.1.json
squad-dev-v1.1.json


In [None]:
%%time
# new

# Get the train and validation json file in the HF script format 
# inspiration: file squad.py at https://github.com/huggingface/datasets/tree/master/datasets/squad

files = ['squad-train-v1.1.json','squad-dev-v1.1.json']

for file in files:
    
    # Opening JSON file & returns JSON object as a dictionary 
    f = open(file, encoding="utf-8") 
    data = json.load(f) 
    
    # Iterating through the json list 
    entry_list = list()
    id_list = list()

    for row in data['data']: 
        title = row['title']
        
        for paragraph in row['paragraphs']:
            context = paragraph['context']

            for qa in paragraph['qas']:
                entry = {}

                qa_id = qa['id']
                question = qa['question']
                answers = qa['answers']
                
                entry['id'] = qa_id
                # entry['title'] = title.strip()
                # entry['context'] = context.strip()
                # entry['question'] = question.strip()
                
                entry['input_ids'] = 'question: %s  context: %s' % (question.strip(), context.strip())
                
                answer_starts = [answer["answer_start"] for answer in answers]

                # keep unique texts
                answer_texts = [answer["text"].strip() for answer in answers]
                sorted_values, index_values = np.unique(answer_texts, return_index=True)
                answer_texts = (np.array(answer_texts)[index_values]).tolist()
                answer_starts = (np.array(answer_starts)[index_values]).tolist()

                # if len(answer_starts) > 1:
                #   print(qa_id)

                entry['answers'] = {}
                entry['answers']['answer_start'] = answer_starts
                entry['answers']['text'] = answer_texts

                #entry['labels'] = '%s' % answer_texts

                entry_list.append(entry)
                
    reverse_entry_list = entry_list[::-1]
    
    # for entries with same id, keep only last one (corrected texts by he group Deep Learning Brasil)
    unique_ids_list = list()
    unique_entry_list = list()
    for entry in reverse_entry_list:
        qa_id = entry['id']
        if qa_id not in unique_ids_list:
            unique_ids_list.append(qa_id)
            unique_entry_list.append(entry)
        
    # Closing file 
    f.close() 

    new_dict = {}
    new_dict['data'] = unique_entry_list

    file_name = 'pt_' + str(file)
    with open(file_name, 'w') as json_file:
        json.dump(new_dict, json_file)

CPU times: user 2min 7s, sys: 1.27 s, total: 2min 8s
Wall time: 2min 8s


In [None]:
raw_datasets = load_dataset('json', 
                        data_files={'train': 'pt_squad-train-v1.1.json', 'validation': 'pt_squad-dev-v1.1.json'}, 
                        field='data')

Using custom data configuration default-d107bce3378a8358


Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-d107bce3378a8358/0.0.0/c90812beea906fcffe0d5e3bb9eba909a80a998b5f88e9f8acbd320aa91acfde...


  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-d107bce3378a8358/0.0.0/c90812beea906fcffe0d5e3bb9eba909a80a998b5f88e9f8acbd320aa91acfde. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
# raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'input_ids', 'answers'],
        num_rows: 87510
    })
    validation: Dataset({
        features: ['id', 'input_ids', 'answers'],
        num_rows: 10570
    })
})

### Push to the datasets hub of Hugging Face

In order to save our `DatasetDict()`, we push it to the [datasets hub of Hugging Face](https://huggingface.co/datasets). 

However, as we are not the owner of this dataset, we push it in the private mode.

In [None]:
raw_datasets.push_to_hub("pierreguillou/squad11pt", private=True)

Pushing split train to the Hub.


Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Pushing split validation to the Hub.
The repository already exists: the `private` keyword argument will be ignored.


Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

### Get the dataset SQuAD 1.1 pt from the datasets hub of Hugging Face

In [13]:
API_TOKEN = "xxxx" # use an API TOKEN of your HF perfil
raw_datasets = load_dataset("pierreguillou/squad11pt", use_auth_token=API_TOKEN)

Downloading:   0%|          | 0.00/1.07k [00:00<?, ?B/s]

Using custom data configuration pierreguillou--squad11pt-85bf948064a7a578


Downloading and preparing dataset json/default (download: 25.37 MiB, generated: 91.00 MiB, post-processed: Unknown size, total: 116.38 MiB) to /root/.cache/huggingface/datasets/parquet/pierreguillou--squad11pt-85bf948064a7a578/0.0.0/1638526fd0e8d960534e2155dc54fdff8dce73851f21f031d2fb9c2cf757c121...


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/23.6M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.02M [00:00<?, ?B/s]

  0%|          | 0/2 [00:00<?, ?it/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/parquet/pierreguillou--squad11pt-85bf948064a7a578/0.0.0/1638526fd0e8d960534e2155dc54fdff8dce73851f21f031d2fb9c2cf757c121. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training and validation set:

In [14]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'input_ids', 'answers'],
        num_rows: 87510
    })
    validation: Dataset({
        features: ['id', 'input_ids', 'answers'],
        num_rows: 10570
    })
})

To access an actual element, you need to select a split first, then give an index:

In [None]:
raw_datasets["train"][0]

{'answers': {'answer_start': [2],
  'text': ['Cidade Metropolitana de Catmandu']},
 'id': '5735d259012e2f140011a0a1',
 'input_ids': 'question: De que KMC é um inicialismo?  context: A Cidade Metropolitana de Catmandu (KMC), a fim de promover as relações internacionais, criou uma Secretaria de Relações Internacionais (IRC). O primeiro relacionamento internacional da KMC foi estabelecido em 1975 com a cidade de Eugene, Oregon, Estados Unidos. Essa atividade foi aprimorada ainda mais com o estabelecimento de relações formais com outras 8 cidades: Cidade de Motsumoto, Japão, Rochester, EUA, Yangon (antiga Rangum) de Mianmar, Xian da República Popular da China, Minsk da Bielorrússia e Pyongyang de República Democrática da Coréia. O esforço constante da KMC é aprimorar sua interação com os países da SAARC, outras agências internacionais e muitas outras grandes cidades do mundo para alcançar melhores programas de gestão urbana e desenvolvimento para Katmandu.'}

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [None]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(raw_datasets["train"])

Unnamed: 0,id,input_ids,answers
0,5727bb374b864d1900163bcd,"question: Em que via navegável está localizada Detroit? context: Durante o final do século 19, várias mansões da Era Dourada, refletindo a riqueza da indústria e dos magnatas da navegação, foram construídas a leste e oeste do atual centro da cidade, ao longo das principais avenidas do plano Woodward. O mais notável deles foi a David Whitney House, localizada na 4421 Woodward Avenue, que se tornou um local privilegiado para mansões. Durante esse período, alguns se referiram a Detroit como a Paris do Ocidente por sua arquitetura, grandes avenidas no estilo de Paris e para o Washington Boulevard, recentemente eletrificado por Thomas Edison. A cidade havia crescido constantemente a partir da década de 1830, com o aumento das indústrias de transporte, construção naval e manufatura. Estrategicamente localizado ao longo da hidrovia dos Grandes Lagos, Detroit emergiu como um importante centro de portos e transportes.","{'answer_start': [776], 'text': ['Grandes Lagos']}"
1,56d8e547dc89441400fdb3a2,"question: Onde era o único lugar em que a bandeira tibetana poderia ser realizada? context: A polícia francesa foi criticada por lidar com os eventos e, principalmente, por confiscar bandeiras tibetanas dos manifestantes. O jornal Libération comentou: ""A polícia fez tanto que apenas os chineses tiveram liberdade de expressão. A bandeira tibetana era proibida em todos os lugares, exceto no Trocadéro"". A ministra do Interior, Michèle Alliot-Marie, afirmou mais tarde que a polícia não havia sido ordenada a fazê-lo e que agira por iniciativa própria. Um cinegrafista da France 2 foi atingido no rosto por um policial, ficou inconsciente e teve que ser enviado ao hospital.","{'answer_start': [298], 'text': ['o Trocadéro']}"
2,5732c3e8cc179a14009dac4a,"question: No que diz respeito aos crimes violentos contra grupos-alvo, qual é a motivação final nas ações do genocídio? context: Genocídio tornou-se um termo oficial usado nas relações internacionais. A palavra genocídio não era usada antes de 1944. Antes disso, em 1941, Winston Churchill descreveu o assassinato em massa de prisioneiros de guerra e civis russos como ""um crime sem nome"". Naquele ano, um advogado judeu polonês chamado Raphael Lemkin, descreveu as políticas de assassinato sistemático fundadas pelos nazistas como genocídio. A palavra genocídio é a combinação do prefixo grego geno- (que significa tribo ou raça) e caedere (a palavra latina para matar). A palavra é definida como um conjunto específico de crimes violentos que são cometidos contra um determinado grupo com a tentativa de remover todo o grupo da existência ou destruí-lo.","{'answer_start': [677], 'text': ['remover todo o grupo da existência ou destruí-lo']}"
3,572798b8708984140094e1cb,"question: Quem matou os dois azerbaijanos? context: Em 20 de fevereiro de 1988, após uma semana de crescentes manifestações em Stepanakert, capital do Oblast Autônomo de Nagorno-Karabakh (a área de maioria armênia da República Socialista Soviética do Azerbaijão), o Soviete Regional votou em se separar e se juntar à República Socialista Soviética da Armênia . Essa votação local em uma parte pequena e remota da União Soviética foi manchete em todo o mundo; foi um desafio sem precedentes da república e das autoridades nacionais. Em 22 de fevereiro de 1988, no que ficou conhecido como ""choque de Askeran"", dois azerbaijanos foram mortos pela Polícia de Karabakh. Essas mortes, anunciadas nas rádios estaduais, levaram ao Sumgait Pogrom. Entre 26 de fevereiro e 1º de março, a cidade de Sumgait (Azerbaijão) sofreu violentos tumultos anti-armênios, durante os quais 32 pessoas foram mortas. As autoridades perderam totalmente o controle e ocuparam a cidade com paraquedistas e tanques; quase todos os 14.000 residentes armênios de Sumgait fugiram.","{'answer_start': [593], 'text': ['Polícia de Karabakh.']}"
4,56f8be559b226e1400dd0f04,"question: Que país era um estado separado bem estabelecido no século XVI? context: Grande parte do período medieval foi uma época de lutas pelo poder entre dinastias concorrentes, como a Casa da Sabóia, os Visconti no norte da Itália e a Casa de Habsburgo na Áustria e na Eslovênia. Em 1291, para se protegerem das incursões dos Habsburgos, quatro cantões no meio da Suíça elaboraram uma carta que é considerada uma declaração de independência dos reinos vizinhos. Após uma série de batalhas travadas nos séculos XIII, XIV e XV, mais cantões aderiram à confederação e, no século XVI, a Suíça estava bem estabelecida como um estado separado.","{'answer_start': [503], 'text': ['Suíça']}"


## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that the model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

You can directly call this tokenizer on one sentence or a pair of sentences:

In [None]:
tokenizer("Qual o gosto de espumante?")

{'input_ids': [15715, 9, 6618, 4, 8, 6, 2104, 9178, 1854, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Depending on the model you selected, you will see different keys in the dictionary returned by the cell above. They don't matter much for what we're doing here (just know they are required by the model we will instantiate later), you can learn more about them in [this tutorial](https://huggingface.co/transformers/preprocessing.html) if you're interested.

Instead of one sentence, we can pass along a list of sentences:

In [None]:
tokenizer(["Olá, é uma frase!", "é uma segunda frase!"])

{'input_ids': [[28, 2647, 3, 21, 17, 5477, 1310, 1], [21, 17, 363, 5477, 1310, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

To prepare the targets for our model, we need to tokenize them inside the `as_target_tokenizer` context manager. This will make sure the tokenizer uses the special tokens corresponding to the targets:

In [None]:
with tokenizer.as_target_tokenizer():
    print(tokenizer(["Olá, é uma frase!", "é uma segunda frase!"]))

{'input_ids': [[28, 2647, 3, 21, 17, 5477, 1310, 1], [21, 17, 363, 5477, 1310, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}


If you are using one of the five T5 checkpoints for summarization we have to prefix the inputs with "summarize:" (the model can also translate and it needs the prefix to know which task it has to perform). But in our case, we will fine-tune a PTT5 to a unique new downstream task (QA). Then, a prefix is not obrigatory.

In [15]:
# if model_checkpoint in ["t5-small", "t5-base", "t5-large", "t5-3b", "t5-11b"]:
#     prefix = "summarize: "
# else:
#     prefix = ""

prefix = ""

We can then write the function that will preprocess our samples. We just feed them to the `tokenizer` with the argument `truncation=True`. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model. The padding will be dealt with later on (in a data collator) so we pad examples to the longest length in the batch and not the whole dataset.

In [None]:
def preprocess_squad_batch(examples):
  targets = [answer["text"][0] if len(answer["text"]) > 0 else "" for answer in examples['answers']]
  return examples['input_ids'], targets

In [None]:
# train preprocessing
def preprocess_train_function(examples):

    inputs, targets = preprocess_squad_batch(examples)

    # inputs = [prefix + doc for doc in inputs]
    model_inputs = tokenizer(inputs, max_length=max_input_length, padding=padding, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=max_target_length, padding=padding, truncation=True)

    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # padding in the loss.
    if padding == "max_length" and ignore_pad_token_for_loss:
      labels["input_ids"] = [
                             [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
                             ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [None]:
# Validation preprocessing
def preprocess_validation_function(examples):
  inputs, targets = preprocess_squad_batch(examples)

  # inputs = [prefix + doc for doc in inputs]
  model_inputs = tokenizer(inputs, max_length=max_seq_length, padding=padding, truncation=True,
                           return_overflowing_tokens=True,
                           return_offsets_mapping=True,)
  
  # Setup the tokenizer for targets
  with tokenizer.as_target_tokenizer():
    labels = tokenizer(targets, max_length=max_target_length, padding=padding, truncation=True)

  # Since one example might give us several features if it has a long context, we need a map from a feature to
  # its corresponding example. This key gives us just that.
  sample_mapping = model_inputs.pop("overflow_to_sample_mapping")

  # For evaluation, we will need to convert our predictions to substrings of the context, so we keep the
  # corresponding example_id and we will store the offset mappings.
  model_inputs["example_id"] = []
  labels_mapping = {}
  labels_mapping['input_ids'] = []
  labels_mapping['attention_mask'] = []

  for i in range(len(model_inputs["input_ids"])):
    # One example can give several spans, this is the index of the example containing this span of text.
    sample_index = sample_mapping[i]
    model_inputs["example_id"].append(examples["id"][sample_index])
    labels_mapping['input_ids'].append(labels['input_ids'][sample_index])
    labels_mapping['attention_mask'].append(labels['attention_mask'][sample_index])

  # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
  # padding in the loss.
  if padding == "max_length" and ignore_pad_token_for_loss:
    labels["input_ids"] = [
                           [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels_mapping["input_ids"]
                           ]

  model_inputs["labels"] = labels["input_ids"]
  return model_inputs

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [None]:
# preprocess_train_function(raw_datasets['train'][:2])

{'input_ids': [[13, 6, 1329, 46, 118, 13, 464, 9906, 21, 16, 1897, 458, 1854, 602, 13283, 46, 25, 1654, 6790, 4, 7608, 305, 635, 24, 529, 9906, 63, 7, 383, 4, 3072, 42, 1626, 2050, 3, 1682, 17, 4224, 4, 8494, 18110, 24, 12610, 165, 73, 28, 90, 2656, 891, 11, 464, 9906, 23, 4929, 12, 5412, 18, 7, 72, 4, 18626, 3, 15498, 3, 217, 225, 5, 1371, 1550, 23, 7, 2665, 6209, 58, 136, 39, 18, 9, 3413, 4, 1626, 16007, 18, 227, 290, 914, 46, 1654, 4, 455, 2676, 5261, 3, 961, 3, 655, 17623, 3, 1471, 3, 14465, 350, 24, 12223, 5632, 690, 41, 36, 4, 813, 550, 637, 3, 665, 1635, 11, 651, 3149, 11, 1105, 3, 3437, 2508, 11, 17595, 8, 11321, 3240, 17616, 4, 651, 7251, 11, 20436, 5, 28, 4337, 3219, 11, 464, 9906, 21, 7, 2665, 6209, 33, 38, 6876, 18, 30, 497, 11, 9359, 4695, 165, 3, 227, 10595, 2050, 8, 512, 227, 480, 914, 10, 265, 20, 3861, 1025, 1425, 4, 2723, 4140, 8, 543, 20, 5275, 305, 635, 5, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

To apply this function on all the pairs of sentences in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command.

In [16]:
train_dataset = raw_datasets["train"]
eval_examples = raw_datasets["validation"]

In [None]:
column_names = raw_datasets["train"].column_names

# Create train feature from dataset
with training_args.main_process_first(desc="train dataset map pre-processing"):
  train_dataset = train_dataset.map(
      preprocess_train_function,
      batched=True,
      num_proc=None,
      remove_columns=column_names,
      load_from_cache_file=True,
      desc="Running tokenizer on train dataset",
      )

Running tokenizer on train dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

In [None]:
column_names = raw_datasets["validation"].column_names

with training_args.main_process_first(desc="validation dataset map pre-processing"):
  eval_dataset = eval_examples.map(
      preprocess_validation_function,
      batched=True,
      num_proc=None,
      remove_columns=column_names,
      load_from_cache_file=True,
      desc="Running tokenizer on validation dataset",
      )

Loading cached processed dataset at /root/.cache/huggingface/datasets/parquet/pierreguillou--squad11pt-85bf948064a7a578/0.0.0/1638526fd0e8d960534e2155dc54fdff8dce73851f21f031d2fb9c2cf757c121/cache-81f682682a0f1da5.arrow


In [None]:
# set format for pytorch
train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
eval_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels', 'example_id', 'offset_mapping'])

In [17]:
from datasets import load_from_disk

model_name = model_checkpoint.split("/")[-1]

# save
tokenized_datasets_dir = '/content/drive/MyDrive/' + str(model_name) + '/tokenized_datasets/train/'
train_dataset.save_to_disk(tokenized_datasets_dir)
tokenized_datasets_dir = '/content/drive/MyDrive/' + str(model_name) + '/tokenized_datasets/validation/'
eval_dataset.save_to_disk(tokenized_datasets_dir)

# load
tokenized_datasets_dir = '/content/drive/MyDrive/' + str(model_name) + '/tokenized_datasets/train/'
train_dataset = load_from_disk(tokenized_datasets_dir)
tokenized_datasets_dir = '/content/drive/MyDrive/' + str(model_name) + '/tokenized_datasets/validation/'
eval_dataset = load_from_disk(tokenized_datasets_dir)

In [18]:
train_dataset, eval_dataset

(Dataset({
     features: ['attention_mask', 'input_ids', 'labels'],
     num_rows: 87510
 }), Dataset({
     features: ['attention_mask', 'example_id', 'input_ids', 'labels', 'offset_mapping'],
     num_rows: 10884
 }))

In [19]:
eval_examples

Dataset({
    features: ['id', 'input_ids', 'answers'],
    num_rows: 10570
})

Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data). For instance, it will properly detect if you change the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses cached files, you can pass `load_from_cache_file=False` in the call to `map` to not use the cached files and force the preprocessing to be applied again.

Note that we passed `batched=True` to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.

## Fine-tuning the model

Then, we need a special kind of data collator, which will not only pad the inputs to the maximum length in the batch, but also the labels:

In [31]:
# Data collator
label_pad_token_id = -100 if ignore_pad_token_for_loss else tokenizer.pad_token_id
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8 if training_args.fp16 else None,
    )

The last thing to define for our `Seq2SeqTrainer` is how to compute the metrics from the predictions. We need to define a function for this, which will just use the `metric` we loaded earlier, and we have to do a bit of pre-processing to decode the predictions into texts:

In [21]:
metric = load_metric("squad_v2" if version_2_with_negative else "squad")

def compute_metrics(p):
  return metric.compute(predictions=p.predictions, references=p.label_ids)

Downloading:   0%|          | 0.00/1.73k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

In [22]:
metric

Metric(name: "squad", features: {'predictions': {'id': Value(dtype='string', id=None), 'prediction_text': Value(dtype='string', id=None)}, 'references': {'id': Value(dtype='string', id=None), 'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None)}}, usage: """
Computes SQuAD scores (F1 and EM).
Args:
    predictions: List of question-answers dictionaries with the following key-values:
        - 'id': id of the question-answer pair as given in the references (see below)
        - 'prediction_text': the text of the answer
    references: List of question-answers dictionaries with the following key-values:
        - 'id': id of the question-answer pair (see above),
        - 'answers': a Dict in the SQuAD dataset format
            {
                'text': list of possible texts for the answer, as a list of strings
                'answer_start': list of start positions for the answer, as a list of ints
   

In [23]:
# Post-processing:
def post_processing_function(examples, features, outputs, stage="eval"):
  # Decode the predicted tokens.
  preds = outputs.predictions
  if isinstance(preds, tuple):
    preds = preds[0]
  
  decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

  # Build a map example to its corresponding features.
  example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
  # print('example_id_to_index:',example_id_to_index)
  # print('features:',features)
  feature_per_example = {example_id_to_index[feature["example_id"]]: i for i, feature in enumerate(features)}
  predictions = {}
  # Let's loop over all the examples!
  for example_index, example in enumerate(examples):
    # This is the index of the feature associated to the current example.
    feature_index = feature_per_example[example_index]
    predictions[example["id"]] = decoded_preds[feature_index]

  # Format the result to the format the metric expects.
  if version_2_with_negative:
    formatted_predictions = [
                             {"id": k, "prediction_text": v, "no_answer_probability": 0.0} for k, v in predictions.items()
                             ]
  else:
    formatted_predictions = [{"id": k, "prediction_text": v} for k, v in predictions.items()]
  
  references = [{"id": ex["id"], "answers": ex[answer_column]} for ex in examples]

  return EvalPrediction(predictions=formatted_predictions, label_ids=references)

Then we just need to pass all of this along with our datasets to the `Seq2SeqTrainer`:

In [36]:
from transformers.trainer_callback import EarlyStoppingCallback

early_stopping_patience = save_total_limit

# Initialize our Trainer
trainer = QuestionAnsweringSeq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset if do_train else None, #.shard(num_shards=400, index=0)
    eval_dataset=eval_dataset if do_eval else None, #.shard(num_shards=400, index=0)
    eval_examples=eval_examples if do_eval else None, #.shard(num_shards=400, index=0)
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    post_process_function=post_processing_function,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=early_stopping_patience)],
    )

Using amp half precision backend


We can now finetune our model by just calling the `train` method:

In [None]:
trainer.train()

***** Running training *****
  Num examples = 87510
  Num Epochs = 10
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 12
  Gradient Accumulation steps = 3
  Total optimization steps = 72920


Step,Training Loss,Validation Loss,Exact Match,F1
3000,0.7761,No log,61.807001,75.114517
6000,0.5459,No log,65.26017,77.46893
9000,0.4605,No log,66.556291,78.491938
12000,0.3934,No log,66.821192,78.745397


The following columns in the evaluation set  don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: offset_mapping, example_id.
***** Running Evaluation *****
  Num examples = 10884
  Batch size = 64
Saving model checkpoint to /content/drive/MyDrive/ptt5-base-portuguese-vocab/checkpoints/e10_lr0.0001/checkpoint-3000
Configuration saved in /content/drive/MyDrive/ptt5-base-portuguese-vocab/checkpoints/e10_lr0.0001/checkpoint-3000/config.json
Model weights saved in /content/drive/MyDrive/ptt5-base-portuguese-vocab/checkpoints/e10_lr0.0001/checkpoint-3000/pytorch_model.bin
tokenizer config file saved in /content/drive/MyDrive/ptt5-base-portuguese-vocab/checkpoints/e10_lr0.0001/checkpoint-3000/tokenizer_config.json
Special tokens file saved in /content/drive/MyDrive/ptt5-base-portuguese-vocab/checkpoints/e10_lr0.0001/checkpoint-3000/special_tokens_map.json
Copy vocab file to /content/drive/MyDrive/ptt5-base-portuguese-vocab/checkpoints/e10_lr0.000

In [None]:
dir_checkpoint = str('/content/drive/MyDrive/ptt5-base-portuguese-vocab/checkpoints/e10_lr0.0001/checkpoint-12000')
trainer.train(dir_checkpoint)

Loading model from /content/drive/MyDrive/ptt5-base-portuguese-vocab/checkpoints/e10_lr0.0001/checkpoint-12000).
***** Running training *****
  Num examples = 87510
  Num Epochs = 10
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 12
  Gradient Accumulation steps = 3
  Total optimization steps = 72920
  Continuing training from checkpoint, will skip to saved global_step
  Continuing training from epoch 1
  Continuing training from global step 12000
  Will skip the first 1 epochs then the first 14124 batches in the first epoch. If this takes a lot of time, you can add the `--ignore_data_skip` flag to your launch command, but you will resume the training on data already seen by your model.


  0%|          | 0/14124 [00:00<?, ?it/s]

Step,Training Loss,Validation Loss,Exact Match,F1
15000,0.3798,No log,66.603595,78.815515
18000,0.2981,No log,67.578051,79.287899


The following columns in the evaluation set  don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: example_id, offset_mapping.
***** Running Evaluation *****
  Num examples = 10884
  Batch size = 64
Saving model checkpoint to /content/drive/MyDrive/ptt5-base-portuguese-vocab/checkpoints/e10_lr0.0001/checkpoint-15000
Configuration saved in /content/drive/MyDrive/ptt5-base-portuguese-vocab/checkpoints/e10_lr0.0001/checkpoint-15000/config.json
Model weights saved in /content/drive/MyDrive/ptt5-base-portuguese-vocab/checkpoints/e10_lr0.0001/checkpoint-15000/pytorch_model.bin
tokenizer config file saved in /content/drive/MyDrive/ptt5-base-portuguese-vocab/checkpoints/e10_lr0.0001/checkpoint-15000/tokenizer_config.json
Special tokens file saved in /content/drive/MyDrive/ptt5-base-portuguese-vocab/checkpoints/e10_lr0.0001/checkpoint-15000/special_tokens_map.json
Copy vocab file to /content/drive/MyDrive/ptt5-base-portuguese-vocab/checkpoints/e10_lr

Step,Training Loss,Validation Loss,Exact Match,F1
15000,0.3798,No log,66.603595,78.815515
18000,0.2981,No log,67.578051,79.287899
21000,0.3031,No log,66.991485,78.979669


The following columns in the evaluation set  don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: example_id, offset_mapping.
***** Running Evaluation *****
  Num examples = 10884
  Batch size = 64
Saving model checkpoint to /content/drive/MyDrive/ptt5-base-portuguese-vocab/checkpoints/e10_lr0.0001/checkpoint-21000
Configuration saved in /content/drive/MyDrive/ptt5-base-portuguese-vocab/checkpoints/e10_lr0.0001/checkpoint-21000/config.json
Model weights saved in /content/drive/MyDrive/ptt5-base-portuguese-vocab/checkpoints/e10_lr0.0001/checkpoint-21000/pytorch_model.bin
tokenizer config file saved in /content/drive/MyDrive/ptt5-base-portuguese-vocab/checkpoints/e10_lr0.0001/checkpoint-21000/tokenizer_config.json
Special tokens file saved in /content/drive/MyDrive/ptt5-base-portuguese-vocab/checkpoints/e10_lr0.0001/checkpoint-21000/special_tokens_map.json
Copy vocab file to /content/drive/MyDrive/ptt5-base-portuguese-vocab/checkpoints/e10_lr

In [None]:
dir_checkpoint = str('/content/drive/MyDrive/ptt5-base-portuguese-vocab/checkpoints/e10_lr0.0001/checkpoint-21000')
trainer.train(dir_checkpoint)

Loading model from /content/drive/MyDrive/ptt5-base-portuguese-vocab/checkpoints/e10_lr0.0001/checkpoint-21000).
***** Running training *****
  Num examples = 87510
  Num Epochs = 10
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 12
  Gradient Accumulation steps = 3
  Total optimization steps = 72920
  Continuing training from checkpoint, will skip to saved global_step
  Continuing training from epoch 2
  Continuing training from global step 21000
  Will skip the first 2 epochs then the first 19248 batches in the first epoch. If this takes a lot of time, you can add the `--ignore_data_skip` flag to your launch command, but you will resume the training on data already seen by your model.


  0%|          | 0/19248 [00:00<?, ?it/s]

Step,Training Loss,Validation Loss,Exact Match,F1
24000,0.2516,No log,67.275307,78.929923
27000,0.2375,No log,66.972564,79.333612
30000,0.2205,No log,66.915799,79.236574
33000,0.1826,No log,67.029328,78.964212
36000,0.1906,No log,66.982025,79.086125


The following columns in the evaluation set  don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: example_id, offset_mapping.
***** Running Evaluation *****
  Num examples = 10884
  Batch size = 64
Saving model checkpoint to /content/drive/MyDrive/ptt5-base-portuguese-vocab/checkpoints/e10_lr0.0001/checkpoint-24000
Configuration saved in /content/drive/MyDrive/ptt5-base-portuguese-vocab/checkpoints/e10_lr0.0001/checkpoint-24000/config.json
Model weights saved in /content/drive/MyDrive/ptt5-base-portuguese-vocab/checkpoints/e10_lr0.0001/checkpoint-24000/pytorch_model.bin
tokenizer config file saved in /content/drive/MyDrive/ptt5-base-portuguese-vocab/checkpoints/e10_lr0.0001/checkpoint-24000/tokenizer_config.json
Special tokens file saved in /content/drive/MyDrive/ptt5-base-portuguese-vocab/checkpoints/e10_lr0.0001/checkpoint-24000/special_tokens_map.json
Copy vocab file to /content/drive/MyDrive/ptt5-base-portuguese-vocab/checkpoints/e10_lr

TrainOutput(global_step=36000, training_loss=0.09023599921332465, metrics={'train_runtime': 12905.7852, 'train_samples_per_second': 67.807, 'train_steps_per_second': 5.65, 'total_flos': 1.9731331778347008e+17, 'train_loss': 0.09023599921332465, 'epoch': 4.94})

In [None]:
# save_steps = 3
# steps = 3000
dir_checkpoint = str('/content/drive/MyDrive/ptt5-base-portuguese-vocab/checkpoints/e10_lr0.0001/checkpoint-27000')
trainer.train(dir_checkpoint)

Loading model from /content/drive/MyDrive/ptt5-base-portuguese-vocab/checkpoints/e10_lr0.0001/checkpoint-27000).
***** Running training *****
  Num examples = 87510
  Num Epochs = 10
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 12
  Gradient Accumulation steps = 3
  Total optimization steps = 72920
  Continuing training from checkpoint, will skip to saved global_step
  Continuing training from epoch 3
  Continuing training from global step 27000
  Will skip the first 3 epochs then the first 15372 batches in the first epoch. If this takes a lot of time, you can add the `--ignore_data_skip` flag to your launch command, but you will resume the training on data already seen by your model.


  0%|          | 0/15372 [00:00<?, ?it/s]

Step,Training Loss,Validation Loss,Exact Match,F1
30000,0.2205,No log,66.915799,79.236574
33000,0.1826,No log,67.029328,78.964212
36000,0.1906,No log,66.982025,79.086125


The following columns in the evaluation set  don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: example_id, offset_mapping.
***** Running Evaluation *****
  Num examples = 10884
  Batch size = 64
Saving model checkpoint to /content/drive/MyDrive/ptt5-base-portuguese-vocab/checkpoints/e10_lr0.0001/checkpoint-30000
Configuration saved in /content/drive/MyDrive/ptt5-base-portuguese-vocab/checkpoints/e10_lr0.0001/checkpoint-30000/config.json
Model weights saved in /content/drive/MyDrive/ptt5-base-portuguese-vocab/checkpoints/e10_lr0.0001/checkpoint-30000/pytorch_model.bin
tokenizer config file saved in /content/drive/MyDrive/ptt5-base-portuguese-vocab/checkpoints/e10_lr0.0001/checkpoint-30000/tokenizer_config.json
Special tokens file saved in /content/drive/MyDrive/ptt5-base-portuguese-vocab/checkpoints/e10_lr0.0001/checkpoint-30000/special_tokens_map.json
Copy vocab file to /content/drive/MyDrive/ptt5-base-portuguese-vocab/checkpoints/e10_lr

TrainOutput(global_step=36000, training_loss=0.04948082817925347, metrics={'train_runtime': 8389.8446, 'train_samples_per_second': 104.305, 'train_steps_per_second': 8.691, 'total_flos': 1.9731331778347008e+17, 'train_loss': 0.04948082817925347, 'epoch': 4.94})

## Evaluation of the model

In [37]:
max_length=32
num_beams=1
early_stopping=True

### Just one QA

In [39]:
input_text  = 'question: Quando foi descoberta a Covid-19? context: A pandemia de COVID-19, também conhecida como pandemia de coronavírus, é uma pandemia em curso de COVID-19, uma doença respiratória aguda causada pelo coronavírus da síndrome respiratória aguda grave 2 (SARS-CoV-2). A doença foi identificada pela primeira vez em Wuhan, na província de Hubei, República Popular da China, em 1 de dezembro de 2019, mas o primeiro caso foi reportado em 31 de dezembro do mesmo ano.'
label = '1 de dezembro de 2019'

inputs = trainer.tokenizer(input_text, return_tensors="pt").to('cuda') 

outputs = trainer.model.generate(inputs["input_ids"], 
                             max_length=max_target_length, 
                             num_beams=num_beams, 
                             early_stopping=early_stopping
                            )
print('true answer |',label)
print('pred        |',tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True))

true answer | 1 de dezembro de 2019
pred        | 1 de dezembro de 2019


## Evaluation

In [40]:
results = {}
max_length = (generation_max_length if generation_max_length is not None else val_max_answer_length)
num_beams = num_beams if num_beams is not None else generation_num_beams

if do_eval:
  print("*** Evaluate ***")
  metrics = trainer.evaluate(max_length=max_length, num_beams=num_beams, metric_key_prefix="eval")
  max_eval_samples = max_eval_samples if max_eval_samples is not None else len(eval_dataset)
  metrics["eval_samples"] = min(max_eval_samples, len(eval_dataset))
  
  trainer.log_metrics("eval", metrics)

  eval_dir = '/content/drive/MyDrive/' + str(model_name) + '/eval_metrics/' + folder_model
  Path(eval_dir).mkdir(parents=True, exist_ok=True)    #python 3.5 above
  trainer.save_metrics(eval_dir, metrics)

The following columns in the evaluation set  don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: offset_mapping, example_id.
***** Running Evaluation *****
  Num examples = 10884
  Batch size = 64


*** Evaluate ***


***** eval metrics *****
  eval_exact_match = 67.3983
  eval_f1          = 79.4177
  eval_samples     =   10884


In [41]:
metrics

{'eval_exact_match': 67.39829706717124,
 'eval_f1': 79.41769695100494,
 'eval_samples': 10884}

## Save locally the model

In [49]:
model_name = model_checkpoint.split("/")[-1]
model_dir = '/content/drive/MyDrive/' + str(model_name) + '/models/' + folder_model
trainer.save_model(model_dir)

Saving model checkpoint to /content/drive/MyDrive/ptt5-base-portuguese-vocab/models/e10_lr0.0001
Configuration saved in /content/drive/MyDrive/ptt5-base-portuguese-vocab/models/e10_lr0.0001/config.json
Model weights saved in /content/drive/MyDrive/ptt5-base-portuguese-vocab/models/e10_lr0.0001/pytorch_model.bin
tokenizer config file saved in /content/drive/MyDrive/ptt5-base-portuguese-vocab/models/e10_lr0.0001/tokenizer_config.json
Special tokens file saved in /content/drive/MyDrive/ptt5-base-portuguese-vocab/models/e10_lr0.0001/special_tokens_map.json
Copy vocab file to /content/drive/MyDrive/ptt5-base-portuguese-vocab/models/e10_lr0.0001/spiece.model


## Push the model to the Hugging Face model hub

In [None]:
model_name_hf = 'pierreguillou/t5-base-qa-squad-v1.1-portuguese'

### Method 1

In [None]:
# trainer.push_to_hub(model_name_hf)

### Method 2

In [53]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForSeq2SeqLM.from_pretrained(model_dir)

Didn't find file /content/drive/MyDrive/ptt5-base-portuguese-vocab/models/e10_lr0.0001/added_tokens.json. We won't load it.
loading file /content/drive/MyDrive/ptt5-base-portuguese-vocab/models/e10_lr0.0001/spiece.model
loading file /content/drive/MyDrive/ptt5-base-portuguese-vocab/models/e10_lr0.0001/tokenizer.json
loading file None
loading file /content/drive/MyDrive/ptt5-base-portuguese-vocab/models/e10_lr0.0001/special_tokens_map.json
loading file /content/drive/MyDrive/ptt5-base-portuguese-vocab/models/e10_lr0.0001/tokenizer_config.json
loading configuration file /content/drive/MyDrive/ptt5-base-portuguese-vocab/models/e10_lr0.0001/config.json
Model config T5Config {
  "_name_or_path": "/content/drive/MyDrive/ptt5-base-portuguese-vocab/models/e10_lr0.0001",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 3072,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer

1. [Creating a repository in the HF model hub](https://huggingface.co/docs/hub/adding-a-model#creating-a-repository)

2. [Clone your model repository](https://huggingface.co/docs/hub/adding-a-model#uploading-your-files)

In [None]:
# source: https://github.com/huggingface/transformers/issues/12572
from huggingface_hub import HfFolder
import os
os.environ['HF_AUTH'] = HfFolder().get_token()

In [73]:
%cd /content

/content


In [74]:
# Clone the repo with authentication
!git clone https://user:$HF_AUTH@huggingface.co/{model_name_hf }

Cloning into 't5-base-qa-squad-v1.1-portuguese'...
remote: Enumerating objects: 3, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (2/2), done.[K
remote: Total 3 (delta 0), reused 0 (delta 0)[K
Unpacking objects: 100% (3/3), done.


3. Add your files to the repository

In [75]:
!cp {model_dir}/* /content/t5-base-qa-squad-v1.1-portuguese

4. Commit and push your files

In [76]:
%cd /content/t5-base-qa-squad-v1.1-portuguese

/content/t5-base-qa-squad-v1.1-portuguese


In [77]:
!git add .
!git commit -m "First model version"
!git push

[main b691002] First model version
 7 files changed, 40 insertions(+)
 create mode 100644 config.json
 create mode 100644 pytorch_model.bin
 create mode 100644 special_tokens_map.json
 create mode 100644 spiece.model
 create mode 100644 tokenizer.json
 create mode 100644 tokenizer_config.json
 create mode 100644 training_args.bin
Git LFS: (3 of 3 files) 851.14 MB / 851.14 MB
Counting objects: 9, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (9/9), done.
Writing objects: 100% (9/9), 476.88 KiB | 6.04 MiB/s, done.
Total 9 (delta 1), reused 0 (delta 0)
To https://huggingface.co/pierreguillou/t5-base-qa-squad-v1.1-portuguese
   07aaf23..b691002  main -> main


# END