# Fine-tuning a MLM (Masked Language Model) like BERT (base or large) with the library adapter-transformers (notebook version)

- **Credit**: [Hugging Face](https://huggingface.co/) and [adapter-transformers](https://github.com/Adapter-Hub/adapter-transformers)
- **Author**: [Pierre GUILLOU](https://www.linkedin.com/in/pierreguillou/)
- **Date**: edited on 07/14/2021 (version 1.0: 07/05/2021)
- **Blog post**: [NLP nas empresas | Como ajustar um modelo de linguagem natural como BERT a um novo dom√≠nio lingu√≠stico com um Adapter?](https://medium.com/@pierre_guillou/nlp-nas-empresas-como-ajustar-um-modelo-de-linguagem-natural-como-bert-a-um-novo-dom%C3%ADnio-23752b73b185)
- **Link to the folder in github with this notebook and all necessary scripts**: [language-modeling with adapters](https://github.com/piegu/language-models/tree/master/adapters/language-modeling/)

## 1. Context

### Objective

The objective here is to **fine-tune a Masked Language Model (MLM) like BERT (base or large) by training adapters (library [adapter-transformers](https://github.com/Adapter-Hub/adapter-transformers)), not the embeddings and transformers layers of the MLM model**, and to compare results with BERT model fully fine-tune for the same task.

The interest is obvious: if you need models for different NLP tasks, instead of fine-tuning and storing one model by NLP task, **you store only one MLM model and the trained tasks adapters which sizes are between 6% and 13% of the MLM model one** (it depends of the choosen adapter configuration). More, the loading of these adapters in production is very easy.

### Content

In this notebook, we'll see how to fine-tune one of the [ü§ó Transformers](https://github.com/huggingface/transformers) model on a language modeling tasks. We will cover one type of language modeling tasks which is:

- Masked language modeling: the model has to predict some tokens that are masked in the input. It still has access to the whole sentence, so it can use the tokens before and after the tokens masked to predict their value.

![Widget inference representing the masked language modeling task](images/masked_language_modeling_adapter.png)

We will see how to easily load and preprocess the dataset for each one of those tasks, and how to use the `Trainer` API to fine-tune a model on it.

### History and Credit

This notebook is an adaptation of the following notebooks and scripts for **fine-tuning a (transformer) Masked Language Model (MLM) like BERT (base or large) with any dataset** (we use here the texts of the [Portuguese Squad 1.1 dataset](https://forum.ailab.unb.br/t/datasets-em-portugues/251/4)):
- **from [adapter-transformers](https://github.com/Adapter-Hub/adapter-transformers)** | notebook [01_Adapter_Training.ipynb](https://github.com/Adapter-Hub/adapter-transformers/blob/master/notebooks/01_Adapter_Training.ipynb) and script [run_mlm.py](https://github.com/Adapter-Hub/adapter-transformers/blob/master/examples/language-modeling/run_mlm.py) (this script was adapted from the script [run_mlm.py](https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_mlm.py) of HF)
- **from [transformers](https://github.com/huggingface/transformers) of Hugging Face** | notebook [language_modeling.ipynb](https://github.com/huggingface/notebooks/blob/master/examples/language_modeling.ipynb) and script [run_mlm.py](https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_mlm.py) 

In order to speed up the fine-tuning of the model on only one GPU, the library [DeepSpeed](https://www.deepspeed.ai/) could be used by applying the configuration provided by HF in the notebook [transformers + deepspeed CLI](https://github.com/stas00/porting/blob/master/transformers/deepspeed/DeepSpeed_on_colab_CLI.ipynb) but as the library adapter-transformers is not synchronized with the last version of the library transformers of HF, we keep that option for the future.

*Note: the paragraph about Causal language modeling (CLM) is not included in this notebook, and all the non necessary code about Masked Model Language (MLM) has been deleted from the original notebook.*

### Major changes from original notebooks and scripts

The notebook [language_modeling.ipynb](https://github.com/huggingface/notebooks/blob/master/examples/language_modeling.ipynb) and script [run_mlm.py](https://github.com/Adapter-Hub/adapter-transformers/blob/master/examples/language-modeling/run_mlm.py) allow to evaluate the model performance against the validation loss at the end of each epoch, not against the metric accuracy. 

As a metric is better in order to select a model than the loss, we introduced in this notebook the metric accuracy for model evaluation (see the method `comput_metrics()`). However, as it needs many GB for the evaluation calculation (and time!) depending on the evaluation dataset size, we do not use it during the training (we keep the loss as evaluation metric) but only before and after.

Thus, we updated the notebook [language_modeling.ipynb](https://github.com/huggingface/notebooks/blob/master/examples/language_modeling.ipynb)  to [language_modeling_adapter.ipynb](https://github.com/piegu/language-models/blob/master/adapters/language_modeling/language_modeling_adapter.ipynb) with the following changes:
- **Accuracy**: model evaluation through eval accuracy
- **EarlyStopping** by selecting the model with the highest eval accuracy (patience of 3 before ending the training)
- **MAD-X 2.0** that allows not to train adapters in the last transformer layer for the Pfeiffer configuration (read page 6 of [UNKs Everywhere: Adapting Multilingual Language Models to New Scripts](https://arxiv.org/pdf/2012.15562.pdf))
- **Houlsby MHA last layer** that allows no to train adapter after the Feed Fordward but only after the MHA (Multi-Head Attention) in the last layer for the Houlsby configuration

## 2. Installation

In [1]:
import pathlib
from pathlib import Path

#root path
root = Path.cwd()

In [2]:
import pickle
import pandas as pd
import numpy as np
import random
import math

In [4]:
import sys; print('python:',sys.version)

import torch; print('Pytorch:',torch.__version__)

import transformers; print('adapter-transformers:',transformers.__version__)
import transformers; print('HF transformers:',transformers.__hf_version__)
import tokenizers; print('tokenizers:',tokenizers.__version__)
import datasets; print('datasets:',datasets.__version__)

# import deepspeed; print('deepspeed:',deepspeed.__version__)

# Versions used in the virtuel environment of this notebook:

# python: 3.8.10 (default, Jun  4 2021, 15:09:15) 
# [GCC 7.5.0]
# Pytorch: 1.9.0
# adapter-transformers: 2.1.1
# HF transformers: 4.8.2
# tokenizers: 0.10.3
# datasets: 1.9.0

python: 3.8.10 (default, Jun  4 2021, 15:09:15) 
[GCC 7.5.0]
Pytorch: 1.9.0
adapter-transformers: 2.1.1
HF transformers: 4.8.2
tokenizers: 0.10.3
datasets: 1.9.0


## 3. Model & dataset

In [5]:
# Select a MLM BERT base or large in the dataset language
model_checkpoint = "neuralmind/bert-base-portuguese-cased"
# model_checkpoint = "neuralmind/bert-large-portuguese-cased"

# SQuAD 1.1 in Portuguese
dataset_name = "squad11pt" # SQuAD v1.1 em portugu√™s

## 4. Main hyperparameters

In [6]:
task = "mlm"

In [7]:
# training arguments
batch_size = 16
gradient_accumulation_steps = 1

learning_rate = 1e-4
num_train_epochs = 100.
early_stopping_patience = 10

adam_epsilon = 1e-6

fp16 = True
ds = False # DeepSpeed

# best model
load_best_model_at_end = True 
if load_best_model_at_end:
    metric_for_best_model = "loss" # could be accuracy, too
    if metric_for_best_model == "loss":
        greater_is_better = False
    else:
        greater_is_better = True # for accuracy

# (option) number of evaluation steps to do on GPU before to put results on CPU
eval_accumulation_steps = 5 # min = 1 (this is the recommended value to use the min GPU RAM for evaluation)

In [8]:
# train adapter
train_adapter = True # we want to train an adapter
load_adapter = None # we do not upload an existing adapter 
load_lang_adapter = None # we do not upload an existing lang adapter

# if True, do not put adapter in the last transformer layer (Pfeiffer configuration)
madx2 = True

# if True, put only an adapter after the MHA but not after the FF in the last layer (Houlsby configuration)
houlsby_MHA_lastlayer = True
if madx2:
    houlsby_MHA_lastlayer = False

## 5. Configuration

### GPU

In [9]:
# gpu
n_gpu = 1 # train on just one GPU
gpu = 0 # select the GPU

In [10]:
# Run this notebook in GPU 0
# As we do not launch a python script in this notebook, this cell is not mandatory
import os
os.environ['MASTER_ADDR'] = 'localhost'
if gpu == 0:
    os.environ['MASTER_PORT'] = '9996' # modify if RuntimeError: Address already in use # GPU 0
elif gpu == 1:
    os.environ['MASTER_PORT'] = '9995'
os.environ['RANK'] = "0"
os.environ['LOCAL_RANK'] = str(gpu)
os.environ['WORLD_SIZE'] = "1"

### Lang adapter config

In [11]:
# lang adapter config
adapter_config_name = "pfeiffer+inv" # houlsby+inv is possible, too
if adapter_config_name == "pfeiffer" or adapter_config_name == "pfeiffer+inv":
    adapter_non_linearity = 'gelu' # relu is possible, too
elif adapter_config_name == "houlsby" or adapter_config_name == "houlsby+inv":
    adapter_non_linearity = 'swish'
adapter_reduction_factor = 2
language = 'pt '# pt = Portuguese

### Training arguments of the HF trainer

In [12]:
# setup the training argument
do_train = True 
do_eval = True 

# epochs, bs, GA
evaluation_strategy = "epoch" # no

# fp16
fp16_opt_level = 'O1'
fp16_backend = "auto"
fp16_full_eval = False

# optimizer (AdamW)
weight_decay = 0.01 # 0.0
adam_beta1 = 0.9
adam_beta2 = 0.999

# scheduler
lr_scheduler_type = 'linear'
warmup_ratio = 0.0
warmup_steps = 0

# logs
logging_strategy = "steps"
logging_first_step = True # False
logging_steps = 500     # if strategy = "steps"
eval_steps = logging_steps # logging_steps

# checkpoints
save_strategy = "epoch" # steps
save_steps = 500 # if save_strategy = "steps"
save_total_limit = 1 # None

# no cuda, seed
no_cuda = False
seed = 42

# bar
disable_tqdm = False # True
remove_unused_columns = True

In [13]:
# folder for training outputs

outputs = model_checkpoint.replace('/','-') + '_' + dataset_name + '/' + str(task) + '/'
outputs = outputs \
+ 'lr' + str(learning_rate) \
+ '_bs' + str(batch_size) \
+ '_GAS' + str(gradient_accumulation_steps) \
+ '_eps' + str(adam_epsilon) \
+ '_epochs' + str(num_train_epochs) \
+ '_patience' + str(early_stopping_patience) \
+ '_madx2' + str(madx2) \
+ '_houlsby_MHA_lastlayer' + str(houlsby_MHA_lastlayer) \
+ '_ds' + str(ds) \
+ '_fp16' + str(fp16) \
+ '_best' + str(load_best_model_at_end) \
+ '_metric' + str(metric_for_best_model) \
+ '_adapterconfig' + str(adapter_config_name)

# path to outputs
path_to_outputs = root/'outputs'/outputs

# subfolder for model outputs
output_dir = path_to_outputs/'output_dir' 
overwrite_output_dir = True # False

# logs
logging_dir = path_to_outputs/'logging_dir'

## 6. Preparing the dataset

### Preparation

In [14]:
# if dataset_name == "squad11pt":
    
#     # create dataset folder 
#     path_to_dataset = root/'data'/dataset_name
#     path_to_dataset.mkdir(parents=True, exist_ok=True) 

#     # Get dataset SQUAD in Portuguese
#     %cd {path_to_dataset}
#     !wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1Q0IaIlv2h2BC468MwUFmUST0EyN7gNkn' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1Q0IaIlv2h2BC468MwUFmUST0EyN7gNkn" -O squad-pt.tar.gz && rm -rf /tmp/cookies.txt

#     # unzip 
#     !tar -xvf squad-pt.tar.gz

#     # Get the train and validation json file in the HF script format 
#     # inspiration: file squad.py at https://github.com/huggingface/datasets/tree/master/datasets/squad
    
#     import json 
#     files = ['squad-train-v1.1.json','squad-dev-v1.1.json']

#     for file in files:

#         # Opening JSON file & returns JSON object as a dictionary 
#         f = open(file, encoding="utf-8") 
#         data = json.load(f) 

#         # Iterating through the json list 
#         context_list = list()
#         id_list = list()

#         for row in data['data']: 

#             for paragraph in row['paragraphs']:
#                 context = (paragraph['context']).strip()
#                 context_list.append(context)

#         # Get unique context
#         unique_context_list = list(set(context_list))

#         # Closing file 
#         f.close() 

#         file_name = 'pt_' + str(file).replace('json','txt')
#         with open(file_name, 'wb') as list_file:
#             pickle.dump(unique_context_list, list_file)
         
#     %cd ../..

You can replace the dataset above with any dataset hosted on [the hub](https://huggingface.co/datasets) or use your own files. Just uncomment the following cell and replace the paths with values that will lead to your files:

### Loading

You can also load datasets from a csv or a JSON file, see the [full documentation](https://huggingface.co/docs/datasets/loading_datasets.html#from-local-files) for more information.

In [15]:
# if dataset_name == "squad11pt":
    
#     path_to_data = root/'data'/dataset_name
#     files = ['pt_squad-train-v1.1.txt','pt_squad-dev-v1.1.txt']
    
#     for i,file in enumerate(files):
#         path_to_file = path_to_data/file
#         with open(path_to_file, "rb") as f:   # Unpickling
#             text_list = pickle.load(f)

#             with open(file, "w") as output:
#                 output.write(str(text_list))
        
#         df = pd.DataFrame(text_list,columns=['text'])
#         if i == 0:
#             df_train = df.copy()
#         else:
#             df_validation = df.copy()
            
#     from datasets import Dataset, DatasetDict
#     dataset_train = Dataset.from_pandas(df_train)
#     dataset_validation = Dataset.from_pandas(df_validation)

#     datasets = DatasetDict()
#     datasets['train'] = dataset_train
#     datasets['validation'] = dataset_validation

### Save/Load 

It is useful to save datasets in order to load it for each new training.

In [16]:
# save 
# path_to_datasets = root/'data'/dataset_name
# datasets.save_to_disk(path_to_datasets)

In [17]:
# load
path_to_datasets = root/'data'/dataset_name
datasets = datasets.load_from_disk(str(path_to_datasets))
datasets

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 36815
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 5644
    })
})

To access an actual element, you need to select a split first, then give an index:

In [18]:
datasets["train"][10]

{'text': 'O pante√≠smo sustenta que Deus √© o universo e o universo √© Deus, enquanto o panente√≠smo sustenta que Deus cont√©m, mas n√£o √© id√™ntico ao universo. √â tamb√©m a vis√£o da Igreja Cat√≥lica Liberal; Teosofia; algumas vis√µes do hindu√≠smo, exceto o vaisnavismo, que acredita no panente√≠smo; Sikhismo; algumas divis√µes do neopaganismo e tao√≠smo, juntamente com muitas denomina√ß√µes e indiv√≠duos variados dentro das denomina√ß√µes. A Cabala, Misticismo judaico, pinta uma vis√£o pante√≠sta / panente√≠sta de Deus - que tem ampla aceita√ß√£o no juda√≠smo hass√≠dico, particularmente de seu fundador The Baal Shem Tov - mas apenas como um complemento √† vis√£o judaica de um deus pessoal, n√£o no pante√≠sta original sensa√ß√£o que nega ou limita a persona a Deus. [cita√ß√£o necess√°rio]'}

### Display

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [19]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [20]:
show_random_elements(datasets["train"])

Unnamed: 0,text
0,"De acordo com as regras da FIG, apenas as mulheres competem na gin√°stica r√≠tmica. √â um esporte que combina elementos de manipula√ß√£o de bal√©, gin√°stica, dan√ßa e aparelhos. O esporte envolve a realiza√ß√£o de cinco rotinas separadas com o uso de cinco aparelhos; bola, fita, aro, tacos, corda - na √°rea do piso, com uma √™nfase muito maior na est√©tica do que na acrob√°tica. Existem tamb√©m rotinas de grupo que consistem em 5 ginastas e 5 aparelhos de sua escolha. Rotinas r√≠tmicas s√£o pontuadas em 30 pontos poss√≠veis; a pontua√ß√£o da arte (coreografia e m√∫sica) √© calculada com a pontua√ß√£o da dificuldade dos movimentos e depois adicionada √† pontua√ß√£o da execu√ß√£o."
1,"Tens√µes logo se desenvolveram entre diferentes fac√ß√µes gregas, levando a duas guerras civis consecutivas. Enquanto isso, o sult√£o otomano negociou com Mehmet Ali, do Egito, que concordou em enviar seu filho Ibrahim Pasha para a Gr√©cia com um ex√©rcito para reprimir a revolta em troca de ganhos territoriais. Ibrahim desembarcou no Peloponeso em fevereiro de 1825 e teve sucesso imediato: no final de 1825, a maior parte do Peloponeso estava sob controle eg√≠pcio, e a cidade de Missolonghi - sitiada pelos turcos desde abril de 1825 - caiu em abril de 1826. Embora Ibrahim foi derrotado em Mani, ele conseguiu reprimir a maior parte da revolta no Peloponeso e Atenas foi retomada."
2,"Para os machos, o sistema reprodutivo √© o test√≠culo, suspenso na cavidade do corpo pelas traqu√©ias e pelo corpo gordo. A maioria dos insetos masculinos tem um par de test√≠culos, dentro dos quais existem tubos de esperma ou fol√≠culos que est√£o dentro de um saco membranoso. Os fol√≠culos se conectam ao ducto deferente pelo ducto efferens, e os dois vasos tubulares deferentes se conectam a um ducto ejaculat√≥rio mediano que leva ao exterior. Uma por√ß√£o do ducto deferente √© frequentemente aumentada para formar a ves√≠cula seminal, que armazena o esperma antes de serem descarregados na f√™mea. As ves√≠culas seminais t√™m revestimentos glandulares que secretam nutrientes para nutri√ß√£o e manuten√ß√£o do esperma. O ducto ejaculat√≥rio √© derivado de uma invagina√ß√£o das c√©lulas epid√©rmicas durante o desenvolvimento e, como resultado, possui um revestimento cuticular. A por√ß√£o terminal do ducto ejaculat√≥rio pode ser esclerotizada para formar o √≥rg√£o intromitente, o edema. O restante do sistema reprodutor masculino √© derivado do mesoderma embrion√°rio, exceto as c√©lulas germinativas, ou espermatog√¥nias, que descem das c√©lulas primordiais do p√≥lo muito cedo durante a embriog√™nese."
3,"Papa Paulo VI se tornou o primeiro pont√≠fice reinante a visitar as Am√©ricas quando voou para Nova York em outubro de 1965 para se dirigir √†s Na√ß√µes Unidas. Como gesto de boa vontade, o papa deu √† ONU duas pe√ßas de joalharia papal, uma cruz e um anel de diamantes, na esperan√ßa de que o produto da venda em leil√£o contribua para os esfor√ßos da ONU para acabar com o sofrimento humano. Durante a visita do papa, √† medida que o envolvimento dos EUA na Guerra do Vietn√£ se intensificou sob o presidente Johnson, Paulo VI implorou por paz perante a ONU:"
4,"O termo ""In√≠cio da era moderna"" foi introduzido no idioma ingl√™s na d√©cada de 1930. distinguir o tempo entre o que chamamos Idade M√©dia e o tempo do Iluminismo tardio (1800) (quando o significado do termo Idade Moderna estava desenvolvendo sua forma contempor√¢nea). √â importante notar que esses termos derivam da hist√≥ria da Europa. Em uso em outras partes do mundo, como na √Åsia e em pa√≠ses mu√ßulmanos, os termos s√£o aplicados de uma maneira muito diferente, mas geralmente no contexto do contato com a cultura europ√©ia na Era dos Descobrimentos."
5,"A Galiza tem uma √°rea de 29.574 quil√¥metros quadrados (11.419 milhas quadradas). Seu ponto mais ao norte, a 43 ¬∞ 47 ‚Ä≤ N, √© a Estaca de Bares (tamb√©m o ponto mais ao norte da Espanha); o extremo sul, a 41 ¬∞ 49‚Ä≤N, fica na fronteira portuguesa no Parque Natural da Baixa Limia - Serra do Xur√©s. A longitude mais oriental fica a 6 ¬∞ 42‚Ä≤W na fronteira entre a prov√≠ncia de Ourense e a prov√≠ncia castelhana e leonesa de Zamora) e a oeste a 9 ¬∞ 18‚Ä≤W, alcan√ßada em dois lugares: o cabo A Nave em Fisterra (tamb√©m conhecido Finisterra) e Cabo Touri√±√°n, ambos na prov√≠ncia da Corunha."
6,"Sob uma bandeira de ""reduzir a embriaguez p√∫blica"", a Lei da Cerveja de 1830 introduziu um novo n√≠vel inferior de instala√ß√µes autorizadas a vender √°lcool, as Cervejarias. Na √©poca, a cerveja era vista como inofensiva, nutritiva e at√© saud√°vel. As crian√ßas pequenas recebiam frequentemente o que era descrito como cerveja pequena, fabricada com baixo teor alco√≥lico, pois a √°gua local geralmente era insegura. At√© a igreja evang√©lica e os movimentos de temperan√ßa do dia viam o consumo de cerveja muito como um mal secund√°rio e um acompanhamento normal para uma refei√ß√£o. A cerveja dispon√≠vel gratuitamente pretendia, assim, afastar os bebedores dos males do gin, ou era o que pensava."
7,"Quando um circuito indutivo √© aberto, a corrente atrav√©s da indut√¢ncia entra em colapso rapidamente, criando uma grande tens√£o no circuito aberto do comutador ou rel√©. Se a indut√¢ncia for grande o suficiente, a energia gerar√° uma fa√≠sca, fazendo com que os pontos de contato se oxidem, se deteriorem ou, √†s vezes, se fundam ou destruam uma chave de estado s√≥lido. Um capacitor de amortecedor no circuito rec√©m-aberto cria um caminho para esse impulso desviar dos pontos de contato, preservando sua vida √∫til; estes foram comumente encontrados em sistemas de igni√ß√£o por disjuntor, por exemplo. Da mesma forma, em circuitos de menor escala, a fa√≠sca pode n√£o ser suficiente para danificar o comutador, mas ainda irradia interfer√™ncia de radiofrequ√™ncia indesej√°vel (RFI), absorvida por um capacitor de filtro. Os capacitores amortecedores s√£o geralmente empregados com um resistor de baixo valor em s√©rie, para dissipar energia e minimizar o RFI. Essas combina√ß√µes resistor-capacitor est√£o dispon√≠veis em um √∫nico pacote."
8,"Diferentes campos da ci√™ncia usam o termo mat√©ria de maneiras diferentes e, √†s vezes, incompat√≠veis. Algumas dessas maneiras s√£o baseadas em significados hist√≥ricos frouxos, de uma √©poca em que n√£o havia raz√£o para distinguir massa e mat√©ria. Como tal, n√£o existe um significado cient√≠fico universalmente aceito da palavra ""mat√©ria"". Cientificamente, o termo ""massa"" √© bem definido, mas ""mat√©ria"" n√£o √©. √Äs vezes, no campo da f√≠sica, ""mat√©ria"" √© simplesmente equiparada a part√≠culas que exibem massa em repouso (isto √©, que n√£o podem viajar na velocidade da luz), como quarks e leptons. No entanto, tanto na f√≠sica quanto na qu√≠mica, a mat√©ria exibe propriedades semelhantes a ondas e part√≠culas, a chamada dualidade onda-part√≠cula."
9,"Alguns criticaram a decis√£o de Paulo VI; o rec√©m-criado S√≠nodo dos Bispos teve apenas um papel consultivo e n√£o p√¥de tomar decis√µes por conta pr√≥pria, embora o Conselho tenha decidido exatamente isso. Durante o pontificado de Paulo VI, cinco desses s√≠nodos ocorreram, e ele registra a implementa√ß√£o de todas as suas decis√µes. Quest√µes relacionadas foram levantadas sobre as novas Confer√™ncias Nacionais de Bispos, que se tornaram obrigat√≥rias ap√≥s o Vaticano II. Outros questionaram seu Ostpolitik e contatos com O comunismo e os acordos que ele realizou para os fi√©is."


As we can see, some of the texts are a full paragraph of a Wikipedia article while others are just titles or empty lines.

## 7. Masked language modeling

For masked language modeling (MLM) we are going to use the same preprocessing as before for our dataset with one additional step: we will randomly mask some tokens (by replacing them by `[MASK]`) and the labels will be adjusted to only include the masked tokens (we don't have to predict the non-masked tokens).

In [21]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

We can now call the tokenizer on all our texts. This is very simple, using the [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) method from the Datasets library. First we define a function that call the tokenizer on our texts:

In [22]:
def tokenize_function(examples):
    return tokenizer(examples["text"])

We can apply the same tokenization function as before, we just need to update our tokenizer to use the checkpoint we just picked:

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])

In [24]:
# block_size = tokenizer.model_max_length
block_size = 128

Then we write the preprocessing function that will group our texts:

In [25]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

First note that we duplicate the inputs for our labels. This is because the model of the ü§ó Transformers library apply the shifting to the right, so we don't need to do it manually.

Also note that by default, the `map` method will send a batch of 1,000 examples to be treated by the preprocessing function. So here, we will drop the remainder to make the concatenated tokenized texts a multiple of `block_size` every 1,000 examples. You can adjust this behavior by passing a higher batch size (which will also be processed slower). You can also speed-up the preprocessing by using multiprocessing:

And like before, we group texts together and chunk them in samples of length `block_size`. You can skip that step if your dataset is composed of individual sentences.

In [None]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

The rest is very similar to what we had, with two exceptions. First we use a model suitable for masked LM:

In [27]:
from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

Some weights of the model checkpoint at neuralmind/bert-base-portuguese-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [28]:
# number of model parameters
model_num_param=0
for p in model.parameters():
    model_num_param+=p.numel()
model_num_param

108954466

## 8. Lang adapter

In [29]:
# Setup adapters
if train_adapter:
        
    # new
    if madx2:
        # do not add adapter in the last transformer layers 
        leave_out = [len(model.bert.encoder.layer)-1]
    else:
        leave_out = []
        
    # new
    # task_name = data_args.dataset_name or "mlm"
    task_name = "mlm"
        
    # check if adapter already exists, otherwise add it
    if task_name not in model.config.adapters:
            
#             # resolve the adapter config
#             adapter_config = AdapterConfig.load(
#                 adapter_args.adapter_config,
#                 non_linearity=adapter_args.adapter_non_linearity,
#                 reduction_factor=adapter_args.adapter_reduction_factor,
#             )

        # new
        # resolve adapter config with (eventually) the MAD-X 2.0 option
        if adapter_config_name == "pfeiffer":
            from transformers.adapters.configuration import PfeifferConfig
            adapter_config = PfeifferConfig(non_linearity=adapter_non_linearity,
                                            reduction_factor=adapter_reduction_factor,
                                            leave_out=leave_out)           
        elif adapter_config_name == "pfeiffer+inv":
            from transformers.adapters.configuration import PfeifferInvConfig
            adapter_config = PfeifferInvConfig(non_linearity=adapter_non_linearity,
                                               reduction_factor=adapter_reduction_factor,
                                               leave_out=leave_out)          
        elif adapter_config_name == "houlsby":
            from transformers.adapters.configuration import HoulsbyConfig
            adapter_config = HoulsbyConfig(non_linearity=adapter_non_linearity,
                                           reduction_factor=adapter_reduction_factor,
                                           leave_out=leave_out)
        elif adapter_config_name == "houlsby+inv":
            from transformers.adapters.configuration import HoulsbyInvConfig
            adapter_config = HoulsbyInvConfig(non_linearity=adapter_non_linearity,
                                              reduction_factor=adapter_reduction_factor,
                                              leave_out=leave_out)              
            
        # load a pre-trained from Hub if specified
        if load_adapter:
            model.load_adapter(
                    load_adapter,
                    config=adapter_config,
                    load_as=task_name,
                    with_head = False
                )
        # otherwise, add a fresh adapter
        else:
            model.add_adapter(task_name, config=adapter_config)
                
    # optionally load another pre-trained language adapter
    if load_lang_adapter:
        # resolve the language adapter config
        lang_adapter_config = AdapterConfig.load(
                lang_adapter_config,
                non_linearity=lang_adapter_non_linearity,
                reduction_factor=lang_adapter_reduction_factor,
                leave_out=leave_out,
            )
        # load the language adapter from Hub
        lang_adapter_name = model.load_adapter(
                load_lang_adapter,
                config=lang_adapter_config,
                load_as=language,
                with_head = False
            )
    else:
        lang_adapter_name = None
    # Freeze all model weights except of those of this adapter
    model.train_adapter([task_name])
    # Set the adapters to be used in every forward pass
    if lang_adapter_name:
        model.set_active_adapters([lang_adapter_name, task_name])
    else:
        model.set_active_adapters([task_name])
else:
    if load_adapter or load_lang_adapter:
        raise ValueError(
                "Adapters can only be loaded in adapters training mode."
                "Use --train_adapter to enable adapter training"
            )

In [30]:
# Put only the adapter after the MHA but not after the FF in the last layer (Houlsby configuration)
if houlsby_MHA_lastlayer \
and train_adapter \
and not madx2 \
and task_name in model.config.adapters \
and (adapter_config_name == "houlsby" or adapter_config_name == "houlsby+inv"):
    from torch.nn import ModuleDict
    model.bert.encoder.layer[len(model.bert.encoder.layer)-1].output.adapters = ModuleDict()

In [31]:
model

BertForMaskedLM(
  (bert): BertModel(
    (invertible_adapters): ModuleDict(
      (mlm): NICECouplingBlock(
        (F): Sequential(
          (0): Linear(in_features=384, out_features=192, bias=True)
          (1): Activation_Function_Class()
          (2): Linear(in_features=192, out_features=384, bias=True)
        )
        (G): Sequential(
          (0): Linear(in_features=384, out_features=192, bias=True)
          (1): Activation_Function_Class()
          (2): Linear(in_features=192, out_features=384, bias=True)
        )
      )
    )
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(29794, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            

In [32]:
model_adapter_num_param=0
for p in model.parameters():
    model_adapter_num_param+=p.numel()
model_adapter_num_param

115751266

## 9. Training

In [33]:
from transformers import TrainingArguments

if ds:
    training_args = TrainingArguments(
        output_dir=output_dir,
        overwrite_output_dir=overwrite_output_dir,
        do_train=do_train,
        do_eval=do_eval,
        evaluation_strategy=evaluation_strategy,
        eval_accumulation_steps=eval_accumulation_steps,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,
        learning_rate=learning_rate,
        weight_decay=weight_decay,
        adam_beta1=adam_beta1,
        adam_beta2=adam_beta2,
        adam_epsilon=adam_epsilon,
        num_train_epochs=num_train_epochs,
        lr_scheduler_type=lr_scheduler_type,
        warmup_ratio=warmup_ratio,
        warmup_steps=warmup_steps,
        logging_dir=logging_dir,         # directory for storing logs
        logging_strategy=evaluation_strategy,
        logging_steps=logging_steps,     # if strategy = "steps"
        save_strategy=evaluation_strategy,          # model checkpoint saving strategy
        save_steps=logging_steps,        # if strategy = "steps"
        save_total_limit=save_total_limit,
        fp16=fp16,
        eval_steps=logging_steps,        # if strategy = "steps"
        load_best_model_at_end=load_best_model_at_end,
        metric_for_best_model=metric_for_best_model,
        greater_is_better=greater_is_better,
        disable_tqdm=disable_tqdm,
        local_rank=gpu,
        deepspeed=ds_config
        )
else:
    training_args = TrainingArguments(
        output_dir=output_dir,
        overwrite_output_dir=overwrite_output_dir,
        do_train=do_train,
        do_eval=do_eval,
        evaluation_strategy=evaluation_strategy,
        eval_accumulation_steps=eval_accumulation_steps,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,
        learning_rate=learning_rate,
        weight_decay=weight_decay,
        adam_beta1=adam_beta1,
        adam_beta2=adam_beta2,
        adam_epsilon=adam_epsilon,
        num_train_epochs=num_train_epochs,
        lr_scheduler_type=lr_scheduler_type,
        warmup_ratio=warmup_ratio,
        warmup_steps=warmup_steps,
        logging_dir=logging_dir,         # directory for storing logs
        logging_strategy=evaluation_strategy,
        logging_steps=logging_steps,     # if strategy = "steps"
        save_strategy=evaluation_strategy,          # model checkpoint saving strategy
        save_steps=logging_steps,        # if strategy = "steps"
        save_total_limit=save_total_limit,
        fp16=fp16,
        eval_steps=logging_steps,        # if strategy = "steps"
        load_best_model_at_end=load_best_model_at_end,
        metric_for_best_model=metric_for_best_model,
        greater_is_better=greater_is_better,
        disable_tqdm=disable_tqdm,
        local_rank=gpu,
        )

And second, we use a special `data_collator`. The `data_collator` is a function that is responsible of taking the samples and batching them in tensors. In the previous example, we had nothing special to do, so we just used the default for this argument. Here we want to do the random-masking. We could do it as a pre-processing step (like the tokenization) but then the tokens would always be masked the same way at each epoch. By doing this step inside the `data_collator`, we ensure this random masking is done in a new way each time we go over the data.

To do this masking for us, the library provides a `DataCollatorForLanguageModeling`. We can adjust the probability of the masking:

In [34]:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

Let's define a compute metrics (accuracy). Even if it is always better to eveluate a model against a metric, we will not use it to evaluate the best model during the training as it can make a CUDA out of memory. Instead, we will use the validation loss (in the case of fine-tuning a MLM on  a new dataset, it is a common procedure). At the end of the training, we will use our compute metrics (accuracy) to get the performance of our model.

### Evaluation (loss and accuracy) before the training

In [35]:
# metric accuracy
from datasets import load_metric
metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    indices = [[i for i, x in enumerate(labels[row]) if x != -100] for row in range(len(labels))]

    labels = [labels[row][indices[row]] for row in range(len(labels))]
    labels = [item for sublist in labels for item in sublist]

    predictions = [predictions[row][indices[row]] for row in range(len(predictions))]
    predictions = [item for sublist in predictions for item in sublist]
    
    results = metric.compute(predictions=predictions, references=labels)
    results["eval_accuracy"] = results["accuracy"]
    results.pop("accuracy")

    return results

In [36]:
%%time
num_rows = lm_datasets["validation"].num_rows
num_rows_10pct = int(num_rows/10)

eval_acc_sum = 0.
eval_loss_sum = 0.

# validation dataset evaluation
for i in range(10):

    # indices"
    start = i*num_rows_10pct
    if i != 9: end = (i+1)*num_rows_10pct  
    else: end = num_rows
    indices = list(range(start,end))

    # sub dataset eval
    dset_eval = lm_datasets["validation"].select(indices)

    from transformers import Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=lm_datasets["train"],#.shard(index=1, num_shards=90), #to be used to reduce train to 1/90
    #     eval_dataset=lm_datasets["validation"],
        eval_dataset=dset_eval,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
        )    

    # calculation of the performance on the validation set
    eval_results = trainer.evaluate()
    eval_acc_sum += eval_results['eval_accuracy']*len(indices)
    eval_loss_sum += eval_results['eval_loss']*len(indices)

eval_acc_mean = eval_acc_sum / num_rows
eval_loss_mean = eval_loss_sum / num_rows

print('eval_accuracy:',eval_acc_mean)
print('eval_loss:',eval_loss_mean)
print(f"perplexity: {math.exp(eval_loss_mean):.2f}")

Using amp fp16 backend
***** Running Evaluation *****
  Num examples = 796
  Batch size = 16


Using amp fp16 backend
***** Running Evaluation *****
  Num examples = 796
  Batch size = 16


Using amp fp16 backend
***** Running Evaluation *****
  Num examples = 796
  Batch size = 16


Using amp fp16 backend
***** Running Evaluation *****
  Num examples = 796
  Batch size = 16


Using amp fp16 backend
***** Running Evaluation *****
  Num examples = 796
  Batch size = 16


Using amp fp16 backend
***** Running Evaluation *****
  Num examples = 796
  Batch size = 16


Using amp fp16 backend
***** Running Evaluation *****
  Num examples = 796
  Batch size = 16


Using amp fp16 backend
***** Running Evaluation *****
  Num examples = 796
  Batch size = 16


Using amp fp16 backend
***** Running Evaluation *****
  Num examples = 796
  Batch size = 16


Using amp fp16 backend
***** Running Evaluation *****
  Num examples = 799
  Batch size = 16


eval_accuracy: 0.567036068465056
eval_loss: 2.532642419832549
perplexity: 12.59
CPU times: user 4min 43s, sys: 4min 29s, total: 9min 12s
Wall time: 9min 15s


### Training

Then we just have to pass everything to `Trainer` and begin training:

In [37]:
from transformers import Trainer
from transformers.trainer_callback import EarlyStoppingCallback

if metric_for_best_model == "accuracy":
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=lm_datasets["train"], # .shard(index=1, num_shards=90), to be used to reduce train to 1/90
        eval_dataset=lm_datasets["validation"], #.shard(index=1, num_shards=90), to be used to reduce validation to 1/90
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
        do_save_full_model=not train_adapter, 
        do_save_adapters=train_adapter,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=early_stopping_patience)],
        )    
else:
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=lm_datasets["train"], # .shard(index=1, num_shards=90), to be used to reduce train to 1/90
        eval_dataset=lm_datasets["validation"], #.shard(index=1, num_shards=90), to be used to reduce validation to 1/90
        tokenizer=tokenizer,
        data_collator=data_collator,
#         compute_metrics=compute_metrics,
        do_save_full_model=not train_adapter, 
        do_save_adapters=train_adapter,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=early_stopping_patience)],
        ) 

Using amp fp16 backend


In [39]:
# trainer.args._n_gpu = n_gpu # train on one GPU but as we use local_rank in training_args, it is not needed
trainer.train()

***** Running training *****
  Num examples = 49822
  Num Epochs = 100
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 311400


Epoch,Training Loss,Validation Loss
1,1.9677,1.883594
2,1.8813,1.848203
3,1.839,1.826846
4,1.8109,1.806322
5,1.7884,1.806623
6,1.7649,1.799932
7,1.7461,1.808581
8,1.7373,1.797586
9,1.716,1.801826
10,1.7027,1.783277




TrainOutput(global_step=93420, training_loss=1.6664949842194525, metrics={'train_runtime': 13280.3879, 'train_samples_per_second': 375.155, 'train_steps_per_second': 23.448, 'total_flos': 1.3287075109404672e+17, 'train_loss': 1.6664949842194525, 'epoch': 30.0})

In [40]:
# check dtype
trainer.model.bert.embeddings.word_embeddings.weight.dtype

torch.float32

If the weights dtype is float16, use the script `zero_to_fp32.py` to get them in float32 as explained in [Getting The Model Weights Out](https://huggingface.co/transformers/main_classes/deepspeed.html?highlight=deepspeed#getting-the-model-weights-out).

In [41]:
print(f"Number of parameters of the model with adapter: {model_adapter_num_param:.0f}")
print(f"Number of parameters of the model without adapter: {model_num_param:.0f}")
print(f"Number of parameters of the adapter: {model_adapter_num_param - model_num_param:.0f}")
print(f"Pourcentage of additional parameters through adapter:",round(((model_adapter_num_param - model_num_param)/model_num_param)*100,2),'%')

Number of parameters of the model with adapter: 115751266
Number of parameters of the model without adapter: 108954466
Number of parameters of the adapter: 6796800
Pourcentage of additional parameters through adapter: 6.24 %


In [None]:
# save adapter + head
adapters_folder = 'adapters-' + task_name
path_to_save_adapter = path_to_outputs/adapters_folder
trainer.model.save_adapter(str(path_to_save_adapter), adapter_name=task_name, with_head=True)

!ls -lh {path_to_save_adapter}

In [None]:
path_to_save_adapter

Now, you can push the saved adapter + head to the [AdapterHub](https://adapterhub.ml/) (follow instructions at [Contributing to Adapter Hub](https://docs.adapterhub.ml/contributing.html)).

## 10. TensorBoard

In [44]:
#!pip install tensorboard

In [None]:
import os
PATH = os.getenv('PATH')
# replace xxxx by your username on your server (ex: paulo)
# replace yyyy by the name of the virtual environment of this notebook (ex: adapter-transformers)
%env PATH=/mnt/home/xxxx/anaconda3/envs/yyyy/bin:$PATH

In [46]:
%load_ext tensorboard
# %reload_ext tensorboard
%tensorboard --logdir {logging_dir} --bind_all



## 11. Evaluation (loss and accuracy) after the training

In [61]:
%%time
num_rows = lm_datasets["validation"].num_rows
num_rows_10pct = int(num_rows/10)

eval_acc_sum = 0.
eval_loss_sum = 0.

# validation dataset evaluation
for i in range(10):

    # indices"
    start = i*num_rows_10pct
    if i != 9: end = (i+1)*num_rows_10pct  
    else: end = num_rows
    indices = list(range(start,end))

    # sub dataset eval
    dset_eval = lm_datasets["validation"].select(indices)

    from transformers import Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=lm_datasets["train"],#.shard(index=1, num_shards=90), #to be used to reduce train to 1/90
    #     eval_dataset=lm_datasets["validation"],
        eval_dataset=dset_eval,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
        )    

    # calculation of the performance on the validation set
    eval_results = trainer.evaluate()
    eval_acc_sum += eval_results['eval_accuracy']*len(indices)
    eval_loss_sum += eval_results['eval_loss']*len(indices)

eval_acc_mean = eval_acc_sum / num_rows
eval_loss_mean = eval_loss_sum / num_rows

print('eval_accuracy:',eval_acc_mean)
print('eval_loss:',eval_loss_mean)
print(f"perplexity: {math.exp(eval_loss_mean):.2f}")

Using amp fp16 backend
***** Running Evaluation *****
  Num examples = 796
  Batch size = 16


Using amp fp16 backend
***** Running Evaluation *****
  Num examples = 796
  Batch size = 16


Using amp fp16 backend
***** Running Evaluation *****
  Num examples = 796
  Batch size = 16


Using amp fp16 backend
***** Running Evaluation *****
  Num examples = 796
  Batch size = 16


Using amp fp16 backend
***** Running Evaluation *****
  Num examples = 796
  Batch size = 16


Using amp fp16 backend
***** Running Evaluation *****
  Num examples = 796
  Batch size = 16


Using amp fp16 backend
***** Running Evaluation *****
  Num examples = 796
  Batch size = 16


Using amp fp16 backend
***** Running Evaluation *****
  Num examples = 796
  Batch size = 16


Using amp fp16 backend
***** Running Evaluation *****
  Num examples = 796
  Batch size = 16


Using amp fp16 backend
***** Running Evaluation *****
  Num examples = 799
  Batch size = 16


eval_accuracy: 0.6432477194214359
eval_loss: 1.7709132447159501
perplexity: 5.88
CPU times: user 3min 58s, sys: 4min 43s, total: 8min 41s
Wall time: 8min 46s


## 11. Application MLM

In [48]:
### import transformers
import pathlib
from pathlib import Path

In [49]:
if dataset_name == "squad11pt":
    
    # sentence from the training dataset
    text_dataset = "O pante√≠smo sustenta que Deus √© o universo e o universo √© Deus."
    dataset_mask = "Deus"
    text_dataset_mask = "O pante√≠smo sustenta que [MASK] √© o universo e o universo √© Deus."
    
    # sentence from wikipedia
    text_wiki = "O primeiro caso da COVID-19 foi descoberto em Wuhan, na China."
    wiki_mask = "China"
    text_wiki_mask = "O primeiro caso da COVID-19 foi descoberto em Wuhan, na [MASK]."

### Model original (without lang adapter)

We use the model `neuralmind/bert-base-portuguese-cased` and its trainned lang adapter within the following examples.

In [None]:
from transformers import AutoModelForMaskedLM, AutoTokenizer

model_mlm = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
tokenizer_mlm = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

In [51]:
from transformers import pipeline
nlp = pipeline("fill-mask", model=model_mlm, tokenizer=tokenizer_mlm)

Let's take one sentence from the SQuAD 1.1 pt dataset and replace the word `Deus` by the token `[MASK]`.

In [52]:
print(f'({dataset_mask}) {text_dataset_mask}')
nlp(text_dataset_mask)

(Deus) O pante√≠smo sustenta que [MASK] √© o universo e o universo √© Deus.


[{'sequence': 'O pante√≠smo sustenta que Deus √© o universo e o universo √© Deus.',
  'score': 0.7392684817314148,
  'token': 2538,
  'token_str': 'Deus'},
 {'sequence': 'O pante√≠smo sustenta que deus √© o universo e o universo √© Deus.',
  'score': 0.042948465794324875,
  'token': 4023,
  'token_str': 'deus'},
 {'sequence': 'O pante√≠smo sustenta que ele √© o universo e o universo √© Deus.',
  'score': 0.029601380228996277,
  'token': 368,
  'token_str': 'ele'},
 {'sequence': 'O pante√≠smo sustenta que Cristo √© o universo e o universo √© Deus.',
  'score': 0.021081821992993355,
  'token': 4184,
  'token_str': 'Cristo'},
 {'sequence': 'O pante√≠smo sustenta que tudo √© o universo e o universo √© Deus.',
  'score': 0.018854131922125816,
  'token': 2745,
  'token_str': 'tudo'}]

Let's test now the original model with another sentence and `China` has masked word.

In [53]:
print(f'({wiki_mask}) {text_wiki_mask}')
nlp(text_wiki_mask)

(China) O primeiro caso da COVID-19 foi descoberto em Wuhan, na [MASK].


[{'sequence': 'O primeiro caso da COVID - 19 foi descoberto em Wuhan, na China.',
  'score': 0.9124720096588135,
  'token': 3278,
  'token_str': 'China'},
 {'sequence': 'O primeiro caso da COVID - 19 foi descoberto em Wuhan, na √çndia.',
  'score': 0.034306950867176056,
  'token': 4340,
  'token_str': '√çndia'},
 {'sequence': 'O primeiro caso da COVID - 19 foi descoberto em Wuhan, na Mal√°sia.',
  'score': 0.023240933194756508,
  'token': 17753,
  'token_str': 'Mal√°sia'},
 {'sequence': 'O primeiro caso da COVID - 19 foi descoberto em Wuhan, na Tail√¢ndia.',
  'score': 0.013218147680163383,
  'token': 15582,
  'token_str': 'Tail√¢ndia'},
 {'sequence': 'O primeiro caso da COVID - 19 foi descoberto em Wuhan, na Inglaterra.',
  'score': 0.0027242223732173443,
  'token': 2785,
  'token_str': 'Inglaterra'}]

### Model with lang adapter

In [54]:
outputs = model_checkpoint.replace('/','-') + '_' + dataset_name + '/' + str(task) + '/'
outputs = outputs \
+ 'lr' + str(learning_rate) \
+ '_bs' + str(batch_size) \
+ '_GAS' + str(gradient_accumulation_steps) \
+ '_eps' + str(adam_epsilon) \
+ '_epochs' + str(num_train_epochs) \
+ '_patience' + str(early_stopping_patience) \
+ '_madx2' + str(madx2) \
+ '_houlsby_MHA_lastlayer' + str(houlsby_MHA_lastlayer) \
+ '_ds' + str(ds) \
+ '_fp16' + str(fp16) \
+ '_best' + str(load_best_model_at_end) \
+ '_metric' + str(metric_for_best_model) \
+ '_adapterconfig' + str(adapter_config_name)

path_to_outputs = root/'outputs'/outputs

# Config of the lang adapter
lang_adapter_path = path_to_outputs/'adapters-mlm/'

load_lang_adapter = lang_adapter_path
lang_adapter_config = str(lang_adapter_path) + "/adapter_config.json"

In [None]:
# load the language adapter
task_mlm_load_as = 'mlm'
lang_adapter_name = model_mlm.load_adapter(
    str(load_lang_adapter),
    config=lang_adapter_config,
    load_as=task_mlm_load_as,
    with_head=True
    )

# Set the adapters to be used in every forward pass
model_mlm.set_active_adapters([lang_adapter_name])

In [56]:
from transformers import pipeline
nlp = pipeline("fill-mask", model=model_mlm, tokenizer=tokenizer_mlm)

In [57]:
print(f'({dataset_mask}) {text_dataset_mask}')
nlp(text_dataset_mask)

(Deus) O pante√≠smo sustenta que [MASK] √© o universo e o universo √© Deus.


[{'sequence': 'O pante√≠smo sustenta que Deus √© o universo e o universo √© Deus.',
  'score': 0.865125834941864,
  'token': 2538,
  'token_str': 'Deus'},
 {'sequence': 'O pante√≠smo sustenta que Cristo √© o universo e o universo √© Deus.',
  'score': 0.03503378853201866,
  'token': 4184,
  'token_str': 'Cristo'},
 {'sequence': 'O pante√≠smo sustenta que Jesus √© o universo e o universo √© Deus.',
  'score': 0.016306772828102112,
  'token': 3125,
  'token_str': 'Jesus'},
 {'sequence': 'O pante√≠smo sustenta que tudo √© o universo e o universo √© Deus.',
  'score': 0.011215949431061745,
  'token': 2745,
  'token_str': 'tudo'},
 {'sequence': 'O pante√≠smo sustenta que ele √© o universo e o universo √© Deus.',
  'score': 0.007768187206238508,
  'token': 368,
  'token_str': 'ele'}]

Our fine-tuned model scored better (0.865 vs. 0.739) when finding the masked word `Deus`. It seems that our finetuning on the SQuAD 1.1 pt dataset with lang adapter worked as it performs better than the original model on a sentence from its training corpus.

Let's test now our fine-tuned model with another sentence and `China` has masked word.

In [58]:
print(f'({wiki_mask}) {text_wiki_mask}')
nlp(text_wiki_mask)

(China) O primeiro caso da COVID-19 foi descoberto em Wuhan, na [MASK].


[{'sequence': 'O primeiro caso da COVID - 19 foi descoberto em Wuhan, na China.',
  'score': 0.7749341726303101,
  'token': 3278,
  'token_str': 'China'},
 {'sequence': 'O primeiro caso da COVID - 19 foi descoberto em Wuhan, na √çndia.',
  'score': 0.14467883110046387,
  'token': 4340,
  'token_str': '√çndia'},
 {'sequence': 'O primeiro caso da COVID - 19 foi descoberto em Wuhan, na Mal√°sia.',
  'score': 0.028336668387055397,
  'token': 17753,
  'token_str': 'Mal√°sia'},
 {'sequence': 'O primeiro caso da COVID - 19 foi descoberto em Wuhan, na Tail√¢ndia.',
  'score': 0.01335228607058525,
  'token': 15582,
  'token_str': 'Tail√¢ndia'},
 {'sequence': 'O primeiro caso da COVID - 19 foi descoberto em Wuhan, na Indon√©sia.',
  'score': 0.007589380722492933,
  'token': 13985,
  'token_str': 'Indon√©sia'}]

The masked word `China` was found with a high score of 0.775 but lower than the score of the orginal model (0.912). It was expected: by finetuning the original model, we specialized it to the "language" of the dataset used.

# END