# Fine-tuning a MLM (Masked Language Model) like BERT (base or large) with the library adapter-transformers (notebook version)

- **Credit**: [Hugging Face](https://huggingface.co/) and [adapter-transformers](https://github.com/Adapter-Hub/adapter-transformers)
- **Author**: [Pierre GUILLOU](https://www.linkedin.com/in/pierreguillou/)
- **Date**: 05/07/2021
- **Blog post**: [NLP nas empresas | Como ajustar um modelo de linguagem natural como BERT a um novo dom√≠nio lingu√≠stico com um Adapter?](https://medium.com/@pierre_guillou/nlp-nas-empresas-como-ajustar-um-modelo-de-linguagem-natural-como-bert-a-um-novo-dom%C3%ADnio-23752b73b185)
- **Link to the folder in github with this notebook and all necessary scripts**: [language-modeling with adapters](https://github.com/piegu/language-models/tree/master/adapters/language-modeling/)

## 1. Context

### Objective

The objective here is to **fine-tune a Masked Language Model (MLM) like BERT (base or large) by training adapters (library [adapter-transformers](https://github.com/Adapter-Hub/adapter-transformers)), not the embeddings and transformers layers of the MLM model**, and to compare results with BERT model fully fine-tune for the same task.

The interest is obvious: if you need models for different NLP tasks, instead of fine-tuning and storing one model by NLP task, **you store only one MLM model and the trained tasks adapters which sizes are between 6% and 13% of the MLM model one** (it depends of the choosen adapter configuration). More, the loading of these adapters in production is very easy.

### Content

In this notebook, we'll see how to fine-tune one of the [ü§ó Transformers](https://github.com/huggingface/transformers) model on a language modeling tasks. We will cover one type of language modeling tasks which is:

- Masked language modeling: the model has to predict some tokens that are masked in the input. It still has access to the whole sentence, so it can use the tokens before and after the tokens masked to predict their value.

![Widget inference representing the masked language modeling task](images/masked_language_modeling_adapter.png)

We will see how to easily load and preprocess the dataset for each one of those tasks, and how to use the `Trainer` API to fine-tune a model on it.

### History and Credit

This notebook is an adaptation of the following notebooks and scripts for **fine-tuning a (transformer) Masked Language Model (MLM) like BERT (base or large) with any dataset** (we use here the texts of the [Portuguese Squad 1.1 dataset](https://forum.ailab.unb.br/t/datasets-em-portugues/251/4)):
- **from [adapter-transformers](https://github.com/Adapter-Hub/adapter-transformers)** | notebook [01_Adapter_Training.ipynb](https://github.com/Adapter-Hub/adapter-transformers/blob/master/notebooks/01_Adapter_Training.ipynb) and script [run_mlm.py](https://github.com/Adapter-Hub/adapter-transformers/blob/master/examples/language-modeling/run_mlm.py) (this script was adapted from the script [run_mlm.py](https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_mlm.py) of HF)
- **from [transformers](https://github.com/huggingface/transformers) of Hugging Face** | notebook [language_modeling.ipynb](https://github.com/huggingface/notebooks/blob/master/examples/language_modeling.ipynb) and script [run_mlm.py](https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_mlm.py) 

In order to speed up the fine-tuning of the model on only one GPU, the library [DeepSpeed](https://www.deepspeed.ai/) could be used by applying the configuration provided by HF in the notebook [transformers + deepspeed CLI](https://github.com/stas00/porting/blob/master/transformers/deepspeed/DeepSpeed_on_colab_CLI.ipynb) but as the library adapter-transformers is not synchronized with the last version of the library transformers of HF, we keep that option for the future.

*Note: the paragraph about Causal language modeling (CLM) is not included in this notebook, and all the non necessary code about Masked Model Language (MLM) has been deleted from the original notebook.*

### Major changes from original notebooks and scripts

The notebook [language_modeling.ipynb](https://github.com/huggingface/notebooks/blob/master/examples/language_modeling.ipynb) and script [run_mlm.py](https://github.com/Adapter-Hub/adapter-transformers/blob/master/examples/language-modeling/run_mlm.py) allow to evaluate the model performance against the validation loss at the end of each epoch, not against the metric accuracy. 

As a metric is better in order to select a model than the loss, we introduced in this notebook the metric accuracy for model evaluation (see the method `comput_metrics()`). However, as it needs many GB for the evaluation calculation, we do not use it here.

Thus, we updated the notebook [language_modeling.ipynb](https://github.com/huggingface/notebooks/blob/master/examples/language_modeling.ipynb)  to [language_modeling_adapter.ipynb](https://github.com/piegu/language-models/blob/master/adapters/language_modeling/language_modeling_adapter.ipynb) with the following changes:
- **Accuracy**: model evaluation through eval accuracy
- **EarlyStopping** by selecting the model with the highest eval accuracy (patience of 3 before ending the training)
- **MAD-X 2.0** that allows not to train adapters in the last transformer layer (read page 6 of [UNKs Everywhere: Adapting Multilingual Language Models to New Scripts](https://arxiv.org/pdf/2012.15562.pdf))

## 2. Installation

In [1]:
import pathlib
from pathlib import Path

#root path
root = Path.cwd()

In [2]:
import pickle
import pandas as pd
import numpy as np
import random

In [3]:
import sys; print('python:',sys.version)

import torch; print('Pytorch:',torch.__version__)

import transformers; print('adapter-transformers:',transformers.__version__)
import transformers; print('HF transformers:',transformers.__hf_version__)
import tokenizers; print('tokenizers:',tokenizers.__version__)
import datasets; print('datasets:',datasets.__version__)

# import deepspeed; print('deepspeed:',deepspeed.__version__)

# Versions used in the virtuel environment of this notebook:

# python: 3.8.10 (default, Jun  4 2021, 15:09:15) 
# [GCC 7.5.0]
# Pytorch: 1.9.0
# adapter-transformers: 2.0.1
# transformers: 4.5.1
# tokenizers: 0.10.3
# datasets: 1.8.0

python: 3.8.10 (default, Jun  4 2021, 15:09:15) 
[GCC 7.5.0]
Pytorch: 1.9.0
adapter-transformers: 2.0.1
HF transformers: 4.5.1
tokenizers: 0.10.3
datasets: 1.8.0


## 3. Model & dataset

In [4]:
# Select a MLM BERT base or large in the dataset language
model_checkpoint = "neuralmind/bert-base-portuguese-cased"
# model_checkpoint = "neuralmind/bert-large-portuguese-cased"

# SQuAD 1.1 in Portuguese
dataset_name = "squad11pt" # SQuAD v1.1 em portugu√™s

## 4. Main hyperparameters

In [5]:
task = "mlm"

In [6]:
# training arguments
batch_size = 32
gradient_accumulation_steps = 1

learning_rate = 1e-4
num_train_epochs = 100.
early_stopping_patience = 10

adam_epsilon = 1e-6

fp16 = True
ds = False # DeepSpeed

# best model
load_best_model_at_end = True 
if load_best_model_at_end:
    metric_for_best_model = "loss" # could be accuracy, too
    if metric_for_best_model == "loss":
        greater_is_better = False
    else:
        greater_is_better = True # for accuracy 

In [7]:
# train adapter
train_adapter = True # we want to train an adapter
load_adapter = None # we do not upload an existing adapter 
load_lang_adapter = None # we do not upload an existing lang adapter

# if True, do not put adapter in the last transformer layer
madx2 = True

## 5. Configuration

### GPU

In [8]:
# gpu
n_gpu = 1 # train on just one GPU
gpu = 1 # select the GPU

In [9]:
# Run this notebook in GPU 0
# As we do not launch a python script in this notebook, this cell is not mandatory
import os
os.environ['MASTER_ADDR'] = 'localhost'
if gpu == 0:
    os.environ['MASTER_PORT'] = '9996' # modify if RuntimeError: Address already in use # GPU 0
elif gpu == 1:
    os.environ['MASTER_PORT'] = '9997'
os.environ['RANK'] = "0"
os.environ['LOCAL_RANK'] = str(gpu)
os.environ['WORLD_SIZE'] = "1"

### Lang adapter config

In [54]:
# lang adapter config
adapter_config_name = "houlsby+inv" # houlsby+inv is possible, too
if adapter_config_name == "pfeiffer+inv":
    adapter_non_linearity = 'gelu' # relu is possible, too
elif adapter_config_name == "houlsby+inv":
    adapter_non_linearity = 'swish'
adapter_reduction_factor = 2
language = 'pt '# pt = Portuguese

### Training arguments of the HF trainer

In [11]:
# setup the training argument
do_train = True 
do_eval = True 

# epochs, bs, GA
evaluation_strategy = "epoch" # no

# fp16
fp16_opt_level = 'O1'
fp16_backend = "auto"
fp16_full_eval = False

# optimizer (AdamW)
weight_decay = 0.01 # 0.0
adam_beta1 = 0.9
adam_beta2 = 0.999

# scheduler
lr_scheduler_type = 'linear'
warmup_ratio = 0.0
warmup_steps = 0

# logs
logging_strategy = "steps"
logging_first_step = True # False
logging_steps = 500     # if strategy = "steps"
eval_steps = logging_steps # logging_steps

# checkpoints
save_strategy = "epoch" # steps
save_steps = 500 # if save_strategy = "steps"
save_total_limit = 1 # None

# no cuda, seed
no_cuda = False
seed = 42

# bar
disable_tqdm = False # True
remove_unused_columns = True

In [12]:
# folder for training outputs

outputs = model_checkpoint.replace('/','-') + '_' + dataset_name + '/'  
outputs = outputs + str(task) \
+ '_lr' + str(learning_rate) \
+ '_bs' + str(batch_size) \
+ '_GAS' + str(gradient_accumulation_steps) \
+ '_eps' + str(adam_epsilon) \
+ '_epochs' + str(num_train_epochs) \
+ '_patience' + str(early_stopping_patience) \
+ '_madx2' + str(madx2) \
+ '_ds' + str(ds) \
+ '_fp16' + str(fp16) \
+ '_best' + str(load_best_model_at_end) \
+ '_metric' + str(metric_for_best_model) \
+ '_adapterconfig' + str(adapter_config_name)

# path to outputs
path_to_outputs = root/'models_outputs'/outputs

# subfolder for model outputs
output_dir = path_to_outputs/'output_dir' 
overwrite_output_dir = True # False

# logs
logging_dir = path_to_outputs/'logging_dir'

## 6. Preparing the dataset

In [13]:
# if dataset_name == "squad11pt":
    
#     # create dataset folder 
#     path_to_dataset = root/'data'/dataset_name
#     path_to_dataset.mkdir(parents=True, exist_ok=True) 

#     # Get dataset SQUAD in Portuguese
#     %cd {path_to_dataset}
#     !wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1Q0IaIlv2h2BC468MwUFmUST0EyN7gNkn' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1Q0IaIlv2h2BC468MwUFmUST0EyN7gNkn" -O squad-pt.tar.gz && rm -rf /tmp/cookies.txt

#     # unzip 
#     !tar -xvf squad-pt.tar.gz

#     # Get the train and validation json file in the HF script format 
#     # inspiration: file squad.py at https://github.com/huggingface/datasets/tree/master/datasets/squad
    
#     import json 
#     files = ['squad-train-v1.1.json','squad-dev-v1.1.json']

#     for file in files:

#         # Opening JSON file & returns JSON object as a dictionary 
#         f = open(file, encoding="utf-8") 
#         data = json.load(f) 

#         # Iterating through the json list 
#         context_list = list()
#         id_list = list()

#         for row in data['data']: 

#             for paragraph in row['paragraphs']:
#                 context = (paragraph['context']).strip()
#                 context_list.append(context)

#         # Get unique context
#         unique_context_list = list(set(context_list))

#         # Closing file 
#         f.close() 

#         file_name = 'pt_' + str(file).replace('json','txt')
#         with open(file_name, 'wb') as list_file:
#             pickle.dump(unique_context_list, list_file)
         
#     %cd ../..

You can replace the dataset above with any dataset hosted on [the hub](https://huggingface.co/datasets) or use your own files. Just uncomment the following cell and replace the paths with values that will lead to your files:

In [14]:
# datasets = load_dataset("text", data_files={"train": path_to_train.txt, "validation": path_to_validation.txt}

You can also load datasets from a csv or a JSON file, see the [full documentation](https://huggingface.co/docs/datasets/loading_datasets.html#from-local-files) for more information.

In [15]:
if dataset_name == "squad11pt":
    
    path_to_data = root/'data'/dataset_name
    files = ['pt_squad-train-v1.1.txt','pt_squad-dev-v1.1.txt']
    
    for i,file in enumerate(files):
        path_to_file = path_to_data/file
        with open(path_to_file, "rb") as f:   # Unpickling
            text_list = pickle.load(f)

            with open(file, "w") as output:
                output.write(str(text_list))
        
        df = pd.DataFrame(text_list,columns=['text'])
        if i == 0:
            df_train = df.copy()
        else:
            df_validation = df.copy()
            
    from datasets import Dataset, DatasetDict
    dataset_train = Dataset.from_pandas(df_train)
    dataset_validation = Dataset.from_pandas(df_validation)

    datasets = DatasetDict()
    datasets['train'] = dataset_train
    datasets['validation'] = dataset_validation

To access an actual element, you need to select a split first, then give an index:

In [16]:
datasets["train"][10]

{'text': 'O pante√≠smo sustenta que Deus √© o universo e o universo √© Deus, enquanto o panente√≠smo sustenta que Deus cont√©m, mas n√£o √© id√™ntico ao universo. √â tamb√©m a vis√£o da Igreja Cat√≥lica Liberal; Teosofia; algumas vis√µes do hindu√≠smo, exceto o vaisnavismo, que acredita no panente√≠smo; Sikhismo; algumas divis√µes do neopaganismo e tao√≠smo, juntamente com muitas denomina√ß√µes e indiv√≠duos variados dentro das denomina√ß√µes. A Cabala, Misticismo judaico, pinta uma vis√£o pante√≠sta / panente√≠sta de Deus - que tem ampla aceita√ß√£o no juda√≠smo hass√≠dico, particularmente de seu fundador The Baal Shem Tov - mas apenas como um complemento √† vis√£o judaica de um deus pessoal, n√£o no pante√≠sta original sensa√ß√£o que nega ou limita a persona a Deus. [cita√ß√£o necess√°rio]'}

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [17]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [18]:
show_random_elements(datasets["train"])

Unnamed: 0,text
0,"O campus principal em Provo, Utah, Estados Unidos, fica em aproximadamente 560 acres (2,3 km2) situado na base das montanhas Wasatch e inclui 295 edif√≠cios. Os edif√≠cios apresentam uma grande variedade de estilos arquitet√¥nicos, cada edif√≠cio sendo constru√≠do no estilo de seu tempo. A grama, as √°rvores e os canteiros de flores do campus da BYU s√£o mantidos impecavelmente. Al√©m disso, vistas das montanhas Wasatch (incluindo o Monte Timpanogos) podem ser vistas no campus. A Biblioteca Harold B. Lee da BYU (tamb√©m conhecida como ""HBLL""), que a Princeton Review classificou como a ""1¬™ Grande Biblioteca da Faculdade"" em 2004, possui aproximadamente 8,5 milh√µes de itens em suas cole√ß√µes, cont√©m 158 km de prateleiras e pode acomodar 4.600 pessoas. A Torre de Spencer W. Kimball, abreviada para SWKT e pronunciada Swicket por muitos estudantes, abriga v√°rios departamentos e programas da universidade e √© o edif√≠cio mais alto de Provo, Utah. Al√©m disso, o Marriott Center da BYU, usado como uma arena de basquete, pode acomodar mais de 22.000 pessoas e √© uma das maiores arenas no campus do pa√≠s. Curiosamente ausente no campus desta universidade de propriedade da igreja √© uma capela do campus. N√£o obstante, todos os cultos dominicais da Igreja SUD para os alunos s√£o realizados no campus, mas devido ao grande n√∫mero de estudantes que frequentam esses servi√ßos, quase todos os pr√©dios e poss√≠veis espa√ßos de reuni√£o no campus s√£o utilizados (al√©m disso, muitos estudantes frequentam servi√ßos fora do campus no campus). Capelas SUD nas comunidades do entorno)."
1,"Nos contextos liter√°rios, Apolo representa harmonia, ordem e raz√£o - caracter√≠sticas contrastadas com as de Dion√≠sio, deus do vinho, que representa √™xtase e desordem. O contraste entre os pap√©is desses deuses se reflete nos adjetivos apol√≠neo e dionis√≠aco. No entanto, os gregos pensavam nas duas qualidades como complementares: os dois deuses s√£o irm√£os, e quando Apolo no inverno partisse para Hyperborea, deixaria o or√°culo de Delfos para Dion√≠sio. Este contraste parece ser mostrado nos dois lados do vaso Borghese."
2,"A liga realizou sua primeira temporada em 1992-1993 e era originalmente composta por 22 clubes. O primeiro gol de todos os tempos na Premier League foi marcado por Brian Deane, do Sheffield United, em uma vit√≥ria por 2 a 1 sobre o Manchester United. Os 22 membros inaugurais da nova Premier League foram Arsenal, Aston Villa, Blackburn Rovers, Chelsea, Coventry City, Crystal Palace, Everton, Ipswich Town, Leeds United, Liverpool, Manchester City, Manchester United, Middlesbrough, Norwich City, Nottingham Forest, Oldham Athletic, Queens Park Rangers, Sheffield United, Sheffield Wednesday, Southampton, Tottenham Hotspur e Wimbledon. Luton Town, Notts County e West Ham United foram as tr√™s equipes rebaixadas da antiga primeira divis√£o no final da temporada 1991-1992 e n√£o participaram da temporada inaugural da Premier League."
3,"N√£o foi at√© o final do s√©culo XVIII que a migra√ß√£o como explica√ß√£o para o desaparecimento no inverno de aves dos climas do norte foi aceita. A History of British Birds (Volume 1, 1797), de Thomas Bewick, menciona um relat√≥rio de ""um mestre muito inteligente de um navio"" que, ""entre as ilhas de Minorca e Maiorca, viu um grande n√∫mero de andorinhas voando para o norte"" e afirma a situa√ß√£o na Gr√£-Bretanha da seguinte forma:"
4,"Na d√©cada de 1960, o centro de Houston consistia em uma cole√ß√£o de estruturas de escrit√≥rios. O centro da cidade estava no limiar de um boom liderado pelo setor de energia em 1970. Uma sucess√£o de arranha-c√©us foi constru√≠da ao longo da d√©cada de 1970 - muitos pelo promotor imobili√°rio Gerald D. Hines - culminando com o arranha-c√©u mais alto de Houston, com 75 andares e 1.002 p√©s ( Torre JPMorgan Chase de 305 m) (anteriormente denominada Texas Commerce Tower), conclu√≠da em 1982. √â a estrutura mais alta do Texas, 15¬∫ edif√≠cio mais alto dos Estados Unidos e o 85¬∫ arranha-c√©u mais alto do mundo, com base na maior caracter√≠stica arquitet√¥nica . Em 1983, o Wells Fargo Plaza, com 71 andares e 992 p√©s (302 m) de altura, foi conclu√≠do, tornando-se o segundo edif√≠cio mais alto de Houston e Texas. Baseado na caracter√≠stica arquitet√¥nica mais alta, √© o 17¬∫ mais alto dos Estados Unidos e o 95¬∫ mais alto do mundo. Em 2007, o centro de Houston tinha mais de 4.000.000 m¬≤ de espa√ßo para escrit√≥rio."
5,"Cardigan montou sua unidade e atacou o comprimento do vale do Balaclava, sob fogo de baterias russas nas colinas. A acusa√ß√£o da Brigada Leve causou 278 baixas na unidade de 700 homens. A Brigada Leve foi comemorada no famoso poema de Alfred Lord Tennyson, ""A Carga da Brigada Leve"". Embora tradicionalmente a acusa√ß√£o da Brigada Leve fosse vista como um sacrif√≠cio glorioso, por√©m desperdi√ßado, de homens e cavalos bons, historiadores recentes dizem que a acusa√ß√£o da Brigada Leve teve √™xito em pelo menos alguns de seus objetivos. O objetivo de qualquer carga de cavalaria √© espalhar as linhas inimigas e assustar o inimigo fora do campo de batalha. A carga da Brigada Ligeira havia t√£o enervado a cavalaria russa, que havia sido derrotada anteriormente pela Brigada Pesada, que a Cavalaria Russa foi posta em v√¥o em grande escala pela carga subsequente da Brigada Ligeira.:252"
6,"Os intestinos dos animais cont√™m uma grande popula√ß√£o de flora intestinal. Nos seres humanos, os quatro filos dominantes s√£o Firmicutes, Bacteroidetes, Actinobacteria e Proteobacteria. Eles s√£o essenciais para a digest√£o e tamb√©m s√£o afetados pelos alimentos consumidos. As bact√©rias no intestino desempenham muitas fun√ß√µes importantes para os seres humanos, incluindo a decomposi√ß√£o e auxiliando na absor√ß√£o de alimentos indigestos; estimular o crescimento celular; reprimindo o crescimento de bact√©rias nocivas, treinando o sistema imunol√≥gico para responder apenas a pat√≥genos; produ√ß√£o de vitamina B12; e defesa contra algumas doen√ßas infecciosas."
7,"As esp√©cies de plantas terrestres dominantes da √©poca eram as gimnospermas, que s√£o plantas vasculares, sem cone e sem flores, como con√≠feras que produzem sementes sem revestimento. Isso se op√µe √† flora atual da Terra, na qual as plantas terrestres dominantes em termos de n√∫mero de esp√©cies s√£o angiospermas. Pensa-se que um g√™nero de planta em particular, o Ginkgo, tenha evolu√≠do neste momento e seja representado hoje por uma √∫nica esp√©cie, Ginkgo biloba. Al√©m disso, acredita-se que o g√™nero existente Sequoia tenha evolu√≠do no Mesoz√≥ico."
8,"Uma iniciativa de vota√ß√£o no Colorado, conhecida como Emenda 36, teria mudado a maneira pela qual o Estado distribui seus votos eleitorais. Em vez de atribuir todos os 9 eleitores do estado ao candidato com uma pluralidade de votos populares, sob a emenda, o Colorado teria designado eleitores presidenciais proporcionalmente √† contagem de votos em todo o estado, o que seria um sistema √∫nico (Nebraska e Maine atribuem votos eleitorais com base em total de votos em cada distrito do congresso). Os depoentes alegaram que essa divis√£o diminuiria a influ√™ncia do Colorado no Col√©gio Eleitoral, e a emenda acabou fracassando, recebendo apenas 34% dos votos."
9,"Embora os rios e a costa desta √°rea estivessem entre os primeiros lugares colonizados pelos portugueses, que criaram postos comerciais no s√©culo XVI, eles n√£o exploraram o interior at√© o s√©culo XIX. Os governantes africanos locais na Guin√©, alguns dos quais prosperaram muito com o com√©rcio de escravos, controlavam o com√©rcio interior e n√£o permitiam a entrada de europeus no interior. Eles os mantinham nos assentamentos costeiros fortificados onde o com√©rcio acontecia. As comunidades africanas que lutaram contra os comerciantes de escravos tamb√©m desconfiavam de aventureiros e pretensos colonos europeus. Os portugueses na Guin√© estavam amplamente restritos ao porto de Bissau e Cacheu. Um pequeno n√∫mero de colonos europeus estabeleceu fazendas isoladas ao longo dos rios do interior de Bissau."


As we can see, some of the texts are a full paragraph of a Wikipedia article while others are just titles or empty lines.

## 7. Masked language modeling

For masked language modeling (MLM) we are going to use the same preprocessing as before for our dataset with one additional step: we will randomly mask some tokens (by replacing them by `[MASK]`) and the labels will be adjusted to only include the masked tokens (we don't have to predict the non-masked tokens).

In [19]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

We can now call the tokenizer on all our texts. This is very simple, using the [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) method from the Datasets library. First we define a function that call the tokenizer on our texts:

In [20]:
def tokenize_function(examples):
    return tokenizer(examples["text"])

We can apply the same tokenization function as before, we just need to update our tokenizer to use the checkpoint we just picked:

In [21]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])











In [22]:
# block_size = tokenizer.model_max_length
block_size = 128

Then we write the preprocessing function that will group our texts:

In [23]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

First note that we duplicate the inputs for our labels. This is because the model of the ü§ó Transformers library apply the shifting to the right, so we don't need to do it manually.

Also note that by default, the `map` method will send a batch of 1,000 examples to be treated by the preprocessing function. So here, we will drop the remainder to make the concatenated tokenized texts a multiple of `block_size` every 1,000 examples. You can adjust this behavior by passing a higher batch size (which will also be processed slower). You can also speed-up the preprocessing by using multiprocessing:

And like before, we group texts together and chunk them in samples of length `block_size`. You can skip that step if your dataset is composed of individual sentences.

In [24]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)











The rest is very similar to what we had, with two exceptions. First we use a model suitable for masked LM:

In [25]:
from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

Some weights of the model checkpoint at neuralmind/bert-base-portuguese-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [26]:
# number of model parameters
model_num_param=0
for p in model.parameters():
    model_num_param+=p.numel()
model_num_param

108954466

## 8. Lang adapter

In [27]:
# Setup adapters
if train_adapter:
        
    # new
    if madx2:
        # do not add adapter in the last transformer layers 
        leave_out = [len(model.bert.encoder.layer)-1]
    else:
        leave_out = []
        
    # new
    # task_name = data_args.dataset_name or "mlm"
    task_name = "mlm"
        
    # check if adapter already exists, otherwise add it
    if task_name not in model.config.adapters:
            
#             # resolve the adapter config
#             adapter_config = AdapterConfig.load(
#                 adapter_args.adapter_config,
#                 non_linearity=adapter_args.adapter_non_linearity,
#                 reduction_factor=adapter_args.adapter_reduction_factor,
#             )

        # new
        # resolve adapter config with (eventually) the MAD-X 2.0 option
        if adapter_config_name == "pfeiffer":
            from transformers.adapters.configuration import PfeifferConfig
            adapter_config = PfeifferConfig(non_linearity=adapter_non_linearity,
                                            reduction_factor=adapter_reduction_factor,
                                            leave_out=leave_out)           
        elif adapter_config_name == "pfeiffer+inv":
            from transformers.adapters.configuration import PfeifferInvConfig
            adapter_config = PfeifferInvConfig(non_linearity=adapter_non_linearity,
                                               reduction_factor=adapter_reduction_factor,
                                               leave_out=leave_out)          
        elif adapter_config_name == "houlsby":
            from transformers.adapters.configuration import HoulsbyConfig
            adapter_config = HoulsbyConfig(non_linearity=adapter_non_linearity,
                                           reduction_factor=adapter_reduction_factor,
                                           leave_out=leave_out)
        elif adapter_config_name == "houlsby+inv":
            from transformers.adapters.configuration import HoulsbyInvConfig
            adapter_config = HoulsbyInvConfig(non_linearity=adapter_non_linearity,
                                              reduction_factor=adapter_reduction_factor,
                                              leave_out=leave_out)              
            
        # load a pre-trained from Hub if specified
        if load_adapter:
            model.load_adapter(
                    load_adapter,
                    config=adapter_config,
                    load_as=task_name,
                    with_head = False
                )
        # otherwise, add a fresh adapter
        else:
            model.add_adapter(task_name, config=adapter_config)
                
    # optionally load another pre-trained language adapter
    if load_lang_adapter:
        # resolve the language adapter config
        lang_adapter_config = AdapterConfig.load(
                lang_adapter_config,
                non_linearity=lang_adapter_non_linearity,
                reduction_factor=lang_adapter_reduction_factor,
                leave_out=leave_out,
            )
        # load the language adapter from Hub
        lang_adapter_name = model.load_adapter(
                load_lang_adapter,
                config=lang_adapter_config,
                load_as=language,
                with_head = False
            )
    else:
        lang_adapter_name = None
    # Freeze all model weights except of those of this adapter
    model.train_adapter([task_name])
    # Set the adapters to be used in every forward pass
    if lang_adapter_name:
        model.set_active_adapters([lang_adapter_name, task_name])
    else:
        model.set_active_adapters([task_name])
else:
    if load_adapter or load_lang_adapter:
        raise ValueError(
                "Adapters can only be loaded in adapters training mode."
                "Use --train_adapter to enable adapter training"
            )

In [28]:
model

BertForMaskedLM(
  (bert): BertModel(
    (invertible_adapters): ModuleDict(
      (mlm): NICECouplingBlock(
        (F): Sequential(
          (0): Linear(in_features=384, out_features=192, bias=True)
          (1): Activation_Function_Class()
          (2): Linear(in_features=192, out_features=384, bias=True)
        )
        (G): Sequential(
          (0): Linear(in_features=384, out_features=192, bias=True)
          (1): Activation_Function_Class()
          (2): Linear(in_features=192, out_features=384, bias=True)
        )
      )
    )
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(29794, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            

In [29]:
model_adapter_num_param=0
for p in model.parameters():
    model_adapter_num_param+=p.numel()
model_adapter_num_param

122252002

In [30]:
print(f"Number of parameters of the model with adapter: {model_adapter_num_param:.0f}")
print(f"Number of parameters of the model without adapter: {model_num_param:.0f}")
print(f"Number of parameters of the adapter: {model_adapter_num_param - model_num_param:.0f}")
print(f"Pourcentage of additional parameters through adapter:",round(((model_adapter_num_param - model_num_param)/model_num_param)*100,2),'%')

Number of parameters of the model with adapter: 122252002
Number of parameters of the model without adapter: 108954466
Number of parameters of the adapter: 13297536
Pourcentage of additional parameters through adapter: 12.2 %


## 9. Training

In [31]:
from transformers import Trainer, TrainingArguments

if ds:
    training_args = TrainingArguments(
        output_dir=output_dir,
        overwrite_output_dir=overwrite_output_dir,
        do_train=do_train,
        do_eval=do_eval,
        evaluation_strategy=evaluation_strategy,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,
        learning_rate=learning_rate,
        weight_decay=weight_decay,
        adam_beta1=adam_beta1,
        adam_beta2=adam_beta2,
        adam_epsilon=adam_epsilon,
        num_train_epochs=num_train_epochs,
        lr_scheduler_type=lr_scheduler_type,
        warmup_ratio=warmup_ratio,
        warmup_steps=warmup_steps,
        logging_dir=logging_dir,         # directory for storing logs
        logging_strategy=evaluation_strategy,
        logging_steps=logging_steps,     # if strategy = "steps"
        save_strategy=evaluation_strategy,          # model checkpoint saving strategy
        save_steps=logging_steps,        # if strategy = "steps"
        save_total_limit=save_total_limit,
        fp16=fp16,
        eval_steps=logging_steps,        # if strategy = "steps"
        load_best_model_at_end=load_best_model_at_end,
        metric_for_best_model=metric_for_best_model,
        greater_is_better=greater_is_better,
        disable_tqdm=disable_tqdm,
        local_rank=gpu,
        deepspeed=ds_config
        )
else:
    training_args = TrainingArguments(
        output_dir=output_dir,
        overwrite_output_dir=overwrite_output_dir,
        do_train=do_train,
        do_eval=do_eval,
        evaluation_strategy=evaluation_strategy,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,
        learning_rate=learning_rate,
        weight_decay=weight_decay,
        adam_beta1=adam_beta1,
        adam_beta2=adam_beta2,
        adam_epsilon=adam_epsilon,
        num_train_epochs=num_train_epochs,
        lr_scheduler_type=lr_scheduler_type,
        warmup_ratio=warmup_ratio,
        warmup_steps=warmup_steps,
        logging_dir=logging_dir,         # directory for storing logs
        logging_strategy=evaluation_strategy,
        logging_steps=logging_steps,     # if strategy = "steps"
        save_strategy=evaluation_strategy,          # model checkpoint saving strategy
        save_steps=logging_steps,        # if strategy = "steps"
        save_total_limit=save_total_limit,
        fp16=fp16,
        eval_steps=logging_steps,        # if strategy = "steps"
        load_best_model_at_end=load_best_model_at_end,
        metric_for_best_model=metric_for_best_model,
        greater_is_better=greater_is_better,
        disable_tqdm=disable_tqdm,
        local_rank=gpu,
        )

And second, we use a special `data_collator`. The `data_collator` is a function that is responsible of taking the samples and batching them in tensors. In the previous example, we had nothing special to do, so we just used the default for this argument. Here we want to do the random-masking. We could do it as a pre-processing step (like the tokenization) but then the tokens would always be masked the same way at each epoch. By doing this step inside the `data_collator`, we ensure this random masking is done in a new way each time we go over the data.

To do this masking for us, the library provides a `DataCollatorForLanguageModeling`. We can adjust the probability of the masking:

In [32]:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

Let's define a compute metrics (accuracy). Even if it is always better to eveluate a model against a metric, we will not use it to evaluate the best model during the training as it can make a CUDA out of memory. Instead, we will use the validation loss (in the case of fine-tuning a MLM on  a new dataset, it is a common procedure). At the end of the training, we will use our compute metrics (accuracy) to get the performance of our model.

In [33]:
# metric accuracy
from datasets import load_metric
metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    indices = [[i for i, x in enumerate(labels[row]) if x != -100] for row in range(len(labels))]

    labels = [labels[row][indices[row]] for row in range(len(labels))]
    temp = list()
    for item in labels:
        temp += item.tolist()
    labels = temp

    predictions = [predictions[row][indices[row]] for row in range(len(predictions))]
    temp = list()
    for item in predictions:
        temp += item.tolist()
    predictions = temp
    
    results = metric.compute(predictions=predictions, references=labels)
    results["eval_accuracy"] = results["accuracy"]
    results.pop("accuracy")

    return results

Then we just have to pass everything to `Trainer` and begin training:

In [34]:
from transformers.trainer_callback import EarlyStoppingCallback

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"], # .shard(index=1, num_shards=90), to be used to reduce train to 1/90
    eval_dataset=lm_datasets["validation"], #.shard(index=1, num_shards=90), to be used to reduce validation to 1/90
    tokenizer=tokenizer,
    data_collator=data_collator,
#     compute_metrics=compute_metrics,
    do_save_full_model=not train_adapter, 
    do_save_adapters=train_adapter,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=early_stopping_patience)],
    )    

In [None]:
# trainer.args._n_gpu = n_gpu # train on one GPU but as we use local_rank in training_args, it is not needed
trainer.train()

````
 [ 90306/155700 9:36:09 < 6:57:13, 2.61 it/s, Epoch 58/100]
Epoch	Training Loss	Validation Loss	Runtime	Samples Per Second
1	1.964100	1.868401	48.642500	163.705000
2	1.884600	1.839256	46.086900	172.782000
3	1.849100	1.821544	44.457500	179.115000
4	1.820000	1.807454	44.421500	179.260000
5	1.802400	1.812738	70.805100	112.464000
6	1.785700	1.807045	71.513400	111.350000
7	1.765500	1.807914	70.185800	113.456000
8	1.748500	1.795039	58.680800	135.700000
9	1.735000	1.801956	71.341500	111.618000
10	1.731300	1.788569	71.223500	111.803000
11	1.722000	1.793057	71.501900	111.368000
12	1.707200	1.792752	79.366800	100.332000
13	1.697600	1.778756	69.999200	113.758000
14	1.690600	1.778784	70.264500	113.329000
15	1.685000	1.774370	71.556900	111.282000
16	1.678300	1.769456	74.719500	106.572000
17	1.664600	1.766170	73.120800	108.902000
18	1.653300	1.774301	78.627900	101.274000
19	1.648000	1.772815	77.317800	102.991000
20	1.643000	1.772505	72.899100	109.233000
21	1.635100	1.763878	71.642500	111.149000
22	1.629000	1.774589	71.053400	112.071000
23	1.619600	1.774103	70.501900	112.947000
24	1.618100	1.775866	71.579600	111.247000
25	1.608600	1.784808	71.177600	111.875000
26	1.605500	1.769531	71.822100	110.871000
27	1.599500	1.773246	71.530800	111.323000
28	1.597200	1.778355	70.789500	112.489000
29	1.591000	1.771813	70.624700	112.751000
30	1.590600	1.756565	70.701000	112.629000
31	1.587000	1.757311	71.865700	110.804000
32	1.578500	1.754790	71.301100	111.681000
33	1.574900	1.757490	70.887400	112.333000
34	1.568900	1.766495	71.213400	111.819000
35	1.562000	1.757397	70.766200	112.526000
36	1.565900	1.761957	71.060400	112.060000
37	1.557800	1.753144	71.445100	111.456000
38	1.551700	1.758267	73.499100	108.341000
39	1.547500	1.761052	78.165800	101.873000
40	1.546500	1.759224	77.168200	103.190000
41	1.537100	1.765854	48.721500	163.439000
42	1.539700	1.764972	49.488800	160.905000
43	1.531600	1.762541	48.618300	163.786000
44	1.527000	1.761401	49.214100	161.803000
45	1.524700	1.772806	48.609400	163.816000
46	1.523900	1.751668	48.683200	163.568000
47	1.519000	1.764286	49.266000	161.633000
48	1.519900	1.746866	49.417400	161.138000
49	1.516500	1.757262	49.317300	161.465000
50	1.511100	1.771098	49.419500	161.131000
51	1.514400	1.752675	49.327000	161.433000
52	1.506900	1.759398	48.620300	163.779000
53	1.506700	1.774234	49.339000	161.394000
54	1.500000	1.768661	49.261600	161.647000
55	1.500400	1.754474	48.522800	164.109000
56	1.492200	1.775785	48.383900	164.579000
57	1.492800	1.750307	48.982300	162.569000
58	1.488500	1.759718	49.337400	161.399000

TrainOutput(global_step=90306, training_loss=1.6183313755799726, metrics={'train_runtime': 34570.5396, 'train_samples_per_second': 4.504, 'total_flos': 2.7131033530925056e+17, 'epoch': 58.0, 'init_mem_cpu_alloc_delta': 2080903168, 'init_mem_gpu_alloc_delta': 504964608, 'init_mem_cpu_peaked_delta': 88915968, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': 137281536, 'train_mem_gpu_alloc_delta': 249311744, 'train_mem_cpu_peaked_delta': 183021568, 'train_mem_gpu_peaked_delta': 4172747264})
````

In [36]:
# check dtype
trainer.model.bert.embeddings.word_embeddings.weight.dtype

torch.float32

If the weights dtype is float16, use the script `zero_to_fp32.py` to get them in float32 as explained in [Getting The Model Weights Out](https://huggingface.co/transformers/main_classes/deepspeed.html?highlight=deepspeed#getting-the-model-weights-out).

In [37]:
# add the metric accuracy
# trainer.compute_metrics=compute_metrics

# calculation of the performance on the validation set
eval_results = trainer.evaluate()
eval_results

{'eval_loss': 1.7610177993774414,
 'eval_runtime': 49.1643,
 'eval_samples_per_second': 161.967,
 'epoch': 58.0,
 'eval_mem_cpu_alloc_delta': 0,
 'eval_mem_gpu_alloc_delta': 0,
 'eval_mem_cpu_peaked_delta': 0,
 'eval_mem_gpu_peaked_delta': 990020608}

In [38]:
import math
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
# print(f"Accuracy: {eval_results['eval_accuracy']:.2f}")

Perplexity: 5.82


In [39]:
# save adapter + head
adapters_folder = 'adapters-' + task_name
path_to_save_adapter = path_to_outputs/adapters_folder
trainer.model.save_adapter(str(path_to_save_adapter), adapter_name=task_name, with_head=True)

!ls -lh {path_to_save_adapter}

total 141M
-rw-rw-r-- 1 pierre pierre 598 Jul  8 12:27 adapter_config.json
-rw-rw-r-- 1 pierre pierre 231 Jul  8 12:27 head_config.json
-rw-rw-r-- 1 pierre pierre 51M Jul  8 12:27 pytorch_adapter.bin
-rw-rw-r-- 1 pierre pierre 90M Jul  8 12:27 pytorch_model_head.bin


In [None]:
path_to_save_adapter

Now, you can push the saved adapter + head to the [AdapterHub](https://adapterhub.ml/) (follow instructions at [Contributing to Adapter Hub](https://docs.adapterhub.ml/contributing.html)).

## 10. TensorBoard

In [40]:
#!pip install tensorboard

In [None]:
import os
PATH = os.getenv('PATH')
# replace xxxx by your username on your server (ex: paulo)
# replace yyyy by the name of the virtual environment of this notebook (ex: adapter-transformers)
%env PATH=/mnt/home/xxxx/anaconda3/envs/yyyy/bin:$PATH

In [42]:
%load_ext tensorboard
# %reload_ext tensorboard
%tensorboard --logdir {logging_dir} --bind_all

## 11. Application MLM

In [43]:
### import transformers
import pathlib
from pathlib import Path

### Model original (without lang adapter)

We use the model `neuralmind/bert-base-portuguese-cased` and its trainned lang adapter within the following examples.

In [44]:
from transformers import AutoModelForMaskedLM, AutoTokenizer

model_mlm = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
tokenizer_mlm = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

Some weights of the model checkpoint at neuralmind/bert-base-portuguese-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [45]:
from transformers import pipeline
nlp = pipeline("fill-mask", model=model_mlm, tokenizer=tokenizer_mlm)

Let's take one sentence from the SQuAD 1.1 pt dataset and replace the word `Deus` by the token `[MASK]`.

In [46]:
nlp("O pante√≠smo sustenta que [MASK] √© o universo e o universo √© Deus.")

[{'sequence': 'O pante√≠smo sustenta que Deus √© o universo e o universo √© Deus.',
  'score': 0.7392684817314148,
  'token': 2538,
  'token_str': 'Deus'},
 {'sequence': 'O pante√≠smo sustenta que deus √© o universo e o universo √© Deus.',
  'score': 0.042948465794324875,
  'token': 4023,
  'token_str': 'deus'},
 {'sequence': 'O pante√≠smo sustenta que ele √© o universo e o universo √© Deus.',
  'score': 0.029601380228996277,
  'token': 368,
  'token_str': 'ele'},
 {'sequence': 'O pante√≠smo sustenta que Cristo √© o universo e o universo √© Deus.',
  'score': 0.021081821992993355,
  'token': 4184,
  'token_str': 'Cristo'},
 {'sequence': 'O pante√≠smo sustenta que tudo √© o universo e o universo √© Deus.',
  'score': 0.018854131922125816,
  'token': 2745,
  'token_str': 'tudo'}]

Let's test now the original model with another sentence and `China` has masked word.

In [47]:
nlp("O primeiro caso da COVID-19 foi descoberto em Wuhan, na [MASK].")

[{'sequence': 'O primeiro caso da COVID - 19 foi descoberto em Wuhan, na China.',
  'score': 0.9124720096588135,
  'token': 3278,
  'token_str': 'China'},
 {'sequence': 'O primeiro caso da COVID - 19 foi descoberto em Wuhan, na √çndia.',
  'score': 0.034306950867176056,
  'token': 4340,
  'token_str': '√çndia'},
 {'sequence': 'O primeiro caso da COVID - 19 foi descoberto em Wuhan, na Mal√°sia.',
  'score': 0.023240933194756508,
  'token': 17753,
  'token_str': 'Mal√°sia'},
 {'sequence': 'O primeiro caso da COVID - 19 foi descoberto em Wuhan, na Tail√¢ndia.',
  'score': 0.013218147680163383,
  'token': 15582,
  'token_str': 'Tail√¢ndia'},
 {'sequence': 'O primeiro caso da COVID - 19 foi descoberto em Wuhan, na Inglaterra.',
  'score': 0.0027242223732173443,
  'token': 2785,
  'token_str': 'Inglaterra'}]

### Model with lang adapter

In [56]:
outputs = model_checkpoint.replace('/','-') + '_' + dataset_name + '/'  
outputs = outputs + str(task) \
+ '_lr' + str(learning_rate) \
+ '_bs' + str(batch_size) \
+ '_GAS' + str(gradient_accumulation_steps) \
+ '_eps' + str(adam_epsilon) \
+ '_epochs' + str(num_train_epochs) \
+ '_patience' + str(early_stopping_patience) \
+ '_madx2' + str(madx2) \
+ '_ds' + str(ds) \
+ '_fp16' + str(fp16) \
+ '_best' + str(load_best_model_at_end) \
+ '_metric' + str(metric_for_best_model) \
+ '_adapterconfig' + str(adapter_config_name) 

path_to_outputs = root/'models_outputs'/outputs

# Config of the lang adapter
lang_adapter_path = path_to_outputs/'adapters-mlm/'

load_lang_adapter = lang_adapter_path
lang_adapter_config = str(lang_adapter_path) + "/adapter_config.json"

In [57]:
# load the language adapter
task_mlm_load_as = 'mlm'
lang_adapter_name = model_mlm.load_adapter(
    str(load_lang_adapter),
    config=lang_adapter_config,
    load_as=task_mlm_load_as,
    with_head=True
    )

# Set the adapters to be used in every forward pass
model_mlm.set_active_adapters([lang_adapter_name])

In [58]:
from transformers import pipeline
nlp = pipeline("fill-mask", model=model_mlm, tokenizer=tokenizer_mlm)

In [59]:
nlp("O pante√≠smo sustenta que [MASK] √© o universo e o universo √© Deus.")

[{'sequence': 'O pante√≠smo sustenta que Deus √© o universo e o universo √© Deus.',
  'score': 0.9113659858703613,
  'token': 2538,
  'token_str': 'Deus'},
 {'sequence': 'O pante√≠smo sustenta que Cristo √© o universo e o universo √© Deus.',
  'score': 0.01384457852691412,
  'token': 4184,
  'token_str': 'Cristo'},
 {'sequence': 'O pante√≠smo sustenta que Jesus √© o universo e o universo √© Deus.',
  'score': 0.013115585781633854,
  'token': 3125,
  'token_str': 'Jesus'},
 {'sequence': 'O pante√≠smo sustenta que deus √© o universo e o universo √© Deus.',
  'score': 0.008912090212106705,
  'token': 4023,
  'token_str': 'deus'},
 {'sequence': 'O pante√≠smo sustenta que tudo √© o universo e o universo √© Deus.',
  'score': 0.006696512456983328,
  'token': 2745,
  'token_str': 'tudo'}]

Our fine-tuned model scored better (0.911 vs. 0.739) when finding the masked word `Deus`. It seems that our finetuning on the SQuAD 1.1 pt dataset with lang adapter worked.

Let's test now our fine-tuned model with another sentence and `China` has masked word.

In [60]:
nlp("O primeiro caso da COVID-19 foi descoberto em Wuhan, na [MASK].")

[{'sequence': 'O primeiro caso da COVID - 19 foi descoberto em Wuhan, na China.',
  'score': 0.8868412971496582,
  'token': 3278,
  'token_str': 'China'},
 {'sequence': 'O primeiro caso da COVID - 19 foi descoberto em Wuhan, na Mal√°sia.',
  'score': 0.04414068162441254,
  'token': 17753,
  'token_str': 'Mal√°sia'},
 {'sequence': 'O primeiro caso da COVID - 19 foi descoberto em Wuhan, na √çndia.',
  'score': 0.034103650599718094,
  'token': 4340,
  'token_str': '√çndia'},
 {'sequence': 'O primeiro caso da COVID - 19 foi descoberto em Wuhan, na Tail√¢ndia.',
  'score': 0.009829722344875336,
  'token': 15582,
  'token_str': 'Tail√¢ndia'},
 {'sequence': 'O primeiro caso da COVID - 19 foi descoberto em Wuhan, na Alemanha.',
  'score': 0.003259493038058281,
  'token': 2423,
  'token_str': 'Alemanha'}]

The masked word `China` was found with a high score of 0.887 but lower than the score of the orginal model (0.912). It was expected: by finetuning the original model, we specialized it to the "language" of the dataset used.

# END