bert-base-cased predicts tokens instead of whole words after fine-tuning on fill-mask task #9678

LeandraFichtel · 2021-01-19T17:43:44Z

Environment info

transformers version: 4.2.1
Platform: Linux-4.15.0-126-generic-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.9
PyTorch version (GPU?): 1.5.1+cu92 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help

@mfuntowicz, @sgugger

Information

Model I am using (Bert, XLNet ...): bert-base-cased

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

Extract the training_data.zip. The traininig_data is structured like it is explained in BertForMaskedLM.
Execute the code for fine-tuning to get the fine-tuned bert-base-cased (first script)
Evaluate the fine-tuned bert-base-cased with the code for evaluation (second script)

#code for fine-tuning of bert-base-cased on fill-mask-task using the files train_queries.json and train_labels.json

from transformers import BertForMaskedLM, Trainer, TrainingArguments
import json
from transformers import BertTokenizer
import torch
import shutil
import os

class MaskedDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item
    def __len__(self):
        return len(self.labels)

if __name__ == "__main__":
    #used LM
    lm_name = 'bert-base-cased'
    model_path = "bert_base_cased_finetuned"
    if os.path.exists(model_path):
        print("remove dir of model")
        shutil.rmtree(model_path)
    os.mkdir(model_path)

    #pepare training dataset
    #read datasets from path
    train_queries = json.load(open("train_queries.json", "r"))
    train_labels = json.load(open("train_labels.json", "r"))
    
    #use tokenizer to get encodings
    tokenizer = BertTokenizer.from_pretrained(lm_name)
    train_question_encodings = tokenizer(train_queries, truncation=True, padding='max_length', max_length=256)
    train_label_encodings = tokenizer(train_labels, truncation=True, padding='max_length', max_length=256)["input_ids"]
    #get final datasets for training
    train_dataset = MaskedDataset(train_question_encodings, train_label_encodings)
    
    training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,   # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,              # strength of weight decay
    logging_dir=model_path+'/logs',  # directory for storing logs
    logging_steps=10,
    save_total_limit=0
    )
    
    model = BertForMaskedLM.from_pretrained(lm_name)

    trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset          # training dataset
    )

    trainer.train()
    trainer.save_model(model_path)

#code for evaluating the fine-tuned bert-base-cased

import json
from transformers import pipeline, BertForMaskedLM
from transformers import BertTokenizer

lm_name = "bert-base-cased"

test_queries = {"Rps26p56 is a subclass of [MASK] .": "pseudogene", "[MASK] is the capital of Hammerfest .": "Hammerfest", "Cubaedomus is a [MASK] .": "taxon", "[MASK] is named after Renfrew .": "Renfrew"}

#bert-base-cased with fine-tuning on train_queries.json and train_labels.json
unmasker_finetuned = pipeline('fill-mask', tokenizer= lm_name, model = BertForMaskedLM.from_pretrained("bert_base_cased_finetuned"), device=0, top_k=5)

#bert-base-cased tokenizer
tokenizer = BertTokenizer.from_pretrained(lm_name)

for query in test_queries:
    correct_answer = test_queries[query]
    #get the answer of the [MASK]-token of bert-base-cased-finetuned
    finetuned_result = unmasker_finetuned(query)
    finetuned_all_answers = []
    for result in finetuned_result:
        finetuned_all_answers.append(result["token_str"])
    
    correct_answer_ids = tokenizer(correct_answer)["input_ids"]
    correct_answer_tokens = tokenizer.convert_ids_to_tokens(correct_answer_ids)
    correct_answer_tokens.remove("[SEP]")
    correct_answer_tokens.remove("[CLS]")

    print("query:", query)
    print("correct answer:", correct_answer)
    print("correct answer tokens:", correct_answer_tokens)
    print("-----real behavior----------")
    print("finetuned all answers:", finetuned_all_answers)
    print("finetuned first answer:", finetuned_result[0]["token_str"])
    print("-----expected behavior------")
    print("finetuned first answer:", correct_answer, "\n")

Expected behavior

The language model should predict the whole word for the [MASK]-token and not only tokens. In the following, four queries were evaluated with the code for evaluation. For the first two queries, the finetuned language model predicts the correct tokens in the first five answers but does not match them together. For the last two queries, the finetuned language model predicts at least the correct first token but not all tokens.

My guess is, that something went wrong in the training when the word for the [MASK]-token is not in the vocabulary and the tokenizer splits the word into more than one token.

query: Rps26p56 is a subclass of [MASK] .
correct answer: pseudogene
correct answer tokens: ['pseudo', '##gene']
-----real behavior----------
finetuned all answers: ['pseudo', 'gene', 'protein', '##gene', 'sub']
finetuned first answer: pseudo
-----expected behavior------
finetuned first answer: pseudogene 

query: [MASK] is the capital of Hammerfest .
correct answer: Hammerfest
correct answer tokens: ['Hammer', '##fest']
-----real behavior----------
finetuned all answers: ['Hammer', 'Metal', 'Hell', 'Lock', '##fest']
finetuned first answer: Hammer
-----expected behavior------
finetuned first answer: Hammerfest 

query: Cubaedomus is a [MASK] .
correct answer: taxon
correct answer tokens: ['tax', '##on']
-----real behavior----------
finetuned all answers: ['tax', 'genus', 'pseudo', 'synonym', 'is']
finetuned first answer: tax
-----expected behavior------
finetuned first answer: taxon 

query: [MASK] is named after Renfrew .
correct answer: Renfrew
correct answer tokens: ['Ren', '##f', '##rew']
-----real behavior----------
finetuned all answers: ['Ren', 'Re', 'R', 'Fe', 'Bo']
finetuned first answer: Ren
-----expected behavior------
finetuned first answer: Renfrew

The text was updated successfully, but these errors were encountered:

sgugger · 2021-01-19T18:19:31Z

The pipeline for masked filling can only be used to fill one token, so you should be using different code for your evaluation if you want to be able to predict more than one masked token.

LeandraFichtel · 2021-01-19T18:32:04Z

The pipeline for masked filling can only be used to fill one token, so you should be using different code for your evaluation if you want to be able to predict more than one masked token.

Thank you for your reply. I am not sure, whether I understand you right. So do you mean, that it is not possible to predict words like "Los Angeles" with two words or that it is also not possible to predict words like "pseudogene", which are one word but are not in the vocabulary and so the tokenizer splits it into ['pseudo', '##gene']? I would only like to predict words like "pseudogene".

sgugger · 2021-01-19T19:20:21Z

The pipeline in itself is only coded to return one token to replace the [MASK]. So it won't be able to predict two tokens to replace one [MASK]. The model is also only trained to replace each [MASK] in its sentence by one token, so it won't be able to predict two tokens for one [MASK].

For this task, you need to either use a different model (coded yourself as it's not present in the library) or have your training set contain one [MASK] per token you want to mask. For instance if you want to mask all the tokens corresponding to one word (a technique called whole-word masking) what is typically done in training scripts is to replace all parts of one word by [MASK]. For pseudogener tokenized as pseudo, ##gene, that would mean having [MASK] [MASK].

Also, this is not a bug of the library, so the discussion should continue on the forum

LeandraFichtel changed the title ~~bert-base-cased predicts syllables instead of whole words after fine-tuning on fill-mask task~~ bert-base-cased predicts tokens instead of whole words after fine-tuning on fill-mask task Jan 19, 2021

LeandraFichtel closed this as completed Jan 21, 2021

piegu mentioned this issue Oct 5, 2021

Update the logic of misspell identification R1j1t/contextualSpellCheck#44

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bert-base-cased predicts tokens instead of whole words after fine-tuning on fill-mask task #9678

bert-base-cased predicts tokens instead of whole words after fine-tuning on fill-mask task #9678

LeandraFichtel commented Jan 19, 2021 •

edited

sgugger commented Jan 19, 2021

LeandraFichtel commented Jan 19, 2021

sgugger commented Jan 19, 2021

bert-base-cased predicts tokens instead of whole words after fine-tuning on fill-mask task #9678

bert-base-cased predicts tokens instead of whole words after fine-tuning on fill-mask task #9678

Comments

LeandraFichtel commented Jan 19, 2021 • edited

Environment info

Who can help

Information

To reproduce

Expected behavior

sgugger commented Jan 19, 2021

LeandraFichtel commented Jan 19, 2021

sgugger commented Jan 19, 2021

LeandraFichtel commented Jan 19, 2021 •

edited