-
Notifications
You must be signed in to change notification settings - Fork 25.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bert-base-cased predicts tokens instead of whole words after fine-tuning on fill-mask task #9678
Comments
The pipeline for masked filling can only be used to fill one token, so you should be using different code for your evaluation if you want to be able to predict more than one masked token. |
Thank you for your reply. I am not sure, whether I understand you right. So do you mean, that it is not possible to predict words like "Los Angeles" with two words or that it is also not possible to predict words like "pseudogene", which are one word but are not in the vocabulary and so the tokenizer splits it into ['pseudo', '##gene']? I would only like to predict words like "pseudogene". |
The pipeline in itself is only coded to return one token to replace the [MASK]. So it won't be able to predict two tokens to replace one [MASK]. The model is also only trained to replace each [MASK] in its sentence by one token, so it won't be able to predict two tokens for one [MASK]. For this task, you need to either use a different model (coded yourself as it's not present in the library) or have your training set contain one [MASK] per token you want to mask. For instance if you want to mask all the tokens corresponding to one word (a technique called whole-word masking) what is typically done in training scripts is to replace all parts of one word by [MASK]. For pseudogener tokenized as pseudo, ##gene, that would mean having [MASK] [MASK]. Also, this is not a bug of the library, so the discussion should continue on the forum |
Environment info
transformers
version: 4.2.1Who can help
@mfuntowicz, @sguggerInformation
Model I am using (Bert, XLNet ...): bert-base-cased
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
Expected behavior
The language model should predict the whole word for the [MASK]-token and not only tokens. In the following, four queries were evaluated with the code for evaluation. For the first two queries, the finetuned language model predicts the correct tokens in the first five answers but does not match them together. For the last two queries, the finetuned language model predicts at least the correct first token but not all tokens.
My guess is, that something went wrong in the training when the word for the [MASK]-token is not in the vocabulary and the tokenizer splits the word into more than one token.
The text was updated successfully, but these errors were encountered: