BertGenerationTokenizer provides an unexpected value for BertGenerationModel #10045

sadakmed · 2021-02-06T08:01:29Z

transformers version: 4.2.2
PyTorch version (GPU?): 1.7.0+cu101
tokenizers: @n1t0, @LysandreJik

Information

in both models BertGenerationEncoder, BertGenerationDecoder, there's no need for token_type_ids however the BertGenerationTokenizer provides it, this issue will be raised if you want to input the tokenizer results directly with **,

and if it meant to be like this, and the user should be aware of this behaviour, I think a change should be in the documentation.

Note: Another issue with BertGenerationTokenizer is the necessity of sentencepiece module, do you prefer that it should for the user to install it separately or it should be included in transformers dependencies.

The text was updated successfully, but these errors were encountered:

LysandreJik · 2021-02-08T13:16:12Z

Hi @sadakmed!

You're right, there's no need for token type IDs in this tokenizer. The workaround for this is to remove token_type_ids from the model input names, as it is done in the DistilBERT tokenizer:

transformers/src/transformers/models/distilbert/tokenization_distilbert.py

Line 71 in cdd8659

model_input_names = ["input_ids", "attention_mask"]

Do you want to open a PR to fix this?

Regarding the necessity of sentencepiece module, yes it is necessary. It was previously in the transformers dependencies and we removed it because it was causing compilation issues on some hardware. The error should be straightforward and mention a sentencepiece installation is necessary in order to use that tokenizer, so no problem there.

sadakmed changed the title ~~BertGenerationTokenizer provides unexpected value for BertGenerationModel~~ BertGenerationTokenizer provides an unexpected value for BertGenerationModel Feb 6, 2021

sadakmed mentioned this issue Feb 8, 2021

remove token_type_ids from TokenizerBertGeneration output #10070

Merged

LysandreJik closed this as completed in #10070 Feb 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BertGenerationTokenizer provides an unexpected value for BertGenerationModel #10045

BertGenerationTokenizer provides an unexpected value for BertGenerationModel #10045

sadakmed commented Feb 6, 2021 •

edited

LysandreJik commented Feb 8, 2021

BertGenerationTokenizer provides an unexpected value for BertGenerationModel #10045

BertGenerationTokenizer provides an unexpected value for BertGenerationModel #10045

Comments

sadakmed commented Feb 6, 2021 • edited

Information

LysandreJik commented Feb 8, 2021

sadakmed commented Feb 6, 2021 •

edited