Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BertGenerationTokenizer provides an unexpected value for BertGenerationModel #10045

Closed
sadakmed opened this issue Feb 6, 2021 · 1 comment · Fixed by #10070
Closed

BertGenerationTokenizer provides an unexpected value for BertGenerationModel #10045

sadakmed opened this issue Feb 6, 2021 · 1 comment · Fixed by #10070

Comments

@sadakmed
Copy link
Contributor

sadakmed commented Feb 6, 2021

  • transformers version: 4.2.2
  • PyTorch version (GPU?): 1.7.0+cu101
  • tokenizers: @n1t0, @LysandreJik

Information

in both models BertGenerationEncoder, BertGenerationDecoder, there's no need for token_type_ids however the BertGenerationTokenizer provides it, this issue will be raised if you want to input the tokenizer results directly with **,

and if it meant to be like this, and the user should be aware of this behaviour, I think a change should be in the documentation.

Note: Another issue with BertGenerationTokenizer is the necessity of sentencepiece module, do you prefer that it should for the user to install it separately or it should be included in transformers dependencies.

@sadakmed sadakmed changed the title BertGenerationTokenizer provides unexpected value for BertGenerationModel BertGenerationTokenizer provides an unexpected value for BertGenerationModel Feb 6, 2021
@LysandreJik
Copy link
Member

Hi @sadakmed!

You're right, there's no need for token type IDs in this tokenizer. The workaround for this is to remove token_type_ids from the model input names, as it is done in the DistilBERT tokenizer:

model_input_names = ["input_ids", "attention_mask"]

Do you want to open a PR to fix this?

Regarding the necessity of sentencepiece module, yes it is necessary. It was previously in the transformers dependencies and we removed it because it was causing compilation issues on some hardware. The error should be straightforward and mention a sentencepiece installation is necessary in order to use that tokenizer, so no problem there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants