Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BERT tokenizer - set special tokens #599

Closed
adigoryl opened this issue May 10, 2019 · 3 comments
Closed

BERT tokenizer - set special tokens #599

adigoryl opened this issue May 10, 2019 · 3 comments
Labels

Comments

@adigoryl
Copy link

Hi,

I was wondering whether the team could expand BERT so that fine-tuning with newly defined special tokens would be possible - just like the GPT allows.

@thomwolf Could you share your thought with me on that?

Regards,
Adrian.

@thomwolf
Copy link
Member

Hi Adrian, BERT already has a few unused tokens that can be used similarly to the special_tokens of GPT/GPT-2.
For more details see google-research/bert#9 (comment) and issue #405 for instance.

@AlanHassen
Copy link

AlanHassen commented May 22, 2019

In case we use an unused special token from the vocabulary, is it enough to finetune a classification task or do we need to train an embedding from scratch? Did anyone already do this?

Two different and somehow related questions I had when looking into the implementation:

  1. The Bert paper mentions a (learned) positional embedding. How is this implemented here? examples/extract_features/convert_examples_to_features() defines tokens (representation), input_type_ids (the difference between the first and second sequence) and an input_mask (distinguishing padding/real tokens) but no positional embedding. Is this done internally?

  2. Can I use a special token as input_type_ids for Bert? In the classification example, only values of [0,1] are possible and I'm wondering what would happen if I would choose a special token instead? Is this possible with a pretrained embedding or do i need to retrain the whole embedding as a consequence?

@stale
Copy link

stale bot commented Jul 21, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jul 21, 2019
@stale stale bot closed this as completed Jul 28, 2019
alexisflive pushed a commit to alexisflive/transformers that referenced this issue Jun 9, 2024
* Update transformers.js version

* Update Token.jsx
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants