-
Notifications
You must be signed in to change notification settings - Fork 25.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
training a new BERT tokenizer model #2210
Comments
Follow sentencepiece github or Bert tensorflow GitHub. You will have some
feedback
…On Wed, Dec 18, 2019 at 07:52 Younggyun Hahm ***@***.***> wrote:
❓ Questions & Help
I would like to train a new BERT model.
There are some way to train BERT tokenizer (a.k.a. wordpiece tokenizer) ?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#2210?email_source=notifications&email_token=AIEAE4BMLLHVIADDR5PGZ63QZFQ27A5CNFSM4J4DE7PKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IBGFX2A>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIEAE4CUY2ESKVEH4IPDL63QZFQ27ANCNFSM4J4DE7PA>
.
|
If you want to see some examples of custom implementation of tokenizers into Transformers' library, you can see how they have implemented Japanese Tokenizer. In general, you can read more information about adding a new model into Transformers here.
|
Checkout the There's an example of how to train a WordPiece tokenizer: https://github.com/huggingface/tokenizers/blob/master/bindings/python/examples/train_bert_wordpiece.py |
Hi @julien-c Assume I have this:
When I am trying to load the file:
an error is thrown:
Seems format used by tokenizers is a single json file, whereas when I save transformers tokenizer it creates a dir with [config.json, special tokens map.json and vocab.txt]. transformers.version = '4.15.0' Can you please give me some hints how to fix this? Thx in advance |
Thanks for the ping julien-c! @tkornuta Indeed, for this kind of questions the forum is the best place to ask them: it also allows other users who would ask the same question as you to benefit from the answer. Your use case is indeed very interesting! With your current code, you have a problem because So in particular to instantiate a new from transformers import BertTokenizerFast
wrapped_tokenizer = BertTokenizerFast(
tokenizer_file="./my_local_tokenizer",
do_lower_case = FILL_ME,
unk_token = FILL_ME,
sep_token = FILL_ME,
pad_token = FILL_ME,
cls_token = FILL_ME,
mask_token = FILL_ME,
tokenize_chinese_chars =FILL_ME,
strip_accents = FILL_ME
) Note: You will have to manually carry over the same parameters ( I refer you to the section "Building a tokenizer, block by block" of the course where we explained how you can build a tokenizer from scratch with the Moreover, if you just want to generate a new vocabulary for BERT tokenizer by re-training it on a new dataset, the easiest way is probably to use the I hope this can help you! |
Hi @SaulLu thanks for the answer! I managed to find the solution here:
Now I also see that somehow I have missed the information at the bottom of the section that you mention on building tokenizer that is also stating that - sorry. |
Hey @SaulLu sorry for bothering, but struggling with yet another problem/question. When I am loading the tokenizer created in HF.tokenizers my special tokens are "gone", i.e.
Without this when I am using padding:
I can add them to tokenizer from HF.transformers e.g. like this:
Is there a similar method for setting special tokens to tokenizer in HF.tokenizers that will enable me to load the tokenizer in HF.transformers? I have all tokens in my vocabulary and tried the following
|
@tkornuta, I'm sorry I missed your second question! The To not miss anything, I would like to comment on several of your remarks Remark 1
To carry your special tokens in your
Remark 2
Nothing prevents you from overloading the Does this answer your questions? |
Hi @SaulLu yeah, I was asking about this "automatic porting of special tokens". As I set them already when training the tokenizer in HF.tokenizers, hence I am really not sure why they couldn't be imported automatically when loading tokenizer in HF.transformers... Anyway, thanks, your answer is super useful. Moreover, thanks for the hint pointing to forum! I will use it next time for sure! :) |
❓ Questions & Help
I would like to train a new BERT model.
There are some way to train BERT tokenizer (a.k.a. wordpiece tokenizer) ?
The text was updated successfully, but these errors were encountered: