training a new BERT tokenizer model #2210

hahmyg · 2019-12-17T23:51:52Z

❓ Questions & Help

I would like to train a new BERT model.
There are some way to train BERT tokenizer (a.k.a. wordpiece tokenizer) ?

pohanchi · 2019-12-18T00:18:06Z

Follow sentencepiece github or Bert tensorflow GitHub. You will have some feedback

On Wed, Dec 18, 2019 at 07:52 Younggyun Hahm ***@***.***> wrote: ❓ Questions & Help I would like to train a new BERT model. There are some way to train BERT tokenizer (a.k.a. wordpiece tokenizer) ? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#2210?email_source=notifications&email_token=AIEAE4BMLLHVIADDR5PGZ63QZFQ27A5CNFSM4J4DE7PKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IBGFX2A>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AIEAE4CUY2ESKVEH4IPDL63QZFQ27ANCNFSM4J4DE7PA> .

TheEdoardo93 · 2019-12-18T09:03:16Z

If you want to see some examples of custom implementation of tokenizers into Transformers' library, you can see how they have implemented Japanese Tokenizer.

In general, you can read more information about adding a new model into Transformers here.

Questions & Help

I would like to train a new BERT model.
There are some way to train BERT tokenizer (a.k.a. wordpiece tokenizer) ?

julien-c · 2020-01-13T18:38:31Z

Checkout the tokenizers repo.

There's an example of how to train a WordPiece tokenizer: https://github.com/huggingface/tokenizers/blob/master/bindings/python/examples/train_bert_wordpiece.py

tkornuta · 2022-01-20T03:08:32Z

Hi @julien-c tokenizers package is great, but I found an issue when using the resulting tokenizer later with transformers.

Assume I have this:

from tokenizers import BertWordPieceTokenizer
init_tokenizer = BertWordPieceTokenizer(vocab=vocab)
init_tokenizer.save("./my_local_tokenizer")

When I am trying to load the file:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("./my_local_tokenizer")

an error is thrown:

ValueError: Unrecognized model in .... Should have a `model_type` key in its config.json, or contain one of the following strings in its name: ...

Seems format used by tokenizers is a single json file, whereas when I save transformers tokenizer it creates a dir with [config.json, special tokens map.json and vocab.txt].

transformers.version = '4.15.0'
tokenizers.version '0.10.3'

Can you please give me some hints how to fix this? Thx in advance

julien-c · 2022-01-20T09:33:04Z

@tkornuta better to open a post on the forum, but tagging @SaulLu for visibility

SaulLu · 2022-01-20T11:38:25Z

Thanks for the ping julien-c!

@tkornuta Indeed, for this kind of questions the forum is the best place to ask them: it also allows other users who would ask the same question as you to benefit from the answer. ☺️

Your use case is indeed very interesting! With your current code, you have a problem because AutoTokenizer has no way of knowing which Tokenizer object we want to use to load your tokenizer since you chose to create a new one from scratch.

So in particular to instantiate a new transformers tokenizer with the Bert-like tokenizer you created with the tokenizers library you can do:

from transformers import BertTokenizerFast

wrapped_tokenizer = BertTokenizerFast(
    tokenizer_file="./my_local_tokenizer",
    do_lower_case = FILL_ME,
    unk_token = FILL_ME,
    sep_token = FILL_ME,
    pad_token = FILL_ME,
    cls_token = FILL_ME,
    mask_token = FILL_ME,
    tokenize_chinese_chars =FILL_ME,
    strip_accents = FILL_ME
)

Note: You will have to manually carry over the same parameters (unk_token, strip_accents, etc) that you used to initialize BertWordPieceTokenizer in the initialization of BertTokenizerFast.

I refer you to the section "Building a tokenizer, block by block" of the course where we explained how you can build a tokenizer from scratch with the tokenizers library and use it to instantiate a new tokenizer with the transformers library. We have even treated the example of a Bert type tokenizer in this chapter 😃.

Moreover, if you just want to generate a new vocabulary for BERT tokenizer by re-training it on a new dataset, the easiest way is probably to use the train_new_from_iterator method of a fast transformers tokenizer which is explained in the section "Training a new tokenizer from an old one" of our course. 😊

I hope this can help you!

tkornuta · 2022-01-20T19:30:36Z

Hi @SaulLu thanks for the answer! I managed to find the solution here:
https://huggingface.co/docs/transformers/fast_tokenizer

# 1st solution: Load the HF.tokenisers tokenizer.
loaded_tokenizer = Tokenizer.from_file(decoder_tokenizer_path)
# "Wrap" it with HF.transformers tokenizer.
tokenizer = PreTrainedTokenizerFast(tokenizer_object=loaded_tokenizer)

# 2nd solution: Load from tokenizer file
tokenizer = PreTrainedTokenizerFast(tokenizer_file=decoder_tokenizer_path)

Now I also see that somehow I have missed the information at the bottom of the section that you mention on building tokenizer that is also stating that - sorry.

tkornuta · 2022-01-21T18:56:54Z

Hey @SaulLu sorry for bothering, but struggling with yet another problem/question.

When I am loading the tokenizer created in HF.tokenizers my special tokens are "gone", i.e.

# Load from tokenizer file
tokenizer = PreTrainedTokenizerFast(tokenizer_file=decoder_tokenizer_path)
tokenizer.pad_token # <- this is None

Without this when I am using padding:

encoded = tokenizer.encode(input, padding=True) # <- raises error - lack of pad_token

I can add them to tokenizer from HF.transformers e.g. like this:

tokenizer.add_special_tokens({'pad_token': '[PAD]'}) # <- this works!

Is there a similar method for setting special tokens to tokenizer in HF.tokenizers that will enable me to load the tokenizer in HF.transformers?

I have all tokens in my vocabulary and tried the following

# Pass as arguments to constructor:
init_tokenizer = BertWordPieceTokenizer(vocab=vocab) 
    #special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])  <- error: wrong keyword
    #bos_token = "[CLS]", eos_token = "[SEP]", unk_token = "[UNK]", sep_token = "[SEP]", <- wrong keywords
    #pad_token = "[PAD]", cls_token = "[CLS]", mask_token = "[MASK]", <- wrong keywords
# Use tokenizer.add_special_tokens() method:
#init_tokenizer.add_special_tokens({'pad_token': '[PAD]'}) <- error: must be a list
#init_tokenizer.add_special_tokens(["[PAD]", "[CLS]", "[SEP]", "[UNK]", "[MASK]", "[BOS]", "[EOS]"]) <- doesn't work (means error when calling encode(padding=True))
#init_tokenizer.add_special_tokens(['[PAD]']) # <- doesn't work

# Set manually.
#init_tokenizer.pad_token = "[PAD]" # <- doesn't work
init_tokenizer.pad_token_id = vocab["[PAD]"] # <- doesn't work
```

Am I missing something obvious? Thanks in advance!

SaulLu · 2022-01-27T10:30:32Z

@tkornuta, I'm sorry I missed your second question!

The BertWordPieceTokenizer class is just an helper class to build a tokenizers.Tokenizers object with the architecture proposed by the Bert's authors. The tokenizers library is used to build tokenizers and the transformers library to wrap these tokenizers by adding useful functionality when we wish to use them with a particular model (like identifying the padding token, the separation token, etc).

To not miss anything, I would like to comment on several of your remarks

Remark 1

[@tkornuta] When I am loading the tokenizer created in HF.tokenizers my special tokens are "gone", i.e.

To carry your special tokens in your HF.transformers tokenizer, I refer you to this section of my previous answer

[@SaulLu] So in particular to instantiate a new transformers tokenizer with the Bert-like tokenizer you created with the tokenizers library you can do:
from transformers import BertTokenizerFast

wrapped_tokenizer = BertTokenizerFast(
  tokenizer_file="./my_local_tokenizer",
  do_lower_case = FILL_ME,
  unk_token = FILL_ME,
   sep_token = FILL_ME,
   pad_token = FILL_ME,
   cls_token = FILL_ME,
   mask_token = FILL_ME,
   tokenize_chinese_chars =FILL_ME,
   strip_accents = FILL_ME
)
Note: You will have to manually carry over the same parameters (unk_token, strip_accents, etc) that you used to initialize BertWordPieceTokenizer in the initialization of BertTokenizerFast.

Remark 2

[@tkornuta] Is there a similar method for setting special tokens to tokenizer in HF.tokenizers that will enable me to load the tokenizer in HF.transformers?

Nothing prevents you from overloading the BertWordPieceTokenizer class in order to define the properties that interest you. On the other hand, there will be no automatic porting of the values of these new properties in the HF.transformers tokenizer properties (you have to use the method mentioned below or the methode .add_special_tokens({'pad_token': '[PAD]'}) after having instanciated your HF.transformers tokenizer ).

Does this answer your questions? ☺️

tkornuta · 2022-01-27T18:37:05Z

Hi @SaulLu yeah, I was asking about this "automatic porting of special tokens". As I set them already when training the tokenizer in HF.tokenizers, hence I am really not sure why they couldn't be imported automatically when loading tokenizer in HF.transformers...

Anyway, thanks, your answer is super useful.

Moreover, thanks for the hint pointing to forum! I will use it next time for sure! :)

julien-c assigned n1t0 Dec 17, 2019

julien-c closed this as completed Jan 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training a new BERT tokenizer model #2210

training a new BERT tokenizer model #2210

hahmyg commented Dec 17, 2019

pohanchi commented Dec 18, 2019 via email

TheEdoardo93 commented Dec 18, 2019 •

edited

Questions & Help

julien-c commented Jan 13, 2020

tkornuta commented Jan 20, 2022

julien-c commented Jan 20, 2022

SaulLu commented Jan 20, 2022 •

edited

tkornuta commented Jan 20, 2022

tkornuta commented Jan 21, 2022 •

edited

SaulLu commented Jan 27, 2022

tkornuta commented Jan 27, 2022 •

edited

training a new BERT tokenizer model #2210

training a new BERT tokenizer model #2210

Comments

hahmyg commented Dec 17, 2019

❓ Questions & Help

pohanchi commented Dec 18, 2019 via email

TheEdoardo93 commented Dec 18, 2019 • edited

Questions & Help

julien-c commented Jan 13, 2020

tkornuta commented Jan 20, 2022

julien-c commented Jan 20, 2022

SaulLu commented Jan 20, 2022 • edited

tkornuta commented Jan 20, 2022

tkornuta commented Jan 21, 2022 • edited

SaulLu commented Jan 27, 2022

Remark 1

Remark 2

tkornuta commented Jan 27, 2022 • edited

TheEdoardo93 commented Dec 18, 2019 •

edited

SaulLu commented Jan 20, 2022 •

edited

tkornuta commented Jan 21, 2022 •

edited

tkornuta commented Jan 27, 2022 •

edited