Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training a new BERT tokenizer model #2210

Closed
hahmyg opened this issue Dec 17, 2019 · 10 comments
Closed

training a new BERT tokenizer model #2210

hahmyg opened this issue Dec 17, 2019 · 10 comments
Assignees

Comments

@hahmyg
Copy link

hahmyg commented Dec 17, 2019

❓ Questions & Help

I would like to train a new BERT model.
There are some way to train BERT tokenizer (a.k.a. wordpiece tokenizer) ?

@pohanchi
Copy link

pohanchi commented Dec 18, 2019 via email

@TheEdoardo93
Copy link

TheEdoardo93 commented Dec 18, 2019

If you want to see some examples of custom implementation of tokenizers into Transformers' library, you can see how they have implemented Japanese Tokenizer.

In general, you can read more information about adding a new model into Transformers here.

Questions & Help

I would like to train a new BERT model.
There are some way to train BERT tokenizer (a.k.a. wordpiece tokenizer) ?

@julien-c
Copy link
Member

Checkout the tokenizers repo.

There's an example of how to train a WordPiece tokenizer: https://github.com/huggingface/tokenizers/blob/master/bindings/python/examples/train_bert_wordpiece.py

@tkornuta
Copy link

Hi @julien-c tokenizers package is great, but I found an issue when using the resulting tokenizer later with transformers.

Assume I have this:

from tokenizers import BertWordPieceTokenizer
init_tokenizer = BertWordPieceTokenizer(vocab=vocab)
init_tokenizer.save("./my_local_tokenizer")

When I am trying to load the file:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("./my_local_tokenizer")

an error is thrown:

ValueError: Unrecognized model in .... Should have a `model_type` key in its config.json, or contain one of the following strings in its name: ...

Seems format used by tokenizers is a single json file, whereas when I save transformers tokenizer it creates a dir with [config.json, special tokens map.json and vocab.txt].

transformers.version = '4.15.0'
tokenizers.version '0.10.3'

Can you please give me some hints how to fix this? Thx in advance

@julien-c
Copy link
Member

@tkornuta better to open a post on the forum, but tagging @SaulLu for visibility

@SaulLu
Copy link
Contributor

SaulLu commented Jan 20, 2022

Thanks for the ping julien-c!

@tkornuta Indeed, for this kind of questions the forum is the best place to ask them: it also allows other users who would ask the same question as you to benefit from the answer. ☺️

Your use case is indeed very interesting! With your current code, you have a problem because AutoTokenizer has no way of knowing which Tokenizer object we want to use to load your tokenizer since you chose to create a new one from scratch.

So in particular to instantiate a new transformers tokenizer with the Bert-like tokenizer you created with the tokenizers library you can do:

from transformers import BertTokenizerFast

wrapped_tokenizer = BertTokenizerFast(
    tokenizer_file="./my_local_tokenizer",
    do_lower_case = FILL_ME,
    unk_token = FILL_ME,
    sep_token = FILL_ME,
    pad_token = FILL_ME,
    cls_token = FILL_ME,
    mask_token = FILL_ME,
    tokenize_chinese_chars =FILL_ME,
    strip_accents = FILL_ME
)

Note: You will have to manually carry over the same parameters (unk_token, strip_accents, etc) that you used to initialize BertWordPieceTokenizer in the initialization of BertTokenizerFast.

I refer you to the section "Building a tokenizer, block by block" of the course where we explained how you can build a tokenizer from scratch with the tokenizers library and use it to instantiate a new tokenizer with the transformers library. We have even treated the example of a Bert type tokenizer in this chapter 😃.

Moreover, if you just want to generate a new vocabulary for BERT tokenizer by re-training it on a new dataset, the easiest way is probably to use the train_new_from_iterator method of a fast transformers tokenizer which is explained in the section "Training a new tokenizer from an old one" of our course. 😊

I hope this can help you!

@tkornuta
Copy link

Hi @SaulLu thanks for the answer! I managed to find the solution here:
https://huggingface.co/docs/transformers/fast_tokenizer

# 1st solution: Load the HF.tokenisers tokenizer.
loaded_tokenizer = Tokenizer.from_file(decoder_tokenizer_path)
# "Wrap" it with HF.transformers tokenizer.
tokenizer = PreTrainedTokenizerFast(tokenizer_object=loaded_tokenizer)

# 2nd solution: Load from tokenizer file
tokenizer = PreTrainedTokenizerFast(tokenizer_file=decoder_tokenizer_path)

Now I also see that somehow I have missed the information at the bottom of the section that you mention on building tokenizer that is also stating that - sorry.

@tkornuta
Copy link

tkornuta commented Jan 21, 2022

Hey @SaulLu sorry for bothering, but struggling with yet another problem/question.

When I am loading the tokenizer created in HF.tokenizers my special tokens are "gone", i.e.

# Load from tokenizer file
tokenizer = PreTrainedTokenizerFast(tokenizer_file=decoder_tokenizer_path)
tokenizer.pad_token # <- this is None

Without this when I am using padding:

encoded = tokenizer.encode(input, padding=True) # <- raises error - lack of pad_token

I can add them to tokenizer from HF.transformers e.g. like this:

tokenizer.add_special_tokens({'pad_token': '[PAD]'}) # <- this works!

Is there a similar method for setting special tokens to tokenizer in HF.tokenizers that will enable me to load the tokenizer in HF.transformers?

I have all tokens in my vocabulary and tried the following

# Pass as arguments to constructor:
init_tokenizer = BertWordPieceTokenizer(vocab=vocab) 
    #special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])  <- error: wrong keyword
    #bos_token = "[CLS]", eos_token = "[SEP]", unk_token = "[UNK]", sep_token = "[SEP]", <- wrong keywords
    #pad_token = "[PAD]", cls_token = "[CLS]", mask_token = "[MASK]", <- wrong keywords
# Use tokenizer.add_special_tokens() method:
#init_tokenizer.add_special_tokens({'pad_token': '[PAD]'}) <- error: must be a list
#init_tokenizer.add_special_tokens(["[PAD]", "[CLS]", "[SEP]", "[UNK]", "[MASK]", "[BOS]", "[EOS]"]) <- doesn't work (means error when calling encode(padding=True))
#init_tokenizer.add_special_tokens(['[PAD]']) # <- doesn't work

# Set manually.
#init_tokenizer.pad_token = "[PAD]" # <- doesn't work
init_tokenizer.pad_token_id = vocab["[PAD]"] # <- doesn't work
```

Am I missing something obvious? Thanks in advance!

@SaulLu
Copy link
Contributor

SaulLu commented Jan 27, 2022

@tkornuta, I'm sorry I missed your second question!

The BertWordPieceTokenizer class is just an helper class to build a tokenizers.Tokenizers object with the architecture proposed by the Bert's authors. The tokenizers library is used to build tokenizers and the transformers library to wrap these tokenizers by adding useful functionality when we wish to use them with a particular model (like identifying the padding token, the separation token, etc).

To not miss anything, I would like to comment on several of your remarks

Remark 1

[@tkornuta] When I am loading the tokenizer created in HF.tokenizers my special tokens are "gone", i.e.

To carry your special tokens in your HF.transformers tokenizer, I refer you to this section of my previous answer

[@SaulLu] So in particular to instantiate a new transformers tokenizer with the Bert-like tokenizer you created with the tokenizers library you can do:

from transformers import BertTokenizerFast

wrapped_tokenizer = BertTokenizerFast(
  tokenizer_file="./my_local_tokenizer",
  do_lower_case = FILL_ME,
  unk_token = FILL_ME,
   sep_token = FILL_ME,
   pad_token = FILL_ME,
   cls_token = FILL_ME,
   mask_token = FILL_ME,
   tokenize_chinese_chars =FILL_ME,
   strip_accents = FILL_ME
)

Note: You will have to manually carry over the same parameters (unk_token, strip_accents, etc) that you used to initialize BertWordPieceTokenizer in the initialization of BertTokenizerFast.

Remark 2

[@tkornuta] Is there a similar method for setting special tokens to tokenizer in HF.tokenizers that will enable me to load the tokenizer in HF.transformers?

Nothing prevents you from overloading the BertWordPieceTokenizer class in order to define the properties that interest you. On the other hand, there will be no automatic porting of the values of these new properties in the HF.transformers tokenizer properties (you have to use the method mentioned below or the methode .add_special_tokens({'pad_token': '[PAD]'}) after having instanciated your HF.transformers tokenizer ).

Does this answer your questions? ☺️

@tkornuta
Copy link

tkornuta commented Jan 27, 2022

Hi @SaulLu yeah, I was asking about this "automatic porting of special tokens". As I set them already when training the tokenizer in HF.tokenizers, hence I am really not sure why they couldn't be imported automatically when loading tokenizer in HF.transformers...

Anyway, thanks, your answer is super useful.

Moreover, thanks for the hint pointing to forum! I will use it next time for sure! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants