-
Notifications
You must be signed in to change notification settings - Fork 571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[UNK] token in v2 models #81
Comments
same here running ALBERT tfhub with own set of data : getting error as `/content/Albert/classifier_utils.py in convert_single_example(ex_index, example, label_list, max_seq_length, tokenizer, task_name) /content/Albert/tokenization.py in convert_tokens_to_ids(self, tokens) /content/Albert/tokenization.py in convert_by_vocab(vocab, items) KeyError: '[UNK]'` |
I had a similar issue and my problem was that I wasn't setting the spm_model_file flag correctly, and therefore the tokeniser was falling back to the Basic & Wordpiece tokenisers which use [UNK] |
I have the same issue, anyone have a solution please help
KeyError Traceback (most recent call last) 1 frames KeyError: '[UNK]' |
If you only pass the .vocab file the init function will fall back on the Basic and Wordpiece tokenizers, which use [UNK]. You need to pass the spm model name as well: tokenizer=tokenization.FullTokenizer("/content/30k-clean.vocab", spm_model_file="/content/30k-clean.model") |
Thanks! @aarmstrong78 |
I downloaded albert_xxl v2, in file assets/30k-clean.vocab entry for [UNK] looks like:
<unk> 0
while in tokenization.py it's :
class WordpieceTokenizer(object):
"""Runs WordPiece tokenziation."""
def init(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200):
So I'm getting error like below. Is it ok to modlfy tokenization.py or I'm doing something wrong?
File "J:\albert\tokenization.py", line 269, in convert_tokens_to_ids
return convert_by_vocab(self.vocab, tokens)
File "J:\albert\tokenization.py", line 211, in convert_by_vocab
output.append(vocab[item])
KeyError: '[UNK]'
The text was updated successfully, but these errors were encountered: