[UNK] token in v2 models #81

kkkppp · 2019-11-16T13:57:34Z

I downloaded albert_xxl v2, in file assets/30k-clean.vocab entry for [UNK] looks like:

<unk> 0

while in tokenization.py it's :

class WordpieceTokenizer(object):
"""Runs WordPiece tokenziation."""

def init(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200):

So I'm getting error like below. Is it ok to modlfy tokenization.py or I'm doing something wrong?

input_ids = tokenizer.convert_tokens_to_ids(ntokens)

File "J:\albert\tokenization.py", line 269, in convert_tokens_to_ids
return convert_by_vocab(self.vocab, tokens)
File "J:\albert\tokenization.py", line 211, in convert_by_vocab
output.append(vocab[item])
KeyError: '[UNK]'

The text was updated successfully, but these errors were encountered:

tvinith · 2019-12-10T18:00:13Z

same here running ALBERT tfhub with own set of data : getting error as

`/content/Albert/classifier_utils.py in convert_single_example(ex_index, example, label_list, max_seq_length, tokenizer, task_name)
623 segment_ids.append(1)
624
--> 625 input_ids = tokenizer.convert_tokens_to_ids(tokens)
626
627 # The mask has 1 for real tokens and 0 for padding tokens. Only real

/content/Albert/tokenization.py in convert_tokens_to_ids(self, tokens)
266 printable_text(token)) for token in tokens]
267 else:
--> 268 return convert_by_vocab(self.vocab, tokens)
269
270 def convert_ids_to_tokens(self, ids):

/content/Albert/tokenization.py in convert_by_vocab(vocab, items)
208 output = []
209 for item in items:
--> 210 output.append(vocab[item])
211 return output
212

KeyError: '[UNK]'`

aarmstrong78 · 2020-01-14T01:42:17Z

I had a similar issue and my problem was that I wasn't setting the spm_model_file flag correctly, and therefore the tokeniser was falling back to the Basic & Wordpiece tokenisers which use [UNK]

JKP0 · 2020-02-07T14:19:12Z

I have the same issue, anyone have a solution please help

cd ALBERT
/content/ALBERT

from ALBERT import tokenization
from ALBERT import tokenization_test

tokenizer=tokenization.FullTokenizer("/content/30k-clean.vocab")
tc=tokenizer.tokenize("Hello, my dog is cute")
ec=tokenizer.convert_tokens_to_ids(tc)

logs

KeyError Traceback (most recent call last)
in ()
1 tokenizer=tokenization.FullTokenizer("/content/30k-clean.vocab")
2 tc=tokenizer.tokenize("Hello, my dog is cute")
----> 3 ec=tokenizer.convert_tokens_to_ids(tc)

1 frames
/content/ALBERT/tokenization.py in convert_by_vocab(vocab, items)
209 output = []
210 for item in items:
--> 211 output.append(vocab[item])
212 return output
213

KeyError: '[UNK]'

aarmstrong78 · 2020-02-09T23:20:57Z

If you only pass the .vocab file the init function will fall back on the Basic and Wordpiece tokenizers, which use [UNK]. You need to pass the spm model name as well:

tokenizer=tokenization.FullTokenizer("/content/30k-clean.vocab", spm_model_file="/content/30k-clean.model")

JKP0 · 2020-02-10T08:52:05Z

Thanks! @aarmstrong78

andrewluchen transferred this issue from google-research/google-research Jan 6, 2020

JKP0 mentioned this issue Feb 10, 2020

KeyError: '[UNK]' pohanchi/ALBert-tf#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[UNK] token in v2 models #81

[UNK] token in v2 models #81

kkkppp commented Nov 16, 2019

tvinith commented Dec 10, 2019

aarmstrong78 commented Jan 14, 2020

JKP0 commented Feb 7, 2020

aarmstrong78 commented Feb 9, 2020

JKP0 commented Feb 10, 2020

[UNK] token in v2 models #81

[UNK] token in v2 models #81

Comments

kkkppp commented Nov 16, 2019

tvinith commented Dec 10, 2019

aarmstrong78 commented Jan 14, 2020

JKP0 commented Feb 7, 2020

aarmstrong78 commented Feb 9, 2020

JKP0 commented Feb 10, 2020