Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UNK] token in v2 models #81

Open
kkkppp opened this issue Nov 16, 2019 · 5 comments
Open

[UNK] token in v2 models #81

kkkppp opened this issue Nov 16, 2019 · 5 comments

Comments

@kkkppp
Copy link

kkkppp commented Nov 16, 2019

I downloaded albert_xxl v2, in file assets/30k-clean.vocab entry for [UNK] looks like:

<unk> 0

while in tokenization.py it's :

class WordpieceTokenizer(object):
"""Runs WordPiece tokenziation."""

def init(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200):

So I'm getting error like below. Is it ok to modlfy tokenization.py or I'm doing something wrong?

input_ids = tokenizer.convert_tokens_to_ids(ntokens)

File "J:\albert\tokenization.py", line 269, in convert_tokens_to_ids
return convert_by_vocab(self.vocab, tokens)
File "J:\albert\tokenization.py", line 211, in convert_by_vocab
output.append(vocab[item])
KeyError: '[UNK]'

@tvinith
Copy link

tvinith commented Dec 10, 2019

same here running ALBERT tfhub with own set of data : getting error as

`/content/Albert/classifier_utils.py in convert_single_example(ex_index, example, label_list, max_seq_length, tokenizer, task_name)
623 segment_ids.append(1)
624
--> 625 input_ids = tokenizer.convert_tokens_to_ids(tokens)
626
627 # The mask has 1 for real tokens and 0 for padding tokens. Only real

/content/Albert/tokenization.py in convert_tokens_to_ids(self, tokens)
266 printable_text(token)) for token in tokens]
267 else:
--> 268 return convert_by_vocab(self.vocab, tokens)
269
270 def convert_ids_to_tokens(self, ids):

/content/Albert/tokenization.py in convert_by_vocab(vocab, items)
208 output = []
209 for item in items:
--> 210 output.append(vocab[item])
211 return output
212

KeyError: '[UNK]'`

@andrewluchen andrewluchen transferred this issue from google-research/google-research Jan 6, 2020
@aarmstrong78
Copy link

I had a similar issue and my problem was that I wasn't setting the spm_model_file flag correctly, and therefore the tokeniser was falling back to the Basic & Wordpiece tokenisers which use [UNK]

@JKP0
Copy link

JKP0 commented Feb 7, 2020

I have the same issue, anyone have a solution please help

cd ALBERT
/content/ALBERT

from ALBERT import tokenization
from ALBERT import tokenization_test
tokenizer=tokenization.FullTokenizer("/content/30k-clean.vocab")
tc=tokenizer.tokenize("Hello, my dog is cute")
ec=tokenizer.convert_tokens_to_ids(tc)

logs


KeyError Traceback (most recent call last)
in ()
1 tokenizer=tokenization.FullTokenizer("/content/30k-clean.vocab")
2 tc=tokenizer.tokenize("Hello, my dog is cute")
----> 3 ec=tokenizer.convert_tokens_to_ids(tc)

1 frames
/content/ALBERT/tokenization.py in convert_by_vocab(vocab, items)
209 output = []
210 for item in items:
--> 211 output.append(vocab[item])
212 return output
213

KeyError: '[UNK]'

@aarmstrong78
Copy link

If you only pass the .vocab file the init function will fall back on the Basic and Wordpiece tokenizers, which use [UNK]. You need to pass the spm model name as well:

tokenizer=tokenization.FullTokenizer("/content/30k-clean.vocab", spm_model_file="/content/30k-clean.model")

@JKP0
Copy link

JKP0 commented Feb 10, 2020

Thanks! @aarmstrong78

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants