Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenizer.convert_ids_to_tokens not generating special tokens with predefined position offset #7

Closed
YebowenHu opened this issue Oct 12, 2021 · 0 comments

Comments

@YebowenHu
Copy link

self.tokenizer = self.tokenizer_class.from_pretrained(self.pretrained_tokenizer_path)
special_tokens_tuple_list = [("eos_token", 128), ("unk_token", 129), ("pad_token", 130), ("bos_token", 131)]
for special_token_name, special_token_id_offset in special_tokens_tuple_list:
if getattr(self.tokenizer, special_token_name) == None:
setattr(self.tokenizer, special_token_name, self.tokenizer.convert_ids_to_tokens(len(self.tokenizer)-special_token_id_offset))
self.config[special_token_name] = self.tokenizer.convert_ids_to_tokens(len(self.tokenizer)-special_token_id_offset)
self.config[special_token_name+'_id'] = len(self.tokenizer)-special_token_id_offset

In this snippet of code, it set up a default special_token_name with offset. Then later, the special token (pad_token, bos_token are not exist in pretrained_tokenizer) need to be added into tokenizer. I tried to load pretrained tokenizer from transof-xl-wt103 under ExampleInitModel and generate tokens from ids base on the predefined offset.

tokenizer.convert_ids_to_tokens(len(self.tokenizer)-special_token_id_offset))

The returned tokens turn out to be specific words, not '<pad>' or '<bos>' tokens.

When the token_name is "pad_token" or "bos_token" with offset of "130", "131":
'The return: Islahul 267605,McShan 267604'

May I ask how did you setup the offset value of these special tokens? Is it normal that the 'transof-xl-wt103' doesn't need pad_token and bos_token or these special tokens actually should be set up somewhere else?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant