Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special tokens not removed by ByteLevelBPETokenizer decoder #186

Closed
tomhosking opened this issue Mar 4, 2020 · 3 comments
Closed

Special tokens not removed by ByteLevelBPETokenizer decoder #186

tomhosking opened this issue Mar 4, 2020 · 3 comments

Comments

@tomhosking
Copy link

@tomhosking tomhosking commented Mar 4, 2020

The following code snippet doesn't behave as I would have expected (compare to eg BertWordPieceTokenizer):

from tokenizers import ByteLevelBPETokenizer

model_slug = 'bart-large'
tokenizer = ByteLevelBPETokenizer("./data/bert-vocabs/{:}-vocab.json".format(model_slug), "./data/bert-vocabs/{:}-merges.txt".format(model_slug))
output = tokenizer.encode(' Here is a sentence with some special tokens <unk> <mask>.')
tokenizer.decode([0] + output.ids + [2] + [1]*3, skip_special_tokens=True)
>>> '<s> Here is a sentence with some special tokens <unk> <mask>.</s><pad><pad><pad>'

I would have expected all the special tokens to be stripped by the decoder?

@tomhosking

This comment has been minimized.

Copy link
Author

@tomhosking tomhosking commented Mar 4, 2020

Ah, my bad - I need to add these manually:
tokenizer.add_special_tokens(["<s>", "</s>", "<pad>", "<mask>", "<unk>"...])

Are there any plans to provide some kind of wrapper around common parameters like this, as transformers does? Something similar to ByteLevelBPETokenizer.from_pretrained()?

@n1t0

This comment has been minimized.

Copy link
Member

@n1t0 n1t0 commented Mar 4, 2020

Yes, we plan on having a way to save everything to a single file, and then be able to load everything once again (cf #15). This will make the sharing and usage of tokenizers way easier.

@tomhosking

This comment has been minimized.

Copy link
Author

@tomhosking tomhosking commented Mar 5, 2020

Nice, thanks!

@tomhosking tomhosking closed this Mar 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.