-
Notifications
You must be signed in to change notification settings - Fork 762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exporting BPE file to vocab/merges files used by Huggingface tokenizers #60
Comments
This comment was marked as outdated.
This comment was marked as outdated.
@hauntsaninja first of all thanks! Unfortunately, I don't think that's going to work. Applying this to Look for example at line 347 of the
Now, the token merges[b"ress"]
# (b'r', b'ess') There are a total of 3644 differences between the two, and, from my understanding, it has to do with the merge order: all of the different examples between the two version of Any suggestion? |
The same motivation. Huggingface tokenization is the standard I'd like to align with. |
Hi @mikcnt do you have any finding on this problem? |
I have been poking around at this more. A question for someone more knowledgeable about encodings: if I take |
I answered my own question, it's also helpfully answered in this comment in the code. TL;DR: tiktoken does not assume tokens are valid UTF-8 byte sequences. I believe HF must impose the constraint of tokens being valid UTF-8 bytes as they assume their vocab/merges objects are representable as Python strings (someone correct me if I'm wrong). This certainly complicates translating tiktoken to HF. |
I believe this issue is resolved, and in general the answer is "no" due to tiktoken supporting tokens as arbitrary byte arrays, not utf-8 byte sequences. |
@mikcnt Sorry for the delay / I should have tested the snippet before posting it. Here's a version that should work. All you need to do is run a partial version of the BPE algorithm.
@colinclement that's not correct. There is no assumption that tokens are valid UTF-8, even in HuggingFace code. It's impossible for BPE to work that way. They're using an encoding scheme to represent bytes as str, the same one from GPT-2: Line 57 in 46287bf
That's why you see the weird "Ġ" things; that character represents spaces. tiktoken represents the bytes as bytes instead of weirdly encoded text because that's the straightforward thing to do: simple, fast, avoids confusion, avoids mishandling. |
@hauntsaninja I suppose I'm confused then about what Python strings are, because you can only make a Python |
See the links in my previous message to either the GPT-2 repo or tiktoken/load.py |
Thanks! Whereas the converted version is capable to deal with normal tokens, the whitespaces won't get merged which is different from tiktoken. eg. " " *2 will be tokenized as [220, 220] instead of [256]. Any idea to fix this? |
@Jason3900 I think I've got the conversion working (just had to adapt @hauntsaninja's code slightly). Here's a link to it on HF: https://huggingface.co/Xenova/gpt-4 Example usage: from transformers import GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained('Xenova/gpt-4')
assert tokenizer.encode('hello world') == [15339, 1917]
assert tokenizer.encode(' ') == [256] (it will print a few warnings, because And here's the conversion code if anyone is interested. |
@hauntsaninja @xenova , i use your code to convert a tiktoken tokenizer to hf, but i got an error. |
Do you have a tool for exporting the tokenizer BPE files for use in other BPE tokenizer libraries like Huggingface tokenizers? For example, can we convert the
encoder._mergeable_ranks
object to the standardvocab.json
andmerges.txt
files used by thetokenizers
library?Edit from hauntsaninja: see #60 (comment)
The text was updated successfully, but these errors were encountered: