Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exporting BPE file to vocab/merges files used by Huggingface tokenizers #60

Closed
colinclement opened this issue Mar 13, 2023 · 14 comments
Closed

Comments

@colinclement
Copy link

colinclement commented Mar 13, 2023

Do you have a tool for exporting the tokenizer BPE files for use in other BPE tokenizer libraries like Huggingface tokenizers? For example, can we convert the encoder._mergeable_ranks object to the standard vocab.json and merges.txt files used by the tokenizers library?

Edit from hauntsaninja: see #60 (comment)

@hauntsaninja

This comment was marked as outdated.

@mikcnt
Copy link

mikcnt commented Mar 20, 2023

@hauntsaninja first of all thanks! Unfortunately, I don't think that's going to work. Applying this to r50k_base (which should be the same tokenizer under the GPT2 tokenizer on HF), produces results that are (slightly) different from the ones contained in the merges.txt file provided by HuggingFace.

Look for example at line 347 of the merges.txt file provided by HF:

res s

Now, the token ress is splitted differently with the function you provide:

merges[b"ress"]
# (b'r', b'ess')

There are a total of 3644 differences between the two, and, from my understanding, it has to do with the merge order: all of the different examples between the two version of merges happen when a token is splitted at a different position than the original merge.

Any suggestion?

@shenfe
Copy link

shenfe commented Mar 21, 2023

The same motivation. Huggingface tokenization is the standard I'd like to align with.
Expect any update. @hauntsaninja

@shenfe
Copy link

shenfe commented Mar 24, 2023

Hi @mikcnt do you have any finding on this problem?

@colinclement
Copy link
Author

I have been poking around at this more. A question for someone more knowledgeable about encodings: if I take tokens from the above code snippet, and try [t.decode('utf-8') for t in tokens], this fails because the tokens are bytes which are not necessarily expressible in utf-8, e.g. one token is b'\xa1', which yields an invalid start byte error. However, when considering translating to HF tokenizers, even the ByteLevelBPETokenizer takes Python strings as its vocab and merges inputs. What am I missing?

@colinclement
Copy link
Author

I answered my own question, it's also helpfully answered in this comment in the code. TL;DR: tiktoken does not assume tokens are valid UTF-8 byte sequences. I believe HF must impose the constraint of tokens being valid UTF-8 bytes as they assume their vocab/merges objects are representable as Python strings (someone correct me if I'm wrong). This certainly complicates translating tiktoken to HF.

@colinclement
Copy link
Author

I believe this issue is resolved, and in general the answer is "no" due to tiktoken supporting tokens as arbitrary byte arrays, not utf-8 byte sequences.

@hauntsaninja
Copy link
Collaborator

hauntsaninja commented Apr 7, 2023

@mikcnt Sorry for the delay / I should have tested the snippet before posting it.

Here's a version that should work. All you need to do is run a partial version of the BPE algorithm.

def bpe(mergeable_ranks: dict[bytes, int], token: bytes, max_rank: Optional[int] = None) -> list[bytes]:
    parts = [bytes([b]) for b in token]
    while True:
        min_idx = None
        min_rank = None
        for i, pair in enumerate(zip(parts[:-1], parts[1:])):
            rank = mergeable_ranks.get(pair[0] + pair[1])
            if rank is not None and (min_rank is None or rank < min_rank):
                min_idx = i
                min_rank = rank
        if min_rank is None or (max_rank is not None and min_rank >= max_rank):
            break
        assert min_idx is not None
        parts = parts[:min_idx] + [parts[min_idx] + parts[min_idx + 1]] + parts[min_idx + 2:]
    return parts

merges = {}
for token, rank in mergeable_ranks.items():
    if len(token) == 1:
        continue
    merges[token] = tuple(bpe(mergeable_ranks, token, max_rank=rank))
    assert len(merges[token]) == 2

@colinclement that's not correct. There is no assumption that tokens are valid UTF-8, even in HuggingFace code. It's impossible for BPE to work that way. They're using an encoding scheme to represent bytes as str, the same one from GPT-2:
https://github.com/openai/gpt-2/blob/a74da5d99abaaba920de8131d64da2862a8f213b/src/encoder.py#L9

def data_gym_to_mergeable_bpe_ranks(

That's why you see the weird "Ġ" things; that character represents spaces.

tiktoken represents the bytes as bytes instead of weirdly encoded text because that's the straightforward thing to do: simple, fast, avoids confusion, avoids mishandling.

@colinclement
Copy link
Author

@hauntsaninja I suppose I'm confused then about what Python strings are, because you can only make a Python str object from bytes if there is a valid encoding. If you try iterating over the gpt-2 tokens in tiktoken, many cannot be cast as Python str objects, and the Huggingface implementation does assume that the tokens themselves are Python str objects. So how can you use Huggingface with tokens of arbitrary byte sequences? Inserting the results of your code into Huggingface tokenizers fails because it only accepts str tokens and theres no consistent way to cast them as str (unless you roll your own, is that your point?). The funny G's are still valid unicode, you just have to have the second step in decoding of mapping them back down to standard whitespace.

@hauntsaninja
Copy link
Collaborator

See the links in my previous message to either the GPT-2 repo or tiktoken/load.py

@Jason3900
Copy link

Thanks! Whereas the converted version is capable to deal with normal tokens, the whitespaces won't get merged which is different from tiktoken. eg. " " *2 will be tokenized as [220, 220] instead of [256]. Any idea to fix this?

@xenova
Copy link

xenova commented Aug 4, 2023

@Jason3900 I think I've got the conversion working (just had to adapt @hauntsaninja's code slightly). Here's a link to it on HF: https://huggingface.co/Xenova/gpt-4

Example usage:

from transformers import GPT2TokenizerFast

tokenizer = GPT2TokenizerFast.from_pretrained('Xenova/gpt-4')
assert tokenizer.encode('hello world') == [15339, 1917]
assert tokenizer.encode('  ') == [256]

(it will print a few warnings, because GPT4Tokenizer isn't officially supported, so I use GPT2TokenizerFast instead.

And here's the conversion code if anyone is interested.

@leizhao1234
Copy link

@mikcnt Sorry for the delay / I should have tested the snippet before posting it.

Here's a version that should work. All you need to do is run a partial version of the BPE algorithm.

def bpe(mergeable_ranks: dict[bytes, int], token: bytes, max_rank: Optional[int] = None) -> list[bytes]:
    parts = [bytes([b]) for b in token]
    while True:
        min_idx = None
        min_rank = None
        for i, pair in enumerate(zip(parts[:-1], parts[1:])):
            rank = mergeable_ranks.get(pair[0] + pair[1])
            if rank is not None and (min_rank is None or rank < min_rank):
                min_idx = i
                min_rank = rank
        if min_rank is None or (max_rank is not None and min_rank >= max_rank):
            break
        assert min_idx is not None
        parts = parts[:min_idx] + [parts[min_idx] + parts[min_idx + 1]] + parts[min_idx + 2:]
    return parts

merges = {}
for token, rank in mergeable_ranks.items():
    if len(token) == 1:
        continue
    merges[token] = tuple(bpe(mergeable_ranks, token, max_rank=rank))
    assert len(merges[token]) == 2

@colinclement that's not correct. There is no assumption that tokens are valid UTF-8, even in HuggingFace code. It's impossible for BPE to work that way. They're using an encoding scheme to represent bytes as str, the same one from GPT-2: https://github.com/openai/gpt-2/blob/a74da5d99abaaba920de8131d64da2862a8f213b/src/encoder.py#L9

def data_gym_to_mergeable_bpe_ranks(

That's why you see the weird "Ġ" things; that character represents spaces.
tiktoken represents the bytes as bytes instead of weirdly encoded text because that's the straightforward thing to do: simple, fast, avoids confusion, avoids mishandling.

@hauntsaninja @xenova , i use your code to convert a tiktoken tokenizer to hf, but i got an error.
token, rank b'\xa0\xe9\x99\xa4' 23977 parts [b'\xa0', b'\xe9', b'\x99', b'\xa4'] parts [b'\xa0', b'\xe9\x99', b'\xa4'] merged (b'\xa0', b'\xe9\x99', b'\xa4') Traceback (most recent call last): File "/share/home/zl/convert/convert.py", line 9, in <module> convert_tiktoken(tik_token.tokenizer, "./hf_model") File "/share/home/zl/convert/tiktoken_to_hf/convert_utils.py", line 65, in convert_tiktoken vocab, merges = generate_vocab_and_merges(encoder) File "/share/home/zl/convert/tiktoken_to_hf/convert_utils.py", line 49, in generate_vocab_and_merges assert len(merged) == 2 AssertionError

@leizhao1234

This comment was marked as duplicate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants