Exporting BPE file to vocab/merges files used by Huggingface tokenizers #60

colinclement · 2023-03-13T20:28:18Z

Do you have a tool for exporting the tokenizer BPE files for use in other BPE tokenizer libraries like Huggingface tokenizers? For example, can we convert the encoder._mergeable_ranks object to the standard vocab.json and merges.txt files used by the tokenizers library?

Edit from hauntsaninja: see #60 (comment)

The text was updated successfully, but these errors were encountered:

mikcnt · 2023-03-20T15:16:14Z

@hauntsaninja first of all thanks! Unfortunately, I don't think that's going to work. Applying this to r50k_base (which should be the same tokenizer under the GPT2 tokenizer on HF), produces results that are (slightly) different from the ones contained in the merges.txt file provided by HuggingFace.

Look for example at line 347 of the merges.txt file provided by HF:

res s

Now, the token ress is splitted differently with the function you provide:

merges[b"ress"]
# (b'r', b'ess')

There are a total of 3644 differences between the two, and, from my understanding, it has to do with the merge order: all of the different examples between the two version of merges happen when a token is splitted at a different position than the original merge.

Any suggestion?

shenfe · 2023-03-21T06:00:59Z

The same motivation. Huggingface tokenization is the standard I'd like to align with.
Expect any update. @hauntsaninja

shenfe · 2023-03-24T04:43:21Z

Hi @mikcnt do you have any finding on this problem?

colinclement · 2023-03-24T21:19:02Z

I have been poking around at this more. A question for someone more knowledgeable about encodings: if I take tokens from the above code snippet, and try [t.decode('utf-8') for t in tokens], this fails because the tokens are bytes which are not necessarily expressible in utf-8, e.g. one token is b'\xa1', which yields an invalid start byte error. However, when considering translating to HF tokenizers, even the ByteLevelBPETokenizer takes Python strings as its vocab and merges inputs. What am I missing?

colinclement · 2023-03-24T22:31:49Z

I answered my own question, it's also helpfully answered in this comment in the code. TL;DR: tiktoken does not assume tokens are valid UTF-8 byte sequences. I believe HF must impose the constraint of tokens being valid UTF-8 bytes as they assume their vocab/merges objects are representable as Python strings (someone correct me if I'm wrong). This certainly complicates translating tiktoken to HF.

colinclement · 2023-03-31T00:22:02Z

I believe this issue is resolved, and in general the answer is "no" due to tiktoken supporting tokens as arbitrary byte arrays, not utf-8 byte sequences.

hauntsaninja · 2023-04-07T06:16:30Z

@mikcnt Sorry for the delay / I should have tested the snippet before posting it.

Here's a version that should work. All you need to do is run a partial version of the BPE algorithm.

def bpe(mergeable_ranks: dict[bytes, int], token: bytes, max_rank: Optional[int] = None) -> list[bytes]:
    parts = [bytes([b]) for b in token]
    while True:
        min_idx = None
        min_rank = None
        for i, pair in enumerate(zip(parts[:-1], parts[1:])):
            rank = mergeable_ranks.get(pair[0] + pair[1])
            if rank is not None and (min_rank is None or rank < min_rank):
                min_idx = i
                min_rank = rank
        if min_rank is None or (max_rank is not None and min_rank >= max_rank):
            break
        assert min_idx is not None
        parts = parts[:min_idx] + [parts[min_idx] + parts[min_idx + 1]] + parts[min_idx + 2:]
    return parts

merges = {}
for token, rank in mergeable_ranks.items():
    if len(token) == 1:
        continue
    merges[token] = tuple(bpe(mergeable_ranks, token, max_rank=rank))
    assert len(merges[token]) == 2

@colinclement that's not correct. There is no assumption that tokens are valid UTF-8, even in HuggingFace code. It's impossible for BPE to work that way. They're using an encoding scheme to represent bytes as str, the same one from GPT-2:
https://github.com/openai/gpt-2/blob/a74da5d99abaaba920de8131d64da2862a8f213b/src/encoder.py#L9

tiktoken/tiktoken/load.py

Line 57 in 46287bf

def data_gym_to_mergeable_bpe_ranks(

That's why you see the weird "Ġ" things; that character represents spaces.

tiktoken represents the bytes as bytes instead of weirdly encoded text because that's the straightforward thing to do: simple, fast, avoids confusion, avoids mishandling.

colinclement · 2023-04-11T02:17:26Z

@hauntsaninja I suppose I'm confused then about what Python strings are, because you can only make a Python str object from bytes if there is a valid encoding. If you try iterating over the gpt-2 tokens in tiktoken, many cannot be cast as Python str objects, and the Huggingface implementation does assume that the tokens themselves are Python str objects. So how can you use Huggingface with tokens of arbitrary byte sequences? Inserting the results of your code into Huggingface tokenizers fails because it only accepts str tokens and theres no consistent way to cast them as str (unless you roll your own, is that your point?). The funny G's are still valid unicode, you just have to have the second step in decoding of mapping them back down to standard whitespace.

hauntsaninja · 2023-04-11T02:44:15Z

See the links in my previous message to either the GPT-2 repo or tiktoken/load.py

Jason3900 · 2023-04-18T05:38:36Z

Thanks! Whereas the converted version is capable to deal with normal tokens, the whitespaces won't get merged which is different from tiktoken. eg. " " *2 will be tokenized as [220, 220] instead of [256]. Any idea to fix this?

xenova · 2023-08-04T10:46:10Z

@Jason3900 I think I've got the conversion working (just had to adapt @hauntsaninja's code slightly). Here's a link to it on HF: https://huggingface.co/Xenova/gpt-4

Example usage:

from transformers import GPT2TokenizerFast

tokenizer = GPT2TokenizerFast.from_pretrained('Xenova/gpt-4')
assert tokenizer.encode('hello world') == [15339, 1917]
assert tokenizer.encode('  ') == [256]

(it will print a few warnings, because GPT4Tokenizer isn't officially supported, so I use GPT2TokenizerFast instead.

And here's the conversion code if anyone is interested.

leizhao1234 · 2023-11-17T07:51:10Z

@mikcnt Sorry for the delay / I should have tested the snippet before posting it.

Here's a version that should work. All you need to do is run a partial version of the BPE algorithm.
def bpe(mergeable_ranks: dict[bytes, int], token: bytes, max_rank: Optional[int] = None) -> list[bytes]:
    parts = [bytes([b]) for b in token]
    while True:
        min_idx = None
        min_rank = None
        for i, pair in enumerate(zip(parts[:-1], parts[1:])):
            rank = mergeable_ranks.get(pair[0] + pair[1])
            if rank is not None and (min_rank is None or rank < min_rank):
                min_idx = i
                min_rank = rank
        if min_rank is None or (max_rank is not None and min_rank >= max_rank):
            break
        assert min_idx is not None
        parts = parts[:min_idx] + [parts[min_idx] + parts[min_idx + 1]] + parts[min_idx + 2:]
    return parts

merges = {}
for token, rank in mergeable_ranks.items():
    if len(token) == 1:
        continue
    merges[token] = tuple(bpe(mergeable_ranks, token, max_rank=rank))
    assert len(merges[token]) == 2
@colinclement that's not correct. There is no assumption that tokens are valid UTF-8, even in HuggingFace code. It's impossible for BPE to work that way. They're using an encoding scheme to represent bytes as str, the same one from GPT-2: https://github.com/openai/gpt-2/blob/a74da5d99abaaba920de8131d64da2862a8f213b/src/encoder.py#L9

tiktoken/tiktoken/load.py

Line 57 in 46287bf

def data_gym_to_mergeable_bpe_ranks(

That's why you see the weird "Ġ" things; that character represents spaces.
tiktoken represents the bytes as bytes instead of weirdly encoded text because that's the straightforward thing to do: simple, fast, avoids confusion, avoids mishandling.

@hauntsaninja @xenova , i use your code to convert a tiktoken tokenizer to hf, but i got an error.
token, rank b'\xa0\xe9\x99\xa4' 23977 parts [b'\xa0', b'\xe9', b'\x99', b'\xa4'] parts [b'\xa0', b'\xe9\x99', b'\xa4'] merged (b'\xa0', b'\xe9\x99', b'\xa4') Traceback (most recent call last): File "/share/home/zl/convert/convert.py", line 9, in <module> convert_tiktoken(tik_token.tokenizer, "./hf_model") File "/share/home/zl/convert/tiktoken_to_hf/convert_utils.py", line 65, in convert_tiktoken vocab, merges = generate_vocab_and_merges(encoder) File "/share/home/zl/convert/tiktoken_to_hf/convert_utils.py", line 49, in generate_vocab_and_merges assert len(merged) == 2 AssertionError

This comment was marked as outdated.

Sign in to view

colinclement closed this as completed Mar 31, 2023

This comment was marked as duplicate.

Sign in to view

hauntsaninja mentioned this issue Nov 27, 2023

Hugging Face tokenizers #18

Closed

karpathy mentioned this issue Feb 18, 2024

Get vocab and merges file from model file google/sentencepiece#444

Closed

pcuenca mentioned this issue Apr 22, 2024

Refactor convert.py and add support for Metas official Llama 3 model ggerganov/llama.cpp#6819

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exporting BPE file to vocab/merges files used by Huggingface tokenizers #60

Exporting BPE file to vocab/merges files used by Huggingface tokenizers #60

colinclement commented Mar 13, 2023 •

edited by hauntsaninja

Loading

This comment was marked as outdated.

mikcnt commented Mar 20, 2023

shenfe commented Mar 21, 2023

shenfe commented Mar 24, 2023

colinclement commented Mar 24, 2023

colinclement commented Mar 24, 2023

colinclement commented Mar 31, 2023

hauntsaninja commented Apr 7, 2023 •

edited

Loading

colinclement commented Apr 11, 2023

hauntsaninja commented Apr 11, 2023

Jason3900 commented Apr 18, 2023

xenova commented Aug 4, 2023 •

edited

Loading

leizhao1234 commented Nov 17, 2023

This comment was marked as duplicate.

Exporting BPE file to vocab/merges files used by Huggingface tokenizers #60

Exporting BPE file to vocab/merges files used by Huggingface tokenizers #60

Comments

colinclement commented Mar 13, 2023 • edited by hauntsaninja Loading

This comment was marked as outdated.

mikcnt commented Mar 20, 2023

shenfe commented Mar 21, 2023

shenfe commented Mar 24, 2023

colinclement commented Mar 24, 2023

colinclement commented Mar 24, 2023

colinclement commented Mar 31, 2023

hauntsaninja commented Apr 7, 2023 • edited Loading

colinclement commented Apr 11, 2023

hauntsaninja commented Apr 11, 2023

Jason3900 commented Apr 18, 2023

xenova commented Aug 4, 2023 • edited Loading

leizhao1234 commented Nov 17, 2023

This comment was marked as duplicate.

colinclement commented Mar 13, 2023 •

edited by hauntsaninja

Loading

hauntsaninja commented Apr 7, 2023 •

edited

Loading

xenova commented Aug 4, 2023 •

edited

Loading