Fix tiktoken wrapper #761

dakinggg · 2023-11-23T06:44:46Z

Cleans up some of the less than perfect aspects of the original tiktoken wrapper to match the expected output for the token/id conversion functions, and fix the signature of tokenize.

rajammanabrolu · 2023-11-28T01:37:59Z

Ok this has my seal of approval! Thank you, @dakinggg !!

here's an example of a script I ran
rajammanabrolu/gpt-4-chat-fastbpe uses GPT2FastTokenizer with the Xenova gist
rajammanabrolu/gpt-4-chat-debug uses TiktokenTokenizerWrapper
and compare that to actual Tiktoken. I looped through all of Ultrafeedback but here's one example that fails for the Xenova but passes for the Foundry version

from transformers import AutoTokenizer

from compose_rl.utils import *
import tiktoken

prompt = """<|im_start|>system
A conversation between a user and an LLM-based AI assistant. The assistant gives helpful and honest answers.<|im_end|>
<|im_start|>user
What is S & P 500? <|im_end|>
<|im_start|>assistant
S & P 500 is a company that performs investment analysis for the U.S. stock market. It tracks and analyzes 500 of the biggest companies on the U.S. stock market. <|im_end|>
<|im_start|>user
Who invented it? <|im_end|>
<|im_start|>assistant
"""

generated = """S & P 500 was created in 1974 by Donaldson, Lufkin & Jenrette. That is a financial services company, which also runs a fund called the “Dow Jones Industrial Average” which is a famous U.S. stock market index. The “Dow Jones Industrial Average” tracks the stock prices of the 30 largest U.S. companies by market value.<|im_end|>"""

tokenizers = [AutoTokenizer.from_pretrained(
    'rajammanabrolu/gpt-4-chat-fastbpe', trust_remote_code=True
), AutoTokenizer.from_pretrained(
    'rajammanabrolu/gpt-4-chat-debug', trust_remote_code=True
)]

cl100k_base = tiktoken.get_encoding("cl100k_base")
special_tokens = {
        **cl100k_base._special_tokens,
        "<|im_start|>": 100278,
        "<|im_end|>": 100279,
    }
tiktokenizer = tiktoken.Encoding(
    name="cl100k_im",
    pat_str=cl100k_base._pat_str,
    mergeable_ranks=cl100k_base._mergeable_ranks,
    special_tokens=special_tokens
)

tiktoken_obs = tiktokenizer.encode(prompt+generated, allowed_special=set(special_tokens.keys()))
total_length = len(tiktoken_obs)
prompt_len = len(tiktokenizer.encode(prompt, allowed_special=set(special_tokens.keys())))
generated_len = len(tiktokenizer.encode(generated, allowed_special=set(special_tokens.keys())))
gold_output = {
    'tok': tiktokenizer,
    'obs': tiktoken_obs,
    'total_length': total_length,
    'prompt_len': prompt_len,
    'generated_len': generated_len
}

outputs = []

for tokenizer in tokenizers:

    original_obs = tokenizer(prompt + generated)['input_ids']

    prompt_len = len(tokenizer.tokenize(prompt))
    generated_len = len(tokenizer.tokenize(generated))

    output = {
    'tok': tokenizer,
    'obs': original_obs,
    'total_length': len(original_obs),
    'prompt_len': prompt_len,
    'generated_len': generated_len
    }
    outputs.append(output)

for output in outputs:
    for key, val in output.items():
        print(key, val, gold_output[key])
        if key != 'tok':
            print(val == gold_output[key])
        print('--------')
    print('===========')

irenedea

LGTM, just some minor comments!

llmfoundry/tokenizers/tiktoken.py

tests/test_tiktoken.py

dakinggg added 6 commits November 22, 2023 22:26

maybe fix it

2d1defa

precommit

b112e5e

fix tests?

75f3aa7

precommit

f697aae

remove unused var

6872e38

add some comments

44a3bc4

dakinggg requested review from rajammanabrolu and irenedea November 23, 2023 07:29

dakinggg marked this pull request as ready for review November 23, 2023 07:29

rajammanabrolu approved these changes Nov 28, 2023

View reviewed changes

Merge branch 'main' into fix-tiktoken

72f571a

irenedea approved these changes Nov 28, 2023

View reviewed changes

dakinggg added 2 commits November 28, 2023 11:09

pr comments

cf3b82b

update docstring

7bf8bb1

dakinggg merged commit 4f399bf into mosaicml:main Nov 28, 2023
10 checks passed

dakinggg deleted the fix-tiktoken branch December 11, 2023 23:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix tiktoken wrapper #761

Fix tiktoken wrapper #761

dakinggg commented Nov 23, 2023 •

edited

rajammanabrolu commented Nov 28, 2023

irenedea left a comment

Fix tiktoken wrapper #761

Fix tiktoken wrapper #761

Conversation

dakinggg commented Nov 23, 2023 • edited

rajammanabrolu commented Nov 28, 2023

irenedea left a comment

Choose a reason for hiding this comment

dakinggg commented Nov 23, 2023 •

edited