Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace GPT2TokenizerFast with tiktoken #43

Closed
irgolic opened this issue Apr 11, 2023 · 1 comment
Closed

Replace GPT2TokenizerFast with tiktoken #43

irgolic opened this issue Apr 11, 2023 · 1 comment

Comments

@irgolic
Copy link
Owner

irgolic commented Apr 11, 2023

The tokenizer is implemented in autopr.utils.tokenizer.get_tokenizer, and called at autopr/utils/repo.py:124 and autopr/repos/completions_repo.py:28. Currently it uses transformers' GPT2TokenizerFast, which isn't the correct way to calculate the token length.

Here's an example from OpenAI's cookbook on how to calculate token length for messages:

import tiktoken

def num_tokens_from_messages(messages, model="gpt-3.5-turbo-0301"):
    """Returns the number of tokens used by a list of messages."""
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        print("Warning: model not found. Using cl100k_base encoding.")
        encoding = tiktoken.get_encoding("cl100k_base")
    if model == "gpt-3.5-turbo":
        print("Warning: gpt-3.5-turbo may change over time. Returning num tokens assuming gpt-3.5-turbo-0301.")
        return num_tokens_from_messages(messages, model="gpt-3.5-turbo-0301")
    elif model == "gpt-4":
        print("Warning: gpt-4 may change over time. Returning num tokens assuming gpt-4-0314.")
        return num_tokens_from_messages(messages, model="gpt-4-0314")
    elif model == "gpt-3.5-turbo-0301":
        tokens_per_message = 4  # every message follows <|start|>{role/name}\n{content}<|end|>\n
        tokens_per_name = -1  # if there's a name, the role is omitted
    elif model == "gpt-4-0314":
        tokens_per_message = 3
        tokens_per_name = 1
    else:
        raise NotImplementedError(f"""num_tokens_from_messages() is not implemented for model {model}. See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens.""")
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3  # every reply is primed with <|start|>assistant<|message|>
    return num_tokens

Our implementation should support both messages for chat completions models and simple strings for ordinary completions models (the tokenizer currently supports only simple strings).

@Noezor
Copy link

Noezor commented Apr 12, 2023

Very cool! good demo, I recommend you put it in the README

@irgolic irgolic closed this as completed Nov 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants
@Noezor @irgolic and others