-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standard Tokenizer #25
Comments
Hi @prabha-git , I would like to dis-agree for using a standard tokenizer. Tokenizers are coupled with models. Visit https://tiktokenizer.vercel.app/ and see that the tokens generated for GPT2 vs GPT4 are different. They have different vocabularies, which differ in merges as well as vocabulary size. I understand that it is not a level playing field. But it not something we can solve. The whole training of the models is done using tokens from a specific tokenizer. Model inference will fail or behave in a random manner if we provide tokens from a tokenizer that is not compatible with it. I would like to close the issue, as this is not feasible. |
@kedarchandrayan we can't choose a tokenizer for any pre-trained transformer model, So I am not proposing that :) Let me clarify this in detail, hope my explanation effectively conveys my points. In our project, we utilize the tokenizer corresponding to the model under test, as demonstrated in the following example for Anthropic Models:
Why is this tokenizer necessary for our project? The insert_needle method evaluates the context length and the depth at which the needle is inserted, on the encoded string (using a tokenizer). Effectively tokenizer is used to measure the length. We decode the text after the needle is inserted and send it for inference. However, this approach introduces a bias, especially noticeable in longer contexts. For instance, if we set the context size to 1M tokens, Anthropic's model might encompass approximately 3M characters, whereas OpenAI's model might include around 3.2M characters from your document. Therefore, I propose using the same tokenizer for measuring the context length and determining the depth at which to insert the needle. Additionally, for some models, like Gemini Pro, accessing their specific tokenizer isn't possible, so we must rely on publicly available tokenizers anyway. |
@prabha-git The bias which you are pointing to is when you think from a human point (character length). For model, the input is always tokenized. We are testing the ability of a model to find a needle in a haystack of tokens. That's why keeping everything in terms of native tokens is more intuitive from model's perspective. When you say that 1M tokens mean 3M characters for Anthropic's model and 3.2M characters for OpenAI's model, you are comparing them from human perspective. For the models these are 1M tokens. They were trained to work with tokens like these. Following are some subtle points which I want to make:
I agree with the point on Gemini Pro. We will have to think of something else. That is a problem to solve. Please let me know your opinion. |
Thank you, @prabha-git, for initiating this discussion. It's crucial we address this issue before we proceed with integrating models like Gemini, for which we lack a tokenizer. If I understand you correctly:
The following chart I've created illustrates the percentage difference in the number of tokens between From the chart above, we can interpret that:
Let's explore whether a 15% discrepancy is tolerable, or if there exists an alternative solution that could reduce this error to 0%. |
@pavelkraleu - that's cool that you compared across different models. I think I am making rather a simple point, Considering this is a testing framework for various LLMs. Do we want to measure all models with the same Yardstick or the Yardstick provided by the Model Provider? If we are just testing one model and the intention is to NOT compare the results with other models, then it is fine to use the tokenizer provided by the model. If we are using a tokenizer provided by the model provider, Comparisons like this would be misleading in a larger context window. image is from this post, I think we don't need to worry about what tokenizer a specific model is using. We measure the context length and depth using a standard tokenizer and pass the context to LLM for inference. |
@prabha-git, this subject is one of my major interests, and I will be speaking about it at PyCon SK next week, so I have many charts like this lying around. 🙂 Don't you think that if we use |
I am new to this discussion, but if I read @pavelkraleu correctly, then an inherent problem would exist with using a general tokenizer. We could blow past or under place the pin within the context window by using a tokenizer different than the one the model is using. This seems to create a couple problems though. There is a new great model that is released BibbityBop, if BibbityBop doesn't use a tokenizer that is already implemented in the NIAH then the above scenario would happen. Wouldn't this by extension mean, that the pin placement in the context window could never be perfectly accurate unless the exact tokenizer as the model uses were used? |
Exactly, @douglasdlewis2012, that's what I think will be happening 🙂 However, we are not completely lost. |
I'm with the majority here - I would also recommend we do not use a standard tokenizer across models. In practice we like to say "oh Claude has 200K tokens and GPT has 128K tokens" as if they are the same thing, but each model users a different tokenizer so it really should be, "oh Claude has 200K tokens using tokenizer X and GPT has 128K tokens using tokenizer Y" Because we are doing length evaluation, we'll need to get as close as possible to do the tokenizer used by the model (ideally the same one). Because we won't have all the tokenizers, we'll need to have a backup or default and adjust as we get more information. |
Wow Cool, Do they upload the presentation to YouTube? Will look it up :) |
Sounds good. I have seen people who don't realize that these models don't use the same tokenizer. but yeah with a Standardized tokenizer, we may not get the same length in x-axis as the model's max context window. Thanks for the discussion, will close this issue. |
Thanks @prabha-git , it was great discussing on the topic. I liked the articulation in terms of graphs and visual images. Please keep suggesting issues and initiate discussions 👍 |
Before proceeding with the implementation, I would like to reach a consensus
To ensure a fair and consistent evaluation of different language models, I propose standardizing our tokenizer.
Currently, we use different tokenizers, including cl100k_base (tiktoken) for OpenAI models and an unspecified tokenizer from Anthropic. This lack of standardization introduces bias, as tokenizers vary in how they split text into units, thus affecting context length calculations.
I recommend adopting cl100k_base as our standard tokenizer due to its open-source availability. This will create a level playing field for model comparisons. This difference is less significant for shorter contexts but becomes more pronounced as context length increases.
Using same tokenizer would not affect the integrity as in this project, tokenizer is used to measure the context length and find the depth for needle.
Results from my testing, Anthropic uses more token to represents the same length text. code in this colab
The text was updated successfully, but these errors were encountered: