Tokenization Count based on Model #4

MentalGear · 2021-03-18T07:40:05Z

Thank you for this library and your blog posts, really appreciate it for learning more about prompt programming GPT-3.
I assume if I want to count the number of tokens for a prompt, I would run:

const str = 'This is an example sentence to try encoding out on!'
const encoded = encode(str)
const tokenCount = encoded.length

I assume this is the tokenization algo used for the davinci models, is that correct?
Do you happen to know where to find a tokenization algo for the other models, or a general way to predict token usage before submitting a prompt ?

Thx for your feedback !

The text was updated successfully, but these errors were encountered:

nickwalton · 2021-09-16T17:45:24Z

Sorry for not replying I don't seem to be getting notifications for this repo which I'll look into, but yep! That should work well for counting tokens before submitting a prompt.

andreasciamanna · 2023-04-07T15:53:14Z

@nickwalton

I've found this closed issue while looking for the same thing.

I need help understanding the provided answer, though.

Does it work with other models? If so, how?
I expect to have somewhere to specify which model I want to tokenise.

For instance, Python's tiktoken has tiktoken.encoding_for_model("gpt-4"), as well as a more generic tiktoken.get_encoding("cl100k_base").

windmemory · 2023-05-06T08:12:22Z

+1
Would like to know how to encode for specific model, seems like the cureent algo is for r50k_base.

Shall we reopen this issue?

nickwalton closed this as completed Sep 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenization Count based on Model #4

Tokenization Count based on Model #4

MentalGear commented Mar 18, 2021

nickwalton commented Sep 16, 2021

andreasciamanna commented Apr 7, 2023

windmemory commented May 6, 2023

Tokenization Count based on Model #4

Tokenization Count based on Model #4

Comments

MentalGear commented Mar 18, 2021

nickwalton commented Sep 16, 2021

andreasciamanna commented Apr 7, 2023

windmemory commented May 6, 2023