Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenization Count based on Model #4

Closed
MentalGear opened this issue Mar 18, 2021 · 3 comments
Closed

Tokenization Count based on Model #4

MentalGear opened this issue Mar 18, 2021 · 3 comments

Comments

@MentalGear
Copy link

Thank you for this library and your blog posts, really appreciate it for learning more about prompt programming GPT-3.
I assume if I want to count the number of tokens for a prompt, I would run:

const str = 'This is an example sentence to try encoding out on!'
const encoded = encode(str)
const tokenCount = encoded.length

I assume this is the tokenization algo used for the davinci models, is that correct?
Do you happen to know where to find a tokenization algo for the other models, or a general way to predict token usage before submitting a prompt ?

Thx for your feedback !

@nickwalton
Copy link
Contributor

Sorry for not replying I don't seem to be getting notifications for this repo which I'll look into, but yep! That should work well for counting tokens before submitting a prompt.

@andreasciamanna
Copy link

@nickwalton

I've found this closed issue while looking for the same thing.

I need help understanding the provided answer, though.

Does it work with other models? If so, how?
I expect to have somewhere to specify which model I want to tokenise.

For instance, Python's tiktoken has tiktoken.encoding_for_model("gpt-4"), as well as a more generic tiktoken.get_encoding("cl100k_base").

@windmemory
Copy link

+1
Would like to know how to encode for specific model, seems like the cureent algo is for r50k_base.

Shall we reopen this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants