Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to add extra tokens in tiktoken? #9

Closed
Stanislas0 opened this issue Dec 17, 2022 · 5 comments
Closed

How to add extra tokens in tiktoken? #9

Stanislas0 opened this issue Dec 17, 2022 · 5 comments

Comments

@Stanislas0
Copy link

Great work! I want to know how can we add customizable extra tokens to tiktoken. Thank you!

@hauntsaninja
Copy link
Collaborator

You can create your own Encoding object by passing in a dict from token-bytes to the integer token value.
See this:

class Encoding:

You can see examples of the arguments passed to the constructor here:

@hauntsaninja
Copy link
Collaborator

If you want to have your own encoding registered so that tiktoken.get_encoding(...) works, you can create a new submodule under the tiktoken_ext namespace. This ability to extend tiktoken's registry is currently undocumented. See

# tiktoken_ext is a namespace package

@Stanislas0
Copy link
Author

Thank you so much for the swift reply! I'll have a try.

@Ontopic
Copy link

Ontopic commented Dec 17, 2022

Sorry for "hijacking", this isssue, but feel it is related enough, instead of opening a separate issue.

When looking at https://huggingface.co/docs/tokenizers/api/added-tokens, what would be the best way, to (as much as possible) stay in-sync with the original tokenizer configs), when adding special tokens in tiktoken?

As to match, rstrip, lstrip and normalized, for example. Understandable if not all is possible (yet) in tiktoken, but any further recommendations, as to what is possible currently and how, move current GPT2 based encoders, with added special tokens, to tiktoken, would be appreciated!

@hauntsaninja
Copy link
Collaborator

I've added more documentation for this over here: https://github.com/openai/tiktoken#extending-tiktoken

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants