Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for querying vocabulary from tokenizer #22

Merged
merged 2 commits into from
Dec 19, 2023

Conversation

Ubospica
Copy link
Contributor

@Ubospica Ubospica commented Dec 13, 2023

This PR adds these methods to the Tokenizer class to support querying vocabulary from tokenizer. This supports downstream uses such as stopstring checking, grammar checking, etc.

  /*!
   * \brief Returns the vocabulary size. Special tokens are considered.
   */
  virtual size_t GetVocabSize() = 0;
  /*!
   * \brief Convert the given id to its corresponding token if it exists. If not, return an
   * empty string.
   */
  virtual std::string IdToToken(int32_t token_id) = 0;
  /*!
   * \brief Convert the given token to its corresponding id if it exists. If not, return -1.
   */
  virtual int32_t TokenToId(const std::string& token) = 0;

Tokenizer build time:

Tokenizer: SentencePiece
Load time: 5 ms

Tokenizer: Huggingface
Load time: 30 ms

Tokenizer: RWKVWorld
Load time: 113 ms

@@ -11,7 +11,7 @@ cd ..
mkdir -p dist
cd dist
if [ ! -f "tokenizer.model" ]; then
wget https://huggingface.co/decapoda-research/llama-7b-hf/resolve/main/tokenizer.model
wget https://huggingface.co/lmsys/vicuna-7b-v1.5/resolve/main/tokenizer.model
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wget cannot download decapoda-research/llama-7b-hf without logging in. But vicuna-7b-v1.5 does not have this restriction.

include/tokenizers_cpp.h Outdated Show resolved Hide resolved
@tqchen
Copy link
Contributor

tqchen commented Dec 13, 2023

let us directly call id_to_token, see related APIs

This would avoid the post processing done by the decode pipeline

@tqchen
Copy link
Contributor

tqchen commented Dec 13, 2023

for the rust binding, we can store the result string in the wrapper and reuse https://github.com/mlc-ai/tokenizers-cpp/blob/main/include/tokenizers_c.h#L31

@tqchen
Copy link
Contributor

tqchen commented Dec 13, 2023

std::string IdToToken(int32_t token_id);

@Ubospica
Copy link
Contributor Author

cc @tqchen

src/rwkv_world_tokenizer.cc Outdated Show resolved Hide resolved
include/logging.h Outdated Show resolved Hide resolved
include/tokenizers_cpp.h Outdated Show resolved Hide resolved
include/tokenizers_cpp.h Outdated Show resolved Hide resolved
include/tokenizers_cpp.h Outdated Show resolved Hide resolved
rust/src/lib.rs Show resolved Hide resolved
rust/src/lib.rs Outdated Show resolved Hide resolved
rust/src/lib.rs Outdated Show resolved Hide resolved
@Ubospica
Copy link
Contributor Author

cc @tqchen

include/tokenizers_cpp.h Outdated Show resolved Hide resolved
@tqchen tqchen merged commit 27dbe17 into mlc-ai:main Dec 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants