Add support for querying vocabulary from tokenizer #22

Ubospica · 2023-12-13T07:04:06Z

This PR adds these methods to the Tokenizer class to support querying vocabulary from tokenizer. This supports downstream uses such as stopstring checking, grammar checking, etc.

  /*!
   * \brief Returns the vocabulary size. Special tokens are considered.
   */
  virtual size_t GetVocabSize() = 0;
  /*!
   * \brief Convert the given id to its corresponding token if it exists. If not, return an
   * empty string.
   */
  virtual std::string IdToToken(int32_t token_id) = 0;
  /*!
   * \brief Convert the given token to its corresponding id if it exists. If not, return -1.
   */
  virtual int32_t TokenToId(const std::string& token) = 0;

Tokenizer build time:

Tokenizer: SentencePiece
Load time: 5 ms

Tokenizer: Huggingface
Load time: 30 ms

Tokenizer: RWKVWorld
Load time: 113 ms

Ubospica · 2023-12-13T07:14:55Z

example/build_and_run.sh

@@ -11,7 +11,7 @@ cd ..
 mkdir -p dist
 cd dist
 if [ ! -f "tokenizer.model" ]; then
-    wget https://huggingface.co/decapoda-research/llama-7b-hf/resolve/main/tokenizer.model
+    wget https://huggingface.co/lmsys/vicuna-7b-v1.5/resolve/main/tokenizer.model


wget cannot download decapoda-research/llama-7b-hf without logging in. But vicuna-7b-v1.5 does not have this restriction.

include/tokenizers_cpp.h

tqchen · 2023-12-13T16:14:38Z

let us directly call id_to_token, see related APIs

https://docs.rs/tokenizers/latest/tokenizers/tokenizer/struct.Tokenizer.html see id_to_token
https://github.com/google/sentencepiece/blob/master/src/sentencepiece_processor.h#L629
RWKV should have its own internal vocab

This would avoid the post processing done by the decode pipeline

tqchen · 2023-12-13T16:15:26Z

for the rust binding, we can store the result string in the wrapper and reuse https://github.com/mlc-ai/tokenizers-cpp/blob/main/include/tokenizers_c.h#L31

tqchen · 2023-12-13T16:31:02Z

std::string IdToToken(int32_t token_id);

Ubospica · 2023-12-18T07:53:06Z

cc @tqchen

src/rwkv_world_tokenizer.cc

include/logging.h

include/tokenizers_cpp.h

rust/src/lib.rs

Ubospica · 2023-12-19T10:42:59Z

cc @tqchen

include/tokenizers_cpp.h

Ubospica force-pushed the main-dev/2023-12-13-vocab branch from 7abe788 to 89b15c2 Compare December 13, 2023 07:14

Ubospica commented Dec 13, 2023

View reviewed changes

tqchen requested changes Dec 13, 2023

View reviewed changes

include/tokenizers_cpp.h Outdated Show resolved Hide resolved

Ubospica force-pushed the main-dev/2023-12-13-vocab branch from f7b2248 to e0901e6 Compare December 18, 2023 07:48

Ubospica commented Dec 18, 2023

View reviewed changes

src/rwkv_world_tokenizer.cc Outdated Show resolved Hide resolved

tqchen reviewed Dec 18, 2023

View reviewed changes

include/logging.h Outdated Show resolved Hide resolved

tqchen reviewed Dec 18, 2023

View reviewed changes

include/tokenizers_cpp.h Outdated Show resolved Hide resolved

tqchen reviewed Dec 18, 2023

View reviewed changes

include/tokenizers_cpp.h Outdated Show resolved Hide resolved

tqchen requested changes Dec 18, 2023

View reviewed changes

include/tokenizers_cpp.h Outdated Show resolved Hide resolved

rust/src/lib.rs Show resolved Hide resolved

rust/src/lib.rs Outdated Show resolved Hide resolved

rust/src/lib.rs Outdated Show resolved Hide resolved

Ubospica force-pushed the main-dev/2023-12-13-vocab branch from e0901e6 to 159e8e8 Compare December 19, 2023 06:47

tqchen requested changes Dec 19, 2023

View reviewed changes

include/tokenizers_cpp.h Outdated Show resolved Hide resolved

finish

8d8a323

Ubospica force-pushed the main-dev/2023-12-13-vocab branch from 17a82d6 to 8d8a323 Compare December 19, 2023 20:01

1219

983d3ab

tqchen approved these changes Dec 19, 2023

View reviewed changes

tqchen merged commit 27dbe17 into mlc-ai:main Dec 19, 2023

Vincent-syr mentioned this pull request Jul 26, 2024

does it support multi - thread decode ? #40

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for querying vocabulary from tokenizer #22

Add support for querying vocabulary from tokenizer #22

Ubospica commented Dec 13, 2023 •

edited

Loading

Ubospica Dec 13, 2023

tqchen commented Dec 13, 2023

tqchen commented Dec 13, 2023

tqchen commented Dec 13, 2023

Ubospica commented Dec 18, 2023

Ubospica commented Dec 19, 2023

Add support for querying vocabulary from tokenizer #22

Add support for querying vocabulary from tokenizer #22

Conversation

Ubospica commented Dec 13, 2023 • edited Loading

Ubospica Dec 13, 2023

Choose a reason for hiding this comment

tqchen commented Dec 13, 2023

tqchen commented Dec 13, 2023

tqchen commented Dec 13, 2023

Ubospica commented Dec 18, 2023

Ubospica commented Dec 19, 2023

Ubospica commented Dec 13, 2023 •

edited

Loading