-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for querying vocabulary from tokenizer #22
Conversation
7abe788
to
89b15c2
Compare
@@ -11,7 +11,7 @@ cd .. | |||
mkdir -p dist | |||
cd dist | |||
if [ ! -f "tokenizer.model" ]; then | |||
wget https://huggingface.co/decapoda-research/llama-7b-hf/resolve/main/tokenizer.model | |||
wget https://huggingface.co/lmsys/vicuna-7b-v1.5/resolve/main/tokenizer.model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wget cannot download decapoda-research/llama-7b-hf without logging in. But vicuna-7b-v1.5 does not have this restriction.
let us directly call id_to_token, see related APIs
This would avoid the post processing done by the decode pipeline |
for the rust binding, we can store the result string in the wrapper and reuse https://github.com/mlc-ai/tokenizers-cpp/blob/main/include/tokenizers_c.h#L31 |
std::string IdToToken(int32_t token_id); |
f7b2248
to
e0901e6
Compare
cc @tqchen |
e0901e6
to
159e8e8
Compare
cc @tqchen |
17a82d6
to
8d8a323
Compare
This PR adds these methods to the Tokenizer class to support querying vocabulary from tokenizer. This supports downstream uses such as stopstring checking, grammar checking, etc.
Tokenizer build time: