-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to extend tokens dictionary? #426
Comments
SentencePiece supports user-defined symbols. I think you can use this feature. https://github.com/google/sentencepiece/blob/master/doc/special_symbols.md |
Thank you, @taku910 for your reply! What I would like to do is to take a sentencepiece model and extend it with new user-defined symbols (i.e. without changing the piece-to-id mapping for the existing pieces or the tokenization behavior.) A use case would be to fine-tune some of the official pre-trained ALBERT models on the Google's Natural Question task (which includes some HTML tags), i.e. I would like to convert an existing sentencepiece model to a sentencepiece model that can also handle those additional user-defined symbols (let's say |
I also have the same problem, anything new? |
My question is: is there a possible method to add some customer tokens at the end of an existing model with all score 0? |
I have the same question with SchenbergZY and kpe. |
You can rewrite the model file to add new user defined symbols. It is also possible to rewrite the id<->piece mapping. |
@SchenbergZY - Have you find a hack regarding this. If anything please do update, if you don't mind. Thanks. :-) |
The proto is a DSL and found at
|
My bad. I took sometime to figure it out. :-) .
…On Sat, Dec 7, 2019, 9:53 AM Taku Kudo ***@***.***> wrote:
The proto is a DSL and found at
https://github.com/google/sentencepiece/blob/master/src/sentencepiece_model.proto
protoc --python_out=. sentencepiece_model.proto
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#426?email_source=notifications&email_token=ACRE6KGT7F332R4H6IQIC33QXMQMPA5CNFSM4JPFRIZ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGF5OSY#issuecomment-562812747>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACRE6KDG4MVPC24UPHZWQ3LQXMQMPANCNFSM4JPFRIZQ>
.
|
Hello, |
Hello, |
It would be nice if an existing sentencepiece model could be extended with additional extra special/
[MARKUP]
tokens.While adding real language tokens might not be needed (or possible), adding additional atomic markup tokens to an existing model, should be possible right?
As an usage scenario - imagine you have a plain text pre-trained model, and you want to handle some HTML-like document, i.e. you need to add special markup/atomic tokens like -
[<P>]
,[<TABLE>]
, etc.This is currently not possible, right?
The text was updated successfully, but these errors were encountered: