How to extend tokens dictionary? #426

kpe · 2019-11-19T16:00:43Z

It would be nice if an existing sentencepiece model could be extended with additional extra special/[MARKUP] tokens.

While adding real language tokens might not be needed (or possible), adding additional atomic markup tokens to an existing model, should be possible right?

As an usage scenario - imagine you have a plain text pre-trained model, and you want to handle some HTML-like document, i.e. you need to add special markup/atomic tokens like - [], [<TABLE>], etc.

This is currently not possible, right?

The text was updated successfully, but these errors were encountered:

taku910 · 2019-11-20T01:45:32Z

SentencePiece supports user-defined symbols. I think you can use this feature.

https://github.com/google/sentencepiece/blob/master/doc/special_symbols.md

kpe · 2019-11-20T09:47:20Z

Thank you, @taku910 for your reply! What I would like to do is to take a sentencepiece model and extend it with new user-defined symbols (i.e. without changing the piece-to-id mapping for the existing pieces or the tokenization behavior.)

A use case would be to fine-tune some of the official pre-trained ALBERT models on the Google's Natural Question task (which includes some HTML tags), i.e. I would like to convert an existing sentencepiece model to a sentencepiece model that can also handle those additional user-defined symbols (let's say [] and []) but without training a new model, which would potentially result in a different sentencepiece model (as the original plain text data might not be available).

SchenbergZY · 2019-11-24T20:56:50Z

Thank you, @taku910 for your reply! What I would like to do is to take a sentencepiece model and extend it with new user-defined symbols (i.e. without changing the piece-to-id mapping for the existing pieces or the tokenization behavior.)

A use case would be to fine-tune some of the official pre-trained ALBERT models on the Google's Natural Question task (which includes some HTML tags), i.e. I would like to convert an existing sentencepiece model to a sentencepiece model that can also handle those additional user-defined symbols (let's say [] and []) but without training a new model, which would potentially result in a different sentencepiece model (as the original plain text data might not be available).

I also have the same problem, anything new?

SchenbergZY · 2019-11-24T21:36:19Z

My question is: is there a possible method to add some customer tokens at the end of an existing model with all score 0?

gyc913 · 2019-12-06T05:50:15Z

I have the same question with SchenbergZY and kpe.

taku910 · 2019-12-06T09:05:38Z

You can rewrite the model file to add new user defined symbols. It is also possible to rewrite the id<->piece mapping.

#398

s4sarath · 2019-12-07T03:30:53Z

@SchenbergZY - Have you find a hack regarding this. If anything please do update, if you don't mind. Thanks. :-)

taku910 · 2019-12-07T04:23:32Z

The proto is a DSL and found at
https://github.com/google/sentencepiece/blob/master/src/sentencepiece_model.proto

protoc --python_out=. sentencepiece_model.proto

s4sarath · 2019-12-07T09:22:20Z

My bad. I took sometime to figure it out. :-) .

…

On Sat, Dec 7, 2019, 9:53 AM Taku Kudo ***@***.***> wrote: The proto is a DSL and found at https://github.com/google/sentencepiece/blob/master/src/sentencepiece_model.proto protoc --python_out=. sentencepiece_model.proto — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#426?email_source=notifications&email_token=ACRE6KGT7F332R4H6IQIC33QXMQMPA5CNFSM4JPFRIZ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGF5OSY#issuecomment-562812747>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACRE6KDG4MVPC24UPHZWQ3LQXMQMPANCNFSM4JPFRIZQ> .

ilaouirine · 2023-06-13T08:52:01Z

Hello,
Is it possible to define a pattern with user-defined symbols that I don't want to tokenize? For example, I don't want to tokenize any words starting with a '$' symbol. I also want to understand how decoding works because currently, all the words I define with user-defined symbols are assigned the ID 0.
Thank you!

hardtoname1002 · 2023-06-26T03:25:41Z

Hello,
I'm confused about the sentence: "Control symbols must be inserted outside of the SentencePiece segmentation." Could you please explain this in detail, Thanks!

taku910 closed this as completed Nov 20, 2019

haven-jeon mentioned this issue Dec 17, 2019

[SEP], [CLS] 등 스페셜 토큰의 토크나이저 이슈 SKTBrain/KoBERT#11

Closed

tannonk mentioned this issue Feb 17, 2021

adding special tokens to a BPEmb model bheinzerling/bpemb#53

Closed

airaria mentioned this issue Apr 12, 2023

词表合并问题 ymcui/Chinese-LLaMA-Alpaca#128

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to extend tokens dictionary? #426

How to extend tokens dictionary? #426

kpe commented Nov 19, 2019

taku910 commented Nov 20, 2019

kpe commented Nov 20, 2019

SchenbergZY commented Nov 24, 2019

SchenbergZY commented Nov 24, 2019

gyc913 commented Dec 6, 2019

taku910 commented Dec 6, 2019

s4sarath commented Dec 7, 2019

taku910 commented Dec 7, 2019

s4sarath commented Dec 7, 2019 via email

ilaouirine commented Jun 13, 2023

hardtoname1002 commented Jun 26, 2023

How to extend tokens dictionary? #426

How to extend tokens dictionary? #426

Comments

kpe commented Nov 19, 2019

taku910 commented Nov 20, 2019

kpe commented Nov 20, 2019

SchenbergZY commented Nov 24, 2019

SchenbergZY commented Nov 24, 2019

gyc913 commented Dec 6, 2019

taku910 commented Dec 6, 2019

s4sarath commented Dec 7, 2019

taku910 commented Dec 7, 2019

s4sarath commented Dec 7, 2019 via email

ilaouirine commented Jun 13, 2023

hardtoname1002 commented Jun 26, 2023