New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the difference between --user_defined_symbols and --control_symbols #215

Closed
thammegowda opened this Issue Oct 20, 2018 · 2 comments

Comments

Projects
None yet
2 participants
@thammegowda

thammegowda commented Oct 20, 2018

Firstly, thanks for making this library! Very useful and easy to use.

I am wondering what is the difference between these two options:

--control_symbols (comma-separated list of control symbols)  type: string  default: 
--user_defined_symbols (comma separated list of user-defined symbols)  type: string  default: 

I guess user_defined_symbols means a way to bypass splitting of some tokens (is that correct?).
I am curious what control_symbols are intended for?

Thanks in advance for your time taken to respond to this.

@taku910

This comment has been minimized.

Collaborator

taku910 commented Oct 22, 2018

SentencePiece manages vocab id <=> token mapping.

control_symbols just reserve ids for the specified token(s). So, even if this token appears in the input, this token is not segmented. User has to insert the id after segmentation as follows:

sp = spm.SentencePieceProcessor()
sp.Load('model')
ids = sp.EncodeAsIds('this is a test') + [sp.PieceToId('<c>')]

This code inserts an id after the id sequence for 'this is a test'.

On the other hand, the tokens with --user_defined_symbols are always segmented as one symbol. So, we can call like.

tokens = sp.EncodeAsIds('this is a test<c>')

For experimental purpose, user-defined-symbols should be easy as you can control the behavior just by tweaking the input. However, when you want to deploy the system as a user-facing product, user-defined-symbols would not be appropriate as user can change/tweak the behavior by injecting these special symbols.

@thammegowda

This comment has been minimized.

thammegowda commented Oct 22, 2018

thanks 💯

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment