Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
What is the difference between --user_defined_symbols and --control_symbols #215
Firstly, thanks for making this library! Very useful and easy to use.
I am wondering what is the difference between these two options:
Thanks in advance for your time taken to respond to this.
SentencePiece manages vocab id <=> token mapping.
control_symbols just reserve ids for the specified token(s). So, even if this token appears in the input, this token is not segmented. User has to insert the id after segmentation as follows:
This code inserts an id after the id sequence for 'this is a test'.
On the other hand, the tokens with --user_defined_symbols are always segmented as one symbol. So, we can call like.
For experimental purpose, user-defined-symbols should be easy as you can control the behavior just by tweaking the input. However, when you want to deploy the system as a user-facing product, user-defined-symbols would not be appropriate as user can change/tweak the behavior by injecting these special symbols.