New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
user defined char set #649
Comments
I am also wondering if it is possible to specify a set of basic units to train the BPE/unigram model? Based on this issue, it seems that Thanks in advance! |
Hi @wnhsu , I recently tried to map the basic units I wanted(i.e. two or more chars) into a single char, where the mapping applies to all my text data. So I basically translate my data with a predefined mapping rule. Such method is not very elegant, but I would like to post it here. Not sure how we can accomplish this with the built-in functionality of this toolkit. |
I would like to see this as well. I am dealing with a language where the basic units are instructions of the form string:number, for instance A:32 or IN:264 The set of instructions is known, and is finite. A typical "sentence" in my language looks like A:32;IN:264;H:7;W:3. Sentencepiece might tokenize this sentence as ['A:32;I', 'N:26', '4;H:7;W:', '3']. However, I would prefer my basic units not be split. To work around this apparent limitation of sentencepiece, I have created a mapping where I replace each instruction with a unique unicode character, but it feels quite hacky, it makes my source files difficult to read, and it is difficult to work with. |
A related issue/suggestion: In our setup, we can't output word boundaries "▁" as complete tokens; they can only occur in combination with subsequent characters (e. g. "▁a" or "▁and"). An option to prohibit certain characters or character combinations from forming their own tokens would be appreciated. I tried circumventing this problem with Thanks |
Hi,
Thanks for this wonderful toolkit you have built! If my understanding is right, this toolkit take all the letters and punctuations as the char set to merge, where each char in the set satisfy:
len(char) == 1
. While in my exp, I need to take some symbols consisted with two chars (i.e. ab ) as the basic unit to merge. Is it possible for me to accomplish this ?Thanks in advance!
The text was updated successfully, but these errors were encountered: