Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

user defined char set #649

Open
wenjie-p opened this issue Apr 18, 2021 · 4 comments
Open

user defined char set #649

wenjie-p opened this issue Apr 18, 2021 · 4 comments
Labels
feature request Add new feature

Comments

@wenjie-p
Copy link

Hi,

Thanks for this wonderful toolkit you have built! If my understanding is right, this toolkit take all the letters and punctuations as the char set to merge, where each char in the set satisfy: len(char) == 1. While in my exp, I need to take some symbols consisted with two chars (i.e. ab ) as the basic unit to merge. Is it possible for me to accomplish this ?

Thanks in advance!

@wnhsu
Copy link

wnhsu commented May 7, 2021

I am also wondering if it is possible to specify a set of basic units to train the BPE/unigram model? Based on this issue, it seems that --user_defined_symbols is designed to specify a set of pre-defined pieces that will not be merged with other pieces, so it does not seem to suit for the purpose.

Thanks in advance!

@wenjie-p
Copy link
Author

wenjie-p commented May 9, 2021

Hi @wnhsu , I recently tried to map the basic units I wanted(i.e. two or more chars) into a single char, where the mapping applies to all my text data. So I basically translate my data with a predefined mapping rule. Such method is not very elegant, but I would like to post it here. Not sure how we can accomplish this with the built-in functionality of this toolkit.

@m-malandro
Copy link

m-malandro commented Feb 16, 2022

I would like to see this as well. I am dealing with a language where the basic units are instructions of the form string:number, for instance

A:32

or

IN:264

The set of instructions is known, and is finite.

A typical "sentence" in my language looks like

A:32;IN:264;H:7;W:3.

Sentencepiece might tokenize this sentence as

['A:32;I', 'N:26', '4;H:7;W:', '3'].

However, I would prefer my basic units not be split.

To work around this apparent limitation of sentencepiece, I have created a mapping where I replace each instruction with a unique unicode character, but it feels quite hacky, it makes my source files difficult to read, and it is difficult to work with.

@taku910 taku910 added the feature request Add new feature label May 29, 2022
@mlmsft
Copy link

mlmsft commented Sep 3, 2022

A related issue/suggestion:

In our setup, we can't output word boundaries "▁" as complete tokens; they can only occur in combination with subsequent characters (e. g. "▁a" or "▁and"). An option to prohibit certain characters or character combinations from forming their own tokens would be appreciated.

I tried circumventing this problem with
https://github.com/google/sentencepiece#vocabulary-restriction
However, this option doesn't seem to be implemented in Python API.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Add new feature
Projects
None yet
Development

No branches or pull requests

5 participants