user defined char set #649

wenjie-p · 2021-04-18T07:18:48Z

Hi,

Thanks for this wonderful toolkit you have built! If my understanding is right, this toolkit take all the letters and punctuations as the char set to merge, where each char in the set satisfy: len(char) == 1. While in my exp, I need to take some symbols consisted with two chars (i.e. ab ) as the basic unit to merge. Is it possible for me to accomplish this ?

Thanks in advance!

The text was updated successfully, but these errors were encountered:

wnhsu · 2021-05-07T17:32:56Z

I am also wondering if it is possible to specify a set of basic units to train the BPE/unigram model? Based on this issue, it seems that --user_defined_symbols is designed to specify a set of pre-defined pieces that will not be merged with other pieces, so it does not seem to suit for the purpose.

Thanks in advance!

wenjie-p · 2021-05-09T14:18:40Z

Hi @wnhsu , I recently tried to map the basic units I wanted(i.e. two or more chars) into a single char, where the mapping applies to all my text data. So I basically translate my data with a predefined mapping rule. Such method is not very elegant, but I would like to post it here. Not sure how we can accomplish this with the built-in functionality of this toolkit.

m-malandro · 2022-02-16T04:29:55Z

I would like to see this as well. I am dealing with a language where the basic units are instructions of the form string:number, for instance

A:32

or

IN:264

The set of instructions is known, and is finite.

A typical "sentence" in my language looks like

A:32;IN:264;H:7;W:3.

Sentencepiece might tokenize this sentence as

['A:32;I', 'N:26', '4;H:7;W:', '3'].

However, I would prefer my basic units not be split.

To work around this apparent limitation of sentencepiece, I have created a mapping where I replace each instruction with a unique unicode character, but it feels quite hacky, it makes my source files difficult to read, and it is difficult to work with.

mlmsft · 2022-09-03T00:42:15Z

A related issue/suggestion:

In our setup, we can't output word boundaries "▁" as complete tokens; they can only occur in combination with subsequent characters (e. g. "▁a" or "▁and"). An option to prohibit certain characters or character combinations from forming their own tokens would be appreciated.

I tried circumventing this problem with
https://github.com/google/sentencepiece#vocabulary-restriction
However, this option doesn't seem to be implemented in Python API.

Thanks

taku910 added the feature request Add new feature label May 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

user defined char set #649

user defined char set #649

wenjie-p commented Apr 18, 2021

wnhsu commented May 7, 2021

wenjie-p commented May 9, 2021

m-malandro commented Feb 16, 2022 •

edited

mlmsft commented Sep 3, 2022 •

edited

user defined char set #649

user defined char set #649

Comments

wenjie-p commented Apr 18, 2021

wnhsu commented May 7, 2021

wenjie-p commented May 9, 2021

m-malandro commented Feb 16, 2022 • edited

mlmsft commented Sep 3, 2022 • edited

m-malandro commented Feb 16, 2022 •

edited

mlmsft commented Sep 3, 2022 •

edited