-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trouble with reversibility #185
Comments
Thank you for using sentencepiece. That looks strange. The tokenization is basically reversible except for the case when the input contains unknown symbols and the input is encoded into id sequences with EncodeAsIds or DecodeId APIs. However, that seems not the case in your experiments. Could you share the model file you are using? I will take a look. |
Given your comment, I further looked into that last input line I provided by copying from the input file (opened in vim, copied with Mac copy) into python interpreters:
>>> x = "Mon emprisonnement m'a contraint à me pencher sur l'essentiel quant à moi-même, mon engagement politique et mon pays."
>>> x
"Mon emprisonnement m'a contraint à me pencher sur l'essentiel\xa0quant à moi-même, mon engagement politique et mon pays."
>>> x.encode()
b"Mon emprisonnement m'a contraint \xc3\xa0 me pencher sur l'essentiel\xc2\xa0quant \xc3\xa0 moi-m\xc3\xaame, mon engagement politique et mon pays." There is clearly something funky going on with the space token between "l'essentiel" and "quant". I'm not sure why they are portrayed differently across python2 and python 3 as "\xc2\xa0" and "\xa0", but these non-breaking spaces seem to get eaten up by sentencepiece. I don't know whether you want sentencepiece to behave differently, but for now I'm just going to replace non-breaking spaces with normal spaces. |
Hi, I've just remembered that I introduced a bug around unicode handling in the latest python wrapper. The master branch has been already fixed, but the pip package is not updated yet.
|
Let me close this issue. Please reopen it if this issue still persists. |
Thanks for this great tool. For context, I'm working with the python wrapper for the BPE tokenization, and I would like to write my tokenized input to files line by line.
Using the default normalization settings, it looks like I can't get complete (character by character) reversibility for some special tokens. If I turn the normalization off by setting
--normalization_rule_name=identity
, I get all sorts of odd tokenizations.This yields things like the following:
See how the space between "25. September" was removed? Looks like spaces are getting removed in these examples as well (this is happening to many, many sentences):
Here there is one where the space is removed between words (rather than just punctuation):
I was hoping to get BPE, subword tokenizations from sentencepiece that were completely reversible, so that I could get back to the exact original input string. But, I'd also like to be able to cache files and write the BPE encoded inputs to a file. Is this possible, either with a different sentencepiece model or with a different method of writing to the file?
The text was updated successfully, but these errors were encountered: