Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble with reversibility #185

Closed
bmccann opened this issue Aug 22, 2018 · 4 comments
Closed

Trouble with reversibility #185

bmccann opened this issue Aug 22, 2018 · 4 comments

Comments

@bmccann
Copy link

bmccann commented Aug 22, 2018

Thanks for this great tool. For context, I'm working with the python wrapper for the BPE tokenization, and I would like to write my tokenized input to files line by line.

Using the default normalization settings, it looks like I can't get complete (character by character) reversibility for some special tokens. If I turn the normalization off by setting --normalization_rule_name=identity, I get all sorts of odd tokenizations.

### Excerpt from a tokenization script that tries to encode and write line by line to a file
input_line = input_line.strip()
tokenized_line = [x.decode('utf-8') for x in spp.EncodeAsPieces(input_line)] # need to convert to strings at some point
encoded_output_line = ' '.join(tokenized_line) + '\n'
decoded_input_line = spp.DecodePieces([x.encode() for x in encoded_output_line.split()])
if not input_line == decoded_input_line:
    print("input_line: ", input_line)
    print("decoded_input_line: ", decoded_input_line)
outfile.write(encoded_output_line)

This yields things like the following:

input_line: Ich erkläre die am Donnerstag, dem 25. September 2003, unterbrochene Sitzungsperiode des Europäischen Parlaments für wieder aufgenommen.(1)

decoded_input_line: Ich erkläre die am Donnerstag, dem 25.September 2003, unterbrochene Sitzungsperiode des Europäischen Parlaments für wieder aufgenommen.(1)

See how the space between "25. September" was removed? Looks like spaces are getting removed in these examples as well (this is happening to many, many sentences):

input_line: Fünfhunderttausend russischsprachige Einwohner bzw. 40 % der Bevölkerung, die keine Staatsangehörigkeit besitzen, sind vom politischen Leben ausgeschlossen.

decoded_input_line: Fünfhunderttausend russischsprachige Einwohner bzw. 40% der Bevölkerung, die keine Staatsangehörigkeit besitzen, sind vom politischen Leben ausgeschlossen.

Here there is one where the space is removed between words (rather than just punctuation):

input_line: Mon emprisonnement m'a contraint à me pencher sur l'essentiel quant à moi-même, mon engagement politique et mon pays.

decoded_input_line: Mon emprisonnement m'a contraint à me pencher sur l'essentielquant à moi-même, mon engagement politique et mon pays.

I was hoping to get BPE, subword tokenizations from sentencepiece that were completely reversible, so that I could get back to the exact original input string. But, I'd also like to be able to cache files and write the BPE encoded inputs to a file. Is this possible, either with a different sentencepiece model or with a different method of writing to the file?

@taku910
Copy link
Collaborator

taku910 commented Aug 23, 2018

Thank you for using sentencepiece.

That looks strange. The tokenization is basically reversible except for the case when the input contains unknown symbols and the input is encoded into id sequences with EncodeAsIds or DecodeId APIs. However, that seems not the case in your experiments.

Could you share the model file you are using? I will take a look.

@bmccann
Copy link
Author

bmccann commented Aug 23, 2018

Given your comment, I further looked into that last input line I provided by copying from the input file (opened in vim, copied with Mac copy) into python interpreters:

>>> x = "Mon emprisonnement m'a contraint à me pencher sur l'essentiel quant à moi-même, mon engagement politique et mon pays."
>>> x
"Mon emprisonnement m'a contraint \xc3\xa0 me pencher sur l'essentiel\xc2\xa0quant \xc3\xa0 moi-m\xc3\xaame, mon engagement politique et mon pays."
>>> x.encode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 33: ordinal not in range(128)
>>> x = "Mon emprisonnement m'a contraint à me pencher sur l'essentiel quant à moi-même, mon engagement politique et mon pays."
>>> x
"Mon emprisonnement m'a contraint à me pencher sur l'essentiel\xa0quant à moi-même, mon engagement politique et mon pays."
>>> x.encode()
b"Mon emprisonnement m'a contraint \xc3\xa0 me pencher sur l'essentiel\xc2\xa0quant \xc3\xa0 moi-m\xc3\xaame, mon engagement politique et mon pays."

There is clearly something funky going on with the space token between "l'essentiel" and "quant". I'm not sure why they are portrayed differently across python2 and python 3 as "\xc2\xa0" and "\xa0", but these non-breaking spaces seem to get eaten up by sentencepiece.

I don't know whether you want sentencepiece to behave differently, but for now I'm just going to replace non-breaking spaces with normal spaces.

@taku910
Copy link
Collaborator

taku910 commented Aug 24, 2018

Hi,

I've just remembered that I introduced a bug around unicode handling in the latest python wrapper.
453fd9a

The master branch has been already fixed, but the pip package is not updated yet.
Could you try the attached whl package?
sentencepiece-0.1.4-cp35-cp35m-manylinux1_x86_64.whl.gz

% gzip -d sentencepiece-0.1.4-cp35-cp35m-manylinux1_x86_64.whl
% pip3 install sentencepiece-0.1.4-cp35-cp35m-manylinux1_x86_64 --user

@taku910
Copy link
Collaborator

taku910 commented Oct 30, 2018

Let me close this issue. Please reopen it if this issue still persists.

@taku910 taku910 closed this as completed Oct 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants