Trouble with reversibility #185

bmccann · 2018-08-22T17:42:51Z

Thanks for this great tool. For context, I'm working with the python wrapper for the BPE tokenization, and I would like to write my tokenized input to files line by line.

Using the default normalization settings, it looks like I can't get complete (character by character) reversibility for some special tokens. If I turn the normalization off by setting --normalization_rule_name=identity, I get all sorts of odd tokenizations.

### Excerpt from a tokenization script that tries to encode and write line by line to a file
input_line = input_line.strip()
tokenized_line = [x.decode('utf-8') for x in spp.EncodeAsPieces(input_line)] # need to convert to strings at some point
encoded_output_line = ' '.join(tokenized_line) + '\n'
decoded_input_line = spp.DecodePieces([x.encode() for x in encoded_output_line.split()])
if not input_line == decoded_input_line:
    print("input_line: ", input_line)
    print("decoded_input_line: ", decoded_input_line)
outfile.write(encoded_output_line)

This yields things like the following:

input_line: Ich erkläre die am Donnerstag, dem 25. September 2003, unterbrochene Sitzungsperiode des Europäischen Parlaments für wieder aufgenommen.(1)

decoded_input_line: Ich erkläre die am Donnerstag, dem 25.September 2003, unterbrochene Sitzungsperiode des Europäischen Parlaments für wieder aufgenommen.(1)

See how the space between "25. September" was removed? Looks like spaces are getting removed in these examples as well (this is happening to many, many sentences):

input_line: Fünfhunderttausend russischsprachige Einwohner bzw. 40 % der Bevölkerung, die keine Staatsangehörigkeit besitzen, sind vom politischen Leben ausgeschlossen.

decoded_input_line: Fünfhunderttausend russischsprachige Einwohner bzw. 40% der Bevölkerung, die keine Staatsangehörigkeit besitzen, sind vom politischen Leben ausgeschlossen.

Here there is one where the space is removed between words (rather than just punctuation):

input_line: Mon emprisonnement m'a contraint à me pencher sur l'essentiel quant à moi-même, mon engagement politique et mon pays.

decoded_input_line: Mon emprisonnement m'a contraint à me pencher sur l'essentielquant à moi-même, mon engagement politique et mon pays.

I was hoping to get BPE, subword tokenizations from sentencepiece that were completely reversible, so that I could get back to the exact original input string. But, I'd also like to be able to cache files and write the BPE encoded inputs to a file. Is this possible, either with a different sentencepiece model or with a different method of writing to the file?

The text was updated successfully, but these errors were encountered:

taku910 · 2018-08-23T10:43:56Z

Thank you for using sentencepiece.

That looks strange. The tokenization is basically reversible except for the case when the input contains unknown symbols and the input is encoded into id sequences with EncodeAsIds or DecodeId APIs. However, that seems not the case in your experiments.

Could you share the model file you are using? I will take a look.

bmccann · 2018-08-23T20:27:46Z

Given your comment, I further looked into that last input line I provided by copying from the input file (opened in vim, copied with Mac copy) into python interpreters:

>>> x = "Mon emprisonnement m'a contraint à me pencher sur l'essentiel quant à moi-même, mon engagement politique et mon pays."
>>> x
"Mon emprisonnement m'a contraint \xc3\xa0 me pencher sur l'essentiel\xc2\xa0quant \xc3\xa0 moi-m\xc3\xaame, mon engagement politique et mon pays."
>>> x.encode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 33: ordinal not in range(128)

>>> x = "Mon emprisonnement m'a contraint à me pencher sur l'essentiel quant à moi-même, mon engagement politique et mon pays."
>>> x
"Mon emprisonnement m'a contraint à me pencher sur l'essentiel\xa0quant à moi-même, mon engagement politique et mon pays."
>>> x.encode()
b"Mon emprisonnement m'a contraint \xc3\xa0 me pencher sur l'essentiel\xc2\xa0quant \xc3\xa0 moi-m\xc3\xaame, mon engagement politique et mon pays."

There is clearly something funky going on with the space token between "l'essentiel" and "quant". I'm not sure why they are portrayed differently across python2 and python 3 as "\xc2\xa0" and "\xa0", but these non-breaking spaces seem to get eaten up by sentencepiece.

I don't know whether you want sentencepiece to behave differently, but for now I'm just going to replace non-breaking spaces with normal spaces.

taku910 · 2018-08-24T11:02:30Z

Hi,

I've just remembered that I introduced a bug around unicode handling in the latest python wrapper.
453fd9a

The master branch has been already fixed, but the pip package is not updated yet.
Could you try the attached whl package?
sentencepiece-0.1.4-cp35-cp35m-manylinux1_x86_64.whl.gz

% gzip -d sentencepiece-0.1.4-cp35-cp35m-manylinux1_x86_64.whl
% pip3 install sentencepiece-0.1.4-cp35-cp35m-manylinux1_x86_64 --user

taku910 · 2018-10-30T11:39:30Z

Let me close this issue. Please reopen it if this issue still persists.

taku910 closed this as completed Oct 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trouble with reversibility #185

Trouble with reversibility #185

bmccann commented Aug 22, 2018 •

edited

taku910 commented Aug 23, 2018

bmccann commented Aug 23, 2018 •

edited

taku910 commented Aug 24, 2018

taku910 commented Oct 30, 2018

Trouble with reversibility #185

Trouble with reversibility #185

Comments

bmccann commented Aug 22, 2018 • edited

taku910 commented Aug 23, 2018

bmccann commented Aug 23, 2018 • edited

taku910 commented Aug 24, 2018

taku910 commented Oct 30, 2018

bmccann commented Aug 22, 2018 •

edited

bmccann commented Aug 23, 2018 •

edited