Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

polib mutilates escape sequences #31

Closed
izimobil opened this issue Sep 20, 2012 · 11 comments
Closed

polib mutilates escape sequences #31

izimobil opened this issue Sep 20, 2012 · 11 comments
Labels

Comments

@izimobil
Copy link
Owner

Originally reported by: qx0monster (Bitbucket: qx0monster, GitHub: Unknown)


I've just updated to polib 1.0.1, after sticking around with an 0.5.x version for a long time. Great job, D-J! Sorry to re-raise this old issue, today with a slightly different (and hopefully more convincing) phrasing.

polib mutilates valid escape sequences.

To wit, here is a simple test case:

#!python

bash> cat t.po
#
msgid ""
msgstr ""

msgid "unicode: \u00ae; octal: \141; hex: \x61; control: \b \f \v \a"
msgstr ""
bash> python
Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56) 
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import polib
>>> polib.pofile("t.po").save()
>>> quit()
bash> cat t.po
# 
msgid ""
msgstr ""

msgid "unicode: \\u00ae; octal: \\141; hex: \\x61; control: \\b \\f \\v \\a"
msgstr ""
bash>

All escape sequences unknown to polib (ie. outside of \t, \r and \n) get an additional '' in front of them. This is particular problematic for us in the case of unicode escapes, as they are frequently used to enter hard-to-type characters into msgid's and msgstr's (like the "Registered" character in the sample).

The problem arises as polib unescapes strings on reads (which removes some '', but leaves them with unknown sequences like '\u...') and escapes on writes/stringify (which unconditionally prefixes unknown escape seq's with another '').

I thought a lot about it, but to keep a long story short my resolution is to have polib leave unknown escape sequences untouched. We've ran long with this patch in several projects with good results. I probably add more of my considerations as a separate comment.

Here is the pull request:
https://bitbucket.org/izi/polib/pull-request/8/removed-escaping-unescaping-of-unknown


@izimobil
Copy link
Owner Author

Original comment by qx0monster (Bitbucket: qx0monster, GitHub: Unknown):


To add a bit more rational to this issue:

  • Compatibility with gettext tools: The gettext parser supports (a) octal numbers of the form '\ooo', (b) hex numbers of the form '\x...' and (c) some additional control seq's like '\b', '\f', '\v' and '\a' (while most of the latter are warned about as "not suitable" for internationalization efforts).
  • The gettext parser is strict, in that it does not allow unknown escape sequences (it bombs). It also seems to replace escape sequences with their binary equivalent, e.g. '\x61' is replaced with 'a'. Leaving octal, hex and the other allowed control seq's untouched seems like a good interop measure, for people using both polib and gettext tools.
  • Unicode support: gettext tools don't support unicode escapes (Funnily, there is a code comment in po-lex.c, saying "FIXME: \u and \U are not handled"! http://git.savannah.gnu.org/cgit/gettext.git/tree/gettext-tools/src/po-lex.c?id=b75a50bc5f30666fb36469a90245e1d7856d672c#n851). But as I said support for unicode escapes is crucial for us, and I would appreciate if polib was getting ahead of gettext in this respect. Being just lenient with unknown escape seq's would do the job.

@izimobil
Copy link
Owner Author

Original comment by David Jean Louis (Bitbucket: izi, GitHub: izi):


Hi, thanks for the explanations, it makes perfect sense indeed. That said, doing this by default will totally break backwards compatibility, so I'm +1 on this if it's an explicit option.
Regards.

@izimobil
Copy link
Owner Author

Original comment by qx0monster (Bitbucket: qx0monster, GitHub: Unknown):


Fine with me. Where would this option be set? On the POFile/_BaseFile object? Will you add it to the params of the pofile() factory?

@izimobil
Copy link
Owner Author

Original comment by David Jean Louis (Bitbucket: izi, GitHub: izi):


I think this should be done in polib.pofile and polib.mofile, but also in polib.POFile and polib.MOFile classes, as it's done for the check_for_duplicates option.

@izimobil
Copy link
Owner Author

Original comment by qx0monster (Bitbucket: qx0monster, GitHub: Unknown):


Would you suggest I'll take a look at it? Or is this something you want to do yourself?

@izimobil
Copy link
Owner Author

Original comment by David Jean Louis (Bitbucket: izi, GitHub: izi):


It would be cool if you could send a pull request (with tests and docs), thanks in advance !

@izimobil
Copy link
Owner Author

Original comment by Jakub Wilk (Bitbucket: jwilk, GitHub: jwilk):


The correct solution is to fix the parser to decode all escape sequences (and possibly signal an error on unknown ones).

AFAICS it is true that with qx0monster's patch polib indeed round-trips properly. It's just the Python objects that are supposed to represent the PO file contents don't make sense. For example, for this file:

msgid ""
msgstr ""

msgid "\\n"
msgstr ""

I get:

>>> polib.pofile('t.po')[0].msgid[-1].isspace()
True

which is wrong: the msgid didn't contain any whitespace.

@izimobil
Copy link
Owner Author

izimobil commented Feb 9, 2013

Original comment by David Jean Louis (Bitbucket: izi, GitHub: izi):


Hi qx0monster, any progress on this ? Thanks.

@izimobil
Copy link
Owner Author

Original comment by David Jean Louis (Bitbucket: izi, GitHub: izi):


Issue #35 was marked as a duplicate of this issue.

@izimobil
Copy link
Owner Author

Original comment by qx0monster (Bitbucket: qx0monster, GitHub: Unknown):


I'm terribly lagging behind ...

@izimobil
Copy link
Owner Author

Original comment by qx0monster (Bitbucket: qx0monster, GitHub: Unknown):


Sorry, I'm out of this. I will not be able to provide a patch as discussed, so please don't wait for me. Close or proceed as you see fit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant