Unicode problem in po to JSON conversion #677

zaach · 2014-03-05T21:38:37Z

The weird characters are present in the generated JSON found in app/i18n.

Another example, from the Back/Zurück button on the /legal/terms page:

Expected:

#: app/scripts/templates/change_password.mustache:12
#: app/scripts/templates/delete_account.mustache:9
#: app/scripts/templates/pp.mustache:4
#: app/scripts/templates/reset_password.mustache:6
#: app/scripts/templates/reset_password.mustache:8
#: app/scripts/templates/tos.mustache:4
msgid "Back"
msgstr "Zurück"

The text was updated successfully, but these errors were encountered:

pdehaan · 2014-03-06T02:09:41Z

No clue what's going on. I tried a different grunt-po-json module and it seemed to translate the Hebrew and German files fine. This may be some issue in the grunt-po2json module or maybe some additional flag we need to set.

It doesn't look like an issue w/ the output_transform method, but it looks like that may be unnecessary code with a recently added stringOnly flag: https://github.com/rkitamura/grunt-po2json#stringonly (ping @shane-tomlinson)

The closest I've gotten to fixing this is explicitly setting the "utf8" charset on the fs.readFileSync() call in ./node_modules/grunt-po2json/node_modules/po2json/lib/parseFileSync.js:12:

var data = fs.readFileSync(fs.realpathSync(fileName), 'utf8');

After hacking that, it seems that Hebrew started parsing fine. Not sure if there is some weird buffer bug somewhere in the po2json dependency, unless there is a weird flag I'm missing.

Figure 1: We need to go deeper!

zaach · 2014-03-06T03:04:41Z

@pdehaan Looks like a bug in po2json for sure. When there's debate, use utf8.

shane-tomlinson · 2014-03-06T10:19:09Z

@zaach, @pdehaan - funny, I wrote a blog post about this a month ago - https://shanetomlinson.com/2014/l10n-gotcha-missing-charset-in-content-type-header/

shane-tomlinson · 2014-03-06T10:21:39Z

@zaach, @pdehaan - more background - If a charset is not specified, po2json expects the character encoding to be iso-8859-1. Since we are using utf8, we get garbage in the json.

pdehaan · 2014-03-06T17:14:02Z

Thanks for the tip, @shane-tomlinson!
I filed a downstream bug at mikeedwards/po2json#23 but we'd need the bug fixed there, and then grunt-po2json updated to use the fixed po2json version.

I did a bit more poking and maybe found a workaround. Shane mentioned "expects character encoding" but I couldn't find any params we could pass to set that, so I searched the po2json repo for 'charset' and noticed this in their .po file:

"Content-Type: text/plain; charset=UTF-8\n"

But if I look at our Hebrew locale I see the following:

"Language: he\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain;\n"
"Content-Transfer-Encoding: 8bit\n"

I did a few quick tests locally and it seems changing the Content-Type charset explicitly from "Content-Type: text/plain;\n" to "Content-Type: text/plain; charset=UTF-8\n" works.

Not sure how to fix this in our source. I can certainly submit a big PR in the mozilla/fxa-content-server-l10n repo if adding the charset=UTF-8 solves everything, but I'm not sure if the .po files are overwritten or generated or if that is the correct place. But if I/we can fix it, that'd be 1000x easier than trying to get the po2json or grunt-po2json modules patched.

mikeedwards · 2014-03-06T17:34:59Z

Thanks for finding this po2json bug! I'll definitely merge in any patches you guys come up with for this.

pdehaan · 2014-03-06T17:57:00Z

Thanks @mikeedwards. I'm not sure if the fix is as simple as adding .toString() to the parse() buffer in /lib/parse.js, or if that could break other things. It looks like adding an explicit charset to our .po files may work for us here.

zaach · 2014-03-06T18:16:51Z

@mathjazz Does verbatim set or overwrite the charset in .po files? Or can we set those ourselves and trust they'll remain so?

mikeedwards · 2014-03-06T18:18:16Z

Ah, ok, good to know, @pdehaan . If that seems like the best route for you to take (vs. adding the .toString() fix upstream), I'll try and make note of that in the po2json docs so people are aware of their .po file encoding.

mathjazz · 2014-03-06T23:00:08Z

@zaach Verbatim does not change the charset. We should stick to UTF-8.

pdehaan · 2014-03-06T23:12:01Z

@mathjazz, So, should I add the "Content-Type: text/plain; charset=UTF-8\n" string in the https://github.com/mozilla/fxa-content-server-l10n .po files?
Currently it only says "Content-Type: text/plain\n".

pdehaan · 2014-03-06T23:48:51Z

Curious, it looks like the .pot files have the charset defined.

mathjazz · 2014-03-07T16:51:02Z

@pdehaan Yes, see the example of a working file here:
https://github.com/mozilla/zamboni/blob/master/locale/de/LC_MESSAGES/javascript.po

Please let me know when you're planning to update the files in the repo, so I'll also update them in Verbatim.

pdehaan · 2014-03-07T18:35:45Z

I have a PR that I can submit today, I'll just have to double check if i used "utf-8" or "UTF-8" (if we care).

I'll also need to ping @zaach on why I was seeing the charset defined in the .pot files but not the .po files. Not sure if I'm misunderstanding some part of the workflow, or if we need to rerun the extract strings and regenerate and merge .PO files scripts.

pdehaan · 2014-03-07T19:27:21Z

PR submitted; mozilla/fxa-content-server-l10n#2

mathjazz · 2014-03-07T20:38:05Z

Verbatim updated.

pdehaan · 2014-03-10T21:49:49Z

Closing as fixed.

pdehaan added i18n labels Mar 5, 2014

pdehaan closed this as completed Mar 10, 2014

mikeedwards mentioned this issue Mar 26, 2014

Unexpected output when parsing buffers mikeedwards/po2json#23

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode problem in po to JSON conversion #677

Unicode problem in po to JSON conversion #677

zaach commented Mar 5, 2014

pdehaan commented Mar 6, 2014

zaach commented Mar 6, 2014

shane-tomlinson commented Mar 6, 2014

shane-tomlinson commented Mar 6, 2014

pdehaan commented Mar 6, 2014

mikeedwards commented Mar 6, 2014

pdehaan commented Mar 6, 2014

zaach commented Mar 6, 2014

mikeedwards commented Mar 6, 2014

mathjazz commented Mar 6, 2014

pdehaan commented Mar 6, 2014

pdehaan commented Mar 6, 2014

mathjazz commented Mar 7, 2014

pdehaan commented Mar 7, 2014

pdehaan commented Mar 7, 2014

mathjazz commented Mar 7, 2014

pdehaan commented Mar 10, 2014

Unicode problem in po to JSON conversion #677

Unicode problem in po to JSON conversion #677

Comments

zaach commented Mar 5, 2014

pdehaan commented Mar 6, 2014

zaach commented Mar 6, 2014

shane-tomlinson commented Mar 6, 2014

shane-tomlinson commented Mar 6, 2014

pdehaan commented Mar 6, 2014

mikeedwards commented Mar 6, 2014

pdehaan commented Mar 6, 2014

zaach commented Mar 6, 2014

mikeedwards commented Mar 6, 2014

mathjazz commented Mar 6, 2014

pdehaan commented Mar 6, 2014

pdehaan commented Mar 6, 2014

mathjazz commented Mar 7, 2014

pdehaan commented Mar 7, 2014

pdehaan commented Mar 7, 2014

mathjazz commented Mar 7, 2014

pdehaan commented Mar 10, 2014