Skip to content
This repository has been archived by the owner on Apr 3, 2019. It is now read-only.

Unicode problem in po to JSON conversion #677

Closed
zaach opened this issue Mar 5, 2014 · 17 comments
Closed

Unicode problem in po to JSON conversion #677

zaach opened this issue Mar 5, 2014 · 17 comments

Comments

@zaach
Copy link
Contributor

zaach commented Mar 5, 2014

See @pdehaan's comment: #676 (comment)

The weird characters are present in the generated JSON found in app/i18n.


Another example, from the Back/Zurück button on the /legal/terms page:

firefox_accounts__nutzungsbedingungen

Expected:

#: app/scripts/templates/change_password.mustache:12
#: app/scripts/templates/delete_account.mustache:9
#: app/scripts/templates/pp.mustache:4
#: app/scripts/templates/reset_password.mustache:6
#: app/scripts/templates/reset_password.mustache:8
#: app/scripts/templates/tos.mustache:4
msgid "Back"
msgstr "Zurück"
@pdehaan
Copy link
Contributor

pdehaan commented Mar 6, 2014

No clue what's going on. I tried a different grunt-po-json module and it seemed to translate the Hebrew and German files fine. This may be some issue in the grunt-po2json module or maybe some additional flag we need to set.

It doesn't look like an issue w/ the output_transform method, but it looks like that may be unnecessary code with a recently added stringOnly flag: https://github.com/rkitamura/grunt-po2json#stringonly (ping @shane-tomlinson)

The closest I've gotten to fixing this is explicitly setting the "utf8" charset on the fs.readFileSync() call in ./node_modules/grunt-po2json/node_modules/po2json/lib/parseFileSync.js:12:

var data = fs.readFileSync(fs.realpathSync(fileName), 'utf8');

After hacking that, it seems that Hebrew started parsing fine. Not sure if there is some weird buffer bug somewhere in the po2json dependency, unless there is a weird flag I'm missing.

firefox_accounts__datenschutzerklrung

Figure 1: We need to go deeper!

@zaach
Copy link
Contributor Author

zaach commented Mar 6, 2014

@pdehaan Looks like a bug in po2json for sure. When there's debate, use utf8.

@shane-tomlinson
Copy link

@zaach, @pdehaan - funny, I wrote a blog post about this a month ago - https://shanetomlinson.com/2014/l10n-gotcha-missing-charset-in-content-type-header/

@shane-tomlinson
Copy link

@zaach, @pdehaan - more background - If a charset is not specified, po2json expects the character encoding to be iso-8859-1. Since we are using utf8, we get garbage in the json.

@pdehaan
Copy link
Contributor

pdehaan commented Mar 6, 2014

Thanks for the tip, @shane-tomlinson!
I filed a downstream bug at mikeedwards/po2json#23 but we'd need the bug fixed there, and then grunt-po2json updated to use the fixed po2json version.

I did a bit more poking and maybe found a workaround. Shane mentioned "expects character encoding" but I couldn't find any params we could pass to set that, so I searched the po2json repo for 'charset' and noticed this in their .po file:

"Content-Type: text/plain; charset=UTF-8\n"

But if I look at our Hebrew locale I see the following:

"Language: he\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain;\n"
"Content-Transfer-Encoding: 8bit\n"

I did a few quick tests locally and it seems changing the Content-Type charset explicitly from "Content-Type: text/plain;\n" to "Content-Type: text/plain; charset=UTF-8\n" works.

Not sure how to fix this in our source. I can certainly submit a big PR in the mozilla/fxa-content-server-l10n repo if adding the charset=UTF-8 solves everything, but I'm not sure if the .po files are overwritten or generated or if that is the correct place. But if I/we can fix it, that'd be 1000x easier than trying to get the po2json or grunt-po2json modules patched.

@mikeedwards
Copy link

Thanks for finding this po2json bug! I'll definitely merge in any patches you guys come up with for this.

@pdehaan
Copy link
Contributor

pdehaan commented Mar 6, 2014

Thanks @mikeedwards. I'm not sure if the fix is as simple as adding .toString() to the parse() buffer in /lib/parse.js, or if that could break other things. It looks like adding an explicit charset to our .po files may work for us here.

@zaach
Copy link
Contributor Author

zaach commented Mar 6, 2014

@mathjazz Does verbatim set or overwrite the charset in .po files? Or can we set those ourselves and trust they'll remain so?

@mikeedwards
Copy link

Ah, ok, good to know, @pdehaan . If that seems like the best route for you to take (vs. adding the .toString() fix upstream), I'll try and make note of that in the po2json docs so people are aware of their .po file encoding.

@mathjazz
Copy link

mathjazz commented Mar 6, 2014

@zaach Verbatim does not change the charset. We should stick to UTF-8.

@pdehaan
Copy link
Contributor

pdehaan commented Mar 6, 2014

@mathjazz, So, should I add the "Content-Type: text/plain; charset=UTF-8\n" string in the https://github.com/mozilla/fxa-content-server-l10n .po files?
Currently it only says "Content-Type: text/plain\n".

@pdehaan
Copy link
Contributor

pdehaan commented Mar 6, 2014

Curious, it looks like the .pot files have the charset defined.

@mathjazz
Copy link

mathjazz commented Mar 7, 2014

@pdehaan Yes, see the example of a working file here:
https://github.com/mozilla/zamboni/blob/master/locale/de/LC_MESSAGES/javascript.po

Please let me know when you're planning to update the files in the repo, so I'll also update them in Verbatim.

@pdehaan
Copy link
Contributor

pdehaan commented Mar 7, 2014

I have a PR that I can submit today, I'll just have to double check if i used "utf-8" or "UTF-8" (if we care).

I'll also need to ping @zaach on why I was seeing the charset defined in the .pot files but not the .po files. Not sure if I'm misunderstanding some part of the workflow, or if we need to rerun the extract strings and regenerate and merge .PO files scripts.

@pdehaan
Copy link
Contributor

pdehaan commented Mar 7, 2014

PR submitted; mozilla/fxa-content-server-l10n#2

@mathjazz
Copy link

mathjazz commented Mar 7, 2014

Verbatim updated.

@pdehaan
Copy link
Contributor

pdehaan commented Mar 10, 2014

Closing as fixed.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants