Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit locale file loading #6328

Closed
asmecher opened this issue Nov 10, 2020 · 44 comments
Closed

Revisit locale file loading #6328

asmecher opened this issue Nov 10, 2020 · 44 comments
Assignees
Labels
Housekeeping:1:Todo Any dependency management or refactor that would be nice to have some day.
Milestone

Comments

@asmecher
Copy link
Member

asmecher commented Nov 10, 2020

Revisit locale file loading.

Goals:

  • Reduce the amount of custom framework code in pkp-lib related to translations.
  • Harmonize our translation toolset with Laravel (we're using more and more of it). (See "Gotchas")
  • Improve performance (using a 3rd-party implementation may make interprocess sharing of locale data easy)
  • Reduce the burden on the current (to-be-deprecated) file caching toolset (currently .po files are cached as .php)

Bonus points:

  • It would be amazing if we could get rid of AppLocale::requireComponents(...) calls, which are currently used to load each .po file when required. This might require making performance good enough that we can simply load everything.
  • A grep through the codebase and locale files to identify unused locale keys would probably trim translations down considerably. (But please keep this in a separate commit if you do this, and make the changes with a script rather than manually, as it's likely to need re-running! This has the potential to be a major time-sink so please don't prioritize it.)
  • The ability to support more flexible locale codes than we currently do would be nice. (We currently require xx_YY with an optional @charset suffix, and the current best practice appears to be to include the _YY only when necessary to distinguish between two regions, e.g. pt_BR vs. pt_PT.)
  • Choosing an approach that improves our options for plurals would be nice. (The .po file format supports this.)

Gotchas:

  • The Laravel toolset doesn't use .po files by default (JSON files are recommended for large translations). There are additional tools that provide .po support for Laravel but they are not official and would need assessment.
  • The underlying gettext toolset may not respond well to .po/.mo files changing while the server runs. Need to assess how users will be affected by a change from the current behaviour, in which a .po file is re-cached when modified with no fuss.
  • If we use gettext, then .po files need to be compiled to .mo files in a normal workflow. This is a requirement that users haven't had before (and locale file modifications aren't an uncommon job).
  • The custom locale plugin needs consideration for changes here.
  • If massive changes are required to the locale files, this would need to be merged at the right moment in the dev lifecycle, so that translators don't lose their work.
  • Any change to the translation file format would need to be compatible with Weblate.

PRs
Applications

Modules

Plugins

Missing a PR for the https://github.com/pkp/defaultTranslation

@NateWr NateWr added the Housekeeping:1:Todo Any dependency management or refactor that would be nice to have some day. label Nov 11, 2020
@asmecher asmecher changed the title Use gettext toolset more fully Revisit locale file loading Feb 4, 2021
@asmecher
Copy link
Member Author

asmecher commented Feb 4, 2021

Assigning you, @jonasraoni; please have a review and do some research, but let's talk approaches before you dive into coding.

@jonasraoni
Copy link
Contributor

  1. I had some comments written here before you edited. In general the AppLocale::requireComponents(...) was my biggest concern :)

Knowing which locale files are already loaded at a given point is not feasible, so I'm normally using the "trial and error" approach, when using a given template/class in another place, the issue rearises. A not needed require might pass unperceived, etc.

I'd say that we're safe to load everything in memory (unless we have some keys with tons of text), but I'll check the impact.
Anyway, a simple indexer + loading on-demand is already doable with our current toolset.

  1. Is it interesting to keep the locale files separated? Does it make the life of the translators easier in weblate? If not, I think I'd leave everything together to decrease the cognitive task of thinking where to add a new key 🤔

  2. Octothorpes: Since my automatic translation idea was rejected, how bad would be to detect octothorpes in the interface, and display a kind of "Term missing translation" followed by an alternative text in a "sister" language (fr_FR/fr_CA) or English and, possibly a link to contribute with a translation? 👀

@asmecher
Copy link
Member Author

asmecher commented Feb 9, 2021

Anyway, a simple indexer + loading on-demand is already doable with our current toolset.

In principle yes, but we shouldn't be writing/maintaining that in our own software when there are widely accepted standards; part of the goal of this issue is to move considerably closer to just using existing frameworks/toolsets for managing translations. We don't have so many translation strings that standard stuff like Laravel and/or gettext aren't usable.

Is it interesting to keep the locale files separated?

Most packages using Weblate only have a few components (a.k.a. separate locale files) and longer term I think we should move towards that. (Again, we'll need to make sure we stage up any comprehensive locale file changes so that they suit our release plans, but that's not to say that we shouldn't do it.) I think the ideal structure would be to have one or two locale files each for OJS/OPS/OMP, plus one or two for pkp-lib, plus one for each plugin. Reducing the number of files considerably and possibly just loading them all would help resolve the loadComponents problem.

...how bad would be to detect octothorpes in the interface, and display a kind of "Term missing translation" followed by an alternative text in a "sister" language (fr_FR/fr_CA) or English and, possibly a link to contribute with a translation?

The default translation plugin is a stepping stone towards this. Translators have often complained that it's hard to know where in the interface a term appears in order to understand what they're translating; an OJS install to explore that shows locale keys would help. The other part that's harder to solve is having a locale key (e.g. something untranslated) and needing to answer the question: where does this appear? (Without randomly exploring, that is.) It's not necessary to solve this for this issue, but since you asked :)

@jonasraoni
Copy link
Contributor

The last problem is tough...

To have some numbers in mind... Excluding the emails, we have around 1489 keys in en_US...

The first part of the problem, to have a custom OJS, will probably involve updating the __()/Smarty/Vue to output locale keys in a special way, so we can highlight the terms, sounds ok.

The second part, to reach an interface from a key, will require a great idea haha! I've left one idea below to start thinking about it, maybe a light will appear at the end of the tunnel 😫

Run the Cypress tests with a special flag, which will monitor updates to the elements (perfectly possible with the current Web APIs), looking for translated elements... Whenever something unique is found, the item will be highlighted and a screenshot with its name will be taken. Even if it misses some things, I think it might catch a lot of things.

The remaining keys, which escaped the first filter, could be manually associated with an instruction (e.g. "open the submission menu"), URL or grouped based on the place where it was found (e.g. if it was found in "submissionReview.tpl, so it must be appearing somewhere at the submission review).

@asmecher
Copy link
Member Author

asmecher commented Feb 9, 2021

^ Good ideas, but definitely outside of the core scope of this issue! I did a proof of concept for automatic screenshot generation for documentation using Travis testing, and when we come to revisiting that, we should talk about this too.

@jonasraoni
Copy link
Contributor

Follow my considerations for some items after an initial evaluation:

  • __(...) translation function (perhaps we could use Laravel's instead, which means we could get rid of this helper removal patch)

The Laravel function has basically the same arguments and under the hood it gets a "translator" object from the resolver/container, so we could override/extend this key in the container, then the patch wouldn't be needed... Otherwise, I think it's better to just namespace our translation function (and probably also rename it).

  • Improve performance (using a 3rd-party implementation may make interprocess sharing of locale data easy)

I checked the Laravel's code to load the json files and didn't see anything special... If it's too slow for us, it's possible to provide a custom loader.

  • It would be amazing if we could get rid of AppLocale::requireComponents(...) calls, which are currently used to load each .po file when required. This might require making performance good enough that we can simply load everything.

I see this is a requirement =]

  • Choosing an approach that improves our options for plurals would be nice. (The .po file format supports this.)

Yeah, the gettext format supports it out of the box. The Laravel has a less flexible system to define plurals (just specific amounts + ranges), but I've inspected the code and it has an internal embedded support for pluralization based on the locale, so it should work fine...

  • The Laravel toolset doesn't use .po files by default (JSON files are recommended for large translations). There are additional tools that provide .po support for Laravel but they are not official and would need assessment.
  • The underlying gettext toolset may not respond well to .po/.mo files changing while the server runs. Need to assess how users will be affected by a change from the current behaviour, in which a .po file is re-cached when modified with no fuss.
  • If we use gettext, then .po files need to be compiled to .mo files in a normal workflow. This is a requirement that users haven't had before (and locale file modifications aren't an uncommon job).
  • The custom locale plugin needs consideration for changes here.

The Laravel Gettext (I didn't find anything else) code isn't huge, and the commits to make it work with the newest versions of Laravel basically just required bumps in the composer file (in case the company supporting it fades away)... But internally it's using the native gettext, which has drawbacks:

  • Sticky cache of components/domains (there are workarounds based on adding timestamps 👀)
  • If an external code changes the working domain, it will be as unpredictable as the strtok.
  • The gettext isn't thread-safe (in case someone is using PHP as a module)
  • Given it just can work with one domain at a time, I think it will be a little bit complex to add/overwrite locales from other places (e.g. plugins), I just had some cumbersome ideas to address this requirement, such as passing through the available domains trying to find a match, create a unique file with the merged locales, etc.
  • About generating the .mo format, the package we're using to parse the .po files has this feature.

Due to these drawbacks I'd vote against using the native gettext in favor of Laravel (it has a native way of adding extra json paths to search for locales) 🤔

  • Any change to the translation file format would need to be compatible with Weblate.

Here's a problem for adopting Laravel. I checked the Weblate integration and it doesn't support the Laravel format. As a fallback, they have an API, but tinkering with a new integration for many repositories will probably be more time-consuming than writing code.


Summary

I didn't find a way that fits perfectly, so any solution will require some patching here and there... Before writing code, it might be worth to check if Weblate is able to provide support for the Laravel format.

If not, I think we'll have to keep using the gettext parser (which I don't see as a bad option), address the goals/bonus points and try to replace/remove some local code by external packages.

Another idea I can try to explore is to convert the gettext format to the Laravel format on-demand.

@asmecher
Copy link
Member Author

asmecher commented Aug 9, 2021

Thanks, @jonasraoni -- so if I'm following, we can adopt the Laravel __ function in place of our own, regardless of the translation storage caching mechanism, by either

  • adopting the Laravel translation toolset directly, or
  • providing a translator object in the container.

If that's correct, then I propose that we decide: Let's adopt Laravel's __ translate function in place of our own. I would prefer this to e.g. namespacing the function. I don't think our function provides anything that Laravel doesn't already provide in a more standard way.

What do you think about performance? I think this is going to drive some of our toolset choices, maybe more so than philosophical json-vs-po considerations. Apparently json_decode is pretty fast (e.g. compared to unserialize) but currently we are using a PHP file based cache, which has the additional advantage of being compiled/cached in the process by PHP. I suspect there are probably blog posts around about the performance implications of each. I'm not sure if Laravel provides any of this in its own locale file handling e.g. using the Laravel caching toolset.

The more I think about it, the more the downsides you've pointed out with PHP-native gettext (e.g. interprocess problems) are prohibitive. We'll be working with a lot of people on shared hosts without much control of their servers. So I suggest we also decide to avoid PHP-native gettext in favour of either a PHP-based parser (as we currently use) or another file format.

@jonasraoni
Copy link
Contributor

Let's adopt Laravel's __ translate function in place of our own.

I think that's fine, I didn't see problems to adopt it.

What do you think about performance?

I didn't see such optimizations in the Laravel Translator, but AFAICS we can provide a custom FileLoader to it...
Given the locale is something we'll be using and reading from the disk in every request, I think it's better to cache it in a great way.

Last time I checked, the performance of a var_dump + OPcache was great. I doubt it has changed, as it's the closest to the metal we can get out of the box. The trade-off/issues of this method are memory usage and configuration limits.

About the configuration, in order to have the best performance I think we should try to fit under the default limits of:
max_accelerated_files (10000)
memory_consumption (128MB)

Instead of counting the amount of php files that we have on our system + the ones generated by the FileCache, Smarty cache, etc. we could get the status from a heavy installation... If we're far from the defaults, then we can't expect the opcache to make miracles, but we still can try to decrease the usage (e.g. merging several locale files into one; moving not so important, but large, caches out of this method; etc).

According to #7111, I assume we're going to use Stash. I checked fast and their code is using file locks + a handmade var_export, strange that performance dropped so much. We can try to debug the source of slowness and provide a report for them.

json vs po

I prefer the po, it's more generic/technology agnostic and has community tooling, but going against the flow has its price.

ps: As I didn't write anything (I was just going to check if the Laravel Gettext works fine, but stopped in the middle of the process), those are just my assumptions after reading documentation, comments and skimming through code.

@asmecher
Copy link
Member Author

We could either use Laravel Cache or Stash; both are mature and widely adopted. Both support PSR-6, so swapping one for the other shouldn't be too bad for A/B testing. Even though we have (old) proof-of-concept code for Stash, I'm currently leaning further towards Laravel, as we're already using so much more of that toolset and adopting as much as possible of the Laravel translation/loading/caching toolset will be sensible to coders rather than a mix of Laravel and Stash (and our third-party stuff).

@Vitaliy-1
Copy link
Collaborator

Vitaliy-1 commented Aug 10, 2021

I didn't see such optimizations in the Laravel Translator, but AFAICS we can provide a custom FileLoader to it...

Laravel uses Flysystem's Filesystem to load translations in Illuminate\Translation\FileLoader. During Filesystem initialization in Illuminate\Filesystem\FilesystemManager::createFlysystem(), I see that it uses its caching mechanisms: https://flysystem.thephpleague.com/v1/docs/advanced/caching/.
(Not sure about performance benefits though)

@asmecher
Copy link
Member Author

I see that it uses its caching mechanisms

That's filesystem metadata caching, not file contents; we'd be using Laravel Cache or Stash to parse .po (or JSON) localization content only when necessary, and if we continue using .php-based file caches, to allow PHP to opcode compile and cache using those mechanisms.

(I think filesystem metadata caching might be helpful on network filesystems, but I suspect file caches will perform badly there for other reasons.)

@jonasraoni
Copy link
Contributor

Just to confirm, the __() worked fine, but as we're not bootstrapping Laravel in a standard way, the default Translator isn't even being loaded at this moment (I think it's worth to leave it functional, since Laravel extensions might need it, so I'll leave this as a "todo").

We could either use Laravel Cache or Stash...

Ok! I'll try to give preference to Laravel instead of adding a new package.

@jonasraoni
Copy link
Contributor

jonasraoni commented Aug 19, 2021

@asmecher about the pluralization...

  • Laravel uses trans_choice() to translate with plural support, and the __() follows a contract (which means I can't change its signature and add a "mixed" quantity argument, e.g. __(key, quantity, params?, locale?)). So, we can use trans_choice() or create another shortcut (__p(), __n(), ...).
  • The gettext parser doesn't offer anything to parse the plural expression, so I've written some code to "convert" it to PHP:
private function _parseExpression(string $expression): callable
{
    if (preg_match('/[^<>|%&!?:=n()\d ]/', $expression)) {
        throw new InvalidArgumentException('Invalid expression');
    }
    $pieces = explode(':', str_replace('n', '$n', $expression));
    $last = array_pop($pieces);
    $expression = '';
    foreach ($pieces as $piece) {
        $expression .= "$piece:(";
    }
    $expression .= $last . str_repeat(')', count($pieces));
    return function ($n) use ($expression) {
        return eval('return ' . $expression . ';');
    };
}

It basically just replaces n by $n and add parenthesis to the else part of the ternary operator (it's just a draft, I'll review and add an extra check to ensure there's just a single "n", etc.) and evals the expression.

I don't like using eval, but I think it's safe here as the charset is limited and the execution code comes from our own text files.
Other ideas that came to my mind:

  • Add a parser package: looks like a bazooka.
  • Hard-code the plural rules for each language in PHP: it's also ok, but I prefer to respect the gettext format.

@asmecher
Copy link
Member Author

asmecher commented Aug 19, 2021

What about changing the locale file format (cached, that is) to store plural forms 1:1 with how they're supported in PO files? That feels like a more natural solution than generating and interpreting PHP.

@jonasraoni
Copy link
Contributor

jonasraoni commented Aug 22, 2021

Hmm, I think I didn't understand very well what you said. I just wrote this function to handle complex plural rules (e.g. "Plural-Forms: nplurals=3; plural=n%10==1 && n%100!=11 ? 0 : n%10>=2 && n%10<= 4 && (n%100<10 || n%100>=20) ? 1 : 2;").

Anyway, we can revisit this later when I open the PR.


About the cache: I setup the Laravel cache and compiled dynamically all the locale files (from OJS, pkp/lib and plugins) into a single cached resource.
But in order to use the Laravel stuff without adding extra checks, I think we'll need to increase its availability here... If you have some insights why this was written this way, it will be helpful (without inspecting deeply I guess it's just a way to avoid initializing the database/queues) 😁


For now I have a working prototype and I've addressed several items of the issue, except:

  • Check what's possible to replace by a 3rd-party implementation (I've removed some things, but I'll check whether it's possible to replace some functions by a package, e.g. https://github.com/leodido/langcode-conv https://punic.github.io)
  • Identify unused locale keys
  • Flexible locale codes (include the region only when needed)
  • Update the custom locale plugin

Then I still need to:

  • Navigate a bit to check if everything is working fine
  • Fix remaining workarounds/add comments
  • Update the tests
  • Check if a fresh installation works fine
  • Redo the above steps in OPS/OMP

@asmecher
Copy link
Member Author

Are you sure we need to parse Plural-Forms at all? The 4.3 php-gettext library used to parse the rules using eval as you proposed, but the 5.x (master) branch no longer includes that. Instead, it assumes that plural forms are listed sequentially. I'm not sure where this assumption came from in php-gettext but I'll file an issue to see whether it was on purpose or a loss of functionality during the rewrite for 5.x.

@asmecher
Copy link
Member Author

I've filed the apparently missing functionality on the php gettext library here: php-gettext/Gettext#273

The maintainer of that package has typically been quite responsive, so we'll see what they say!

@jonasraoni
Copy link
Contributor

jonasraoni commented Aug 24, 2021

After taking a look at how weird some languages build their plurals, I'm sure haha 😅

Looks like they've dropped support for their Translator class, but yeah, the idea is the same...

Given it's a generic library, I think their version:

  • Should be more cautious with the input and limit the charset like I did
  • Ensure the input is executable, by properly parsing the expression (perhaps the token_get_all() might be useful), as it's not possible to recover from bad code in eval with a try/catch (I can address this later with code or tests).

Given that I already made my choice, I'll just leave other possibilities that I considered:

  • Reuse the indexes from Laravel internals... I doubt it's going to break, but as it doesn't implement an interface, there's no responsibilities... So I prefer to not rely on it.
  • Convert to the Laravel format (e.g. replace our variables from {$var} to :var, escape some things, etc.)

@asmecher
Copy link
Member Author

@jonasraoni, Oscar (the maintainer of that toolset) has confirmed that the Translator class is still available but has hived off to another package: https://github.com/php-gettext/Translator/blob/master/src/Translator.php#L186

We should use that library rather than reinventing the wheel; if you think there are improvements to be made there, I'd suggest filing them as issues.

@jonasraoni
Copy link
Contributor

The Translator package is ok, I can create some issues later.

What about the Laravel initialization? I believe I'll probably need to modify it, so if you know about any strong reason to keep it this way, it will be helpful.

@asmecher
Copy link
Member Author

Go ahead an modify it; it wasn't necessary to initialize the container before OJS/OMP/OPS was installed in the past, but now it will be for the sake of translations.

jonasraoni added a commit to jonasraoni/pkp-lib that referenced this issue Sep 21, 2021
jonasraoni added a commit to jonasraoni/pkp-lib that referenced this issue Sep 21, 2021
- Introduced a Laravel Facade
- Removed the need of loading specific components
- Added isInstalled and isUpgrading to PKPApplication
asmecher pushed a commit to pkp/reviewReport that referenced this issue Feb 18, 2022
asmecher pushed a commit to pkp/tinymce that referenced this issue Feb 18, 2022
asmecher pushed a commit to pkp/staticPages that referenced this issue Feb 18, 2022
asmecher pushed a commit to pkp/staticPages that referenced this issue Feb 18, 2022
asmecher pushed a commit to pkp/customLocale that referenced this issue Feb 18, 2022
asmecher added a commit to pkp/pdfJsViewer that referenced this issue Feb 18, 2022
asmecher pushed a commit to pkp/omp that referenced this issue Feb 18, 2022
asmecher pushed a commit to pkp/omp that referenced this issue Feb 18, 2022
asmecher pushed a commit to pkp/omp that referenced this issue Feb 18, 2022
asmecher pushed a commit to pkp/omp that referenced this issue Feb 18, 2022
asmecher pushed a commit to pkp/omp that referenced this issue Feb 18, 2022
asmecher pushed a commit to pkp/omp that referenced this issue Feb 18, 2022
asmecher pushed a commit to pkp/omp that referenced this issue Feb 18, 2022
asmecher pushed a commit to pkp/omp that referenced this issue Feb 18, 2022
… backwards compatibility with external plugins
@asmecher
Copy link
Member Author

asmecher commented Feb 18, 2022

@jonasraoni, I've merged almost all the PRs listed at the top, but don't forget some unaddressed comments on the following (which can be addressed after merge):

I haven't merged the following:

The custom locale plugin is also not working and needs attention: https://github.com/pkp/customLocale

jonasraoni added a commit to jonasraoni/usageStats that referenced this issue Feb 19, 2022
@jonasraoni
Copy link
Contributor

Thanks!
I just updated the PR for the usageStats (I didn't flag anything really important in the formatting commit), just a small update remained, which isn't required, but I prefer to get it merged for the sake of standardization.

I'll check the plugins soon (customLocale + defaultTranslation) and probably this side-issue: #7673.

@NateWr
Copy link
Contributor

NateWr commented Feb 21, 2022

Exciting to see this go in! Congrats @jonasraoni. 🎉

As part of the work on this, can you please review and provide some documentation on these changes? I've filed an issue on our docs repo: pkp/pkp-docs#923

Don't worry about writing perfect English. Just get the substance of the changes, as well as some code samples, into a PR and I will do the copyediting.

@jonasraoni
Copy link
Contributor

I've filed an issue against the php-gettex/translator to fix a small issue (empty translations aren't kept), which was already merged, and also a security issue through email.

@jonasraoni
Copy link
Contributor

The security issue was merged (php-gettext/Translator#5), just a reminder a update the packages later.

@jonasraoni
Copy link
Contributor

I've updated the gettext package to catch my updates and the plugins were merged, so I'll close this issue.

I've created an extra issue to handle the unused keys (#7837), I'll proceed to the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Housekeeping:1:Todo Any dependency management or refactor that would be nice to have some day.
Projects
Status: No status
Development

No branches or pull requests

4 participants