Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate translations with different XLIFF unit IDs are not maintained #224

Open
asmecher opened this issue Sep 16, 2019 · 11 comments
Open

Comments

@asmecher
Copy link
Contributor

asmecher commented Sep 16, 2019

This is a knock-on issue from #220. (I hope you won't regret adding that already!)

An XLIFF file may contain several units with the same translation content... (silly example but you get the point)

<unit id="am.permanently">
  <segment>
    <source>am</source>
    <target>soy</target>
  </segment>
</unit>
<unit id="am.temporarily">
  <segment>
    <source>am</source>
    <target>estoy</target>
  </segment>
</unit>

However, when loading the translation from the XLIFF file, only one gets loaded into $translations. This is because the two will have the same ID, based only on the source text ("am"), without considering the unit ID.

Thus trying to save the loaded XLIFF again will drop one of the two entries.

@asmecher
Copy link
Contributor Author

@oscarotero, I am happy to code up a solution, but as with #220 I'd like a little guidance about the best option because I'm not familiar with how the translation IDs are used elsewhere in the library.

@asmecher asmecher changed the title Duplicate translations with the same XLIFF unit ID are not maintained Duplicate translations with different XLIFF unit IDs are not maintained Sep 16, 2019
@oscarotero
Copy link
Member

Mmm, I don't know how to solve this.

On the one hand, the unit id is not unique, on the other hand, the source is not unique either. I guess the uniqueness of a translation in Xliff is a sort of combination of unit id + source?

@asmecher
Copy link
Contributor Author

I guess the uniqueness of a translation in Xliff is a sort of combination of unit id + source?

I don't think XLIFF has any requirements of uniqueness involving the source text. The XLIFF 2.0 spec does talk about uniqueness of IDs, but that's pretty much it. For example, by my read, each unit (=Translation class instance) can be uniquely identified by a combination of unit ID and file ID (see section 3.1).

I think it'll be necessary to inform the mechanism for ID generation (currently the static generateID function) based on the format of the file being sourced.

@oscarotero
Copy link
Member

In gettext, we have the following structure:

  • domain
  • context
  • original

Right now, the equivalent in Xliff would be:

  • domain (the id of the file)
  • context (the <node category="context"> element)
  • original (the <source> element)

A possible solution could be to get the id of the unit as the msgid, instead the source, and store the source as a comment. So in a exported .po file, we could get:

# XLIFF_SOURCE: source-element
msgid "unit-id"
msgstr "translation"

@asmecher
Copy link
Contributor Author

@oscarotero, but wouldn't that be a radical departure from either/both the PO and XLIFF standards? My understanding of .po files is that msgid should always be the source language, likewise source elements in XLIFF.

@oscarotero
Copy link
Member

oscarotero commented Sep 18, 2019

I was thinking in how to use these translations in the html templates. Using the unit-id as the msgid, you can get the translations by that id:

<h1><?= __('unit-id') ?></h1>

But, at the same time, if you scan this template searching for gettext entries, this translation will have "unit-id" string as the original untranslated string. So, finally, we can assume that the unit id and the source are the same thing (or at least, share the same purpose: to identify a translation).

I don't know, it's hard to fit the Xliff format into gettext without lossing data. I'm open to other possible workaround.

@asmecher
Copy link
Contributor Author

@oscarotero, we already have large parts of an internationalization toolset in place, including a caching layer that'll go in front of Gettext and tools in our PHP code and Smarty templates for getting translations. What I'm hoping to use Gettext for is specifically its XLIFF load/save features -- i.e. just the XLIFF extractor and generator, but not the Translator class. My operations on the translations always deal with the full list of translations -- i.e. foreach($translations as $translation) {...}

We're introducing Gettext in an attempt to move from our own bespoke XML files to the XLIFF standard. We're using symbolic IDs instead of English-language text in our source code, hence the use of the unit ID in XLIFF. Our goal is to facilitate the use of commodity translation tools like POEdit and Weblate, thus we need the XLIFF content to be usable within those tools (and I think using msgid or <source> with the unit ID will break them).

I think the purpose of generateID etc. is to make it easy for the Translator class to fetch a translation quickly, correct? I can see two options:

  • Disentangle the loading of the translations (the fromString method in the Extractor) from the indexing/deduplication of the translations (currently in Translations::offsetSet). Currently there's no way to load or save translation files without going through the Translations class, which has built-in resolution of what it considers to be duplicates. (Maybe the indexing/deduplication tools belong in Translator rather than Translations?)
  • Alternatively, support a richer translation ID that doesn't just consider the segment and source text. Looking up translations by source text alone could continue to return the first match, but going through a list of all translations (foreach) would no longer be missing translations.

@oscarotero
Copy link
Member

Maybe the indexing/deduplication tools belong in Translator rather than Translations?

Translation has the getId() and the static generateId() methods, so the deduplication belongs to Translation. This value is important not only to search translations but also to merge different Translations instances.

The right solution could be add a setId() method, so you could choose another different way to define the ids/deduplication and use generateId internally as a fallback. The only problem is Translations::find that currently depends on this method to find the translations. Removing or changing this method is a breaking change, and I don't want to release a major version right now. And if you define a different id, you can't find a specific translation by context + original. Using a foreach is discarded for performance reasons.

Maybe we could add the Translation::setId() method with a warning that using this method makes imposible to search the translation in a Translations instance using the context + original.

And because Xliff generator use this method to generate ids, include the warning here too.

asmecher added a commit to asmecher/Gettext that referenced this issue Sep 19, 2019
asmecher added a commit to asmecher/Gettext that referenced this issue Sep 19, 2019
asmecher added a commit to asmecher/Gettext that referenced this issue Sep 19, 2019
@asmecher
Copy link
Contributor Author

Thanks, @oscarotero, I submitted this according to the suggestions you made: #225

I think there is still one issue to resolve, as noted in the PR comment.

asmecher added a commit to asmecher/Gettext that referenced this issue Sep 19, 2019
oscarotero added a commit that referenced this issue Sep 19, 2019
#224 Allow for setting custom ID to avoid translation clobbering
@asmecher
Copy link
Contributor Author

From your earlier comment:

A possible solution could be to get the id of the unit as the msgid, instead the source, and store the source as a comment. So in a exported .po file, we could get:

# XLIFF_SOURCE: source-element
msgid "unit-id"
msgstr "translation"

My reaction was...

@oscarotero, but wouldn't that be a radical departure from either/both the PO and XLIFF standards?

...but it turns out many projects actually do use PO files written this way (with symbolic IDs in the msgid), and it's supported in translation tools like Weblate (known there as "Monolingual gettext").

Just documenting something I learned in the hopes it'll be useful to someone else :)

@ShawnMilo
Copy link

I'm maintaining XLIFF tools for a free site: https://xliff.tools/

Does the "ID Replacement" tool do what's required? If not, I'm interested in learning more about the issue and possibly implementing it.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants