BCP47<->babel/polyglossia interface #961

plk · 2020-01-22T12:15:02Z

We need a way to go between BCP47 tags and babel/polyglossia langauge names/options in a reliable and cross-package way. The multiscript support in bibaltex/biber 4.0 will rely on this heavily. Currently these mappings are hard-coded into the experimental biber/biblatex branches (with some user-facing support for redefining the BCP47->babel/polyglossia mapping).

Some data format that TeX and biber can parse and which is findable with kpsewhich would be ideal. Even a file of TeX data macros would be fine as biber could parse this. Even though multiscript support in biblatex/biber is currently experimental, the lack of such a package will be a blocker eventually.

CC'ing @jspitz, @moewew, @reutenauer, @bastien-roucaries, @jbezos

josephwright · 2020-01-22T12:50:06Z

I'll raise with the team: I'd like to see this handled at the expl3 level, really

plk · 2020-01-22T12:56:08Z

@josephwright - that would be ideal and if there is any way to help with this, I will make some time. The multiscript branch of biblatex currently uses xparse a lot as it is the only way of sensibly managing expandability and default values in a manageable way.

reutenauer · 2020-01-22T14:03:50Z

On Wed, Jan 22, 2020 at 04:50:06AM -0800, Joseph Wright wrote: I'll raise with the team: I'd like to see this handled at the `expl3` level, really

I won’t contradict you here :-) Arthur

jbezos · 2020-01-22T15:00:02Z

One of the purposes of the files under locale in babel is precisely this one. What do you need more precisely?

plk · 2020-01-22T15:16:42Z

biber/biblatex 4.0 will use BCP47 tags to specify multiscript alternates of fields and will do a lot of automatic language switching based on this for fields/parts of fields and for this, we need a reliable mapping of BCP47 <-> babel/polyglossia language names. Ideally, babel and polyglossia would allow language selection via BCP47 tags (I know polyglossia is looking at this but I'm not sure if this has been raised for babel?) as this would standardise language specification across packages but I realise that this is not trivial and so a mapping package would be the next best thing. This would in general help LaTeX integration to other internationalisation systems since BCP47 is an IETF standard.

The tricky part is that babel tends to have variants of languages as separate language names and polyglossia tends to use options on top of generic language names. We would need the mapping to be agnostic about babel/polyglossia and ensure that BCP47 tags map to generally "the same" language in both (modulo the details of language support differences, naturally).

jbezos · 2020-01-22T15:42:05Z

Sure, although I had other priorities, and I left this possibility to a revamped set of selectors based more strictly in the concept of ‘locale’ (vs. the somewhat fuzzy of ’language’), and including extension subtags and private use subtags. Now babel has bidi, CJK line breaking, non-standard hyphenation and automatic font switching, which were the main priorities, I could tackle the task of an improved interface.

plk · 2020-01-22T15:53:04Z

Yes, babel has really improved recently, which is very nice. If we could all agree on BCP47 locale tags, that would be a huge step forward in interoperability.

jbezos · 2020-01-22T16:12:36Z

If I have a list from bcp47 to language/variant pairs in polyglossia, I could add this information to the ini files, to improve the interoperability (and in addition improve the data for babel, because I've seen some mistakes). Note the ini files are not meant exclusively for babel, but also as a data repository available to other packages.

pauloney · 2020-01-22T16:51:45Z

@javier, the list is in the main Polyglossia track for the implementation of BCP-47: reutenauer/polyglossia#226 right at the top and on my first reply.

…

On Wed, Jan 22, 2020 at 8:12 AM Javier Bezos ***@***.***> wrote: If I have a list from bcp47 to language/variant pairs in polyglossia, I could add this information to the ini files, to improve the interoperability (and in addition improve the data for babel, because I've seen some mistakes). Note the ini files are not meant exclusively for babel, but also as a data repository available to other packages. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#961?email_source=notifications&email_token=AAR7WYTAGOAWMBSM5LEOFQ3Q7BV7LA5CNFSM4KKEYVIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJUFCEA#issuecomment-577261840>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR7WYTTNWGMMOR3C5S4XBDQ7BV7LANCNFSM4KKEYVIA> .

jspitz · 2020-01-22T16:58:45Z

Ideally, babel and polyglossia would allow language selection via BCP47 tags (I know polyglossia is looking at this

Yes, polyglossia 1.47 (to be released in the next days) features this already.

jspitz · 2020-01-22T17:01:47Z

If I have a list from bcp47 to language/variant pairs in polyglossia

The list is here: https://github.com/reutenauer/polyglossia/blob/master/tools/bcp47.py and in the (master) manual, sec. 2.4: https://github.com/reutenauer/polyglossia/blob/master/doc/polyglossia.tex

pauloney · 2020-01-22T17:13:11Z

@plk: ... and for this, we need a reliable mapping of BCP47 <->

babel/polyglossia language names. I hope this is needed only for backwards compatibility. BCP-47 says that the IETF language tags should be used in the programming every time one needs to refer to the language -- for example, "ngerman" should be replaced by "de".

…

On Wed, Jan 22, 2020 at 9:01 AM Jürgen Spitzmüller ***@***.***> wrote: Am Mittwoch, den 22.01.2020, 08:12 -0800 schrieb Javier Bezos: > If I have a list from bcp47 to language/variant pairs in polyglossia The list is here: https://github.com/reutenauer/polyglossia/blob/master/tools/bcp47.py and in the (master) manual, sec. 2.4: https://github.com/reutenauer/polyglossia/blob/master/doc/polyglossia.tex — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#961?email_source=notifications&email_token=AAR7WYWP5ZSBE3SXNHJQOUTQ7B3XZA5CNFSM4KKEYVIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJUKWMA#issuecomment-577284912>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR7WYV22S4RCE6LIMXUMALQ7B3XZANCNFSM4KKEYVIA> .

jbezos · 2020-01-22T17:22:10Z

@jspitz Thank you. I’ll try to have the data in the ini for 3.39. By the way, now that polyglossia is again actively maintained, we can try to find points in common. See for example my preliminary thoughts on a user interface in https://tug.org/pipermail/kadingira/2018-February/thread.html .

jspitz · 2020-01-22T17:32:11Z

I hope this is needed only for backwards compatibility. BCP-47 says that the IETF language tags should be used in the programming every time one needs to refer to the language -- for example, "ngerman" should be replaced by "de".

No, we don't replace the old interface. We add the possiblity to use BCP-47 tags alternatively. Personally, I prefer human readable language names a lot.

josephwright · 2020-01-22T17:33:33Z

@jspitz I guess it depends where you are looking: internally, and at a code level, BCP 47 is really what should be passed. At a document level 'friendly' names are fine, but I'd hope that these can be converted to BCP 47 before being used.

jspitz · 2020-01-22T17:34:24Z

@jspitz Thank you. I’ll try to have the data in the ini for 3.39. By the way, now that polyglossia is again actively maintained, we can try to find points in common. See for example my preliminary thoughts on a user interface in https://tug.org/pipermail/kadingira/2018-February/thread.html .

Arthur is still driving the boat. I'm just helping a bit in the engine room ;-)

jspitz · 2020-01-22T17:36:57Z

at a code level, BCP 47 is really what should be passed. At a document level 'friendly' names are fine, but I'd hope that these can be converted to BCP 47 before being used.

That's not planned, no (at least as far as my engagement with the matter is concerned)

josephwright · 2020-01-22T17:38:38Z

@jspitz I'm thinking expl3, of course: there we have a clear code/document level split (also I've gone with BCP 47 for \text_lowercase:nn, though at present I've not added support for splitting the language and locale).

jbezos · 2020-01-22T17:38:57Z

@plpauloney

BCP-47 says that the IETF language tags should be used in the programming every time one needs to refer to the language -- for example, "ngerman" should be replaced by "de".

This is debatable in the context of LaTeX, because it's a ‘public’ markup language. It could be true for the internals, at the programming level, which are not exposed to the user (and which most LaTeX users don't know at all), but definitely if a user must select a language in an HTML form, they will see the name, even if internally it's mapped to the bcp47 code. Furthermore, names provide an additional abstraction layer which can be useful.

jbezos · 2020-01-22T17:41:58Z

Arthur is still driving the boat. I'm just helping a bit in the engine room ;-)

I had to try it ;-).

reutenauer · 2020-01-22T18:48:50Z

On Wed, Jan 22, 2020 at 09:41:59AM -0800, Javier Bezos wrote: > Arthur is still driving the boat. I'm just helping a bit in the engine room ;-) I had to try it ;-).

I’m all for making Babel and Polyglossia converge, as you should be well aware.

pauloney · 2020-01-22T19:02:00Z

On Wed, Jan 22, 2020 at 9:36 AM Jürgen Spitzmüller ***@***.***> wrote: at a code level, BCP 47 is really what should be passed. At a document level 'friendly' names are fine, but I'd hope that these can be converted to BCP 47 before being used. That's not planned, no (at least as far as my engagement with the matter is concerned)

This is a pity, because it is not BCP-47 compliance.

reutenauer · 2020-01-22T19:06:59Z

On Wed, Jan 22, 2020 at 09:13:12AM -0800, Paulo Ney de Souza wrote: I hope this is needed only for backwards compatibility. BCP-47 says that the IETF language tags should be used in the programming every time one needs to refer to the language

I never heard of anything like that. Do you have a reference? Backward compatibility is of course essential in any case and we won’t break the old interface arbitrarily.

for example, "ngerman" should be replaced by "de".

You of course mean `[de-1901]`.

pauloney · 2020-01-22T19:08:27Z

On Wed, Jan 22, 2020 at 9:38 AM Javier Bezos ***@***.***> wrote: @plpauloney BCP-47 says that the IETF language tags should be used in the programming every time one needs to refer to the language -- for example, "ngerman" should be replaced by "de". This is debatable in the context of LaTeX, because it's a ‘public’ markup language. It could be true for the internals, at the programming level, which are not exposed to the user (and which most LaTeX users don't know at all), but definitely if a user must select a language in an HTML form, they will see the name, even if internally it's mapped to the bcp47 code. Furthermore, names provide an additional abstraction layer which can be useful.

@javier this is a misunderstanding about BCP-47. In fact, rather the opposite of what you affirm is true. The choice of language in HTML should be done in BCP-47 -- read here: https://www.w3.org/International/articles/language-tags/ Paulo Ney

jbezos · 2020-01-22T21:05:06Z

@plk Just a few questions. Which is the user interface do you have in mind? How the tags will be used? Like alternative names or by means or specific macros?

plk · 2020-01-22T21:48:29Z

It will be possible to have multiple bibtex fields with a "form" and a "language" which will end up in the .bbl.:

@BOOK{test1,
  TITLE                          = {Some Title},
   TITLE_translation_de-1996 = {Titel},
   TITLE_translation_fr              = {Titre}

biblatex reads the .bbl and so knows about these "alternates" of the field and will automatically switch babel/polyglossia language accordingly when such fields are printed. In order to do this, it needs to be able to either specify languages to babel/polyglossia by BCP47 tag or needs to be abe to map to babel/polyglossia language names with some TeX macro (which is what happens in the experimental biblatex/biber code but with hard-coded mappings currently).

Another aspect of this is that users specify languages in biblatex by babel/polyglossia names and biber needs to convert these (they appear in the .bcf) to BCP47 tags because the Unicode Collation library it uses specifies CLDR tailoring by BCP47 tag.

So, there is a need to be able to go both ways (and these two directions need to cohere so a complete round trip results in the same tag/language name).

pauloney · 2020-01-22T21:56:18Z

@Arthur There are TWO main gains from introducing a standard like BCP-47 in packages used by LaTeX. First is the interoperability between packages -- and this has to do with the internals and the use of BCP-47 in the guts of each package. There are great benefits that can be derived by cross-feeding between Babel and Polyglossia, but if they continue to name their files german.ini <---> de-1901.ini portugues.ini <----> pt-BR.ini zh-cmn-Hans-CN.ini <---> Chinese_Mandarin_Simplified_script_as_used_in_China.ini there is vey little that can be accomplished. Literally the Language Tag should be used werever a Language Tag is needed. For example, a Babel manual written in Serbian using Cyrillic script should be appropriately named: Babel-ver7.01-sr-Cyrl.pdf the same manual in German should be: Babel-ver7.01-de.pdf (and that is an extreme example). The other one is User Interface. Users should be able to enter something simple as in ... the hero \text-ru{Светлана Евгеньевна Савицкая} know in Japan as \text-ja{スベトラーナ・サビツカヤ} was the ... or \text{ru}{...} whatever the interface is decided... so the text should be aprpopriately typeset syllable separated ... and then extracted into separate files by programs like "detex" so each one can sent out to spelles like Aspell using the respective language. Aspell already used BCP-47, detex is being prepared and the missing part is the source. Paulo Ney On Wed, Jan 22, 2020 at 11:07 AM Arthur Reutenauer <notifications@github.com> wrote:

…

On Wed, Jan 22, 2020 at 09:13:12AM -0800, Paulo Ney de Souza wrote: > I hope this is needed only for backwards compatibility. BCP-47 says that > the IETF language tags should be used in the programming every time one > needs to refer to the language I never heard of anything like that. Do you have a reference? Backward compatibility is of course essential in any case and we won’t break the old interface arbitrarily. > for example, "ngerman" should be replaced > by "de". You of course mean `[de-1901]`. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#961?email_source=notifications&email_token=AAR7WYSYTFKQ33FEHTWIJHTQ7CKNJA5CNFSM4KKEYVIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJUXR4Q#issuecomment-577337586>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR7WYSRC2CS4UKUFPQHWO3Q7CKNJANCNFSM4KKEYVIA> .

pauloney · 2020-01-22T22:21:33Z

@Arthur I think that "ngerman" refers to German since 2006 and not 1901 -- but I could be wrong, things are changing very fast with Babel. One more reason to use the BCP-47 tags and not random naming of languages. Paulo Ney On Wed, Jan 22, 2020 at 11:07 AM Arthur Reutenauer <notifications@github.com> wrote:

…

On Wed, Jan 22, 2020 at 09:13:12AM -0800, Paulo Ney de Souza wrote: > I hope this is needed only for backwards compatibility. BCP-47 says that > the IETF language tags should be used in the programming every time one > needs to refer to the language I never heard of anything like that. Do you have a reference? Backward compatibility is of course essential in any case and we won’t break the old interface arbitrarily. > for example, "ngerman" should be replaced > by "de". You of course mean `[de-1901]`. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#961?email_source=notifications&email_token=AAR7WYSYTFKQ33FEHTWIJHTQ7CKNJA5CNFSM4KKEYVIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJUXR4Q#issuecomment-577337586>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR7WYSRC2CS4UKUFPQHWO3Q7CKNJANCNFSM4KKEYVIA> .

jspitz · 2020-01-23T07:06:49Z

@jspitz I'm thinking expl3, of course

Yes I agree wrt expl3.

jbezos · 2020-01-23T14:36:15Z

@reutenauer ngerman is modern German, as the description line says [German support for babel (post-1996 orthography)] and @jspitz can confirm, as the current maintainer.

@pauloney (1) The current babel interface won't change. Very likely there will be another one, but even so the old one will be preserved for compatibility.

(2) The idea of identifying the version in the file name came to me when I was building the locale files, but I thought (and still think) that could make things unnecesarily complicated.

(3) The latest version of babel can map the characters of a script block to a specific locale at the Lua level. It only makes sense, of course, if you don't mix French and Spanish, or Russian and Bulgarian, but it's a first step. I think a good spell checker should be able to do the same, so that no explicit markup is needed in these clear cases.

@plk Last but not least, there are no unique mappings from a set of rules to any identifier, either a language name or a bcp47 code. While en will often be the preferred tag, there are cases where en-Latn can be a better choice (the same applies to regions). Both are legitimate.

I think there are still many loose ends and I don't want to rush, but there are good news, because I based my work for the locales on the CLDR, so it is close to what you want.

moewew · 2020-03-07T11:42:17Z

@josephwright Was there any progress on the expl3 side here that you can share with us?

I'm asking because I want to look into selecting the correct <language> in \text_uppercase:nn {<language>}. Currently on the biblatex side we work with babel identifiers. So I need a mapping to l3text's identifiers (are they BCP47?). Currently the list of identifiers I have to worry about is not particularly long and I could hard-code it, but of course it would be nicer to use a proper interface.

josephwright · 2020-03-07T18:53:42Z

@moewew Nothing yet on expl3: you might want to talk to Javier Bezos from the babel side.

The string is BCP47 at least in principle, but at the present I don't have code in place to split the language and locale. But as there are only about 6 languages to cover and they have simple fixed strings, so you can likely special-case.

jbezos · 2020-03-08T06:55:11Z

The string is BCP47 at least in principle

Not quite -- de-alt is not a valid code. Once released (or frozen) TeXLive 2020, we could discuss proper codes for LaTeX along the lines of the extensions http://www.unicode.org/reports/tr35/#Locale_Extension_Key_and_Type_Data , ie, with a set of well defined private extensions (eg, -x-l-c-grosses-eszett could mean ‘local eXtension-Latex-Case-grosses-eszett).

I've opened an issue in babel.

moewew · 2020-03-08T11:33:00Z

OK, for now I'll look into a manual mapping for the few cases we have.

Long-term it would be great if we could have a translator between BCP, babel and polyglossia that lives outside of babel or polyglossia (ideally in the LaTeX (3?) kernel) so that packages like biblatex don't have to deal with different implementations of the translator (making it part of the LaTeX kernel would also make it more likely that the translator is maintained).

plk · 2020-03-08T12:26:10Z

I agree that we really need a separate package for BCP47<->babel/polyglossia mapping.

@bastien-roucaries was working on something, not latex3 though:

reutenauer/polyglossia#279

viktoriasee · 2020-06-12T17:50:54Z

While the BCP47 codes already work in polyglossia they do not in babel. This should be improved. It could be used by packages like hyperxmp to set the language in PDF metadata.

plk added enhancement localisation concerns lbx files or localisation in general labels Jan 22, 2020

jbezos mentioned this issue Jan 22, 2020

Support for BCP47 codes latex3/babel#44

Closed

njbart mentioned this issue Sep 11, 2021

Language field in the metadata exported incorrectly retorquere/zotero-better-bibtex#1921

Closed

moewew mentioned this issue Jun 10, 2024

Biblatex fails to detect languages loaded with \babelprovide #1362

Open

moewew mentioned this issue Jul 8, 2024

Change from pidgin-TeX to UTF-8 #1364

Open

BCP47<->babel/polyglossia interface #961

BCP47<->babel/polyglossia interface #961

Comments

plk commented Jan 22, 2020 • edited Loading

josephwright commented Jan 22, 2020

plk commented Jan 22, 2020 • edited Loading

reutenauer commented Jan 22, 2020 via email

jbezos commented Jan 22, 2020

plk commented Jan 22, 2020

jbezos commented Jan 22, 2020 • edited Loading

plk commented Jan 22, 2020

jbezos commented Jan 22, 2020

pauloney commented Jan 22, 2020 via email

jspitz commented Jan 22, 2020 via email • edited Loading

jspitz commented Jan 22, 2020 via email • edited Loading

pauloney commented Jan 22, 2020 via email

jbezos commented Jan 22, 2020

jspitz commented Jan 22, 2020

josephwright commented Jan 22, 2020

jspitz commented Jan 22, 2020

jspitz commented Jan 22, 2020

josephwright commented Jan 22, 2020

jbezos commented Jan 22, 2020

jbezos commented Jan 22, 2020

reutenauer commented Jan 22, 2020 via email

pauloney commented Jan 22, 2020 via email

reutenauer commented Jan 22, 2020 via email

pauloney commented Jan 22, 2020 via email

jbezos commented Jan 22, 2020

plk commented Jan 22, 2020 • edited Loading

pauloney commented Jan 22, 2020 via email

pauloney commented Jan 22, 2020 via email

jspitz commented Jan 23, 2020 via email

jbezos commented Jan 23, 2020 • edited Loading

moewew commented Mar 7, 2020

josephwright commented Mar 7, 2020

jbezos commented Mar 8, 2020

moewew commented Mar 8, 2020

plk commented Mar 8, 2020

viktoriasee commented Jun 12, 2020

plk commented Jan 22, 2020 •

edited

Loading

plk commented Jan 22, 2020 •

edited

Loading

jbezos commented Jan 22, 2020 •

edited

Loading

jspitz commented Jan 22, 2020 via email •

edited

Loading

jspitz commented Jan 22, 2020 via email •

edited

Loading

plk commented Jan 22, 2020 •

edited

Loading

jbezos commented Jan 23, 2020 •

edited

Loading