Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BCP47<->babel/polyglossia interface #961

Open
plk opened this issue Jan 22, 2020 · 36 comments
Open

BCP47<->babel/polyglossia interface #961

plk opened this issue Jan 22, 2020 · 36 comments
Labels
enhancement localisation concerns lbx files or localisation in general

Comments

@plk
Copy link
Owner

plk commented Jan 22, 2020

We need a way to go between BCP47 tags and babel/polyglossia langauge names/options in a reliable and cross-package way. The multiscript support in bibaltex/biber 4.0 will rely on this heavily. Currently these mappings are hard-coded into the experimental biber/biblatex branches (with some user-facing support for redefining the BCP47->babel/polyglossia mapping).

Some data format that TeX and biber can parse and which is findable with kpsewhich would be ideal. Even a file of TeX data macros would be fine as biber could parse this. Even though multiscript support in biblatex/biber is currently experimental, the lack of such a package will be a blocker eventually.

CC'ing @jspitz, @moewew, @reutenauer, @bastien-roucaries, @jbezos

@plk plk added enhancement localisation concerns lbx files or localisation in general labels Jan 22, 2020
@josephwright
Copy link
Collaborator

I'll raise with the team: I'd like to see this handled at the expl3 level, really

@plk
Copy link
Owner Author

plk commented Jan 22, 2020

@josephwright - that would be ideal and if there is any way to help with this, I will make some time. The multiscript branch of biblatex currently uses xparse a lot as it is the only way of sensibly managing expandability and default values in a manageable way.

@reutenauer
Copy link

reutenauer commented Jan 22, 2020 via email

@jbezos
Copy link

jbezos commented Jan 22, 2020

One of the purposes of the files under locale in babel is precisely this one. What do you need more precisely?

@plk
Copy link
Owner Author

plk commented Jan 22, 2020

biber/biblatex 4.0 will use BCP47 tags to specify multiscript alternates of fields and will do a lot of automatic language switching based on this for fields/parts of fields and for this, we need a reliable mapping of BCP47 <-> babel/polyglossia language names. Ideally, babel and polyglossia would allow language selection via BCP47 tags (I know polyglossia is looking at this but I'm not sure if this has been raised for babel?) as this would standardise language specification across packages but I realise that this is not trivial and so a mapping package would be the next best thing. This would in general help LaTeX integration to other internationalisation systems since BCP47 is an IETF standard.

The tricky part is that babel tends to have variants of languages as separate language names and polyglossia tends to use options on top of generic language names. We would need the mapping to be agnostic about babel/polyglossia and ensure that BCP47 tags map to generally "the same" language in both (modulo the details of language support differences, naturally).

@jbezos
Copy link

jbezos commented Jan 22, 2020

Sure, although I had other priorities, and I left this possibility to a revamped set of selectors based more strictly in the concept of ‘locale’ (vs. the somewhat fuzzy of ’language’), and including extension subtags and private use subtags. Now babel has bidi, CJK line breaking, non-standard hyphenation and automatic font switching, which were the main priorities, I could tackle the task of an improved interface.

@plk
Copy link
Owner Author

plk commented Jan 22, 2020

Yes, babel has really improved recently, which is very nice. If we could all agree on BCP47 locale tags, that would be a huge step forward in interoperability.

@jbezos
Copy link

jbezos commented Jan 22, 2020

If I have a list from bcp47 to language/variant pairs in polyglossia, I could add this information to the ini files, to improve the interoperability (and in addition improve the data for babel, because I've seen some mistakes). Note the ini files are not meant exclusively for babel, but also as a data repository available to other packages.

@pauloney
Copy link
Collaborator

pauloney commented Jan 22, 2020 via email

@jspitz
Copy link
Contributor

jspitz commented Jan 22, 2020 via email

@jspitz
Copy link
Contributor

jspitz commented Jan 22, 2020 via email

@pauloney
Copy link
Collaborator

pauloney commented Jan 22, 2020 via email

@jbezos
Copy link

jbezos commented Jan 22, 2020

@jspitz Thank you. I’ll try to have the data in the ini for 3.39. By the way, now that polyglossia is again actively maintained, we can try to find points in common. See for example my preliminary thoughts on a user interface in https://tug.org/pipermail/kadingira/2018-February/thread.html .

@jspitz
Copy link
Contributor

jspitz commented Jan 22, 2020

I hope this is needed only for backwards compatibility. BCP-47 says that the IETF language tags should be used in the programming every time one needs to refer to the language -- for example, "ngerman" should be replaced by "de".

No, we don't replace the old interface. We add the possiblity to use BCP-47 tags alternatively. Personally, I prefer human readable language names a lot.

@josephwright
Copy link
Collaborator

@jspitz I guess it depends where you are looking: internally, and at a code level, BCP 47 is really what should be passed. At a document level 'friendly' names are fine, but I'd hope that these can be converted to BCP 47 before being used.

@jspitz
Copy link
Contributor

jspitz commented Jan 22, 2020

@jspitz Thank you. I’ll try to have the data in the ini for 3.39. By the way, now that polyglossia is again actively maintained, we can try to find points in common. See for example my preliminary thoughts on a user interface in https://tug.org/pipermail/kadingira/2018-February/thread.html .

Arthur is still driving the boat. I'm just helping a bit in the engine room ;-)

@jspitz
Copy link
Contributor

jspitz commented Jan 22, 2020

at a code level, BCP 47 is really what should be passed. At a document level 'friendly' names are fine, but I'd hope that these can be converted to BCP 47 before being used.

That's not planned, no (at least as far as my engagement with the matter is concerned)

@josephwright
Copy link
Collaborator

@jspitz I'm thinking expl3, of course: there we have a clear code/document level split (also I've gone with BCP 47 for \text_lowercase:nn, though at present I've not added support for splitting the language and locale).

@jbezos
Copy link

jbezos commented Jan 22, 2020

@plpauloney

BCP-47 says that the IETF language tags should be used in the programming every time one needs to refer to the language -- for example, "ngerman" should be replaced by "de".

This is debatable in the context of LaTeX, because it's a ‘public’ markup language. It could be true for the internals, at the programming level, which are not exposed to the user (and which most LaTeX users don't know at all), but definitely if a user must select a language in an HTML form, they will see the name, even if internally it's mapped to the bcp47 code. Furthermore, names provide an additional abstraction layer which can be useful.

@jbezos
Copy link

jbezos commented Jan 22, 2020

Arthur is still driving the boat. I'm just helping a bit in the engine room ;-)

I had to try it ;-).

@reutenauer
Copy link

reutenauer commented Jan 22, 2020 via email

@pauloney
Copy link
Collaborator

pauloney commented Jan 22, 2020 via email

@reutenauer
Copy link

reutenauer commented Jan 22, 2020 via email

@pauloney
Copy link
Collaborator

pauloney commented Jan 22, 2020 via email

@jbezos
Copy link

jbezos commented Jan 22, 2020

@plk Just a few questions. Which is the user interface do you have in mind? How the tags will be used? Like alternative names or by means or specific macros?

@plk
Copy link
Owner Author

plk commented Jan 22, 2020

It will be possible to have multiple bibtex fields with a "form" and a "language" which will end up in the .bbl.:

@BOOK{test1,
  TITLE                          = {Some Title},
   TITLE_translation_de-1996 = {Titel},
   TITLE_translation_fr              = {Titre}

biblatex reads the .bbl and so knows about these "alternates" of the field and will automatically switch babel/polyglossia language accordingly when such fields are printed. In order to do this, it needs to be able to either specify languages to babel/polyglossia by BCP47 tag or needs to be abe to map to babel/polyglossia language names with some TeX macro (which is what happens in the experimental biblatex/biber code but with hard-coded mappings currently).

Another aspect of this is that users specify languages in biblatex by babel/polyglossia names and biber needs to convert these (they appear in the .bcf) to BCP47 tags because the Unicode Collation library it uses specifies CLDR tailoring by BCP47 tag.

So, there is a need to be able to go both ways (and these two directions need to cohere so a complete round trip results in the same tag/language name).

@pauloney
Copy link
Collaborator

pauloney commented Jan 22, 2020 via email

@pauloney
Copy link
Collaborator

pauloney commented Jan 22, 2020 via email

@jspitz
Copy link
Contributor

jspitz commented Jan 23, 2020 via email

@jbezos
Copy link

jbezos commented Jan 23, 2020

@reutenauer ngerman is modern German, as the description line says [German support for babel (post-1996 orthography)] and @jspitz can confirm, as the current maintainer.


@pauloney (1) The current babel interface won't change. Very likely there will be another one, but even so the old one will be preserved for compatibility.

(2) The idea of identifying the version in the file name came to me when I was building the locale files, but I thought (and still think) that could make things unnecesarily complicated.

(3) The latest version of babel can map the characters of a script block to a specific locale at the Lua level. It only makes sense, of course, if you don't mix French and Spanish, or Russian and Bulgarian, but it's a first step. I think a good spell checker should be able to do the same, so that no explicit markup is needed in these clear cases.


@plk Last but not least, there are no unique mappings from a set of rules to any identifier, either a language name or a bcp47 code. While en will often be the preferred tag, there are cases where en-Latn can be a better choice (the same applies to regions). Both are legitimate.

I think there are still many loose ends and I don't want to rush, but there are good news, because I based my work for the locales on the CLDR, so it is close to what you want.

@moewew
Copy link
Collaborator

moewew commented Mar 7, 2020

@josephwright Was there any progress on the expl3 side here that you can share with us?

I'm asking because I want to look into selecting the correct <language> in \text_uppercase:nn {<language>}. Currently on the biblatex side we work with babel identifiers. So I need a mapping to l3text's identifiers (are they BCP47?). Currently the list of identifiers I have to worry about is not particularly long and I could hard-code it, but of course it would be nicer to use a proper interface.

@josephwright
Copy link
Collaborator

@moewew Nothing yet on expl3: you might want to talk to Javier Bezos from the babel side.

The string is BCP47 at least in principle, but at the present I don't have code in place to split the language and locale. But as there are only about 6 languages to cover and they have simple fixed strings, so you can likely special-case.

@jbezos
Copy link

jbezos commented Mar 8, 2020

The string is BCP47 at least in principle

Not quite -- de-alt is not a valid code. Once released (or frozen) TeXLive 2020, we could discuss proper codes for LaTeX along the lines of the extensions http://www.unicode.org/reports/tr35/#Locale_Extension_Key_and_Type_Data , ie, with a set of well defined private extensions (eg, -x-l-c-grosses-eszett could mean ‘local eXtension-Latex-Case-grosses-eszett).

I've opened an issue in babel.

@moewew
Copy link
Collaborator

moewew commented Mar 8, 2020

OK, for now I'll look into a manual mapping for the few cases we have.

Long-term it would be great if we could have a translator between BCP, babel and polyglossia that lives outside of babel or polyglossia (ideally in the LaTeX (3?) kernel) so that packages like biblatex don't have to deal with different implementations of the translator (making it part of the LaTeX kernel would also make it more likely that the translator is maintained).

@plk
Copy link
Owner Author

plk commented Mar 8, 2020

I agree that we really need a separate package for BCP47<->babel/polyglossia mapping.

@bastien-roucaries was working on something, not latex3 though:

reutenauer/polyglossia#279

@viktoriasee
Copy link

While the BCP47 codes already work in polyglossia they do not in babel. This should be improved. It could be used by packages like hyperxmp to set the language in PDF metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement localisation concerns lbx files or localisation in general
Projects
None yet
Development

No branches or pull requests

8 participants