Opentype lang tag #404

adipose · 2020-06-16T21:37:44Z

libass currently supports a two character lang tag. Harfbuzz (a dependency of libass) uses up to 3 characters. It seems recommended to support the same limit in libass.

see https://github.com/libass/libass/blob/master/libass/ass.c line 615

and https://github.com/harfbuzz/harfbuzz/blob/master/src/hb-ot-tag-table.hh

adipose · 2020-06-16T21:56:56Z

It also seems harfbuzz may support even longer strings with subtypes.

TheOneric · 2020-06-23T17:30:40Z

Currently libass only considers ISO 639-1 codes valid. As it is supported by Harfbuzz, allowing users to specify languages not covered by ISO 639-1 seems sensible to me. Eg ~~Ainu ain~~ (doesn't seem to be recognized by Harfbuzz; Lower Sorbian dsb is an actually supported example)
Afaik the Language tag is a libass extension – so no compatibility concerns.

Some things to consider:
Harfbuzz uses IETF BCP 47 codes to denote languages and their variants. Not all libraries can understand this. It's probably a good idea to advise users to use two-letter ISO 639-1 codes when possible, as those are more widely understood.

E.g. pr #372 needs ISO 639-1 to work with libunibreak and probably should also chop off anything starting with the first dash - (variants) before passing it to libunicode.

adipose · 2020-06-25T17:43:30Z

The directwrite api also uses the "locale" parameter, which may have limitations. I believe libass will pass the Language tag to this field, but I'm not entirely sure they are BCP 47 codes.

TheOneric · 2020-06-28T23:56:06Z

The directwrite api also uses the "locale" parameter, which may have limitations. I believe libass will pass the Language tag to this field, but I'm not entirely sure they are BCP 47 codes.

LIbass currently does not pass any language to directwrite.
I'm not familiar with directwrite, but I don't think there's any benefit in passing language info to it, as afaik – by libass – it is only used to get the correct font/fallback and not glyph rendering.

adipose · 2020-06-29T18:39:52Z

The directwrite api also uses the "locale" parameter, which may have limitations. I believe libass will pass the Language tag to this field, but I'm not entirely sure they are BCP 47 codes.

LIbass currently does not pass any language to directwrite.
I'm not familiar with directwrite, but I don't think there's any benefit in passing language info to it, as afaik – by libass – it is only used to get the correct font/fallback and not glyph rendering.

Harfbuzz, which is a dependency, does via GetGlyphs:

https://github.com/harfbuzz/harfbuzz/blob/master/src/hb-directwrite.cc

I think that locale would/could originate from libass Language tag, probably.

khaledhosny · 2020-06-29T20:44:51Z

HarfBuzz does not call DirectWrite under normal circumstances, the code you cite is for directwrite shaper which is a testing shaper and has to be called explicitly.

adipose · 2020-06-30T00:55:58Z

bows head in shame

OK, I guess it is not an issue :)

jnqnfe · 2021-06-16T02:26:35Z

Hi. I've landed here having just worked on getting VLC to make use of the language attribute to better detect the language of ASS/SSA subtitle files, for display in the subtitle menu and elsewhere. (See here). I've also been updating VLC's ISO-639 lookup table (see here), which includes adding many new entries that lack 2-char ISO-639-1 codes (the set has been obtained from glibc FYI, as VLC's older copy had been).

I ended up here having noticed that the language property of ASS/SSA files is currently limited to 2-char ISO-639-1 codes only, as pointed out in this issue. I agree that it would be sensible to expand this to allow for languages that only have 3-char ISO-639-2 codes. If creation tools can start using such 3-char codes for languages lacking 2-char codes, this will allow them to get identified in VLC through the work just mentioned, alongside any potential benefit gained from it's actual purpose in libass of course.

BTW, I noticed that the strndup(p, 2) handling in process_info_line() is used without any length validation. Any mistaken attempt to use an ISO-639-2 code currently would result in truncation rather than rejection and thus potential misidentification where you pass it to something like harfbuzz. For instance "rup" for Aromanian would be truncated to "ru" which corresponds with Russian. Perhaps wrong lengths should be rejected?

Some things to consider:
Harfbuzz uses IETF BCP 47 codes to denote languages and their variants. Not all libraries can understand this. It's probably a good idea to advise users to use two-letter ISO 639-1 codes when possible, as those are more widely understood.

What I understand from taking a look at BCP-47 (from reading RFC5646) is that the initial 2/3-char component of the language tags is an ISO-639 code. Though per §2.2.1 BCP-47 only recognises a single ISO-639 code per language, using the 2-char ISO-639-1 codes for languages that have them, the ISO-639-2T code otherwise. Thus users do not have any choice between ISO-639-1 and ISO-639-2 for those that have both, and cannot use ISO-639-2B codes for those few where it differs from ISO-639-2T.

If dealing with a library that only understands ISO-639 not BCP-47, then presumably you can just split the BCP-47 string on the first dash if there is one and give the library just that first part, as I believe you've hinted at. If you enhance the language attribute to take BCP-47 rather than just allowing for 3-char ISO-639-2 codes, then that's exactly how I plan to fix my VLC work to accommodate for the possibility of encountering values with BCP-47 sub-components in future.

E.g. pr #372 needs ISO 639-1 to work with libunibreak and probably should also chop off anything starting with the first dash - (variants) before passing it to libunicode.

I took a quick look at libunibreak. It seems to compare a given language string with a handful of codes that happen to be ISO-639-1. Perhaps I've taken your "needs ISO-639-1" comment the wrong way, but I did not see any restriction in passing it longer ISO-639-2 codes, and if taking BCP-47 for the language attribute and trimming off the initial ISO-639 component, it should work just fine to pass that along to libunibreak I think, considering that only the 2-char ISO-639-1 codes for those languages it has special handling for are valid in BCP-47 per above.

Considering that BCP-47 is just a superset of the currently accepted ISO-639-1 codes, and that we can easily cut it down to an ISO-639-1/2 code for a library if needed, surely there's no problem moving forwards with the idea of changing the attribute to taking BCP-47?

TheOneric · 2021-06-17T00:32:50Z

I took a quick look at libunibreak. It seems to compare a given language string with a handful of codes that happen to be ISO-639-1.

libunibreak also only compares up to the length of the codes it knows, so if — as eg in the rup, ru example you gave — a 3-letter-code were to be passed, it has a chance to mistakenly match an unrelated 2-letter code.

Though per §2.2.1 BCP-47 only recognises a single ISO-639 code per language, using the 2-char ISO-639-1 codes for languages that have them, the ISO-639-2T code otherwise.

Neat! Still, advising to use 2-letter if available can't hurt even if it may technically be superfluous.

[…] alongside any potential benefit gained from it's actual purpose in libass

It's current purpose in libass is only to allow language specific Font features to be used. IN the future, thanks to libunibreak, it will also be used for the optional, ASS-incompatible Unicode-linebreaking mode.

If creation tools can start using such 3-char codes for languages lacking 2-char codes, this will allow them to get identified in VLC through the work just mentioned […]

The Language: header is a libasss-specific extension to ASS. I'm not aware of any ASS-editors natively supporting it. (And sub-authors who don't need it for correct font rendering will likely leave it blank)

As I wrote previously, I think it makes sense to allow more than ISO-693-1, with a safety mechanism for ISO-639-1-only libs. There's no technical blocker iinm.

jnqnfe · 2021-06-17T02:35:12Z

I took a quick look at libunibreak. It seems to compare a given language string with a handful of codes that happen to be ISO-639-1.

libunibreak also only compares up to the length of the codes it knows, so if — as eg in the rup, ru example you gave — a 3-letter-code were to be passed, it has a chance to mistakenly match an unrelated 2-letter code.

Oh right, so it does. I've misread that bit of code. Caution will certainly be needed there then.

Though per §2.2.1 BCP-47 only recognises a single ISO-639 code per language, using the 2-char ISO-639-1 codes for languages that have them, the ISO-639-2T code otherwise.

Neat! Still, advising to use 2-letter if available can't hurt even if it may technically be superfluous.

I'm just being a little cautious about the language used to guide users in order to avoid misleading them into making mistakes. "Advising to use 2-letter if available" suggests to me that using 3-letter codes for those that have 2-letter codes is perfectly fine, just not recommended, whereas specifically mentioning BCP-47 and stating that essentially only 2-letter codes are valid for those that have them, otherwise ISO-639-2T, is less problematic. :)

[…] alongside any potential benefit gained from it's actual purpose in libass

It's current purpose in libass is only to allow language specific Font features to be used. IN the future, thanks to libunibreak, it will also be used for the optional, ASS-incompatible Unicode-linebreaking mode.

Yes, I understand. :)

If creation tools can start using such 3-char codes for languages lacking 2-char codes, this will allow them to get identified in VLC through the work just mentioned […]

The Language: header is a libasss-specific extension to ASS. I'm not aware of any ASS-editors natively supporting it. (And sub-authors who don't need it for correct font rendering will likely leave it blank)

Sure. It just happens to work neatly as a means of identification, better than trying to just pick out language from a portion of filenames as otherwise is done.

As I wrote previously, I think it makes sense to allow more than ISO-693-1, with a safety mechanism for ISO-639-1-only libs. There's no technical blocker iinm.

Sure, Great. 👍

... for language identification. this info property has been supported by libass since v0.10.0. it is currently a 2-char iso-639-1 code. libass commit adding support: libass/libass@c979365 discussion about enhancing the attribute to support 3-char iso-639-2 codes, possibly bcp-47: libass/libass#404

TheOneric mentioned this issue Jun 23, 2020

Use the Unicode line breaking algorithm #372

Closed

TheOneric added the request label Feb 24, 2021

TheOneric mentioned this issue Apr 5, 2022

fontselect: use language from track for fallback selection #607

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Opentype lang tag #404

Opentype lang tag #404

adipose commented Jun 16, 2020

adipose commented Jun 16, 2020

TheOneric commented Jun 23, 2020 •

edited

adipose commented Jun 25, 2020

TheOneric commented Jun 28, 2020

adipose commented Jun 29, 2020

khaledhosny commented Jun 29, 2020

adipose commented Jun 30, 2020 •

edited

jnqnfe commented Jun 16, 2021 •

edited

TheOneric commented Jun 17, 2021

jnqnfe commented Jun 17, 2021

Opentype lang tag #404

Opentype lang tag #404

Comments

adipose commented Jun 16, 2020

adipose commented Jun 16, 2020

TheOneric commented Jun 23, 2020 • edited

adipose commented Jun 25, 2020

TheOneric commented Jun 28, 2020

adipose commented Jun 29, 2020

khaledhosny commented Jun 29, 2020

adipose commented Jun 30, 2020 • edited

jnqnfe commented Jun 16, 2021 • edited

TheOneric commented Jun 17, 2021

jnqnfe commented Jun 17, 2021

TheOneric commented Jun 23, 2020 •

edited

adipose commented Jun 30, 2020 •

edited

jnqnfe commented Jun 16, 2021 •

edited