Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opentype lang tag #404

Open
adipose opened this issue Jun 16, 2020 · 10 comments
Open

Opentype lang tag #404

adipose opened this issue Jun 16, 2020 · 10 comments
Labels

Comments

@adipose
Copy link

adipose commented Jun 16, 2020

libass currently supports a two character lang tag. Harfbuzz (a dependency of libass) uses up to 3 characters. It seems recommended to support the same limit in libass.

see https://github.com/libass/libass/blob/master/libass/ass.c line 615

and https://github.com/harfbuzz/harfbuzz/blob/master/src/hb-ot-tag-table.hh

@adipose
Copy link
Author

adipose commented Jun 16, 2020

It also seems harfbuzz may support even longer strings with subtypes.

@TheOneric
Copy link
Member

TheOneric commented Jun 23, 2020

Currently libass only considers ISO 639-1 codes valid. As it is supported by Harfbuzz, allowing users to specify languages not covered by ISO 639-1 seems sensible to me. Eg Ainu ain (doesn't seem to be recognized by Harfbuzz; Lower Sorbian dsb is an actually supported example)
Afaik the Language tag is a libass extension – so no compatibility concerns.

Some things to consider:
Harfbuzz uses IETF BCP 47 codes to denote languages and their variants. Not all libraries can understand this. It's probably a good idea to advise users to use two-letter ISO 639-1 codes when possible, as those are more widely understood.

E.g. pr #372 needs ISO 639-1 to work with libunibreak and probably should also chop off anything starting with the first dash - (variants) before passing it to libunicode.

@adipose
Copy link
Author

adipose commented Jun 25, 2020

The directwrite api also uses the "locale" parameter, which may have limitations. I believe libass will pass the Language tag to this field, but I'm not entirely sure they are BCP 47 codes.

@TheOneric
Copy link
Member

The directwrite api also uses the "locale" parameter, which may have limitations. I believe libass will pass the Language tag to this field, but I'm not entirely sure they are BCP 47 codes.

LIbass currently does not pass any language to directwrite.
I'm not familiar with directwrite, but I don't think there's any benefit in passing language info to it, as afaik – by libass – it is only used to get the correct font/fallback and not glyph rendering.

@adipose
Copy link
Author

adipose commented Jun 29, 2020

The directwrite api also uses the "locale" parameter, which may have limitations. I believe libass will pass the Language tag to this field, but I'm not entirely sure they are BCP 47 codes.

LIbass currently does not pass any language to directwrite.
I'm not familiar with directwrite, but I don't think there's any benefit in passing language info to it, as afaik – by libass – it is only used to get the correct font/fallback and not glyph rendering.

Harfbuzz, which is a dependency, does via GetGlyphs:

https://github.com/harfbuzz/harfbuzz/blob/master/src/hb-directwrite.cc

I think that locale would/could originate from libass Language tag, probably.

@khaledhosny
Copy link
Contributor

HarfBuzz does not call DirectWrite under normal circumstances, the code you cite is for directwrite shaper which is a testing shaper and has to be called explicitly.

@adipose
Copy link
Author

adipose commented Jun 30, 2020

bows head in shame

OK, I guess it is not an issue :)

@jnqnfe
Copy link

jnqnfe commented Jun 16, 2021

Hi. I've landed here having just worked on getting VLC to make use of the language attribute to better detect the language of ASS/SSA subtitle files, for display in the subtitle menu and elsewhere. (See here). I've also been updating VLC's ISO-639 lookup table (see here), which includes adding many new entries that lack 2-char ISO-639-1 codes (the set has been obtained from glibc FYI, as VLC's older copy had been).

I ended up here having noticed that the language property of ASS/SSA files is currently limited to 2-char ISO-639-1 codes only, as pointed out in this issue. I agree that it would be sensible to expand this to allow for languages that only have 3-char ISO-639-2 codes. If creation tools can start using such 3-char codes for languages lacking 2-char codes, this will allow them to get identified in VLC through the work just mentioned, alongside any potential benefit gained from it's actual purpose in libass of course.

BTW, I noticed that the strndup(p, 2) handling in process_info_line() is used without any length validation. Any mistaken attempt to use an ISO-639-2 code currently would result in truncation rather than rejection and thus potential misidentification where you pass it to something like harfbuzz. For instance "rup" for Aromanian would be truncated to "ru" which corresponds with Russian. Perhaps wrong lengths should be rejected?

Some things to consider:
Harfbuzz uses IETF BCP 47 codes to denote languages and their variants. Not all libraries can understand this. It's probably a good idea to advise users to use two-letter ISO 639-1 codes when possible, as those are more widely understood.

What I understand from taking a look at BCP-47 (from reading RFC5646) is that the initial 2/3-char component of the language tags is an ISO-639 code. Though per §2.2.1 BCP-47 only recognises a single ISO-639 code per language, using the 2-char ISO-639-1 codes for languages that have them, the ISO-639-2T code otherwise. Thus users do not have any choice between ISO-639-1 and ISO-639-2 for those that have both, and cannot use ISO-639-2B codes for those few where it differs from ISO-639-2T.

If dealing with a library that only understands ISO-639 not BCP-47, then presumably you can just split the BCP-47 string on the first dash if there is one and give the library just that first part, as I believe you've hinted at. If you enhance the language attribute to take BCP-47 rather than just allowing for 3-char ISO-639-2 codes, then that's exactly how I plan to fix my VLC work to accommodate for the possibility of encountering values with BCP-47 sub-components in future.

E.g. pr #372 needs ISO 639-1 to work with libunibreak and probably should also chop off anything starting with the first dash - (variants) before passing it to libunicode.

I took a quick look at libunibreak. It seems to compare a given language string with a handful of codes that happen to be ISO-639-1. Perhaps I've taken your "needs ISO-639-1" comment the wrong way, but I did not see any restriction in passing it longer ISO-639-2 codes, and if taking BCP-47 for the language attribute and trimming off the initial ISO-639 component, it should work just fine to pass that along to libunibreak I think, considering that only the 2-char ISO-639-1 codes for those languages it has special handling for are valid in BCP-47 per above.

Considering that BCP-47 is just a superset of the currently accepted ISO-639-1 codes, and that we can easily cut it down to an ISO-639-1/2 code for a library if needed, surely there's no problem moving forwards with the idea of changing the attribute to taking BCP-47?

@TheOneric
Copy link
Member

I took a quick look at libunibreak. It seems to compare a given language string with a handful of codes that happen to be ISO-639-1.

libunibreak also only compares up to the length of the codes it knows, so if — as eg in the rup, ru example you gave — a 3-letter-code were to be passed, it has a chance to mistakenly match an unrelated 2-letter code.

Though per §2.2.1 BCP-47 only recognises a single ISO-639 code per language, using the 2-char ISO-639-1 codes for languages that have them, the ISO-639-2T code otherwise.

Neat! Still, advising to use 2-letter if available can't hurt even if it may technically be superfluous.

[…] alongside any potential benefit gained from it's actual purpose in libass

It's current purpose in libass is only to allow language specific Font features to be used. IN the future, thanks to libunibreak, it will also be used for the optional, ASS-incompatible Unicode-linebreaking mode.

If creation tools can start using such 3-char codes for languages lacking 2-char codes, this will allow them to get identified in VLC through the work just mentioned […]

The Language: header is a libasss-specific extension to ASS. I'm not aware of any ASS-editors natively supporting it. (And sub-authors who don't need it for correct font rendering will likely leave it blank)

As I wrote previously, I think it makes sense to allow more than ISO-693-1, with a safety mechanism for ISO-639-1-only libs. There's no technical blocker iinm.

@jnqnfe
Copy link

jnqnfe commented Jun 17, 2021

I took a quick look at libunibreak. It seems to compare a given language string with a handful of codes that happen to be ISO-639-1.

libunibreak also only compares up to the length of the codes it knows, so if — as eg in the rup, ru example you gave — a 3-letter-code were to be passed, it has a chance to mistakenly match an unrelated 2-letter code.

Oh right, so it does. I've misread that bit of code. Caution will certainly be needed there then.

Though per §2.2.1 BCP-47 only recognises a single ISO-639 code per language, using the 2-char ISO-639-1 codes for languages that have them, the ISO-639-2T code otherwise.

Neat! Still, advising to use 2-letter if available can't hurt even if it may technically be superfluous.

I'm just being a little cautious about the language used to guide users in order to avoid misleading them into making mistakes. "Advising to use 2-letter if available" suggests to me that using 3-letter codes for those that have 2-letter codes is perfectly fine, just not recommended, whereas specifically mentioning BCP-47 and stating that essentially only 2-letter codes are valid for those that have them, otherwise ISO-639-2T, is less problematic. :)

[…] alongside any potential benefit gained from it's actual purpose in libass

It's current purpose in libass is only to allow language specific Font features to be used. IN the future, thanks to libunibreak, it will also be used for the optional, ASS-incompatible Unicode-linebreaking mode.

Yes, I understand. :)

If creation tools can start using such 3-char codes for languages lacking 2-char codes, this will allow them to get identified in VLC through the work just mentioned […]

The Language: header is a libasss-specific extension to ASS. I'm not aware of any ASS-editors natively supporting it. (And sub-authors who don't need it for correct font rendering will likely leave it blank)

Sure. It just happens to work neatly as a means of identification, better than trying to just pick out language from a portion of filenames as otherwise is done.

As I wrote previously, I think it makes sense to allow more than ISO-693-1, with a safety mechanism for ISO-639-1-only libs. There's no technical blocker iinm.

Sure, Great. 👍

vlc-mirrorer pushed a commit to videolan/vlc that referenced this issue Jun 18, 2021
... for language identification.

this info property has been supported by libass since v0.10.0. it is currently
a 2-char iso-639-1 code.

libass commit adding support:
libass/libass@c979365

discussion about enhancing the attribute to support 3-char iso-639-2 codes,
possibly bcp-47: libass/libass#404
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants