New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Opentype lang tag #404
Comments
It also seems harfbuzz may support even longer strings with subtypes. |
Currently libass only considers ISO 639-1 codes valid. As it is supported by Harfbuzz, allowing users to specify languages not covered by ISO 639-1 seems sensible to me. Eg Some things to consider: E.g. pr #372 needs ISO 639-1 to work with libunibreak and probably should also chop off anything starting with the first dash |
The directwrite api also uses the "locale" parameter, which may have limitations. I believe libass will pass the Language tag to this field, but I'm not entirely sure they are BCP 47 codes. |
LIbass currently does not pass any language to directwrite. |
Harfbuzz, which is a dependency, does via GetGlyphs: https://github.com/harfbuzz/harfbuzz/blob/master/src/hb-directwrite.cc I think that locale would/could originate from libass Language tag, probably. |
HarfBuzz does not call DirectWrite under normal circumstances, the code you cite is for |
bows head in shame OK, I guess it is not an issue :) |
Hi. I've landed here having just worked on getting VLC to make use of the language attribute to better detect the language of ASS/SSA subtitle files, for display in the subtitle menu and elsewhere. (See here). I've also been updating VLC's ISO-639 lookup table (see here), which includes adding many new entries that lack 2-char ISO-639-1 codes (the set has been obtained from glibc FYI, as VLC's older copy had been). I ended up here having noticed that the language property of ASS/SSA files is currently limited to 2-char ISO-639-1 codes only, as pointed out in this issue. I agree that it would be sensible to expand this to allow for languages that only have 3-char ISO-639-2 codes. If creation tools can start using such 3-char codes for languages lacking 2-char codes, this will allow them to get identified in VLC through the work just mentioned, alongside any potential benefit gained from it's actual purpose in libass of course. BTW, I noticed that the
What I understand from taking a look at BCP-47 (from reading RFC5646) is that the initial 2/3-char component of the language tags is an ISO-639 code. Though per §2.2.1 BCP-47 only recognises a single ISO-639 code per language, using the 2-char ISO-639-1 codes for languages that have them, the ISO-639-2T code otherwise. Thus users do not have any choice between ISO-639-1 and ISO-639-2 for those that have both, and cannot use ISO-639-2B codes for those few where it differs from ISO-639-2T. If dealing with a library that only understands ISO-639 not BCP-47, then presumably you can just split the BCP-47 string on the first dash if there is one and give the library just that first part, as I believe you've hinted at. If you enhance the language attribute to take BCP-47 rather than just allowing for 3-char ISO-639-2 codes, then that's exactly how I plan to fix my VLC work to accommodate for the possibility of encountering values with BCP-47 sub-components in future.
I took a quick look at libunibreak. It seems to compare a given language string with a handful of codes that happen to be ISO-639-1. Perhaps I've taken your "needs ISO-639-1" comment the wrong way, but I did not see any restriction in passing it longer ISO-639-2 codes, and if taking BCP-47 for the language attribute and trimming off the initial ISO-639 component, it should work just fine to pass that along to libunibreak I think, considering that only the 2-char ISO-639-1 codes for those languages it has special handling for are valid in BCP-47 per above. Considering that BCP-47 is just a superset of the currently accepted ISO-639-1 codes, and that we can easily cut it down to an ISO-639-1/2 code for a library if needed, surely there's no problem moving forwards with the idea of changing the attribute to taking BCP-47? |
libunibreak also only compares up to the length of the codes it knows, so if — as eg in the
Neat! Still, advising to use 2-letter if available can't hurt even if it may technically be superfluous.
It's current purpose in libass is only to allow language specific Font features to be used. IN the future, thanks to libunibreak, it will also be used for the optional, ASS-incompatible Unicode-linebreaking mode.
The As I wrote previously, I think it makes sense to allow more than ISO-693-1, with a safety mechanism for ISO-639-1-only libs. There's no technical blocker iinm. |
Oh right, so it does. I've misread that bit of code. Caution will certainly be needed there then.
I'm just being a little cautious about the language used to guide users in order to avoid misleading them into making mistakes. "Advising to use 2-letter if available" suggests to me that using 3-letter codes for those that have 2-letter codes is perfectly fine, just not recommended, whereas specifically mentioning BCP-47 and stating that essentially only 2-letter codes are valid for those that have them, otherwise ISO-639-2T, is less problematic. :)
Yes, I understand. :)
Sure. It just happens to work neatly as a means of identification, better than trying to just pick out language from a portion of filenames as otherwise is done.
Sure, Great. 👍 |
... for language identification. this info property has been supported by libass since v0.10.0. it is currently a 2-char iso-639-1 code. libass commit adding support: libass/libass@c979365 discussion about enhancing the attribute to support 3-char iso-639-2 codes, possibly bcp-47: libass/libass#404
libass currently supports a two character lang tag. Harfbuzz (a dependency of libass) uses up to 3 characters. It seems recommended to support the same limit in libass.
see https://github.com/libass/libass/blob/master/libass/ass.c line 615
and https://github.com/harfbuzz/harfbuzz/blob/master/src/hb-ot-tag-table.hh
The text was updated successfully, but these errors were encountered: