Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Legacy fonts can hava a NameRecord not encoded in UTF-16BE #643

Open
moi15moi opened this issue Aug 22, 2022 · 6 comments
Open

Legacy fonts can hava a NameRecord not encoded in UTF-16BE #643

moi15moi opened this issue Aug 22, 2022 · 6 comments

Comments

@moi15moi
Copy link
Contributor

Currently, libass always decode family name with utf-16be:

ass_utf16be_to_utf8(buf, sizeof(buf), (uint8_t *)name.string,
name.string_len);

But, Microsoft NameRecord don't always use utf-16be.
To know how libass should decode properly namerecord, see: MicrosoftDocs/typography-issues#956 (comment)

Something like this could be added in ass_utils.c:

char* get_name_encoding(FT_SfntName name) {
    if (name.platform_id == TT_PLATFORM_MICROSOFT)
    {
        switch (name.encoding_id)
        ​{
            case TT_MS_ID_PRC:
                return "windows-936";
            break;

            case TT_MS_ID_BIG_5:
                return (name.name_id == TT_NAME_ID_FONT_SUBFAMILY) ? "UTF-16BE" : "windows-950";
            break;

            case TT_MS_ID_WANSUNG:
                return (name.name_id == TT_NAME_ID_FONT_SUBFAMILY) ? "UTF-16BE" : "windows-949";
            break;

            default:
                return "UTF-16BE";
        }
    }
}

Finally, to decode byte into utf-8, libass could use ICU: https://unicode-org.github.io/icu/userguide/conversion/converters.html#1-single-string

PS: To test if it decode properly namerecord with BIG_5, download the font in this issue: MicrosoftDocs/typography-issues#956 (comment)

@moi15moi moi15moi changed the title [Bug] Fail decode namerecord string [Bug] Wrong encoding is been used to decode NameRecord Aug 22, 2022
@TheOneric TheOneric changed the title [Bug] Wrong encoding is been used to decode NameRecord Legacy fonts can hava a NameRecord not encoded in UTF-16BE Aug 22, 2022
@astiob
Copy link
Member

astiob commented Aug 22, 2022

ICU is waaaay too massive for libass to use, if I’m not mixing anything up.

But the issue is, of course, real; thanks for creating a dedicated ticket for it. I already have name decoding code in https://github.com/astiob/libass/tree/debug-fonts. It should probably be adapted into mainline libass.

@moi15moi
Copy link
Contributor Author

ICU is waaaay too massive for libass to use, if I’m not mixing anything up.

Ok. I don't have a good knowledge of C.

I already have name decoding code in https://github.com/astiob/libass/tree/debug-fonts.

I don't think it is a good idea to use mac platform id.

Here is what the Apple documentation mentions: Names with platformID 1 were required by earlier versions of macOS. Its use on modern platforms is discouraged.

@astiob
Copy link
Member

astiob commented Aug 22, 2022

Here is what the Apple documentation mentions: Names with platformID 1 were required by earlier versions of macOS. Its use on modern platforms is discouraged.

That means about as much as the Microsoft docs not mentioning Windows 95 quirks that GDI nevertheless emulates to this day. It’s “discouraged” to use non-Unicode fonts at all, but that’s exactly what we’re trying to do here. (And IIRC macOS itself still preferred Macintosh-platform names when I last checked.)

That branch is (as the name suggests) meant for debugging font issues, so it dumps all the information it can find in a font. What we actually want (ideally) is what VSFilter’s GDI calls use:

For TrueType fonts, uses names with the same platform and encoding as the first valid Microsoft-platform cmap (if any) or MacRoman cmap (otherwise). Never uses Unicode-platform names.

so all Microsoft-platform encodings, as well as Macintosh-platform MacRoman (whatever version of it is implemented in Windows).

@astiob
Copy link
Member

astiob commented Aug 22, 2022

Of course, we don’t currently support MacRoman cmaps, either, and I don’t remember if I’ve ever seen a font that lacked Microsoft-platform data and actually worked in VSFilter. (Zapfino lacks them, and it doesn’t work in VSFilter.) But anyway, just Microsoft-platform names for now would be plenty good, to match our support of Microsoft-platform cmaps.

@moi15moi
Copy link
Contributor Author

Here is 2 font that should not use utf-16be to be decoded.
fonts.zip

@moi15moi
Copy link
Contributor Author

moi15moi commented Aug 4, 2023

I spoke with a Microsoft employee and he told me that GDI performed this processing:

# This NameRecord is from 文鼎中特廣告體 - PlatEncID 4.ttf
name_record = NameRecord()
name_record.nameID = 1
name_record.string = b"\x00\xa4\x00\xe5\x00\xb9\x00\xa9\x00\xa4\x00\xa4\x00\xaf\x00S\x00\xbc\x00s\x00\xa7\x00i\x00\xc5\x00\xe9"
name_record.platformID = 3
name_record.platEncID = 4
name_record.langID = 0

encoding = get_name_record_encoding(name_record)

if name_record.platformID == 3 and encoding != "utf_16_be":
    name_to_decode = name_record.string.replace(b"\x00", b"")
else:
    name_to_decode = name_record.string

decoded_name = name_to_decode.decode(encoding)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants