Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmap table that maps a code point to GID 0 #2031

Closed
khaledhosny opened this issue Oct 27, 2019 · 13 comments · Fixed by #2034
Closed

cmap table that maps a code point to GID 0 #2031

khaledhosny opened this issue Oct 27, 2019 · 13 comments · Fixed by #2034

Comments

@khaledhosny
Copy link
Collaborator

The font cmunrm.otf has a format 4 cmap subtable that map U+0000 to GID 0, and this seems to cause hb_face_collect_unicodes() to include U+0000, but the spec says:

Regardless of the encoding scheme, character codes that do not correspond to any glyph in the font should be mapped to glyph index 0. The glyph at this location must be a special glyph representing a missing character, commonly known as .notdef.

So shouldn’t HarfBuzz skip U+0000 then? Furthermore, hb_font_get_nominal_glyph() returns false for U+0000 in this font, which is rather inconsistent.

@khaledhosny
Copy link
Collaborator Author

khaledhosny commented Oct 27, 2019

For comparison, FT_Get_First_Char() gives 0x0001, skipping the code point mapped to GID 0.

@behdad
Copy link
Member

behdad commented Oct 28, 2019

I agree, we should just remove 0 from collected unicodes of OpenType fonts.

@behdad
Copy link
Member

behdad commented Oct 28, 2019

Simple patch please.

@behdad
Copy link
Member

behdad commented Oct 28, 2019

That was wrong. I'm fixing correctly now.

Existing code tried to do this correctly, but apparently doesn't in some places. Investigating.

@behdad
Copy link
Member

behdad commented Oct 28, 2019

Humm. I don't see where in the code this is happening.

Khaled, can you point out to what mapping in the font is causing this? I'm failing to find it. Can you please debug and submit a PR? Ideally with a test? Thanks.

behdad added a commit that referenced this issue Oct 28, 2019
@khaledhosny
Copy link
Collaborator Author

I’ll try to debug later, here is the test code I used to check this:

#include <assert.h>
#include <stdio.h>
#include <hb.h>

int
main(int argc, char** argv)
{
	if (argc < 2)
	{
		fprintf(stderr, "Usage: %s font\n", argv[0]);
		return 1;
	}

	hb_blob_t *blob = hb_blob_create_from_file(argv[1]);
	hb_face_t *face = hb_face_create(blob, 0);
	hb_font_t *font = hb_font_create(face);

	hb_set_t *codes = hb_set_create();
	hb_face_collect_unicodes(face, codes);

	size_t count = hb_set_get_population(codes);
	hb_codepoint_t code = HB_SET_VALUE_INVALID;
	hb_codepoint_t glyph = 0;
	while (hb_set_next(codes, &code))
	{
		hb_font_get_nominal_glyph(font, code, &glyph))
		fprintf(stdout, "U+%04X => %04d\n", code, glyph);
		assert(glyph);
	}

	hb_blob_destroy(blob);
	hb_face_destroy(face);
	hb_font_destroy(font);
	hb_set_destroy(codes);
}

ttx output did not show the mapping to GID 0, but here is an excepet from ttfdump -t cmap:

SubTable   0.	 Format 4 - Segment mapping to delta values
		 Length:    1424
		 Version:      0
		 segCount:	 176
		 searchRange:	 256
		 entrySelector:	 7
		 rangeShift:	 96
		 Seg     0 : St = 0000, En = 0000, D =      0, RO =      0, gId# = N/A

@khaledhosny
Copy link
Collaborator Author

I’m guessing it is this code:

unsigned int rangeOffset = this->idRangeOffset[i];
if (rangeOffset == 0)
out->add_range (this->startCount[i], this->endCount[i]);
else
{
for (hb_codepoint_t codepoint = this->startCount[i];
codepoint <= this->endCount[i];
codepoint++)
{
unsigned int index = rangeOffset / 2 + (codepoint - this->startCount[i]) + i - this->segCount;
if (unlikely (index >= this->glyphIdArrayLength))
break;
hb_codepoint_t gid = this->glyphIdArray[index];
if (unlikely (!gid))
continue;
out->add (codepoint);
}
}

When rangeOffset is zero, the range is added without checking the glyph ids.

@behdad
Copy link
Member

behdad commented Oct 28, 2019

When rangeOffset is zero, the range is added without checking the glyph ids.

Right. So the font declares a range mapping 0 to 0? That's pretty weird. Humm. Yeah I don't remember Format4 in enough detail to figure out what that one does...

[...hours pass...]

Yeah I'm also convinced it's that blind add_range causing the problem. Really weird that the font has this encoded, but we should handle it consistently.

@khaledhosny
Copy link
Collaborator Author

The font has a format 12 subtable which has the same issue. I don’t know how to make a font like this and can’t subset away the other subtable with FontTools since it will rather helpfully fix the subtable, so the test I’m adding will be testing only the subtable HarfBuzz end up selecting (format 12 in this case).

I’m not sure if other subtable formats need fixing as well.

@behdad
Copy link
Member

behdad commented Oct 28, 2019

Thanks. It's a bummer that we have to do this check for all ranges. The Format12/13 can be further optimized but I'm happy to not do that unless proven necessary. Thanks!

@ebraminio
Copy link
Collaborator

Fixed by dd28884 I believe

@khaledhosny
Copy link
Collaborator Author

The pull request is not yet merged.

@ebraminio ebraminio reopened this Oct 29, 2019
@ebraminio
Copy link
Collaborator

Oh ok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants