Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support dynamic generation of the OS2 table #55

Merged
merged 3 commits into from
Mar 14, 2018

Conversation

camertron
Copy link
Member

This pull request is part of a larger effort to bring OTF support to TTFunk. See #53 for details.

Higlights of this PR:

  1. Currently the OS2 table is written back into font subsets as-is. These changes make it possible to encode the table from its various components.
  2. Replacement of the TTFunk::Encoding classes in favor of the code-pages gem, which I think is much simpler. It also allows TTFunk to support any of the various code pages instead of just Mac Roman and Windows 1252.
  3. Introduction of several utility classes.

@pointlessone
Copy link
Member

I will take a closer look later but I already have a few questions regaring the code-pages gem.

  • It seem to provide quite a few code pages. Much more than TTFunk used to support. Are they all needed? Maybe there are alternative ways to do tame without introduction of a new dependency? Maybe Ruby's Encoding can suit?
  • In case we decide to keep it… It seem to become a dependency only of the Prawn gem. According to RubyGems no other gem depend on it. Would you be willing to join Prawn org and transfer the gem to the org. Naturally, you keep the authorship and maintainership of the gem. I just want to make sure the dependency can be easily picked up by a new maintainer in case you'd decide to move on.
  • I also have some concerns regarding the gem's efficiency. It seem to load all the pages even when only one is requested. What's impact on start up time and memory use?


def code_pages_for(subset)
field = BitField.new(0)
return field if subset.unicode?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to docs 0 is CP1252. Granted, there's no value for Unicode what would happen if subset actually includes glyphs outside of 1252?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm good question. After some additional research I was able to find several of the Noto fonts that have a code_page_range of 0 but don't include the majority of the 1252 code page (i.e. a-z, A-Z). So I guess there's precedent for it? Since TTFunk's subsets are very encoding-based, I don't know how we would do this any differently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spec suggests only setting encoding bit if the code table is functional. However…

The determination of “functional” is left up to the font designer

Also this bit:

If the font file is encoding ID 0, then the Symbol Character Set bit should be set.

Is this handled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe an encoding ID of 0 only signifies a symbol character set on Microsoft platforms, i.e. platform ID 3. Currently TTFunk doesn't support that combination in the cmap logic, so we should be ok.

end

protected
def new_cmap_table(_options = nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the optional argument can be dropped as it seem to be never used. In all Subset classes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good idea.

def to_unicode_map
Encoding::Windows1252::TO_UNICODE
CodePages[code_page].unicode_mapping
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After all the changes this class is different to MacRoman only by the code_page method. Moreover, this one still encodes into mac_roman cmap table. Do we need it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I noticed that as well. In order to not break backwards-compatibility, I'll keep these two classes but refactor into a base class that can accept any encoding/code page.

@camertron
Copy link
Member Author

@pointlessone in response to your points:

  • Yes, the CodePages gem supports all the code pages available at unicode.org. After doing a bit more research it appears your idea of using Ruby's Encoding will probably work just fine. The only method in the subset classes that wouldn't work with Encoding is #to_unicode_map which appears to not be used anywhere. Since Encoding supports a bunch of code pages I think it's probably sufficient.
  • I am willing to transfer CodePages to the prawn org if necessary.
  • It doesn't load all the code pages, just a manifest of all the available ones. Individual code pages are loaded as they are requested.

@pointlessone
Copy link
Member

Let's try getting by with Encoding instead of new dep. If that proves problematic then let's consider your gem.

@camertron camertron mentioned this pull request Mar 5, 2018
9 tasks
@camertron
Copy link
Member Author

camertron commented Mar 7, 2018

Ok @pointlessone I removed references to the CodePages gem and used Ruby's Encoding functionality instead. As you can see, it's a bit awkward but it works. There's definitely something to be said for the efficiency of the hashes we had before in the individual TTFunk::Encoding classes (also in the CodePages gem) instead of using a bunch of packs and unpacks. Let me know what you think.

@camertron camertron force-pushed the refactor_os2 branch 3 times, most recently from e595be1 to 9f7ce8a Compare March 7, 2018 21:53
new_char_range = unicode_blocks_for(os2, os2.char_range, subset)
result << BinUtils
.unpack_int(new_char_range.value, 32)
.pack('N*').ljust(16, "\0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, it would be easier to supply unpack_int with a desired target size instead of manually padding the result. It also would make the method itself a bit simpler.

end

# assumes big-endian
def unpack_int(value, bit_width)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably very subjective but I find it hard to understand what these methods do by their names.

I think slice_int and stitch_int are more intention revealing names.

What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree the names could be better (I was trying to follow the ruby pack/unpack terminology). What about int_to_bytes and bytes_to_int?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Historically bytes are quite consistently 8 bits wide. For wider values "word" is the most common term. But otherwise those names work for me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that makes sense. Ok, let's go with slice_int and stitch_int.

# assumes big-endian
def unpack_int(value, bit_width)
return [0] if value == 0
return [1] if value == 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a special case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method needs to calculate how many bytes should be returned, so it takes the log base 2 of the value. Math.log2(1) == 0.0, which would mean 0 bytes in the return value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned elsewhere, if this method took the size of unpacked value the calculation would be unnecessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged.

.codepoints
.first
rescue Encoding::UndefinedConversionError
# Ruby doesn't appear to think there is a strict 1:1 mapping
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is not quite accurate. Ruby's opinion has nothing to do with convertability. There're actually illegal coversions.

For example, 0x81 is undefined in CP1252. There's no way to convert it into Unicode.

So maybe just say that there's no 1:1 mapping between all encodings and Unicode.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that's something I hadn't realized, thanks for the explanation.

@@ -33,13 +33,14 @@ def from_unicode(character)
character
end

protected
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to expose the next method as well?


# only flip the bit on if the subset includes all the characters
# that were present for this block in the original font
if code_points.all? { |cp| subset_code_points.include?(cp) }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is a sound approach. Please correct me if I'm mistaken.

Consider a font hat has a full Latin alphabet and let's say it set 1250 bit. For example, the font is used for only one title in the whole document that says "Report". The subset would only contain those 6 letter out of at least 48 provided by the font. For the purpose of the document those 6 should be considered "usable" and warrant a bit in the table. But with current logic the bit won't be set at the subset doesn't fully cover the original set.

It looks like the bit would rarely set if ever since it's really hard to come up with a real world document that fully covers a full code page of any decent font. Consider just covering English alphabet in both lower an upper case and and the punctuation characters and braces. Most fonts include even more characters that are hard to come by in most documents.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I see what you're saying. Before submitting the PR I tried a different approach where bits were flipped on if any of the corresponding characters were present in the subset. I abandoned it because it was fairly slow and didn't seem correct (after all, the font's definition of a usable unicode range won't have been satisfied). I don't believe the font will be broken if no unicode ranges are indicated, so only flipping bits on if all characters are present seemed like the best approach. If you feel strongly about going back to the original technique, I'd be happy to revert.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was fairly slow

I for some reason doubt that any? is slower than all?. They're the same in the worst case scenario.

the font's definition of a usable unicode range won't have been satisfied

It may be not for a regular font since the font author has no idea how the font would be used. In general case the flag should be set when a code page has a decent if not full coverage of a code page.

Subset fonts are used in a very specific scenarios. In case of Prawn (and PDF in general) the fonts are only usable with a specific text. There's no risk of encountering a situation where subset font won't cover what original could have. So it still satisfies "usability" criteria to the same extent the original font would have.

I think the appropriate approach would be to unset bits for ranges that are completely removed from the font.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I for some reason doubt that any? is slower than all?. They're the same in the worst case scenario.

No, that's not what I meant. My original algorithm used Array#bsearch for every character.

In general case the flag should be set when a code page has a decent if not full coverage of a code page.

Ok, that makes sense to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is as simple as using any? instead of all?. Cool 😎

# using is irrelevant. We choose MacRoman because it's a 256-character
# encoding that happens to be well-supported in both TTF and
# PDF formats.
TTFunk::Table::Cmap.encode(mapping, :mac_roman)
end

def original_glyph_ids
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be public.

end
end

def original_glyph_ids
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be public.

@pointlessone
Copy link
Member

All right, it looks good. Sorry for torturing you. 😅

Make sure the history is good and it's good to merge.

* Remove *Encoding classes in favor of Ruby's built-in encoding capabilities.
* Create a CodePage base class and add Windows1252 and MacRoman as subclasses. Move shared functionality into CodePage.
@camertron
Copy link
Member Author

@pointlessone not at all! I really appreciate how thorough you have been in reviewing these PRs :)

The commit history has been cleaned up, ready to merge when you are.

@pointlessone pointlessone merged commit ab795ff into prawnpdf:otf Mar 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants