Support dynamic generation of the OS2 table #55

camertron · 2018-03-04T07:53:20Z

This pull request is part of a larger effort to bring OTF support to TTFunk. See #53 for details.

Higlights of this PR:

Currently the OS2 table is written back into font subsets as-is. These changes make it possible to encode the table from its various components.
Replacement of the TTFunk::Encoding classes in favor of the code-pages gem, which I think is much simpler. It also allows TTFunk to support any of the various code pages instead of just Mac Roman and Windows 1252.
Introduction of several utility classes.

pointlessone · 2018-03-04T09:17:28Z

I will take a closer look later but I already have a few questions regaring the code-pages gem.

It seem to provide quite a few code pages. Much more than TTFunk used to support. Are they all needed? Maybe there are alternative ways to do tame without introduction of a new dependency? Maybe Ruby's Encoding can suit?
In case we decide to keep it… It seem to become a dependency only of the Prawn gem. According to RubyGems no other gem depend on it. Would you be willing to join Prawn org and transfer the gem to the org. Naturally, you keep the authorship and maintainership of the gem. I just want to make sure the dependency can be easily picked up by a new maintainer in case you'd decide to move on.
I also have some concerns regarding the gem's efficiency. It seem to load all the pages even when only one is requested. What's impact on start up time and memory use?

pointlessone · 2018-03-04T13:53:30Z

lib/ttfunk/table/os2.rb

+
+        def code_pages_for(subset)
+          field = BitField.new(0)
+          return field if subset.unicode?


According to docs 0 is CP1252. Granted, there's no value for Unicode what would happen if subset actually includes glyphs outside of 1252?

Hmm good question. After some additional research I was able to find several of the Noto fonts that have a code_page_range of 0 but don't include the majority of the 1252 code page (i.e. a-z, A-Z). So I guess there's precedent for it? Since TTFunk's subsets are very encoding-based, I don't know how we would do this any differently.

Spec suggests only setting encoding bit if the code table is functional. However…

The determination of “functional” is left up to the font designer

Also this bit:

If the font file is encoding ID 0, then the Symbol Character Set bit should be set.

Is this handled?

I believe an encoding ID of 0 only signifies a symbol character set on Microsoft platforms, i.e. platform ID 3. Currently TTFunk doesn't support that combination in the cmap logic, so we should be ok.

pointlessone · 2018-03-04T14:01:21Z

lib/ttfunk/subset/mac_roman.rb

      end

-      protected
+      def new_cmap_table(_options = nil)


I guess the optional argument can be dropped as it seem to be never used. In all Subset classes.

Yes, good idea.

pointlessone · 2018-03-04T14:03:29Z

lib/ttfunk/subset/windows_1252.rb

      def to_unicode_map
-        Encoding::Windows1252::TO_UNICODE
+        CodePages[code_page].unicode_mapping


After all the changes this class is different to MacRoman only by the code_page method. Moreover, this one still encodes into mac_roman cmap table. Do we need it?

Yeah, I noticed that as well. In order to not break backwards-compatibility, I'll keep these two classes but refactor into a base class that can accept any encoding/code page.

camertron · 2018-03-04T17:31:08Z

@pointlessone in response to your points:

Yes, the CodePages gem supports all the code pages available at unicode.org. After doing a bit more research it appears your idea of using Ruby's Encoding will probably work just fine. The only method in the subset classes that wouldn't work with Encoding is #to_unicode_map which appears to not be used anywhere. Since Encoding supports a bunch of code pages I think it's probably sufficient.
I am willing to transfer CodePages to the prawn org if necessary.
It doesn't load all the code pages, just a manifest of all the available ones. Individual code pages are loaded as they are requested.

pointlessone · 2018-03-04T18:45:16Z

Let's try getting by with Encoding instead of new dep. If that proves problematic then let's consider your gem.

camertron · 2018-03-07T07:05:13Z

Ok @pointlessone I removed references to the CodePages gem and used Ruby's Encoding functionality instead. As you can see, it's a bit awkward but it works. There's definitely something to be said for the efficiency of the hashes we had before in the individual TTFunk::Encoding classes (also in the CodePages gem) instead of using a bunch of packs and unpacks. Let me know what you think.

pointlessone · 2018-03-09T09:20:20Z

lib/ttfunk/table/os2.rb

+            new_char_range = unicode_blocks_for(os2, os2.char_range, subset)
+            result << BinUtils
+                      .unpack_int(new_char_range.value, 32)
+                      .pack('N*').ljust(16, "\0")


I think, it would be easier to supply unpack_int with a desired target size instead of manually padding the result. It also would make the method itself a bit simpler.

pointlessone · 2018-03-09T09:22:45Z

lib/ttfunk/bin_utils.rb

+    end
+
+    # assumes big-endian
+    def unpack_int(value, bit_width)


This is probably very subjective but I find it hard to understand what these methods do by their names.

I think slice_int and stitch_int are more intention revealing names.

What do you think?

Yeah, I agree the names could be better (I was trying to follow the ruby pack/unpack terminology). What about int_to_bytes and bytes_to_int?

Historically bytes are quite consistently 8 bits wide. For wider values "word" is the most common term. But otherwise those names work for me.

Ah that makes sense. Ok, let's go with slice_int and stitch_int.

pointlessone · 2018-03-09T09:23:43Z

lib/ttfunk/bin_utils.rb

+    # assumes big-endian
+    def unpack_int(value, bit_width)
+      return [0] if value == 0
+      return [1] if value == 1


Why is this a special case?

The method needs to calculate how many bytes should be returned, so it takes the log base 2 of the value. Math.log2(1) == 0.0, which would mean 0 bytes in the return value.

As I mentioned elsewhere, if this method took the size of unpacked value the calculation would be unnecessary.

Acknowledged.

pointlessone · 2018-03-09T09:40:10Z

lib/ttfunk/subset/code_page.rb

+                        .codepoints
+                        .first
+            rescue Encoding::UndefinedConversionError
+              # Ruby doesn't appear to think there is a strict 1:1 mapping


This comment is not quite accurate. Ruby's opinion has nothing to do with convertability. There're actually illegal coversions.

For example, 0x81 is undefined in CP1252. There's no way to convert it into Unicode.

So maybe just say that there's no 1:1 mapping between all encodings and Unicode.

Ok, that's something I hadn't realized, thanks for the explanation.

pointlessone · 2018-03-12T08:49:25Z

lib/ttfunk/subset/unicode.rb

@@ -33,13 +33,14 @@ def from_unicode(character)
        character
      end

-      protected


Do we need to expose the next method as well?

pointlessone · 2018-03-12T09:07:09Z

lib/ttfunk/table/os2.rb

+
+            # only flip the bit on if the subset includes all the characters
+            # that were present for this block in the original font
+            if code_points.all? { |cp| subset_code_points.include?(cp) }


I'm not sure if this is a sound approach. Please correct me if I'm mistaken.

Consider a font hat has a full Latin alphabet and let's say it set 1250 bit. For example, the font is used for only one title in the whole document that says "Report". The subset would only contain those 6 letter out of at least 48 provided by the font. For the purpose of the document those 6 should be considered "usable" and warrant a bit in the table. But with current logic the bit won't be set at the subset doesn't fully cover the original set.

It looks like the bit would rarely set if ever since it's really hard to come up with a real world document that fully covers a full code page of any decent font. Consider just covering English alphabet in both lower an upper case and and the punctuation characters and braces. Most fonts include even more characters that are hard to come by in most documents.

Yes, I see what you're saying. Before submitting the PR I tried a different approach where bits were flipped on if any of the corresponding characters were present in the subset. I abandoned it because it was fairly slow and didn't seem correct (after all, the font's definition of a usable unicode range won't have been satisfied). I don't believe the font will be broken if no unicode ranges are indicated, so only flipping bits on if all characters are present seemed like the best approach. If you feel strongly about going back to the original technique, I'd be happy to revert.

it was fairly slow

I for some reason doubt that any? is slower than all?. They're the same in the worst case scenario.

the font's definition of a usable unicode range won't have been satisfied

It may be not for a regular font since the font author has no idea how the font would be used. In general case the flag should be set when a code page has a decent if not full coverage of a code page.

Subset fonts are used in a very specific scenarios. In case of Prawn (and PDF in general) the fonts are only usable with a specific text. There's no risk of encountering a situation where subset font won't cover what original could have. So it still satisfies "usability" criteria to the same extent the original font would have.

I think the appropriate approach would be to unset bits for ranges that are completely removed from the font.

I for some reason doubt that any? is slower than all?. They're the same in the worst case scenario.

No, that's not what I meant. My original algorithm used Array#bsearch for every character.

In general case the flag should be set when a code page has a decent if not full coverage of a code page.

Ok, that makes sense to me.

Looks like this is as simple as using any? instead of all?. Cool 😎

pointlessone · 2018-03-13T07:45:00Z

lib/ttfunk/subset/unicode_8bit.rb

-        # using is irrelevant. We choose MacRoman because it's a 256-character
-        # encoding that happens to be well-supported in both TTF and
-        # PDF formats.
-        TTFunk::Table::Cmap.encode(mapping, :mac_roman)
      end

      def original_glyph_ids


This should not be public.

pointlessone · 2018-03-13T07:50:28Z

lib/ttfunk/subset/code_page.rb

+        end
+      end
+
+      def original_glyph_ids


This should not be public.

pointlessone · 2018-03-14T13:12:20Z

All right, it looks good. Sorry for torturing you. 😅

Make sure the history is good and it's good to merge.

* Remove *Encoding classes in favor of Ruby's built-in encoding capabilities. * Create a CodePage base class and add Windows1252 and MacRoman as subclasses. Move shared functionality into CodePage.

camertron · 2018-03-14T14:52:31Z

@pointlessone not at all! I really appreciate how thorough you have been in reviewing these PRs :)

The commit history has been cleaned up, ready to merge when you are.

pointlessone reviewed Mar 4, 2018

View reviewed changes

camertron mentioned this pull request Mar 5, 2018

OTF Support #53

Closed

9 tasks

camertron force-pushed the refactor_os2 branch from 1a15e10 to 9bff838 Compare March 7, 2018 07:03

camertron force-pushed the refactor_os2 branch 3 times, most recently from e595be1 to 9f7ce8a Compare March 7, 2018 21:53

pointlessone reviewed Mar 9, 2018

View reviewed changes

pointlessone reviewed Mar 12, 2018

View reviewed changes

pointlessone reviewed Mar 13, 2018

View reviewed changes

camertron added 3 commits March 14, 2018 07:49

Refactor subset classes

1a9b28a

* Remove *Encoding classes in favor of Ruby's built-in encoding capabilities. * Create a CodePage base class and add Windows1252 and MacRoman as subclasses. Move shared functionality into CodePage.

Support encoding the OS/2 table

af670ad

Add tests for OS/2 table and supporting classes.

fd48ed1

camertron force-pushed the refactor_os2 branch from 1f03ac9 to fd48ed1 Compare March 14, 2018 14:51

pointlessone approved these changes Mar 14, 2018

View reviewed changes

pointlessone merged commit ab795ff into prawnpdf:otf Mar 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support dynamic generation of the OS2 table #55

Support dynamic generation of the OS2 table #55

camertron commented Mar 4, 2018

pointlessone commented Mar 4, 2018

pointlessone Mar 4, 2018

camertron Mar 4, 2018

pointlessone Mar 4, 2018

camertron Mar 5, 2018

pointlessone Mar 4, 2018

camertron Mar 4, 2018

pointlessone Mar 4, 2018

camertron Mar 4, 2018

camertron commented Mar 4, 2018

pointlessone commented Mar 4, 2018

camertron commented Mar 7, 2018 •

edited

Loading

pointlessone Mar 9, 2018

pointlessone Mar 9, 2018

camertron Mar 10, 2018

pointlessone Mar 10, 2018

camertron Mar 11, 2018

pointlessone Mar 9, 2018

camertron Mar 10, 2018

pointlessone Mar 10, 2018

camertron Mar 11, 2018

pointlessone Mar 9, 2018

camertron Mar 11, 2018

pointlessone Mar 12, 2018

pointlessone Mar 12, 2018

camertron Mar 12, 2018

pointlessone Mar 12, 2018

camertron Mar 12, 2018

camertron Mar 13, 2018

pointlessone Mar 13, 2018

pointlessone Mar 13, 2018

pointlessone commented Mar 14, 2018

camertron commented Mar 14, 2018

Support dynamic generation of the OS2 table #55

Support dynamic generation of the OS2 table #55

Conversation

camertron commented Mar 4, 2018

pointlessone commented Mar 4, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

camertron commented Mar 4, 2018

pointlessone commented Mar 4, 2018

camertron commented Mar 7, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pointlessone commented Mar 14, 2018

camertron commented Mar 14, 2018

camertron commented Mar 7, 2018 •

edited

Loading