Add CFF core classes and data structures #59

camertron · 2018-04-08T15:44:08Z

This pull request is part of a larger effort to bring OTF support to TTFunk. See #53 for details.

Highlights of this PR:

The two CFF data structures added in this PR are Index and Dict. An index is essentially an array that can be accessed by index, and a dict is a dictionary of operator/operand (i.e. key/value) pairs. Operators have special meaning depending on the context. For example the CFF top and private dicts both use the dict structure but define different operators for the data they contain.
The Dict class technically isn't used anywhere yet. Including it when it is actually used would make the corresponding PR too large in my opinion, so it is introduced here instead.
The Sci class (couldn't come up with a better name, suggestions welcome) is supposed to represent a number in scientific notation (i.e. 5.2 x 10⁴). The Dict class uses it for a special type of operand that starts with a byte 30. I could have stored it as a float, but it would then have been a lot more work to re-encode later.
The TTFunk::SubTable class has been introduced to handle the case when table-like parsing behavior is desired but the table is not a top-level, directory-based font table.
The beginnings of an incomplete CFF table are also present. Currently only the header and name index sub tables are supported, more coming soon. CFF is by far the most complicated table TTFunk supports to date and the centerpiece of the open-type font format.

pointlessone · 2018-04-09T06:05:50Z

Could you please provide a link to spec. This is a rather logic-heavy PR. I'd like to at least try and verify the logic as well as heaps of magic numbers.
I'd rather leave Dict until it's actually used. PRs are as big as they need to be. You can use commits within a PR to show logical progression and further split changes into smaller pieces.
Sci is indeed a bit vague. Maybe SkiForm? It's not much better but at least resemble a noun and maybe a bit more telling to people who are familiar with a proper term.
Reasonable. Though, I'd expect a more generically named Table would be useful for all table-like structures and more specialized subclasses would implement specifics of top-level tables. Maybe we should consider this as an intermediary refactoring step either before or after this PR. WDYT?
Is the CFF table functional within the PR?

camertron · 2018-04-09T18:20:33Z

This is the document I used to implement the CFF table.
Ok, that's fine.
Hmm, maybe SciForm? The math world has to have invented a word for this, but I don't know what it is. I'll keep thinking about it.
I see your point. I think the original authors created the Table class to correspond to the True-Type concept of a table, which is a structure that has an entry in the font's directory. The structures within the CFF table aren't truly tables, or at least not according to the spec. So perhaps the right move here is to rename SubTable to something like CFFStructure.
Well... that's a complicated question. I don't believe the CFF table is valid with only a header and name index. In addition, subsetting OTF files at the moment will break because the glyf table doesn't exist for CID-keyed fonts. The last branch, otf_encoder, handles encoding OTF fonts and takes the missing glyf table into consideration.

Just a reminder that we're merging everything into the otf branch not master, so there's no danger of messing things up by merging incomplete features. I appreciate what you said about PRs being as large as they need to be, but I really don't see the need for the otf branch to be fully working after every one of my smaller PRs. It will be fully working when they have all been merged. I split things up to make the whole feature conceptually easier to understand. Just as I have submitted PRs for individual tables like VORG and DSIG, these upcoming CFF PRs introduce a few sub-tables (i.e. CFF structures) at a time. If you're ok with the entire CFF feature landing at once, I can create a single giant PR for the remaining work. A valid CFF table contains 5 top-level structures.

pointlessone

Sorry for the delay.

pointlessone · 2018-04-21T09:42:35Z

lib/ttfunk/sci.rb

@@ -0,0 +1,14 @@
+module TTFunk
+  class Sci


Let's go with SkiForm. Scientific form is one of the proper terms for this and the one, I believe, that is most fitting here.

Alright, that's fine. My main issue is that I don't understand where the letter k comes from... why not SciForm?

pointlessone · 2018-04-21T09:55:24Z

lib/ttfunk/table/cff/index.rb

+        end
+
+        def parse!
+          @count, @offset_size = read(3, 'nc')


offset_size is supposed to be unsigned. That is uppercase C.

Wow. Nice catch!

pointlessone · 2018-04-21T10:13:07Z

lib/ttfunk/table/cff/index.rb

+        end
+
+        def parse!
+          @count, @offset_size = read(3, 'nc')


I don't think this is correct.

Here's a relevant section from the spec:

An empty INDEX is represented by a count field with a 0 value and no additional fields. Thus, the total size of an empty INDEX is 2 bytes.
— 5 INDEX data, page 12

This and the next lines would read out of the following indexes if current index happens to be empty.

So, we have to first read just the count field and then if it's not 0 read the length of offset array, the array itself and the object data.

Another excellent catch, thank you :)

pointlessone · 2018-04-21T10:16:28Z

lib/ttfunk/table/cff/index.rb

+
+          if @raw_offset_array.empty?
+            @raw_data_array = ''
+            @length = 3


This is incorrect as well. See previous comment on this method.

pointlessone · 2018-04-21T10:32:57Z

lib/ttfunk/table/cff/index.rb

+          end
+
+          _, last_finish = relative_data_offsets_for(count - 1)
+          @raw_data_array = io.read(last_finish)


I'm not quite happy with the structure of the code.

Parsing is done only partially here. The rest of it is done during access or encoding (by virtue of access). The whole time we hold in memory this weird offsets table.

Wouldn't it be easier to completely parse the index (including at least slicing object data into separate pieces) and getting rid of offsets altogether. This would give us a nice array, with indexes that are much more familiar for Ruby users.

I understand that offsets is a private part of the implementation and never exposed in the public interface but they are spread way too much over the whole class.

I agree it might be a bit easier to parse the whole thing into an array. However I consciously decided to keep the offset and data strings because I didn't want to spend a lot of time and memory constructing a potentially large array that may not get used anywhere. The Index can be accessed by the familiar Ruby indexing strategy (i.e. #[]) and is Enumerable, meaning it can be turned into a regular Ruby array pretty easily. The calculation of offsets etc is done in internal, private methods which I had hoped would be abstract enough to be accessible to most Ruby programmers.

Well, the whole thing is parsed anyway when the table is re-encoded so no time/memory is saved. I don't think memory overhead is significant anyway. But code complexity is significantly higher compared to upfront parsing.

Ah, that's a good point.

pointlessone · 2018-04-21T10:37:19Z

lib/ttfunk/table/cff/index.rb

+        end
+
+        def unpack_offset(offset_data)
+          case offset_data.length


This whole case statement can leverage padding and be much shorter:

padding = "\x00" * (4 - @offset_size) (padding + offset_data).unpack("N")

pointlessone · 2018-04-21T10:47:32Z

lib/ttfunk/table/cff/dict.rb

+            (int >> 16) & 0xFF,
+            (int >> 8) & 0xFF,
+            int & 0xFF
+          ]


[29, int].pack("CV").bytes

Looks simpler also, I guess, a bit faster.

Also maybe all encode_ methods should return string instead of byte arrays to facilitate faster encoding methods than manually unpacked integers.

Yeah, this is great. Didn't know about V.

Also maybe all encode_ methods should return string instead of byte arrays to facilitate faster encoding methods than manually unpacked integers.

This is already the case for everything except encode_integer no?

pointlessone · 2018-04-21T11:03:34Z

lib/ttfunk/table/cff/dict.rb

+        end
+
+        def decode_two_byte_operator
+          1200 + read(1, 'C').first


This is an arbitrary magic number. It doesn't really represent any special or useful value. Maybe it'd be easier to keep operators arrays of numbers? Like, [12, 3]. I think i's as good as 1203 and makes some code go away. WDYT?

Yeah I like that. I'll see what I can do.

Ah, but this does make Dict access a little more challenging. Now instead of doing dict[1203] you have to do dict[[12, 3]].

I'll pull it out into a constant for now.

pointlessone · 2018-04-21T11:08:41Z

lib/ttfunk/table/cff/dict.rb

+              @dict[operator] = operands
+              operands = []
+            when 0..21
+              @dict[b_zero] = operands


It probably may prove prudent to validate data type of operands for a specific operator.

Would you like to see that added before merge? If so I can work something up.

Yes, please. It seems like a logical place to add it if at all.

Ok, I realized the problem is that dicts themselves don't really have the concept of valid or invalid data - they're just data structures. Individual types of dict (like the top dict, private dict, etc) do contain specific operators and operands, but I think it makes sense to add validation when those particular classes are introduced. Thoughts?

I don't know. Operators are present in the parsing logic. Even though operators are not universal (used only in specific dicts) they have specific types of operands. In my mind it makes sense to validate operands here. Maybe add additional validations to specific dicts that restrict operators. But that doesn't seem to be connected with operand validations.

pointlessone · 2018-04-21T11:11:12Z

lib/ttfunk/table/cff/dict.rb

+        end
+
+        def encode_significand(sig)
+          sig.to_s.each_char.with_object([]) do |char, ret|


This looks a lot like map

It is except for the else case.

pointlessone · 2018-05-01T07:05:07Z

This PR is rather heavy on complex logic but specs are sparse. Maybe you could add a bit more coverage?

camertron · 2018-05-09T15:55:49Z

Hey @pointlessone, I think I've addressed all your concerns specifically I have:

Added specs for Dict and Index (caught several bugs too hehe).
Updated Index to parse everything immediately, which simplified the logic considerably.
Added validation of Dict operands (specifically sci form operands since it's not really possible to validate integers).

pointlessone · 2018-05-09T17:04:06Z

lib/ttfunk/table/cff/dict.rb

+        alias each_pair each
+
+        def encode
+          ''.tap do |result|


Just heads up: this pattern will have to be replaced. You may have seen that Prawan master is updated to use Ruby 2.3. That entails frozen strings as we're trying to stay close to community style guide (and coincidentally Rubocop's default configuration). I haven't updated TTFunk yet to reflect the change but I will before the release. You can keep using string mutation trough out these PRs. I will handle the update after the merge. Or you can help me a bit and opt for immutable strings in your PRs.

I'd suggest .map{}.join('') as a replacement pattern here.

According to a couple blog articles, we can also just do ''.dup.tap do ... also. What do you think about that?

I mean, technically you can use +'' to get a mutable string, which is more concise and intention revealing. But why use mutation if there's an equally good alternative?

I was just thinking << is more efficient than allocating a temporary array.

I thought that they should be comparable in performance since Array#join internally is very similar to s = ''; array.each {|x| s << x }. .map.join just looks a bit more natural to my eye. But I got curious and decided to check.

require 'benchmark/ips' $a = (0..1_000).map(&:to_s) Benchmark.ips do |x| x.report "String#<<" do s = +'' $a.each do |n| s << n end end x.report "Array#join" do $a.join('') end end

String#<< 13.332k (±16.5%) i/s - 64.183k in 5.062950s Array#join 17.965k (±17.9%) i/s - 85.050k in 5.069048s

Apparently, .join is about 30% faster.

Ok, but the construction of the array should be part of the benchmark. The individual << operations vs a single join (which I'm sure performs a loop in C land) might be fast, but constructing the array is an operation that we are introducing to the algorithm, so it should be measured:

require 'benchmark/ips' Benchmark.ips do |x| x.report "String#<<" do s = +'' 1_000.times do |i| s << i.to_s end end x.report "Array#join" do a = (0..1_000).map(&:to_s) a.join('') end x.compare! end

As it turns out, the two techniques result in about the same i/s:

Warming up -------------------------------------- String#<< 545.000 i/100ms Array#join 490.000 i/100ms Calculating ------------------------------------- String#<< 5.655k (± 4.3%) i/s - 28.340k in 5.020904s Array#join 5.197k (± 4.3%) i/s - 25.970k in 5.006597s Comparison: String#<<: 5655.1 i/s Array#join: 5196.7 i/s - same-ish: difference falls within error

pointlessone · 2018-05-10T07:44:45Z

lib/ttfunk/sub_table.rb

+
+    attr_reader :file, :table_offset
+
+    # set by parse! in derived classes


This comment is a bit misleading. It's only true for Index, others have it either set in constructor (Dict) or hardcodec (Header).

camertron added 2 commits April 8, 2018 08:26

Add BinUtils.twos_comp_to_int

ef5e69f

Add Sci (scientific notation) and SubTable classes for CFF

8eb6401

camertron force-pushed the cff_core branch from 72f4dc8 to 65f3ea5 Compare April 8, 2018 15:54

camertron mentioned this pull request Apr 8, 2018

OTF Support #53

Closed

9 tasks

pointlessone reviewed Apr 21, 2018

View reviewed changes

camertron force-pushed the cff_core branch from 65f3ea5 to 455ea4e Compare May 1, 2018 05:59

camertron force-pushed the cff_core branch 3 times, most recently from 2750933 to 484339d Compare May 9, 2018 15:51

pointlessone reviewed May 9, 2018

View reviewed changes

pointlessone reviewed May 10, 2018

View reviewed changes

camertron force-pushed the cff_core branch from 484339d to 4403b6d Compare May 11, 2018 05:17

Add the beginnings of support for the CFF table

2929017

camertron force-pushed the cff_core branch from 4403b6d to 2929017 Compare May 11, 2018 05:20

pointlessone approved these changes May 11, 2018

View reviewed changes

pointlessone merged commit b2f0001 into prawnpdf:otf May 11, 2018

camertron deleted the cff_core branch May 11, 2018 15:37


		attr_reader :file, :table_offset

		# set by parse! in derived classes

Add CFF core classes and data structures #59

Add CFF core classes and data structures #59

Conversation

camertron commented Apr 8, 2018 • edited

pointlessone commented Apr 9, 2018

camertron commented Apr 9, 2018

pointlessone left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

camertron May 1, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pointlessone commented May 1, 2018

camertron commented May 9, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

camertron commented Apr 8, 2018 •

edited

camertron May 1, 2018 •

edited