Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CFF core classes and data structures #59

Merged
merged 3 commits into from May 11, 2018

Conversation

camertron
Copy link
Member

@camertron camertron commented Apr 8, 2018

This pull request is part of a larger effort to bring OTF support to TTFunk. See #53 for details.

Highlights of this PR:

  1. The two CFF data structures added in this PR are Index and Dict. An index is essentially an array that can be accessed by index, and a dict is a dictionary of operator/operand (i.e. key/value) pairs. Operators have special meaning depending on the context. For example the CFF top and private dicts both use the dict structure but define different operators for the data they contain.
  2. The Dict class technically isn't used anywhere yet. Including it when it is actually used would make the corresponding PR too large in my opinion, so it is introduced here instead.
  3. The Sci class (couldn't come up with a better name, suggestions welcome) is supposed to represent a number in scientific notation (i.e. 5.2 x 104). The Dict class uses it for a special type of operand that starts with a byte 30. I could have stored it as a float, but it would then have been a lot more work to re-encode later.
  4. The TTFunk::SubTable class has been introduced to handle the case when table-like parsing behavior is desired but the table is not a top-level, directory-based font table.
  5. The beginnings of an incomplete CFF table are also present. Currently only the header and name index sub tables are supported, more coming soon. CFF is by far the most complicated table TTFunk supports to date and the centerpiece of the open-type font format.

@pointlessone
Copy link
Member

  1. Could you please provide a link to spec. This is a rather logic-heavy PR. I'd like to at least try and verify the logic as well as heaps of magic numbers.
  2. I'd rather leave Dict until it's actually used. PRs are as big as they need to be. You can use commits within a PR to show logical progression and further split changes into smaller pieces.
  3. Sci is indeed a bit vague. Maybe SkiForm? It's not much better but at least resemble a noun and maybe a bit more telling to people who are familiar with a proper term.
  4. Reasonable. Though, I'd expect a more generically named Table would be useful for all table-like structures and more specialized subclasses would implement specifics of top-level tables. Maybe we should consider this as an intermediary refactoring step either before or after this PR. WDYT?
  5. Is the CFF table functional within the PR?

@camertron
Copy link
Member Author

  1. This is the document I used to implement the CFF table.
  2. Ok, that's fine.
  3. Hmm, maybe SciForm? The math world has to have invented a word for this, but I don't know what it is. I'll keep thinking about it.
  4. I see your point. I think the original authors created the Table class to correspond to the True-Type concept of a table, which is a structure that has an entry in the font's directory. The structures within the CFF table aren't truly tables, or at least not according to the spec. So perhaps the right move here is to rename SubTable to something like CFFStructure.
  5. Well... that's a complicated question. I don't believe the CFF table is valid with only a header and name index. In addition, subsetting OTF files at the moment will break because the glyf table doesn't exist for CID-keyed fonts. The last branch, otf_encoder, handles encoding OTF fonts and takes the missing glyf table into consideration.

Just a reminder that we're merging everything into the otf branch not master, so there's no danger of messing things up by merging incomplete features. I appreciate what you said about PRs being as large as they need to be, but I really don't see the need for the otf branch to be fully working after every one of my smaller PRs. It will be fully working when they have all been merged. I split things up to make the whole feature conceptually easier to understand. Just as I have submitted PRs for individual tables like VORG and DSIG, these upcoming CFF PRs introduce a few sub-tables (i.e. CFF structures) at a time. If you're ok with the entire CFF feature landing at once, I can create a single giant PR for the remaining work. A valid CFF table contains 5 top-level structures.

Copy link
Member

@pointlessone pointlessone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay.

@@ -0,0 +1,14 @@
module TTFunk
class Sci
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go with SkiForm. Scientific form is one of the proper terms for this and the one, I believe, that is most fitting here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, that's fine. My main issue is that I don't understand where the letter k comes from... why not SciForm?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo. 😅

end

def parse!
@count, @offset_size = read(3, 'nc')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

offset_size is supposed to be unsigned. That is uppercase C.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow. Nice catch!

end

def parse!
@count, @offset_size = read(3, 'nc')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct.

Here's a relevant section from the spec:

An empty INDEX is represented by a count field with a 0 value and no additional fields. Thus, the total size of an empty INDEX is 2 bytes.
— 5 INDEX data, page 12

This and the next lines would read out of the following indexes if current index happens to be empty.

So, we have to first read just the count field and then if it's not 0 read the length of offset array, the array itself and the object data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another excellent catch, thank you :)


if @raw_offset_array.empty?
@raw_data_array = ''
@length = 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect as well. See previous comment on this method.

end

_, last_finish = relative_data_offsets_for(count - 1)
@raw_data_array = io.read(last_finish)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite happy with the structure of the code.

Parsing is done only partially here. The rest of it is done during access or encoding (by virtue of access). The whole time we hold in memory this weird offsets table.

Wouldn't it be easier to completely parse the index (including at least slicing object data into separate pieces) and getting rid of offsets altogether. This would give us a nice array, with indexes that are much more familiar for Ruby users.

I understand that offsets is a private part of the implementation and never exposed in the public interface but they are spread way too much over the whole class.

Copy link
Member Author

@camertron camertron May 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it might be a bit easier to parse the whole thing into an array. However I consciously decided to keep the offset and data strings because I didn't want to spend a lot of time and memory constructing a potentially large array that may not get used anywhere. The Index can be accessed by the familiar Ruby indexing strategy (i.e. #[]) and is Enumerable, meaning it can be turned into a regular Ruby array pretty easily. The calculation of offsets etc is done in internal, private methods which I had hoped would be abstract enough to be accessible to most Ruby programmers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the whole thing is parsed anyway when the table is re-encoded so no time/memory is saved. I don't think memory overhead is significant anyway. But code complexity is significantly higher compared to upfront parsing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's a good point.

end

def unpack_offset(offset_data)
case offset_data.length
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole case statement can leverage padding and be much shorter:

padding = "\x00" * (4 - @offset_size)
(padding + offset_data).unpack("N")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice :)

(int >> 16) & 0xFF,
(int >> 8) & 0xFF,
int & 0xFF
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[29, int].pack("CV").bytes

Looks simpler also, I guess, a bit faster.

Also maybe all encode_ methods should return string instead of byte arrays to facilitate faster encoding methods than manually unpacked integers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is great. Didn't know about V.

Also maybe all encode_ methods should return string instead of byte arrays to facilitate faster encoding methods than manually unpacked integers.

This is already the case for everything except encode_integer no?

end

def decode_two_byte_operator
1200 + read(1, 'C').first
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an arbitrary magic number. It doesn't really represent any special or useful value. Maybe it'd be easier to keep operators arrays of numbers? Like, [12, 3]. I think i's as good as 1203 and makes some code go away. WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I like that. I'll see what I can do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, but this does make Dict access a little more challenging. Now instead of doing dict[1203] you have to do dict[[12, 3]].

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll pull it out into a constant for now.

@dict[operator] = operands
operands = []
when 0..21
@dict[b_zero] = operands
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably may prove prudent to validate data type of operands for a specific operator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like to see that added before merge? If so I can work something up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please. It seems like a logical place to add it if at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I realized the problem is that dicts themselves don't really have the concept of valid or invalid data - they're just data structures. Individual types of dict (like the top dict, private dict, etc) do contain specific operators and operands, but I think it makes sense to add validation when those particular classes are introduced. Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. Operators are present in the parsing logic. Even though operators are not universal (used only in specific dicts) they have specific types of operands. In my mind it makes sense to validate operands here. Maybe add additional validations to specific dicts that restrict operators. But that doesn't seem to be connected with operand validations.

end

def encode_significand(sig)
sig.to_s.each_char.with_object([]) do |char, ret|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a lot like map

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is except for the else case.

@pointlessone
Copy link
Member

This PR is rather heavy on complex logic but specs are sparse. Maybe you could add a bit more coverage?

@camertron camertron force-pushed the cff_core branch 3 times, most recently from 2750933 to 484339d Compare May 9, 2018 15:51
@camertron
Copy link
Member Author

Hey @pointlessone, I think I've addressed all your concerns specifically I have:

  1. Added specs for Dict and Index (caught several bugs too hehe).
  2. Updated Index to parse everything immediately, which simplified the logic considerably.
  3. Added validation of Dict operands (specifically sci form operands since it's not really possible to validate integers).

alias each_pair each

def encode
''.tap do |result|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just heads up: this pattern will have to be replaced. You may have seen that Prawan master is updated to use Ruby 2.3. That entails frozen strings as we're trying to stay close to community style guide (and coincidentally Rubocop's default configuration). I haven't updated TTFunk yet to reflect the change but I will before the release. You can keep using string mutation trough out these PRs. I will handle the update after the merge. Or you can help me a bit and opt for immutable strings in your PRs.

I'd suggest .map{}.join('') as a replacement pattern here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to a couple blog articles, we can also just do ''.dup.tap do ... also. What do you think about that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, technically you can use +'' to get a mutable string, which is more concise and intention revealing. But why use mutation if there's an equally good alternative?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just thinking << is more efficient than allocating a temporary array.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that they should be comparable in performance since Array#join internally is very similar to s = ''; array.each {|x| s << x }. .map.join just looks a bit more natural to my eye. But I got curious and decided to check.

require 'benchmark/ips'

$a = (0..1_000).map(&:to_s)

Benchmark.ips do |x|
  x.report "String#<<" do
    s = +''
    $a.each do |n|
      s << n
    end
  end

  x.report "Array#join" do
    $a.join('')
  end
end
 String#<<     13.332k (±16.5%) i/s -     64.183k in   5.062950s
Array#join     17.965k (±17.9%) i/s -     85.050k in   5.069048s

Apparently, .join is about 30% faster.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but the construction of the array should be part of the benchmark. The individual << operations vs a single join (which I'm sure performs a loop in C land) might be fast, but constructing the array is an operation that we are introducing to the algorithm, so it should be measured:

require 'benchmark/ips'

Benchmark.ips do |x|
  x.report "String#<<" do
    s = +''
    1_000.times do |i|
      s << i.to_s
    end
  end

  x.report "Array#join" do
    a = (0..1_000).map(&:to_s)
    a.join('')
  end

  x.compare!
end

As it turns out, the two techniques result in about the same i/s:

Warming up --------------------------------------
           String#<<   545.000  i/100ms
          Array#join   490.000  i/100ms
Calculating -------------------------------------
           String#<<      5.655k (± 4.3%) i/s -     28.340k in   5.020904s
          Array#join      5.197k (± 4.3%) i/s -     25.970k in   5.006597s

Comparison:
           String#<<:     5655.1 i/s
          Array#join:     5196.7 i/s - same-ish: difference falls within error


attr_reader :file, :table_offset

# set by parse! in derived classes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is a bit misleading. It's only true for Index, others have it either set in constructor (Dict) or hardcodec (Header).

@pointlessone pointlessone merged commit b2f0001 into prawnpdf:otf May 11, 2018
@camertron camertron deleted the cff_core branch May 11, 2018 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants