New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CFF core classes and data structures #59
Conversation
|
Just a reminder that we're merging everything into the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay.
lib/ttfunk/sci.rb
Outdated
@@ -0,0 +1,14 @@ | |||
module TTFunk | |||
class Sci |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's go with SkiForm
. Scientific form is one of the proper terms for this and the one, I believe, that is most fitting here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, that's fine. My main issue is that I don't understand where the letter k
comes from... why not SciForm
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo. 😅
lib/ttfunk/table/cff/index.rb
Outdated
end | ||
|
||
def parse! | ||
@count, @offset_size = read(3, 'nc') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
offset_size
is supposed to be unsigned. That is uppercase C
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow. Nice catch!
lib/ttfunk/table/cff/index.rb
Outdated
end | ||
|
||
def parse! | ||
@count, @offset_size = read(3, 'nc') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is correct.
Here's a relevant section from the spec:
An empty INDEX is represented by a count field with a 0 value and no additional fields. Thus, the total size of an empty INDEX is 2 bytes.
— 5 INDEX data, page 12
This and the next lines would read out of the following indexes if current index happens to be empty.
So, we have to first read just the count field and then if it's not 0
read the length of offset array, the array itself and the object data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another excellent catch, thank you :)
lib/ttfunk/table/cff/index.rb
Outdated
|
||
if @raw_offset_array.empty? | ||
@raw_data_array = '' | ||
@length = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect as well. See previous comment on this method.
lib/ttfunk/table/cff/index.rb
Outdated
end | ||
|
||
_, last_finish = relative_data_offsets_for(count - 1) | ||
@raw_data_array = io.read(last_finish) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite happy with the structure of the code.
Parsing is done only partially here. The rest of it is done during access or encoding (by virtue of access). The whole time we hold in memory this weird offsets table.
Wouldn't it be easier to completely parse the index (including at least slicing object data into separate pieces) and getting rid of offsets altogether. This would give us a nice array, with indexes that are much more familiar for Ruby users.
I understand that offsets is a private part of the implementation and never exposed in the public interface but they are spread way too much over the whole class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it might be a bit easier to parse the whole thing into an array. However I consciously decided to keep the offset and data strings because I didn't want to spend a lot of time and memory constructing a potentially large array that may not get used anywhere. The Index
can be accessed by the familiar Ruby indexing strategy (i.e. #[]
) and is Enumerable
, meaning it can be turned into a regular Ruby array pretty easily. The calculation of offsets etc is done in internal, private methods which I had hoped would be abstract enough to be accessible to most Ruby programmers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the whole thing is parsed anyway when the table is re-encoded so no time/memory is saved. I don't think memory overhead is significant anyway. But code complexity is significantly higher compared to upfront parsing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's a good point.
lib/ttfunk/table/cff/index.rb
Outdated
end | ||
|
||
def unpack_offset(offset_data) | ||
case offset_data.length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole case
statement can leverage padding and be much shorter:
padding = "\x00" * (4 - @offset_size)
(padding + offset_data).unpack("N")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice :)
lib/ttfunk/table/cff/dict.rb
Outdated
(int >> 16) & 0xFF, | ||
(int >> 8) & 0xFF, | ||
int & 0xFF | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[29, int].pack("CV").bytes
Looks simpler also, I guess, a bit faster.
Also maybe all encode_
methods should return string instead of byte arrays to facilitate faster encoding methods than manually unpacked integers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is great. Didn't know about V
.
Also maybe all
encode_
methods should return string instead of byte arrays to facilitate faster encoding methods than manually unpacked integers.
This is already the case for everything except encode_integer
no?
lib/ttfunk/table/cff/dict.rb
Outdated
end | ||
|
||
def decode_two_byte_operator | ||
1200 + read(1, 'C').first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an arbitrary magic number. It doesn't really represent any special or useful value. Maybe it'd be easier to keep operators arrays of numbers? Like, [12, 3]
. I think i's as good as 1203
and makes some code go away. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I like that. I'll see what I can do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, but this does make Dict
access a little more challenging. Now instead of doing dict[1203]
you have to do dict[[12, 3]]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll pull it out into a constant for now.
lib/ttfunk/table/cff/dict.rb
Outdated
@dict[operator] = operands | ||
operands = [] | ||
when 0..21 | ||
@dict[b_zero] = operands |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably may prove prudent to validate data type of operands for a specific operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you like to see that added before merge? If so I can work something up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please. It seems like a logical place to add it if at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I realized the problem is that dicts themselves don't really have the concept of valid or invalid data - they're just data structures. Individual types of dict (like the top dict, private dict, etc) do contain specific operators and operands, but I think it makes sense to add validation when those particular classes are introduced. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. Operators are present in the parsing logic. Even though operators are not universal (used only in specific dicts) they have specific types of operands. In my mind it makes sense to validate operands here. Maybe add additional validations to specific dicts that restrict operators. But that doesn't seem to be connected with operand validations.
lib/ttfunk/table/cff/dict.rb
Outdated
end | ||
|
||
def encode_significand(sig) | ||
sig.to_s.each_char.with_object([]) do |char, ret| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks a lot like map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is except for the else
case.
This PR is rather heavy on complex logic but specs are sparse. Maybe you could add a bit more coverage? |
2750933
to
484339d
Compare
Hey @pointlessone, I think I've addressed all your concerns specifically I have:
|
lib/ttfunk/table/cff/dict.rb
Outdated
alias each_pair each | ||
|
||
def encode | ||
''.tap do |result| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just heads up: this pattern will have to be replaced. You may have seen that Prawan master is updated to use Ruby 2.3. That entails frozen strings as we're trying to stay close to community style guide (and coincidentally Rubocop's default configuration). I haven't updated TTFunk yet to reflect the change but I will before the release. You can keep using string mutation trough out these PRs. I will handle the update after the merge. Or you can help me a bit and opt for immutable strings in your PRs.
I'd suggest .map{}.join('')
as a replacement pattern here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to a couple blog articles, we can also just do ''.dup.tap do ...
also. What do you think about that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, technically you can use +''
to get a mutable string, which is more concise and intention revealing. But why use mutation if there's an equally good alternative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just thinking <<
is more efficient than allocating a temporary array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that they should be comparable in performance since Array#join
internally is very similar to s = ''; array.each {|x| s << x }
. .map.join
just looks a bit more natural to my eye. But I got curious and decided to check.
require 'benchmark/ips'
$a = (0..1_000).map(&:to_s)
Benchmark.ips do |x|
x.report "String#<<" do
s = +''
$a.each do |n|
s << n
end
end
x.report "Array#join" do
$a.join('')
end
end
String#<< 13.332k (±16.5%) i/s - 64.183k in 5.062950s
Array#join 17.965k (±17.9%) i/s - 85.050k in 5.069048s
Apparently, .join
is about 30% faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, but the construction of the array should be part of the benchmark. The individual <<
operations vs a single join
(which I'm sure performs a loop in C land) might be fast, but constructing the array is an operation that we are introducing to the algorithm, so it should be measured:
require 'benchmark/ips'
Benchmark.ips do |x|
x.report "String#<<" do
s = +''
1_000.times do |i|
s << i.to_s
end
end
x.report "Array#join" do
a = (0..1_000).map(&:to_s)
a.join('')
end
x.compare!
end
As it turns out, the two techniques result in about the same i/s:
Warming up --------------------------------------
String#<< 545.000 i/100ms
Array#join 490.000 i/100ms
Calculating -------------------------------------
String#<< 5.655k (± 4.3%) i/s - 28.340k in 5.020904s
Array#join 5.197k (± 4.3%) i/s - 25.970k in 5.006597s
Comparison:
String#<<: 5655.1 i/s
Array#join: 5196.7 i/s - same-ish: difference falls within error
lib/ttfunk/sub_table.rb
Outdated
|
||
attr_reader :file, :table_offset | ||
|
||
# set by parse! in derived classes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is a bit misleading. It's only true for Index
, others have it either set in constructor (Dict
) or hardcodec (Header
).
This pull request is part of a larger effort to bring OTF support to TTFunk. See #53 for details.
Highlights of this PR:
Index
andDict
. An index is essentially an array that can be accessed by index, and a dict is a dictionary of operator/operand (i.e. key/value) pairs. Operators have special meaning depending on the context. For example the CFF top and private dicts both use the dict structure but define different operators for the data they contain.Dict
class technically isn't used anywhere yet. Including it when it is actually used would make the corresponding PR too large in my opinion, so it is introduced here instead.Sci
class (couldn't come up with a better name, suggestions welcome) is supposed to represent a number in scientific notation (i.e. 5.2 x 104). TheDict
class uses it for a special type of operand that starts with a byte30
. I could have stored it as a float, but it would then have been a lot more work to re-encode later.TTFunk::SubTable
class has been introduced to handle the case when table-like parsing behavior is desired but the table is not a top-level, directory-based font table.