-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support dynamic generation of the OS2 table #55
Conversation
I will take a closer look later but I already have a few questions regaring the
|
|
||
def code_pages_for(subset) | ||
field = BitField.new(0) | ||
return field if subset.unicode? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to docs 0
is CP1252. Granted, there's no value for Unicode what would happen if subset actually includes glyphs outside of 1252?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm good question. After some additional research I was able to find several of the Noto fonts that have a code_page_range
of 0 but don't include the majority of the 1252 code page (i.e. a-z, A-Z). So I guess there's precedent for it? Since TTFunk's subsets are very encoding-based, I don't know how we would do this any differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spec suggests only setting encoding bit if the code table is functional. However…
The determination of “functional” is left up to the font designer
Also this bit:
If the font file is encoding ID 0, then the Symbol Character Set bit should be set.
Is this handled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe an encoding ID of 0 only signifies a symbol character set on Microsoft platforms, i.e. platform ID 3. Currently TTFunk doesn't support that combination in the cmap logic, so we should be ok.
lib/ttfunk/subset/mac_roman.rb
Outdated
end | ||
|
||
protected | ||
def new_cmap_table(_options = nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the optional argument can be dropped as it seem to be never used. In all Subset
classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good idea.
lib/ttfunk/subset/windows_1252.rb
Outdated
def to_unicode_map | ||
Encoding::Windows1252::TO_UNICODE | ||
CodePages[code_page].unicode_mapping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After all the changes this class is different to MacRoman
only by the code_page
method. Moreover, this one still encodes into mac_roman
cmap table. Do we need it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I noticed that as well. In order to not break backwards-compatibility, I'll keep these two classes but refactor into a base class that can accept any encoding/code page.
@pointlessone in response to your points:
|
Let's try getting by with |
Ok @pointlessone I removed references to the CodePages gem and used Ruby's |
e595be1
to
9f7ce8a
Compare
lib/ttfunk/table/os2.rb
Outdated
new_char_range = unicode_blocks_for(os2, os2.char_range, subset) | ||
result << BinUtils | ||
.unpack_int(new_char_range.value, 32) | ||
.pack('N*').ljust(16, "\0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, it would be easier to supply unpack_int
with a desired target size instead of manually padding the result. It also would make the method itself a bit simpler.
lib/ttfunk/bin_utils.rb
Outdated
end | ||
|
||
# assumes big-endian | ||
def unpack_int(value, bit_width) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably very subjective but I find it hard to understand what these methods do by their names.
I think slice_int
and stitch_int
are more intention revealing names.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree the names could be better (I was trying to follow the ruby pack/unpack terminology). What about int_to_bytes
and bytes_to_int
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Historically bytes are quite consistently 8 bits wide. For wider values "word" is the most common term. But otherwise those names work for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that makes sense. Ok, let's go with slice_int
and stitch_int
.
lib/ttfunk/bin_utils.rb
Outdated
# assumes big-endian | ||
def unpack_int(value, bit_width) | ||
return [0] if value == 0 | ||
return [1] if value == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a special case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method needs to calculate how many bytes should be returned, so it takes the log base 2 of the value. Math.log2(1) == 0.0
, which would mean 0 bytes in the return value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned elsewhere, if this method took the size of unpacked value the calculation would be unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Acknowledged.
lib/ttfunk/subset/code_page.rb
Outdated
.codepoints | ||
.first | ||
rescue Encoding::UndefinedConversionError | ||
# Ruby doesn't appear to think there is a strict 1:1 mapping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is not quite accurate. Ruby's opinion has nothing to do with convertability. There're actually illegal coversions.
For example, 0x81
is undefined in CP1252. There's no way to convert it into Unicode.
So maybe just say that there's no 1:1 mapping between all encodings and Unicode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that's something I hadn't realized, thanks for the explanation.
@@ -33,13 +33,14 @@ def from_unicode(character) | |||
character | |||
end | |||
|
|||
protected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to expose the next method as well?
lib/ttfunk/table/os2.rb
Outdated
|
||
# only flip the bit on if the subset includes all the characters | ||
# that were present for this block in the original font | ||
if code_points.all? { |cp| subset_code_points.include?(cp) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is a sound approach. Please correct me if I'm mistaken.
Consider a font hat has a full Latin alphabet and let's say it set 1250
bit. For example, the font is used for only one title in the whole document that says "Report". The subset would only contain those 6 letter out of at least 48 provided by the font. For the purpose of the document those 6 should be considered "usable" and warrant a bit in the table. But with current logic the bit won't be set at the subset doesn't fully cover the original set.
It looks like the bit would rarely set if ever since it's really hard to come up with a real world document that fully covers a full code page of any decent font. Consider just covering English alphabet in both lower an upper case and and the punctuation characters and braces. Most fonts include even more characters that are hard to come by in most documents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I see what you're saying. Before submitting the PR I tried a different approach where bits were flipped on if any of the corresponding characters were present in the subset. I abandoned it because it was fairly slow and didn't seem correct (after all, the font's definition of a usable unicode range won't have been satisfied). I don't believe the font will be broken if no unicode ranges are indicated, so only flipping bits on if all characters are present seemed like the best approach. If you feel strongly about going back to the original technique, I'd be happy to revert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was fairly slow
I for some reason doubt that any?
is slower than all?
. They're the same in the worst case scenario.
the font's definition of a usable unicode range won't have been satisfied
It may be not for a regular font since the font author has no idea how the font would be used. In general case the flag should be set when a code page has a decent if not full coverage of a code page.
Subset fonts are used in a very specific scenarios. In case of Prawn (and PDF in general) the fonts are only usable with a specific text. There's no risk of encountering a situation where subset font won't cover what original could have. So it still satisfies "usability" criteria to the same extent the original font would have.
I think the appropriate approach would be to unset bits for ranges that are completely removed from the font.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I for some reason doubt that any? is slower than all?. They're the same in the worst case scenario.
No, that's not what I meant. My original algorithm used Array#bsearch
for every character.
In general case the flag should be set when a code page has a decent if not full coverage of a code page.
Ok, that makes sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is as simple as using any?
instead of all?
. Cool 😎
lib/ttfunk/subset/unicode_8bit.rb
Outdated
# using is irrelevant. We choose MacRoman because it's a 256-character | ||
# encoding that happens to be well-supported in both TTF and | ||
# PDF formats. | ||
TTFunk::Table::Cmap.encode(mapping, :mac_roman) | ||
end | ||
|
||
def original_glyph_ids |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be public.
lib/ttfunk/subset/code_page.rb
Outdated
end | ||
end | ||
|
||
def original_glyph_ids |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be public.
All right, it looks good. Sorry for torturing you. 😅 Make sure the history is good and it's good to merge. |
* Remove *Encoding classes in favor of Ruby's built-in encoding capabilities. * Create a CodePage base class and add Windows1252 and MacRoman as subclasses. Move shared functionality into CodePage.
@pointlessone not at all! I really appreciate how thorough you have been in reviewing these PRs :) The commit history has been cleaned up, ready to merge when you are. |
This pull request is part of a larger effort to bring OTF support to TTFunk. See #53 for details.
Higlights of this PR:
TTFunk::Encoding
classes in favor of the code-pages gem, which I think is much simpler. It also allows TTFunk to support any of the various code pages instead of just Mac Roman and Windows 1252.