Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewriting add_font() and _putfonts() using Fonttools library #477

Merged
merged 54 commits into from
Sep 7, 2022

Conversation

RedShy
Copy link

@RedShy RedShy commented Jul 28, 2022

I want to share what I'm doing, so you can suggest change in direction, see the progress, jump in and so on

If understood correctly, for now the aim is to rewrite .add_font() and ._putfonts() to use Fonttools and drop the home made ttfonts.py.

  • Now .add_font() get all the data from Fonttools except for the ttf.fullName. Still a cleanup and refactor is needed.

  • I'm working on .makeSubset() method and replacing it piece by piece with Fonttools calls. The tables head, hhea, maxp and cmap are currently extracted using Fonttools and currently I'm working on hmtx table

  • The GitHub pipeline is OK (green) meaning that both pylint (static code analyzer) and black (code formatter) are happy with the changes of this PR.

  • A unit test is covering the code added / modified by this PR

  • This PR is ready to be merged

  • In case of a new feature, docstrings have been added, with also some documentation in the docs/ folder

  • A mention of the change is present in CHANGELOG.md

By submitting this pull request, I confirm that my contribution is made under the terms of the GNU LGPL 3.0 license.

@codecov
Copy link

codecov bot commented Aug 21, 2022

Codecov Report

Merging #477 (289366a) into master (25b9ffb) will increase coverage by 1.54%.
The diff coverage is 94.11%.

@@            Coverage Diff             @@
##           master     #477      +/-   ##
==========================================
+ Coverage   92.34%   93.89%   +1.54%     
==========================================
  Files          23       22       -1     
  Lines        6860     6093     -767     
  Branches     1405     1249     -156     
==========================================
- Hits         6335     5721     -614     
+ Misses        299      195     -104     
+ Partials      226      177      -49     
Impacted Files Coverage Δ
fpdf/util.py 86.00% <ø> (-1.04%) ⬇️
fpdf/fpdf.py 92.54% <92.95%> (+2.02%) ⬆️
fpdf/enums.py 100.00% <100.00%> (ø)
fpdf/line_break.py 99.01% <100.00%> (-0.04%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@RedShy
Copy link
Author

RedShy commented Aug 21, 2022

Hi! Sorry for the delay, but I think I finally got something.
I have seen that there is a fonttools' module that does the subsetting, so I used that instead of .makeSubset(), in this way we could drop the ttfonts.py file entirely!

For doing the subsetting with fonttools I do the following steps:

  1. I take all the glyphs used in the PDF. For doing this I
    • Take all the unicodes of characters in the subset dict, these are the characters used in the PDF.
    • I use the cmap table to get the glyph associated with every unicode
  2. I use the glyphs to make the subsetting using fonttools
  3. I produce the codeToGlyph dict, that from what I understood, it's a map between the new codes for the characters and the glyphs: in the PDF are not actually inserted the unicodes of characters but new codes generated by fpdf
  4. Then I transform the subsetted font in a sequence of bytes and reuse the existing code

In doing all of this, because of the difference of the generated subset font, several tests break. I visually inspected every PDF produced with the new code and they seem identical to me.
Even in the charmap_first_999_chars-cmss12.pdf created using the old code, there was a mistake because the letter 072 was not displayed, instead using fonttools it correctly displayed the letter r.

Another thing to mention is that test_add_font_otf() does not raise the Postscript outlines are not supported error anymore because it was raised by ttfonts.py, for now I simply removed the test, assuming that fonttools support those types of fonts.

A recap of the current situation:

  • I removed all the usage of ttfonts.py from fpdf.py. We can drop it and use only fonttools going forward
  • In the .add_font() I put some code from ttfonts.py for getting the exact same data as before: a reorganization of this part of the code I think is needed

@gmischler
Copy link
Collaborator

Cool to see this coming to fruition!

Another thing to mention is that test_add_font_otf() does not raise the Postscript outlines are not supported error anymore because it was raised by ttfonts.py, for now I simply removed the test, assuming that fonttools support those types of fonts.

Better to find a free example font to add to the tests then.
The various font file formats and variations thereof are rather confusing, but if bot the PDF format as well as fonttools can handle OpenType fonts, then we shouldn't have a problem with them either.

  • I removed all the usage of ttfonts.py from fpdf.py. We can drop it and use only fonttools going forward

That was the point after all! 👍

  • In the .add_font() I put some code from ttfonts.py for getting the exact same data as before: a reorganization of this part of the code I think is needed

This is apparently not about the changed glyph mapping, so what other data would be subject to change here?

Btw.: Codecov complains about reduced coverage in "util.py". I think this is because of the function substr() there which was only used from "ttfonts.py", and thus doesn't get tested anymore. You should probably remove that as well.

@Lucas-C
Copy link
Member

Lucas-C commented Aug 22, 2022

Wow, great job @RedShy in working towards using fonttools!
I'm goind to make a first review of your code.

Even in the charmap_first_999_chars-cmss12.pdf created using the old code, there was a mistake because the letter 072 was not displayed, instead using fonttools it correctly displayed the letter r.

That is nice!

Better to find a free example font to add to the tests then.

I agree with @gmischler that, if it's not too much to ask you @RedShy, a first basic unit test of using an .otf font would be nice.

Copy link
Member

@Lucas-C Lucas-C left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really promising, good job!
A few general comments:

  • Thank you for commenting the steps you implemented, that will make code maintaince a lot easier!
  • You should probably add fonttools to setup.py
  • Some reference PDF files sizes increased by 1/2 KB, I wonder why... That's not a blocker preventing this change, but would you know what could cause the is increase in file size? (while other reference PDF files have decreased in size)

fpdf/fpdf.py Outdated Show resolved Hide resolved
fpdf/fpdf.py Outdated Show resolved Hide resolved
fpdf/fpdf.py Show resolved Hide resolved
@RedShy
Copy link
Author

RedShy commented Aug 23, 2022

Cool to see this coming to fruition!

Wow, great job @RedShy in working towards using fonttools!

Yes it's very nice and cool! And without the help from both of you I would be still stuck in trying to understand what's going on there! 😁

Better to find a free example font to add to the tests then.
I agree with @gmischler that, if it's not too much to ask you @RedShy, a first basic unit test of using an .otf font would be nice.

Sure! I just added a test with the old .otf font and I added also the italic and bold version of it


This is apparently not about the changed glyph mapping, so what other data would be subject to change here?

Sorry @gmischler I don't get what do you mean here

Btw.: Codecov complains about reduced coverage in "util.py". I think this is because of the function substr() there which was only used from "ttfonts.py", and thus doesn't get tested anymore. You should probably remove that as well.

I observed that .substr() is used in the if "type" in info and info["type"] != "TTF": branch inside _putfonts(): the entire branch is not covered by tests and I have seen that when a new font is added in .add_font() the type is hard coded to be "TTF", so I guess that branch is dead code and could be removed?

Given that we are here, also the branch elif my_type in ("Type1", "TrueType"): is not covered by tests, can we remove it too?
Maybe both changes could be moved in another PR.

You should probably add fonttools to setup.py

I'm new to this, I have to just add fonttools to install_requirements=[...] right?

Some reference PDF files sizes increased by 1/2 KB, I wonder why...

Yes, I'm curious about it too, I could try to inspect the qpdf files and see differences with the previous ones. I think it could be as @khaledhosny pointed out, that the fonttools subsetter doesn't drop all the tables previous dropped

edit: I have updated the code keeping only the tables that the old code kept for the tests. Now every PDF file size is lesser than before, I think this is due to the optimization that fonttools does:

The tool also performs some size-reducing optimizations, aimed for using subset fonts as webfonts.

@gmischler
Copy link
Collaborator

Just for information: I just found in the 1.7 specs:

Beginning with PDF 1.6, font programms may be embedded using the OpenType format, which is an extension of the TrueType format....

Which essentially means that when we implement PDF/A-1 (based on PDF 1.4), we'll have to disallow OTF fonts.

@gmischler
Copy link
Collaborator

gmischler commented Sep 6, 2022

I observed that .substr() is used in the if "type" in info and info["type"] != "TTF": branch inside _putfonts(): the entire branch is not covered by tests and I have seen that when a new font is added in .add_font() the type is hard coded to be "TTF", so I guess that branch is dead code and could be removed?

So you can show conclusively that all imported fonts end up with info["type"] = "TTF"? (And that was already the case before your changes? 😉) Do the newly supported ".otf" fonts get assigned the same type as well? That would then mean that we only have two types of fonts to deal with in the rest of the code: core and TTF. Once we have proven this, that might simplify things quite a bit. We can then indeed remove all the code that deals with font types other than "core" and "TTF".

That brings me to a bit of a nitpick: I've always wondered about the naming of the FPDF.unifontsubset property (and my own derived Fragment.unicode_font). A quick glance at the TTF specs didn't turn up anything conclusive. Are TTF fonts Unicode by definition? Or can they also be encoded with one of the historical charsets (and we could handle that)? Found it: "OpenType, like TrueType, is based on Unicode".
Either way, since the distinction is really about whether we need to embed the font in the output, I think a better name for both those properties would be .is_ttf_font.

Given that we are here, also the branch elif my_type in ("Type1", "TrueType"): is not covered by tests, can we remove it too? Maybe both changes could be moved in another PR.

Generally speaking, the absence of tests would rather be a reason to add tests than to remove the code...
Unless, of course, it is provably dead code as explained above. 💀
I see no need to do a seperate PR for cleaning up the code you're already working on anyway.

edit: TTF is always Unicode

@RedShy
Copy link
Author

RedShy commented Sep 6, 2022

Hi @gmischler thank you for the feedback 😁

So you can show conclusively that all imported fonts end up with info["type"] = "TTF"? (And that was already the case before your changes? 😉)

Looking at the code at lines L1831, L1881 and L1893 in add_font(), every font that is not core, get the type "TTF" in the info dictionary. I'm not sure if I'm missing something. I don't know if there are other ways to add fonts other than using the method add_font()

Do the newly supported ".otf" fonts get assigned the same type as well?

Yes, because it goes through add_font() and the TTF type is just hardcoded in the info dictionary

We can then indeed remove all the code that deals with font types other than "core" and "TTF".

I'm not sure of the exact differences between OpenType and TrueType, currently they are treated the same as TTF type and the PDF generated seems good. We could differentiate explicitly in the code adding the OTF type.

I've always wondered about the naming of the FPDF.unifontsubset property (and my own derived Fragment.unicode_font). Found it: "OpenType, like TrueType, is based on Unicode".
Either way, since the distinction is really about whether we need to embed the font in the output, I think a better name for both those properties would be .is_ttf_font

Looking at the code I see return self.current_font.get("type") == "TTF", under def unifontsubset(self): so I agree that a better name would be .is_ttf_font 👍
If we add the OTF type, I think we should consider if and how to change the check == "TTF"

Generally speaking, the absence of tests would rather be a reason to add tests than to remove the code...

Yes I absolutely agree, I tried to add some tests back in #439 and we concluded that was not worth it to improve the testing and clarity of that part of the code.

@Lucas-C
Copy link
Member

Lucas-C commented Sep 6, 2022

(just a side note: thanks a lot to both of you for working on improving this part of the code! I don't have as much insight on the subject as you, but I fully trust the both of you on this. I'll be happy to merge your PR @RedShy whenever @gmischler gives his approval)

@gmischler
Copy link
Collaborator

Looking at the code at lines L1831, L1881 and L1893 in add_font(), every font that is not core, get the type "TTF" in the info dictionary.

That looks like conclusive proof: All imported fonts are marked as TTF.

We can then indeed remove all the code that deals with font types other than "core" and "TTF".

Let's get rid of the cruft!

I'm not sure of the exact differences between OpenType and TrueType, currently they are treated the same as TTF type and the PDF generated seems good. We could differentiate explicitly in the code adding the OTF type.

As far as I understand, they share largely the same structure, but some specific types of data may only be present in one or the other. I don't think we need to worry about the distinction right now, though it may possibly become relevant once we try to substitute ligatures etc. (cool follow-up project, btw., in case you're interested 😉).

Looking at the code I see return self.current_font.get("type") == "TTF", under def unifontsubset(self): so I agree that a better name would be .is_ttf_font +1 If we add the OTF type, I think we should consider if and how to change the check == "TTF"

I don't think we need to mark OTF fonts differently at this point. Feel free to rename those properties, though.

@RedShy
Copy link
Author

RedShy commented Sep 7, 2022

Great! I deleted the code and renamed those properties

(cool follow-up project, btw., in case you're interested 😉).

Yes why not? 😁 It's a pleasure to work with both of you and I'm starting to get used to the codebase, the various concepts inherent to fonts, glyphs etc...

@gmischler
Copy link
Collaborator

(cool follow-up project, btw., in case you're interested 😉).

Yes why not? 😁 It's a pleasure to work with both of you and I'm starting to get used to the codebase, the various concepts inherent to fonts, glyphs etc...

I was hoping for that. You're now the contributor here most familiar with the font handling code and especially the fonttools. I had only looked at these things very superficially myself, to get a general idea of the concepts.
Maybe there's even a module in fonttools that would do the substitutions for us? You already found one for the subsetting, that I hadn't previously noticed...

More urgently: In the context of #511 you mentioned that you had moved the width == 65535 check on the character width to an earlier point in the processing. So I guess now would be the time to adapt Fragment.get_width() to that change. Ideally you'd be able to get rid of the .char_width() sub-function there.

@RedShy
Copy link
Author

RedShy commented Sep 7, 2022

I had only looked at these things very superficially myself, to get a general idea of the concepts.

Actually your help was fundamental to understanding what the subsetting was about and how the management of the fonts worked

Maybe there's even a module in fonttools that would do the substitutions for us?

I'm not sure what do you mean by substitutions here

In the context of #511 #511 (review) that you had moved the width == 65535 check on the character width to an earlier point in the processing. So I guess now would be the time to adapt Fragment.get_width() to that change. Ideally you'd be able to get rid of the .char_width() sub-function there.

Yes I deleted the .char_width() inside Fragment.get_width() because now is just a call to the defaultdict self.font["cw"]

@gmischler
Copy link
Collaborator

I'm not sure what do you mean by substitutions here

Supporting ligatures means that a font may offer you the opportunity to writee eg. the two characters "fs", but have the single combined glyph "grafik" being displayed (note how the upper end of the "f" connects with the dot of the "i", which they otherwise wouldn't. See Google fonts: Ligature for examples related to HTML rendering.

This is just one of the simplest cases though, and in western languages it's usually just an esthetical nicety. In other writing systems, most prominently of the indic family, it is an actual necessity for being able to write correctly. See our recent issues #365, #459, and #474. In those cases, larger numbers of characters (up to 7 I think) may get combined into one or several glyphs, in an m * n relationship. There are several possible table types in TTF/OTF fonts that can be used to store and retreive such substitutions, such as "gsub", "liga", "dlig", etc. I hope that fonttools offer some support in searching for those, so that we don't have to figure out all the possible combinations ourselfes...

@gmischler
Copy link
Collaborator

@Lucas-C , I don't see any obstacles here anymore, so I'll vote for merging.

@Lucas-C
Copy link
Member

Lucas-C commented Sep 7, 2022

Thank you both for you work on this PR!
Merging now

@Lucas-C Lucas-C merged commit eb54535 into py-pdf:master Sep 7, 2022
@Lucas-C
Copy link
Member

Lucas-C commented Sep 7, 2022

As a side-effect, thanks to your work, code coverage has improved by 1.5 points!
2022-09-07 22_09_50-Window

@Lucas-C
Copy link
Member

Lucas-C commented Sep 8, 2022

This has been released in v2.5.7

@RedShy
Copy link
Author

RedShy commented Sep 15, 2022

I'm not sure what do you mean by substitutions here

Supporting ligatures means that a font may offer you the opportunity to writee eg. the two characters "fs", but have the single combined glyph "grafik" being displayed (note how the upper end of the "f" connects with the dot of the "i", which they otherwise wouldn't. See Google fonts: Ligature for examples related to HTML rendering.

This is just one of the simplest cases though, and in western languages it's usually just an esthetical nicety. In other writing systems, most prominently of the indic family, it is an actual necessity for being able to write correctly. See our recent issues #365, #459, and #474. In those cases, larger numbers of characters (up to 7 I think) may get combined into one or several glyphs, in an m * n relationship. There are several possible table types in TTF/OTF fonts that can be used to store and retreive such substitutions, such as "gsub", "liga", "dlig", etc. I hope that fonttools offer some support in searching for those, so that we don't have to figure out all the possible combinations ourselfes...

This is interesting and sparks me to challenge myself with this project, also it would have a direct impact and be helpful for people out there.
Unfortunately for now, for various reasons in my personal life, I don't think I have much time to dedicate, but in 1-2 weeks I think yes, if no one got up at that time, I would be glad to help 😁

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants