Rewriting add_font() and _putfonts() using Fonttools library #477

RedShy · 2022-07-28T17:25:06Z

I want to share what I'm doing, so you can suggest change in direction, see the progress, jump in and so on

If understood correctly, for now the aim is to rewrite .add_font() and ._putfonts() to use Fonttools and drop the home made ttfonts.py.

Now .add_font() get all the data from Fonttools except for the ttf.fullName. Still a cleanup and refactor is needed.
I'm working on .makeSubset() method and replacing it piece by piece with Fonttools calls. The tables head, hhea, maxp and cmap are currently extracted using Fonttools and currently I'm working on hmtx table
The GitHub pipeline is OK (green) meaning that both pylint (static code analyzer) and black (code formatter) are happy with the changes of this PR.
A unit test is covering the code added / modified by this PR
This PR is ready to be merged
In case of a new feature, docstrings have been added, with also some documentation in the docs/ folder
A mention of the change is present in CHANGELOG.md

By submitting this pull request, I confirm that my contribution is made under the terms of the GNU LGPL 3.0 license.

…fonttools_lib

codecov · 2022-08-21T10:44:36Z

Codecov Report

Merging #477 (289366a) into master (25b9ffb) will increase coverage by 1.54%.
The diff coverage is 94.11%.

@@            Coverage Diff             @@
##           master     #477      +/-   ##
==========================================
+ Coverage   92.34%   93.89%   +1.54%     
==========================================
  Files          23       22       -1     
  Lines        6860     6093     -767     
  Branches     1405     1249     -156     
==========================================
- Hits         6335     5721     -614     
+ Misses        299      195     -104     
+ Partials      226      177      -49

Impacted Files	Coverage Δ
fpdf/util.py	`86.00% <ø> (-1.04%)`	⬇️
fpdf/fpdf.py	`92.54% <92.95%> (+2.02%)`	⬆️
fpdf/enums.py	`100.00% <100.00%> (ø)`
fpdf/line_break.py	`99.01% <100.00%> (-0.04%)`	⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

RedShy · 2022-08-21T16:39:44Z

Hi! Sorry for the delay, but I think I finally got something.
I have seen that there is a fonttools' module that does the subsetting, so I used that instead of .makeSubset(), in this way we could drop the ttfonts.py file entirely!

For doing the subsetting with fonttools I do the following steps:

I take all the glyphs used in the PDF. For doing this I
- Take all the unicodes of characters in the subset dict, these are the characters used in the PDF.
- I use the cmap table to get the glyph associated with every unicode
I use the glyphs to make the subsetting using fonttools
I produce the codeToGlyph dict, that from what I understood, it's a map between the new codes for the characters and the glyphs: in the PDF are not actually inserted the unicodes of characters but new codes generated by fpdf
Then I transform the subsetted font in a sequence of bytes and reuse the existing code

In doing all of this, because of the difference of the generated subset font, several tests break. I visually inspected every PDF produced with the new code and they seem identical to me.
Even in the charmap_first_999_chars-cmss12.pdf created using the old code, there was a mistake because the letter 072 was not displayed, instead using fonttools it correctly displayed the letter r.

Another thing to mention is that test_add_font_otf() does not raise the Postscript outlines are not supported error anymore because it was raised by ttfonts.py, for now I simply removed the test, assuming that fonttools support those types of fonts.

A recap of the current situation:

I removed all the usage of ttfonts.py from fpdf.py. We can drop it and use only fonttools going forward
In the .add_font() I put some code from ttfonts.py for getting the exact same data as before: a reorganization of this part of the code I think is needed

gmischler · 2022-08-22T07:04:11Z

Cool to see this coming to fruition!

Another thing to mention is that test_add_font_otf() does not raise the Postscript outlines are not supported error anymore because it was raised by ttfonts.py, for now I simply removed the test, assuming that fonttools support those types of fonts.

Better to find a free example font to add to the tests then.
The various font file formats and variations thereof are rather confusing, but if bot the PDF format as well as fonttools can handle OpenType fonts, then we shouldn't have a problem with them either.

I removed all the usage of ttfonts.py from fpdf.py. We can drop it and use only fonttools going forward

That was the point after all! 👍

In the .add_font() I put some code from ttfonts.py for getting the exact same data as before: a reorganization of this part of the code I think is needed

This is apparently not about the changed glyph mapping, so what other data would be subject to change here?

Btw.: Codecov complains about reduced coverage in "util.py". I think this is because of the function substr() there which was only used from "ttfonts.py", and thus doesn't get tested anymore. You should probably remove that as well.

Lucas-C · 2022-08-22T10:04:03Z

Wow, great job @RedShy in working towards using fonttools!
I'm goind to make a first review of your code.

Even in the charmap_first_999_chars-cmss12.pdf created using the old code, there was a mistake because the letter 072 was not displayed, instead using fonttools it correctly displayed the letter r.

That is nice!

Better to find a free example font to add to the tests then.

I agree with @gmischler that, if it's not too much to ask you @RedShy, a first basic unit test of using an .otf font would be nice.

Lucas-C

This looks really promising, good job!
A few general comments:

Thank you for commenting the steps you implemented, that will make code maintaince a lot easier!
You should probably add fonttools to setup.py
Some reference PDF files sizes increased by 1/2 KB, I wonder why... That's not a blocker preventing this change, but would you know what could cause the is increase in file size? (while other reference PDF files have decreased in size)

fpdf/fpdf.py

RedShy · 2022-08-23T10:49:34Z

Cool to see this coming to fruition!

Wow, great job @RedShy in working towards using fonttools!

Yes it's very nice and cool! And without the help from both of you I would be still stuck in trying to understand what's going on there! 😁

Better to find a free example font to add to the tests then.
I agree with @gmischler that, if it's not too much to ask you @RedShy, a first basic unit test of using an .otf font would be nice.

Sure! I just added a test with the old .otf font and I added also the italic and bold version of it

This is apparently not about the changed glyph mapping, so what other data would be subject to change here?

Sorry @gmischler I don't get what do you mean here

Btw.: Codecov complains about reduced coverage in "util.py". I think this is because of the function substr() there which was only used from "ttfonts.py", and thus doesn't get tested anymore. You should probably remove that as well.

I observed that .substr() is used in the if "type" in info and info["type"] != "TTF": branch inside _putfonts(): the entire branch is not covered by tests and I have seen that when a new font is added in .add_font() the type is hard coded to be "TTF", so I guess that branch is dead code and could be removed?

Given that we are here, also the branch elif my_type in ("Type1", "TrueType"): is not covered by tests, can we remove it too?
Maybe both changes could be moved in another PR.

You should probably add fonttools to setup.py

I'm new to this, I have to just add fonttools to install_requirements=[...] right?

Some reference PDF files sizes increased by 1/2 KB, I wonder why...

Yes, I'm curious about it too, I could try to inspect the qpdf files and see differences with the previous ones. I think it could be as @khaledhosny pointed out, that the fonttools subsetter doesn't drop all the tables previous dropped

edit: I have updated the code keeping only the tables that the old code kept for the tests. Now every PDF file size is lesser than before, I think this is due to the optimization that fonttools does:

The tool also performs some size-reducing optimizations, aimed for using subset fonts as webfonts.

…fonttools_lib

gmischler · 2022-08-24T08:17:20Z

Just for information: I just found in the 1.7 specs:

Beginning with PDF 1.6, font programms may be embedded using the OpenType format, which is an extension of the TrueType format....

Which essentially means that when we implement PDF/A-1 (based on PDF 1.4), we'll have to disallow OTF fonts.

gmischler · 2022-09-06T06:52:04Z

I observed that .substr() is used in the if "type" in info and info["type"] != "TTF": branch inside _putfonts(): the entire branch is not covered by tests and I have seen that when a new font is added in .add_font() the type is hard coded to be "TTF", so I guess that branch is dead code and could be removed?

So you can show conclusively that all imported fonts end up with info["type"] = "TTF"? (And that was already the case before your changes? 😉) Do the newly supported ".otf" fonts get assigned the same type as well? That would then mean that we only have two types of fonts to deal with in the rest of the code: core and TTF. Once we have proven this, that might simplify things quite a bit. We can then indeed remove all the code that deals with font types other than "core" and "TTF".

That brings me to a bit of a nitpick: I've always wondered about the naming of the FPDF.unifontsubset property (and my own derived Fragment.unicode_font). ~~A quick glance at the TTF specs didn't turn up anything conclusive. Are TTF fonts Unicode by definition? Or can they also be encoded with one of the historical charsets (and we could handle that)?~~ Found it: "OpenType, like TrueType, is based on Unicode".
Either way, since the distinction is really about whether we need to embed the font in the output, I think a better name for both those properties would be .is_ttf_font.

Given that we are here, also the branch elif my_type in ("Type1", "TrueType"): is not covered by tests, can we remove it too? Maybe both changes could be moved in another PR.

Generally speaking, the absence of tests would rather be a reason to add tests than to remove the code...
Unless, of course, it is provably dead code as explained above. 💀
I see no need to do a seperate PR for cleaning up the code you're already working on anyway.

edit: TTF is always Unicode

RedShy · 2022-09-06T09:19:33Z

Hi @gmischler thank you for the feedback 😁

So you can show conclusively that all imported fonts end up with info["type"] = "TTF"? (And that was already the case before your changes? 😉)

Looking at the code at lines L1831, L1881 and L1893 in add_font(), every font that is not core, get the type "TTF" in the info dictionary. I'm not sure if I'm missing something. I don't know if there are other ways to add fonts other than using the method add_font()

Do the newly supported ".otf" fonts get assigned the same type as well?

Yes, because it goes through add_font() and the TTF type is just hardcoded in the info dictionary

We can then indeed remove all the code that deals with font types other than "core" and "TTF".

I'm not sure of the exact differences between OpenType and TrueType, currently they are treated the same as TTF type and the PDF generated seems good. We could differentiate explicitly in the code adding the OTF type.

I've always wondered about the naming of the FPDF.unifontsubset property (and my own derived Fragment.unicode_font). Found it: "OpenType, like TrueType, is based on Unicode".
Either way, since the distinction is really about whether we need to embed the font in the output, I think a better name for both those properties would be .is_ttf_font

Looking at the code I see return self.current_font.get("type") == "TTF", under def unifontsubset(self): so I agree that a better name would be .is_ttf_font 👍
If we add the OTF type, I think we should consider if and how to change the check == "TTF"

Generally speaking, the absence of tests would rather be a reason to add tests than to remove the code...

Yes I absolutely agree, I tried to add some tests back in #439 and we concluded that was not worth it to improve the testing and clarity of that part of the code.

Lucas-C · 2022-09-06T09:28:33Z

(just a side note: thanks a lot to both of you for working on improving this part of the code! I don't have as much insight on the subject as you, but I fully trust the both of you on this. I'll be happy to merge your PR @RedShy whenever @gmischler gives his approval)

gmischler · 2022-09-06T11:38:38Z

Looking at the code at lines L1831, L1881 and L1893 in add_font(), every font that is not core, get the type "TTF" in the info dictionary.

That looks like conclusive proof: All imported fonts are marked as TTF.

We can then indeed remove all the code that deals with font types other than "core" and "TTF".

Let's get rid of the cruft!

I'm not sure of the exact differences between OpenType and TrueType, currently they are treated the same as TTF type and the PDF generated seems good. We could differentiate explicitly in the code adding the OTF type.

As far as I understand, they share largely the same structure, but some specific types of data may only be present in one or the other. I don't think we need to worry about the distinction right now, though it may possibly become relevant once we try to substitute ligatures etc. (cool follow-up project, btw., in case you're interested 😉).

Looking at the code I see return self.current_font.get("type") == "TTF", under def unifontsubset(self): so I agree that a better name would be .is_ttf_font +1 If we add the OTF type, I think we should consider if and how to change the check == "TTF"

I don't think we need to mark OTF fonts differently at this point. Feel free to rename those properties, though.

RedShy · 2022-09-07T07:10:28Z

Great! I deleted the code and renamed those properties

(cool follow-up project, btw., in case you're interested 😉).

Yes why not? 😁 It's a pleasure to work with both of you and I'm starting to get used to the codebase, the various concepts inherent to fonts, glyphs etc...

gmischler · 2022-09-07T08:55:27Z

(cool follow-up project, btw., in case you're interested 😉).

Yes why not? 😁 It's a pleasure to work with both of you and I'm starting to get used to the codebase, the various concepts inherent to fonts, glyphs etc...

I was hoping for that. You're now the contributor here most familiar with the font handling code and especially the fonttools. I had only looked at these things very superficially myself, to get a general idea of the concepts.
Maybe there's even a module in fonttools that would do the substitutions for us? You already found one for the subsetting, that I hadn't previously noticed...

More urgently: In the context of #511 you mentioned that you had moved the width == 65535 check on the character width to an earlier point in the processing. So I guess now would be the time to adapt Fragment.get_width() to that change. Ideally you'd be able to get rid of the .char_width() sub-function there.

RedShy · 2022-09-07T16:10:31Z

I had only looked at these things very superficially myself, to get a general idea of the concepts.

Actually your help was fundamental to understanding what the subsetting was about and how the management of the fonts worked

Maybe there's even a module in fonttools that would do the substitutions for us?

I'm not sure what do you mean by substitutions here

In the context of #511 #511 (review) that you had moved the width == 65535 check on the character width to an earlier point in the processing. So I guess now would be the time to adapt Fragment.get_width() to that change. Ideally you'd be able to get rid of the .char_width() sub-function there.

Yes I deleted the .char_width() inside Fragment.get_width() because now is just a call to the defaultdict self.font["cw"]

gmischler · 2022-09-07T18:17:18Z

I'm not sure what do you mean by substitutions here

Supporting ligatures means that a font may offer you the opportunity to writee eg. the two characters "fs", but have the single combined glyph "" being displayed (note how the upper end of the "f" connects with the dot of the "i", which they otherwise wouldn't. See Google fonts: Ligature for examples related to HTML rendering.

This is just one of the simplest cases though, and in western languages it's usually just an esthetical nicety. In other writing systems, most prominently of the indic family, it is an actual necessity for being able to write correctly. See our recent issues #365, #459, and #474. In those cases, larger numbers of characters (up to 7 I think) may get combined into one or several glyphs, in an m * n relationship. There are several possible table types in TTF/OTF fonts that can be used to store and retreive such substitutions, such as "gsub", "liga", "dlig", etc. I hope that fonttools offer some support in searching for those, so that we don't have to figure out all the possible combinations ourselfes...

gmischler · 2022-09-07T18:29:44Z

@Lucas-C , I don't see any obstacles here anymore, so I'll vote for merging.

Lucas-C · 2022-09-07T19:52:39Z

Thank you both for you work on this PR!
Merging now

Lucas-C · 2022-09-07T20:11:02Z

As a side-effect, thanks to your work, code coverage has improved by 1.5 points!

Lucas-C · 2022-09-08T12:08:22Z

This has been released in v2.5.7

RedShy · 2022-09-15T08:40:54Z

I'm not sure what do you mean by substitutions here

Supporting ligatures means that a font may offer you the opportunity to writee eg. the two characters "fs", but have the single combined glyph "" being displayed (note how the upper end of the "f" connects with the dot of the "i", which they otherwise wouldn't. See Google fonts: Ligature for examples related to HTML rendering.

This is just one of the simplest cases though, and in western languages it's usually just an esthetical nicety. In other writing systems, most prominently of the indic family, it is an actual necessity for being able to write correctly. See our recent issues #365, #459, and #474. In those cases, larger numbers of characters (up to 7 I think) may get combined into one or several glyphs, in an m * n relationship. There are several possible table types in TTF/OTF fonts that can be used to store and retreive such substitutions, such as "gsub", "liga", "dlig", etc. I hope that fonttools offer some support in searching for those, so that we don't have to figure out all the possible combinations ourselfes...

This is interesting and sparks me to challenge myself with this project, also it would have a direct impact and be helpful for people out there.
Unfortunately for now, for various reasons in my personal life, I don't think I have much time to dedicate, but in 1-2 weeks I think yes, if no one got up at that time, I would be glad to help 😁

RedShy and others added 10 commits July 28, 2022 17:29

modified add_font

b20f393

added parameter ft to makeSubset

af45b27

using font tools for tables before HMTX

d5a51c4

refactor getHMTX

8775c5c

managed hmtx and loca tables

fddbf3e

Merge branch 'PyFPDF:master' into fonttools_lib

dca5fc6

Merge branch 'fonttools_lib' of https://github.com/RedShy/fpdf2 into …

ca19056

…fonttools_lib

fonttools subsetter instead of ttf.makeSubset

69d1771

Merge branch 'PyFPDF:master' into fonttools_lib

c7680c4

options, rewrote codeToGlyph, rem TTFontFile

8063478

gmischler mentioned this pull request Aug 12, 2022

Hindi text is rendered incorrectly #365

Closed

RedShy and others added 4 commits August 21, 2022 12:41

recalcTimestamp and comments

992ec5f

removed test add_font_otf

60a46a6

new pdfs

5baffe7

Merge branch 'PyFPDF:master' into fonttools_lib

ddc63aa

Lucas-C reviewed Aug 22, 2022

View reviewed changes

fpdf/fpdf.py Outdated Show resolved Hide resolved

fpdf/fpdf.py Outdated Show resolved Hide resolved

fpdf/fpdf.py Show resolved Hide resolved

rewrote test using fonttools

65fa718

khaledhosny reviewed Aug 22, 2022

View reviewed changes

fpdf/fpdf.py Show resolved Hide resolved

added otf font test

a4c0467

RedShy and others added 4 commits August 23, 2022 12:57

renamed variables

92b6a95

Merge branch 'PyFPDF:master' into fonttools_lib

1a623c3

Merge branch 'fonttools_lib' of https://github.com/RedShy/fpdf2 into …

99e09bc

…fonttools_lib

renamed font

2f4f06b

added fonttools to setup.py

9b41561

RedShy added 8 commits September 5, 2022 18:29

fixed docstrings

ef25097

embed pdfs

f2c77ec

removed unnecessary code

ea1ba21

fix

66e66da

fix fpdf

fe6026e

Merge https://github.com/PyFPDF/fpdf2 into fonttools_lib

b05102f

regenareted pdfs

55f1ec8

fixed pdfs

ce23c55

rename unifontsubset & unicode_font to is_ttf_font

8fff76e

fixed renaming

7a2d102

remove .char_width()

bdd5172

Merge branch 'PyFPDF:master' into fonttools_lib

289366a

Lucas-C merged commit eb54535 into py-pdf:master Sep 7, 2022

This was referenced Sep 7, 2022

Switch to using fonttools #418

Closed

CBDT/CBLC font support for color emojis #224

Open

RedShy deleted the fonttools_lib branch September 15, 2022 08:41

kreier mentioned this pull request May 30, 2024

Support for Khmer contains errors kreier/timeline#35

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewriting add_font() and _putfonts() using Fonttools library #477

Rewriting add_font() and _putfonts() using Fonttools library #477

RedShy commented Jul 28, 2022 •

edited

Loading

codecov bot commented Aug 21, 2022 •

edited

Loading

RedShy commented Aug 21, 2022

gmischler commented Aug 22, 2022

Lucas-C commented Aug 22, 2022

Lucas-C left a comment

RedShy commented Aug 23, 2022 •

edited

Loading

gmischler commented Aug 24, 2022

gmischler commented Sep 6, 2022 •

edited

Loading

RedShy commented Sep 6, 2022 •

edited

Loading

Lucas-C commented Sep 6, 2022

gmischler commented Sep 6, 2022

RedShy commented Sep 7, 2022

gmischler commented Sep 7, 2022

RedShy commented Sep 7, 2022

gmischler commented Sep 7, 2022

gmischler commented Sep 7, 2022

Lucas-C commented Sep 7, 2022

Lucas-C commented Sep 7, 2022

Lucas-C commented Sep 8, 2022

RedShy commented Sep 15, 2022

Rewriting add_font() and _putfonts() using Fonttools library #477

Rewriting add_font() and _putfonts() using Fonttools library #477

Conversation

RedShy commented Jul 28, 2022 • edited Loading

codecov bot commented Aug 21, 2022 • edited Loading

Codecov Report

RedShy commented Aug 21, 2022

gmischler commented Aug 22, 2022

Lucas-C commented Aug 22, 2022

Lucas-C left a comment

Choose a reason for hiding this comment

RedShy commented Aug 23, 2022 • edited Loading

gmischler commented Aug 24, 2022

gmischler commented Sep 6, 2022 • edited Loading

RedShy commented Sep 6, 2022 • edited Loading

Lucas-C commented Sep 6, 2022

gmischler commented Sep 6, 2022

RedShy commented Sep 7, 2022

gmischler commented Sep 7, 2022

RedShy commented Sep 7, 2022

gmischler commented Sep 7, 2022

gmischler commented Sep 7, 2022

Lucas-C commented Sep 7, 2022

Lucas-C commented Sep 7, 2022

Lucas-C commented Sep 8, 2022

RedShy commented Sep 15, 2022

RedShy commented Jul 28, 2022 •

edited

Loading

codecov bot commented Aug 21, 2022 •

edited

Loading

RedShy commented Aug 23, 2022 •

edited

Loading

gmischler commented Sep 6, 2022 •

edited

Loading

RedShy commented Sep 6, 2022 •

edited

Loading