Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Type42 subsetting in PS/PDF #20391

Merged
merged 29 commits into from
Jul 26, 2021
Merged

Conversation

aitikgupta
Copy link
Contributor

@aitikgupta aitikgupta commented Jun 8, 2021

PR Summary

This PR is a fresh rebase from #18143.

  • Adds a dependency: fonttools to handle font subsetting for us
    (we already have an external ttconv dependency, which does not handle subsetting)
  • Interfaces a getSubset utility to get file-like objects containing subsetted font data

Possibly fixes #11303 (large file sizes)
Fixes #18191.

PR Checklist

  • Has pytest style unit tests (and pytest passes).
  • Is Flake 8 compliant (run flake8 on changed files to check).
  • New features are documented, with examples if plot related.
  • Documentation is sphinx and numpydoc compliant (the docs should build without error).
  • Conforms to Matplotlib style conventions (install flake8-docstrings and run flake8 --docstring-convention=all).
  • New features have an entry in doc/users/next_whats_new/ (follow instructions in README.rst there).
  • API changes documented in doc/api/next_api_changes/ (follow instructions in README.rst there).

@jklymak jklymak marked this pull request as draft June 8, 2021 21:31
@lumberbot-app
Copy link

lumberbot-app bot commented Jun 10, 2021

I'm Mr. Meeseek, @aitikgupta, Look at meee !

@aitikgupta aitikgupta changed the title Type42 subsetting in PDF Type42 subsetting in PS/PDF Jun 13, 2021
@aitikgupta
Copy link
Contributor Author

aitikgupta commented Jun 13, 2021

The last commit subsets and embeds Type 42 PS/EPS, here's the size difference in outputs:

nosub-dejavu.ps                ----> 1.2 MB
sub-dejavu.ps                  ----> 8.5 kB

(To test it out on your machine, apply this patch for nosub-dejavu.ps:)

diff --git a/lib/matplotlib/backends/backend_ps.py b/lib/matplotlib/backends/backend_ps.py
index 2a3ab64a1c..ed0af4ea85 100644
--- a/lib/matplotlib/backends/backend_ps.py
+++ b/lib/matplotlib/backends/backend_ps.py
@@ -997,12 +997,8 @@ class FigureCanvasPS(FigureCanvasBase):
                             ) as tmp:
                                 tmp.write(fontdata)
                                 tmp.seek(0, 0)
-                                font = FT2Font(tmp.name)
-                                glyph_ids = [
-                                    font.get_char_index(c) for c in chars
-                                ]
                                 convert_ttf_to_ps(
-                                    os.fsencode(tmp.name),
+                                    os.fsencode(font_path),
                                     fh,
                                     fonttype,
                                     glyph_ids,

The script:

import matplotlib.pyplot as plt

plt.rcParams["ps.fonttype"] = 42
plt.figtext(0.5, 0.5, "hello")
# plt.savefig('sub-dejavu.ps')                    # before applying the patch
# plt.savefig('nosub-dejavu.ps')                  # after applying the patch

@aitikgupta aitikgupta marked this pull request as ready for review June 15, 2021 16:57
lib/matplotlib/backends/_backend_pdf_ps.py Outdated Show resolved Hide resolved
lib/matplotlib/backends/_backend_pdf_ps.py Show resolved Hide resolved
lib/matplotlib/backends/_backend_pdf_ps.py Outdated Show resolved Hide resolved
lib/matplotlib/backends/backend_pdf.py Outdated Show resolved Hide resolved
lib/matplotlib/backends/backend_ps.py Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
lib/matplotlib/backends/_backend_pdf_ps.py Outdated Show resolved Hide resolved
@jklymak jklymak added this to the v3.5.0 milestone Jun 17, 2021
@jklymak jklymak marked this pull request as draft June 17, 2021 14:40
@aitikgupta
Copy link
Contributor Author

aitikgupta commented Jun 18, 2021

#20391 (comment): This appears to require testing.

I think it's difficult to test get_glyphs_subset, since it really depends on font files, and if we chose certain characters to test with a fixed font (lets say), there's no guarantee that the subset will be the same for different versions of the library.

Is there any other way to test this?

edit: @jklymak this isn't a draft anymore, just needed to address review comments 😄

@jklymak
Copy link
Member

jklymak commented Jun 18, 2021

You can mark as ready-to-review at any point. I just kick things to Draft so they don't show up in the queue if the PR requires action on the part of the author. (this one is still not passing the tests).

@aitikgupta
Copy link
Contributor Author

(this one is still not passing the tests).

Yeah, CI isn't installing fonttools, which is why NoModuleFound is all over the place (even after adding it to minvers.txt: #20391 (comment))

@QuLogic
Copy link
Member

QuLogic commented Jun 19, 2021

Install is done with --no-deps, so hard dependencies also need to be listed here: https://github.com/matplotlib/matplotlib/blob/master/.github/workflows/tests.yml#L148 minver.txt is only a constraint on versions.

@QuLogic
Copy link
Member

QuLogic commented Jun 19, 2021

I think it's difficult to test get_glyphs_subset, since it really depends on font files, and if we chose certain characters to test with a fixed font (lets say), there's no guarantee that the subset will be the same for different versions of the library.

We ship DejaVu ourselves, so that should always be available for testing.

Is there any other way to test this?

A file with just 'A' embedded should be smaller than one with the full alphabet, say?
Also, a subsetted PDF should appear the same as one without subsetting (assuming Ghostscript doesn't happen to substitute the same glyphs.)

@aitikgupta
Copy link
Contributor Author

A file with just 'A' embedded should be smaller than one with the full alphabet, say?

Oh, if we just test the subset being 'smaller' yeah we can do it, but not with the exact number glyphs, since that will definitely vary. (we also set recommeded_glyphs=True)

But even so, isn't that the 'full-time job' of the fonttools library itself? or in other words wouldn't testing the get_glyphs_subset function be the same as testing the library itself (that it does reduce the number of glyphs)?

@aitikgupta aitikgupta marked this pull request as ready for review June 19, 2021 19:13
@jkseppan
Copy link
Member

Oh, and one more thing (sorry for not remembering this earlier): the PDF specification has some extra requirements for font subsets in section 9.6.4. The PostScript name of subsetted fonts needs to have a prepended tag of six random uppercase letters and a plus sign, e.g. EOODIA+Poetica if the original font name is Poetica. This is relevant in the case that multiple Matplotlib plots are combined in the same document, e.g. in a LaTeX paper with several figures. The random tags prevent collisions between the different versions of the same font.

@aitikgupta
Copy link
Contributor Author

extra requirements for font subsets ... PostScript name of subsetted fonts needs to have a prepended tag of six random uppercase letters and a plus sign

I think we don't do this even for Type 3 subsetting?
Since we always try to subset those, I can probably just add a function to modify the name for Type 3 and Type 42 both.

@aitikgupta
Copy link
Contributor Author

This breaks the test_determinism_check:

def test_determinism_check(objects, fmt, usetex):
"""
Output three times the same graphs and checks that the outputs are exactly
the same.

This is for obvious reasons, since we have random string inside font postscript name. (due to the specification)

@jkseppan
Copy link
Member

Then the string shouldn't be random, but perhaps something derived from the subset of glyphs? Take hash(frozenset(glyphs)), convert it to base 26 and take the first six letters, or something like that.

The PDF specification only requires the tag for Type 1 and TrueType fonts. It probably doesn't hurt with Type 3 fonts either.

@aitikgupta
Copy link
Contributor Author

Then the string shouldn't be random, but perhaps something derived from the subset of glyphs?

That totally makes sense! Let me try this out 👍🏼

@jklymak jklymak marked this pull request as draft July 22, 2021 13:42
@jklymak
Copy link
Member

jklymak commented Jul 22, 2021

@sauerburger can you suggest a test for that, or is it already in #20633?

@aitikgupta
Copy link
Contributor Author

can you suggest a test for that, or is it already in #20633?

The test is already in, and that is exactly what is breaking the CI for this PR.
I'll rebase and push a commit, fixing the error..

@aitikgupta aitikgupta marked this pull request as ready for review July 22, 2021 16:34
@jkseppan
Copy link
Member

It seems to me that the requested changes have been made.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants