Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Math fonts (Type 3) incorrectly embedded in PDF? #21797

Open
andreas-brugger opened this issue Nov 29, 2021 · 21 comments
Open

[Bug]: Math fonts (Type 3) incorrectly embedded in PDF? #21797

andreas-brugger opened this issue Nov 29, 2021 · 21 comments

Comments

@andreas-brugger
Copy link

Bug summary

PDFs containing math fonts cause the following error message in Adobe Acrobat: Cannot extract the embedded font. Some characters may not display or print correctly.

Code for reproduction

import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, 1, figsize=[2, 1])

# use TrueType
if False: plt.rcParams.update({'pdf.fonttype': 42})

ax.axis('Off')
ax.text(0.5, 0.5, '$\\alpha~\\beta~\\gamma$', ha='center')

fig.savefig('Test.pdf')

Actual outcome

Adobe Acrobat error message

Expected outcome

The font seems to be displayed correctly, but apparently it is not correctly embedded.

Additional information

To reproduce this issure, open the PDF created by Matplotlib in Adobe Acrobat, navigate to Document Properties > Fonts > OK. The error message is only displayed if one changes the zoom, switches pages, etc. The error message arises in recent versions of Adobe Acrobat on different operating systems.

The error message arises only for PDFs containing mathematical expressions, while PDFs with regular text cause no error message. However, the issue seems to arise independently of the actual font type and arises for the default font dejavusans and others such as stix.

The issue furthermore arises for the default value of pdf.fonttype being Type 3. When switching to TrueType (causing much larger file sizes), the error message no longer arises.

Operating system

No response

Matplotlib Version

3.4.3

Matplotlib Backend

No response

Python version

No response

Jupyter version

No response

Installation

conda

@andreas-brugger
Copy link
Author

Consider the file Test.pdf, which causes the aforementioned error message in Adobe Acrobat.

@QuLogic
Copy link
Member

QuLogic commented Nov 30, 2021

Please try with 3.5.0

@QuLogic QuLogic added the status: needs clarification Issues that need more information to resolve. label Nov 30, 2021
@jklymak
Copy link
Member

jklymak commented Nov 30, 2021

I tested on 3.5.0 and the error message still exists. OTOH I'm not clear what it means, if anything..

@andreas-brugger
Copy link
Author

I tried to get more details by using the Preflight tool from Adobe Acrobat, but unfortunately I could not get any. However, since the error message only arises for certain types of fonts, there seems to be an actual reason.

The error message is significant insofar as it also arises if such a figure is embedded in a larger PDF (e.g. via LaTeX) and therefore presumably affects a whole lot of documents.

@jklymak
Copy link
Member

jklymak commented Nov 30, 2021

.. but does it have a practical effect, in that the document looks bad? Are the fonts actually missing? I wouldn't rule out an Adobe bug here...

@andreas-brugger
Copy link
Author

andreas-brugger commented Nov 30, 2021

As far as I can tell, the figures look fine, but it is hard to tell since I usually use the font Stix 2 while Matplotlib uses the former version and they differ anyway.

According to the document properties shown in Adobe Acrobat (prior to displaying the error message) the font is embedded. Similarly, pdffonts provided by poppler-utils states that the font is embedded:

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
DejaVuSans-Oblique                   Type 3            Custom           yes no  no      15  0

It could be a bug in Adobe Acrobat, or e.g., maybe something remotely similar to pull request #1808? Does anybody know a way to truly check the PDF, e.g. via the Preflight tool?

@aitikgupta
Copy link
Contributor

Consider the file Test.pdf

3.5 would instead produce a font subsetted file (with 6 characters in the beginning of font name)..
Reproducing the snippet, the embedded fonts are:

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
GCWXDV+DejaVuSans-Oblique            Type 3            Custom           yes yes no      15  0

^when pdf.fonttype is 3, and:

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
GCWXDV+DejaVuSans-Oblique            CID TrueType      Identity-H       yes yes yes     15  0

^when pdf.fonttype is 42..

Not sure about the actual bug here, but I'd first try to reproduce it with latest Matplotlib build.. also these PDFs do not really bug other native PDF viewers that I've tried them with, so the Adobe Acrobat bug seems possible.

@andreas-brugger
Copy link
Author

3.5 would instead produce a font subsetted file (with 6 characters in the beginning of font name)..
Reproducing the snippet, the embedded fonts are:

Could you please add the two PDFs for Matplotlib 3.5 with font type 3 and TrueType, so I could check whether the error arises?

@aitikgupta
Copy link
Contributor

@theBruegge here you go:
fonttype3.pdf
fonttype42.pdf

@andreas-brugger
Copy link
Author

Thanks @aitikgupta!

The issue is still the same: for font type 3 the aforementioned error message arises, which states that the embedded font cannot be extracted (the font now has the name GCWXDV+DejaVuSans-Oblique indicating the subset). No error is shown for TrueType.

I already tried to get more details via the Preflight tool from Adobe Acrobat and from Adobe support, which unfortunately did not work out so far, but I'll try again ...

@andreas-brugger
Copy link
Author

I think I got some more details via the Preflight tool from Adobe Acrobat Pro: according to Preflight the font is not embedded if Font Type 3 is used, while TrueType fonts are shown to be embedded. In the former case, Preflight therefore tries to embed the font. This procedure is shown to be successful, if the DejaVu Fonts were previously installed on the system.

However, while this procedure truly works for non-math text in DejaVuSans (the font is denoted as Embedded Subset afterwards in the Document Properties), embedding apparently does not work for math text in DejaVuSans-Oblique and Preflight vectorizes the math text.

    

Unfortunately, this does still not reveal any details. However, in my experience printing companies check PDFs in that respect for instance using Preflight – which renders this issue actually problematic, at least to some extent.

@jklymak
Copy link
Member

jklymak commented Jan 6, 2022

@theBruegge Its not clear to me what action you are suggesting for Matplotlib.

@andreas-brugger
Copy link
Author

@jklymak
Well, there seems to be a broad hint for some issue, probably caused by Matplotlib's embedding of Type 3 fonts. Unfortunately, since I have no profound knowledge about font embedding or the internal structure of PDFs, I do not know how to further investigate or fix the issue – I can only point it out.

@andreas-brugger
Copy link
Author

Finally, I got some more details – I hope it helps @aitikgupta and @jklymak! I contacted Adobe Care on Twitter via Direct Message and after some discussion and mailing them the file, they came to the following conclusion:

The three characters on the page are actually not text. They are filled paths, i.e. moveto/lineto/curveto and then filled. There is actually no text drawn on the page at all. You can see this with Preflight->Options->Create Inventory. So the first question to the creators of the app that made the PDF is, “Why are you including a Font resource for a font that isn’t even used to draw anything on the page?”
The Type 3 font dictionary in the PDF has an empty CharProcs dictionary. That’s probably confusing to Acrobat, as without any CharProcs entries, there are no glyph descriptions to describe how to draw the glyphs. But the font dictionary isn’t needed because there is no text on the page. They should omit the font dictionary if there is no text drawn with it. Seems like their software leaves it in, and just doesn’t populate the CharProcs if there is no text drawn with the font.

 
So, allegedly the characters are not text, but rather filled paths. At first, I was doubtful – but then I inspected the file with PDFQuery using the following code, which yields to the export shown below:

from pdfquery import PDFQuery

file = 'Test.pdf'
pdf  = PDFQuery(file)

pdf.load()
pdf.tree.write(file.replace('.pdf', '.xml'), pretty_print=True, encoding='utf-8')
<pdfxml CreationDate="" Creator="Matplotlib v3.4.3, https://matplotlib.org" Producer="Matplotlib pdf backend v3.4.3">
  <LTPage y0="0" y1="72" x0="0" x1="144" width="144" height="72" bbox="[0, 0, 144, 72]" pageid="1" rotate="0" page_index="0" page_label="">
    <LTRect y0="0" y1="72" x0="0" x1="144" width="144" height="72" bbox="[0, 0, 144, 72]" linewidth="0" pts="[[0, 0], [144, 0], [144, 72], [0, 72]]">
      <LTFigure y0="32.474" y1="43.154" x0="50.64" x1="67.24" width="16.6" height="10.68" bbox="[50.64, 32.474, 67.24, 43.154]" name="F1-DejaVuSans-Oblique-alpha" matrix="[0.01, 0.0, 0.0, 0.01, 60.8, 35.984]"/>
      <LTFigure y0="32.474" y1="43.154" x0="60.479" x1="77.079" width="16.6" height="10.68" bbox="[60.479, 32.474, 77.079, 43.154]" name="F1-DejaVuSans-Oblique-beta" matrix="[0.01, 0.0, 0.0, 0.01, 70.639, 35.984]"/>
      <LTFigure y0="32.474" y1="43.154" x0="70.108" x1="86.708" width="16.6" height="10.68" bbox="[70.108, 32.474, 86.708, 43.154]" name="F1-DejaVuSans-Oblique-gamma" matrix="[0.01, 0.0, 0.0, 0.01, 80.268, 35.984]"/>
    </LTRect>
  </LTPage>
</pdfxml>

 
Indeed, the PDF seemingly contains three LTFigure objects, but no text.
In contrast, the expected result is achieved if I use non-Greek characters by replacing '$\\alpha~\\beta~\\gamma$' in the code from my initial post with '$a~b~c$' – PDFQuery then recognizes the text as LTTextLineHorizontal and LTTextBoxHorizontal.

 
This leads me to the following questions:

  • Is this the intended behavior, i.e., does Matplotlib create paths for Greek math symbols on purpose instead of using actual text? I would not have expected such behavior.
  • If so, could the CharProcs dictionary be cleaned up accordingly to prevent error messages?
  • If not, well ... something has to be wrong, since neither the Preflight tool from Adobe Acrobat Pro nor PDFQuery recognizes the actual text ...

@andreas-brugger
Copy link
Author

Sorry for bothering you again @aitikgupta, @dstansby, and @jklymak, but consider the following crucial question for this issue: does Matplotlib intendedly create vector paths for Greek math symbols instead of actual text?
I would not have expected such behavior, but apparently, this is the current behavior (see the above post for further details). If so, there can't be an actual bug since no font is involved at all – however, the CharProcs dictionary could be cleaned up accordingly.

@QuLogic
Copy link
Member

QuLogic commented Mar 22, 2022

There are various locations in the PDF backend that check (for Type 3) whether the character code is < 256. I think this may be a misinterpretation of the spec, or at least a simplification. AFAICT, there is no requirement that Character Encodings are restrained to codes < 256.

There is however a statement that 'With a simple font, each byte of the string shall be treated as a separate character code.' Since Greek characters are outside this range in Unicode/UTF-8, they all would not work and are output as XObjects.

However, there is this statement:

With a composite font (PDF 1.2), multiple-byte codes may be used to select glyphs. In this instance, one or
more consecutive bytes of the string shall be treated as a single character code. The code lengths and the
mappings from codes to glyphs are defined in a data structure called a CMap, described in 9.7, "Composite
Fonts".

And we do implement a Unicode CMap for Type 42. I think we should do the same for Type 3.

@QuLogic
Copy link
Member

QuLogic commented Mar 22, 2022

We can definitely drop some of the <256 checks, though things are a bit under-optimized that way. However, I hesitate to do much more with the CMap just yet to avoid conflicting with some existing PRs.

@andreas-brugger
Copy link
Author

andreas-brugger commented Dec 5, 2022

@QuLogic, thanks for your previous efforts! Are there any updates on this topic?

I tried the current Matplotlib version and tried to deactivate the <256 checks (more precisely, in backend_pdf.py in _font_supports_glyph() on line 335 and in embedTTFType3() on line 1167). But if doing so, these characters do not show at all in the resulting PDF. I guess it is not that easy and the CMap you mentioned earlier has to be implemented?

@jklymak jklymak removed the status: needs clarification Issues that need more information to resolve. label Apr 20, 2023
@jklymak
Copy link
Member

jklymak commented Apr 20, 2023

Another example is from https://stackoverflow.com/questions/76057034/python-matplotlib-produces-larger-and-blurrier-pdf-than-r?noredirect=1#comment134141564_76057034

Note that if you do plt.rcParams["pdf.fonttype"] = 42 the file looks fine, but actually gets larger than if you use Type-3 fonts. If you use plt.rcParams["pdf.use14corefonts"] = True the file is smaller, but the minus sign doesn't work, I assume because we made it unicode.

@tacaswell
Copy link
Member

Not emitting an empty font embedding makes sense, but I would push back on adobe and ask if that is actually out of spec or not. If the spec does not forbid it we have found a bug in their tool :)

@andreas-brugger
Copy link
Author

Knowing what I know now, I would formulate my issue differently: it's actually not just about the error message in Adobe Acrobat (regardless of whether out of spec or not as mentioned by @tacaswell). The issue rather addresses the fact that characters with a code ≥ 256 are not embedded as text in the PDF but as a rendered path.
This has many disadvantages: the file size increases (as mentioned by @jklymak), you can't search for these characters, you can't edit them later, etc. – but as far as I know (and as mentioned by @QuLogic), the standard would allow embedding such characters as text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants