Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] PDFs Not Imported Properly Due to Missing Fonts #1135

Closed
nomad64 opened this issue Jun 14, 2022 · 13 comments
Closed

[BUG] PDFs Not Imported Properly Due to Missing Fonts #1135

nomad64 opened this issue Jun 14, 2022 · 13 comments
Labels
bug Bug report or a Bug-fix deployment help wanted Extra attention is needed

Comments

@nomad64
Copy link

nomad64 commented Jun 14, 2022

Description

I have been running Paperless NGX for a while and have developed a (still manual) workflow for when I get an e-mail receipt (where the e-mail itself is a receipt and does not have an attached PDF). In the GMail app on my Android phone, I print the email and select "Save as PDF" as the printer. I save the PDF into a Syncthing folder which syncs the paperless-consume directory to my docker server. This has been working fine for some time.

This morning, I realized that starting on June 3, all of the PDFs generated in this manner failed to import properly. Paperless "successfully" imports it, but all of the text and most of the images are missing. Based on the message below, this appears to be due to missing embedded fonts.

To be clear, the issue is primarily with the GMail app. Using the same "Save as PDF" printer method with other apps seems to work just fine.

Notes:

  • Doing this same process via Chrome on my desktop works fine; it appears to be isolated to the GMail app
  • I am using the LSIO container, however, this doesn't appear to be related (if it is, I will gladly open a bug ticket on their end)

Steps to reproduce

  1. Using GMail app, print an e-mail and select "Save as PDF" as the printer
  2. Copy the PDF over to the paperless-consume dir

Webserver logs

[2022-06-14 06:38:42,342] [INFO] [paperless.management.consumer] Adding /data/consume/Gmail - Order #99994 confirmed.pdf to the task queue.
06:38:42 [Q] INFO Enqueued 1
06:38:42 [Q] INFO Process-1:659 processing [Gmail - Order #99994 confirmed.pdf]
[2022-06-14 06:38:42,589] [INFO] [paperless.consumer] Consuming Gmail - Order #99994 confirmed.pdf
[2022-06-14 06:38:43,539] [ERROR] [ocrmypdf._exec.ghostscript] GPL Ghostscript 9.50 (2019-10-15)
Copyright (C) 2019 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 1 through 2.
Page 1
GPL Ghostscript 9.50: Setting Overprint Mode to 1
 not permitted in PDF/A-2, overprint mode not set

   
[2022-06-14 06:38:43,539] [ERROR] [ocrmypdf._exec.ghostscript]  Error: can't process embedded font stream,
        attempting to load the font using its name.
               Output may be incorrect.
Can't find CID font "RobotoStatic-Regular".
Attempting to substitute CID font /Adobe-Identity for /RobotoStatic-Regular, see doc/Use.htm#CIDFontSubstitution.
The substitute CID font "Adobe-Identity" is not provided either. attempting to use fallback CIDFont.See doc/Use.htm#CIDFontSubstitution.
The fallback CID font "CIDFallBack" is not provided.  Finally attempting to use ArtifexBullet.
   
[2022-06-14 06:38:43,539] [ERROR] [ocrmypdf._exec.ghostscript]  Error reading a content stream. The page may be incomplete.
               Output may be incorrect.
   
[2022-06-14 06:38:43,540] [ERROR] [ocrmypdf._exec.ghostscript]  Error: File did not complete the page properly and may be damaged.
               Output may be incorrect.
Page 2
   
[2022-06-14 06:38:43,540] [ERROR] [ocrmypdf._exec.ghostscript]  Error: can't process embedded font stream,
        attempting to load the font using its name.
               Output may be incorrect.
Can't find CID font "RobotoStatic-Regular".
Attempting to substitute CID font /Adobe-Identity for /RobotoStatic-Regular, see doc/Use.htm#CIDFontSubstitution.
The substitute CID font "Adobe-Identity" is not provided either. attempting to use fallback CIDFont.See doc/Use.htm#CIDFontSubstitution.
The fallback CID font "CIDFallBack" is not provided.  Finally attempting to use ArtifexBullet.
   
[2022-06-14 06:38:43,540] [ERROR] [ocrmypdf._exec.ghostscript]  Error reading a content stream. The page may be incomplete.
               Output may be incorrect.
   
[2022-06-14 06:38:43,540] [ERROR] [ocrmypdf._exec.ghostscript]  Error: File did not complete the page properly and may be damaged.
               Output may be incorrect.

[2022-06-14 06:38:43,648] [WARNING] [paperless.parsing.tesseract] Encountered an error while running OCR: No text was found in the original document. Attempting force OCR to get the text.
[2022-06-14 06:38:44,054] [ERROR] [ocrmypdf._exec.ghostscript]    **** Error: can't process embedded font stream,
        attempting to load the font using its name.
               Output may be incorrect.
   **** Error reading a content stream. The page may be incomplete.
               Output may be incorrect.
   **** Error: File did not complete the page properly and may be damaged.
               Output may be incorrect.

[2022-06-14 06:38:45,735] [ERROR] [ocrmypdf._exec.tesseract] [tesseract] Error during processing.
[2022-06-14 06:38:46,560] [ERROR] [ocrmypdf._exec.ghostscript]    **** Error: can't process embedded font stream,
        attempting to load the font using its name.
               Output may be incorrect.
   **** Error reading a content stream. The page may be incomplete.
               Output may be incorrect.
   **** Error: File did not complete the page properly and may be damaged.
               Output may be incorrect.

[2022-06-14 06:38:58,837] [ERROR] [ocrmypdf._exec.ghostscript]    **** Error: can't process embedded font stream,
        attempting to load the font using its name.
               Output may be incorrect.
   **** Error reading a content stream. The page may be incomplete.
               Output may be incorrect.
   **** Error: File did not complete the page properly and may be damaged.
               Output may be incorrect.

[2022-06-14 06:39:00,406] [ERROR] [ocrmypdf._exec.tesseract] [tesseract] Error during processing.
[2022-06-14 06:39:01,267] [ERROR] [ocrmypdf._exec.ghostscript]    **** Error: can't process embedded font stream,
        attempting to load the font using its name.
               Output may be incorrect.
   **** Error reading a content stream. The page may be incomplete.
               Output may be incorrect.
   **** Error: File did not complete the page properly and may be damaged.
               Output may be incorrect.

[2022-06-14 06:39:19,687] [INFO] [paperless.handlers] Assigning document type Receipt to 2022-03-01 Gmail - Order #99994 confirmed
[2022-06-14 06:39:19,804] [INFO] [paperless.consumer] Document 2022-03-01 Gmail - Order #99994 confirmed consumption finished
06:39:19 [Q] INFO Process-1:659 stopped doing work
06:39:19 [Q] INFO Processed [Gmail - Order #99994 confirmed.pdf]
06:39:19 [Q] INFO recycled worker Process-1:659

Paperless-ngx version

1.7.1

Host OS

Ubuntu 20.04 / Docker

Installation method

Docker

Browser

No response

Configuration changes

No response

Other

No response

@nomad64 nomad64 added bug Bug report or a Bug-fix unconfirmed labels Jun 14, 2022
@stumpylog
Copy link
Member

This appears to be a duplicate of #277. With the LSIO, you can probably add a custom user start script to install the fonts-roboto package, which appears to be the missing font in this particular case.

@stumpylog
Copy link
Member

The details on a custom startup script are here: https://www.linuxserver.io/blog/2019-09-14-customizing-our-containers

#!/usr/bin/env bash
apt-get -qq update
apt-get install fonts-roboto

If that does resolve the issue, let me know. GhostScript isn't the best documentation for where it finds fonts to use.

@nomad64
Copy link
Author

nomad64 commented Jun 14, 2022

For testing, I ran the below commands in the currently-running Paperless-NGX container:

apt-get -qq update
apt-get install -y fonts-roboto

After that, I printed to PDF a random email from GMail and waited for it to import. Unfortunately, the imported PDF is still blank. Below is the log from the import.

[2022-06-14 16:45:52,083] [INFO] [paperless.management.consumer] Adding /data/consume/Gmail - Information about Your Automatic Renewal.pdf to the task queue.
16:45:52 [Q] INFO Enqueued 1
16:45:52 [Q] INFO Process-1:52 processing [Gmail - Information about Your Automatic Renewal.pdf]
[2022-06-14 16:45:52,258] [INFO] [paperless.consumer] Consuming Gmail - Information about Your Automatic Renewal.pdf
[2022-06-14 16:45:52,877] [ERROR] [ocrmypdf._exec.ghostscript] GPL Ghostscript 9.50 (2019-10-15)
Copyright (C) 2019 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 1 through 1.
Page 1
GPL Ghostscript 9.50: Setting Overprint Mode to 1
 not permitted in PDF/A-2, overprint mode not set

   
[2022-06-14 16:45:52,877] [ERROR] [ocrmypdf._exec.ghostscript]  Error: can't process embedded font stream,
        attempting to load the font using its name.
               Output may be incorrect.
Can't find CID font "RobotoStatic-Regular".
Attempting to substitute CID font /Adobe-Identity for /RobotoStatic-Regular, see doc/Use.htm#CIDFontSubstitution.
The substitute CID font "Adobe-Identity" is not provided either. attempting to use fallback CIDFont.See doc/Use.htm#CIDFontSubstitution.
The fallback CID font "CIDFallBack" is not provided.  Finally attempting to use ArtifexBullet.
   
[2022-06-14 16:45:52,877] [ERROR] [ocrmypdf._exec.ghostscript]  Error reading a content stream. The page may be incomplete.
               Output may be incorrect.
   
[2022-06-14 16:45:52,878] [ERROR] [ocrmypdf._exec.ghostscript]  Error: File did not complete the page properly and may be damaged.
               Output may be incorrect.

[2022-06-14 16:45:57,469] [INFO] [paperless.handlers] Assigning document type Receipt to 2022-06-14 Gmail - Information about Your Automatic Renewal
[2022-06-14 16:45:57,477] [INFO] [paperless.handlers] Tagging "2022-06-14 Gmail - Information about Your Automatic Renewal" with "Microcenter, Scholastic, ChowNow, Nintendo"
[2022-06-14 16:45:57,712] [INFO] [paperless.consumer] Document 2022-06-14 Gmail - Information about Your Automatic Renewal consumption finished

@stumpylog
Copy link
Member

Ok, it seems like this is a font that Android includes. Basically the only references I can find are in the AOSP. Basically, I believe a PDF should embed the font in some way, so it's self contained. Kind of the point of PDF, so it looks the same for everyone. But it's not doing that.

So to test that theory, if you are able to:

apt-get update
apt-get install git
git clone --depth 1 https://android.googlesource.com/platform/external/roboto-fonts
cd roboto-fonts/
cp RobotoStatic-Regular.ttf /usr/local/share/fonts/
fc-cache /usr/local/share/fonts/

(fc-cache is generating the font caches after the change: https://linux.die.net/man/1/fc-cache)

@nomad64
Copy link
Author

nomad64 commented Jun 15, 2022

I totally agree that PDFs should contain any fonts they use so they are self-contained. The GMail app did do this up until about June 3, as I don't recall seeing any errors like this before (and the PDFs imported just fine into Paperless). Kinda frustrating that this was changed. :(

I did get a chance to troubleshoot this a bit this morning.

I ran the above commands as you suggested in the same container we have been using. The commands worked, but Paperless still threw errors when attempting to process the PDF.

I copied the TTF file into /config/custom-cont-init.d and created a startup script with the following lines:

#!/bin/bash

mkdir /usr/share/fonts/truetype/roboto
cp /config/custom-cont-init.d/RobotoStatic-Regular.ttf /usr/local/share/fonts/
fc-cache -v

I restarted the container and verified it ran the script. I logged into the container and verified the font cache saw the RobotoStatic font:

root@954e35a786fe:/# fc-list | grep Roboto
/usr/local/share/fonts/RobotoStatic-Regular.ttf: RobotoStatic:style=RegularStatic

I tried uploading the document again, but ghostscript still fails to find the font. The uploaded PDF is still mostly blank.

It appears someone created a similar issue for the LSIO container: linuxserver/docker-paperless-ngx#25. I tried installing the DroidSansFallbackFull.ttf font via the startup script:

#!/bin/bash

mkdir /usr/share/fonts/truetype/droid
cp /config/custom-cont-init.d/DroidSansFallback.ttf /usr/share/fonts/truetype/droid/DroidSansFallbackFull.ttf
fc-cache -v

This time, when uploading the PDF, ghostscript does find the fallback font and uses it. However, any part of the PDF that uses it returns gibberish characters. I verified the TTF file itself is OK, so I am not sure what happened. Here is the log output for reference:

08:33:52 [Q] INFO Process-1:1 processing [Gmail - Information about Your Automatic Renewal.pdf]
[2022-06-15 08:33:52,644] [INFO] [paperless.consumer] Consuming Gmail - Information about Your Automatic Renewal.pdf
[2022-06-15 08:33:53,437] [ERROR] [ocrmypdf._exec.ghostscript] GPL Ghostscript 9.50 (2019-10-15)
Copyright (C) 2019 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 1 through 1.
Page 1
GPL Ghostscript 9.50: Setting Overprint Mode to 1
 not permitted in PDF/A-2, overprint mode not set

   
[2022-06-15 08:33:53,437] [ERROR] [ocrmypdf._exec.ghostscript]  Error: can't process embedded font stream,
        attempting to load the font using its name.
               Output may be incorrect.
Can't find CID font "RobotoStatic-Regular".
Attempting to substitute CID font /Adobe-Identity for /RobotoStatic-Regular, see doc/Use.htm#CIDFontSubstitution.
The substitute CID font "Adobe-Identity" is not provided either. attempting to use fallback CIDFont.See doc/Use.htm#CIDFontSubstitution.
Loading a TT font from /usr/share/ghostscript/9.50/Resource/CIDFSubst/DroidSansFallback.ttf to emulate a CID font Adobe-Identity ... Done.

[2022-06-15 08:34:01,638] [INFO] [paperless.handlers] Assigning document type Receipt to 2022-06-14 Gmail - Information about Your Automatic Renewal
[2022-06-15 08:34:01,771] [INFO] [paperless.consumer] Document 2022-06-14 Gmail - Information about Your Automatic Renewal consumption finished
08:34:01 [Q] INFO Process-1:1 stopped doing work
08:34:01 [Q] INFO Processed [Gmail - Information about Your Automatic Renewal.pdf]
08:34:02 [Q] INFO recycled worker Process-1:1

It appears that, although the container recognizes the font as installed, ghostscript doesn't use RobotoStatic-Regular for some reason.

@stumpylog stumpylog added help wanted Extra attention is needed deployment and removed unconfirmed labels Jun 15, 2022
@stumpylog
Copy link
Member

Yep, that's about what I get as well. Very strange that Ghostscript finds the font during loadallfonts, but then somehow fails to use it correctly. I'm unsure how to fix this exactly.

For the future, I think my idea would be a directory (DATA_DIR/fonts maybe?) where a user can place .ttf files. At container startup, these are copied or processed however is needed to get them into Ghostscript. As there could be potential licensing around fonts, and maybe many fonts out there, I don't think packaging them up ourselves is a good idea.

@nomad64
Copy link
Author

nomad64 commented Jun 15, 2022

I have learned more about fonts in the last few hours than I would care to admit. :)

Using pdffonts, I can see that there are fonts embedded into the PDF:

nomad64@minty-beast:~/Desktop$ pdffonts Gmail\ -\ Information\ about\ Your\ Automatic\ Renewal.pdf 
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
[none]                               Type 3            Custom           yes no  yes      6  0
BAAAAA+RobotoStatic-Regular          CID TrueType      Identity-H       yes yes yes      7  0

If I download an older PDF that imported into paperless just fine, I see a similar output.

nomad64@minty-beast:~/Desktop$ pdffonts 2022-05-27\ Gmail\ -\ Your\ Google\ Play\ Order\ Receipt\ from\ May\ 27\,\ 2022.pdf 
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
[none]                               Type 3            Custom           yes no  yes     15  0
BAAAAA+RobotoStatic-Regular          CID TrueType      Identity-H       yes yes yes     16  0

I am not sure what to make of this, just pointing it out.

@stale
Copy link

stale bot commented Jul 15, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jul 15, 2022
@stumpylog
Copy link
Member

We should probably set onlyLabels for stalebot so it doesn't stale things which are issues. So only unconfirmed and cant-reproduce? @shamoon what do you think?

@shamoon
Copy link
Member

shamoon commented Jul 15, 2022

We should probably set onlyLabels for stalebot so it doesn't stale things which are issues. So only unconfirmed and cant-reproduce? @shamoon what do you think?

Yes agree, the different options seem to behave in weird ways, I thought only-of… did that 🤷‍♂️

@tooomm
Copy link
Contributor

tooomm commented Jul 15, 2022

Yes agree, the different options seem to behave in weird ways, I thought only-of… did that 🤷‍♂️

Oh, I guess I know what's going on...
You did actually look at https://github.com/actions/stale which is a GitHub Action to do similar stuff.
https://github.com/marketplace/stale should be the application that is actually installed for the project I think.

I took the liberty to apply a fix. :) #1237

@nomad64
Copy link
Author

nomad64 commented Jul 23, 2022

I am circling back to this issue to report it is no longer an issue (sort of).

Emails that I print to PDF from GMail on my Android now properly import and show text (w/ OCR). I verified this using several recent e-mails.

I still had the "bad" PDF from earlier in this thread, and uploading it still causes the blank image. However, re-printing to PDF (from Android GMail), and importing it is now successful. As a note, when I try to view the "bad" PDF in Xreader, it also shows up as blank. When I open it via Brave Browser, the text is there but is definitely a substitute font. The new "good" version properly renders in both.

Below is the output from pdffonts, which looks fine to me (not that I know what I am looking at)?

$ pdffonts Gmail\ -\ Information\ about\ Your\ Automatic\ Renewal-bad.pdf 
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
[none]                               Type 3            Custom           yes no  yes      6  0
BAAAAA+RobotoStatic-Regular          CID TrueType      Identity-H       yes yes yes      7  0
$ pdffonts Gmail\ -\ Information\ about\ Your\ Automatic\ Renewal-good.pdf 
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
[none]                               Type 3            Custom           yes no  yes      8  0
GNOKIL+RobotoStatic-Regular          CID TrueType      Identity-H       yes yes yes      9  0

Either way, this was obviously a bug in the GMail app that is now resolved.

@nomad64 nomad64 closed this as completed Jul 23, 2022
@github-actions
Copy link
Contributor

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new discussion or issue for related concerns.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bug report or a Bug-fix deployment help wanted Extra attention is needed
Projects
Archived in project
Development

No branches or pull requests

4 participants