Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qpdf fails unknown token - Visio/Distiller generated pure vector PDF with raster alternates fails #200

Closed
KEIJOT opened this issue Nov 17, 2017 · 15 comments

Comments

@KEIJOT
Copy link

KEIJOT commented Nov 17, 2017

OCRMYPDF:WARNING: /tmp/com.github.ocrmypdf.a_l45o7i/000001.done.metadata.pdf (file position 908): unknown token while reading object; treating as string

OCRMYPDF:WARNING: /tmp/com.github.ocrmypdf.a_l45o7i/000001.done.metadata.pdf (file position 4499): unknown token while reading object; treating as string

OCRMYPDF:WARNING: /tmp/com.github.ocrmypdf.a_l45o7i/000001.done.metadata.pdf (file position 6665): unknown token while reading object; treating as string

OCRMYPDF:WARNING: /tmp/com.github.ocrmypdf.a_l45o7i/000001.done.metadata.pdf (file position 5263): unknown token while reading object; treating as string

OCRMYPDF:WARNING: /tmp/com.github.ocrmypdf.a_l45o7i/000001.done.metadata.pdf (file position 6033): unknown token while reading object; treating as string

OCRMYPDF:qpdf: operation succeeded with warnings; resulting file may have some problems

OCRMYPDF: ERROR - Error occurred while running this command:

OCRMYPDF:(Command '['qpdf', '--min-version=1.6', '/tmp/com.github.ocrmypdf.a_l45o7i/000001.done.metadata.pdf', '--pages', '/tmp/com.github.ocrmypdf.a_l45o7i/000001.done.metadata.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000002.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000003.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000004.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000005.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000006.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000007.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000008.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000009.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000010.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000011.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000012.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000013.done.pdf', '--', '/tmp/com.github.ocrmypdf.a_l45o7i/merged.pdf']' returned non-zero exit status 3)

@jbarlow83
Copy link
Collaborator

jbarlow83 commented Nov 17, 2017 via email

@KEIJOT
Copy link
Author

KEIJOT commented Nov 17, 2017

I tried to run command: ocrmypdf --pdf-renderer sandwich --skip-text -l eng --output-type pdf --tesseract-oem 1 PDF-FAILS.pdf final.pdf, attached is the file I used.
PDF-FAILS.pdf

@jbarlow83
Copy link
Collaborator

Thanks for providing that information.

The file you provided is pure vector content. It's already machine readable so there is no reason to OCR it. However, it is a design goal that such PDFs should pass through when --skip-text is issued, mainly for convenience in batch processing. So the issue remains open from my point of view, but you probably don't have to worry about it. I'll look into it.

(Aside: If you really want to force OCR on this file, there is --force-ocr, which rasterizes each page as an image and converts back to PDF. The resulting file is an inferior approximation of the input. This is useful in rare cases, but not yours.)

(Note to self: Triage: Acrobat&qpdf check clean. Contains alternate resource streams in the form of invisible images, likely embedded Visio metadata, and probably some Form XObjects. qpdf seems to split and merge pieces of it okay on its own, making PyPDF2 the likely culprit.)

@jbarlow83 jbarlow83 changed the title qpdf fails unknown token qpdf fails unknown token - Visio/Distiller generated pure vector PDF with raster alternates fails Nov 17, 2017
@KEIJOT
Copy link
Author

KEIJOT commented Nov 17, 2017

Yeah I know but I have a batch process which feeds all PDF's to ocrmypdf and I don't know really how to detect if I should not. No I don't want to force OCR as such if it is not needed. If you have any hint or trick how to detect if PDF is pure vector content as such, let me know and I can possible add that detection on my end, so all such PDF's will not go into ocrmypdf process end. Also I think some PDF's could be mixed mode ones ie. contain text + vector info as such, not sure though. Thank You

@jbarlow83
Copy link
Collaborator

Most of the time pure vector files should go through without trouble, although it's not something that is checked extensively in the test suite. Your command line is correct for what you want.

With --skip-text, no OCR will be done on any page in the file that contains any text. If there is vector art but no text, then the page is rasterized and OCRed, and the invisible OCR layer is grafted on to the original page (so vector art is preserved).

For detecting these files:

Title:          Visio-152048_Figures_8-13-15_Approved.vsd
Author:         dwade
Creator:        PScript5.dll Version 5.2.2
Producer:       Acrobat Distiller 10.1.15 (Windows)
CreationDate:   Thu Aug 13 13:35:46 2015 PDT
ModDate:        Mon Jan  4 14:34:24 2016 PST
Tagged:         no
UserProperties: no
Suspects:       no
Form:           none
JavaScript:     no
Pages:          13
Encrypted:      no
Page size:      595 x 842 pts (A4)
Page rot:       0
File size:      205931 bytes
Optimized:      no
PDF version:    1.4

where pdfinfo is from poppler. So you could have a script that searches for Creator: PScript5, Producer: Acrobat Distiller to trap any files produced by Visio. (The Title is suggestive, but users can change that.)

@KEIJOT
Copy link
Author

KEIJOT commented Nov 17, 2017

Excellent I will do that one, thanks for the info

@KEIJOT
Copy link
Author

KEIJOT commented Nov 17, 2017

btw related to pqdf issues, check this one: qpdf/qpdf#106 now is there a way to tell ocrmypdf that use qpdf version 7.X or any version as such if you have a newly build qpdf on your machine somewhere ? I tested qpdf 7 and with that I only got out as Warnings and it did produce the final output PDF file:

/usr/local/bin/qpdf7 --empty --pages *.pdf -- final.pdf
WARNING: 000001.done.metadata.pdf (file position 6953): unknown token while reading object; treating as string
WARNING: 000001.done.metadata.pdf (file position 8497): unknown token while reading object; treating as string
WARNING: 000001.done.metadata.pdf (file position 6335): unknown token while reading object; treating as string
WARNING: 000001.done.metadata.pdf (file position 7728): unknown token while reading object; treating as string
WARNING: 000001.done.metadata.pdf (file position 5574): unknown token while reading object; treating as string

@jbarlow83
Copy link
Collaborator

I have qpdf 7.0 and I can reproduce the error with ocrmypdf + qpdf 7.0. I think the problem is inside PyPDF2.

@jbarlow83
Copy link
Collaborator

See here for instructions about pointing ocrmypdf to a different qpdf binary
https://ocrmypdf.readthedocs.io/en/latest/advanced.html#overriding-other-support-programs

It's better to use qpdf 7 anyway since there are CVEs against earlier versions:
https://www.cvedetails.com/vulnerability-list/vendor_id-16505/product_id-38012/year-2017/Qpdf-Project-Qpdf.html

@KEIJOT
Copy link
Author

KEIJOT commented Nov 22, 2017

Excellent, any news on PyPDF2 ?

@jbarlow83
Copy link
Collaborator

It's not PyPDF2.
When qpdf merges the file it generates the warnings, including this message

qpdf: operation succeeded with warnings; resulting file may have some problems

and returns with error code 3. ocrmypdf treats nonzero return from qpdf as an error.

You could change ocrmypdf/exec/qpdf.py::merge() to trap a possible CalledProcessError from run(.., check=True) and print but suppress the exception if returncode == 3 and the output file exists. The other option is to refry PDFs that produce the error with Ghostscript: gs -q -o out.pdf -sDEVICE=pdfwrite in.pdf.

I'm not sure I want to make that change yet. I'd like to see a larger sample of the spectrum of problems that produce this warning in qpdf, to make sure that files are still valid (maybe the PDF might be valid, but maybe it's not visually identical). Do you happen to have other files that cause this or is it true of all Visio-produced PDFs. By inspecting the file positions that triggered the problems, it looks like qpdf's parser got lost in the file. Do you mind if I submit this file as a possible issue to qpdf?

@KEIJOT
Copy link
Author

KEIJOT commented Nov 22, 2017

I had 10 Visio files and this was the only one which failed. Yes you can share it. Thank You

@jbarlow83
Copy link
Collaborator

Wrote up the underlying issue at qpdf/qpdf#165

There appear to be no side effect so I will change ocrmypdf to print the warning from qpdf, when this type of warning occurs, instead of terminating. Thanks for the report.

@KEIJOT
Copy link
Author

KEIJOT commented Nov 24, 2017

Excellent, you do good job on support your excellent sw, thanks a lot

jbarlow83 pushed a commit that referenced this issue Nov 27, 2017
Also replace check_output() calls with run() in qpdf.py
jbarlow83 pushed a commit that referenced this issue Nov 27, 2017
@jbarlow83
Copy link
Collaborator

Fixed in v5.4.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants