qpdf fails unknown token - Visio/Distiller generated pure vector PDF with raster alternates fails #200

KEIJOT · 2017-11-17T06:53:49Z

OCRMYPDF:WARNING: /tmp/com.github.ocrmypdf.a_l45o7i/000001.done.metadata.pdf (file position 908): unknown token while reading object; treating as string

OCRMYPDF:WARNING: /tmp/com.github.ocrmypdf.a_l45o7i/000001.done.metadata.pdf (file position 4499): unknown token while reading object; treating as string

OCRMYPDF:WARNING: /tmp/com.github.ocrmypdf.a_l45o7i/000001.done.metadata.pdf (file position 6665): unknown token while reading object; treating as string

OCRMYPDF:WARNING: /tmp/com.github.ocrmypdf.a_l45o7i/000001.done.metadata.pdf (file position 5263): unknown token while reading object; treating as string

OCRMYPDF:WARNING: /tmp/com.github.ocrmypdf.a_l45o7i/000001.done.metadata.pdf (file position 6033): unknown token while reading object; treating as string

OCRMYPDF:qpdf: operation succeeded with warnings; resulting file may have some problems

OCRMYPDF: ERROR - Error occurred while running this command:

OCRMYPDF:(Command '['qpdf', '--min-version=1.6', '/tmp/com.github.ocrmypdf.a_l45o7i/000001.done.metadata.pdf', '--pages', '/tmp/com.github.ocrmypdf.a_l45o7i/000001.done.metadata.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000002.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000003.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000004.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000005.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000006.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000007.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000008.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000009.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000010.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000011.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000012.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000013.done.pdf', '--', '/tmp/com.github.ocrmypdf.a_l45o7i/merged.pdf']' returned non-zero exit status 3)

jbarlow83 · 2017-11-17T07:07:42Z

If you can't provide the information request in the issue template I won't be able to help. I would be guessing, which would waste your time and mine.

…

On Nov 16, 2017 22:53, "KEIJOT" ***@***.***> wrote: OCRMYPDF:WARNING: /tmp/com.github.ocrmypdf.a_l45o7i/000001.done.metadata.pdf (file position 908): unknown token while reading object; treating as string OCRMYPDF:WARNING: /tmp/com.github.ocrmypdf.a_l45o7i/000001.done.metadata.pdf (file position 4499): unknown token while reading object; treating as string OCRMYPDF:WARNING: /tmp/com.github.ocrmypdf.a_l45o7i/000001.done.metadata.pdf (file position 6665): unknown token while reading object; treating as string OCRMYPDF:WARNING: /tmp/com.github.ocrmypdf.a_l45o7i/000001.done.metadata.pdf (file position 5263): unknown token while reading object; treating as string OCRMYPDF:WARNING: /tmp/com.github.ocrmypdf.a_l45o7i/000001.done.metadata.pdf (file position 6033): unknown token while reading object; treating as string OCRMYPDF:qpdf: operation succeeded with warnings; resulting file may have some problems OCRMYPDF: ERROR - Error occurred while running this command: OCRMYPDF:(Command '['qpdf', '--min-version=1.6', '/tmp/com.github.ocrmypdf.a_l45o7i/000001.done.metadata.pdf', '--pages', '/tmp/com.github.ocrmypdf.a_l45o7i/000001.done.metadata.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000002.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000003.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000004.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000005.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000006.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000007.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000008.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000009.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000010.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000011.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000012.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000013.done.pdf', '--', '/tmp/com.github.ocrmypdf.a_l45o7i/merged.pdf']' returned non-zero exit status 3) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#200>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABvcMzv22-8zzNQij7I3-9gMWXAYHOT5ks5s3S19gaJpZM4Qhnai> .

KEIJOT · 2017-11-17T07:15:28Z

I tried to run command: ocrmypdf --pdf-renderer sandwich --skip-text -l eng --output-type pdf --tesseract-oem 1 PDF-FAILS.pdf final.pdf, attached is the file I used.
PDF-FAILS.pdf

jbarlow83 · 2017-11-17T09:16:22Z

Thanks for providing that information.

The file you provided is pure vector content. It's already machine readable so there is no reason to OCR it. However, it is a design goal that such PDFs should pass through when --skip-text is issued, mainly for convenience in batch processing. So the issue remains open from my point of view, but you probably don't have to worry about it. I'll look into it.

(Aside: If you really want to force OCR on this file, there is --force-ocr, which rasterizes each page as an image and converts back to PDF. The resulting file is an inferior approximation of the input. This is useful in rare cases, but not yours.)

(Note to self: Triage: Acrobat&qpdf check clean. Contains alternate resource streams in the form of invisible images, likely embedded Visio metadata, and probably some Form XObjects. qpdf seems to split and merge pieces of it okay on its own, making PyPDF2 the likely culprit.)

KEIJOT · 2017-11-17T09:25:04Z

Yeah I know but I have a batch process which feeds all PDF's to ocrmypdf and I don't know really how to detect if I should not. No I don't want to force OCR as such if it is not needed. If you have any hint or trick how to detect if PDF is pure vector content as such, let me know and I can possible add that detection on my end, so all such PDF's will not go into ocrmypdf process end. Also I think some PDF's could be mixed mode ones ie. contain text + vector info as such, not sure though. Thank You

jbarlow83 · 2017-11-17T09:50:52Z

Most of the time pure vector files should go through without trouble, although it's not something that is checked extensively in the test suite. Your command line is correct for what you want.

With --skip-text, no OCR will be done on any page in the file that contains any text. If there is vector art but no text, then the page is rasterized and OCRed, and the invisible OCR layer is grafted on to the original page (so vector art is preserved).

For detecting these files:

Title:          Visio-152048_Figures_8-13-15_Approved.vsd
Author:         dwade
Creator:        PScript5.dll Version 5.2.2
Producer:       Acrobat Distiller 10.1.15 (Windows)
CreationDate:   Thu Aug 13 13:35:46 2015 PDT
ModDate:        Mon Jan  4 14:34:24 2016 PST
Tagged:         no
UserProperties: no
Suspects:       no
Form:           none
JavaScript:     no
Pages:          13
Encrypted:      no
Page size:      595 x 842 pts (A4)
Page rot:       0
File size:      205931 bytes
Optimized:      no
PDF version:    1.4

where pdfinfo is from poppler. So you could have a script that searches for Creator: PScript5, Producer: Acrobat Distiller to trap any files produced by Visio. (The Title is suggestive, but users can change that.)

KEIJOT · 2017-11-17T16:37:52Z

Excellent I will do that one, thanks for the info

KEIJOT · 2017-11-17T17:14:53Z

btw related to pqdf issues, check this one: qpdf/qpdf#106 now is there a way to tell ocrmypdf that use qpdf version 7.X or any version as such if you have a newly build qpdf on your machine somewhere ? I tested qpdf 7 and with that I only got out as Warnings and it did produce the final output PDF file:

/usr/local/bin/qpdf7 --empty --pages *.pdf -- final.pdf
WARNING: 000001.done.metadata.pdf (file position 6953): unknown token while reading object; treating as string
WARNING: 000001.done.metadata.pdf (file position 8497): unknown token while reading object; treating as string
WARNING: 000001.done.metadata.pdf (file position 6335): unknown token while reading object; treating as string
WARNING: 000001.done.metadata.pdf (file position 7728): unknown token while reading object; treating as string
WARNING: 000001.done.metadata.pdf (file position 5574): unknown token while reading object; treating as string

jbarlow83 · 2017-11-20T01:26:24Z

I have qpdf 7.0 and I can reproduce the error with ocrmypdf + qpdf 7.0. I think the problem is inside PyPDF2.

jbarlow83 · 2017-11-21T23:52:01Z

See here for instructions about pointing ocrmypdf to a different qpdf binary
https://ocrmypdf.readthedocs.io/en/latest/advanced.html#overriding-other-support-programs

It's better to use qpdf 7 anyway since there are CVEs against earlier versions:
https://www.cvedetails.com/vulnerability-list/vendor_id-16505/product_id-38012/year-2017/Qpdf-Project-Qpdf.html

KEIJOT · 2017-11-22T01:22:48Z

Excellent, any news on PyPDF2 ?

jbarlow83 · 2017-11-22T02:24:42Z

It's not PyPDF2.
When qpdf merges the file it generates the warnings, including this message

qpdf: operation succeeded with warnings; resulting file may have some problems

and returns with error code 3. ocrmypdf treats nonzero return from qpdf as an error.

You could change ocrmypdf/exec/qpdf.py::merge() to trap a possible CalledProcessError from run(.., check=True) and print but suppress the exception if returncode == 3 and the output file exists. The other option is to refry PDFs that produce the error with Ghostscript: gs -q -o out.pdf -sDEVICE=pdfwrite in.pdf.

I'm not sure I want to make that change yet. I'd like to see a larger sample of the spectrum of problems that produce this warning in qpdf, to make sure that files are still valid (maybe the PDF might be valid, but maybe it's not visually identical). Do you happen to have other files that cause this or is it true of all Visio-produced PDFs. By inspecting the file positions that triggered the problems, it looks like qpdf's parser got lost in the file. Do you mind if I submit this file as a possible issue to qpdf?

KEIJOT · 2017-11-22T03:22:38Z

I had 10 Visio files and this was the only one which failed. Yes you can share it. Thank You

jbarlow83 · 2017-11-24T08:26:01Z

Wrote up the underlying issue at qpdf/qpdf#165

There appear to be no side effect so I will change ocrmypdf to print the warning from qpdf, when this type of warning occurs, instead of terminating. Thanks for the report.

KEIJOT · 2017-11-24T08:31:09Z

Excellent, you do good job on support your excellent sw, thanks a lot

Also replace check_output() calls with run() in qpdf.py

jbarlow83 · 2017-11-29T23:59:59Z

Fixed in v5.4.4

jbarlow83 changed the title ~~qpdf fails unknown token~~ qpdf fails unknown token - Visio/Distiller generated pure vector PDF with raster alternates fails Nov 17, 2017

jbarlow83 pushed a commit that referenced this issue Nov 27, 2017

Fix issue #200, uncommon but valid decimal syntax treated as error

2040ae4

Also replace check_output() calls with run() in qpdf.py

jbarlow83 pushed a commit that referenced this issue Nov 27, 2017

Test case for issue #200

965de3a

jbarlow83 closed this as completed Nov 29, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qpdf fails unknown token - Visio/Distiller generated pure vector PDF with raster alternates fails #200

qpdf fails unknown token - Visio/Distiller generated pure vector PDF with raster alternates fails #200

KEIJOT commented Nov 17, 2017

jbarlow83 commented Nov 17, 2017 via email

KEIJOT commented Nov 17, 2017

jbarlow83 commented Nov 17, 2017

KEIJOT commented Nov 17, 2017 •

edited

jbarlow83 commented Nov 17, 2017

KEIJOT commented Nov 17, 2017

KEIJOT commented Nov 17, 2017

jbarlow83 commented Nov 20, 2017

jbarlow83 commented Nov 21, 2017

KEIJOT commented Nov 22, 2017

jbarlow83 commented Nov 22, 2017

KEIJOT commented Nov 22, 2017

jbarlow83 commented Nov 24, 2017

KEIJOT commented Nov 24, 2017

jbarlow83 commented Nov 29, 2017

qpdf fails unknown token - Visio/Distiller generated pure vector PDF with raster alternates fails #200

qpdf fails unknown token - Visio/Distiller generated pure vector PDF with raster alternates fails #200

Comments

KEIJOT commented Nov 17, 2017

jbarlow83 commented Nov 17, 2017 via email

KEIJOT commented Nov 17, 2017

jbarlow83 commented Nov 17, 2017

KEIJOT commented Nov 17, 2017 • edited

jbarlow83 commented Nov 17, 2017

KEIJOT commented Nov 17, 2017

KEIJOT commented Nov 17, 2017

jbarlow83 commented Nov 20, 2017

jbarlow83 commented Nov 21, 2017

KEIJOT commented Nov 22, 2017

jbarlow83 commented Nov 22, 2017

KEIJOT commented Nov 22, 2017

jbarlow83 commented Nov 24, 2017

KEIJOT commented Nov 24, 2017

jbarlow83 commented Nov 29, 2017

KEIJOT commented Nov 17, 2017 •

edited