New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
qpdf fails unknown token - Visio/Distiller generated pure vector PDF with raster alternates fails #200
Comments
If you can't provide the information request in the issue template I won't
be able to help. I would be guessing, which would waste your time and mine.
…On Nov 16, 2017 22:53, "KEIJOT" ***@***.***> wrote:
OCRMYPDF:WARNING: /tmp/com.github.ocrmypdf.a_l45o7i/000001.done.metadata.pdf
(file position 908): unknown token while reading object; treating as string
OCRMYPDF:WARNING: /tmp/com.github.ocrmypdf.a_l45o7i/000001.done.metadata.pdf
(file position 4499): unknown token while reading object; treating as string
OCRMYPDF:WARNING: /tmp/com.github.ocrmypdf.a_l45o7i/000001.done.metadata.pdf
(file position 6665): unknown token while reading object; treating as string
OCRMYPDF:WARNING: /tmp/com.github.ocrmypdf.a_l45o7i/000001.done.metadata.pdf
(file position 5263): unknown token while reading object; treating as string
OCRMYPDF:WARNING: /tmp/com.github.ocrmypdf.a_l45o7i/000001.done.metadata.pdf
(file position 6033): unknown token while reading object; treating as string
OCRMYPDF:qpdf: operation succeeded with warnings; resulting file may have
some problems
OCRMYPDF: ERROR - Error occurred while running this command:
OCRMYPDF:(Command '['qpdf', '--min-version=1.6',
'/tmp/com.github.ocrmypdf.a_l45o7i/000001.done.metadata.pdf', '--pages',
'/tmp/com.github.ocrmypdf.a_l45o7i/000001.done.metadata.pdf',
'/tmp/com.github.ocrmypdf.a_l45o7i/000002.done.pdf',
'/tmp/com.github.ocrmypdf.a_l45o7i/000003.done.pdf',
'/tmp/com.github.ocrmypdf.a_l45o7i/000004.done.pdf',
'/tmp/com.github.ocrmypdf.a_l45o7i/000005.done.pdf',
'/tmp/com.github.ocrmypdf.a_l45o7i/000006.done.pdf',
'/tmp/com.github.ocrmypdf.a_l45o7i/000007.done.pdf',
'/tmp/com.github.ocrmypdf.a_l45o7i/000008.done.pdf',
'/tmp/com.github.ocrmypdf.a_l45o7i/000009.done.pdf',
'/tmp/com.github.ocrmypdf.a_l45o7i/000010.done.pdf',
'/tmp/com.github.ocrmypdf.a_l45o7i/000011.done.pdf',
'/tmp/com.github.ocrmypdf.a_l45o7i/000012.done.pdf',
'/tmp/com.github.ocrmypdf.a_l45o7i/000013.done.pdf', '--',
'/tmp/com.github.ocrmypdf.a_l45o7i/merged.pdf']' returned non-zero exit
status 3)
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#200>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABvcMzv22-8zzNQij7I3-9gMWXAYHOT5ks5s3S19gaJpZM4Qhnai>
.
|
I tried to run command: ocrmypdf --pdf-renderer sandwich --skip-text -l eng --output-type pdf --tesseract-oem 1 PDF-FAILS.pdf final.pdf, attached is the file I used. |
Thanks for providing that information. The file you provided is pure vector content. It's already machine readable so there is no reason to OCR it. However, it is a design goal that such PDFs should pass through when (Aside: If you really want to force OCR on this file, there is (Note to self: Triage: Acrobat&qpdf check clean. Contains alternate resource streams in the form of invisible images, likely embedded Visio metadata, and probably some Form XObjects. qpdf seems to split and merge pieces of it okay on its own, making PyPDF2 the likely culprit.) |
Yeah I know but I have a batch process which feeds all PDF's to ocrmypdf and I don't know really how to detect if I should not. No I don't want to force OCR as such if it is not needed. If you have any hint or trick how to detect if PDF is pure vector content as such, let me know and I can possible add that detection on my end, so all such PDF's will not go into ocrmypdf process end. Also I think some PDF's could be mixed mode ones ie. contain text + vector info as such, not sure though. Thank You |
Most of the time pure vector files should go through without trouble, although it's not something that is checked extensively in the test suite. Your command line is correct for what you want. With For detecting these files:
where |
Excellent I will do that one, thanks for the info |
btw related to pqdf issues, check this one: qpdf/qpdf#106 now is there a way to tell ocrmypdf that use qpdf version 7.X or any version as such if you have a newly build qpdf on your machine somewhere ? I tested qpdf 7 and with that I only got out as Warnings and it did produce the final output PDF file: /usr/local/bin/qpdf7 --empty --pages *.pdf -- final.pdf |
I have qpdf 7.0 and I can reproduce the error with ocrmypdf + qpdf 7.0. I think the problem is inside PyPDF2. |
See here for instructions about pointing ocrmypdf to a different qpdf binary It's better to use qpdf 7 anyway since there are CVEs against earlier versions: |
Excellent, any news on PyPDF2 ? |
It's not PyPDF2.
and returns with error code 3. ocrmypdf treats nonzero return from qpdf as an error. You could change I'm not sure I want to make that change yet. I'd like to see a larger sample of the spectrum of problems that produce this warning in qpdf, to make sure that files are still valid (maybe the PDF might be valid, but maybe it's not visually identical). Do you happen to have other files that cause this or is it true of all Visio-produced PDFs. By inspecting the file positions that triggered the problems, it looks like qpdf's parser got lost in the file. Do you mind if I submit this file as a possible issue to qpdf? |
I had 10 Visio files and this was the only one which failed. Yes you can share it. Thank You |
Wrote up the underlying issue at qpdf/qpdf#165 There appear to be no side effect so I will change ocrmypdf to print the warning from qpdf, when this type of warning occurs, instead of terminating. Thanks for the report. |
Excellent, you do good job on support your excellent sw, thanks a lot |
Also replace check_output() calls with run() in qpdf.py
Fixed in v5.4.4 |
OCRMYPDF:WARNING: /tmp/com.github.ocrmypdf.a_l45o7i/000001.done.metadata.pdf (file position 908): unknown token while reading object; treating as string
OCRMYPDF:WARNING: /tmp/com.github.ocrmypdf.a_l45o7i/000001.done.metadata.pdf (file position 4499): unknown token while reading object; treating as string
OCRMYPDF:WARNING: /tmp/com.github.ocrmypdf.a_l45o7i/000001.done.metadata.pdf (file position 6665): unknown token while reading object; treating as string
OCRMYPDF:WARNING: /tmp/com.github.ocrmypdf.a_l45o7i/000001.done.metadata.pdf (file position 5263): unknown token while reading object; treating as string
OCRMYPDF:WARNING: /tmp/com.github.ocrmypdf.a_l45o7i/000001.done.metadata.pdf (file position 6033): unknown token while reading object; treating as string
OCRMYPDF:qpdf: operation succeeded with warnings; resulting file may have some problems
OCRMYPDF: ERROR - Error occurred while running this command:
OCRMYPDF:(Command '['qpdf', '--min-version=1.6', '/tmp/com.github.ocrmypdf.a_l45o7i/000001.done.metadata.pdf', '--pages', '/tmp/com.github.ocrmypdf.a_l45o7i/000001.done.metadata.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000002.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000003.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000004.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000005.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000006.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000007.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000008.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000009.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000010.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000011.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000012.done.pdf', '/tmp/com.github.ocrmypdf.a_l45o7i/000013.done.pdf', '--', '/tmp/com.github.ocrmypdf.a_l45o7i/merged.pdf']' returned non-zero exit status 3)
The text was updated successfully, but these errors were encountered: