A set of failing PDFs #325

gwern · 2018-12-15T02:56:45Z

I recently used ocrmypdf to mass-OCR my PDFs and a bunch of DjVu files I converted to PDF (which strips the original Tesseract OCR so I needed some way to restore it). Worked very nicely, and I like the better compression over the default ddjvu output.

Some files failed. I noticed the mention of a test corpus, so I thought you might like a list of failing files (these failed multiple times, so should be reliable test cases) and the errors.

The errors:

myocr-gwernnet-errors.txt

The files:

The text was updated successfully, but these errors were encountered:

jbarlow83 · 2018-12-15T03:07:01Z

I will take a look.

Do you know what version of ocrmypdf version you used? The stack traces appear to be from an older version.

gwern · 2018-12-15T03:13:59Z

Whatever Ubuntu 18.0.4.1 ships, which appears to be '6.1.2-1ubuntu1.1' or '6.1.2' from --version.

jbarlow83 · 2018-12-15T03:32:17Z

Please try the latest released version. There is an installation procedure in the documentation specifically for Ubuntu 18.04. I suspect that will fix many of these errors.

gwern · 2018-12-15T03:57:48Z

Upgrading to 7.3.1 does fix many of the errors. What's still left:

ocrmypdf-gwernnet-errors2.log

jbarlow83 · 2018-12-15T23:51:51Z

The problem is quite definitely how these files are formatted. In any case, the next release should be more tolerant of PDFs with these types of errors - it will issue warnings instead.

I went by the logs and concluded the errors are for the same for the most part.

gwern · 2018-12-16T01:57:05Z

That's good to hear. I hope they'll be good test cases for the next release, then.

ivsanro1 · 2018-12-17T08:15:01Z

I found another error. Unfortunately, I cannot upload the pdf file, because it has personal data, and I do not know how to reproduce the error by creating a handcrafted pdf file. It seems to be a problem of the internal structure of the pdf file. This is the stacktrace of the error:

  File "/usr/local/lib/python3.5/dist-packages/ruffus/task.py", line 712, in run_pooled_job_without_exceptions
    register_cleanup, touch_files_only)
  File "/usr/local/lib/python3.5/dist-packages/ruffus/task.py", line 544, in job_wrapper_io_files
    ret_val = user_defined_work_func(*params)
  File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/_pipeline.py", line 170, in repair_and_parse_pdf
    pdfinfo = PdfInfo(output_file, detailed_page_analysis=detailed_page_analysis, log=log)
  File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/pdfinfo/__init__.py", line 722, in __init__
    infile, detailed_page_analysis, log=log)
  File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/pdfinfo/__init__.py", line 604, in _pdf_get_all_pageinfo
    page = PageInfo(pdf, n, infile, page_xml, detailed_analysis)
  File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/pdfinfo/__init__.py", line 614, in __init__
    self._pageinfo = _pdf_get_pageinfo(pdf, pageno, infile, xmltext)
  File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/pdfinfo/__init__.py", line 571, in _pdf_get_pageinfo
    shorthand=userunit_shorthand)]
  File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/pdfinfo/__init__.py", line 569, in <listcomp>
    contentsinfo = [ci for ci in
  File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/pdfinfo/__init__.py", line 485, in _process_content_streams
    yield from _find_regular_images(container, contentsinfo)
  File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/pdfinfo/__init__.py", line 390, in _find_regular_images
    for pdfimage, xobj in _image_xobjects(container):
  File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/pdfinfo/__init__.py", line 376, in _image_xobjects
    if candidate['/Subtype'] == '/Image':
TypeError: 'NoneType' object is not subscriptable

jbarlow83 · 2018-12-17T08:56:22Z

The wiki has instructions for encrypting a file for me only if you are comfortable with that.
https://github.com/jbarlow83/OCRmyPDF/wiki

ivsanro1 · 2018-12-17T11:08:55Z

The wiki has instructions for encrypting a file for me only if you are comfortable with that.
https://github.com/jbarlow83/OCRmyPDF/wiki

I am afraid I cannot do that, sorry. The document itself pertains to a third-party organization, and the personal info is not mine. I can check why it fails with pdb if it helps.

Thanks

jbarlow83 · 2019-01-09T00:46:52Z

Probably fixed this, or at least suppressed the immediate cause of stack trace, in next release

jbarlow83 closed this as completed in e3a5821 Jan 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A set of failing PDFs #325

A set of failing PDFs #325

gwern commented Dec 15, 2018

jbarlow83 commented Dec 15, 2018

gwern commented Dec 15, 2018

jbarlow83 commented Dec 15, 2018

gwern commented Dec 15, 2018

jbarlow83 commented Dec 15, 2018

gwern commented Dec 16, 2018

ivsanro1 commented Dec 17, 2018

jbarlow83 commented Dec 17, 2018

ivsanro1 commented Dec 17, 2018

jbarlow83 commented Jan 9, 2019

A set of failing PDFs #325

A set of failing PDFs #325

Comments

gwern commented Dec 15, 2018

jbarlow83 commented Dec 15, 2018

gwern commented Dec 15, 2018

jbarlow83 commented Dec 15, 2018

gwern commented Dec 15, 2018

jbarlow83 commented Dec 15, 2018

gwern commented Dec 16, 2018

ivsanro1 commented Dec 17, 2018

jbarlow83 commented Dec 17, 2018

ivsanro1 commented Dec 17, 2018

jbarlow83 commented Jan 9, 2019