Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A set of failing PDFs #325

Closed
gwern opened this issue Dec 15, 2018 · 10 comments
Closed

A set of failing PDFs #325

gwern opened this issue Dec 15, 2018 · 10 comments

Comments

@gwern
Copy link

gwern commented Dec 15, 2018

I recently used ocrmypdf to mass-OCR my PDFs and a bunch of DjVu files I converted to PDF (which strips the original Tesseract OCR so I needed some way to restore it). Worked very nicely, and I like the better compression over the default ddjvu output.

Some files failed. I noticed the mention of a test corpus, so I thought you might like a list of failing files (these failed multiple times, so should be reliable test cases) and the errors.

The errors:

myocr-gwernnet-errors.txt

The files:

@jbarlow83
Copy link
Collaborator

I will take a look.

Do you know what version of ocrmypdf version you used? The stack traces appear to be from an older version.

@gwern
Copy link
Author

gwern commented Dec 15, 2018

Whatever Ubuntu 18.0.4.1 ships, which appears to be '6.1.2-1ubuntu1.1' or '6.1.2' from --version.

@jbarlow83
Copy link
Collaborator

Please try the latest released version. There is an installation procedure in the documentation specifically for Ubuntu 18.04. I suspect that will fix many of these errors.

@jbarlow83
Copy link
Collaborator

The problem is quite definitely how these files are formatted. In any case, the next release should be more tolerant of PDFs with these types of errors - it will issue warnings instead.

I went by the logs and concluded the errors are for the same for the most part.

@gwern
Copy link
Author

gwern commented Dec 16, 2018

That's good to hear. I hope they'll be good test cases for the next release, then.

@ivsanro1
Copy link

I found another error. Unfortunately, I cannot upload the pdf file, because it has personal data, and I do not know how to reproduce the error by creating a handcrafted pdf file. It seems to be a problem of the internal structure of the pdf file. This is the stacktrace of the error:

  File "/usr/local/lib/python3.5/dist-packages/ruffus/task.py", line 712, in run_pooled_job_without_exceptions
    register_cleanup, touch_files_only)
  File "/usr/local/lib/python3.5/dist-packages/ruffus/task.py", line 544, in job_wrapper_io_files
    ret_val = user_defined_work_func(*params)
  File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/_pipeline.py", line 170, in repair_and_parse_pdf
    pdfinfo = PdfInfo(output_file, detailed_page_analysis=detailed_page_analysis, log=log)
  File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/pdfinfo/__init__.py", line 722, in __init__
    infile, detailed_page_analysis, log=log)
  File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/pdfinfo/__init__.py", line 604, in _pdf_get_all_pageinfo
    page = PageInfo(pdf, n, infile, page_xml, detailed_analysis)
  File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/pdfinfo/__init__.py", line 614, in __init__
    self._pageinfo = _pdf_get_pageinfo(pdf, pageno, infile, xmltext)
  File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/pdfinfo/__init__.py", line 571, in _pdf_get_pageinfo
    shorthand=userunit_shorthand)]
  File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/pdfinfo/__init__.py", line 569, in <listcomp>
    contentsinfo = [ci for ci in
  File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/pdfinfo/__init__.py", line 485, in _process_content_streams
    yield from _find_regular_images(container, contentsinfo)
  File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/pdfinfo/__init__.py", line 390, in _find_regular_images
    for pdfimage, xobj in _image_xobjects(container):
  File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/pdfinfo/__init__.py", line 376, in _image_xobjects
    if candidate['/Subtype'] == '/Image':
TypeError: 'NoneType' object is not subscriptable

@jbarlow83
Copy link
Collaborator

The wiki has instructions for encrypting a file for me only if you are comfortable with that.
https://github.com/jbarlow83/OCRmyPDF/wiki

@ivsanro1
Copy link

The wiki has instructions for encrypting a file for me only if you are comfortable with that.
https://github.com/jbarlow83/OCRmyPDF/wiki

I am afraid I cannot do that, sorry. The document itself pertains to a third-party organization, and the personal info is not mine. I can check why it fails with pdb if it helps.

Thanks

@jbarlow83
Copy link
Collaborator

Probably fixed this, or at least suppressed the immediate cause of stack trace, in next release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants