Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF cannot be OCRed: page has no images/exceptions #134

Closed
AEgit opened this issue Feb 10, 2017 · 10 comments
Closed

PDF cannot be OCRed: page has no images/exceptions #134

AEgit opened this issue Feb 10, 2017 · 10 comments

Comments

@AEgit
Copy link

AEgit commented Feb 10, 2017

Version of ocrmypdf: 4.4.2

The following PDF cannot be OCRed:
https://app.box.com/s/4v1cmadxuqc5zjk0akezxvurrr8bxjo6

Without forcing the OCR, all pages are skipped: An output file is generated, but no pages have been OCRed. If OCR is forced, exceptions occur.

Without forcing OCR:

ocrmypdf -l eng Schoch2002.pdf Schoch2002_ocr.pdf
   INFO -    1: page has no images - skipping all processing on this page
   INFO -    2: page has no images - skipping all processing on this page
   INFO -    3: page has no images - skipping all processing on this page
   INFO -    4: page has no images - skipping all processing on this page
   INFO -    5: page has no images - skipping all processing on this page
   INFO -    6: page has no images - skipping all processing on this page
   INFO -    7: page has no images - skipping all processing on this page
   INFO -    8: page has no images - skipping all processing on this page
   INFO -    9: page has no images - skipping all processing on this page
   INFO -   10: page has no images - skipping all processing on this page
   INFO - Output file is a PDF/A-2B (as expected)

When forcing OCR:

ocrmypdf -l eng --force-ocr Schoch2002.pdf Schoch2002_ocr.pdf
WARNING -    1: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
WARNING -    2: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
WARNING -    3: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
WARNING -    4: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
WARNING -    5: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
WARNING -    6: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
WARNING -    7: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
WARNING -    8: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
WARNING -    9: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
WARNING -   10: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
  ERROR - Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/PIL/JpegImagePlugin.py", line 590, in _save
    rawmode = RAWMODE[im.mode]
KeyError: 'P'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
    register_cleanup, touch_files_only)
  File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 567, in job_wrapper_io_files
    ret_val = user_defined_work_func(*params)
  File "/usr/local/lib/python3.4/dist-packages/ocrmypdf/pipeline.py", line 534, in select_visible_page_image
    im.save(output_file, format='JPEG', dpi=dpi)
  File "/usr/local/lib/python3.4/dist-packages/PIL/Image.py", line 1728, in save
    save_handler(self, fp, filename)
  File "/usr/local/lib/python3.4/dist-packages/PIL/JpegImagePlugin.py", line 592, in _save
    raise IOError("cannot write mode %s as JPEG" % im.mode)
OSError: cannot write mode P as JPEG
@jbarlow83
Copy link
Collaborator

jbarlow83 commented Feb 10, 2017 via email

@AEgit
Copy link
Author

AEgit commented Feb 10, 2017

Thanks a lot for the quick help!

@jbarlow83
Copy link
Collaborator

The link seems to be dead or down now. Was working when I first checked it. Can you attach the file directly to this ticket?

@AEgit
Copy link
Author

AEgit commented Feb 10, 2017

Ah sorry about that, I thought you wouldn't need the file anymore and I took it down. Unfortunately my PDF files can't be uploaded to Github (I've no idea why that's the case). I've reuploaded it here:
https://app.box.com/s/141ryz3jfsu1wdfslnl8a150uyksxox0

@jbarlow83
Copy link
Collaborator

jbarlow83 commented Feb 10, 2017 via email

@AEgit
Copy link
Author

AEgit commented Feb 10, 2017

Ha, well spotted. The PDF comes from an Interlibrary Loan, which provides you with a PDF. Unfortunately they apply some copyright protection to the PDF as well, which means that you cannot OCR the PDF (you can't even print it!). As I think this is an unnecessary hurdle, I've removed the DRM using the approach described here: http://tetrachroma.wordpress.com/
This might lead to the strange format of the PDF (?), although it could also have to do with the initial implementation of the protection itself (?).

@jbarlow83
Copy link
Collaborator

Fixed 4.5

@AEgit
Copy link
Author

AEgit commented Feb 14, 2017

Thank you so much! Great work!

@AEgit
Copy link
Author

AEgit commented Feb 15, 2017

Some of the PDFs which can now be OCRed with version 4.5 appear a bit distorted and the original "background" is a bit shifted out of place, e.g.:
NON-OCR version: https://app.box.com/s/8th0lsjqr8mqrq3pst2p967rdsglxo5r
OCR-version: https://app.box.com/s/x3s8bmbnj5anr6ulf4l3x7pk7u2fmwmb

Command used to OCR: ocrmypdf -l eng --force-ocr Walkden1993.pdf Walkden1993_ocr.pdf

Is this related to ocrmypdf or does it have to do with the way the original file has been scanned?

@jbarlow83
Copy link
Collaborator

Opened as a new issue in #137

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants