PDF cannot be OCRed: page has no images/exceptions #134

AEgit · 2017-02-10T10:48:54Z

Version of ocrmypdf: 4.4.2

The following PDF cannot be OCRed:
https://app.box.com/s/4v1cmadxuqc5zjk0akezxvurrr8bxjo6

Without forcing the OCR, all pages are skipped: An output file is generated, but no pages have been OCRed. If OCR is forced, exceptions occur.

Without forcing OCR:

ocrmypdf -l eng Schoch2002.pdf Schoch2002_ocr.pdf
   INFO -    1: page has no images - skipping all processing on this page
   INFO -    2: page has no images - skipping all processing on this page
   INFO -    3: page has no images - skipping all processing on this page
   INFO -    4: page has no images - skipping all processing on this page
   INFO -    5: page has no images - skipping all processing on this page
   INFO -    6: page has no images - skipping all processing on this page
   INFO -    7: page has no images - skipping all processing on this page
   INFO -    8: page has no images - skipping all processing on this page
   INFO -    9: page has no images - skipping all processing on this page
   INFO -   10: page has no images - skipping all processing on this page
   INFO - Output file is a PDF/A-2B (as expected)

When forcing OCR:

ocrmypdf -l eng --force-ocr Schoch2002.pdf Schoch2002_ocr.pdf
WARNING -    1: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
WARNING -    2: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
WARNING -    3: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
WARNING -    4: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
WARNING -    5: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
WARNING -    6: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
WARNING -    7: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
WARNING -    8: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
WARNING -    9: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
WARNING -   10: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
  ERROR - Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/PIL/JpegImagePlugin.py", line 590, in _save
    rawmode = RAWMODE[im.mode]
KeyError: 'P'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
    register_cleanup, touch_files_only)
  File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 567, in job_wrapper_io_files
    ret_val = user_defined_work_func(*params)
  File "/usr/local/lib/python3.4/dist-packages/ocrmypdf/pipeline.py", line 534, in select_visible_page_image
    im.save(output_file, format='JPEG', dpi=dpi)
  File "/usr/local/lib/python3.4/dist-packages/PIL/Image.py", line 1728, in save
    save_handler(self, fp, filename)
  File "/usr/local/lib/python3.4/dist-packages/PIL/JpegImagePlugin.py", line 592, in _save
    raise IOError("cannot write mode %s as JPEG" % im.mode)
OSError: cannot write mode P as JPEG

The text was updated successfully, but these errors were encountered:

jbarlow83 · 2017-02-10T11:04:10Z

Yes, that's a bug. It appears to be related to an image in the PDF having an unusual palette. If you need a quick workaround, use Ghostscript to rasterize the file as JPEGs, and merge the JPEGs using img2pdf. For Ghostscript use gs -sDEVICE=jpeg -r300 -o out%03d.jpg input.pdf The ocrmypdf documentation describes using img2pdf. Otherwise I'll get to in the next few days.

…

On Fri, Feb 10, 2017 at 02:48 AEgit ***@***.***> wrote: The following PDF cannot be OCRed: https://app.box.com/s/4v1cmadxuqc5zjk0akezxvurrr8bxjo6 Without forcing the OCR, all pages are skipped: An output file is generated, but no pages have been OCRed. If OCR is forced, exceptions occur. Without forcing OCR: ocrmypdf -l eng Schoch2002.pdf Schoch2002_ocr.pdf INFO - 1: page has no images - skipping all processing on this page INFO - 2: page has no images - skipping all processing on this page INFO - 3: page has no images - skipping all processing on this page INFO - 4: page has no images - skipping all processing on this page INFO - 5: page has no images - skipping all processing on this page INFO - 6: page has no images - skipping all processing on this page INFO - 7: page has no images - skipping all processing on this page INFO - 8: page has no images - skipping all processing on this page INFO - 9: page has no images - skipping all processing on this page INFO - 10: page has no images - skipping all processing on this page INFO - Output file is a PDF/A-2B (as expected) When forcing OCR: ocrmypdf -l eng --force-ocr Schoch2002.pdf Schoch2002_ocr .pdf WARNING - 1: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI. WARNING - 2: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI. WARNING - 3: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI. WARNING - 4: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI. WARNING - 5: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI. WARNING - 6: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI. WARNING - 7: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI. WARNING - 8: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI. WARNING - 9: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI. WARNING - 10: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI. ERROR - Traceback (most recent call last): File "/usr/local/lib/python3.4/dist-packages/PIL/JpegImagePlugin.py", line 590, in _save rawmode = RAWMODE[im.mode] KeyError: 'P' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions register_cleanup, touch_files_only) File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 567, in job_wrapper_io_files ret_val = user_defined_work_func(*params) File "/usr/local/lib/python3.4/dist-packages/ocrmypdf/pipeline.py", line 534, in select_visible_page_image im.save(output_file, format='JPEG', dpi=dpi) File "/usr/local/lib/python3.4/dist-packages/PIL/Image.py", line 1728, in save save_handler(self, fp, filename) File "/usr/local/lib/python3.4/dist-packages/PIL/JpegImagePlugin.py", line 592, in _save raise IOError("cannot write mode %s as JPEG" % im.mode) OSError: cannot write mode P as JPEG — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#134>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABvcM--i6YAumDwPiRGfOVIW2ZwBQ6rPks5rbECWgaJpZM4L9Ppw> .

AEgit · 2017-02-10T11:24:44Z

Thanks a lot for the quick help!

jbarlow83 · 2017-02-10T16:53:43Z

The link seems to be dead or down now. Was working when I first checked it. Can you attach the file directly to this ticket?

AEgit · 2017-02-10T17:00:27Z

Ah sorry about that, I thought you wouldn't need the file anymore and I took it down. Unfortunately my PDF files can't be uploaded to Github (I've no idea why that's the case). I've reuploaded it here:
https://app.box.com/s/141ryz3jfsu1wdfslnl8a150uyksxox0

jbarlow83 · 2017-02-10T18:34:13Z

Thanks. That is a weird-ass PDF. All of the images are contained in the type of object normally used for fillable form fields. Legal, just weird. There's a small amount of vector text at the bottom of each page which is why that message appears. Do you know what software produced it?

…

On Fri, Feb 10, 2017 at 09:00 AEgit ***@***.***> wrote: Ah sorry about that, I thought you wouldn't need the file anymore and I tool it down. Unfortunately my PDF files can't be uploaded to Github (I've no idea why that's the case). I've reuploaded it here: https://app.box.com/s/141ryz3jfsu1wdfslnl8a150uyksxox0 — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#134 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABvcM5mtLC5xS6BdKss3963t9XQDewIJks5rbJergaJpZM4L9Ppw> .

AEgit · 2017-02-10T21:00:18Z

Ha, well spotted. The PDF comes from an Interlibrary Loan, which provides you with a PDF. Unfortunately they apply some copyright protection to the PDF as well, which means that you cannot OCR the PDF (you can't even print it!). As I think this is an unnecessary hurdle, I've removed the DRM using the approach described here: http://tetrachroma.wordpress.com/
This might lead to the strange format of the PDF (?), although it could also have to do with the initial implementation of the protection itself (?).

jbarlow83 · 2017-02-14T21:36:18Z

Fixed 4.5

AEgit · 2017-02-14T22:50:22Z

Thank you so much! Great work!

AEgit · 2017-02-15T09:08:16Z

Some of the PDFs which can now be OCRed with version 4.5 appear a bit distorted and the original "background" is a bit shifted out of place, e.g.:
NON-OCR version: https://app.box.com/s/8th0lsjqr8mqrq3pst2p967rdsglxo5r
OCR-version: https://app.box.com/s/x3s8bmbnj5anr6ulf4l3x7pk7u2fmwmb

Command used to OCR: ocrmypdf -l eng --force-ocr Walkden1993.pdf Walkden1993_ocr.pdf

Is this related to ocrmypdf or does it have to do with the way the original file has been scanned?

jbarlow83 · 2017-02-15T23:19:45Z

Opened as a new issue in #137

jbarlow83 closed this as completed Feb 14, 2017

jbarlow83 mentioned this issue Feb 15, 2017

Proportions of non-square aspect ratio images distorted #137

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF cannot be OCRed: page has no images/exceptions #134

PDF cannot be OCRed: page has no images/exceptions #134

AEgit commented Feb 10, 2017 •

edited

jbarlow83 commented Feb 10, 2017 via email •

edited

AEgit commented Feb 10, 2017

jbarlow83 commented Feb 10, 2017

AEgit commented Feb 10, 2017 •

edited

jbarlow83 commented Feb 10, 2017 via email

AEgit commented Feb 10, 2017

jbarlow83 commented Feb 14, 2017

AEgit commented Feb 14, 2017

AEgit commented Feb 15, 2017

jbarlow83 commented Feb 15, 2017

PDF cannot be OCRed: page has no images/exceptions #134

PDF cannot be OCRed: page has no images/exceptions #134

Comments

AEgit commented Feb 10, 2017 • edited

jbarlow83 commented Feb 10, 2017 via email • edited

AEgit commented Feb 10, 2017

jbarlow83 commented Feb 10, 2017

AEgit commented Feb 10, 2017 • edited

jbarlow83 commented Feb 10, 2017 via email

AEgit commented Feb 10, 2017

jbarlow83 commented Feb 14, 2017

AEgit commented Feb 14, 2017

AEgit commented Feb 15, 2017

jbarlow83 commented Feb 15, 2017

AEgit commented Feb 10, 2017 •

edited

jbarlow83 commented Feb 10, 2017 via email •

edited

AEgit commented Feb 10, 2017 •

edited