New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDF cannot be OCRed: page has no images/exceptions #134
Comments
Yes, that's a bug. It appears to be related to an image in the PDF having
an unusual palette.
If you need a quick workaround, use Ghostscript to rasterize the file as
JPEGs, and merge the JPEGs using img2pdf. For Ghostscript use
gs -sDEVICE=jpeg -r300 -o out%03d.jpg input.pdf
The ocrmypdf documentation describes using img2pdf. Otherwise I'll get to in the next few days.
…On Fri, Feb 10, 2017 at 02:48 AEgit ***@***.***> wrote:
The following PDF cannot be OCRed:
https://app.box.com/s/4v1cmadxuqc5zjk0akezxvurrr8bxjo6
Without forcing the OCR, all pages are skipped: An output file is
generated, but no pages have been OCRed. If OCR is forced, exceptions occur.
Without forcing OCR:
ocrmypdf -l eng Schoch2002.pdf Schoch2002_ocr.pdf
INFO - 1: page has no images - skipping all processing on this page
INFO - 2: page has no images - skipping all processing on this page
INFO - 3: page has no images - skipping all processing on this page
INFO - 4: page has no images - skipping all processing on this page
INFO - 5: page has no images - skipping all processing on this page
INFO - 6: page has no images - skipping all processing on this page
INFO - 7: page has no images - skipping all processing on this page
INFO - 8: page has no images - skipping all processing on this page
INFO - 9: page has no images - skipping all processing on this page
INFO - 10: page has no images - skipping all processing on this page
INFO - Output file is a PDF/A-2B (as expected)
When forcing OCR:
ocrmypdf -l eng --force-ocr Schoch2002.pdf Schoch2002_ocr
.pdf
WARNING - 1: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
WARNING - 2: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
WARNING - 3: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
WARNING - 4: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
WARNING - 5: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
WARNING - 6: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
WARNING - 7: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
WARNING - 8: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
WARNING - 9: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
WARNING - 10: page has no images - all vector content will be rasterized at 400 DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.
ERROR - Traceback (most recent call last):
File "/usr/local/lib/python3.4/dist-packages/PIL/JpegImagePlugin.py", line 590, in _save
rawmode = RAWMODE[im.mode]
KeyError: 'P'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
register_cleanup, touch_files_only)
File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 567, in job_wrapper_io_files
ret_val = user_defined_work_func(*params)
File "/usr/local/lib/python3.4/dist-packages/ocrmypdf/pipeline.py", line 534, in select_visible_page_image
im.save(output_file, format='JPEG', dpi=dpi)
File "/usr/local/lib/python3.4/dist-packages/PIL/Image.py", line 1728, in save
save_handler(self, fp, filename)
File "/usr/local/lib/python3.4/dist-packages/PIL/JpegImagePlugin.py", line 592, in _save
raise IOError("cannot write mode %s as JPEG" % im.mode)
OSError: cannot write mode P as JPEG
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#134>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABvcM--i6YAumDwPiRGfOVIW2ZwBQ6rPks5rbECWgaJpZM4L9Ppw>
.
|
Thanks a lot for the quick help! |
The link seems to be dead or down now. Was working when I first checked it. Can you attach the file directly to this ticket? |
Ah sorry about that, I thought you wouldn't need the file anymore and I took it down. Unfortunately my PDF files can't be uploaded to Github (I've no idea why that's the case). I've reuploaded it here: |
Thanks. That is a weird-ass PDF. All of the images are contained in the
type of object normally used for fillable form fields. Legal, just weird.
There's a small amount of vector text at the bottom of each page which is
why that message appears. Do you know what software produced it?
…On Fri, Feb 10, 2017 at 09:00 AEgit ***@***.***> wrote:
Ah sorry about that, I thought you wouldn't need the file anymore and I
tool it down. Unfortunately my PDF files can't be uploaded to Github (I've
no idea why that's the case). I've reuploaded it here:
https://app.box.com/s/141ryz3jfsu1wdfslnl8a150uyksxox0
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#134 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABvcM5mtLC5xS6BdKss3963t9XQDewIJks5rbJergaJpZM4L9Ppw>
.
|
Ha, well spotted. The PDF comes from an Interlibrary Loan, which provides you with a PDF. Unfortunately they apply some copyright protection to the PDF as well, which means that you cannot OCR the PDF (you can't even print it!). As I think this is an unnecessary hurdle, I've removed the DRM using the approach described here: http://tetrachroma.wordpress.com/ |
Fixed 4.5 |
Thank you so much! Great work! |
Some of the PDFs which can now be OCRed with version 4.5 appear a bit distorted and the original "background" is a bit shifted out of place, e.g.: Command used to OCR: Is this related to ocrmypdf or does it have to do with the way the original file has been scanned? |
Opened as a new issue in #137 |
Version of ocrmypdf: 4.4.2
The following PDF cannot be OCRed:
https://app.box.com/s/4v1cmadxuqc5zjk0akezxvurrr8bxjo6
Without forcing the OCR, all pages are skipped: An output file is generated, but no pages have been OCRed. If OCR is forced, exceptions occur.
Without forcing OCR:
When forcing OCR:
The text was updated successfully, but these errors were encountered: