You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello,
first thanks for the great work with this script. It made me work with OCR again at all after 10 years of frustrated absence :-)
Only one negative thing: Most of my PDFs come from a Canon ImageRunner scan, and are very good in quality vs. size. OCR gives great results, but the output PDFs are 7-8x bigger than input. As far as I can see, the embedded images get recompressed to JPEG, while the original is /CCITTFaxDecode.
If this is because of PDF/A compatibility, I suggest to add an option for non-PDF/A output.
GNU Parallel version:
GNU parallel 20130622
Copyright (C) 2007,2008,2009,2010,2011,2012,2013 Ole Tange and Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
GNU parallel comes with no warranty.
Is there a fast way to change the code here locally, to not reconvert the images? As example a 3MB pdf now grows to 25MB and thats really big for alle the pdf we would like to convert.
Issue by alphablue52
Tue Feb 18 20:11:19 2014
Originally opened as fritz-hh/OCRmyPDF#70
Hello,
first thanks for the great work with this script. It made me work with OCR again at all after 10 years of frustrated absence :-)
Only one negative thing: Most of my PDFs come from a Canon ImageRunner scan, and are very good in quality vs. size. OCR gives great results, but the output PDFs are 7-8x bigger than input. As far as I can see, the embedded images get recompressed to JPEG, while the original is /CCITTFaxDecode.
If this is because of PDF/A compatibility, I suggest to add an option for non-PDF/A output.
You can download input.pdf and output.pdf here:
https://www.dropbox.com/l/KYlpYRiSs6IjWVOmF1fX39
Here is the output of the script with -g option.
~/bin/OCRmyPDF-2.0-stable$ sh OCRmyPDF.sh -g -l deu input.pdf output.pdf
OCRmyPDF version: v2.0-stable
Arguments: -g -l deu input.pdf output.pdf
Checking if all dependencies are installed
ImageMagick version:
Version: ImageMagick 6.7.7-10 2013-09-10 Q16 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC
Features: OpenMP
GNU Parallel version:
GNU parallel 20130622
Copyright (C) 2007,2008,2009,2010,2011,2012,2013 Ole Tange and Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
GNU parallel comes with no warranty.
Web site: http://www.gnu.org/software/parallel
When using GNU Parallel for a publication please cite:
O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
;login: The USENIX Magazine, February 2011:42-47.
Poppler-utils version:
pdfimages version 0.24.1
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
pdftoppm version 0.24.1
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
pdffonts version 0.24.1
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
unpaper version:
OCRmyPDF.sh: 190: OCRmyPDF.sh: unpaper: not found
tesseract version:
tesseract 3.02.01
leptonica-1.69
libgif 4.1.6 : libjpeg 8d : libpng 1.2.49 : libtiff 4.0.2 : zlib 1.2.8
python2 version:
Python 2.7.5+
Ghostscript version:
9.10
Java version:
java version "1.7.0_51"
OpenJDK Runtime Environment (IcedTea 2.4.4) (7u51-2.4.4-0ubuntu0.13.10.1)
OpenJDK 64-Bit Server VM (build 24.45-b08, mixed mode)
Created temporary folder: "/tmp/tmp.X82OQourlI"
Input file: Extracting size of each page (in pt)
Processing page 0001 / 0014
Page 0001: Size 842x595 (h_w in pt)
Page 0001: Expecting exactly 1 image covering the whole page (found 3). Cannot compute dpi value.
Page 0001: Continuing anyway, assuming a default resolution of 300 dpi
Page 0001: Extracting image as ppm file (300 dpi)
Page 0001: Performing OCR
Page 0001: Embedding text in PDF
Page 0001: Embedding text in PDF (debug page)
Processing page 0002 / 0014
Page 0002: Size 842x595 (h_w in pt)
Page 0002: Expecting exactly 1 image covering the whole page (found 2). Cannot compute dpi value.
Page 0002: Continuing anyway, assuming a default resolution of 300 dpi
Page 0002: Extracting image as ppm file (300 dpi)
Page 0002: Performing OCR
Page 0002: Embedding text in PDF
Page 0002: Embedding text in PDF (debug page)
Processing page 0003 / 0014
Page 0003: Size 842x595 (h_w in pt)
Page 0003: Expecting exactly 1 image covering the whole page (found 3). Cannot compute dpi value.
Page 0003: Continuing anyway, assuming a default resolution of 300 dpi
Page 0003: Extracting image as ppm file (300 dpi)
Page 0003: Performing OCR
Page 0003: Embedding text in PDF
Page 0003: Embedding text in PDF (debug page)
Processing page 0004 / 0014
Page 0004: Size 842x595 (h_w in pt)
Page 0004: Expecting exactly 1 image covering the whole page (found 2). Cannot compute dpi value.
Page 0004: Continuing anyway, assuming a default resolution of 300 dpi
Page 0004: Extracting image as ppm file (300 dpi)
Page 0004: Performing OCR
Page 0004: Embedding text in PDF
Page 0004: Embedding text in PDF (debug page)
Processing page 0005 / 0014
Page 0005: Size 842x595 (h_w in pt)
Page 0005: Expecting exactly 1 image covering the whole page (found 2). Cannot compute dpi value.
Page 0005: Continuing anyway, assuming a default resolution of 300 dpi
Page 0005: Extracting image as ppm file (300 dpi)
Page 0005: Performing OCR
Page 0005: Embedding text in PDF
Page 0005: Embedding text in PDF (debug page)
Processing page 0006 / 0014
Page 0006: Size 842x595 (h_w in pt)
Page 0006: Expecting exactly 1 image covering the whole page (found 3). Cannot compute dpi value.
Page 0006: Continuing anyway, assuming a default resolution of 300 dpi
Page 0006: Extracting image as ppm file (300 dpi)
Page 0006: Performing OCR
Page 0006: Embedding text in PDF
Page 0006: Embedding text in PDF (debug page)
Processing page 0007 / 0014
Page 0007: Size 842x595 (h_w in pt)
Page 0007: Expecting exactly 1 image covering the whole page (found 3). Cannot compute dpi value.
Page 0007: Continuing anyway, assuming a default resolution of 300 dpi
Page 0007: Extracting image as ppm file (300 dpi)
Page 0007: Performing OCR
Page 0007: Embedding text in PDF
Page 0007: Embedding text in PDF (debug page)
Processing page 0008 / 0014
Page 0008: Size 842x595 (h_w in pt)
Page 0008: Expecting exactly 1 image covering the whole page (found 8). Cannot compute dpi value.
Page 0008: Continuing anyway, assuming a default resolution of 300 dpi
Page 0008: Extracting image as ppm file (300 dpi)
Page 0008: Performing OCR
Page 0008: Embedding text in PDF
Page 0008: Embedding text in PDF (debug page)
Processing page 0009 / 0014
Page 0009: Size 842x595 (h_w in pt)
Page 0009: Expecting exactly 1 image covering the whole page (found 5). Cannot compute dpi value.
Page 0009: Continuing anyway, assuming a default resolution of 300 dpi
Page 0009: Extracting image as ppm file (300 dpi)
Page 0009: Performing OCR
Page 0009: Embedding text in PDF
Page 0009: Embedding text in PDF (debug page)
Processing page 0010 / 0014
Page 0010: Size 842x595 (h_w in pt)
Page 0010: Expecting exactly 1 image covering the whole page (found 4). Cannot compute dpi value.
Page 0010: Continuing anyway, assuming a default resolution of 300 dpi
Page 0010: Extracting image as ppm file (300 dpi)
Page 0010: Performing OCR
Page 0010: Embedding text in PDF
Page 0010: Embedding text in PDF (debug page)
Processing page 0011 / 0014
Page 0011: Size 842x595 (h_w in pt)
Page 0011: Expecting exactly 1 image covering the whole page (found 4). Cannot compute dpi value.
Page 0011: Continuing anyway, assuming a default resolution of 300 dpi
Page 0011: Extracting image as ppm file (300 dpi)
Page 0011: Performing OCR
Page 0011: Embedding text in PDF
Page 0011: Embedding text in PDF (debug page)
Processing page 0012 / 0014
Page 0012: Size 842x595 (h_w in pt)
Page 0012: Expecting exactly 1 image covering the whole page (found 3). Cannot compute dpi value.
Page 0012: Continuing anyway, assuming a default resolution of 300 dpi
Page 0012: Extracting image as ppm file (300 dpi)
Page 0012: Performing OCR
Page 0012: Embedding text in PDF
Page 0012: Embedding text in PDF (debug page)
Processing page 0013 / 0014
Page 0013: Size 842x595 (h_w in pt)
Page 0013: Expecting exactly 1 image covering the whole page (found 4). Cannot compute dpi value.
Page 0013: Continuing anyway, assuming a default resolution of 300 dpi
Page 0013: Extracting image as ppm file (300 dpi)
Page 0013: Performing OCR
Page 0013: Embedding text in PDF
Page 0013: Embedding text in PDF (debug page)
Processing page 0014 / 0014
Page 0014: Size 842x595 (h_w in pt)
Page 0014: Size 1240x1753 (in pixel)
Page 0014: Low image resolution detected (150 dpi). If needed, please use the "-o" to try to get better OCR results.
Page 0014: Extracting image as pgm file (150 dpi)
Page 0014: Performing OCR
Page 0014: Embedding text in PDF
Page 0014: Embedding text in PDF (debug page)
Output file: Concatenating all pages to the final PDF/A file
Output file: Checking compliance to PDF/A standard
The full validation log is available here: "/tmp/tmp.X82OQourlI/pdf_validation.log"
Output file: The generated PDF/A file is VALID
Script took 20 seconds
The text was updated successfully, but these errors were encountered: