Excessive file size growth with --force-ocr #237

jbarlow83 · 2018-03-23T03:49:35Z

--force-ocr on a born digital PDF with a small color lossless compressed logo on each page, the presence of which causes the whole page to be rasterized and saved as color lossless at high DPI. File size increased 26x.

Color segmentation for the whole page image would cover this case universally and would optimize other cases. Another option may be to separate inspect vector and raster content for color requirements, and rasterize separately.

This could also mean a born digital page with a monochrome image and color text might get forced to monochrome, if we're not testing for color in vector content.

--skip-ocr also causes considerable inflation due to the presence of duplicate objects

The text was updated successfully, but these errors were encountered:

jbarlow83 · 2018-03-25T07:23:34Z

Partially fixed for v6 (duplicate object removal). Color coercion remains an issue.

MarjaE2 · 2018-04-10T18:45:46Z

File size is also an issue with scanned pdfs. These tend to expand to 1 to 2 mb per page, when running ocrmypdf --force-ocr --output-type pdfa-1, so full-length books take a lot of disk space.

(I tried running k2pdfopt -mode copy -dev dx afterwards, but that scrambled the ocr'd text. I also tried running a ghostscript conversion tool, but it blurred the image.)

I wonder if an onboard option to rasterize the image to certain dimensions, in pixels, after ocr, would be practical.

jbarlow83 · 2018-04-10T21:40:52Z

@MarjaE2 Possibly --pdfa-image-compression jpeg if the input images are color or grayscale.

Please attach a sample page if you can.

MarjaE2 · 2018-04-10T23:49:43Z

Here, or an excerpt using cpdf:

https://archive.org/details/voliaukrainy3219unse

Volya32Excerpt.pdf

It looks like an improvement for the excerpt, but I haven't tested it for the longer file.

nemobis · 2018-09-20T07:14:09Z

small color lossless compressed logo on each page, the presence of which causes the whole page to be rasterized and saved as color lossless at high DPI

Same problem here, I think:

INFO - Output file is a PDF/A-2B (as expected)
WARNING - The output file size is 4.88× larger than the input file.
Possible reasons for this include:
The argument --force-ocr was issued, causing transcoding.

$ pdfimages -list in.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     172   217  icc     3   8  jpeg   no       114  0    86    87 10.1K 9.2%
   1     1 image     475   178  icc     3   8  jpeg   no       115  0   245   245 19.2K 7.8%
   1     2 image     110    14  index   1   8  image  no       109  0    72    72 1058B  69%
   2     3 image    1794  2828  gray    1   1  ccitt  no        11  0   301   301 79.4K  13%
   2     4 image     110    14  index   1   8  image  no         4  0    72    72 1058B  69%
   3     5 image    1794  2828  gray    1   1  ccitt  no        22  0   301   301 77.4K  12%
   3     6 image     110    14  index   1   8  image  no        15  0    72    72 1058B  69%
   4     7 image    1794  2828  gray    1   1  ccitt  no        33  0   301   301 42.1K 6.8%
   4     8 image     110    14  index   1   8  image  no        26  0    72    72 1058B  69%
   5     9 image    1824  2848  gray    1   1  ccitt  no        44  0   301   301 69.1K  11%
   5    10 image     110    14  index   1   8  image  no        37  0    72    72 1058B  69%
   6    11 image    1794  2828  gray    1   1  ccitt  no        55  0   301   301 79.0K  13%
   6    12 image     110    14  index   1   8  image  no        48  0    72    72 1058B  69%
   7    13 image    1831  2852  gray    1   1  ccitt  no        66  0   301   301 91.7K  14%
   7    14 image     110    14  index   1   8  image  no        59  0    72    72 1058B  69%
   8    15 image    1794  2828  gray    1   1  ccitt  no        77  0   301   301 65.8K  11%
   8    16 image     110    14  index   1   8  image  no        70  0    72    72 1058B  69%
   9    17 image    1824  2847  gray    1   1  ccitt  no        88  0   301   301 24.6K 3.9%
   9    18 image     110    14  index   1   8  image  no        81  0    72    72 1058B  69%
  10    19 image     816  1056  gray    1   1  ccitt  no        99  0   101   101 3342B 3.1%
  10    20 image     110    14  index   1   8  image  no        92  0    72    72 1058B  69%

$ pdfimages -list out.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    2082  2693  rgb     3   8  image  no        59  0   245   245  220K 1.3%
   2     1 image    2553  3304  rgb     3   8  image  no        60  0   301   301  266K 1.1%
   3     2 image    2553  3304  rgb     3   8  image  no        61  0   301   301  272K 1.1%
   4     3 image    2553  3304  rgb     3   8  image  no        62  0   301   301  176K 0.7%
   5     4 image    2555  3306  rgb     3   8  image  no        63  0   301   301  248K 1.0%
   6     5 image    2553  3304  rgb     3   8  image  no        64  0   301   301  266K 1.1%
   7     6 image    2553  3303  rgb     3   8  image  no        65  0   301   301  312K 1.3%
   8     7 image    2553  3304  rgb     3   8  image  no        66  0   301   301  232K 0.9%
   9     8 image    2555  3306  rgb     3   8  image  no        67  0   301   301  119K 0.5%
  10     9 image     851  1101  rgb     3   8  image  no        68  0   101   101 15.0K 0.5%

$ ocrmypdf --version
7.0.5
$ jbig2 --version
jbig2enc 0.28
$ unpaper --version
0.3
$ qpdf --version
qpdf version 7.1.1
$ pngquant --version
2.11.10 (January 2018)

jbarlow83 changed the title ~~Excessive file size growth with --force-ocr~~ Excessive file size growth with --output-pdf pdf Mar 25, 2018

jbarlow83 changed the title ~~Excessive file size growth with --output-pdf pdf~~ Excessive file size growth with --output-type pdf Mar 25, 2018

jbarlow83 added this to the v6.0.0 milestone Mar 25, 2018

jbarlow83 changed the title ~~Excessive file size growth with --output-type pdf~~ Excessive file size growth with --force-ocr Mar 25, 2018

jbarlow83 removed this from the v6.0.0 milestone Mar 26, 2018

jbarlow83 closed this as completed Oct 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive file size growth with --force-ocr #237

Excessive file size growth with --force-ocr #237

jbarlow83 commented Mar 23, 2018 •

edited

jbarlow83 commented Mar 25, 2018 •

edited

MarjaE2 commented Apr 10, 2018

jbarlow83 commented Apr 10, 2018

MarjaE2 commented Apr 10, 2018

nemobis commented Sep 20, 2018

Excessive file size growth with --force-ocr #237

Excessive file size growth with --force-ocr #237

Comments

jbarlow83 commented Mar 23, 2018 • edited

jbarlow83 commented Mar 25, 2018 • edited

MarjaE2 commented Apr 10, 2018

jbarlow83 commented Apr 10, 2018

MarjaE2 commented Apr 10, 2018

nemobis commented Sep 20, 2018

jbarlow83 commented Mar 23, 2018 •

edited

jbarlow83 commented Mar 25, 2018 •

edited