Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive file size growth with --force-ocr #237

Closed
jbarlow83 opened this issue Mar 23, 2018 · 5 comments
Closed

Excessive file size growth with --force-ocr #237

jbarlow83 opened this issue Mar 23, 2018 · 5 comments

Comments

@jbarlow83
Copy link
Collaborator

jbarlow83 commented Mar 23, 2018

--force-ocr on a born digital PDF with a small color lossless compressed logo on each page, the presence of which causes the whole page to be rasterized and saved as color lossless at high DPI. File size increased 26x.

Color segmentation for the whole page image would cover this case universally and would optimize other cases. Another option may be to separate inspect vector and raster content for color requirements, and rasterize separately.

This could also mean a born digital page with a monochrome image and color text might get forced to monochrome, if we're not testing for color in vector content.

--skip-ocr also causes considerable inflation due to the presence of duplicate objects

@jbarlow83 jbarlow83 changed the title Excessive file size growth with --force-ocr Excessive file size growth with --output-pdf pdf Mar 25, 2018
@jbarlow83
Copy link
Collaborator Author

jbarlow83 commented Mar 25, 2018

Partially fixed for v6 (duplicate object removal). Color coercion remains an issue.

@jbarlow83 jbarlow83 changed the title Excessive file size growth with --output-pdf pdf Excessive file size growth with --output-type pdf Mar 25, 2018
@jbarlow83 jbarlow83 added this to the v6.0.0 milestone Mar 25, 2018
@jbarlow83 jbarlow83 changed the title Excessive file size growth with --output-type pdf Excessive file size growth with --force-ocr Mar 25, 2018
@jbarlow83 jbarlow83 removed this from the v6.0.0 milestone Mar 26, 2018
@MarjaE2
Copy link

MarjaE2 commented Apr 10, 2018

File size is also an issue with scanned pdfs. These tend to expand to 1 to 2 mb per page, when running ocrmypdf --force-ocr --output-type pdfa-1, so full-length books take a lot of disk space.

(I tried running k2pdfopt -mode copy -dev dx afterwards, but that scrambled the ocr'd text. I also tried running a ghostscript conversion tool, but it blurred the image.)

I wonder if an onboard option to rasterize the image to certain dimensions, in pixels, after ocr, would be practical.

@jbarlow83
Copy link
Collaborator Author

@MarjaE2 Possibly --pdfa-image-compression jpeg if the input images are color or grayscale.

Please attach a sample page if you can.

@MarjaE2
Copy link

MarjaE2 commented Apr 10, 2018

Here, or an excerpt using cpdf:

https://archive.org/details/voliaukrainy3219unse

Volya32Excerpt.pdf

It looks like an improvement for the excerpt, but I haven't tested it for the longer file.

@nemobis
Copy link

nemobis commented Sep 20, 2018

small color lossless compressed logo on each page, the presence of which causes the whole page to be rasterized and saved as color lossless at high DPI

Same problem here, I think:

INFO - Output file is a PDF/A-2B (as expected)
WARNING - The output file size is 4.88× larger than the input file.
Possible reasons for this include:
The argument --force-ocr was issued, causing transcoding.

$ pdfimages -list in.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     172   217  icc     3   8  jpeg   no       114  0    86    87 10.1K 9.2%
   1     1 image     475   178  icc     3   8  jpeg   no       115  0   245   245 19.2K 7.8%
   1     2 image     110    14  index   1   8  image  no       109  0    72    72 1058B  69%
   2     3 image    1794  2828  gray    1   1  ccitt  no        11  0   301   301 79.4K  13%
   2     4 image     110    14  index   1   8  image  no         4  0    72    72 1058B  69%
   3     5 image    1794  2828  gray    1   1  ccitt  no        22  0   301   301 77.4K  12%
   3     6 image     110    14  index   1   8  image  no        15  0    72    72 1058B  69%
   4     7 image    1794  2828  gray    1   1  ccitt  no        33  0   301   301 42.1K 6.8%
   4     8 image     110    14  index   1   8  image  no        26  0    72    72 1058B  69%
   5     9 image    1824  2848  gray    1   1  ccitt  no        44  0   301   301 69.1K  11%
   5    10 image     110    14  index   1   8  image  no        37  0    72    72 1058B  69%
   6    11 image    1794  2828  gray    1   1  ccitt  no        55  0   301   301 79.0K  13%
   6    12 image     110    14  index   1   8  image  no        48  0    72    72 1058B  69%
   7    13 image    1831  2852  gray    1   1  ccitt  no        66  0   301   301 91.7K  14%
   7    14 image     110    14  index   1   8  image  no        59  0    72    72 1058B  69%
   8    15 image    1794  2828  gray    1   1  ccitt  no        77  0   301   301 65.8K  11%
   8    16 image     110    14  index   1   8  image  no        70  0    72    72 1058B  69%
   9    17 image    1824  2847  gray    1   1  ccitt  no        88  0   301   301 24.6K 3.9%
   9    18 image     110    14  index   1   8  image  no        81  0    72    72 1058B  69%
  10    19 image     816  1056  gray    1   1  ccitt  no        99  0   101   101 3342B 3.1%
  10    20 image     110    14  index   1   8  image  no        92  0    72    72 1058B  69%

$ pdfimages -list out.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    2082  2693  rgb     3   8  image  no        59  0   245   245  220K 1.3%
   2     1 image    2553  3304  rgb     3   8  image  no        60  0   301   301  266K 1.1%
   3     2 image    2553  3304  rgb     3   8  image  no        61  0   301   301  272K 1.1%
   4     3 image    2553  3304  rgb     3   8  image  no        62  0   301   301  176K 0.7%
   5     4 image    2555  3306  rgb     3   8  image  no        63  0   301   301  248K 1.0%
   6     5 image    2553  3304  rgb     3   8  image  no        64  0   301   301  266K 1.1%
   7     6 image    2553  3303  rgb     3   8  image  no        65  0   301   301  312K 1.3%
   8     7 image    2553  3304  rgb     3   8  image  no        66  0   301   301  232K 0.9%
   9     8 image    2555  3306  rgb     3   8  image  no        67  0   301   301  119K 0.5%
  10     9 image     851  1101  rgb     3   8  image  no        68  0   101   101 15.0K 0.5%

$ ocrmypdf --version
7.0.5
$ jbig2 --version
jbig2enc 0.28
$ unpaper --version
0.3
$ qpdf --version
qpdf version 7.1.1
$ pngquant --version
2.11.10 (January 2018)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants