Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[13.4.2] lossy compression of pngs into jpegs when it shouldn't #940

Closed
RamKromberg opened this issue Apr 3, 2022 · 21 comments
Closed

[13.4.2] lossy compression of pngs into jpegs when it shouldn't #940

RamKromberg opened this issue Apr 3, 2022 · 21 comments

Comments

@RamKromberg
Copy link

RamKromberg commented Apr 3, 2022

  1. It might be just the older version, but ocrmypdf 12.7.2 seems to compress uncompressed pngs into (lossy) jpegs:
$ ocrmypdf --version
12.7.2
$ wget https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
$ convert Example.png -define png:compression-level=0 -define png:compression-filter=0 -define png:color-type=2 Example-uncompress.png
$ img2pdf ./Example-uncompress.png -o ./Example-uncompress.pdf
$ ocrmypdf --tesseract-timeout=0 --optimize 1 --skip-text Example-uncompress.pdf Example-uncompress-compress.pdf
$ pdfimages -list ./Example-uncompress-compress.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     172   178  rgb     3   8  jpeg   no         9  0    96    96 4157B 4.5%

I believe it should be running the image through pngquant instead at optimize level 1.

  1. Btw, it's probably not even worth mentioning since, looking at the changelog, I'm fairly certain you've already sorted it out in recent ocrmypdf versions, but small pdfs with small pngs grow instead of shrinking / remaining the same:
$ ocrmypdf --version
12.7.2
$ wget https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
$ img2pdf ./Example.png -o ./Example.pdf
$ ocrmypdf --tesseract-timeout=0 --optimize 1 --skip-text Example.pdf Example-compress.pdf
$ stat -c "%n,%s" Example*.* | column -t -s,
Example-compress.pdf  7799
Example.pdf           3906
Example.png           2335

Though this might also be the pdf format changing to the archival specs...

  1. As a side note, if compute time isn't a factor, I personally found 'optipng -o7' to produce smaller pngs than pngquant and 'jpegrescan -i -t -v' to produce the smallest jpeg, even compared to MozJPEG despite the author saying otherwise oddly enough.

p.s. forgot to mention the png-to-jpeg bug also happens with some compressed pngs but I haven't bothered trying to replicate this since I believe it should never try to convert bitmap images to jpegs to begin with.

@jbarlow83
Copy link
Collaborator

Closing due to old version

@RamKromberg
Copy link
Author

RamKromberg commented Apr 3, 2022

Sorry for the old version. I pulled a newer one (though still not the latest... might get to doing that later in the week/weekend) and both bugs were still there:

$ ocrmypdf --version
13.3.0

$ wget https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
--2022-04-03 22:31:36--  https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
Resolving upload.wikimedia.org (upload.wikimedia.org)... 91.198.174.208, 2620:0:862:ed1a::2:b
Connecting to upload.wikimedia.org (upload.wikimedia.org)|91.198.174.208|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2335 (2.3K) [image/png]
Saving to: ‘Example.png’

Example.png    100%   2.28K  --.-KB/s    in 0s

2022-04-03 22:31:37 (51.8 MB/s) - ‘Example.png’ saved [2335/2335]


$ convert Example.png -define png:compression-level=0 -define png:compression-filter=0 -define png:color-type=2 Example-uncompress.png

$ img2pdf ./Example-uncompress.png -o ./Example-uncompress.pdf

$ ocrmypdf --tesseract-timeout=0 --optimize 1 --skip-text Example-uncompress.pdf Example-uncompress-compress.pdf
Scanning contents: 100%|██| 1/1 [00:00<00:00, 162.01page/s]
Image processing: 100%|█| 1.0/1.0 [00:00<00:00,  4.39page/s
Postprocessing...
PDF/A conversion: 100%|████| 1/1 [00:00<00:00, 26.34page/s]
JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.00 savings: 0.1%
Output file is a PDF/A-2B (as expected)

$ pdfimages -list ./Example-uncompress-compress.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     172   178  rgb     3   8  jpeg   no         9  0    96    96 4157B 4.5%
  1. And the other one:
$ ocrmypdf --version
13.3.0

$ wget https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
--2022-04-03 22:33:56--  https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
Resolving upload.wikimedia.org (upload.wikimedia.org)... 91.198.174.208, 2620:0:862:ed1a::2:b
Connecting to upload.wikimedia.org (upload.wikimedia.org)|91.198.174.208|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2335 (2.3K) [image/png]
Saving to: ‘Example.png’

Example.png    100%   2.28K  --.-KB/s    in 0s

2022-04-03 22:33:57 (83.8 MB/s) - ‘Example.png’ saved [2335/2335]


$ img2pdf ./Example.png -o ./Example.pdf

$ ocrmypdf --tesseract-timeout=0 --optimize 1 --skip-text Example.pdf Example-compress.pdf
Scanning contents: 100%|██| 1/1 [00:00<00:00, 132.59page/s]
Image processing: 100%|█| 1.0/1.0 [00:00<00:00,  4.47page/s
Postprocessing...
PDF/A conversion: 100%|████| 1/1 [00:00<00:00, 47.82page/s]
JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.00 savings: 0.1%
Output file is a PDF/A-2B (as expected)

$ stat -c "%n,%s" Example*.* | column -t -s,
Example-compress.pdf  7789
Example.pdf           3906
Example.png           2335

I'm not sure when I'll get the chance to test 13.4 so I'll leave it at this for now.

thanks and sorry for your time

p.s.

scripts without the stdout/stderr noise:

1.sh

#!/usr/bin/env sh

ocrmypdf --version
wget https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
convert Example.png -define png:compression-level=0 -define png:compression-filter=0 -define png:color-type=2 Example-uncompress.png
img2pdf ./Example-uncompress.png -o ./Example-uncompress.pdf
ocrmypdf --tesseract-timeout=0 --optimize 1 --skip-text Example-uncompress.pdf Example-uncompress-compress.pdf
pdfimages -list ./Example-uncompress-compress.pdf

2.sh

#!/usr/bin/env sh

ocrmypdf --version
wget https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
img2pdf ./Example.png -o ./Example.pdf
ocrmypdf --tesseract-timeout=0 --optimize 1 --skip-text Example.pdf Example-compress.pdf
stat -c "%n,%s" Example*.* | column -t -s,

@RamKromberg
Copy link
Author

@jbarlow83 reproduced on 13.4.1:

$ ./1.sh
13.4.1
--2022-04-08 15:17:57--  https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
Resolving upload.wikimedia.org (upload.wikimedia.org)... 91.198.174.208, 2620:0:862:ed1a::2:b
Connecting to upload.wikimedia.org (upload.wikimedia.org)|91.198.174.208|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2335 (2.3K) [image/png]
Saving to: ‘Example.png’

Example.png                   100%[================================================>]   2.28K  --.-KB/s    in 0s

2022-04-08 15:18:00 (691 MB/s) - ‘Example.png’ saved [2335/2335]

Scanning contents: 100%|██████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 168.50page/s]
Image processing: 100%|████████████████████████████████████████████████████████████| 1.0/1.0 [00:00<00:00,  5.30page/s]
Postprocessing...
PDF/A conversion: 100%|████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 26.04page/s]
Recompressing JPEGs: 0image [00:00, ?image/s]
Deflating JPEGs: 100%|███████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 566.87image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.09 savings: 8.5%
Output file is a PDF/A-2B (as expected)
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     172   178  rgb     3   8  jpeg   no         9  0    96    96 3373B 3.7%
$ ./2.sh
13.4.1
--2022-04-08 18:14:40--  https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
Resolving upload.wikimedia.org (upload.wikimedia.org)... 91.198.174.208, 2620:0:862:ed1a::2:b
Connecting to upload.wikimedia.org (upload.wikimedia.org)|91.198.174.208|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2335 (2.3K) [image/png]
Saving to: ‘Example.png’

Example.png                   100%[================================================>]   2.28K  --.-KB/s    in 0s

2022-04-08 18:14:50 (537 MB/s) - ‘Example.png’ saved [2335/2335]

Scanning contents: 100%|██████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 134.76page/s]
Image processing: 100%|████████████████████████████████████████████████████████████| 1.0/1.0 [00:00<00:00,  5.34page/s]
Postprocessing...
PDF/A conversion: 100%|████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 48.33page/s]
Recompressing JPEGs: 0image [00:00, ?image/s]
Deflating JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.00 savings: 0.1%
Output file is a PDF/A-2B (as expected)
Example-compress.pdf  7787
Example.pdf           3906
Example.png           2335

@RamKromberg RamKromberg changed the title [12.7.2] lossy compression of pngs into jpegs when it shouldn't [13.4.1] lossy compression of pngs into jpegs when it shouldn't Apr 8, 2022
@RamKromberg
Copy link
Author

FYI I took a look at 2.sh's Example-compress.pdf with a text editor and noticed an extra stream (or two?) so I concatenated it with "pdftk Example-compress.pdf cat output Example-compress-pdftk.pdf" and it helped quite a bit:

Example-compress.pdf        7787
Example-compress-pdftk.pdf  4059
Example.pdf                 3906
Example.png                 2335

Though I'm not sure if it's a bug (loose unreferenced objects?) or a feature (thumbnail? color profile? duplicate split stream for gradual web render? integrity redundancies? binary meta?...), it seems to be the cause for the size increase.

@RamKromberg
Copy link
Author

@jbarlow83 reproduced in current current 13.4.2:

  1. png lossy compression into jpeg:
$ ./1.sh
13.4.2
--2022-04-12 15:20:39--  https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
Resolving upload.wikimedia.org (upload.wikimedia.org)... 91.198.174.208, 2620:0:862:ed1a::2:b
Connecting to upload.wikimedia.org (upload.wikimedia.org)|91.198.174.208|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2335 (2.3K) [image/png]
Saving to: ‘Example.png’

Example.png    100%   2.28K  --.-KB/s    in 0s

2022-04-12 15:20:43 (538 MB/s) - ‘Example.png’ saved [2335/2335]

Scanning contents: 100%|███| 1/1 [00:00<00:00, 59.54page/s]
Image processing: 100%|█| 1.0/1.0 [00:00<00:00,  5.28page/s
Postprocessing...
PDF/A conversion: 100%|████| 1/1 [00:00<00:00, 26.04page/s]
Recompressing JPEGs: 0image [00:00, ?image/s]
Deflating JPEGs: 100%|███| 1/1 [00:00<00:00, 619.91image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.09 savings: 8.5%
Output file is a PDF/A-2B (as expected)
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     172   178  rgb     3   8  jpeg   no         9  0    96    96 3373B 3.7%
  1. unusual size increase possibly over unreferenced objects:
$ ./2.sh
13.4.2
--2022-04-12 15:21:23--  https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
Resolving upload.wikimedia.org (upload.wikimedia.org)... 91.198.174.208, 2620:0:862:ed1a::2:b
Connecting to upload.wikimedia.org (upload.wikimedia.org)|91.198.174.208|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2335 (2.3K) [image/png]
Saving to: ‘Example.png’

Example.png    100%   2.28K  --.-KB/s    in 0s

2022-04-12 15:21:23 (689 MB/s) - ‘Example.png’ saved [2335/2335]

Scanning contents: 100%|██████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 128.75page/s]
Image processing: 100%|████████████████████████████████████████████████████████████| 1.0/1.0 [00:00<00:00,  5.29page/s]
Postprocessing...
PDF/A conversion: 100%|████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 47.95page/s]
Recompressing JPEGs: 0image [00:00, ?image/s]
Deflating JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.00 savings: 0.0%
Output file is a PDF/A-2B (as expected)
Example-compress.pdf        7786
Example-compress-pdftk.pdf  4059
Example.pdf                 3906
Example.png                 2335

1.sh:

#!/usr/bin/env sh

ocrmypdf --version
wget https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
convert Example.png -define png:compression-level=0 -define png:compression-filter=0 -define png:color-type=2 Example-uncompress.png
img2pdf ./Example-uncompress.png -o ./Example-uncompress.pdf
ocrmypdf --tesseract-timeout=0 --optimize 1 --skip-text Example-uncompress.pdf Example-uncompress-compress.pdf
pdfimages -list ./Example-uncompress-compress.pdf

2.sh:

#!/usr/bin/env sh

ocrmypdf --version
wget https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
img2pdf ./Example.png -o ./Example.pdf
ocrmypdf --tesseract-timeout=0 --optimize 1 --skip-text Example.pdf Example-compress.pdf
pdftk Example-compress.pdf cat output Example-compress-pdftk.pdf
stat -c "%n,%s" Example*.* | column -t -s,

@RamKromberg RamKromberg changed the title [13.4.1] lossy compression of pngs into jpegs when it shouldn't [13.4.2] lossy compression of pngs into jpegs when it shouldn't Apr 12, 2022
@jbarlow83
Copy link
Collaborator

If you use --output-type pdf is this behavior still triggered? Or does it seem to happen during PDF/A conversion?

@jbarlow83 jbarlow83 reopened this Apr 13, 2022
@RamKromberg
Copy link
Author

If you use --output-type pdf is this behavior still triggered? Or does it seem to happen during PDF/A conversion?

  1. The first issue (lossy jpegs instead of lossless images) doesn't happen when targeting non-archival:

$ ./1.sh
1 OCRmyPDF version:
13.4.2

2 Download Example.png sample:
--2022-04-14 11:58:52--  https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
Resolving upload.wikimedia.org (upload.wikimedia.org)... 91.198.174.208, 2620:0:862:ed1a::2:b
Connecting to upload.wikimedia.org (upload.wikimedia.org)|91.198.174.208|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2335 (2.3K) [image/png]
Saving to: ‘Example.png’

Example.png                   100%[================================================>]   2.28K  --.-KB/s    in 0s

2022-04-14 11:58:52 (672 MB/s) - ‘Example.png’ saved [2335/2335]


3 Convert sample to uncompressed PNG:

4 Convert uncompressed sample to PDF:

5 Optimize with OCRmyPDF:
Scanning contents: 100%|██████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 156.31page/s]
Image processing: 100%|████████████████████████████████████████████████████████████| 1.0/1.0 [00:00<00:00,  5.36page/s]
Postprocessing...
PDF/A conversion: 100%|████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 26.67page/s]
Recompressing JPEGs: 0image [00:00, ?image/s]
Deflating JPEGs: 100%|███████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 578.44image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.09 savings: 8.5%
Output file is a PDF/A-2B (as expected)

6 List images:
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     172   178  rgb     3   8  jpeg   no         9  0    96    96 3373B 3.7%

7 Optimize with OCRmyPDF targetting non-archival format:
Scanning contents: 100%|██████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 157.21page/s]
Image processing: 100%|████████████████████████████████████████████████████████████| 1.0/1.0 [00:00<00:00,  5.32page/s]
Postprocessing...
Recompressing JPEGs: 0image [00:00, ?image/s]
Deflating JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.00 savings: 0.0%

8 List images:
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     172   178  rgb     3   8  image  no         8  0    96    96 89.9K 100%
  1. The size issue is mostly alleviate with non-archival though it's still there:
$ ./2.sh
1 OCRmyPDF version:
13.4.2

2 Download Example.png sample:
--2022-04-14 11:59:43--  https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
Resolving upload.wikimedia.org (upload.wikimedia.org)... 91.198.174.208, 2620:0:862:ed1a::2:b
Connecting to upload.wikimedia.org (upload.wikimedia.org)|91.198.174.208|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2335 (2.3K) [image/png]
Saving to: ‘Example.png’

Example.png    100%   2.28K  --.-KB/s    in 0s

2022-04-14 11:59:43 (712 MB/s) - ‘Example.png’ saved [2335/2335]


3 Convert sample to pdf:

4 Optimize with OCRmyPDF:
Scanning contents: 100%|██| 1/1 [00:00<00:00, 129.98page/s]
Image processing: 100%|█| 1.0/1.0 [00:00<00:00,  5.34page/s
Postprocessing...
PDF/A conversion: 100%|████| 1/1 [00:00<00:00, 48.76page/s]
Recompressing JPEGs: 0image [00:00, ?image/s]
Deflating JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.00 savings: 0.0%
Output file is a PDF/A-2B (as expected)

5 Concatenate with pdftk:

6 Optimize with OCRmyPDF targetting non-archival format:
Scanning contents: 100%|██████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 132.60page/s]
Image processing: 100%|████████████████████████████████████████████████████████████| 1.0/1.0 [00:00<00:00,  5.36page/s]
Postprocessing...
Recompressing JPEGs: 0image [00:00, ?image/s]
Deflating JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.00 savings: 0.0%

7 Concatenate non-archival with pdftk:

8 Compare sizes:
Example-compress-nonA.pdf        4473
Example-compress-nonA-pdftk.pdf  3519
Example-compress.pdf             7787
Example-compress-pdftk.pdf       4059
Example.pdf                      3906
Example.png                      2335

1.sh:

echo 1 OCRmyPDF version:
ocrmypdf --version
echo
echo 2 Download Example.png sample:
wget https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
echo
echo 3 Convert sample to uncompressed PNG:
convert Example.png -define png:compression-level=0 -define png:compression-filter=0 -define png:color-type=2 Example-uncompress.png
echo
echo 4 Convert uncompressed sample to PDF:
img2pdf ./Example-uncompress.png -o ./Example-uncompress.pdf
echo
echo 5 Optimize with OCRmyPDF:
ocrmypdf --tesseract-timeout=0 --optimize 1 --skip-text Example-uncompress.pdf Example-uncompress-compress.pdf
echo
echo 6 List images:
pdfimages -list ./Example-uncompress-compress.pdf
echo
echo 7 Optimize with OCRmyPDF targetting non-archival format:
ocrmypdf --tesseract-timeout=0 --output-type pdf --optimize 1 --skip-text Example-uncompress.pdf Example-uncompress-compress-nonA.pdf
echo
echo 8 List images:
pdfimages -list ./Example-uncompress-compress-nonA.pdf
echo

2.sh:

echo 1 OCRmyPDF version:
ocrmypdf --version
echo
echo 2 Download Example.png sample:
wget https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
echo
echo 3 Convert sample to pdf:
img2pdf ./Example.png -o ./Example.pdf
echo
echo 4 Optimize with OCRmyPDF:
ocrmypdf --tesseract-timeout=0 --optimize 1 --skip-text Example.pdf Example-compress.pdf
echo
echo 5 Concatenate with pdftk:
pdftk Example-compress.pdf cat output Example-compress-pdftk.pdf
echo
echo 6 Optimize with OCRmyPDF targetting non-archival format:
ocrmypdf --tesseract-timeout=0 --output-type pdf --optimize 1 --skip-text Example.pdf Example-compress-nonA.pdf
echo
echo 7 Concatenate non-archival with pdftk:
pdftk Example-compress-nonA.pdf cat output Example-compress-nonA-pdftk.pdf
echo
echo 8 Compare sizes:
stat -c "%n,%s" Example*.* | column -t -s,
echo

@rmast
Copy link

rmast commented Jun 26, 2022

I can confirm --output-type pdf skips the lossy jpeg compression.

I removed jpeg-artifacts from my test-image
by using

and they reappear if I use -O0 or -O1 but don't use --output-type pdf

@rmast
Copy link

rmast commented Jun 26, 2022

I have the open source PDF24 on my computer and it is able to flate a TIFF to PDF/A with no JPEG-compression artifacts, also appearing to use GhostScript. It's validated to be PDF/A by https://avepdf.com/pdfa-validation.
merge_from_ofoct quality 5pdf24.pdf

@RamKromberg
Copy link
Author

@rmast Anything using ghostscript will run into lossy conversions in some cases since ghostscript doesn't support the same image formats and color profiles as pdf.

The specific issue here is a nuance of:

Ghostscript may transcode grayscale and color images, either lossy to lossless or lossless to lossy, based on an internal algorithm. This behavior can be suppressed by setting --pdfa-image-compression to jpeg or lossless to set all images to one type or the other. Ghostscript has no option to maintain the input image’s format. (Ghostscript 9.25+ can copy JPEG images without transcoding them; earlier versions will transcode.)

( https://ocrmypdf.readthedocs.io/en/latest/introduction.html#limitations )

I've raised a similar issue with pdfScale.sh where I've also made some test scripts to illustrate the issue: tavinus/pdfScale#27

I should have closed the issue since it's known and documented but since --output-type pdf isn't the default behavior I figured I should leave it up to the dev to decide whether to close the issue or not since it's still technically there.

Anyhow, PDF24 is closed source freeware so I won't look too much into it but if it uses ghostscript, it will have to deal with similar issues.

Otherwise, FBCNN seems like a nice image restoration neural net model (I've personally used waifu2x for sheet music scaling before OCRing with Audiveris with good results) but it's still a lossy process so it's only appropriate as a mid-stage before running Tesseract.

@rmast
Copy link

rmast commented Jun 27, 2022 via email

@jbarlow83
Copy link
Collaborator

--output-type pdf is not default behavior for historical reasons. PDF/A output has been the default since ocrmypdf 1.0 - and back when it was released it was pretty much the only open source tool that produced PDF/A other than Ghostscript (and that only with a lot of coaxing).

Unfortunately Ghostscript does not announce when it transcodes.

@RamKromberg
Copy link
Author

Unfortunately Ghostscript does not announce when it transcodes.

It wouldn't help since there's still a lot of remaining feature mismatches between postscript, PDF and PDF/A.

Btw, what would it take to have OCRMyPDF preserve existing PDF/A documents? I don't actually need it myself but if it's a big enough deal to default on despite the lossy transcoding... Well, I mean, it's not like OCRMyPDF is adding any multimedia features so if it comes down to just setting a meta flag or something... Right?

@rmast
Copy link

rmast commented Jun 28, 2022 via email

@RamKromberg
Copy link
Author

I would expect all necessary tricks in https://github.com/oxplot/pdfrankenstein

Disturbingly enough, going through svg instead of postscript might actually be better since svg doesn't specify a limit on supported raster formats or color profiles and should be able accommodate a transparent text layer... Frankenstein indeed.

Putting that aside, I meant to ask what would it take to modify OCRMyPDF itself so it preserve a PDF/A as a PDF/A so it will still pass veraPDF. According to their docs, it's something pikepdf/qpdf are able to do and, without looking into the specs, I'm assuming adding a text layer to a pdf shouldn't break PDF/A. So, I'm guessing OCRMyPDF will only need to avoid stuff like the linearization to pass veraPDF?

p.s. @jbarlow83 Feel free to close the issue if you feel it's appropriate.

@jbarlow83
Copy link
Collaborator

Btw, what would it take to have OCRMyPDF preserve existing PDF/A documents?

If an input document is already a valid PDF/A, and we're only adding the text layer, and we're not preprocessing images, we could probably keep it a PDF/A without passing through Ghostscript. It's a special case, but it seems like a worthwhile one....

I'm guessing OCRMyPDF will only need to avoid stuff like the linearization to pass veraPDF?

Linearization is allowed in PDF/A if the PDF is 1.5 or above, IIRC.

@jbarlow83
Copy link
Collaborator

Actually, it turns OCRmyPDF already will preserve PDF/A if the input is valid PDF/A and (counterintuitively) --output-type pdf is selected, because the edits to insert OCR are not that invasive. OCRmyPDF doesn't notice incoming PDF/A and doesn't remark on PDF/A being preserved in this way, and it probably should output a log message or two about this.

@RamKromberg
Copy link
Author

I haven't had much luck trying to install veraPDF so I'll just take your word for it and say congrats :)

OCRmyPDF doesn't notice incoming PDF/A and doesn't remark on PDF/A being preserved in this way, and it probably should output a log message or two about this.

I think a note in the docs would be more than enough. Besides, unless you want to bring in veraPDF as a dependency, trusting the pdf/a meta tag is probably not a good idea.

@rmast
Copy link

rmast commented Jul 1, 2022 via email

@rmast
Copy link

rmast commented Jul 2, 2022 via email

@RamKromberg
Copy link
Author

I don’t know whether that check exists in open source.

@rmast veraPDF is FOSS and optionally uses PDFbox: https://docs.verapdf.org/develop/#license

You can also just run the .jar off their installer like the Arch AUR package does: https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=verapdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants