[13.4.2] lossy compression of pngs into jpegs when it shouldn't #940

RamKromberg · 2022-04-03T10:19:58Z

It might be just the older version, but ocrmypdf 12.7.2 seems to compress uncompressed pngs into (lossy) jpegs:

$ ocrmypdf --version
12.7.2
$ wget https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
$ convert Example.png -define png:compression-level=0 -define png:compression-filter=0 -define png:color-type=2 Example-uncompress.png
$ img2pdf ./Example-uncompress.png -o ./Example-uncompress.pdf
$ ocrmypdf --tesseract-timeout=0 --optimize 1 --skip-text Example-uncompress.pdf Example-uncompress-compress.pdf
$ pdfimages -list ./Example-uncompress-compress.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     172   178  rgb     3   8  jpeg   no         9  0    96    96 4157B 4.5%

I believe it should be running the image through pngquant instead at optimize level 1.

Btw, it's probably not even worth mentioning since, looking at the changelog, I'm fairly certain you've already sorted it out in recent ocrmypdf versions, but small pdfs with small pngs grow instead of shrinking / remaining the same:

$ ocrmypdf --version
12.7.2
$ wget https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
$ img2pdf ./Example.png -o ./Example.pdf
$ ocrmypdf --tesseract-timeout=0 --optimize 1 --skip-text Example.pdf Example-compress.pdf
$ stat -c "%n,%s" Example*.* | column -t -s,
Example-compress.pdf  7799
Example.pdf           3906
Example.png           2335

Though this might also be the pdf format changing to the archival specs...

As a side note, if compute time isn't a factor, I personally found 'optipng -o7' to produce smaller pngs than pngquant and 'jpegrescan -i -t -v' to produce the smallest jpeg, even compared to MozJPEG despite the author saying otherwise oddly enough.

p.s. forgot to mention the png-to-jpeg bug also happens with some compressed pngs but I haven't bothered trying to replicate this since I believe it should never try to convert bitmap images to jpegs to begin with.

The text was updated successfully, but these errors were encountered:

jbarlow83 · 2022-04-03T17:51:49Z

Closing due to old version

RamKromberg · 2022-04-03T19:47:18Z

Sorry for the old version. I pulled a newer one (though still not the latest... might get to doing that later in the week/weekend) and both bugs were still there:

$ ocrmypdf --version
13.3.0

$ wget https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
--2022-04-03 22:31:36--  https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
Resolving upload.wikimedia.org (upload.wikimedia.org)... 91.198.174.208, 2620:0:862:ed1a::2:b
Connecting to upload.wikimedia.org (upload.wikimedia.org)|91.198.174.208|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2335 (2.3K) [image/png]
Saving to: ‘Example.png’

Example.png    100%   2.28K  --.-KB/s    in 0s

2022-04-03 22:31:37 (51.8 MB/s) - ‘Example.png’ saved [2335/2335]


$ convert Example.png -define png:compression-level=0 -define png:compression-filter=0 -define png:color-type=2 Example-uncompress.png

$ img2pdf ./Example-uncompress.png -o ./Example-uncompress.pdf

$ ocrmypdf --tesseract-timeout=0 --optimize 1 --skip-text Example-uncompress.pdf Example-uncompress-compress.pdf
Scanning contents: 100%|██| 1/1 [00:00<00:00, 162.01page/s]
Image processing: 100%|█| 1.0/1.0 [00:00<00:00,  4.39page/s
Postprocessing...
PDF/A conversion: 100%|████| 1/1 [00:00<00:00, 26.34page/s]
JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.00 savings: 0.1%
Output file is a PDF/A-2B (as expected)

$ pdfimages -list ./Example-uncompress-compress.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     172   178  rgb     3   8  jpeg   no         9  0    96    96 4157B 4.5%

And the other one:

$ ocrmypdf --version
13.3.0

$ wget https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
--2022-04-03 22:33:56--  https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
Resolving upload.wikimedia.org (upload.wikimedia.org)... 91.198.174.208, 2620:0:862:ed1a::2:b
Connecting to upload.wikimedia.org (upload.wikimedia.org)|91.198.174.208|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2335 (2.3K) [image/png]
Saving to: ‘Example.png’

Example.png    100%   2.28K  --.-KB/s    in 0s

2022-04-03 22:33:57 (83.8 MB/s) - ‘Example.png’ saved [2335/2335]


$ img2pdf ./Example.png -o ./Example.pdf

$ ocrmypdf --tesseract-timeout=0 --optimize 1 --skip-text Example.pdf Example-compress.pdf
Scanning contents: 100%|██| 1/1 [00:00<00:00, 132.59page/s]
Image processing: 100%|█| 1.0/1.0 [00:00<00:00,  4.47page/s
Postprocessing...
PDF/A conversion: 100%|████| 1/1 [00:00<00:00, 47.82page/s]
JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.00 savings: 0.1%
Output file is a PDF/A-2B (as expected)

$ stat -c "%n,%s" Example*.* | column -t -s,
Example-compress.pdf  7789
Example.pdf           3906
Example.png           2335

I'm not sure when I'll get the chance to test 13.4 so I'll leave it at this for now.

thanks and sorry for your time

p.s.

scripts without the stdout/stderr noise:

1.sh

#!/usr/bin/env sh

ocrmypdf --version
wget https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
convert Example.png -define png:compression-level=0 -define png:compression-filter=0 -define png:color-type=2 Example-uncompress.png
img2pdf ./Example-uncompress.png -o ./Example-uncompress.pdf
ocrmypdf --tesseract-timeout=0 --optimize 1 --skip-text Example-uncompress.pdf Example-uncompress-compress.pdf
pdfimages -list ./Example-uncompress-compress.pdf

2.sh

#!/usr/bin/env sh

ocrmypdf --version
wget https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
img2pdf ./Example.png -o ./Example.pdf
ocrmypdf --tesseract-timeout=0 --optimize 1 --skip-text Example.pdf Example-compress.pdf
stat -c "%n,%s" Example*.* | column -t -s,

RamKromberg · 2022-04-08T15:16:39Z

@jbarlow83 reproduced on 13.4.1:

$ ./1.sh
13.4.1
--2022-04-08 15:17:57--  https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
Resolving upload.wikimedia.org (upload.wikimedia.org)... 91.198.174.208, 2620:0:862:ed1a::2:b
Connecting to upload.wikimedia.org (upload.wikimedia.org)|91.198.174.208|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2335 (2.3K) [image/png]
Saving to: ‘Example.png’

Example.png                   100%[================================================>]   2.28K  --.-KB/s    in 0s

2022-04-08 15:18:00 (691 MB/s) - ‘Example.png’ saved [2335/2335]

Scanning contents: 100%|██████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 168.50page/s]
Image processing: 100%|████████████████████████████████████████████████████████████| 1.0/1.0 [00:00<00:00,  5.30page/s]
Postprocessing...
PDF/A conversion: 100%|████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 26.04page/s]
Recompressing JPEGs: 0image [00:00, ?image/s]
Deflating JPEGs: 100%|███████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 566.87image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.09 savings: 8.5%
Output file is a PDF/A-2B (as expected)
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     172   178  rgb     3   8  jpeg   no         9  0    96    96 3373B 3.7%

$ ./2.sh
13.4.1
--2022-04-08 18:14:40--  https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
Resolving upload.wikimedia.org (upload.wikimedia.org)... 91.198.174.208, 2620:0:862:ed1a::2:b
Connecting to upload.wikimedia.org (upload.wikimedia.org)|91.198.174.208|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2335 (2.3K) [image/png]
Saving to: ‘Example.png’

Example.png                   100%[================================================>]   2.28K  --.-KB/s    in 0s

2022-04-08 18:14:50 (537 MB/s) - ‘Example.png’ saved [2335/2335]

Scanning contents: 100%|██████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 134.76page/s]
Image processing: 100%|████████████████████████████████████████████████████████████| 1.0/1.0 [00:00<00:00,  5.34page/s]
Postprocessing...
PDF/A conversion: 100%|████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 48.33page/s]
Recompressing JPEGs: 0image [00:00, ?image/s]
Deflating JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.00 savings: 0.1%
Output file is a PDF/A-2B (as expected)
Example-compress.pdf  7787
Example.pdf           3906
Example.png           2335

RamKromberg · 2022-04-08T18:39:44Z

FYI I took a look at 2.sh's Example-compress.pdf with a text editor and noticed an extra stream (or two?) so I concatenated it with "pdftk Example-compress.pdf cat output Example-compress-pdftk.pdf" and it helped quite a bit:

Example-compress.pdf        7787
Example-compress-pdftk.pdf  4059
Example.pdf                 3906
Example.png                 2335

Though I'm not sure if it's a bug (loose unreferenced objects?) or a feature (thumbnail? color profile? duplicate split stream for gradual web render? integrity redundancies? binary meta?...), it seems to be the cause for the size increase.

RamKromberg · 2022-04-12T12:29:53Z

@jbarlow83 reproduced in current current 13.4.2:

png lossy compression into jpeg:

$ ./1.sh
13.4.2
--2022-04-12 15:20:39--  https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
Resolving upload.wikimedia.org (upload.wikimedia.org)... 91.198.174.208, 2620:0:862:ed1a::2:b
Connecting to upload.wikimedia.org (upload.wikimedia.org)|91.198.174.208|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2335 (2.3K) [image/png]
Saving to: ‘Example.png’

Example.png    100%   2.28K  --.-KB/s    in 0s

2022-04-12 15:20:43 (538 MB/s) - ‘Example.png’ saved [2335/2335]

Scanning contents: 100%|███| 1/1 [00:00<00:00, 59.54page/s]
Image processing: 100%|█| 1.0/1.0 [00:00<00:00,  5.28page/s
Postprocessing...
PDF/A conversion: 100%|████| 1/1 [00:00<00:00, 26.04page/s]
Recompressing JPEGs: 0image [00:00, ?image/s]
Deflating JPEGs: 100%|███| 1/1 [00:00<00:00, 619.91image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.09 savings: 8.5%
Output file is a PDF/A-2B (as expected)
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     172   178  rgb     3   8  jpeg   no         9  0    96    96 3373B 3.7%

unusual size increase possibly over unreferenced objects:

$ ./2.sh
13.4.2
--2022-04-12 15:21:23--  https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
Resolving upload.wikimedia.org (upload.wikimedia.org)... 91.198.174.208, 2620:0:862:ed1a::2:b
Connecting to upload.wikimedia.org (upload.wikimedia.org)|91.198.174.208|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2335 (2.3K) [image/png]
Saving to: ‘Example.png’

Example.png    100%   2.28K  --.-KB/s    in 0s

2022-04-12 15:21:23 (689 MB/s) - ‘Example.png’ saved [2335/2335]

Scanning contents: 100%|██████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 128.75page/s]
Image processing: 100%|████████████████████████████████████████████████████████████| 1.0/1.0 [00:00<00:00,  5.29page/s]
Postprocessing...
PDF/A conversion: 100%|████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 47.95page/s]
Recompressing JPEGs: 0image [00:00, ?image/s]
Deflating JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.00 savings: 0.0%
Output file is a PDF/A-2B (as expected)
Example-compress.pdf        7786
Example-compress-pdftk.pdf  4059
Example.pdf                 3906
Example.png                 2335

1.sh:

#!/usr/bin/env sh

ocrmypdf --version
wget https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
convert Example.png -define png:compression-level=0 -define png:compression-filter=0 -define png:color-type=2 Example-uncompress.png
img2pdf ./Example-uncompress.png -o ./Example-uncompress.pdf
ocrmypdf --tesseract-timeout=0 --optimize 1 --skip-text Example-uncompress.pdf Example-uncompress-compress.pdf
pdfimages -list ./Example-uncompress-compress.pdf

2.sh:

#!/usr/bin/env sh

ocrmypdf --version
wget https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
img2pdf ./Example.png -o ./Example.pdf
ocrmypdf --tesseract-timeout=0 --optimize 1 --skip-text Example.pdf Example-compress.pdf
pdftk Example-compress.pdf cat output Example-compress-pdftk.pdf
stat -c "%n,%s" Example*.* | column -t -s,

jbarlow83 · 2022-04-13T21:44:54Z

If you use --output-type pdf is this behavior still triggered? Or does it seem to happen during PDF/A conversion?

RamKromberg · 2022-04-14T09:15:38Z

If you use --output-type pdf is this behavior still triggered? Or does it seem to happen during PDF/A conversion?

The first issue (lossy jpegs instead of lossless images) doesn't happen when targeting non-archival:


$ ./1.sh
1 OCRmyPDF version:
13.4.2

2 Download Example.png sample:
--2022-04-14 11:58:52--  https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
Resolving upload.wikimedia.org (upload.wikimedia.org)... 91.198.174.208, 2620:0:862:ed1a::2:b
Connecting to upload.wikimedia.org (upload.wikimedia.org)|91.198.174.208|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2335 (2.3K) [image/png]
Saving to: ‘Example.png’

Example.png                   100%[================================================>]   2.28K  --.-KB/s    in 0s

2022-04-14 11:58:52 (672 MB/s) - ‘Example.png’ saved [2335/2335]


3 Convert sample to uncompressed PNG:

4 Convert uncompressed sample to PDF:

5 Optimize with OCRmyPDF:
Scanning contents: 100%|██████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 156.31page/s]
Image processing: 100%|████████████████████████████████████████████████████████████| 1.0/1.0 [00:00<00:00,  5.36page/s]
Postprocessing...
PDF/A conversion: 100%|████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 26.67page/s]
Recompressing JPEGs: 0image [00:00, ?image/s]
Deflating JPEGs: 100%|███████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 578.44image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.09 savings: 8.5%
Output file is a PDF/A-2B (as expected)

6 List images:
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     172   178  rgb     3   8  jpeg   no         9  0    96    96 3373B 3.7%

7 Optimize with OCRmyPDF targetting non-archival format:
Scanning contents: 100%|██████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 157.21page/s]
Image processing: 100%|████████████████████████████████████████████████████████████| 1.0/1.0 [00:00<00:00,  5.32page/s]
Postprocessing...
Recompressing JPEGs: 0image [00:00, ?image/s]
Deflating JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.00 savings: 0.0%

8 List images:
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     172   178  rgb     3   8  image  no         8  0    96    96 89.9K 100%

The size issue is mostly alleviate with non-archival though it's still there:

$ ./2.sh
1 OCRmyPDF version:
13.4.2

2 Download Example.png sample:
--2022-04-14 11:59:43--  https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
Resolving upload.wikimedia.org (upload.wikimedia.org)... 91.198.174.208, 2620:0:862:ed1a::2:b
Connecting to upload.wikimedia.org (upload.wikimedia.org)|91.198.174.208|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2335 (2.3K) [image/png]
Saving to: ‘Example.png’

Example.png    100%   2.28K  --.-KB/s    in 0s

2022-04-14 11:59:43 (712 MB/s) - ‘Example.png’ saved [2335/2335]


3 Convert sample to pdf:

4 Optimize with OCRmyPDF:
Scanning contents: 100%|██| 1/1 [00:00<00:00, 129.98page/s]
Image processing: 100%|█| 1.0/1.0 [00:00<00:00,  5.34page/s
Postprocessing...
PDF/A conversion: 100%|████| 1/1 [00:00<00:00, 48.76page/s]
Recompressing JPEGs: 0image [00:00, ?image/s]
Deflating JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.00 savings: 0.0%
Output file is a PDF/A-2B (as expected)

5 Concatenate with pdftk:

6 Optimize with OCRmyPDF targetting non-archival format:
Scanning contents: 100%|██████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 132.60page/s]
Image processing: 100%|████████████████████████████████████████████████████████████| 1.0/1.0 [00:00<00:00,  5.36page/s]
Postprocessing...
Recompressing JPEGs: 0image [00:00, ?image/s]
Deflating JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.00 savings: 0.0%

7 Concatenate non-archival with pdftk:

8 Compare sizes:
Example-compress-nonA.pdf        4473
Example-compress-nonA-pdftk.pdf  3519
Example-compress.pdf             7787
Example-compress-pdftk.pdf       4059
Example.pdf                      3906
Example.png                      2335

1.sh:

echo 1 OCRmyPDF version:
ocrmypdf --version
echo
echo 2 Download Example.png sample:
wget https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
echo
echo 3 Convert sample to uncompressed PNG:
convert Example.png -define png:compression-level=0 -define png:compression-filter=0 -define png:color-type=2 Example-uncompress.png
echo
echo 4 Convert uncompressed sample to PDF:
img2pdf ./Example-uncompress.png -o ./Example-uncompress.pdf
echo
echo 5 Optimize with OCRmyPDF:
ocrmypdf --tesseract-timeout=0 --optimize 1 --skip-text Example-uncompress.pdf Example-uncompress-compress.pdf
echo
echo 6 List images:
pdfimages -list ./Example-uncompress-compress.pdf
echo
echo 7 Optimize with OCRmyPDF targetting non-archival format:
ocrmypdf --tesseract-timeout=0 --output-type pdf --optimize 1 --skip-text Example-uncompress.pdf Example-uncompress-compress-nonA.pdf
echo
echo 8 List images:
pdfimages -list ./Example-uncompress-compress-nonA.pdf
echo

2.sh:

echo 1 OCRmyPDF version:
ocrmypdf --version
echo
echo 2 Download Example.png sample:
wget https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png
echo
echo 3 Convert sample to pdf:
img2pdf ./Example.png -o ./Example.pdf
echo
echo 4 Optimize with OCRmyPDF:
ocrmypdf --tesseract-timeout=0 --optimize 1 --skip-text Example.pdf Example-compress.pdf
echo
echo 5 Concatenate with pdftk:
pdftk Example-compress.pdf cat output Example-compress-pdftk.pdf
echo
echo 6 Optimize with OCRmyPDF targetting non-archival format:
ocrmypdf --tesseract-timeout=0 --output-type pdf --optimize 1 --skip-text Example.pdf Example-compress-nonA.pdf
echo
echo 7 Concatenate non-archival with pdftk:
pdftk Example-compress-nonA.pdf cat output Example-compress-nonA-pdftk.pdf
echo
echo 8 Compare sizes:
stat -c "%n,%s" Example*.* | column -t -s,
echo

rmast · 2022-06-26T18:01:24Z

I can confirm --output-type pdf skips the lossy jpeg compression.

I removed jpeg-artifacts from
by using

and they reappear if I use -O0 or -O1 but don't use --output-type pdf

rmast · 2022-06-26T18:49:38Z

I have the open source PDF24 on my computer and it is able to flate a TIFF to PDF/A with no JPEG-compression artifacts, also appearing to use GhostScript. It's validated to be PDF/A by https://avepdf.com/pdfa-validation.
merge_from_ofoct quality 5pdf24.pdf

RamKromberg · 2022-06-27T11:44:54Z

@rmast Anything using ghostscript will run into lossy conversions in some cases since ghostscript doesn't support the same image formats and color profiles as pdf.

The specific issue here is a nuance of:

Ghostscript may transcode grayscale and color images, either lossy to lossless or lossless to lossy, based on an internal algorithm. This behavior can be suppressed by setting --pdfa-image-compression to jpeg or lossless to set all images to one type or the other. Ghostscript has no option to maintain the input image’s format. (Ghostscript 9.25+ can copy JPEG images without transcoding them; earlier versions will transcode.)

( https://ocrmypdf.readthedocs.io/en/latest/introduction.html#limitations )

I've raised a similar issue with pdfScale.sh where I've also made some test scripts to illustrate the issue: tavinus/pdfScale#27

I should have closed the issue since it's known and documented but since --output-type pdf isn't the default behavior I figured I should leave it up to the dev to decide whether to close the issue or not since it's still technically there.

Anyhow, PDF24 is closed source freeware so I won't look too much into it but if it uses ghostscript, it will have to deal with similar issues.

Otherwise, FBCNN seems like a nice image restoration neural net model (I've personally used waifu2x for sheet music scaling before OCRing with Audiveris with good results) but it's still a lossy process so it's only appropriate as a mid-stage before running Tesseract.

rmast · 2022-06-27T13:56:20Z

You’re right, I can’t find the source online. I’ll try --pdfa-image-compression lossless to see whether it preserves my TIFF. I assumed that was already covered by -O0.

jbarlow83 · 2022-06-27T23:10:20Z

--output-type pdf is not default behavior for historical reasons. PDF/A output has been the default since ocrmypdf 1.0 - and back when it was released it was pretty much the only open source tool that produced PDF/A other than Ghostscript (and that only with a lot of coaxing).

Unfortunately Ghostscript does not announce when it transcodes.

RamKromberg · 2022-06-28T10:17:29Z

Unfortunately Ghostscript does not announce when it transcodes.

It wouldn't help since there's still a lot of remaining feature mismatches between postscript, PDF and PDF/A.

Btw, what would it take to have OCRMyPDF preserve existing PDF/A documents? I don't actually need it myself but if it's a big enough deal to default on despite the lossy transcoding... Well, I mean, it's not like OCRMyPDF is adding any multimedia features so if it comes down to just setting a meta flag or something... Right?

rmast · 2022-06-28T11:32:30Z

I would expect all necessary tricks in https://github.com/oxplot/pdfrankenstein

RamKromberg · 2022-06-28T12:40:26Z

I would expect all necessary tricks in https://github.com/oxplot/pdfrankenstein

Disturbingly enough, going through svg instead of postscript might actually be better since svg doesn't specify a limit on supported raster formats or color profiles and should be able accommodate a transparent text layer... Frankenstein indeed.

Putting that aside, I meant to ask what would it take to modify OCRMyPDF itself so it preserve a PDF/A as a PDF/A so it will still pass veraPDF. According to their docs, it's something pikepdf/qpdf are able to do and, without looking into the specs, I'm assuming adding a text layer to a pdf shouldn't break PDF/A. So, I'm guessing OCRMyPDF will only need to avoid stuff like the linearization to pass veraPDF?

p.s. @jbarlow83 Feel free to close the issue if you feel it's appropriate.

jbarlow83 · 2022-06-30T21:19:32Z

Btw, what would it take to have OCRMyPDF preserve existing PDF/A documents?

If an input document is already a valid PDF/A, and we're only adding the text layer, and we're not preprocessing images, we could probably keep it a PDF/A without passing through Ghostscript. It's a special case, but it seems like a worthwhile one....

I'm guessing OCRMyPDF will only need to avoid stuff like the linearization to pass veraPDF?

Linearization is allowed in PDF/A if the PDF is 1.5 or above, IIRC.

jbarlow83 · 2022-07-01T08:26:00Z

Actually, it turns OCRmyPDF already will preserve PDF/A if the input is valid PDF/A and (counterintuitively) --output-type pdf is selected, because the edits to insert OCR are not that invasive. OCRmyPDF doesn't notice incoming PDF/A and doesn't remark on PDF/A being preserved in this way, and it probably should output a log message or two about this.

RamKromberg · 2022-07-01T22:16:24Z

I haven't had much luck trying to install veraPDF so I'll just take your word for it and say congrats :)

OCRmyPDF doesn't notice incoming PDF/A and doesn't remark on PDF/A being preserved in this way, and it probably should output a log message or two about this.

I think a note in the docs would be more than enough. Besides, unless you want to bring in veraPDF as a dependency, trusting the pdf/a meta tag is probably not a good idea.

rmast · 2022-07-01T22:27:50Z

I didn’t install veraPDF. I just uploaded the PDF for an online check. I don’t know whether that check exists in open source.

rmast · 2022-07-02T07:27:04Z

There is a validator in pdfbox: https://pdfbox.apache.org/1.8/cookbook/pdfavalidation.html

RamKromberg · 2022-07-02T07:42:29Z

I don’t know whether that check exists in open source.

@rmast veraPDF is FOSS and optionally uses PDFbox: https://docs.verapdf.org/develop/#license

You can also just run the .jar off their installer like the Arch AUR package does: https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=verapdf

jbarlow83 closed this as completed Apr 3, 2022

RamKromberg changed the title ~~[12.7.2] lossy compression of pngs into jpegs when it shouldn't~~ [13.4.1] lossy compression of pngs into jpegs when it shouldn't Apr 8, 2022

RamKromberg changed the title ~~[13.4.1] lossy compression of pngs into jpegs when it shouldn't~~ [13.4.2] lossy compression of pngs into jpegs when it shouldn't Apr 12, 2022

jbarlow83 reopened this Apr 13, 2022

jbarlow83 closed this as completed Jun 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[13.4.2] lossy compression of pngs into jpegs when it shouldn't #940

[13.4.2] lossy compression of pngs into jpegs when it shouldn't #940

RamKromberg commented Apr 3, 2022 •

edited

Loading

jbarlow83 commented Apr 3, 2022

RamKromberg commented Apr 3, 2022 •

edited

Loading

RamKromberg commented Apr 8, 2022

RamKromberg commented Apr 8, 2022

RamKromberg commented Apr 12, 2022

jbarlow83 commented Apr 13, 2022

RamKromberg commented Apr 14, 2022

rmast commented Jun 26, 2022

rmast commented Jun 26, 2022

RamKromberg commented Jun 27, 2022

rmast commented Jun 27, 2022 via email •

edited

Loading

jbarlow83 commented Jun 27, 2022

RamKromberg commented Jun 28, 2022

rmast commented Jun 28, 2022 via email

RamKromberg commented Jun 28, 2022

jbarlow83 commented Jun 30, 2022

jbarlow83 commented Jul 1, 2022

RamKromberg commented Jul 1, 2022

rmast commented Jul 1, 2022 via email

rmast commented Jul 2, 2022 via email

RamKromberg commented Jul 2, 2022

[13.4.2] lossy compression of pngs into jpegs when it shouldn't #940

[13.4.2] lossy compression of pngs into jpegs when it shouldn't #940

Comments

RamKromberg commented Apr 3, 2022 • edited Loading

jbarlow83 commented Apr 3, 2022

RamKromberg commented Apr 3, 2022 • edited Loading

RamKromberg commented Apr 8, 2022

RamKromberg commented Apr 8, 2022

RamKromberg commented Apr 12, 2022

jbarlow83 commented Apr 13, 2022

RamKromberg commented Apr 14, 2022

rmast commented Jun 26, 2022

rmast commented Jun 26, 2022

RamKromberg commented Jun 27, 2022

rmast commented Jun 27, 2022 via email • edited Loading

jbarlow83 commented Jun 27, 2022

RamKromberg commented Jun 28, 2022

rmast commented Jun 28, 2022 via email

RamKromberg commented Jun 28, 2022

jbarlow83 commented Jun 30, 2022

jbarlow83 commented Jul 1, 2022

RamKromberg commented Jul 1, 2022

rmast commented Jul 1, 2022 via email

rmast commented Jul 2, 2022 via email

RamKromberg commented Jul 2, 2022

RamKromberg commented Apr 3, 2022 •

edited

Loading

RamKromberg commented Apr 3, 2022 •

edited

Loading

rmast commented Jun 27, 2022 via email •

edited

Loading