Improve optimization - add option to remove unreferenced images #807

xelan · 2024-02-21T12:14:18Z

In some cases, PDFs may contain image resources which are not referenced on pages anymore.

Example files:
pdf-optimization-original.pdf
pdf-optimization-page-removed.pdf

Here a second page containing an otter image was there, but has been removed by a third-party PDF editing tool. However, the image resource is still in the PDF.

If you diff the two PDFs, the size is almost the same, and the original image of page two is still taking up space in the file.

C:\Software\pdfcpu_0.6.0_Windows_x86_64>pdfcpu.exe image list pdf-optimization-original.pdf
pages: all

pdf-optimization-original.pdf:
2 images available(241 KB)
Page Obj# │ Id      │ Type  SoftMask ImgMask │ Width │ Height │ ColorSpace Comp bpc Interp │   Size │ Filters
━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━
   1   14 │ Image14 │ image                  │  1384 │    865 │  DeviceRGB    3   8    *   │ 133 KB │ DCTDecode
━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━
   2   17 │ Image17 │ image                  │  1384 │    865 │  DeviceRGB    3   8    *   │ 108 KB │ DCTDecode

C:\Software\pdfcpu_0.6.0_Windows_x86_64>pdfcpu.exe image list pdf-optimization-page-removed.pdf
pages: all

pdf-optimization-page-removed.pdf:
1 images available(133 KB)
Page Obj# │ Id      │ Type  SoftMask ImgMask │ Width │ Height │ ColorSpace Comp bpc Interp │   Size │ Filters
━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━
   1   14 │ Image14 │ image                  │  1384 │    865 │  DeviceRGB    3   8    *   │ 133 KB │ DCTDecode

Would it be possible to add an optimization option to remove such "orphan" image resources?

The ocrmypdf tool does something similar during its PDF optimization see optimize.py. The possibility to perform this optimization also via pdfcpu would allow us to simplify our toolchain and reduce the number of required dependencies.

EDIT: optimization output via ocrmypdf:

$ ls -hal pdf-optimization-page-removed*.pdf
-rw-r--r--    1 root     root      279.0K Feb 21 14:20 pdf-optimization-page-removed.pdf

$ ocrmypdf --skip-text  pdf-optimization-page-removed.pdf pdf-optimization-page-removed-optimized.pdf
Scan: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 30.07page/s]
   INFO - Using Tesseract OpenMP thread limit 3
   INFO -    1: skipping all processing on this page                                                                                                                                  
OCR: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.0/1.0 [00:00<00:00, 179.26page/s]
WARNING - Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
   INFO - Optimize ratio: 1.00 savings: 0.0%
WARNING - Output file is okay but is not PDF/A (seems to be No PDF/A metadata in XMP)

$ ls -hal pdf-optimization-page-removed*.pdf
-rw-r--r--    1 root     root      146.8K Feb 21 14:26 pdf-optimization-page-removed-optimized.pdf
-rw-r--r--    1 root     root      279.0K Feb 21 14:20 pdf-optimization-page-removed.pdf

PDF optimized with ocrmypdf:
pdf-optimization-page-removed-optimized.pdf

Thank you very much, best regards from Tyrol
Andreas

The text was updated successfully, but these errors were encountered:

hhrutter · 2024-02-21T21:41:46Z

I think we can do smth about this.

hhrutter · 2024-02-25T01:28:52Z

This is fixed with latest commit.
Just run pdfcpu optimize to get rid of dangling pageDicts and referenced resources.

xelan · 2024-02-26T07:10:28Z

Good morning, thank you very much! Just compiled the latest version and tested against some of our PDFs.

Here's the result with a representative file we have:

$ time ocrmypdf -q -s -O 0 deleted-pages.pdf out.pdf && ls -ahl *.pdf

real	0m0.557s
user	0m0.501s
sys	0m0.057s
-rw-r--r-- 1 root root  38M Feb 26 06:57 deleted-pages.pdf
-rw-r--r-- 1 root root 568K Feb 26 07:04 out.pdf

$ time pdfcpu optimize deleted-pages.pdf out.pdf && ls -ahl *.pdf
writing out.pdf...
optimizing...

real	0m0.013s
user	0m0.016s
sys	0m0.005s
-rw-r--r-- 1 root root  38M Feb 26 06:57 deleted-pages.pdf
-rw-r--r-- 1 root root 565K Feb 26 07:05 out.pdf

So pdfcpu provides roughly the same optimization level as ocrmypdf (without lossless file-changing optimizations), but with way better performance!

hhrutter · 2024-02-26T07:41:55Z

Please consider becoming a pdfcpu sponsor.
Sponsorship is important for credability and confidence in the project and ensures a clear path moving forward.
Thank you for using pdfcpu 💚

xelan · 2024-02-26T14:19:02Z

Sure, done 😃

I've stumbled across another test case of our PDF test suite where the optimization does not fully work. The first page with the image of the source file optimization-test-2-pages.pdf was deleted (see optimization-test-1-page.pdf), but the image data remains in the PDF even after optimization.

optimization-test-2-pages.pdf
optimization-test-1-page.pdf

hhrutter · 2024-02-27T09:53:13Z

Appreciated!
I will check this out.

hhrutter · 2024-02-28T00:42:06Z

Should be fixed with latest commit.

xelan · 2024-02-28T07:50:55Z

Wow, thanks for the quick fix! Just compiled the latest commit, now this test case also succeeds.

xelan added the feature request label Feb 21, 2024

xelan assigned hhrutter Feb 21, 2024

hhrutter closed this as completed in d3e607d Feb 25, 2024

hhrutter added a commit that referenced this issue Feb 28, 2024

Fix #807

d5fd063

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve optimization - add option to remove unreferenced images #807

Improve optimization - add option to remove unreferenced images #807

xelan commented Feb 21, 2024 •

edited

hhrutter commented Feb 21, 2024

hhrutter commented Feb 25, 2024

xelan commented Feb 26, 2024

hhrutter commented Feb 26, 2024

xelan commented Feb 26, 2024 •

edited

hhrutter commented Feb 27, 2024

hhrutter commented Feb 28, 2024

xelan commented Feb 28, 2024

Improve optimization - add option to remove unreferenced images #807

Improve optimization - add option to remove unreferenced images #807

Comments

xelan commented Feb 21, 2024 • edited

hhrutter commented Feb 21, 2024

hhrutter commented Feb 25, 2024

xelan commented Feb 26, 2024

hhrutter commented Feb 26, 2024

xelan commented Feb 26, 2024 • edited

hhrutter commented Feb 27, 2024

hhrutter commented Feb 28, 2024

xelan commented Feb 28, 2024

xelan commented Feb 21, 2024 •

edited

xelan commented Feb 26, 2024 •

edited