Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve optimization - add option to remove unreferenced images #807

Closed
xelan opened this issue Feb 21, 2024 · 8 comments
Closed

Improve optimization - add option to remove unreferenced images #807

xelan opened this issue Feb 21, 2024 · 8 comments
Assignees

Comments

@xelan
Copy link
Contributor

xelan commented Feb 21, 2024

In some cases, PDFs may contain image resources which are not referenced on pages anymore.

Example files:
pdf-optimization-original.pdf
pdf-optimization-page-removed.pdf

Here a second page containing an otter image was there, but has been removed by a third-party PDF editing tool. However, the image resource is still in the PDF.

If you diff the two PDFs, the size is almost the same, and the original image of page two is still taking up space in the file.

C:\Software\pdfcpu_0.6.0_Windows_x86_64>pdfcpu.exe image list pdf-optimization-original.pdf
pages: all

pdf-optimization-original.pdf:
2 images available(241 KB)
Page Obj# │ Id      │ Type  SoftMask ImgMask │ Width │ Height │ ColorSpace Comp bpc Interp │   Size │ Filters
━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━
   1   14 │ Image14 │ image                  │  1384 │    865 │  DeviceRGB    3   8    *   │ 133 KB │ DCTDecode
━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━
   2   17 │ Image17 │ image                  │  1384 │    865 │  DeviceRGB    3   8    *   │ 108 KB │ DCTDecode

C:\Software\pdfcpu_0.6.0_Windows_x86_64>pdfcpu.exe image list pdf-optimization-page-removed.pdf
pages: all

pdf-optimization-page-removed.pdf:
1 images available(133 KB)
Page Obj# │ Id      │ Type  SoftMask ImgMask │ Width │ Height │ ColorSpace Comp bpc Interp │   Size │ Filters
━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━
   1   14 │ Image14 │ image                  │  1384 │    865 │  DeviceRGB    3   8    *   │ 133 KB │ DCTDecode

Would it be possible to add an optimization option to remove such "orphan" image resources?

The ocrmypdf tool does something similar during its PDF optimization see optimize.py. The possibility to perform this optimization also via pdfcpu would allow us to simplify our toolchain and reduce the number of required dependencies.

EDIT: optimization output via ocrmypdf:

$ ls -hal pdf-optimization-page-removed*.pdf
-rw-r--r--    1 root     root      279.0K Feb 21 14:20 pdf-optimization-page-removed.pdf

$ ocrmypdf --skip-text  pdf-optimization-page-removed.pdf pdf-optimization-page-removed-optimized.pdf
Scan: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 30.07page/s]
   INFO - Using Tesseract OpenMP thread limit 3
   INFO -    1: skipping all processing on this page                                                                                                                                  
OCR: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.0/1.0 [00:00<00:00, 179.26page/s]
WARNING - Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
   INFO - Optimize ratio: 1.00 savings: 0.0%
WARNING - Output file is okay but is not PDF/A (seems to be No PDF/A metadata in XMP)

$ ls -hal pdf-optimization-page-removed*.pdf
-rw-r--r--    1 root     root      146.8K Feb 21 14:26 pdf-optimization-page-removed-optimized.pdf
-rw-r--r--    1 root     root      279.0K Feb 21 14:20 pdf-optimization-page-removed.pdf

PDF optimized with ocrmypdf:
pdf-optimization-page-removed-optimized.pdf

Thank you very much, best regards from Tyrol
Andreas

@hhrutter
Copy link
Collaborator

I think we can do smth about this.

@hhrutter
Copy link
Collaborator

This is fixed with latest commit.
Just run pdfcpu optimize to get rid of dangling pageDicts and referenced resources.

@xelan
Copy link
Contributor Author

xelan commented Feb 26, 2024

Good morning, thank you very much! Just compiled the latest version and tested against some of our PDFs.

Here's the result with a representative file we have:

$ time ocrmypdf -q -s -O 0 deleted-pages.pdf out.pdf && ls -ahl *.pdf

real	0m0.557s
user	0m0.501s
sys	0m0.057s
-rw-r--r-- 1 root root  38M Feb 26 06:57 deleted-pages.pdf
-rw-r--r-- 1 root root 568K Feb 26 07:04 out.pdf

$ time pdfcpu optimize deleted-pages.pdf out.pdf && ls -ahl *.pdf
writing out.pdf...
optimizing...

real	0m0.013s
user	0m0.016s
sys	0m0.005s
-rw-r--r-- 1 root root  38M Feb 26 06:57 deleted-pages.pdf
-rw-r--r-- 1 root root 565K Feb 26 07:05 out.pdf

So pdfcpu provides roughly the same optimization level as ocrmypdf (without lossless file-changing optimizations), but with way better performance!

@hhrutter
Copy link
Collaborator

Please consider becoming a pdfcpu sponsor.
Sponsorship is important for credability and confidence in the project and ensures a clear path moving forward.
Thank you for using pdfcpu 💚

@xelan
Copy link
Contributor Author

xelan commented Feb 26, 2024

Sure, done 😃

I've stumbled across another test case of our PDF test suite where the optimization does not fully work. The first page with the image of the source file optimization-test-2-pages.pdf was deleted (see optimization-test-1-page.pdf), but the image data remains in the PDF even after optimization.

optimization-test-2-pages.pdf
optimization-test-1-page.pdf

@hhrutter
Copy link
Collaborator

Appreciated!
I will check this out.

hhrutter added a commit that referenced this issue Feb 28, 2024
@hhrutter
Copy link
Collaborator

Should be fixed with latest commit.

@xelan
Copy link
Contributor Author

xelan commented Feb 28, 2024

Wow, thanks for the quick fix! Just compiled the latest commit, now this test case also succeeds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants