Benefits of pdfcpu optimize #135

OrganicChem · 2019-11-26T23:23:42Z

The file compression using optimize isn't too significant. Are there some settings I can use to get a better PDF file compression?

hhrutter · 2019-11-27T23:01:01Z

Hello!

A PDF file at heart is just a bunch of objects referenced by a cross reference table written to disk.
There are two ways for a PDF Writer to save a cross reference table:

Save cross reference sections made up of sub sections containing the objects
Save cross reference streams utilizing object streams since PDF 1.5

The main job of pdfcpu optimize is getting rid of redundant resource objects.
What that means is eg. if your file is embedding the same font multiple times perhaps for each page then pdfcpu makes sure the font file is embedded only once (and fixing all references to old objects) or for instance if the same image appears multiple times then pdfcpu also can get rid of unneeded copies of this image.

Once these (and other) optimization steps are done pdfcpu writes back the cross reference table using 2) utilizing the DEFLATE stream compression filter which is zlib/deflate under the hood.

To answer you question:
Unless your PDF files are really old chances are they are already saved using cross reference streams and in this case you are not really gaining much out of pdfcpu optimize so
your mileage may vary. It really depends on how your file was written and to what extent it uses embedded fonts and images.

In my case I had to deal with a PDF with a filesize of a couple of gigabytes written by a popular Java PDF library and after optimization and removing embedded fonts it was a lightweight of a couple of hundred megabytes. This was actually my motivation for starting pdfcpu in the first place. So for me it worked out well.

Unfortunately pdfcpu does not provide any magic switches but in pkg/pdfcpu/configuration.go there are some settings but they are already preconfigured for achieving the best compact results in the most efficient way.

Let me know if you have further questions.
Thank you for using pdfcpu 💚

OrganicChem · 2019-11-28T02:04:23Z

Thank you Horst for the quite elaborate and very informative response.

The process you describe is in fact an optimization if you will - since the quality of the PDF remains pretty much intact. It's basically a refinement of structure of a given file, which, after clean up, results in a PDF that is unaltered in appearance, with some reduction of file size.

Unfortunately, the refinements you mentioned do not always result in dramatic size shifts. The more common tools out there compress images by altering resolution and color, remove hyperlinks, bookmarks etc, and this sort of downsampling will in fact dramatically reduce the PDF file size. However, the caveat here is you will lose quality and the amount of quality loss is certainly proportional to file reduction.

I suppose if one is satisfied with content alteration with a large decrease in file size, it's all good! It would be nice to be able to have the ability to compress image resolution and color and I think these two things alone will certainly cut down the size.

On a happier note, merging files is certainly stellar in performance, compared to other conventional tools. This really demonstrates the power of go!

hhrutter · 2019-11-29T09:58:54Z

Compressing images on the fly is doable but I think automation is not practical.
We have different compression filters in PDF and then there is also the compression rate.
You may have imported a JPG into your PDF but which PDF compression filter your PDFWriter uses to store the image in the PDF cross reference table is an unknown and that complicates things.
Choosing an appropriate compression rate is a science in itself.

OrganicChem · 2020-08-26T23:39:40Z

Perhaps a shot in the dark in PDF file compression - since images mainly contribute to the PDF file size, could one extract the images, compress them with any well known library and replace them back into the PDF file? Is this plausible?

hhrutter · 2020-08-27T21:05:22Z

Yes this is conceivable.
If you want to follow up please do so in #6.
Thank you!

ihipop · 2024-03-08T14:06:31Z

Can I disable PDF optimization?
I have a PDF of about 30 pages, needs almost 28 seconds to optimize, which is very slow @hhrutter

hhrutter · 2024-03-08T14:13:52Z

Interesting. Please send me this file if you can, optimization should not slow you down.
This is smth I would like to fix.

hhrutter · 2024-03-09T10:49:02Z

Your file has 1 page with an excessive amount of images (>8000).
Some operations rely on calling optimize internally, but depending on your usecase we may be able to disable image optimization.

What are you calling? Are you using the CLI or API?

For example we could add a -skipImages flag for the CLI Optimize command.

ihipop · 2024-03-09T12:41:55Z

Your file has 1 page with an excessive amount of images (>8000). Some operations rely on calling optimize internally, but depending on your usecase we may be able to disable image optimization.

What are you calling? Are you using the CLI or API?

For example we could add a -skipImages flag for the CLI Optimize command.

I’m using both APIs and CLIs and I can't find options to disable optimization

hhrutter · 2024-03-09T16:38:35Z

@ALL - The latest commit features the Configuration.Optimize flag.
Right now this is for API users, CLI will follow up with next release.

🔥 Proceed with caution 🔥

hhrutter changed the title ~~PDF Compression~~ Benefits of pdfcpu optimize Nov 27, 2019

hhrutter added the question label Apr 2, 2021

hhrutter closed this as completed in 5ccea97 Mar 9, 2024

hhrutter reopened this Mar 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benefits of pdfcpu optimize #135

Benefits of pdfcpu optimize #135

OrganicChem commented Nov 26, 2019

hhrutter commented Nov 27, 2019

OrganicChem commented Nov 28, 2019

hhrutter commented Nov 29, 2019

OrganicChem commented Aug 26, 2020

hhrutter commented Aug 27, 2020

ihipop commented Mar 8, 2024

hhrutter commented Mar 8, 2024

hhrutter commented Mar 9, 2024

ihipop commented Mar 9, 2024

hhrutter commented Mar 9, 2024

Benefits of pdfcpu optimize #135

Benefits of pdfcpu optimize #135

Comments

OrganicChem commented Nov 26, 2019

hhrutter commented Nov 27, 2019

OrganicChem commented Nov 28, 2019

hhrutter commented Nov 29, 2019

OrganicChem commented Aug 26, 2020

hhrutter commented Aug 27, 2020

ihipop commented Mar 8, 2024

hhrutter commented Mar 8, 2024

hhrutter commented Mar 9, 2024

ihipop commented Mar 9, 2024

hhrutter commented Mar 9, 2024