Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benefits of pdfcpu optimize #135

Open
OrganicChem opened this issue Nov 26, 2019 · 10 comments
Open

Benefits of pdfcpu optimize #135

OrganicChem opened this issue Nov 26, 2019 · 10 comments
Labels

Comments

@OrganicChem
Copy link

The file compression using optimize isn't too significant. Are there some settings I can use to get a better PDF file compression?

@hhrutter
Copy link
Collaborator

Hello!

A PDF file at heart is just a bunch of objects referenced by a cross reference table written to disk.
There are two ways for a PDF Writer to save a cross reference table:

  1. Save cross reference sections made up of sub sections containing the objects
  2. Save cross reference streams utilizing object streams since PDF 1.5

The main job of pdfcpu optimize is getting rid of redundant resource objects.
What that means is eg. if your file is embedding the same font multiple times perhaps for each page then pdfcpu makes sure the font file is embedded only once (and fixing all references to old objects) or for instance if the same image appears multiple times then pdfcpu also can get rid of unneeded copies of this image.

Once these (and other) optimization steps are done pdfcpu writes back the cross reference table using 2) utilizing the DEFLATE stream compression filter which is zlib/deflate under the hood.

To answer you question:
Unless your PDF files are really old chances are they are already saved using cross reference streams and in this case you are not really gaining much out of pdfcpu optimize so
your mileage may vary. It really depends on how your file was written and to what extent it uses embedded fonts and images.

In my case I had to deal with a PDF with a filesize of a couple of gigabytes written by a popular Java PDF library and after optimization and removing embedded fonts it was a lightweight of a couple of hundred megabytes. This was actually my motivation for starting pdfcpu in the first place. So for me it worked out well.

Unfortunately pdfcpu does not provide any magic switches but in pkg/pdfcpu/configuration.go there are some settings but they are already preconfigured for achieving the best compact results in the most efficient way.

Let me know if you have further questions.
Thank you for using pdfcpu 💚

@hhrutter hhrutter changed the title PDF Compression Benefits of pdfcpu optimize Nov 27, 2019
@OrganicChem
Copy link
Author

Thank you Horst for the quite elaborate and very informative response.

The process you describe is in fact an optimization if you will - since the quality of the PDF remains pretty much intact. It's basically a refinement of structure of a given file, which, after clean up, results in a PDF that is unaltered in appearance, with some reduction of file size.

Unfortunately, the refinements you mentioned do not always result in dramatic size shifts. The more common tools out there compress images by altering resolution and color, remove hyperlinks, bookmarks etc, and this sort of downsampling will in fact dramatically reduce the PDF file size. However, the caveat here is you will lose quality and the amount of quality loss is certainly proportional to file reduction.

I suppose if one is satisfied with content alteration with a large decrease in file size, it's all good! It would be nice to be able to have the ability to compress image resolution and color and I think these two things alone will certainly cut down the size.

On a happier note, merging files is certainly stellar in performance, compared to other conventional tools. This really demonstrates the power of go!

@hhrutter
Copy link
Collaborator

Compressing images on the fly is doable but I think automation is not practical.
We have different compression filters in PDF and then there is also the compression rate.
You may have imported a JPG into your PDF but which PDF compression filter your PDFWriter uses to store the image in the PDF cross reference table is an unknown and that complicates things.
Choosing an appropriate compression rate is a science in itself.

@OrganicChem
Copy link
Author

Perhaps a shot in the dark in PDF file compression - since images mainly contribute to the PDF file size, could one extract the images, compress them with any well known library and replace them back into the PDF file? Is this plausible?

@hhrutter
Copy link
Collaborator

Yes this is conceivable.
If you want to follow up please do so in #6.
Thank you!

@ihipop
Copy link

ihipop commented Mar 8, 2024

Can I disable PDF optimization?
I have a PDF of about 30 pages, needs almost 28 seconds to optimize, which is very slow @hhrutter

@hhrutter
Copy link
Collaborator

hhrutter commented Mar 8, 2024

Interesting. Please send me this file if you can, optimization should not slow you down.
This is smth I would like to fix.

@hhrutter
Copy link
Collaborator

hhrutter commented Mar 9, 2024

Your file has 1 page with an excessive amount of images (>8000).
Some operations rely on calling optimize internally, but depending on your usecase we may be able to disable image optimization.

What are you calling? Are you using the CLI or API?

For example we could add a -skipImages flag for the CLI Optimize command.

@ihipop
Copy link

ihipop commented Mar 9, 2024

Your file has 1 page with an excessive amount of images (>8000). Some operations rely on calling optimize internally, but depending on your usecase we may be able to disable image optimization.

What are you calling? Are you using the CLI or API?

For example we could add a -skipImages flag for the CLI Optimize command.

I’m using both APIs and CLIs and I can't find options to disable optimization

@hhrutter hhrutter reopened this Mar 9, 2024
@hhrutter
Copy link
Collaborator

hhrutter commented Mar 9, 2024

@ALL - The latest commit features the Configuration.Optimize flag.
Right now this is for API users, CLI will follow up with next release.

🔥 Proceed with caution 🔥

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants