Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge file produced when merging (watermarking/stamping) #1951

Closed
ucomru opened this issue Jul 7, 2023 · 6 comments · Fixed by #1952
Closed

Huge file produced when merging (watermarking/stamping) #1951

ucomru opened this issue Jul 7, 2023 · 6 comments · Fixed by #1952
Labels
needs-pdf The issue needs a PDF file to show the problem nf-performance Non-functional change: Performance

Comments

@ucomru
Copy link

ucomru commented Jul 7, 2023

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
macOS-13.4.1-arm64-arm-64bit

$ python -c "import pypdf;print(pypdf.__version__)"
3.12.0

Explanation

Good day!

  1. Sometimes I need to merge (append) background to the text pdf file.
    Its not simple, because pypdf.PdfReader(file).pages[0].merge_page(page) append page under the PdfReader(file).pages[0].
    Its good for the draft information, but is not good idea for the background image.
    Its need to create one and one background page bkg = pypdf.PdfReader(f_bkg).pages[0], and then merge it with next text pdf page.

I think its a good idea, to append method .bg : pypdf.PdfReader(file).pages[0].bg(background_page) to append background down the text page PdfReader(file).pages[0].

  1. The problem with multipage pdf file.
    I try to merge background file bg.pdf with text file source.pdf to the out_with_bg.pdf, Im getting very huge 132 Mb file:
bg.pdf, 761 kb (contain image, simple page)
source.pdf, 680 kb (contain text, 171 pages)
out_with_bg.pdf, 132 Mb !!!

but when I use pdftk, Im getting only 1,5 Mb file:

 pdftk out.pdf background bg.pdf output out-BG_pdftk.pdf
out-BG_pdftk.pdf, 1.5 Mb !!!

Its too slow process with huge output file.

I think its a good idea, to add methods .merge and .bg to the whole pdf obj:

PdfReader(file).merge(page)
PdfReader(file).bg(page)

Code Example

How would your feature be used? (Remove this if it is not applicable.)

import pypdf

with open('source.pdf', 'rb') as f_pdf, \
        open('bg.pdf', 'rb') as f_bkg:
    pdf = pypdf.PdfReader(f_pdf) # source text pdf file
    out = pypdf.PdfWriter()
    for p, page in enumerate(pdf.pages):
        bkg = pypdf.PdfReader(f_bkg).pages[0] # background pdf file with image
        bkg.merge_page(page)
        out.add_page(bkg)
    with open('out_with_bg.pdf', 'wb') as f:
        out.write(f)
@MartinThoma MartinThoma removed their assignment Jul 7, 2023
@MartinThoma MartinThoma added the needs-pdf The issue needs a PDF file to show the problem label Jul 7, 2023
@py-pdf py-pdf deleted a comment from ucomru Jul 7, 2023
@MartinThoma MartinThoma added the nf-performance Non-functional change: Performance label Jul 7, 2023
@MartinThoma
Copy link
Member

@ucomru Do you have example files which we could use? I guess your real files are private, but maybe you can re-create one that is not?

As a side-note: I've added this kind of use-case to https://github.com/py-pdf/benchmarks#watermarking-speed already and noticed that pdfrw is a lot faster + produces smaller results. See also py-pdf/benchmarks#7

It's an issue I am aware of, but of which I'm uncertain how to tackle it.

@MartinThoma MartinThoma changed the title pdf merge with background return is too huge file + NEED merge & bg to PdfReader(file) Huge file produced when merging (watermarking/stamping) Jul 7, 2023
@ucomru
Copy link
Author

ucomru commented Jul 7, 2023

@ucomru Do you have example files which we could use? I guess your real files are private, but maybe you can re-create one that is not?

As a side-note: I've added this kind of use-case to https://github.com/py-pdf/benchmarks#watermarking-speed already and noticed that pdfrw is a lot faster + produces smaller results. See also py-pdf/benchmarks#7

It's an issue I am aware of, but of which I'm uncertain how to tackle it.

Thanks for answer!!!

I ran 2 tests pypdf_test.py and pdftk:

❯ pypdf_test.py
Total time: 1.640826940536499 sec.

❯ time pdftk source.pdf background bg.pdf output out-BG_pdftk.pdf
pdftk source.pdf background bg.pdf output out-BG_pdftk.pdf  0.37s user 0.07s system 139% cpu 0.317 total

> exa -l
.rw-r--r--@ 721k tm  7 Jul 22:57 bg.pdf
.rw-r--r--  1.0M tm  7 Jul 23:10 out-BG_pdftk.pdf
.rw-r--r--   87M tm  7 Jul 23:00 out_with_bg.pdf
.rw-r--r--@  527 tm  7 Jul 22:59 pypdf_test.py
.rw-r--r--@ 262k tm  7 Jul 22:52 source.pdf

I think it's not bad time, but (as I think) if append new method - .merge_page_bg, its will be faster.
I don't now deep pdf structure, may be Im wrong... Word docx contains header and footer.
May be pdf contain so?..

pypdf_test.py:

import time

import pypdf # pip install pypdf

t = time.time()

with open('source.pdf', 'rb') as f_pdf, \
        open('bg.pdf', 'rb') as f_bkg:
    pdf = pypdf.PdfReader(f_pdf) # source text pdf file
    out = pypdf.PdfWriter()
    for p, page in enumerate(pdf.pages):
        bkg = pypdf.PdfReader(f_bkg).pages[0] # background pdf file with image
        bkg.merge_page(page)
        out.add_page(bkg)
    with open('out_with_bg.pdf', 'wb') as f:
        out.write(f)

print(f'Total time: {time.time() - t} sec.')

source.pdf -- 262 k
bg.pdf -- 721 k
out-BG_pdftk.pdf -- 1.0 M
out_with_bg.pdf -- 87 M

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Jul 8, 2023

In your process you are adding everytime a new page based on a new copy of the bg.pdf, therefore the image is multiplied many times explaining the hudge size. This is my proposal:

import pypdf

r = pypdf.PdfReader("bg.pdf")
w = pypdf.PdfWriter(clone_from="source.pdf")

for p in w.pages:
    p.merge_transformed_page(r.pages[0], pypdf.Transformation(), over=False)
    p[pypdf.generic.NameObject("/Contents")] = w._add_object(p.get_contents())

w.write("out.pdf")

The current merge_page does not offer the parameter over, this is why I've used merge_transformed_page.(@MartinThoma, I think It would be great the add the parameter to merge_page, but to be consistent it should precede the existing expand. are you OK with this)

with this approach, the image is added only once.

the replacement of the contents is because currently the Contents is an array of ContentStreams which are not IndirectObjects.
it is also to be noted that this will leave some object unused in the file increasnig the size abnormally : : I will propose a PR to not have to do that in the future.

Finally I think we should update the documentation

@MartinThoma
Copy link
Member

The current merge_page does not offer the parameter over, this is why I've used merge_transformed_page.(@MartinThoma, I think It would be great the add the parameter to merge_page, but to be consistent it should precede the existing expand. are you OK with this)

Do you mean something like this:

def merge_page(self, page2: "PageObject", expand: bool = False, backgorund: PageObject) -> None:
    ...

?

@pubpub-zz
Copy link
Collaborator

The current merge_page does not offer the parameter over, this is why I've used merge_transformed_page.(@MartinThoma, I think It would be great the add the parameter to merge_page, but to be consistent it should precede the existing expand. are you OK with this)

Do you mean something like this:

def merge_page(self, page2: "PageObject", expand: bool = False, backgorund: PageObject) -> None:
    ...

?

no : page2 is already the background it would be like:

def merge_page(self, page2: "PageObject", over: bool = True, expand: bool = False) -> None:

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Jul 8, 2023
add also over param in merge_page

closes py-pdf#1951
MartinThoma pushed a commit that referenced this issue Jul 9, 2023
…rmarking) (#1952)

ENH: Add the`over` parameter to `merge_page`

closes #1951
closes #1953
@ucomru
Copy link
Author

ucomru commented Jul 23, 2023

Its Great!

Now method page.merge_page(bg, over=False) with magic param over is more good.

  • my sample pdf file using pypdf.__version__ from 3.12.0 to 3.12.2 is shortening from 6.7 mb to 3.8 mb.

But pdftk is getting smaller 1.5 mb ;-)

Many Thanks!!!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-pdf The issue needs a PDF file to show the problem nf-performance Non-functional change: Performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants