Copy and paste of pages with annotations #437

m-holger · 2021-03-19T19:13:36Z

When using PDFarranger to duplicate pages that contain annotations using copy and "paste after", or using "duplicate",
I observed the following:

Editing the annotation (using Okular) on one page would modify the annotation on the other page (after save & reload)
A new annotation created on one page would also appear on the other page (after save & reload)

Even deleting all annotations (and saving/reloading) does not clear the problem. Any subsequent annotations are still shared by all copies of the page, including new copies created when the pages had no annotations. Even copying such a page into a new file does not clear the problem.

Duplicating pages without annotations works as expected, i.e. any subsequent annotations only affect the page annotated, not any copies of the page.

kbengs · 2021-03-20T07:06:56Z

This sounds like #291 What version of PDF Arranger do you have?

m-holger · 2021-03-20T11:16:20Z

I run pdfarranger 1.7.0 installed from source on a Mint 20.1, with pikepdf 2.8.0.

To me, this does not sound like the same problem as # 291. Particularly, the problem does occur on files loaded from disk.

I know very little about pdf files. That said, my first impression is:

For each page, there is a container object that contains all the annotations for that page (i.e. all the highlights, boxes and comments) . This container appears to get created when the first annotation is placed on the page, and does not (at least on Okular) ever get deleted, even when all annotations on the page are deleted. When a page is duplicated, the cloned page receives a reference to annotations on the original page rather than a deep copy, which means that the original as well as all cloned pages share all current and future annotations.

I don't know whether my impression is plausible, but that's the best I can come up with at the moment. Unfortunately I do not have time to delve into the code right now.

To demonstrate the problem, I have created a little example, using the following steps

create single page pdf
Open in Okular and highlight the text
save and reload
delete the highlight and save
open in PDFarranger, duplicate page twice and save
open in OKular, add note "this is the first page" to first page only and save
reopened with Xreader - comment visible on all 3 pages

hi.pdf

kbengs · 2021-03-20T12:29:52Z

You are right it's not bug 291. I am able to reproduce it. Annotations seems to be referenced when a page is copied.

m-holger · 2021-03-20T14:28:19Z

The main problem (from a user perspective) is not that the annotations themself are referenced, but that the container holding all annotations on a page is referenced. The only work-around that I have at the moment is to print the affected page to a pdf file and then replace the problem page and its clones with the printed version.

What is troubling about this problem is that it only becomes apparent after a file is saved and reopened. You can happily edit a file with multiple clones of the same page, annotate each of the clones and everything looks perfectly correct, even if you go back to earlier pages - that is, until you reopen the file and realised that all the pages have been stuffed up. Users might (unfairly) perceive that PDFarranger corrupts files. If the issue cannot be resolved quickly, it probably would be good to see how difficult it would be to pop up a warming message if pages with annotations are cloned - I seem to recall that this is done for files with tables of content. (Sorry for telling volunteers who very kindly donate their time what I think they should do. I do appreciate the effort, and I do understand the limited time they have available)

dreua · 2021-03-20T23:49:22Z

I would rather fix stuff than add more warnings, but not sure if it is possible on this case. Have you tried those operations with the qpdf cli by any chance? Maybe it can be fixed down there...

m-holger · 2021-03-21T01:13:57Z

@dreua Sorry, I use an excellent little graphical Python program to manipulate Pdf files, so never had a need to use qpdf cli.

I had a quick look at the qpdf manual and spotted that release 10.3.0 lists "Reworked the code that handles copying annotations and form fields during page operations." I think I'll try upgrading pikepdf to 2.9.1 with embedded qpdf 10.3.1 and see what happens.

kbengs · 2021-03-21T04:44:35Z

This works:

--- a/pdfarranger/exporter.py
+++ b/pdfarranger/exporter.py
@@ -143,7 +143,9 @@ def export(input_files, pages, file_out, mode, mdata):
     pdf_output = pikepdf.Pdf.new()
     pdf_input = [pikepdf.open(p.copyname, password=p.password) for p in input_files]
     for row in pages:
-        current_page = pdf_input[row.nfile - 1].pages[row.npage - 1]
+        pdf_temp = pikepdf.Pdf.new()
+        pdf_temp.pages.append(pdf_input[row.nfile - 1].pages[row.npage - 1])
+        current_page = pdf_temp.pages[0]
         angle = row.angle
         angle0 = current_page.Rotate if '/Rotate' in current_page else 0
         new_page = pdf_output.copy_foreign(current_page)

But it is not ideal. It's about 40% slower to save for a test pdf. Hopefully a better solution can be found.

dreua · 2021-03-21T07:09:15Z

@kbengs do I understand correctly that you are forcing a deeper copy by moving the page to a temporary document first? Hopefully I'll find the time to have a closer look later today.

kbengs · 2021-03-21T07:32:09Z

@dreua Yes something like that. It breaks the reference.

m-holger · 2021-03-21T10:38:33Z

@dreua upgrading to pikepdf 2.9.1 did not fix the problem.

Another thought (keeping in mind that I don't really know what I am talking about), core.Pdf.duplicate returns a shallow copy of the page being duplicated. What would be the implications of returning a deep copy instead? (A short answer like "it wouldn't work" would be acceptable - I will try to understand the code when I get a bit more time)

dreua · 2021-03-21T10:44:37Z

I don't think we are calling core.Pdf.duplicate, where did you see that? I'm not that deep into PDF structure either, so take my assumptions with a grain of salt: I think an actual deep copy over everything would not be space efficient (duplicating all the images of a scan when splitting pages, resulting in doubled filesize).
For annotations I think a deep copy is what we want, but I'm not sure whether we can simply do this in pdfarranger. I'll have to look into it.

m-holger · 2021-03-21T12:56:35Z

@dreua Sorry, meant core.Page.duplicate .

You are probably right re space efficiency.

I've tried the following, which seems to work.

from pikepdf import *

temp=Pdf.new()
i=Pdf.open('./hi.pdf')
temp.pages.append(i.pages[0])
i.pages.append(i.pages[0])
i.pages[1]["/Annots"]=i.copy_foreign(temp.pages[0]["/Annots"])  
i.save('./hi2.pdf')

hi2.pdf

EDIT incorrect sections removed

Edit:

I also tried

from pikepdf import *
import copy

i=Pdf.open('./hi4.pdf')
i.pages.append(i.pages[0])
i.pages[1]["/Annots"]= copy.copy(i.pages[1]["/Annots"])
i.save('./hi5.pdf')

This solves some problems - annotations can be deleted independently on only 1 page, and new annotation only appear on the page on which they were created. However, editing an existing annotation affects all copies. Replacing copy.copy with copy.deepcopy does not work at all.

m-holger · 2021-03-25T18:48:27Z

I looked into this a little bit further, and I think there is an issue with the duplicate method in PDF Arranger

I loaded a single page pdf into PDF Arranger and then duplicated the page twice, producing the following output file

4 0 obj
<< /Count 3 /Kids [ 5 0 R 6 0 R 7 0 R ] /Type /Pages >>
endobj
5 0 obj
<< /Annots 8 0 R /Contents 9 0 R /Group << /CS /DeviceRGB /I true /S /Transparency >> /MediaBox [ 0 0 595.303937007900 841.889763779500 ] /Parent 4 0 R /Resources << /Font << /F1 10 0 R >> /ProcSet [ /PDF /Text ] >> /Type /Page >>
endobj
6 0 obj
<< /Annots 8 0 R /Contents 9 0 R /Group << /CS /DeviceRGB /I true /S /Transparency >> /MediaBox [ 0 0 595.303937007900 841.889763779500 ] /Parent 4 0 R /Resources << /Font << /F1 10 0 R >> /ProcSet [ /PDF /Text ] >> /Type /Page >>
endobj
7 0 obj
<< /Annots 8 0 R /Contents 9 0 R /Group << /CS /DeviceRGB /I true /S /Transparency >> /MediaBox [ 0 0 595.303937007900 841.889763779500 ] /Parent 4 0 R /Resources << /Font << /F1 10 0 R >> /ProcSet [ /PDF /Text ] >> /Type /Page >>
endobj

All three pages point to the same annotation object as well as to the same contents object.

This works (and is space efficient) if we are duplicating the pages to provide different views of the same page. It does not work if we are duplicating the page to then annotate or edit the copied pages. In my case, I tend to only annotate without editing - so I quite like being able to duplicate a page without duplicating the content, especially when I annotate scanned pages.

I think we need different duplicate options for different purposes. We probably also need a way for the user to find out which parts of a page are independent from other pages, and which are not. Finally, there probably should be an option to make pages with shared content independent.

kbengs · 2021-03-26T02:46:13Z

The duplicate method does not copy a pdf page, it only copies information about what pdf document and what page number the copied page originate from (a file and page in self.pdfqueue).
@m-holger does the code I suggested work for for you in all cases?
Imo it should be fixed at export. Maybe the right place would be in pikepdf.Pdf.copy_foreign() ? (fixed in pikepdf)

m-holger · 2021-03-26T10:13:40Z

@kbengs I did not spot any problems with your fix when I tried it. The alternative below also works and does not duplicate the content, which is what I (usually) prefer

Imo it should be fixed at export. Maybe the right place would be in pikepdf.Pdf.copy_foreign() ? (fixed in pikepdf)

I agree. (I don't really understand pikepdf copy_foreign or genobj but neither works as I would expect, but that could be just me) (EDIT AS far as genobj is concerned, it is just me)

@@ -165,12 +165,17 @@ def export(input_files, pages, file_out, mode, mdata):
         new_page = _scale(pdf_output, new_page, row.scale)
 
         # Workraround for pikepdf < 2.7.0
         # https://github.com/pikepdf/pikepdf/issues/174
         new_page = pdf_output.make_indirect(new_page)
-
         pdf_output.pages.append(new_page)
+        if pikepdf.Name.Annots in ǹew_page:
+            pdf_temp = pikepdf.Pdf.new()
+            pdf_temp.pages.append(pdf_input[row.nfile - 1].pages[row.npage - 1])
+            pdf_output.pages[-1].Annots = pdf_output.copy_foreign(pdf_temp.pages[0].Annots)
+
+
 
     if exportmode in ['ALL_TO_MULTIPLE', 'SELECTED_TO_MULTIPLE']:```

<< /Annots 8 0 R /Contents 9 0 R /Group << /CS /DeviceRGB /... 
<< /Annots 11 0 R /Contents 9 0 R /Group << /CS /DeviceRGB ...
<< /Annots 12 0 R /Contents 9 0 R /Group << /CS /DeviceRGB /....

dreua · 2021-03-27T21:44:29Z

My personal opinion: I think PDF files are not really meant for editing, therefore doing shallow copies of page content is fine with me and I see not need to change that unless a good reason comes up. Software that allows editing of PDF page contents needs to take shallow copies into account and do a deep copy on write where necessary.
Annotations are a different story, because they are meant to be user-editable and that should work like users expect it to work, i.e. no references between pages or annotations. We need a deep copy down to the annotation's content here.

Have you investigated possibilities to create a deep copy of an annotation object yourself? Could that be a viable and better performing alternative to copying pages around to temporary PDF objects? (Not sure how complicated that might become, there seem to be at least two different annotation styles and I think some viewers allow highlight-marker style markings. Maybe the current approach is more robust and easier, then it's fine with me.)

dreua · 2021-03-27T21:47:05Z

I'd say PRs welcome, maybe we can have a look at those magic constants (I see three times -1 without explanation) while we're at it.

m-holger · 2021-03-28T13:33:53Z

I'll prepare a PR, probably tomorrow

m-holger · 2021-03-28T13:42:40Z

@dreua @kbengs

Should the pdf files in exporter.export be explicitly closed?

dreua · 2021-03-28T14:00:28Z

I have no idea, how is it done at the moment?

m-holger · 2021-03-28T14:07:26Z

@dreua

Files simply go out of scope and are presumably closed by the garbage collector (eventually?)

(in exporter.num_pages the pdf is closed explicitely)

m-holger · 2021-03-28T17:54:19Z

@dreua

I'd say PRs welcome, maybe we can have a look at those magic constants (I see three times -1 without explanation) while we're at it.

Given that nfile and npage are actually documented, I feel that [xxx-1] for conversion between 1-based and 0-based is a reasonably standard Python idiom. Similarly for [-1] for last list element.
But happy to add comments if you think it would be useful. Please confirm

    class Page:
        def __init__(self, nfile, npage, zoom, copyname, angle, scale, crop, size, basename):
            #: The ID (from 1 to n) of the PDF file owning the page
            self.nfile = nfile
            #: The ID (from 1 to n) of the page in its owner PDF document
            self.npage = npage

dreua · 2021-03-28T18:15:13Z

Those comments sound good to me!

Ensure duplicated pages receive a copy of the annotations, not a reference Fixes pdfarranger#437

Ensure duplicated pages receive a copy of the annotations, not a reference Fixes #437

kbengs · 2021-04-03T11:26:00Z

With this file:
https://readthedocs.org/projects/pikepdf/downloads/pdf/latest/
I encountered this error on save:

    pdf_output.pages[-1].Annots = pdf_output.copy_foreign(pdf_temp.pages[0].Annots)
pikepdf._qpdf.ForeignObjectError: pikepdf.copy_foreign called with direct object handle

I think a "make_indirect" is needed.

m-holger · 2021-04-03T17:21:49Z

@kbengs I am not convinced that there is a problem in the first place if the Annots are a direct object. If so, only change would be to make sure we don't do anything:

      if (pikepdf.Name.Annots in ǹew_page) and (not new_page.Annots.is_indirect())

I will look into that.

dreua · 2021-04-04T07:25:46Z

I think an error on save is a problem, thanks for spotting this, kbengs!

dreua · 2021-04-04T07:29:40Z

@m-holger you may be right but I think an additional make_indirect doesn't hurt and is easier to read and maintain. I like to keep it simple if possible. Correct me if you think this has a significant performance impact, I didn't analyse that!

Ensure duplicated pages receive a copy of the annotations, not a reference Fixes pdfarranger#437

Ensure duplicated pages receive a copy of the annotations, not a reference Ensure annotations are stored as indirect objects Fixes pdfarranger#437 Fixes pdfarranger#437 (comment)

m-holger · 2021-04-04T15:40:32Z

Even when new_page.Annots are direct, the actual annotations new_page.Annots[??] may be indirect and therefore modifying the if statement does not work, and the performance hit is therefore unavoidable.

The good news is that making the Annots indirect overcomes another issue I had, allowing me to make the code cleaner and more efficient, more than compensating for the performance hit from the call to make_indirect

@dreua I have submitted a PR

Ensure duplicated pages receive a copy of the annotations, not a reference Ensure annotations are stored as indirect objects Fixes pdfarranger#437 Fixes pdfarranger#437 (comment)

Ensure duplicated pages receive a copy of the annotations, not a reference Ensure annotations are stored as indirect objects Maintain compatibility with pikepdf < 2.8.0 Fixes pdfarranger#437 Fixes pdfarranger#437 (comment)

kbengs · 2021-05-29T04:30:53Z

I think this is not relevant any more since abb3f02. I can reopen if I was wrong.

dreua · 2023-07-13T08:42:13Z

In the qpdf release notes I read:

11.5.0: July 9, 2023

Bug Fixes
When copying the same page more than once, ensure that annotations are copied and not shared among multiple pages.

It looks like @m-holger went down into the machine room to properly take care of it, much appreciated! Once qpdf 11.5.0 is widespread enough we could have a dependency on this version and clean up the workaround here.

m-holger · 2023-07-13T10:02:33Z

This is a bug fix to the qpdf application (and the qpdf job interface).

For qpdflib, annotations need to be explicitly fixed using a method called 'fixCopiedAnnotations'. This method is not exposed in pikepdf and therefore the workaround remains necessary.

m-holger · 2023-07-13T10:15:06Z

(unless and until the job interface is used to generate the initial output file, which incidentally preserves outlines from the main input file. Incidentally, it looks like pikepdf 8.0 will expose Job.create_pdf)

m-holger added a commit to m-holger/pdfarranger that referenced this issue Mar 29, 2021

Fix error with annotations on duplicated pages

63bd3da

Ensure duplicated pages receive a copy of the annotations, not a reference Fixes pdfarranger#437

m-holger mentioned this issue Mar 29, 2021

Fix error with annotations on duplicated pages #453

Merged

m-holger added a commit to m-holger/pdfarranger that referenced this issue Mar 29, 2021

Fix error with annotations on duplicated pages

96f1b3a

Ensure duplicated pages receive a copy of the annotations, not a reference Fixes pdfarranger#437

dreua closed this as completed in #453 Mar 30, 2021

dreua pushed a commit that referenced this issue Mar 30, 2021

Fix error with annotations on duplicated pages

c8e58d9

Ensure duplicated pages receive a copy of the annotations, not a reference Fixes #437

dreua reopened this Apr 4, 2021

m-holger added a commit to m-holger/pdfarranger that referenced this issue Apr 4, 2021

Fix error with annotations on duplicated pages

4f44a6c

Ensure duplicated pages receive a copy of the annotations, not a reference Fixes pdfarranger#437

m-holger mentioned this issue Apr 4, 2021

Fix error with annotations on duplicated pages #459

Closed

kbengs closed this as completed May 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Copy and paste of pages with annotations #437

Copy and paste of pages with annotations #437

m-holger commented Mar 19, 2021

kbengs commented Mar 20, 2021

m-holger commented Mar 20, 2021 •

edited

kbengs commented Mar 20, 2021

m-holger commented Mar 20, 2021

dreua commented Mar 20, 2021 •

edited

m-holger commented Mar 21, 2021 •

edited

kbengs commented Mar 21, 2021 •

edited by dreua

dreua commented Mar 21, 2021

kbengs commented Mar 21, 2021

m-holger commented Mar 21, 2021

dreua commented Mar 21, 2021 •

edited

m-holger commented Mar 21, 2021 •

edited

m-holger commented Mar 25, 2021

kbengs commented Mar 26, 2021 •

edited

m-holger commented Mar 26, 2021 •

edited

dreua commented Mar 27, 2021 •

edited

dreua commented Mar 27, 2021 •

edited

m-holger commented Mar 28, 2021

m-holger commented Mar 28, 2021

dreua commented Mar 28, 2021

m-holger commented Mar 28, 2021

m-holger commented Mar 28, 2021

dreua commented Mar 28, 2021

kbengs commented Apr 3, 2021

m-holger commented Apr 3, 2021

dreua commented Apr 4, 2021

dreua commented Apr 4, 2021

m-holger commented Apr 4, 2021 •

edited

kbengs commented May 29, 2021

dreua commented Jul 13, 2023

m-holger commented Jul 13, 2023

m-holger commented Jul 13, 2023

Copy and paste of pages with annotations #437

Copy and paste of pages with annotations #437

Comments

m-holger commented Mar 19, 2021

kbengs commented Mar 20, 2021

m-holger commented Mar 20, 2021 • edited

kbengs commented Mar 20, 2021

m-holger commented Mar 20, 2021

dreua commented Mar 20, 2021 • edited

m-holger commented Mar 21, 2021 • edited

kbengs commented Mar 21, 2021 • edited by dreua

dreua commented Mar 21, 2021

kbengs commented Mar 21, 2021

m-holger commented Mar 21, 2021

dreua commented Mar 21, 2021 • edited

m-holger commented Mar 21, 2021 • edited

m-holger commented Mar 25, 2021

kbengs commented Mar 26, 2021 • edited

m-holger commented Mar 26, 2021 • edited

dreua commented Mar 27, 2021 • edited

dreua commented Mar 27, 2021 • edited

m-holger commented Mar 28, 2021

m-holger commented Mar 28, 2021

dreua commented Mar 28, 2021

m-holger commented Mar 28, 2021

m-holger commented Mar 28, 2021

dreua commented Mar 28, 2021

kbengs commented Apr 3, 2021

m-holger commented Apr 3, 2021

dreua commented Apr 4, 2021

dreua commented Apr 4, 2021

m-holger commented Apr 4, 2021 • edited

kbengs commented May 29, 2021

dreua commented Jul 13, 2023

m-holger commented Jul 13, 2023

m-holger commented Jul 13, 2023

m-holger commented Mar 20, 2021 •

edited

dreua commented Mar 20, 2021 •

edited

m-holger commented Mar 21, 2021 •

edited

kbengs commented Mar 21, 2021 •

edited by dreua

dreua commented Mar 21, 2021 •

edited

m-holger commented Mar 21, 2021 •

edited

kbengs commented Mar 26, 2021 •

edited

m-holger commented Mar 26, 2021 •

edited

dreua commented Mar 27, 2021 •

edited

dreua commented Mar 27, 2021 •

edited

m-holger commented Apr 4, 2021 •

edited