Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy and paste of pages with annotations #437

Closed
m-holger opened this issue Mar 19, 2021 · 32 comments · Fixed by #453
Closed

Copy and paste of pages with annotations #437

m-holger opened this issue Mar 19, 2021 · 32 comments · Fixed by #453

Comments

@m-holger
Copy link
Contributor

When using PDFarranger to duplicate pages that contain annotations using copy and "paste after", or using "duplicate",
I observed the following:

  • Editing the annotation (using Okular) on one page would modify the annotation on the other page (after save & reload)
  • A new annotation created on one page would also appear on the other page (after save & reload)

Even deleting all annotations (and saving/reloading) does not clear the problem. Any subsequent annotations are still shared by all copies of the page, including new copies created when the pages had no annotations. Even copying such a page into a new file does not clear the problem.

Duplicating pages without annotations works as expected, i.e. any subsequent annotations only affect the page annotated, not any copies of the page.

@kbengs
Copy link
Member

kbengs commented Mar 20, 2021

This sounds like #291 What version of PDF Arranger do you have?

@m-holger
Copy link
Contributor Author

m-holger commented Mar 20, 2021

I run pdfarranger 1.7.0 installed from source on a Mint 20.1, with pikepdf 2.8.0.

To me, this does not sound like the same problem as # 291. Particularly, the problem does occur on files loaded from disk.

I know very little about pdf files. That said, my first impression is:

For each page, there is a container object that contains all the annotations for that page (i.e. all the highlights, boxes and comments) . This container appears to get created when the first annotation is placed on the page, and does not (at least on Okular) ever get deleted, even when all annotations on the page are deleted. When a page is duplicated, the cloned page receives a reference to annotations on the original page rather than a deep copy, which means that the original as well as all cloned pages share all current and future annotations.

I don't know whether my impression is plausible, but that's the best I can come up with at the moment. Unfortunately I do not have time to delve into the code right now.

To demonstrate the problem, I have created a little example, using the following steps

  1. create single page pdf
  2. Open in Okular and highlight the text
  3. save and reload
  4. delete the highlight and save
  5. open in PDFarranger, duplicate page twice and save
  6. open in OKular, add note "this is the first page" to first page only and save
  7. reopened with Xreader - comment visible on all 3 pages

hi.pdf

@kbengs
Copy link
Member

kbengs commented Mar 20, 2021

You are right it's not bug 291. I am able to reproduce it. Annotations seems to be referenced when a page is copied.

@m-holger
Copy link
Contributor Author

The main problem (from a user perspective) is not that the annotations themself are referenced, but that the container holding all annotations on a page is referenced. The only work-around that I have at the moment is to print the affected page to a pdf file and then replace the problem page and its clones with the printed version.

What is troubling about this problem is that it only becomes apparent after a file is saved and reopened. You can happily edit a file with multiple clones of the same page, annotate each of the clones and everything looks perfectly correct, even if you go back to earlier pages - that is, until you reopen the file and realised that all the pages have been stuffed up. Users might (unfairly) perceive that PDFarranger corrupts files. If the issue cannot be resolved quickly, it probably would be good to see how difficult it would be to pop up a warming message if pages with annotations are cloned - I seem to recall that this is done for files with tables of content. (Sorry for telling volunteers who very kindly donate their time what I think they should do. I do appreciate the effort, and I do understand the limited time they have available)

@dreua
Copy link
Member

dreua commented Mar 20, 2021

I would rather fix stuff than add more warnings, but not sure if it is possible on this case. Have you tried those operations with the qpdf cli by any chance? Maybe it can be fixed down there...

@m-holger
Copy link
Contributor Author

m-holger commented Mar 21, 2021

@dreua Sorry, I use an excellent little graphical Python program to manipulate Pdf files, so never had a need to use qpdf cli.

I had a quick look at the qpdf manual and spotted that release 10.3.0 lists "Reworked the code that handles copying annotations and form fields during page operations." I think I'll try upgrading pikepdf to 2.9.1 with embedded qpdf 10.3.1 and see what happens.

@kbengs
Copy link
Member

kbengs commented Mar 21, 2021

This works:

--- a/pdfarranger/exporter.py
+++ b/pdfarranger/exporter.py
@@ -143,7 +143,9 @@ def export(input_files, pages, file_out, mode, mdata):
     pdf_output = pikepdf.Pdf.new()
     pdf_input = [pikepdf.open(p.copyname, password=p.password) for p in input_files]
     for row in pages:
-        current_page = pdf_input[row.nfile - 1].pages[row.npage - 1]
+        pdf_temp = pikepdf.Pdf.new()
+        pdf_temp.pages.append(pdf_input[row.nfile - 1].pages[row.npage - 1])
+        current_page = pdf_temp.pages[0]
         angle = row.angle
         angle0 = current_page.Rotate if '/Rotate' in current_page else 0
         new_page = pdf_output.copy_foreign(current_page)

But it is not ideal. It's about 40% slower to save for a test pdf. Hopefully a better solution can be found.

@dreua
Copy link
Member

dreua commented Mar 21, 2021

@kbengs do I understand correctly that you are forcing a deeper copy by moving the page to a temporary document first? Hopefully I'll find the time to have a closer look later today.

@kbengs
Copy link
Member

kbengs commented Mar 21, 2021

@dreua Yes something like that. It breaks the reference.

@m-holger
Copy link
Contributor Author

@dreua upgrading to pikepdf 2.9.1 did not fix the problem.

Another thought (keeping in mind that I don't really know what I am talking about), core.Pdf.duplicate returns a shallow copy of the page being duplicated. What would be the implications of returning a deep copy instead? (A short answer like "it wouldn't work" would be acceptable - I will try to understand the code when I get a bit more time)

@dreua
Copy link
Member

dreua commented Mar 21, 2021

I don't think we are calling core.Pdf.duplicate, where did you see that? I'm not that deep into PDF structure either, so take my assumptions with a grain of salt: I think an actual deep copy over everything would not be space efficient (duplicating all the images of a scan when splitting pages, resulting in doubled filesize).
For annotations I think a deep copy is what we want, but I'm not sure whether we can simply do this in pdfarranger. I'll have to look into it.

@m-holger
Copy link
Contributor Author

m-holger commented Mar 21, 2021

@dreua Sorry, meant core.Page.duplicate .

You are probably right re space efficiency.

I've tried the following, which seems to work.

from pikepdf import *

temp=Pdf.new()
i=Pdf.open('./hi.pdf')
temp.pages.append(i.pages[0])
i.pages.append(i.pages[0])
i.pages[1]["/Annots"]=i.copy_foreign(temp.pages[0]["/Annots"])  
i.save('./hi2.pdf')

hi2.pdf

EDIT incorrect sections removed

Edit:

I also tried

from pikepdf import *
import copy

i=Pdf.open('./hi4.pdf')
i.pages.append(i.pages[0])
i.pages[1]["/Annots"]= copy.copy(i.pages[1]["/Annots"])
i.save('./hi5.pdf')

This solves some problems - annotations can be deleted independently on only 1 page, and new annotation only appear on the page on which they were created. However, editing an existing annotation affects all copies. Replacing copy.copy with copy.deepcopy does not work at all.

@m-holger
Copy link
Contributor Author

I looked into this a little bit further, and I think there is an issue with the duplicate method in PDF Arranger

I loaded a single page pdf into PDF Arranger and then duplicated the page twice, producing the following output file

4 0 obj
<< /Count 3 /Kids [ 5 0 R 6 0 R 7 0 R ] /Type /Pages >>
endobj
5 0 obj
<< /Annots 8 0 R /Contents 9 0 R /Group << /CS /DeviceRGB /I true /S /Transparency >> /MediaBox [ 0 0 595.303937007900 841.889763779500 ] /Parent 4 0 R /Resources << /Font << /F1 10 0 R >> /ProcSet [ /PDF /Text ] >> /Type /Page >>
endobj
6 0 obj
<< /Annots 8 0 R /Contents 9 0 R /Group << /CS /DeviceRGB /I true /S /Transparency >> /MediaBox [ 0 0 595.303937007900 841.889763779500 ] /Parent 4 0 R /Resources << /Font << /F1 10 0 R >> /ProcSet [ /PDF /Text ] >> /Type /Page >>
endobj
7 0 obj
<< /Annots 8 0 R /Contents 9 0 R /Group << /CS /DeviceRGB /I true /S /Transparency >> /MediaBox [ 0 0 595.303937007900 841.889763779500 ] /Parent 4 0 R /Resources << /Font << /F1 10 0 R >> /ProcSet [ /PDF /Text ] >> /Type /Page >>
endobj

All three pages point to the same annotation object as well as to the same contents object.

This works (and is space efficient) if we are duplicating the pages to provide different views of the same page. It does not work if we are duplicating the page to then annotate or edit the copied pages. In my case, I tend to only annotate without editing - so I quite like being able to duplicate a page without duplicating the content, especially when I annotate scanned pages.

I think we need different duplicate options for different purposes. We probably also need a way for the user to find out which parts of a page are independent from other pages, and which are not. Finally, there probably should be an option to make pages with shared content independent.

@kbengs
Copy link
Member

kbengs commented Mar 26, 2021

The duplicate method does not copy a pdf page, it only copies information about what pdf document and what page number the copied page originate from (a file and page in self.pdfqueue).
@m-holger does the code I suggested work for for you in all cases?
Imo it should be fixed at export. Maybe the right place would be in pikepdf.Pdf.copy_foreign() ? (fixed in pikepdf)

@m-holger
Copy link
Contributor Author

m-holger commented Mar 26, 2021

@kbengs I did not spot any problems with your fix when I tried it. The alternative below also works and does not duplicate the content, which is what I (usually) prefer

Imo it should be fixed at export. Maybe the right place would be in pikepdf.Pdf.copy_foreign() ? (fixed in pikepdf)

I agree. (I don't really understand pikepdf copy_foreign or genobj but neither works as I would expect, but that could be just me) (EDIT AS far as genobj is concerned, it is just me)

@@ -165,12 +165,17 @@ def export(input_files, pages, file_out, mode, mdata):
         new_page = _scale(pdf_output, new_page, row.scale)
 
         # Workraround for pikepdf < 2.7.0
         # https://github.com/pikepdf/pikepdf/issues/174
         new_page = pdf_output.make_indirect(new_page)
-
         pdf_output.pages.append(new_page)
+        if pikepdf.Name.Annots in ǹew_page:
+            pdf_temp = pikepdf.Pdf.new()
+            pdf_temp.pages.append(pdf_input[row.nfile - 1].pages[row.npage - 1])
+            pdf_output.pages[-1].Annots = pdf_output.copy_foreign(pdf_temp.pages[0].Annots)
+
+
 
     if exportmode in ['ALL_TO_MULTIPLE', 'SELECTED_TO_MULTIPLE']:```
<< /Annots 8 0 R /Contents 9 0 R /Group << /CS /DeviceRGB /... 
<< /Annots 11 0 R /Contents 9 0 R /Group << /CS /DeviceRGB ...
<< /Annots 12 0 R /Contents 9 0 R /Group << /CS /DeviceRGB /....

@dreua
Copy link
Member

dreua commented Mar 27, 2021

My personal opinion: I think PDF files are not really meant for editing, therefore doing shallow copies of page content is fine with me and I see not need to change that unless a good reason comes up. Software that allows editing of PDF page contents needs to take shallow copies into account and do a deep copy on write where necessary.
Annotations are a different story, because they are meant to be user-editable and that should work like users expect it to work, i.e. no references between pages or annotations. We need a deep copy down to the annotation's content here.

Have you investigated possibilities to create a deep copy of an annotation object yourself? Could that be a viable and better performing alternative to copying pages around to temporary PDF objects? (Not sure how complicated that might become, there seem to be at least two different annotation styles and I think some viewers allow highlight-marker style markings. Maybe the current approach is more robust and easier, then it's fine with me.)

@dreua
Copy link
Member

dreua commented Mar 27, 2021

I'd say PRs welcome, maybe we can have a look at those magic constants (I see three times -1 without explanation) while we're at it.

@m-holger
Copy link
Contributor Author

I'll prepare a PR, probably tomorrow

@m-holger
Copy link
Contributor Author

@dreua @kbengs

Should the pdf files in exporter.export be explicitly closed?

@dreua
Copy link
Member

dreua commented Mar 28, 2021

I have no idea, how is it done at the moment?

@m-holger
Copy link
Contributor Author

@dreua

Files simply go out of scope and are presumably closed by the garbage collector (eventually?)

(in exporter.num_pages the pdf is closed explicitely)

@m-holger
Copy link
Contributor Author

@dreua

I'd say PRs welcome, maybe we can have a look at those magic constants (I see three times -1 without explanation) while we're at it.

Given that nfile and npage are actually documented, I feel that [xxx-1] for conversion between 1-based and 0-based is a reasonably standard Python idiom. Similarly for [-1] for last list element.
But happy to add comments if you think it would be useful. Please confirm

    class Page:
        def __init__(self, nfile, npage, zoom, copyname, angle, scale, crop, size, basename):
            #: The ID (from 1 to n) of the PDF file owning the page
            self.nfile = nfile
            #: The ID (from 1 to n) of the page in its owner PDF document
            self.npage = npage

@dreua
Copy link
Member

dreua commented Mar 28, 2021

Those comments sound good to me!

m-holger added a commit to m-holger/pdfarranger that referenced this issue Mar 29, 2021
Ensure duplicated pages receive a copy of the annotations, not a reference
Fixes pdfarranger#437
m-holger added a commit to m-holger/pdfarranger that referenced this issue Mar 29, 2021
Ensure duplicated pages receive a copy of the annotations, not a reference
Fixes pdfarranger#437
dreua pushed a commit that referenced this issue Mar 30, 2021
Ensure duplicated pages receive a copy of the annotations, not a reference
Fixes #437
@kbengs
Copy link
Member

kbengs commented Apr 3, 2021

With this file:
https://readthedocs.org/projects/pikepdf/downloads/pdf/latest/
I encountered this error on save:

    pdf_output.pages[-1].Annots = pdf_output.copy_foreign(pdf_temp.pages[0].Annots)
pikepdf._qpdf.ForeignObjectError: pikepdf.copy_foreign called with direct object handle

I think a "make_indirect" is needed.

@m-holger
Copy link
Contributor Author

m-holger commented Apr 3, 2021

@kbengs I am not convinced that there is a problem in the first place if the Annots are a direct object. If so, only change would be to make sure we don't do anything:

      if (pikepdf.Name.Annots in ǹew_page) and (not new_page.Annots.is_indirect())

I will look into that.

@dreua dreua reopened this Apr 4, 2021
@dreua
Copy link
Member

dreua commented Apr 4, 2021

I think an error on save is a problem, thanks for spotting this, kbengs!

@dreua
Copy link
Member

dreua commented Apr 4, 2021

@m-holger you may be right but I think an additional make_indirect doesn't hurt and is easier to read and maintain. I like to keep it simple if possible. Correct me if you think this has a significant performance impact, I didn't analyse that!

m-holger added a commit to m-holger/pdfarranger that referenced this issue Apr 4, 2021
Ensure duplicated pages receive a copy of the annotations, not a reference
Fixes pdfarranger#437
m-holger added a commit to m-holger/pdfarranger that referenced this issue Apr 4, 2021
Ensure duplicated pages receive a copy of the annotations, not a reference
Ensure annotations are stored as indirect objects
Fixes pdfarranger#437
Fixes pdfarranger#437 (comment)
@m-holger
Copy link
Contributor Author

m-holger commented Apr 4, 2021

Even when new_page.Annots are direct, the actual annotations new_page.Annots[??] may be indirect and therefore modifying the if statement does not work, and the performance hit is therefore unavoidable.

The good news is that making the Annots indirect overcomes another issue I had, allowing me to make the code cleaner and more efficient, more than compensating for the performance hit from the call to make_indirect

@dreua I have submitted a PR

m-holger added a commit to m-holger/pdfarranger that referenced this issue Apr 4, 2021
Ensure duplicated pages receive a copy of the annotations, not a reference
Ensure annotations are stored as indirect objects
Fixes pdfarranger#437
Fixes pdfarranger#437 (comment)
m-holger added a commit to m-holger/pdfarranger that referenced this issue Apr 4, 2021
Ensure duplicated pages receive a copy of the annotations, not a reference
Ensure annotations are stored as indirect objects
Fixes pdfarranger#437
Fixes pdfarranger#437 (comment)
m-holger added a commit to m-holger/pdfarranger that referenced this issue Apr 5, 2021
Ensure duplicated pages receive a copy of the annotations, not a reference
Ensure annotations are stored as indirect objects
Fixes pdfarranger#437
Fixes pdfarranger#437 (comment)
m-holger added a commit to m-holger/pdfarranger that referenced this issue Apr 5, 2021
Ensure duplicated pages receive a copy of the annotations, not a reference
Ensure annotations are stored as indirect objects
Fixes pdfarranger#437
Fixes pdfarranger#437 (comment)
m-holger added a commit to m-holger/pdfarranger that referenced this issue Apr 5, 2021
Ensure duplicated pages receive a copy of the annotations, not a reference
Ensure annotations are stored as indirect objects
Fixes pdfarranger#437
Fixes pdfarranger#437 (comment)
m-holger added a commit to m-holger/pdfarranger that referenced this issue Apr 8, 2021
Ensure duplicated pages receive a copy of the annotations, not a reference
Ensure annotations are stored as indirect objects
Maintain compatibility with pikepdf < 2.8.0
Fixes pdfarranger#437
Fixes pdfarranger#437 (comment)
m-holger added a commit to m-holger/pdfarranger that referenced this issue Apr 8, 2021
Ensure duplicated pages receive a copy of the annotations, not a reference
Ensure annotations are stored as indirect objects
Maintain compatibility with pikepdf < 2.8.0
Fixes pdfarranger#437
Fixes pdfarranger#437 (comment)
@kbengs
Copy link
Member

kbengs commented May 29, 2021

I think this is not relevant any more since abb3f02. I can reopen if I was wrong.

@kbengs kbengs closed this as completed May 29, 2021
@dreua
Copy link
Member

dreua commented Jul 13, 2023

In the qpdf release notes I read:

11.5.0: July 9, 2023

  • Bug Fixes
    When copying the same page more than once, ensure that annotations are copied and not shared among multiple pages.

It looks like @m-holger went down into the machine room to properly take care of it, much appreciated! Once qpdf 11.5.0 is widespread enough we could have a dependency on this version and clean up the workaround here.

@m-holger
Copy link
Contributor Author

This is a bug fix to the qpdf application (and the qpdf job interface).

For qpdflib, annotations need to be explicitly fixed using a method called 'fixCopiedAnnotations'. This method is not exposed in pikepdf and therefore the workaround remains necessary.

@m-holger
Copy link
Contributor Author

(unless and until the job interface is used to generate the initial output file, which incidentally preserves outlines from the main input file. Incidentally, it looks like pikepdf 8.0 will expose Job.create_pdf)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants