New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Copy and paste of pages with annotations #437
Comments
This sounds like #291 What version of PDF Arranger do you have? |
I run pdfarranger 1.7.0 installed from source on a Mint 20.1, with pikepdf 2.8.0. To me, this does not sound like the same problem as # 291. Particularly, the problem does occur on files loaded from disk. I know very little about pdf files. That said, my first impression is: For each page, there is a container object that contains all the annotations for that page (i.e. all the highlights, boxes and comments) . This container appears to get created when the first annotation is placed on the page, and does not (at least on Okular) ever get deleted, even when all annotations on the page are deleted. When a page is duplicated, the cloned page receives a reference to annotations on the original page rather than a deep copy, which means that the original as well as all cloned pages share all current and future annotations. I don't know whether my impression is plausible, but that's the best I can come up with at the moment. Unfortunately I do not have time to delve into the code right now. To demonstrate the problem, I have created a little example, using the following steps
|
You are right it's not bug 291. I am able to reproduce it. Annotations seems to be referenced when a page is copied. |
The main problem (from a user perspective) is not that the annotations themself are referenced, but that the container holding all annotations on a page is referenced. The only work-around that I have at the moment is to print the affected page to a pdf file and then replace the problem page and its clones with the printed version. What is troubling about this problem is that it only becomes apparent after a file is saved and reopened. You can happily edit a file with multiple clones of the same page, annotate each of the clones and everything looks perfectly correct, even if you go back to earlier pages - that is, until you reopen the file and realised that all the pages have been stuffed up. Users might (unfairly) perceive that PDFarranger corrupts files. If the issue cannot be resolved quickly, it probably would be good to see how difficult it would be to pop up a warming message if pages with annotations are cloned - I seem to recall that this is done for files with tables of content. (Sorry for telling volunteers who very kindly donate their time what I think they should do. I do appreciate the effort, and I do understand the limited time they have available) |
I would rather fix stuff than add more warnings, but not sure if it is possible on this case. Have you tried those operations with the qpdf cli by any chance? Maybe it can be fixed down there... |
@dreua Sorry, I use an excellent little graphical Python program to manipulate Pdf files, so never had a need to use qpdf cli. I had a quick look at the qpdf manual and spotted that release 10.3.0 lists "Reworked the code that handles copying annotations and form fields during page operations." I think I'll try upgrading pikepdf to 2.9.1 with embedded qpdf 10.3.1 and see what happens. |
This works: --- a/pdfarranger/exporter.py
+++ b/pdfarranger/exporter.py
@@ -143,7 +143,9 @@ def export(input_files, pages, file_out, mode, mdata):
pdf_output = pikepdf.Pdf.new()
pdf_input = [pikepdf.open(p.copyname, password=p.password) for p in input_files]
for row in pages:
- current_page = pdf_input[row.nfile - 1].pages[row.npage - 1]
+ pdf_temp = pikepdf.Pdf.new()
+ pdf_temp.pages.append(pdf_input[row.nfile - 1].pages[row.npage - 1])
+ current_page = pdf_temp.pages[0]
angle = row.angle
angle0 = current_page.Rotate if '/Rotate' in current_page else 0
new_page = pdf_output.copy_foreign(current_page) But it is not ideal. It's about 40% slower to save for a test pdf. Hopefully a better solution can be found. |
@kbengs do I understand correctly that you are forcing a deeper copy by moving the page to a temporary document first? Hopefully I'll find the time to have a closer look later today. |
@dreua Yes something like that. It breaks the reference. |
@dreua upgrading to pikepdf 2.9.1 did not fix the problem. Another thought (keeping in mind that I don't really know what I am talking about), core.Pdf.duplicate returns a shallow copy of the page being duplicated. What would be the implications of returning a deep copy instead? (A short answer like "it wouldn't work" would be acceptable - I will try to understand the code when I get a bit more time) |
I don't think we are calling |
@dreua Sorry, meant You are probably right re space efficiency. I've tried the following, which seems to work. from pikepdf import *
temp=Pdf.new()
i=Pdf.open('./hi.pdf')
temp.pages.append(i.pages[0])
i.pages.append(i.pages[0])
i.pages[1]["/Annots"]=i.copy_foreign(temp.pages[0]["/Annots"])
i.save('./hi2.pdf') EDIT incorrect sections removed Edit: I also tried from pikepdf import *
import copy
i=Pdf.open('./hi4.pdf')
i.pages.append(i.pages[0])
i.pages[1]["/Annots"]= copy.copy(i.pages[1]["/Annots"])
i.save('./hi5.pdf') This solves some problems - annotations can be deleted independently on only 1 page, and new annotation only appear on the page on which they were created. However, editing an existing annotation affects all copies. Replacing |
I looked into this a little bit further, and I think there is an issue with the duplicate method in PDF Arranger I loaded a single page pdf into PDF Arranger and then duplicated the page twice, producing the following output file
All three pages point to the same annotation object as well as to the same contents object. This works (and is space efficient) if we are duplicating the pages to provide different views of the same page. It does not work if we are duplicating the page to then annotate or edit the copied pages. In my case, I tend to only annotate without editing - so I quite like being able to duplicate a page without duplicating the content, especially when I annotate scanned pages. I think we need different duplicate options for different purposes. We probably also need a way for the user to find out which parts of a page are independent from other pages, and which are not. Finally, there probably should be an option to make pages with shared content independent. |
The duplicate method does not copy a pdf page, it only copies information about what pdf document and what page number the copied page originate from (a file and page in self.pdfqueue). |
@kbengs I did not spot any problems with your fix when I tried it. The alternative below also works and does not duplicate the content, which is what I (usually) prefer
I agree. (I don't really understand pikepdf copy_foreign or genobj but neither works as I would expect, but that could be just me) (EDIT AS far as genobj is concerned, it is just me) @@ -165,12 +165,17 @@ def export(input_files, pages, file_out, mode, mdata):
new_page = _scale(pdf_output, new_page, row.scale)
# Workraround for pikepdf < 2.7.0
# https://github.com/pikepdf/pikepdf/issues/174
new_page = pdf_output.make_indirect(new_page)
-
pdf_output.pages.append(new_page)
+ if pikepdf.Name.Annots in ǹew_page:
+ pdf_temp = pikepdf.Pdf.new()
+ pdf_temp.pages.append(pdf_input[row.nfile - 1].pages[row.npage - 1])
+ pdf_output.pages[-1].Annots = pdf_output.copy_foreign(pdf_temp.pages[0].Annots)
+
+
if exportmode in ['ALL_TO_MULTIPLE', 'SELECTED_TO_MULTIPLE']:```
|
My personal opinion: I think PDF files are not really meant for editing, therefore doing shallow copies of page content is fine with me and I see not need to change that unless a good reason comes up. Software that allows editing of PDF page contents needs to take shallow copies into account and do a deep copy on write where necessary. Have you investigated possibilities to create a deep copy of an annotation object yourself? Could that be a viable and better performing alternative to copying pages around to temporary PDF objects? (Not sure how complicated that might become, there seem to be at least two different annotation styles and I think some viewers allow highlight-marker style markings. Maybe the current approach is more robust and easier, then it's fine with me.) |
I'd say PRs welcome, maybe we can have a look at those magic constants (I see three times -1 without explanation) while we're at it. |
I'll prepare a PR, probably tomorrow |
I have no idea, how is it done at the moment? |
Files simply go out of scope and are presumably closed by the garbage collector (eventually?) (in |
Given that nfile and npage are actually documented, I feel that [xxx-1] for conversion between 1-based and 0-based is a reasonably standard Python idiom. Similarly for [-1] for last list element. class Page:
def __init__(self, nfile, npage, zoom, copyname, angle, scale, crop, size, basename):
#: The ID (from 1 to n) of the PDF file owning the page
self.nfile = nfile
#: The ID (from 1 to n) of the page in its owner PDF document
self.npage = npage |
Those comments sound good to me! |
Ensure duplicated pages receive a copy of the annotations, not a reference Fixes pdfarranger#437
Ensure duplicated pages receive a copy of the annotations, not a reference Fixes pdfarranger#437
Ensure duplicated pages receive a copy of the annotations, not a reference Fixes #437
With this file:
I think a "make_indirect" is needed. |
@kbengs I am not convinced that there is a problem in the first place if the Annots are a direct object. If so, only change would be to make sure we don't do anything: if (pikepdf.Name.Annots in ǹew_page) and (not new_page.Annots.is_indirect()) I will look into that. |
I think an error on save is a problem, thanks for spotting this, kbengs! |
@m-holger you may be right but I think an additional make_indirect doesn't hurt and is easier to read and maintain. I like to keep it simple if possible. Correct me if you think this has a significant performance impact, I didn't analyse that! |
Ensure duplicated pages receive a copy of the annotations, not a reference Fixes pdfarranger#437
Ensure duplicated pages receive a copy of the annotations, not a reference Ensure annotations are stored as indirect objects Fixes pdfarranger#437 Fixes pdfarranger#437 (comment)
Even when The good news is that making the Annots indirect overcomes another issue I had, allowing me to make the code cleaner and more efficient, more than compensating for the performance hit from the call to @dreua I have submitted a PR |
Ensure duplicated pages receive a copy of the annotations, not a reference Ensure annotations are stored as indirect objects Fixes pdfarranger#437 Fixes pdfarranger#437 (comment)
Ensure duplicated pages receive a copy of the annotations, not a reference Ensure annotations are stored as indirect objects Fixes pdfarranger#437 Fixes pdfarranger#437 (comment)
Ensure duplicated pages receive a copy of the annotations, not a reference Ensure annotations are stored as indirect objects Fixes pdfarranger#437 Fixes pdfarranger#437 (comment)
Ensure duplicated pages receive a copy of the annotations, not a reference Ensure annotations are stored as indirect objects Fixes pdfarranger#437 Fixes pdfarranger#437 (comment)
Ensure duplicated pages receive a copy of the annotations, not a reference Ensure annotations are stored as indirect objects Fixes pdfarranger#437 Fixes pdfarranger#437 (comment)
Ensure duplicated pages receive a copy of the annotations, not a reference Ensure annotations are stored as indirect objects Maintain compatibility with pikepdf < 2.8.0 Fixes pdfarranger#437 Fixes pdfarranger#437 (comment)
Ensure duplicated pages receive a copy of the annotations, not a reference Ensure annotations are stored as indirect objects Maintain compatibility with pikepdf < 2.8.0 Fixes pdfarranger#437 Fixes pdfarranger#437 (comment)
I think this is not relevant any more since abb3f02. I can reopen if I was wrong. |
In the qpdf release notes I read:
It looks like @m-holger went down into the machine room to properly take care of it, much appreciated! Once qpdf 11.5.0 is widespread enough we could have a dependency on this version and clean up the workaround here. |
This is a bug fix to the qpdf application (and the qpdf job interface). For qpdflib, annotations need to be explicitly fixed using a method called 'fixCopiedAnnotations'. This method is not exposed in pikepdf and therefore the workaround remains necessary. |
(unless and until the job interface is used to generate the initial output file, which incidentally preserves outlines from the main input file. Incidentally, it looks like pikepdf 8.0 will expose Job.create_pdf) |
When using PDFarranger to duplicate pages that contain annotations using copy and "paste after", or using "duplicate",
I observed the following:
Even deleting all annotations (and saving/reloading) does not clear the problem. Any subsequent annotations are still shared by all copies of the page, including new copies created when the pages had no annotations. Even copying such a page into a new file does not clear the problem.
Duplicating pages without annotations works as expected, i.e. any subsequent annotations only affect the page annotated, not any copies of the page.
The text was updated successfully, but these errors were encountered: