New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cloneDocumentFromReader produces a document with all blank pages #219
Comments
Example document @ http://www.wmmi.net/documents/Bugs/PyPDF-114xSizeAfterRotation-Example.pdf Cannot upload a document of PDF to issue |
For the time being, when you need to clone the document down to the page structure (any page trees are preserved), use I think that If you open the PDF output of your program in a text editor, you will notice that there are many references to indirect objects that are never defined within the document. |
I have a potential solution for '''
Create a copy (clone) of a document from a PDF file reader
:param reader: PDF file reader instance from which the clone
should be created.
:callback after_page_append (function): Callback function that is invoked after
each page is appended to the writer. Signature includes a reference to the
appended page (delegates to appendPagesFromReader). Callback signature:
:param writer_pageref (PDF page reference): Reference to the page just
appended to the document.
'''
debug = False
if debug:
print("Number of Objects: %d" % len(self._objects))
for obj in self._objects:
print("\tObject is %r" % obj)
if hasattr(obj, "indirectRef") and obj.indirectRef != None:
print("\t\tObject's reference is %r %r, at PDF %r" % (obj.indirectRef.idnum, obj.indirectRef.generation, obj.indirectRef.pdf))
# Variables used for after cloning the root to
# improve pre- and post- cloning experience
mustAddTogether = False
newInfoRef = self._info
oldPagesRef = self._pages
oldPages = self.getObject(self._pages)
# If there have already been any number of pages added
if oldPages[NameObject("/Count")] > 0:
# Keep them
mustAddTogether = True
else:
# Through the page object out
if oldPages in self._objects:
newInfoRef = self._pages
self._objects.remove(oldPages)
# Clone the reader's root document
self.cloneReaderDocumentRoot(reader)
if not self._root:
self._root = self._addObject(self._root_object)
# Sweep for all indirect references
externalReferenceMap = {}
self.stack = []
newRootRef = self._sweepIndirectReferences(externalReferenceMap, self._root)
# Delete the stack to reset
del self.stack
#Clean-Up Time!!!
# Get the new root of the PDF
realRoot = self.getObject(newRootRef)
# Get the new pages tree root and its ID Number
tmpPages = realRoot[NameObject("/Pages")]
newIdNumForPages = 1 + self._objects.index(tmpPages)
# Make an IndirectObject just for the new Pages
self._pages = IndirectObject(newIdNumForPages, 0, self)
# If there are any pages to add back in
if mustAddTogether:
# Set the new page's root's parent to the old
# page's root's reference
tmpPages[NameObject("/Parent")] = oldPagesRef
# Add the reference to the new page's root in
# the old page's kids array
newPagesRef = self._pages
oldPages[NameObject("/Kids")].append(newPagesRef)
# Set all references to the root of the old/new
# page's root
self._pages = oldPagesRef
realRoot[NameObject("/Pages")] = oldPagesRef
# Update the count attribute of the page's root
oldPages[NameObject("/Count")] = NumberObject(oldPages[NameObject("/Count")] + tmpPages[NameObject("/Count")])
else:
# Bump up the info's reference b/c the old
# page's tree was bumped off
self._info = newInfoRef |
I think dhudson1's code post was meant to be cloneDocumentFromReader and not cloneReaderDocumentRoot . I tried it there and it seemed to work perfectly, but it alerted me to another problem that I am having. |
Unfortunately this solution only clones the presentation of the document. The document is rebuilt, loosing the past document versions. A very useful clone will be a raw clone byte by byte of the document, and make that subsequent modifications generate a new version. That will also bring the library the power to preserve the digital signatures for past versions. Also this kind of problems would be avoided. This would also make the library stronger, because it could preserve even objects that it can not parse totally. As long as I have seen, it seems that all this kind of libraries work parsing all the content with the reader to its own DOM and then the writer knows how to write them. I mean the writer doesn't need to handle the semantic of all the existent objects in past versions, it can just include references to them, even when the reader doesn't know the semantic of the object. The object and past versions would be truly cloned. Am I missing something ? |
Anyway, now it works. Thanks. :-) Step by step... Any plan to review / merge this change ? |
I also see this problem with |
so the same |
@chickendiver I've never used that function. Could you give a minimal example on how you would like to use it? |
Interesting: https://stackoverflow.com/search?q=cloneDocumentFromReader So we have cloneDocumentFromReader, appendPagesFromReader and cloneReaderDocumentRoot and all 3 don't work properly? |
I've just re-tried with the dev version (2.10.5?) from PyPDF2 import PdfReader, PdfWriter
reader = PdfReader("c:/PyPDF-114xSizeAfterRotation-Example.pdf")
writer = PdfWriter()
with open("e:/tt.pdf", "wb") as f:
writer.write(f) and the PDF file generated is good. The problem seems solved. |
Amazing 🚀 Thank you for letting me know @pubpub-zz ❤️ |
The solution for me needed one extra line compared to the one given by @pubpub-zz from PyPDF2 import PdfReader, PdfWriter
reader = PdfReader("c:/PyPDF-114xSizeAfterRotation-Example.pdf")
writer = PdfWriter()
writer.clone_document_from_reader(reader)
with open("e:/tt.pdf", "wb") as f:
writer.write(f) |
oups! But still good, isn't-it ? |
For me, it still just outputs a blank deck... until I used the |
Perhaps I am misunderstanding what cloneDocumentFromReader does... but the following code produces a document of all blank pages.
The text was updated successfully, but these errors were encountered: