Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloneDocumentFromReader produces a document with all blank pages #219

Closed
whitemice opened this issue Aug 7, 2015 · 15 comments
Closed

cloneDocumentFromReader produces a document with all blank pages #219

whitemice opened this issue Aug 7, 2015 · 15 comments
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF PdfWriter The PdfWriter component is affected

Comments

@whitemice
Copy link

whitemice commented Aug 7, 2015

Perhaps I am misunderstanding what cloneDocumentFromReader does... but the following code produces a document of all blank pages.

    signal.signal(signal.SIGALRM, timeout_alarm_handler)
    signal.alarm(15)

    reader = PdfFileReader(self.rfile, strict=False, )
    writer = PdfFileWriter()

    #writer.cloneReaderDocumentRoot(reader)
    try:
        writer.cloneDocumentFromReader(
            reader
        )
    except TimeOutAlarm:
        raise BurstingTimeOutException
    else:
        writer.write(self.wfile)
    writer = None
@whitemice
Copy link
Author

Example document @ http://www.wmmi.net/documents/Bugs/PyPDF-114xSizeAfterRotation-Example.pdf

Cannot upload a document of PDF to issue

@dhudson1 dhudson1 added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Aug 13, 2015
@dhudson1
Copy link

For the time being, when you need to clone the document down to the page structure (any page trees are preserved), use cloneReaderDocumentRoot. If keeping the underlying page structure is not important, use appendPagesFromReader to compact the page structure to a single page list with many page references.

I think that cloneDocumentFromReader does not work like expected because cloneDocumentFromReader uses both of the above methods, which causes a conflict when trying to write the document. This conflict occurs because the writer, when looking for all of the external references gets caught up in two different versions of the same object with the same version history, tricking the writer into thinking that the content has already been loaded when it has not been.

If you open the PDF output of your program in a text editor, you will notice that there are many references to indirect objects that are never defined within the document.

@dhudson1
Copy link

I have a potential solution for cloneReaderDocumentRoot (type, or copy-and-paste, this in under the definition for cloneReaderDocumentRoot and see if it does what you want):

        '''
        Create a copy (clone) of a document from a PDF file reader

        :param reader: PDF file reader instance from which the clone
            should be created.
        :callback after_page_append (function): Callback function that is invoked after
            each page is appended to the writer. Signature includes a reference to the
            appended page (delegates to appendPagesFromReader). Callback signature:

            :param writer_pageref (PDF page reference): Reference to the page just
                appended to the document.
        '''
        debug = False
        if debug:
            print("Number of Objects: %d" % len(self._objects))
            for obj in self._objects:
                print("\tObject is %r" % obj)
                if hasattr(obj, "indirectRef") and obj.indirectRef != None:
                    print("\t\tObject's reference is %r %r, at PDF %r" % (obj.indirectRef.idnum, obj.indirectRef.generation, obj.indirectRef.pdf))

        # Variables used for after cloning the root to
        # improve pre- and post- cloning experience

        mustAddTogether = False
        newInfoRef = self._info
        oldPagesRef = self._pages
        oldPages = self.getObject(self._pages)

        # If there have already been any number of pages added

        if oldPages[NameObject("/Count")] > 0:

            # Keep them

            mustAddTogether = True
        else:

            # Through the page object out

            if oldPages in self._objects:
                newInfoRef = self._pages
                self._objects.remove(oldPages)

        # Clone the reader's root document

        self.cloneReaderDocumentRoot(reader)
        if not self._root:
            self._root = self._addObject(self._root_object)

        # Sweep for all indirect references

        externalReferenceMap = {}
        self.stack = []
        newRootRef = self._sweepIndirectReferences(externalReferenceMap, self._root)

        # Delete the stack to reset

        del self.stack

        #Clean-Up Time!!!

        # Get the new root of the PDF

        realRoot = self.getObject(newRootRef)

        # Get the new pages tree root and its ID Number

        tmpPages = realRoot[NameObject("/Pages")]
        newIdNumForPages = 1 + self._objects.index(tmpPages)

        # Make an IndirectObject just for the new Pages

        self._pages = IndirectObject(newIdNumForPages, 0, self)

        # If there are any pages to add back in

        if mustAddTogether:

            # Set the new page's root's parent to the old
            # page's root's reference

            tmpPages[NameObject("/Parent")] = oldPagesRef

            # Add the reference to the new page's root in
            # the old page's kids array

            newPagesRef = self._pages
            oldPages[NameObject("/Kids")].append(newPagesRef)

            # Set all references to the root of the old/new
            # page's root

            self._pages = oldPagesRef
            realRoot[NameObject("/Pages")] = oldPagesRef

            # Update the count attribute of the page's root

            oldPages[NameObject("/Count")] = NumberObject(oldPages[NameObject("/Count")] + tmpPages[NameObject("/Count")])

        else:

            # Bump up the info's reference b/c the old
            # page's tree was bumped off

            self._info = newInfoRef

@LightningMan711
Copy link

I think dhudson1's code post was meant to be cloneDocumentFromReader and not cloneReaderDocumentRoot . I tried it there and it seemed to work perfectly, but it alerted me to another problem that I am having.

@yucer
Copy link

yucer commented Sep 22, 2016

Unfortunately this solution only clones the presentation of the document. The document is rebuilt, loosing the past document versions. A very useful clone will be a raw clone byte by byte of the document, and make that subsequent modifications generate a new version.

That will also bring the library the power to preserve the digital signatures for past versions.

Also this kind of problems would be avoided.

This would also make the library stronger, because it could preserve even objects that it can not parse totally. As long as I have seen, it seems that all this kind of libraries work parsing all the content with the reader to its own DOM and then the writer knows how to write them.

I mean the writer doesn't need to handle the semantic of all the existent objects in past versions, it can just include references to them, even when the reader doesn't know the semantic of the object. The object and past versions would be truly cloned.

Am I missing something ?

@yucer
Copy link

yucer commented Sep 22, 2016

Anyway, now it works. Thanks. :-)

Step by step...

Any plan to review / merge this change ?

@benwiggy
Copy link

I also see this problem with cloneDocumentFromReader. When I use cloneReaderDocumentRoot instead, I get a PDF with No pages.

@livizy
Copy link

livizy commented Jan 12, 2021

I also see this problem with cloneDocumentFromReader. When I use cloneReaderDocumentRoot instead, I get a PDF with No pages.

so the same

@MartinThoma MartinThoma added PdfWriter The PdfWriter component is affected Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests labels Apr 16, 2022
@MartinThoma
Copy link
Member

@chickendiver I've never used that function. Could you give a minimal example on how you would like to use it?

@MartinThoma
Copy link
Member

Interesting: https://stackoverflow.com/search?q=cloneDocumentFromReader

So we have cloneDocumentFromReader, appendPagesFromReader and cloneReaderDocumentRoot and all 3 don't work properly?

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Sep 3, 2022

I've just re-tried with the dev version (2.10.5?)

from PyPDF2 import PdfReader, PdfWriter

reader = PdfReader("c:/PyPDF-114xSizeAfterRotation-Example.pdf")
writer = PdfWriter()

with open("e:/tt.pdf", "wb") as f:
    writer.write(f)

and the PDF file generated is good. The problem seems solved.

@MartinThoma
Copy link
Member

Amazing 🚀 Thank you for letting me know @pubpub-zz ❤️

@johnf1004
Copy link

johnf1004 commented Sep 15, 2022

The solution for me needed one extra line compared to the one given by @pubpub-zz

from PyPDF2 import PdfReader, PdfWriter

reader = PdfReader("c:/PyPDF-114xSizeAfterRotation-Example.pdf")
writer = PdfWriter()

writer.clone_document_from_reader(reader)

with open("e:/tt.pdf", "wb") as f:
    writer.write(f)

@pubpub-zz
Copy link
Collaborator

oups! But still good, isn't-it ?

@johnf1004
Copy link

For me, it still just outputs a blank deck... until I used the clone_document_from_reader

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF PdfWriter The PdfWriter component is affected
Projects
None yet
Development

No branches or pull requests

9 participants