cloneDocumentFromReader produces a document with all blank pages #219

whitemice · 2015-08-07T14:27:34Z

Perhaps I am misunderstanding what cloneDocumentFromReader does... but the following code produces a document of all blank pages.

    signal.signal(signal.SIGALRM, timeout_alarm_handler)
    signal.alarm(15)

    reader = PdfFileReader(self.rfile, strict=False, )
    writer = PdfFileWriter()

    #writer.cloneReaderDocumentRoot(reader)
    try:
        writer.cloneDocumentFromReader(
            reader
        )
    except TimeOutAlarm:
        raise BurstingTimeOutException
    else:
        writer.write(self.wfile)
    writer = None

whitemice · 2015-08-07T14:28:14Z

Example document @ http://www.wmmi.net/documents/Bugs/PyPDF-114xSizeAfterRotation-Example.pdf

Cannot upload a document of PDF to issue

dhudson1 · 2015-08-13T22:34:37Z

For the time being, when you need to clone the document down to the page structure (any page trees are preserved), use cloneReaderDocumentRoot. If keeping the underlying page structure is not important, use appendPagesFromReader to compact the page structure to a single page list with many page references.

I think that cloneDocumentFromReader does not work like expected because cloneDocumentFromReader uses both of the above methods, which causes a conflict when trying to write the document. This conflict occurs because the writer, when looking for all of the external references gets caught up in two different versions of the same object with the same version history, tricking the writer into thinking that the content has already been loaded when it has not been.

If you open the PDF output of your program in a text editor, you will notice that there are many references to indirect objects that are never defined within the document.

dhudson1 · 2015-08-14T22:14:13Z

I have a potential solution for cloneReaderDocumentRoot (type, or copy-and-paste, this in under the definition for cloneReaderDocumentRoot and see if it does what you want):

        '''
        Create a copy (clone) of a document from a PDF file reader

        :param reader: PDF file reader instance from which the clone
            should be created.
        :callback after_page_append (function): Callback function that is invoked after
            each page is appended to the writer. Signature includes a reference to the
            appended page (delegates to appendPagesFromReader). Callback signature:

            :param writer_pageref (PDF page reference): Reference to the page just
                appended to the document.
        '''
        debug = False
        if debug:
            print("Number of Objects: %d" % len(self._objects))
            for obj in self._objects:
                print("\tObject is %r" % obj)
                if hasattr(obj, "indirectRef") and obj.indirectRef != None:
                    print("\t\tObject's reference is %r %r, at PDF %r" % (obj.indirectRef.idnum, obj.indirectRef.generation, obj.indirectRef.pdf))

        # Variables used for after cloning the root to
        # improve pre- and post- cloning experience

        mustAddTogether = False
        newInfoRef = self._info
        oldPagesRef = self._pages
        oldPages = self.getObject(self._pages)

        # If there have already been any number of pages added

        if oldPages[NameObject("/Count")] > 0:

            # Keep them

            mustAddTogether = True
        else:

            # Through the page object out

            if oldPages in self._objects:
                newInfoRef = self._pages
                self._objects.remove(oldPages)

        # Clone the reader's root document

        self.cloneReaderDocumentRoot(reader)
        if not self._root:
            self._root = self._addObject(self._root_object)

        # Sweep for all indirect references

        externalReferenceMap = {}
        self.stack = []
        newRootRef = self._sweepIndirectReferences(externalReferenceMap, self._root)

        # Delete the stack to reset

        del self.stack

        #Clean-Up Time!!!

        # Get the new root of the PDF

        realRoot = self.getObject(newRootRef)

        # Get the new pages tree root and its ID Number

        tmpPages = realRoot[NameObject("/Pages")]
        newIdNumForPages = 1 + self._objects.index(tmpPages)

        # Make an IndirectObject just for the new Pages

        self._pages = IndirectObject(newIdNumForPages, 0, self)

        # If there are any pages to add back in

        if mustAddTogether:

            # Set the new page's root's parent to the old
            # page's root's reference

            tmpPages[NameObject("/Parent")] = oldPagesRef

            # Add the reference to the new page's root in
            # the old page's kids array

            newPagesRef = self._pages
            oldPages[NameObject("/Kids")].append(newPagesRef)

            # Set all references to the root of the old/new
            # page's root

            self._pages = oldPagesRef
            realRoot[NameObject("/Pages")] = oldPagesRef

            # Update the count attribute of the page's root

            oldPages[NameObject("/Count")] = NumberObject(oldPages[NameObject("/Count")] + tmpPages[NameObject("/Count")])

        else:

            # Bump up the info's reference b/c the old
            # page's tree was bumped off

            self._info = newInfoRef

LightningMan711 · 2016-05-23T16:54:27Z

I think dhudson1's code post was meant to be cloneDocumentFromReader and not cloneReaderDocumentRoot . I tried it there and it seemed to work perfectly, but it alerted me to another problem that I am having.

yucer · 2016-09-22T16:01:57Z

Unfortunately this solution only clones the presentation of the document. The document is rebuilt, loosing the past document versions. A very useful clone will be a raw clone byte by byte of the document, and make that subsequent modifications generate a new version.

That will also bring the library the power to preserve the digital signatures for past versions.

Also this kind of problems would be avoided.

This would also make the library stronger, because it could preserve even objects that it can not parse totally. As long as I have seen, it seems that all this kind of libraries work parsing all the content with the reader to its own DOM and then the writer knows how to write them.

I mean the writer doesn't need to handle the semantic of all the existent objects in past versions, it can just include references to them, even when the reader doesn't know the semantic of the object. The object and past versions would be truly cloned.

Am I missing something ?

yucer · 2016-09-22T17:52:46Z

Anyway, now it works. Thanks. :-)

Step by step...

Any plan to review / merge this change ?

benwiggy · 2019-04-21T15:48:30Z

I also see this problem with cloneDocumentFromReader. When I use cloneReaderDocumentRoot instead, I get a PDF with No pages.

livizy · 2021-01-12T01:25:41Z

I also see this problem with cloneDocumentFromReader. When I use cloneReaderDocumentRoot instead, I get a PDF with No pages.

so the same

MartinThoma · 2022-07-05T19:27:14Z

@chickendiver I've never used that function. Could you give a minimal example on how you would like to use it?

MartinThoma · 2022-07-05T20:16:43Z

Interesting: https://stackoverflow.com/search?q=cloneDocumentFromReader

So we have cloneDocumentFromReader, appendPagesFromReader and cloneReaderDocumentRoot and all 3 don't work properly?

pubpub-zz · 2022-09-03T12:37:31Z

I've just re-tried with the dev version (2.10.5?)

from PyPDF2 import PdfReader, PdfWriter

reader = PdfReader("c:/PyPDF-114xSizeAfterRotation-Example.pdf")
writer = PdfWriter()

with open("e:/tt.pdf", "wb") as f:
    writer.write(f)

and the PDF file generated is good. The problem seems solved.

MartinThoma · 2022-09-06T19:35:20Z

Amazing 🚀 Thank you for letting me know @pubpub-zz ❤️

johnf1004 · 2022-09-15T17:12:43Z

The solution for me needed one extra line compared to the one given by @pubpub-zz

from PyPDF2 import PdfReader, PdfWriter

reader = PdfReader("c:/PyPDF-114xSizeAfterRotation-Example.pdf")
writer = PdfWriter()

writer.clone_document_from_reader(reader)

with open("e:/tt.pdf", "wb") as f:
    writer.write(f)

pubpub-zz · 2022-09-15T18:54:37Z

oups! But still good, isn't-it ?

johnf1004 · 2022-09-15T22:31:21Z

For me, it still just outputs a blank deck... until I used the clone_document_from_reader

dhudson1 added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Aug 13, 2015

LightningMan711 mentioned this issue May 23, 2016

PdfFileWriter addBookmark DictionaryObject issue #264

Closed

MartinThoma added PdfWriter The PdfWriter component is affected Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests labels Apr 16, 2022

MartinThoma closed this as completed Sep 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cloneDocumentFromReader produces a document with all blank pages #219

cloneDocumentFromReader produces a document with all blank pages #219

whitemice commented Aug 7, 2015 •

edited by MartinThoma

whitemice commented Aug 7, 2015

dhudson1 commented Aug 13, 2015

dhudson1 commented Aug 14, 2015

LightningMan711 commented May 23, 2016

yucer commented Sep 22, 2016 •

edited

yucer commented Sep 22, 2016

benwiggy commented Apr 21, 2019

livizy commented Jan 12, 2021

MartinThoma commented Jul 5, 2022

MartinThoma commented Jul 5, 2022

pubpub-zz commented Sep 3, 2022 •

edited by MartinThoma

MartinThoma commented Sep 6, 2022

johnf1004 commented Sep 15, 2022 •

edited by MartinThoma

pubpub-zz commented Sep 15, 2022

johnf1004 commented Sep 15, 2022

cloneDocumentFromReader produces a document with all blank pages #219

cloneDocumentFromReader produces a document with all blank pages #219

Comments

whitemice commented Aug 7, 2015 • edited by MartinThoma

whitemice commented Aug 7, 2015

dhudson1 commented Aug 13, 2015

dhudson1 commented Aug 14, 2015

LightningMan711 commented May 23, 2016

yucer commented Sep 22, 2016 • edited

yucer commented Sep 22, 2016

benwiggy commented Apr 21, 2019

livizy commented Jan 12, 2021

MartinThoma commented Jul 5, 2022

MartinThoma commented Jul 5, 2022

pubpub-zz commented Sep 3, 2022 • edited by MartinThoma

MartinThoma commented Sep 6, 2022

johnf1004 commented Sep 15, 2022 • edited by MartinThoma

pubpub-zz commented Sep 15, 2022

johnf1004 commented Sep 15, 2022

whitemice commented Aug 7, 2015 •

edited by MartinThoma

yucer commented Sep 22, 2016 •

edited

pubpub-zz commented Sep 3, 2022 •

edited by MartinThoma

johnf1004 commented Sep 15, 2022 •

edited by MartinThoma