New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDF from Google Sheet doesn't merge with PdfMerger when import_bookmarks is True #1034
Comments
This might be related to #602 |
Thanks for the great bug ticket! |
As a side-note: I would typically recommend using context managers: # Not recommended
out_file = open(sys.argv[2], 'wb')
out_pdf.write(out_file)
# Recommended
with open(sys.argv[2], 'wb') as out_file:
out_pdf.write(out_file) This ensures that the file handles get closed again. |
I found that issue is in PdfReader.outlines method. from PyPDF2 import PdfReader
pf = PdfReader("test_google_sheet.pdf")
pf.trailer["/Root"]["/Outlines"] |
And seems that xref is not correct in PDF.
I have no idea what is that third column meaning. Do you have? |
That explains. It tries read something from a free entry. |
Another user wrote something similar: #521 (comment) Could you expand on that? Do you maybe even have an idea how to fix it? |
I tested this skipping method but it isn't correct way to do that. Because numbering is not correct any more. Maybe correct way is check if IndirectObject target is a free entry then it return NullObject. I haven't yet read PDF definition document so I have no idea if there is defined this case. |
I found this from documentation https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf: Section 7.3.10 Indirect Objects
And section 7.5.4 Cross-Reference Table
And if you look those entries you can found that offset is object number if it is a free entry and they form a linked list of free entries. So as I thought correct way to handle this is resolve indirect reference to NullObject. |
A PDF from Google Sheet doesn't merge with PdfMerger when import_bookmarks is True. If that is False it works.
It seems that stream is not in a correct state for reading a header from a PDF.
Environment
Which environment were you using when you encountered the problem?
$ python -m platform Linux-5.17.12-200.fc35.x86_64-x86_64-with-glibc2.34 $ python -c "import PyPDF2;print(PyPDF2.__version__)" 2.4.0
Code + PDF
This is a minimal, complete example that shows the issue:
Sample PDF file:
Traceback
This is the complete Traceback I see:
The text was updated successfully, but these errors were encountered: