Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF from Google Sheet doesn't merge with PdfMerger when import_bookmarks is True #1034

Closed
Hatell opened this issue Jun 28, 2022 · 11 comments
Closed
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF PdfMerger The PdfMerger component is affected

Comments

@Hatell
Copy link
Contributor

Hatell commented Jun 28, 2022

A PDF from Google Sheet doesn't merge with PdfMerger when import_bookmarks is True. If that is False it works.

It seems that stream is not in a correct state for reading a header from a PDF.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.17.12-200.fc35.x86_64-x86_64-with-glibc2.34

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.4.0

Code + PDF

This is a minimal, complete example that shows the issue:

#!/usr/bin/env python
# vi: et sw=4 fileencoding=utf-8

from PyPDF2 import PdfReader, PdfMerger

import sys

out_pdf = PdfMerger()

print("This is OK")
out_pdf.append(PdfReader(sys.argv[1]), import_bookmarks=False)

print("This crashes")
out_pdf.append(PdfReader(sys.argv[1]), import_bookmarks=True)

out_file = open(sys.argv[2], 'wb')

out_pdf.write(out_file)

Sample PDF file:

Traceback

This is the complete Traceback I see:

Traceback (most recent call last):
  File "/home/hate/git/PyPDF2/sample-files/003-pdflatex-image/bug_report.py", line 18, in <module>
    out_pdf.append(PdfReader(sys.argv[1]), import_bookmarks=True)
  File "/home/hate/git/PyPDF2/PyPDF2/_merger.py", line 252, in append
    self.merge(len(self.pages), fileobj, bookmark, pages, import_bookmarks)
  File "/home/hate/git/PyPDF2/PyPDF2/_merger.py", line 152, in merge
    outline = reader.outlines
  File "/home/hate/git/PyPDF2/PyPDF2/_reader.py", line 665, in outlines
    return self._get_outlines()
  File "/home/hate/git/PyPDF2/PyPDF2/_reader.py", line 677, in _get_outlines
    lines = cast(DictionaryObject, catalog[CO.OUTLINES])
  File "/home/hate/git/PyPDF2/PyPDF2/generic.py", line 666, in __getitem__
    return dict.__getitem__(self, key).get_object()
  File "/home/hate/git/PyPDF2/PyPDF2/generic.py", line 237, in get_object
    obj = self.pdf.get_object(self)
  File "/home/hate/git/PyPDF2/PyPDF2/_reader.py", line 1051, in get_object
    idnum, generation = self.read_object_header(self.stream)
  File "/home/hate/git/PyPDF2/PyPDF2/_reader.py", line 1133, in read_object_header
    return int(idnum), int(generation)
ValueError: invalid literal for int() with base 10: b'F-1.4'
@MartinThoma
Copy link
Member

This might be related to #602

@MartinThoma
Copy link
Member

Thanks for the great bug ticket!

@MartinThoma MartinThoma added Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF PdfMerger The PdfMerger component is affected labels Jun 28, 2022
@MartinThoma
Copy link
Member

MartinThoma commented Jun 28, 2022

As a side-note:

I would typically recommend using context managers:

# Not recommended
out_file = open(sys.argv[2], 'wb')
out_pdf.write(out_file)

# Recommended
with open(sys.argv[2], 'wb') as out_file:
    out_pdf.write(out_file)

This ensures that the file handles get closed again.

@Hatell
Copy link
Contributor Author

Hatell commented Jun 29, 2022

I found that issue is in PdfReader.outlines method.

from PyPDF2 import PdfReader

pf = PdfReader("test_google_sheet.pdf")

pf.trailer["/Root"]["/Outlines"]

@Hatell
Copy link
Contributor Author

Hatell commented Jun 29, 2022

And seems that xref is not correct in PDF.

xref
0 19
0000000002 65535 f 
0000014832 00000 n 
0000000003 00000 f 
0000000000 00000 f 
0000000016 00000 n 
0000000160 00000 n 
0000000281 00000 n 
0000000447 00000 n 
0000000900 00000 n 
0000000809 00000 n 
0000000828 00000 n 
0000000848 00000 n 
0000001057 00000 n 
0000014292 00000 n 
0000001201 00000 n 
0000014593 00000 n 
0000001554 00000 n 
0000014790 00000 n 
0000014810 00000 n 
trailer

I have no idea what is that third column meaning. Do you have?

@MartinThoma
Copy link
Member

See "3.4.3 Cross-Reference Table" in the PDF 1.7 standard:

image

@Hatell
Copy link
Contributor Author

Hatell commented Jun 29, 2022

That explains. It tries read something from a free entry.

@MartinThoma
Copy link
Member

Another user wrote something similar: #521 (comment)

Could you expand on that? Do you maybe even have an idea how to fix it?

@Hatell
Copy link
Contributor Author

Hatell commented Jul 1, 2022

I tested this skipping method but it isn't correct way to do that. Because numbering is not correct any more.

Maybe correct way is check if IndirectObject target is a free entry then it return NullObject.

I haven't yet read PDF definition document so I have no idea if there is defined this case.

@Hatell
Copy link
Contributor Author

Hatell commented Jul 2, 2022

I found this from documentation https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf:

Section 7.3.10 Indirect Objects

An indirect reference to an undefined object shall not be considered an error by a conforming reader; it shall be
treated as a reference to the null object.

And section 7.5.4 Cross-Reference Table

There are two ways an entry may be a member of the free entries list. Using the basic mechanism the free
entries in the cross-reference table may form a linked list, with each free entry containing the object number of
the next. The first entry in the table (object number 0) shall always be free and shall have a generation number
of 65,535; it is shall be the head of the linked list of free objects. The last free entry (the tail of the linked list)
links back to object number 0. Using the second mechanism, the table may contain other free entries that link
back to object number 0 and have a generation number of 65,535, even though these entries are not in the
linked list itself.

And if you look those entries you can found that offset is object number if it is a free entry and they form a linked list of free entries.

So as I thought correct way to handle this is resolve indirect reference to NullObject.

@MartinThoma
Copy link
Member

This issue was fixed by @Hatell via #1054. It will be part of PyPDF2>=2.4.2. I will make that release on PyPI probably this evening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF PdfMerger The PdfMerger component is affected
Projects
None yet
Development

No branches or pull requests

2 participants