PDF from Google Sheet doesn't merge with PdfMerger when import_bookmarks is True #1034

Hatell · 2022-06-28T12:45:58Z

A PDF from Google Sheet doesn't merge with PdfMerger when import_bookmarks is True. If that is False it works.

It seems that stream is not in a correct state for reading a header from a PDF.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.17.12-200.fc35.x86_64-x86_64-with-glibc2.34

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.4.0

Code + PDF

This is a minimal, complete example that shows the issue:

#!/usr/bin/env python
# vi: et sw=4 fileencoding=utf-8

from PyPDF2 import PdfReader, PdfMerger

import sys

out_pdf = PdfMerger()

print("This is OK")
out_pdf.append(PdfReader(sys.argv[1]), import_bookmarks=False)

print("This crashes")
out_pdf.append(PdfReader(sys.argv[1]), import_bookmarks=True)

out_file = open(sys.argv[2], 'wb')

out_pdf.write(out_file)

Sample PDF file:

test_google_sheet.pdf

Traceback

This is the complete Traceback I see:

Traceback (most recent call last):
  File "/home/hate/git/PyPDF2/sample-files/003-pdflatex-image/bug_report.py", line 18, in <module>
    out_pdf.append(PdfReader(sys.argv[1]), import_bookmarks=True)
  File "/home/hate/git/PyPDF2/PyPDF2/_merger.py", line 252, in append
    self.merge(len(self.pages), fileobj, bookmark, pages, import_bookmarks)
  File "/home/hate/git/PyPDF2/PyPDF2/_merger.py", line 152, in merge
    outline = reader.outlines
  File "/home/hate/git/PyPDF2/PyPDF2/_reader.py", line 665, in outlines
    return self._get_outlines()
  File "/home/hate/git/PyPDF2/PyPDF2/_reader.py", line 677, in _get_outlines
    lines = cast(DictionaryObject, catalog[CO.OUTLINES])
  File "/home/hate/git/PyPDF2/PyPDF2/generic.py", line 666, in __getitem__
    return dict.__getitem__(self, key).get_object()
  File "/home/hate/git/PyPDF2/PyPDF2/generic.py", line 237, in get_object
    obj = self.pdf.get_object(self)
  File "/home/hate/git/PyPDF2/PyPDF2/_reader.py", line 1051, in get_object
    idnum, generation = self.read_object_header(self.stream)
  File "/home/hate/git/PyPDF2/PyPDF2/_reader.py", line 1133, in read_object_header
    return int(idnum), int(generation)
ValueError: invalid literal for int() with base 10: b'F-1.4'

The text was updated successfully, but these errors were encountered:

MartinThoma · 2022-06-28T12:53:11Z

This might be related to #602

MartinThoma · 2022-06-28T12:53:26Z

Thanks for the great bug ticket!

MartinThoma · 2022-06-28T13:03:19Z

As a side-note:

I would typically recommend using context managers:

# Not recommended
out_file = open(sys.argv[2], 'wb')
out_pdf.write(out_file)

# Recommended
with open(sys.argv[2], 'wb') as out_file:
    out_pdf.write(out_file)

This ensures that the file handles get closed again.

Hatell · 2022-06-29T05:55:48Z

I found that issue is in PdfReader.outlines method.

from PyPDF2 import PdfReader

pf = PdfReader("test_google_sheet.pdf")

pf.trailer["/Root"]["/Outlines"]

Hatell · 2022-06-29T06:17:57Z

And seems that xref is not correct in PDF.

xref
0 19
0000000002 65535 f 
0000014832 00000 n 
0000000003 00000 f 
0000000000 00000 f 
0000000016 00000 n 
0000000160 00000 n 
0000000281 00000 n 
0000000447 00000 n 
0000000900 00000 n 
0000000809 00000 n 
0000000828 00000 n 
0000000848 00000 n 
0000001057 00000 n 
0000014292 00000 n 
0000001201 00000 n 
0000014593 00000 n 
0000001554 00000 n 
0000014790 00000 n 
0000014810 00000 n 
trailer

I have no idea what is that third column meaning. Do you have?

MartinThoma · 2022-06-29T07:14:33Z

See "3.4.3 Cross-Reference Table" in the PDF 1.7 standard:

Hatell · 2022-06-29T07:39:05Z

That explains. It tries read something from a free entry.

MartinThoma · 2022-06-30T13:09:53Z

Another user wrote something similar: #521 (comment)

Could you expand on that? Do you maybe even have an idea how to fix it?

Hatell · 2022-07-01T05:49:02Z

I tested this skipping method but it isn't correct way to do that. Because numbering is not correct any more.

Maybe correct way is check if IndirectObject target is a free entry then it return NullObject.

I haven't yet read PDF definition document so I have no idea if there is defined this case.

Hatell · 2022-07-02T10:30:35Z

I found this from documentation https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf:

Section 7.3.10 Indirect Objects

An indirect reference to an undefined object shall not be considered an error by a conforming reader; it shall be
treated as a reference to the null object.

And section 7.5.4 Cross-Reference Table

There are two ways an entry may be a member of the free entries list. Using the basic mechanism the free
entries in the cross-reference table may form a linked list, with each free entry containing the object number of
the next. The first entry in the table (object number 0) shall always be free and shall have a generation number
of 65,535; it is shall be the head of the linked list of free objects. The last free entry (the tail of the linked list)
links back to object number 0. Using the second mechanism, the table may contain other free entries that link
back to object number 0 and have a generation number of 65,535, even though these entries are not in the
linked list itself.

And if you look those entries you can found that offset is object number if it is a free entry and they form a linked list of free entries.

So as I thought correct way to handle this is resolve indirect reference to NullObject.

MartinThoma · 2022-07-05T08:22:51Z

This issue was fixed by @Hatell via #1054. It will be part of PyPDF2>=2.4.2. I will make that release on PyPI probably this evening.

MartinThoma added Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF PdfMerger The PdfMerger component is affected labels Jun 28, 2022

Hatell mentioned this issue Jul 4, 2022

Resolve IndirectObject when it refers to a free entry. #1054

Merged

MartinThoma closed this as completed in 02c601c Jul 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF from Google Sheet doesn't merge with PdfMerger when import_bookmarks is True #1034

PDF from Google Sheet doesn't merge with PdfMerger when import_bookmarks is True #1034

Hatell commented Jun 28, 2022

MartinThoma commented Jun 28, 2022

MartinThoma commented Jun 28, 2022

MartinThoma commented Jun 28, 2022 •

edited

Hatell commented Jun 29, 2022

Hatell commented Jun 29, 2022

MartinThoma commented Jun 29, 2022

Hatell commented Jun 29, 2022

MartinThoma commented Jun 30, 2022

Hatell commented Jul 1, 2022

Hatell commented Jul 2, 2022

MartinThoma commented Jul 5, 2022

PDF from Google Sheet doesn't merge with PdfMerger when import_bookmarks is True #1034

PDF from Google Sheet doesn't merge with PdfMerger when import_bookmarks is True #1034

Comments

Hatell commented Jun 28, 2022

Environment

Code + PDF

Traceback

MartinThoma commented Jun 28, 2022

MartinThoma commented Jun 28, 2022

MartinThoma commented Jun 28, 2022 • edited

Hatell commented Jun 29, 2022

Hatell commented Jun 29, 2022

MartinThoma commented Jun 29, 2022

Hatell commented Jun 29, 2022

MartinThoma commented Jun 30, 2022

Hatell commented Jul 1, 2022

Hatell commented Jul 2, 2022

MartinThoma commented Jul 5, 2022

MartinThoma commented Jun 28, 2022 •

edited