Skip to content

Commit

Permalink
BUG: Resolve IndirectObject when it refers to a free entry (#1054)
Browse files Browse the repository at this point in the history
From the PDF 1.7 docs https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf:

Section 7.3.10 Indirect Objects:
An indirect reference to an undefined object shall not be considered an error by a conforming reader;
it shall be treated as a reference to the null object.

And section 7.5.4 Cross-Reference Table:
There are two ways an entry may be a member of the free entries list. Using the basic mechanism the free
entries in the cross-reference table may form a linked list, with each free entry containing the object number of
the next. The first entry in the table (object number 0) shall always be free and shall have a generation number
of 65,535; it is shall be the head of the linked list of free objects. The last free entry (the tail of the linked list)
links back to object number 0. Using the second mechanism, the table may contain other free entries that link
back to object number 0 and have a generation number of 65,535, even though these entries are not in the
linked list itself.

Those entries form a linked list. The correct way to handle this is to resolve the indirect reference to the NullObject.

See "3.4.3 Cross-Reference Table" in the PDF 1.7 standard for free cross-reference entries in general.

Co-authored-by: Harry Karvonen <harry.karvonen@onebyte.fi>

Closes #521
Closes #1034
  • Loading branch information
Hatell committed Jul 5, 2022
1 parent 70605ae commit 02c601c
Showing 1 changed file with 12 additions and 0 deletions.
12 changes: 12 additions & 0 deletions PyPDF2/_reader.py
Expand Up @@ -691,6 +691,9 @@ def _get_outlines(
# so continue to load the file without the Bookmarks
return outlines

if isinstance(lines, NullObject):
return outlines

# TABLE 8.3 Entries in the outline dictionary
if "/First" in lines:
node = cast(DictionaryObject, lines["/First"])
Expand Down Expand Up @@ -1052,6 +1055,10 @@ def get_object(self, indirect_reference: IndirectObject) -> Optional[PdfObject]:
indirect_reference.generation in self.xref
and indirect_reference.idnum in self.xref[indirect_reference.generation]
):
if self.xref_free_entry.get(indirect_reference.generation, {}).get(
indirect_reference.idnum, False
):
return NullObject()
start = self.xref[indirect_reference.generation][indirect_reference.idnum]
self.stream.seek(start, 0)
idnum, generation = self.read_object_header(self.stream)
Expand Down Expand Up @@ -1225,6 +1232,7 @@ def read(self, stream: StreamType) -> None:

# read all cross reference tables and their trailers
self.xref: Dict[int, Dict[Any, Any]] = {}
self.xref_free_entry: Dict[int, Dict[Any, Any]] = {}
self.xref_objStm: Dict[int, Tuple[Any, Any]] = {}
self.trailer = DictionaryObject()
while True:
Expand Down Expand Up @@ -1380,9 +1388,12 @@ def _read_standard_xref_table(self, stream: StreamType) -> None:
stream.seek(-1, 1)

offset_b, generation_b = line[:16].split(b" ")
entry_type_b = line[17:18]

offset, generation = int(offset_b), int(generation_b)
if generation not in self.xref:
self.xref[generation] = {}
self.xref_free_entry[generation] = {}
if num in self.xref[generation]:
# It really seems like we should allow the last
# xref table in the file to override previous
Expand All @@ -1391,6 +1402,7 @@ def _read_standard_xref_table(self, stream: StreamType) -> None:
pass
else:
self.xref[generation][num] = offset
self.xref_free_entry[generation][num] = entry_type_b == b"f"
cnt += 1
num += 1
read_non_whitespace(stream)
Expand Down

0 comments on commit 02c601c

Please sign in to comment.