"PDFDocument is not initialized" when startxref is invalid #946

umaplehurst · 2024-02-14T18:09:44Z

Bug report

We came across a corrupted .pdf where the startxref pointer is invalid and points to an offset before the actual xref table in the .pdf. The following backtrace is then observed during loading:

Traceback (most recent call last):
    for page in extract_pages(pdf_file, laparams=LAParams(**la_params)):
  File "lib\site-packages\pdfminer\high_level.py", line 197, in extract_pages
    for page in PDFPage.get_pages(
  File "lib\site-packages\pdfminer\pdfpage.py", line 151, in get_pages
    doc = PDFDocument(parser, password=password, caching=caching)
  File "lib\site-packages\pdfminer\pdfdocument.py", line 722, in __init__
    self.read_xref_from(parser, pos, self.xrefs)
  File "lib\site-packages\pdfminer\pdfdocument.py", line 1000, in read_xref_from
    xref.load(parser)
  File "lib\site-packages\pdfminer\pdfdocument.py", line 280, in load
    (_, stream) = parser.nextobject()
  File "lib\site-packages\pdfminer\psparser.py", line 654, in nextobject
    self.do_keyword(pos, token)
  File "lib\site-packages\pdfminer\pdfparser.py", line 92, in do_keyword
    objlen = int_value(dic["Length"])
  File "lib\site-packages\pdfminer\pdftypes.py", line 151, in int_value
    x = resolve1(x)
  File "lib\site-packages\pdfminer\pdftypes.py", line 118, in resolve1
    x = x.resolve(default=default)
  File "lib\site-packages\pdfminer\pdftypes.py", line 106, in resolve
    return self.doc.getobj(self.objid)
  File "lib\site-packages\pdfminer\pdfdocument.py", line 851, in getobj
    raise PDFException("PDFDocument is not initialized")
pdfminer.pdftypes.PDFException: PDFDocument is not initialized

How to reproduce

Take a .pdf with an existing startxref and just set the offset to 0 in the file instead of the real xref table offset. I modified zen_of_python_corrupted.pdf to create this same bug, file is attached: zen_of_python_corrupted_xref.pdf

Thoughts

Adobe Reader is able to open the corrupted file and to repair it. I guess one way to workaround this issue would be to look for an "xref" string after seeking to the pointer and to not trust the offset value blindly.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"PDFDocument is not initialized" when startxref is invalid #946

"PDFDocument is not initialized" when startxref is invalid #946

umaplehurst commented Feb 14, 2024

"PDFDocument is not initialized" when startxref is invalid #946

"PDFDocument is not initialized" when startxref is invalid #946

Comments

umaplehurst commented Feb 14, 2024