Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"PDFDocument is not initialized" when startxref is invalid #946

Open
umaplehurst opened this issue Feb 14, 2024 · 0 comments
Open

"PDFDocument is not initialized" when startxref is invalid #946

umaplehurst opened this issue Feb 14, 2024 · 0 comments

Comments

@umaplehurst
Copy link

Bug report

We came across a corrupted .pdf where the startxref pointer is invalid and points to an offset before the actual xref table in the .pdf. The following backtrace is then observed during loading:

Traceback (most recent call last):
    for page in extract_pages(pdf_file, laparams=LAParams(**la_params)):
  File "lib\site-packages\pdfminer\high_level.py", line 197, in extract_pages
    for page in PDFPage.get_pages(
  File "lib\site-packages\pdfminer\pdfpage.py", line 151, in get_pages
    doc = PDFDocument(parser, password=password, caching=caching)
  File "lib\site-packages\pdfminer\pdfdocument.py", line 722, in __init__
    self.read_xref_from(parser, pos, self.xrefs)
  File "lib\site-packages\pdfminer\pdfdocument.py", line 1000, in read_xref_from
    xref.load(parser)
  File "lib\site-packages\pdfminer\pdfdocument.py", line 280, in load
    (_, stream) = parser.nextobject()
  File "lib\site-packages\pdfminer\psparser.py", line 654, in nextobject
    self.do_keyword(pos, token)
  File "lib\site-packages\pdfminer\pdfparser.py", line 92, in do_keyword
    objlen = int_value(dic["Length"])
  File "lib\site-packages\pdfminer\pdftypes.py", line 151, in int_value
    x = resolve1(x)
  File "lib\site-packages\pdfminer\pdftypes.py", line 118, in resolve1
    x = x.resolve(default=default)
  File "lib\site-packages\pdfminer\pdftypes.py", line 106, in resolve
    return self.doc.getobj(self.objid)
  File "lib\site-packages\pdfminer\pdfdocument.py", line 851, in getobj
    raise PDFException("PDFDocument is not initialized")
pdfminer.pdftypes.PDFException: PDFDocument is not initialized

How to reproduce

Take a .pdf with an existing startxref and just set the offset to 0 in the file instead of the real xref table offset. I modified zen_of_python_corrupted.pdf to create this same bug, file is attached: zen_of_python_corrupted_xref.pdf

Thoughts

Adobe Reader is able to open the corrupted file and to repair it. I guess one way to workaround this issue would be to look for an "xref" string after seeking to the pointer and to not trust the offset value blindly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant