You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We came across a corrupted .pdf where the startxref pointer is invalid and points to an offset before the actual xref table in the .pdf. The following backtrace is then observed during loading:
Traceback (most recent call last):
for page in extract_pages(pdf_file, laparams=LAParams(**la_params)):
File "lib\site-packages\pdfminer\high_level.py", line 197, in extract_pages
for page in PDFPage.get_pages(
File "lib\site-packages\pdfminer\pdfpage.py", line 151, in get_pages
doc = PDFDocument(parser, password=password, caching=caching)
File "lib\site-packages\pdfminer\pdfdocument.py", line 722, in __init__
self.read_xref_from(parser, pos, self.xrefs)
File "lib\site-packages\pdfminer\pdfdocument.py", line 1000, in read_xref_from
xref.load(parser)
File "lib\site-packages\pdfminer\pdfdocument.py", line 280, in load
(_, stream) = parser.nextobject()
File "lib\site-packages\pdfminer\psparser.py", line 654, in nextobject
self.do_keyword(pos, token)
File "lib\site-packages\pdfminer\pdfparser.py", line 92, in do_keyword
objlen = int_value(dic["Length"])
File "lib\site-packages\pdfminer\pdftypes.py", line 151, in int_value
x = resolve1(x)
File "lib\site-packages\pdfminer\pdftypes.py", line 118, in resolve1
x = x.resolve(default=default)
File "lib\site-packages\pdfminer\pdftypes.py", line 106, in resolve
return self.doc.getobj(self.objid)
File "lib\site-packages\pdfminer\pdfdocument.py", line 851, in getobj
raise PDFException("PDFDocument is not initialized")
pdfminer.pdftypes.PDFException: PDFDocument is not initialized
Adobe Reader is able to open the corrupted file and to repair it. I guess one way to workaround this issue would be to look for an "xref" string after seeking to the pointer and to not trust the offset value blindly.
The text was updated successfully, but these errors were encountered:
Bug report
We came across a corrupted .pdf where the
startxref
pointer is invalid and points to an offset before the actual xref table in the .pdf. The following backtrace is then observed during loading:How to reproduce
Take a .pdf with an existing
startxref
and just set the offset to0
in the file instead of the real xref table offset. I modified zen_of_python_corrupted.pdf to create this same bug, file is attached: zen_of_python_corrupted_xref.pdfThoughts
Adobe Reader is able to open the corrupted file and to repair it. I guess one way to workaround this issue would be to look for an "xref" string after seeking to the pointer and to not trust the offset value blindly.
The text was updated successfully, but these errors were encountered: