PyPDF2 hangs in NumberObject.NumberPattern.search as called by extractText() #180
Labels
is-bug
From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF
needs-change
The PR/issue cannot be handled as issue and needs to be improved
workflow-text-extraction
From a users perspective, text extraction is the affected feature/workflow
Please download and try getting text from this doc using PyPDF2 version 1.23:
When I try it simply hangs forever in NumberObject.NumberPattern.search() invocation, as called by the extractText() function. This file is searchable text (not images) and not encrypted; about 5MB and 15 pages. Partial traceback (omitting my code) is below.
Hope this is enough information, glad to provide more, thanks for your time.
File "/usr/local/lib/python2.7/dist-packages/PyPDF2/pdf.py", line 2352, in extractText
content = ContentStream(content, self.pdf)
File "/usr/local/lib/python2.7/dist-packages/PyPDF2/pdf.py", line 2425, in init
data += s.getObject().getData()
File "/usr/local/lib/python2.7/dist-packages/PyPDF2/generic.py", line 174, in getObject
return self.pdf.getObject(self).getObject()
File "/usr/local/lib/python2.7/dist-packages/PyPDF2/pdf.py", line 1378, in getObject
retval = readObject(self.stream, self)
File "/usr/local/lib/python2.7/dist-packages/PyPDF2/generic.py", line 63, in readObject
return DictionaryObject.readFromStream(stream, pdf)
File "/usr/local/lib/python2.7/dist-packages/PyPDF2/generic.py", line 599, in readFromStream
length = pdf.getObject(length)
File "/usr/local/lib/python2.7/dist-packages/PyPDF2/pdf.py", line 1360, in getObject
retval = self._getObjectFromStream(indirectReference)
File "/usr/local/lib/python2.7/dist-packages/PyPDF2/pdf.py", line 1333, in _getObjectFromStream
obj = readObject(streamData, self)
File "/usr/local/lib/python2.7/dist-packages/PyPDF2/generic.py", line 95, in readObject
return NumberObject.readFromStream(stream)
File "/usr/local/lib/python2.7/dist-packages/PyPDF2/generic.py", line 259, in readFromStream
m = NumberObject.NumberPattern.search(tok)
The text was updated successfully, but these errors were encountered: