PyPDF2 hangs in NumberObject.NumberPattern.search as called by extractText() #180

chrisinmtown · 2015-02-16T02:05:06Z

Please download and try getting text from this doc using PyPDF2 version 1.23:

http://emma.msrb.org/EP713514-EP554286-EP955449.pdf

When I try it simply hangs forever in NumberObject.NumberPattern.search() invocation, as called by the extractText() function. This file is searchable text (not images) and not encrypted; about 5MB and 15 pages. Partial traceback (omitting my code) is below.

Hope this is enough information, glad to provide more, thanks for your time.

File "/usr/local/lib/python2.7/dist-packages/PyPDF2/pdf.py", line 2352, in extractText
content = ContentStream(content, self.pdf)
File "/usr/local/lib/python2.7/dist-packages/PyPDF2/pdf.py", line 2425, in init
data += s.getObject().getData()
File "/usr/local/lib/python2.7/dist-packages/PyPDF2/generic.py", line 174, in getObject
return self.pdf.getObject(self).getObject()
File "/usr/local/lib/python2.7/dist-packages/PyPDF2/pdf.py", line 1378, in getObject
retval = readObject(self.stream, self)
File "/usr/local/lib/python2.7/dist-packages/PyPDF2/generic.py", line 63, in readObject
return DictionaryObject.readFromStream(stream, pdf)
File "/usr/local/lib/python2.7/dist-packages/PyPDF2/generic.py", line 599, in readFromStream
length = pdf.getObject(length)
File "/usr/local/lib/python2.7/dist-packages/PyPDF2/pdf.py", line 1360, in getObject
retval = self._getObjectFromStream(indirectReference)
File "/usr/local/lib/python2.7/dist-packages/PyPDF2/pdf.py", line 1333, in _getObjectFromStream
obj = readObject(streamData, self)
File "/usr/local/lib/python2.7/dist-packages/PyPDF2/generic.py", line 95, in readObject
return NumberObject.readFromStream(stream)
File "/usr/local/lib/python2.7/dist-packages/PyPDF2/generic.py", line 259, in readFromStream
m = NumberObject.NumberPattern.search(tok)

The text was updated successfully, but these errors were encountered:

MartinThoma · 2022-04-22T15:04:29Z

The file is no longer accessible. Do you have an example file that still is accessible?

MartinThoma · 2022-06-06T12:59:11Z

I'm closing this issue as I cannot reproduce it.

chrisinmtown changed the title ~~PyPDF2 hangs when getting text from PDF - link to orig doc provided~~ PyPDF2 hangs in NumberObject.NumberPattern.search as called by extractText() Feb 23, 2015

MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Apr 8, 2022

MartinThoma added workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests labels Apr 16, 2022

MartinThoma added needs-change The PR/issue cannot be handled as issue and needs to be improved and removed Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests labels Apr 22, 2022

MartinThoma closed this as completed Jun 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyPDF2 hangs in NumberObject.NumberPattern.search as called by extractText() #180

PyPDF2 hangs in NumberObject.NumberPattern.search as called by extractText() #180

chrisinmtown commented Feb 16, 2015

MartinThoma commented Apr 22, 2022

MartinThoma commented Jun 6, 2022

PyPDF2 hangs in NumberObject.NumberPattern.search as called by extractText() #180

PyPDF2 hangs in NumberObject.NumberPattern.search as called by extractText() #180

Comments

chrisinmtown commented Feb 16, 2015

MartinThoma commented Apr 22, 2022

MartinThoma commented Jun 6, 2022