Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyPDF2 hangs in NumberObject.NumberPattern.search as called by extractText() #180

Closed
chrisinmtown opened this issue Feb 16, 2015 · 2 comments
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF needs-change The PR/issue cannot be handled as issue and needs to be improved workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@chrisinmtown
Copy link

Please download and try getting text from this doc using PyPDF2 version 1.23:

http://emma.msrb.org/EP713514-EP554286-EP955449.pdf

When I try it simply hangs forever in NumberObject.NumberPattern.search() invocation, as called by the extractText() function. This file is searchable text (not images) and not encrypted; about 5MB and 15 pages. Partial traceback (omitting my code) is below.

Hope this is enough information, glad to provide more, thanks for your time.

File "/usr/local/lib/python2.7/dist-packages/PyPDF2/pdf.py", line 2352, in extractText
content = ContentStream(content, self.pdf)
File "/usr/local/lib/python2.7/dist-packages/PyPDF2/pdf.py", line 2425, in init
data += s.getObject().getData()
File "/usr/local/lib/python2.7/dist-packages/PyPDF2/generic.py", line 174, in getObject
return self.pdf.getObject(self).getObject()
File "/usr/local/lib/python2.7/dist-packages/PyPDF2/pdf.py", line 1378, in getObject
retval = readObject(self.stream, self)
File "/usr/local/lib/python2.7/dist-packages/PyPDF2/generic.py", line 63, in readObject
return DictionaryObject.readFromStream(stream, pdf)
File "/usr/local/lib/python2.7/dist-packages/PyPDF2/generic.py", line 599, in readFromStream
length = pdf.getObject(length)
File "/usr/local/lib/python2.7/dist-packages/PyPDF2/pdf.py", line 1360, in getObject
retval = self._getObjectFromStream(indirectReference)
File "/usr/local/lib/python2.7/dist-packages/PyPDF2/pdf.py", line 1333, in _getObjectFromStream
obj = readObject(streamData, self)
File "/usr/local/lib/python2.7/dist-packages/PyPDF2/generic.py", line 95, in readObject
return NumberObject.readFromStream(stream)
File "/usr/local/lib/python2.7/dist-packages/PyPDF2/generic.py", line 259, in readFromStream
m = NumberObject.NumberPattern.search(tok)

@chrisinmtown chrisinmtown changed the title PyPDF2 hangs when getting text from PDF - link to orig doc provided PyPDF2 hangs in NumberObject.NumberPattern.search as called by extractText() Feb 23, 2015
@MartinThoma MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Apr 8, 2022
@MartinThoma MartinThoma added workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests labels Apr 16, 2022
@MartinThoma
Copy link
Member

The file is no longer accessible. Do you have an example file that still is accessible?

@MartinThoma MartinThoma added needs-change The PR/issue cannot be handled as issue and needs to be improved and removed Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests labels Apr 22, 2022
@MartinThoma
Copy link
Member

I'm closing this issue as I cannot reproduce it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF needs-change The PR/issue cannot be handled as issue and needs to be improved workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

2 participants