Skip to content

Commit

Permalink
Tolerance for extra whitespace in Indirect Objects
Browse files Browse the repository at this point in the history
  • Loading branch information
mstamy2 committed Feb 5, 2014
1 parent b92eef2 commit 4f3311e
Showing 1 changed file with 4 additions and 2 deletions.
6 changes: 4 additions & 2 deletions PyPDF2/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ def readObject(stream, pdf):
return NumberObject.readFromStream(stream)
peek = stream.read(20)
stream.seek(-len(peek), 1) # reset to start
if re.match(b_(r"(\d+)\s(\d+)\sR[^a-zA-Z]"), peek) != None:
if re.match(b_(r"(\d+)\s+(\d+)\s+R[^a-zA-Z]"), peek) != None:
return IndirectObject.readFromStream(stream, pdf)
else:
return NumberObject.readFromStream(stream)
Expand Down Expand Up @@ -204,9 +204,11 @@ def readFromStream(stream, pdf):
# stream has truncated prematurely
raise PdfStreamError("Stream has ended unexpectedly")
if tok.isspace():
if not generation:
continue
break
generation += tok
r = stream.read(1)
r = utils.readNonWhitespace(stream)
if r != b_("R"):
raise utils.PdfReadError("Error reading indirect object reference at byte %s" % utils.hexStr(stream.tell()))
return IndirectObject(int(idnum), int(generation), pdf)
Expand Down

2 comments on commit 4f3311e

@ulion
Copy link

@ulion ulion commented on 4f3311e Feb 5, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, you have readNonWhitespace, that's great, I didn't know that.
this commit looks good & will well resolve the multiple space problem.

but I think for other token parse, there may still is some place only works with single space, if you can check & resolve them all, which will be best.

@mstamy2
Copy link
Collaborator Author

@mstamy2 mstamy2 commented on 4f3311e Feb 6, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - I'll look into other areas that may not account for extra whitespace. Most parsing methods use readNonWhitespace; when it isn't used, the PDF should technically contain a specific formatting, but PyPDF2 should be able to handle PDFs which deviate from the PDF standard.

Please sign in to comment.