Reading a pdf from file like object or data not working in python 3 with bytesIO #38

rdgraham · 2015-10-01T22:44:56Z

I have noticed that it is possible to make a PdfReader either by specifying a filename or file-like object, or by giving the data directly with fdata argument. This is great, however, it doesn't work if I give it a BytesIO object since the various functions in the following code only work with strings. For example, fdata.startswith('%PDF-') is called rather than fdata.startswith(b'%PDF-').

I can't immediately see an elegant way to solve this. Directly converting the data with str() produces assertion errors such as 'File "/usr/lib/python3.4/site-packages/pdfrw/pdfreader.py", line 319, in findxref assert tok == 'startxref' # (We just checked this...)' with the files I have tried.

The text was updated successfully, but these errors were encountered:

pmaupin · 2015-10-02T00:46:52Z

At a glance, I think my preferred solution would be to take the lines in PdfReader() that say:

                fdata = convert_load(fdata)
        assert fdata is not None

and change them to always execute convert_load():

        assert fdata is not None
        fdata = convert_load(fdata)

and then to modify the Python 3 (unicode not found) version of the convert_load() function to do nothing if passed a string:

def convert_load(s):
    return s if isinstance(s, str) else s.decode('Latin-1')

Since convert_load() is a no-op in Python 2, I cannot see any downside at the moment. Thoughts?

rdgraham · 2015-10-02T05:59:05Z

Thanks, I tried that solution and it seems to work fine and looks like a good solution.

Only thing I would add is to fname.seek(0) before reading in the case where we have been given a stream. Might be wrong but I think most people would expect it to read the whole stream by default.

pmaupin · 2015-10-02T13:48:51Z

The seek would certainly be convenient for some code, but I think I prefer not to have PdfReader do that for the simple reason that right now, code that needs the seek can do it before calling PdfReader, but if PdfReader always seeks to zero, then it will not be able to process a PDF embedded in a larger stream.

Thanks for the bug report and testing the solution!

skidzo · 2015-10-20T14:54:35Z

Very happy to find this little conversation here,
but will you Change the code to treat this Issue in the future?

Cheers!

pmaupin · 2015-10-20T16:52:32Z

Yes, that's why the issue is still open :-)

skidzo · 2015-10-20T16:58:41Z

I have tested it and it seems to work.If i can contribute in any way - please let me know ;-)

-------- Ursprüngliche Nachricht --------
Von: Patrick Maupin notifications@github.com
Datum: 20.10.2015 18:52 (GMT+01:00)
An: pmaupin/pdfrw pdfrw@noreply.github.com
Cc: Johannes Eckstein digital.fireball@gmail.com
Betreff: Re: [pdfrw] Reading a pdf from file like object or data not working
in python 3 with bytesIO (#38)

Yes, that's why the issue is still open :-)

—
Reply to this email directly or view it on GitHub.

pmaupin · 2015-10-21T04:15:28Z

Pull requests are good -- do you think you could add a test to the current test suite? It's not documented terribly well, but there is some getting-started stuff in the readme.

Thanks,
Pat

skidzo · 2015-10-21T07:49:38Z

Ok! I ll start with the readme file!

-------- Ursprüngliche Nachricht --------
Von: Patrick Maupin notifications@github.com
Datum: 21.10.2015 06:15 (GMT+01:00)
An: pmaupin/pdfrw pdfrw@noreply.github.com
Cc: Johannes Eckstein digital.fireball@gmail.com
Betreff: Re: [pdfrw] Reading a pdf from file like object or data not working
in python 3 with bytesIO (#38)

Pull requests are good -- do you think you could add a test to the current test suite? It's not documented terribly well, but there is some getting-started stuff in the readme.

—
Reply to this email directly or view it on GitHub.

Make read of in-memory PDF work with 3.x - This closes issue #38 with code discussed there plus a regression test. - Also add Python version 3.5 to regression tests - Also add OSX filetypes to .gitignore

pmaupin · 2015-12-13T20:20:54Z

Fixed with PR #43 ( Thanks, @b4stien !)

pmaupin mentioned this issue Dec 12, 2015

Add tests for PdfReader with in-memory pdf #43

Merged

pmaupin closed this as completed Dec 13, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading a pdf from file like object or data not working in python 3 with bytesIO #38

Reading a pdf from file like object or data not working in python 3 with bytesIO #38

rdgraham commented Oct 1, 2015

pmaupin commented Oct 2, 2015

rdgraham commented Oct 2, 2015

pmaupin commented Oct 2, 2015

skidzo commented Oct 20, 2015

pmaupin commented Oct 20, 2015

skidzo commented Oct 20, 2015

pmaupin commented Oct 21, 2015

skidzo commented Oct 21, 2015

pmaupin commented Dec 13, 2015

Reading a pdf from file like object or data not working in python 3 with bytesIO #38

Reading a pdf from file like object or data not working in python 3 with bytesIO #38

Comments

rdgraham commented Oct 1, 2015

pmaupin commented Oct 2, 2015

rdgraham commented Oct 2, 2015

pmaupin commented Oct 2, 2015

skidzo commented Oct 20, 2015

pmaupin commented Oct 20, 2015

skidzo commented Oct 20, 2015

pmaupin commented Oct 21, 2015

skidzo commented Oct 21, 2015

pmaupin commented Dec 13, 2015