Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading a pdf from file like object or data not working in python 3 with bytesIO #38

Closed
rdgraham opened this issue Oct 1, 2015 · 9 comments

Comments

@rdgraham
Copy link

rdgraham commented Oct 1, 2015

I have noticed that it is possible to make a PdfReader either by specifying a filename or file-like object, or by giving the data directly with fdata argument. This is great, however, it doesn't work if I give it a BytesIO object since the various functions in the following code only work with strings. For example, fdata.startswith('%PDF-') is called rather than fdata.startswith(b'%PDF-').

I can't immediately see an elegant way to solve this. Directly converting the data with str() produces assertion errors such as 'File "/usr/lib/python3.4/site-packages/pdfrw/pdfreader.py", line 319, in findxref assert tok == 'startxref' # (We just checked this...)' with the files I have tried.

@pmaupin
Copy link
Owner

pmaupin commented Oct 2, 2015

At a glance, I think my preferred solution would be to take the lines in PdfReader() that say:

                fdata = convert_load(fdata)
        assert fdata is not None

and change them to always execute convert_load():

        assert fdata is not None
        fdata = convert_load(fdata)

and then to modify the Python 3 (unicode not found) version of the convert_load() function to do nothing if passed a string:

def convert_load(s):
    return s if isinstance(s, str) else s.decode('Latin-1')

Since convert_load() is a no-op in Python 2, I cannot see any downside at the moment. Thoughts?

@rdgraham
Copy link
Author

rdgraham commented Oct 2, 2015

Thanks, I tried that solution and it seems to work fine and looks like a good solution.

Only thing I would add is to fname.seek(0) before reading in the case where we have been given a stream. Might be wrong but I think most people would expect it to read the whole stream by default.

@pmaupin
Copy link
Owner

pmaupin commented Oct 2, 2015

The seek would certainly be convenient for some code, but I think I prefer not to have PdfReader do that for the simple reason that right now, code that needs the seek can do it before calling PdfReader, but if PdfReader always seeks to zero, then it will not be able to process a PDF embedded in a larger stream.

Thanks for the bug report and testing the solution!

@skidzo
Copy link

skidzo commented Oct 20, 2015

Very happy to find this little conversation here,
but will you Change the code to treat this Issue in the future?

Cheers!

@pmaupin
Copy link
Owner

pmaupin commented Oct 20, 2015

Yes, that's why the issue is still open :-)

@skidzo
Copy link

skidzo commented Oct 20, 2015

I have tested it and it seems to work.If i can contribute in any way - please let me know ;-)

-------- Ursprüngliche Nachricht --------
Von: Patrick Maupin notifications@github.com
Datum: 20.10.2015 18:52 (GMT+01:00)
An: pmaupin/pdfrw pdfrw@noreply.github.com
Cc: Johannes Eckstein digital.fireball@gmail.com
Betreff: Re: [pdfrw] Reading a pdf from file like object or data not working
in python 3 with bytesIO (#38)

Yes, that's why the issue is still open :-)


Reply to this email directly or view it on GitHub.

@pmaupin
Copy link
Owner

pmaupin commented Oct 21, 2015

Pull requests are good -- do you think you could add a test to the current test suite? It's not documented terribly well, but there is some getting-started stuff in the readme.

Thanks,
Pat

@skidzo
Copy link

skidzo commented Oct 21, 2015

Ok! I ll start with the readme file!

-------- Ursprüngliche Nachricht --------
Von: Patrick Maupin notifications@github.com
Datum: 21.10.2015 06:15 (GMT+01:00)
An: pmaupin/pdfrw pdfrw@noreply.github.com
Cc: Johannes Eckstein digital.fireball@gmail.com
Betreff: Re: [pdfrw] Reading a pdf from file like object or data not working
in python 3 with bytesIO (#38)

Pull requests are good -- do you think you could add a test to the current test suite? It's not documented terribly well, but there is some getting-started stuff in the readme.


Reply to this email directly or view it on GitHub.

pmaupin added a commit that referenced this issue Dec 13, 2015
Make read of in-memory PDF work with 3.x

- This closes issue #38 with code discussed there plus a regression test.
- Also add Python version 3.5 to regression tests
- Also add OSX filetypes to .gitignore
@pmaupin
Copy link
Owner

pmaupin commented Dec 13, 2015

Fixed with PR #43 ( Thanks, @b4stien !)

@pmaupin pmaupin closed this as completed Dec 13, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants