Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[discussion] messytables should *only* work with local files #59

Closed
rossjones opened this issue Jun 3, 2013 · 10 comments
Closed

[discussion] messytables should *only* work with local files #59

rossjones opened this issue Jun 3, 2013 · 10 comments
Labels

Comments

@rossjones
Copy link
Contributor

Messytables doesn't work well in a lot of situations when the provided fileobj is a socket.

The BufferedFile object attempts to resolve this, but in a lot of cases it will force a read(-1) and cause a complete download of the file (into ram) anyway. This is particularly true of anything that that wants to seek within the file (such as zip and xls) or the buffer passed to magic.from_buffer (which is inadequate in some cases and from_file would be more accurate).

Downloading the content to temporary storage isn't an onerous task, and if the interface was modified to use filenames instead of file-objects it could even transparently download the content when a url is provided (which is is destined to do anyway at some point).

@rufuspollock
Copy link
Member

This would break use of messytables on e.g. appengine where it is used for jsondataproxy (at least i think it would). One of the main reasons for having streaming was to support that.

What are your thoughts. What are others thoughts on the tradeoff?

@rossjones
Copy link
Contributor Author

jsondataproxy downloads the entire file before processing, doesn't it? If it doesn't it is probably currently doing so for xls files and zip files will be broken. As will anything that is an office doc encoded as CDF (could be .pub, .doc etc).

CDF
http://jsonpdataproxy.appspot.com/?url=http://www.hm-treasury.gov.uk/d/hmt_businessplan_qds_april12.xls
Workbook is encrypted (it isn't, it's just CDF) and magic.from_buffer doesn't read enough of the header to work out it is actually XLS and instead thinks it is corrupted CDF.

http://jsonpdataproxy.appspot.com/?url=http://www.connectingforhealth.nhs.uk/systemsandservices/data/ods/genmedpracs/datafiles/epraccur.zip
Doesn't currently handle ZIP (but it could)

I know that these seem like fix-able bugs, but the only way to fix them is to download the entire content (to pass the full file to magic, and to be able to seek into the zip). Just seems that if you're going to download them anyway ...

@rufuspollock
Copy link
Member

No, jsondataproxy does not download the whole file (at least for CSV) - it uses the stream approach (note for xls it of course does download the whole file ...)

@JoshData
Copy link
Contributor

JoshData commented Jun 3, 2013

Allowing a complete download as an option --- or allow streaming as an option --- would be very helpful. I use messytables to load data into the CKAN datastore, so I need to download the whole file anyway.

@domoritz
Copy link
Contributor

domoritz commented Jun 7, 2013

I think support for streaming is a huge advantage and an initial design goal of messytables. If we don't want this any more, I suggest we rewrite large parts of it anyway. This would mean major simplifications and a messytables would become much simpler. Despite this advantage, I think streaming support os an awesome feature that makes many things much faster. With streaming support, I could make the dataproxy answer in just a few seconds instead of tens of seconds.

You are probably right, that many applications do not need streaming support but some definitely benefit from it.

@scraperdragon
Copy link

I think there's a big difference between "should only work with local files" and "should stream"

I don't care about streaming; but I do care about it working with remote files - e.g. things grabbed from a webservice. Now, granted, I often just say:

data = requests.get("http://target.site/data.csv").content
filehandle = StringIO.StringIO(data)
do_some_stuff_with(filehandle)

but I wouldn't want internals to assume that there's always a file on the disk that has a filename and is tolerably cheap to just keep reading all the time.

@rossjones
Copy link
Contributor Author

Problem with that approach is that it means we often end up with bit chunks of data in RAM. And whilst it attempts to 'stream' everything in a lot of cases (non-csv) it ends up just doing that. I guess I'd just prefer that if it knew it couldn't stream it, there was an alternative to me keeping 20Mb xls files in RAM. Doesn't seem a lot for a single user, but for dozens/hundreds ...

@frabcus
Copy link
Contributor

frabcus commented Jun 19, 2013

It sounds to me like it should either take a filehandle (and then stream, buffering if necessary) or a filename (and then know it can seek without buffering).

Can you inspect a filehandle and find out if it refers to local data? (either already in memory, or already on local disk)

@rossjones
Copy link
Contributor Author

You can check if it is seekable, and the BufferedFile is provided for that purpose. The problem is that the BufferedFile seek() implementation is broken, and the way BufferedFile is used means it just all gets read into a stringio.Would be tempting for non-seekable FDs to just download the data to disk and then carry on with the new file FD. Which is sort of what I was after.

@rossjones
Copy link
Contributor Author

I'm a bit clearer on expectations for how it should work now, so I think probably safe to close this ticket. I think my main concern is covered by #58

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants