Remove all instances of seek() so that we can handle streaming data again #24

pudo · 2012-12-14T08:07:21Z

After some format detection fixes, we now have a few calls to seek() in the CSV module. Those cannot work on urllib-style http request data. One of the main use cases for messytables is to do streaming web data. We should remove these calls, even if this results in a loss of functionality wrt. type detection.

rufuspollock · 2012-12-24T10:18:56Z

Some thoughts:

At present urls do not cause a problem because of hack in any.py to make_stream_seekable (which reads entire file into memory). I imagine that hack should go as part of this refactor (?)

Alternative solutions (to removing use of seek):

implement a urllib style urlopen that supports seek - behind the scenes it could just make multiple urlopen calls, and if full seek support is needed you could use HTTP Range headers (see this example)
A simple alternative given that all we need is seek(0) is to add some kind of intermediate wrapper around the stream that buffers, say, the first 10k/100k bytes and allows seek within that

That said we only seem to have 2 places seek is used (commas.py) and one place in any.py.

rufuspollock · 2012-12-31T18:54:43Z

@domoritz is there a status update here - i know you were chatting with @pudo on this (and see you've started some work in a branch).

It would be really nice to get this fixed (in particular it would allow us to upgrade dataproxy which would be super useful!)

/cc @nigelbabu

Enable support for non seekable resources - fixes #24.

domoritz mentioned this issue Jan 3, 2013

Enable support for non seekable resources #27

Merged

domoritz closed this as completed in deec902 Jan 5, 2013

rufuspollock added a commit that referenced this issue Jan 5, 2013

Merge pull request #27 from okfn/24-http-files

993e5eb

Enable support for non seekable resources - fixes #24.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove all instances of seek() so that we can handle streaming data again #24

Remove all instances of seek() so that we can handle streaming data again #24

pudo commented Dec 14, 2012

rufuspollock commented Dec 24, 2012

rufuspollock commented Dec 31, 2012

Remove all instances of seek() so that we can handle streaming data again #24

Remove all instances of seek() so that we can handle streaming data again #24

Comments

pudo commented Dec 14, 2012

rufuspollock commented Dec 24, 2012

rufuspollock commented Dec 31, 2012