Support BytesIO/Stream-like objects instead of just filepath #2

AnkurDedania · 2016-08-01T23:01:19Z

Support BytesIO/Stream-like objects instead of just filepath

jjhelmus · 2016-08-03T14:37:12Z

@AnkurDedania This is a great idea! There is no reason that pyfive shouldn't be able to read from any object that correctly implements valid read, seek and peek methods. I'll look into adding this to the library later this week.

synaptic · 2016-08-29T14:33:04Z

Jon, this is Benj, Ankur and I talked to you at PyDataChi. If I understand correctly, the BytesIO option might allow you to read an HDF5 file from an object store, right? For example, MS Azure has an API function called get_blob_to_stream. No disk needed. That could be powerful.

Another example of this is accessing memory images of HDF5 files. See the the relevant conversations regarding h5py, xarray, and pytables github forums here, here, and here.

jjhelmus · 2016-09-02T02:13:38Z

@synaptic Good to hear from you. Reading from BytesIO or other file like objects should all you to read a HDF5 file which exists entirely in memory. With the changes in 94d5c64 as long as you pass a object which has read, peek and seek method swhich behave like Python file object to pyfile.File the object should work just like an on disk HDF5.

For example the follow works:

>>> import pyfive
>>> open('./tests/latest.hdf5')
>>> hfile = pyfive.File(f)
>>> hfile.attrs['attr1']
-123
>>> f.close()

synaptic · 2016-09-02T14:53:07Z

@jjhelmus I got pyfive to work with the Azure blob API! The only issue I had is that the object also has to have a tell method. I'm not real sure that the performance is faster than reading the whole blob and then attaching as a file or memory image. More testing may tell. I wonder if moving some of the code to C would help optimize performance (though avoiding C was a main goal of yours).

In order to work fully with my data I would need support for chunks, compression, and shuffle. I see you have a branch for compression. Maybe we can help.

jjhelmus · 2016-09-07T16:22:18Z

@synaptic Great to hear that pyfive works on Azure blobs. The file-like object does need a tell method although this requirement could be removed without too much effort as it is only used in one place.

I'm not particularly surprised to hear about the less than optimal performance as efficiency has not been a priority. When reading data from a Dataset, pyfive currently loads all chunks into memory before slicing the requested data. This behavior is very inefficient when only a small region of the data is requested which could be extracted from a small number of chunks. Improving this is on the pyfive roadmap.

The compression branch does have the start of the logic needed to support data compression shuffling. I hope to clean this up soon and get it merged. Chunks should already be supported.

jjhelmus closed this as completed in 94d5c64 Sep 2, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support BytesIO/Stream-like objects instead of just filepath #2

Support BytesIO/Stream-like objects instead of just filepath #2

AnkurDedania commented Aug 1, 2016

jjhelmus commented Aug 3, 2016

synaptic commented Aug 29, 2016

jjhelmus commented Sep 2, 2016

synaptic commented Sep 2, 2016

jjhelmus commented Sep 7, 2016

Support BytesIO/Stream-like objects instead of just filepath #2

Support BytesIO/Stream-like objects instead of just filepath #2

Comments

AnkurDedania commented Aug 1, 2016

jjhelmus commented Aug 3, 2016

synaptic commented Aug 29, 2016

jjhelmus commented Sep 2, 2016

synaptic commented Sep 2, 2016

jjhelmus commented Sep 7, 2016