Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support BytesIO/Stream-like objects instead of just filepath #2

Closed
AnkurDedania opened this issue Aug 1, 2016 · 5 comments
Closed

Comments

@AnkurDedania
Copy link

Support BytesIO/Stream-like objects instead of just filepath

@jjhelmus
Copy link
Owner

jjhelmus commented Aug 3, 2016

@AnkurDedania This is a great idea! There is no reason that pyfive shouldn't be able to read from any object that correctly implements valid read, seek and peek methods. I'll look into adding this to the library later this week.

@synaptic
Copy link
Contributor

Jon, this is Benj, Ankur and I talked to you at PyDataChi. If I understand correctly, the BytesIO option might allow you to read an HDF5 file from an object store, right? For example, MS Azure has an API function called get_blob_to_stream. No disk needed. That could be powerful.

Another example of this is accessing memory images of HDF5 files. See the the relevant conversations regarding h5py, xarray, and pytables github forums here, here, and here.

@jjhelmus
Copy link
Owner

jjhelmus commented Sep 2, 2016

@synaptic Good to hear from you. Reading from BytesIO or other file like objects should all you to read a HDF5 file which exists entirely in memory. With the changes in 94d5c64 as long as you pass a object which has read, peek and seek method swhich behave like Python file object to pyfile.File the object should work just like an on disk HDF5.

For example the follow works:

>>> import pyfive
>>> open('./tests/latest.hdf5')
>>> hfile = pyfive.File(f)
>>> hfile.attrs['attr1']
-123
>>> f.close()

@synaptic
Copy link
Contributor

synaptic commented Sep 2, 2016

@jjhelmus I got pyfive to work with the Azure blob API! The only issue I had is that the object also has to have a tell method. I'm not real sure that the performance is faster than reading the whole blob and then attaching as a file or memory image. More testing may tell. I wonder if moving some of the code to C would help optimize performance (though avoiding C was a main goal of yours).

In order to work fully with my data I would need support for chunks, compression, and shuffle. I see you have a branch for compression. Maybe we can help.

@jjhelmus
Copy link
Owner

jjhelmus commented Sep 7, 2016

@synaptic Great to hear that pyfive works on Azure blobs. The file-like object does need a tell method although this requirement could be removed without too much effort as it is only used in one place.

I'm not particularly surprised to hear about the less than optimal performance as efficiency has not been a priority. When reading data from a Dataset, pyfive currently loads all chunks into memory before slicing the requested data. This behavior is very inefficient when only a small region of the data is requested which could be extracted from a small number of chunks. Improving this is on the pyfive roadmap.

The compression branch does have the start of the logic needed to support data compression shuffling. I hope to clean this up soon and get it merged. Chunks should already be supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants