HSDS: another way to access HDF5/NETCDF4 "files" from S3 #75

rsignell-usgs · 2018-01-15T16:37:57Z

I was able to successfully run a demonstration notebook accessing data from HSDS, which, like zarr, stores HDF5 or NETCDF4 datasets as chunks, with each chunk in an S3 object.

In the sample notebook here, I'm accessing data from an HSDS instance on XSEDE, yet the access times are comparable to running the same notebook on XSEDE. Google Cloud and XSEDE are connected via Internet 2, I assume.

To run this notebook on pangeo as I did, you would need to:

get a username/password to access the HSDS XSEDE endpoint from @jreadey
install the h5pyd custom conda environment (see below)
add nb_conda_kernels to the root environment so that the custom kernel list appears.

Here's the procedure I used for creating the h5pyd environment:

conda env create -f h5pyd_env.yml -y
source activate h5pyd
conda install xarray -y
conda remove h5netcdf
pip install --no-deps --upgrade git+https://github.com/ajelenak-thg/h5netcdf.git@h5pyd
conda install --no-deps xarray -y

For more info on HSDS, check out John Readey's Scipy 2017 talk on HSDS

The text was updated successfully, but these errors were encountered:

mrocklin · 2018-01-15T16:42:05Z

I'm very glad to see this.

Some things that would be interesting to try if anyone has time:

Try XArray + Dask locally on the HSDS data to verify that it can be accessed concurrently from multiple threads
Try XArray + Dask.distributed locally on the HSDS data to verify that the h5pyd objects can survive being serialized
Try everything on a distributed cluster using KubeCluster and then look at the performance of scalable computing
Try this all again on a cluster on S3, where presumably we would expect 100-200MB/s network access from each node.

stale · 2018-06-15T21:41:00Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2018-06-22T22:26:09Z

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

This was referenced Jan 15, 2018

Try xarray/dask/h5netcdf on top of h5pyd HDFGroup/h5pyd#30

Closed

Dask tests HDFGroup/h5pyd#51

Open

stale bot added the stale label Jun 15, 2018

stale bot closed this as completed Jun 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HSDS: another way to access HDF5/NETCDF4 "files" from S3 #75

HSDS: another way to access HDF5/NETCDF4 "files" from S3 #75

rsignell-usgs commented Jan 15, 2018 •

edited

Loading

mrocklin commented Jan 15, 2018

stale bot commented Jun 15, 2018

stale bot commented Jun 22, 2018

HSDS: another way to access HDF5/NETCDF4 "files" from S3 #75

HSDS: another way to access HDF5/NETCDF4 "files" from S3 #75

Comments

rsignell-usgs commented Jan 15, 2018 • edited Loading

mrocklin commented Jan 15, 2018

stale bot commented Jun 15, 2018

stale bot commented Jun 22, 2018

rsignell-usgs commented Jan 15, 2018 •

edited

Loading