Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HSDS: another way to access HDF5/NETCDF4 "files" from S3 #75

Closed
rsignell-usgs opened this issue Jan 15, 2018 · 3 comments
Closed

HSDS: another way to access HDF5/NETCDF4 "files" from S3 #75

rsignell-usgs opened this issue Jan 15, 2018 · 3 comments
Labels

Comments

@rsignell-usgs
Copy link
Member

rsignell-usgs commented Jan 15, 2018

I was able to successfully run a demonstration notebook accessing data from HSDS, which, like zarr, stores HDF5 or NETCDF4 datasets as chunks, with each chunk in an S3 object.

In the sample notebook here, I'm accessing data from an HSDS instance on XSEDE, yet the access times are comparable to running the same notebook on XSEDE. Google Cloud and XSEDE are connected via Internet 2, I assume.

2018-01-15_11-21-53
2018-01-15_11-21-19

To run this notebook on pangeo as I did, you would need to:

  • get a username/password to access the HSDS XSEDE endpoint from @jreadey
  • install the h5pyd custom conda environment (see below)
  • add nb_conda_kernels to the root environment so that the custom kernel list appears.

Here's the procedure I used for creating the h5pyd environment:

conda env create -f h5pyd_env.yml -y
source activate h5pyd
conda install xarray -y
conda remove h5netcdf
pip install --no-deps --upgrade git+https://github.com/ajelenak-thg/h5netcdf.git@h5pyd
conda install --no-deps xarray -y

For more info on HSDS, check out John Readey's Scipy 2017 talk on HSDS

@mrocklin
Copy link
Member

I'm very glad to see this.

Some things that would be interesting to try if anyone has time:

  1. Try XArray + Dask locally on the HSDS data to verify that it can be accessed concurrently from multiple threads
  2. Try XArray + Dask.distributed locally on the HSDS data to verify that the h5pyd objects can survive being serialized
  3. Try everything on a distributed cluster using KubeCluster and then look at the performance of scalable computing
  4. Try this all again on a cluster on S3, where presumably we would expect 100-200MB/s network access from each node.

@stale
Copy link

stale bot commented Jun 15, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jun 15, 2018
@stale
Copy link

stale bot commented Jun 22, 2018

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

@stale stale bot closed this as completed Jun 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants