My experience with Pangeo thus far #152

rsignell-usgs · 2018-03-12T17:39:21Z

@mrocklin asked me to comment on my experience with Pangeo thus far, so here goes...

Here's why I like using pangeo:

It's a great environment for exploring scalable, data-proximate analysis and cloud-based storage issues.

Most of our modelers run simulations remotely on HPC and then drag the data back to their desktops for analysis. The researcher two doors down from me is simulating tides in an estuary, and so far has produced 90 runs. Each run is 150GB and takes 8 hours to download over the USGS secure network. We clearly need remote analysis capabilities!

Pangeo.pydata.org provides a great environment with which to explore new workflows using Dask clusters to perform scalable analysis next to the data. I love that I can jump on there with my github account, and everything is set up and ready to go, while we continue to work on getting Dask and JuptyerHub going with Kubernetes on XSEDE Jetstream. It's also been great for exploring different data storage approaches like Zarr and HSDS.

Another useful feature is that pangeo.pydata.org is on a fat pipe, so it's speedy to process data services. My colleague @ocefpaf in Brazil was trying to run a notebook that downloads a fair amount of data, and it wasn't working well with the bandwidth he had. He moved the notebook to pangeo.pydata.org and got it going in short order. Here's a snapshot:

In addition to the cloud instance and cloud issues, the pangeo guide for data-proximate, scalable analysis on HPC was fantastic. The description of setting up the python environment, for submitting a dask cluster job, connecting with Jupyter, and then accessing the client via SSH tunneling on a local browser worked perfectly on the USGS HPC. The only change was the job control script, because we use Slurm instead of PBS. Here's a snapshot from a modified version of this notebook by @jhamman, which we used to compute the maximum wave height at each grid cell over the duration of a Hurricane Sandy simulation:

I don't need to do system administration. I like that someone else is taking care of keeping the components of the framework up-to-date, fixing bugs, etc. I can just be a science tester/user.

Pangeo is a great community. I've found out about things via pangeo that I didn't know about, like Zarr and daskernetes, and meeting researchers with similar big data issues.

mrocklin · 2018-03-12T19:43:42Z

This is a great report. Thanks @rsignell-usgs . Now can I ask, what don't you like about this system?.

If say, a well funded government agency were to come by, where would you ask them to devote their resources?

rsignell-usgs · 2018-03-12T20:19:25Z

Now can I ask, what don't you like about this system?

These are not really things I don't like, but things I haven't figured out yet:

How to write my own big data to Zarr so it can be accessed effectively by pangeo.pydata.org. I gather I can't just use Ryan's code because I don't have access to the project='pangeo-181919' Google Cloud project, right? I have a 2.4GB dataset I'd like to experiment with. So what is the suggested procedure for folks like me?
Although I know how to get custom kernels to appear in Jupyter, I'm not sure how to customize the root environment (e.g. if I wanted to add the "gist-it" extension, for example).

I'm also wondering what it would take for folks to use this for their day-to-day compute environment. Certainly some way to track use and have folks pay, right? Perhaps with some subscription plan?

It would also be great to see additional language interfaces for storage solutions like Zarr and HSDS, so that they might have a chance as de facto standards for Cloud-optimized multidimensional data.

Perhaps https://github.com/constantinpape/z5 is a start?

If say, a well funded government agency were to come by, where would you ask them to devote their resources?

I'm not sure what the priorities of DOD would be. 😸

jacobtomlinson · 2018-03-13T09:30:31Z

I'm particularly interested in some of these questions about how to allocate cost to users. Things like:

Do we artificially cap the amount of resource a user can get?
Do we provide credits that people "use up"?
Do we track individual usage and then carve the cloud provider bill up between teams?

I'm currently looking into much of the customization stuff now. We are finding Jupyter Lab rather inflexible and if we wish to allow users to add custom extensions it basically requires them to reinstall Jupyter Lab.

rabernat · 2018-03-13T13:40:54Z

@rsignell-usgs -- I'm happy to give you full access to our GCS allocation. Just give me your google username.

stale · 2018-06-25T16:15:27Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2018-07-02T16:38:39Z

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

jacobtomlinson added the community label Apr 26, 2018

stale bot added the stale label Jun 25, 2018

stale bot closed this as completed Jul 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

My experience with Pangeo thus far #152

My experience with Pangeo thus far #152

rsignell-usgs commented Mar 12, 2018 •

edited

Loading

mrocklin commented Mar 12, 2018

rsignell-usgs commented Mar 12, 2018

jacobtomlinson commented Mar 13, 2018

rabernat commented Mar 13, 2018

stale bot commented Jun 25, 2018

stale bot commented Jul 2, 2018

My experience with Pangeo thus far #152

My experience with Pangeo thus far #152

Comments

rsignell-usgs commented Mar 12, 2018 • edited Loading

mrocklin commented Mar 12, 2018

rsignell-usgs commented Mar 12, 2018

jacobtomlinson commented Mar 13, 2018

rabernat commented Mar 13, 2018

stale bot commented Jun 25, 2018

stale bot commented Jul 2, 2018

rsignell-usgs commented Mar 12, 2018 •

edited

Loading