Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

My experience with Pangeo thus far #152

Closed
rsignell-usgs opened this issue Mar 12, 2018 · 6 comments
Closed

My experience with Pangeo thus far #152

rsignell-usgs opened this issue Mar 12, 2018 · 6 comments

Comments

@rsignell-usgs
Copy link
Member

rsignell-usgs commented Mar 12, 2018

@mrocklin asked me to comment on my experience with Pangeo thus far, so here goes...

Here's why I like using pangeo:

It's a great environment for exploring scalable, data-proximate analysis and cloud-based storage issues.

Most of our modelers run simulations remotely on HPC and then drag the data back to their desktops for analysis. The researcher two doors down from me is simulating tides in an estuary, and so far has produced 90 runs. Each run is 150GB and takes 8 hours to download over the USGS secure network. We clearly need remote analysis capabilities!

Pangeo.pydata.org provides a great environment with which to explore new workflows using Dask clusters to perform scalable analysis next to the data. I love that I can jump on there with my github account, and everything is set up and ready to go, while we continue to work on getting Dask and JuptyerHub going with Kubernetes on XSEDE Jetstream. It's also been great for exploring different data storage approaches like Zarr and HSDS.

Another useful feature is that pangeo.pydata.org is on a fat pipe, so it's speedy to process data services. My colleague @ocefpaf in Brazil was trying to run a notebook that downloads a fair amount of data, and it wasn't working well with the bandwidth he had. He moved the notebook to pangeo.pydata.org and got it going in short order. Here's a snapshot:
2018-03-06

In addition to the cloud instance and cloud issues, the pangeo guide for data-proximate, scalable analysis on HPC was fantastic. The description of setting up the python environment, for submitting a dask cluster job, connecting with Jupyter, and then accessing the client via SSH tunneling on a local browser worked perfectly on the USGS HPC. The only change was the job control script, because we use Slurm instead of PBS. Here's a snapshot from a modified version of this notebook by @jhamman, which we used to compute the maximum wave height at each grid cell over the duration of a Hurricane Sandy simulation:
2018-03-08_7-43-09

I don't need to do system administration. I like that someone else is taking care of keeping the components of the framework up-to-date, fixing bugs, etc. I can just be a science tester/user.

Pangeo is a great community. I've found out about things via pangeo that I didn't know about, like Zarr and daskernetes, and meeting researchers with similar big data issues.

@mrocklin
Copy link
Member

This is a great report. Thanks @rsignell-usgs . Now can I ask, what don't you like about this system?.

If say, a well funded government agency were to come by, where would you ask them to devote their resources?

@rsignell-usgs
Copy link
Member Author

Now can I ask, what don't you like about this system?

These are not really things I don't like, but things I haven't figured out yet:

  • How to write my own big data to Zarr so it can be accessed effectively by pangeo.pydata.org. I gather I can't just use Ryan's code because I don't have access to the project='pangeo-181919' Google Cloud project, right? I have a 2.4GB dataset I'd like to experiment with. So what is the suggested procedure for folks like me?

  • Although I know how to get custom kernels to appear in Jupyter, I'm not sure how to customize the root environment (e.g. if I wanted to add the "gist-it" extension, for example).

I'm also wondering what it would take for folks to use this for their day-to-day compute environment. Certainly some way to track use and have folks pay, right? Perhaps with some subscription plan?

It would also be great to see additional language interfaces for storage solutions like Zarr and HSDS, so that they might have a chance as de facto standards for Cloud-optimized multidimensional data.

Perhaps https://github.com/constantinpape/z5 is a start?

If say, a well funded government agency were to come by, where would you ask them to devote their resources?

I'm not sure what the priorities of DOD would be. 😸

@jacobtomlinson
Copy link
Member

I'm particularly interested in some of these questions about how to allocate cost to users. Things like:

  • Do we artificially cap the amount of resource a user can get?
  • Do we provide credits that people "use up"?
  • Do we track individual usage and then carve the cloud provider bill up between teams?

I'm currently looking into much of the customization stuff now. We are finding Jupyter Lab rather inflexible and if we wish to allow users to add custom extensions it basically requires them to reinstall Jupyter Lab.

@rabernat
Copy link
Member

@rsignell-usgs -- I'm happy to give you full access to our GCS allocation. Just give me your google username.

@stale
Copy link

stale bot commented Jun 25, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jun 25, 2018
@stale
Copy link

stale bot commented Jul 2, 2018

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

@stale stale bot closed this as completed Jul 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants