Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hero workload brainstorm #5

Open
jrbourbeau opened this issue Mar 16, 2021 · 19 comments
Open

Hero workload brainstorm #5

jrbourbeau opened this issue Mar 16, 2021 · 19 comments

Comments

@jrbourbeau
Copy link

Yesterday in the group meeting it was mentioned that we should come up with a so-called "hero workload" that both

  • Produces a useful result that folks will find interesting as well as
  • Really shows off the power of what we can do with the PyData stack today

I'm opening this issue as a place for us to brainstorm about what such a workload might look like. @rabernat @jbusecke do you have any initial thoughts about huge datasets and/or computation-heavy workloads the geoscience community would be excited to see us work on?

@jrbourbeau
Copy link
Author

@rabernat @jbusecke there's the Dask Distributed Summit coming up this May. From the summit's landing page, it's a virtual conference "where users, contributors, and newcomers can share experiences to learn from one another and grow together".

This would be a great space for Pangeo folks to share their experience using Dask. For example, we could share the results of a hero workload, or if we're not quite there yet, discuss some of the successes and pain points that we've encountered so far as part of such a workload.

The CFP closes in a couple of days (May 28) and the talks themselves (30 minutes) are May 19-21. Do either of you, or other Pangeo folks, have an interest in submitting a talk?

@rabernat
Copy link
Contributor

Hi @jrbourbeau! I'm a bit confused about the Dask Summit. There are two "workshop proposals" that I know about: Pangeo and Xarray. But I haven't seen an actual call for talk proposals. Did we miss that deadline?

@rabernat
Copy link
Contributor

Regarding the hero workload, I have an idea coming together. The core of the method is described in this recent paper by @dhruvbalwada. We want to compute the vorticity / strain / divergence joint distribution from ocean surface velocities. However, we want to do it using the dataset described here - https://medium.com/pangeo/petabytes-of-ocean-data-part-1-nasa-ecco-data-portal-81e3c5e077be - which is mirrored on google cloud here - https://catalog.pangeo.io/browse/master/ocean/LLC4320/. We could also introduce some scale dependence into the calculation by doing filtering with the new software package coming together here - https://github.com/ocean-eddy-cpt/gcm-filters/.

Our default way of implementing these calculations - using xgcm - would not be able to leverage GPUs. However, we could easily hand code some of the routines using cupy.

One big challenge is that the data live in Google Cloud US-Central region. Would it be possible to get a Coiled GPU dask cluster going there?

@jrbourbeau
Copy link
Author

In addition to workshops there are also talks and tutorials at the Dask summit. Apologies, I probably should have posted something earlier here about the summit. Fortunately, I submitted an aspirational talk about processing a petabyte of data with Dask in the cloud. The hero workload sounds like it would be a good fit for that talk : )

Would it be possible to get a Coiled GPU dask cluster going there?

We're in the process of rolling out GCP support in US-Central at the moment -- I can let you know when things are up and running

Also cc'ing @gjoseph92 for visibility

@rabernat
Copy link
Contributor

Following up on the conversation at today's meeting. The idea is to do the following

  • Split the domain into roughly 5x5-degree block. (Possibly use rechunker to write a copy of the data with optimized chunk structure. Or try without rechunking to disk and see what happens...)
  • Calculate vorticity, strain, and divergence from the surface velocity fields using xgcm. (This will provide a chance to optimize these operations in terms of their dask graph.)
  • Calculate the 3D vorticity / strain / divergence histogram for each box.
  • Persist the results for further analysis.

We could also consider an interactive version of this, where a user can draw a bounding box on a map and then have the histogram generated on the fly! 😱 🤩

I propose that anyone interested (cc @ocean-transport/members + @dhruvbalwada + @jrbourbeau) have a meeting dedicated to this topic.

I am available this Friday 12-1pm and 2:45-3:30pm (EST). If those times don't work, we can look at next week.

@dhruvbalwada
Copy link
Member

I can go both those times this week, with a slight preference for the second slot.

@jbusecke
Copy link
Collaborator

I am very interested and can do both slots.

@jbusecke
Copy link
Collaborator

I combed throught the xgcm issue tracker and I think it might be helpful to fasttrack this issue.

My thinking was that there might be opportunities to optimize these higher level operators without calling the lower level ones (diff/interp) in a chained manner? But for that to work it might be helpful to have this functionality implemented and properly tested, so that we have a "ground truth" before messing with the internals. Just a thought we can chat about it during the meeting.

@hscannell
Copy link
Member

I would like to attend and both time slots work for me as well. I will read Dhruv's paper before then.

@TomNicholas
Copy link

I am also interested, free both times, but need to read Dhruv's paper to understand!

@jrbourbeau
Copy link
Author

Count me in too -- both time slots this Friday work for me

@rabernat
Copy link
Contributor

Ok invite sent for 2:45.

@dhruvbalwada - perhaps you could walk us through the code you used in your paper? No need to prepare anything fancy though...

@cspencerjones
Copy link

Seems like there might be an overlap with what Qiyu is already doing? At least maybe worth inviting him...

@dhruvbalwada
Copy link
Member

dhruvbalwada commented Apr 15, 2021

Good idea, he just started looking at the vorticity-strain histograms in llc4320 to compare against idealized simulations with seasonality.

@qyxiao are you available tomorrow at 2:45pm eastern time?

@qyxiao
Copy link

qyxiao commented Apr 15, 2021 via email

@dhruvbalwada
Copy link
Member

I have shared meeting invite with Qiyu, so he can join in as well.

@rabernat
Copy link
Contributor

Seems like there might be an overlap with what Qiyu is already doing? At least maybe worth inviting him...

Thanks so much @cspencerjones for making this connection. It's actually perfect because we are looking for a scientific lead for this project. It sounds like @qyxiao is a natural fit for this. Just have to make all our goals are aligned.

@rabernat
Copy link
Contributor

rabernat commented May 5, 2021

So in #12, I provided a simple example how xgcm's operations can be optimized. (This was my TODO item from our last meeting.) I'm trying to figure out the next step: how to actually get such optimizations to work via xgcm. I see a few different paths we could take:

  1. Modify xgcm to use some kind of manual inlining / compiling of consecutive operations
  2. Modify xgcm to use a "kernel" based approach, like gcm-filters, plus map_blocks
  3. Don't change xgcm (much) and instead rely on dask to make these optimizations

I'd be curious to hear the Coiled folks' views on the feasibility of 3, using my example notebook as a reference problem.

@rabernat
Copy link
Contributor

rabernat commented May 6, 2021

FYI, I also just shared an example of how to do an "interp with land" operation using the kernel + apply_ufunc approach here: xgcm/xgcm#324

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants