Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Means of quadratic quantities #2

Open
dcherian opened this issue Apr 13, 2023 · 6 comments
Open

Means of quadratic quantities #2

dcherian opened this issue Apr 13, 2023 · 6 comments

Comments

@dcherian
Copy link

From pydata/xarray#6709

This example calculates ds.u.mean(), ds.v.mean(), and (ds.u * ds.v).mean() all at the same time

ms = MemorySampler()

ds = xr.Dataset(
    dict(
        anom_u=(["time", "face", "j", "i"], da.random.random((5000, 1, 987, 1920), chunks=(10, 1, -1, -1))),
        anom_v=(["time", "face", "j", "i"], da.random.random((5000, 1, 987, 1920), chunks=(10, 1, -1, -1))),
    )
)

quad = ds**2
quad["uv"] = ds.anom_u * ds.anom_v
mean = quad.mean("time")

with ms.sample():
    mean.compute()

With dask, we get not-so-great memory use. (Colors are for different values of "worker saturation")

image

@TomNicholas
Copy link

Cubed completed this workload using only 1.5GB of RAM!

https://gist.github.com/TomNicholas/8366c917349b647d87860a20a257a3fb

@TomNicholas
Copy link

I would like to try this problem with cubed using real data instead of random data. @dcherian (/anyone) if you know, can you explain a little more about the context of this issue please? So that I understand if/how I might be able to use some publicly available zarr data to create a representative benchmark case that includes I/O. Something about anomalies of GCM data... 😅

@dcherian
Copy link
Author

I would like to try this problem with cubed using real data instead of random data.

cc @robin-cls who opened the original xarray issue

@fjetter
Copy link

fjetter commented Jun 29, 2023

FYI I could track this problem down to the way dask performs the topological sort / prioritization of tasks, see dask/dask#10384

This example should work trivially when either is true:

  1. Only one of the arrays is calculated, e.g. mean['uv'].compute()
  2. The xarray dataset is transformed to a dask.DataFrame using mean.to_dask_dataframe() (The DataFrame graph looks slightly different and is handled well by dask.order)

@TomNicholas
Copy link

TomNicholas commented Jun 29, 2023

Only one of the arrays is calculated, e.g. mean['uv'].compute()

Anecdotally I think the performance is much better when you only compute one array, yes.

@fjetter
Copy link

fjetter commented Sep 28, 2023

Just a heads up. I'm working for a fix for this in dask/dask, see dask/dask#10535

Preliminary results look very promising

image

This graph show the memory usage for a couple of runs with increasing size in the time partition. This increases basically number of tasks but keeps the individual chunks and the algorithm constant.

image
This was far away from the spilling threshold (yellow line) so the constant memory was indeed due to better scheduling, not spilling or anything like that.

I'm also looking at other workloads. If you are aware of other stuff that should be constant or near-constant in memory usage but isn't, please let me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants