Means of quadratic quantities #2

dcherian · 2023-04-13T02:00:20Z

This example calculates ds.u.mean(), ds.v.mean(), and (ds.u * ds.v).mean() all at the same time

ms = MemorySampler()

ds = xr.Dataset(
    dict(
        anom_u=(["time", "face", "j", "i"], da.random.random((5000, 1, 987, 1920), chunks=(10, 1, -1, -1))),
        anom_v=(["time", "face", "j", "i"], da.random.random((5000, 1, 987, 1920), chunks=(10, 1, -1, -1))),
    )
)

quad = ds**2
quad["uv"] = ds.anom_u * ds.anom_v
mean = quad.mean("time")

with ms.sample():
    mean.compute()

With dask, we get not-so-great memory use. (Colors are for different values of "worker saturation")

The text was updated successfully, but these errors were encountered:

TomNicholas · 2023-05-03T22:51:15Z

Cubed completed this workload using only 1.5GB of RAM!

https://gist.github.com/TomNicholas/8366c917349b647d87860a20a257a3fb

TomNicholas · 2023-05-24T22:23:13Z

I would like to try this problem with cubed using real data instead of random data. @dcherian (/anyone) if you know, can you explain a little more about the context of this issue please? So that I understand if/how I might be able to use some publicly available zarr data to create a representative benchmark case that includes I/O. Something about anomalies of GCM data... 😅

dcherian · 2023-05-25T02:42:17Z

I would like to try this problem with cubed using real data instead of random data.

cc @robin-cls who opened the original xarray issue

fjetter · 2023-06-29T13:31:42Z

FYI I could track this problem down to the way dask performs the topological sort / prioritization of tasks, see dask/dask#10384

This example should work trivially when either is true:

Only one of the arrays is calculated, e.g. mean['uv'].compute()
The xarray dataset is transformed to a dask.DataFrame using mean.to_dask_dataframe() (The DataFrame graph looks slightly different and is handled well by dask.order)

TomNicholas · 2023-06-29T14:14:21Z

Only one of the arrays is calculated, e.g. mean['uv'].compute()

Anecdotally I think the performance is much better when you only compute one array, yes.

fjetter · 2023-09-28T15:13:15Z

Just a heads up. I'm working for a fix for this in dask/dask, see dask/dask#10535

Preliminary results look very promising

This graph show the memory usage for a couple of runs with increasing size in the time partition. This increases basically number of tasks but keeps the individual chunks and the algorithm constant.

This was far away from the spilling threshold (yellow line) so the constant memory was indeed due to better scheduling, not spilling or anything like that.

I'm also looking at other workloads. If you are aware of other stuff that should be constant or near-constant in memory usage but isn't, please let me know!

dcherian mentioned this issue Apr 13, 2023

Pangeo / earth science workflows coiled/benchmarks#770

Open

TomNicholas mentioned this issue May 3, 2023

Blockwise memory exceeded when creating random array cubed-dev/cubed#160

Closed

TomNicholas mentioned this issue May 3, 2023

Task naming for general chunkmanagers pydata/xarray#7813

Open

TomNicholas mentioned this issue Jun 12, 2023

Efficient scalable shuffle - P2P shuffle extension dask/distributed#7507

Open

fjetter mentioned this issue Jun 29, 2023

Dask.order causing high memory pressure for multi array compute calls (commonly used in xarray) dask/dask#10384

Closed

fjetter mentioned this issue Jun 30, 2023

Add quadratic mean xarray workload coiled/benchmarks#896

Merged

fjetter mentioned this issue Sep 27, 2023

[dask.order] Reduce memory pressure for multi array reductions by releasing splitter tasks more eagerly dask/dask#10535

Merged

fjetter mentioned this issue Sep 28, 2023

Means of zarr arrays cause a memory overload in dask workers pydata/xarray#6709

Closed

dcherian mentioned this issue Nov 5, 2023

Is there any way of having .map_blocks be even more opaque to dask? pydata/xarray#8414

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Means of quadratic quantities #2

Means of quadratic quantities #2

dcherian commented Apr 13, 2023

TomNicholas commented May 3, 2023

TomNicholas commented May 24, 2023

dcherian commented May 25, 2023

fjetter commented Jun 29, 2023

TomNicholas commented Jun 29, 2023 •

edited

fjetter commented Sep 28, 2023

Means of quadratic quantities #2

Means of quadratic quantities #2

Comments

dcherian commented Apr 13, 2023

TomNicholas commented May 3, 2023

TomNicholas commented May 24, 2023

dcherian commented May 25, 2023

fjetter commented Jun 29, 2023

TomNicholas commented Jun 29, 2023 • edited

fjetter commented Sep 28, 2023

TomNicholas commented Jun 29, 2023 •

edited