-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issue; preprocessing produces ungodly amount of dask tasks #58
Comments
Seems like this is related: pydata/xarray#4428 Should i deactivate the auto slicing for the whole I am also unclear why these chunks are split in the first place? They are smaller than |
|
Working on it. Do you think that is better posted over at xarray or here? |
I think I got a simpler example with the same issue: import xarray as xr
import numpy as np
import dask
dask.config.set(**{'array.slicing.split_large_chunks': True,
"array.chunk-size": "24 MiB",})
da = xr.DataArray(np.random.rand(10), dims=['x'],
coords={'x':[3,4,5,6,7,9,8,0,2,1]}).chunk({'x':-1}).expand_dims(y=1000, time=2000).chunk({'y':-1, 'time':200})
da Note that I set the array chunk size to be larger than the chunks of the array. If I now sort it by da.sortby('x') If I deactivate the auto slicing, I get the expected result (+ a few tasks but not 3x as much): with dask.config.set(**{'array.slicing.split_large_chunks': False}):
da_sorted = da.sortby('x')
da_sorted I dont know enough about these internals but that seems unintuitive to me. EDIT: Here are the versions installed:
|
I just tried it and is seems I need to go to |
I have changed the array size to |
another option to circumvent this: use |
You mean with |
maybe. I usually dont use aggregate. I just meant to use the functions from your package after intake-esm on xarray objects directly. |
I am experimenting with this and will report back. This is a good suggestion and maybe it should become our default recommendation so that people are not bound to intake-esm, but can still use it if desired. |
The functions works as expected on datasets. When used on extracted dataarrays, the missing attrs may cause some functions to fail |
I have updated the recommended workflow to use |
I just discovered a concerning behavior of
combined_preprocessing
, which seems to create a lot more dask tasks for each datasets when I enable the new automatic slicing for large arrays.Consider this example:
Now lets load this single model into a dictionary with and without using
preprocess
The tasks increased from ~30k to more than 9 million!. This seems to hit the limit of what dask can handle.
I dug a little deeper and it seems that the increase is happening during this step , specifically the call to
.sortby('x)
here.If I deactivate the new automatic slicing for large arrays I get this
This prevents the
x
dimension to be rechunked to single values chunks.Unfortunately I dont understand enough about these xarray/dask internals. @dcherian is this something that you know more about? Ill try to come up with a more simplified example and crosspost over at xarray. Just wanted to document this behavior here first.
The text was updated successfully, but these errors were encountered: