-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Computations fail with "could not find dependent" #11
Comments
This is tricky... There are various issues in dask/distributed around this topic (dask/distributed#5172 is the primary one). Can you provide more information on the behavior of the cluster as this is computing? In particular:
|
Thanks Tom. I would say that this is typically happening when there is a total computation that may exceed the memory of the cluster. My assumption is that dask.distributed would spill to disk -- but in the issue you reference it does say this happens in cases of high task latency often due to 'spill to disk'. So you may be right on track. Let me do some more digging and observing during my runs and see if I can get better answers to you questions. Will post back when I have that or better yet an isolated example. |
I just posted #12 -- still getting missing dependent errors intermittently. Per your question, example in #12 was with a fixed size cluster and fully stabilized before running the workload. The code there is a repeatable example. If you wanted to run it yourself I'm sure you'd see many things I am missing. |
@MikeBeller what version of |
|
I ran this myself and found something like this in the worker logs
The I also happened to notice these transient errors from rasterio while reading the data:
So I'd probably add this just to ignore them for now. stackstac.stack(...,
errors_as_nodata=(
RasterioIOError('HTTP response code: 404'),
RuntimeError("Error opening"),
RuntimeError("Error reading Window")
), |
Those are from some rate limits on Azure Blob Storage (account-wide limits, not specific to your requests necessarily). If you enable
So those would be good errors to retry after some backoff at some layer in the stack. |
Interesting that it's a 503 instead of 429. Good to note for gjoseph92/stackstac#18. Retries are definitely a top priority for robustness, I just don't know whether to do them at the dask level (feels more appropriate, but dask/dask#7036 is a blocker) or just within stackstac. |
GDAL has a couple config options GDAL_HTTP_MAX_RETRY and GDAL_HTTP_RETRY_DELAY (https://gdal.org/user/configoptions.html) that we can set. IMO, that's probably best done at the application layer, but I'm not sure. I'll likely set them on the Planetary Computer hub. |
Nice, I didn't know about these! That would be very easy to set in stackstac's |
I think this was fixed by gjoseph92/stackstac#95 and dask/distributed#5552. If those versions aren't already in the production hub, they will be in the next couple weeks (they're in our staging hub, and the workflow that previously errored is now working. Will update in #12). |
Frequently, large dask computations on large clusters seem to fail with "could not find dependent" errors:
In checking the logs, by looking in the master logs for the item ID, and then tracking it down to a specific worker, the worker log looks like this:
The general situation is I've stacked a large, multi-tile stack (25 times, 5 bands, x
60000 y40000) -- so 2.5TB. The cluster size tried is varying from 45 -> 300 cores.Can you give me general guidance on this sort of error? I can't reproduce it very reliably, so it's hard to provide a "minimal example" that can reproduce the error. Perhaps someone might have some guidance about what is happening and what I could do to explore possible reasons / solutions?
The text was updated successfully, but these errors were encountered: