Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Computations fail with "could not find dependent" #11

Closed
MikeBeller opened this issue Dec 1, 2021 · 11 comments
Closed

Computations fail with "could not find dependent" #11

MikeBeller opened this issue Dec 1, 2021 · 11 comments

Comments

@MikeBeller
Copy link

Frequently, large dask computations on large clusters seem to fail with "could not find dependent" errors:

/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
   1811                             exc = CancelledError(key)
   1812                         else:
-> 1813                             raise exception.with_traceback(traceback)
   1814                         raise exc
   1815                     if errors == "skip":

ValueError: Could not find dependent ('broadcast_to-concatenate-where-getitem-4916603adba736a1cf7a63ef1786c8d8', 8, 1, 0, 0).  Check worker logs

In checking the logs, by looking in the master logs for the item ID, and then tracking it down to a specific worker, the worker log looks like this:

distributed.worker - INFO - Start worker at: tls://10.244.47.9:36493
distributed.worker - INFO - Listening to: tls://10.244.47.9:36493
distributed.worker - INFO - dashboard at: 10.244.47.9:8787
distributed.worker - INFO - Waiting to connect to: tls://dask-258014f969ff4b288a395f5e84f2354b.prod:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 1
distributed.worker - INFO - Memory: 8.00 GiB
distributed.worker - INFO - Local Directory: /home/jovyan/dask-worker-space/worker-5bjfqpay
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tls://dask-258014f969ff4b288a395f5e84f2354b.prod:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Can't find dependencies {<Task "('broadcast_to-concatenate-where-getitem-bceb982d462d09c679da971dd8c4d874', 8, 2, 0, 0)" fetch>} for key ('getitem-bceb982d462d09c679da971dd8c4d874', 8, 2, 0, 0)
distributed.worker - INFO - Dependent not found: ('broadcast_to-concatenate-where-getitem-bceb982d462d09c679da971dd8c4d874', 8, 2, 0, 0) 0 . Asking scheduler
distributed.worker - INFO - Can't find dependencies {<Task "('broadcast_to-concatenate-where-getitem-bceb982d462d09c679da971dd8c4d874', 8, 2, 0, 0)" fetch>} for key ('getitem-bceb982d462d09c679da971dd8c4d874', 8, 2, 0, 0)
distributed.worker - INFO - Dependent not found: ('broadcast_to-concatenate-where-getitem-bceb982d462d09c679da971dd8c4d874', 8, 2, 0, 0) 1 . Asking scheduler
... many repeats
distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/worker.py", line 2550, in handle_missing_dep for dep in deps: RuntimeError: Set changed size during iteration

The general situation is I've stacked a large, multi-tile stack (25 times, 5 bands, x60000 y40000) -- so 2.5TB. The cluster size tried is varying from 45 -> 300 cores.

Can you give me general guidance on this sort of error? I can't reproduce it very reliably, so it's hard to provide a "minimal example" that can reproduce the error. Perhaps someone might have some guidance about what is happening and what I could do to explore possible reasons / solutions?

@TomAugspurger
Copy link

This is tricky... There are various issues in dask/distributed around this topic (dask/distributed#5172 is the primary one).

Can you provide more information on the behavior of the cluster as this is computing? In particular:

  1. Is the size of your cluster fixed at some number of workers prior to your computation starting?
  2. Are any workers under memory pressure and possibly being paused / killed?

@MikeBeller
Copy link
Author

Thanks Tom. I would say that this is typically happening when there is a total computation that may exceed the memory of the cluster. My assumption is that dask.distributed would spill to disk -- but in the issue you reference it does say this happens in cases of high task latency often due to 'spill to disk'. So you may be right on track.

Let me do some more digging and observing during my runs and see if I can get better answers to you questions. Will post back when I have that or better yet an isolated example.

@MikeBeller
Copy link
Author

I just posted #12 -- still getting missing dependent errors intermittently. Per your question, example in #12 was with a fixed size cluster and fully stabilized before running the workload. The code there is a repeatable example. If you wanted to run it yourself I'm sure you'd see many things I am missing.

@gjoseph92
Copy link

@MikeBeller what version of distributed is this? I hear unconfirmed rumors that this Could not find dependent issue may be fixed in 2021.11.2 and later. @TomAugspurger maybe you should update dask as well?

@MikeBeller
Copy link
Author

@MikeBeller what version of distributed is this? I hear unconfirmed rumors that this Could not find dependent issue may be fixed in 2021.11.2 and later. @TomAugspurger maybe you should update dask as well?

import dask.distributed
dask.distributed.__version__
'2021.08.1'

@gjoseph92
Copy link

I ran this myself and found something like this in the worker logs
distributed.worker - INFO - Can't find dependencies {<Task "('asset-table-81cb4d5d94619ed532a9b3f16b4a132c', 72, 2)" memory>} for key ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 8, 10)
distributed.worker - INFO - Dependent not found: ('asset-table-81cb4d5d94619ed532a9b3f16b4a132c', 72, 2) 0 .  Asking scheduler
distributed.worker - INFO - Can't find dependencies {<Task "('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 7, 11)" fetch>} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 7, 11)
distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 7, 11) 0 .  Asking scheduler
distributed.worker - INFO - Can't find dependencies {<Task "('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 7, 11)" fetch>} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 7, 11)
distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 7, 11) 1 .  Asking scheduler
distributed.worker - INFO - Can't find dependencies {<Task "('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 9, 12)" fetch>} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 9, 12)
distributed.worker - INFO - Can't find dependencies {<Task "('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 5, 7)" fetch>} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 5, 7)
distributed.worker - INFO - Can't find dependencies {<Task "('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 7, 11)" fetch>} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 7, 11)
distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 9, 12) 0 .  Asking scheduler
distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 5, 7) 0 .  Asking scheduler
distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 7, 11) 2 .  Asking scheduler
distributed.worker - INFO - Can't find dependencies {<Task "('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 10, 10)" fetch>} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 10, 10)
distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 10, 10) 0 .  Asking scheduler
distributed.worker - INFO - Can't find dependencies {<Task "('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 9, 12)" fetch>} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 9, 12)
distributed.worker - INFO - Can't find dependencies {<Task "('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 5, 7)" fetch>} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 5, 7)
distributed.worker - INFO - Can't find dependencies {<Task "('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 7, 11)" fetch>} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 7, 11)
distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 9, 12) 1 .  Asking scheduler
distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 5, 7) 1 .  Asking scheduler
distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 7, 11) 3 .  Asking scheduler
distributed.worker - INFO - Can't find dependencies {<Task "('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 10, 10)" fetch>} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 10, 10)
distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 10, 10) 1 .  Asking scheduler
distributed.worker - INFO - Can't find dependencies {<Task "('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 9, 12)" fetch>} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 9, 12)
distributed.worker - INFO - Can't find dependencies {<Task "('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 5, 7)" fetch>} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 5, 7)
distributed.worker - INFO - Can't find dependencies {<Task "('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 7, 11)" fetch>} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 7, 11)
distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 9, 12) 2 .  Asking scheduler
distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 5, 7) 2 .  Asking scheduler
distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 7, 11) 4 .  Asking scheduler
distributed.worker - INFO - Can't find dependencies {<Task "('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 10, 10)" fetch>} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 10, 10)
distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 10, 10) 2 .  Asking scheduler
distributed.worker - INFO - Can't find dependencies {<Task "('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 9, 12)" fetch>} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 9, 12)
distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 9, 12) 3 .  Asking scheduler
distributed.worker - INFO - Can't find dependencies {<Task "('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 5, 7)" fetch>} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 5, 7)
distributed.worker - INFO - Can't find dependencies {<Task "('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 7, 11)" fetch>} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 7, 11)
distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 5, 7) 3 .  Asking scheduler
distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 7, 11) 5 .  Asking scheduler
distributed.worker - INFO - Can't find dependencies {<Task "('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 10, 10)" fetch>} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 10, 10)
distributed.worker - INFO - Can't find dependencies {<Task "('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 9, 12)" fetch>} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 9, 12)
distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 10, 10) 3 .  Asking scheduler
distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 9, 12) 4 .  Asking scheduler
distributed.worker - INFO - Can't find dependencies {<Task "('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 5, 7)" fetch>} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 5, 7)
distributed.worker - INFO - Can't find dependencies {<Task "('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 7, 11)" fetch>} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 7, 11)
distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 5, 7) 4 .  Asking scheduler
distributed.worker - ERROR - Handle missing dep failed, retrying
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/worker.py", line 2550, in handle_missing_dep
    for dep in deps:
RuntimeError: Set changed size during iteration
distributed.worker - INFO - Can't find dependencies {<Task "('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 10, 10)" fetch>} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 10, 10)
distributed.worker - INFO - Can't find dependencies {<Task "('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 9, 12)" fetch>} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 9, 12)
distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 10, 10) 4 .  Asking scheduler
distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 9, 12) 5 .  Asking scheduler
distributed.worker - INFO - Can't find dependencies {<Task "('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 5, 7)" fetch>} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 5, 7)
distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 5, 7) 5 .  Asking scheduler
distributed.worker - INFO - Can't find dependencies {<Task "('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 10, 10)" fetch>} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 10, 10)
distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 10, 10) 5 .  Asking scheduler
distributed.worker - INFO - Can't find dependencies {<Task "('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 9, 12)" fetch>} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 9, 12)
distributed.worker - ERROR - Handle missing dep failed, retrying
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/worker.py", line 2550, in handle_missing_dep
    for dep in deps:
RuntimeError: Set changed size during iteration
distributed.worker - INFO - Can't find dependencies {<Task "('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 5, 7)" fetch>} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 5, 7)
distributed.worker - INFO - Can't find dependencies {<Task "('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 10, 10)" fetch>} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 10, 10)
distributed.worker - ERROR - Handle missing dep failed, retrying
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/worker.py", line 2550, in handle_missing_dep
    for dep in deps:
RuntimeError: Set changed size during iteration
distributed.worker - ERROR - Handle missing dep failed, retrying
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/worker.py", line 2550, in handle_missing_dep
    for dep in deps:
RuntimeError: Set changed size during iteration

The RuntimeError: Set changed size during iteration is reminiscent of dask/distributed#4371. I don't think that was actually properly fixed, but the handle_missing_dep function in question was removed entirely in dask/distributed#5046, so upgrading to distributed>= 2021.10.0 indeed might eliminate this problem.

I also happened to notice these transient errors from rasterio while reading the data:

RuntimeError('Error opening \'https://sentinel2l2a01.blob.core.windows.net/sentinel2-l2/35/K/PU/2020/01/30/S2A_MSIL2A_20200130T080151_N0212_R035_T35KPU_20201002T151436.SAFE/GRANULE/L2A_T35KPU_A024056_20200130T080800/IMG_DATA/R10m/T35KPU_20200130T080151_B03_10m.tif?st=2021-12-14T00%3A38%3A52Z&se=2021-12-22T00%3A38%3A52Z&sp=rl&sv=2020-06-12&sr=c&skoid=c85c15d6-d1ae-42d4-af60-e2ca0f81359b&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2021-12-14T23%3A38%3A52Z&ske=2021-12-16T00%3A38%3A52Z&sks=b&skv=2020-06-12&sig=TTTLkhN3fl5A4qx5gIK0MGBUbjJqi1gft6Xy6Yb0YKs%3D\': RasterioIOError("\'/vsicurl/https://sentinel2l2a01.blob.core.windows.net/sentinel2-l2/35/K/PU/2020/01/30/S2A_MSIL2A_20200130T080151_N0212_R035_T35KPU_20201002T151436.SAFE/GRANULE/L2A_T35KPU_A024056_20200130T080800/IMG_DATA/R10m/T35KPU_20200130T080151_B03_10m.tif?st=2021-12-14T00%3A38%3A52Z&se=2021-12-22T00%3A38%3A52Z&sp=rl&sv=2020-06-12&sr=c&skoid=c85c15d6-d1ae-42d4-af60-e2ca0f81359b&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2021-12-14T23%3A38%3A52Z&ske=2021-12-16T00%3A38%3A52Z&sks=b&skv=2020-06-12&sig=TTTLkhN3fl5A4qx5gIK0MGBUbjJqi1gft6Xy6Yb0YKs%3D\' not recognized as a supported file format.")')

RuntimeError("Error reading Window(col_off=24576, row_off=12288, width=2048, height=2048) from 'https://sentinel2l2a01.blob.core.windows.net/sentinel2-l2/35/K/QT/2020/02/01/S2B_MSIL2A_20200201T075039_N0212_R135_T35KQT_20200930T092316.SAFE/GRANULE/L2A_T35KQT_A015176_20200201T080648/IMG_DATA/R20m/T35KQT_20200201T075039_SCL_20m.tif?st=2021-12-14T00%3A38%3A52Z&se=2021-12-22T00%3A38%3A52Z&sp=rl&sv=2020-06-12&sr=c&skoid=c85c15d6-d1ae-42d4-af60-e2ca0f81359b&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2021-12-14T23%3A38%3A52Z&ske=2021-12-16T00%3A38%3A52Z&sks=b&skv=2020-06-12&sig=TTTLkhN3fl5A4qx5gIK0MGBUbjJqi1gft6Xy6Yb0YKs%3D': RasterioIOError('Read or write failed. IReadBlock failed at X offset 48, Y offset 100: IReadBlock failed at X offset 5, Y offset 4: TIFFReadEncodedTile() failed.')")

So I'd probably add this just to ignore them for now.

stackstac.stack(...,
            errors_as_nodata=(
                RasterioIOError('HTTP response code: 404'),
                RuntimeError("Error opening"),
                RuntimeError("Error reading Window")
            ),

@TomAugspurger
Copy link

I also happened to notice these transient errors from rasterio while reading the data:

Those are from some rate limits on Azure Blob Storage (account-wide limits, not specific to your requests necessarily). If you enable CPL_CURL_VERBOSE=1 you might see logs like

< HTTP/1.1 503 Egress is over the account limit.
< CPLE_None in VSICURL: Got response_code=503
Content-Length: 212
< Content-Type: application/xml
< Server: Microsoft-HTTPAPI/2.0
< x-ms-request-id: 4ca9c883-301e-0045-72f7-f13f4d000000
< x-ms-error-code: ServerBusy
< Date: Wed, 15 Dec 2021 21:04:00 GMT
< 

So those would be good errors to retry after some backoff at some layer in the stack.

@gjoseph92
Copy link

Interesting that it's a 503 instead of 429. Good to note for gjoseph92/stackstac#18. Retries are definitely a top priority for robustness, I just don't know whether to do them at the dask level (feels more appropriate, but dask/dask#7036 is a blocker) or just within stackstac.

@TomAugspurger
Copy link

GDAL has a couple config options GDAL_HTTP_MAX_RETRY and GDAL_HTTP_RETRY_DELAY (https://gdal.org/user/configoptions.html) that we can set. IMO, that's probably best done at the application layer, but I'm not sure. I'll likely set them on the Planetary Computer hub.

@gjoseph92
Copy link

Nice, I didn't know about these! That would be very easy to set in stackstac's LayeredEnv right now. Let's do that.

@TomAugspurger
Copy link

I think this was fixed by gjoseph92/stackstac#95 and dask/distributed#5552. If those versions aren't already in the production hub, they will be in the next couple weeks (they're in our staging hub, and the workflow that previously errored is now working. Will update in #12).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants