zombie clusters #914

chiaral · 2021-01-14T20:16:15Z

I have noticed for a while now, that when my computations don't work (i.e.a worker die) often times regardless of my biggest effort to kill a cluster, i keep seeing zombie clusters in my machine.

i.e.:

I do cluster.close()
I restart the kernel
I run in another notebook:

from dask_gateway import Gateway
g = Gateway()
g.list_clusters()

and the cluster is still there.
Usually I try to scale it down to 0, so at least I am not using anything, but the cluster stays there.

Today I kept having issues with my clusters - probably due to what I want to do - and I had 4 zombie clusters (that I managed to scale down to 0 by connecting to each of them through cluster = g.connect(g.list_clusters()[i].name) ), so I decided to restart my server entirely.

I went on my home page, pressed stop server, and restarted it.
And on the new server I could still list the zombie clusters with
g.list_clusters()
They all have 0 workers and cores, but they are there, and I think they still can take memory just by existing there.

After a while - i guess after whatever timeout limit is in place- they disappear.

The text was updated successfully, but these errors were encountered:

chiaral · 2021-01-28T02:52:09Z

Can anyone help me with this or tell me how to completely kill these zombie clusters?

chiaral · 2021-01-28T03:06:28Z

Small note: I was using cluster.adapt() a lot, and now that i use scale things seem to be more stable. Maybe the adapt deployment was particular unstable creating all of these issues. I will monitor and see if these instances of persistent zombie clusters diminuish now.

chiaral · 2021-01-30T20:22:51Z

Like - this cluster
[ClusterReport<name=prod.9b358575a5c24bf68fb4569dda44c14e, status=RUNNING>]
has been hanging out in my server for 2 days? I scaled it down to 0 workers - but is there a way to kill it?
This is a bit worrisome, because other people might not be as careful as I am in keeping clusters under control, and we might have an army of zombie clusters eating up space and money.

TomAugspurger · 2021-01-30T22:17:56Z

There is a lingering scheduler pod, so they are consuming some resources:

distributed.core - INFO - Removing comms to tls://10.39.51.3:34977
distributed.scheduler - INFO - Closing worker tls://10.39.19.4:42565
distributed.scheduler - INFO - Remove worker <Worker 'tls://10.39.19.4:42565', name: dask-worker-9b358575a5c24bf68fb4569dda44c14e-4dfvp, memory: 0, processing: 0>
distributed.core - INFO - Removing comms to tls://10.39.19.4:42565
distributed.scheduler - INFO - Lost all workers

Just to confirm, does doing this do anything?

cluster = g.connect(cluster_name)   # from g.list_clusters()
cluster.close()

There is an idle timeout of 1800 seconds, but that's apparently not kicking in.

chiaral · 2021-01-31T00:19:50Z

It does not work - interestingly tho, I can do

cluster = g.connect(cluster_name)   # from g.list_clusters()
cluster.scale(2)

and it will scale! which is the very confusing thing!

And that cluster has been around for way longer than 1800 seconds - more like 2 days!
is 1800 seconds starting from when I log out/close my server? or independently from that?

thanks so much Tom - nice to see you here :)

TomAugspurger · 2021-01-31T02:12:32Z

Weird... So you can connect to the cluster? What if you do...

cluster = g.connect(cluster_name)
client = cluster.get_client()
client.shutdown()

In theory that's supposed to tell the scheduler and workers to shut down too.

And that cluster has been around for way longer than 1800 seconds - more like 2 days! is 1800 seconds starting from when I log out/close my server?

So that's 1800 seconds of "inactivity". I was poking around to see what that is: https://github.com/dask/distributed/blob/08ea96890674d48b90f4e1f92959957e5e362a18/distributed/scheduler.py#L6362-L6379. Basically, checks if any of the workers have tasks they're working on. It'd be useful to surface this through the dashboard, but I don't think it is.

It is independent of when you log out / close your server. Normally, the lifetime of the Dask cluster is tied to the lifetime of your Python kernel. We register some functions to run when the kernel is shut down telling the cluster to shut down. But if the kernel doesn't exit cleanly then those functions may not run. That's what the idle timeout is for, but it's apparently not working in this case.

If you're able to, it'd be interesting to check the state of the scheduler. Maybe something like

client.run_on_scheduler(lambda dask_scheduler: dask_scheduler.idle_since)

and compare that to time.time(). I see that that code references "unrunable" tasks. I wonder if that would be the culprit.

thanks so much Tom - nice to see you here :)

I'm always following along :)

chiaral · 2021-02-01T17:13:28Z

Ok - lol - it is still there!

client.run_on_scheduler(lambda dask_scheduler: dask_scheduler.idle_since)

gave me
1611770554.0601828

which sounds about right 🤣

client = cluster.get_client()
client.shutdown()

it's hanging.. i will report back if it gets somewhere.
I also have a tangential question.

what is the difference betwee

from dask.distributed import Client
client = Client(cluster)

and

client = cluster.get_client()

I think both connect the cluster to my notebook. But in a different way.
Also from today when I do

from dask.distributed import Client

I get this warning - it did not happen on friday

/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py:1136: VersionMismatchWarning: Mismatched versions found

+-------------+-----------+-----------+---------+
| Package     | client    | scheduler | workers |
+-------------+-----------+-----------+---------+
| blosc       | 1.10.2    | 1.9.2     | None    |
| dask        | 2021.01.1 | 2.30.0    | None    |
| distributed | 2021.01.1 | 2.30.1    | None    |
| lz4         | 3.1.3     | 3.1.1     | None    |
| msgpack     | 1.0.2     | 1.0.0     | None    |
+-------------+-----------+-----------+---------+
Notes: 
-  msgpack: Variation is ok, as long as everything is above 0.6
  warnings.warn(version_module.VersionMismatchWarning(msg[0]["warning"]))

TomAugspurger · 2021-02-01T17:29:35Z

what is the difference between...

Those are should be about the same. I think there's an issue somewhere about standardizing them (dask-gateway used to require cluster.get_client(), but now it's not required).

I think warning is safe to ignore... It's only going to happening right now because the zombie cluster was created a while ago and we updated the image in the meantime. Your client is on the new image and your zombie cluster is on the old one.

chiaral · 2021-02-01T17:38:32Z

It's been 24 min and client.shutdown() is still hanging. So I am not sure that will work. I am happy to let it going and ssee if eventually shuts it down, but I don't think it's promising.

chiaral · 2021-02-01T19:25:56Z

still hanging after 2 hours, I would say that client.shutdown() doesn't work. Feel free to kill it!

TomAugspurger · 2021-02-02T00:11:53Z

OK, I killed it.

…

On Feb 1, 2021, at 1:26 PM, Chiara Lepore ***@***.***> wrote: still hanging after 2 hours, I would say that client.shutdown() doesn't work. Feel free to kill it! — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#914 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOITIOG3R4DIL3IQWA3LS4355JANCNFSM4WDAFVUQ>.

scottyhq · 2021-03-03T03:34:25Z

not sure if this is still coming up, but i've noticed on the AWS hub in a clean session i often see a pending cluster still around:

from dask_gateway import Gateway
gateway = Gateway()
gateway.list_clusters()

[ClusterReport<name=icesat2-staging.45dc9798954c48fca2b3580b6a6104ea, status=PENDING>]

The following command gets rid of it:
gateway.stop_cluster('icesat2-staging.45dc9798954c48fca2b3580b6a6104ea')

chiaral · 2021-03-05T15:28:36Z

The following command gets rid of it:
gateway.stop_cluster('icesat2-staging.45dc9798954c48fca2b3580b6a6104ea')

Thanks @scottyhq !!! Next time I will try this. If it consistently works we should add it to the notes. I am keeping the issue open because I plan to add some text about this issue in the documentation... at some point!

chiaral · 2021-03-10T22:38:49Z

that command killed a zombie cluster right away! great!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zombie clusters #914

zombie clusters #914

chiaral commented Jan 14, 2021

chiaral commented Jan 28, 2021

chiaral commented Jan 28, 2021

chiaral commented Jan 30, 2021

TomAugspurger commented Jan 30, 2021

chiaral commented Jan 31, 2021

TomAugspurger commented Jan 31, 2021

chiaral commented Feb 1, 2021

TomAugspurger commented Feb 1, 2021

chiaral commented Feb 1, 2021

chiaral commented Feb 1, 2021

TomAugspurger commented Feb 2, 2021 via email

scottyhq commented Mar 3, 2021

chiaral commented Mar 5, 2021

chiaral commented Mar 10, 2021

zombie clusters #914

zombie clusters #914

Comments

chiaral commented Jan 14, 2021

chiaral commented Jan 28, 2021

chiaral commented Jan 28, 2021

chiaral commented Jan 30, 2021

TomAugspurger commented Jan 30, 2021

chiaral commented Jan 31, 2021

TomAugspurger commented Jan 31, 2021

chiaral commented Feb 1, 2021

TomAugspurger commented Feb 1, 2021

chiaral commented Feb 1, 2021

chiaral commented Feb 1, 2021

TomAugspurger commented Feb 2, 2021 via email

scottyhq commented Mar 3, 2021

chiaral commented Mar 5, 2021

chiaral commented Mar 10, 2021