Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zombie clusters #914

Open
chiaral opened this issue Jan 14, 2021 · 14 comments
Open

zombie clusters #914

chiaral opened this issue Jan 14, 2021 · 14 comments

Comments

@chiaral
Copy link
Member

chiaral commented Jan 14, 2021

I have noticed for a while now, that when my computations don't work (i.e.a worker die) often times regardless of my biggest effort to kill a cluster, i keep seeing zombie clusters in my machine.

i.e.:

  1. I do cluster.close()
  2. I restart the kernel
  3. I run in another notebook:
from dask_gateway import Gateway
g = Gateway()
g.list_clusters()

and the cluster is still there.
Usually I try to scale it down to 0, so at least I am not using anything, but the cluster stays there.

Today I kept having issues with my clusters - probably due to what I want to do - and I had 4 zombie clusters (that I managed to scale down to 0 by connecting to each of them through cluster = g.connect(g.list_clusters()[i].name) ), so I decided to restart my server entirely.

I went on my home page, pressed stop server, and restarted it.
And on the new server I could still list the zombie clusters with
g.list_clusters()
They all have 0 workers and cores, but they are there, and I think they still can take memory just by existing there.

After a while - i guess after whatever timeout limit is in place- they disappear.

@chiaral
Copy link
Member Author

chiaral commented Jan 28, 2021

Can anyone help me with this or tell me how to completely kill these zombie clusters?

@chiaral
Copy link
Member Author

chiaral commented Jan 28, 2021

Small note: I was using cluster.adapt() a lot, and now that i use scale things seem to be more stable. Maybe the adapt deployment was particular unstable creating all of these issues. I will monitor and see if these instances of persistent zombie clusters diminuish now.

@chiaral
Copy link
Member Author

chiaral commented Jan 30, 2021

Like - this cluster
[ClusterReport<name=prod.9b358575a5c24bf68fb4569dda44c14e, status=RUNNING>]
has been hanging out in my server for 2 days? I scaled it down to 0 workers - but is there a way to kill it?
This is a bit worrisome, because other people might not be as careful as I am in keeping clusters under control, and we might have an army of zombie clusters eating up space and money.

@TomAugspurger
Copy link
Member

There is a lingering scheduler pod, so they are consuming some resources:

distributed.core - INFO - Removing comms to tls://10.39.51.3:34977
distributed.scheduler - INFO - Closing worker tls://10.39.19.4:42565
distributed.scheduler - INFO - Remove worker <Worker 'tls://10.39.19.4:42565', name: dask-worker-9b358575a5c24bf68fb4569dda44c14e-4dfvp, memory: 0, processing: 0>
distributed.core - INFO - Removing comms to tls://10.39.19.4:42565
distributed.scheduler - INFO - Lost all workers

Just to confirm, does doing this do anything?

cluster = g.connect(cluster_name)   # from g.list_clusters()
cluster.close()

There is an idle timeout of 1800 seconds, but that's apparently not kicking in.

@chiaral
Copy link
Member Author

chiaral commented Jan 31, 2021

It does not work - interestingly tho, I can do

cluster = g.connect(cluster_name)   # from g.list_clusters()
cluster.scale(2)

and it will scale! which is the very confusing thing!

And that cluster has been around for way longer than 1800 seconds - more like 2 days!
is 1800 seconds starting from when I log out/close my server? or independently from that?

thanks so much Tom - nice to see you here :)

@TomAugspurger
Copy link
Member

Weird... So you can connect to the cluster? What if you do...

cluster = g.connect(cluster_name)
client = cluster.get_client()
client.shutdown()

In theory that's supposed to tell the scheduler and workers to shut down too.

And that cluster has been around for way longer than 1800 seconds - more like 2 days! is 1800 seconds starting from when I log out/close my server?

So that's 1800 seconds of "inactivity". I was poking around to see what that is: https://github.com/dask/distributed/blob/08ea96890674d48b90f4e1f92959957e5e362a18/distributed/scheduler.py#L6362-L6379. Basically, checks if any of the workers have tasks they're working on. It'd be useful to surface this through the dashboard, but I don't think it is.

It is independent of when you log out / close your server. Normally, the lifetime of the Dask cluster is tied to the lifetime of your Python kernel. We register some functions to run when the kernel is shut down telling the cluster to shut down. But if the kernel doesn't exit cleanly then those functions may not run. That's what the idle timeout is for, but it's apparently not working in this case.

If you're able to, it'd be interesting to check the state of the scheduler. Maybe something like

client.run_on_scheduler(lambda dask_scheduler: dask_scheduler.idle_since)

and compare that to time.time(). I see that that code references "unrunable" tasks. I wonder if that would be the culprit.

thanks so much Tom - nice to see you here :)

I'm always following along :)

@chiaral
Copy link
Member Author

chiaral commented Feb 1, 2021

Ok - lol - it is still there!

client.run_on_scheduler(lambda dask_scheduler: dask_scheduler.idle_since)

gave me
1611770554.0601828

which sounds about right 🤣

client = cluster.get_client()
client.shutdown()

it's hanging.. i will report back if it gets somewhere.
I also have a tangential question.

what is the difference betwee

from dask.distributed import Client
client = Client(cluster)

and

client = cluster.get_client()

I think both connect the cluster to my notebook. But in a different way.
Also from today when I do

from dask.distributed import Client

I get this warning - it did not happen on friday

/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py:1136: VersionMismatchWarning: Mismatched versions found

+-------------+-----------+-----------+---------+
| Package     | client    | scheduler | workers |
+-------------+-----------+-----------+---------+
| blosc       | 1.10.2    | 1.9.2     | None    |
| dask        | 2021.01.1 | 2.30.0    | None    |
| distributed | 2021.01.1 | 2.30.1    | None    |
| lz4         | 3.1.3     | 3.1.1     | None    |
| msgpack     | 1.0.2     | 1.0.0     | None    |
+-------------+-----------+-----------+---------+
Notes: 
-  msgpack: Variation is ok, as long as everything is above 0.6
  warnings.warn(version_module.VersionMismatchWarning(msg[0]["warning"]))

@TomAugspurger
Copy link
Member

what is the difference between...

Those are should be about the same. I think there's an issue somewhere about standardizing them (dask-gateway used to require cluster.get_client(), but now it's not required).

I think warning is safe to ignore... It's only going to happening right now because the zombie cluster was created a while ago and we updated the image in the meantime. Your client is on the new image and your zombie cluster is on the old one.

@chiaral
Copy link
Member Author

chiaral commented Feb 1, 2021

It's been 24 min and client.shutdown() is still hanging. So I am not sure that will work. I am happy to let it going and ssee if eventually shuts it down, but I don't think it's promising.

@chiaral
Copy link
Member Author

chiaral commented Feb 1, 2021

still hanging after 2 hours, I would say that client.shutdown() doesn't work. Feel free to kill it!

@TomAugspurger
Copy link
Member

TomAugspurger commented Feb 2, 2021 via email

@scottyhq
Copy link
Member

scottyhq commented Mar 3, 2021

not sure if this is still coming up, but i've noticed on the AWS hub in a clean session i often see a pending cluster still around:

from dask_gateway import Gateway
gateway = Gateway()
gateway.list_clusters()

[ClusterReport<name=icesat2-staging.45dc9798954c48fca2b3580b6a6104ea, status=PENDING>]

The following command gets rid of it:
gateway.stop_cluster('icesat2-staging.45dc9798954c48fca2b3580b6a6104ea')

@chiaral
Copy link
Member Author

chiaral commented Mar 5, 2021

The following command gets rid of it:
gateway.stop_cluster('icesat2-staging.45dc9798954c48fca2b3580b6a6104ea')

Thanks @scottyhq !!! Next time I will try this. If it consistently works we should add it to the notes. I am keeping the issue open because I plan to add some text about this issue in the documentation... at some point!

@chiaral
Copy link
Member Author

chiaral commented Mar 10, 2021

that command killed a zombie cluster right away! great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants