change: graceful shutdown drains node before k3d container stops #1119

arikmaor · 2022-08-04T12:21:41Z

This PR drains the node before stopping the container (incase of docker stop/k3d cluster stop/k3d node stop/system shutdown)

It allows services to gracefully shutdown in this case, this is specially useful for databases that needs to clear their lock files.

Related to the bug I opened

Implications

sh is now the main process running in the image and is used to intercept incoming signals (SIGTERM etc.) more info
Before killing the k3s process, kubectl drain is called, causing the pods to evict. When to node is ready again, kubectl uncordon is called to allow pods to reschedule more info

This commit should be reverted once k3d-io/k3d#1119 is merged

iwilltry42 · 2022-08-04T14:07:52Z

Hi @arikmaor , thanks for your PR!
I like the simplicity of this solution 👍
However, I feel like we should make this configurable.
Without testing, I guess, that it will significantly slow down the cluster stop process, right?
That may not be wanted in quite a few use-cases, e.g. CI or quick development cycles, where no data needs to be persisted.

E.g. one could introduce a flag --graceful to k3d cluster stop or a new command k3d cluster shutdown or just some environment variable, or something else.

What do you think?

arikmaor · 2022-08-04T14:30:58Z

Without testing, I guess, that it will significantly slow down the cluster stop process, right?

I don't think so, I'm assuming you're calling docker stop internally and according to the docs docker will send a SIGTERM signal and wait for 10s by default before doing a force kill.
you can set this timeout when you do docker stop with the --timeout parameter which perhaps should be added to the k3d cluster/node stop commands.

That may not be wanted in quite a few use-cases, e.g. CI or quick development cycles, where no data needs to be persisted.

In the ci scenario, you can live with the graceful default 10s or do a kill
In the dev scenario, I think you expect the pods to go down gracefully when doing a k3d cluster stop or when shutting down your workstation.

E.g. one could introduce a flag --graceful to k3d cluster stop or a new command k3d cluster shutdown or just some environment variable, or something else.

IMHO, graceful shutdown should be the default, but I guess that's debatable.
I think it should be the default because:

To match docker and kubectl semantics, I would prefer a separate kill command to go along with stop.
When doing a clean system shutdown, you expect all the running processes to close properly and all data to be saved. This is also applicable to someone accidentally doing docker stop instead of k3d node stop. In those cases you must "speak the os language" (handling signals) in order to respond properly

This commit should be reverted once k3d-io/k3d#1119 is merged

iwilltry42 · 2022-08-04T15:28:12Z

Your arguments make total sense, thanks for taking the time.
I'll add it to the next milestone 👍

arikmaor · 2022-08-04T17:18:12Z

10x cool :)

This commit should be reverted once k3d-io/k3d#1119 is merged

iwilltry42

LGTM

iwilltry42 · 2023-02-08T09:08:30Z

Thank you @arikmaor ! 👍

arikmaor · 2023-03-08T23:27:50Z

@all-contributors
please add @arikmaor for code

allcontributors · 2023-03-08T23:27:59Z

@arikmaor

I've put up a pull request to add @arikmaor! 🎉

eshizhan · 2023-04-12T17:02:34Z

@iwilltry42 @arikmaor agent does not have a kubeconfig, that make error in agent: The connection to the server localhost:8080 was refused - did you specify the right host or port?

arikmaor · 2023-04-13T09:50:24Z

@iwilltry42 @arikmaor agent does not have a kubeconfig, that make error in agent: The connection to the server localhost:8080 was refused - did you specify the right host or port?

Can happen from all kinds of reasons...
How is this related to this PR?

eshizhan · 2023-04-13T15:55:39Z

That PR used kubectl command, it make agent node raise error log every 3 seconds. Because k3s does not create kubeconfig on agent default. k3s-io/k3s#3862

eshizhan · 2023-04-13T16:03:45Z

Reproduce:

k3d cluster create k3dtest -a 1
docker logs k3d-k3dtest-agent-0 -f

The connection to the server localhost:8080 was refused - did you specify the right host or port?
The connection to the server localhost:8080 was refused - did you specify the right host or port?
The connection to the server localhost:8080 was refused - did you specify the right host or port?
The connection to the server localhost:8080 was refused - did you specify the right host or port?
The connection to the server localhost:8080 was refused - did you specify the right host or port?
...

arikmaor · 2023-04-25T08:28:38Z

@iwilltry42 perhaps the script should be tweaked a bit for some use case I did not for-see
PTAL

iwilltry42 · 2023-04-25T10:38:51Z

@arikmaor I think we may even need to revert the change for now.
I see the following problems:

As @eshizhan noticed, it won't work on agent nodes, since the only available node-local kubeconfigs there don't have the necessary permissions
We always need access to the Kubernetes API to call the drain command, which will make it fail if the servers are already down, e.g. when you just want to delete some orphaned agent node

I'm not sure how "graceful" we could make the shutdown by e.g. using ctr or crictl to node-locally stop/remove pods/containers.

k3d's stop commands could easily be handled from the client side.. but not docker commands unfortunately.

Do you have any idea?

arikmaor · 2023-04-26T15:07:05Z

k3d's stop commands could easily be handled from the client side.. but not docker commands unfortunately.

Let's not give up just yet.
It's not just docker commands, it's also a system shutdown.

If you just run k3s
shutting down k3s - doesn't kill the running containers
shutting down the machine - allowing all containers to gracefully shutdown

shutting down the machine is critical case IMO, you don't expect your containers to just drop dead without a chance of saving their data when you do a clean shutdown.

arikmaor · 2023-04-26T15:10:48Z

@iwilltry42 is there a way of knowing inside k3d-entrypoint.sh if the current container is a server? (perhaps by reading some environment variables)
If so, perhaps we should just run the kubectl drain/uncordon command when we know it's a server, otherwise ignore.
Does that sound like the correct behavior to you?

iwilltry42 · 2023-04-26T18:34:37Z

@arikmaor yeah we would know that, but that doesn't help with agent shutdown, right?
With k3d triggered stop we could make it work, but not externally induced, since we don't know the order in which they are stopped, so it may or may not work.

arikmaor · 2023-04-27T14:00:22Z

mmm... I see what you mean (before I didn't understand that an agent is still running pods, just not replicating the control plane)
But why exactly can't we drain an agent node? it's still a k3s node IIUC

iwilltry42 · 2023-04-27T18:17:34Z

Alright, so we two major cases here

k3d-induced shutdown of nodes

Everything's fine here, we can just run the drain command for all cluster node from a single server node.

Docker- or System-induced shutdown of nodes

We don't know in which order the nodes are being shut down, and if it will be possible to drain the node before the container is killed.
E.g. we have one server and two agents. The System is rebooting. Agents are stopped already or in any shutdown state where the kubelet won't react to a drain task anymore, but in the shutdown sequence of the server we're still trying to drain them, which will fail.

Do you know of any way to drain nodes without access to the control-plane?

arikmaor · 2023-05-07T18:02:43Z

Do you know of any way to drain nodes without access to the control-plane?

Draining is the recommended way
Found this feature in k8s that tries to deal with this, not sure if it's included/enabled in k3s

I think it someone use case was broken by this PR, we should revert it.
In case of revert, I'd appreciate a way of loading a custom entry file without a warning during cluster creation 🙏

eshizhan · 2024-03-19T14:03:25Z

@arikmaor @iwilltry42 any update? This causes the docker log to grow very quickly and takes up disk space.
Can we temporarily add 2>/dev/null to avoid this problem first?

iwilltry42 · 2024-04-03T09:37:14Z

Moved #205 to v5.7.0 - with this, we can add the option for custom entrypoints and other lifecycle hooks that can execute scripts.
I'd then revert this PR but keep the script around, just disabled by default with an easy way to enable those "built-in" scripts.

iwilltry42 · 2024-04-03T09:37:28Z

Ref #1420

arikmaor mentioned this pull request Aug 4, 2022

[BUG] Cluster doesn't shutdown gracefully #1117

Closed

arikmaor pushed a commit to tensorleap/helm-charts that referenced this pull request Aug 4, 2022

Allow graceful shutdown of the container

d37cacf

This commit should be reverted once k3d-io/k3d#1119 is merged

arikmaor mentioned this pull request Aug 4, 2022

Allow graceful shutdown of the container tensorleap/helm-charts#24

Merged

arikmaor pushed a commit to tensorleap/helm-charts that referenced this pull request Aug 4, 2022

Allow graceful shutdown of the container

a2041a5

This commit should be reverted once k3d-io/k3d#1119 is merged

arikmaor pushed a commit to tensorleap/helm-charts that referenced this pull request Aug 4, 2022

Allow graceful shutdown of the container

d455bb2

This commit should be reverted once k3d-io/k3d#1119 is merged

yotamazriel approved these changes Aug 4, 2022 •

edited

Loading

View reviewed changes

arikmaor pushed a commit to tensorleap/helm-charts that referenced this pull request Aug 4, 2022

Allow graceful shutdown of the container

2557a4f

This commit should be reverted once k3d-io/k3d#1119 is merged

iwilltry42 added this to the v5.5.0 milestone Aug 4, 2022

yotamazriel pushed a commit to tensorleap/helm-charts that referenced this pull request Aug 7, 2022

Allow graceful shutdown of the container (#24)

ce32d2a

This commit should be reverted once k3d-io/k3d#1119 is merged

iwilltry42 force-pushed the main branch from bed781c to a7f16a6 Compare August 25, 2022 14:25

arikmaor and others added 2 commits February 8, 2023 09:57

Drain node before container stops

3d05733

change: shellcheck

45fa418

iwilltry42 force-pushed the patch-1 branch from 409a686 to 45fa418 Compare February 8, 2023 09:07

iwilltry42 approved these changes Feb 8, 2023

View reviewed changes

iwilltry42 changed the title ~~Drain node before container stops~~ change: graceful shutdown drains node before k3d container stops Feb 8, 2023

iwilltry42 merged commit f354727 into k3d-io:main Feb 8, 2023

iwilltry42 self-assigned this Feb 8, 2023

arikmaor mentioned this pull request Mar 8, 2023

Remove custom entrypoint tensorleap/helm-charts#88

Merged

allcontributors bot mentioned this pull request Mar 8, 2023

docs: add arikmaor as a contributor for code #1240

Merged

arikmaor deleted the patch-1 branch April 13, 2023 09:48

mjgallag mentioned this pull request Mar 14, 2024

fix: make drain ignore DaemonSets & bypass PodDisruptionBudgets (cluster/node stop) #1414

Merged

iwilltry42 modified the milestones: v5.5.0, v5.7.0, v5.4.6 Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

change: graceful shutdown drains node before k3d container stops #1119

change: graceful shutdown drains node before k3d container stops #1119

arikmaor commented Aug 4, 2022 •

edited

Loading

iwilltry42 commented Aug 4, 2022

arikmaor commented Aug 4, 2022 •

edited

Loading

iwilltry42 commented Aug 4, 2022

arikmaor commented Aug 4, 2022

iwilltry42 left a comment

iwilltry42 commented Feb 8, 2023

arikmaor commented Mar 8, 2023

allcontributors bot commented Mar 8, 2023

eshizhan commented Apr 12, 2023

arikmaor commented Apr 13, 2023

eshizhan commented Apr 13, 2023

eshizhan commented Apr 13, 2023

arikmaor commented Apr 25, 2023

iwilltry42 commented Apr 25, 2023

arikmaor commented Apr 26, 2023 •

edited

Loading

arikmaor commented Apr 26, 2023

iwilltry42 commented Apr 26, 2023

arikmaor commented Apr 27, 2023

iwilltry42 commented Apr 27, 2023

arikmaor commented May 7, 2023

eshizhan commented Mar 19, 2024

iwilltry42 commented Apr 3, 2024

iwilltry42 commented Apr 3, 2024

change: graceful shutdown drains node before k3d container stops #1119

change: graceful shutdown drains node before k3d container stops #1119

Conversation

arikmaor commented Aug 4, 2022 • edited Loading

Implications

iwilltry42 commented Aug 4, 2022

arikmaor commented Aug 4, 2022 • edited Loading

iwilltry42 commented Aug 4, 2022

arikmaor commented Aug 4, 2022

iwilltry42 left a comment

Choose a reason for hiding this comment

iwilltry42 commented Feb 8, 2023

arikmaor commented Mar 8, 2023

allcontributors bot commented Mar 8, 2023

eshizhan commented Apr 12, 2023

arikmaor commented Apr 13, 2023

eshizhan commented Apr 13, 2023

eshizhan commented Apr 13, 2023

arikmaor commented Apr 25, 2023

iwilltry42 commented Apr 25, 2023

arikmaor commented Apr 26, 2023 • edited Loading

arikmaor commented Apr 26, 2023

iwilltry42 commented Apr 26, 2023

arikmaor commented Apr 27, 2023

iwilltry42 commented Apr 27, 2023

k3d-induced shutdown of nodes

Docker- or System-induced shutdown of nodes

arikmaor commented May 7, 2023

eshizhan commented Mar 19, 2024

iwilltry42 commented Apr 3, 2024

iwilltry42 commented Apr 3, 2024

arikmaor commented Aug 4, 2022 •

edited

Loading

arikmaor commented Aug 4, 2022 •

edited

Loading

arikmaor commented Apr 26, 2023 •

edited

Loading