Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet does not start even when the agent node is unable to connect to the server node #1686

Closed
carlosrmendes opened this issue Apr 25, 2020 · 18 comments

Comments

@carlosrmendes
Copy link

carlosrmendes commented Apr 25, 2020

I have an agent node with some static pods (I've set the --kubelet-arg=pod-manifest-path=... argument) but sometimes that node goes offline and could get some reboots. The problem is when the node starts after a reboot and still is offline, the k3s doesn't start anything (i.e. containerd, kubelet, ...) and with that, don't start the static pods.

Is there any solution or workaround for this use case?

@carlosrmendes carlosrmendes changed the title kubelet won't start even of the agent node could not connect to the server node kubelet does not start even when the agent node is unable to connect to the server node Apr 27, 2020
@brandond
Copy link
Contributor

brandond commented Apr 27, 2020

I tried to do the same thing - I was hoping to run etcd as a static pod, but there's a chicken-and-egg problem where k3s won't start until etcd is up, so it won't start the static pod, etc.

@carlosrmendes
Copy link
Author

carlosrmendes commented Apr 27, 2020

as described in https://kubernetes.io/docs/tasks/configure-pod-container/static-pod/:
"Static Pods are managed directly by the kubelet daemon on a specific node, without the API server observing them"

With that, IMO, I think that kubelet on the k3s process must be started before the agent node tries to connect to the server node.

@brandond
Copy link
Contributor

brandond commented Apr 28, 2020

Yeah, having the agent run without server connectivity would be great. Bonus points if the kubelet on server instances would start up independent of the apiserver so that we could do thinks like running etcd or mysql as a static pod.

@ibuildthecloud
Copy link
Contributor

The reason why this is a bit difficult is that we download the config of the kubelet from the api server. To do this properly requires a bootstrap mode for the kubelet. Basically run with the last (or no) configuration and then download the configuration and restart. The kubelet is ran embedded in the same process as the k3s agent so restart means we'd have to reexec ourselve as the kubelet can't just be restarted in memory. This all gets a bit messy.

@carlosrmendes What static pods are you running that can't be done with a daemonset? I personally haven't found any great use cases for static pods beyond bootstraping k8s itself.

@ibuildthecloud
Copy link
Contributor

I do think it's a reasonable request for "agent should start without server connectivity." If the server goes down and then you restart an agent it shouldn't be blocked. Supporting static pods on the agent only nodes will be tricky, but supporting them on the server is probably feasible as this is how rke2 works.

@carlosrmendes
Copy link
Author

my use case is run specific pods (workloads) on some agent nodes, even the nodes are offline and disconnected from the master. I don't want DaemonSets, because I want to create/schedule specific pods into specific agent nodes and despite that, to start the pods of a DaemonSet in an agent node that is offline or disconnected from the master, the agent node must have connection to the api-server (or the node must be visible as Ready by the api-server). With static pods that is not necessary, only kubelet running is needed to start static pod on an offline node.

@tdbs
Copy link

tdbs commented May 5, 2020

I have to agree, this is an issue. I would expect the "lightweight kubernetes" to work like kubernetes, but in this instance it does things in a way that I cannot use the kuberenetes instructions on static pods.

@stangerm2
Copy link

stangerm2 commented Jan 6, 2021

Want to +1 this.

K3S's on the homepage certified Kubernetes distribution built for IoT & Edge computing but IoT & Edge computing isn't a data-center. Edge devices almost exclusively have intermittent connectivity. I want pod's to run on IoT & Edge even if my thing isn't connected.. because unlike a webapp my container/pod is still doing something productive even when it's not 'online'.

Static pods without a control plane is a great way to get managed app's on top of firmware. Kubelet also supports this behavior without issue. It's just really hard to justify putting a 200Mb bin(s) on a embedded device. K3S would be an amazing solution in that place if this issue wasn't present.

As a device engineer I'd like to see K3S support offline static manifest so I can use it to run edge app's via pod's and manage them as I see fit, via manifest updates when they are online and run as last configure, via manifest when they aren't.

@stale
Copy link

stale bot commented Jul 30, 2021

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

@stale stale bot added the status/stale label Jul 30, 2021
@tdbs
Copy link

tdbs commented Jul 30, 2021

Ignoring an issue doesn't solve it.

@stale stale bot removed the status/stale label Jul 30, 2021
@cwayne18
Copy link
Collaborator

Hi @tdbs, the reason for marking things stale is not to ignore them, but rather to give us a better idea of what is still an issue. By commenting here, you've ensured that this issue is no longer marked as stale and remains open. Thank you!

@stale
Copy link

stale bot commented Jan 26, 2022

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

@stale stale bot added the status/stale label Jan 26, 2022
@sloveridge
Copy link

sloveridge commented Jan 30, 2022

Would anyone be able to confirm if this functionality will be added to k3s or if I should look to RKE2 for this?

It is essential to be able to roll out k8s at the edge for us.

@stale stale bot removed the status/stale label Jan 30, 2022
@brandond
Copy link
Contributor

brandond commented Feb 2, 2022

We don't currently have any plans to support operating agents without a server. Note that (as far as I know) RKE2 would suffer a similar limitation.

Our usual edge use-case would support multiple self-contained clusters in each potentially isolatable location, managed by a multi-cluster management product (Rancher/Fleet/etc) - as opposed to worker nodes that attempt to operate while detached from the control-plane. Kubernetes is not really designed for offline node operation.

@sloveridge
Copy link

Hi @brandond

The use case we are looking at is similar "multiple self-contained clusters in each potentially isolatable location" however some of the nodes are connected to physical systems / user interfaces. This requires static pods as the pods providing the UI or the physical system communication must be on the related node. The functionality I am looking for is just K3s reflecting the way k8s is meant to do with static pods from the k8s docs:

"Static Pods are managed directly by the kubelet daemon on a specific node, without the API server observing them. Unlike Pods that are managed by the control plane (for example, a Deployment); instead, the kubelet watches each static Pod (and restarts it if it fails)."

wrt RKE2 support I am referring to this issue (rancher/rke2#251) which although it has been pushed back a couple times seems to have been accepted as something that will be done. Considering you also work with RKE2 do you know if that is likely to be done?

To summarise the use case. In edge deployments it is often required that a node run a specific service due to physical input occurring at that node or something directly connected to it. Static pods are part of k8s to support this configuration but the kubelet not starting the static pods without communication to the control plane is not an acceptable scenario.

If I am missing something here please let me know.

Thank you for your time.

@brandond
Copy link
Contributor

brandond commented Feb 3, 2022

The core issue that makes this difficult is that the agent generates the kubelet config using information pulled down from the server. This includes cluster configuration, certificates and keys, apiserver addresses, etc. While this is all written out to disk every startup and could potentially be used on subsequent startups if the server is unavailable, we don't have any logic to do that at the moment. We'd essentially need to start the kubelet using an untrusted existing configuration, and then restart it later once a server becomes available and the configuration has been updated. While this is all theoretically doable, it's not currently something that's prioritized on our product roadmap.

Edit: I just realized I retyped basically the exact same thing that was said at #1686 (comment)

@sloveridge
Copy link

I guess that is the trade off of everything running in the one process with k3s.

It is helpful to know it is unlikely to come from the core team in the future. I will think through some work around at the application layer.

@stale
Copy link

stale bot commented Aug 3, 2022

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants