New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubeadm: move the "kubelet-start" phase after "kubeconfig" for "init" #90892
kubeadm: move the "kubelet-start" phase after "kubeconfig" for "init" #90892
Conversation
|
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
Welcome @xphoniex! |
|
Hi @xphoniex. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
cmd/kubeadm/app/cmd/init.go
Outdated
| @@ -177,6 +177,7 @@ func NewCmdInit(out io.Writer, initOptions *initOptions) *cobra.Command { | |||
| initRunner.AppendPhase(phases.NewKubeConfigPhase()) | |||
| initRunner.AppendPhase(phases.NewControlPlanePhase()) | |||
| initRunner.AppendPhase(phases.NewEtcdPhase()) | |||
| initRunner.AppendPhase(phases.NewKubeletRestartPhase()) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not patch the NewKubeletStartPhase phase to have a managed restart loop (in kubeadm, for something like 2 minutes) in the case of a detected openrc system? would that work?
adding a new phase does not seem needed for the systemd case, so we should avoid it.
also did you try that the change recommended here fixes the problem?
kubernetes/kubeadm#1986
(moving the kubelet start phase)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not patch the NewKubeletStartPhase phase to have a managed restart loop (in kubeadm, for something like 2 minutes) in the case of a detected openrc system? would that work?
could work but kubeadm should not become another init system as this would cause sync issues and might break something in edge cases.
unless we expect kubelet to fail randomly in other ways and restarting to heal those failures, a one-time restart would suffice here.
adding a new phase does not seem needed for the systemd case, so we should avoid it.
as systemd already does the restart itself, I expect a one-time restart to not cause any harm and actually improve start time by ~a second, also we can easily turn this phase into a no-op for systemd.
also did you try that the change recommended here fixes the problem?
kubernetes/kubeadm#1986
(moving the kubelet start phase)
I didn't try it, but it'd have been my proposed solution had I not seen this comment of yours:
this is something that we discussed at some point, but the problem is that
at this point it could be a breaking change to all the users that separated
the init process into phases.
I wanted my solution to change as little as possible and as a result I didn't modify existing phases like NewKubeletStartPhase and since moving phases could be a breaking change, I opted for creating a small, separate phase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we also have the NewKubeletFinalizePhase phase and this could be potentially a sub-phase in there called "restart".
the "restart" phase can detect the init system and A) do nothing in the case of systemd B) perform a restart if openrc is detected as the init system.
adding it as a parent phase is far from ideal.
this is something that we discussed at some point, but the problem is that
at this point it could be a breaking change to all the users that separated
the init process into phases.
in any case i think i should bring this as a discussion topic in the next week meeting and see what the wider group thinks. we might as well just move the phase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubeadm gets stuck on NewWaitControlPlanePhase which is at line 181: https://github.com/kubernetes/kubernetes/blob/a713c8f6fb21e3f171510a104b700aef5fd88555/cmd/kubeadm/app/cmd/init.go#L180-L186
NewKubeletFinalizePhase is way further down at line 186.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss this in the office hours. In the long term we should try to reconcile systemd, OpenRC, and Windows' Service Manager:
- A service should be installed disabled if it's not installed with a default configuration (ready to go). The same should apply to the kubelet.
- If a service is enabled, it should restart after it crashes (what it now does constantly on systemd).
- kubeadm (and any other deployment tool) should enable the service and start it after it's configured.
This required some syncing with SIG Release and SIG Arch and a huge "action required" note somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we had a discussion about this problem in the kubeadm meeting today.
the introduction of a new phase or reordering phases is not exactly desired to solve the openrc case for the time being.
so we discussed this:
why not patch the NewKubeletStartPhase phase to have a managed restart loop (in kubeadm, for something like 2 minutes) in the case of a detected openrc system? would that work?
we sort of a agreed during the meeting that this is an OK option, but now i started thinking more about the actual implementation it ended up hacky and abrupt for kubeadm to manage the restart. we may have to os.Exit(1) kubeadm from go routine and so on.
[1] so my latest proposal for a short term fix is to just move the kubelet-start phase before the wait-control-plane phase, and here is why it will not be that breaking:
- if users are using systemd, due to systemd supporting restarts, the location of the kublet-start phase is not important as long as they are calling it before wait-control-plane.
- systemd users can optionally adjust their phase order as long as we have an action-required release note.
- this has been bugging systemd and windows people too. as the crashlooping of kubeadm's managed kubelet is not really needed.
this solves @rosti's comment above:
kubeadm (and any other deployment tool) should enable the service and start it after it's configured.
also solves this openrc issue kubernetes/kubeadm#1986.
long term
- openrc should support automatic restarts? i don't know what is the state of this problem there...but an init system not-supporting programmable restarts is far from ideal.
- apply the changes @rosti suggests above to k8s packages.
alternative options (if we end up rejecting [1])
- as proposed by @rosti, Alpine users can try wrapping the kubelet in a shim that manages its restarts, given openrc cannot do that.
- Alpine users can break the kubeadm init process in phases and reorder the kubelet-start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we sort of a agreed during the meeting that this is an OK option, but now i started thinking more about the actual implementation it ended up hacky and abrupt for kubeadm to manage the restart. we may have to os.Exit(1) kubeadm from go routine and so on.
also there are places which kubeadm stops kubelet service, e.g. to write config files. it wouldn't break anything atm but if there's a need in future to stop the kubelet service, someone's going to be very confused why their code isn't working since goroutine keeps restarting it.
[1] so my latest proposal for a short term fix is to just move the kubelet-start phase before the wait-control-plane phase
sounds good to me!
I thought the issue was some users running their kubeadm phase by phase hence why we wouldn't want to reorder it. otherwise yeah, this should not break anything but if a user was using kubeadm phase by phase then they need to modify their scripts.
openrc should support automatic restarts? i don't know what is the state of this problem there...but an init system not-supporting programmable restarts is far from ideal.
openrc has something called supervise-daemon which is supposed to handle restarts for us but it's experimental and basically misbehaving on my VM and some older alpine releases.
we need a simple fix for now to get this working and in the long run superviced or whatever else the gentoo/alpine community comes up with would be the answer.
alternative options (if we end up rejecting [1])
see above ^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am glad that we are having an action plan here for the time being. Let's do what @neolit123 proposed and put an "Action Required" release note on this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, please check @neolit123 .
|
/priority backlog |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tackling this @xphoniex !
cmd/kubeadm/app/cmd/init.go
Outdated
| @@ -177,6 +177,7 @@ func NewCmdInit(out io.Writer, initOptions *initOptions) *cobra.Command { | |||
| initRunner.AppendPhase(phases.NewKubeConfigPhase()) | |||
| initRunner.AppendPhase(phases.NewControlPlanePhase()) | |||
| initRunner.AppendPhase(phases.NewEtcdPhase()) | |||
| initRunner.AppendPhase(phases.NewKubeletRestartPhase()) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss this in the office hours. In the long term we should try to reconcile systemd, OpenRC, and Windows' Service Manager:
- A service should be installed disabled if it's not installed with a default configuration (ready to go). The same should apply to the kubelet.
- If a service is enabled, it should restart after it crashes (what it now does constantly on systemd).
- kubeadm (and any other deployment tool) should enable the service and start it after it's configured.
This required some syncing with SIG Release and SIG Arch and a huge "action required" note somewhere.
a713c8f
to
64cca18
Compare
|
please change the box under
|
|
/ok-to-test |
|
/remove-priority backlog |
|
@kubernetes/sig-cluster-lifecycle-pr-reviews |
|
@neolit123 is this even related to kubeadm? |
|
/retitle kubeadm: move the "kubelet-start" phase after "kubeconfig" for "init" |
some test jobs are flaky. |
|
/retest |
|
This is really not fun, pull-kubernetes-verify ran for 2h0m22s only to fail to a seemingly unrelated issue: Can someone help verify if it's related to my commit or not? Because if not, the test jobs are not just flaky, they are broken. EDIT: I just checked, all PRs in the first page have |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: neolit123, xphoniex The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
will LGTM myself if nobody has comments until EOD today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @xphoniex !
/lgtm
|
/retest Review the full test history for this PR. Silence the bot with an |
What type of PR is this?
/kind feature
What this PR does / why we need it:
running
kubeadm initgets stuck on alpine linux because it uses a different init system lacking re-try mechanism and for some reasonkubeadmtries to startkubeletservice before required conf files are copied, thus leaving the service in crashed state.Which issue(s) this PR fixes:
Fixes kubernetes/kubeadm#1986
Special notes for your reviewer:
We can only run
kubelet-restartphase in case we detect non-systemd or openrc but this would barely make any difference in performance.Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: