New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make components on control-plane nodes point to the local API server endpoint #2271
Comments
first PR is here: kubernetes/kubernetes#94398 |
we spoke about the kubelet.conf in the office hours today:
i'm going to experiment and see how it goes, but this cannot be backported to older releases as it is a breaking change to phase users. |
This breaks the rules, the |
can you clarify with examples? |
@jdef added a note that that some comments were left invalid after the recent change: this should be fixed in master. |
some else added a comment on kubernetes/kubernetes#94398
this validation should be turned into a warning instead of an error. then components would fail if they don't point to a valid API server, so the user would know. |
you could see this doc https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/high-availability/#steps-for-the-first-control-plane-node
|
i do know about that doc. are you saying that using "DNS-name:port" is completely broken now for you? what error output are you seeing? |
yes, please. this just bit us when testing a workaround in a pre-1.19.1 cluster whereby we tried manually updating |
yes, if you want to deploy a HA cluster, it is best to set |
what error are you getting? |
I add some code for the log print, this is the error
|
ok, so you have the same error as the user reporting above. we can fix this for 1.19.2 one workaround is:
|
Both kube-scheduler and kube-controller-manager can use localhost and loadblance to connect to kube-apiserver, but users cannot be forced to use localhost, and warnning can be used instead of error |
@neolit123 I'm +1 to relax the checks on the address in the existing kubeconfig file. |
@neolit123 here is the example. i just edit to add log print.
use method
|
i will send the PR in the next couple of days. |
fix for 1.19.2 is here: |
to further summarize what is happening. after the changes above, kubeadm will no longer error out if the server URL in custom provided kubeconfig files does not match the expected one. it will only show a warning. example:
|
1.19.2 is already out. So this fix will target 1.19.3, yes? |
Indeed, they pushed it out 2 days ago. Should be out with 1.19.3 then.
|
i am not aware if it can still happen in the future. the original trigger for logging the issue was the CSR v1 graduation which was near 3 years ago. the problem in kubeadm and CAPI where the kubelets on CP nodes talk to the LB remains and i could not find a sane solution. |
yeah, sorry, I meant an 1.(n+1) KCM talking to a 1.n API server, not 1.19/1.18 specifically |
KCM and Scheduler always talk with the API server on the same machine, which is of the same version (as far as I remember this decision was a trade-off between HA and user experience for upgrades). Kubelet is the only component going through the load balancer, it is the last open point of this issue |
maybe https://github.com/kubernetes/kubernetes/pull/116570/files#r1179273639 was what I was thinking of, which was due to upgrade order rather than load-balancer communication |
I think this is a good idea to fix this problem. @neolit123 :) |
IMO, this is no HA design for kubelet to connect the local APIServer only on control plane nodes. And for bootstrap (I mean
As kubelet must not be newer than kube-apiserver, we should upgrade all control planes at first and then upgrade the kubelet in control plane nodes. This is enough for me. |
would it be possible to also do this for generated |
the point of admin.conf is to reach the lb sitting in front of the servers. in case of failure or during upgrade it is best to keep it that way IMO. you could sign a custom kubeconf that talks to localhost:port. |
To tackle the last point of this issue:
The TL/DR for this change is that we have to adjust the https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/cmd/phases/join/kubelet.go#L123 to point to localhost. This change alone does not work, because it would create a chicken-egg problem:
Because of that this requires a change to kubeadm and its phases to fix the order of some actions so it can succeed. Note: With that said, I think this change cannot be simply merged to the codebase and may require getting activated over time by graduating it via a feature gate or similar to get the new default behaviour after some time and including well-written release-notes for this change to make users aware of it. Proposed solutionTo solve the above chicken-egg issue we have to reorder some subphases / add some extra phases to kubeadm: To summarise the change:
* Note: The The addition to the phases changes:
more informationI have a dirty POC implementation here: kubernetes/kubernetes@master...chrischdi:kubernetes:pr-experiment-kubeadm-kubelet-localhost which I used for testing the implementation. I also stress-tested this implementation by using kinder:
#!/bin/bash
set -o errexit
set -o nounset
set -o pipefail
I=0
while true; do
if [[ $(($I % 10)) -eq 0 ]]; then
echo ""
echo "Starting iteration $I"
fi
echo -n '.'
kinder do kubeadm-init --name kinder-test >stdout.txt 2>stderr.txt
kinder do kubeadm-join --name kinder-test >stdout.txt 2>stderr.txt
kinder do kubeadm-reset --name kinder-test >stdout.txt 2>stderr.txt
I=$((I+1))
done If this sounds good, I would be happy to help driving this forward. I don't know if this requires a KEP first?! Happy to receive some feedback :-) |
thanks for all the work on this @chrischdi we should probably talk more about it in the kubeadm office hours this week.
given a FG was suggested and given it's a complex change, that is 1) breaking for users that anticipate a certain kubeadm phase order, and also 2) needs tests - i guess we need a KEP. @pacoxu @SataQiu WDYT about this overall? my vote is +1, but i hope we don't break users in ways that cannot be recoverable. if we agree on a KEP and a way forward you can omit the PRR (prod readiness review) as it's a non-target for kubeadm. |
During joining a new control-plane node, in the step of new It sounds like doable. For upgrade progress, should we add logic for kubelet config to point to localhost? |
I think I lack context on what "hairpin mode LB" is :-)
Yes, in the targeted implementation, kubelet starts already, but cannot yet join the cluster (because the referenced kube-apiserver will not get healthy unless etcd is started and joined the cluster). |
i think the CAPZ and the Azure LB were affected: if we agree that this needs a KEP it can cover what problems we are trying to solve. |
Yup, Azure is the most affected here, where traffic outbound to a LB that points back to a host making the request will have the traffic dropped (looks like a hairpin). |
we spoke with @chrischdi about his proposal at the kubeadm meeting today (Wed 31st January 2024 - 9am PT) @SataQiu @pacoxu i proposed that we should have a new feature gate for this. also a KEP, so that we can decide on some of the implementation details. please, LMK if you think a KEP is not needed, or if you have other comments at this stage. |
A KEP would help to track it and FG is needed.
|
we haven't been tracking kubeadm KEPs with release team for a number of releases. https://github.com/kubernetes/sig-release/tree/master/releases/release-1.30 |
in CAPI immutable upgrades we saw a problem where a 1.19 joining node cannot bootstrap, if a 1.19 KCM takes leadership and tries to send a CSR to a 1.18 API server on an existing Node. this happens because in 1.19 the CSR API graduated to v1 and a KCM is supposed to talk to a N or N+1 API server only.
a better explanation here:
https://kubernetes.slack.com/archives/C8TSNPY4T/p1598907959059100?thread_ts=1598899864.038100&cid=C8TSNPY4T
we should make the controller-manager.conf and scheduler.conf that kubeadm generates talk to the local API server and not to the controlPlaneEndpoint (CPE, e.g. LB).
PR for 1.20: kubeadm: make the scheduler and KCM connect to the local API endpoint kubernetes#94398
PR for 1.19: Automated cherry pick of #94398: kubeadm: make the scheduler and KCM connect to local endpoint kubernetes#94442
relax the server URL validation in kubeconfig files:
make components on control-plane nodes point to the local API server endpoint #2271 (comment)
kubeadm: relax the validation of kubeconfig server URLs kubernetes#94816
optionally we should see if we can make the kubelet on control-plane Nodes bootstrap via the local API server instead of using the CPE. this might be a bit tricky and needs investigation. we could at least post-fix the kubelet.conf to point to the local API server after the bootstrap has finished.
see sig-cluster-lifecycle: best practices for immutable upgrades kubernetes#80774 for a related discussion
/assign
The text was updated successfully, but these errors were encountered: