Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make components on control-plane nodes point to the local API server endpoint #2271

Open
2 of 3 tasks
neolit123 opened this issue Aug 31, 2020 · 54 comments
Open
2 of 3 tasks
Assignees
Labels
kind/design Categorizes issue or PR as related to design. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/backlog Higher priority than priority/awaiting-more-evidence.
Milestone

Comments

@neolit123
Copy link
Member

neolit123 commented Aug 31, 2020

in CAPI immutable upgrades we saw a problem where a 1.19 joining node cannot bootstrap, if a 1.19 KCM takes leadership and tries to send a CSR to a 1.18 API server on an existing Node. this happens because in 1.19 the CSR API graduated to v1 and a KCM is supposed to talk to a N or N+1 API server only.

a better explanation here:
https://kubernetes.slack.com/archives/C8TSNPY4T/p1598907959059100?thread_ts=1598899864.038100&cid=C8TSNPY4T

/assign

@neolit123 neolit123 added kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Aug 31, 2020
@neolit123 neolit123 added this to the v1.20 milestone Aug 31, 2020
@neolit123 neolit123 added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Sep 1, 2020
@neolit123
Copy link
Member Author

neolit123 commented Sep 1, 2020

first PR is here: kubernetes/kubernetes#94398

@neolit123
Copy link
Member Author

we spoke about the kubelet.conf in the office hours today:

  • Pointing the kubelet to the local api server should work, but the kubelet-start phase has to happen after the control-plane manifests are written on disk for CP nodes.
  • Requires phase reorder and we are considering using a feature gate.
  • This avoids skew problems of a new kubelet trying to bootstrap against an old api-server.
  • One less component to point to the CPE.

i'm going to experiment and see how it goes, but this cannot be backported to older releases as it is a breaking change to phase users.

@zhangguanzhang
Copy link

This breaks the rules, the controlPlaneEndpoint maybe a domain, if this is a domain, so it will not run ok after your code

@neolit123
Copy link
Member Author

This breaks the rules, the controlPlaneEndpoint maybe a domain, if this is a domain, so it will not run ok after your code

can you clarify with examples?

@neolit123
Copy link
Member Author

@jdef added a note that that some comments were left invalid after the recent change:
https://github.com/kubernetes/kubernetes/pull/94398/files/d9441906c4155173ce1a75421d8fcd1d2f79c471#r486252360

this should be fixed in master.

@neolit123
Copy link
Member Author

some else added a comment on kubernetes/kubernetes#94398
but later deleted it:

when use method CreateJoinControlPlaneKubeConfigFiles with controlPlaneEndpoint like apiserver.cluster.local to generate config files. and use kubeadm init --config=/root/kubeadm-config.yaml --upload-certs -v 5
the error occurs like

I0910 15:15:54.436430   52511 kubeconfig.go:84] creating kubeconfig file for controller-manager.conf
currentConfig.Clusters[currentCluster].Server:  https://apiserver.cluster.local:6443 
config.Clusters[expectedCluster].Server:  https://192.168.160.243:6443
a kubeconfig file "/etc/kubernetes/controller-manager.conf" exists already but has got the wrong API Server URL

this validation should be turned into a warning instead of an error. then components would fail if they don't point to a valid API server, so the user would know.

@zhangguanzhang
Copy link

This breaks the rules, the controlPlaneEndpoint maybe a domain, if this is a domain, so it will not run ok after your code

can you clarify with examples?

you could see this doc https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/high-availability/#steps-for-the-first-control-plane-node

--control-plane-endpoint "LOAD_BALANCER_DNS:LOAD_BALANCER_PORT"

@neolit123
Copy link
Member Author

neolit123 commented Sep 10, 2020

i do know about that doc. are you saying that using "DNS-name:port" is completely broken now for you? what error output are you seeing?
i did test this during my work on the changes and it worked fine.

@jdef
Copy link

jdef commented Sep 10, 2020

some else added a comment on kubernetes/kubernetes#94398
but later deleted it:

when use method CreateJoinControlPlaneKubeConfigFiles with controlPlaneEndpoint like apiserver.cluster.local to generate config files. and use kubeadm init --config=/root/kubeadm-config.yaml --upload-certs -v 5
the error occurs like

I0910 15:15:54.436430   52511 kubeconfig.go:84] creating kubeconfig file for controller-manager.conf
currentConfig.Clusters[currentCluster].Server:  https://apiserver.cluster.local:6443 
config.Clusters[expectedCluster].Server:  https://192.168.160.243:6443
a kubeconfig file "/etc/kubernetes/controller-manager.conf" exists already but has got the wrong API Server URL

this validation should be turned into a warning instead of an error. then components would fail if they don't point to a valid API server, so the user would know.

yes, please. this just bit us when testing a workaround in a pre-1.19.1 cluster whereby we tried manually updating clusters[].cluster.server in (scheduler, controller-manager .conf) to point to localhost instead of the official controlplane endpoint.

@zhangguanzhang
Copy link

i do know about that doc. are you saying that using "DNS-name:port" is completely broken now for you?

yes, if you want to deploy a HA cluster, it is best to set controlPlaneEndpoint to the LOAD_BALANCER_DNS instead of LOAD_BALANCER ip

@neolit123
Copy link
Member Author

yes, if you want to deploy a HA cluster, it is best to set controlPlaneEndpoint to the LOAD_BALANCER_DNS instead of LOAD_BALANCER ip

what error are you getting?

@zhangguanzhang
Copy link

zhangguanzhang commented Sep 10, 2020

yes, if you want to deploy a HA cluster, it is best to set controlPlaneEndpoint to the LOAD_BALANCER_DNS instead of LOAD_BALANCER ip

what error are you getting?

I add some code for the log print, this is the error

I0910 13:14:53.017570   21006 kubeconfig.go:84] creating kubeconfig file for controller-manager.conf
currentConfig.Clusters https://apiserver.cluster.local:6443 
config.Clusters:  https://192.168.160.243:6443
error execution phase kubeconfig/controller-manager: a kubeconfig file "/etc/kubernetes/controller-manager.conf" exists already but has got the wrong API Server URL

@neolit123
Copy link
Member Author

neolit123 commented Sep 10, 2020

ok, so you have the same error as the user reporting above.

we can fix this for 1.19.2

one workaround is:

  • start kubeadm "init" with kubeconfig files using the local endpoint (instead of control-plane-endpoint)
  • wait for init to finish
  • modify the kubeconfig files again
  • restart the kube-scheduler and kube-controller-manager

@zhangguanzhang
Copy link

Both kube-scheduler and kube-controller-manager can use localhost and loadblance to connect to kube-apiserver, but users cannot be forced to use localhost, and warnning can be used instead of error

@fabriziopandini
Copy link
Member

@neolit123 I'm +1 to relax the checks on the address in the existing kubeconfig file.
We can either remove the check or make it more flexible by checking if the address is either CPE or LAPI

@oldthreefeng
Copy link

oldthreefeng commented Sep 11, 2020

@neolit123 here is the example. i just edit to add log print.
https://github.com/neolit123/kubernetes/blob/d9441906c4155173ce1a75421d8fcd1d2f79c471/cmd/kubeadm/app/phases/kubeconfig/kubeconfig.go#L225

fmt.Println("currentConfig.Clusters[currentCluster].Server:", currentConfig.Clusters[currentCluster].Server, "\nconfig.Clusters[expectedCluster].Server: ", config.Clusters[expectedCluster].Server)

use method CreateJoinControlPlaneKubeConfigFiles with controlPlaneEndpoint to genrate kube-schedulerand kube-controller-manager , in this situation , set controlPlaneEndpoint as LOAD_BALANCER_DNS:LOAD_BALANCER_PORT . it is best to set LOAD_BALANCER_DNS instead of IP.
then to run kubeadm init with LOAD_BALANCER_DNS:LOAD_BALANCER_PORT. the result is.

./kubeadm  init  --control-plane-endpoint  apiserver.cluster.local:6443
W0911 09:36:17.922135   63517 configset.go:348] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]
[init] Using Kubernetes version: v1.19.1
[preflight] Running pre-flight checks
	[WARNING FileExisting-socat]: socat not found in system path
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
[certs] Using certificateDir folder "/etc/kubernetes/pki"
[certs] Using existing ca certificate authority
[certs] Using existing apiserver certificate and key on disk
[certs] Using existing apiserver-kubelet-client certificate and key on disk
[certs] Using existing front-proxy-ca certificate authority
[certs] Using existing front-proxy-client certificate and key on disk
[certs] Using existing etcd/ca certificate authority
[certs] Using existing etcd/server certificate and key on disk
[certs] Using existing etcd/peer certificate and key on disk
[certs] Using existing etcd/healthcheck-client certificate and key on disk
[certs] Using existing apiserver-etcd-client certificate and key on disk
[certs] Using the existing "sa" key
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Using existing kubeconfig file: "/etc/kubernetes/admin.conf"
[kubeconfig] Using existing kubeconfig file: "/etc/kubernetes/kubelet.conf"
currentConfig.Clusters[currentCluster].Server: https://apiserver.cluster.local:6443 
config.Clusters[expectedCluster].Server:  https://192.168.160.243:6443
error execution phase kubeconfig/controller-manager: a kubeconfig file "/etc/kubernetes/controller-manager.conf" exists already but has got the wrong API Server URL
To see the stack trace of this error execute with --v=5 or higher

@neolit123
Copy link
Member Author

neolit123 commented Sep 14, 2020

i will send the PR in the next couple of days.
edit: kubernetes/kubernetes#94816

@neolit123
Copy link
Member Author

fix for 1.19.2 is here:
kubernetes/kubernetes#94890

@neolit123
Copy link
Member Author

neolit123 commented Sep 18, 2020

to further summarize what is happening. after the changes above, kubeadm will no longer error out if the server URL in custom provided kubeconfig files does not match the expected one. it will only show a warning.

example:

  • you have something like foo:6443 in scheduler.conf
  • kubeadm expects scheduler.conf to point to e.g. 192.168.0.108:6443 (local api server endpoint)
  • kubeadm will show you a warning when reading the provided kubeconfig file.
  • this allows you to modify the topology of your control-plane components, but you need to make sure the components work after such a customization.

@jdef
Copy link

jdef commented Sep 18, 2020

fix for 1.19.2 is here:
kubernetes/kubernetes#94890

1.19.2 is already out. So this fix will target 1.19.3, yes?

@neolit123
Copy link
Member Author

neolit123 commented Sep 18, 2020 via email

@neolit123
Copy link
Member Author

if a 1.19 KCM takes leadership and tries to send a CSR to a 1.18 API server on an existing Node

can this still happen (a new KCM talk to an old API server)? if so, this is breaking https://kubernetes.io/releases/version-skew-policy/#kube-controller-manager-kube-scheduler-and-cloud-controller-manager constraints and needs fixing

i am not aware if it can still happen in the future. the original trigger for logging the issue was the CSR v1 graduation which was near 3 years ago.

the problem in kubeadm and CAPI where the kubelets on CP nodes talk to the LB remains and i could not find a sane solution.

@liggitt
Copy link
Member

liggitt commented May 16, 2023

yeah, sorry, I meant an 1.(n+1) KCM talking to a 1.n API server, not 1.19/1.18 specifically

@neolit123 neolit123 added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label May 17, 2023
@fabriziopandini
Copy link
Member

fabriziopandini commented May 17, 2023

KCM and Scheduler always talk with the API server on the same machine, which is of the same version (as far as I remember this decision was a trade-off between HA and user experience for upgrades).

Kubelet is the only component going through the load balancer, it is the last open point of this issue

@liggitt
Copy link
Member

liggitt commented May 17, 2023

maybe https://github.com/kubernetes/kubernetes/pull/116570/files#r1179273639 was what I was thinking of, which was due to upgrade order rather than load-balancer communication

@oldthreefeng
Copy link

optionally we should see if we can make the kubelet on control-plane Nodes bootstrap via the local API server instead of using the CPE. this might be a bit tricky and needs investigation. we could at least post-fix the kubelet.conf to point to the local API server after the bootstrap has finished.

I think this is a good idea to fix this problem. @neolit123 :)

@pacoxu
Copy link
Member

pacoxu commented May 24, 2023

IMO, this is no HA design for kubelet to connect the local APIServer only on control plane nodes. And for bootstrap (I mean /etc/kubernetes/bootstrap-kubelet.conf), the local apiserver is not ready yet.

  • If the local apiserver is restarting or crashed, the kubelet will fail and the node will be not ready then.
  • For KCM and scheduler, that is a different topic, and the local connection is stable for them and node status will not be affected.

As kubelet must not be newer than kube-apiserver, we should upgrade all control planes at first and then upgrade the kubelet in control plane nodes. This is enough for me.

@pavels
Copy link

pavels commented Nov 26, 2023

would it be possible to also do this for generated admin.conf on CP?

@neolit123
Copy link
Member Author

would it be possible to also do this for generated admin.conf on CP?

the point of admin.conf is to reach the lb sitting in front of the servers. in case of failure or during upgrade it is best to keep it that way IMO.

you could sign a custom kubeconf that talks to localhost:port.

@chrischdi
Copy link
Member

chrischdi commented Jan 23, 2024

To tackle the last point of this issue:

  • optionally we should see if we can make the kubelet on control-plane Nodes bootstrap via the local API server instead of using the CPE. this might be a bit tricky and needs investigation. we could at least post-fix the kubelet.conf to point to the local API server after the bootstrap has finished.
    see sig-cluster-lifecycle: best practices for immutable upgrades kubernetes#80774 for a related discussion

The TL/DR for this change is that we have to adjust the tlsBootstrapCfg which gets written for the kubelet to disk here:

https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/cmd/phases/join/kubelet.go#L123

to point to localhost.

This change alone does not work, because it would create a chicken-egg problem:

  • The kubelet tries to bootstrap itself while trying to talk to the kube-apiserver which gets started by the kubelet itself as static pod.
  • The kube-apiserver relies on the etcd pod which is also running as a static pod.
  • The etcd pod does get created in a later phase

Because of that this requires a change to kubeadm and its phases to fix the order of some actions so it can succeed.

Note: With that said, I think this change cannot be simply merged to the codebase and may require getting activated over time by graduating it via a feature gate or similar to get the new default behaviour after some time and including well-written release-notes for this change to make users aware of it.

Proposed solution

To solve the above chicken-egg issue we have to reorder some subphases / add some extra phases to kubeadm:

image

To summarise the change:

  • Split up the KubeletStart phase into KubeletStart and KubeletWaitBootstrap (orange). The KubeletStart phase ends after starting the kubelet and the KubeletWaitBootstrap (orange) phase does the rest which previously was embedded in KubeletStart
  • Split up the ControlPlaneJoinPhase into ControlPlaneJoinPhase (cyan) and ControlplaneJoinEtcdPhase (green) by extracting the etcd relevant subphase EtcdLocalSubphase which then is the ControlplaneJoinEtcdPhase
  • Reorder with the new phases:
    • From: ... -> KubeletStart -> (KubeletWaitBootstrap) -> (ControlplaneJoinEtcdPhase*) -> ControlPlaneJoinPhase -> ...
    • To: ... -> KubeletStart -> ControlplaneJoinEtcdPhase -> KubeletWaitBootstrap -> ControlPlaneJoinPhase -> ...

* Note: The EtcdLocalSubphase was directly at the beginning of ControlPlaneJoinPhase.

The addition to the phases changes:

more information

I have a dirty POC implementation here: kubernetes/kubernetes@master...chrischdi:kubernetes:pr-experiment-kubeadm-kubelet-localhost which I used for testing the implementation.

I also stress-tested this implementation by using kinder:

  • By creating a custom v1.28.0 kindest/node image (only change is the updated kubeadm binary)
  • Then created cluster via kinder: kinder create cluster --name kinder-test --image kindest/node:v1.28.0-test --control-plane-nodes 3
  • Then run the following script for the stress-test:
#!/bin/bash

set -o errexit
set -o nounset
set -o pipefail

I=0
while true; do 
  if [[ $(($I % 10)) -eq 0 ]]; then
    echo ""
    echo "Starting iteration $I"
  fi
  echo -n '.'
  kinder do kubeadm-init --name kinder-test >stdout.txt 2>stderr.txt
  kinder do kubeadm-join --name kinder-test >stdout.txt 2>stderr.txt
  kinder do kubeadm-reset --name kinder-test >stdout.txt 2>stderr.txt
  I=$((I+1))
done

If this sounds good, I would be happy to help driving this forward. I don't know if this requires a KEP first?! Happy to receive some feedback :-)

@neolit123
Copy link
Member Author

thanks for all the work on this @chrischdi
we could target 1.30 for it as it has been a long standing task, however it's still not clear to me how exactly users are affected.
it's the hairpin mode LB, correct?

we should probably talk more about it in the kubeadm office hours this week.

If this sounds good, I would be happy to help driving this forward. I don't know if this requires a KEP first?! Happy to receive some feedback :-)

given a FG was suggested and given it's a complex change, that is 1) breaking for users that anticipate a certain kubeadm phase order, and also 2) needs tests - i guess we need a KEP.

@pacoxu @SataQiu WDYT about this overall?
we need agreement on it, obviously.

my vote is +1, but i hope we don't break users in ways that cannot be recoverable.

if we agree on a KEP and a way forward you can omit the PRR (prod readiness review) as it's a non-target for kubeadm.
https://github.com/kubernetes/enhancements/blob/master/keps/README.md

@pacoxu
Copy link
Member

pacoxu commented Jan 29, 2024

During joining a new control-plane node, in the step of new EtcdLocalSubphase, is kubelet running in standalone mode at first?

It sounds like doable.

For upgrade progress, should we add logic for kubelet config to point to localhost?

@chrischdi
Copy link
Member

it's the hairpin mode LB, correct?

I think I lack context on what "hairpin mode LB" is :-)

During joining a new control-plane node, in the step of new EtcdLocalSubphase, is kubelet running in standalone mode at first?

Yes, in the targeted implementation, kubelet starts already, but cannot yet join the cluster (because the referenced kube-apiserver will not get healthy unless etcd is started and joined the cluster).
During EtcdLocalSubphase we then place the etcd static pod manifest and join etcd to the cluster.
After it joined, kube-apiserver gets healthy and the kubelet bootstraps itself, while kubeadm starts to wait for bootstrap to complete.

@neolit123
Copy link
Member Author

neolit123 commented Jan 29, 2024

it's the hairpin mode LB, correct?

I think I lack context on what "hairpin mode LB" is :-)

i think the CAPZ and the Azure LB were affected:
https://github.com/microsoft/Azure-ILB-hairpin

if we agree that this needs a KEP it can cover what problems we are trying to solve.
it's a disruptive change, thus it needs to be waranted.

@randomvariable
Copy link
Member

Yup, Azure is the most affected here, where traffic outbound to a LB that points back to a host making the request will have the traffic dropped (looks like a hairpin).

@neolit123
Copy link
Member Author

we spoke with @chrischdi about his proposal at the kubeadm meeting today (Wed 31st January 2024 - 9am PT)
there are some notes and also a recording.

@SataQiu @pacoxu i proposed that we should have a new feature gate for this. also a KEP, so that we can decide on some of the implementation details. please, LMK if you think a KEP is not needed, or if you have other comments at this stage.

@pacoxu
Copy link
Member

pacoxu commented Feb 1, 2024

A KEP would help to track it and FG is needed.

  • Should this KEP be tracked by release team? I think not.

@neolit123
Copy link
Member Author

neolit123 commented Feb 1, 2024

A KEP would help to track it and FG is needed.

  • Should this KEP be tracked by release team? I think not.

we haven't been tracking kubeadm KEPs with release team for a number of releases.
they are more useful for SIGs with many KEPs like node,api,auth. PRR does not apply to kubeadm.
the overall process is also optional for us.

https://github.com/kubernetes/sig-release/tree/master/releases/release-1.30
enhancement freeze we can respect but it's not 100% mandatory.
code freeze we must respect IMO, because if we break something it creates noise for the whole k/k in unit tests. also e2e tests for the release team.

@neolit123 neolit123 added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Feb 8, 2024
@neolit123 neolit123 modified the milestones: v1.30, v1.31 Apr 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Categorizes issue or PR as related to design. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/backlog Higher priority than priority/awaiting-more-evidence.
Projects
None yet
Development

No branches or pull requests