This repository has been archived by the owner on Jun 29, 2022. It is now read-only.
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
kube-apiserver: improve reliability of upgrades
This commit attempts to improve the reliability of upgrade process of self-hosted kube-apiserver, especially when having only one controller node. Currently, kube-apiserver is run as a DameonSet, which means, that if there is only one node, where the pod is assigned, it must be removed before new one will be scheduled. This causes short outage when doing a rolling update of kube-apiserver. During the outage, pod checkpointer kicks in and brings up temporary kube-apiserver as a static pod, to recover the cluster and waits until kube-apiserver pod is scheduled on the node. Then, it shuts down temporary kube-apiserver pod and removes it's manifest. As there cannot be 2 instances of kube-apiserver running at the same time on the node, pod checkpointer is not able to wait until updated pod gets up, as it must shut down the temporary one. This has a bad side-effect, that if new pod is wrongly configured (e.g. has a non-existent flag specified), kube-apiserver will never recover, which brings down the cluster and the manual intervention is needed. See #72 PR for more details. If it would be possible to run more than one instance of kube-apiserver on a single node, that would make upgrade process easier. When you try to do that at the moment, 2nd instance will not run, as secure port is already bind by the first instance. In Linux, there is a way to have multiple processes bind the same address and port, which is SO_REUSEPORT socket option. More details under this link: https://lwn.net/Articles/542629/. Unfortunately, kube-apiserver does not create a listening socket with such option. To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a HAProxy instance as a side-container to kube-apiserver. HAProxy does support SO_REUSEPORT, so multiple instances can bind to the same address and port and then traffic between the processes will be equally distributed by the kernel. As kube-apiserver still runs on the host network, we need to either randomize the IP address or the port it listens on, in order to be able to run multiple instances on a single host. In this case, randomizing IP address for binding is easier than randomizing port, as kube-apiserver advertises it's own IP address and port where it binds to the 'kubernetes' service in 'default' namespace in the cluster, which means that pod on the cluster would bypass the HAProxy and connect to kube-apiserver directly, which requires opening such random ports on the firewall for the controller nodes, which is undesired. If we randomize IP address to bind, we can use loopback interface, which by default in Linux has /8 IP address assigned, which means that we can select random IP address like 127.155.125.53 and bind to it and this will work. To avoid addressing localhost IP address to the cluster, which obviously wouldn't work, we use --advertise-address kube-apiserver flag, which allows us to override IP address advertised to the cluster and always set it to the address, where HAProxy is listening, for example using HOST_IP environmental variable pulled from Kubernetes node information from pod status. Proxying HAPRoxy runs in TCP mode to minimize the required configuration and possible impact of misconfiguration. In my testing, I didn't experience any breakage because of using proxy, however, we may need to pay attention to parameters like session timeouts, to make sure they don't affect the connections. Once we are able to run multiple instances of kube-apiserver on a single node, we need to change the way we deploy self-hosted kube-apiserver from DaemonSet to Deployment to allow running multiple instances on a single node. As running multiple instances on a single node is should only be done temporarily, as single kube-apiserver is able to scale very well, we add podAntiAffinity, to make sure that replicas of Deployment are equally spreded across controller nodes. This also makes sense, as each kube-apiserver instance consumes at least 500MB of RAM, which means that if controller node has 2GB of RAM, it might be not enough to run 2 instances for a longer period, which means at least 4GB of RAM are recommended for the controller nodes. This also make sense from stability point of view, as with many workloads, controller node resource usage will grow. By default, the number of replicas equals to number of controller nodes. For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is used instead of requiredDuringSchedulingIgnoredDuringExecution, as with single controller node, we should actually allow to run multiple instances on a single node to perform graceful updates. See #90 for more details. If there is just one replica of kube-apiserver requested, we set maxUnavailable: 0, to make sure, that there is always at least one instance running. We also add PodDisruptionBudget to which also makes sure, that there is at least one instance running. If there is more replicas requested, then PodDisruptionBudget controls, that only one kube-apiserver can be shut down at a time, so avoid overloading other running instances. On platforms, where kube-apiserver needs to be exposed on all interfaces (e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this port, kube-apiserver will listen on random local IP address and HAProxy will listen on Node IP), then in addition HAProxy will also listen on 0.0.0.0:6443 to exporte the API on all interfaces (including public ones). This is required, as you cannot have 2 processes, one listening on 127.0.0.1:6443 and another one on 0.0.0.0:6443. On platforms with only private network, where kube-apiserver is accessed via load balancer (e.g. AWS), ports setup remains the same. The whole setup would be much simpler, if kube-apiserver would support SO_REUSEPORT. I have opened upstream issue about that and implemented working PoC. More details here: kubernetes/kubernetes#88785 Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
- Loading branch information