-
Notifications
You must be signed in to change notification settings - Fork 49
Add ability to perform controlplane upgrades: alternative approach #72
Conversation
Current issues:
|
9586fb1
to
e188c9e
Compare
I pushed the code, which separates pod checkpointer and kube-apiserver upgrades. However, it's still not working correctly, as pod checkpointer currently does not have any health checks defined and it seems, that Helm relies on them when doing Next, we should look how to avoid podcheckpointer caching crashing pods. |
It looks like pod checkpointer does not implement any way of health checks, so we would have to implement one, to be able to provide reliable updates of it. Please note, that if pod checkpointer is not in a healthy state, And if we try upgrading
I think that would again require some modification to pod checkpointer. Perhaps pod-checkpointer could not remove pods manifests, which are part of the daemonset and use them until new DaemonSet pod is not ready somehow. As far as I know, there is no such functionality at the moment. As a side note, perhaps pod checkpointer could be modified to use kubelet's |
This commit adds replacement function chartFromComponent, which glues the logic between chartFromManifests and Component interface, making chartFromManifests to only handle generating the chart. chartFromComponent also validates, that generated chart is correct, which makes it safer for the caller. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
This is required for controlplane upgrades. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
To make sure we are always up to date. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
As it is now defined in separated helm chart. This is to make update process more reliable, as kube-apiserver should be updated before other components. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
…orms As this is now produced by bootkube module. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
To perform automatic rollback if something fails and to make sure cluster is upgraded after lokoctl finishes. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
So if some controlplane release got manually uninstalled, just reinstall it with the values from helm rather than give an error to the user. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
To save time on fresh installations. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
As currently due to runc concurrency bugs, kubelet does not shutdown properly during the upgrade, which will break the upgrade process. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
As 2GB of RAM is not enough for running standalone controller node with graceful upgrade strategy (2 API servers does not fit into 2GB node). Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
To make code more reusable. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
To make the function a bit simpler. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
So if process does not accept connections anymore, it will be restarted. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
So they can be updated separately. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
For checking status of DaemonSet objects. This function will be used when perforing upgrades of kube-apiserver and can be used for testing if components are running etc. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
e188c9e
to
14cabb1
Compare
This commit attempts to improve the reliability of the upgrade process of the self-hosted kube-apiserver, especially when having only one controller node. Currently, kube-apiserver runs as a DaemonSet, which means that if there is only one node, where the pod is assigned, it must be removed before a new one will be scheduled. This causes a short outage when doing a rolling update of kube-apiserver. During the outage, the pod checkpointer kicks in and brings up a temporary kube-apiserver as a static pod to recover the cluster and waits until kube-apiserver pod is scheduled on the node. Then, it shuts down the temporary kube-apiserver pod and removes its manifest. As there cannot be 2 instances of kube-apiserver running at the same time on the node, the pod checkpointer is not able to wait until the updated pod starts up, as it must shut down the temporary one. This has a bad side-effect: if the new pod is wrongly configured (e.g. has a non-existent flag specified), kube-apiserver will never recover, which brings down the cluster and then manual intervention is needed. See #72 PR for more details. If it would be possible to run more than one instance of kube-apiserver on a single node, that would make the upgrade process easier. When you try to do that now, the 2nd instance will not run, as the secure port is already bound by the first instance. In Linux, there is a way to have multiple processes bind the same address and port: the SO_REUSEPORT socket option. More details under this link: https://lwn.net/Articles/542629/. Unfortunately, kube-apiserver does not create a listening socket with that option. To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a HAProxy instance as a side-container to kube-apiserver. HAProxy does support SO_REUSEPORT, so multiple instances can bind to the same address and port and then traffic between the processes will be equally distributed by the kernel. As kube-apiserver still runs on the host network, we need to either randomize the IP address or the port it listens on, in order to be able to run multiple instances on a single host. In this case, randomizing the IP address for binding is easier than randomizing the port, as kube-apiserver advertises its own IP address and port where it binds to the 'kubernetes' service in 'default' namespace in the cluster, which means that pods on the cluster would bypass HAProxy and connect to kube-apiserver directly, which requires opening such random ports on the firewall for the controller nodes, which is undesired. If we randomize IP address to bind, we can use the loopback interface, which by default in Linux has a /8 IP address assigned, which means that we can select a random IP address like 127.155.125.53 and bind to it. To avoid advertising localhost IP address to the cluster, which obviously wouldn't work, we use --advertise-address kube-apiserver flag, which allows us to override IP address advertised to the cluster and always set it to the address where HAProxy is listening, for example using the HOST_IP environment variable pulled from the Kubernetes node information in the pod status. HAProxy runs in TCP mode to minimize the required configuration and possible impact of misconfiguration. In my testing, I didn't experience any breakage because of using a proxy, however, we may need to pay attention to parameters like session timeouts, to make sure they don't affect connections. Once we are able to run multiple instances of kube-apiserver on a single node, we need to change the way we deploy the self-hosted kube-apiserver from DaemonSet to Deployment to allow running multiple instances on a single node. As running multiple instances on a single node should only be done temporarily, as single kube-apiserver is able to scale very well, we add podAntiAffinity to make sure that replicas of Deployment are equally spread across controller nodes. This also makes sense, as each kube-apiserver instance consumes at least 500MB of RAM, which means that if a controller node has 2GB of RAM, it might be not enough to run 2 instances for a longer period, meaning at least 4GB of RAM are recommended for the controller nodes. This also make sense from a stability point of view, as with many workloads, controller node resource usage will grow. By default, the number of replicas equals the number of controller nodes. For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is used instead of requiredDuringSchedulingIgnoredDuringExecution, as with a single controller node, we should actually allow multiple instances on a single node to perform graceful updates. See #90 for more details. If there is just one replica of kube-apiserver requested, we set maxUnavailable: 0, to make sure that there is always at least one instance running. We also add a PodDisruptionBudget to which also makes sure that there is at least one instance running. If there are more replicas requested, then PodDisruptionBudget controls that only one kube-apiserver can be shut down at a time, to avoid overloading other running instances. On platforms where kube-apiserver needs to be exposed on all interfaces (e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this port, kube-apiserver will listen on random local IP address and HAProxy will listen on the Node IP), then in addition HAProxy will also listen on 0.0.0.0:6443 to export the API on all interfaces (including the public ones). This is required as you cannot have 2 processes, one listening on 127.0.0.1:6443 and another on 0.0.0.0:6443. On platforms with private network only, where kube-apiserver is accessed via a load balancer (e.g. AWS), port setup remains the same. The whole setup would be much simpler, if kube-apiserver would support SO_REUSEPORT. I have opened an upstream issue about that and implemented a working PoC. More details here: kubernetes/kubernetes#88785. With SO_REUSEPORT support in kube-apiserver, there is no need to run HAProxy as a side-container, no need for listening on random IP address and no need to use multiple ports, which simplifies the whole solution. However, change from DaemonSet to Deployment and pod anti affinities are still needed. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
This commit attempts to improve the reliability of the upgrade process of the self-hosted kube-apiserver, especially when having only one controller node. Currently, kube-apiserver runs as a DaemonSet, which means that if there is only one node, where the pod is assigned, it must be removed before a new one will be scheduled. This causes a short outage when doing a rolling update of kube-apiserver. During the outage, the pod checkpointer kicks in and brings up a temporary kube-apiserver as a static pod to recover the cluster and waits until kube-apiserver pod is scheduled on the node. Then, it shuts down the temporary kube-apiserver pod and removes its manifest. As there cannot be 2 instances of kube-apiserver running at the same time on the node, the pod checkpointer is not able to wait until the updated pod starts up, as it must shut down the temporary one. This has a bad side-effect: if the new pod is wrongly configured (e.g. has a non-existent flag specified), kube-apiserver will never recover, which brings down the cluster and then manual intervention is needed. See #72 PR for more details. If it would be possible to run more than one instance of kube-apiserver on a single node, that would make the upgrade process easier. When you try to do that now, the 2nd instance will not run, as the secure port is already bound by the first instance. In Linux, there is a way to have multiple processes bind the same address and port: the SO_REUSEPORT socket option. More details under this link: https://lwn.net/Articles/542629/. Unfortunately, kube-apiserver does not create a listening socket with that option. To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a HAProxy instance as a side-container to kube-apiserver. HAProxy does support SO_REUSEPORT, so multiple instances can bind to the same address and port and then traffic between the processes will be equally distributed by the kernel. As kube-apiserver still runs on the host network, we need to either randomize the IP address or the port it listens on, in order to be able to run multiple instances on a single host. In this case, randomizing the IP address for binding is easier than randomizing the port, as kube-apiserver advertises its own IP address and port where it binds to the 'kubernetes' service in 'default' namespace in the cluster, which means that pods on the cluster would bypass HAProxy and connect to kube-apiserver directly, which requires opening such random ports on the firewall for the controller nodes, which is undesired. If we randomize IP address to bind, we can use the loopback interface, which by default in Linux has a /8 IP address assigned, which means that we can select a random IP address like 127.155.125.53 and bind to it. To avoid advertising localhost IP address to the cluster, which obviously wouldn't work, we use --advertise-address kube-apiserver flag, which allows us to override IP address advertised to the cluster and always set it to the address where HAProxy is listening, for example using the HOST_IP environment variable pulled from the Kubernetes node information in the pod status. HAProxy runs in TCP mode to minimize the required configuration and possible impact of misconfiguration. In my testing, I didn't experience any breakage because of using a proxy, however, we may need to pay attention to parameters like session timeouts, to make sure they don't affect connections. Once we are able to run multiple instances of kube-apiserver on a single node, we need to change the way we deploy the self-hosted kube-apiserver from DaemonSet to Deployment to allow running multiple instances on a single node. As running multiple instances on a single node should only be done temporarily, as single kube-apiserver is able to scale very well, we add podAntiAffinity to make sure that replicas of Deployment are equally spread across controller nodes. This also makes sense, as each kube-apiserver instance consumes at least 500MB of RAM, which means that if a controller node has 2GB of RAM, it might be not enough to run 2 instances for a longer period, meaning at least 4GB of RAM are recommended for the controller nodes. This also make sense from a stability point of view, as with many workloads, controller node resource usage will grow. By default, the number of replicas equals the number of controller nodes. For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is used instead of requiredDuringSchedulingIgnoredDuringExecution, as with a single controller node, we should actually allow multiple instances on a single node to perform graceful updates. See #90 for more details. If there is just one replica of kube-apiserver requested, we set maxUnavailable: 0, to make sure that there is always at least one instance running. We also add a PodDisruptionBudget to which also makes sure that there is at least one instance running. If there are more replicas requested, then PodDisruptionBudget controls that only one kube-apiserver can be shut down at a time, to avoid overloading other running instances. On platforms where kube-apiserver needs to be exposed on all interfaces (e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this port, kube-apiserver will listen on random local IP address and HAProxy will listen on the Node IP), then in addition HAProxy will also listen on 0.0.0.0:6443 to export the API on all interfaces (including the public ones). This is required as you cannot have 2 processes, one listening on 127.0.0.1:6443 and another on 0.0.0.0:6443. On platforms with private network only, where kube-apiserver is accessed via a load balancer (e.g. AWS), port setup remains the same. The whole setup would be much simpler, if kube-apiserver would support SO_REUSEPORT. I have opened an upstream issue about that and implemented a working PoC. More details here: kubernetes/kubernetes#88785. With SO_REUSEPORT support in kube-apiserver, there is no need to run HAProxy as a side-container, no need for listening on random IP address and no need to use multiple ports, which simplifies the whole solution. However, the change from DaemonSet to Deployment and pod anti affinities are still needed. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
This commit attempts to improve the reliability of the upgrade process of the self-hosted kube-apiserver, especially when having only one controller node. Currently, kube-apiserver runs as a DaemonSet, which means that if there is only one node, where the pod is assigned, it must be removed before a new one will be scheduled. This causes a short outage when doing a rolling update of kube-apiserver. During the outage, the pod checkpointer kicks in and brings up a temporary kube-apiserver as a static pod to recover the cluster and waits until kube-apiserver pod is scheduled on the node. Then, it shuts down the temporary kube-apiserver pod and removes its manifest. As there cannot be 2 instances of kube-apiserver running at the same time on the node, the pod checkpointer is not able to wait until the updated pod starts up, as it must shut down the temporary one. This has a bad side-effect: if the new pod is wrongly configured (e.g. has a non-existent flag specified), kube-apiserver will never recover, which brings down the cluster and then manual intervention is needed. See #72 PR for more details. If it would be possible to run more than one instance of kube-apiserver on a single node, that would make the upgrade process easier. When you try to do that now, the 2nd instance will not run, as the secure port is already bound by the first instance. In Linux, there is a way to have multiple processes bind the same address and port: the SO_REUSEPORT socket option. More details under this link: https://lwn.net/Articles/542629/. Unfortunately, kube-apiserver does not create a listening socket with that option. To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a HAProxy instance as a side-container to kube-apiserver. HAProxy does support SO_REUSEPORT, so multiple instances can bind to the same address and port and then traffic between the processes will be equally distributed by the kernel. As kube-apiserver still runs on the host network, we need to either randomize the IP address or the port it listens on, in order to be able to run multiple instances on a single host. In this case, randomizing the IP address for binding is easier than randomizing the port, as kube-apiserver advertises its own IP address and port where it binds to the 'kubernetes' service in 'default' namespace in the cluster, which means that pods on the cluster would bypass HAProxy and connect to kube-apiserver directly, which requires opening such random ports on the firewall for the controller nodes, which is undesired. If we randomize IP address to bind, we can use the loopback interface, which by default in Linux has a /8 IP address assigned, which means that we can select a random IP address like 127.155.125.53 and bind to it. To avoid advertising localhost IP address to the cluster, which obviously wouldn't work, we use --advertise-address kube-apiserver flag, which allows us to override IP address advertised to the cluster and always set it to the address where HAProxy is listening, for example using the HOST_IP environment variable pulled from the Kubernetes node information in the pod status. HAProxy runs in TCP mode to minimize the required configuration and possible impact of misconfiguration. In my testing, I didn't experience any breakage because of using a proxy, however, we may need to pay attention to parameters like session timeouts, to make sure they don't affect connections. Once we are able to run multiple instances of kube-apiserver on a single node, we need to change the way we deploy the self-hosted kube-apiserver from DaemonSet to Deployment to allow running multiple instances on a single node. As running multiple instances on a single node should only be done temporarily, as single kube-apiserver is able to scale very well, we add podAntiAffinity to make sure that replicas of Deployment are equally spread across controller nodes. This also makes sense, as each kube-apiserver instance consumes at least 500MB of RAM, which means that if a controller node has 2GB of RAM, it might be not enough to run 2 instances for a longer period, meaning at least 4GB of RAM are recommended for the controller nodes. This also make sense from a stability point of view, as with many workloads, controller node resource usage will grow. By default, the number of replicas equals the number of controller nodes. For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is used instead of requiredDuringSchedulingIgnoredDuringExecution, as with a single controller node, we should actually allow multiple instances on a single node to perform graceful updates. See #90 for more details. If there is just one replica of kube-apiserver requested, we set maxUnavailable: 0, to make sure that there is always at least one instance running. We also add a PodDisruptionBudget to which also makes sure that there is at least one instance running. If there are more replicas requested, then PodDisruptionBudget controls that only one kube-apiserver can be shut down at a time, to avoid overloading other running instances. On platforms where kube-apiserver needs to be exposed on all interfaces (e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this port, kube-apiserver will listen on random local IP address and HAProxy will listen on the Node IP), then in addition HAProxy will also listen on 0.0.0.0:6443 to export the API on all interfaces (including the public ones). This is required as you cannot have 2 processes, one listening on 127.0.0.1:6443 and another on 0.0.0.0:6443. On platforms with private network only, where kube-apiserver is accessed via a load balancer (e.g. AWS), port setup remains the same. The whole setup would be much simpler, if kube-apiserver would support SO_REUSEPORT. I have opened an upstream issue about that and implemented a working PoC. More details here: kubernetes/kubernetes#88785. With SO_REUSEPORT support in kube-apiserver, there is no need to run HAProxy as a side-container, no need for listening on random IP address and no need to use multiple ports, which simplifies the whole solution. However, the change from DaemonSet to Deployment and pod anti affinities are still needed. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
This commit attempts to improve the reliability of the upgrade process of the self-hosted kube-apiserver, especially when having only one controller node. Currently, kube-apiserver runs as a DaemonSet, which means that if there is only one node, where the pod is assigned, it must be removed before a new one will be scheduled. This causes a short outage when doing a rolling update of kube-apiserver. During the outage, the pod checkpointer kicks in and brings up a temporary kube-apiserver as a static pod to recover the cluster and waits until kube-apiserver pod is scheduled on the node. Then, it shuts down the temporary kube-apiserver pod and removes its manifest. As there cannot be 2 instances of kube-apiserver running at the same time on the node, the pod checkpointer is not able to wait until the updated pod starts up, as it must shut down the temporary one. This has a bad side-effect: if the new pod is wrongly configured (e.g. has a non-existent flag specified), kube-apiserver will never recover, which brings down the cluster and then manual intervention is needed. See #72 PR for more details. If it would be possible to run more than one instance of kube-apiserver on a single node, that would make the upgrade process easier. When you try to do that now, the 2nd instance will not run, as the secure port is already bound by the first instance. In Linux, there is a way to have multiple processes bind the same address and port: the SO_REUSEPORT socket option. More details under this link: https://lwn.net/Articles/542629/. Unfortunately, kube-apiserver does not create a listening socket with that option. To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a HAProxy instance as a side-container to kube-apiserver. HAProxy does support SO_REUSEPORT, so multiple instances can bind to the same address and port and then traffic between the processes will be equally distributed by the kernel. As kube-apiserver still runs on the host network, we need to either randomize the IP address or the port it listens on, in order to be able to run multiple instances on a single host. In this case, randomizing the IP address for binding is easier than randomizing the port, as kube-apiserver advertises its own IP address and port where it binds to the 'kubernetes' service in 'default' namespace in the cluster, which means that pods on the cluster would bypass HAProxy and connect to kube-apiserver directly, which requires opening such random ports on the firewall for the controller nodes, which is undesired. If we randomize IP address to bind, we can use the loopback interface, which by default in Linux has a /8 IP address assigned, which means that we can select a random IP address like 127.155.125.53 and bind to it. To avoid advertising localhost IP address to the cluster, which obviously wouldn't work, we use --advertise-address kube-apiserver flag, which allows us to override IP address advertised to the cluster and always set it to the address where HAProxy is listening, for example using the HOST_IP environment variable pulled from the Kubernetes node information in the pod status. HAProxy runs in TCP mode to minimize the required configuration and possible impact of misconfiguration. In my testing, I didn't experience any breakage because of using a proxy, however, we may need to pay attention to parameters like session timeouts, to make sure they don't affect connections. Once we are able to run multiple instances of kube-apiserver on a single node, we need to change the way we deploy the self-hosted kube-apiserver from DaemonSet to Deployment to allow running multiple instances on a single node. As running multiple instances on a single node should only be done temporarily, as single kube-apiserver is able to scale very well, we add podAntiAffinity to make sure that replicas of Deployment are equally spread across controller nodes. This also makes sense, as each kube-apiserver instance consumes at least 500MB of RAM, which means that if a controller node has 2GB of RAM, it might be not enough to run 2 instances for a longer period, meaning at least 4GB of RAM are recommended for the controller nodes. This also make sense from a stability point of view, as with many workloads, controller node resource usage will grow. By default, the number of replicas equals the number of controller nodes. For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is used instead of requiredDuringSchedulingIgnoredDuringExecution, as with a single controller node, we should actually allow multiple instances on a single node to perform graceful updates. See #90 for more details. If there is just one replica of kube-apiserver requested, we set maxUnavailable: 0, to make sure that there is always at least one instance running. We also add a PodDisruptionBudget to which also makes sure that there is at least one instance running. If there are more replicas requested, then PodDisruptionBudget controls that only one kube-apiserver can be shut down at a time, to avoid overloading other running instances. On platforms where kube-apiserver needs to be exposed on all interfaces (e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this port, kube-apiserver will listen on random local IP address and HAProxy will listen on the Node IP), then in addition HAProxy will also listen on 0.0.0.0:6443 to export the API on all interfaces (including the public ones). This is required as you cannot have 2 processes, one listening on 127.0.0.1:6443 and another on 0.0.0.0:6443. On platforms with private network only, where kube-apiserver is accessed via a load balancer (e.g. AWS), port setup remains the same. The whole setup would be much simpler, if kube-apiserver would support SO_REUSEPORT. I have opened an upstream issue about that and implemented a working PoC. More details here: kubernetes/kubernetes#88785. With SO_REUSEPORT support in kube-apiserver, there is no need to run HAProxy as a side-container, no need for listening on random IP address and no need to use multiple ports, which simplifies the whole solution. However, the change from DaemonSet to Deployment and pod anti affinities are still needed. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
This commit attempts to improve the reliability of the upgrade process of the self-hosted kube-apiserver, especially when having only one controller node. Currently, kube-apiserver runs as a DaemonSet, which means that if there is only one node, where the pod is assigned, it must be removed before a new one will be scheduled. This causes a short outage when doing a rolling update of kube-apiserver. During the outage, the pod checkpointer kicks in and brings up a temporary kube-apiserver as a static pod to recover the cluster and waits until kube-apiserver pod is scheduled on the node. Then, it shuts down the temporary kube-apiserver pod and removes its manifest. As there cannot be 2 instances of kube-apiserver running at the same time on the node, the pod checkpointer is not able to wait until the updated pod starts up, as it must shut down the temporary one. This has a bad side-effect: if the new pod is wrongly configured (e.g. has a non-existent flag specified), kube-apiserver will never recover, which brings down the cluster and then manual intervention is needed. See #72 PR for more details. If it would be possible to run more than one instance of kube-apiserver on a single node, that would make the upgrade process easier. When you try to do that now, the 2nd instance will not run, as the secure port is already bound by the first instance. In Linux, there is a way to have multiple processes bind the same address and port: the SO_REUSEPORT socket option. More details under this link: https://lwn.net/Articles/542629/. Unfortunately, kube-apiserver does not create a listening socket with that option. To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a HAProxy instance as a side-container to kube-apiserver. HAProxy does support SO_REUSEPORT, so multiple instances can bind to the same address and port and then traffic between the processes will be equally distributed by the kernel. As kube-apiserver still runs on the host network, we need to either randomize the IP address or the port it listens on, in order to be able to run multiple instances on a single host. In this case, randomizing the IP address for binding is easier than randomizing the port, as kube-apiserver advertises its own IP address and port where it binds to the 'kubernetes' service in 'default' namespace in the cluster, which means that pods on the cluster would bypass HAProxy and connect to kube-apiserver directly, which requires opening such random ports on the firewall for the controller nodes, which is undesired. If we randomize IP address to bind, we can use the loopback interface, which by default in Linux has a /8 IP address assigned, which means that we can select a random IP address like 127.155.125.53 and bind to it. To avoid advertising localhost IP address to the cluster, which obviously wouldn't work, we use --advertise-address kube-apiserver flag, which allows us to override IP address advertised to the cluster and always set it to the address where HAProxy is listening, for example using the HOST_IP environment variable pulled from the Kubernetes node information in the pod status. HAProxy runs in TCP mode to minimize the required configuration and possible impact of misconfiguration. In my testing, I didn't experience any breakage because of using a proxy, however, we may need to pay attention to parameters like session timeouts, to make sure they don't affect connections. Once we are able to run multiple instances of kube-apiserver on a single node, we need to change the way we deploy the self-hosted kube-apiserver from DaemonSet to Deployment to allow running multiple instances on a single node. As running multiple instances on a single node should only be done temporarily, as single kube-apiserver is able to scale very well, we add podAntiAffinity to make sure that replicas of Deployment are equally spread across controller nodes. This also makes sense, as each kube-apiserver instance consumes at least 500MB of RAM, which means that if a controller node has 2GB of RAM, it might be not enough to run 2 instances for a longer period, meaning at least 4GB of RAM are recommended for the controller nodes. This also make sense from a stability point of view, as with many workloads, controller node resource usage will grow. By default, the number of replicas equals the number of controller nodes. For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is used instead of requiredDuringSchedulingIgnoredDuringExecution, as with a single controller node, we should actually allow multiple instances on a single node to perform graceful updates. See #90 for more details. If there is just one replica of kube-apiserver requested, we set maxUnavailable: 0, to make sure that there is always at least one instance running. We also add a PodDisruptionBudget to which also makes sure that there is at least one instance running. If there are more replicas requested, then PodDisruptionBudget controls that only one kube-apiserver can be shut down at a time, to avoid overloading other running instances. On platforms where kube-apiserver needs to be exposed on all interfaces (e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this port, kube-apiserver will listen on random local IP address and HAProxy will listen on the Node IP), then in addition HAProxy will also listen on 0.0.0.0:6443 to export the API on all interfaces (including the public ones). This is required as you cannot have 2 processes, one listening on 127.0.0.1:6443 and another on 0.0.0.0:6443. On platforms with private network only, where kube-apiserver is accessed via a load balancer (e.g. AWS), port setup remains the same. The whole setup would be much simpler, if kube-apiserver would support SO_REUSEPORT. I have opened an upstream issue about that and implemented a working PoC. More details here: kubernetes/kubernetes#88785. With SO_REUSEPORT support in kube-apiserver, there is no need to run HAProxy as a side-container, no need for listening on random IP address and no need to use multiple ports, which simplifies the whole solution. However, the change from DaemonSet to Deployment and pod anti affinities are still needed. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
This commit attempts to improve the reliability of the upgrade process of the self-hosted kube-apiserver, especially when having only one controller node. Currently, kube-apiserver runs as a DaemonSet, which means that if there is only one node, where the pod is assigned, it must be removed before a new one will be scheduled. This causes a short outage when doing a rolling update of kube-apiserver. During the outage, the pod checkpointer kicks in and brings up a temporary kube-apiserver as a static pod to recover the cluster and waits until kube-apiserver pod is scheduled on the node. Then, it shuts down the temporary kube-apiserver pod and removes its manifest. As there cannot be 2 instances of kube-apiserver running at the same time on the node, the pod checkpointer is not able to wait until the updated pod starts up, as it must shut down the temporary one. This has a bad side-effect: if the new pod is wrongly configured (e.g. has a non-existent flag specified), kube-apiserver will never recover, which brings down the cluster and then manual intervention is needed. See #72 PR for more details. If it would be possible to run more than one instance of kube-apiserver on a single node, that would make the upgrade process easier. When you try to do that now, the 2nd instance will not run, as the secure port is already bound by the first instance. In Linux, there is a way to have multiple processes bind the same address and port: the SO_REUSEPORT socket option. More details under this link: https://lwn.net/Articles/542629/. Unfortunately, kube-apiserver does not create a listening socket with that option. To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a HAProxy instance as a side-container to kube-apiserver. HAProxy does support SO_REUSEPORT, so multiple instances can bind to the same address and port and then traffic between the processes will be equally distributed by the kernel. As kube-apiserver still runs on the host network, we need to either randomize the IP address or the port it listens on, in order to be able to run multiple instances on a single host. In this case, randomizing the IP address for binding is easier than randomizing the port, as kube-apiserver advertises its own IP address and port where it binds to the 'kubernetes' service in 'default' namespace in the cluster, which means that pods on the cluster would bypass HAProxy and connect to kube-apiserver directly, which requires opening such random ports on the firewall for the controller nodes, which is undesired. If we randomize IP address to bind, we can use the loopback interface, which by default in Linux has a /8 IP address assigned, which means that we can select a random IP address like 127.155.125.53 and bind to it. To avoid advertising localhost IP address to the cluster, which obviously wouldn't work, we use --advertise-address kube-apiserver flag, which allows us to override IP address advertised to the cluster and always set it to the address where HAProxy is listening, for example using the HOST_IP environment variable pulled from the Kubernetes node information in the pod status. HAProxy runs in TCP mode to minimize the required configuration and possible impact of misconfiguration. In my testing, I didn't experience any breakage because of using a proxy, however, we may need to pay attention to parameters like session timeouts, to make sure they don't affect connections. Once we are able to run multiple instances of kube-apiserver on a single node, we need to change the way we deploy the self-hosted kube-apiserver from DaemonSet to Deployment to allow running multiple instances on a single node. As running multiple instances on a single node should only be done temporarily, as single kube-apiserver is able to scale very well, we add podAntiAffinity to make sure that replicas of Deployment are equally spread across controller nodes. This also makes sense, as each kube-apiserver instance consumes at least 500MB of RAM, which means that if a controller node has 2GB of RAM, it might be not enough to run 2 instances for a longer period, meaning at least 4GB of RAM are recommended for the controller nodes. This also make sense from a stability point of view, as with many workloads, controller node resource usage will grow. By default, the number of replicas equals the number of controller nodes. For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is used instead of requiredDuringSchedulingIgnoredDuringExecution, as with a single controller node, we should actually allow multiple instances on a single node to perform graceful updates. See #90 for more details. If there is just one replica of kube-apiserver requested, we set maxUnavailable: 0, to make sure that there is always at least one instance running. We also add a PodDisruptionBudget to which also makes sure that there is at least one instance running. If there are more replicas requested, then PodDisruptionBudget controls that only one kube-apiserver can be shut down at a time, to avoid overloading other running instances. On platforms where kube-apiserver needs to be exposed on all interfaces (e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this port, kube-apiserver will listen on random local IP address and HAProxy will listen on the Node IP), then in addition HAProxy will also listen on 0.0.0.0:6443 to export the API on all interfaces (including the public ones). This is required as you cannot have 2 processes, one listening on 127.0.0.1:6443 and another on 0.0.0.0:6443. On platforms with private network only, where kube-apiserver is accessed via a load balancer (e.g. AWS), port setup remains the same. The whole setup would be much simpler, if kube-apiserver would support SO_REUSEPORT. I have opened an upstream issue about that and implemented a working PoC. More details here: kubernetes/kubernetes#88785. With SO_REUSEPORT support in kube-apiserver, there is no need to run HAProxy as a side-container, no need for listening on random IP address and no need to use multiple ports, which simplifies the whole solution. However, the change from DaemonSet to Deployment and pod anti affinities are still needed. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
This commit attempts to improve the reliability of the upgrade process of the self-hosted kube-apiserver, especially when having only one controller node. Currently, kube-apiserver runs as a DaemonSet, which means that if there is only one node, where the pod is assigned, it must be removed before a new one will be scheduled. This causes a short outage when doing a rolling update of kube-apiserver. During the outage, the pod checkpointer kicks in and brings up a temporary kube-apiserver as a static pod to recover the cluster and waits until kube-apiserver pod is scheduled on the node. Then, it shuts down the temporary kube-apiserver pod and removes its manifest. As there cannot be 2 instances of kube-apiserver running at the same time on the node, the pod checkpointer is not able to wait until the updated pod starts up, as it must shut down the temporary one. This has a bad side-effect: if the new pod is wrongly configured (e.g. has a non-existent flag specified), kube-apiserver will never recover, which brings down the cluster and then manual intervention is needed. See #72 PR for more details. If it would be possible to run more than one instance of kube-apiserver on a single node, that would make the upgrade process easier. When you try to do that now, the 2nd instance will not run, as the secure port is already bound by the first instance. In Linux, there is a way to have multiple processes bind the same address and port: the SO_REUSEPORT socket option. More details under this link: https://lwn.net/Articles/542629/. Unfortunately, kube-apiserver does not create a listening socket with that option. To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a HAProxy instance as a side-container to kube-apiserver. HAProxy does support SO_REUSEPORT, so multiple instances can bind to the same address and port and then traffic between the processes will be equally distributed by the kernel. As kube-apiserver still runs on the host network, we need to either randomize the IP address or the port it listens on, in order to be able to run multiple instances on a single host. In this case, randomizing the IP address for binding is easier than randomizing the port, as kube-apiserver advertises its own IP address and port where it binds to the 'kubernetes' service in 'default' namespace in the cluster, which means that pods on the cluster would bypass HAProxy and connect to kube-apiserver directly, which requires opening such random ports on the firewall for the controller nodes, which is undesired. If we randomize IP address to bind, we can use the loopback interface, which by default in Linux has a /8 IP address assigned, which means that we can select a random IP address like 127.155.125.53 and bind to it. To avoid advertising localhost IP address to the cluster, which obviously wouldn't work, we use --advertise-address kube-apiserver flag, which allows us to override IP address advertised to the cluster and always set it to the address where HAProxy is listening, for example using the HOST_IP environment variable pulled from the Kubernetes node information in the pod status. HAProxy runs in TCP mode to minimize the required configuration and possible impact of misconfiguration. In my testing, I didn't experience any breakage because of using a proxy, however, we may need to pay attention to parameters like session timeouts, to make sure they don't affect connections. Once we are able to run multiple instances of kube-apiserver on a single node, we need to change the way we deploy the self-hosted kube-apiserver from DaemonSet to Deployment to allow running multiple instances on a single node. As running multiple instances on a single node should only be done temporarily, as single kube-apiserver is able to scale very well, we add podAntiAffinity to make sure that replicas of Deployment are equally spread across controller nodes. This also makes sense, as each kube-apiserver instance consumes at least 500MB of RAM, which means that if a controller node has 2GB of RAM, it might be not enough to run 2 instances for a longer period, meaning at least 4GB of RAM are recommended for the controller nodes. This also make sense from a stability point of view, as with many workloads, controller node resource usage will grow. By default, the number of replicas equals the number of controller nodes. For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is used instead of requiredDuringSchedulingIgnoredDuringExecution, as with a single controller node, we should actually allow multiple instances on a single node to perform graceful updates. See #90 for more details. If there is just one replica of kube-apiserver requested, we set maxUnavailable: 0, to make sure that there is always at least one instance running. We also add a PodDisruptionBudget to which also makes sure that there is at least one instance running. If there are more replicas requested, then PodDisruptionBudget controls that only one kube-apiserver can be shut down at a time, to avoid overloading other running instances. On platforms where kube-apiserver needs to be exposed on all interfaces (e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this port, kube-apiserver will listen on random local IP address and HAProxy will listen on the Node IP), then in addition HAProxy will also listen on 0.0.0.0:6443 to export the API on all interfaces (including the public ones). This is required as you cannot have 2 processes, one listening on 127.0.0.1:6443 and another on 0.0.0.0:6443. On platforms with private network only, where kube-apiserver is accessed via a load balancer (e.g. AWS), port setup remains the same. The whole setup would be much simpler, if kube-apiserver would support SO_REUSEPORT. I have opened an upstream issue about that and implemented a working PoC. More details here: kubernetes/kubernetes#88785. With SO_REUSEPORT support in kube-apiserver, there is no need to run HAProxy as a side-container, no need for listening on random IP address and no need to use multiple ports, which simplifies the whole solution. However, the change from DaemonSet to Deployment and pod anti affinities are still needed. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
This commit attempts to improve the reliability of the upgrade process of the self-hosted kube-apiserver, especially when having only one controller node. Currently, kube-apiserver runs as a DaemonSet, which means that if there is only one node, where the pod is assigned, it must be removed before a new one will be scheduled. This causes a short outage when doing a rolling update of kube-apiserver. During the outage, the pod checkpointer kicks in and brings up a temporary kube-apiserver as a static pod to recover the cluster and waits until kube-apiserver pod is scheduled on the node. Then, it shuts down the temporary kube-apiserver pod and removes its manifest. As there cannot be 2 instances of kube-apiserver running at the same time on the node, the pod checkpointer is not able to wait until the updated pod starts up, as it must shut down the temporary one. This has a bad side-effect: if the new pod is wrongly configured (e.g. has a non-existent flag specified), kube-apiserver will never recover, which brings down the cluster and then manual intervention is needed. See #72 PR for more details. If it would be possible to run more than one instance of kube-apiserver on a single node, that would make the upgrade process easier. When you try to do that now, the 2nd instance will not run, as the secure port is already bound by the first instance. In Linux, there is a way to have multiple processes bind the same address and port: the SO_REUSEPORT socket option. More details under this link: https://lwn.net/Articles/542629/. Unfortunately, kube-apiserver does not create a listening socket with that option. To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a HAProxy instance as a side-container to kube-apiserver. HAProxy does support SO_REUSEPORT, so multiple instances can bind to the same address and port and then traffic between the processes will be equally distributed by the kernel. As kube-apiserver still runs on the host network, we need to either randomize the IP address or the port it listens on, in order to be able to run multiple instances on a single host. In this case, randomizing the IP address for binding is easier than randomizing the port, as kube-apiserver advertises its own IP address and port where it binds to the 'kubernetes' service in 'default' namespace in the cluster, which means that pods on the cluster would bypass HAProxy and connect to kube-apiserver directly, which requires opening such random ports on the firewall for the controller nodes, which is undesired. If we randomize IP address to bind, we can use the loopback interface, which by default in Linux has a /8 IP address assigned, which means that we can select a random IP address like 127.155.125.53 and bind to it. To avoid advertising localhost IP address to the cluster, which obviously wouldn't work, we use --advertise-address kube-apiserver flag, which allows us to override IP address advertised to the cluster and always set it to the address where HAProxy is listening, for example using the HOST_IP environment variable pulled from the Kubernetes node information in the pod status. HAProxy runs in TCP mode to minimize the required configuration and possible impact of misconfiguration. In my testing, I didn't experience any breakage because of using a proxy, however, we may need to pay attention to parameters like session timeouts, to make sure they don't affect connections. Once we are able to run multiple instances of kube-apiserver on a single node, we need to change the way we deploy the self-hosted kube-apiserver from DaemonSet to Deployment to allow running multiple instances on a single node. As running multiple instances on a single node should only be done temporarily, as single kube-apiserver is able to scale very well, we add podAntiAffinity to make sure that replicas of Deployment are equally spread across controller nodes. This also makes sense, as each kube-apiserver instance consumes at least 500MB of RAM, which means that if a controller node has 2GB of RAM, it might be not enough to run 2 instances for a longer period, meaning at least 4GB of RAM are recommended for the controller nodes. This also make sense from a stability point of view, as with many workloads, controller node resource usage will grow. By default, the number of replicas equals the number of controller nodes. For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is used instead of requiredDuringSchedulingIgnoredDuringExecution, as with a single controller node, we should actually allow multiple instances on a single node to perform graceful updates. See #90 for more details. If there is just one replica of kube-apiserver requested, we set maxUnavailable: 0, to make sure that there is always at least one instance running. We also add a PodDisruptionBudget to which also makes sure that there is at least one instance running. If there are more replicas requested, then PodDisruptionBudget controls that only one kube-apiserver can be shut down at a time, to avoid overloading other running instances. On platforms where kube-apiserver needs to be exposed on all interfaces (e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this port, kube-apiserver will listen on random local IP address and HAProxy will listen on the Node IP), then in addition HAProxy will also listen on 0.0.0.0:6443 to export the API on all interfaces (including the public ones). This is required as you cannot have 2 processes, one listening on 127.0.0.1:6443 and another on 0.0.0.0:6443. On platforms with private network only, where kube-apiserver is accessed via a load balancer (e.g. AWS), port setup remains the same. The whole setup would be much simpler, if kube-apiserver would support SO_REUSEPORT. I have opened an upstream issue about that and implemented a working PoC. More details here: kubernetes/kubernetes#88785. With SO_REUSEPORT support in kube-apiserver, there is no need to run HAProxy as a side-container, no need for listening on random IP address and no need to use multiple ports, which simplifies the whole solution. However, the change from DaemonSet to Deployment and pod anti affinities are still needed. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
This commit attempts to improve the reliability of the upgrade process of the self-hosted kube-apiserver, especially when having only one controller node. Currently, kube-apiserver runs as a DaemonSet, which means that if there is only one node, where the pod is assigned, it must be removed before a new one will be scheduled. This causes a short outage when doing a rolling update of kube-apiserver. During the outage, the pod checkpointer kicks in and brings up a temporary kube-apiserver as a static pod to recover the cluster and waits until kube-apiserver pod is scheduled on the node. Then, it shuts down the temporary kube-apiserver pod and removes its manifest. As there cannot be 2 instances of kube-apiserver running at the same time on the node, the pod checkpointer is not able to wait until the updated pod starts up, as it must shut down the temporary one. This has a bad side-effect: if the new pod is wrongly configured (e.g. has a non-existent flag specified), kube-apiserver will never recover, which brings down the cluster and then manual intervention is needed. See #72 PR for more details. If it would be possible to run more than one instance of kube-apiserver on a single node, that would make the upgrade process easier. When you try to do that now, the 2nd instance will not run, as the secure port is already bound by the first instance. In Linux, there is a way to have multiple processes bind the same address and port: the SO_REUSEPORT socket option. More details under this link: https://lwn.net/Articles/542629/. Unfortunately, kube-apiserver does not create a listening socket with that option. To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a HAProxy instance as a side-container to kube-apiserver. HAProxy does support SO_REUSEPORT, so multiple instances can bind to the same address and port and then traffic between the processes will be equally distributed by the kernel. As kube-apiserver still runs on the host network, we need to either randomize the IP address or the port it listens on, in order to be able to run multiple instances on a single host. In this case, randomizing the IP address for binding is easier than randomizing the port, as kube-apiserver advertises its own IP address and port where it binds to the 'kubernetes' service in 'default' namespace in the cluster, which means that pods on the cluster would bypass HAProxy and connect to kube-apiserver directly, which requires opening such random ports on the firewall for the controller nodes, which is undesired. If we randomize IP address to bind, we can use the loopback interface, which by default in Linux has a /8 IP address assigned, which means that we can select a random IP address like 127.155.125.53 and bind to it. To avoid advertising localhost IP address to the cluster, which obviously wouldn't work, we use --advertise-address kube-apiserver flag, which allows us to override IP address advertised to the cluster and always set it to the address where HAProxy is listening, for example using the HOST_IP environment variable pulled from the Kubernetes node information in the pod status. HAProxy runs in TCP mode to minimize the required configuration and possible impact of misconfiguration. In my testing, I didn't experience any breakage because of using a proxy, however, we may need to pay attention to parameters like session timeouts, to make sure they don't affect connections. Once we are able to run multiple instances of kube-apiserver on a single node, we need to change the way we deploy the self-hosted kube-apiserver from DaemonSet to Deployment to allow running multiple instances on a single node. As running multiple instances on a single node should only be done temporarily, as single kube-apiserver is able to scale very well, we add podAntiAffinity to make sure that replicas of Deployment are equally spread across controller nodes. This also makes sense, as each kube-apiserver instance consumes at least 500MB of RAM, which means that if a controller node has 2GB of RAM, it might be not enough to run 2 instances for a longer period, meaning at least 4GB of RAM are recommended for the controller nodes. This also make sense from a stability point of view, as with many workloads, controller node resource usage will grow. By default, the number of replicas equals the number of controller nodes. For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is used instead of requiredDuringSchedulingIgnoredDuringExecution, as with a single controller node, we should actually allow multiple instances on a single node to perform graceful updates. See #90 for more details. If there is just one replica of kube-apiserver requested, we set maxUnavailable: 0, to make sure that there is always at least one instance running. We also add a PodDisruptionBudget to which also makes sure that there is at least one instance running. If there are more replicas requested, then PodDisruptionBudget controls that only one kube-apiserver can be shut down at a time, to avoid overloading other running instances. On platforms where kube-apiserver needs to be exposed on all interfaces (e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this port, kube-apiserver will listen on random local IP address and HAProxy will listen on the Node IP), then in addition HAProxy will also listen on 0.0.0.0:6443 to export the API on all interfaces (including the public ones). This is required as you cannot have 2 processes, one listening on 127.0.0.1:6443 and another on 0.0.0.0:6443. On platforms with private network only, where kube-apiserver is accessed via a load balancer (e.g. AWS), port setup remains the same. The whole setup would be much simpler, if kube-apiserver would support SO_REUSEPORT. I have opened an upstream issue about that and implemented a working PoC. More details here: kubernetes/kubernetes#88785. With SO_REUSEPORT support in kube-apiserver, there is no need to run HAProxy as a side-container, no need for listening on random IP address and no need to use multiple ports, which simplifies the whole solution. However, the change from DaemonSet to Deployment and pod anti affinities are still needed. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
This commit attempts to improve the reliability of the upgrade process of the self-hosted kube-apiserver, especially when having only one controller node. Currently, kube-apiserver runs as a DaemonSet, which means that if there is only one node, where the pod is assigned, it must be removed before a new one will be scheduled. This causes a short outage when doing a rolling update of kube-apiserver. During the outage, the pod checkpointer kicks in and brings up a temporary kube-apiserver as a static pod to recover the cluster and waits until kube-apiserver pod is scheduled on the node. Then, it shuts down the temporary kube-apiserver pod and removes its manifest. As there cannot be 2 instances of kube-apiserver running at the same time on the node, the pod checkpointer is not able to wait until the updated pod starts up, as it must shut down the temporary one. This has a bad side-effect: if the new pod is wrongly configured (e.g. has a non-existent flag specified), kube-apiserver will never recover, which brings down the cluster and then manual intervention is needed. See #72 PR for more details. If it would be possible to run more than one instance of kube-apiserver on a single node, that would make the upgrade process easier. When you try to do that now, the 2nd instance will not run, as the secure port is already bound by the first instance. In Linux, there is a way to have multiple processes bind the same address and port: the SO_REUSEPORT socket option. More details under this link: https://lwn.net/Articles/542629/. Unfortunately, kube-apiserver does not create a listening socket with that option. To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a HAProxy instance as a side-container to kube-apiserver. HAProxy does support SO_REUSEPORT, so multiple instances can bind to the same address and port and then traffic between the processes will be equally distributed by the kernel. As kube-apiserver still runs on the host network, we need to either randomize the IP address or the port it listens on, in order to be able to run multiple instances on a single host. In this case, randomizing the IP address for binding is easier than randomizing the port, as kube-apiserver advertises its own IP address and port where it binds to the 'kubernetes' service in 'default' namespace in the cluster, which means that pods on the cluster would bypass HAProxy and connect to kube-apiserver directly, which requires opening such random ports on the firewall for the controller nodes, which is undesired. If we randomize IP address to bind, we can use the loopback interface, which by default in Linux has a /8 IP address assigned, which means that we can select a random IP address like 127.155.125.53 and bind to it. To avoid advertising localhost IP address to the cluster, which obviously wouldn't work, we use --advertise-address kube-apiserver flag, which allows us to override IP address advertised to the cluster and always set it to the address where HAProxy is listening, for example using the HOST_IP environment variable pulled from the Kubernetes node information in the pod status. HAProxy runs in TCP mode to minimize the required configuration and possible impact of misconfiguration. In my testing, I didn't experience any breakage because of using a proxy, however, we may need to pay attention to parameters like session timeouts, to make sure they don't affect connections. Once we are able to run multiple instances of kube-apiserver on a single node, we need to change the way we deploy the self-hosted kube-apiserver from DaemonSet to Deployment to allow running multiple instances on a single node. As running multiple instances on a single node should only be done temporarily, as single kube-apiserver is able to scale very well, we add podAntiAffinity to make sure that replicas of Deployment are equally spread across controller nodes. This also makes sense, as each kube-apiserver instance consumes at least 500MB of RAM, which means that if a controller node has 2GB of RAM, it might be not enough to run 2 instances for a longer period, meaning at least 4GB of RAM are recommended for the controller nodes. This also make sense from a stability point of view, as with many workloads, controller node resource usage will grow. By default, the number of replicas equals the number of controller nodes. For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is used instead of requiredDuringSchedulingIgnoredDuringExecution, as with a single controller node, we should actually allow multiple instances on a single node to perform graceful updates. See #90 for more details. If there is just one replica of kube-apiserver requested, we set maxUnavailable: 0, to make sure that there is always at least one instance running. We also add a PodDisruptionBudget to which also makes sure that there is at least one instance running. If there are more replicas requested, then PodDisruptionBudget controls that only one kube-apiserver can be shut down at a time, to avoid overloading other running instances. On platforms where kube-apiserver needs to be exposed on all interfaces (e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this port, kube-apiserver will listen on random local IP address and HAProxy will listen on the Node IP), then in addition HAProxy will also listen on 0.0.0.0:6443 to export the API on all interfaces (including the public ones). This is required as you cannot have 2 processes, one listening on 127.0.0.1:6443 and another on 0.0.0.0:6443. On platforms with private network only, where kube-apiserver is accessed via a load balancer (e.g. AWS), port setup remains the same. The whole setup would be much simpler, if kube-apiserver would support SO_REUSEPORT. I have opened an upstream issue about that and implemented a working PoC. More details here: kubernetes/kubernetes#88785. With SO_REUSEPORT support in kube-apiserver, there is no need to run HAProxy as a side-container, no need for listening on random IP address and no need to use multiple ports, which simplifies the whole solution. However, the change from DaemonSet to Deployment and pod anti affinities are still needed. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
This commit attempts to improve the reliability of the upgrade process of the self-hosted kube-apiserver, especially when having only one controller node. Currently, kube-apiserver runs as a DaemonSet, which means that if there is only one node, where the pod is assigned, it must be removed before a new one will be scheduled. This causes a short outage when doing a rolling update of kube-apiserver. During the outage, the pod checkpointer kicks in and brings up a temporary kube-apiserver as a static pod to recover the cluster and waits until kube-apiserver pod is scheduled on the node. Then, it shuts down the temporary kube-apiserver pod and removes its manifest. As there cannot be 2 instances of kube-apiserver running at the same time on the node, the pod checkpointer is not able to wait until the updated pod starts up, as it must shut down the temporary one. This has a bad side-effect: if the new pod is wrongly configured (e.g. has a non-existent flag specified), kube-apiserver will never recover, which brings down the cluster and then manual intervention is needed. See #72 PR for more details. If it would be possible to run more than one instance of kube-apiserver on a single node, that would make the upgrade process easier. When you try to do that now, the 2nd instance will not run, as the secure port is already bound by the first instance. In Linux, there is a way to have multiple processes bind the same address and port: the SO_REUSEPORT socket option. More details under this link: https://lwn.net/Articles/542629/. Unfortunately, kube-apiserver does not create a listening socket with that option. To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a HAProxy instance as a side-container to kube-apiserver. HAProxy does support SO_REUSEPORT, so multiple instances can bind to the same address and port and then traffic between the processes will be equally distributed by the kernel. As kube-apiserver still runs on the host network, we need to either randomize the IP address or the port it listens on, in order to be able to run multiple instances on a single host. In this case, randomizing the IP address for binding is easier than randomizing the port, as kube-apiserver advertises its own IP address and port where it binds to the 'kubernetes' service in 'default' namespace in the cluster, which means that pods on the cluster would bypass HAProxy and connect to kube-apiserver directly, which requires opening such random ports on the firewall for the controller nodes, which is undesired. If we randomize IP address to bind, we can use the loopback interface, which by default in Linux has a /8 IP address assigned, which means that we can select a random IP address like 127.155.125.53 and bind to it. To avoid advertising localhost IP address to the cluster, which obviously wouldn't work, we use --advertise-address kube-apiserver flag, which allows us to override IP address advertised to the cluster and always set it to the address where HAProxy is listening, for example using the HOST_IP environment variable pulled from the Kubernetes node information in the pod status. HAProxy runs in TCP mode to minimize the required configuration and possible impact of misconfiguration. In my testing, I didn't experience any breakage because of using a proxy, however, we may need to pay attention to parameters like session timeouts, to make sure they don't affect connections. Once we are able to run multiple instances of kube-apiserver on a single node, we need to change the way we deploy the self-hosted kube-apiserver from DaemonSet to Deployment to allow running multiple instances on a single node. As running multiple instances on a single node should only be done temporarily, as single kube-apiserver is able to scale very well, we add podAntiAffinity to make sure that replicas of Deployment are equally spread across controller nodes. This also makes sense, as each kube-apiserver instance consumes at least 500MB of RAM, which means that if a controller node has 2GB of RAM, it might be not enough to run 2 instances for a longer period, meaning at least 4GB of RAM are recommended for the controller nodes. This also make sense from a stability point of view, as with many workloads, controller node resource usage will grow. By default, the number of replicas equals the number of controller nodes. For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is used instead of requiredDuringSchedulingIgnoredDuringExecution, as with a single controller node, we should actually allow multiple instances on a single node to perform graceful updates. See #90 for more details. If there is just one replica of kube-apiserver requested, we set maxUnavailable: 0, to make sure that there is always at least one instance running. We also add a PodDisruptionBudget to which also makes sure that there is at least one instance running. If there are more replicas requested, then PodDisruptionBudget controls that only one kube-apiserver can be shut down at a time, to avoid overloading other running instances. On platforms where kube-apiserver needs to be exposed on all interfaces (e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this port, kube-apiserver will listen on random local IP address and HAProxy will listen on the Node IP), then in addition HAProxy will also listen on 0.0.0.0:6443 to export the API on all interfaces (including the public ones). This is required as you cannot have 2 processes, one listening on 127.0.0.1:6443 and another on 0.0.0.0:6443. On platforms with private network only, where kube-apiserver is accessed via a load balancer (e.g. AWS), port setup remains the same. The whole setup would be much simpler, if kube-apiserver would support SO_REUSEPORT. I have opened an upstream issue about that and implemented a working PoC. More details here: kubernetes/kubernetes#88785. With SO_REUSEPORT support in kube-apiserver, there is no need to run HAProxy as a side-container, no need for listening on random IP address and no need to use multiple ports, which simplifies the whole solution. However, the change from DaemonSet to Deployment and pod anti affinities are still needed. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
…lease This commit attempts to improve the reliability of the upgrade process of the self-hosted kube-apiserver when having only one controller node. If running more than one kube-apiserver replica, the setup does not change. Currently, kube-apiserver runs as a DaemonSet, which means that if there is only one node, where the pod is assigned, it must be removed before a new one will be scheduled. This causes a short outage when doing a rolling update of kube-apiserver. During the outage, the pod checkpointer kicks in and brings up a temporary kube-apiserver as a static pod to recover the cluster and waits until kube-apiserver pod is scheduled on the node. Then, it shuts down the temporary kube-apiserver pod and removes its manifest. As there cannot be 2 instances of kube-apiserver running at the same time on the node, the pod checkpointer is not able to wait until the updated pod starts up, as it must shut down the temporary one. This has a bad side-effect: if the new pod is wrongly configured (e.g. has a non-existent flag specified), kube-apiserver will never recover, which brings down the cluster and then manual intervention is needed. See #72 PR for more details. If it would be possible to run more than one instance of kube-apiserver on a single node, that would make the upgrade process easier. When you try to do that now, the 2nd instance will not run, as the secure port is already bound by the first instance. In Linux, there is a way to have multiple processes bind the same address and port: the SO_REUSEPORT socket option. More details under this link: https://lwn.net/Articles/542629/. Unfortunately, kube-apiserver does not create a listening socket with that option. To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a HAProxy instance as a side-container to kube-apiserver. HAProxy does support SO_REUSEPORT, so multiple instances can bind to the same address and port and then traffic between the processes will be equally distributed by the kernel. As kube-apiserver still runs on the host network, we need to either randomize the IP address or the port it listens on, in order to be able to run multiple instances on a single host. In this case, randomizing the IP address for binding is easier than randomizing the port, as kube-apiserver advertises its own IP address and port where it binds to the 'kubernetes' service in 'default' namespace in the cluster, which means that pods on the cluster would bypass HAProxy and connect to kube-apiserver directly, which requires opening such random ports on the firewall for the controller nodes, which is undesired. If we randomize IP address to bind, we can use the loopback interface, which by default in Linux has a /8 IP address assigned, which means that we can select a random IP address like 127.155.125.53 and bind to it. To avoid advertising localhost IP address to the cluster, which obviously wouldn't work, we use --advertise-address kube-apiserver flag, which allows us to override IP address advertised to the cluster and always set it to the address where HAProxy is listening, for example using the HOST_IP environment variable pulled from the Kubernetes node information in the pod status. HAProxy runs in TCP mode to minimize the required configuration and possible impact of misconfiguration. In my testing, I didn't experience any breakage because of using a proxy, however, we may need to pay attention to parameters like session timeouts, to make sure they don't affect connections. Once we are able to run multiple instances of kube-apiserver on a single node, we need to change the way we deploy the self-hosted kube-apiserver from DaemonSet to Deployment to allow running multiple instances on a single node. As running multiple instances on a single node should only be done temporarily, as single kube-apiserver is able to scale very well, we add podAntiAffinity to make sure that replicas of Deployment are equally spread across controller nodes. This also makes sense, as each kube-apiserver instance consumes at least 500MB of RAM, which means that if a controller node has 2GB of RAM, it might be not enough to run 2 instances for a longer period, meaning at least 4GB of RAM are recommended for the controller nodes. This also make sense from a stability point of view, as with many workloads, controller node resource usage will grow. By default, the number of replicas equals the number of controller nodes. For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is used instead of requiredDuringSchedulingIgnoredDuringExecution, as with a single controller node, we should actually allow multiple instances on a single node to perform graceful updates. See #90 for more details. If there is just one replica of kube-apiserver requested, we set maxUnavailable: 0, to make sure that there is always at least one instance running. We also add a PodDisruptionBudget to which also makes sure that there is at least one instance running. If there are more replicas requested, then PodDisruptionBudget controls that only one kube-apiserver can be shut down at a time, to avoid overloading other running instances. On platforms where kube-apiserver needs to be exposed on all interfaces (e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this port, kube-apiserver will listen on random local IP address and HAProxy will listen on the Node IP), then in addition HAProxy will also listen on 0.0.0.0:6443 to export the API on all interfaces (including the public ones). This is required as you cannot have 2 processes, one listening on 127.0.0.1:6443 and another on 0.0.0.0:6443. On platforms with private network only, where kube-apiserver is accessed via a load balancer (e.g. AWS), port setup remains the same. The whole setup would be much simpler, if kube-apiserver would support SO_REUSEPORT. I have opened an upstream issue about that and implemented a working PoC. More details here: kubernetes/kubernetes#88785. With SO_REUSEPORT support in kube-apiserver, there is no need to run HAProxy as a side-container, no need for listening on random IP address and no need to use multiple ports, which simplifies the whole solution. However, the change from DaemonSet to Deployment and pod anti affinities are still needed. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
…lease This commit attempts to improve the reliability of the upgrade process of the self-hosted kube-apiserver when having only one controller node. If running more than one kube-apiserver replica, the setup does not change. Currently, kube-apiserver runs as a DaemonSet, which means that if there is only one node, where the pod is assigned, it must be removed before a new one will be scheduled. This causes a short outage when doing a rolling update of kube-apiserver. During the outage, the pod checkpointer kicks in and brings up a temporary kube-apiserver as a static pod to recover the cluster and waits until kube-apiserver pod is scheduled on the node. Then, it shuts down the temporary kube-apiserver pod and removes its manifest. As there cannot be 2 instances of kube-apiserver running at the same time on the node, the pod checkpointer is not able to wait until the updated pod starts up, as it must shut down the temporary one. This has a bad side-effect: if the new pod is wrongly configured (e.g. has a non-existent flag specified), kube-apiserver will never recover, which brings down the cluster and then manual intervention is needed. See #72 PR for more details. If it would be possible to run more than one instance of kube-apiserver on a single node, that would make the upgrade process easier. When you try to do that now, the 2nd instance will not run, as the secure port is already bound by the first instance. In Linux, there is a way to have multiple processes bind the same address and port: the SO_REUSEPORT socket option. More details under this link: https://lwn.net/Articles/542629/. Unfortunately, kube-apiserver does not create a listening socket with that option. To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a HAProxy instance as a side-container to kube-apiserver. HAProxy does support SO_REUSEPORT, so multiple instances can bind to the same address and port and then traffic between the processes will be equally distributed by the kernel. As kube-apiserver still runs on the host network, we need to either randomize the IP address or the port it listens on, in order to be able to run multiple instances on a single host. In this case, randomizing the IP address for binding is easier than randomizing the port, as kube-apiserver advertises its own IP address and port where it binds to the 'kubernetes' service in 'default' namespace in the cluster, which means that pods on the cluster would bypass HAProxy and connect to kube-apiserver directly, which requires opening such random ports on the firewall for the controller nodes, which is undesired. If we randomize IP address to bind, we can use the loopback interface, which by default in Linux has a /8 IP address assigned, which means that we can select a random IP address like 127.155.125.53 and bind to it. To avoid advertising localhost IP address to the cluster, which obviously wouldn't work, we use --advertise-address kube-apiserver flag, which allows us to override IP address advertised to the cluster and always set it to the address where HAProxy is listening, for example using the HOST_IP environment variable pulled from the Kubernetes node information in the pod status. HAProxy runs in TCP mode to minimize the required configuration and possible impact of misconfiguration. In my testing, I didn't experience any breakage because of using a proxy, however, we may need to pay attention to parameters like session timeouts, to make sure they don't affect connections. Once we are able to run multiple instances of kube-apiserver on a single node, we need to change the way we deploy the self-hosted kube-apiserver from DaemonSet to Deployment to allow running multiple instances on a single node. As running multiple instances on a single node should only be done temporarily, as single kube-apiserver is able to scale very well, we add podAntiAffinity to make sure that replicas of Deployment are equally spread across controller nodes. This also makes sense, as each kube-apiserver instance consumes at least 500MB of RAM, which means that if a controller node has 2GB of RAM, it might be not enough to run 2 instances for a longer period, meaning at least 4GB of RAM are recommended for the controller nodes. This also make sense from a stability point of view, as with many workloads, controller node resource usage will grow. By default, the number of replicas equals the number of controller nodes. For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is used instead of requiredDuringSchedulingIgnoredDuringExecution, as with a single controller node, we should actually allow multiple instances on a single node to perform graceful updates. See #90 for more details. If there is just one replica of kube-apiserver requested, we set maxUnavailable: 0, to make sure that there is always at least one instance running. We also add a PodDisruptionBudget to which also makes sure that there is at least one instance running. If there are more replicas requested, then PodDisruptionBudget controls that only one kube-apiserver can be shut down at a time, to avoid overloading other running instances. On platforms where kube-apiserver needs to be exposed on all interfaces (e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this port, kube-apiserver will listen on random local IP address and HAProxy will listen on the Node IP), then in addition HAProxy will also listen on 0.0.0.0:6443 to export the API on all interfaces (including the public ones). This is required as you cannot have 2 processes, one listening on 127.0.0.1:6443 and another on 0.0.0.0:6443. On platforms with private network only, where kube-apiserver is accessed via a load balancer (e.g. AWS), port setup remains the same. The whole setup would be much simpler, if kube-apiserver would support SO_REUSEPORT. I have opened an upstream issue about that and implemented a working PoC. More details here: kubernetes/kubernetes#88785. With SO_REUSEPORT support in kube-apiserver, there is no need to run HAProxy as a side-container, no need for listening on random IP address and no need to use multiple ports, which simplifies the whole solution. However, the change from DaemonSet to Deployment and pod anti affinities are still needed. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
…lease This commit attempts to improve the reliability of the upgrade process of the self-hosted kube-apiserver when having only one controller node. If running more than one kube-apiserver replica, the setup does not change. Currently, kube-apiserver runs as a DaemonSet, which means that if there is only one node, where the pod is assigned, it must be removed before a new one will be scheduled. This causes a short outage when doing a rolling update of kube-apiserver. During the outage, the pod checkpointer kicks in and brings up a temporary kube-apiserver as a static pod to recover the cluster and waits until kube-apiserver pod is scheduled on the node. Then, it shuts down the temporary kube-apiserver pod and removes its manifest. As there cannot be 2 instances of kube-apiserver running at the same time on the node, the pod checkpointer is not able to wait until the updated pod starts up, as it must shut down the temporary one. This has a bad side-effect: if the new pod is wrongly configured (e.g. has a non-existent flag specified), kube-apiserver will never recover, which brings down the cluster and then manual intervention is needed. See #72 PR for more details. If it would be possible to run more than one instance of kube-apiserver on a single node, that would make the upgrade process easier. When you try to do that now, the 2nd instance will not run, as the secure port is already bound by the first instance. In Linux, there is a way to have multiple processes bind the same address and port: the SO_REUSEPORT socket option. More details under this link: https://lwn.net/Articles/542629/. Unfortunately, kube-apiserver does not create a listening socket with that option. To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a HAProxy instance as a side-container to kube-apiserver. HAProxy does support SO_REUSEPORT, so multiple instances can bind to the same address and port and then traffic between the processes will be equally distributed by the kernel. As kube-apiserver still runs on the host network, we need to either randomize the IP address or the port it listens on, in order to be able to run multiple instances on a single host. In this case, randomizing the IP address for binding is easier than randomizing the port, as kube-apiserver advertises its own IP address and port where it binds to the 'kubernetes' service in 'default' namespace in the cluster, which means that pods on the cluster would bypass HAProxy and connect to kube-apiserver directly, which requires opening such random ports on the firewall for the controller nodes, which is undesired. If we randomize IP address to bind, we can use the loopback interface, which by default in Linux has a /8 IP address assigned, which means that we can select a random IP address like 127.155.125.53 and bind to it. To avoid advertising localhost IP address to the cluster, which obviously wouldn't work, we use --advertise-address kube-apiserver flag, which allows us to override IP address advertised to the cluster and always set it to the address where HAProxy is listening, for example using the HOST_IP environment variable pulled from the Kubernetes node information in the pod status. HAProxy runs in TCP mode to minimize the required configuration and possible impact of misconfiguration. In my testing, I didn't experience any breakage because of using a proxy, however, we may need to pay attention to parameters like session timeouts, to make sure they don't affect connections. Once we are able to run multiple instances of kube-apiserver on a single node, we need to change the way we deploy the self-hosted kube-apiserver from DaemonSet to Deployment to allow running multiple instances on a single node. As running multiple instances on a single node should only be done temporarily, as single kube-apiserver is able to scale very well, we add podAntiAffinity to make sure that replicas of Deployment are equally spread across controller nodes. This also makes sense, as each kube-apiserver instance consumes at least 500MB of RAM, which means that if a controller node has 2GB of RAM, it might be not enough to run 2 instances for a longer period, meaning at least 4GB of RAM are recommended for the controller nodes. This also make sense from a stability point of view, as with many workloads, controller node resource usage will grow. By default, the number of replicas equals the number of controller nodes. For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is used instead of requiredDuringSchedulingIgnoredDuringExecution, as with a single controller node, we should actually allow multiple instances on a single node to perform graceful updates. See #90 for more details. If there is just one replica of kube-apiserver requested, we set maxUnavailable: 0, to make sure that there is always at least one instance running. We also add a PodDisruptionBudget to which also makes sure that there is at least one instance running. If there are more replicas requested, then PodDisruptionBudget controls that only one kube-apiserver can be shut down at a time, to avoid overloading other running instances. On platforms where kube-apiserver needs to be exposed on all interfaces (e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this port, kube-apiserver will listen on random local IP address and HAProxy will listen on the Node IP), then in addition HAProxy will also listen on 0.0.0.0:6443 to export the API on all interfaces (including the public ones). This is required as you cannot have 2 processes, one listening on 127.0.0.1:6443 and another on 0.0.0.0:6443. On platforms with private network only, where kube-apiserver is accessed via a load balancer (e.g. AWS), port setup remains the same. The whole setup would be much simpler, if kube-apiserver would support SO_REUSEPORT. I have opened an upstream issue about that and implemented a working PoC. More details here: kubernetes/kubernetes#88785. With SO_REUSEPORT support in kube-apiserver, there is no need to run HAProxy as a side-container, no need for listening on random IP address and no need to use multiple ports, which simplifies the whole solution. However, the change from DaemonSet to Deployment and pod anti affinities are still needed. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
…lease This commit attempts to improve the reliability of the upgrade process of the self-hosted kube-apiserver when having only one controller node. If running more than one kube-apiserver replica, the setup does not change. Currently, kube-apiserver runs as a DaemonSet, which means that if there is only one node, where the pod is assigned, it must be removed before a new one will be scheduled. This causes a short outage when doing a rolling update of kube-apiserver. During the outage, the pod checkpointer kicks in and brings up a temporary kube-apiserver as a static pod to recover the cluster and waits until kube-apiserver pod is scheduled on the node. Then, it shuts down the temporary kube-apiserver pod and removes its manifest. As there cannot be 2 instances of kube-apiserver running at the same time on the node, the pod checkpointer is not able to wait until the updated pod starts up, as it must shut down the temporary one. This has a bad side-effect: if the new pod is wrongly configured (e.g. has a non-existent flag specified), kube-apiserver will never recover, which brings down the cluster and then manual intervention is needed. See #72 PR for more details. If it would be possible to run more than one instance of kube-apiserver on a single node, that would make the upgrade process easier. When you try to do that now, the 2nd instance will not run, as the secure port is already bound by the first instance. In Linux, there is a way to have multiple processes bind the same address and port: the SO_REUSEPORT socket option. More details under this link: https://lwn.net/Articles/542629/. Unfortunately, kube-apiserver does not create a listening socket with that option. To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a HAProxy instance as a side-container to kube-apiserver. HAProxy does support SO_REUSEPORT, so multiple instances can bind to the same address and port and then traffic between the processes will be equally distributed by the kernel. As kube-apiserver still runs on the host network, we need to either randomize the IP address or the port it listens on, in order to be able to run multiple instances on a single host. In this case, randomizing the IP address for binding is easier than randomizing the port, as kube-apiserver advertises its own IP address and port where it binds to the 'kubernetes' service in 'default' namespace in the cluster, which means that pods on the cluster would bypass HAProxy and connect to kube-apiserver directly, which requires opening such random ports on the firewall for the controller nodes, which is undesired. If we randomize IP address to bind, we can use the loopback interface, which by default in Linux has a /8 IP address assigned, which means that we can select a random IP address like 127.155.125.53 and bind to it. To avoid advertising localhost IP address to the cluster, which obviously wouldn't work, we use --advertise-address kube-apiserver flag, which allows us to override IP address advertised to the cluster and always set it to the address where HAProxy is listening, for example using the HOST_IP environment variable pulled from the Kubernetes node information in the pod status. HAProxy runs in TCP mode to minimize the required configuration and possible impact of misconfiguration. In my testing, I didn't experience any breakage because of using a proxy, however, we may need to pay attention to parameters like session timeouts, to make sure they don't affect connections. Once we are able to run multiple instances of kube-apiserver on a single node, we need to change the way we deploy the self-hosted kube-apiserver from DaemonSet to Deployment to allow running multiple instances on a single node. As running multiple instances on a single node should only be done temporarily, as single kube-apiserver is able to scale very well, we add podAntiAffinity to make sure that replicas of Deployment are equally spread across controller nodes. This also makes sense, as each kube-apiserver instance consumes at least 500MB of RAM, which means that if a controller node has 2GB of RAM, it might be not enough to run 2 instances for a longer period, meaning at least 4GB of RAM are recommended for the controller nodes. This also make sense from a stability point of view, as with many workloads, controller node resource usage will grow. By default, the number of replicas equals the number of controller nodes. For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is used instead of requiredDuringSchedulingIgnoredDuringExecution, as with a single controller node, we should actually allow multiple instances on a single node to perform graceful updates. See #90 for more details. If there is just one replica of kube-apiserver requested, we set maxUnavailable: 0, to make sure that there is always at least one instance running. We also add a PodDisruptionBudget to which also makes sure that there is at least one instance running. If there are more replicas requested, then PodDisruptionBudget controls that only one kube-apiserver can be shut down at a time, to avoid overloading other running instances. On platforms where kube-apiserver needs to be exposed on all interfaces (e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this port, kube-apiserver will listen on random local IP address and HAProxy will listen on the Node IP), then in addition HAProxy will also listen on 0.0.0.0:6443 to export the API on all interfaces (including the public ones). This is required as you cannot have 2 processes, one listening on 127.0.0.1:6443 and another on 0.0.0.0:6443. On platforms with private network only, where kube-apiserver is accessed via a load balancer (e.g. AWS), port setup remains the same. The whole setup would be much simpler, if kube-apiserver would support SO_REUSEPORT. I have opened an upstream issue about that and implemented a working PoC. More details here: kubernetes/kubernetes#88785. With SO_REUSEPORT support in kube-apiserver, there is no need to run HAProxy as a side-container, no need for listening on random IP address and no need to use multiple ports, which simplifies the whole solution. However, the change from DaemonSet to Deployment and pod anti affinities are still needed. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
…lease This commit attempts to improve the reliability of the upgrade process of the self-hosted kube-apiserver when having only one controller node. If running more than one kube-apiserver replica, the setup does not change. Currently, kube-apiserver runs as a DaemonSet, which means that if there is only one node, where the pod is assigned, it must be removed before a new one will be scheduled. This causes a short outage when doing a rolling update of kube-apiserver. During the outage, the pod checkpointer kicks in and brings up a temporary kube-apiserver as a static pod to recover the cluster and waits until kube-apiserver pod is scheduled on the node. Then, it shuts down the temporary kube-apiserver pod and removes its manifest. As there cannot be 2 instances of kube-apiserver running at the same time on the node, the pod checkpointer is not able to wait until the updated pod starts up, as it must shut down the temporary one. This has a bad side-effect: if the new pod is wrongly configured (e.g. has a non-existent flag specified), kube-apiserver will never recover, which brings down the cluster and then manual intervention is needed. See #72 PR for more details. If it would be possible to run more than one instance of kube-apiserver on a single node, that would make the upgrade process easier. When you try to do that now, the 2nd instance will not run, as the secure port is already bound by the first instance. In Linux, there is a way to have multiple processes bind the same address and port: the SO_REUSEPORT socket option. More details under this link: https://lwn.net/Articles/542629/. Unfortunately, kube-apiserver does not create a listening socket with that option. To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a HAProxy instance as a side-container to kube-apiserver. HAProxy does support SO_REUSEPORT, so multiple instances can bind to the same address and port and then traffic between the processes will be equally distributed by the kernel. As kube-apiserver still runs on the host network, we need to either randomize the IP address or the port it listens on, in order to be able to run multiple instances on a single host. In this case, randomizing the IP address for binding is easier than randomizing the port, as kube-apiserver advertises its own IP address and port where it binds to the 'kubernetes' service in 'default' namespace in the cluster, which means that pods on the cluster would bypass HAProxy and connect to kube-apiserver directly, which requires opening such random ports on the firewall for the controller nodes, which is undesired. If we randomize IP address to bind, we can use the loopback interface, which by default in Linux has a /8 IP address assigned, which means that we can select a random IP address like 127.155.125.53 and bind to it. To avoid advertising localhost IP address to the cluster, which obviously wouldn't work, we use --advertise-address kube-apiserver flag, which allows us to override IP address advertised to the cluster and always set it to the address where HAProxy is listening, for example using the HOST_IP environment variable pulled from the Kubernetes node information in the pod status. HAProxy runs in TCP mode to minimize the required configuration and possible impact of misconfiguration. In my testing, I didn't experience any breakage because of using a proxy, however, we may need to pay attention to parameters like session timeouts, to make sure they don't affect connections. Once we are able to run multiple instances of kube-apiserver on a single node, we need to change the way we deploy the self-hosted kube-apiserver from DaemonSet to Deployment to allow running multiple instances on a single node. As running multiple instances on a single node should only be done temporarily, as single kube-apiserver is able to scale very well, we add podAntiAffinity to make sure that replicas of Deployment are equally spread across controller nodes. This also makes sense, as each kube-apiserver instance consumes at least 500MB of RAM, which means that if a controller node has 2GB of RAM, it might be not enough to run 2 instances for a longer period, meaning at least 4GB of RAM are recommended for the controller nodes. This also make sense from a stability point of view, as with many workloads, controller node resource usage will grow. By default, the number of replicas equals the number of controller nodes. For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is used instead of requiredDuringSchedulingIgnoredDuringExecution, as with a single controller node, we should actually allow multiple instances on a single node to perform graceful updates. See #90 for more details. If there is just one replica of kube-apiserver requested, we set maxUnavailable: 0, to make sure that there is always at least one instance running. We also add a PodDisruptionBudget to which also makes sure that there is at least one instance running. If there are more replicas requested, then PodDisruptionBudget controls that only one kube-apiserver can be shut down at a time, to avoid overloading other running instances. On platforms where kube-apiserver needs to be exposed on all interfaces (e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this port, kube-apiserver will listen on random local IP address and HAProxy will listen on the Node IP), then in addition HAProxy will also listen on 0.0.0.0:6443 to export the API on all interfaces (including the public ones). This is required as you cannot have 2 processes, one listening on 127.0.0.1:6443 and another on 0.0.0.0:6443. On platforms with private network only, where kube-apiserver is accessed via a load balancer (e.g. AWS), port setup remains the same. The whole setup would be much simpler, if kube-apiserver would support SO_REUSEPORT. I have opened an upstream issue about that and implemented a working PoC. More details here: kubernetes/kubernetes#88785. With SO_REUSEPORT support in kube-apiserver, there is no need to run HAProxy as a side-container, no need for listening on random IP address and no need to use multiple ports, which simplifies the whole solution. However, the change from DaemonSet to Deployment and pod anti affinities are still needed. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
…lease This commit attempts to improve the reliability of the upgrade process of the self-hosted kube-apiserver when having only one controller node. If running more than one kube-apiserver replica, the setup does not change. Currently, kube-apiserver runs as a DaemonSet, which means that if there is only one node, where the pod is assigned, it must be removed before a new one will be scheduled. This causes a short outage when doing a rolling update of kube-apiserver. During the outage, the pod checkpointer kicks in and brings up a temporary kube-apiserver as a static pod to recover the cluster and waits until kube-apiserver pod is scheduled on the node. Then, it shuts down the temporary kube-apiserver pod and removes its manifest. As there cannot be 2 instances of kube-apiserver running at the same time on the node, the pod checkpointer is not able to wait until the updated pod starts up, as it must shut down the temporary one. This has a bad side-effect: if the new pod is wrongly configured (e.g. has a non-existent flag specified), kube-apiserver will never recover, which brings down the cluster and then manual intervention is needed. See #72 PR for more details. If it would be possible to run more than one instance of kube-apiserver on a single node, that would make the upgrade process easier. When you try to do that now, the 2nd instance will not run, as the secure port is already bound by the first instance. In Linux, there is a way to have multiple processes bind the same address and port: the SO_REUSEPORT socket option. More details under this link: https://lwn.net/Articles/542629/. Unfortunately, kube-apiserver does not create a listening socket with that option. To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a HAProxy instance as a side-container to kube-apiserver. HAProxy does support SO_REUSEPORT, so multiple instances can bind to the same address and port and then traffic between the processes will be equally distributed by the kernel. As kube-apiserver still runs on the host network, we need to either randomize the IP address or the port it listens on, in order to be able to run multiple instances on a single host. In this case, randomizing the IP address for binding is easier than randomizing the port, as kube-apiserver advertises its own IP address and port where it binds to the 'kubernetes' service in 'default' namespace in the cluster, which means that pods on the cluster would bypass HAProxy and connect to kube-apiserver directly, which requires opening such random ports on the firewall for the controller nodes, which is undesired. If we randomize IP address to bind, we can use the loopback interface, which by default in Linux has a /8 IP address assigned, which means that we can select a random IP address like 127.155.125.53 and bind to it. To avoid advertising localhost IP address to the cluster, which obviously wouldn't work, we use --advertise-address kube-apiserver flag, which allows us to override IP address advertised to the cluster and always set it to the address where HAProxy is listening, for example using the HOST_IP environment variable pulled from the Kubernetes node information in the pod status. HAProxy runs in TCP mode to minimize the required configuration and possible impact of misconfiguration. In my testing, I didn't experience any breakage because of using a proxy, however, we may need to pay attention to parameters like session timeouts, to make sure they don't affect connections. Once we are able to run multiple instances of kube-apiserver on a single node, we need to change the way we deploy the self-hosted kube-apiserver from DaemonSet to Deployment to allow running multiple instances on a single node. As running multiple instances on a single node should only be done temporarily, as single kube-apiserver is able to scale very well, we add podAntiAffinity to make sure that replicas of Deployment are equally spread across controller nodes. This also makes sense, as each kube-apiserver instance consumes at least 500MB of RAM, which means that if a controller node has 2GB of RAM, it might be not enough to run 2 instances for a longer period, meaning at least 4GB of RAM are recommended for the controller nodes. This also make sense from a stability point of view, as with many workloads, controller node resource usage will grow. By default, the number of replicas equals the number of controller nodes. For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is used instead of requiredDuringSchedulingIgnoredDuringExecution, as with a single controller node, we should actually allow multiple instances on a single node to perform graceful updates. See #90 for more details. If there is just one replica of kube-apiserver requested, we set maxUnavailable: 0, to make sure that there is always at least one instance running. We also add a PodDisruptionBudget to which also makes sure that there is at least one instance running. If there are more replicas requested, then PodDisruptionBudget controls that only one kube-apiserver can be shut down at a time, to avoid overloading other running instances. On platforms where kube-apiserver needs to be exposed on all interfaces (e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this port, kube-apiserver will listen on random local IP address and HAProxy will listen on the Node IP), then in addition HAProxy will also listen on 0.0.0.0:6443 to export the API on all interfaces (including the public ones). This is required as you cannot have 2 processes, one listening on 127.0.0.1:6443 and another on 0.0.0.0:6443. On platforms with private network only, where kube-apiserver is accessed via a load balancer (e.g. AWS), port setup remains the same. The whole setup would be much simpler, if kube-apiserver would support SO_REUSEPORT. I have opened an upstream issue about that and implemented a working PoC. More details here: kubernetes/kubernetes#88785. With SO_REUSEPORT support in kube-apiserver, there is no need to run HAProxy as a side-container, no need for listening on random IP address and no need to use multiple ports, which simplifies the whole solution. However, the change from DaemonSet to Deployment and pod anti affinities are still needed. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
…lease This commit attempts to improve the reliability of the upgrade process of the self-hosted kube-apiserver when having only one controller node. If running more than one kube-apiserver replica, the setup does not change. Currently, kube-apiserver runs as a DaemonSet, which means that if there is only one node, where the pod is assigned, it must be removed before a new one will be scheduled. This causes a short outage when doing a rolling update of kube-apiserver. During the outage, the pod checkpointer kicks in and brings up a temporary kube-apiserver as a static pod to recover the cluster and waits until kube-apiserver pod is scheduled on the node. Then, it shuts down the temporary kube-apiserver pod and removes its manifest. As there cannot be 2 instances of kube-apiserver running at the same time on the node, the pod checkpointer is not able to wait until the updated pod starts up, as it must shut down the temporary one. This has a bad side-effect: if the new pod is wrongly configured (e.g. has a non-existent flag specified), kube-apiserver will never recover, which brings down the cluster and then manual intervention is needed. See #72 PR for more details. If it would be possible to run more than one instance of kube-apiserver on a single node, that would make the upgrade process easier. When you try to do that now, the 2nd instance will not run, as the secure port is already bound by the first instance. In Linux, there is a way to have multiple processes bind the same address and port: the SO_REUSEPORT socket option. More details under this link: https://lwn.net/Articles/542629/. Unfortunately, kube-apiserver does not create a listening socket with that option. To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a HAProxy instance as a side-container to kube-apiserver. HAProxy does support SO_REUSEPORT, so multiple instances can bind to the same address and port and then traffic between the processes will be equally distributed by the kernel. As kube-apiserver still runs on the host network, we need to either randomize the IP address or the port it listens on, in order to be able to run multiple instances on a single host. In this case, randomizing the IP address for binding is easier than randomizing the port, as kube-apiserver advertises its own IP address and port where it binds to the 'kubernetes' service in 'default' namespace in the cluster, which means that pods on the cluster would bypass HAProxy and connect to kube-apiserver directly, which requires opening such random ports on the firewall for the controller nodes, which is undesired. If we randomize IP address to bind, we can use the loopback interface, which by default in Linux has a /8 IP address assigned, which means that we can select a random IP address like 127.155.125.53 and bind to it. To avoid advertising localhost IP address to the cluster, which obviously wouldn't work, we use --advertise-address kube-apiserver flag, which allows us to override IP address advertised to the cluster and always set it to the address where HAProxy is listening, for example using the HOST_IP environment variable pulled from the Kubernetes node information in the pod status. HAProxy runs in TCP mode to minimize the required configuration and possible impact of misconfiguration. In my testing, I didn't experience any breakage because of using a proxy, however, we may need to pay attention to parameters like session timeouts, to make sure they don't affect connections. Once we are able to run multiple instances of kube-apiserver on a single node, we need to change the way we deploy the self-hosted kube-apiserver from DaemonSet to Deployment to allow running multiple instances on a single node. As running multiple instances on a single node should only be done temporarily, as single kube-apiserver is able to scale very well, we add podAntiAffinity to make sure that replicas of Deployment are equally spread across controller nodes. This also makes sense, as each kube-apiserver instance consumes at least 500MB of RAM, which means that if a controller node has 2GB of RAM, it might be not enough to run 2 instances for a longer period, meaning at least 4GB of RAM are recommended for the controller nodes. This also make sense from a stability point of view, as with many workloads, controller node resource usage will grow. By default, the number of replicas equals the number of controller nodes. For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is used instead of requiredDuringSchedulingIgnoredDuringExecution, as with a single controller node, we should actually allow multiple instances on a single node to perform graceful updates. See #90 for more details. If there is just one replica of kube-apiserver requested, we set maxUnavailable: 0, to make sure that there is always at least one instance running. We also add a PodDisruptionBudget to which also makes sure that there is at least one instance running. If there are more replicas requested, then PodDisruptionBudget controls that only one kube-apiserver can be shut down at a time, to avoid overloading other running instances. On platforms where kube-apiserver needs to be exposed on all interfaces (e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this port, kube-apiserver will listen on random local IP address and HAProxy will listen on the Node IP), then in addition HAProxy will also listen on 0.0.0.0:6443 to export the API on all interfaces (including the public ones). This is required as you cannot have 2 processes, one listening on 127.0.0.1:6443 and another on 0.0.0.0:6443. On platforms with private network only, where kube-apiserver is accessed via a load balancer (e.g. AWS), port setup remains the same. The whole setup would be much simpler, if kube-apiserver would support SO_REUSEPORT. I have opened an upstream issue about that and implemented a working PoC. More details here: kubernetes/kubernetes#88785. With SO_REUSEPORT support in kube-apiserver, there is no need to run HAProxy as a side-container, no need for listening on random IP address and no need to use multiple ports, which simplifies the whole solution. However, the change from DaemonSet to Deployment and pod anti affinities are still needed. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
…lease This commit attempts to improve the reliability of the upgrade process of the self-hosted kube-apiserver when having only one controller node. If running more than one kube-apiserver replica, the setup does not change. Currently, kube-apiserver runs as a DaemonSet, which means that if there is only one node, where the pod is assigned, it must be removed before a new one will be scheduled. This causes a short outage when doing a rolling update of kube-apiserver. During the outage, the pod checkpointer kicks in and brings up a temporary kube-apiserver as a static pod to recover the cluster and waits until kube-apiserver pod is scheduled on the node. Then, it shuts down the temporary kube-apiserver pod and removes its manifest. As there cannot be 2 instances of kube-apiserver running at the same time on the node, the pod checkpointer is not able to wait until the updated pod starts up, as it must shut down the temporary one. This has a bad side-effect: if the new pod is wrongly configured (e.g. has a non-existent flag specified), kube-apiserver will never recover, which brings down the cluster and then manual intervention is needed. See #72 PR for more details. If it would be possible to run more than one instance of kube-apiserver on a single node, that would make the upgrade process easier. When you try to do that now, the 2nd instance will not run, as the secure port is already bound by the first instance. In Linux, there is a way to have multiple processes bind the same address and port: the SO_REUSEPORT socket option. More details under this link: https://lwn.net/Articles/542629/. Unfortunately, kube-apiserver does not create a listening socket with that option. To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a HAProxy instance as a side-container to kube-apiserver. HAProxy does support SO_REUSEPORT, so multiple instances can bind to the same address and port and then traffic between the processes will be equally distributed by the kernel. As kube-apiserver still runs on the host network, we need to either randomize the IP address or the port it listens on, in order to be able to run multiple instances on a single host. In this case, randomizing the IP address for binding is easier than randomizing the port, as kube-apiserver advertises its own IP address and port where it binds to the 'kubernetes' service in 'default' namespace in the cluster, which means that pods on the cluster would bypass HAProxy and connect to kube-apiserver directly, which requires opening such random ports on the firewall for the controller nodes, which is undesired. If we randomize IP address to bind, we can use the loopback interface, which by default in Linux has a /8 IP address assigned, which means that we can select a random IP address like 127.155.125.53 and bind to it. To avoid advertising localhost IP address to the cluster, which obviously wouldn't work, we use --advertise-address kube-apiserver flag, which allows us to override IP address advertised to the cluster and always set it to the address where HAProxy is listening, for example using the HOST_IP environment variable pulled from the Kubernetes node information in the pod status. HAProxy runs in TCP mode to minimize the required configuration and possible impact of misconfiguration. In my testing, I didn't experience any breakage because of using a proxy, however, we may need to pay attention to parameters like session timeouts, to make sure they don't affect connections. Once we are able to run multiple instances of kube-apiserver on a single node, we need to change the way we deploy the self-hosted kube-apiserver from DaemonSet to Deployment to allow running multiple instances on a single node. As running multiple instances on a single node should only be done temporarily, as single kube-apiserver is able to scale very well, we add podAntiAffinity to make sure that replicas of Deployment are equally spread across controller nodes. This also makes sense, as each kube-apiserver instance consumes at least 500MB of RAM, which means that if a controller node has 2GB of RAM, it might be not enough to run 2 instances for a longer period, meaning at least 4GB of RAM are recommended for the controller nodes. This also make sense from a stability point of view, as with many workloads, controller node resource usage will grow. By default, the number of replicas equals the number of controller nodes. For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is used instead of requiredDuringSchedulingIgnoredDuringExecution, as with a single controller node, we should actually allow multiple instances on a single node to perform graceful updates. See #90 for more details. If there is just one replica of kube-apiserver requested, we set maxUnavailable: 0, to make sure that there is always at least one instance running. We also add a PodDisruptionBudget to which also makes sure that there is at least one instance running. If there are more replicas requested, then PodDisruptionBudget controls that only one kube-apiserver can be shut down at a time, to avoid overloading other running instances. On platforms where kube-apiserver needs to be exposed on all interfaces (e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this port, kube-apiserver will listen on random local IP address and HAProxy will listen on the Node IP), then in addition HAProxy will also listen on 0.0.0.0:6443 to export the API on all interfaces (including the public ones). This is required as you cannot have 2 processes, one listening on 127.0.0.1:6443 and another on 0.0.0.0:6443. On platforms with private network only, where kube-apiserver is accessed via a load balancer (e.g. AWS), port setup remains the same. The whole setup would be much simpler, if kube-apiserver would support SO_REUSEPORT. I have opened an upstream issue about that and implemented a working PoC. More details here: kubernetes/kubernetes#88785. With SO_REUSEPORT support in kube-apiserver, there is no need to run HAProxy as a side-container, no need for listening on random IP address and no need to use multiple ports, which simplifies the whole solution. However, the change from DaemonSet to Deployment and pod anti affinities are still needed. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
…lease This commit attempts to improve the reliability of the upgrade process of the self-hosted kube-apiserver when having only one controller node. If running more than one kube-apiserver replica, the setup does not change. Currently, kube-apiserver runs as a DaemonSet, which means that if there is only one node, where the pod is assigned, it must be removed before a new one will be scheduled. This causes a short outage when doing a rolling update of kube-apiserver. During the outage, the pod checkpointer kicks in and brings up a temporary kube-apiserver as a static pod to recover the cluster and waits until kube-apiserver pod is scheduled on the node. Then, it shuts down the temporary kube-apiserver pod and removes its manifest. As there cannot be 2 instances of kube-apiserver running at the same time on the node, the pod checkpointer is not able to wait until the updated pod starts up, as it must shut down the temporary one. This has a bad side-effect: if the new pod is wrongly configured (e.g. has a non-existent flag specified), kube-apiserver will never recover, which brings down the cluster and then manual intervention is needed. See #72 PR for more details. If it would be possible to run more than one instance of kube-apiserver on a single node, that would make the upgrade process easier. When you try to do that now, the 2nd instance will not run, as the secure port is already bound by the first instance. In Linux, there is a way to have multiple processes bind the same address and port: the SO_REUSEPORT socket option. More details under this link: https://lwn.net/Articles/542629/. Unfortunately, kube-apiserver does not create a listening socket with that option. To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a HAProxy instance as a side-container to kube-apiserver. HAProxy does support SO_REUSEPORT, so multiple instances can bind to the same address and port and then traffic between the processes will be equally distributed by the kernel. As kube-apiserver still runs on the host network, we need to either randomize the IP address or the port it listens on, in order to be able to run multiple instances on a single host. In this case, randomizing the IP address for binding is easier than randomizing the port, as kube-apiserver advertises its own IP address and port where it binds to the 'kubernetes' service in 'default' namespace in the cluster, which means that pods on the cluster would bypass HAProxy and connect to kube-apiserver directly, which requires opening such random ports on the firewall for the controller nodes, which is undesired. If we randomize IP address to bind, we can use the loopback interface, which by default in Linux has a /8 IP address assigned, which means that we can select a random IP address like 127.155.125.53 and bind to it. To avoid advertising localhost IP address to the cluster, which obviously wouldn't work, we use --advertise-address kube-apiserver flag, which allows us to override IP address advertised to the cluster and always set it to the address where HAProxy is listening, for example using the HOST_IP environment variable pulled from the Kubernetes node information in the pod status. HAProxy runs in TCP mode to minimize the required configuration and possible impact of misconfiguration. In my testing, I didn't experience any breakage because of using a proxy, however, we may need to pay attention to parameters like session timeouts, to make sure they don't affect connections. Once we are able to run multiple instances of kube-apiserver on a single node, we need to change the way we deploy the self-hosted kube-apiserver from DaemonSet to Deployment to allow running multiple instances on a single node. As running multiple instances on a single node should only be done temporarily, as single kube-apiserver is able to scale very well, we add podAntiAffinity to make sure that replicas of Deployment are equally spread across controller nodes. This also makes sense, as each kube-apiserver instance consumes at least 500MB of RAM, which means that if a controller node has 2GB of RAM, it might be not enough to run 2 instances for a longer period, meaning at least 4GB of RAM are recommended for the controller nodes. This also make sense from a stability point of view, as with many workloads, controller node resource usage will grow. By default, the number of replicas equals the number of controller nodes. For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is used instead of requiredDuringSchedulingIgnoredDuringExecution, as with a single controller node, we should actually allow multiple instances on a single node to perform graceful updates. See #90 for more details. If there is just one replica of kube-apiserver requested, we set maxUnavailable: 0, to make sure that there is always at least one instance running. We also add a PodDisruptionBudget to which also makes sure that there is at least one instance running. If there are more replicas requested, then PodDisruptionBudget controls that only one kube-apiserver can be shut down at a time, to avoid overloading other running instances. On platforms where kube-apiserver needs to be exposed on all interfaces (e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this port, kube-apiserver will listen on random local IP address and HAProxy will listen on the Node IP), then in addition HAProxy will also listen on 0.0.0.0:6443 to export the API on all interfaces (including the public ones). This is required as you cannot have 2 processes, one listening on 127.0.0.1:6443 and another on 0.0.0.0:6443. On platforms with private network only, where kube-apiserver is accessed via a load balancer (e.g. AWS), port setup remains the same. The whole setup would be much simpler, if kube-apiserver would support SO_REUSEPORT. I have opened an upstream issue about that and implemented a working PoC. More details here: kubernetes/kubernetes#88785. With SO_REUSEPORT support in kube-apiserver, there is no need to run HAProxy as a side-container, no need for listening on random IP address and no need to use multiple ports, which simplifies the whole solution. However, the change from DaemonSet to Deployment and pod anti affinities are still needed. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
…lease This commit attempts to improve the reliability of the upgrade process of the self-hosted kube-apiserver when having only one controller node. If running more than one kube-apiserver replica, the setup does not change. Currently, kube-apiserver runs as a DaemonSet, which means that if there is only one node, where the pod is assigned, it must be removed before a new one will be scheduled. This causes a short outage when doing a rolling update of kube-apiserver. During the outage, the pod checkpointer kicks in and brings up a temporary kube-apiserver as a static pod to recover the cluster and waits until kube-apiserver pod is scheduled on the node. Then, it shuts down the temporary kube-apiserver pod and removes its manifest. As there cannot be 2 instances of kube-apiserver running at the same time on the node, the pod checkpointer is not able to wait until the updated pod starts up, as it must shut down the temporary one. This has a bad side-effect: if the new pod is wrongly configured (e.g. has a non-existent flag specified), kube-apiserver will never recover, which brings down the cluster and then manual intervention is needed. See #72 PR for more details. If it would be possible to run more than one instance of kube-apiserver on a single node, that would make the upgrade process easier. When you try to do that now, the 2nd instance will not run, as the secure port is already bound by the first instance. In Linux, there is a way to have multiple processes bind the same address and port: the SO_REUSEPORT socket option. More details under this link: https://lwn.net/Articles/542629/. Unfortunately, kube-apiserver does not create a listening socket with that option. To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a HAProxy instance as a side-container to kube-apiserver. HAProxy does support SO_REUSEPORT, so multiple instances can bind to the same address and port and then traffic between the processes will be equally distributed by the kernel. As kube-apiserver still runs on the host network, we need to either randomize the IP address or the port it listens on, in order to be able to run multiple instances on a single host. In this case, randomizing the IP address for binding is easier than randomizing the port, as kube-apiserver advertises its own IP address and port where it binds to the 'kubernetes' service in 'default' namespace in the cluster, which means that pods on the cluster would bypass HAProxy and connect to kube-apiserver directly, which requires opening such random ports on the firewall for the controller nodes, which is undesired. If we randomize IP address to bind, we can use the loopback interface, which by default in Linux has a /8 IP address assigned, which means that we can select a random IP address like 127.155.125.53 and bind to it. To avoid advertising localhost IP address to the cluster, which obviously wouldn't work, we use --advertise-address kube-apiserver flag, which allows us to override IP address advertised to the cluster and always set it to the address where HAProxy is listening, for example using the HOST_IP environment variable pulled from the Kubernetes node information in the pod status. HAProxy runs in TCP mode to minimize the required configuration and possible impact of misconfiguration. In my testing, I didn't experience any breakage because of using a proxy, however, we may need to pay attention to parameters like session timeouts, to make sure they don't affect connections. Once we are able to run multiple instances of kube-apiserver on a single node, we need to change the way we deploy the self-hosted kube-apiserver from DaemonSet to Deployment to allow running multiple instances on a single node. As running multiple instances on a single node should only be done temporarily, as single kube-apiserver is able to scale very well, we add podAntiAffinity to make sure that replicas of Deployment are equally spread across controller nodes. This also makes sense, as each kube-apiserver instance consumes at least 500MB of RAM, which means that if a controller node has 2GB of RAM, it might be not enough to run 2 instances for a longer period, meaning at least 4GB of RAM are recommended for the controller nodes. This also make sense from a stability point of view, as with many workloads, controller node resource usage will grow. By default, the number of replicas equals the number of controller nodes. For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is used instead of requiredDuringSchedulingIgnoredDuringExecution, as with a single controller node, we should actually allow multiple instances on a single node to perform graceful updates. See #90 for more details. If there is just one replica of kube-apiserver requested, we set maxUnavailable: 0, to make sure that there is always at least one instance running. We also add a PodDisruptionBudget to which also makes sure that there is at least one instance running. If there are more replicas requested, then PodDisruptionBudget controls that only one kube-apiserver can be shut down at a time, to avoid overloading other running instances. On platforms where kube-apiserver needs to be exposed on all interfaces (e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this port, kube-apiserver will listen on random local IP address and HAProxy will listen on the Node IP), then in addition HAProxy will also listen on 0.0.0.0:6443 to export the API on all interfaces (including the public ones). This is required as you cannot have 2 processes, one listening on 127.0.0.1:6443 and another on 0.0.0.0:6443. On platforms with private network only, where kube-apiserver is accessed via a load balancer (e.g. AWS), port setup remains the same. The whole setup would be much simpler, if kube-apiserver would support SO_REUSEPORT. I have opened an upstream issue about that and implemented a working PoC. More details here: kubernetes/kubernetes#88785. With SO_REUSEPORT support in kube-apiserver, there is no need to run HAProxy as a side-container, no need for listening on random IP address and no need to use multiple ports, which simplifies the whole solution. However, the change from DaemonSet to Deployment and pod anti affinities are still needed. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
…lease This commit attempts to improve the reliability of the upgrade process of the self-hosted kube-apiserver when having only one controller node. If running more than one kube-apiserver replica, the setup does not change. Currently, kube-apiserver runs as a DaemonSet, which means that if there is only one node, where the pod is assigned, it must be removed before a new one will be scheduled. This causes a short outage when doing a rolling update of kube-apiserver. During the outage, the pod checkpointer kicks in and brings up a temporary kube-apiserver as a static pod to recover the cluster and waits until kube-apiserver pod is scheduled on the node. Then, it shuts down the temporary kube-apiserver pod and removes its manifest. As there cannot be 2 instances of kube-apiserver running at the same time on the node, the pod checkpointer is not able to wait until the updated pod starts up, as it must shut down the temporary one. This has a bad side-effect: if the new pod is wrongly configured (e.g. has a non-existent flag specified), kube-apiserver will never recover, which brings down the cluster and then manual intervention is needed. See #72 PR for more details. If it would be possible to run more than one instance of kube-apiserver on a single node, that would make the upgrade process easier. When you try to do that now, the 2nd instance will not run, as the secure port is already bound by the first instance. In Linux, there is a way to have multiple processes bind the same address and port: the SO_REUSEPORT socket option. More details under this link: https://lwn.net/Articles/542629/. Unfortunately, kube-apiserver does not create a listening socket with that option. To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a HAProxy instance as a side-container to kube-apiserver. HAProxy does support SO_REUSEPORT, so multiple instances can bind to the same address and port and then traffic between the processes will be equally distributed by the kernel. As kube-apiserver still runs on the host network, we need to either randomize the IP address or the port it listens on, in order to be able to run multiple instances on a single host. In this case, randomizing the IP address for binding is easier than randomizing the port, as kube-apiserver advertises its own IP address and port where it binds to the 'kubernetes' service in 'default' namespace in the cluster, which means that pods on the cluster would bypass HAProxy and connect to kube-apiserver directly, which requires opening such random ports on the firewall for the controller nodes, which is undesired. If we randomize IP address to bind, we can use the loopback interface, which by default in Linux has a /8 IP address assigned, which means that we can select a random IP address like 127.155.125.53 and bind to it. To avoid advertising localhost IP address to the cluster, which obviously wouldn't work, we use --advertise-address kube-apiserver flag, which allows us to override IP address advertised to the cluster and always set it to the address where HAProxy is listening, for example using the HOST_IP environment variable pulled from the Kubernetes node information in the pod status. HAProxy runs in TCP mode to minimize the required configuration and possible impact of misconfiguration. In my testing, I didn't experience any breakage because of using a proxy, however, we may need to pay attention to parameters like session timeouts, to make sure they don't affect connections. Once we are able to run multiple instances of kube-apiserver on a single node, we need to change the way we deploy the self-hosted kube-apiserver from DaemonSet to Deployment to allow running multiple instances on a single node. As running multiple instances on a single node should only be done temporarily, as single kube-apiserver is able to scale very well, we add podAntiAffinity to make sure that replicas of Deployment are equally spread across controller nodes. This also makes sense, as each kube-apiserver instance consumes at least 500MB of RAM, which means that if a controller node has 2GB of RAM, it might be not enough to run 2 instances for a longer period, meaning at least 4GB of RAM are recommended for the controller nodes. This also make sense from a stability point of view, as with many workloads, controller node resource usage will grow. By default, the number of replicas equals the number of controller nodes. For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is used instead of requiredDuringSchedulingIgnoredDuringExecution, as with a single controller node, we should actually allow multiple instances on a single node to perform graceful updates. See #90 for more details. If there is just one replica of kube-apiserver requested, we set maxUnavailable: 0, to make sure that there is always at least one instance running. We also add a PodDisruptionBudget to which also makes sure that there is at least one instance running. If there are more replicas requested, then PodDisruptionBudget controls that only one kube-apiserver can be shut down at a time, to avoid overloading other running instances. On platforms where kube-apiserver needs to be exposed on all interfaces (e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this port, kube-apiserver will listen on random local IP address and HAProxy will listen on the Node IP), then in addition HAProxy will also listen on 0.0.0.0:6443 to export the API on all interfaces (including the public ones). This is required as you cannot have 2 processes, one listening on 127.0.0.1:6443 and another on 0.0.0.0:6443. On platforms with private network only, where kube-apiserver is accessed via a load balancer (e.g. AWS), port setup remains the same. The whole setup would be much simpler, if kube-apiserver would support SO_REUSEPORT. I have opened an upstream issue about that and implemented a working PoC. More details here: kubernetes/kubernetes#88785. With SO_REUSEPORT support in kube-apiserver, there is no need to run HAProxy as a side-container, no need for listening on random IP address and no need to use multiple ports, which simplifies the whole solution. However, the change from DaemonSet to Deployment and pod anti affinities are still needed. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
…lease This commit attempts to improve the reliability of the upgrade process of the self-hosted kube-apiserver when having only one controller node. If running more than one kube-apiserver replica, the setup does not change. Currently, kube-apiserver runs as a DaemonSet, which means that if there is only one node, where the pod is assigned, it must be removed before a new one will be scheduled. This causes a short outage when doing a rolling update of kube-apiserver. During the outage, the pod checkpointer kicks in and brings up a temporary kube-apiserver as a static pod to recover the cluster and waits until kube-apiserver pod is scheduled on the node. Then, it shuts down the temporary kube-apiserver pod and removes its manifest. As there cannot be 2 instances of kube-apiserver running at the same time on the node, the pod checkpointer is not able to wait until the updated pod starts up, as it must shut down the temporary one. This has a bad side-effect: if the new pod is wrongly configured (e.g. has a non-existent flag specified), kube-apiserver will never recover, which brings down the cluster and then manual intervention is needed. See #72 PR for more details. If it would be possible to run more than one instance of kube-apiserver on a single node, that would make the upgrade process easier. When you try to do that now, the 2nd instance will not run, as the secure port is already bound by the first instance. In Linux, there is a way to have multiple processes bind the same address and port: the SO_REUSEPORT socket option. More details under this link: https://lwn.net/Articles/542629/. Unfortunately, kube-apiserver does not create a listening socket with that option. To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a HAProxy instance as a side-container to kube-apiserver. HAProxy does support SO_REUSEPORT, so multiple instances can bind to the same address and port and then traffic between the processes will be equally distributed by the kernel. As kube-apiserver still runs on the host network, we need to either randomize the IP address or the port it listens on, in order to be able to run multiple instances on a single host. In this case, randomizing the IP address for binding is easier than randomizing the port, as kube-apiserver advertises its own IP address and port where it binds to the 'kubernetes' service in 'default' namespace in the cluster, which means that pods on the cluster would bypass HAProxy and connect to kube-apiserver directly, which requires opening such random ports on the firewall for the controller nodes, which is undesired. If we randomize IP address to bind, we can use the loopback interface, which by default in Linux has a /8 IP address assigned, which means that we can select a random IP address like 127.155.125.53 and bind to it. To avoid advertising localhost IP address to the cluster, which obviously wouldn't work, we use --advertise-address kube-apiserver flag, which allows us to override IP address advertised to the cluster and always set it to the address where HAProxy is listening, for example using the HOST_IP environment variable pulled from the Kubernetes node information in the pod status. HAProxy runs in TCP mode to minimize the required configuration and possible impact of misconfiguration. In my testing, I didn't experience any breakage because of using a proxy, however, we may need to pay attention to parameters like session timeouts, to make sure they don't affect connections. Once we are able to run multiple instances of kube-apiserver on a single node, we need to change the way we deploy the self-hosted kube-apiserver from DaemonSet to Deployment to allow running multiple instances on a single node. As running multiple instances on a single node should only be done temporarily, as single kube-apiserver is able to scale very well, we add podAntiAffinity to make sure that replicas of Deployment are equally spread across controller nodes. This also makes sense, as each kube-apiserver instance consumes at least 500MB of RAM, which means that if a controller node has 2GB of RAM, it might be not enough to run 2 instances for a longer period, meaning at least 4GB of RAM are recommended for the controller nodes. This also make sense from a stability point of view, as with many workloads, controller node resource usage will grow. By default, the number of replicas equals the number of controller nodes. For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is used instead of requiredDuringSchedulingIgnoredDuringExecution, as with a single controller node, we should actually allow multiple instances on a single node to perform graceful updates. See #90 for more details. If there is just one replica of kube-apiserver requested, we set maxUnavailable: 0, to make sure that there is always at least one instance running. We also add a PodDisruptionBudget to which also makes sure that there is at least one instance running. If there are more replicas requested, then PodDisruptionBudget controls that only one kube-apiserver can be shut down at a time, to avoid overloading other running instances. On platforms where kube-apiserver needs to be exposed on all interfaces (e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this port, kube-apiserver will listen on random local IP address and HAProxy will listen on the Node IP), then in addition HAProxy will also listen on 0.0.0.0:6443 to export the API on all interfaces (including the public ones). This is required as you cannot have 2 processes, one listening on 127.0.0.1:6443 and another on 0.0.0.0:6443. On platforms with private network only, where kube-apiserver is accessed via a load balancer (e.g. AWS), port setup remains the same. The whole setup would be much simpler, if kube-apiserver would support SO_REUSEPORT. I have opened an upstream issue about that and implemented a working PoC. More details here: kubernetes/kubernetes#88785. With SO_REUSEPORT support in kube-apiserver, there is no need to run HAProxy as a side-container, no need for listening on random IP address and no need to use multiple ports, which simplifies the whole solution. However, the change from DaemonSet to Deployment and pod anti affinities are still needed. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
…lease This commit attempts to improve the reliability of the upgrade process of the self-hosted kube-apiserver when having only one controller node. If running more than one kube-apiserver replica, the setup does not change. Currently, kube-apiserver runs as a DaemonSet, which means that if there is only one node, where the pod is assigned, it must be removed before a new one will be scheduled. This causes a short outage when doing a rolling update of kube-apiserver. During the outage, the pod checkpointer kicks in and brings up a temporary kube-apiserver as a static pod to recover the cluster and waits until kube-apiserver pod is scheduled on the node. Then, it shuts down the temporary kube-apiserver pod and removes its manifest. As there cannot be 2 instances of kube-apiserver running at the same time on the node, the pod checkpointer is not able to wait until the updated pod starts up, as it must shut down the temporary one. This has a bad side-effect: if the new pod is wrongly configured (e.g. has a non-existent flag specified), kube-apiserver will never recover, which brings down the cluster and then manual intervention is needed. See #72 PR for more details. If it would be possible to run more than one instance of kube-apiserver on a single node, that would make the upgrade process easier. When you try to do that now, the 2nd instance will not run, as the secure port is already bound by the first instance. In Linux, there is a way to have multiple processes bind the same address and port: the SO_REUSEPORT socket option. More details under this link: https://lwn.net/Articles/542629/. Unfortunately, kube-apiserver does not create a listening socket with that option. To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a HAProxy instance as a side-container to kube-apiserver. HAProxy does support SO_REUSEPORT, so multiple instances can bind to the same address and port and then traffic between the processes will be equally distributed by the kernel. As kube-apiserver still runs on the host network, we need to either randomize the IP address or the port it listens on, in order to be able to run multiple instances on a single host. In this case, randomizing the IP address for binding is easier than randomizing the port, as kube-apiserver advertises its own IP address and port where it binds to the 'kubernetes' service in 'default' namespace in the cluster, which means that pods on the cluster would bypass HAProxy and connect to kube-apiserver directly, which requires opening such random ports on the firewall for the controller nodes, which is undesired. If we randomize IP address to bind, we can use the loopback interface, which by default in Linux has a /8 IP address assigned, which means that we can select a random IP address like 127.155.125.53 and bind to it. To avoid advertising localhost IP address to the cluster, which obviously wouldn't work, we use --advertise-address kube-apiserver flag, which allows us to override IP address advertised to the cluster and always set it to the address where HAProxy is listening, for example using the HOST_IP environment variable pulled from the Kubernetes node information in the pod status. HAProxy runs in TCP mode to minimize the required configuration and possible impact of misconfiguration. In my testing, I didn't experience any breakage because of using a proxy, however, we may need to pay attention to parameters like session timeouts, to make sure they don't affect connections. Once we are able to run multiple instances of kube-apiserver on a single node, we need to change the way we deploy the self-hosted kube-apiserver from DaemonSet to Deployment to allow running multiple instances on a single node. As running multiple instances on a single node should only be done temporarily, as single kube-apiserver is able to scale very well, we add podAntiAffinity to make sure that replicas of Deployment are equally spread across controller nodes. This also makes sense, as each kube-apiserver instance consumes at least 500MB of RAM, which means that if a controller node has 2GB of RAM, it might be not enough to run 2 instances for a longer period, meaning at least 4GB of RAM are recommended for the controller nodes. This also make sense from a stability point of view, as with many workloads, controller node resource usage will grow. By default, the number of replicas equals the number of controller nodes. For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is used instead of requiredDuringSchedulingIgnoredDuringExecution, as with a single controller node, we should actually allow multiple instances on a single node to perform graceful updates. See #90 for more details. If there is just one replica of kube-apiserver requested, we set maxUnavailable: 0, to make sure that there is always at least one instance running. We also add a PodDisruptionBudget to which also makes sure that there is at least one instance running. If there are more replicas requested, then PodDisruptionBudget controls that only one kube-apiserver can be shut down at a time, to avoid overloading other running instances. On platforms where kube-apiserver needs to be exposed on all interfaces (e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this port, kube-apiserver will listen on random local IP address and HAProxy will listen on the Node IP), then in addition HAProxy will also listen on 0.0.0.0:6443 to export the API on all interfaces (including the public ones). This is required as you cannot have 2 processes, one listening on 127.0.0.1:6443 and another on 0.0.0.0:6443. On platforms with private network only, where kube-apiserver is accessed via a load balancer (e.g. AWS), port setup remains the same. The whole setup would be much simpler, if kube-apiserver would support SO_REUSEPORT. I have opened an upstream issue about that and implemented a working PoC. More details here: kubernetes/kubernetes#88785. With SO_REUSEPORT support in kube-apiserver, there is no need to run HAProxy as a side-container, no need for listening on random IP address and no need to use multiple ports, which simplifies the whole solution. However, the change from DaemonSet to Deployment and pod anti affinities are still needed. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
…lease This commit attempts to improve the reliability of the upgrade process of the self-hosted kube-apiserver when having only one controller node. If running more than one kube-apiserver replica, the setup does not change. Currently, kube-apiserver runs as a DaemonSet, which means that if there is only one node, where the pod is assigned, it must be removed before a new one will be scheduled. This causes a short outage when doing a rolling update of kube-apiserver. During the outage, the pod checkpointer kicks in and brings up a temporary kube-apiserver as a static pod to recover the cluster and waits until kube-apiserver pod is scheduled on the node. Then, it shuts down the temporary kube-apiserver pod and removes its manifest. As there cannot be 2 instances of kube-apiserver running at the same time on the node, the pod checkpointer is not able to wait until the updated pod starts up, as it must shut down the temporary one. This has a bad side-effect: if the new pod is wrongly configured (e.g. has a non-existent flag specified), kube-apiserver will never recover, which brings down the cluster and then manual intervention is needed. See #72 PR for more details. If it would be possible to run more than one instance of kube-apiserver on a single node, that would make the upgrade process easier. When you try to do that now, the 2nd instance will not run, as the secure port is already bound by the first instance. In Linux, there is a way to have multiple processes bind the same address and port: the SO_REUSEPORT socket option. More details under this link: https://lwn.net/Articles/542629/. Unfortunately, kube-apiserver does not create a listening socket with that option. To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a HAProxy instance as a side-container to kube-apiserver. HAProxy does support SO_REUSEPORT, so multiple instances can bind to the same address and port and then traffic between the processes will be equally distributed by the kernel. As kube-apiserver still runs on the host network, we need to either randomize the IP address or the port it listens on, in order to be able to run multiple instances on a single host. In this case, randomizing the IP address for binding is easier than randomizing the port, as kube-apiserver advertises its own IP address and port where it binds to the 'kubernetes' service in 'default' namespace in the cluster, which means that pods on the cluster would bypass HAProxy and connect to kube-apiserver directly, which requires opening such random ports on the firewall for the controller nodes, which is undesired. If we randomize IP address to bind, we can use the loopback interface, which by default in Linux has a /8 IP address assigned, which means that we can select a random IP address like 127.155.125.53 and bind to it. To avoid advertising localhost IP address to the cluster, which obviously wouldn't work, we use --advertise-address kube-apiserver flag, which allows us to override IP address advertised to the cluster and always set it to the address where HAProxy is listening, for example using the HOST_IP environment variable pulled from the Kubernetes node information in the pod status. HAProxy runs in TCP mode to minimize the required configuration and possible impact of misconfiguration. In my testing, I didn't experience any breakage because of using a proxy, however, we may need to pay attention to parameters like session timeouts, to make sure they don't affect connections. Once we are able to run multiple instances of kube-apiserver on a single node, we need to change the way we deploy the self-hosted kube-apiserver from DaemonSet to Deployment to allow running multiple instances on a single node. As running multiple instances on a single node should only be done temporarily, as single kube-apiserver is able to scale very well, we add podAntiAffinity to make sure that replicas of Deployment are equally spread across controller nodes. This also makes sense, as each kube-apiserver instance consumes at least 500MB of RAM, which means that if a controller node has 2GB of RAM, it might be not enough to run 2 instances for a longer period, meaning at least 4GB of RAM are recommended for the controller nodes. This also make sense from a stability point of view, as with many workloads, controller node resource usage will grow. By default, the number of replicas equals the number of controller nodes. For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is used instead of requiredDuringSchedulingIgnoredDuringExecution, as with a single controller node, we should actually allow multiple instances on a single node to perform graceful updates. See #90 for more details. If there is just one replica of kube-apiserver requested, we set maxUnavailable: 0, to make sure that there is always at least one instance running. We also add a PodDisruptionBudget to which also makes sure that there is at least one instance running. If there are more replicas requested, then PodDisruptionBudget controls that only one kube-apiserver can be shut down at a time, to avoid overloading other running instances. On platforms where kube-apiserver needs to be exposed on all interfaces (e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this port, kube-apiserver will listen on random local IP address and HAProxy will listen on the Node IP), then in addition HAProxy will also listen on 0.0.0.0:6443 to export the API on all interfaces (including the public ones). This is required as you cannot have 2 processes, one listening on 127.0.0.1:6443 and another on 0.0.0.0:6443. On platforms with private network only, where kube-apiserver is accessed via a load balancer (e.g. AWS), port setup remains the same. The whole setup would be much simpler, if kube-apiserver would support SO_REUSEPORT. I have opened an upstream issue about that and implemented a working PoC. More details here: kubernetes/kubernetes#88785. With SO_REUSEPORT support in kube-apiserver, there is no need to run HAProxy as a side-container, no need for listening on random IP address and no need to use multiple ports, which simplifies the whole solution. However, the change from DaemonSet to Deployment and pod anti affinities are still needed. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
Should we close this? |
Yes, thanks @iaguis. |
Alternative approach to #32. Opening separate PR to keep comments focused.
Contains commits from #71.