Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

Add ability to perform controlplane upgrades: alternative approach #72

Closed
wants to merge 21 commits into from

Conversation

invidian
Copy link
Member

Alternative approach to #32. Opening separate PR to keep comments focused.

Contains commits from #71.

@invidian
Copy link
Member Author

Current issues:

  • Problem: Currently we upgrade podcheckpointer together with kube-apiserver. If podcheckpointer fails, then kube-apiserver is lost and there is no way to recover from it, as inactive-manifests directory gets cleaned up.

    Potential solution: Update podcheckpointer separately.

  • Problem: Unfortunately, as I said, pod checkpointer does not prevent or provide a way to recover if bad flag is specified for kube-apiserver, as there is no way to
    distinguish between the "regular" crash, which is part of the upgrade process and the real crash, if e.g. bad flag has been specified.

    Additionally, bad flag is also persisted to inactive-manifests, so the recovery procedure is manual and difficult.

    Proposed solution: None so far

@invidian invidian force-pushed the invidian/controlplane-upgrades-alternative branch from 9586fb1 to e188c9e Compare March 2, 2020 08:20
@invidian
Copy link
Member Author

invidian commented Mar 2, 2020

I pushed the code, which separates pod checkpointer and kube-apiserver upgrades. However, it's still not working correctly, as pod checkpointer currently does not have any health checks defined and it seems, that Helm relies on them when doing --atomic upgrades.

Next, we should look how to avoid podcheckpointer caching crashing pods.

@invidian
Copy link
Member Author

invidian commented Mar 3, 2020

I pushed the code, which separates pod checkpointer and kube-apiserver upgrades. However, it's still not working correctly, as pod checkpointer currently does not have any health checks defined and it seems, that Helm relies on them when doing --atomic upgrades.

It looks like pod checkpointer does not implement any way of health checks, so we would have to implement one, to be able to provide reliable updates of it. Please note, that if pod checkpointer is not in a healthy state, lokoctl will not be able to restore it via upgrading helm release.

And if we try upgrading kube-apiserver without pod checkpointer running, cluster will break. Maybe we should look into https://github.com/kubernetes-sigs/bootkube/blob/master/Documentation/disaster-recovery.md and implement some of this functionality somehow.

Next, we should look how to avoid podcheckpointer caching crashing pods.

I think that would again require some modification to pod checkpointer. Perhaps pod-checkpointer could not remove pods manifests, which are part of the daemonset and use them until new DaemonSet pod is not ready somehow. As far as I know, there is no such functionality at the moment.

As a side note, perhaps pod checkpointer could be modified to use kubelet's kubeconfig rather than service account token, so with Node restrictions, it would limit it to read only secrets of pods, which are actually assigned to this node.

This commit adds replacement function chartFromComponent, which glues
the logic between chartFromManifests and Component interface, making
chartFromManifests to only handle generating the chart.

chartFromComponent also validates, that generated chart is correct,
which makes it safer for the caller.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
This is required for controlplane upgrades.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
To make sure we are always up to date.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
As it is now defined in separated helm chart. This is to make update
process more reliable, as kube-apiserver should be updated before other
components.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
…orms

As this is now produced by bootkube module.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
To perform automatic rollback if something fails and to make sure
cluster is upgraded after lokoctl finishes.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
So if some controlplane release got manually uninstalled, just reinstall
it with the values from helm rather than give an error to the user.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
To save time on fresh installations.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
As currently due to runc concurrency bugs, kubelet does not shutdown
properly during the upgrade, which will break the upgrade process.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
As 2GB of RAM is not enough for running standalone controller node with
graceful upgrade strategy (2 API servers does not fit into 2GB node).

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
To make code more reusable.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
To make the function a bit simpler.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
So if process does not accept connections anymore, it will be restarted.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
So they can be updated separately.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
For checking status of DaemonSet objects. This function will be used
when perforing upgrades of kube-apiserver and can be used for testing if
components are running etc.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
@invidian invidian force-pushed the invidian/controlplane-upgrades-alternative branch from e188c9e to 14cabb1 Compare March 4, 2020 10:01
invidian added a commit that referenced this pull request Mar 4, 2020
This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver, especially when having only one controller
node.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this pull request Mar 4, 2020
This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver, especially when having only one controller
node.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this pull request Mar 4, 2020
This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver, especially when having only one controller
node.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this pull request Mar 4, 2020
This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver, especially when having only one controller
node.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this pull request Mar 4, 2020
This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver, especially when having only one controller
node.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this pull request Mar 5, 2020
This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver, especially when having only one controller
node.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this pull request Mar 5, 2020
This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver, especially when having only one controller
node.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this pull request Mar 6, 2020
This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver, especially when having only one controller
node.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this pull request Mar 6, 2020
This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver, especially when having only one controller
node.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this pull request Mar 6, 2020
This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver, especially when having only one controller
node.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this pull request Mar 6, 2020
This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver, especially when having only one controller
node.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this pull request Mar 9, 2020
…lease

This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver when having only one controller node. If
running more than one kube-apiserver replica, the setup does not change.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this pull request Mar 10, 2020
…lease

This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver when having only one controller node. If
running more than one kube-apiserver replica, the setup does not change.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this pull request Mar 10, 2020
…lease

This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver when having only one controller node. If
running more than one kube-apiserver replica, the setup does not change.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this pull request Mar 10, 2020
…lease

This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver when having only one controller node. If
running more than one kube-apiserver replica, the setup does not change.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this pull request Mar 10, 2020
…lease

This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver when having only one controller node. If
running more than one kube-apiserver replica, the setup does not change.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this pull request Mar 10, 2020
…lease

This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver when having only one controller node. If
running more than one kube-apiserver replica, the setup does not change.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this pull request Mar 10, 2020
…lease

This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver when having only one controller node. If
running more than one kube-apiserver replica, the setup does not change.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this pull request Mar 11, 2020
…lease

This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver when having only one controller node. If
running more than one kube-apiserver replica, the setup does not change.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this pull request Mar 11, 2020
…lease

This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver when having only one controller node. If
running more than one kube-apiserver replica, the setup does not change.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this pull request Mar 11, 2020
…lease

This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver when having only one controller node. If
running more than one kube-apiserver replica, the setup does not change.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this pull request Mar 11, 2020
…lease

This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver when having only one controller node. If
running more than one kube-apiserver replica, the setup does not change.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this pull request Mar 11, 2020
…lease

This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver when having only one controller node. If
running more than one kube-apiserver replica, the setup does not change.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this pull request Mar 11, 2020
…lease

This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver when having only one controller node. If
running more than one kube-apiserver replica, the setup does not change.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this pull request Mar 11, 2020
…lease

This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver when having only one controller node. If
running more than one kube-apiserver replica, the setup does not change.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
@iaguis
Copy link
Contributor

iaguis commented Mar 12, 2020

Should we close this?

@invidian
Copy link
Member Author

Yes, thanks @iaguis.

@invidian invidian closed this Mar 12, 2020
@invidian invidian deleted the invidian/controlplane-upgrades-alternative branch March 12, 2020 10:10
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants