Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

Ensure that controlplane pods are spread equally across controller nodes #90

Closed
invidian opened this issue Mar 4, 2020 · 4 comments · Fixed by #1193
Closed

Ensure that controlplane pods are spread equally across controller nodes #90

invidian opened this issue Mar 4, 2020 · 4 comments · Fixed by #1193
Labels
area/kubernetes Core Kubernetes stuff kind/enhancement New feature or request size/s Issues which likely require up to a couple of work hours

Comments

@invidian
Copy link
Member

invidian commented Mar 4, 2020

kube-controller-manager and kube-scheduler use Deployment object with preferredDuringSchedulingIgnoredDuringExecution, which means that during scheduling, pods will be equally spread across nodes. However, if there is only one controller node available during bootstrapping (it does happen, as nodes join without particular order), then all pods are getting scheduled on this single node. If this node fails, then the entire controlplane goes down, which gives user a false sense of redundancy of the controlplane.

We should ensure, that those pods are always scheduled equally across nodes to maximize the redundancy.

One way of achieving that would be to use requiredDuringSchedulingIgnoredDuringExecution instead of preferredDuringSchedulingIgnoredDuringExecution, however that has negative effect when having just a single controlplane node, that more than one pod cannot be scheduled on a single node, and that is recommended for doing upgrades, so we would have to do that conditionally.

The alternative would be to use descheduler, which would periodically ensure, that pods are spread.

Currently for kube-apiserver we use DaemonSet, which makes sure that their pods are equally spread across controller nodes, but might be changed to Deployment as well, if #32 gets merged.

@surajssd
Copy link
Member

surajssd commented Mar 4, 2020

A simple podAntiAffinity will help as well to broaden this spread.

invidian added a commit that referenced this issue Mar 4, 2020
This commit attempts to improve the reliability of upgrade process of
self-hosted kube-apiserver, especially when having only one controller
node.

Currently, kube-apiserver is run as a DameonSet, which means, that if
there is only one node, where the pod is assigned, it must be removed
before new one will be scheduled. This causes short outage when doing a
rolling update of kube-apiserver. During the outage, pod checkpointer
kicks in and brings up temporary kube-apiserver as a static pod, to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down temporary kube-apiserver pod and removes
it's manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, pod checkpointer is not able to wait until
updated pod gets up, as it must shut down the temporary one. This has a
bad side-effect, that if new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and the manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make upgrade process easier. When you try
to do that at the moment, 2nd instance will not run, as secure port is
already bind by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port, which is SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
such option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing IP address for binding is easier than
randomizing port, as kube-apiserver advertises it's own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pod on the cluster would bypass the
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use loopback interface, which
by default in Linux has /8 IP address assigned, which means that we can
select random IP address like 127.155.125.53 and bind to it and this
will work.

To avoid addressing localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address, where HAProxy is listening, for example using
HOST_IP environmental variable pulled from Kubernetes node information
from pod status.

Proxying HAPRoxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect the connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node is should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity, to make sure that replicas of Deployment are equally
spreded across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, which means at least 4GB of RAM are
recommended for the controller nodes. This also make sense from
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals to number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
single controller node, we should actually allow to run multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure, that there is always at least one
instance running. We also add PodDisruptionBudget to which also makes
sure, that there is at least one instance running. If there is more
replicas requested, then PodDisruptionBudget controls, that only one
kube-apiserver can be shut down at a time, so avoid overloading other
running instances.

On platforms, where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to exporte the API on all interfaces (including public
ones). This is required, as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another one on 0.0.0.0:6443.

On platforms with only private network, where kube-apiserver is accessed
via load balancer (e.g. AWS), ports setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened upstream issue about that and implemented
working PoC. More details here: kubernetes/kubernetes#88785

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Mar 4, 2020
This commit attempts to improve the reliability of upgrade process of
self-hosted kube-apiserver, especially when having only one controller
node.

Currently, kube-apiserver is run as a DameonSet, which means, that if
there is only one node, where the pod is assigned, it must be removed
before new one will be scheduled. This causes short outage when doing a
rolling update of kube-apiserver. During the outage, pod checkpointer
kicks in and brings up temporary kube-apiserver as a static pod, to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down temporary kube-apiserver pod and removes
it's manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, pod checkpointer is not able to wait until
updated pod gets up, as it must shut down the temporary one. This has a
bad side-effect, that if new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and the manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make upgrade process easier. When you try
to do that at the moment, 2nd instance will not run, as secure port is
already bind by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port, which is SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
such option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing IP address for binding is easier than
randomizing port, as kube-apiserver advertises it's own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pod on the cluster would bypass the
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use loopback interface, which
by default in Linux has /8 IP address assigned, which means that we can
select random IP address like 127.155.125.53 and bind to it and this
will work.

To avoid addressing localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address, where HAProxy is listening, for example using
HOST_IP environmental variable pulled from Kubernetes node information
from pod status.

Proxying HAPRoxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect the connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node is should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity, to make sure that replicas of Deployment are equally
spreded across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, which means at least 4GB of RAM are
recommended for the controller nodes. This also make sense from
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals to number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
single controller node, we should actually allow to run multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure, that there is always at least one
instance running. We also add PodDisruptionBudget to which also makes
sure, that there is at least one instance running. If there is more
replicas requested, then PodDisruptionBudget controls, that only one
kube-apiserver can be shut down at a time, so avoid overloading other
running instances.

On platforms, where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to exporte the API on all interfaces (including public
ones). This is required, as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another one on 0.0.0.0:6443.

On platforms with only private network, where kube-apiserver is accessed
via load balancer (e.g. AWS), ports setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened upstream issue about that and implemented
working PoC. More details here: kubernetes/kubernetes#88785

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Mar 4, 2020
This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver, especially when having only one controller
node.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Mar 4, 2020
This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver, especially when having only one controller
node.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Mar 4, 2020
This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver, especially when having only one controller
node.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Mar 4, 2020
This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver, especially when having only one controller
node.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Mar 4, 2020
This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver, especially when having only one controller
node.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Mar 5, 2020
This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver, especially when having only one controller
node.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Mar 5, 2020
This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver, especially when having only one controller
node.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Mar 6, 2020
This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver, especially when having only one controller
node.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Mar 6, 2020
This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver, especially when having only one controller
node.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Mar 6, 2020
This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver, especially when having only one controller
node.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Mar 6, 2020
This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver, especially when having only one controller
node.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Mar 9, 2020
…lease

This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver when having only one controller node. If
running more than one kube-apiserver replica, the setup does not change.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Mar 10, 2020
…lease

This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver when having only one controller node. If
running more than one kube-apiserver replica, the setup does not change.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Mar 10, 2020
…lease

This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver when having only one controller node. If
running more than one kube-apiserver replica, the setup does not change.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Mar 10, 2020
…lease

This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver when having only one controller node. If
running more than one kube-apiserver replica, the setup does not change.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Mar 10, 2020
…lease

This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver when having only one controller node. If
running more than one kube-apiserver replica, the setup does not change.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Mar 10, 2020
…lease

This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver when having only one controller node. If
running more than one kube-apiserver replica, the setup does not change.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Mar 10, 2020
…lease

This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver when having only one controller node. If
running more than one kube-apiserver replica, the setup does not change.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Mar 11, 2020
…lease

This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver when having only one controller node. If
running more than one kube-apiserver replica, the setup does not change.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Mar 11, 2020
…lease

This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver when having only one controller node. If
running more than one kube-apiserver replica, the setup does not change.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Mar 11, 2020
…lease

This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver when having only one controller node. If
running more than one kube-apiserver replica, the setup does not change.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Mar 11, 2020
…lease

This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver when having only one controller node. If
running more than one kube-apiserver replica, the setup does not change.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Mar 11, 2020
…lease

This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver when having only one controller node. If
running more than one kube-apiserver replica, the setup does not change.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Mar 11, 2020
…lease

This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver when having only one controller node. If
running more than one kube-apiserver replica, the setup does not change.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Mar 11, 2020
…lease

This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver when having only one controller node. If
running more than one kube-apiserver replica, the setup does not change.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
@ipochi ipochi added the proposed/next-sprint Issues proposed for next sprint label Oct 29, 2020
@invidian
Copy link
Member Author

invidian commented Nov 2, 2020

#1030 changed controlplane components to be deployed as DaemonSets in HA setup, so this is ensured. However, we may need to roll it back because of #1097.

@surajssd
Copy link
Member

surajssd commented Nov 3, 2020

I think #1097 specifies a code that should have been cleaned up in #1030 itself. Why should the fix be to revert and not fix this

control_plane_replicas = max(2, length(var.etcd_servers))
?

@invidian
Copy link
Member Author

invidian commented Nov 3, 2020

I think #1097 specifies a code that should have been cleaned up in #1030 itself. Why should the fix be to revert and not fix this

If I understand you correctly, then yes, possibly fixing this Terraform code should be sufficient ot fix #1097. Then this can be closed as well I think.

@invidian invidian added area/kubernetes Core Kubernetes stuff kind/enhancement New feature or request size/s Issues which likely require up to a couple of work hours labels Nov 3, 2020
@iaguis iaguis removed the proposed/next-sprint Issues proposed for next sprint label Nov 4, 2020
invidian added a commit that referenced this issue Nov 18, 2020
This commit fixes passed 'control_plane_replicas' value to Kubernetes
Helm chart which caused kube-scheduler and kube-controller-manager to
run as DaemonSet on single controlplane node clusters, which breaks the
ability to update it gracefully.

It also adds tests that controlplane is using right resource type on
different controlplane sizes and that both can be gracefully updated
without breaking cluster functionality.

Closes #1097
Closes #90

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Nov 18, 2020
This commit fixes passed 'control_plane_replicas' value to Kubernetes
Helm chart which caused kube-scheduler and kube-controller-manager to
run as DaemonSet on single controlplane node clusters, which breaks the
ability to update it gracefully.

It also adds tests that controlplane is using right resource type on
different controlplane sizes and that both can be gracefully updated
without breaking cluster functionality.

Closes #1097
Closes #90

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Nov 18, 2020
This commit fixes passed 'control_plane_replicas' value to Kubernetes
Helm chart which caused kube-scheduler and kube-controller-manager to
run as DaemonSet on single controlplane node clusters, which breaks the
ability to update it gracefully.

It also adds tests that controlplane is using right resource type on
different controlplane sizes and that both can be gracefully updated
without breaking cluster functionality.

Closes #1097
Closes #90

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Nov 18, 2020
This commit fixes passed 'control_plane_replicas' value to Kubernetes
Helm chart which caused kube-scheduler and kube-controller-manager to
run as DaemonSet on single controlplane node clusters, which breaks the
ability to update it gracefully.

It also adds tests that controlplane is using right resource type on
different controlplane sizes and that both can be gracefully updated
without breaking cluster functionality.

Closes #1097
Closes #90

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Nov 18, 2020
This commit fixes passed 'control_plane_replicas' value to Kubernetes
Helm chart which caused kube-scheduler and kube-controller-manager to
run as DaemonSet on single controlplane node clusters, which breaks the
ability to update it gracefully.

It also adds tests that controlplane is using right resource type on
different controlplane sizes and that both can be gracefully updated
without breaking cluster functionality.

Closes #1097
Closes #90

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Nov 19, 2020
This commit fixes passed 'control_plane_replicas' value to Kubernetes
Helm chart which caused kube-scheduler and kube-controller-manager to
run as DaemonSet on single controlplane node clusters, which breaks the
ability to update it gracefully.

It also adds tests that controlplane is using right resource type on
different controlplane sizes and that both can be gracefully updated
without breaking cluster functionality.

Closes #1097
Closes #90

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Nov 23, 2020
This commit fixes passed 'control_plane_replicas' value to Kubernetes
Helm chart which caused kube-scheduler and kube-controller-manager to
run as DaemonSet on single controlplane node clusters, which breaks the
ability to update it gracefully.

It also adds tests that controlplane is using right resource type on
different controlplane sizes and that both can be gracefully updated
without breaking cluster functionality.

Closes #1097
Closes #90

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Nov 23, 2020
This commit fixes passed 'control_plane_replicas' value to Kubernetes
Helm chart which caused kube-scheduler and kube-controller-manager to
run as DaemonSet on single controlplane node clusters, which breaks the
ability to update it gracefully.

It also adds tests that controlplane is using right resource type on
different controlplane sizes and that both can be gracefully updated
without breaking cluster functionality.

Closes #1097
Closes #90

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Nov 23, 2020
This commit fixes passed 'control_plane_replicas' value to Kubernetes
Helm chart which caused kube-scheduler and kube-controller-manager to
run as DaemonSet on single controlplane node clusters, which breaks the
ability to update it gracefully.

It also adds tests that controlplane is using right resource type on
different controlplane sizes and that both can be gracefully updated
without breaking cluster functionality.

Closes #1097
Closes #90

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Nov 25, 2020
This commit fixes passed 'control_plane_replicas' value to Kubernetes
Helm chart which caused kube-scheduler and kube-controller-manager to
run as DaemonSet on single controlplane node clusters, which breaks the
ability to update it gracefully.

It also adds tests that controlplane is using right resource type on
different controlplane sizes and that both can be gracefully updated
without breaking cluster functionality.

Closes #1097
Closes #90

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Nov 26, 2020
This commit fixes passed 'control_plane_replicas' value to Kubernetes
Helm chart which caused kube-scheduler and kube-controller-manager to
run as DaemonSet on single controlplane node clusters, which breaks the
ability to update it gracefully.

It also adds tests that controlplane is using right resource type on
different controlplane sizes and that both can be gracefully updated
without breaking cluster functionality.

Closes #1097
Closes #90

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Nov 27, 2020
This commit fixes passed 'control_plane_replicas' value to Kubernetes
Helm chart which caused kube-scheduler and kube-controller-manager to
run as DaemonSet on single controlplane node clusters, which breaks the
ability to update it gracefully.

It also adds tests that controlplane is using right resource type on
different controlplane sizes and that both can be gracefully updated
without breaking cluster functionality.

Closes #1097
Closes #90

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Nov 30, 2020
This commit fixes passed 'control_plane_replicas' value to Kubernetes
Helm chart which caused kube-scheduler and kube-controller-manager to
run as DaemonSet on single controlplane node clusters, which breaks the
ability to update it gracefully.

It also adds tests that controlplane is using right resource type on
different controlplane sizes and that both can be gracefully updated
without breaking cluster functionality.

Closes #1097
Closes #90

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Dec 1, 2020
This commit fixes passed 'control_plane_replicas' value to Kubernetes
Helm chart which caused kube-scheduler and kube-controller-manager to
run as DaemonSet on single controlplane node clusters, which breaks the
ability to update it gracefully.

It also adds tests that controlplane is using right resource type on
different controlplane sizes and that both can be gracefully updated
without breaking cluster functionality.

Closes #1097
Closes #90

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Dec 2, 2020
This commit fixes passed 'control_plane_replicas' value to Kubernetes
Helm chart which caused kube-scheduler and kube-controller-manager to
run as DaemonSet on single controlplane node clusters, which breaks the
ability to update it gracefully.

It also adds tests that controlplane is using right resource type on
different controlplane sizes and that both can be gracefully updated
without breaking cluster functionality.

Closes #1097
Closes #90

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Dec 2, 2020
This commit fixes passed 'control_plane_replicas' value to Kubernetes
Helm chart which caused kube-scheduler and kube-controller-manager to
run as DaemonSet on single controlplane node clusters, which breaks the
ability to update it gracefully.

It also adds tests that controlplane is using right resource type on
different controlplane sizes and that both can be gracefully updated
without breaking cluster functionality.

Closes #1097
Closes #90

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Dec 3, 2020
This commit fixes passed 'control_plane_replicas' value to Kubernetes
Helm chart which caused kube-scheduler and kube-controller-manager to
run as DaemonSet on single controlplane node clusters, which breaks the
ability to update it gracefully.

It also adds tests that controlplane is using right resource type on
different controlplane sizes and that both can be gracefully updated
without breaking cluster functionality.

Closes #1097
Closes #90

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Dec 3, 2020
This commit fixes passed 'control_plane_replicas' value to Kubernetes
Helm chart which caused kube-scheduler and kube-controller-manager to
run as DaemonSet on single controlplane node clusters, which breaks the
ability to update it gracefully.

It also adds tests that controlplane is using right resource type on
different controlplane sizes and that both can be gracefully updated
without breaking cluster functionality.

Closes #1097
Closes #90

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/kubernetes Core Kubernetes stuff kind/enhancement New feature or request size/s Issues which likely require up to a couple of work hours
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants