Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

Add ability to perform controlplane upgrades #32

Merged
merged 13 commits into from
Mar 11, 2020

Conversation

invidian
Copy link
Member

No description provided.

@invidian
Copy link
Member Author

invidian commented Feb 21, 2020

Pushed some updates today. TODO:

  • add user feedback that we now upgrade controlplane
  • add user feedback of what controlplane component is being upgraded.
  • add PodDisruptionBudget for CoreDNS
  • add PodDisruptionBudget
  • resolve issue with upgrading self-hosted Kubelet
  • add explanation why kube-apiserver looks like they look
  • (optional) refactor pkg/component/util/install.go and cli/cmd/cluster.go to extract common logic for installing if release is not found

TL; DR for using HAProxy with kube-apiserver:

  • it allows to make use of SO_REUSEPORT, so you can have multiple instances of kube-apiserver on single node, which provides seamless updates

TL;DR for using Deployment for `kube-apiserver:

  • DaemonSet currently does not allow "overcommit" pods while upgrading. Currently in K8s, old pod needs to be shut down before new one gets created (and scheduled). With Deployment, it is possible, so it is easier to survive the upgrades. Additonally, with SO_REUSEPORT, you can have 2 kube-apiserver processes running in parallel, which allows to do atomic updates (helm upgrade --atomic).

TL; DR for no-downtime upgrades on single controller setup:

  • currently, if kube-apiserver becomes unavailable, pod-checkpointer kicks in and creates static instance of kube-apiserver. pod-checkpointer operates in 1 minute intervals, and even when new version of kube-apiserver gets created, it will crash at the beginning, after static pod gets removed (there is 30 seconds timeout in pod-checkpointer for that).

    All those conditions make it difficult to determine when actual upgrade is finished and will require us to have some complex retry/timeout logic in lokoctl to make the process nice for the user. Additionally, it takes more time to do such upgrade. With pod-checkpointer, it might be few minutes to do the upgrade, while with this approach, if the image is pulled, the upgrade takes ~10 seconds and there is no downtime during the process, nor the helm failure.

@invidian
Copy link
Member Author

Ah, for now, I only tested in on single controller AWS cluster. Will do more tests next week.

@iaguis iaguis force-pushed the invidian/controlplane-upgrades branch from b86451a to c45776f Compare February 21, 2020 17:33
@invidian invidian force-pushed the invidian/controlplane-upgrades branch 3 times, most recently from 584e349 to d79436e Compare February 24, 2020 13:06
@invidian
Copy link
Member Author

Pushed some updates:

  • Provided user feedback about the upgrade process:
    Ensuring that cluster controlplane is up to date.
    Ensuring controlplane component 'kube-apiserver' is up to date... Done.
    Ensuring controlplane component 'kubernetes' is up to date... Done.
    Ensuring controlplane component 'calico' is up to date... Done.
    Installing component 'metrics-server'...
    Succesfully installed component 'metrics-server'!
    
  • Changes listening to randomize IP address rather than the port. This is because kube-apiserver allows to override what IP address will be advertised (be added to the endpoints of kubernetes.default.svc Service), but not the port. Because of firewalls, it's then easier to keep the port static and randomize the IP address we listen on.
    On platforms, where kube-apiserver needs to be exposed on all interfaces (e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this port, kube-apiserver will listen on random local IP and haproxy will listen on Node IP), then in addition, haproxy will also listen on 0.0.0.0:6443 to expose API on all interfaces (including public ones).
    On platforms with only private network (e.g. AWS), ports setup remains the same.

@invidian
Copy link
Member Author

I've done some more testing based on @iaguis feedback.

It seems that kube-apiserver is consuming hell lot of memory, so squeezing 2 instances on a 2GB machine (default controller type on AWS) is problematic.

Also, it seems this patch breaks metrics-server for some reason, I need to investigate.

@invidian
Copy link
Member Author

Also, it seems this patch breaks metrics-server for some reason, I need to investigate.

Fixed that, extra backslash broke aggregation flags.

@invidian invidian force-pushed the invidian/controlplane-upgrades branch 2 times, most recently from 41c8ad9 to baf0105 Compare February 25, 2020 17:20
@invidian
Copy link
Member Author

With following cluster configuration (left just important bits):

cluster "aws" {
  os_channel       = "edge"
  os_version       = "2387.99.0"

I was able to perform full cluster upgrade (including kubelet) and it took 2 minutes with images already pulled.

However, on first attempt, I got some anomaly, where new kubelet instance registered as localhost instead of using hostname, even though in the container, hostname command was returning valid hostname. Perhaps we could use --hostname-override and specify hostname explicitly to avoid such issues.

@invidian invidian force-pushed the invidian/controlplane-upgrades branch from 49a4337 to 7a7b5e0 Compare February 26, 2020 14:29
@invidian
Copy link
Member Author

Alternative, simpler and more dummy approach: https://github.com/kinvolk/lokomotive/tree/invidian/controlplane-upgrades-alternative.

@iaguis
Copy link
Contributor

iaguis commented Feb 27, 2020

Alternative, simpler and more dummy approach: https://github.com/kinvolk/lokomotive/tree/invidian/controlplane-upgrades-alternative.

I've been testing this approach and it seems to work. However, the API server flips between available and unavailable while the other kubernetes CP components are updating. Since we also retry for those, by the time they're installed the API server becomes stable, but this is a side effect of us waiting for the other CP components. Can we do a bit better here?

Otherwise I think this works and is simple so I'm happy with it. Also, as expected, HA updates work well too.

@invidian
Copy link
Member Author

I've been testing this approach and it seems to work. However, the API server flips between available and unavailable while the other kubernetes CP components are updating. Since we also retry for those, by the time they're installed the API server becomes stable, but this is a side effect of us waiting for the other CP components. Can we do a bit better here?

Yes, this is what I will be working on today. I think I'll implement a check, which will make sure, that kube-apiserver DaemonSet is up to date and only then proceed.

I also still need to check against bad flags.

@invidian
Copy link
Member Author

I also still need to check against bad flags.

Unfortunately, as I said, pod checkpointer does not prevent or provide a way to recover if bad flag is specified for kube-apiserver, as there is no way to distinguish between the "regular" crash, which is part of the upgrade process and the real crash, if e.g. bad flag has been specified.

Additionally, bad flag is also persisted to inactive-manifests, so the recovery procedure is manual and difficult. It should look something like this:

  • copy bad manifest from inactive-manifests directory somewhere to edit
  • edit it to restore it to the good state
  • move it to manifests directory
  • wait for static API server to be created by kubelet
  • delete existing kube-apiserver Pod and Daemonset to prevent it from being run, checkpointed etc.
  • resolve bad flag issue
  • run upgrade process again to ensure that right manifests are applied to the cluster
  • move static kube-apiserver manifest from manifests

Yes, this is what I will be working on today. I think I'll implement a check, which will make sure, that kube-apiserver DaemonSet is up to date and only then proceed.

It seems that in order to make that, we just need to wait until all replicas of DaemonSet are ready and then we can proceed. This also involves only kube-apiserver, so retry logic can be limited in scope to just this upgrade.

@iaguis
Copy link
Contributor

iaguis commented Feb 28, 2020

It seems that in order to make that, we just need to wait until all replicas of DaemonSet are ready and then we can proceed. This also involves only kube-apiserver, so retry logic can be limited in scope to just this upgrade.

cool!

@invidian
Copy link
Member Author

invidian commented Mar 3, 2020

I opened an upstream issue about SO_REUSEPORT to see if there is a chance to get the support for it in upstream: kubernetes/kubernetes#88785

I also implemented this option for kube-apiserver and tested it to verify that it's working as expected. It is. See the issue for more details.

@invidian invidian force-pushed the invidian/controlplane-upgrades branch 2 times, most recently from be5432e to 1402f9a Compare March 4, 2020 11:09
@invidian invidian requested a review from iaguis March 4, 2020 11:10
@invidian invidian force-pushed the invidian/controlplane-upgrades branch 5 times, most recently from 555aed7 to 6d10d3a Compare March 11, 2020 14:41
@invidian
Copy link
Member Author

I'm testing upgrades of kube-apiserver run as DaemonSet on HA Controlplane. It seems, that when kube-apiserver pod gets descheduled, pod checkpointer kicks in on every node, which triggers crashloop for a minute for every node. During this time, I also get connection refused often when running kubectl get pods.

With Deployment approach, there is 0 errors 🙈

@rata
Copy link
Member

rata commented Mar 11, 2020

@invidian that is weird, if I run a kubectl set image on an "old" (pre helm controlplane charts), it works just fine during the upgrad (kubectl does retries). And kubectl even doesn't fail even if one out of 3 controllers is down, it just takes more time (in the tests I did in the past).

How are you trying?

@invidian
Copy link
Member Author

How are you trying?

I'm running lokoctl cluster install with changes hyperkube image version, which will trigger contorlplane upgrade. In the meanwhile, I just run kubectl get pods from time to time and I see, that it fails sometimes.

@invidian invidian force-pushed the invidian/controlplane-upgrades branch from 6d10d3a to 2970bea Compare March 11, 2020 16:25
This commit adds replacement function chartFromComponent, which glues
the logic between chartFromManifests and Component interface, making
chartFromManifests to only handle generating the chart.

chartFromComponent also validates, that generated chart is correct,
which makes it safer for the caller.

This commit also adds some unit tests testing chartFromManifests, as now
it is easier to test it.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
So the code for building helm action config and checking if given helm
release exists can be reused.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
So it can be upgraded independently. This is important for the
controlplane upgrade process, as we should ensure first that pod
checkpointer is up to date and running and only then upgrade
kube-apiserver, as if kube-apiserver upgrade fails, pod checkpointer is
needed to recover the cluster.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
pod-checkpointer and kube-apiserver were missing and this is required
for controplane upgrades feature.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
This outputs will be used when upgrading controlplane.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
All charts values and networking solution, so we can query, which chart
is in use for the network.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
As 2GB of RAM is not enough for running standalone controller node with
graceful upgrade strategy (2 API servers does not fit into 2GB node).

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
This commit adds controlplane upgrade functionality to 'cluster install'
command. If user runs this command on existing cluster, all controlplane
helm releases will be upgraded using charts and values.yaml generated by
Terraform.

If some of controlplane release has been uninstalled by the user, it
will be reinstalled to ensure consistency.

The upgrade process is atomic, which means that if something goes wrong
(e.g. new pod will not become ready), then the upgrade will be rolled
back.

The upgrade process is done in order recommended by upstream:
https://kubernetes.io/docs/setup/release/version-skew-policy/#supported-component-upgrade-order

Currently, 'kubelet' is not being upgraded, because of bug in runc on
Flatcar Linux. See #110 for
more details. For testing, --upgrade-kubelets can be used, but it's
experimental and not by default at the moment.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
So if Packet image used when installing the server is not up to date
(this happens), the node upgrade process won't disturb installing
remaining components.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
…lease

This commit attempts to improve the reliability of the upgrade process of
the self-hosted kube-apiserver when having only one controller node. If
running more than one kube-apiserver replica, the setup does not change.

Currently, kube-apiserver runs as a DaemonSet, which means that if
there is only one node, where the pod is assigned, it must be removed
before a new one will be scheduled. This causes a short outage when doing a
rolling update of kube-apiserver. During the outage, the pod checkpointer
kicks in and brings up a temporary kube-apiserver as a static pod to
recover the cluster and waits until kube-apiserver pod is scheduled on
the node. Then, it shuts down the temporary kube-apiserver pod and removes
its manifest. As there cannot be 2 instances of kube-apiserver running
at the same time on the node, the pod checkpointer is not able to wait until
the updated pod starts up, as it must shut down the temporary one. This has a
bad side-effect: if the new pod is wrongly configured (e.g. has a
non-existent flag specified), kube-apiserver will never recover, which
brings down the cluster and then manual intervention is needed.

See #72 PR for more details.

If it would be possible to run more than one instance of kube-apiserver
on a single node, that would make the upgrade process easier. When you try
to do that now, the 2nd instance will not run, as the secure port is
already bound by the first instance.

In Linux, there is a way to have multiple processes bind the same
address and port: the SO_REUSEPORT socket option. More details
under this link: https://lwn.net/Articles/542629/.

Unfortunately, kube-apiserver does not create a listening socket with
that option.

To mimic the SO_REUSEPORT option in kube-apiserver, this commit adds a
HAProxy instance as a side-container to kube-apiserver. HAProxy does
support SO_REUSEPORT, so multiple instances can bind to the same address
and port and then traffic between the processes will be equally
distributed by the kernel.

As kube-apiserver still runs on the host network, we need to either
randomize the IP address or the port it listens on, in order to be able
to run multiple instances on a single host.

In this case, randomizing the IP address for binding is easier than
randomizing the port, as kube-apiserver advertises its own IP address and
port where it binds to the 'kubernetes' service in 'default' namespace
in the cluster, which means that pods on the cluster would bypass
HAProxy and connect to kube-apiserver directly, which requires opening
such random ports on the firewall for the controller nodes, which is
undesired.

If we randomize IP address to bind, we can use the loopback interface, which
by default in Linux has a /8 IP address assigned, which means that we can
select a random IP address like 127.155.125.53 and bind to it.

To avoid advertising localhost IP address to the cluster, which obviously
wouldn't work, we use --advertise-address kube-apiserver flag, which
allows us to override IP address advertised to the cluster and always
set it to the address where HAProxy is listening, for example using
the HOST_IP environment variable pulled from the Kubernetes node information
in the pod status.

HAProxy runs in TCP mode to minimize the required configuration
and possible impact of misconfiguration. In my testing, I didn't
experience any breakage because of using a proxy, however, we may need to
pay attention to parameters like session timeouts, to make sure they
don't affect connections.

Once we are able to run multiple instances of kube-apiserver on a single
node, we need to change the way we deploy the self-hosted kube-apiserver
from DaemonSet to Deployment to allow running multiple instances on a
single node.

As running multiple instances on a single node should only be done
temporarily, as single kube-apiserver is able to scale very well, we add
podAntiAffinity to make sure that replicas of Deployment are equally
spread across controller nodes. This also makes sense, as each
kube-apiserver instance consumes at least 500MB of RAM, which means that
if a controller node has 2GB of RAM, it might be not enough to run 2
instances for a longer period, meaning at least 4GB of RAM are
recommended for the controller nodes. This also make sense from a
stability point of view, as with many workloads, controller node
resource usage will grow.

By default, the number of replicas equals the number of controller nodes.

For podAntiAffinity, preferredDuringSchedulingIgnoredDuringExecution is
used instead of requiredDuringSchedulingIgnoredDuringExecution, as with
a single controller node, we should actually allow multiple
instances on a single node to perform graceful updates.

See #90 for more details.

If there is just one replica of kube-apiserver requested, we set
maxUnavailable: 0, to make sure that there is always at least one
instance running. We also add a PodDisruptionBudget to which also makes
sure that there is at least one instance running. If there are more
replicas requested, then PodDisruptionBudget controls that only one
kube-apiserver can be shut down at a time, to avoid overloading other
running instances.

On platforms where kube-apiserver needs to be exposed on all interfaces
(e.g. Packet), we switch kube-apiserver in-cluster port to 7443 (on this
port, kube-apiserver will listen on random local IP address and HAProxy
will listen on the Node IP), then in addition HAProxy will also listen on
0.0.0.0:6443 to export the API on all interfaces (including the public
ones). This is required as you cannot have 2 processes, one listening
on 127.0.0.1:6443 and another on 0.0.0.0:6443.

On platforms with private network only, where kube-apiserver is accessed
via a load balancer (e.g. AWS), port setup remains the same.

The whole setup would be much simpler, if kube-apiserver would support
SO_REUSEPORT. I have opened an upstream issue about that and implemented a
working PoC. More details here: kubernetes/kubernetes#88785.

With SO_REUSEPORT support in kube-apiserver, there is no need to run
HAProxy as a side-container, no need for listening on random IP address
and no need to use multiple ports, which simplifies the whole solution.
However, the change from DaemonSet to Deployment and pod anti affinities are
still needed.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
Rather than pointing to PR with a lot of resolved comments.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
@invidian invidian force-pushed the invidian/controlplane-upgrades branch from 2970bea to 729241c Compare March 11, 2020 17:17
Copy link
Contributor

@iaguis iaguis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@invidian invidian merged commit cfcf78e into master Mar 11, 2020
@invidian invidian deleted the invidian/controlplane-upgrades branch March 11, 2020 18:13
@invidian
Copy link
Member Author

For future self, with SO_REUSEPORT support in kube-apiserver, this patch needs to be applied: https://gist.github.com/invidian/d77dca57bcf3dbe49fa9b18330b5366a.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants