Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
111 changes: 111 additions & 0 deletions premerge/cluster-management.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,3 +98,114 @@ Setting the cluster up for the first time is more involved as there are certain
resources where terraform is unable to handle explicit dependencies. This means
that we have to set up the GKE cluster before we setup any of the Kubernetes
resources as otherwise the Terraform Kubernetes provider will error out.

## Upgrading/Resetting Github ARC

Updating and resetting the Github Actions Runner Controller (ARC) within the
cluster involves largely the same process. Some special considerations need
to be made with how ARC interacts with kubernetes. The process involves
uninstalling the runner scale set charts, deleting the namespaces to ensure
everything is properly cleaned up, optionally bumping the version number if
this is a version upgrade, and then reinstalling the charts to get the cluster
back to accepting production jobs.

It is important to not just blindly delete controller pods or namespaces as
this (at least empirically) can interrupt the state and custom resources that
ARC manages, then requiring a costly full uninstallation and reinstallation of
at least a runner scale set.

When upgrading/resetting the cluster, jobs will not be lost, but instead remain
queued on the Github side. Running build jobs will complete after the helm charts
are uninstalled unless they are forcibly killed. Note that best practice dictates
the helm charts should just be uninstalled rather than also setting `maxRunners`
to zero beforehand as that can cause ARC to accept some jobs but not actually
execute them which could prevent failover in HA cluster configurations.

### Uninstalling the Helm Charts

To begin, start by uninstalling the helm charts by using resource targetting
on a kubernetes destroy command:

```bash
terraform destroy -target helm_release.github_actions_runner_set_linux
terraform destroy -target helm_release.github_actions_runner_set_windows
```

These should complete, but if they do not, we are still able to get things
cleaned up. If everything went smoothly, the commands should complete and leave
runner pods that are still in the process of executing jobs. You will need to
wait for them to complete before moving on. If they are stuck, you will need to
manually delete them with `kubectl delete`. Follow up the previous terraform
commands by deleting the kubernetes namespaces all the resources live in:

```bash
terraform destroy -target kubernetes_namespace.llvm_premerge_linux_runners
terraform destroy -target kubernetes_namespace.llvm_premerge_windows_runners
```

If things go smoothly, these should complete quickly. If they do not complete,
there is most likely dangling resources in the namespaces that need to have their
finalizers removed before they can be updated. You can confirm this by running
`kubectl get namespaces`. If the namespace is listed as `Terminating`, you most
likely need to manually intervene. To find a list of dangling resources that
did not get cleaned up properly, you can run the following command, making sure
to fill in `<namespace>` with the actual namespace of interest:

```bash
kubectl api-resources --verbs=list --namespaced -o name \
| xargs -n 1 kubectl get --show-kind --ignore-not-found -n <namespace>
```

This will return the stuck resources. Then you can copy each resource, and edit
the YAML configuration of the kubernetes object to remove the finalizers:

```bash
kubectl edit <resource name> -n <namespace name>
```

Just deleting the finalizers key along with any entries should be sufficient.
After rerunning the command to find dangling resources, you should see it get
removed. After doing this for all dangling resources, the namespace should
then delete automatically. This can be confirmed by running
`kubectl get namespaces`.

If you are performing these steps as part of an incident response, you can
skip to the section [Bringing the Cluster Back Up](#bringing-the-cluster-back-up).
If you are bumping the version you still need to uninstall the controller and
bump the version number beforehand.

### Uninstalling the Controller Helm Chart

Next, the controller helm chart needs to be uninstalled. If you are performing
these steps as part of dealing with an incident, you most likely do not need to
perform this step. Usually it is sufficient to destroy and recreate the runner
scale sets to resolve incidents. Uninstalling the controller is necessary for
version upgrades however.

Start by destroying the helm chart:
```bash
terraform destroy -target helm_release.github_actions_runner_controller
```

Then delete the namespace to ensure there are no dangling resources
```bash
terraform destroy -target kubernetes_namespace.llvm_premerge_controller
```

### Bumping the Version Number

This is not necessary only for bumping the version of ARC. This involves simply
updating the version field for the `helm_release` objects in `main.tf`. Make sure
to commit the changes and push them to `llvm-zorg` to ensure others working on
the terraform configuration have an up to date state when they pull the repository.

### Bringing the Cluster Back Up

To get the cluster back up and accepting production jobs again, simply run
`terraform apply`. It will recreate all the resource previously destroyed and
ensure they are in a state consistent with the terraform IaC definitions.

### External Resources

[Strategies for Upgrading ARC](https://www.kenmuse.com/blog/strategies-for-upgrading-arc/)
outlines how ARC should be upgraded and why.