From 4fd93a33bf61062f0de55e3f000ca1412347e320 Mon Sep 17 00:00:00 2001
From: Aiden Grossman <aidengrossman@google.com>
Date: Mon, 31 Mar 2025 22:08:38 +0000
Subject: [PATCH] =?UTF-8?q?[=F0=9D=98=80=F0=9D=97=BD=F0=9D=97=BF]=20initia?=
 =?UTF-8?q?l=20version?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Created using spr 1.3.4
---
 premerge/cluster-management.md | 104 +++++++++++++++++++++++++++++++++
 1 file changed, 104 insertions(+)
diff --git a/premerge/cluster-management.md b/premerge/cluster-management.md
index dd157e82c..c3aed03b9 100644
--- a/premerge/cluster-management.md
+++ b/premerge/cluster-management.md
@@ -98,3 +98,107 @@ Setting the cluster up for the first time is more involved as there are certain
 resources where terraform is unable to handle explicit dependencies. This means
 that we have to set up the GKE cluster before we setup any of the Kubernetes
 resources as otherwise the Terraform Kubernetes provider will error out.
+
+## Upgrading/Resetting Github ARC
+
+Updating and resetting the Github Actions Runner Controller (ARC) within the
+cluster involves largely the same process. Some special considerations need
+to be made with how ARC interacts with kubernetes. The process involves
+uninstalling the runner scale set charts, deleting the namespaces to ensure
+everything is properly cleaned up, optionally bumping the version number if
+this is a version upgrade, and then reinstalling the charts to get the cluster
+back to accepting production jobs.
+
+It is important to not just blindly delete controller pods or namespaces as
+this (at least empirically) can interrupt the state and custom resources that
+ARC manages, then requiring a costly full uninstallation and reinstallation of
+at least a runner scale set.
+
+### Uninstalling the Helm Charts
+
+To begin, start by uninstalling the helm charts by using resource targetting
+on a kubernetes destroy command:
+
+```bash
+terraform destroy -target helm_release.github_actions_runner_set_linux
+terraform destroy -target helm_release.github_actions_runner_set_windows
+```
+
+These should complete, but if they do not, we are still able to get things
+cleaned up. If everything went smoothly, the commands should complete and leave
+runner pods that are still in the process of executing jobs. You will need to
+wait for them to complete before moving on. If they are stuck, you will need to
+manually delete them with `kubectl delete`. Follow up the previous terraform
+commands by deleting the kubernetes namespaces all the resources live in:
+
+```bash
+terraform destroy -target kubernetes_namespace.llvm_premerge_linux_runners
+terraform destroy -target kubernetes_namespace.llvm_premerge_windows_runners
+```
+
+If things go smoothly, these should complete quickly. If they do not complete,
+there is most likely dangling resources in the namespaces that need to have their
+finalizers removed before they can be updated. You can confirm this by running
+`kubectl get namespaces`. If the namespace is listed as `Terminating`, you most
+likely need to manually intervene. To find a list of dangling resources that
+did not get cleaned up properly, you can run the following command, making sure
+to fill in `<namespace>` with the actual namespace of interest:
+
+```bash
+kubectl api-resources --verbs=list --namespaced -o name \
+  | xargs -n 1 kubectl get --show-kind --ignore-not-found -n <namespace>
+```
+
+This will return the stuck resources. Then you can copy each resource, and edit
+the YAML configuration of the kubernetes object to remove the finalizers:
+
+```bash
+kubectl edit <resource name> -n <namespace name>
+```
+
+Just deleting the finalizers key along with any entries should be sufficient.
+After rerunning the command to find dangling resources, you should see it get
+removed. After doing this for all dangling resources, the namespace should
+then delete automatically. This can be confirmed by running
+`kubectl get namespaces`.
+
+If you are performing these steps as part of an incident response, you can
+skip to the section [Bringing the Cluster Back Up](#bringing-the-cluster-back-up).
+If you are bumping the version you still need to uninstall the controller and
+bump the version number beforehand.
+
+### Uninstalling the Controller Helm Chart
+
+Next, the controller helm chart needs to be uninstalled. If you are performing
+these steps as part of dealing with an incident, you most likely do not need to
+perform this step. Usually it is sufficient to destroy and recreate the runner
+scale sets to resolve incidents. Uninstalling the controller is necessary for
+version upgrades however.
+
+Start by destroying the helm chart:
+```bash
+terraform destroy -target helm_release.github_actions_runner_controller
+```
+
+Then delete the namespace to ensure there are no dangling resources
+```bash
+terraform destroy -target kubernetes_namespace.llvm_premerge_controller
+```
+
+### Bumping the Version Number
+
+This is not necessary only for bumping the version of ARC. This involves simply
+updating the version field for the `helm_release` objects in `main.tf`. Make sure
+to commit the changes and push them to `llvm-zorg` to ensure others working on
+the terraform configuration have an up to date state when they pull the repository.
+
+### Bringing the Cluster Back Up
+
+To get the cluster back up and accepting production jobs again, simply run
+`terraform apply`. It will recreate all the resource previously destroy and
+ensure they are in a state consistent with the terraform IaC definitions.
+
+### External Resources
+
+[Strategies for Upgrading ARC](https://www.kenmuse.com/blog/strategies-for-upgrading-arc/)
+outlines how ARC should be upgraded and why.