Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to Kubernetes 1.26 #3683

Closed
7 tasks done
dduportal opened this issue Jul 26, 2023 · 12 comments
Closed
7 tasks done

Upgrade to Kubernetes 1.26 #3683

dduportal opened this issue Jul 26, 2023 · 12 comments

Comments

@dduportal
Copy link
Contributor

dduportal commented Jul 26, 2023

Previous upgrade (1.25): #3582

Depreciation timelines for 1.25 (justifying the upgrade to 1.26):


Task list:

@github-actions
Copy link

Take a look at these similar issues to see if there isn't already a response to your problem:

  1. 92% Upgrade to Kubernetes 1.24 #3387
  2. 92% Upgrade to Kubernetes 1.23 #3053
  3. 92% Upgrade to Kubernetes 1.22 #2930
  4. 92% Upgrade to Kubernetes 1.21 #2866
  5. 77% [INFRA-3118] Upgrade to Kubernetes 1.20 #2664

@dduportal dduportal added triage Incoming issues that need review kubernetes labels Aug 1, 2023
@dduportal dduportal added this to the infra-team-sync-2023-10-10 milestone Oct 3, 2023
@dduportal dduportal removed the triage Incoming issues that need review label Oct 3, 2023
@lemeurherve
Copy link
Member

lemeurherve commented Oct 9, 2023

To keep in mind: https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke

To ensure the separation between the open source version of Kubernetes and those versions that are customized by services providers [...], the open source community is requiring that all provider-specific code that currently exists in the OSS code base be removed starting with v1.26.

@smerle33
Copy link
Contributor

kubectl 1.26.9 is now used in our system (client side).

@dduportal
Copy link
Contributor Author

Update:

TODO:

  • EKS
  • AKS

@dduportal
Copy link
Contributor Author

dduportal commented Oct 26, 2023

Update on DigitalOcean upgrade:

Post Mortem

The upgrade itself went fine but it had to be done through the DigitalOcean UI: the 2 terraform PRs did not show any change in their plans because of jenkins-infra/digitalocean#148 which tells Terraform to ignore the version changes.

Rollbacking this change fails any terraform plan (as explained in the PR) due to the way how digitalocean_kubernetes_cluster and data.digitalocean_kubernetes_cluster are linked in relation with the kuberbetes providers in charge of managing CSI and admin SVCaccounts.

As such, the upgrade procedure is amended to the following workflow:

  • Keep the terraform PRs (one or 2 clusters at a time, both can be done)
  • Disable the Terraform Job
  • Run the upgrades through the Digital Ocean Web portal
  • Merge the PRs without the checks
  • Run a terraform plan from upstream/main as a sanity check to ensure to changes or cluster destruction are planned
  • Enable the Job and run it once to ensure its is successfull and up to date with upgraded clusters
  • Enable Terraform

dduportal added a commit to dduportal/jenkins-infra that referenced this issue Oct 26, 2023
…ade - jenkins-infra/helpdesk#3683

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
dduportal added a commit to dduportal/kubernetes-management that referenced this issue Oct 26, 2023
…fra/helpdesk#3683

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
dduportal added a commit to dduportal/digitalocean that referenced this issue Oct 26, 2023
…nfra/helpdesk#3683

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
dduportal added a commit to dduportal/digitalocean that referenced this issue Oct 26, 2023
dduportal added a commit to jenkins-infra/jenkins-infra that referenced this issue Oct 26, 2023
…ade - jenkins-infra/helpdesk#3683 (#3140)

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
@dduportal
Copy link
Contributor Author

dduportal commented Nov 7, 2023

Update: AKS Upgrade plan

  • Check changelog:

Current changelog notable elements for Kubernetes 1.26:

  • Some AKS labels are being deprecated with the Kubernetes 1.26 release. Update your AKS labels to the recommended substitutions. See more information on label deprecations and how to update your labels in the Use labels in an AKS cluster documentation. beta.kubernetes.io/arch= and beta.kubernetes.io/os= are still applied by kubelet in kubernetes code
    • ✅ We don't seem to use these labels since 1 year at least (no warning from helmfiles)
  • HostProcess Containers will be GA
    • ✅ We don't use this feature except when debugging
  • Two in-tree driver persistent volumes won't be supported in AKS : kubernetes.io/azure-disk, kubernetes.io/azure-file.
    • ✅ All of our PV are using CSI already
  • 📝 (for info) All AKS clusters on version 1.26+ will use the latest coreDNS version v1.10.1. (see below)
    • For all AKS clusters on version 1.26+, coreDNS health plugin will use lameduck 5s to minimizes DNS resolution failures during coreDNS pod restart or deployment rollout.
    • For all AKS clusters on version 1.26+, coreDNS will use ttl 30 as default TTL for DNS records.
  • During cluster upgrade to v1.26.0 or a later version, disk PV node affinity check will cause the upgrade to fail if there are disk PVs still using deprecated labels: failure-domain.beta.kubernetes.io/zone and failure-domain.beta.kubernetes.io/region
    • ✅ All of our PV are using CSI already
    • ✅ Temp. fix by m$: Enable failure-domain.beta.kubernetes.io labels on K8S 1.26+ nodes by default to resolve issue with in tree CSI drivers. Will be removed from K8S 1.28

privatek8s

Post Mortem for privatek8s

@dduportal
Copy link
Contributor Author

dduportal commented Nov 7, 2023

Update:

  • Migration tests of the Public IPs on privatek8s
    • Manual tests:
      • The resource group MC_prod-privatek8s_privatek8s-emerging-ram_eastus2 (managed by the privatek8s) has 2 objects of type Public IP used by the cluster:
        • 1 is used for inbound requests, is terraform-managed and named public-privatek8s
        • 1 is used for outbound requests, is AKS-managed and has a generated unique name
      • ✅ First manual test:
      • ✅ Second test (half manual / half automated)
        • ✅ With jenkins-infra/azure and jenkins-infra/kubernetes-management djobs disabled (mandatory!)
        • ✅ Through the Azure Web UI, we moved the inbound IP from MC_prod-privatek8s_privatek8s-emerging-ram_eastus2 to this new RG
          • Took ~5 min
          • No inbound connectivity error during the migration (tested by redelivering webhooks payloads from GitHub to infra.ci)
        • ✅ We updated the SVC LB annotation (kubectl -n public-nginx-ingress edit svc public-nginx-ingress-ingress-nginx-controller) to:
        • ⚠️ The SVC LB used for the public ingress of privatek8s cluster showed warning messages in its events (kubectl -n public-nginx-ingress describe svc public-nginx-ingress-ingress-nginx-controller) about missing permissions .../join on the Public IP
          • We need to add a new role assignement in Azure AD (Entra) as the moved Public IP needs "Network Contributor" role's permissions now it's not dependending on the AKS-managed RG
        • ✅ Put Azure changes in IaC (Terraform) on jenkins-infra/azure:
          • Locally: imported the new RG prod-public-ips, updated the public IP public-privatek8s + its lock resource and adding the ne role assignement
          • PR merged without waiting for checks, applied manually and verified by enabling the jenkins-infra/azure jobs
          • See feat(privatek8s) change RG of public inbound IP azure#507
          • Note the checks failed due to the public IP changes mentionned earlier, which we fixed afterwards
        • ✅ Put Kubernetes changes in IaC on jenkins-infra/kubernetes-management:
        • ⚠️ We had to wait ~ 10 min to see the warning messages disapearring on the SVC LB (time for the permissions to propagate)

TODO:

  • Prepare plan for the "Public IP migration" for publick8
  • Announce operation on publick8s which will consist in:
    • Public IP migration
    • Falco upgrade on publick8s to ensure it works on arm64
    • Nginx ingress upgrade on publick8s to ensure we can validate the webhook admission bump
    • Kube 1.26 for publick8s:

@dduportal
Copy link
Contributor Author

dduportal commented Nov 9, 2023

Operation on publick8s (wip)

Chart upgrades (falco and Nginx Ingress)

Public IP

  • Move Public IPs from MC_publick8s... to prod-public-ips resourcegroup (and delete manually their locks)
    • public-publick8s-ipv4
    • public-publick8s-ipv6
    • ldap-jenkins-io-ipv4
  • Update Terraform Azure resources to (move public IP object + create new lock + add role assignment for AKS management)
    • Disable Terraform Azure Job
    • Apply manually (partial apply):
      • public-publick8s-ipv4
      • public-publick8s-ipv6
      • ldap-jenkins-io-ipv4
    • Enable Terraform Azure Job
    • PR + (Sanity) checks should show no changes (or no major changes)
    • Merge and deploy PR (main build green)
  • Update Kubernetes SVC LB configurations
    • PR (and release) on LDAP jenkins-infra/helm-chart to support new Azure annotations
    • PR (and release) on IPv6-LB jenkins-infra/helm-chart to support new Azure annotations
    • Disable jenkins-infra/kubernetes-management job
    • PR on jenkins-infra/kubernetes-management to update annotations and configuration for the 3 (used) SVC LB using AKS LB and (moved above) public IPs - https://github.com/jenkins-infra/kubernetes-management/pull/4647/files
    • Enable jenkins-infra/kubernetes-management job
    • Merge PR and deploy
    • Sanity Checks

Kubernetes 1.26 upgrade

Post Mortem

  • 💡 When moving Azure resources from a resource group to another, do it sequentially. Yes it is slow but it avoids HTTP/500 error from the API followed by 10-15 min waiting anxiously before being able to retry ("eventually consistent" :trollface: )

dduportal added a commit to jenkins-infra/azure that referenced this issue Nov 9, 2023
…c_ips (#509)

as per
jenkins-infra/helpdesk#3683 (comment)

migrate the ips in another Ressource Group to be able to upgrade
kubernetes and avoid locks

---------

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
Co-authored-by: Damien Duportal <damien.duportal@gmail.com>
dduportal pushed a commit to jenkins-infra/azure that referenced this issue Nov 9, 2023
dduportal added a commit to jenkins-infra/azure that referenced this issue Nov 9, 2023
Ref.
jenkins-infra/helpdesk#3683 (comment)

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
@dduportal
Copy link
Contributor Author

Mandatory logo

kube-1 26-logo

@smerle33 I'll let you close this issue with a mandatory gif or image ;)

@smerle33
Copy link
Contributor

smerle33 commented Nov 9, 2023

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants
@dduportal @lemeurherve @smerle33 and others