Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for application in publick8s to migrate to arm64 #3619

Closed
smerle33 opened this issue Jun 7, 2023 · 20 comments
Closed

Proposal for application in publick8s to migrate to arm64 #3619

smerle33 opened this issue Jun 7, 2023 · 20 comments
Assignees
Labels

Comments

@smerle33
Copy link
Contributor

smerle33 commented Jun 7, 2023

Service(s)

Azure

Summary

Work in progress: determining candidates to migrate on the arm node pool

existing deployments on publick8s :

  • accountapp
  • cert-manager
  • cert-manager-cainjector
  • cert-manager-webhook
  • coredns (kube-system)
  • coredns-autoscaler (kube-system)
  • datadog-cluster-agent
  • incrementals-publisher
  • javadoc
  • jenkinsio
  • jenkinsio-zh
  • konnectivity-agent (kube-system)
  • metrics-server (kube-system)
  • mirrorbits
  • mirrorbits-files
  • plugin-health-scoring
  • plugin-site-backend
  • plugin-site-frontend
  • plugin-site-issues
  • private-nginx-ingress-ingress-nginx-controller
  • private-nginx-ingress-ingress-nginx-defaultbackend
  • public-nginx-ingress-ingress-nginx-controller
  • public-nginx-ingress-ingress-nginx-defaultbackend
  • rating
  • reports
  • uplink
  • wiki

Progress

#3619 (comment)

@smerle33 smerle33 added the triage Incoming issues that need review label Jun 7, 2023
@dduportal dduportal removed the triage Incoming issues that need review label Jun 13, 2023
@dduportal dduportal added this to the infra-team-sync-2023-06-20 milestone Jun 13, 2023
smerle33 added a commit to jenkins-infra/kubernetes-management that referenced this issue Aug 7, 2023
dduportal pushed a commit to jenkins-infra/kubernetes-management that referenced this issue Aug 7, 2023
@lemeurherve
Copy link
Member

lemeurherve commented Nov 14, 2023

Update

plugin-site-api

Arm64 image published from infra.ci.jenkins.io, chart updated, ready to migrate.

plugin-site-issues

Arm64 image published, ready to migrate.

Next steps

We can proceed to plugin-site components migration to arm64.

Then we'll migrate weekly.ci.jenkins.io to arm64, the corresponding arm64 image is already published.

@lemeurherve
Copy link
Member

Last plugin-site helm chart version including the tag with an arm64 variation deployed on publick8s cluster.
Only remain the helmfile release changes to migrate its components to arm64.

@smerle33
Copy link
Contributor Author

plugin-site and plugin-site-issues migration to arm64 done.

Migration plugin-site and plugin-site-issues post mortem:

I did forget to update the charts version in the PR : jenkins-infra/kubernetes-management#4683
I created the PR jenkins-infra/kubernetes-management#4684 to fix it but the helm engine locked the update with: Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress

The solution was to rollback :
first we list the releases : helm ls --namespace plugin-site
then we revert: helm rollback --namespace plugin-site plugin-site and helm rollback --namespace plugin-site plugin-site-issues
we can then launch a new kubernetes-management build in the infra.ci.jenkins.io

In order to avoid this kind of problem, I think that I need to first check the opened PR related to this release (we had jenkins-infra/kubernetes-management#4671) and to better plan the migration to make sure that more than one can focus and check the PR.

This had no impact on production as kubernetes was able to hold the upgrade as it was not successful.

@dduportal
Copy link
Contributor

Before closing this PR, the following services will have to be migrated to arm64:

  • httpd (in both mirrorbits and mirrorbits-parent releases)
  • rsyncd (in mirrorbits-parent release)
  • weekly.ci.jenkins.io

Next step after this issue (to be continued and detailed):

@smerle33
Copy link
Contributor Author

smerle33 commented Nov 20, 2023

mirrorbits-parent : httpd and rsyncd are now on arm64:

Capture d’écran 2023-11-20 à 17 34 05

@smerle33
Copy link
Contributor Author

first attempt to move WEEKLY.CI.JENKINS.IO to arm64 was a failure, but probably due to my impatience
Next attempt will involved before merging the PR:

  • manual scaling +1 arm node
  • manually change the statefulset to downsize to 0 to help the volume migration

@smerle33
Copy link
Contributor Author

smerle33 commented Nov 24, 2023

first attempt to move WEEKLY.CI.JENKINS.IO to arm64 was a failure, but probably due to my impatience Next attempt will involved before merging the PR:

* manual scaling +1 arm node

* manually change the statefulset to downsize to 0 to help the volume migration

More testing and investigation brought us to discovering a zone incompatibility. ARM VM for the node pool are only available on zone 1 (useast2-1) while our others nodepool are located in zone 3 (useast2-3).
The problem remain on the volumes that are not able to be mounted from one zone to another.
We decided the following plan :

Capture d’écran 2023-11-24 à 16 18 33 Capture d’écran 2023-11-24 à 16 18 43

@smerle33
Copy link
Contributor Author

smerle33 commented Nov 27, 2023

We created a new class : jenkins-infra/azure#526 to use ZRS storage.
and use a temporary PV/PVC on this volume in order to use it as a source for the migration

it looks like that there was a bug with the CSI volume clone, it failed with the following error on the new PVC created by the cloning system :

Warning  ProvisioningFailed    23s (x6 over 54s)  disk.csi.azure.com_csi-azuredisk-controller-68cfbf9cc6-vknhs_e4bc448f-ce5d-4e2c-9af5-2f47223a8443  failed to provision volume with StorageClass "managed-csi-premium-zrs-retain": rpc error: code = Internal desc = sourceResourceID(/subscriptions/redacted/resourceGroups/mc_publick8s_publick8s-endless-ghoul_eastus2/providers/Microsoft.Compute/disks//subscriptions/redacted/resourcegroups/MC_publick8s_publick8s-endless-ghoul_eastus2/providers/Microsoft.Compute/disks/jenkins-weekly-snap) is invalid, correct format: .*/subscriptions/(?:.*)/resourceGroups/(?:.*)/providers/Microsoft.Compute/disks/(.+)   

we changed the volumeHandle of the source PV from /subscriptions/redacted/resourcegroups/MC_publick8s_publick8s-endless-ghoul_eastus2/providers/Microsoft.Compute/disks/jenkins-weekly-snapto jenkins-weekly-snap (we exploited the csi clone bug)

Everything went well: WEEKLY.CI.JENKINS.IO now runs on ARM64 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants