Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Fail to Upgrade harvester-cloud-provider on RKE2 cluster (rancher-v2.8.4) #5873

Open
albinsun opened this issue May 24, 2024 · 6 comments
Assignees
Labels
area/cloud-provider Harvester cloud provider for guest cluster area/rancher Rancher related issues kind/bug Issues that are defects reported by users or that we know have reached a real release not-require/test-plan Skip to create a e2e automation test issue reproduce/always Reproducible 100% of the time severity/3 Function working but has a major issue w/ workaround
Milestone

Comments

@albinsun
Copy link

albinsun commented May 24, 2024

Describe the bug
Fail to Upgrade harvester-cloud-provider on RKE2 (v1.28) cluster (rancher-v2.8.4)

0/1 nodes are available: 1 node(s) didn't match pod anti-affinity rules. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..
image

To Reproduce
Steps to reproduce the behavior:

  1. Setup harvester-v1.2.2 and rancher-v2.8.4

  2. Import Harvester to Rancher

  3. Create RKE2 cluster (v1.28.9+rke2r1)

    image
    harvester-cloud-provider and harvester-csi-driver
    image

  4. Upgrade harvester-cloud-provider (0.2.300 -> 103.0.2+up0.2.4) ❌

    • Upgrade
      image
      image
    • After
      image
      0/1 nodes are available: 1 node(s) didn't match pod anti-affinity rules. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..
      image

Expected behavior
Should be able to upgrade Rancher harvester app

Support bundle
supportbundle_284FailUpgradeCloudProvider-24T10-50-58Z.zip

Rancher Log
myrke2-128-pool1-ef6ffd22-7jtc8-2024-05-24_10_51_16.tar.gz

Environment

  • Harvester
    • Version: v1.2.2
    • Profile: QEMU/KVM, single nodes (16C/32G/500G)
    • ui-source: Auto
  • Rancher
    • Version: v2.8.4
    • Profile: Helm (v1.28.9+k3s1) in QEMU/KVM (2C/4G)
    • RKE2 Version: v1.28.9 +rke2r1

Additional context

  1. helm-operation-cfns7_undefined.log

    helm upgrade --history-max=5 --install=true --namespace=kube-system --timeout=10m0s --values=/home/shell/helm/values-harvester-cloud-provider-103.0.2-up0.2.4.yaml --version=103.0.2+up0.2.4 --wait=true harvester-cloud-provider /home/shell/helm/harvester-cloud-provider-103.0.2-up0.2.4.tgz
    checking 8 resources for changes
    Looks like there are no changes for ServiceAccount "kube-vip"
    Patch ServiceAccount "harvester-cloud-provider" in namespace kube-system
    Looks like there are no changes for ClusterRole "kube-vip"
    Looks like there are no changes for ClusterRole "harvester-cloud-provider"
    Looks like there are no changes for ClusterRoleBinding "kube-vip"
    Patch ClusterRoleBinding "harvester-cloud-provider" in namespace 
    Patch DaemonSet "kube-vip" in namespace kube-system
    Patch Deployment "harvester-cloud-provider" in namespace kube-system
    beginning wait for 8 resources with timeout of 10m0s
    Deployment is not ready: kube-system/harvester-cloud-provider. 0 out of 1 expected pods are ready
    DaemonSet is not ready: kube-system/kube-vip. 0 out of 1 expected pods have been scheduled
    Deployment is not ready: kube-system/harvester-cloud-provider. 0 out of 1 expected pods are ready
    ...
    2024-05-24T10:45:07.219067445Z Deployment is not ready: kube-system/harvester-cloud-provider. 0 out of 1 expected pods are ready
    2024-05-24T10:45:09.135615116Z Error: UPGRADE FAILED: context deadline exceeded
    
  2. rke2-server.service

    # journalctl -u rke2-server.service --follow
    May 24 10:13:43 myrke2-128-pool1-ef6ffd22-7jtc8 rke2[2322]: time="2024-05-24T10:13:43Z" level=error msg="error syncing 'kube-system/rke2-coredns': handler helm-controller-chart-registration: helmcharts.helm.cattle.io \"rke2-coredns\" not found, requeuing"
    May 24 10:13:44 myrke2-128-pool1-ef6ffd22-7jtc8 rke2[2322]: time="2024-05-24T10:13:44Z" level=error msg="error syncing 'kube-system/rke2-ingress-nginx': handler helm-controller-chart-registration: helmcharts.helm.cattle.io \"rke2-ingress-nginx\" not found, requeuing"
    May 24 10:13:45 myrke2-128-pool1-ef6ffd22-7jtc8 rke2[2322]: time="2024-05-24T10:13:45Z" level=error msg="error syncing 'kube-system/rke2-metrics-server': handler helm-controller-chart-registration: helmcharts.helm.cattle.io \"rke2-metrics-server\" not found, requeuing"
    May 24 10:13:45 myrke2-128-pool1-ef6ffd22-7jtc8 rke2[2322]: time="2024-05-24T10:13:45Z" level=error msg="error syncing 'kube-system/rke2-snapshot-controller-crd': handler helm-controller-chart-registration: helmcharts.helm.cattle.io \"rke2-snapshot-controller-crd\" not found, requeuing"
    May 24 10:13:46 myrke2-128-pool1-ef6ffd22-7jtc8 rke2[2322]: time="2024-05-24T10:13:46Z" level=error msg="error syncing 'kube-system/rke2-snapshot-controller': handler helm-controller-chart-registration: helmcharts.helm.cattle.io \"rke2-snapshot-controller\" not found, requeuing"
    May 24 10:13:46 myrke2-128-pool1-ef6ffd22-7jtc8 rke2[2322]: time="2024-05-24T10:13:46Z" level=error msg="error syncing 'kube-system/rke2-snapshot-validation-webhook': handler helm-controller-chart-registration: helmcharts.helm.cattle.io \"rke2-snapshot-validation-webhook\" not found, requeuing"
    May 24 10:13:52 myrke2-128-pool1-ef6ffd22-7jtc8 rke2[2322]: time="2024-05-24T10:13:52Z" level=info msg="Adding node myrke2-128-pool1-ef6ffd22-7jtc8-a99bff44 etcd status condition"
    May 24 10:15:45 myrke2-128-pool1-ef6ffd22-7jtc8 rke2[2322]: time="2024-05-24T10:15:45Z" level=info msg="Tunnel authorizer set Kubelet Port 10250"
    May 24 10:22:04 myrke2-128-pool1-ef6ffd22-7jtc8 rke2[2322]: time="2024-05-24T10:22:04Z" level=info msg="Updating TLS secret for kube-system/rke2-serving (count: 11): map[field.cattle.io/projectId:c-m-c79fjcww:p-gz6xh listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-192.168.0.52:192.168.0.52 listener.cattle.io/cn-__1-f16284:::1 listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc:kubernetes.default.svc listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/cn-myrke2-128-pool1-ef6ffd22-7jtc8:myrke2-128-pool1-ef6ffd22-7jtc8 listener.cattle.io/fingerprint:SHA1=83CDD81E2BC9612F9C7E7E06A70E367C36573FC7]"
    May 24 10:22:04 myrke2-128-pool1-ef6ffd22-7jtc8 rke2[2322]: time="2024-05-24T10:22:04Z" level=info msg="Active TLS secret kube-system/rke2-serving (ver=4210) (count 11): map[field.cattle.io/projectId:c-m-c79fjcww:p-gz6xh listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-192.168.0.52:192.168.0.52 listener.cattle.io/cn-__1-f16284:::1 listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc:kubernetes.default.svc listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/cn-myrke2-128-pool1-ef6ffd22-7jtc8:myrke2-128-pool1-ef6ffd22-7jtc8 listener.cattle.io/fingerprint:SHA1=83CDD81E2BC9612F9C7E7E06A70E367C36573FC7]"
    
  3. rancher-system-agent.service

    # journalctl -u rancher-system-agent.service --follow
    May 24 10:34:11 myrke2-128-pool1-ef6ffd22-7jtc8 rancher-system-agent[16919]: time="2024-05-24T10:34:11Z" level=info msg="[Applyinator] No image provided, creating empty working directory /var/lib/rancher/agent/work/20240524-103411/1a601d63558d61145f67d1569292e3450c7597bc81499ae31bebbec678d6a964_0"
    May 24 10:34:11 myrke2-128-pool1-ef6ffd22-7jtc8 rancher-system-agent[16919]: time="2024-05-24T10:34:11Z" level=info msg="[Applyinator] Running command: sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null]"
    May 24 10:34:12 myrke2-128-pool1-ef6ffd22-7jtc8 rancher-system-agent[16919]: time="2024-05-24T10:34:12Z" level=info msg="[1a601d63558d61145f67d1569292e3450c7597bc81499ae31bebbec678d6a964_0:stdout]: Name Location Size Created"
    May 24 10:34:12 myrke2-128-pool1-ef6ffd22-7jtc8 rancher-system-agent[16919]: time="2024-05-24T10:34:12Z" level=info msg="[Applyinator] Command sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null] finished with err: <nil> and exit code: 0"
    May 24 10:34:12 myrke2-128-pool1-ef6ffd22-7jtc8 rancher-system-agent[16919]: time="2024-05-24T10:34:12Z" level=info msg="[K8s] updated plan secret fleet-default/myrke2-128-bootstrap-template-45xrz-machine-plan with feedback"
    May 24 10:44:13 myrke2-128-pool1-ef6ffd22-7jtc8 rancher-system-agent[16919]: time="2024-05-24T10:44:13Z" level=info msg="[Applyinator] No image provided, creating empty working directory /var/lib/rancher/agent/work/20240524-104413/1a601d63558d61145f67d1569292e3450c7597bc81499ae31bebbec678d6a964_0"
    May 24 10:44:13 myrke2-128-pool1-ef6ffd22-7jtc8 rancher-system-agent[16919]: time="2024-05-24T10:44:13Z" level=info msg="[Applyinator] Running command: sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null]"
    May 24 10:44:13 myrke2-128-pool1-ef6ffd22-7jtc8 rancher-system-agent[16919]: time="2024-05-24T10:44:13Z" level=info msg="[1a601d63558d61145f67d1569292e3450c7597bc81499ae31bebbec678d6a964_0:stdout]: Name Location Size Created"
    May 24 10:44:13 myrke2-128-pool1-ef6ffd22-7jtc8 rancher-system-agent[16919]: time="2024-05-24T10:44:13Z" level=info msg="[Applyinator] Command sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null] finished with err: <nil> and exit code: 0"
    May 24 10:44:14 myrke2-128-pool1-ef6ffd22-7jtc8 rancher-system-agent[16919]: time="2024-05-24T10:44:14Z" level=info msg="[K8s] updated plan secret fleet-default/myrke2-128-bootstrap-template-45xrz-machine-plan with feedback"
    
@albinsun albinsun added kind/bug Issues that are defects reported by users or that we know have reached a real release severity/3 Function working but has a major issue w/ workaround area/rancher Rancher related issues reproduce/always Reproducible 100% of the time area/cloud-provider Harvester cloud provider for guest cluster labels May 24, 2024
@albinsun albinsun added this to the v1.2.3 milestone May 24, 2024
@albinsun
Copy link
Author

albinsun commented May 24, 2024

Same test with rancher-v2.8.3 looks fine.
0.2.300 > 103.0.1+up0.2.3 (latest)
image

@albinsun
Copy link
Author

One more trial.

@albinsun albinsun modified the milestones: v1.2.3, v1.4.0 May 29, 2024
@bk201
Copy link
Member

bk201 commented May 29, 2024

This should be duplicated with #5382, @starbops can you help check?

@khushboo-rancher
Copy link

Just leaving a workaround here - If we delete the old pod, the new pod with spin up successfully.

@starbops
Copy link
Member

The root cause is the same as #5382's. The image repo/tag of harvester-cloud-provider was changed, which triggered the deployment rollout. Since it is a single-node guest cluster, the situation described in #5348 happened.

@starbops starbops added the not-require/test-plan Skip to create a e2e automation test issue label Jun 20, 2024
@harvesterhci-io-github-bot
Copy link

harvesterhci-io-github-bot commented Jun 20, 2024

Pre Ready-For-Testing Checklist

  • If labeled: require/HEP Has the Harvester Enhancement Proposal PR submitted?
    The HEP PR is at:

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at:

  • Is there a workaround for the issue? If so, where is it documented?
    The workaround is at:

  • Have the backend code been merged (harvester, harvester-installer, etc) (including backport-needed/*)?
    The PR is at:

  • If labeled: area/ui Has the UI issue filed or ready to be merged?
    The UI issue/PR is at:

  • If labeled: require/doc, require/knowledge-base Has the necessary document PR submitted or merged?
    The documentation/KB PR is at: doc: add note about how to address ccm upgrade stuck docs#584

  • If NOT labeled: not-require/test-plan Has the e2e test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue?

    • The automation skeleton PR is at:
    • The automation test case PR is at:
  • If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
    The compatibility issue is filed at:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-provider Harvester cloud provider for guest cluster area/rancher Rancher related issues kind/bug Issues that are defects reported by users or that we know have reached a real release not-require/test-plan Skip to create a e2e automation test issue reproduce/always Reproducible 100% of the time severity/3 Function working but has a major issue w/ workaround
Projects
None yet
Development

No branches or pull requests

5 participants