[BUG] Fail to Upgrade `harvester-cloud-provider` on RKE2 cluster (`rancher-v2.8.4`) #5873

albinsun · 2024-05-24T10:55:18Z

Describe the bug
Fail to Upgrade harvester-cloud-provider on RKE2 (v1.28) cluster (rancher-v2.8.4)

Note

Similiar to [BUG] A pending fail pod from different hub is generated after upgrade rancher #5382

0/1 nodes are available: 1 node(s) didn't match pod anti-affinity rules. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..

To Reproduce
Steps to reproduce the behavior:

Setup harvester-v1.2.2 and rancher-v2.8.4
Import Harvester to Rancher
Create RKE2 cluster (v1.28.9+rke2r1)

harvester-cloud-provider and harvester-csi-driver
Upgrade harvester-cloud-provider (0.2.300 -> 103.0.2+up0.2.4) ❌
- Upgrade
- After
  
  0/1 nodes are available: 1 node(s) didn't match pod anti-affinity rules. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..

Expected behavior
Should be able to upgrade Rancher harvester app

Support bundle
supportbundle_284FailUpgradeCloudProvider-24T10-50-58Z.zip

Rancher Log
myrke2-128-pool1-ef6ffd22-7jtc8-2024-05-24_10_51_16.tar.gz

Environment

Harvester
- Version: v1.2.2
- Profile: QEMU/KVM, single nodes (16C/32G/500G)
- ui-source: Auto
Rancher
- Version: v2.8.4
- Profile: Helm (v1.28.9+k3s1) in QEMU/KVM (2C/4G)
- RKE2 Version: v1.28.9 +rke2r1

Additional context

helm-operation-cfns7_undefined.log

helm upgrade --history-max=5 --install=true --namespace=kube-system --timeout=10m0s --values=/home/shell/helm/values-harvester-cloud-provider-103.0.2-up0.2.4.yaml --version=103.0.2+up0.2.4 --wait=true harvester-cloud-provider /home/shell/helm/harvester-cloud-provider-103.0.2-up0.2.4.tgz
checking 8 resources for changes
Looks like there are no changes for ServiceAccount "kube-vip"
Patch ServiceAccount "harvester-cloud-provider" in namespace kube-system
Looks like there are no changes for ClusterRole "kube-vip"
Looks like there are no changes for ClusterRole "harvester-cloud-provider"
Looks like there are no changes for ClusterRoleBinding "kube-vip"
Patch ClusterRoleBinding "harvester-cloud-provider" in namespace 
Patch DaemonSet "kube-vip" in namespace kube-system
Patch Deployment "harvester-cloud-provider" in namespace kube-system
beginning wait for 8 resources with timeout of 10m0s
Deployment is not ready: kube-system/harvester-cloud-provider. 0 out of 1 expected pods are ready
DaemonSet is not ready: kube-system/kube-vip. 0 out of 1 expected pods have been scheduled
Deployment is not ready: kube-system/harvester-cloud-provider. 0 out of 1 expected pods are ready
...
2024-05-24T10:45:07.219067445Z Deployment is not ready: kube-system/harvester-cloud-provider. 0 out of 1 expected pods are ready
2024-05-24T10:45:09.135615116Z Error: UPGRADE FAILED: context deadline exceeded

rke2-server.service

# journalctl -u rke2-server.service --follow
May 24 10:13:43 myrke2-128-pool1-ef6ffd22-7jtc8 rke2[2322]: time="2024-05-24T10:13:43Z" level=error msg="error syncing 'kube-system/rke2-coredns': handler helm-controller-chart-registration: helmcharts.helm.cattle.io \"rke2-coredns\" not found, requeuing"
May 24 10:13:44 myrke2-128-pool1-ef6ffd22-7jtc8 rke2[2322]: time="2024-05-24T10:13:44Z" level=error msg="error syncing 'kube-system/rke2-ingress-nginx': handler helm-controller-chart-registration: helmcharts.helm.cattle.io \"rke2-ingress-nginx\" not found, requeuing"
May 24 10:13:45 myrke2-128-pool1-ef6ffd22-7jtc8 rke2[2322]: time="2024-05-24T10:13:45Z" level=error msg="error syncing 'kube-system/rke2-metrics-server': handler helm-controller-chart-registration: helmcharts.helm.cattle.io \"rke2-metrics-server\" not found, requeuing"
May 24 10:13:45 myrke2-128-pool1-ef6ffd22-7jtc8 rke2[2322]: time="2024-05-24T10:13:45Z" level=error msg="error syncing 'kube-system/rke2-snapshot-controller-crd': handler helm-controller-chart-registration: helmcharts.helm.cattle.io \"rke2-snapshot-controller-crd\" not found, requeuing"
May 24 10:13:46 myrke2-128-pool1-ef6ffd22-7jtc8 rke2[2322]: time="2024-05-24T10:13:46Z" level=error msg="error syncing 'kube-system/rke2-snapshot-controller': handler helm-controller-chart-registration: helmcharts.helm.cattle.io \"rke2-snapshot-controller\" not found, requeuing"
May 24 10:13:46 myrke2-128-pool1-ef6ffd22-7jtc8 rke2[2322]: time="2024-05-24T10:13:46Z" level=error msg="error syncing 'kube-system/rke2-snapshot-validation-webhook': handler helm-controller-chart-registration: helmcharts.helm.cattle.io \"rke2-snapshot-validation-webhook\" not found, requeuing"
May 24 10:13:52 myrke2-128-pool1-ef6ffd22-7jtc8 rke2[2322]: time="2024-05-24T10:13:52Z" level=info msg="Adding node myrke2-128-pool1-ef6ffd22-7jtc8-a99bff44 etcd status condition"
May 24 10:15:45 myrke2-128-pool1-ef6ffd22-7jtc8 rke2[2322]: time="2024-05-24T10:15:45Z" level=info msg="Tunnel authorizer set Kubelet Port 10250"
May 24 10:22:04 myrke2-128-pool1-ef6ffd22-7jtc8 rke2[2322]: time="2024-05-24T10:22:04Z" level=info msg="Updating TLS secret for kube-system/rke2-serving (count: 11): map[field.cattle.io/projectId:c-m-c79fjcww:p-gz6xh listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-192.168.0.52:192.168.0.52 listener.cattle.io/cn-__1-f16284:::1 listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc:kubernetes.default.svc listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/cn-myrke2-128-pool1-ef6ffd22-7jtc8:myrke2-128-pool1-ef6ffd22-7jtc8 listener.cattle.io/fingerprint:SHA1=83CDD81E2BC9612F9C7E7E06A70E367C36573FC7]"
May 24 10:22:04 myrke2-128-pool1-ef6ffd22-7jtc8 rke2[2322]: time="2024-05-24T10:22:04Z" level=info msg="Active TLS secret kube-system/rke2-serving (ver=4210) (count 11): map[field.cattle.io/projectId:c-m-c79fjcww:p-gz6xh listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-192.168.0.52:192.168.0.52 listener.cattle.io/cn-__1-f16284:::1 listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc:kubernetes.default.svc listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/cn-myrke2-128-pool1-ef6ffd22-7jtc8:myrke2-128-pool1-ef6ffd22-7jtc8 listener.cattle.io/fingerprint:SHA1=83CDD81E2BC9612F9C7E7E06A70E367C36573FC7]"

rancher-system-agent.service

# journalctl -u rancher-system-agent.service --follow
May 24 10:34:11 myrke2-128-pool1-ef6ffd22-7jtc8 rancher-system-agent[16919]: time="2024-05-24T10:34:11Z" level=info msg="[Applyinator] No image provided, creating empty working directory /var/lib/rancher/agent/work/20240524-103411/1a601d63558d61145f67d1569292e3450c7597bc81499ae31bebbec678d6a964_0"
May 24 10:34:11 myrke2-128-pool1-ef6ffd22-7jtc8 rancher-system-agent[16919]: time="2024-05-24T10:34:11Z" level=info msg="[Applyinator] Running command: sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null]"
May 24 10:34:12 myrke2-128-pool1-ef6ffd22-7jtc8 rancher-system-agent[16919]: time="2024-05-24T10:34:12Z" level=info msg="[1a601d63558d61145f67d1569292e3450c7597bc81499ae31bebbec678d6a964_0:stdout]: Name Location Size Created"
May 24 10:34:12 myrke2-128-pool1-ef6ffd22-7jtc8 rancher-system-agent[16919]: time="2024-05-24T10:34:12Z" level=info msg="[Applyinator] Command sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null] finished with err: <nil> and exit code: 0"
May 24 10:34:12 myrke2-128-pool1-ef6ffd22-7jtc8 rancher-system-agent[16919]: time="2024-05-24T10:34:12Z" level=info msg="[K8s] updated plan secret fleet-default/myrke2-128-bootstrap-template-45xrz-machine-plan with feedback"
May 24 10:44:13 myrke2-128-pool1-ef6ffd22-7jtc8 rancher-system-agent[16919]: time="2024-05-24T10:44:13Z" level=info msg="[Applyinator] No image provided, creating empty working directory /var/lib/rancher/agent/work/20240524-104413/1a601d63558d61145f67d1569292e3450c7597bc81499ae31bebbec678d6a964_0"
May 24 10:44:13 myrke2-128-pool1-ef6ffd22-7jtc8 rancher-system-agent[16919]: time="2024-05-24T10:44:13Z" level=info msg="[Applyinator] Running command: sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null]"
May 24 10:44:13 myrke2-128-pool1-ef6ffd22-7jtc8 rancher-system-agent[16919]: time="2024-05-24T10:44:13Z" level=info msg="[1a601d63558d61145f67d1569292e3450c7597bc81499ae31bebbec678d6a964_0:stdout]: Name Location Size Created"
May 24 10:44:13 myrke2-128-pool1-ef6ffd22-7jtc8 rancher-system-agent[16919]: time="2024-05-24T10:44:13Z" level=info msg="[Applyinator] Command sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null] finished with err: <nil> and exit code: 0"
May 24 10:44:14 myrke2-128-pool1-ef6ffd22-7jtc8 rancher-system-agent[16919]: time="2024-05-24T10:44:14Z" level=info msg="[K8s] updated plan secret fleet-default/myrke2-128-bootstrap-template-45xrz-machine-plan with feedback"

The text was updated successfully, but these errors were encountered:

albinsun · 2024-05-24T16:01:59Z

Same test with rancher-v2.8.3 looks fine.
0.2.300 > 103.0.1+up0.2.3 (latest)

albinsun · 2024-05-24T17:04:34Z

One more trial.

2 options, we pick the latest (103.0.2+up0.2.4), so 0.2.300 > 103.0.2+up0.2.4
Fail, note that the pending failed pod uses harvester-cloud-provider:v0.2.1 which is not the same as we specified (v0.2.4)
Similiar to [BUG] A pending fail pod from different hub is generated after upgrade rancher #5382

bk201 · 2024-05-29T08:10:53Z

This should be duplicated with #5382, @starbops can you help check?

khushboo-rancher · 2024-06-03T19:50:43Z

Just leaving a workaround here - If we delete the old pod, the new pod with spin up successfully.

starbops · 2024-06-12T03:11:04Z

The root cause is the same as #5382's. The image repo/tag of harvester-cloud-provider was changed, which triggered the deployment rollout. Since it is a single-node guest cluster, the situation described in #5348 happened.

harvesterhci-io-github-bot · 2024-06-20T03:18:41Z

albinsun added this to the v1.2.3 milestone May 24, 2024

albinsun mentioned this issue May 24, 2024

[TEST] Harvester v1.2.2 - Rancher v2.8.4 Integration Test harvester/tests#1279

Closed

albinsun modified the milestones: v1.2.3, v1.4.0 May 29, 2024

bk201 assigned starbops May 29, 2024

albinsun mentioned this issue Jun 4, 2024

[BUG] Fail to upgrade harvester-cloud-provider on RKE2 cluster (rancher-v2.7.13) #5871

Open

starbops mentioned this issue Jun 20, 2024

Bump cloud provider 0.2.5 harvester/charts#245

Merged

starbops added the not-require/test-plan Skip to create a e2e automation test issue label Jun 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Fail to Upgrade `harvester-cloud-provider` on RKE2 cluster (`rancher-v2.8.4`) #5873

[BUG] Fail to Upgrade `harvester-cloud-provider` on RKE2 cluster (`rancher-v2.8.4`) #5873

albinsun commented May 24, 2024 •

edited

Loading

albinsun commented May 24, 2024 •

edited

Loading

albinsun commented May 24, 2024

bk201 commented May 29, 2024

khushboo-rancher commented Jun 3, 2024

starbops commented Jun 12, 2024

harvesterhci-io-github-bot commented Jun 20, 2024 •

edited by starbops

Loading

[BUG] Fail to Upgrade harvester-cloud-provider on RKE2 cluster (rancher-v2.8.4) #5873

[BUG] Fail to Upgrade harvester-cloud-provider on RKE2 cluster (rancher-v2.8.4) #5873

Comments

albinsun commented May 24, 2024 • edited Loading

albinsun commented May 24, 2024 • edited Loading

albinsun commented May 24, 2024

bk201 commented May 29, 2024

khushboo-rancher commented Jun 3, 2024

starbops commented Jun 12, 2024

harvesterhci-io-github-bot commented Jun 20, 2024 • edited by starbops Loading

Pre Ready-For-Testing Checklist

[BUG] Fail to Upgrade `harvester-cloud-provider` on RKE2 cluster (`rancher-v2.8.4`) #5873

[BUG] Fail to Upgrade `harvester-cloud-provider` on RKE2 cluster (`rancher-v2.8.4`) #5873

albinsun commented May 24, 2024 •

edited

Loading

albinsun commented May 24, 2024 •

edited

Loading

harvesterhci-io-github-bot commented Jun 20, 2024 •

edited by starbops

Loading