[BUG] Upgrade stuck on post-drain job with the unresponsive VM #4095

Vicente-Cheng · 2023-06-16T02:10:48Z

Describe the bug
Upgrade stuck on the post-drin job, and you could observe these two situations.

on the post-drin job, we could see the similar log as follows

hvst-upgrade-79jcs-post-drain-harvester-node-0-2m8zh + curl -sSfL http://upgrade-repo-hvst-upgrade-79jcs.harvester-system/harvester-iso/rootfs.squashfs -o /host/usr/local/upgrade_tmp/tmp.ZqMu
5Yk811
hvst-upgrade-79jcs-post-drain-harvester-node-0-2m8zh curl: (7) Failed to connect to upgrade-repo-hvst-upgrade-79jcs.harvester-system port 80 after 3080 ms: Couldn't connect to server
hvst-upgrade-79jcs-post-drain-harvester-node-0-2m8zh + echo 'Failed to download the requested file from "http://upgrade-repo-hvst-upgrade-79jcs.harvester-system/harvester-iso/rootfs.squashfs"
 to "/host/usr/local/upgrade_tmp/tmp.ZqMu5Yk811" with error code: 7, retrying (6)...'

you could find the upgrade VM will have high CPU/Memory usage, which makes it not respond.

To Reproduce

install harvester v1.1.2
upgrade to v1.2.0-rc2
You will have a chance to see this issue.

Expected behavior
The post-drin job should be smooth.

Support bundle
None

Environment

Harvester ISO version: 1.1.2
Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630): 3 VMs cluster

Additional context
We could force delete this VM. After this VM re-launch, the upgrade will continue.

The text was updated successfully, but these errors were encountered:

bk201 · 2023-06-16T03:26:47Z

Seeing this in upgrade-vm console:

bk201 · 2023-06-20T15:12:54Z

Analysis of the issue:

virsh reset the VM doesn't work, the VM can't read the disk. The virt-launcher pod can't access the disk either.
From iscsid log, we saw volume is logged out and logged in again. So device changed.
During upgrading Rancher from v2.6.11 to v2.7.5-rc5, by changing the local-rke-state secret type, we observe rancher-system-agent restart RKE2 server/agent on all nodes simultaneously, which leads to this issue: [BUG] Replica rebuilding caused by rke2/kubelet restart longhorn/longhorn#5340

We saw plan secret changes; attached the old and new plans:
new-plan.json.txt
old-plan.json.txt
And the diff:

--- old-plan	2023-06-20 14:02:28.012183902 +0800
+++ new-plan	2023-06-20 14:02:24.608116748 +0800
@@ -1,7 +1,10 @@
 {
   "files": [
     {
-      "content": "ewogICJhZ2VudC10b2tlbiI6ICJhYSBiYiBjYyIsCiAgImNuaSI6ICJjYWxpY28iLAogICJrdWJlLWNvbnRyb2xsZXItbWFuYWdlci1hcmciOiBbCiAgICAiY2VydC1kaXI9L3Zhci9saWIvcmFuY2hlci9ya2UyL3NlcnZlci90bHMva3ViZS1jb250cm9sbGVyLW1hbmFnZXIiLAogICAgInNlY3VyZS1wb3J0PTEwMjU3IgogIF0sCiAgImt1YmUtY29udHJvbGxlci1tYW5hZ2VyLWV4dHJhLW1vdW50IjogWwogICAgIi92YXIvbGliL3JhbmNoZXIvcmtlMi9zZXJ2ZXIvdGxzL2t1YmUtY29udHJvbGxlci1tYW5hZ2VyOi92YXIvbGliL3JhbmNoZXIvcmtlMi9zZXJ2ZXIvdGxzL2t1YmUtY29udHJvbGxlci1tYW5hZ2VyIgogIF0sCiAgImt1YmUtc2NoZWR1bGVyLWFyZyI6IFsKICAgICJjZXJ0LWRpcj0vdmFyL2xpYi9yYW5jaGVyL3JrZTIvc2VydmVyL3Rscy9rdWJlLXNjaGVkdWxlciIsCiAgICAic2VjdXJlLXBvcnQ9MTAyNTkiCiAgXSwKICAia3ViZS1zY2hlZHVsZXItZXh0cmEtbW91bnQiOiBbCiAgICAiL3Zhci9saWIvcmFuY2hlci9ya2UyL3NlcnZlci90bHMva3ViZS1zY2hlZHVsZXI6L3Zhci9saWIvcmFuY2hlci9ya2UyL3NlcnZlci90bHMva3ViZS1zY2hlZHVsZXIiCiAgXSwKICAibm9kZS1sYWJlbCI6IFsKICAgICJya2UuY2F0dGxlLmlvL21hY2hpbmU9Zjg0MjJhYjktN2Q2NS00MDk5LWE1ZjYtMGFjOTQ0OWQ2OTc3IgogIF0sCiAgInRva2VuIjogImFhIGJiIGNjIgp9",
+      "path": "/etc/rancher/rke2/registries.yaml"
+    },
+    {
+      "content": "ewogICJhZ2VudC10b2tlbiI6ICJhYSBiYiBjYyIsCiAgImNuaSI6ICJjYWxpY28iLAogICJrdWJlLWNvbnRyb2xsZXItbWFuYWdlci1hcmciOiBbCiAgICAiY2VydC1kaXI9L3Zhci9saWIvcmFuY2hlci9ya2UyL3NlcnZlci90bHMva3ViZS1jb250cm9sbGVyLW1hbmFnZXIiLAogICAgInNlY3VyZS1wb3J0PTEwMjU3IgogIF0sCiAgImt1YmUtY29udHJvbGxlci1tYW5hZ2VyLWV4dHJhLW1vdW50IjogWwogICAgIi92YXIvbGliL3JhbmNoZXIvcmtlMi9zZXJ2ZXIvdGxzL2t1YmUtY29udHJvbGxlci1tYW5hZ2VyOi92YXIvbGliL3JhbmNoZXIvcmtlMi9zZXJ2ZXIvdGxzL2t1YmUtY29udHJvbGxlci1tYW5hZ2VyIgogIF0sCiAgImt1YmUtc2NoZWR1bGVyLWFyZyI6IFsKICAgICJjZXJ0LWRpcj0vdmFyL2xpYi9yYW5jaGVyL3JrZTIvc2VydmVyL3Rscy9rdWJlLXNjaGVkdWxlciIsCiAgICAic2VjdXJlLXBvcnQ9MTAyNTkiCiAgXSwKICAia3ViZS1zY2hlZHVsZXItZXh0cmEtbW91bnQiOiBbCiAgICAiL3Zhci9saWIvcmFuY2hlci9ya2UyL3NlcnZlci90bHMva3ViZS1zY2hlZHVsZXI6L3Zhci9saWIvcmFuY2hlci9ya2UyL3NlcnZlci90bHMva3ViZS1zY2hlZHVsZXIiCiAgXSwKICAibm9kZS1sYWJlbCI6IFsKICAgICJya2UuY2F0dGxlLmlvL21hY2hpbmU9Zjg0MjJhYjktN2Q2NS00MDk5LWE1ZjYtMGFjOTQ0OWQ2OTc3IgogIF0sCiAgInByaXZhdGUtcmVnaXN0cnkiOiAiL2V0Yy9yYW5jaGVyL3JrZTIvcmVnaXN0cmllcy55YW1sIiwKICAidG9rZW4iOiAiYWEgYmIgY2MiCn0=",
       "path": "/etc/rancher/rke2/config.yaml.d/50-rancher.yaml"
     },
     {
@@ -10,7 +13,7 @@
       "minor": true
     },
     {
-      "content": "CmFwaVZlcnNpb246IHYxCmtpbmQ6IENvbmZpZ01hcAptZXRhZGF0YToKICBuYW1lOiBya2UyLWV0Y2Qtc25hcHNob3QtZXh0cmEtbWV0YWRhdGEKICBuYW1lc3BhY2U6IGt1YmUtc3lzdGVtCmRhdGE6CiAgcHJvdmlzaW9uaW5nLWNsdXN0ZXItc3BlYzogSDRzSUFBQUFBQUFBLytTU3dXcmpRQXlHMzBYWE5XRWQ5dVRiRW51M3R3WlMwck15Vmh6aHNXUTBtcFFRL080bFR1SkNYNkczbWUvL2tmUUxYYUhQQnpJaHA3UW5TNndDRlp6TDFmclBxaXgvV1U5cks2RUE2Mm1qY3VRT3FpdmtzVE5zYWVlR1R0M2xob0tLbThadFJLSGFrT1YxZEZaSk40MEVENUZhcUk0WUV4VndWQXUwL0xnVE5hcVJCcFVkZVlKS2NveFAzcGlwcGNYY1VpU25aaGo5VXJQVjZQZ2xjYnExYWM0Y2ZFN3g0SjFob0MwWmF3dlY3d0tjQjlMczh6djFQTDRqK3orMWVxNzhkaGQzRkZUYU5IdEd1d2Q2VWUyWDRVWk4vcDFPQlh5bzlXUS9OZjlVUURpaCtSNWpwc1U2WURpeDBQK29CNHpQRzNyc0syckF1SWs1T2RuZjdLZEcybEZaSEtyck5IMENBQUQvL3dFQUFQLy9UdkNKTnBzQ0FBQT0K",
+      "content": "CmFwaVZlcnNpb246IHYxCmtpbmQ6IENvbmZpZ01hcAptZXRhZGF0YToKICBuYW1lOiBya2UyLWV0Y2Qtc25hcHNob3QtZXh0cmEtbWV0YWRhdGEKICBuYW1lc3BhY2U6IGt1YmUtc3lzdGVtCmRhdGE6CiAgcHJvdmlzaW9uaW5nLWNsdXN0ZXItc3BlYzogSDRzSUFBQUFBQUFBLytTU1FXdnpNQXlHLzR1dlh5aGZ5azY1alNiYmJpdDBkR2ZWVVZJVFJUS3kzRkZLLy90SXVtWnNmMkUzKzMxZVpEK2dpeHZ5QVpYUk1PMVJVeEIybFR1VnEvWERxaXovNllCckxWM2hkTUNOY0JkNlYxMWNqcjFDaXp0VE1PelBVK1NGVFlXMkJJeTFRdURYYUVFNFRRd1pEb1N0cXpxZ2hJWHJSRDB1dDlDektOYUFvL0FPTGJtS005RTliMVJGMDFKdWtkQ3dHYU9kNjZBMUdIeWprS1pubWxQd05sdDg1YjJDeHkxcWtOWlYvd3RuWVVUSk5wL1RFT0k3QkhzU3JlZkpiemU0UXkvY3Bya1Q5U2IwSWpJc240dVM3SGQ2TGR5SDZJRDZWLzJ2aGZOSFVOc0RaVnlxSS9oallId21PUURkZCtnSDJvcFFqUjFrbXV3djB4d1NEN1NobkF6MU1kdXg0VFpLWUp2eEp3QUFBUC8vQVFBQS8vK05YOS94dEFJQUFBPT0K",
       "path": "/var/lib/rancher/rke2/server/manifests/rancher/rke2-etcd-snapshot-extra-metadata.yaml",
       "dynamic": true,
       "minor": true
@@ -22,6 +25,12 @@
     {
       "path": "/var/lib/rancher/rke2/server/manifests/rancher/managed-chart-config.yaml",
       "dynamic": true
+    },
+    {
+      "content": "CiMhL2Jpbi9zaAoKY3VycmVudEhhc2g9IiIKa2V5PSQxCnRhcmdldEhhc2g9JDIKaGFzaGVkQ21kPSQzCmNtZD0kNApzaGlmdCA0CgpkYXRhUm9vdD0iL3Zhci9saWIvcmFuY2hlci9jYXByL2lkZW1wb3RlbmNlLyRrZXkvJGhhc2hlZENtZCIKaGFzaEZpbGU9IiRkYXRhUm9vdC9oYXNoIgphdHRlbXB0RmlsZT0iJGRhdGFSb290L2F0dGVtcHQiCgpjdXJyZW50SGFzaD0kKGNhdCAiJGhhc2hGaWxlIiB8fCBlY2hvICIiKQpjdXJyZW50QXR0ZW1wdD0kKGNhdCAiJGF0dGVtcHRGaWxlIiB8fCBlY2hvICItMSIpCgppZiBbICIkY3VycmVudEhhc2giICE9ICIkdGFyZ2V0SGFzaCIgXSAmJiBbICIkY3VycmVudEF0dGVtcHQiICE9ICIkQ0FUVExFX0FHRU5UX0FUVEVNUFRfTlVNQkVSIiBdOyB0aGVuCglta2RpciAtcCAiJGRhdGFSb290IgoJZWNobyAiJHRhcmdldEhhc2giID4gIiRoYXNoRmlsZSIKCWVjaG8gIiRDQVRUTEVfQUdFTlRfQVRURU1QVF9OVU1CRVIiID4gIiRhdHRlbXB0RmlsZSIKCWV4ZWMgIiRjbWQiICIkQCIKZWxzZQoJZWNobyAiYWN0aW9uIGhhcyBhbHJlYWR5IGJlZW4gcmVjb25jaWxlZCB0byB0aGUgY3VycmVudCBoYXNoICRjdXJyZW50SGFzaCBhdCBhdHRlbXB0ICRjdXJyZW50QXR0ZW1wdCIKZmkK",
+      "path": "/var/lib/rancher/capr/idempotence/idempotent.sh",
+      "dynamic": true,
+      "minor": true
     }
   ],
   "instructions": [
@@ -29,7 +38,7 @@
       "name": "install",
       "image": "rancher/system-agent-installer-rke2:v1.24.11-rke2r1",
       "env": [
-        "RESTART_STAMP=febf3b60fea17222e7649db8b1527380b3355d7a9d11e33d98bf575da0cf0600"
+        "RESTART_STAMP=35bbaf5a8ce0b602d5297fc226e966c847bfa7de8a278e1a52ec19612c81294a"
       ],
       "args": [
         "-c",

w13915984028 · 2023-06-20T20:58:13Z

Also enconter similar issue (more like #4096) when testing PR #4117, the upgrade takes longer time, but at least #4117 runs OK .

Error from server: Get "https://192.168.122.131:10250/containerLogs/harvester-system/hvst-upgrade-scnqb-apply-manifests-qqbrf/apply": proxy error from 127.0.0.1:9345 while dialing 192.168.122.131:10250, code 503: 503 Service Unavailable
Tue Jun 20 20:41:14 UTC 2023

Error from server: Get "https://192.168.122.131:10250/containerLogs/harvester-system/hvst-upgrade-scnqb-apply-manifests-qqbrf/apply": proxy error from 127.0.0.1:9345 while dialing 192.168.122.131:10250, code 503: 503 Service Unavailable
Tue Jun 20 20:41:34 UTC 2023

harv31:~ # kubectl get pods -A | grep upgrade
cattle-system               apply-hvst-upgrade-scnqb-prepare-on-harv31-with-d23a54392-k64zc   0/1     Completed     0             11m
cattle-system               apply-system-agent-upgrader-on-harv31-with-efb0f06adc9836-jcp4r   0/1     Completed     0             3m42s
cattle-system               system-upgrade-controller-5685d568ff-96xk9                        1/1     Running       0             3m12s
harvester-system            hvst-upgrade-scnqb-apply-manifests-qqbrf                          1/1     Running       0             5m15s
harvester-system            hvst-upgrade-scnqb-upgradelog-downloader-6b4f866c4b-vpsfk         1/1     Running       0             13m
harvester-system            hvst-upgrade-scnqb-upgradelog-infra-fluentbit-wr7qz               1/1     Running       0             13m
harvester-system            hvst-upgrade-scnqb-upgradelog-infra-fluentd-0                     2/2     Running       0             13m
harvester-system            virt-launcher-upgrade-repo-hvst-upgrade-scnqb-5lt6w               1/1     Running       0             12m
longhorn-system             longhorn-post-upgrade-42l65                                       0/1     Pending       0             78s

harv31:~ # kubectl get pods -A | grep upgrade
cattle-system               apply-hvst-upgrade-scnqb-prepare-on-harv31-with-d23a54392-k64zc   0/1     Completed           0             13m
cattle-system               apply-system-agent-upgrader-on-harv31-with-efb0f06adc9836-jcp4r   0/1     Completed           0             5m45s
cattle-system               system-upgrade-controller-5685d568ff-96xk9                        1/1     Running             0             5m15s
harvester-system            hvst-upgrade-scnqb-apply-manifests-qqbrf                          1/1     Running             0             7m18s
harvester-system            hvst-upgrade-scnqb-upgradelog-downloader-6b4f866c4b-s7z9s         0/1     ContainerCreating   0             72s
harvester-system            hvst-upgrade-scnqb-upgradelog-downloader-6b4f866c4b-vpsfk         0/1     Terminating         0             15m
harvester-system            hvst-upgrade-scnqb-upgradelog-infra-fluentbit-wr7qz               1/1     Running             0             15m
harvester-system            hvst-upgrade-scnqb-upgradelog-infra-fluentd-0                     0/2     Terminating         0             15m
harvester-system            virt-launcher-upgrade-repo-hvst-upgrade-scnqb-5lt6w               1/1     Running             0             14m
harv31:~ # 

...
harv31:~ # kubectl get pods -A | grep upgrade
cattle-system              apply-system-agent-upgrader-on-harv31-with-efb0f06adc9836-jcp4r   0/1     Completed   0               15m
cattle-system              system-upgrade-controller-5685d568ff-96xk9                        1/1     Running     0               14m
harvester-system           hvst-upgrade-scnqb-apply-manifests-qqbrf                          0/1     Completed   0               16m
harvester-system           hvst-upgrade-scnqb-single-node-upgrade-harv31-zrt2t               1/1     Running     0               3m26s
harvester-system           hvst-upgrade-scnqb-upgradelog-downloader-6b4f866c4b-s7z9s         1/1     Running     0               10m
harvester-system           hvst-upgrade-scnqb-upgradelog-infra-fluentbit-22gww               1/1     Running     0               3m32s
harvester-system           hvst-upgrade-scnqb-upgradelog-infra-fluentd-0                     2/2     Running     0               3m33s
harv31:~ # 
harv31:~ #

w13915984028 · 2023-06-20T21:20:00Z

In my single-node clutser upgrade environment, it finally runs into #4119

with such error in fleet-controller

time="2023-06-20T20:51:23Z" level=info msg="While calculating status.ResourceKey, error running helm template for bundle mcc-local-managed-system-upgrade-controller with target options from : chart requires kubeVersion: >= 1.23.0-0 which is incompatible with Kubernetes v1.20.0"

starbops · 2023-06-27T11:18:57Z

The upgrade will proceed by temporarily scaling the fleet-agent deployment back with one replica. We've applied the workaround (scale down to zero) in #3643. Since the upstream issue rancher/fleet#637 was closed, it may be time to reevaluate the behavior of the new fleet-agent.

bk201 · 2023-06-28T15:38:52Z

Create a Rancher issue to ask for help: rancher/rancher#41965

harvesterhci-io-github-bot · 2023-07-03T01:18:38Z

TachunLin · 2023-07-12T14:07:06Z

Verified fixed on v1.2.0-rc3. Close this issue.

Result

We can completely upgrade three nodes Harvester from v1.1.2 to v1.2.0-rc3 with VMs created.

Checked the rke2-server.service on each node did not restart during the time of upgrading system services

node1:~ # sudo systemctl status rke2-server.service
● rke2-server.service - Rancher Kubernetes Engine v2 (server)
     Loaded: loaded (/etc/systemd/system/rke2-server.service; enabled; vendor preset: disabled)
    Drop-In: /etc/systemd/system/rke2-server.service.d
             └─override.conf
     Active: active (running) since Wed 2023-07-12 09:40:17 UTC; 2h 3min ago

node2:~ # sudo systemctl status rke2-server.service
● rke2-server.service - Rancher Kubernetes Engine v2 (server)
     Loaded: loaded (/etc/systemd/system/rke2-server.service; enabled; vendor preset: disabled)
    Drop-In: /etc/systemd/system/rke2-server.service.d
             └─override.conf
     Active: active (running) since Wed 2023-07-12 10:37:26 UTC; 1h 8min ago

node3:~ # sudo systemctl status rke2-server.service
● rke2-server.service - Rancher Kubernetes Engine v2 (server)
     Loaded: loaded (/etc/systemd/system/rke2-server.service; enabled; vendor preset: disabled)
    Drop-In: /etc/systemd/system/rke2-server.service.d
             └─override.conf
     Active: active (running) since Wed 2023-07-12 10:38:21 UTC; 1h 8min ago

post-drain pods complete correctly

Test Information

Test Environment: 3 nodes harvester on bare machines
Harvester version: v1.2.0-rc3

Verify Steps

Prepare the three nodes v1.1.2 Harvester cluster
Create VMs on different nodes
Upgrade to v1.2.0-rc3
Monitoring the upgrading system process, access to each Harvester node
Check the Active time period of the rke2-server.service on server nodes and rke2-agent.service on worker node

sudo systemctl status rke2-server.service

Check the upgrade can completely finish

Vicente-Cheng added kind/bug Issues that are defects reported by users or that we know have reached a real release area/upgrade reproduce/needed Reminder to add a reproduce label and to remove this one severity/needed Reminder to add a severity label and to remove this one labels Jun 16, 2023

Vicente-Cheng changed the title ~~[BUG] Upgrade stcuk on post-drin job with the unresponsive VM~~ [BUG] Upgrade stuck on post-drain job with the unresponsive VM Jun 16, 2023

guangbochen added this to the v1.2.0 milestone Jun 16, 2023

guangbochen assigned starbops Jun 20, 2023

This was referenced Jun 20, 2023

[BUG] Upgrade stuck on the logging pod provision #4096

Closed

[Task] Harvester v1.2.0-rc3 release #4105

Closed

w13915984028 mentioned this issue Jun 20, 2023

[BUG] Upgrade stucks in Waiting for RKE2 to be upgraded, fleet-agent was scaled down, but failed to scale up #4119

Closed

bk201 added severity/1 Function broken (a critical incident with very high impact) and removed severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact) labels Jun 21, 2023

TachunLin mentioned this issue Jun 27, 2023

[FEATURE] Provide Monitoring and or Logging as an AddOn #2992

Closed

guangbochen added area/rancher Rancher related issues blocker blocker of major functionality labels Jun 28, 2023

bk201 mentioned this issue Jun 28, 2023

[BUG] Downstream RKE2 server/agent restarts after upgrading from Rancher v2.6.x to v2.7.5-rc5 rancher/rancher#41965

Open

starbops mentioned this issue Jun 29, 2023

fix(upgrade): skip restarting RKE2 servers #4149

Merged

guangbochen added the required-for-rc/v1.2.0 label Jun 29, 2023

bk201 mentioned this issue Jun 30, 2023

[BUG] containerd registry file is empty after upgrading the embedded Rancher to 2.7.5 #4156

Closed

starbops added the not-require/test-plan Skip to create a e2e automation test issue label Jul 3, 2023

TachunLin self-assigned this Jul 12, 2023

TachunLin closed this as completed Jul 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Upgrade stuck on post-drain job with the unresponsive VM #4095

[BUG] Upgrade stuck on post-drain job with the unresponsive VM #4095

Vicente-Cheng commented Jun 16, 2023

bk201 commented Jun 16, 2023

bk201 commented Jun 20, 2023

w13915984028 commented Jun 20, 2023

w13915984028 commented Jun 20, 2023 •

edited

starbops commented Jun 27, 2023

bk201 commented Jun 28, 2023

harvesterhci-io-github-bot commented Jul 3, 2023 •

edited by starbops

TachunLin commented Jul 12, 2023

[BUG] Upgrade stuck on post-drain job with the unresponsive VM #4095

[BUG] Upgrade stuck on post-drain job with the unresponsive VM #4095

Comments

Vicente-Cheng commented Jun 16, 2023

bk201 commented Jun 16, 2023

bk201 commented Jun 20, 2023

w13915984028 commented Jun 20, 2023

w13915984028 commented Jun 20, 2023 • edited

starbops commented Jun 27, 2023

bk201 commented Jun 28, 2023

harvesterhci-io-github-bot commented Jul 3, 2023 • edited by starbops

Pre Ready-For-Testing Checklist

TachunLin commented Jul 12, 2023

Result

Test Information

Verify Steps

w13915984028 commented Jun 20, 2023 •

edited

harvesterhci-io-github-bot commented Jul 3, 2023 •

edited by starbops