Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Upgrade stuck on post-drain job with the unresponsive VM #4095

Closed
Vicente-Cheng opened this issue Jun 16, 2023 · 8 comments
Closed

[BUG] Upgrade stuck on post-drain job with the unresponsive VM #4095

Vicente-Cheng opened this issue Jun 16, 2023 · 8 comments
Assignees
Labels
area/rancher Rancher related issues area/upgrade blocker blocker of major functionality kind/bug Issues that are defects reported by users or that we know have reached a real release not-require/test-plan Skip to create a e2e automation test issue priority/0 Must be fixed in this release reproduce/always Reproducible 100% of the time severity/1 Function broken (a critical incident with very high impact)
Milestone

Comments

@Vicente-Cheng
Copy link
Contributor

Describe the bug
Upgrade stuck on the post-drin job, and you could observe these two situations.

  1. on the post-drin job, we could see the similar log as follows
hvst-upgrade-79jcs-post-drain-harvester-node-0-2m8zh + curl -sSfL http://upgrade-repo-hvst-upgrade-79jcs.harvester-system/harvester-iso/rootfs.squashfs -o /host/usr/local/upgrade_tmp/tmp.ZqMu
5Yk811
hvst-upgrade-79jcs-post-drain-harvester-node-0-2m8zh curl: (7) Failed to connect to upgrade-repo-hvst-upgrade-79jcs.harvester-system port 80 after 3080 ms: Couldn't connect to server
hvst-upgrade-79jcs-post-drain-harvester-node-0-2m8zh + echo 'Failed to download the requested file from "http://upgrade-repo-hvst-upgrade-79jcs.harvester-system/harvester-iso/rootfs.squashfs"
 to "/host/usr/local/upgrade_tmp/tmp.ZqMu5Yk811" with error code: 7, retrying (6)...'
  1. you could find the upgrade VM will have high CPU/Memory usage, which makes it not respond.
    截圖 2023-06-16 上午12 56 05

To Reproduce

  1. install harvester v1.1.2
  2. upgrade to v1.2.0-rc2
  3. You will have a chance to see this issue.

Expected behavior
The post-drin job should be smooth.

Support bundle
None

Environment

  • Harvester ISO version: 1.1.2
  • Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630): 3 VMs cluster

Additional context
We could force delete this VM. After this VM re-launch, the upgrade will continue.

@Vicente-Cheng Vicente-Cheng added kind/bug Issues that are defects reported by users or that we know have reached a real release area/upgrade reproduce/needed Reminder to add a reproduce label and to remove this one severity/needed Reminder to add a severity label and to remove this one labels Jun 16, 2023
@Vicente-Cheng Vicente-Cheng changed the title [BUG] Upgrade stcuk on post-drin job with the unresponsive VM [BUG] Upgrade stuck on post-drain job with the unresponsive VM Jun 16, 2023
@guangbochen guangbochen added this to the v1.2.0 milestone Jun 16, 2023
@bk201
Copy link
Member

bk201 commented Jun 16, 2023

Seeing this in upgrade-vm console:
Screenshot 2023-06-16 at 11 25 55 AM

@guangbochen guangbochen added priority/0 Must be fixed in this release reproduce/always Reproducible 100% of the time severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact) and removed reproduce/needed Reminder to add a reproduce label and to remove this one severity/needed Reminder to add a severity label and to remove this one labels Jun 20, 2023
@bk201
Copy link
Member

bk201 commented Jun 20, 2023

Analysis of the issue:

We saw plan secret changes; attached the old and new plans:
new-plan.json.txt
old-plan.json.txt
And the diff:

--- old-plan	2023-06-20 14:02:28.012183902 +0800
+++ new-plan	2023-06-20 14:02:24.608116748 +0800
@@ -1,7 +1,10 @@
 {
   "files": [
     {
-      "content": "ewogICJhZ2VudC10b2tlbiI6ICJhYSBiYiBjYyIsCiAgImNuaSI6ICJjYWxpY28iLAogICJrdWJlLWNvbnRyb2xsZXItbWFuYWdlci1hcmciOiBbCiAgICAiY2VydC1kaXI9L3Zhci9saWIvcmFuY2hlci9ya2UyL3NlcnZlci90bHMva3ViZS1jb250cm9sbGVyLW1hbmFnZXIiLAogICAgInNlY3VyZS1wb3J0PTEwMjU3IgogIF0sCiAgImt1YmUtY29udHJvbGxlci1tYW5hZ2VyLWV4dHJhLW1vdW50IjogWwogICAgIi92YXIvbGliL3JhbmNoZXIvcmtlMi9zZXJ2ZXIvdGxzL2t1YmUtY29udHJvbGxlci1tYW5hZ2VyOi92YXIvbGliL3JhbmNoZXIvcmtlMi9zZXJ2ZXIvdGxzL2t1YmUtY29udHJvbGxlci1tYW5hZ2VyIgogIF0sCiAgImt1YmUtc2NoZWR1bGVyLWFyZyI6IFsKICAgICJjZXJ0LWRpcj0vdmFyL2xpYi9yYW5jaGVyL3JrZTIvc2VydmVyL3Rscy9rdWJlLXNjaGVkdWxlciIsCiAgICAic2VjdXJlLXBvcnQ9MTAyNTkiCiAgXSwKICAia3ViZS1zY2hlZHVsZXItZXh0cmEtbW91bnQiOiBbCiAgICAiL3Zhci9saWIvcmFuY2hlci9ya2UyL3NlcnZlci90bHMva3ViZS1zY2hlZHVsZXI6L3Zhci9saWIvcmFuY2hlci9ya2UyL3NlcnZlci90bHMva3ViZS1zY2hlZHVsZXIiCiAgXSwKICAibm9kZS1sYWJlbCI6IFsKICAgICJya2UuY2F0dGxlLmlvL21hY2hpbmU9Zjg0MjJhYjktN2Q2NS00MDk5LWE1ZjYtMGFjOTQ0OWQ2OTc3IgogIF0sCiAgInRva2VuIjogImFhIGJiIGNjIgp9",
+      "path": "/etc/rancher/rke2/registries.yaml"
+    },
+    {
+      "content": "ewogICJhZ2VudC10b2tlbiI6ICJhYSBiYiBjYyIsCiAgImNuaSI6ICJjYWxpY28iLAogICJrdWJlLWNvbnRyb2xsZXItbWFuYWdlci1hcmciOiBbCiAgICAiY2VydC1kaXI9L3Zhci9saWIvcmFuY2hlci9ya2UyL3NlcnZlci90bHMva3ViZS1jb250cm9sbGVyLW1hbmFnZXIiLAogICAgInNlY3VyZS1wb3J0PTEwMjU3IgogIF0sCiAgImt1YmUtY29udHJvbGxlci1tYW5hZ2VyLWV4dHJhLW1vdW50IjogWwogICAgIi92YXIvbGliL3JhbmNoZXIvcmtlMi9zZXJ2ZXIvdGxzL2t1YmUtY29udHJvbGxlci1tYW5hZ2VyOi92YXIvbGliL3JhbmNoZXIvcmtlMi9zZXJ2ZXIvdGxzL2t1YmUtY29udHJvbGxlci1tYW5hZ2VyIgogIF0sCiAgImt1YmUtc2NoZWR1bGVyLWFyZyI6IFsKICAgICJjZXJ0LWRpcj0vdmFyL2xpYi9yYW5jaGVyL3JrZTIvc2VydmVyL3Rscy9rdWJlLXNjaGVkdWxlciIsCiAgICAic2VjdXJlLXBvcnQ9MTAyNTkiCiAgXSwKICAia3ViZS1zY2hlZHVsZXItZXh0cmEtbW91bnQiOiBbCiAgICAiL3Zhci9saWIvcmFuY2hlci9ya2UyL3NlcnZlci90bHMva3ViZS1zY2hlZHVsZXI6L3Zhci9saWIvcmFuY2hlci9ya2UyL3NlcnZlci90bHMva3ViZS1zY2hlZHVsZXIiCiAgXSwKICAibm9kZS1sYWJlbCI6IFsKICAgICJya2UuY2F0dGxlLmlvL21hY2hpbmU9Zjg0MjJhYjktN2Q2NS00MDk5LWE1ZjYtMGFjOTQ0OWQ2OTc3IgogIF0sCiAgInByaXZhdGUtcmVnaXN0cnkiOiAiL2V0Yy9yYW5jaGVyL3JrZTIvcmVnaXN0cmllcy55YW1sIiwKICAidG9rZW4iOiAiYWEgYmIgY2MiCn0=",
       "path": "/etc/rancher/rke2/config.yaml.d/50-rancher.yaml"
     },
     {
@@ -10,7 +13,7 @@
       "minor": true
     },
     {
-      "content": "CmFwaVZlcnNpb246IHYxCmtpbmQ6IENvbmZpZ01hcAptZXRhZGF0YToKICBuYW1lOiBya2UyLWV0Y2Qtc25hcHNob3QtZXh0cmEtbWV0YWRhdGEKICBuYW1lc3BhY2U6IGt1YmUtc3lzdGVtCmRhdGE6CiAgcHJvdmlzaW9uaW5nLWNsdXN0ZXItc3BlYzogSDRzSUFBQUFBQUFBLytTU3dXcmpRQXlHMzBYWE5XRWQ5dVRiRW51M3R3WlMwck15Vmh6aHNXUTBtcFFRL080bFR1SkNYNkczbWUvL2tmUUxYYUhQQnpJaHA3UW5TNndDRlp6TDFmclBxaXgvV1U5cks2RUE2Mm1qY3VRT3FpdmtzVE5zYWVlR1R0M2xob0tLbThadFJLSGFrT1YxZEZaSk40MEVENUZhcUk0WUV4VndWQXUwL0xnVE5hcVJCcFVkZVlKS2NveFAzcGlwcGNYY1VpU25aaGo5VXJQVjZQZ2xjYnExYWM0Y2ZFN3g0SjFob0MwWmF3dlY3d0tjQjlMczh6djFQTDRqK3orMWVxNzhkaGQzRkZUYU5IdEd1d2Q2VWUyWDRVWk4vcDFPQlh5bzlXUS9OZjlVUURpaCtSNWpwc1U2WURpeDBQK29CNHpQRzNyc0syckF1SWs1T2RuZjdLZEcybEZaSEtyck5IMENBQUQvL3dFQUFQLy9UdkNKTnBzQ0FBQT0K",
+      "content": "CmFwaVZlcnNpb246IHYxCmtpbmQ6IENvbmZpZ01hcAptZXRhZGF0YToKICBuYW1lOiBya2UyLWV0Y2Qtc25hcHNob3QtZXh0cmEtbWV0YWRhdGEKICBuYW1lc3BhY2U6IGt1YmUtc3lzdGVtCmRhdGE6CiAgcHJvdmlzaW9uaW5nLWNsdXN0ZXItc3BlYzogSDRzSUFBQUFBQUFBLytTU1FXdnpNQXlHLzR1dlh5aGZ5azY1alNiYmJpdDBkR2ZWVVZJVFJUS3kzRkZLLy90SXVtWnNmMkUzKzMxZVpEK2dpeHZ5QVpYUk1PMVJVeEIybFR1VnEvWERxaXovNllCckxWM2hkTUNOY0JkNlYxMWNqcjFDaXp0VE1PelBVK1NGVFlXMkJJeTFRdURYYUVFNFRRd1pEb1N0cXpxZ2hJWHJSRDB1dDlDektOYUFvL0FPTGJtS005RTliMVJGMDFKdWtkQ3dHYU9kNjZBMUdIeWprS1pubWxQd05sdDg1YjJDeHkxcWtOWlYvd3RuWVVUSk5wL1RFT0k3QkhzU3JlZkpiemU0UXkvY3Bya1Q5U2IwSWpJc240dVM3SGQ2TGR5SDZJRDZWLzJ2aGZOSFVOc0RaVnlxSS9oallId21PUURkZCtnSDJvcFFqUjFrbXV3djB4d1NEN1NobkF6MU1kdXg0VFpLWUp2eEp3QUFBUC8vQVFBQS8vK05YOS94dEFJQUFBPT0K",
       "path": "/var/lib/rancher/rke2/server/manifests/rancher/rke2-etcd-snapshot-extra-metadata.yaml",
       "dynamic": true,
       "minor": true
@@ -22,6 +25,12 @@
     {
       "path": "/var/lib/rancher/rke2/server/manifests/rancher/managed-chart-config.yaml",
       "dynamic": true
+    },
+    {
+      "content": "CiMhL2Jpbi9zaAoKY3VycmVudEhhc2g9IiIKa2V5PSQxCnRhcmdldEhhc2g9JDIKaGFzaGVkQ21kPSQzCmNtZD0kNApzaGlmdCA0CgpkYXRhUm9vdD0iL3Zhci9saWIvcmFuY2hlci9jYXByL2lkZW1wb3RlbmNlLyRrZXkvJGhhc2hlZENtZCIKaGFzaEZpbGU9IiRkYXRhUm9vdC9oYXNoIgphdHRlbXB0RmlsZT0iJGRhdGFSb290L2F0dGVtcHQiCgpjdXJyZW50SGFzaD0kKGNhdCAiJGhhc2hGaWxlIiB8fCBlY2hvICIiKQpjdXJyZW50QXR0ZW1wdD0kKGNhdCAiJGF0dGVtcHRGaWxlIiB8fCBlY2hvICItMSIpCgppZiBbICIkY3VycmVudEhhc2giICE9ICIkdGFyZ2V0SGFzaCIgXSAmJiBbICIkY3VycmVudEF0dGVtcHQiICE9ICIkQ0FUVExFX0FHRU5UX0FUVEVNUFRfTlVNQkVSIiBdOyB0aGVuCglta2RpciAtcCAiJGRhdGFSb290IgoJZWNobyAiJHRhcmdldEhhc2giID4gIiRoYXNoRmlsZSIKCWVjaG8gIiRDQVRUTEVfQUdFTlRfQVRURU1QVF9OVU1CRVIiID4gIiRhdHRlbXB0RmlsZSIKCWV4ZWMgIiRjbWQiICIkQCIKZWxzZQoJZWNobyAiYWN0aW9uIGhhcyBhbHJlYWR5IGJlZW4gcmVjb25jaWxlZCB0byB0aGUgY3VycmVudCBoYXNoICRjdXJyZW50SGFzaCBhdCBhdHRlbXB0ICRjdXJyZW50QXR0ZW1wdCIKZmkK",
+      "path": "/var/lib/rancher/capr/idempotence/idempotent.sh",
+      "dynamic": true,
+      "minor": true
     }
   ],
   "instructions": [
@@ -29,7 +38,7 @@
       "name": "install",
       "image": "rancher/system-agent-installer-rke2:v1.24.11-rke2r1",
       "env": [
-        "RESTART_STAMP=febf3b60fea17222e7649db8b1527380b3355d7a9d11e33d98bf575da0cf0600"
+        "RESTART_STAMP=35bbaf5a8ce0b602d5297fc226e966c847bfa7de8a278e1a52ec19612c81294a"
       ],
       "args": [
         "-c",

@w13915984028
Copy link
Member

Also enconter similar issue (more like #4096) when testing PR #4117, the upgrade takes longer time, but at least #4117 runs OK .

Error from server: Get "https://192.168.122.131:10250/containerLogs/harvester-system/hvst-upgrade-scnqb-apply-manifests-qqbrf/apply": proxy error from 127.0.0.1:9345 while dialing 192.168.122.131:10250, code 503: 503 Service Unavailable
Tue Jun 20 20:41:14 UTC 2023

Error from server: Get "https://192.168.122.131:10250/containerLogs/harvester-system/hvst-upgrade-scnqb-apply-manifests-qqbrf/apply": proxy error from 127.0.0.1:9345 while dialing 192.168.122.131:10250, code 503: 503 Service Unavailable
Tue Jun 20 20:41:34 UTC 2023

harv31:~ # kubectl get pods -A | grep upgrade
cattle-system               apply-hvst-upgrade-scnqb-prepare-on-harv31-with-d23a54392-k64zc   0/1     Completed     0             11m
cattle-system               apply-system-agent-upgrader-on-harv31-with-efb0f06adc9836-jcp4r   0/1     Completed     0             3m42s
cattle-system               system-upgrade-controller-5685d568ff-96xk9                        1/1     Running       0             3m12s
harvester-system            hvst-upgrade-scnqb-apply-manifests-qqbrf                          1/1     Running       0             5m15s
harvester-system            hvst-upgrade-scnqb-upgradelog-downloader-6b4f866c4b-vpsfk         1/1     Running       0             13m
harvester-system            hvst-upgrade-scnqb-upgradelog-infra-fluentbit-wr7qz               1/1     Running       0             13m
harvester-system            hvst-upgrade-scnqb-upgradelog-infra-fluentd-0                     2/2     Running       0             13m
harvester-system            virt-launcher-upgrade-repo-hvst-upgrade-scnqb-5lt6w               1/1     Running       0             12m
longhorn-system             longhorn-post-upgrade-42l65                                       0/1     Pending       0             78s

harv31:~ # kubectl get pods -A | grep upgrade
cattle-system               apply-hvst-upgrade-scnqb-prepare-on-harv31-with-d23a54392-k64zc   0/1     Completed           0             13m
cattle-system               apply-system-agent-upgrader-on-harv31-with-efb0f06adc9836-jcp4r   0/1     Completed           0             5m45s
cattle-system               system-upgrade-controller-5685d568ff-96xk9                        1/1     Running             0             5m15s
harvester-system            hvst-upgrade-scnqb-apply-manifests-qqbrf                          1/1     Running             0             7m18s
harvester-system            hvst-upgrade-scnqb-upgradelog-downloader-6b4f866c4b-s7z9s         0/1     ContainerCreating   0             72s
harvester-system            hvst-upgrade-scnqb-upgradelog-downloader-6b4f866c4b-vpsfk         0/1     Terminating         0             15m
harvester-system            hvst-upgrade-scnqb-upgradelog-infra-fluentbit-wr7qz               1/1     Running             0             15m
harvester-system            hvst-upgrade-scnqb-upgradelog-infra-fluentd-0                     0/2     Terminating         0             15m
harvester-system            virt-launcher-upgrade-repo-hvst-upgrade-scnqb-5lt6w               1/1     Running             0             14m
harv31:~ # 

...
harv31:~ # kubectl get pods -A | grep upgrade
cattle-system              apply-system-agent-upgrader-on-harv31-with-efb0f06adc9836-jcp4r   0/1     Completed   0               15m
cattle-system              system-upgrade-controller-5685d568ff-96xk9                        1/1     Running     0               14m
harvester-system           hvst-upgrade-scnqb-apply-manifests-qqbrf                          0/1     Completed   0               16m
harvester-system           hvst-upgrade-scnqb-single-node-upgrade-harv31-zrt2t               1/1     Running     0               3m26s
harvester-system           hvst-upgrade-scnqb-upgradelog-downloader-6b4f866c4b-s7z9s         1/1     Running     0               10m
harvester-system           hvst-upgrade-scnqb-upgradelog-infra-fluentbit-22gww               1/1     Running     0               3m32s
harvester-system           hvst-upgrade-scnqb-upgradelog-infra-fluentd-0                     2/2     Running     0               3m33s
harv31:~ # 
harv31:~ # 


image

@w13915984028
Copy link
Member

w13915984028 commented Jun 20, 2023

In my single-node clutser upgrade environment, it finally runs into #4119

with such error in fleet-controller

time="2023-06-20T20:51:23Z" level=info msg="While calculating status.ResourceKey, error running helm template for bundle mcc-local-managed-system-upgrade-controller with target options from : chart requires kubeVersion: >= 1.23.0-0 which is incompatible with Kubernetes v1.20.0"

@bk201 bk201 added severity/1 Function broken (a critical incident with very high impact) and removed severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact) labels Jun 21, 2023
@starbops
Copy link
Member

The upgrade will proceed by temporarily scaling the fleet-agent deployment back with one replica. We've applied the workaround (scale down to zero) in #3643. Since the upstream issue rancher/fleet#637 was closed, it may be time to reevaluate the behavior of the new fleet-agent.

@bk201
Copy link
Member

bk201 commented Jun 28, 2023

Create a Rancher issue to ask for help: rancher/rancher#41965

@starbops starbops added the not-require/test-plan Skip to create a e2e automation test issue label Jul 3, 2023
@harvesterhci-io-github-bot
Copy link

harvesterhci-io-github-bot commented Jul 3, 2023

Pre Ready-For-Testing Checklist

  • ~~If NOT labeled: not-require/test-plan Has the e2e test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue?

    • The automation skeleton PR is at:
    • The automation test case PR is at:~~
  • If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
    The compatibility issue is filed at:

@TachunLin TachunLin self-assigned this Jul 12, 2023
@TachunLin
Copy link

Verified fixed on v1.2.0-rc3. Close this issue.

Result

  • We can completely upgrade three nodes Harvester from v1.1.2 to v1.2.0-rc3 with VMs created.
    image

  • Checked the rke2-server.service on each node did not restart during the time of upgrading system services

    node1:~ # sudo systemctl status rke2-server.service
    ● rke2-server.service - Rancher Kubernetes Engine v2 (server)
         Loaded: loaded (/etc/systemd/system/rke2-server.service; enabled; vendor preset: disabled)
        Drop-In: /etc/systemd/system/rke2-server.service.d
                 └─override.conf
         Active: active (running) since Wed 2023-07-12 09:40:17 UTC; 2h 3min ago
    
    node2:~ # sudo systemctl status rke2-server.service
    ● rke2-server.service - Rancher Kubernetes Engine v2 (server)
         Loaded: loaded (/etc/systemd/system/rke2-server.service; enabled; vendor preset: disabled)
        Drop-In: /etc/systemd/system/rke2-server.service.d
                 └─override.conf
         Active: active (running) since Wed 2023-07-12 10:37:26 UTC; 1h 8min ago
    
    node3:~ # sudo systemctl status rke2-server.service
    ● rke2-server.service - Rancher Kubernetes Engine v2 (server)
         Loaded: loaded (/etc/systemd/system/rke2-server.service; enabled; vendor preset: disabled)
        Drop-In: /etc/systemd/system/rke2-server.service.d
                 └─override.conf
         Active: active (running) since Wed 2023-07-12 10:38:21 UTC; 1h 8min ago
    
  • post-drain pods complete correctly
    image

Test Information

  • Test Environment: 3 nodes harvester on bare machines
  • Harvester version: v1.2.0-rc3

Verify Steps

  1. Prepare the three nodes v1.1.2 Harvester cluster
  2. Create VMs on different nodes
  3. Upgrade to v1.2.0-rc3
  4. Monitoring the upgrading system process, access to each Harvester node
  5. Check the Active time period of the rke2-server.service on server nodes and rke2-agent.service on worker node
sudo systemctl status rke2-server.service
  1. Check the upgrade can completely finish

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/rancher Rancher related issues area/upgrade blocker blocker of major functionality kind/bug Issues that are defects reported by users or that we know have reached a real release not-require/test-plan Skip to create a e2e automation test issue priority/0 Must be fixed in this release reproduce/always Reproducible 100% of the time severity/1 Function broken (a critical incident with very high impact)
Projects
None yet
Development

No branches or pull requests

7 participants