[FEATURE] Optimize for Frequent Power-off/Power-On operating procedures #3261

rebeccazzzz · 2022-12-08T20:53:32Z

Context
Running Harvester in edge or remote environments with either intermittent power or with devices needing to be turned off and moved to a new location frequently (sometimes daily).

Is your feature request related to a problem? Please describe.
Operators turning the cluster off and on are not highly technical with Kubernetes and thus can't be expected to troubleshoot stuck containers that don't come back online after startup.

update 20230824:
This issue is converted into an EPIC, it touchs k8s, csi, VM, Linux and more, one HEP is required.

A Harvester cluster is deployed in the rough sequence of:

OS -> rancherd -> rke2 -> k8s -> fleet -> harvester -> longhorn | monitoring | logging..., -> virtual-machine ...

To shutodown the cluster safely, follow the roughly reverse sequence seems reasonable.

A bunch of sub-stories are created to work on them.

Sub-stories in v1.3.0:

Bug fix and enhancement in v1.3.0:

A couple of bugs with various phenomena, but the root cause is same: after a node reboots, kubelet fails to UnmountVolume

[BUG] VM got IO error after host restart #4633 (validate after rke2 is bumped to v1.27.10; kubelet fail to UnmountVolume)
[BUG] After power off Harvester nodes then power on for days, all VMs display Stopping state and can't launch #4033
[BUG] VM stuck in stopping state. Then after delete stuck in terminating phase #3539
Is this a typical problem with this harvester that the rke2 clusters above the harvester no longer start (remains in pending state) after rebooting the harvester nodes following a power outage for example? version 1.2 #4789
Add workaroud for 5109 docs#519 (The issue [BUG] Single Node Cluster Fails To Have VMs Come Back Post Restart w/ Storage Network Configured #5109 was observed when validating 4633, may have no chance to be fixed in Harvester v1.3.0, add it to Harvester document, after document PR is merged, this issue [BUG] Single Node Cluster Fails To Have VMs Come Back Post Restart w/ Storage Network Configured #5109 will be moved to v1.4.0)

Others:

[BUG] Do not activate the LVM device on the harvester node #4674
[BUG] Migrating (or simply stopping) VMs with LVM volumes causes Buffer I/O errors in kernel log of source host #3843 (possbible same root cause as 4674, will validate)
[Doc] Add workaround for VM start button is not visible #4659

HEP PR:

HEP: System robustness enhancement [CI SKIP] #4464

Continuous enhancment in v1.4.0+:

EPIC: #5007 [ENHANCEMENT] Continuous enhancement on system robustness and resilience

Those ISSUEs/PRs could be checked & algined as well:
#3902
#3263
harvester/node-manager#15

The text was updated successfully, but these errors were encountered:

harvesterhci-io-github-bot · 2022-12-19T07:15:04Z

harvesterhci-io-github-bot · 2022-12-19T07:15:06Z

Automation e2e test issue: harvester/tests#664

w13915984028 · 2023-11-27T20:28:23Z

New updated scenario:

Run 5+ VMs on the cluster (1~3 Nodes), stop all nodes at same time, then restart the cluster at certain time. Some VMs may stuck / take quite a long time to restart:

Some may be related issues:
[BUG] After power off Harvester nodes then power on for days, all VMs display Stopping state and can't launch #4033
[BUG] VM stuck in stopping state. Then after delete stuck in terminating phase #3539
[FEATURE] rebuild last healthy replica during node drain #3378
[BUG] VM got IO error after host restart #4633
[BUG] vm stuck in loop toggle between "pause"/"unpause" #4828
[BUG] VM is stuck in terminating after a reboot #994 (comment)

w13915984028 · 2024-03-07T07:49:06Z

The planned optimization and bug fix in v1.3.0 have been done, and the continuous enhancement is tracked new epic #5007 in Harvester v1.4.0. Close this issue now.

rebeccazzzz added kind/enhancement Issues that improve or augment existing functionality priority/0 Must be fixed in this release labels Dec 8, 2022

rebeccazzzz added this to the v1.2.0 milestone Dec 8, 2022

rebeccazzzz added priority/1 Highly recommended to fix in this release and removed priority/0 Must be fixed in this release labels Dec 8, 2022

guangbochen added the require/doc Improvements or additions to documentation label Dec 9, 2022

bk201 assigned FrankYang0529 Dec 19, 2022

FrankYang0529 mentioned this issue Dec 19, 2022

docs: add cluster shutdown and restart harvester/docs#268

Open

harvesterhci-io-github-bot mentioned this issue Dec 19, 2022

[e2e] [FEATURE] Optimize for Frequent Power-off/Power-On operating procedures harvester/tests#664

Open

1 task

guangbochen mentioned this issue Jan 5, 2023

[Question] How should I properly shut down Harvester cluster? #3080

Closed

bk201 mentioned this issue Mar 8, 2023

[BUG] After upgrade and cold start Harvester node, failed to recover RKE2 guest cluster with VM paused state #3614

Open

guangbochen assigned bk201 May 18, 2023

guangbochen unassigned FrankYang0529 May 25, 2023

guangbochen modified the milestones: v1.2.0, v1.2.1 May 25, 2023

rebeccazzzz modified the milestones: v1.2.1, v1.3.0 Aug 17, 2023

w13915984028 self-assigned this Aug 24, 2023

w13915984028 mentioned this issue Aug 24, 2023

After a power down of hosts in a Harvester cluster, the cluster fails to come back up #4234

Open

w13915984028 added Epic require/HEP Require Harvester Enhancement Proposal PR labels Aug 24, 2023

This was referenced Sep 7, 2023

[ENHANCEMENT] support elemental cloud-init via harvester-node-manager #3902

Closed

[FEATURE] Complete Cluster Backup #3263

Open

bk201 added the highlight Highlight issues/features label Nov 21, 2023

bk201 removed their assignment Dec 8, 2023

This was referenced Jan 22, 2024

[ENHANCEMENT] Continuous enhancement on system robustness and resilience #5007

Open

[BUG] VM got IO error after host restart #4633

Closed

w13915984028 closed this as completed Mar 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Optimize for Frequent Power-off/Power-On operating procedures #3261

[FEATURE] Optimize for Frequent Power-off/Power-On operating procedures #3261

rebeccazzzz commented Dec 8, 2022 •

edited by w13915984028

harvesterhci-io-github-bot commented Dec 19, 2022

harvesterhci-io-github-bot commented Dec 19, 2022

w13915984028 commented Nov 27, 2023 •

edited

w13915984028 commented Mar 7, 2024

[FEATURE] Optimize for Frequent Power-off/Power-On operating procedures #3261

[FEATURE] Optimize for Frequent Power-off/Power-On operating procedures #3261

Comments

rebeccazzzz commented Dec 8, 2022 • edited by w13915984028

harvesterhci-io-github-bot commented Dec 19, 2022

Pre Ready-For-Testing Checklist

harvesterhci-io-github-bot commented Dec 19, 2022

w13915984028 commented Nov 27, 2023 • edited

w13915984028 commented Mar 7, 2024

rebeccazzzz commented Dec 8, 2022 •

edited by w13915984028

w13915984028 commented Nov 27, 2023 •

edited