[backport v1.1] [ENHANCEMENT] Maintenance mode doesn't drain the node #3363

harvesterhci-io-github-bot · 2023-01-13T05:02:09Z

backport the issue #2723

harvesterhci-io-github-bot · 2023-02-15T00:09:33Z

Pre Ready-For-Testing Checklist

If labeled: require/HEP Has the Harvester Enhancement Proposal PR submitted?
The HEP PR is at:
Where is the reproduce steps/test steps documented?
The reproduce steps/test steps are at: [ENHANCEMENT] Harvester supports draining the node when entering maintenance mode #2723 (comment)
Is there a workaround for the issue? If so, where is it documented?
The workaround is at:
Have the backend code been merged (harvester, harvester-installer, etc) (including backport-needed/*)?
The PR is at:

* [ ] Does the PR include the explanation for the fix or the feature?

* [ ] Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?
The PR for the YAML change is at:
The PR for the chart change is at:

If labeled: area/ui Has the UI issue filed or ready to be merged?
The UI issue/PR is at:

* [ ] If labeled: require/doc, require/knowledge-base Has the necessary document PR submitted or merged?
The documentation/KB PR is at:

* [ ] If NOT labeled: not-require/test-plan Has the e2e test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue?
- The automation skeleton PR is at:
- The automation test case PR is at:

* [ ] If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
The compatibility issue is filed at:

lanfon72 · 2023-02-21T09:56:03Z

Move back to ready for testing because we still have some enhancement in #2723 (comment)

noahgildersleeve · 2023-02-27T23:01:14Z

Ran test plan from linked comment on Harvester v1.1-68cb406a-head from ipxe-examples. I also made sure that I was running against the latest ui-index and ui-plugin-index not the v1.1 ones. It looks like in HA controlplane cluster (3 node) when trying to force the second node into maintenance mode per controlplane drain fail scenario 2 it won't let you force it. You just get the same error as controlplane drain fail scenario 2. The fail scenario for Single controlplane cluster worker drain fail scenario also failed due to not being able to force it.

Test plan

Single controlplane cluster:

Common Setup:

Create a 2 node cluster. In this scenario no node promotion will occur and first node will stay as master and second will stay as worker node

Setup a non default storage class with 1 replica only.

worker drain pass scenario:

create a VM using the default storage class

place worker node in maintenance mode, and it should enter maintenance mode as there should be healthy replicas on controlplane node

VM if scheduled on the worker node will be gracefully evicted to the controlplane node

remove worker node from maintenance mode

place controlplane node in maintenance mode, will fail, as the controller will reject the request as this is the only valid controlplane. logs of the harvester pod should reflect the same.

PASS

worker drain fail scenario:

create a few VM's with the single replica storage class

one of the replicas should schedule on the worker node

attempt to place worker node in maintenance mode, a warning should be presenting with warning of VM's which will be impacted by the drain and Apply button will be disabled

enable Force in the checkbox, should enable the UI, and allow Apply to be clicked

drain controller will try and shutdown the impacted VM's from the warning pop-up

drain will be unsuccessful, as longhorn will be unable to shutdown the last replica due to its PDB. logs of the harvester pod should reflect the same

disable maintenance mode of the worker node

place controlplane node in maintenance mode, will fail, as the controller will reject the request as this is the only valid controlplane. logs of the harvester pod should reflect the same.

FAIL

HA controlplane cluster:

Common Setup:

Create atleast a 3 node cluster. ensure that node promotion has been performed and we have a HA setup.

Setup a non default storage class with 1 replica only.

controlplane drain pass scenario:

create a VM using the default storage class

assuming a 3 node cluster, all nodes will be controlplanes. place a node in maintenance mode, and this should be successful, and VM's will be gracefully migrated to other nodes in the cluster.

attempt to place another controlplane node into maintenance mode, this should fail with message in the harvester pod logs that a controlplane is already in maintenance mode. at any point in time in a HA setup, only 1 controlplane node can be placed in maintenance mode

PASS

controlplane drain fail scenario 1:

create a VM using the default storage class

assuming a 3 node cluster, all nodes will be controlplanes. place a node in maintenance mode, and this should be successful, and VM's will be gracefully migrated to other nodes in the cluster.

attempt to place another controlplane node into maintenance mode, this should fail with message in the harvester pod logs that a controlplane is already in maintainence mode. at any point in time in a HA setup, only 1 controlplane node can be placed in maintenance mode

PASS

controlplane drain fail scenario 2:

create a VM using the single replica storage class

assuming a 3 node cluster, all nodes will be controlplanes. try and place each node into maintenance mode one node at a time(unless we can view the node where the replica for the single replica storage class is scheduled). placing this node into maintenance mode will fail with the warning about unhealthy VM's and Apply button will be disabled

enable Force in the checkbox, should enable the UI, and allow Apply to be clicked

drain controller will try and shutdown the impacted VM's from the warning pop-up

drain will be unsuccessful, as longhorn will be unable to shutdown the last replica due to its PDB. logs of the harvester pod should reflect the same

FAIL

noahgildersleeve · 2023-03-02T05:01:57Z

I retested this in a 4 node physical setup and it worked fine. However it did successfully move the VMs to the node and turned it back on. Is that the desired functionality?

lanfon72 · 2023-03-06T14:54:22Z

I retested this in a 4 node physical setup and it worked fine. However it did successfully move the VMs to the node and turned it back on. Is that the desired functionality?

VMs should always be migrated to other nodes when its host entering maintenance; and if migrated VMs be restart (not soft reboot), it migt be host back to the origin node (if the node is working), it is expected.

noahgildersleeve · 2023-03-07T00:01:16Z

Tested in v1.1-f3472cc4-head. Verified as fixed.

controlplane drain fail scenario 2:

create a VM using the single replica storage class

assuming a 3 node cluster, all nodes will be controlplanes. try and place each node into maintenance mode one node at a time(unless we can view the node where the replica for the single replica storage class is scheduled). placing this node into maintenance mode will fail with the warning about unhealthy VM's and Apply button will be disabled

enable Force in the checkbox, should enable the UI, and allow Apply to be clicked

drain controller will try and shutdown the impacted VM's from the warning pop-up

drain will be unsuccessful, as longhorn will be unable to shutdown the last replica due to its PDB. logs of the harvester pod should reflect the same

PASS

This seems to be working. It gave the error then it shutdown the VMs. The VM could not be turned back on after the shutdown. It just detaches the volume in Longhorn. The host is showing as cordoned. When I uncordon the host it will allow you to turn on the VM again and it comes up successfully.

harvesterhci-io-github-bot added this to the v1.1.2 milestone Jan 13, 2023

harvesterhci-io-github-bot assigned ibrokethecloud and WuJun2016 Jan 13, 2023

harvesterhci-io-github-bot mentioned this issue Jan 13, 2023

[ENHANCEMENT] Harvester supports draining the node when entering maintenance mode #2723

Closed

n313893254 added the area/ui label Feb 2, 2023

guangbochen added the blocker blocker of major functionality label Feb 10, 2023

guangbochen mentioned this issue Feb 14, 2023

introduction of drain helper (backport #3271) #3452

Merged

lanfon72 self-assigned this Feb 16, 2023

lanfon72 removed their assignment Feb 21, 2023

noahgildersleeve self-assigned this Feb 27, 2023

noahgildersleeve closed this as completed Mar 7, 2023

lanfon72 mentioned this issue Mar 8, 2023

[e2e] fix e2e testing errors for v1.1-head harvester/tests#738

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[backport v1.1] [ENHANCEMENT] Maintenance mode doesn't drain the node #3363

[backport v1.1] [ENHANCEMENT] Maintenance mode doesn't drain the node #3363

harvesterhci-io-github-bot commented Jan 13, 2023

harvesterhci-io-github-bot commented Feb 15, 2023 •

edited by n313893254

lanfon72 commented Feb 21, 2023

noahgildersleeve commented Feb 27, 2023 •

edited

Test plan

Single controlplane cluster:

Common Setup:

worker drain pass scenario:

worker drain fail scenario:

HA controlplane cluster:

Common Setup:

controlplane drain pass scenario:

controlplane drain fail scenario 1:

controlplane drain fail scenario 2:

noahgildersleeve commented Mar 2, 2023

lanfon72 commented Mar 6, 2023

noahgildersleeve commented Mar 7, 2023 •

edited

controlplane drain fail scenario 2:

[backport v1.1] [ENHANCEMENT] Maintenance mode doesn't drain the node #3363

[backport v1.1] [ENHANCEMENT] Maintenance mode doesn't drain the node #3363

Comments

harvesterhci-io-github-bot commented Jan 13, 2023

harvesterhci-io-github-bot commented Feb 15, 2023 • edited by n313893254

Pre Ready-For-Testing Checklist

lanfon72 commented Feb 21, 2023

noahgildersleeve commented Feb 27, 2023 • edited

Test plan

Single controlplane cluster:

Common Setup:

worker drain pass scenario:

PASS

worker drain fail scenario:

FAIL

HA controlplane cluster:

Common Setup:

controlplane drain pass scenario:

PASS

controlplane drain fail scenario 1:

PASS

controlplane drain fail scenario 2:

FAIL

noahgildersleeve commented Mar 2, 2023

lanfon72 commented Mar 6, 2023

noahgildersleeve commented Mar 7, 2023 • edited

controlplane drain fail scenario 2:

PASS

harvesterhci-io-github-bot commented Feb 15, 2023 •

edited by n313893254

noahgildersleeve commented Feb 27, 2023 •

edited

noahgildersleeve commented Mar 7, 2023 •

edited