Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[backport v1.1] [ENHANCEMENT] Maintenance mode doesn't drain the node #3363

Closed
harvesterhci-io-github-bot opened this issue Jan 13, 2023 · 6 comments
Assignees
Labels
area/ui blocker blocker of major functionality highlight Highlight issues/features kind/bug Issues that are defects reported by users or that we know have reached a real release not-require/test-plan Skip to create a e2e automation test issue priority/0 Must be fixed in this release reproduce/always Reproducible 100% of the time require-ui/small estimate 1-2 working days
Milestone

Comments

@harvesterhci-io-github-bot

backport the issue #2723

@harvesterhci-io-github-bot harvesterhci-io-github-bot added highlight Highlight issues/features kind/bug Issues that are defects reported by users or that we know have reached a real release not-require/test-plan Skip to create a e2e automation test issue priority/0 Must be fixed in this release reproduce/always Reproducible 100% of the time require-ui/small estimate 1-2 working days labels Jan 13, 2023
@harvesterhci-io-github-bot harvesterhci-io-github-bot added this to the v1.1.2 milestone Jan 13, 2023
@guangbochen guangbochen added the blocker blocker of major functionality label Feb 10, 2023
@harvesterhci-io-github-bot
Copy link
Author

harvesterhci-io-github-bot commented Feb 15, 2023

Pre Ready-For-Testing Checklist

  • If labeled: require/HEP Has the Harvester Enhancement Proposal PR submitted?
    The HEP PR is at:

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at: [ENHANCEMENT] Harvester supports draining the node when entering maintenance mode #2723 (comment)

  • Is there a workaround for the issue? If so, where is it documented?
    The workaround is at:

  • Have the backend code been merged (harvester, harvester-installer, etc) (including backport-needed/*)?
    The PR is at:

* [ ] Does the PR include the explanation for the fix or the feature?

* [ ] Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?
The PR for the YAML change is at:
The PR for the chart change is at:

  • If labeled: area/ui Has the UI issue filed or ready to be merged?
    The UI issue/PR is at:

* [ ] If labeled: require/doc, require/knowledge-base Has the necessary document PR submitted or merged?
The documentation/KB PR is at:

* [ ] If NOT labeled: not-require/test-plan Has the e2e test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue?
- The automation skeleton PR is at:
- The automation test case PR is at:

* [ ] If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
The compatibility issue is filed at:

@lanfon72 lanfon72 self-assigned this Feb 16, 2023
@lanfon72
Copy link
Member

Move back to ready for testing because we still have some enhancement in #2723 (comment)

@lanfon72 lanfon72 removed their assignment Feb 21, 2023
@noahgildersleeve noahgildersleeve self-assigned this Feb 27, 2023
@noahgildersleeve
Copy link

noahgildersleeve commented Feb 27, 2023

Ran test plan from linked comment on Harvester v1.1-68cb406a-head from ipxe-examples. I also made sure that I was running against the latest ui-index and ui-plugin-index not the v1.1 ones. It looks like in HA controlplane cluster (3 node) when trying to force the second node into maintenance mode per controlplane drain fail scenario 2 it won't let you force it. You just get the same error as controlplane drain fail scenario 2. The fail scenario for Single controlplane cluster worker drain fail scenario also failed due to not being able to force it.

Screenshot_20230227_135212

Test plan

Single controlplane cluster:

Common Setup:

  • Create a 2 node cluster. In this scenario no node promotion will occur and first node will stay as master and second will stay as worker node
  • Setup a non default storage class with 1 replica only.

worker drain pass scenario:

  • create a VM using the default storage class
  • place worker node in maintenance mode, and it should enter maintenance mode as there should be healthy replicas on controlplane node
  • VM if scheduled on the worker node will be gracefully evicted to the controlplane node
  • remove worker node from maintenance mode
  • place controlplane node in maintenance mode, will fail, as the controller will reject the request as this is the only valid controlplane. logs of the harvester pod should reflect the same.

PASS

worker drain fail scenario:

  • create a few VM's with the single replica storage class
  • one of the replicas should schedule on the worker node
  • attempt to place worker node in maintenance mode, a warning should be presenting with warning of VM's which will be impacted by the drain and Apply button will be disabled
  • enable Force in the checkbox, should enable the UI, and allow Apply to be clicked
  • drain controller will try and shutdown the impacted VM's from the warning pop-up
  • drain will be unsuccessful, as longhorn will be unable to shutdown the last replica due to its PDB. logs of the harvester pod should reflect the same
  • disable maintenance mode of the worker node
  • place controlplane node in maintenance mode, will fail, as the controller will reject the request as this is the only valid controlplane. logs of the harvester pod should reflect the same.

FAIL

HA controlplane cluster:

Common Setup:

  • Create atleast a 3 node cluster. ensure that node promotion has been performed and we have a HA setup.
  • Setup a non default storage class with 1 replica only.

controlplane drain pass scenario:

  • create a VM using the default storage class
  • assuming a 3 node cluster, all nodes will be controlplanes. place a node in maintenance mode, and this should be successful, and VM's will be gracefully migrated to other nodes in the cluster.
  • attempt to place another controlplane node into maintenance mode, this should fail with message in the harvester pod logs that a controlplane is already in maintenance mode. at any point in time in a HA setup, only 1 controlplane node can be placed in maintenance mode

PASS

controlplane drain fail scenario 1:

  • create a VM using the default storage class
  • assuming a 3 node cluster, all nodes will be controlplanes. place a node in maintenance mode, and this should be successful, and VM's will be gracefully migrated to other nodes in the cluster.
  • attempt to place another controlplane node into maintenance mode, this should fail with message in the harvester pod logs that a controlplane is already in maintainence mode. at any point in time in a HA setup, only 1 controlplane node can be placed in maintenance mode

PASS

controlplane drain fail scenario 2:

  • create a VM using the single replica storage class
  • assuming a 3 node cluster, all nodes will be controlplanes. try and place each node into maintenance mode one node at a time(unless we can view the node where the replica for the single replica storage class is scheduled). placing this node into maintenance mode will fail with the warning about unhealthy VM's and Apply button will be disabled
  • enable Force in the checkbox, should enable the UI, and allow Apply to be clicked
  • drain controller will try and shutdown the impacted VM's from the warning pop-up
  • drain will be unsuccessful, as longhorn will be unable to shutdown the last replica due to its PDB. logs of the harvester pod should reflect the same

FAIL

@noahgildersleeve
Copy link

I retested this in a 4 node physical setup and it worked fine. However it did successfully move the VMs to the node and turned it back on. Is that the desired functionality?

@lanfon72
Copy link
Member

lanfon72 commented Mar 6, 2023

I retested this in a 4 node physical setup and it worked fine. However it did successfully move the VMs to the node and turned it back on. Is that the desired functionality?

VMs should always be migrated to other nodes when its host entering maintenance; and if migrated VMs be restart (not soft reboot), it migt be host back to the origin node (if the node is working), it is expected.

@noahgildersleeve
Copy link

noahgildersleeve commented Mar 7, 2023

Tested in v1.1-f3472cc4-head. Verified as fixed.

controlplane drain fail scenario 2:

  • create a VM using the single replica storage class
  • assuming a 3 node cluster, all nodes will be controlplanes. try and place each node into maintenance mode one node at a time(unless we can view the node where the replica for the single replica storage class is scheduled). placing this node into maintenance mode will fail with the warning about unhealthy VM's and Apply button will be disabled
  • enable Force in the checkbox, should enable the UI, and allow Apply to be clicked
  • drain controller will try and shutdown the impacted VM's from the warning pop-up
  • drain will be unsuccessful, as longhorn will be unable to shutdown the last replica due to its PDB. logs of the harvester pod should reflect the same

PASS

This seems to be working. It gave the error then it shutdown the VMs. The VM could not be turned back on after the shutdown. It just detaches the volume in Longhorn. The host is showing as cordoned. When I uncordon the host it will allow you to turn on the VM again and it comes up successfully.

Screenshot_20230306_155718
Screenshot_20230306_155637
Screenshot_20230306_155353
Screenshot_20230306_155319
Screenshot_20230306_155251
Screenshot_20230306_155240

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ui blocker blocker of major functionality highlight Highlight issues/features kind/bug Issues that are defects reported by users or that we know have reached a real release not-require/test-plan Skip to create a e2e automation test issue priority/0 Must be fixed in this release reproduce/always Reproducible 100% of the time require-ui/small estimate 1-2 working days
Projects
None yet
Development

No branches or pull requests

7 participants