Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Disk eviction not doing anything #2995

Closed
vinid223 opened this issue Sep 11, 2021 · 9 comments
Closed

[BUG] Disk eviction not doing anything #2995

vinid223 opened this issue Sep 11, 2021 · 9 comments
Assignees
Labels
backport/1.1.3 Require to backport to 1.1.3 release branch backport/1.2.1 Require to backport to 1.2.1 release branch kind/bug priority/0 Must be fixed in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated
Milestone

Comments

@vinid223
Copy link

Describe the bug
I added disks to a cluster to each of my nodes to replace the existing disks as the main storage.

For each node, I went into the disk settings and disabled the scheduling for the old disks and enabled eviction. The new disks are activated for scheduling.

It's been hours and not a single replicat have been moved. I can't see any logs in the longhorn ui. The disks works when I force delete a replicat and it rebuilt fine in the other disks.

To Reproduce
Steps to reproduce the behavior:

  1. Add disk to node
  2. Setup disk in longhorn with scheduled enable
  3. Setup old disk in longhorn to disable scheduling and enable eviction
  4. Wait and watch for nothing happening

Expected behavior
Replicat are being moved. If not, show logs or information in the volume page info or node info or main page.

Log
If needed, I can generate a support bundle

Environment:

  • Longhorn version: 1.1.2
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Rancher Catalog
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K3s 1.21.4
    • Number of management node in the cluster: 1
    • Number of worker node in the cluster:3 (the management node is also a worker, so 4 nodes)
  • Node config
    • OS type and version: Raspbian 10 (Buster)
    • CPU per node: 4
    • Memory per node: 8gb
    • Disk type(e.g. SSD/NVMe): SD Card for main OS, external USB HDD 1TB (node 2, 3 and 4)
    • Network bandwidth between the nodes: 1GB
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):Baremetal
  • Number of Longhorn volumes in the cluster: 11

Additional context
N/A

@vinid223
Copy link
Author

I tried something different this morning, I disabled scheduling on a node and enable eviction request and the node started to remove the replicas.

I could do my migration that way, but it would be nice to have the feature on disk to work

@jenting jenting added this to New in Community Issue Review via automation Sep 13, 2021
@PhanLe1010 PhanLe1010 moved this from New to Resolved/Scheduled in Community Issue Review Sep 14, 2021
@PhanLe1010 PhanLe1010 added this to the v1.3.0 milestone Sep 14, 2021
@yasker yasker added priority/0 Must be fixed in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated labels Sep 14, 2021
@innobead innobead added backport/1.2.1 Require to backport to 1.2.1 release branch backport/1.1.3 Require to backport to 1.1.3 release branch labels Sep 14, 2021
@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Sep 17, 2021

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at: the ticket description

  • Is there a workaround for the issue? If so, where is it documented?
    The workaround is at: [BUG] Disk eviction not doing anything #2995 (comment)

  • Does the PR include the explanation for the fix or the feature?

  • Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?
    The PR for the YAML change is at:
    The PR for the chart change is at:

  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at
    controller: Fix the replica eviction not working issue longhorn-manager#1053
    [Backport][v1.2.x]controller: Fix the replica eviction not working issue longhorn-manager#1052
    [Backport][v1.1.3]controller: Fix the replica eviction not working issue longhorn-manager#1054

  • Which areas/issues this PR might have potential impacts on?
    Area: Disk eviction
    Issues

  • If labeled: require/LEP Has the Longhorn Enhancement Proposal PR submitted?
    The LEP PR is at

  • If labeled: area/ui Has the UI issue filed or ready to be merged (including backport-needed/*)?
    The UI issue/PR is at

  • If labeled: require/doc Has the necessary document PR submitted or merged (including backport-needed/*)?
    The documentation issue/PR is at

  • If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*)
    The automation skeleton PR is at
    The automation test case PR is at
    The issue of automation test case implementation is at (please create by the template)

  • If labeled: require/automation-engine Has the engine integration test been merged (including backport-needed/*)?
    The engine automation PR is at

  • If labeled: require/manual-test-plan Has the manual test plan been documented?
    The updated manual test plan is at

  • If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
    The compatibility issue is filed at

@shuo-wu
Copy link
Contributor

shuo-wu commented Sep 17, 2021

Workaround:

  1. Manually Scale up the replica count for all volumes that contain replicas in the evicting disk
  2. Wait for the rebuilding finish.
  3. Re-scale down the replica count for those volumes
  4. Delete the replicas using the evicting disks

@shuo-wu
Copy link
Contributor

shuo-wu commented Sep 17, 2021

After the fix, the nightly tests should pass. And the case mentioned in the reproducing step should work as expected.

@kaxing
Copy link

kaxing commented Sep 24, 2021

Validation: PASSED

Tested with v1.2.1-rc1 and master-head (20210924-3pm) on local k3s cluster.
I can see the volume replicas begin removing when the node/disk Eviction Requested set to True.
Longhorn-v1 2 1-rc1-Node-Eviction webm

@kaxing kaxing closed this as completed Sep 24, 2021
@vinid223
Copy link
Author

vinid223 commented Sep 24, 2021

Validation: PASSED

Tested with v1.2.1-rc1 and master-head (20210924-3pm) on local k3s cluster.
I can see the volume replicas begin removing when the node/disk Eviction Requested set to True.
Longhorn-v1 2 1-rc1-Node-Eviction webm

The issue I had was not on the eviction of node, but on the disk only of a node while having an other schedulable disk on the same node. Like migrating slower disk to newer of a node while having the node schedulable

@kaxing kaxing reopened this Sep 25, 2021
Community Issue Review automation moved this from Resolved/Scheduled to New Sep 25, 2021
@kaxing
Copy link

kaxing commented Sep 27, 2021

Hey @vinid223 thanks for commenting back to my test result!

I've setup a new test environment and retest it with v1.2.1.-rc2, result as following gif:
longhorn-v1 2 1-rc2-node-disks-eviction webm

Steps:

  1. Create a cluster on EC2
  2. Install and configure the Longhorn
  3. Create and mount the 2 new disks, one slower gp2 and one faster gp3, for the worker node instance that is going to swap the disk
  4. Setup new disks in Longhorn Node management UI
  5. Set Scheduling: Fals and Eviction Requested: True to the disk want to be retired

The replicas will start migrating themselves to the disk that is available for use. To narrow down the issue, I have also turn off other two nodes in this cluster, so only one node is running in this test.

@vinid223
Copy link
Author

@kaxing This looks good to me. Thank you.

@kaxing kaxing closed this as completed Sep 27, 2021
@shuo-wu
Copy link
Contributor

shuo-wu commented Sep 27, 2021

@vinid223 BTW, we do found a corner case #3076 but it's not related to the eviction bug. The case gonna be tracked in that new ticket.

@jenting jenting moved this from New to Resolved/Scheduled in Community Issue Review Sep 27, 2021
@innobead innobead added backport-needed/1.1.x and removed backport/1.1.3 Require to backport to 1.1.3 release branch labels Oct 11, 2021
@innobead innobead added the backport/1.1.3 Require to backport to 1.1.3 release branch label Dec 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.1.3 Require to backport to 1.1.3 release branch backport/1.2.1 Require to backport to 1.2.1 release branch kind/bug priority/0 Must be fixed in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated
Projects
Archived in project
Community Issue Review
Resolved/Scheduled
Development

No branches or pull requests

7 participants