Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] After adding taint to a node, volume cannot be attached to any other node #2475

Closed
PhanLe1010 opened this issue Apr 12, 2021 · 4 comments
Assignees
Labels
component/longhorn-manager Longhorn manager (control plane) kind/bug priority/2 Nice to implement or fix in this release (managed by PO) severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade)
Milestone

Comments

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Apr 12, 2021

Describe the bug
Volume stuck in attaching after setting taint for a one of it replicas' node

To Reproduce

  1. Create a cluster of 3 Longhorn nodes.
  2. Create a volume of 3 replicas
  3. Set taint f:b=NoExecute for a node
  4. Wait for Kubernetes to evict workloads from tainted node
  5. Attach the volume. Observe that the volume stuck in attaching state

Expected behavior
Longhorn should skip the replica on the tainted node and finish attaching volume

Environment:

  • Longhorn version: Longhorn master version - 04/12/21

Additional context
The replica controller always set the state of the replica: status.CurrentState = types.InstanceStateStopped when the corresponding instance manager is in error state. On the other hand, the volume controller insists on waiting until all replicas to be in running state before finishing attaching the volume

@PhanLe1010 PhanLe1010 changed the title [BUG] After setting taint for a node, volume cannot be attached to any node [BUG] After adding taint to a node, volume cannot be attached to any other node Apr 12, 2021
@PhanLe1010
Copy link
Contributor Author

PhanLe1010 commented Apr 12, 2021

Proposal:

We need to differentiate 2 cases when the Instance Manager get into the error state:

  1. If the volume is detached and isn't attaching, the replica should stay in a stoped state
  2. If the volume is attaching, the replicas should be set to a different state so that the volume controller can ignore it

@innobead innobead added the severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade) label Apr 13, 2021
@khushboo-rancher khushboo-rancher added the kind/regression Regression which has worked before label Apr 13, 2021
@innobead innobead added this to the v1.1.1 milestone Apr 13, 2021
@innobead innobead removed the kind/regression Regression which has worked before label Apr 13, 2021
@innobead
Copy link
Member

innobead commented Apr 13, 2021

This is not a regression and it's a day 1 issue.

@shuo-wu "There is no similar issue with the node down case. We can add the similar logic for this case:"
https://github.com/longhorn/longhorn-manager/blob/9da8dee282c552504b4e4aed8dd398bfdaf2993c/controller/volume_controller.go#L1263-L1271

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Apr 14, 2021

Pre Ready-For-Testing Checklist

  • Is the reproduce steps/test steps documented?

  • Is there a workaround for the issue? If so, is it documented?
    No workaround

  • Does the PR include the explanation for the fix or the feature?

  • Does the PR include deployment change (YAML/Chart)? If so, have both YAML file and Chart been updated in the PR?

  • Is the backend code merged (Manager, Engine, Instance Manager, BackupStore etc)?
    The PR is at Fix volume cannot be attached when one of the replicas has error IM and the node of IM is still running longhorn-manager#873

  • Which areas/issues this PR might have potential impacts on?
    Area HA
    Issues

  • If labeled: require/LEP Has the Longhorn Enhancement Proposal PR submitted?
    The LEP PR is at

  • If labeled: area/ui Has the UI issue filed or ready to be merged?
    The UI issue/PR is at

  • If labeled: require/doc Has the necessary document PR submitted or merged?
    The Doc issue/PR is at

  • If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case?
    The automation skeleton PR is at
    The automation test case PR is at

  • If labeled: require/automation-engine Has the engine integration test been merged?
    The engine automation PR is at

  • If labeled: require/manual-test-plan Has the manual test plan been documented?
    The updated manual test plan is at

  • If the fix introduces the code for backward compatibility Has an separate issue been filed with the label release/obsolete-compatibility?
    The compatibility issue is filed at

@meldafrawi
Copy link
Contributor

Validation: PASSED

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/longhorn-manager Longhorn manager (control plane) kind/bug priority/2 Nice to implement or fix in this release (managed by PO) severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade)
Projects
Status: Closed
Development

No branches or pull requests

5 participants