Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Engine image and instance manager state is not correct on the node page of Longhorn UI #2377

Closed
khushboo-rancher opened this issue Mar 18, 2021 · 6 comments
Assignees
Labels
area/ui UI related like UI or CLI component/longhorn-manager Longhorn manager (control plane) kind/bug priority/1 Highly recommended to implement or fix in this release (managed by PO)
Milestone

Comments

@khushboo-rancher
Copy link
Contributor

khushboo-rancher commented Mar 18, 2021

Describe the bug
The components details for a node don't show correct value if the engine image, instance-manger etc are not running on the node.
Screen Shot 2021-03-18 at 12 41 21 PM

To Reproduce
Case 1:

  1. Deploy Longhorn on a K8s cluster.
  2. Add taint to a node, the state of the node should become Down on the Longhorn UI node page.
  3. Go to the node page of the Longhorn UI and click in the state column shown as notReady.
  4. The engine image shows Deployed and instance-manager shows running state.

Case 2: #2081 (comment)

Expected behavior
The engine-image and instance-manager should show error state.

Environment:

  • Longhorn version: Longhorn-master 03/19/2021
@khushboo-rancher khushboo-rancher added kind/bug priority/1 Highly recommended to implement or fix in this release (managed by PO) labels Mar 18, 2021
@khushboo-rancher khushboo-rancher added this to the v1.1.1 milestone Mar 18, 2021
@khushboo-rancher
Copy link
Contributor Author

khushboo-rancher commented Mar 18, 2021

After discussing with @PhanLe1010, below points should be taken in consideration.

  1. The UI should not be deployed when engine image NodeDeploymentMap[currentNode] == false. This can be fixed from UI side.
  2. The issue with replica CR still in running state and this issue have the same root cause: when Longhorn manager pod stops running on a node but the Kubernetes node object is still ready, Longhorn doesn't consider the node as down (code). Therefore, Longhorn doesn't switch ownerID for the resource to a new node and the CRs are not updated. We need to reconsider this node down handling logic.

@PhanLe1010 PhanLe1010 self-assigned this Mar 18, 2021
@PhanLe1010
Copy link
Contributor

In this issue, we should also fix this issue from @joshimoo suggestion:

for each enqueue remove these type of checks, since they prohibit ownership transfer: https://github.com/longhorn/longhorn-manager/blob/b0b3579609ac768d76271f68f662ce2243b6cb99/controller/engine_image_controller.go#L656
The controller should at least look at each resource to determine whether it's responsible for it or not, it's no the enqueues responsibility to make that determination. This is a form of optimization we can add later if necessary, but for now remove these checks from the enques the same applies for other controllers.

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Mar 30, 2021

Pre-merged Checklist

  • Is the reproduce steps/test steps documented?

  • Is there a workaround for the issue? If so, is it documented?

  • Does the PR include the explanation for the fix or the feature?

  • Does the PR include deployment change (YAML/Chart)? If so, have both YAML file and Chart been updated in the PR?

  • Is the backend code merged (Manager, Engine, Instance Manager, BackupStore etc)?
    The PR is at Ownership transferring refactoring longhorn-manager#854

  • Which areas/issues this PR might have potential impacts on?
    Area HA, RWX
    Issues

  • If labeled: require/LEP Has the Longhorn Enhancement Proposal PR submitted?
    The LEP PR is at

  • If labeled: area/ui Has the UI issue filed or ready to be merged?
    The UI issue/PR is: This issue

  • If labeled: require/doc Has the necessary document PR submitted or merged?
    The Doc issue/PR is at

  • If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case?
    The automation skeleton PR is at
    The automation test case PR is at

  • If labeled: require/automation-engine Has the engine integration test been merged?
    The engine automation PR is at

  • If labeled: require/manual-test-plan Has the manual test plan been documented?
    The updated manual test plan is at

  • If the fix introduces the code for backward compatibility Has an separate issue been filed with the label release/obsolete-compatibility?
    The compatibility issue is filed at

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Mar 30, 2021

Manual test:

Case 1:
Test case 1 in the reproduce steps in the issue description

Case 2:
Test case 2 in the reproduce steps in the issue description

Case 3:

  1. Set up a 3 node cluster and install Longhorn
  2. Create a Deployment/StatefulSet of 1 pod using a volume of 3 replicas.
  3. Taint the node that is not where the volume is attaching to with the taint k=v:NoExecute
  4. Scale down the Deployment/StatefulSet, verify that volume is detached successfully
  5. Scale up the Deployment/StatefulSet, verify that volume is attached successfully
  6. Verify the .Status.OwnerID of engine, replica, volume CRs are all move to a different node
  7. Remove the taint, verify that the failed replica get reuse

Note: If we taint the node where the volume is attaching to with the taint k=v:NoExecute, we will not be be to detach/attach the volume. This problem should be fixed in the issue #2329

Case 4: Test share manager changes ownerID

  1. Create a Deployment/StatefulSet of 2 pods that uses 1 RWX volume of 3 replicas
  2. Let's say the RWX volume is attaching to node-1
  3. Set taint for node-1 k=v:NoSchedule
  4. Remove the longhorn manager pod on node-1
  5. Verify that the owner ID of the share manager move to a different node, say node-2
  6. Verify that Longhorn doesn't disrupt the workload, doesn't detach the RWX volume
  7. Scale down the workload to 0, the RWX is detached
  8. Scale up the workload, verify that the RWX is attached to node-2

Note: If we set taint k=v:NoExecute for node-1 in step 3, the RWX volume will be auto detached but never be able to reattach and recover. This problem is related to the auto reattach feature, it needs to be fixed in a different issue where we revise the auto attach feature.

Case 5: Make sure IM pod are created/deleted on correct nodes

  1. Add taint k=v:NoExecute for node-1
  2. Wait for IM pods on node-1 are deleted
  3. Remove the taint
  4. Verify that IM pods are recreated correctly. (I.g., it matches the IM's Spec.NodeID)

Case 6: Engine CR transfer owner ID from node A to B then back to A

  1. Crete a deployment of 1 pod that uses 1 Longhorn volume
  2. Assume that the pod and the volume are on node-1, add taint k=v:NoSchedule for node-1
  3. Remove the engine image ds pod on node-1
  4. Verify that engine image CR's ownerID move to a different node, say node-2
  5. Remove the taint k=v:NoSchedule on node-1
  6. Verify that engine image CR's ownerID move back to node-1
  7. Delete a replica (not the one on node-1 because Longhorn will not be able to recreate replica on node-1 since it is missing engine image on node-1)
  8. Verify that engine CR's status is updated with the new replica. This means that engine CR is refreshed and it is not stuck in non-monitoring state. This fixes the bug

Case 7: Test volume attaching when there is engine image missing on some replicas' nodes

  1. Create a volume of 1 replica
  2. Let's say the replica is on node-1
  3. Add taint k=v:NoSchedule to node-1
  4. Delete the engine image ds pod on node-1
  5. Try to attach the volume to any node.
  6. Verify that you cannot attach the volume because there is no engine image deployed on any replicas' nodes

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Mar 30, 2021

@smallteeths
Besides the backend, this issue also contains a UI bug. Can you help to fix? Some more details below:

  • Each engine image object has the map NodeDeploymentMap that shows which node has this engine image like:
    NodeDeploymentMap:
    - phan-cluster-v46-worker1: True
    - phan-cluster-v46-worker2: True
    - phan-cluster-v46-worker3: True
    
  • When user click on this popup modal:
    111682663-a610ac00-884a-11eb-8729-4fe8293a5694
    UI can use the engine image's NodeDeploymentMap map to check whether the clicked node has engine image. If the value is false or not exist, UI should show not deployed, otherwise shows deployed

@meldafrawi
Copy link
Contributor

Validation: PASSED

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ui UI related like UI or CLI component/longhorn-manager Longhorn manager (control plane) kind/bug priority/1 Highly recommended to implement or fix in this release (managed by PO)
Projects
Status: Closed
Development

No branches or pull requests

5 participants