[BUG] Engine image and instance manager state is not correct on the node page of Longhorn UI #2377

khushboo-rancher · 2021-03-18T18:59:44Z

Describe the bug
The components details for a node don't show correct value if the engine image, instance-manger etc are not running on the node.

To Reproduce
Case 1:

Deploy Longhorn on a K8s cluster.
Add taint to a node, the state of the node should become Down on the Longhorn UI node page.
Go to the node page of the Longhorn UI and click in the state column shown as notReady.
The engine image shows Deployed and instance-manager shows running state.

Case 2: #2081 (comment)

Expected behavior
The engine-image and instance-manager should show error state.

Environment:

Longhorn version: Longhorn-master 03/19/2021

The text was updated successfully, but these errors were encountered:

khushboo-rancher · 2021-03-18T19:01:25Z

After discussing with @PhanLe1010, below points should be taken in consideration.

The UI should not be deployed when engine image NodeDeploymentMap[currentNode] == false. This can be fixed from UI side.
The issue with replica CR still in running state and this issue have the same root cause: when Longhorn manager pod stops running on a node but the Kubernetes node object is still ready, Longhorn doesn't consider the node as down (code). Therefore, Longhorn doesn't switch ownerID for the resource to a new node and the CRs are not updated. We need to reconsider this node down handling logic.

PhanLe1010 · 2021-03-20T23:18:36Z

In this issue, we should also fix this issue from @joshimoo suggestion:

for each enqueue remove these type of checks, since they prohibit ownership transfer: https://github.com/longhorn/longhorn-manager/blob/b0b3579609ac768d76271f68f662ce2243b6cb99/controller/engine_image_controller.go#L656
The controller should at least look at each resource to determine whether it's responsible for it or not, it's no the enqueues responsibility to make that determination. This is a form of optimization we can add later if necessary, but for now remove these checks from the enques the same applies for other controllers.

longhorn-io-github-bot · 2021-03-30T07:33:04Z

PhanLe1010 · 2021-03-30T23:34:42Z

Manual test:

Case 1:
Test case 1 in the reproduce steps in the issue description

Case 2:
Test case 2 in the reproduce steps in the issue description

Case 3:

Set up a 3 node cluster and install Longhorn
Create a Deployment/StatefulSet of 1 pod using a volume of 3 replicas.
Taint the node that is not where the volume is attaching to with the taint k=v:NoExecute
Scale down the Deployment/StatefulSet, verify that volume is detached successfully
Scale up the Deployment/StatefulSet, verify that volume is attached successfully
Verify the .Status.OwnerID of engine, replica, volume CRs are all move to a different node
Remove the taint, verify that the failed replica get reuse

Note: If we taint the node where the volume is attaching to with the taint k=v:NoExecute, we will not be be to detach/attach the volume. This problem should be fixed in the issue #2329

Case 4: Test share manager changes ownerID

Create a Deployment/StatefulSet of 2 pods that uses 1 RWX volume of 3 replicas
Let's say the RWX volume is attaching to node-1
Set taint for node-1 k=v:NoSchedule
Remove the longhorn manager pod on node-1
Verify that the owner ID of the share manager move to a different node, say node-2
Verify that Longhorn doesn't disrupt the workload, doesn't detach the RWX volume
Scale down the workload to 0, the RWX is detached
Scale up the workload, verify that the RWX is attached to node-2

Note: If we set taint k=v:NoExecute for node-1 in step 3, the RWX volume will be auto detached but never be able to reattach and recover. This problem is related to the auto reattach feature, it needs to be fixed in a different issue where we revise the auto attach feature.

Case 5: Make sure IM pod are created/deleted on correct nodes

Add taint k=v:NoExecute for node-1
Wait for IM pods on node-1 are deleted
Remove the taint
Verify that IM pods are recreated correctly. (I.g., it matches the IM's Spec.NodeID)

Case 6: Engine CR transfer owner ID from node A to B then back to A

Crete a deployment of 1 pod that uses 1 Longhorn volume
Assume that the pod and the volume are on node-1, add taint k=v:NoSchedule for node-1
Remove the engine image ds pod on node-1
Verify that engine image CR's ownerID move to a different node, say node-2
Remove the taint k=v:NoSchedule on node-1
Verify that engine image CR's ownerID move back to node-1
Delete a replica (not the one on node-1 because Longhorn will not be able to recreate replica on node-1 since it is missing engine image on node-1)
Verify that engine CR's status is updated with the new replica. This means that engine CR is refreshed and it is not stuck in non-monitoring state. This fixes the bug

Case 7: Test volume attaching when there is engine image missing on some replicas' nodes

Create a volume of 1 replica
Let's say the replica is on node-1
Add taint k=v:NoSchedule to node-1
Delete the engine image ds pod on node-1
Try to attach the volume to any node.
Verify that you cannot attach the volume because there is no engine image deployed on any replicas' nodes

PhanLe1010 · 2021-03-30T23:54:48Z

@smallteeths
Besides the backend, this issue also contains a UI bug. Can you help to fix? Some more details below:

Each engine image object has the map NodeDeploymentMap that shows which node has this engine image like:

NodeDeploymentMap:
- phan-cluster-v46-worker1: True
- phan-cluster-v46-worker2: True
- phan-cluster-v46-worker3: True

When user click on this popup modal:

UI can use the engine image's NodeDeploymentMap map to check whether the clicked node has engine image. If the value is false or not exist, UI should show not deployed, otherwise shows deployed

meldafrawi · 2021-04-06T21:56:35Z

Validation: PASSED

khushboo-rancher added kind/bug priority/1 Highly recommended to implement or fix in this release (managed by PO) labels Mar 18, 2021

khushboo-rancher added this to the v1.1.1 milestone Mar 18, 2021

PhanLe1010 self-assigned this Mar 18, 2021

innobead added the component/longhorn-manager Longhorn manager (control plane) label Mar 30, 2021

smallteeths mentioned this issue Mar 31, 2021

fix(node) Fix the status of engineimage on node longhorn/longhorn-ui#373

Merged

PhanLe1010 added the area/ui UI related like UI or CLI label Apr 5, 2021

meldafrawi assigned meldafrawi and unassigned PhanLe1010 Apr 6, 2021

meldafrawi closed this as completed Apr 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Engine image and instance manager state is not correct on the node page of Longhorn UI #2377

[BUG] Engine image and instance manager state is not correct on the node page of Longhorn UI #2377

khushboo-rancher commented Mar 18, 2021 •

edited by PhanLe1010

Loading

khushboo-rancher commented Mar 18, 2021 •

edited by innobead

Loading

PhanLe1010 commented Mar 20, 2021

longhorn-io-github-bot commented Mar 30, 2021 •

edited by PhanLe1010

Loading

PhanLe1010 commented Mar 30, 2021 •

edited

Loading

PhanLe1010 commented Mar 30, 2021 •

edited

Loading

meldafrawi commented Apr 6, 2021

[BUG] Engine image and instance manager state is not correct on the node page of Longhorn UI #2377

[BUG] Engine image and instance manager state is not correct on the node page of Longhorn UI #2377

Comments

khushboo-rancher commented Mar 18, 2021 • edited by PhanLe1010 Loading

khushboo-rancher commented Mar 18, 2021 • edited by innobead Loading

PhanLe1010 commented Mar 20, 2021

longhorn-io-github-bot commented Mar 30, 2021 • edited by PhanLe1010 Loading

Pre-merged Checklist

PhanLe1010 commented Mar 30, 2021 • edited Loading

Manual test:

PhanLe1010 commented Mar 30, 2021 • edited Loading

meldafrawi commented Apr 6, 2021

khushboo-rancher commented Mar 18, 2021 •

edited by PhanLe1010

Loading

khushboo-rancher commented Mar 18, 2021 •

edited by innobead

Loading

longhorn-io-github-bot commented Mar 30, 2021 •

edited by PhanLe1010

Loading

PhanLe1010 commented Mar 30, 2021 •

edited

Loading

PhanLe1010 commented Mar 30, 2021 •

edited

Loading