[BUG] Can not create backup in engine image not fully deployed cluster #5248

chriscchien · 2023-01-10T13:31:13Z

Describe the bug (🐛 if you encounter this issue)

Can not create backup in engine image not fully deployed cluster

To Reproduce

Steps to reproduce the behavior:

Deploy Longhorn in 3 nodes cluster
Setup backupstore on cloud
Create volume and attach to node with 3 replicas
Taint another node which volume not attached to
Delete engine image daemonset, wait for remain engine image pod to 2
Create backup for volume, backup not success

Expected behavior

Can do backup when engine image not fully deployed in cluster

Log or Support bundle

supportbundle_df3a5f73-9034-411e-846d-823caef11337_2023-01-10T13-26-05Z.zip

Environment

Longhorn version: master-head
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.25
- Number of management node in the cluster: 1
- Number of worker node in the cluster: 3

Additional context

After step 6, create another volume with 2 replica and attach node then do backup, the backup not success also.

innobead · 2023-01-10T13:36:38Z

@chriscchien is this an issue for 1.4.0? Do we have this auto e2e case?

chriscchien · 2023-01-10T13:45:34Z

We have e2e case test_engine_image_not_fully_deployed_perform_volume_operations which is implementing now, for v1.4.0, I will update test result later

innobead · 2023-01-10T16:09:22Z

@chriscchien you are awesome. thank you.

chriscchien · 2023-01-11T01:32:06Z

This issue can reproduce on fresh cluster with v1.4.0 deployed, here is support bundle in case needed
supportbundle_9b2a0851-4357-4a4b-9ce6-72323d4ec50f_2023-01-11T01-05-28Z.zip

ChanYiLin · 2023-04-25T08:12:27Z

It seemed that the reason was because
the controller (longhorn-manager) with no engine-image available (because of the taint) picked up the CR and tries to reconcile it every time.

The normal one with available engine-image never had the chance to reconcile the backup CR.

When I deleted the abnormal controller, the normal one would pick up the backup CR and the creation would be successful

innobead · 2023-04-25T08:47:58Z

It seemed that the reason was because the controller (longhorn-manager) with no engine-image available (because of the taint) picked up the CR and tries to reconcile it every time.

But why longhorn-manager pod can run on a tainted node? but engine-image pod not? I thought we use the same taint toleration for all longhorn components.

ChanYiLin · 2023-04-25T09:22:33Z

the reproduce steps are

...
give the node a taint
delete engine image daemonset, wait for it to be recreated and remain engine image pod to 2
...

it doesn't delete the longhorn-manager daemonset to let it be recreated
so the longhorn-manager pod will still be on the tainted node.

ChanYiLin · 2023-04-25T09:28:29Z

I found something else, sorry for the misleading

The root cause is not because it is picked up by the wrong controller
but because none of the controllers success to take responsible for the backup CR

May need to look up the isResponsibleFor() function code

innobead · 2023-04-25T09:35:54Z

cc @PhanLe1010

ChanYiLin · 2023-04-25T10:10:36Z

Ok, I got the root cause

It is because engine-image is not fully deployed

From the engineimages.longhorn.io CR status, it showed:Engine image is not fully deployed on all nodes: 1 of 2

Status:
  Build Date:           2023-04-10T06:10:29+00:00
  Cli API Min Version:  3
  Cli API Version:      8
  Conditions:
    Last Probe Time:           
    Last Transition Time:      2023-04-25T09:59:34Z
    Message:                   Engine image is not fully deployed on all nodes: 1 of 2
    Reason:                    daemonSet
    Status:                    False
    Type:                      ready
  Controller API Min Version:  3
  Controller API Version:      4
  Data Format Min Version:     1
  Data Format Version:         1
  Git Commit:                  bfa9e8934722d60cd9c3de7fa170e0688d680d37
  No Ref Since:                
  Node Deployment Map:
    kworker1:  true
    kworker2:  false

The function ListReadyNodesWithReadyEngineImage return empty when image is not fully deployed code
Hence, backup_controller isResponsibleFor() always return false because readyNodesWithReadyEI is 0
https://github.com/longhorn/longhorn-manager/blob/master/controller/backup_controller.go#L447-L462

engineimages.longhorn.io status is decided by comparing ready nodes.longhorn.io and engineImage.Status.NodeDeploymentMap
https://github.com/longhorn/longhorn-manager/blob/master/controller/engine_image_controller.go#L334-L349
since the longhorn-manager is still running on the tainted node
readyNodes will always greater than deployedNodeCount, which makes it in EngineImageStateDeploying state

ChanYiLin · 2023-04-25T10:32:35Z

We should still allow backup_controller to select the responsible controller even engineimages is not fully deployed,
this can be done by removing the condition in ListReadyNodesWithReadyEngineImage()

// ListReadyNodesWithReadyEngineImage returns list of ready nodes that have the corresponding engine image deployed
func (s *DataStore) ListReadyNodesWithReadyEngineImage(image string) (map[string]*longhorn.Node, error) {
	ei, err := s.GetEngineImage(types.GetEngineImageChecksumName(image))
	if err != nil {
		return nil, fmt.Errorf("unable to get engine image %v: %v", image, err)
	}

        // [REMOVE FOLLOWING CONDITION]
	// if ei.Status.State != longhorn.EngineImageStateDeployed {
	// 	return map[string]*longhorn.Node{}, nil
	// }

	nodes, err := s.ListNodesWithEngineImage(ei)
	if err != nil {
		return nil, err
	}
	readyNodes := filterReadyNodes(nodes)
	return readyNodes, nil
}

But maybe we add this condition for a reason?

longhorn-io-github-bot · 2023-04-26T14:04:37Z

Pre Ready-For-Testing Checklist

Where is the reproduce steps/test steps documented?
The reproduce steps/test steps are at:

Steps:
1. Deploy Longhorn in 3 nodes cluster
2. Setup backupstore on cloud
3. Create volume and attach to node with 3 replicas
4. Taint another node which volume not attached to
5. Delete engine image daemonset, wait for remain engine image pod to 2
6. Create backup for volume, backup should be success

* [] Is there a workaround for the issue? If so, where is it documented?
The workaround is at:

Does the PR include the explanation for the fix or the feature? YES

* [ ] Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?
The PR for the YAML change is at:
The PR for the chart change is at:

Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
The PR is at: fix: fix create backup in engine image not fully deployed cluster longhorn-manager#1867

* [ ] Which areas/issues this PR might have potential impacts on?
Area
Issues

* [ ] If labeled: require/LEP Has the Longhorn Enhancement Proposal PR submitted?
The LEP PR is at

* [ ] If labeled: area/ui Has the UI issue filed or ready to be merged (including backport-needed/*)?
The UI issue/PR is at

* [ ] If labeled: require/doc Has the necessary document PR submitted or merged (including backport-needed/*)?
The documentation issue/PR is at

* [ ] If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*)
The automation skeleton PR is at
The automation test case PR is at
The issue of automation test case implementation is at (please create by the template)

* [ ] If labeled: require/automation-engine Has the engine integration test been merged (including backport-needed/*)?
The engine automation PR is at

* [ ] If labeled: require/manual-test-plan Has the manual test plan been documented?
The updated manual test plan is at

* [ ] If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
The compatibility issue is filed at

innobead · 2023-04-26T14:05:19Z

Moved to review. @ChanYiLin I assumed this is ready for review, so if yes, next time, remember to move this to the right pipeline. Thanks.

shuo-wu · 2023-04-28T14:49:07Z

It seemed that the reason was because the controller (longhorn-manager) with no engine-image available (because of the taint) picked up the CR and tries to reconcile it every time.

But why longhorn-manager pod can run on a tainted node? but engine-image pod not? I thought we use the same taint toleration for all longhorn components.

When the taint if effect NoSchedule rather than NoExecute, the existing components (without the corresponding toleration) on that node won't be evicted. That's why there is still a longhorn-manager pod running on it. As for the engine-image pod, the taint will block the new pod after the old one getting removed.

roger-ryao · 2023-05-03T07:57:06Z

Verified on master-head 20230503

longhorn master-head (81adad7)
longhorn-manager master-head (2613d56)

The test steps

#5248 (comment)

Deploy Longhorn in 3 nodes cluster
Setup backupstore target on cloud
Create volume and attach to node with 3 replicas
Taint another node which volume not attached to
kubectl taint nodes <node-name> key=value:NoSchedule
Delete engine-image-ei-xxxx daemonset, wait for remain engine image pod to 2
Create backup for volume

Result Passed

I created a backup for the volume, and the backup was successful.

chriscchien added kind/bug severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade) kind/regression Regression which has worked before labels Jan 10, 2023

innobead assigned PhanLe1010 Jan 10, 2023

innobead added this to the v1.5.0 milestone Jan 10, 2023

innobead added priority/0 Must be fixed in this release (managed by PO) area/volume-backup-restore Volume backup restore reproduce/always 100% reproducible labels Jan 10, 2023

chriscchien mentioned this issue Jan 16, 2023

[TEST] Implement test case test_engine_image_missing_on_some_nodes #5014

Closed

8 tasks

innobead assigned ChanYiLin and unassigned PhanLe1010 Mar 13, 2023

ChanYiLin mentioned this issue Apr 26, 2023

fix: fix create backup in engine image not fully deployed cluster longhorn/longhorn-manager#1867

Merged

innobead added backport/1.4.2 backport/1.3.4 labels Apr 26, 2023

This was referenced Apr 26, 2023

[BACKPORT][v1.4.2][BUG] Can not create backup in engine image not fully deployed cluster #5814

Closed

[BACKPORT][v1.3.4][BUG] Can not create backup in engine image not fully deployed cluster #5815

Closed

khushboo-rancher mentioned this issue Apr 26, 2023

test_engine_image_not_fully_deployed related cases and functions longhorn/longhorn-tests#1245

Merged

innobead assigned roger-ryao May 3, 2023

roger-ryao closed this as completed May 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Can not create backup in engine image not fully deployed cluster #5248

[BUG] Can not create backup in engine image not fully deployed cluster #5248

chriscchien commented Jan 10, 2023

innobead commented Jan 10, 2023

chriscchien commented Jan 10, 2023

innobead commented Jan 10, 2023

chriscchien commented Jan 11, 2023

ChanYiLin commented Apr 25, 2023 •

edited

innobead commented Apr 25, 2023

ChanYiLin commented Apr 25, 2023 •

edited

ChanYiLin commented Apr 25, 2023

innobead commented Apr 25, 2023

ChanYiLin commented Apr 25, 2023 •

edited

ChanYiLin commented Apr 25, 2023

longhorn-io-github-bot commented Apr 26, 2023 •

edited by ChanYiLin

innobead commented Apr 26, 2023

shuo-wu commented Apr 28, 2023

roger-ryao commented May 3, 2023

[BUG] Can not create backup in engine image not fully deployed cluster #5248

[BUG] Can not create backup in engine image not fully deployed cluster #5248

Comments

chriscchien commented Jan 10, 2023

Describe the bug (🐛 if you encounter this issue)

To Reproduce

Expected behavior

Log or Support bundle

Environment

Additional context

innobead commented Jan 10, 2023

chriscchien commented Jan 10, 2023

innobead commented Jan 10, 2023

chriscchien commented Jan 11, 2023

ChanYiLin commented Apr 25, 2023 • edited

innobead commented Apr 25, 2023

ChanYiLin commented Apr 25, 2023 • edited

ChanYiLin commented Apr 25, 2023

innobead commented Apr 25, 2023

ChanYiLin commented Apr 25, 2023 • edited

ChanYiLin commented Apr 25, 2023

longhorn-io-github-bot commented Apr 26, 2023 • edited by ChanYiLin

Pre Ready-For-Testing Checklist

innobead commented Apr 26, 2023

shuo-wu commented Apr 28, 2023

roger-ryao commented May 3, 2023

ChanYiLin commented Apr 25, 2023 •

edited

ChanYiLin commented Apr 25, 2023 •

edited

ChanYiLin commented Apr 25, 2023 •

edited

longhorn-io-github-bot commented Apr 26, 2023 •

edited by ChanYiLin