Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Can not create backup in engine image not fully deployed cluster #5248

Closed
chriscchien opened this issue Jan 10, 2023 · 15 comments
Closed
Assignees
Labels
area/volume-backup-restore Volume backup restore backport/1.3.4 backport/1.4.2 kind/bug kind/regression Regression which has worked before priority/0 Must be fixed in this release (managed by PO) reproduce/always 100% reproducible severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade)
Milestone

Comments

@chriscchien
Copy link
Contributor

Describe the bug (馃悰 if you encounter this issue)

Can not create backup in engine image not fully deployed cluster

To Reproduce

Steps to reproduce the behavior:

  1. Deploy Longhorn in 3 nodes cluster
  2. Setup backupstore on cloud
  3. Create volume and attach to node with 3 replicas
  4. Taint another node which volume not attached to
  5. Delete engine image daemonset, wait for remain engine image pod to 2
  6. Create backup for volume, backup not success

Expected behavior

Can do backup when engine image not fully deployed in cluster

Log or Support bundle

supportbundle_df3a5f73-9034-411e-846d-823caef11337_2023-01-10T13-26-05Z.zip

Environment

  • Longhorn version: master-head
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.25
    • Number of management node in the cluster: 1
    • Number of worker node in the cluster: 3

Additional context

After step 6, create another volume with 2 replica and attach node then do backup, the backup not success also.

@chriscchien chriscchien added kind/bug severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade) kind/regression Regression which has worked before labels Jan 10, 2023
@innobead innobead added this to the v1.5.0 milestone Jan 10, 2023
@innobead
Copy link
Member

@chriscchien is this an issue for 1.4.0? Do we have this auto e2e case?

@innobead innobead added priority/0 Must be fixed in this release (managed by PO) area/volume-backup-restore Volume backup restore reproduce/always 100% reproducible labels Jan 10, 2023
@chriscchien
Copy link
Contributor Author

We have e2e case test_engine_image_not_fully_deployed_perform_volume_operations which is implementing now, for v1.4.0, I will update test result later

@innobead
Copy link
Member

@chriscchien you are awesome. thank you.

@chriscchien
Copy link
Contributor Author

This issue can reproduce on fresh cluster with v1.4.0 deployed, here is support bundle in case needed
supportbundle_9b2a0851-4357-4a4b-9ce6-72323d4ec50f_2023-01-11T01-05-28Z.zip

@ChanYiLin
Copy link
Contributor

ChanYiLin commented Apr 25, 2023

It seemed that the reason was because
the controller (longhorn-manager) with no engine-image available (because of the taint) picked up the CR and tries to reconcile it every time.

The normal one with available engine-image never had the chance to reconcile the backup CR.

When I deleted the abnormal controller, the normal one would pick up the backup CR and the creation would be successful

@innobead
Copy link
Member

It seemed that the reason was because the controller (longhorn-manager) with no engine-image available (because of the taint) picked up the CR and tries to reconcile it every time.

But why longhorn-manager pod can run on a tainted node? but engine-image pod not? I thought we use the same taint toleration for all longhorn components.

@ChanYiLin
Copy link
Contributor

ChanYiLin commented Apr 25, 2023

the reproduce steps are

  • ...
  • give the node a taint
  • delete engine image daemonset, wait for it to be recreated and remain engine image pod to 2
  • ...

it doesn't delete the longhorn-manager daemonset to let it be recreated
so the longhorn-manager pod will still be on the tainted node.

@ChanYiLin
Copy link
Contributor

I found something else, sorry for the misleading

The root cause is not because it is picked up by the wrong controller
but because none of the controllers success to take responsible for the backup CR

May need to look up the isResponsibleFor() function code

@innobead
Copy link
Member

cc @PhanLe1010

@ChanYiLin
Copy link
Contributor

ChanYiLin commented Apr 25, 2023

Ok, I got the root cause

It is because engine-image is not fully deployed

  • From the engineimages.longhorn.io CR status, it showed:Engine image is not fully deployed on all nodes: 1 of 2
Status:
  Build Date:           2023-04-10T06:10:29+00:00
  Cli API Min Version:  3
  Cli API Version:      8
  Conditions:
    Last Probe Time:           
    Last Transition Time:      2023-04-25T09:59:34Z
    Message:                   Engine image is not fully deployed on all nodes: 1 of 2
    Reason:                    daemonSet
    Status:                    False
    Type:                      ready
  Controller API Min Version:  3
  Controller API Version:      4
  Data Format Min Version:     1
  Data Format Version:         1
  Git Commit:                  bfa9e8934722d60cd9c3de7fa170e0688d680d37
  No Ref Since:                
  Node Deployment Map:
    kworker1:  true
    kworker2:  false

engineimages.longhorn.io status is decided by comparing ready nodes.longhorn.io and engineImage.Status.NodeDeploymentMap
https://github.com/longhorn/longhorn-manager/blob/master/controller/engine_image_controller.go#L334-L349
since the longhorn-manager is still running on the tainted node
readyNodes will always greater than deployedNodeCount, which makes it in EngineImageStateDeploying state

@ChanYiLin
Copy link
Contributor

We should still allow backup_controller to select the responsible controller even engineimages is not fully deployed,
this can be done by removing the condition in ListReadyNodesWithReadyEngineImage()

// ListReadyNodesWithReadyEngineImage returns list of ready nodes that have the corresponding engine image deployed
func (s *DataStore) ListReadyNodesWithReadyEngineImage(image string) (map[string]*longhorn.Node, error) {
	ei, err := s.GetEngineImage(types.GetEngineImageChecksumName(image))
	if err != nil {
		return nil, fmt.Errorf("unable to get engine image %v: %v", image, err)
	}

        // [REMOVE FOLLOWING CONDITION]
	// if ei.Status.State != longhorn.EngineImageStateDeployed {
	// 	return map[string]*longhorn.Node{}, nil
	// }

	nodes, err := s.ListNodesWithEngineImage(ei)
	if err != nil {
		return nil, err
	}
	readyNodes := filterReadyNodes(nodes)
	return readyNodes, nil
}

But maybe we add this condition for a reason?

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Apr 26, 2023

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at:
  • Steps:
    1. Deploy Longhorn in 3 nodes cluster
    2. Setup backupstore on cloud
    3. Create volume and attach to node with 3 replicas
    4. Taint another node which volume not attached to
    5. Delete engine image daemonset, wait for remain engine image pod to 2
    6. Create backup for volume, backup should be success

* [] Is there a workaround for the issue? If so, where is it documented?
The workaround is at:

  • Does the PR include the explanation for the fix or the feature? YES

* [ ] Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?
The PR for the YAML change is at:
The PR for the chart change is at:

* [ ] Which areas/issues this PR might have potential impacts on?
Area
Issues

* [ ] If labeled: require/LEP Has the Longhorn Enhancement Proposal PR submitted?
The LEP PR is at

* [ ] If labeled: area/ui Has the UI issue filed or ready to be merged (including backport-needed/*)?
The UI issue/PR is at

* [ ] If labeled: require/doc Has the necessary document PR submitted or merged (including backport-needed/*)?
The documentation issue/PR is at

* [ ] If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*)
The automation skeleton PR is at
The automation test case PR is at
The issue of automation test case implementation is at (please create by the template)

* [ ] If labeled: require/automation-engine Has the engine integration test been merged (including backport-needed/*)?
The engine automation PR is at

* [ ] If labeled: require/manual-test-plan Has the manual test plan been documented?
The updated manual test plan is at

* [ ] If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
The compatibility issue is filed at

@innobead
Copy link
Member

Moved to review. @ChanYiLin I assumed this is ready for review, so if yes, next time, remember to move this to the right pipeline. Thanks.

@shuo-wu
Copy link
Contributor

shuo-wu commented Apr 28, 2023

It seemed that the reason was because the controller (longhorn-manager) with no engine-image available (because of the taint) picked up the CR and tries to reconcile it every time.

But why longhorn-manager pod can run on a tainted node? but engine-image pod not? I thought we use the same taint toleration for all longhorn components.

When the taint if effect NoSchedule rather than NoExecute, the existing components (without the corresponding toleration) on that node won't be evicted. That's why there is still a longhorn-manager pod running on it. As for the engine-image pod, the taint will block the new pod after the old one getting removed.

@roger-ryao
Copy link

Verified on master-head 20230503

  • longhorn master-head (81adad7)
  • longhorn-manager master-head (2613d56)

The test steps

#5248 (comment)

  1. Deploy Longhorn in 3 nodes cluster
  2. Setup backupstore target on cloud
  3. Create volume and attach to node with 3 replicas
  4. Taint another node which volume not attached to
    kubectl taint nodes <node-name> key=value:NoSchedule
  5. Delete engine-image-ei-xxxx daemonset, wait for remain engine image pod to 2
  6. Create backup for volume

Result Passed

  1. I created a backup for the volume, and the backup was successful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/volume-backup-restore Volume backup restore backport/1.3.4 backport/1.4.2 kind/bug kind/regression Regression which has worked before priority/0 Must be fixed in this release (managed by PO) reproduce/always 100% reproducible severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade)
Projects
None yet
Development

No branches or pull requests

7 participants