Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Two active engine when volume migrating #6642

Closed
Vicente-Cheng opened this issue Sep 6, 2023 · 5 comments
Closed

[BUG] Two active engine when volume migrating #6642

Vicente-Cheng opened this issue Sep 6, 2023 · 5 comments
Assignees
Labels
area/stability System or volume stability backport/1.4.4 backport/1.5.2 kind/bug priority/0 Must be fixed in this release (managed by PO) require/qa-review-coverage Require QA to review coverage severity/3 Function working but has a major issue w/ workaround
Milestone

Comments

@Vicente-Cheng
Copy link

Vicente-Cheng commented Sep 6, 2023

Describe the bug (馃悰 if you encounter this issue)

In the harvester upgrade scenario, the running VM will migrate to another node if the hosted VM node upgrades.
The VM migrate will trigger the volume migration.
Sometimes, we get the two active engines after the volume tries to migrate to another node.
We could see the following logs:

2023-09-05T21:23:53.251947253Z stderr F time="2023-09-05T21:23:53Z" level=warning msg="Error syncing Longhorn volume longhorn-system/pvc-7b120d60-1577-4716-be5a-62348271025a" controller=longhorn-volume error="failed to sync longhorn-system/pvc-7b120d60-1577-4716-be5a-62348271025a: failed to reconcile engine/replica state for pvc-7b120d60-1577-4716-be5a-62348271025a: BUG: found the second active engine pvc-7b120d60-1577-4716-be5a-62348271025a-e-1cd53c57 besides pvc-7b120d60-1577-4716-be5a-62348271025a-e-08220b62" node=harvester-q4vhd

To Reproduce

Copy from harvester issue: harvester/harvester#4477

  1. Install Harvester with at least 4 nodes
  2. Create an image for VM creation
  3. Create VLAN
  4. Create 2 VMs with 1 snapshot and 1 backup
  5. Import Harvester into Rancher 2.7.6
  6. Create RKE1 and RKE2 cluster (1 node)
  7. Install Harvester Cloud Provider and CSI Driver
  8. Create DHCP load balancer service (only RKE1 cluster)
  9. Create Nginx deployment with new Harvester PVC
  10. Upgrade Harvester

Expected behavior

Should not get two active engine when volume migrating

Support bundle for troubleshooting

refer to harvester/harvester#4477

Environment

  • Longhorn version: 1.4.3
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: rke2
    • Number of management node in the cluster: 3
    • Number of worker node in the cluster: 1
  • Node config
    • OS type and version:
    • Kernel version:
    • CPU per node:
    • Memory per node:
    • Disk type(e.g. SSD/NVMe/HDD):
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster:
  • Impacted Longhorn resources:
    • Volume names:

Additional context

Related two issues:
harvester/harvester#4477
harvester/harvester#4489
harvester/harvester#3228

@PhanLe1010
Copy link
Contributor

Analysis from @Vicente-Cheng

For the migration flow, we could briefly say as follows:

  1. Start migration from node A to node B
  2. Create a new engine on node B ( The engine of node A still holds the e.Spec.Active: True)
  3. The new engine of node B becomes ready, and it still holds the e.Spec.Active: False
  4. After volume migration is complete from node A to node B
  5. Check engines after migration, refer to this section https://github.com/longhorn/longhorn-manager/blob/v1.4.x/controller/volume_controller.go#L3475-L3507
  6. We will set the e.Spec.Active: True for the current engine when calling GetNewCurrentEngineAndExtras()
  7. Try to remove the extra (means old) engine.
  8. If successful, everything would be fine.

The problem might happened in Step 7.
When we call deleteEngine(), we will try to update the engine and then delete it.
If the update engine gets error, we will return this error. Then the defer function of the upper layer (means the syncVolume()) will update the whole engine again.
So, in this situation, we will get the two active engines (one is current, another is old we wanted to delete them but failed).

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Sep 12, 2023

Reproducing Steps:

It is quite difficult to reproduce the case in the analysis (a race condition) organically. I modified an e2e test case to repeatedly does volume migration 100 times (ref PhanLe1010/longhorn-tests@5c21e8b) but no luck

We are able to artificially trigger the bug (the first case) by:

  1. Deploying this modified longhorn-manager PhanLe1010/longhorn-manager@39b5bbe. It is manufacturing the first engine deletion error to trigger the bug 2 active engines
  2. Create a migratible volume (spec.accessMode: rwx and spec.migratable: true)
  3. Attach it to node-a
  4. Attach it to node-b (start the migration)
  5. Wait for the volume to be available on node-b
  6. Detach the volume from node-a (migration confirmation)
  7. Observer there are 2 active engines and volume can never be re-conciliated
    [longhorn-manager-v65nc longhorn-manager] time="2023-09-12T00:41:43Z" level=warning msg="Error syncing Longhorn volume longhorn-system/testvol" controller=longhorn-volume error="failed to sync longhorn-system/testvol: failed to reconcile engine/replica state for testvol: BUG: found the second active engine testvol-e-be829d4a besides testvol-e-6c787758" node=phan-v400-two-active-engines-pool2-de9b2523-jd6hd
    

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Sep 18, 2023

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at: [BUG] Two active engine when volume migrating聽#6642 (comment)

  • Is there a workaround for the issue? If so, where is it documented?
    The workaround is at: Just delete the engine CR which is not on the volume.Status.CurrentNodeID

  • Does the PR include the explanation for the fix or the feature?

  • Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?
    The PR for the YAML change is at:
    The PR for the chart change is at:

  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at Fix bug two active engine when volume migrating聽longhorn-manager#2155

  • Which areas/issues this PR might have potential impacts on?
    Area Migration/HA
    Issues

  • If labeled: require/LEP Has the Longhorn Enhancement Proposal PR submitted?
    The LEP PR is at

  • If labeled: area/ui Has the UI issue filed or ready to be merged (including backport-needed/*)?
    The UI issue/PR is at

  • If labeled: require/doc Has the necessary document PR submitted or merged (including backport-needed/*)?
    The documentation issue/PR is at

  • If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*)
    The automation skeleton PR is at
    The automation test case PR is at
    The issue of automation test case implementation is at (please create by the template)

  • If labeled: require/automation-engine Has the engine integration test been merged (including backport-needed/*)?
    The engine automation PR is at

  • If labeled: require/manual-test-plan Has the manual test plan been documented?
    The updated manual test plan is at

  • If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
    The compatibility issue is filed at

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Sep 18, 2023

Test steps:

It is quite difficult to reproduce the case organically. I think it is good enough to make sure the PR passes all e2e tests

@innobead innobead changed the title [BUG] two active engine when volume migrating [BUG] Two active engine when volume migrating Sep 21, 2023
@chriscchien chriscchien self-assigned this Sep 22, 2023
@chriscchien
Copy link
Contributor

Verified pass on longhorn master (longhorn-manager 22e1e1)

e2e pipeline did not encounter outstanding issue after PR merged(AMD64, ARM64)

@roger-ryao roger-ryao added the severity/3 Function working but has a major issue w/ workaround label Oct 13, 2023
@innobead innobead added the area/stability System or volume stability label Oct 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/stability System or volume stability backport/1.4.4 backport/1.5.2 kind/bug priority/0 Must be fixed in this release (managed by PO) require/qa-review-coverage Require QA to review coverage severity/3 Function working but has a major issue w/ workaround
Projects
None yet
Development

No branches or pull requests

6 participants