[BUG] Two active engine when volume migrating #6642

Vicente-Cheng · 2023-09-06T14:58:08Z

Describe the bug (🐛 if you encounter this issue)

In the harvester upgrade scenario, the running VM will migrate to another node if the hosted VM node upgrades.
The VM migrate will trigger the volume migration.
Sometimes, we get the two active engines after the volume tries to migrate to another node.
We could see the following logs:

2023-09-05T21:23:53.251947253Z stderr F time="2023-09-05T21:23:53Z" level=warning msg="Error syncing Longhorn volume longhorn-system/pvc-7b120d60-1577-4716-be5a-62348271025a" controller=longhorn-volume error="failed to sync longhorn-system/pvc-7b120d60-1577-4716-be5a-62348271025a: failed to reconcile engine/replica state for pvc-7b120d60-1577-4716-be5a-62348271025a: BUG: found the second active engine pvc-7b120d60-1577-4716-be5a-62348271025a-e-1cd53c57 besides pvc-7b120d60-1577-4716-be5a-62348271025a-e-08220b62" node=harvester-q4vhd

To Reproduce

Copy from harvester issue: harvester/harvester#4477

Install Harvester with at least 4 nodes
Create an image for VM creation
Create VLAN
Create 2 VMs with 1 snapshot and 1 backup
Import Harvester into Rancher 2.7.6
Create RKE1 and RKE2 cluster (1 node)
Install Harvester Cloud Provider and CSI Driver
Create DHCP load balancer service (only RKE1 cluster)
Create Nginx deployment with new Harvester PVC
Upgrade Harvester

Expected behavior

Should not get two active engine when volume migrating

Support bundle for troubleshooting

refer to harvester/harvester#4477

Environment

Longhorn version: 1.4.3
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: rke2
- Number of management node in the cluster: 3
- Number of worker node in the cluster: 1
Node config
- OS type and version:
- Kernel version:
- CPU per node:
- Memory per node:
- Disk type(e.g. SSD/NVMe/HDD):
- Network bandwidth between the nodes:
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
Number of Longhorn volumes in the cluster:
Impacted Longhorn resources:
- Volume names:

Additional context

Related two issues:
harvester/harvester#4477
harvester/harvester#4489
harvester/harvester#3228

The text was updated successfully, but these errors were encountered:

PhanLe1010 · 2023-09-12T20:25:40Z

Analysis from @Vicente-Cheng

For the migration flow, we could briefly say as follows:

Start migration from node A to node B
Create a new engine on node B ( The engine of node A still holds the e.Spec.Active: True)
The new engine of node B becomes ready, and it still holds the e.Spec.Active: False
After volume migration is complete from node A to node B
Check engines after migration, refer to this section https://github.com/longhorn/longhorn-manager/blob/v1.4.x/controller/volume_controller.go#L3475-L3507
We will set the e.Spec.Active: True for the current engine when calling GetNewCurrentEngineAndExtras()
Try to remove the extra (means old) engine.
If successful, everything would be fine.

The problem might happened in Step 7.
When we call deleteEngine(), we will try to update the engine and then delete it.
If the update engine gets error, we will return this error. Then the defer function of the upper layer (means the syncVolume()) will update the whole engine again.
So, in this situation, we will get the two active engines (one is current, another is old we wanted to delete them but failed).

PhanLe1010 · 2023-09-12T20:27:59Z

Reproducing Steps:

It is quite difficult to reproduce the case in the analysis (a race condition) organically. I modified an e2e test case to repeatedly does volume migration 100 times (ref PhanLe1010/longhorn-tests@5c21e8b) but no luck

We are able to artificially trigger the bug (the first case) by:

Deploying this modified longhorn-manager PhanLe1010/longhorn-manager@39b5bbe. It is manufacturing the first engine deletion error to trigger the bug 2 active engines
Create a migratible volume (spec.accessMode: rwx and spec.migratable: true)
Attach it to node-a
Attach it to node-b (start the migration)
Wait for the volume to be available on node-b
Detach the volume from node-a (migration confirmation)

Observer there are 2 active engines and volume can never be re-conciliated

[longhorn-manager-v65nc longhorn-manager] time="2023-09-12T00:41:43Z" level=warning msg="Error syncing Longhorn volume longhorn-system/testvol" controller=longhorn-volume error="failed to sync longhorn-system/testvol: failed to reconcile engine/replica state for testvol: BUG: found the second active engine testvol-e-be829d4a besides testvol-e-6c787758" node=phan-v400-two-active-engines-pool2-de9b2523-jd6hd

longhorn-io-github-bot · 2023-09-18T23:49:37Z

PhanLe1010 · 2023-09-18T23:50:58Z

Test steps:

It is quite difficult to reproduce the case organically. I think it is good enough to make sure the PR passes all e2e tests

chriscchien · 2023-09-22T02:33:38Z

Verified pass on longhorn master (longhorn-manager 22e1e1)

e2e pipeline did not encounter outstanding issue after PR merged(AMD64, ARM64)

Vicente-Cheng added kind/bug require/qa-review-coverage Require QA to review coverage labels Sep 6, 2023

innobead added this to the v1.6.0 milestone Sep 6, 2023

innobead added the priority/0 Must be fixed in this release (managed by PO) label Sep 6, 2023

PhanLe1010 assigned PhanLe1010 and unassigned PhanLe1010 Sep 11, 2023

innobead added backport/1.4.4 backport/1.5.2 labels Sep 12, 2023

This was referenced Sep 12, 2023

[BACKPORT][v1.5.2][BUG] two active engine when volume migrating #6663

Closed

[BACKPORT][v1.4.4][BUG] two active engine when volume migrating #6664

Closed

guangbochen mentioned this issue Sep 13, 2023

[BUG] Upgrade (v1.1.2 -> v1.2.0-rc6) stuck in pre-drained harvester/harvester#4477

Closed

PhanLe1010 mentioned this issue Sep 18, 2023

Fix bug two active engine when volume migrating longhorn/longhorn-manager#2155

Merged

innobead changed the title ~~[BUG] two active engine when volume migrating~~ [BUG] Two active engine when volume migrating Sep 21, 2023

chriscchien self-assigned this Sep 22, 2023

chriscchien closed this as completed Sep 22, 2023

roger-ryao added the severity/3 Function working but has a major issue w/ workaround label Oct 13, 2023

innobead added the area/stability System or volume stability label Oct 27, 2023

mantissahz mentioned this issue Apr 16, 2024

[BUG] Volume has multiple instance managers #8374

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Two active engine when volume migrating #6642

[BUG] Two active engine when volume migrating #6642

Vicente-Cheng commented Sep 6, 2023 •

edited

PhanLe1010 commented Sep 12, 2023

PhanLe1010 commented Sep 12, 2023 •

edited

longhorn-io-github-bot commented Sep 18, 2023 •

edited by PhanLe1010

PhanLe1010 commented Sep 18, 2023 •

edited

chriscchien commented Sep 22, 2023

[BUG] Two active engine when volume migrating #6642

[BUG] Two active engine when volume migrating #6642

Comments

Vicente-Cheng commented Sep 6, 2023 • edited

Describe the bug (🐛 if you encounter this issue)

To Reproduce

Expected behavior

Support bundle for troubleshooting

Environment

Additional context

PhanLe1010 commented Sep 12, 2023

PhanLe1010 commented Sep 12, 2023 • edited

longhorn-io-github-bot commented Sep 18, 2023 • edited by PhanLe1010

Pre Ready-For-Testing Checklist

PhanLe1010 commented Sep 18, 2023 • edited

chriscchien commented Sep 22, 2023

Vicente-Cheng commented Sep 6, 2023 •

edited

PhanLe1010 commented Sep 12, 2023 •

edited

longhorn-io-github-bot commented Sep 18, 2023 •

edited by PhanLe1010

PhanLe1010 commented Sep 18, 2023 •

edited