[BUG] A replica may be incorrectly scheduled to a node with an existing failed replica #8043

ejweber · 2024-02-26T22:42:57Z

Describe the bug

Harvester QA hit a complete lockup of Longhorn after a hard node reboot in a single node cluster. Before the reboot:

They had four VMs, each with a single Longhorn volume.
Each Longhorn volume had numReplicas == 3, so all volumes were degraded.
Each Longhorn volume had one scheduled and two unscheduled replicas (as expected).

After the reboot:

The Longhorn volumes were in need of auto-salvage (as expected).
Longhorn unexpectedly scheduled an extra replica for each volume to the node. By itself, this wouldn't have been particularly problematic. However,
The disk was not large enough to accommodate the extra scheduled replicas.
The disk transitioned to unschedulable, halting auto-salvage indefinitely.

To Reproduce

Observe the root cause

Install Longhorn in a single node cluster or shut down all but one node.
Create a deployment that requests a block volume with size << 50% of a node's storageMaximum. (I'm not sure block volume and deployment are important here, but they mimic the original context). The volume is degraded and two replicas aren't scheduled.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: longhorn-block-vol
  namespace: default
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Block
  storageClassName: longhorn
  resources:
    requests:
      storage: 2Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: block-volume-test
  labels:
    app: block-volume-test
  namespace: default
spec:
  selector:
    matchLabels:
      app: block-volume-test
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: block-volume-test
    spec:
      restartPolicy: Always
      containers:
      - image: nginx:stable-alpine
        name: block-volume-test
        volumeDevices:
        - devicePath: /dev/longhorn/testblk
          name: block-vol
      volumes:
      - name: block-vol
        persistentVolumeClaim:
          claimName: longhorn-block-vol

eweber@laptop:~/longhorn> kl get volume
NAME                                       DATA ENGINE   STATE      ROBUSTNESS   SCHEDULED   SIZE          NODE                                AGE
pvc-976c97f1-8515-44e5-820e-0ea9fb9bbcbc   v1            attached   degraded                 64424509440   eweber-v126-worker-9c1451b4-kgxdq   77s
eweber@laptop:~/longhorn> kl get replica
NAME                                                  DATA ENGINE   STATE     NODE                                DISK                                   INSTANCEMANAGER                                     IMAGE                                    AGE
pvc-976c97f1-8515-44e5-820e-0ea9fb9bbcbc-r-4f8ed9ed   v1            stopped                                                                                                                                                                           78s
pvc-976c97f1-8515-44e5-820e-0ea9fb9bbcbc-r-62ad5f16   v1            stopped                                                                                                                                                                           78s
pvc-976c97f1-8515-44e5-820e-0ea9fb9bbcbc-r-ec37e84e   v1            running   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f   instance-manager-699da83c0e9d22726e667344227e096b   longhornio/longhorn-engine:master-head   78s

Forcefully reboot the node like reboot -f.
Watch the replicas while the node reboots. Eventually a second replica is scheduled to the node. This is the cause of the lockup in the original context, but since there is still enough space on the disk, the volume is able to attach.

eweber@laptop:~/longhorn> kl get replica
NAME                                                  DATA ENGINE   STATE     NODE                                DISK                                   INSTANCEMANAGER                                     IMAGE                                    AGE
pvc-17b2c324-cf5e-47dd-9d3c-25e305e8b987-r-7abdc0bd   v1            stopped   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f                                                                                                2m57s
pvc-17b2c324-cf5e-47dd-9d3c-25e305e8b987-r-dbb7f749   v1            running   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f   instance-manager-699da83c0e9d22726e667344227e096b   longhornio/longhorn-engine:master-head   2m57s

Cause a lockup

This is more complicated than I originally supposed. We need to create a situation in which the node becomes overscheduled. This cannot be done with a single volume with size > 50% of a node's storageMaximum, or even a volume with close to, but < 50% of a node's storageMaximum because the replica scheduler can easily recognize that a second replica will not fit. Instead, we need to ensure there are multiple volumes that, in aggregate, cause the node to be overscheduled (as was the cause of the lockup in the original context). It appears to be a bit racy as well. I reproduced 1/4 times with four volumes and never with only two volumes.

Install Longhorn in a single node cluster or shut down all but one node.
Create multiple deployments that each request a block volume such that the aggregate size > 50% of a node's storageAvailable. (I'm not sure block volume and deployment are important here, but they mimic the original context). The volumes are degraded and two replicas each aren't scheduled.

eweber@laptop:~/longhorn> kl get replica
NAME                                                  DATA ENGINE   STATE     NODE                                DISK                                   INSTANCEMANAGER                                     IMAGE                                    AGE
pvc-2d618ff3-5163-458b-a218-e865404dc335-r-62256e66   v1            running   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f   instance-manager-699da83c0e9d22726e667344227e096b   longhornio/longhorn-engine:master-head   5s
pvc-2d618ff3-5163-458b-a218-e865404dc335-r-9c81d938   v1            stopped                                                                                                                                                                           5s
pvc-2d618ff3-5163-458b-a218-e865404dc335-r-fc96c900   v1            stopped                                                                                                                                                                           5s
pvc-a4644689-d986-4170-9560-0c748b699fc8-r-2322e3dc   v1            stopped                                                                                                                                                                           5s
pvc-a4644689-d986-4170-9560-0c748b699fc8-r-95ef0584   v1            stopped                                                                                                                                                                           5s
pvc-a4644689-d986-4170-9560-0c748b699fc8-r-d3c22a86   v1            running   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f   instance-manager-699da83c0e9d22726e667344227e096b   longhornio/longhorn-engine:master-head   5s
pvc-db68a1e4-f124-4418-91db-a6c783fa99e8-r-255993bc   v1            running   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f   instance-manager-699da83c0e9d22726e667344227e096b   longhornio/longhorn-engine:master-head   5s
pvc-db68a1e4-f124-4418-91db-a6c783fa99e8-r-4cefc537   v1            stopped                                                                                                                                                                           5s
pvc-db68a1e4-f124-4418-91db-a6c783fa99e8-r-f20830f6   v1            stopped                                                                                                                                                                           5s
pvc-fa5e794f-be5a-4c2f-984f-7faf63ac81eb-r-7930c837   v1            stopped                                                                                                                                                                           5s
pvc-fa5e794f-be5a-4c2f-984f-7faf63ac81eb-r-89972ea1   v1            running   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f   instance-manager-699da83c0e9d22726e667344227e096b   longhornio/longhorn-engine:master-head   6s
pvc-fa5e794f-be5a-4c2f-984f-7faf63ac81eb-r-bdb94544   v1            stopped

Forcefully reboot the node like reboot -f.
Watch the replicas while the node reboots. Eventually a second replica for many volumes is scheduled to the node. The associated disk becomes unschedulable. Logs indicate autosalvage is not progressing.

eweber@laptop:~/longhorn> kl get replica
NAME                                                  DATA ENGINE   STATE     NODE                                DISK                                   INSTANCEMANAGER   IMAGE   AGE
pvc-2d618ff3-5163-458b-a218-e865404dc335-r-62256e66   v1            stopped   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f                             5m53s
pvc-2d618ff3-5163-458b-a218-e865404dc335-r-fc96c900   v1            stopped   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f                             5m53s
pvc-a4644689-d986-4170-9560-0c748b699fc8-r-95ef0584   v1            stopped   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f                             5m53s
pvc-a4644689-d986-4170-9560-0c748b699fc8-r-d3c22a86   v1            stopped   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f                             5m53s
pvc-db68a1e4-f124-4418-91db-a6c783fa99e8-r-255993bc   v1            stopped   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f                             5m53s
pvc-fa5e794f-be5a-4c2f-984f-7faf63ac81eb-r-89972ea1   v1            stopped   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f                             5m54s

eweber@laptop:~/longhorn> kl get lhn -oyaml
...
    diskStatus:
      default-disk-fd7aeb2fd64320f8:
        conditions:
        - lastProbeTime: ""
          lastTransitionTime: "2024-02-23T19:33:20Z"
          message: Disk default-disk-fd7aeb2fd64320f8(/var/lib/longhorn/) on node
            eweber-v126-worker-9c1451b4-kgxdq is ready
          reason: ""
          status: "True"
          type: Ready
        - lastProbeTime: ""
          lastTransitionTime: "2024-02-27T20:50:29Z"
          message: Disk default-disk-fd7aeb2fd64320f8 (/var/lib/longhorn/) on the
            node eweber-v126-worker-9c1451b4-kgxdq has 109261619200 available, but
            requires reserved 49891221504, minimal 25% to schedule more replicas
          reason: DiskPressure
          status: "False"
          type: Schedulable
        diskType: filesystem
        diskUUID: 5680b199-91bd-452e-bbb6-4eeee965bf2f
        filesystemType: ext2/ext3
        scheduledReplica:
          pvc-2d618ff3-5163-458b-a218-e865404dc335-r-62256e66: 21474836480
          pvc-2d618ff3-5163-458b-a218-e865404dc335-r-fc96c900: 21474836480
          pvc-a4644689-d986-4170-9560-0c748b699fc8-r-95ef0584: 21474836480
          pvc-a4644689-d986-4170-9560-0c748b699fc8-r-d3c22a86: 21474836480
          pvc-db68a1e4-f124-4418-91db-a6c783fa99e8-r-255993bc: 21474836480
          pvc-fa5e794f-be5a-4c2f-984f-7faf63ac81eb-r-89972ea1: 21474836480
        storageAvailable: 109261619200
        storageMaximum: 166304071680
        storageScheduled: 128849018880

eweber@laptop:~/longhorn> kl logs --tail=-1 longhorn-manager-qxrpc
time="2024-02-27T20:59:17Z" level=info msg="All replicas are failed, auto-salvaging volume" func="controller.(*VolumeController).ReconcileVolumeState" file="volume_controller.go:1366" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=eweber-v126-worker-9c1451b4-kgxdq owner=eweber-v126-worker-9c1451b4-kgxdq state=detached volume=pvc-2d618ff3-5163-458b-a218-e865404dc335
time="2024-02-27T20:59:17Z" level=info msg="Bringing up replicas for auto-salvage" func="controller.(*VolumeController).ReconcileVolumeState" file="volume_controller.go:1416" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=eweber-v126-worker-9c1451b4-kgxdq owner=eweber-v126-worker-9c1451b4-kgxdq state=detached volume=pvc-2d618ff3-5163-458b-a218-e865404dc335
time="2024-02-27T20:59:17Z" level=info msg="All replicas are failed, set engine salvageRequested to true" func="controller.(*VolumeController).ReconcileVolumeState" file="volume_controller.go:1361" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=eweber-v126-worker-9c1451b4-kgxdq owner=eweber-v126-worker-9c1451b4-kgxdq state=detached volume=pvc-db68a1e4-f124-4418-91db-a6c783fa99e8
time="2024-02-27T20:59:17Z" level=info msg="All replicas are failed, auto-salvaging volume" func="controller.(*VolumeController).ReconcileVolumeState" file="volume_controller.go:1366" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=eweber-v126-worker-9c1451b4-kgxdq owner=eweber-v126-worker-9c1451b4-kgxdq state=detached volume=pvc-db68a1e4-f124-4418-91db-a6c783fa99e8
time="2024-02-27T20:59:17Z" level=info msg="All replicas are failed, set engine salvageRequested to true" func="controller.(*VolumeController).ReconcileVolumeState" file="volume_controller.go:1361" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=eweber-v126-worker-9c1451b4-kgxdq owner=eweber-v126-worker-9c1451b4-kgxdq state=detached volume=pvc-fa5e794f-be5a-4c2f-984f-7faf63ac81eb
time="2024-02-27T20:59:17Z" level=info msg="All replicas are failed, auto-salvaging volume" func="controller.(*VolumeController).ReconcileVolumeState" file="volume_controller.go:1366" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=eweber-v126-worker-9c1451b4-kgxdq owner=eweber-v126-worker-9c1451b4-kgxdq state=detached volume=pvc-fa5e794f-be5a-4c2f-984f-7faf63ac81eb
time="2024-02-27T20:59:17Z" level=info msg="Bringing up replicas for auto-salvage" func="controller.(*VolumeController).ReconcileVolumeState" file="volume_controller.go:1416" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=eweber-v126-worker-9c1451b4-kgxdq owner=eweber-v126-worker-9c1451b4-kgxdq state=detached volume=pvc-fa5e794f-be5a-4c2f-984f-7faf63ac81eb
time="2024-02-27T20:59:17Z" level=info msg="Bringing up replicas for auto-salvage" func="controller.(*VolumeController).ReconcileVolumeState" file="volume_controller.go:1416" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=eweber-v126-worker-9c1451b4-kgxdq owner=eweber-v126-worker-9c1451b4-kgxdq state=detached volume=pvc-db68a1e4-f124-4418-91db-a6c783fa99e8

Expected behavior

Because replicaSoftAntiAffinity == false in the cluster, Longhorn should not have scheduled and additional replica for each volume. If an extra replica was truly desired for some reason, the user should have had to:

Set replicaSoftAntiAffinity == true, or
Evict the failed replica.

Targeting two potential fixes:

Change https://github.com/longhorn/longhorn-manager/blob/edf23eddc3b6c307031eaa770c3d312c963f25a5/scheduler/replica_scheduler.go#L863 so a node containing a failed volume is still considered to be used by the scheduler.
Check why https://github.com/longhorn/longhorn-manager/blob/edf23eddc3b6c307031eaa770c3d312c963f25a5/scheduler/replica_scheduler.go#L54-L58 didn't prevent scheduling. Generally all replicas (even unused ones) are marked failed when an engine crashes.

Support bundle for troubleshooting

https://github.com/harvester/harvester/files/14390529/supportbundle_e5003761-8a04-41c3-8cbf-88a8c0e19116_2024-02-23T21-55-41Z.zip

Environment

Longhorn version:
- Observed in v1.6.0.
- Definitely affects master as well.
- Probably affects older versions, but scheduling logic was refactored in Add ReplicaDiskSoftAntiAffinity setting longhorn-manager#2094, so need to verify.

Additional context

See harvester/harvester#5109 (comment) for the original context.

The text was updated successfully, but these errors were encountered:

ejweber · 2024-02-26T22:49:56Z

Check if v1.5.x is vulnerable.

ejweber · 2024-02-26T22:51:15Z

In the Harvester cluster, the workaround was to delete enough of the incorrectly scheduled replicas (HealthyAt = "") to get the disk schedulable again. The cluster completely self-healed after that.

bk201 · 2024-02-27T02:20:46Z

@ejweber Does this mean for 3 node-cluster, the issue won't happen? (Say if a 3-node cluster has power outage and power on again at the same time).

ejweber · 2024-02-27T19:36:24Z

@ejweber Does this mean for 3 node-cluster, the issue won't happen? (Say if a 3-node cluster has power outage and power on again at the same time).

@bk201, I think you are mostly correct. In a three node cluster, all replicas should already be scheduled to some node, so the power outage should not result in any unexpected scheduling. If the outage is long enough, we may clean up some of the existing replicas and create new ones. From a brief review of the code, I don't think that is a path to hitting this issue, but I'm not sure I can rule it out. If it is a path, I think it will be much rarer.

longhorn-io-github-bot · 2024-03-04T16:51:00Z

Pre Ready-For-Testing Checklist

Where is the reproduce steps/test steps documented?
The reproduce steps are at: [BUG] A replica may be incorrectly scheduled to a node with an existing failed replica #8043 (comment).
The test steps are at: Consider a node with a failed reusable replica as still used longhorn-manager#2650 (comment).
Is there a workaround for the issue? If so, where is it documented?
The workaround is at: [BUG] A replica may be incorrectly scheduled to a node with an existing failed replica #8043 (comment).
Does the PR include the explanation for the fix or the feature?
Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
The PR is at: Consider a node with a failed reusable replica as still used longhorn-manager#2650.
Which areas/issues this PR might have potential impacts on?
In some failed-replica situations, scheduling behavior will be slightly changed. Longhorn will not schedule a new replica to a node that already contains a failed replica until the failed replica becomes unusable or replica-replenishment-wait-interval is exceeded.
If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*)
We have pretty solid unit/integration testing within the longhorn-manager PR. I don't recommend a new end-to-end test at this time.
https://github.com/longhorn/longhorn-manager/pull/2650/files#diff-6bfd30bb63aa743911a9b4f7f2cf9e3743f133a2dbd0e7ebfee6293c9bcd7541

innobead · 2024-03-06T08:01:55Z

Check if v1.5.x is vulnerable.

@ejweber Added backport/1.5.5 first. If it's not required to backport after check, just remove the label.

ejweber added kind/bug require/qa-review-coverage Require QA to review coverage require/backport Require backport. Only used when the specific versions to backport have not been definied. labels Feb 26, 2024

ejweber mentioned this issue Feb 26, 2024

Consider a node with a failed reusable replica as still used longhorn/longhorn-manager#2650

Merged

ejweber added the backport/1.6.1 label Feb 26, 2024

ejweber self-assigned this Feb 26, 2024

ejweber added this to the v1.7.0 milestone Feb 26, 2024

github-actions bot mentioned this issue Feb 26, 2024

[BACKPORT][v1.6.1][BUG] A replica may be incorrectly scheduled to a node with an existing failed replica #8044

Closed

ejweber mentioned this issue Feb 28, 2024

Fix HA tests for scheduling change longhorn/longhorn-tests#1787

Closed

w13915984028 mentioned this issue Mar 5, 2024

Enhance Longhorn stopping and recovering harvester/harvester#4448

Open

3 tasks

innobead added priority/0 Must be fixed in this release (managed by PO) area/volume-replica-scheduling Volume replica scheduling related area/resilience System or volume resilience backport/1.5.5 labels Mar 6, 2024

github-actions bot mentioned this issue Mar 6, 2024

[BACKPORT][v1.5.5][BUG] A replica may be incorrectly scheduled to a node with an existing failed replica #8116

Closed

ejweber mentioned this issue Apr 26, 2024

[BUG] Disks become over provisioned when Storage Over Provisioning Percentage is set to 100 #8450

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] A replica may be incorrectly scheduled to a node with an existing failed replica #8043

[BUG] A replica may be incorrectly scheduled to a node with an existing failed replica #8043

ejweber commented Feb 26, 2024 •

edited

ejweber commented Feb 26, 2024 •

edited

ejweber commented Feb 26, 2024

bk201 commented Feb 27, 2024

ejweber commented Feb 27, 2024

longhorn-io-github-bot commented Mar 4, 2024 •

edited by ejweber

innobead commented Mar 6, 2024 •

edited by ejweber

[BUG] A replica may be incorrectly scheduled to a node with an existing failed replica #8043

[BUG] A replica may be incorrectly scheduled to a node with an existing failed replica #8043

Comments

ejweber commented Feb 26, 2024 • edited

Describe the bug

To Reproduce

Observe the root cause

Cause a lockup

Expected behavior

Support bundle for troubleshooting

Environment

Additional context

ejweber commented Feb 26, 2024 • edited

ejweber commented Feb 26, 2024

bk201 commented Feb 27, 2024

ejweber commented Feb 27, 2024

longhorn-io-github-bot commented Mar 4, 2024 • edited by ejweber

Pre Ready-For-Testing Checklist

innobead commented Mar 6, 2024 • edited by ejweber

ejweber commented Feb 26, 2024 •

edited

ejweber commented Feb 26, 2024 •

edited

longhorn-io-github-bot commented Mar 4, 2024 •

edited by ejweber

innobead commented Mar 6, 2024 •

edited by ejweber