Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] A replica may be incorrectly scheduled to a node with an existing failed replica #8043

Open
ejweber opened this issue Feb 26, 2024 · 6 comments
Assignees
Labels
area/resilience System or volume resilience area/volume-replica-scheduling Volume replica scheduling related backport/1.5.5 backport/1.6.1 kind/bug priority/0 Must be fixed in this release (managed by PO) require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage
Milestone

Comments

@ejweber
Copy link
Contributor

ejweber commented Feb 26, 2024

Describe the bug

Harvester QA hit a complete lockup of Longhorn after a hard node reboot in a single node cluster. Before the reboot:

  • They had four VMs, each with a single Longhorn volume.
  • Each Longhorn volume had numReplicas == 3, so all volumes were degraded.
  • Each Longhorn volume had one scheduled and two unscheduled replicas (as expected).

After the reboot:

  • The Longhorn volumes were in need of auto-salvage (as expected).
  • Longhorn unexpectedly scheduled an extra replica for each volume to the node. By itself, this wouldn't have been particularly problematic. However,
  • The disk was not large enough to accommodate the extra scheduled replicas.
  • The disk transitioned to unschedulable, halting auto-salvage indefinitely.

To Reproduce

Observe the root cause

  1. Install Longhorn in a single node cluster or shut down all but one node.
  2. Create a deployment that requests a block volume with size << 50% of a node's storageMaximum. (I'm not sure block volume and deployment are important here, but they mimic the original context). The volume is degraded and two replicas aren't scheduled.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: longhorn-block-vol
  namespace: default
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Block
  storageClassName: longhorn
  resources:
    requests:
      storage: 2Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: block-volume-test
  labels:
    app: block-volume-test
  namespace: default
spec:
  selector:
    matchLabels:
      app: block-volume-test
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: block-volume-test
    spec:
      restartPolicy: Always
      containers:
      - image: nginx:stable-alpine
        name: block-volume-test
        volumeDevices:
        - devicePath: /dev/longhorn/testblk
          name: block-vol
      volumes:
      - name: block-vol
        persistentVolumeClaim:
          claimName: longhorn-block-vol
eweber@laptop:~/longhorn> kl get volume
NAME                                       DATA ENGINE   STATE      ROBUSTNESS   SCHEDULED   SIZE          NODE                                AGE
pvc-976c97f1-8515-44e5-820e-0ea9fb9bbcbc   v1            attached   degraded                 64424509440   eweber-v126-worker-9c1451b4-kgxdq   77s
eweber@laptop:~/longhorn> kl get replica
NAME                                                  DATA ENGINE   STATE     NODE                                DISK                                   INSTANCEMANAGER                                     IMAGE                                    AGE
pvc-976c97f1-8515-44e5-820e-0ea9fb9bbcbc-r-4f8ed9ed   v1            stopped                                                                                                                                                                           78s
pvc-976c97f1-8515-44e5-820e-0ea9fb9bbcbc-r-62ad5f16   v1            stopped                                                                                                                                                                           78s
pvc-976c97f1-8515-44e5-820e-0ea9fb9bbcbc-r-ec37e84e   v1            running   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f   instance-manager-699da83c0e9d22726e667344227e096b   longhornio/longhorn-engine:master-head   78s
  1. Forcefully reboot the node like reboot -f.
  2. Watch the replicas while the node reboots. Eventually a second replica is scheduled to the node. This is the cause of the lockup in the original context, but since there is still enough space on the disk, the volume is able to attach.
eweber@laptop:~/longhorn> kl get replica
NAME                                                  DATA ENGINE   STATE     NODE                                DISK                                   INSTANCEMANAGER                                     IMAGE                                    AGE
pvc-17b2c324-cf5e-47dd-9d3c-25e305e8b987-r-7abdc0bd   v1            stopped   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f                                                                                                2m57s
pvc-17b2c324-cf5e-47dd-9d3c-25e305e8b987-r-dbb7f749   v1            running   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f   instance-manager-699da83c0e9d22726e667344227e096b   longhornio/longhorn-engine:master-head   2m57s

Cause a lockup

This is more complicated than I originally supposed. We need to create a situation in which the node becomes overscheduled. This cannot be done with a single volume with size > 50% of a node's storageMaximum, or even a volume with close to, but < 50% of a node's storageMaximum because the replica scheduler can easily recognize that a second replica will not fit. Instead, we need to ensure there are multiple volumes that, in aggregate, cause the node to be overscheduled (as was the cause of the lockup in the original context). It appears to be a bit racy as well. I reproduced 1/4 times with four volumes and never with only two volumes.

  1. Install Longhorn in a single node cluster or shut down all but one node.
  2. Create multiple deployments that each request a block volume such that the aggregate size > 50% of a node's storageAvailable. (I'm not sure block volume and deployment are important here, but they mimic the original context). The volumes are degraded and two replicas each aren't scheduled.
eweber@laptop:~/longhorn> kl get replica
NAME                                                  DATA ENGINE   STATE     NODE                                DISK                                   INSTANCEMANAGER                                     IMAGE                                    AGE
pvc-2d618ff3-5163-458b-a218-e865404dc335-r-62256e66   v1            running   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f   instance-manager-699da83c0e9d22726e667344227e096b   longhornio/longhorn-engine:master-head   5s
pvc-2d618ff3-5163-458b-a218-e865404dc335-r-9c81d938   v1            stopped                                                                                                                                                                           5s
pvc-2d618ff3-5163-458b-a218-e865404dc335-r-fc96c900   v1            stopped                                                                                                                                                                           5s
pvc-a4644689-d986-4170-9560-0c748b699fc8-r-2322e3dc   v1            stopped                                                                                                                                                                           5s
pvc-a4644689-d986-4170-9560-0c748b699fc8-r-95ef0584   v1            stopped                                                                                                                                                                           5s
pvc-a4644689-d986-4170-9560-0c748b699fc8-r-d3c22a86   v1            running   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f   instance-manager-699da83c0e9d22726e667344227e096b   longhornio/longhorn-engine:master-head   5s
pvc-db68a1e4-f124-4418-91db-a6c783fa99e8-r-255993bc   v1            running   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f   instance-manager-699da83c0e9d22726e667344227e096b   longhornio/longhorn-engine:master-head   5s
pvc-db68a1e4-f124-4418-91db-a6c783fa99e8-r-4cefc537   v1            stopped                                                                                                                                                                           5s
pvc-db68a1e4-f124-4418-91db-a6c783fa99e8-r-f20830f6   v1            stopped                                                                                                                                                                           5s
pvc-fa5e794f-be5a-4c2f-984f-7faf63ac81eb-r-7930c837   v1            stopped                                                                                                                                                                           5s
pvc-fa5e794f-be5a-4c2f-984f-7faf63ac81eb-r-89972ea1   v1            running   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f   instance-manager-699da83c0e9d22726e667344227e096b   longhornio/longhorn-engine:master-head   6s
pvc-fa5e794f-be5a-4c2f-984f-7faf63ac81eb-r-bdb94544   v1            stopped       
  1. Forcefully reboot the node like reboot -f.
  2. Watch the replicas while the node reboots. Eventually a second replica for many volumes is scheduled to the node. The associated disk becomes unschedulable. Logs indicate autosalvage is not progressing.
eweber@laptop:~/longhorn> kl get replica
NAME                                                  DATA ENGINE   STATE     NODE                                DISK                                   INSTANCEMANAGER   IMAGE   AGE
pvc-2d618ff3-5163-458b-a218-e865404dc335-r-62256e66   v1            stopped   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f                             5m53s
pvc-2d618ff3-5163-458b-a218-e865404dc335-r-fc96c900   v1            stopped   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f                             5m53s
pvc-a4644689-d986-4170-9560-0c748b699fc8-r-95ef0584   v1            stopped   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f                             5m53s
pvc-a4644689-d986-4170-9560-0c748b699fc8-r-d3c22a86   v1            stopped   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f                             5m53s
pvc-db68a1e4-f124-4418-91db-a6c783fa99e8-r-255993bc   v1            stopped   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f                             5m53s
pvc-fa5e794f-be5a-4c2f-984f-7faf63ac81eb-r-89972ea1   v1            stopped   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f                             5m54s
eweber@laptop:~/longhorn> kl get lhn -oyaml
...
    diskStatus:
      default-disk-fd7aeb2fd64320f8:
        conditions:
        - lastProbeTime: ""
          lastTransitionTime: "2024-02-23T19:33:20Z"
          message: Disk default-disk-fd7aeb2fd64320f8(/var/lib/longhorn/) on node
            eweber-v126-worker-9c1451b4-kgxdq is ready
          reason: ""
          status: "True"
          type: Ready
        - lastProbeTime: ""
          lastTransitionTime: "2024-02-27T20:50:29Z"
          message: Disk default-disk-fd7aeb2fd64320f8 (/var/lib/longhorn/) on the
            node eweber-v126-worker-9c1451b4-kgxdq has 109261619200 available, but
            requires reserved 49891221504, minimal 25% to schedule more replicas
          reason: DiskPressure
          status: "False"
          type: Schedulable
        diskType: filesystem
        diskUUID: 5680b199-91bd-452e-bbb6-4eeee965bf2f
        filesystemType: ext2/ext3
        scheduledReplica:
          pvc-2d618ff3-5163-458b-a218-e865404dc335-r-62256e66: 21474836480
          pvc-2d618ff3-5163-458b-a218-e865404dc335-r-fc96c900: 21474836480
          pvc-a4644689-d986-4170-9560-0c748b699fc8-r-95ef0584: 21474836480
          pvc-a4644689-d986-4170-9560-0c748b699fc8-r-d3c22a86: 21474836480
          pvc-db68a1e4-f124-4418-91db-a6c783fa99e8-r-255993bc: 21474836480
          pvc-fa5e794f-be5a-4c2f-984f-7faf63ac81eb-r-89972ea1: 21474836480
        storageAvailable: 109261619200
        storageMaximum: 166304071680
        storageScheduled: 128849018880
eweber@laptop:~/longhorn> kl logs --tail=-1 longhorn-manager-qxrpc
time="2024-02-27T20:59:17Z" level=info msg="All replicas are failed, auto-salvaging volume" func="controller.(*VolumeController).ReconcileVolumeState" file="volume_controller.go:1366" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=eweber-v126-worker-9c1451b4-kgxdq owner=eweber-v126-worker-9c1451b4-kgxdq state=detached volume=pvc-2d618ff3-5163-458b-a218-e865404dc335
time="2024-02-27T20:59:17Z" level=info msg="Bringing up replicas for auto-salvage" func="controller.(*VolumeController).ReconcileVolumeState" file="volume_controller.go:1416" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=eweber-v126-worker-9c1451b4-kgxdq owner=eweber-v126-worker-9c1451b4-kgxdq state=detached volume=pvc-2d618ff3-5163-458b-a218-e865404dc335
time="2024-02-27T20:59:17Z" level=info msg="All replicas are failed, set engine salvageRequested to true" func="controller.(*VolumeController).ReconcileVolumeState" file="volume_controller.go:1361" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=eweber-v126-worker-9c1451b4-kgxdq owner=eweber-v126-worker-9c1451b4-kgxdq state=detached volume=pvc-db68a1e4-f124-4418-91db-a6c783fa99e8
time="2024-02-27T20:59:17Z" level=info msg="All replicas are failed, auto-salvaging volume" func="controller.(*VolumeController).ReconcileVolumeState" file="volume_controller.go:1366" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=eweber-v126-worker-9c1451b4-kgxdq owner=eweber-v126-worker-9c1451b4-kgxdq state=detached volume=pvc-db68a1e4-f124-4418-91db-a6c783fa99e8
time="2024-02-27T20:59:17Z" level=info msg="All replicas are failed, set engine salvageRequested to true" func="controller.(*VolumeController).ReconcileVolumeState" file="volume_controller.go:1361" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=eweber-v126-worker-9c1451b4-kgxdq owner=eweber-v126-worker-9c1451b4-kgxdq state=detached volume=pvc-fa5e794f-be5a-4c2f-984f-7faf63ac81eb
time="2024-02-27T20:59:17Z" level=info msg="All replicas are failed, auto-salvaging volume" func="controller.(*VolumeController).ReconcileVolumeState" file="volume_controller.go:1366" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=eweber-v126-worker-9c1451b4-kgxdq owner=eweber-v126-worker-9c1451b4-kgxdq state=detached volume=pvc-fa5e794f-be5a-4c2f-984f-7faf63ac81eb
time="2024-02-27T20:59:17Z" level=info msg="Bringing up replicas for auto-salvage" func="controller.(*VolumeController).ReconcileVolumeState" file="volume_controller.go:1416" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=eweber-v126-worker-9c1451b4-kgxdq owner=eweber-v126-worker-9c1451b4-kgxdq state=detached volume=pvc-fa5e794f-be5a-4c2f-984f-7faf63ac81eb
time="2024-02-27T20:59:17Z" level=info msg="Bringing up replicas for auto-salvage" func="controller.(*VolumeController).ReconcileVolumeState" file="volume_controller.go:1416" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=eweber-v126-worker-9c1451b4-kgxdq owner=eweber-v126-worker-9c1451b4-kgxdq state=detached volume=pvc-db68a1e4-f124-4418-91db-a6c783fa99e8

Expected behavior

Because replicaSoftAntiAffinity == false in the cluster, Longhorn should not have scheduled and additional replica for each volume. If an extra replica was truly desired for some reason, the user should have had to:

  • Set replicaSoftAntiAffinity == true, or
  • Evict the failed replica.

Targeting two potential fixes:

Support bundle for troubleshooting

https://github.com/harvester/harvester/files/14390529/supportbundle_e5003761-8a04-41c3-8cbf-88a8c0e19116_2024-02-23T21-55-41Z.zip

Environment

Additional context

See harvester/harvester#5109 (comment) for the original context.

@ejweber
Copy link
Contributor Author

ejweber commented Feb 26, 2024

  • Check if v1.5.x is vulnerable.

@ejweber
Copy link
Contributor Author

ejweber commented Feb 26, 2024

In the Harvester cluster, the workaround was to delete enough of the incorrectly scheduled replicas (HealthyAt = "") to get the disk schedulable again. The cluster completely self-healed after that.

@bk201
Copy link
Member

bk201 commented Feb 27, 2024

@ejweber Does this mean for 3 node-cluster, the issue won't happen? (Say if a 3-node cluster has power outage and power on again at the same time).

@ejweber
Copy link
Contributor Author

ejweber commented Feb 27, 2024

@ejweber Does this mean for 3 node-cluster, the issue won't happen? (Say if a 3-node cluster has power outage and power on again at the same time).

@bk201, I think you are mostly correct. In a three node cluster, all replicas should already be scheduled to some node, so the power outage should not result in any unexpected scheduling. If the outage is long enough, we may clean up some of the existing replicas and create new ones. From a brief review of the code, I don't think that is a path to hitting this issue, but I'm not sure I can rule it out. If it is a path, I think it will be much rarer.

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Mar 4, 2024

Pre Ready-For-Testing Checklist

@innobead innobead added priority/0 Must be fixed in this release (managed by PO) area/volume-replica-scheduling Volume replica scheduling related area/resilience System or volume resilience backport/1.5.5 labels Mar 6, 2024
@innobead
Copy link
Member

innobead commented Mar 6, 2024

  • Check if v1.5.x is vulnerable.

@ejweber Added backport/1.5.5 first. If it's not required to backport after check, just remove the label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/resilience System or volume resilience area/volume-replica-scheduling Volume replica scheduling related backport/1.5.5 backport/1.6.1 kind/bug priority/0 Must be fixed in this release (managed by PO) require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage
Projects
None yet
Development

No branches or pull requests

4 participants