[BUG] Engine continues to attempt to rebuild replica while detaching #6217

ejweber · 2023-06-28T21:30:58Z

Describe the bug (🐛 if you encounter this issue)

This is another cause of inappropriate replica expansion uncovered while implementing #5845. Maybe this is the mode of failure in #6078? I'll have to figure out a way to confirm.

When a volume is detached, the following things happen in the following order:

Rebuilding replicas are killed.
The engine is killed.
Healthy replicas are killed.

There is a minimum of a 10 second window between the time longhorn-manager starts a snapshot purge for a rebuild and the time the rebuild actually starts. During that time, the engine is not reconciled by the engine controller (it is already in the middle of a reconciliation). If the timing is right, it is possible for the rebuilding replica to be killed and a new replica (for a different volume) to take its place. The engine controller continues with the rebuild and attempts to use the engine to rebuild the wrong replica. This can lead to inappropriate expansion.

To Reproduce

Steps to reproduce the behavior:

Deploy a Longhorn cluster off of master (v1.5.0).
Create two stateful sets, each with 25 pods. Use different sizes for the two stateful sets.

apiVersion: v1
kind: Service
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  ports:
  - port: 80
    name: web
  selector:
    app: nginx
  type: NodePort
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
  namespace: default
spec:
  selector:
    matchLabels:
      app: nginx # has to match .spec.template.metadata.labels
  serviceName: "nginx"
  replicas: 25 # by default is 1
  template:
    metadata:
      labels:
        app: nginx # has to match .spec.selector.matchLabels
    spec:
      restartPolicy: Always
      terminationGracePeriodSeconds: 10
      containers:
      - name: nginx
        image: registry.k8s.io/nginx-slim:0.8
        livenessProbe:
          exec:
            command:
              - ls
              - /usr/share/nginx/html/lost+found
          initialDelaySeconds: 5
          periodSeconds: 5
        ports:
        - containerPort: 80
          name: web
        volumeMounts:
        - name: www
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: www
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "longhorn"
      resources:
        requests:
          storage: 512Mi
---
apiVersion: v1
kind: Service
metadata:
  name: nginx2
  labels:
    app: nginx2
spec:
  ports:
  - port: 80
    name: web
  selector:
    app: nginx2
  type: NodePort
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web2
  namespace: default
spec:
  selector:
    matchLabels:
      app: nginx2 # has to match .spec.template.metadata.labels
  serviceName: "nginx2"
  replicas: 25 # by default is 1
  template:
    metadata:
      labels:
        app: nginx2 # has to match .spec.selector.matchLabels
    spec:
      restartPolicy: Always
      terminationGracePeriodSeconds: 10
      containers:
      - name: nginx
        image: registry.k8s.io/nginx-slim:0.8
        livenessProbe:
          exec:
            command:
              - ls
              - /usr/share/nginx/html/lost+found
          initialDelaySeconds: 5
          periodSeconds: 5
        ports:
        - containerPort: 80
          name: web
        volumeMounts:
        - name: www2
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: www2
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "longhorn"
      resources:
        requests:
          storage: 1024Mi

Use a script to periodically kill an instance-manager pod.

#!/bin/bash

current_time=$(date -u +"%Y-%m-%dT%H:%M:%SZ")

while true; do
    # Execute kubectl -n longhorn-system delete command
    kubectl -n longhorn-system delete --grace-period=1 pod instance-manager-c090b6286e8431380a8cbe71c3fb43ec

    # Wait for 300 seconds
    sleep 300

    # Execute kubectl -n longhorn-system logs and grep commands
    kubectl -n longhorn-system logs -l longhorn.io/component=instance-manager --tail -1 --since-time $current_time | grep -i -e "invalid grpc metadata"
    log_result_1=$?

    kubectl -n longhorn-system logs -l app=longhorn-manager --tail -1 --since-time $current_time | grep -i -e "invalid grpc metadata" -e "incorrect volume name" -e "incorrect instance name"
    log_result_2=$?

    # Check if there was any output from the grep commands
    if [[ $log_result_1 -eq 0 || $log_result_2 -eq 0 ]]; then
        echo "Execution stopped. Trigger keywords found in logs."
        break
    else
        echo "Execution completed successfully."
    fi
done

Wait for the script to fail. In my cluster, this takes between 3 and 10 iterations usually.
Observe the cause of the failure. NOTE: these logs are generated with code for [IMPROVEMENT] Longhorn-engine processes should refuse to serve requests not intended for them #5845 that isn't yet merged. Without that code, we will inappropriately expand the volume instead of safely failing out.

2023-06-27T19:48:22.606088859Z [pvc-059a429b-28f7-4b21-a7cc-45c8632be109-r-cd8811d9] time="2023-06-27T19:48:22Z" level=error msg="Invalid gRPC metadata" clientVolumeName=pvc-70c30a6e-c387-4352-aae1-86f94ad334d2 method=/ptypes.ReplicaService/ReplicaGet serverVolumeName=pvc-059a429b-28f7-4b21-a7cc-45c8632be109
2023-06-27T19:48:22.607799147Z [pvc-c1a6441c-bf6e-4a5d-8a6c-58b02da82938-r-03074ae7] time="2023-06-27T19:48:22Z" level=error msg="Invalid gRPC metadata" clientVolumeName=pvc-6ea5e001-3f33-4f2b-9c03-95271d95dccf method=/ptypes.ReplicaService/ReplicaGet serverVolumeName=pvc-c1a6441c-bf6e-4a5d-8a6c-58b02da82938

2023-06-27T19:48:22.607532292Z time="2023-06-27T19:48:22Z" level=error msg="Failed to rebuild replica 10.42.150.121:10147" controller=longhorn-engine engine=pvc-70c30a6e-c387-4352-aae1-86f94ad334d2-e-799c7bff error="proxyServer=10.42.150.121:8501 destination=10.42.150.121:10065: failed to add replica tcp://10.42.150.121:10147 for volume: rpc error: code = Unknown desc = failed to get replica 10.42.150.121:10147: rpc error: code = FailedPrecondition desc = Incorrect volume name; check replica address" node=eweber-v124-worker-1ae51dbb-4pngn volume=pvc-70c30a6e-c387-4352-aae1-86f94ad334d2
2023-06-27T19:48:22.607555900Z time="2023-06-27T19:48:22Z" level=info msg="Removing failed rebuilding replica 10.42.150.121:10147" controller=longhorn-engine engine=pvc-70c30a6e-c387-4352-aae1-86f94ad334d2-e-799c7bff node=eweber-v124-worker-1ae51dbb-4pngn volume=pvc-70c30a6e-c387-4352-aae1-86f94ad334d2
2023-06-27T19:48:22.607560110Z time="2023-06-27T19:48:22Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pvc-70c30a6e-c387-4352-aae1-86f94ad334d2-e-799c7bff\", UID:\"fb4612d5-6ca8-4417-a829-e65fc1eab29a\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"52778902\", FieldPath:\"\"}): type: 'Warning' reason: 'FailedRebuilding' Failed rebuilding replica with Address 10.42.150.121:10147: proxyServer=10.42.150.121:8501 destination=10.42.150.121:10065: failed to add replica tcp://10.42.150.121:10147 for volume: rpc error: code = Unknown desc = failed to get replica 10.42.150.121:10147: rpc error: code = FailedPrecondition desc = Incorrect volume name; check replica address"
2023-06-27T19:48:22.608620253Z time="2023-06-27T19:48:22Z" level=error msg="Failed to rebuild replica 10.42.150.121:10137" controller=longhorn-engine engine=pvc-6ea5e001-3f33-4f2b-9c03-95271d95dccf-e-17a35c32 error="proxyServer=10.42.150.121:8501 destination=10.42.150.121:10062: failed to add replica tcp://10.42.150.121:10137 for volume: rpc error: code = Unknown desc = failed to get replica 10.42.150.121:10137: rpc error: code = FailedPrecondition desc = Incorrect volume name; check replica address" node=eweber-v124-worker-1ae51dbb-4pngn volume=pvc-6ea5e001-3f33-4f2b-9c03-95271d95dccf
2023-06-27T19:48:22.608673822Z time="2023-06-27T19:48:22Z" level=info msg="Removing failed rebuilding replica 10.42.150.121:10137" controller=longhorn-engine engine=pvc-6ea5e001-3f33-4f2b-9c03-95271d95dccf-e-17a35c32 node=eweber-v124-worker-1ae51dbb-4pngn volume=pvc-6ea5e001-3f33-4f2b-9c03-95271d95dccf
2023-06-27T19:48:22.608734835Z time="2023-06-27T19:48:22Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", N

Expected behavior

The engine controller should not attempt to continue with the rebuild using the wrong replica.

Log or Support bundle

A summary of logs from various components when the issue occurs:

Environment

Longhorn version: master (v1.5.0), probably others
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE2
- Number of management node in the cluster: 1
- Number of worker node in the cluster: 3
Node config
- OS type and version:
- CPU per node: 4
- Memory per node: 8
- Disk type(e.g. SSD/NVMe):
- Network bandwidth between the nodes:
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): DigitalOcean
Number of Longhorn volumes in the cluster: 50

Additional context

The recreate works because killing the instance-manager pod causes the following chain of events:

All engines and replicas owned by that instance-manager restart.
Longhorn-manager restarts all workload pods that use each engine.
When a workload pod is killed, its corresponding volume is detached.
When a volume is detached, its rebuilding replica is killed immediately.
If the timing is bad, a new replica starts using the previous replica's port while the engine is in the 10 second snapshot purge window. The window closes and the engine controller communicates with the new replica.

Ideas

Augment this check to also ensure the engine.spec.desireState != stopped before continuing to rebuild.
Don't kill rebuilding replicas before killing the engine? (This may not be possible.)

The text was updated successfully, but these errors were encountered:

PhanLe1010 · 2023-06-29T18:16:44Z

🚀

innobead · 2023-06-30T05:33:54Z

cc @longhorn/qa

innobead · 2023-06-30T05:50:00Z

I would prefer to include the fix in 1.5.0 if possible.

longhorn-io-github-bot · 2023-06-30T21:12:08Z

Pre Ready-For-Testing Checklist

Where is the reproduce steps/test steps documented?
[BUG] Engine continues to attempt to rebuild replica while detaching #6217 (comment)

Under the conditions described, the provided script fails in <10 iterations without the fix and should proceed indefinitely with the fix.

IMPORTANT: The script provided only works with a modified Longhorn with work from #5845. The script below is similar except that it monitors for signs of inappropriate expansion instead of logs saying it was prevented.

#!/bin/bash

current_time=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
counter=0

while true; do
    counter=$((counter+1))
    echo "Iteration: $counter"

    # Execute kubectl -n longhorn-system delete command
    kubectl -n longhorn-system delete --grace-period=1 pod instance-manager-4639bda14281d41f3af00d64bc364bb9

    # Wait for 300 seconds
    sleep 300

    # Execute kubectl -n longhorn-system logs and grep commands
    kubectl -n longhorn-system logs -l longhorn.io/component=instance-manager --tail -1 --since-time $current_time | grep -i -e "expand"
    log_result_1=$?

    # Check if there was any output from the grep commands
    if [[ $log_result_1 -eq 0 ]]; then
        echo "Execution stopped. Trigger keywords found in logs."
        break
    else
        echo "Execution completed successfully."
    fi
done

Failed output for this new script looks like:

eweber@laptop:~/longhorn> examples/test_no_5845.sh 
Iteration: 1
pod "instance-manager-4639bda14281d41f3af00d64bc364bb9" deleted
Execution completed successfully.
Iteration: 2
pod "instance-manager-4639bda14281d41f3af00d64bc364bb9" deleted
Execution completed successfully.
Iteration: 3
pod "instance-manager-4639bda14281d41f3af00d64bc364bb9" deleted
Execution completed successfully.
Iteration: 4
pod "instance-manager-4639bda14281d41f3af00d64bc364bb9" deleted
[longhorn-instance-manager] time="2023-06-30T19:32:34Z" level=info msg="Prepare to expand new replica to size 1073741824"
[pvc-3014d46d-fdf7-4898-b329-0e238842c1eb-r-ec448065] time="2023-06-30T19:32:34Z" level=info msg="Replica server starts to expand to size 1073741824"
[pvc-3014d46d-fdf7-4898-b329-0e238842c1eb-r-ec448065] time="2023-06-30T19:32:34Z" level=info msg="Starting to create disk" disk=expand-1073741824
time="2023-06-30T19:32:34Z" level=info msg="Cleaning up new disk metadata file path /host/var/lib/longhorn/replicas/pvc-3014d46d-fdf7-4898-b329-0e238842c1eb-52539dd3/volume-snap-expand-1073741824.img.meta before linking"
time="2023-06-30T19:32:34Z" level=info msg="Cleaning up new disk checksum file /host/var/lib/longhorn/replicas/pvc-3014d46d-fdf7-4898-b329-0e238842c1eb-52539dd3/volume-snap-expand-1073741824.img.checksum before linking"
time="2023-06-30T19:32:34Z" level=info msg="Cleaning up new disk file /host/var/lib/longhorn/replicas/pvc-3014d46d-fdf7-4898-b329-0e238842c1eb-52539dd3/volume-snap-expand-1073741824.img before linking"
time="2023-06-30T19:32:34Z" level=info msg="Finished creating disk" disk=expand-1073741824
time="2023-06-30T19:32:34Z" level=info msg="Removing disk volume-snap-expand-1073741824.img, force true"
[pvc-3014d46d-fdf7-4898-b329-0e238842c1eb-r-ec448065] time="2023-06-30T19:32:34Z" level=info msg="Removing disk volume-snap-expand-1073741824.img"
[pvc-6135b88d-bcb6-4d25-ab00-1c0c3b6dbd0b-e-1c893ce2] time="2023-06-30T19:32:34Z" level=warning msg="Removed extra disks map[volume-snap-b6cb56f9-26cb-41ba-9a9d-bee9d326a700.img:{volume-snap-b6cb56f9-26cb-41ba-9a9d-bee9d326a700.img  map[volume-snap-expand-1073741824.img:true] true false 2023-06-30T19:28:34Z 25559040 map[]} volume-snap-expand-1073741824.img:{volume-snap-expand-1073741824.img volume-snap-b6cb56f9-26cb-41ba-9a9d-bee9d326a700.img map[volume-snap-60105d14-11ad-478a-9cf4-29cd3b62075b.img:true] false false 2023-06-30T19:32:34Z 0 map[replica-expansion:1073741824]}] in replica tcp://10.42.174.165:10129"
Execution stopped. Trigger keywords found in logs.

yangchiu · 2023-07-03T10:46:48Z

Verified passed on master-head (longhorn-manager 57b4596) and v1.5.x-head (longhorn-manager a41e906) following the test steps.

After running the scripts for more than 20 iterations, the unexpected expansion is still not present.

ejweber added kind/bug backport/1.3.4 backport/1.4.3 backport/1.5.1 labels Jun 28, 2023

ejweber added this to the Backlog milestone Jun 28, 2023

ejweber self-assigned this Jun 28, 2023

innobead added priority/0 Must be fixed in this release (managed by PO) reproduce/rare < 50% reproducible labels Jun 30, 2023

innobead added the component/longhorn-manager Longhorn manager (control plane) label Jun 30, 2023

innobead mentioned this issue Jun 30, 2023

Engine rebuilds deleted replica longhorn/longhorn-manager#2042

Merged

innobead modified the milestones: Backlog, v1.5.0 Jun 30, 2023

innobead added the severity/3 Function working but has a major issue w/ workaround label Jun 30, 2023

innobead assigned yangchiu Jul 3, 2023

yangchiu closed this as completed Jul 3, 2023

This was referenced Jul 12, 2023

[BUG] Race leaves snapshot CRs that cannot be deleted #6298

Closed

[BUG] Expansion error still occurs in 1.4.2 #6078

Closed

ejweber mentioned this issue Jul 25, 2023

[BUG] (unknown cause) Volume thinks it is expanding (e.g. 20gb to 10gb) #4181

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Engine continues to attempt to rebuild replica while detaching #6217

[BUG] Engine continues to attempt to rebuild replica while detaching #6217

ejweber commented Jun 28, 2023 •

edited

PhanLe1010 commented Jun 29, 2023

innobead commented Jun 30, 2023

innobead commented Jun 30, 2023

longhorn-io-github-bot commented Jun 30, 2023 •

edited by ejweber

yangchiu commented Jul 3, 2023

[BUG] Engine continues to attempt to rebuild replica while detaching #6217

[BUG] Engine continues to attempt to rebuild replica while detaching #6217

Comments

ejweber commented Jun 28, 2023 • edited

Describe the bug (🐛 if you encounter this issue)

To Reproduce

Expected behavior

Log or Support bundle

Environment

Additional context

Ideas

PhanLe1010 commented Jun 29, 2023

innobead commented Jun 30, 2023

innobead commented Jun 30, 2023

longhorn-io-github-bot commented Jun 30, 2023 • edited by ejweber

Pre Ready-For-Testing Checklist

yangchiu commented Jul 3, 2023

ejweber commented Jun 28, 2023 •

edited

longhorn-io-github-bot commented Jun 30, 2023 •

edited by ejweber