Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Engine continues to attempt to rebuild replica while detaching #6217

Closed
ejweber opened this issue Jun 28, 2023 · 5 comments
Closed

[BUG] Engine continues to attempt to rebuild replica while detaching #6217

ejweber opened this issue Jun 28, 2023 · 5 comments
Assignees
Labels
backport/1.3.4 backport/1.4.3 backport/1.5.1 component/longhorn-manager Longhorn manager (control plane) kind/bug priority/0 Must be fixed in this release (managed by PO) reproduce/rare < 50% reproducible severity/3 Function working but has a major issue w/ workaround
Milestone

Comments

@ejweber
Copy link
Contributor

ejweber commented Jun 28, 2023

Describe the bug (馃悰 if you encounter this issue)

This is another cause of inappropriate replica expansion uncovered while implementing #5845. Maybe this is the mode of failure in #6078? I'll have to figure out a way to confirm.

When a volume is detached, the following things happen in the following order:

  • Rebuilding replicas are killed.
  • The engine is killed.
  • Healthy replicas are killed.

There is a minimum of a 10 second window between the time longhorn-manager starts a snapshot purge for a rebuild and the time the rebuild actually starts. During that time, the engine is not reconciled by the engine controller (it is already in the middle of a reconciliation). If the timing is right, it is possible for the rebuilding replica to be killed and a new replica (for a different volume) to take its place. The engine controller continues with the rebuild and attempts to use the engine to rebuild the wrong replica. This can lead to inappropriate expansion.

To Reproduce

Steps to reproduce the behavior:

  1. Deploy a Longhorn cluster off of master (v1.5.0).
  2. Create two stateful sets, each with 25 pods. Use different sizes for the two stateful sets.
apiVersion: v1
kind: Service
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  ports:
  - port: 80
    name: web
  selector:
    app: nginx
  type: NodePort
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
  namespace: default
spec:
  selector:
    matchLabels:
      app: nginx # has to match .spec.template.metadata.labels
  serviceName: "nginx"
  replicas: 25 # by default is 1
  template:
    metadata:
      labels:
        app: nginx # has to match .spec.selector.matchLabels
    spec:
      restartPolicy: Always
      terminationGracePeriodSeconds: 10
      containers:
      - name: nginx
        image: registry.k8s.io/nginx-slim:0.8
        livenessProbe:
          exec:
            command:
              - ls
              - /usr/share/nginx/html/lost+found
          initialDelaySeconds: 5
          periodSeconds: 5
        ports:
        - containerPort: 80
          name: web
        volumeMounts:
        - name: www
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: www
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "longhorn"
      resources:
        requests:
          storage: 512Mi
---
apiVersion: v1
kind: Service
metadata:
  name: nginx2
  labels:
    app: nginx2
spec:
  ports:
  - port: 80
    name: web
  selector:
    app: nginx2
  type: NodePort
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web2
  namespace: default
spec:
  selector:
    matchLabels:
      app: nginx2 # has to match .spec.template.metadata.labels
  serviceName: "nginx2"
  replicas: 25 # by default is 1
  template:
    metadata:
      labels:
        app: nginx2 # has to match .spec.selector.matchLabels
    spec:
      restartPolicy: Always
      terminationGracePeriodSeconds: 10
      containers:
      - name: nginx
        image: registry.k8s.io/nginx-slim:0.8
        livenessProbe:
          exec:
            command:
              - ls
              - /usr/share/nginx/html/lost+found
          initialDelaySeconds: 5
          periodSeconds: 5
        ports:
        - containerPort: 80
          name: web
        volumeMounts:
        - name: www2
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: www2
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "longhorn"
      resources:
        requests:
          storage: 1024Mi
  1. Use a script to periodically kill an instance-manager pod.
#!/bin/bash

current_time=$(date -u +"%Y-%m-%dT%H:%M:%SZ")

while true; do
    # Execute kubectl -n longhorn-system delete command
    kubectl -n longhorn-system delete --grace-period=1 pod instance-manager-c090b6286e8431380a8cbe71c3fb43ec

    # Wait for 300 seconds
    sleep 300

    # Execute kubectl -n longhorn-system logs and grep commands
    kubectl -n longhorn-system logs -l longhorn.io/component=instance-manager --tail -1 --since-time $current_time | grep -i -e "invalid grpc metadata"
    log_result_1=$?

    kubectl -n longhorn-system logs -l app=longhorn-manager --tail -1 --since-time $current_time | grep -i -e "invalid grpc metadata" -e "incorrect volume name" -e "incorrect instance name"
    log_result_2=$?

    # Check if there was any output from the grep commands
    if [[ $log_result_1 -eq 0 || $log_result_2 -eq 0 ]]; then
        echo "Execution stopped. Trigger keywords found in logs."
        break
    else
        echo "Execution completed successfully."
    fi
done
  1. Wait for the script to fail. In my cluster, this takes between 3 and 10 iterations usually.
  2. Observe the cause of the failure. NOTE: these logs are generated with code for [IMPROVEMENT] Longhorn-engine processes should refuse to serve requests not intended for them聽#5845 that isn't yet merged. Without that code, we will inappropriately expand the volume instead of safely failing out.
2023-06-27T19:48:22.606088859Z [pvc-059a429b-28f7-4b21-a7cc-45c8632be109-r-cd8811d9] time="2023-06-27T19:48:22Z" level=error msg="Invalid gRPC metadata" clientVolumeName=pvc-70c30a6e-c387-4352-aae1-86f94ad334d2 method=/ptypes.ReplicaService/ReplicaGet serverVolumeName=pvc-059a429b-28f7-4b21-a7cc-45c8632be109
2023-06-27T19:48:22.607799147Z [pvc-c1a6441c-bf6e-4a5d-8a6c-58b02da82938-r-03074ae7] time="2023-06-27T19:48:22Z" level=error msg="Invalid gRPC metadata" clientVolumeName=pvc-6ea5e001-3f33-4f2b-9c03-95271d95dccf method=/ptypes.ReplicaService/ReplicaGet serverVolumeName=pvc-c1a6441c-bf6e-4a5d-8a6c-58b02da82938

2023-06-27T19:48:22.607532292Z time="2023-06-27T19:48:22Z" level=error msg="Failed to rebuild replica 10.42.150.121:10147" controller=longhorn-engine engine=pvc-70c30a6e-c387-4352-aae1-86f94ad334d2-e-799c7bff error="proxyServer=10.42.150.121:8501 destination=10.42.150.121:10065: failed to add replica tcp://10.42.150.121:10147 for volume: rpc error: code = Unknown desc = failed to get replica 10.42.150.121:10147: rpc error: code = FailedPrecondition desc = Incorrect volume name; check replica address" node=eweber-v124-worker-1ae51dbb-4pngn volume=pvc-70c30a6e-c387-4352-aae1-86f94ad334d2
2023-06-27T19:48:22.607555900Z time="2023-06-27T19:48:22Z" level=info msg="Removing failed rebuilding replica 10.42.150.121:10147" controller=longhorn-engine engine=pvc-70c30a6e-c387-4352-aae1-86f94ad334d2-e-799c7bff node=eweber-v124-worker-1ae51dbb-4pngn volume=pvc-70c30a6e-c387-4352-aae1-86f94ad334d2
2023-06-27T19:48:22.607560110Z time="2023-06-27T19:48:22Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pvc-70c30a6e-c387-4352-aae1-86f94ad334d2-e-799c7bff\", UID:\"fb4612d5-6ca8-4417-a829-e65fc1eab29a\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"52778902\", FieldPath:\"\"}): type: 'Warning' reason: 'FailedRebuilding' Failed rebuilding replica with Address 10.42.150.121:10147: proxyServer=10.42.150.121:8501 destination=10.42.150.121:10065: failed to add replica tcp://10.42.150.121:10147 for volume: rpc error: code = Unknown desc = failed to get replica 10.42.150.121:10147: rpc error: code = FailedPrecondition desc = Incorrect volume name; check replica address"
2023-06-27T19:48:22.608620253Z time="2023-06-27T19:48:22Z" level=error msg="Failed to rebuild replica 10.42.150.121:10137" controller=longhorn-engine engine=pvc-6ea5e001-3f33-4f2b-9c03-95271d95dccf-e-17a35c32 error="proxyServer=10.42.150.121:8501 destination=10.42.150.121:10062: failed to add replica tcp://10.42.150.121:10137 for volume: rpc error: code = Unknown desc = failed to get replica 10.42.150.121:10137: rpc error: code = FailedPrecondition desc = Incorrect volume name; check replica address" node=eweber-v124-worker-1ae51dbb-4pngn volume=pvc-6ea5e001-3f33-4f2b-9c03-95271d95dccf
2023-06-27T19:48:22.608673822Z time="2023-06-27T19:48:22Z" level=info msg="Removing failed rebuilding replica 10.42.150.121:10137" controller=longhorn-engine engine=pvc-6ea5e001-3f33-4f2b-9c03-95271d95dccf-e-17a35c32 node=eweber-v124-worker-1ae51dbb-4pngn volume=pvc-6ea5e001-3f33-4f2b-9c03-95271d95dccf
2023-06-27T19:48:22.608734835Z time="2023-06-27T19:48:22Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", N

Expected behavior

The engine controller should not attempt to continue with the rebuild using the wrong replica.

Log or Support bundle

A summary of logs from various components when the issue occurs:

image

Environment

  • Longhorn version: master (v1.5.0), probably others
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE2
    • Number of management node in the cluster: 1
    • Number of worker node in the cluster: 3
  • Node config
    • OS type and version:
    • CPU per node: 4
    • Memory per node: 8
    • Disk type(e.g. SSD/NVMe):
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): DigitalOcean
  • Number of Longhorn volumes in the cluster: 50

Additional context

The recreate works because killing the instance-manager pod causes the following chain of events:

  1. All engines and replicas owned by that instance-manager restart.
  2. Longhorn-manager restarts all workload pods that use each engine.
  3. When a workload pod is killed, its corresponding volume is detached.
  4. When a volume is detached, its rebuilding replica is killed immediately.
  5. If the timing is bad, a new replica starts using the previous replica's port while the engine is in the 10 second snapshot purge window. The window closes and the engine controller communicates with the new replica.

Ideas

  • Augment this check to also ensure the engine.spec.desireState != stopped before continuing to rebuild.
  • Don't kill rebuilding replicas before killing the engine? (This may not be possible.)
@PhanLe1010
Copy link
Contributor

馃殌

@innobead innobead added priority/0 Must be fixed in this release (managed by PO) reproduce/rare < 50% reproducible labels Jun 30, 2023
@innobead
Copy link
Member

cc @longhorn/qa

@innobead
Copy link
Member

I would prefer to include the fix in 1.5.0 if possible.

@innobead innobead added the severity/3 Function working but has a major issue w/ workaround label Jun 30, 2023
@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Jun 30, 2023

Pre Ready-For-Testing Checklist

Under the conditions described, the provided script fails in <10 iterations without the fix and should proceed indefinitely with the fix.

IMPORTANT: The script provided only works with a modified Longhorn with work from #5845. The script below is similar except that it monitors for signs of inappropriate expansion instead of logs saying it was prevented.

#!/bin/bash

current_time=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
counter=0

while true; do
    counter=$((counter+1))
    echo "Iteration: $counter"

    # Execute kubectl -n longhorn-system delete command
    kubectl -n longhorn-system delete --grace-period=1 pod instance-manager-4639bda14281d41f3af00d64bc364bb9

    # Wait for 300 seconds
    sleep 300

    # Execute kubectl -n longhorn-system logs and grep commands
    kubectl -n longhorn-system logs -l longhorn.io/component=instance-manager --tail -1 --since-time $current_time | grep -i -e "expand"
    log_result_1=$?

    # Check if there was any output from the grep commands
    if [[ $log_result_1 -eq 0 ]]; then
        echo "Execution stopped. Trigger keywords found in logs."
        break
    else
        echo "Execution completed successfully."
    fi
done

Failed output for this new script looks like:

eweber@laptop:~/longhorn> examples/test_no_5845.sh 
Iteration: 1
pod "instance-manager-4639bda14281d41f3af00d64bc364bb9" deleted
Execution completed successfully.
Iteration: 2
pod "instance-manager-4639bda14281d41f3af00d64bc364bb9" deleted
Execution completed successfully.
Iteration: 3
pod "instance-manager-4639bda14281d41f3af00d64bc364bb9" deleted
Execution completed successfully.
Iteration: 4
pod "instance-manager-4639bda14281d41f3af00d64bc364bb9" deleted
[longhorn-instance-manager] time="2023-06-30T19:32:34Z" level=info msg="Prepare to expand new replica to size 1073741824"
[pvc-3014d46d-fdf7-4898-b329-0e238842c1eb-r-ec448065] time="2023-06-30T19:32:34Z" level=info msg="Replica server starts to expand to size 1073741824"
[pvc-3014d46d-fdf7-4898-b329-0e238842c1eb-r-ec448065] time="2023-06-30T19:32:34Z" level=info msg="Starting to create disk" disk=expand-1073741824
time="2023-06-30T19:32:34Z" level=info msg="Cleaning up new disk metadata file path /host/var/lib/longhorn/replicas/pvc-3014d46d-fdf7-4898-b329-0e238842c1eb-52539dd3/volume-snap-expand-1073741824.img.meta before linking"
time="2023-06-30T19:32:34Z" level=info msg="Cleaning up new disk checksum file /host/var/lib/longhorn/replicas/pvc-3014d46d-fdf7-4898-b329-0e238842c1eb-52539dd3/volume-snap-expand-1073741824.img.checksum before linking"
time="2023-06-30T19:32:34Z" level=info msg="Cleaning up new disk file /host/var/lib/longhorn/replicas/pvc-3014d46d-fdf7-4898-b329-0e238842c1eb-52539dd3/volume-snap-expand-1073741824.img before linking"
time="2023-06-30T19:32:34Z" level=info msg="Finished creating disk" disk=expand-1073741824
time="2023-06-30T19:32:34Z" level=info msg="Removing disk volume-snap-expand-1073741824.img, force true"
[pvc-3014d46d-fdf7-4898-b329-0e238842c1eb-r-ec448065] time="2023-06-30T19:32:34Z" level=info msg="Removing disk volume-snap-expand-1073741824.img"
[pvc-6135b88d-bcb6-4d25-ab00-1c0c3b6dbd0b-e-1c893ce2] time="2023-06-30T19:32:34Z" level=warning msg="Removed extra disks map[volume-snap-b6cb56f9-26cb-41ba-9a9d-bee9d326a700.img:{volume-snap-b6cb56f9-26cb-41ba-9a9d-bee9d326a700.img  map[volume-snap-expand-1073741824.img:true] true false 2023-06-30T19:28:34Z 25559040 map[]} volume-snap-expand-1073741824.img:{volume-snap-expand-1073741824.img volume-snap-b6cb56f9-26cb-41ba-9a9d-bee9d326a700.img map[volume-snap-60105d14-11ad-478a-9cf4-29cd3b62075b.img:true] false false 2023-06-30T19:32:34Z 0 map[replica-expansion:1073741824]}] in replica tcp://10.42.174.165:10129"
Execution stopped. Trigger keywords found in logs.
  • Is there a workaround for the issue? If so, where is it documented?
    Not yet, but @PhanLe1010 suggested disabling the auto-cleanup-system-generated-snapshot setting, based on the fact that many of the issues we've uncovered center around the code used to implement. I am running tests now, and it looks promising.

  • Does the PR include the explanation for the fix or the feature?

  • Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?

  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?

  • Which areas/issues this PR might have potential impacts on?

  • If labeled: require/LEP Has the Longhorn Enhancement Proposal PR submitted?

  • If labeled: area/ui Has the UI issue filed or ready to be merged (including backport-needed/*)?

  • If labeled: require/doc Has the necessary document PR submitted or merged (including backport-needed/*)?

  • If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*)
    Not labeled, but maybe we can follow up in [TEST][BUG] Rebuild rebuilding is possibly issued to a wrong replica聽#5850 with some kind of automated stress test like the one implemented in my recreate script? WDYT @longhorn/qa?

  • If labeled: require/automation-engine Has the engine integration test been merged (including backport-needed/*)?

  • If labeled: require/manual-test-plan Has the manual test plan been documented?
    Not labeled, but maybe we can follow up in [TEST][BUG] Rebuild rebuilding is possibly issued to a wrong replica聽#5850 with some kind of manual stress test like the one implemented in my recreate script? WDYT @longhorn/qa?

  • If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?

@yangchiu
Copy link
Member

yangchiu commented Jul 3, 2023

Verified passed on master-head (longhorn-manager 57b4596) and v1.5.x-head (longhorn-manager a41e906) following the test steps.

After running the scripts for more than 20 iterations, the unexpected expansion is still not present.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.3.4 backport/1.4.3 backport/1.5.1 component/longhorn-manager Longhorn manager (control plane) kind/bug priority/0 Must be fixed in this release (managed by PO) reproduce/rare < 50% reproducible severity/3 Function working but has a major issue w/ workaround
Projects
None yet
Development

No branches or pull requests

5 participants