[BUG] Longhorn fails to recover from a node restart #8403

CC007 · 2024-04-22T00:19:10Z

Describe the bug

After #8091 was fixed, I updated to 1.6.1, but got the another error.

To Reproduce

What I did:

I uninstalled longhorn, along with ALL resources (making sure to not leave anything behind)
I installed longhorn v1.6.1 (through rancher's chart: 103.3.0+up1.6.1)
I set up the storage nodes (like before)
I reinstalled nexus
I checked if the replicas were all healthy (they were)
I restarted one of my cluster nodes (making sure that that one contained one of the replicas)

Expected behavior

I noticed the node becoming unhealthy (showing red or gray). This was expected, since it doesn't surprise me that it would first try to wait to see if the node comes back.

I expected that the node's replica would either rejoin the volume (after it updated to the latest state) or, if the replica was too out of sync, that a new replica would be created from one of the 2 remaining ones

Actual behavior

After a while, when the node comes back, the replica on that node disappears and the other two replicas become faulted, switching between attaching and detaching because of issues.

When the node comes back I see these messages:

Start rebuilding replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-51bf25ab with Address 10.42.2.195:10000 for normal engine pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-e-0 and volume pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490

I then get this error (probably because the node wasn't fully ready yet):

Failed rebuilding replica with Address 10.42.2.195:10000: proxyServer=10.42.0.156:8501 destination=10.42.0.156:10000: failed to add replica tcp://10.42.2.195:10000 for volume: rpc error: code = Unavailable desc = failed to get replica 10.42.2.195:10000: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.42.2.195:10000: i/o timeout"

Followed by a couple of:

Node pirack2-node4 is ready

And then

Engine image ei-ad735e4a (rancher/mirrored-longhornio-longhorn-engine:v1.6.1) is fully deployed on all ready nodes

Failed rebuilding replica with Address 10.42.2.213:10000: proxyServer=10.42.0.156:8501 destination=10.42.0.156:10000: failed to add replica tcp://10.42.2.213:10000 for volume: rpc error: code = Unknown desc = failed to sync files [{FromFileName:volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img ToFileName:volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img ActualSize:512} {FromFileName:volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img.meta ToFileName:volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img.meta ActualSize:0}] from tcp://10.42.7.226:10000: rpc error: code = Unknown desc = replica tcp://10.42.7.226:10000 failed to send file volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img to 10.42.2.213:10004: failed to send file volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img to 10.42.2.213:10004: rpc error: code = Unknown desc = open volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img: invalid argument

Start rebuilding replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-51bf25ab with Address 10.42.2.213:10000 for normal engine pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-e-0 and volume pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490

snapshot becomes ready to use

replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-31aac936 failed the snapshot purge: failed to open parent file volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img: open volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img: invalid argument

snapshot becomes not ready to use

snapshot is marked as removed

replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-b2ff7e06 failed the restore: Failed to get restoring status on tcp://10.42.7.226:10000: failed to get restore status: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.42.7.226:10002: connect: connection refused"

replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-b2ff7e06 failed the snapshot purge: failed to open parent file volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img: open volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img: invalid argument

Detected replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-31aac936 (10.42.1.194:10000) in error

Detected replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-b2ff7e06 (10.42.7.226:10000) in error

lost track of the corresponding snapshot info inside volume engine

Volume pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490 requested remount at 2024-04-21T01:01:15Z after automatically salvaging replicas

Replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-31aac936 of volume pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490 will be automatically salvaged

Replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-b2ff7e06 of volume pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490 will be automatically salvaged

snapshot is marked as removed

Stops pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-e-0

Start rebuilding replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-51bf25ab with Address 10.42.2.213:10000 for normal engine pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-e-0 and volume pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490

Detected rebuilding replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-51bf25ab (10.42.2.213:10000)

snapshot becomes ready to use

Failed rebuilding replica with Address 10.42.2.213:10000: proxyServer=10.42.3.188:8501 destination=10.42.3.188:10000: failed to add replica tcp://10.42.2.213:10000 for volume: rpc error: code = Unknown desc = failed to sync files [{FromFileName:volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img ToFileName:volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img ActualSize:2343993344} {FromFileName:volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img.meta ToFileName:volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img.meta ActualSize:0} {FromFileName:volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img ToFileName:volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img ActualSize:162422784} {FromFileName:volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img.meta ToFileName:volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img.meta ActualSize:0}] from tcp://10.42.1.194:10000: rpc error: code = Unknown desc = replica tcp://10.42.1.194:10000 failed to send file volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img to 10.42.2.213:10006: failed to send file volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img to 10.42.2.213:10006: rpc error: code = Unknown desc = open volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img: invalid argument

replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-31aac936 failed the snapshot purge: failed to open childFile: open volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img: invalid argument

replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-b2ff7e06 failed the snapshot purge: failed to open childFile: open volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img: invalid argument

snapshot becomes not ready to use

snapshot is marked as removed

volume pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490 became degraded

Detected replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-b2ff7e06 (10.42.7.226:10000) in error

Detected replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-31aac936 (10.42.1.194:10000) in error

...

Etc.

The one that stands out is the invalid argument

A day later: without me changing anything, it switches between showing only 1, 2 or all 3 (faulted) replicas in the Longhorn UI (showing only 1 most of the time).

Support bundle for troubleshooting

https://we.tl/t-Y482MSSwQH (valid for 7 days)

Environment

Longhorn version: v1.6.1
Impacted volume (PV): pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Rancher v2.8.3 (using chart longhorn:103.3.0+up1.6.1, helm version: v2.16.8-rancher2)
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K3s v1.26.10+k3s1 (managed by Rancher v2.8.3)
- Number of control plane nodes in the cluster: 3
- Number of worker nodes in the cluster: 5
Node config
- OS type and version: Ubuntu 22.04.3 LTS
- Kernel version: 5.15.0-1047-raspi aarch64
- CPU per node: 4 cores
- Memory per node: 8GB
- Disk type (e.g. SSD/NVMe/HDD): NVMe SSD in a NAS, connected with a CIFS mount.
- Network bandwidth between the nodes (Gbps): 1Gb
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): 8x Raspberry PI 4B (for nodes) and QNAP TBS-464 (for NAS)
Number of Longhorn volumes in the cluster: 2

The text was updated successfully, but these errors were encountered:

james-munson · 2024-04-23T18:49:15Z

Interesting that although there is log history and events for multiple replicas of pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490, there is currently only one in the replicas.yaml list and in the engines.yaml address maps: pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-b2ff7e06.

The storage is on these disks:

              message: Disk disk-1-3(/mnt/PiRackData1/longhorn/) on node pirack1-node3 is ready
              message: Disk disk-2-4(/mnt/PiRackData2/longhorn/) on node pirack1-node4 is ready
              message: Disk disk-2-3(/mnt/PiRackData3/longhorn/) on node pirack2-node3 is ready
              message: Disk disk-2-4(/mnt/PiRackData4/longhorn/) on node pirack2-node4 is ready

There is a name overlap, but the node path and UUID are unique, so that should have no effect.

Looks like node pirack2-node4 was restarted just before 00:58 on 4/21, and then two others (with no storage) about 30 minutes later:

grep -rh -B 1 "Node.*is ready"
          lastTransitionTime: "2024-04-21T01:33:14Z"
          message: Node pirack1-node2 is ready
--
          lastTransitionTime: "2024-04-21T00:01:44Z"
          message: Node pirack1-node3 is ready
--
          lastTransitionTime: "2024-04-21T00:01:47Z"
          message: Node pirack1-node4 is ready
--
          lastTransitionTime: "2024-04-21T00:02:12Z"
          message: Node pirack1-node5 is ready
--
          lastTransitionTime: "2024-04-21T00:02:07Z"
          message: Node pirack2-node2 is ready
--
          lastTransitionTime: "2024-04-21T00:02:07Z"
          message: Node pirack2-node3 is ready
--
          lastTransitionTime: "2024-04-21T00:58:15Z"
          message: Node pirack2-node4 is ready
--
          lastTransitionTime: "2024-04-21T01:25:23Z"
          message: Node pirack2-node5 is ready

Given logging like this:

replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-b2ff7e06 failed the snapshot purge: failed to open parent file volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img: open volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img: invalid argument
it would be worthwhile to see the contents of the replica storage directories, both filenames and sizes, and the contents of the meta files.

CC007 · 2024-04-25T06:54:47Z

Interesting that although there is log history and events for multiple replicas of pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490, there is currently only one in the replicas.yaml list and in the engines.yaml address maps: pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-b2ff7e06.

That's probably also why I also only see 1 replica in the Longhorn UI when I'm looking now.

There is a name overlap, but the node path and UUID are unique, so that should have no effect.

Thanks for catching that. Fixed.

Given logging like this:

replica pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-b2ff7e06 failed the snapshot purge: failed to open parent file volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img: open volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img: invalid argument

it would be worthwhile to see the contents of the replica storage directories, both filenames and sizes, and the contents of the meta files.

I'll check those once I'm home again this evening.

CC007 · 2024-04-25T17:19:16Z

Here's the data:

rik@pirack1-node3:/mnt/PiRackData1/longhorn/replicas$ ll
total 0
drwxr-xr-x 2 root root 0 Apr 21 02:43 ./
drwxr-xr-x 2 root root 0 Dec 16 02:18 ../
drwxr-xr-x 2 root root 0 Apr 25 19:16 pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-daeeeafa/
rik@pirack1-node3:/mnt/PiRackData1/longhorn/replicas$ ll pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-daeeeafa/
total 3958976
drwxr-xr-x 2 root root            0 Apr 25 19:17 ./
drwxr-xr-x 2 root root            0 Apr 21 02:43 ../
-rwxr-xr-x 1 root root         4096 Apr 25 19:16 revision.counter*
-rwxr-xr-x 1 root root 171798691840 Apr 25 19:16 volume-head-003.img*
-rwxr-xr-x 1 root root          178 Apr 25 17:23 volume-head-003.img.meta*
-rwxr-xr-x 1 root root 171798691840 Apr 25 17:25 volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img*
-rwxr-xr-x 1 root root          210 Apr 21 03:04 volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img.meta*
-rwxr-xr-x 1 root root 171798691840 Apr 21 03:05 volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img*
-rwxr-xr-x 1 root root          158 Apr 21 03:00 volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img.meta*
-rwxr-xr-x 1 root root            0 Apr 25 17:26 volume-snap-e473b8fd-12e9-428d-9123-a79ec592eab3.img*
-rwxr-xr-x 1 root root          210 Apr 25 17:24 volume-snap-e473b8fd-12e9-428d-9123-a79ec592eab3.img.meta*
-rwxr-xr-x 1 root root          196 Apr 25 19:17 volume.meta*

rik@pirack1-node4:/mnt/PiRackData2/longhorn/replicas$ ll
total 0
drwxr-xr-x 2 root root 0 Apr 25 08:38 ./
drwxr-xr-x 2 root root 0 Dec 16 02:18 ../

rik@pirack1-node4:/mnt/PiRackData2/longhorn/replicas$ ll
total 0
drwxr-xr-x 2 root root 0 Apr 25 08:38 ./
drwxr-xr-x 2 root root 0 Dec 16 02:18 ../

CC007 · 2024-04-25T17:26:37Z

Here's the metadata:

rik@pirack1-node3:/mnt/PiRackData1/longhorn/replicas/pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-daeeeafa$ cat revision.counter; echo
32153
rik@pirack1-node3:/mnt/PiRackData1/longhorn/replicas/pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-daeeeafa$ cat volume-head-003.img.meta; echo
{"Name":"volume-head-003.img","Parent":"volume-snap-e473b8fd-12e9-428d-9123-a79ec592eab3.img","Removed":false,"UserCreated":false,"Created":"2024-04-25T15:23:53Z","Labels":null}

rik@pirack1-node3:/mnt/PiRackData1/longhorn/replicas/pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-daeeeafa$ cat volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img.meta; echo
{"Name":"volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img","Parent":"volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img","Removed":true,"UserCreated":false,"Created":"2024-04-21T01:02:09Z","Labels":null}

rik@pirack1-node3:/mnt/PiRackData1/longhorn/replicas/pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-daeeeafa$ cat volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img.meta; echo
{"Name":"volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img","Parent":"","Removed":true,"UserCreated":false,"Created":"2024-04-21T00:58:37Z","Labels":null}

rik@pirack1-node3:/mnt/PiRackData1/longhorn/replicas/pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-daeeeafa$ cat volume-snap-e473b8fd-12e9-428d-9123-a79ec592eab3.img.meta; echo
{"Name":"volume-snap-e473b8fd-12e9-428d-9123-a79ec592eab3.img","Parent":"volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img","Removed":true,"UserCreated":false,"Created":"2024-04-25T15:23:53Z","Labels":null}

james-munson · 2024-04-26T21:48:06Z

From the "invalid argument" error on file open, I expected the snap file to be missing, perhaps, but it is there. The other reason for that error would be a permissions issue, but it is world r-x.

I am also dubious of this logging:

2024-04-22T02:13:03.037620784+02:00 [pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-b2ff7e06] time="2024-04-22T00:13:03Z" level=error msg="Failed to coalesce volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img to volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img" func="rpc.(*SyncAgentServer).processRemoveSnapshot" file="server.go:1330" error="file sizes are not equal and the parent file is larger than the child file"

Which means that Longhorn called os.Stat() on each file successfully, but the reported filesize doesn't agree (and that the older file has more, which it should not.)

Still looking.

james-munson · 2024-05-03T20:35:42Z

Looking at the event timestamps, there are ones and twos spaced hours apart from the time of node restart up until about 23:00 on 4/21, after which events come in floods:

grep creationTime events.yaml | sort | uniq -c
      1       creationTimestamp: "2024-04-21T01:00:44Z"
      1       creationTimestamp: "2024-04-21T01:42:16Z"
      1       creationTimestamp: "2024-04-21T01:42:17Z"
      1       creationTimestamp: "2024-04-21T03:50:57Z"
      1       creationTimestamp: "2024-04-21T12:17:15Z"
      1       creationTimestamp: "2024-04-21T13:17:14Z"
      1       creationTimestamp: "2024-04-21T19:57:14Z"

      1       creationTimestamp: "2024-04-21T22:57:09Z"
      1       creationTimestamp: "2024-04-21T23:19:48Z"
      2       creationTimestamp: "2024-04-21T23:20:03Z"
      2       creationTimestamp: "2024-04-21T23:20:27Z"
      1       creationTimestamp: "2024-04-21T23:20:58Z"
...

But all the ones before 23:00 have large "deprecatedCount" fields (does that suggest a repetition of the same event?) and all refer to the "principal" replica, pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-b2ff7e06, whether about faulting, starting, stopping, or salvaging. Or about the engine itself.

What about other replicas? Let's pick pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-0c1350fc, with a lifetime from Start to Stop spanning 23:54.04 to 23:54:51. What do the logs say about it?

In instance-manager logs/longhorn-system/instance-manager-9611beb6650a7e26236ac88c51ce97c2/instance-manager.log, its instance and managing process are created:

2024-04-22T01:54:04.037148183+02:00 [longhorn-instance-manager] time="2024-04-21T23:54:04Z" level=info msg="Creating instance" func="instance.(*Server).InstanceCreate" file="instance.go:113" dataEngine=DATA_ENGINE_V1 name=pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-0c1350fc type=replica
2024-04-22T01:54:04.037815671+02:00 [longhorn-instance-manager] time="2024-04-21T23:54:04Z" level=info msg="Creating process" func="client.(*ProcessManagerClient).ProcessCreate" file="process_manager.go:96" args="[--volume-name pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490 replica /host/mnt/PiRackData2/longhorn/replicas/pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-134802af --size 171798691840 --replica-instance-name pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-0c1350fc --snapshot-max-count 250 --snapshot-max-size 0 --sync-agent-port-count 7]" binary=/host/var/lib/longhorn/engine-binaries/rancher-mirrored-longhornio-longhorn-engine-v1.6.1/longhorn name=pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-0c1350fc portArgs="[--listen,0.0.0.0:]" portCount=10
2024-04-22T01:54:04.039301052+02:00 [longhorn-instance-manager] time="2024-04-21T23:54:04Z" level=info msg="Process Manager: prepare to create process pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-0c1350fc" func="process.(*Manager).ProcessCreate" file="process_manager.go:184"
2024-04-22T01:54:04.039912652+02:00 [longhorn-instance-manager] time="2024-04-21T23:54:04Z" level=info msg="Process Manager: created process pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-0c1350fc" func="process.(*Manager).ProcessCreate" file="process_manager.go:221"

Then the physical directory

2024-04-22T01:54:04.069854285+02:00 [pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-0c1350fc] time="2024-04-21T23:54:04Z" level=info msg="Creating replica /host/mnt/PiRackData2/longhorn/replicas/pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-134802af, size 171798691840/512" func="replica.(*Server).Create" file="server.go:56"
2024-04-22T01:54:04.077168118+02:00 [pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-0c1350fc] time="2024-04-21T23:54:04Z" level=warning msg="Recovering volume metafile /host/mnt/PiRackData2/longhorn/replicas/pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-134802af/volume.meta, and replica info: {Size:171798691840 Head: Dirty:false Rebuilding:false Error: Parent: SectorSize:512 BackingFilePath: BackingFile:<nil>}" func="replica.(*Replica).tryRecoverVolumeMetaFile" file="replica.go:1114"
2024-04-22T01:54:04.094421440+02:00 [pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-0c1350fc] time="2024-04-21T23:54:04Z" level=info msg="Starting to create disk" func="replica.(*Replica).createDisk" file="replica.go:889" disk=000
2024-04-22T01:54:04.112707447+02:00 [pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-0c1350fc] time="2024-04-21T23:54:04Z" level=info msg="Finished creating disk" func="replica.(*Replica).createDisk" file="replica.go:987" disk=000
2024-04-22T01:54:04.129028230+02:00 [pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-0c1350fc] time="2024-04-21T23:54:04Z" level=info msg="Listening on data server 0.0.0.0:10001" func=cmd.startReplica.func2 file="replica.go:164"
2024-04-22T01:54:04.138184233+02:00 [pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-0c1350fc] time="2024-04-21T23:54:04Z" level=info msg="Listening on sync agent server 0.0.0.0:10002" func=cmd.startReplica.func3 file="replica.go:193"
2024-04-22T01:54:04.138274158+02:00 time="2024-04-21T23:54:04Z" level=info msg="Listening on gRPC Replica server 0.0.0.0:10000" func=cmd.startReplica.func1 file="replica.go:156"
2024-04-22T01:54:04.170537212+02:00 [longhorn-instance-manager] time="2024-04-21T23:54:04Z" level=info msg="Process pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-0c1350fc has started at localhost:10000" func="process.(*GRPCHealthChecker).WaitForRunning" file="healthchecker.go:33"
2024-04-22T01:54:04.183176598+02:00 [pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-0c1350fc] time="2024-04-21T23:54:04Z" level=info msg="Listening on sync 0.0.0.0:10002" func=cmd.startSyncAgent file="sync_agent.go:89"

And then it shuts down about 45 seconds later, have done exactly nothing:

2024-04-22T01:54:51.672773811+02:00 [longhorn-instance-manager] time="2024-04-21T23:54:51Z" level=info msg="Getting instance log" func="instance.(*Server).InstanceLog" file="instance.go:397" dataEngine=DATA_ENGINE_V1 name=pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-0c1350fc type=replica
2024-04-22T01:54:51.674786238+02:00 [longhorn-instance-manager] time="2024-04-21T23:54:51Z" level=info msg="Process Manager: start getting logs for process pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-0c1350fc" func="process.(*Manager).ProcessLog" file="process_manager.go:338"
2024-04-22T01:54:51.675980754+02:00 [longhorn-instance-manager] time="2024-04-21T23:54:51Z" level=info msg="Process Manager: got logs for process pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-0c1350fc" func="process.(*Manager).ProcessLog" file="process_manager.go:355"
2024-04-22T01:54:51.682793818+02:00 [longhorn-instance-manager] time="2024-04-21T23:54:51Z" level=info msg="Deleting instance" func="instance.(*Server).InstanceDelete" file="instance.go:174" cleanupRequired=true dataEngine=DATA_ENGINE_V1 diskUuid=d994c947-edd8-4e86-8405-db43f800e0ef name=pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-0c1350fc type=replica
2024-04-22T01:54:51.686111073+02:00 [longhorn-instance-manager] time="2024-04-21T23:54:51Z" level=info msg="Process Manager: prepare to delete process pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-0c1350fc" func="process.(*Manager).ProcessDelete" file="process_manager.go:230"
2024-04-22T01:54:51.686174369+02:00 [longhorn-instance-manager] time="2024-04-21T23:54:51Z" level=info msg="Process Manager: deleted process pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-0c1350fc" func="process.(*Manager).ProcessDelete" file="process_manager.go:244"
2024-04-22T01:54:51.686195035+02:00 [longhorn-instance-manager] time="2024-04-21T23:54:51Z" level=debug msg="Process Manager: wait for process pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-0c1350fc to shutdown before unregistering process" func="process.(*Manager).unregisterProcess.func1" file="process_manager.go:281"
2024-04-22T01:54:51.686212331+02:00 [longhorn-instance-manager] time="2024-04-21T23:54:51Z" level=info msg="Process Manager: trying to stop process pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-0c1350fc" func="process.(*Process).StopWithSignal.func1" file="process.go:186"
2024-04-22T01:54:51.686230868+02:00 [longhorn-instance-manager] time="2024-04-21T23:54:51Z" level=info msg="Wait for process pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-0c1350fc to shutdown" func="process.(*Process).StopWithSignal.func1" file="process.go:192"
2024-04-22T01:54:51.686249275+02:00 [pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-0c1350fc] time="2024-04-21T23:54:51Z" level=warning msg="Received signal interrupt to shutdown" func=cmd.registerShutdown.func1 file="shutdown.go:31"
... etc,etc

Nothing is logged about why it should do so.

CC007 · 2024-05-03T23:03:14Z

So no info about why it's failing or shutting down? What about the first time after I installed Longhorn 1.6.1, added nexus and then restarted the node? Did anything special pop up that first time?

CC007 · 2024-05-10T20:06:43Z

Would any changes be needed to longhorn to properly diagnose the issue?

If needed, I'd be down to rerun the test once these changes are added.

PhanLe1010 · 2024-05-14T18:09:06Z

Looks like on the engine side we have:

Engine process is started 2024-04-22T01:53:55.398883232+02:00 [longhorn-instance-manager] time="2024-04-21T23:53:55Z" level=info msg="Creating process" func="client.(*ProcessManagerClient).ProcessCreate" file="process_manager.go:96" args="[--engine-instance-name pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-e-0 controller pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490 --frontend tgt-blockdev --salvageRequested --size 171798691840 --current-size 171798691840 --engine-replica-timeout 8 --file-sync-http-client-timeout 30 --snapshot-max-count 250 --snapshot-max-size 0 --replica tcp://10.42.7.226:10000]" binary=/engine-binaries/rancher-mirrored-longhornio-longhorn-engine-v1.6.1/longhorn name=pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-e-0 portArgs="[--listen,0.0.0.0:]" portCount=1 With this replica 2024-04-22T01:53:55.445576847+02:00 [pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-e-0] time="2024-04-21T23:53:55Z" level=info msg="Starting with replicas [\"tcp://10.42.7.226:10000\"]" func=cmd.startController file="controller.go:196"
Engine does a snapshot purge to prepare for replica rebuilding 2024-04-22T01:54:05.422647628+02:00 [longhorn-instance-manager] time="2024-04-21T23:54:05Z" level=info msg="Purging snapshots" func="proxy.(*Proxy).SnapshotPurge" file="snapshot.go:388" dataEngine=DATA_ENGINE_V1 engineName=pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-e-0 serviceURL="10.42.7.226:10010" volumeName=pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490 Then engine does coalescing snapshot files 2024-04-22T01:54:05.455843984+02:00 [pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-b2ff7e06] time="2024-04-21T23:54:05Z" level=info msg="Coalescing volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img to volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img" func="rpc.(*SyncAgentServer).processRemoveSnapshot" file="server.go:1328"
It failed because of snapshot file size mismatch 2024-04-22T01:54:05.455898613+02:00 time="2024-04-21T23:54:05Z" level=error msg="Failed to coalesce volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img to volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img" func="rpc.(*SyncAgentServer).processRemoveSnapshot" file="server.go:1330" error="file sizes are not equal and the parent file is larger than the child file"
Then engine is killed and replica is killed

Rootcause might be related to https://longhorn.io/kb/troubleshooting-unexpected-expansion-leads-to-degradation-or-attach-failure/. Although the user is using fresh v1.6.1 so this is a bit confused. Could we double check if this volume created before v1.6.1?

CC007 · 2024-05-15T02:01:52Z

You can see that this pvc has a different from the one that was reported in #8091 . I had cleaned up all longhorn resources (including custom resources) in k8s and I also emptied the folder that was assigned to be used by longhorn before. I even emptied the garbage bin that my storage server uses for deleted files.

CC007 · 2024-05-15T23:14:52Z

Okay, I'll try again, just to sanity check

Uninstallation:

In Rancher I uninstalled the Nexus helm chart, so that the volume wasn't needed anymore
In Longhorn I deleted the volume that's associated with the nexus helm chart
In Longhorn I disabled and then removed all storage locations that I added for my CIFS connection with my NAS.
In Longhorn I turned on the setting that allows me to uninstall Longhorn
In Rancher I uninstalled the Longhorn helm chart (leaving the Longhorn-crd installed for now)
In Rancher's More Resources, I found that the settings were still there, so I deleted those.
In Rancher I checked the namespace where longhorn previously was installed (longhorn-system) and checked if there were any longhorn-specific resources left. I found events and deleted those (there were a few other unrelated resources, but I'll come back to that)
In Rancher I uninstalled the Longhorn-crd chart
In Rancher I deleted the longhorn-system namespace (and all the resources that were still there)
In an ssh connection, I deleted the folder in the CIFS mount that longhorn was using for all its data storage (the storage locations from step 2.)
In my NAS app, I emptied out the recycle bin, so that the storage locations are also gone from there.

My system should now be fully rid of everything related to Longhorn

Installation:

In Rancher I installed the Longhorn helm chart (Longhorn (103.3.0+up1.6.1) from the Rancher repo) with default settings
In Longhorn create 1 disk on each of the 4 storage nodes (the ones with the CIFS mount)
In Longhorn I removed the default /var/lib/longhorn disks from all 8 nodes (I don't want storage to be used on the SD card) and disabled the 4 nodes that don't have a CIFS mount. At this moment, I have 4 disabled nodes and 4 unschedulable nodes.
In an ssh connection, I created the folder that Longhorn is trying to use on the CIFS mounts. The unschedulable nodes now become schedulable.

Longhorn should now be up and running:

Test setup:

Install the Nexus3 chart (nexus3 (4.43.0) from stevehipwell's repo).
- Set persistence.enabled to true
- Set persistence.size to 160Gi
- Set persistence.storageClass to longhorn
... uh....

I expected to say, "All 3 replicas are created and healthy" Which they were... for a bit. I created a snapshot, which succeeded, but then I saw that the Nexus chart wasn't fully installed yet, so I removed the snapshot again and went to do something else in the mean time to wait for the nexus chart to finish installing, but when I came back, I saw that the chart installation failed:

I also saw this in the Longhorn UI:

PhanLe1010 · 2024-05-16T00:43:51Z

Installation:

In Rancher I installed the Longhorn helm chart (Longhorn (103.3.0+up1.6.1) from the Rancher repo) with default settings
In Longhorn create 1 disk on each of the 4 storage nodes (the ones with the CIFS mount)
In Longhorn I removed the default /var/lib/longhorn disks from all 8 nodes (I don't want storage to be used on the SD card) and disabled the 4 nodes that don't have a CIFS mount. At this moment, I have 4 disabled nodes and 4 unschedulable nodes.
In an ssh connection, I created the folder that Longhorn is trying to use on the CIFS mounts. The unschedulable nodes now become schedulable.

Hi @CC007

1, For this step:

In Longhorn I removed the default /var/lib/longhorn disks from all 8 nodes (I don't want storage to be used on the SD card) and disabled the 4 nodes that don't have a CIFS mount. At this moment, I have 4 disabled nodes and 4 unschedulable nodes.

Did you delete the /var/lib/longhorn on the physical nodes or in Longhorn UI only. We shouldn't not delete it on the physical nodes since it contains some binary that Longhorn needs.

2, Can we switch the order a bit:

Do this first In an ssh connection, I created the folder that Longhorn is trying to use on the CIFS mounts. The unschedulable nodes now become schedulable. then do this In Longhorn create 1 disk on each of the 4 storage nodes (the ones with the CIFS mount)

3, We have not tested CIFS storage for Longhorn. Looks like they only support directIO since 2018: https://lwn.net/Articles/770552/? Could you try to remove CIFS from the test to see if you still hit the same issue?

4, Lastly, can we have support bundle for the failed test at #8403 (comment)?

CC007 · 2024-05-16T01:14:44Z

Did you delete the /var/lib/longhorn on the physical nodes or in Longhorn UI only. We shouldn't not delete it on the physical nodes since it contains some binary that Longhorn needs.

Only through the UI indeed. Don't wanna touch Longhorn's files myself.

2, Can we switch the order a bit:

Do this first In an ssh connection, I created the folder that Longhorn is trying to use on the CIFS mounts. The unschedulable nodes now become schedulable. then do this In Longhorn create 1 disk on each of the 4 storage nodes (the ones with the CIFS mount)

I'll do that if my next attempt fails.

3, We have not tested CIFS storage for Longhorn. Looks like they only support directIO since 2018: https://lwn.net/Articles/770552/? Could you try to remove CIFS from the test to see if you still hit the same issue?

Right now I only have CIFS set up. I tried NFS before, but couldn't get it working with proper authentication. The Raspberry PIs don't have enough (or fast enough) storage themselves through the SD card, so I'm trying to use a NAS (QNAP TBS-464) for storage.

If you know a better way to connect the NAS to the Raspberry PIs (or connect the NAS to Longhorn directly), please tell me.

4, Lastly, can we have support bundle for the failed test at #8403 (comment)?

Here is the original bundle: https://we.tl/t-qSKv15WFFr
And here is the new bundle of today's first test: https://we.tl/t-Mj8iSjYMuE

Here's another attempt to install nexus:

Test setup:

same as before (though I forgot to mention with image.repository set to sonatypecommunity/nexus3)
check the volume:

So far so good
Wait until the nexus chart is fully done installing (before doing anything, like making/deleting snapshots)
Take a snapshot. still no issues:
Remove the snapshot (to verify the issue didn't reappear because of the snapshot removal). A brief downtime of the replicas, but it recovered within a couple of seconds:
Snapshot again (just to have one)

Do the test:

Kill a node with one of the replicas (with sudo reboot, in this case node 1-3):
Wait until the node comes back up:

CC007 · 2024-05-16T01:19:06Z

As you can see, the snapshot that I created doesn't show up anymore too.

CC007 · 2024-05-16T01:27:30Z

And then the replica on node 1-3 disappeared:

CC007 · 2024-05-16T01:39:44Z

Here is the bundle after the last test: https://we.tl/t-eul6aOGIlJ

CC007 · 2024-05-18T11:01:13Z

Also, the fact that the replicas get corrupted by a node reboot is one thing, but the fact that the snapshot is no longer available to revert back to in "Do the test" step 2 is even more worrying. That means that not just the running instances get corrupted by this, but that it causes an issue with snapshots too. That makes it not just a partial data loss, but a FULL data loss situation.

CC007 · 2024-06-08T21:14:21Z

Any progress on the analysis of this bug?

CC007 added kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage labels Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Longhorn fails to recover from a node restart #8403

[BUG] Longhorn fails to recover from a node restart #8403

CC007 commented Apr 22, 2024 •

edited

james-munson commented Apr 23, 2024

CC007 commented Apr 25, 2024 •

edited

CC007 commented Apr 25, 2024

CC007 commented Apr 25, 2024 •

edited

james-munson commented Apr 26, 2024

james-munson commented May 3, 2024 •

edited

CC007 commented May 3, 2024

CC007 commented May 10, 2024

PhanLe1010 commented May 14, 2024

CC007 commented May 15, 2024

CC007 commented May 15, 2024 •

edited

PhanLe1010 commented May 16, 2024

CC007 commented May 16, 2024

CC007 commented May 16, 2024

CC007 commented May 16, 2024 •

edited

CC007 commented May 16, 2024

CC007 commented May 18, 2024

CC007 commented Jun 8, 2024

[BUG] Longhorn fails to recover from a node restart #8403

[BUG] Longhorn fails to recover from a node restart #8403

Comments

CC007 commented Apr 22, 2024 • edited

Describe the bug

To Reproduce

Expected behavior

Actual behavior

Support bundle for troubleshooting

Environment

james-munson commented Apr 23, 2024

CC007 commented Apr 25, 2024 • edited

CC007 commented Apr 25, 2024

CC007 commented Apr 25, 2024 • edited

james-munson commented Apr 26, 2024

james-munson commented May 3, 2024 • edited

CC007 commented May 3, 2024

CC007 commented May 10, 2024

PhanLe1010 commented May 14, 2024

CC007 commented May 15, 2024

CC007 commented May 15, 2024 • edited

PhanLe1010 commented May 16, 2024

CC007 commented May 16, 2024

CC007 commented May 16, 2024

CC007 commented May 16, 2024 • edited

CC007 commented May 16, 2024

CC007 commented May 18, 2024

CC007 commented Jun 8, 2024

CC007 commented Apr 22, 2024 •

edited

CC007 commented Apr 25, 2024 •

edited

CC007 commented Apr 25, 2024 •

edited

james-munson commented May 3, 2024 •

edited

CC007 commented May 15, 2024 •

edited

CC007 commented May 16, 2024 •

edited