-
Notifications
You must be signed in to change notification settings - Fork 572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Longhorn fails to recover from a node restart #8403
Comments
Interesting that although there is log history and events for multiple replicas of pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490, there is currently only one in the replicas.yaml list and in the engines.yaml address maps: pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-b2ff7e06. The storage is on these disks:
There is a name overlap, but the node path and UUID are unique, so that should have no effect. Looks like node pirack2-node4 was restarted just before 00:58 on 4/21, and then two others (with no storage) about 30 minutes later:
Given logging like this:
|
That's probably also why I also only see 1 replica in the Longhorn UI when I'm looking now.
Thanks for catching that. Fixed.
I'll check those once I'm home again this evening. |
Here's the data: rik@pirack1-node3:/mnt/PiRackData1/longhorn/replicas$ ll
total 0
drwxr-xr-x 2 root root 0 Apr 21 02:43 ./
drwxr-xr-x 2 root root 0 Dec 16 02:18 ../
drwxr-xr-x 2 root root 0 Apr 25 19:16 pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-daeeeafa/
rik@pirack1-node3:/mnt/PiRackData1/longhorn/replicas$ ll pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-daeeeafa/
total 3958976
drwxr-xr-x 2 root root 0 Apr 25 19:17 ./
drwxr-xr-x 2 root root 0 Apr 21 02:43 ../
-rwxr-xr-x 1 root root 4096 Apr 25 19:16 revision.counter*
-rwxr-xr-x 1 root root 171798691840 Apr 25 19:16 volume-head-003.img*
-rwxr-xr-x 1 root root 178 Apr 25 17:23 volume-head-003.img.meta*
-rwxr-xr-x 1 root root 171798691840 Apr 25 17:25 volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img*
-rwxr-xr-x 1 root root 210 Apr 21 03:04 volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img.meta*
-rwxr-xr-x 1 root root 171798691840 Apr 21 03:05 volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img*
-rwxr-xr-x 1 root root 158 Apr 21 03:00 volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img.meta*
-rwxr-xr-x 1 root root 0 Apr 25 17:26 volume-snap-e473b8fd-12e9-428d-9123-a79ec592eab3.img*
-rwxr-xr-x 1 root root 210 Apr 25 17:24 volume-snap-e473b8fd-12e9-428d-9123-a79ec592eab3.img.meta*
-rwxr-xr-x 1 root root 196 Apr 25 19:17 volume.meta* rik@pirack1-node4:/mnt/PiRackData2/longhorn/replicas$ ll
total 0
drwxr-xr-x 2 root root 0 Apr 25 08:38 ./
drwxr-xr-x 2 root root 0 Dec 16 02:18 ../ rik@pirack1-node4:/mnt/PiRackData2/longhorn/replicas$ ll
total 0
drwxr-xr-x 2 root root 0 Apr 25 08:38 ./
drwxr-xr-x 2 root root 0 Dec 16 02:18 ../ |
Here's the metadata: rik@pirack1-node3:/mnt/PiRackData1/longhorn/replicas/pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-daeeeafa$ cat revision.counter; echo
32153
rik@pirack1-node3:/mnt/PiRackData1/longhorn/replicas/pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-daeeeafa$ cat volume-head-003.img.meta; echo
{"Name":"volume-head-003.img","Parent":"volume-snap-e473b8fd-12e9-428d-9123-a79ec592eab3.img","Removed":false,"UserCreated":false,"Created":"2024-04-25T15:23:53Z","Labels":null}
rik@pirack1-node3:/mnt/PiRackData1/longhorn/replicas/pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-daeeeafa$ cat volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img.meta; echo
{"Name":"volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img","Parent":"volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img","Removed":true,"UserCreated":false,"Created":"2024-04-21T01:02:09Z","Labels":null}
rik@pirack1-node3:/mnt/PiRackData1/longhorn/replicas/pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-daeeeafa$ cat volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img.meta; echo
{"Name":"volume-snap-66c47248-5ddc-4e38-990e-4d57cd9f5615.img","Parent":"","Removed":true,"UserCreated":false,"Created":"2024-04-21T00:58:37Z","Labels":null}
rik@pirack1-node3:/mnt/PiRackData1/longhorn/replicas/pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-daeeeafa$ cat volume-snap-e473b8fd-12e9-428d-9123-a79ec592eab3.img.meta; echo
{"Name":"volume-snap-e473b8fd-12e9-428d-9123-a79ec592eab3.img","Parent":"volume-snap-4478c436-a23c-47de-a4b2-f89bc82ae501.img","Removed":true,"UserCreated":false,"Created":"2024-04-25T15:23:53Z","Labels":null} |
From the "invalid argument" error on file open, I expected the snap file to be missing, perhaps, but it is there. The other reason for that error would be a permissions issue, but it is world r-x. I am also dubious of this logging:
Which means that Longhorn called os.Stat() on each file successfully, but the reported filesize doesn't agree (and that the older file has more, which it should not.) Still looking. |
Looking at the event timestamps, there are ones and twos spaced hours apart from the time of node restart up until about 23:00 on 4/21, after which events come in floods:
But all the ones before 23:00 have large "deprecatedCount" fields (does that suggest a repetition of the same event?) and all refer to the "principal" replica, pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-b2ff7e06, whether about faulting, starting, stopping, or salvaging. Or about the engine itself. What about other replicas? Let's pick pvc-31540f93-f81a-4864-a5bb-7e9d6bd5d490-r-0c1350fc, with a lifetime from Start to Stop spanning 23:54.04 to 23:54:51. What do the logs say about it? In instance-manager logs/longhorn-system/instance-manager-9611beb6650a7e26236ac88c51ce97c2/instance-manager.log, its instance and managing process are created:
Then the physical directory
And then it shuts down about 45 seconds later, have done exactly nothing:
Nothing is logged about why it should do so. |
So no info about why it's failing or shutting down? What about the first time after I installed Longhorn 1.6.1, added nexus and then restarted the node? Did anything special pop up that first time? |
Would any changes be needed to longhorn to properly diagnose the issue? If needed, I'd be down to rerun the test once these changes are added. |
Looks like on the engine side we have:
Rootcause might be related to https://longhorn.io/kb/troubleshooting-unexpected-expansion-leads-to-degradation-or-attach-failure/. Although the user is using fresh v1.6.1 so this is a bit confused. Could we double check if this volume created before v1.6.1? |
You can see that this pvc has a different from the one that was reported in #8091 . I had cleaned up all longhorn resources (including custom resources) in k8s and I also emptied the folder that was assigned to be used by longhorn before. I even emptied the garbage bin that my storage server uses for deleted files. |
Hi @CC007 1, For this step:
Did you delete the /var/lib/longhorn on the physical nodes or in Longhorn UI only. We shouldn't not delete it on the physical nodes since it contains some binary that Longhorn needs. 2, Can we switch the order a bit: Do this first 3, We have not tested CIFS storage for Longhorn. Looks like they only support directIO since 2018: https://lwn.net/Articles/770552/? Could you try to remove CIFS from the test to see if you still hit the same issue? 4, Lastly, can we have support bundle for the failed test at #8403 (comment)? |
Only through the UI indeed. Don't wanna touch Longhorn's files myself.
I'll do that if my next attempt fails.
Right now I only have CIFS set up. I tried NFS before, but couldn't get it working with proper authentication. The Raspberry PIs don't have enough (or fast enough) storage themselves through the SD card, so I'm trying to use a NAS (QNAP TBS-464) for storage. If you know a better way to connect the NAS to the Raspberry PIs (or connect the NAS to Longhorn directly), please tell me.
Here is the original bundle: https://we.tl/t-qSKv15WFFr Here's another attempt to install nexus: Test setup:
Do the test: |
As you can see, the snapshot that I created doesn't show up anymore too. |
Here is the bundle after the last test: https://we.tl/t-eul6aOGIlJ |
Also, the fact that the replicas get corrupted by a node reboot is one thing, but the fact that the snapshot is no longer available to revert back to in "Do the test" step 2 is even more worrying. That means that not just the running instances get corrupted by this, but that it causes an issue with snapshots too. That makes it not just a partial data loss, but a FULL data loss situation. |
Any progress on the analysis of this bug? |
Describe the bug
After #8091 was fixed, I updated to 1.6.1, but got the another error.
To Reproduce
What I did:
Expected behavior
I noticed the node becoming unhealthy (showing red or gray). This was expected, since it doesn't surprise me that it would first try to wait to see if the node comes back.
I expected that the node's replica would either rejoin the volume (after it updated to the latest state) or, if the replica was too out of sync, that a new replica would be created from one of the 2 remaining ones
Actual behavior
After a while, when the node comes back, the replica on that node disappears and the other two replicas become faulted, switching between attaching and detaching because of issues.
When the node comes back I see these messages:
I then get this error (probably because the node wasn't fully ready yet):
Followed by a couple of:
And then
Etc.
The one that stands out is the
invalid argument
A day later: without me changing anything, it switches between showing only 1, 2 or all 3 (faulted) replicas in the Longhorn UI (showing only 1 most of the time).
Support bundle for troubleshooting
https://we.tl/t-Y482MSSwQH (valid for 7 days)
Environment
The text was updated successfully, but these errors were encountered: