-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Volume attach/detach/delete operations stuck in version 1.6.0 #7915
Comments
@Eilyre Could you point me to one of the volumes that is currently stuck in attach/detach? |
Stuck in detach: pvc-411dd345-c354-4f3e-ac9d-738b39671776, pvc-58297811-5dc4-4ac3-944d-056512745d8d, pvc-0762851a-d674-4854-a210-b1d5787c86f7, pvc-de096a05-870a-4a79-9e29-9f2c67cec105, pvc-e34dd802-84f5-4886-87ec-9401aaa0075a, pvc-ecce39fb-7606-40ee-a58a-7cd6a15965c0 The ones in bold are the ones which I noticed first. This information comes from the UI. Currently there's no stuck attaching ones, but it happens when I create a new deployment. |
Also, extra information @PhanLe1010, replica rebuild does not get scheduled. If I delete a replica for a healthy volume, it will just stay as degraded. |
After speaking with @PhanLe1010 and doing some analysis, it appears the instance map of We see the below log, so it seems longhorn-manager is trying to update the instance map, but instance-manager is not responding in time.
We may eventually need to delete its pod in the hope of getting it kickstarted, but that will cause any workloads that are still successfully running on it to crash. It's a bit of a shot in the dark, but if you are willing to capture some profiling data, it may give us clues.
|
@ejweber @PhanLe1010 Sent the *.out files via email. |
Can we manually check the grpc connection from the Longhorn manager pod
To double check the IP of instance-manager-d7653d361fe2dcc420474ccf47737d4b, could you run |
Sorry for the delay @PhanLe1010, here's the information:
Looking at the process table of
So instance manager does not seem to be answering, but what is weird, is that instance manager processes seem to be fine:
So maybe it's not I also checked the logs and metrics, and I cannot really find any reasons why these processes would be killed. There's no limits hit for longhorn, the node seems otherwise fine, and longhorn has the only zombie processes. |
@Eilyre, I'll be looking into this again this morning.
Something like this is what I think I am expecting. In general, we know that the instance-manager process is running, because it continues to log in the support bundle, but maybe one of its gRPC services is hung somehow. Similarly, it seems longhorn-manager is also logging (quite a lot), so it seems it is still alive as well.
TBH, I don't have much confidence it will help, but restarting longhorn-manager itself is a safe operation. It should not affect running volumes negatively. |
Apparently mutex profiling is disabled by default (I probably should have known that), so
If you are up for it, please send the following:
|
@ejweber Thanks you for looking into this issue in such detail! I did not do anything yet (so no manager restarts). I have six
Four of them were nicely able to port-forward and give me the goroutine information:
But two of them do not have anything listening on port 6060! I do not know if this is normal.
So I was unable to run curl against them as well. I sent the files via email, to the same thread. The special one is suffixed with |
No worries there, I think. Those two are still on |
Thanks to your help @Eilyre, we think we have identified a deadlock in the Unfortunately, we think the only resolution to your current situation is to restart the instance manager pod. This will be disruptive to workloads. If it is possible to do so, scaling down workloads running on the affected node before restarting the pod is recommended. While the associated volumes won't detach or move, this should ensure applications aren't writing to the volumes when their engines stop. |
Thank you so much for the help @ejweber @PhanLe1010! Draining the node and restarting the instance-manager (and longhorn-manager, just in case) pods brought the cluster out of the deadlock, operations started happening again. Gave a bit of time, and the dust has settled nicely, no issues. Just for future reference, I understand this is probably a rare occurrence, but is there a way to detect this kind of deadlock? |
I think the biggest signal is that your Longhorn volume failed to finish attach/detach thus your workload stuck in pending state Then, we would have to look into the logs to figure it out. In this case the hint is from the log |
Hi, @Eilyre, would you mind collecting and posting a follow-up support bundle when you get a chance, so we can check that all the anomalous things we saw were artifacts of the stuck update and have now resolved themselves? Thanks! |
Hey @james-munson, sent via email like before. |
Hey @ejweber! Either my cluster is special again, or it's easier to hit the deadlock on an active cluster than you figured - I managed to hit it again. On a different node this time around.
Restarting the stuck instance-manager pod helped again. |
I'm sorry to hear you hit it again so soon, @Eilyre. I would not have expected that. Since I am not sure when we will have a followup release, I created the following Docker image with the following changes:
If you would like to run this modified instance-manager, you can:
If you choose to do so, you'll see multiple instance-managers running on each node until all of its engine and replica processes have stopped and started again. (You'll still be potentially vulnerable to this issue until all have switched over.)
If you pick up the modified version, please let us know how it goes. If not, that's completely understandable. |
Thank you for the modified version @ejweber! I talked with my team, and we figured we'd apply it when we hit the deadlock for the third time (as then we need to detach/attach the volumes anyways). Hopefully it'll be useful information for you - how often it happens - as well. |
I attempted to use these images, but I believe they are not built for multiarch. I have mixed architectures. |
Hey @ejweber! We hit the same issue again only 26 hours later (didn't notice), and a few days later with another instance manager at the same time. I now upgraded to your version of instance manager, and rebooted two of the stuck instance managers. There seems to have been some initial issues with volumes whose primary replica was attached to workload, by an instance manager with an old version. When these volumes tried to schedule new replicas on the nodes with new instance manager, they got weird errors. The errors:
Changing the max replica count and downscaling the workload helped but.. can and should snapshot max count block replica rebuild? I'll let you know if there's any issues with the custom version, thank you! |
@Eilyre, I checked the code and it is expected. We must take a snapshot at the start of the rebuild. This allows anything written before that moment to become immutable (so it can be transferred to the new replica safely). Any writes after that moment can be issued to all replicas (including the rebuilding one). At the same time, It is unusual to unexpectedly have 20 snapshots of a volume. Were these snapshots created intentionally (by a recurring job), or were they system snapshots? If they were system snapshots, is If this happens again, it should be possible to delete a snapshot for the volume in question (e.g. via the UI) instead of increasing the limit. Feel free to open a docs issue if you feel there is somewhere we can better document this behavior. |
@ejweber They were created during automatic jobs (daily snapshot) before this new version introduced the |
Describe the bug
Over the weekend, our production Longhorn cluster degraded in a way where new volume attach/detach/delete operations get stuck (meaning won't finish). Workloads continue working, until the volume needs to be detached and attached on another node.
This happened after fixing the cluster with the proposed solution here #7887, but not instantly as I was able to recreate and run operations on the cluster on Thursday and Friday.
To Reproduce
Unable to reproduce this in any of the test clusters.
Expected behavior
Volume attach/detach/delete operations should finish.
Support bundle for troubleshooting
Support ticket was sent to longhorn-support-bundle@suse.com.
Environment
Additional context
This only happens in the production cluster.
The volumes are stuck in a very weird state. My replication factor is set to 3 for all volumes, but Longhorn keeps 4 alive.
The text was updated successfully, but these errors were encountered: