-
Notifications
You must be signed in to change notification settings - Fork 515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process xxx (ganesha.nfsd) of user 0 killed by SIGSEGV - dumping core #904
Comments
It looks like we failed to get the export, and so the error handling code tried to deref the state. However, this was the last state ref, and so the state was freed, but that depends on the op_ctx being set up correctly, and in particular the export being set in it. This was not the case, since the lookup of the export failed, so it de-ref'd a null pointer. @ffilz I'm not sure what to do about this. It seems like it will be a not-uncommon case when an export is removed with outstanding state? |
Hi, |
Hmm, part of the problem is that the code path doesn't set op_ctx... The get_state_obj_export_owner_refs() might or might not be failing, though if the state is being deleted for other reasons, it might fail, but we obviously have the only reference. I don't think the export is being removed though that would be an interesting race. I need to look at whether it's possible to get a ref on the state, then another thread removes the export (which WOULD start removal of the state). I'll have to think about how to solve this code path... |
@ffilz That's bad news! Do you need me to collect more logs? |
@duduxiao I updated my nfs-ganesha service (/lib/systemd/system/nfs-ganesha.service) to include
so nfs-ganesha automatically restarts after 2 secondes and it's almost transparent for my clients @ffilz I had the same feeling (another thread as removed the export) and I wonder if, as a (dirty) workaround, we can add a test before calling free_state (in src/SAL/nfs4_state_id.c, line 514):
I know it may produce a memory leak (or worst) but I can test this "workaround" on my ganesha server to see if ganesha is still crashing every 3-4 days... |
@amartel Thanks,This is a good way.Now I use crontab to monitor and restart ganesha service. |
Thanks, but I have enough information to understand the problem. Now I just need to find the time to fix it. I'm glad that a workaround of restarting Ganesha is allowing you to continue but obviously we need an actual fix... |
Yea, that will cause a memory leak, but that might be better than constant crashing. And it's a pretty small object being leaked. Try it out and see how fast memory grows. |
OK. I built and deployed a package with the "workaround" and, for now, no crash occured. Let's see what will happen during 3-4 days... |
Issue #909 is a duplicate of this, but provides a re-create:
|
Just for information, ganesha server has never crashed since I deployed the workaround (2 weeks ago) and I don't notice a memory increase. I don't have added a call to a log function to know how many crashes have been avoid but, right now, ganesha server is perfectly stable... |
nfs-ganesha#904 (comment) Signed-off-by: Derek Su <derek.su@suse.com>
nfs-ganesha#904 (comment) Signed-off-by: Derek Su <derek.su@suse.com>
For those who want some official version: 4.0.8 is the last good one. The first bad commit, apparently, is:
|
OK, a better fix is to make free_state not dependent on the export. In all in-tree FSALs, alloc_state allocates space for a state_t and a fsal_fd, with the state_t first, so simply gsh_free(state) will work. If a FSAL needs different behavior, there is a state_free field now in the state_t where the FSAL can put a function pointer. Please test this patch: https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/551045 |
Tested that exact commit, it still crashes. |
Hmm, what is the back trace? I need to try and re-create the issue, but the issue was clear from the code. |
I need to reproduce again in order to recreate the stack trace, and will do that in a few minutes - but honestly, I would rather prefer collaborating with you in order to make sure you can recreate the issue without asking me. How about a video call on Jitsi Meet? Or would you prefer a pair of cloud instances accessible via SSH? |
EDIT: the package was misbuilt, the backtrace that was here before the edit is invalid. |
OK, so what do I need to re-create? That's clearly the next step since the obvious fix isn't working. |
What FSAL are you using? |
The FSAL is "VFS". To recreate, you need two virtual machines or cloud instances, set up as described in #909 You are welcome to join a video conference here: (EDIT: link edited out, as the video meeting is complete) so that I can guide you through the setup. |
Sorry, it might be an error in my build (misapplied the patch), I have to retest. |
I did end up recreating without the fix, but I actually got a different back trace which suggests the fix that I made that eliminates the possibility across several races is the right way to go. Also, a periodic helpful reminder - I am generally available on IRC on the #ganesha channel on Libera.Chat |
Retested, and the fix indeed works. |
V5.0 has been released. Closing. |
Hi!We are working in nfs-ganesha + glusterFS. Recently,Nfs-ganesha is killed by abrt,And we can't find any reason.
Environment info:
glusterfs 9.6
NFS-Ganesha Release = V4.3
glusterfs Node info:
[root@k8s-node-1 ccpp-2023-02-15-13:10:38-10086]# gluster vol status k8s-data
Status of volume: k8s-data
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick k8s-node-1:/data/k8s-gluster-data 49152 0 Y 14065
Brick k8s-node-2:/data/k8s-gluster-data 49152 0 Y 14188
Brick k8s-node-3:/data/k8s-gluster-data 49152 0 Y 13726
Self-heal Daemon on localhost N/A N/A Y 14082
Self-heal Daemon on k8s-node-2 N/A N/A Y 14205
Self-heal Daemon on k8s-node-4 N/A N/A Y 14551
Self-heal Daemon on k8s-node-5 N/A N/A Y 13523
Self-heal Daemon on k8s-node-3 N/A N/A Y 13743
Task Status of Volume k8s-data
------------------------------------------------------------------------------
There are no active volume tasks
System log is:
Process 10086 (ganesha.nfsd) of user 0 killed by SIGSEGV - dumping core
abrt reason file :
ganesha.nfsd killed by SIGSEGV
abrt coredump:
Please give me some suggestions,How to troubleshoot problems.
The text was updated successfully, but these errors were encountered: