Process xxx (ganesha.nfsd) of user 0 killed by SIGSEGV - dumping core #904

duduxiao · 2023-02-16T09:03:51Z

Hi!We are working in nfs-ganesha + glusterFS. Recently,Nfs-ganesha is killed by abrt，And we can't find any reason.

Environment info：
glusterfs 9.6
NFS-Ganesha Release = V4.3

glusterfs Node info:

[root@k8s-node-1 ccpp-2023-02-15-13:10:38-10086]# gluster vol status k8s-data
Status of volume: k8s-data
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick k8s-node-1:/data/k8s-gluster-data 49152 0 Y 14065
Brick k8s-node-2:/data/k8s-gluster-data 49152 0 Y 14188
Brick k8s-node-3:/data/k8s-gluster-data 49152 0 Y 13726
Self-heal Daemon on localhost N/A N/A Y 14082
Self-heal Daemon on k8s-node-2 N/A N/A Y 14205
Self-heal Daemon on k8s-node-4 N/A N/A Y 14551
Self-heal Daemon on k8s-node-5 N/A N/A Y 13523
Self-heal Daemon on k8s-node-3 N/A N/A Y 13743

Task Status of Volume k8s-data
------------------------------------------------------------------------------
There are no active volume tasks

System log is：
Process 10086 (ganesha.nfsd) of user 0 killed by SIGSEGV - dumping core

abrt reason file :

ganesha.nfsd killed by SIGSEGV

abrt coredump:

(gdb) bt
#0  0x00007f69287264fb in raise (sig=11) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:36
#1  <signal handler called>
#2  0x00007f692a646318 in dec_nfs4_state_ref (state=0x7f68480252b0) at /usr/src/debug/nfs-ganesha-4.3/src/SAL/nfs4_state_id.c:514
#3  0x00007f692a68f888 in dec_state_t_ref (state=<optimized out>) at /usr/src/debug/nfs-ganesha-4.3/src/include/sal_functions.h:434
#4  nfs4_op_free_stateid (op=<optimized out>, data=<optimized out>, resp=0x7f68180efed0) at /usr/src/debug/nfs-ganesha-4.3/src/Protocols/NFS/nfs4_op_free_stateid.c:105
#5  0x00007f692a688598 in process_one_op (data=data@entry=0x7f681810b920, status=status@entry=0x7f68d1cde8cc) at /usr/src/debug/nfs-ganesha-4.3/src/Protocols/NFS/nfs4_Compound.c:912
#6  0x00007f692a689798 in nfs4_Compound (arg=0x7f681808d5a8, req=<optimized out>, res=0x7f68180f8810) at /usr/src/debug/nfs-ganesha-4.3/src/Protocols/NFS/nfs4_Compound.c:1376
#7  0x00007f692a5ffcd9 in nfs_rpc_process_request (reqdata=<optimized out>, retry=<optimized out>) at /usr/src/debug/nfs-ganesha-4.3/src/MainNFSD/nfs_worker_thread.c:1499
#8  0x00007f692a38365c in svc_request () from /lib64/libntirpc.so.4.3
#9  0x00007f692a380811 in svc_rqst_xprt_task_recv () from /lib64/libntirpc.so.4.3
#10 0x00007f692a381207 in svc_rqst_epoll_loop () from /lib64/libntirpc.so.4.3
#11 0x00007f692a38becd in work_pool_thread () from /lib64/libntirpc.so.4.3
#12 0x00007f692871eea5 in start_thread (arg=0x7f68d1ce0700) at pthread_create.c:307
#13 0x00007f692823f96d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Please give me some suggestions，How to troubleshoot problems.

The text was updated successfully, but these errors were encountered:

dang · 2023-02-16T15:33:29Z

It looks like we failed to get the export, and so the error handling code tried to deref the state. However, this was the last state ref, and so the state was freed, but that depends on the op_ctx being set up correctly, and in particular the export being set in it. This was not the case, since the lookup of the export failed, so it de-ref'd a null pointer.

@ffilz I'm not sure what to do about this. It seems like it will be a not-uncommon case when an export is removed with outstanding state?

amartel · 2023-02-20T10:57:08Z

Hi,
I have exactly the same problem (with an identical backtrace) using nfs-ganesha 4.3 + CEPH
It occurs every 3-4 days (with no configuration or export update) and I will be very interested too if you have a workaround.

ffilz · 2023-02-20T16:08:29Z

Hmm, part of the problem is that the code path doesn't set op_ctx...

The get_state_obj_export_owner_refs() might or might not be failing, though if the state is being deleted for other reasons, it might fail, but we obviously have the only reference.

I don't think the export is being removed though that would be an interesting race. I need to look at whether it's possible to get a ref on the state, then another thread removes the export (which WOULD start removal of the state).

I'll have to think about how to solve this code path...

duduxiao · 2023-02-21T03:35:00Z

@ffilz That's bad news! Do you need me to collect more logs?
Now,I must monitor ganesha's thread, And I restart ganesha server when it's killed... Otherwise , It will cause my k8s cluster to crash.

amartel · 2023-02-21T09:44:03Z

@duduxiao I updated my nfs-ganesha service (/lib/systemd/system/nfs-ganesha.service) to include

[Service]
...
Restart=on-failure
RestartSec=2

so nfs-ganesha automatically restarts after 2 secondes and it's almost transparent for my clients

@ffilz I had the same feeling (another thread as removed the export) and I wonder if, as a (dirty) workaround, we can add a test before calling free_state (in src/SAL/nfs4_state_id.c, line 514):

	if ((op_ctx != NULL) && (op_ctx->fsal_export != NULL))
		op_ctx->fsal_export->exp_ops.free_state(op_ctx->fsal_export, state);

I know it may produce a memory leak (or worst) but I can test this "workaround" on my ganesha server to see if ganesha is still crashing every 3-4 days...
What do you think??

duduxiao · 2023-02-22T01:53:16Z

@amartel Thanks,This is a good way.Now I use crontab to monitor and restart ganesha service.

ffilz · 2023-02-22T15:58:12Z

@ffilz That's bad news! Do you need me to collect more logs? Now,I must monitor ganesha's thread, And I restart ganesha server when it's killed... Otherwise , It will cause my k8s cluster to crash.

Thanks, but I have enough information to understand the problem. Now I just need to find the time to fix it.

I'm glad that a workaround of restarting Ganesha is allowing you to continue but obviously we need an actual fix...

ffilz · 2023-02-22T15:59:24Z

@duduxiao I updated my nfs-ganesha service (/lib/systemd/system/nfs-ganesha.service) to include
[Service]
...
Restart=on-failure
RestartSec=2
so nfs-ganesha automatically restarts after 2 secondes and it's almost transparent for my clients

@ffilz I had the same feeling (another thread as removed the export) and I wonder if, as a (dirty) workaround, we can add a test before calling free_state (in src/SAL/nfs4_state_id.c, line 514):
	if ((op_ctx != NULL) && (op_ctx->fsal_export != NULL))
		op_ctx->fsal_export->exp_ops.free_state(op_ctx->fsal_export, state);
I know it may produce a memory leak (or worst) but I can test this "workaround" on my ganesha server to see if ganesha is still crashing every 3-4 days... What do you think??

Yea, that will cause a memory leak, but that might be better than constant crashing. And it's a pretty small object being leaked. Try it out and see how fast memory grows.

amartel · 2023-02-23T17:13:40Z

OK. I built and deployed a package with the "workaround" and, for now, no crash occured. Let's see what will happen during 3-4 days...

ffilz · 2023-02-27T18:52:44Z

Issue #909 is a duplicate of this, but provides a re-create:

On the other server, xfstests are compiled from source. Their local.config file is:
export TEST_DEV=ubuntu2304beta-1:/nfs-export-alex
export TEST_DIR=/mnt/nfs-export-alex
export NFS_MOUNT_OPTIONS="-o rw,relatime,vers=4.1,soft,nosharecache"
To crash nfs-ganesha reliably, run this command in the xfstests-dev directory as root:
./check -nfs generic/089

amartel · 2023-03-06T11:51:08Z

Just for information, ganesha server has never crashed since I deployed the workaround (2 weeks ago) and I don't notice a memory increase. I don't have added a call to a log function to know how many crashes have been avoid but, right now, ganesha server is perfectly stable...

nfs-ganesha#904 (comment) Signed-off-by: Derek Su <derek.su@suse.com>

patrakov · 2023-03-10T07:37:37Z

For those who want some official version: 4.0.8 is the last good one.

The first bad commit, apparently, is:

3fcf165356f891e0feadf02e59c9ea7c6a866917 is the first bad commit
commit 3fcf165356f891e0feadf02e59c9ea7c6a866917
Author: Frank S. Filz <ffilzlnx@mindspring.com>
Date:   Mon Jun 27 16:26:25 2022 -0700

    Remove state_exp from state_t

ffilz · 2023-03-14T20:57:04Z

OK, a better fix is to make free_state not dependent on the export. In all in-tree FSALs, alloc_state allocates space for a state_t and a fsal_fd, with the state_t first, so simply gsh_free(state) will work. If a FSAL needs different behavior, there is a state_free field now in the state_t where the FSAL can put a function pointer.

Please test this patch: https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/551045

patrakov · 2023-03-15T13:53:37Z

Tested that exact commit, it still crashes.

ffilz · 2023-03-15T15:18:17Z

Tested that exact commit, it still crashes.

Hmm, what is the back trace?

I need to try and re-create the issue, but the issue was clear from the code.

patrakov · 2023-03-15T17:03:28Z

I need to reproduce again in order to recreate the stack trace, and will do that in a few minutes - but honestly, I would rather prefer collaborating with you in order to make sure you can recreate the issue without asking me. How about a video call on Jitsi Meet? Or would you prefer a pair of cloud instances accessible via SSH?

patrakov · 2023-03-15T17:12:10Z

EDIT: the package was misbuilt, the backtrace that was here before the edit is invalid.

ffilz · 2023-03-15T17:20:13Z

I need to reproduce again in order to recreate the stack trace, and will do that in a few minutes - but honestly, I would rather prefer collaborating with you in order to make sure you can recreate the issue without asking me. How about a video call on Jitsi Meet? Or would you prefer a pair of cloud instances accessible via SSH?

OK, so what do I need to re-create? That's clearly the next step since the obvious fix isn't working.

ffilz · 2023-03-15T17:20:42Z

What FSAL are you using?

patrakov · 2023-03-15T17:23:53Z

The FSAL is "VFS". To recreate, you need two virtual machines or cloud instances, set up as described in #909

You are welcome to join a video conference here: (EDIT: link edited out, as the video meeting is complete) so that I can guide you through the setup.

patrakov · 2023-03-15T18:08:47Z

Sorry, it might be an error in my build (misapplied the patch), I have to retest.

ffilz · 2023-03-15T18:35:45Z

I did end up recreating without the fix, but I actually got a different back trace which suggests the fix that I made that eliminates the possibility across several races is the right way to go.

Also, a periodic helpful reminder - I am generally available on IRC on the #ganesha channel on Libera.Chat

patrakov · 2023-03-15T18:56:10Z

Retested, and the fix indeed works.

ffilz · 2023-04-21T23:01:57Z

V5.0 has been released. Closing.

ffilz added the bug label Feb 20, 2023

ffilz mentioned this issue Feb 27, 2023

nfs-ganesha 4.3: segfault provoked by xfstests generic/089 #909

Closed

amartel mentioned this issue Mar 9, 2023

nfs-ganesha server crash #914

Closed

derekbit mentioned this issue Mar 9, 2023

[BUG] RWX Volume attachment is getting Failed longhorn/longhorn#5456

Closed

derekbit added a commit to derekbit/nfs-ganesha that referenced this issue Mar 9, 2023

Fix SIGSEGV at dec_nfs4_state_ref

70a5d1a

nfs-ganesha#904 (comment) Signed-off-by: Derek Su <derek.su@suse.com>

derekbit mentioned this issue Mar 9, 2023

Temporary fix for SIGSEGV at dec_nfs4_state_ref rancher/nfs-ganesha#5

Merged

derekbit added a commit to rancher/nfs-ganesha that referenced this issue Mar 10, 2023

Fix SIGSEGV at dec_nfs4_state_ref

02d65fe

nfs-ganesha#904 (comment) Signed-off-by: Derek Su <derek.su@suse.com>

ffilz added Needs-Verification Patch Submitted labels Mar 14, 2023

ffilz added Verified and removed Needs-Verification labels Mar 15, 2023

ffilz closed this as completed Apr 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process xxx (ganesha.nfsd) of user 0 killed by SIGSEGV - dumping core #904

Process xxx (ganesha.nfsd) of user 0 killed by SIGSEGV - dumping core #904

duduxiao commented Feb 16, 2023

dang commented Feb 16, 2023

amartel commented Feb 20, 2023

ffilz commented Feb 20, 2023

duduxiao commented Feb 21, 2023

amartel commented Feb 21, 2023

duduxiao commented Feb 22, 2023

ffilz commented Feb 22, 2023

ffilz commented Feb 22, 2023

amartel commented Feb 23, 2023

ffilz commented Feb 27, 2023

amartel commented Mar 6, 2023

patrakov commented Mar 10, 2023

ffilz commented Mar 14, 2023

patrakov commented Mar 15, 2023

ffilz commented Mar 15, 2023

patrakov commented Mar 15, 2023

patrakov commented Mar 15, 2023 •

edited

Loading

ffilz commented Mar 15, 2023

ffilz commented Mar 15, 2023

patrakov commented Mar 15, 2023 •

edited

Loading

patrakov commented Mar 15, 2023

ffilz commented Mar 15, 2023

patrakov commented Mar 15, 2023

ffilz commented Apr 21, 2023

Process xxx (ganesha.nfsd) of user 0 killed by SIGSEGV - dumping core #904

Process xxx (ganesha.nfsd) of user 0 killed by SIGSEGV - dumping core #904

Comments

duduxiao commented Feb 16, 2023

dang commented Feb 16, 2023

amartel commented Feb 20, 2023

ffilz commented Feb 20, 2023

duduxiao commented Feb 21, 2023

amartel commented Feb 21, 2023

duduxiao commented Feb 22, 2023

ffilz commented Feb 22, 2023

ffilz commented Feb 22, 2023

amartel commented Feb 23, 2023

ffilz commented Feb 27, 2023

amartel commented Mar 6, 2023

patrakov commented Mar 10, 2023

ffilz commented Mar 14, 2023

patrakov commented Mar 15, 2023

ffilz commented Mar 15, 2023

patrakov commented Mar 15, 2023

patrakov commented Mar 15, 2023 • edited Loading

ffilz commented Mar 15, 2023

ffilz commented Mar 15, 2023

patrakov commented Mar 15, 2023 • edited Loading

patrakov commented Mar 15, 2023

ffilz commented Mar 15, 2023

patrakov commented Mar 15, 2023

ffilz commented Apr 21, 2023

patrakov commented Mar 15, 2023 •

edited

Loading

patrakov commented Mar 15, 2023 •

edited

Loading