Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process xxx (ganesha.nfsd) of user 0 killed by SIGSEGV - dumping core #904

Closed
duduxiao opened this issue Feb 16, 2023 · 24 comments
Closed

Comments

@duduxiao
Copy link

Hi!We are working in nfs-ganesha + glusterFS. Recently,Nfs-ganesha is killed by abrt,And we can't find any reason.

Environment info:
glusterfs 9.6
NFS-Ganesha Release = V4.3

glusterfs Node info:

[root@k8s-node-1 ccpp-2023-02-15-13:10:38-10086]# gluster vol status k8s-data
Status of volume: k8s-data
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick k8s-node-1:/data/k8s-gluster-data 49152 0 Y 14065
Brick k8s-node-2:/data/k8s-gluster-data 49152 0 Y 14188
Brick k8s-node-3:/data/k8s-gluster-data 49152 0 Y 13726
Self-heal Daemon on localhost N/A N/A Y 14082
Self-heal Daemon on k8s-node-2 N/A N/A Y 14205
Self-heal Daemon on k8s-node-4 N/A N/A Y 14551
Self-heal Daemon on k8s-node-5 N/A N/A Y 13523
Self-heal Daemon on k8s-node-3 N/A N/A Y 13743

Task Status of Volume k8s-data
------------------------------------------------------------------------------
There are no active volume tasks

System log is:
Process 10086 (ganesha.nfsd) of user 0 killed by SIGSEGV - dumping core

abrt reason file :

ganesha.nfsd killed by SIGSEGV

abrt coredump:

(gdb) bt
#0  0x00007f69287264fb in raise (sig=11) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:36
#1  <signal handler called>
#2  0x00007f692a646318 in dec_nfs4_state_ref (state=0x7f68480252b0) at /usr/src/debug/nfs-ganesha-4.3/src/SAL/nfs4_state_id.c:514
#3  0x00007f692a68f888 in dec_state_t_ref (state=<optimized out>) at /usr/src/debug/nfs-ganesha-4.3/src/include/sal_functions.h:434
#4  nfs4_op_free_stateid (op=<optimized out>, data=<optimized out>, resp=0x7f68180efed0) at /usr/src/debug/nfs-ganesha-4.3/src/Protocols/NFS/nfs4_op_free_stateid.c:105
#5  0x00007f692a688598 in process_one_op (data=data@entry=0x7f681810b920, status=status@entry=0x7f68d1cde8cc) at /usr/src/debug/nfs-ganesha-4.3/src/Protocols/NFS/nfs4_Compound.c:912
#6  0x00007f692a689798 in nfs4_Compound (arg=0x7f681808d5a8, req=<optimized out>, res=0x7f68180f8810) at /usr/src/debug/nfs-ganesha-4.3/src/Protocols/NFS/nfs4_Compound.c:1376
#7  0x00007f692a5ffcd9 in nfs_rpc_process_request (reqdata=<optimized out>, retry=<optimized out>) at /usr/src/debug/nfs-ganesha-4.3/src/MainNFSD/nfs_worker_thread.c:1499
#8  0x00007f692a38365c in svc_request () from /lib64/libntirpc.so.4.3
#9  0x00007f692a380811 in svc_rqst_xprt_task_recv () from /lib64/libntirpc.so.4.3
#10 0x00007f692a381207 in svc_rqst_epoll_loop () from /lib64/libntirpc.so.4.3
#11 0x00007f692a38becd in work_pool_thread () from /lib64/libntirpc.so.4.3
#12 0x00007f692871eea5 in start_thread (arg=0x7f68d1ce0700) at pthread_create.c:307
#13 0x00007f692823f96d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Please give me some suggestions,How to troubleshoot problems.

@dang
Copy link
Contributor

dang commented Feb 16, 2023

It looks like we failed to get the export, and so the error handling code tried to deref the state. However, this was the last state ref, and so the state was freed, but that depends on the op_ctx being set up correctly, and in particular the export being set in it. This was not the case, since the lookup of the export failed, so it de-ref'd a null pointer.

@ffilz I'm not sure what to do about this. It seems like it will be a not-uncommon case when an export is removed with outstanding state?

@amartel
Copy link

amartel commented Feb 20, 2023

Hi,
I have exactly the same problem (with an identical backtrace) using nfs-ganesha 4.3 + CEPH
It occurs every 3-4 days (with no configuration or export update) and I will be very interested too if you have a workaround.

@ffilz ffilz added the bug label Feb 20, 2023
@ffilz
Copy link
Member

ffilz commented Feb 20, 2023

Hmm, part of the problem is that the code path doesn't set op_ctx...

The get_state_obj_export_owner_refs() might or might not be failing, though if the state is being deleted for other reasons, it might fail, but we obviously have the only reference.

I don't think the export is being removed though that would be an interesting race. I need to look at whether it's possible to get a ref on the state, then another thread removes the export (which WOULD start removal of the state).

I'll have to think about how to solve this code path...

@duduxiao
Copy link
Author

@ffilz That's bad news! Do you need me to collect more logs?
Now,I must monitor ganesha's thread, And I restart ganesha server when it's killed... Otherwise , It will cause my k8s cluster to crash.

@amartel
Copy link

amartel commented Feb 21, 2023

@duduxiao I updated my nfs-ganesha service (/lib/systemd/system/nfs-ganesha.service) to include

[Service]
...
Restart=on-failure
RestartSec=2

so nfs-ganesha automatically restarts after 2 secondes and it's almost transparent for my clients

@ffilz I had the same feeling (another thread as removed the export) and I wonder if, as a (dirty) workaround, we can add a test before calling free_state (in src/SAL/nfs4_state_id.c, line 514):

	if ((op_ctx != NULL) && (op_ctx->fsal_export != NULL))
		op_ctx->fsal_export->exp_ops.free_state(op_ctx->fsal_export, state);

I know it may produce a memory leak (or worst) but I can test this "workaround" on my ganesha server to see if ganesha is still crashing every 3-4 days...
What do you think??

@duduxiao
Copy link
Author

@amartel Thanks,This is a good way.Now I use crontab to monitor and restart ganesha service.

@ffilz
Copy link
Member

ffilz commented Feb 22, 2023

@ffilz That's bad news! Do you need me to collect more logs? Now,I must monitor ganesha's thread, And I restart ganesha server when it's killed... Otherwise , It will cause my k8s cluster to crash.

Thanks, but I have enough information to understand the problem. Now I just need to find the time to fix it.

I'm glad that a workaround of restarting Ganesha is allowing you to continue but obviously we need an actual fix...

@ffilz
Copy link
Member

ffilz commented Feb 22, 2023

@duduxiao I updated my nfs-ganesha service (/lib/systemd/system/nfs-ganesha.service) to include

[Service]
...
Restart=on-failure
RestartSec=2

so nfs-ganesha automatically restarts after 2 secondes and it's almost transparent for my clients

@ffilz I had the same feeling (another thread as removed the export) and I wonder if, as a (dirty) workaround, we can add a test before calling free_state (in src/SAL/nfs4_state_id.c, line 514):

	if ((op_ctx != NULL) && (op_ctx->fsal_export != NULL))
		op_ctx->fsal_export->exp_ops.free_state(op_ctx->fsal_export, state);

I know it may produce a memory leak (or worst) but I can test this "workaround" on my ganesha server to see if ganesha is still crashing every 3-4 days... What do you think??

Yea, that will cause a memory leak, but that might be better than constant crashing. And it's a pretty small object being leaked. Try it out and see how fast memory grows.

@amartel
Copy link

amartel commented Feb 23, 2023

OK. I built and deployed a package with the "workaround" and, for now, no crash occured. Let's see what will happen during 3-4 days...

@ffilz
Copy link
Member

ffilz commented Feb 27, 2023

Issue #909 is a duplicate of this, but provides a re-create:

On the other server, xfstests are compiled from source. Their local.config file is:

export TEST_DEV=ubuntu2304beta-1:/nfs-export-alex
export TEST_DIR=/mnt/nfs-export-alex
export NFS_MOUNT_OPTIONS="-o rw,relatime,vers=4.1,soft,nosharecache"

To crash nfs-ganesha reliably, run this command in the xfstests-dev directory as root:

./check -nfs generic/089

@amartel
Copy link

amartel commented Mar 6, 2023

Just for information, ganesha server has never crashed since I deployed the workaround (2 weeks ago) and I don't notice a memory increase. I don't have added a call to a log function to know how many crashes have been avoid but, right now, ganesha server is perfectly stable...

derekbit added a commit to derekbit/nfs-ganesha that referenced this issue Mar 9, 2023
nfs-ganesha#904 (comment)

Signed-off-by: Derek Su <derek.su@suse.com>
derekbit added a commit to rancher/nfs-ganesha that referenced this issue Mar 10, 2023
nfs-ganesha#904 (comment)

Signed-off-by: Derek Su <derek.su@suse.com>
@patrakov
Copy link

For those who want some official version: 4.0.8 is the last good one.

The first bad commit, apparently, is:

3fcf165356f891e0feadf02e59c9ea7c6a866917 is the first bad commit
commit 3fcf165356f891e0feadf02e59c9ea7c6a866917
Author: Frank S. Filz <ffilzlnx@mindspring.com>
Date:   Mon Jun 27 16:26:25 2022 -0700

    Remove state_exp from state_t

@ffilz
Copy link
Member

ffilz commented Mar 14, 2023

OK, a better fix is to make free_state not dependent on the export. In all in-tree FSALs, alloc_state allocates space for a state_t and a fsal_fd, with the state_t first, so simply gsh_free(state) will work. If a FSAL needs different behavior, there is a state_free field now in the state_t where the FSAL can put a function pointer.

Please test this patch: https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/551045

@patrakov
Copy link

Tested that exact commit, it still crashes.

@ffilz
Copy link
Member

ffilz commented Mar 15, 2023

Tested that exact commit, it still crashes.

Hmm, what is the back trace?

I need to try and re-create the issue, but the issue was clear from the code.

@patrakov
Copy link

I need to reproduce again in order to recreate the stack trace, and will do that in a few minutes - but honestly, I would rather prefer collaborating with you in order to make sure you can recreate the issue without asking me. How about a video call on Jitsi Meet? Or would you prefer a pair of cloud instances accessible via SSH?

@patrakov
Copy link

patrakov commented Mar 15, 2023

EDIT: the package was misbuilt, the backtrace that was here before the edit is invalid.

@ffilz
Copy link
Member

ffilz commented Mar 15, 2023

I need to reproduce again in order to recreate the stack trace, and will do that in a few minutes - but honestly, I would rather prefer collaborating with you in order to make sure you can recreate the issue without asking me. How about a video call on Jitsi Meet? Or would you prefer a pair of cloud instances accessible via SSH?

OK, so what do I need to re-create? That's clearly the next step since the obvious fix isn't working.

@ffilz
Copy link
Member

ffilz commented Mar 15, 2023

What FSAL are you using?

@patrakov
Copy link

patrakov commented Mar 15, 2023

The FSAL is "VFS". To recreate, you need two virtual machines or cloud instances, set up as described in #909

You are welcome to join a video conference here: (EDIT: link edited out, as the video meeting is complete) so that I can guide you through the setup.

@patrakov
Copy link

Sorry, it might be an error in my build (misapplied the patch), I have to retest.

@ffilz
Copy link
Member

ffilz commented Mar 15, 2023

I did end up recreating without the fix, but I actually got a different back trace which suggests the fix that I made that eliminates the possibility across several races is the right way to go.

Also, a periodic helpful reminder - I am generally available on IRC on the #ganesha channel on Libera.Chat

@patrakov
Copy link

Retested, and the fix indeed works.

@ffilz
Copy link
Member

ffilz commented Apr 21, 2023

V5.0 has been released. Closing.

@ffilz ffilz closed this as completed Apr 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants