Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

frequent failures with PROXY FSAL #580

Closed
toddpfaff opened this issue May 15, 2020 · 37 comments
Closed

frequent failures with PROXY FSAL #580

toddpfaff opened this issue May 15, 2020 · 37 comments
Assignees
Labels
Projects

Comments

@toddpfaff
Copy link

I've been using nfs-ganesha with the PROXY FSAL for a while, first with 2.x versions and now with the "next" branch, currently at version 4-dev.17.

With older versions and with 4-dev.17 I'm frequently encountering a problem when copying data with rsync on an NFS client to a path that is mounted via PROXY FSAL. Often the rsync process will terminate with a error like this:

rsync: close failed on "/tmp/mnt/junk/boot/efi/EFI/BOOT/.fallback.efi.fkPpCQ": Invalid argument (22)
rsync error: error in file IO (code 11) at receiver.c(859) [receiver=3.1.2]

Sometimes I can simply restart the rsync process and it will make a bit more progress and then abort again in the same way on another file. Other times the NFS mount will have gone stale or will hang when accessed. Sometimes I can simply restart the nfs-ganesha service to get the client working again, until the next repeat of this problem. Other times I have been able to unmount and remount the NFS client to get things sorted. Other times I have needed to reboot the client. I have also encountered cases where the NFS client was wedged and needed a power cycle.

In all cases, the nfs-ganesha server does not crash, and is usually still operating fine for other NFS clients than the one on which the problem was initiated, so there is no crash dump or backtrace to provide.

I can usually trigger the problem quickly on demand with rsync, for example by simply copying a typical linux root filesystem from the NFS client to server via the ganesha PROXY.

I've captured a couple of ganesha logs corresponding to rsync commands as described above with the following debug options in ganesha.conf:

NFSPROTO=FULL_DEBUG;
NFS_V4=FULL_DEBUG;
NFS_V4_LOCK=FULL_DEBUG;

I've uploaded those logs here:
ganesha.log.1.txt.gz
ganesha.log.2.txt.gz

My operating environment is CentOS 7 on all systems: the NFS client host, the nfs-ganesha PROXY FSAL server, and the NFS server host behind the ganesha NFS PROXY.

Please let me know if there's anything else I can provide to help debug this problem.

@boulos
Copy link
Contributor

boulos commented May 18, 2020

Could you add FSAL=DEBUG (or FULL_DEBUG) to your config?

I assume your backend is the standard nfs-kernel-server / knfsd, right? Which CentOS 7 release and kernel version (just for completeness, I assume this is a bug in what is now renamed PROXY_V4).

Can you also paste your full rsync command (scrubbed if you prefer)? Does this only happen with your EFI bootloader directories or is that just an example?

@toddpfaff
Copy link
Author

toddpfaff commented May 18, 2020 via email

@toddpfaff
Copy link
Author

toddpfaff commented May 18, 2020 via email

@boulos
Copy link
Contributor

boulos commented May 18, 2020

No explicit errors, but did you have an rsync failure there for the file mentioned here:

fsal_remove :FSAL :F_DBG :.BOOTX64.EFI.y3rFjc

The most visible thing from $ grep proxy is the renewal of the client sessions.

$ grep -i proxy ganesha.log.4.txt | grep -v Recmark | grep -v XID

shows the "build them up", run for a minute (not the 90s it says), and then issue client renewals.

Do you have a timestamp for the first error you saw for this log?

@toddpfaff
Copy link
Author

toddpfaff commented May 18, 2020 via email

@ffilz
Copy link
Member

ffilz commented May 18, 2020

That suggests a problem with keeping the clientid/session renewed. Since PROXY_V4 uses 4.1, it seems odd that it would not be keeping the clientid/session renewed since basically all requests must start with a SEQUENCE op that will renew the clientid/session. Maybe something is stalling activity for long enough to put that renewal at risk, and PROXY_V4 isn't doing anything to renew clientid/session in the background? I'm not really at all familiar with the PROXY_V4 code.

@toddpfaff
Copy link
Author

toddpfaff commented May 18, 2020 via email

@toddpfaff
Copy link
Author

toddpfaff commented May 18, 2020 via email

@toddpfaff
Copy link
Author

toddpfaff commented May 18, 2020 via email

@toddpfaff
Copy link
Author

toddpfaff commented May 18, 2020 via email

@toddpfaff
Copy link
Author

toddpfaff commented May 18, 2020 via email

@boulos
Copy link
Contributor

boulos commented May 18, 2020

Hmm, the inode failure may suggest that my code (PROXY_V3) has another "failed to clean up" bug that leaks "fds" (these are synthetic, there are no real fds). I just wrote this code recently, so at the very least you're now in code that I can more easily debug.

Unfortunately, my logging in PROXY_V3 was extremely noisy (if you happen to have a log for the proxy_v3 run, please attach it). I calmed it down a bunch in https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/492048 (which is slated for V4-dev.18, which I thought Frank pushed... hopefully it comes back today?).

Given a log from PROXY_V3, I'll likely make more progress (and I appreciate the bug report!).

@toddpfaff
Copy link
Author

toddpfaff commented May 18, 2020 via email

@boulos
Copy link
Contributor

boulos commented May 18, 2020

I would probably either pull that patch in (and my memory leak patch, fwiw) or wait for V4-dev.18 to come back to life.

@toddpfaff
Copy link
Author

toddpfaff commented May 18, 2020 via email

@toddpfaff
Copy link
Author

toddpfaff commented May 18, 2020 via email

@boulos
Copy link
Contributor

boulos commented May 18, 2020

We're trying to reach @ffilz to have him push the tag so you can just clone v4-dev.18. It seems like he intended to on Friday afternoon (https://lists.nfs-ganesha.org/archives/list/devel@lists.nfs-ganesha.org/thread/Q5SIDAAQOGYYF6ZUQLRGJTWMFGEBT4GB/) but just didn't hit the button :).

@toddpfaff
Copy link
Author

toddpfaff commented May 18, 2020 via email

@boulos
Copy link
Contributor

boulos commented May 18, 2020

Just went up. So you should be able to pull the latest "next" branch and/or clone v4-dev.18 directly.

@boulos
Copy link
Contributor

boulos commented May 26, 2020

Just checking in to see if you'd caught this again with the v4-dev.18 tag (and FSAL=DEBUG logging)

@toddpfaff
Copy link
Author

toddpfaff commented May 28, 2020 via email

@ffilz ffilz added this to To do in Ganesha 4 Jun 11, 2020
@ffilz
Copy link
Member

ffilz commented Jul 15, 2020

Should we keep this issue open?

@toddpfaff
Copy link
Author

Yes, please leave open as this issue still exists. I've tested again tonight using 4-dev.26.

I can consistently reproduce the failure by trying to boot a kvm guest with the guest's root filesystem qcow2 file accessed via nfs-ganesha PROXY FSAL. The guest runs CentOS 7 and the boot always fails just after the switch root, with output similar to the following:

         Starting Switch Root...
[   15.785736] EXT4-fs error (device vda1): __ext4_get_inode_loc:4247:
inode #531070: block 2097607: comm systemd: unable to read itable block 
[FAILED] Failed to start Switch Root.
See 'systemctl status initrd-switch-root.service' for details
Generating "/run/initramfs/rdsosreport.txt"
[   15.969067] Buffer I/O error on dev vda1, logical block 8, async page read
[   15.976178] Buffer I/O error on dev vda1, logical block 16, async page read
[   15.986507] Buffer I/O error on dev vda2, logical block 262128, async page read
[   16.000794] Buffer I/O error on dev vda1, logical block 8, async page read
[   16.007216] Buffer I/O error on dev vda1, logical block 16, async page read
[   16.016846] Buffer I/O error on dev vda2, logical block 262128, async page read

Entering emergency mode. Exit the shell to continue.

I've captured an nfs-ganesha log with ALL=FULL_DEBUG during one of these kvm guest boot attempts - uploaded here:
ganesha.log-20200717-23:22.txt.gz

@boulos
Copy link
Contributor

boulos commented Jul 22, 2020

Can you help correlate the timestamps here? Assuming you stopped the log at when the badness happened roughly, I suspect this is related to the same bug elsewhere with trying to renew the session / connection:

$ grep "proxyv4" ganesha.log-20200717-23.22.txt | grep -v ":F_DBG" | grep -i renew
17/07/2020 23:17:43 : epoch 5f1269d7 : vh8 : ganesha.nfsd-34894[proxyv4_clientid_renewer] proxyv4_clientid_renewer :FSAL :DEBUG :Need 1 new client id
17/07/2020 23:17:43 : epoch 5f1269d7 : vh8 : ganesha.nfsd-34894[proxyv4_clientid_renewer] proxyv4_setclientid :FSAL :EVENT :Negotiating a new ClientId with the remote server
17/07/2020 23:17:43 : epoch 5f1269d7 : vh8 : ganesha.nfsd-34894[proxyv4_clientid_renewer] proxyv4_compoundv4_call :FSAL :DEBUG :First attempt to send XID 1595072921 with 160 bytes
17/07/2020 23:17:43 : epoch 5f1269d7 : vh8 : ganesha.nfsd-34894[proxyv4_clientid_renewer] proxyv4_clientid_renewer :FSAL :DEBUG :Need 1 new session id
17/07/2020 23:17:43 : epoch 5f1269d7 : vh8 : ganesha.nfsd-34894[proxyv4_clientid_renewer] proxyv4_setsessionid :FSAL :DEBUG :Getting new session id for client id 91e6fa5eb6fdd300 with sequence id 1
17/07/2020 23:17:43 : epoch 5f1269d7 : vh8 : ganesha.nfsd-34894[proxyv4_clientid_renewer] proxyv4_compoundv4_call :FSAL :DEBUG :First attempt to send XID 1595072922 with 168 bytes
17/07/2020 23:17:43 : epoch 5f1269d7 : vh8 : ganesha.nfsd-34894[proxyv4_clientid_renewer] proxyv4_compoundv4_call :FSAL :DEBUG :First attempt to send XID 1595072923 with 140 bytes
17/07/2020 23:17:43 : epoch 5f1269d7 : vh8 : ganesha.nfsd-34894[proxyv4_clientid_renewer] proxyv4_setsessionid :FSAL :DEBUG :Getting new lease 90
17/07/2020 23:19:07 : epoch 5f1269d7 : vh8 : ganesha.nfsd-34894[proxyv4_clientid_renewer] proxyv4_clientid_renewer :FSAL :DEBUG :Try renew session id for client id 91e6fa5eb6fdd300
17/07/2020 23:19:07 : epoch 5f1269d7 : vh8 : ganesha.nfsd-34894[proxyv4_clientid_renewer] proxyv4_compoundv4_call :FSAL :DEBUG :First attempt to send XID 1595072931 with 116 bytes
17/07/2020 23:19:08 : epoch 5f1269d7 : vh8 : ganesha.nfsd-34894[proxyv4_clientid_renewer] proxyv4_clientid_renewer :FSAL :DEBUG :New session id for client id 10513365664846959360
17/07/2020 23:20:32 : epoch 5f1269d7 : vh8 : ganesha.nfsd-34894[proxyv4_clientid_renewer] proxyv4_clientid_renewer :FSAL :DEBUG :Try renew session id for client id 91e6fa5eb6fdd300
17/07/2020 23:20:32 : epoch 5f1269d7 : vh8 : ganesha.nfsd-34894[proxyv4_clientid_renewer] proxyv4_compoundv4_call :FSAL :DEBUG :First attempt to send XID 1595072957 with 116 bytes
17/07/2020 23:20:33 : epoch 5f1269d7 : vh8 : ganesha.nfsd-34894[proxyv4_clientid_renewer] proxyv4_clientid_renewer :FSAL :EVENT :sr_status_flags received on renewing session with seqop : 1024
17/07/2020 23:21:57 : epoch 5f1269d7 : vh8 : ganesha.nfsd-34894[proxyv4_clientid_renewer] proxyv4_clientid_renewer :FSAL :DEBUG :Try renew session id for client id 91e6fa5eb6fdd300
17/07/2020 23:21:57 : epoch 5f1269d7 : vh8 : ganesha.nfsd-34894[proxyv4_clientid_renewer] proxyv4_compoundv4_call :FSAL :DEBUG :First attempt to send XID 1595077361 with 116 bytes
17/07/2020 23:21:58 : epoch 5f1269d7 : vh8 : ganesha.nfsd-34894[proxyv4_clientid_renewer] proxyv4_clientid_renewer :FSAL :EVENT :sr_status_flags received on renewing session with seqop : 1088

my question is whether the badness starts with regards to:

17/07/2020 23:20:33 : epoch 5f1269d7 : vh8 : ganesha.nfsd-34894[proxyv4_clientid_renewer] proxyv4_clientid_renewer :FSAL :EVENT :sr_status_flags received on renewing session with seqop : 1024

or if it was already busted before then.

@toddpfaff
Copy link
Author

toddpfaff commented Jul 22, 2020 via email

@boulos
Copy link
Contributor

boulos commented Jul 23, 2020 via email

@toddpfaff
Copy link
Author

I've run the same test again after changing ganesha.conf from PROXY_V4 to PROXY_V3.
Here's the log:
ganesha.log-20200722-23:33.txt.gz

@boulos
Copy link
Contributor

boulos commented Jul 24, 2020

A ha!

$ egrep -i failed ganesha.log-20200722-23.33.txt
22/07/2020 23:31:28 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[main] mdc_try_get_cached :NFS READDIR :F_DBG :mdcache_avl_lookup export failed trust negative no
22/07/2020 23:31:28 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[main] mdc_try_get_cached :NFS READDIR :F_DBG :mdcache_avl_lookup export failed trust negative no
22/07/2020 23:31:28 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[main] mdc_lookup_uncached :NFS READDIR :DEBUG :lookup export failed with No such file or directory
22/07/2020 23:31:28 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[main] place_new_dirent :NFS READDIR :F_DBG :Could not add export to chunk for directory 0x7f497dab7500, compute_readdir_cookie failed
22/07/2020 23:31:28 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[main] mdc_try_get_cached :NFS READDIR :F_DBG :mdcache_avl_lookup ss1 failed trust negative no
22/07/2020 23:31:28 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[main] mdc_try_get_cached :NFS READDIR :F_DBG :mdcache_avl_lookup ss1 failed trust negative no
22/07/2020 23:31:28 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[main] mdc_lookup_uncached :NFS READDIR :DEBUG :lookup ss1 failed with No such file or directory
22/07/2020 23:31:28 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[main] place_new_dirent :NFS READDIR :F_DBG :Could not add ss1 to chunk for directory 0x7f497dab7a00, compute_readdir_cookie failed
22/07/2020 23:31:28 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[main] mdc_try_get_cached :NFS READDIR :F_DBG :mdcache_avl_lookup vm-images failed trust negative no
22/07/2020 23:31:28 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[main] mdc_try_get_cached :NFS READDIR :F_DBG :mdcache_avl_lookup vm-images failed trust negative no
22/07/2020 23:31:28 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[main] mdc_lookup_uncached :NFS READDIR :DEBUG :lookup vm-images failed with No such file or directory
22/07/2020 23:31:28 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[main] place_new_dirent :NFS READDIR :F_DBG :Could not add vm-images to chunk for directory 0x7f497dab7f00, compute_readdir_cookie failed
22/07/2020 23:31:28 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[main] nfs_rpc_cb_init_ccache :NFS STARTUP :WARN :gssd_refresh_krb5_machine_credential failed (-1765328160:0)
22/07/2020 23:31:40 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[svc_6] mdc_try_get_cached :NFS READDIR :F_DBG :mdcache_avl_lookup kvm failed trust negative no
22/07/2020 23:31:40 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[svc_6] mdc_try_get_cached :NFS READDIR :F_DBG :mdcache_avl_lookup kvm failed trust negative no
22/07/2020 23:31:40 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[svc_6] place_new_dirent :NFS READDIR :F_DBG :Could not add kvm to chunk for directory 0x7f497dab7000, compute_readdir_cookie failed
22/07/2020 23:31:40 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[svc_6] mdc_try_get_cached :NFS READDIR :F_DBG :mdcache_avl_lookup inviscid.qcow2 failed trust negative no
22/07/2020 23:31:40 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[svc_6] mdc_try_get_cached :NFS READDIR :F_DBG :mdcache_avl_lookup inviscid.qcow2 failed trust negative no
22/07/2020 23:31:40 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[svc_6] place_new_dirent :NFS READDIR :F_DBG :Could not add inviscid.qcow2 to chunk for directory 0x7f4973858100, compute_readdir_cookie failed
22/07/2020 23:33:51 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[svc_73] proxyv3_write2 :FSAL :DEBUG :WRITE failed: 1

I think those metadata/dirent ones are confusingly harmless. However the write failure does matter, and means that the RPC worked, but the backend server returned an error. Strangely that error (1) is NFS3ERR_PERM...

Just to verify, the errors in your kernel boot though are all complaints about reads (or does the visible error to you vary?).

Either way, looking at all the lines near there handled by the svc_73 thread, I see it's trying to do:

...
22/07/2020 23:33:51 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[svc_73] nfs_req_creds :DISP :M_DBG :AUTH_SYS creds mapped to uid=107, gid=107, glen=2
22/07/2020 23:33:51 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[svc_73] nfs3_write :NFS3 :DEBUG :REQUEST PROCESSING: Calling NFS3_WRITE File Handle V3: Len=40 0x430027752001
0006017fdf829500cf0bd800000000000000000a005d6602000000dfb04709000000 start: 1eb818000 len: 8192 FILE_SYNC
...
22/07/2020 23:33:51 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[svc_73] mdcache_getattrs :INODE :F_DBG :attrs obj attributes Request Mask=0000c004 Valid Mask=0001dfce Suppo
rted Mask=0001dfce REGULAR_FILE numlinks=0x1 size=0x4faa90000 mode=0660 owner=0x6b group=0x6b atime=29/06/2020 17:59:59 mtime=22/07/2020 23:33:51
...

and so 107 == 0x6b and you'd had tons of successful reads and writes to that file handle.

Nothing looks suspicious in the actual writing code (though I have a printf bug on "Got a new socket", since I don't believe your host is 2.0.0.0)

22/07/2020 23:33:51 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[svc_73] nfs3_write :NFS3 :F_DBG :Write offset=8246099968 size=8192 MaxOffSet=9223372036854775807
22/07/2020 23:33:51 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[svc_73] state_deleg_conflict :RW LOCK :F_DBG :Acquired mutex 0x7f49738588a8 (&(obj)->state_hdl->st_lock) at /
usr/local/ganesha/4-dev-20200717/src/nfs-ganesha/src/SAL/state_deleg.c:596
22/07/2020 23:33:51 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[svc_73] state_deleg_conflict :RW LOCK :F_DBG :Released mutex 0x7f49738588a8 (&(obj)->state_hdl->st_lock) at /
usr/local/ganesha/4-dev-20200717/src/nfs-ganesha/src/SAL/state_deleg.c:598
22/07/2020 23:33:51 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[svc_73] proxyv3_write2 :FSAL :DEBUG :Doing write2 at offset 8246099968 in handle 0x7f4973845540 of len 8192
22/07/2020 23:33:51 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[svc_73] proxyv3_call :FSAL :F_DBG :Sending an RPC: Program = 100003, Version = 3, Procedure = 7
22/07/2020 23:33:51 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[svc_73] proxyv3_getfdentry :FSAL :F_DBG :Looking for an open socket for port 2049
22/07/2020 23:33:51 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[svc_73] proxyv3_openfd :FSAL :DEBUG :Opening a new socket
22/07/2020 23:33:51 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[svc_73] proxyv3_openfd :FSAL :DEBUG :Got a new socket (47) open to host 2.0.0.0
22/07/2020 23:33:51 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[svc_73] rpc :TIRPC :F_DBG :xdr_call_encode:85 INLINE
22/07/2020 23:33:51 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[svc_73] proxyv3_call :FSAL :F_DBG :Sending XID 1617583841 with 8324 bytes
22/07/2020 23:33:51 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[svc_73] rpc :TIRPC :F_DBG :proxyv3_call: auth_put(), refs 0
22/07/2020 23:33:51 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[svc_73] proxyv3_call :FSAL :F_DBG :Let's go ask for a response.
22/07/2020 23:33:51 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[svc_73] proxyv3_call :FSAL :F_DBG :Got recmark 80000078 (120 bytes) xid 1617583841
22/07/2020 23:33:51 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[svc_73] proxyv3_call :FSAL :F_DBG :Going to read the remaining 116 bytes
22/07/2020 23:33:51 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[svc_73] proxyv3_call :FSAL :F_DBG :Got all the bytes, time to decode
22/07/2020 23:33:51 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[svc_73] proxyv3_release_fdentry :FSAL :F_DBG :Releasing fd 47 back into the pool (close = F)
22/07/2020 23:33:51 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[svc_73] proxyv3_call :FSAL :F_DBG :RPC Completed SUCCESSFULLY: Program = 100003, Version = 3, Procedure = 7
22/07/2020 23:33:51 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[svc_73] proxyv3_write2 :FSAL :DEBUG :WRITE failed: 1
22/07/2020 23:33:51 : epoch 5f190490 : vh8 : ganesha.nfsd-65631[svc_73] nfs3_write_cb :NFS3 :F_DBG :write fsal_status=Forbidden action

Moreover, there are tons of reads and writes that successfully happen afterwards too:

$ egrep "proxyv3_(write2|read2)" ganesha.log-20200722-23.33.txt

though I suppose they could have been in flight as svc_73 never shows up again.

NFS experts: can knfsd return PERM/ACCESS errors under load or something? (Or any other suspicious outcomes for getting back error code 1?)

@toddpfaff
Copy link
Author

toddpfaff commented Jul 24, 2020 via email

@boulos
Copy link
Contributor

boulos commented Sep 4, 2020

(Bringing this back here from the list)

Running tshark while doing an rsync shows the same failure down to the "the backend replies with NFS3ERR_PERM":

454 71.197509325 192.168.1.2 → 192.168.1.5 NFS 190 V3 WRITE Reply (Call In 403) Error: NFS3ERR_PERM

rsync is likely just so robust, that it's willing to retry these things enough to make progress.

A more detailed and fancily decoded tshark run:

Frame 450: 16962 bytes on wire (135696 bits), 16962 bytes captured (135696 bits) on interface 0
Ethernet II, Src: 42:01:c0:a8:01:01 (42:01:c0:a8:01:01), Dst: 42:01:c0:a8:01:02 (42:01:c0:a8:01:02)
Internet Protocol Version 4, Src: 192.168.1.5, Dst: 192.168.1.2
Transmission Control Protocol, Src Port: 838, Dst Port: 2049, Seq: 547476, Ack: 29433, Len: 16896
[2 Reassembled TCP Segments (28844 bytes): #448(16979), #450(11865)]
Remote Procedure Call, Type:Call XID:0x00fcdd36
    Fragment header: Last fragment, 28840 bytes
        1... .... .... .... .... .... .... .... = Last Fragment: Yes
        .000 0000 0000 0000 0111 0000 1010 1000 = Fragment Length: 28840
    XID: 0x00fcdd36 (16571702)
    Message Type: Call (0)
    RPC Version: 2
    Program: NFS (100003)
    Program Version: 3
    Procedure: WRITE (7)
    Credentials
        Flavor: AUTH_UNIX (1)
        Length: 60
        Stamp: 0x019b0bf4
        Machine Name: ganesha-client
            length: 14
            contents: ganesha-client
            fill bytes: opaque data
        UID: 1000
        GID: 1001
        Auxiliary GIDs (6) [4, 30, 44, 46, 1000, 1001]
            GID: 4
            GID: 30
            GID: 44
            GID: 46
            GID: 1000
            GID: 1001
    Verifier
        Flavor: AUTH_NULL (0)
        Length: 0
Network File System, WRITE Call FH: 0xc80202d4 Offset: 487424 Len: 28671 UNSTABLE
    [Program Version: 3]
    [V3 Procedure: WRITE (7)]
    file
        length: 44
        [hash (CRC-32): 0xc80202d4]
        FileHandle: 4300004d24010007016641080000000000384f1f4acc7240...
    offset: 487424
    count: 28671
    Stable: UNSTABLE (0)
    Data: <DATA>
        length: 28671
        contents: <DATA>
        fill bytes: opaque data


....
....
....

Frame 545: 190 bytes on wire (1520 bits), 190 bytes captured (1520 bits) on interface 0
Ethernet II, Src: 42:01:c0:a8:01:02 (42:01:c0:a8:01:02), Dst: 42:01:c0:a8:01:01 (42:01:c0:a8:01:01)
Internet Protocol Version 4, Src: 192.168.1.2, Dst: 192.168.1.5
Transmission Control Protocol, Src Port: 2049, Dst Port: 838, Seq: 32425, Ack: 1411213, Len: 124
Remote Procedure Call, Type:Reply XID:0x00fcdd36
    Fragment header: Last fragment, 120 bytes
        1... .... .... .... .... .... .... .... = Last Fragment: Yes
        .000 0000 0000 0000 0000 0000 0111 1000 = Fragment Length: 120
    XID: 0x00fcdd36 (16571702)
    Message Type: Reply (1)
    [Program: NFS (100003)]
    [Program Version: 3]
    [Procedure: WRITE (7)]
    Reply State: accepted (0)
    [This is a reply to a request in frame 450]
    [Time from request: 0.165609639 seconds]
    Verifier
        Flavor: AUTH_NULL (0)
        Length: 0
    Accept State: RPC executed successfully (0)
Network File System, WRITE Reply  Error: NFS3ERR_PERM
    [Program Version: 3]
    [V3 Procedure: WRITE (7)]
    Status: NFS3ERR_PERM (1)
    file_wcc
        before
            attributes_follow: no value (0)
        after  Regular File mode: 0600 uid: 1000 gid: 1001
            attributes_follow: value follows (1)
            attributes  Regular File mode: 0600 uid: 1000 gid: 1001
                Type: Regular File (1)
                Mode: 0600, S_IRUSR, S_IWUSR
                    .... .... .... .... .... 0... .... .... = S_ISUID: No
                    .... .... .... .... .... .0.. .... .... = S_ISGID: No
                    .... .... .... .... .... ..0. .... .... = S_ISVTX: No
                    .... .... .... .... .... ...1 .... .... = S_IRUSR: Yes
                    .... .... .... .... .... .... 1... .... = S_IWUSR: Yes
                    .... .... .... .... .... .... .0.. .... = S_IXUSR: No
                    .... .... .... .... .... .... ..0. .... = S_IRGRP: No
                    .... .... .... .... .... .... ...0 .... = S_IWGRP: No
                    .... .... .... .... .... .... .... 0... = S_IXGRP: No
                    .... .... .... .... .... .... .... .0.. = S_IROTH: No
                    .... .... .... .... .... .... .... ..0. = S_IWOTH: No
                    .... .... .... .... .... .... .... ...0 = S_IXOTH: No
                nlink: 1
                uid: 1000
                gid: 1001
                size: 1359871
                used: 806912
                rdev: 0,0
                    specdata1: 0
                    specdata2: 0
                fsid: 0x8cb259029983aba9 (10138263578864757673)
                fileid: 532980
                atime: Sep  3, 2020 21:59:39.407546329 UTC
                    seconds: 1599170379
                    nano seconds: 407546329
                mtime: Sep  3, 2020 21:59:39.611563419 UTC
                    seconds: 1599170379
                    nano seconds: 611563419
                ctime: Sep  3, 2020 21:59:39.611563419 UTC
                    seconds: 1599170379
                    nano seconds: 611563419

shows what looks like a fairly reasonable write to me. These permission errors are strewn about in the write stream...

@boulos
Copy link
Contributor

boulos commented Sep 12, 2020

I finally figured this out.

By playing around with the maximum number of connections in rpc.c, I determined "it's a race of some sort, that's weird" (that is, it worked fine at 1 connection, 2, 4, 8 depending on debug settings, etc.). I burned a lot of time trying to get msan and tsan working with clang, but eventually decided to just hop over to my knfsd box and turn up all the logging while running the following from my client:

for i in $(seq 10); do echo "Starting run-$i" && mkdir proxy-mount/run-$i && rsync -vaxS /boot/grub proxy-mount/run-$i && echo " run-$i $?"; done

There was a lot of noise in the log, but this jumped out at me:

Sep 12 23:10:45 ganesha-proxy-nfsd kernel: [16077266.727259] nfsd: request from insecure port 192.168.1.2, port=36953!

because I call bindresvport_sa to get a reserved port...

My comment for my local patch explains the problem:

+        * NOTE(boulos): libntirpc's bindresvport_sa is *also* not
+        * thread-safe. So we need to hold a lock around calling it. Our only
+        * caller (proxyv3_getfdentry) no longer holds the rpcLock, so we can
+        * use that one.

So the fix is easy enough, but I think like glibc, we should make bindresvport itself threadsafe:

Before glibc 2.17, the bindresvport() function uses a static variable that is not protected, so it is not thread-safe.
Since glibc 2.17, the bindresvport() function uses a lock to protect the static variable, so it is thread-safe.

Thanks again, Todd for being so patient!

@toddpfaff
Copy link
Author

toddpfaff commented Sep 15, 2020 via email

@ffilz ffilz closed this as completed in 48c464b Sep 18, 2020
@toddpfaff
Copy link
Author

I realize this issue is closed but I just want to comment here that I've updated to V4-dev.34 and quickly hit this problem again when doing some rsync tests. My log shows:

19/09/2020 11:02:46 : epoch 5f659a35 : hostname : ganesha.nfsd-24063[svc_5010] proxyv3_openfd :FSAL :CRIT :Failed to reserve a privileged port. 98 Address already in use
19/09/2020 11:02:46 : epoch 5f659a35 : hostname : ganesha.nfsd-24063[svc_5010] proxyv3_read2 :FSAL :CRIT :proxyv3_nfs_call failed (0)

Todd

@boulos
Copy link
Contributor

boulos commented Sep 19, 2020 via email

@boulos
Copy link
Contributor

boulos commented Sep 19, 2020 via email

@fznply
Copy link

fznply commented Jul 20, 2022

even though i set PROXY_V3 { num_sockets = 1;} , the log also show error proxyv3_openfd :FSAL :CRIT :Failed to connect to host '2.0.0.0'. errno 110 (Connection timed out).

nfs-ganesha-4.0-6.fc37.x86_64
nfs-ganesha-proxy-v3-4.0-6.fc37.x86_64

i use docker image fedora:37 , simply run with below commands can repeat this error:

yum install -y net-tools wget procps-ng vim nfs-ganesha nfs-ganesha-proxy-v3

rpcbind
rpc.statd -L
rpc.idmapd
/usr/bin/ganesha.nfsd -L /var/log/ganesha/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT

cat /etc/ganesha/ganesha.conf

> PROXY_V3 {
>     num_sockets = 1;
> }
> 
> EXPORT
> {
>         Export_Id = 0;
>         Path = "/z0xak7rj";
>         Pseudo = "/";
> 
>         Access_Type = RW;
>         Squash = no_root_squash;
>         SecType = sys;
>         Transports = TCP;
> 
>         # Exporting FSAL
>         FSAL {
>             Name = PROXY_V3;
>             Srv_Addr = 192.168.253.115;
>         }
> }

use mount command directly will success.

mount -t nfs -o vers=3,nolock,proto=tcp,noresvport 192.168.253.115:/z0xak7rj /tmp/test/
mount -l | grep z0xak7rj

> 192.168.253.115:/z0xak7rj on /tmp/test type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,noresvport,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.253.115,mou
> ntvers=3,mountport=892,mountproto=tcp,local_lock=all,addr=192.168.253.115)

@boulos can you give some advice ?

@fznply
Copy link

fznply commented Jul 21, 2022

the log show connect timeout port is 43255.

21/07/2022 05:21:31 ganesha.nfsd-2157[main] proxyv3_getfdentry :FSAL :F_DBG :Looking for an open socket for port 43255
21/07/2022 05:21:31 ganesha.nfsd-2157[main] proxyv3_openfd :FSAL :DEBUG :Opening a new socket
21/07/2022 05:23:39 ganesha.nfsd-2157[main] proxyv3_openfd :FSAL :CRIT :Failed to connect to host '2.0.0.0'. errno 110 (Connection timed out)

so grep 43255 port, i found no NLM running on backend NFS server.

grep 43255 /var/log/ganesha/ganesha.log 

> 21/07/2022 05:21:31 ganesha.nfsd-2157[main] proxyv3_find_ports :FSAL :DEBUG :Got back nlm port 43255
> 21/07/2022 05:21:31 proxyv3_getfdentry :FSAL :F_DBG :Looking for an open socket for port 43255

can i proxy v3 backend NFS without NLM ?

if cannot, why below command ok.

mount -t nfs -o vers=3,nolock,proto=tcp,noresvport 192.168.253.115:/z0xak7rj /tmp/test/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Ganesha 4
  
To do
Development

No branches or pull requests

4 participants