Client id expiration - multiple clients using same hostname #934

deepakaru · 2023-04-27T00:34:26Z

We've had multiple issues where a CREATE_SESSION tries to expire a clientid when the clientid is being used by other rpcs like OPEN, EXCHANGE_ID etc.You typically don't expect a client to send a OPEN/EXCHANGE_ID rpc while the server is processing a CREATE_SESSSION rpc unless the client is misbehaving. We found that multiple clients try to use the same hostname (which isn't recommended by the rpc r) is not handled very well in the client id expiration code path.

Now although the clients were misbehaving we feel that the server should handle/fail gracefully. In these issues the server usually runs into an unexpected assert,segfault and crashes or it runs into a deadlock and hangs forever.

Notes from rfc - hopefully helpful to anyone who picks up the issue

The RFC says this about co_ownerid

The string should be unique so that multiple clients do not present the same string. The consequences of two clients
presenting the same string range from one client getting an error to one client having its leased state abruptly and unexpectedly
cancelled.

Some examples of the kinds of issues we've been hitting

Deadlock:

Two client ids say c1 and c2 from 2 different clients were associated with the same co_ownerid and same record cr1

(1) Client 1 -> thread 1 was doing a open. It was inside the get_state_owner function trying to get a owner for the open state. It acquired ht_owner->partitions[15].lock created a new open owner and was trying to hold the mutex on cr1 aka nfs4_owner->so_clientrec->cid_mutex so it could add the owner to the client record cr1.
(2) Client2 -> thread 2 was doing a create session. It was inside nfs_clientid_expire function. It was parsing the open owner list inside c1 and trying to 'delete' each owner (the second while(true) loop). It was holding nfs4_owner->so_clientrec->cid_mutex and trying to get ht_owner->partitions[15].lock

Assertion error
We hit an assertion error inside dec_state_owner_ref. This is the part of the code which checks if refcount is greater than or equal to 0. But this refcount was -1 and the assert caught it.

I can add more details on other issues if needed.

cc: @ffilz

ffilz · 2023-04-27T16:27:46Z

I think we could make hold_state_owner effectively do a pthread_rwlock_trywrlock on the latch, an if it fails, return false. I think that would break the deadlock.

ffilz · 2023-04-27T18:20:05Z

I am working on a fix for the deadlock by using atomic_add_unless to be able to detect when refcount is 0 before incrementing which is what hold_state_owner is trying to do by using the hash latch. Using atomic_add_unless means hold_state_owner doesn't need to use the hash latch, thus removing the lock inversion.

I'm looking to see if we have fixed anything regarding state_owner_ref, what version of Ganesha are you using?

What other issues have you hit?

deepakaru · 2023-04-27T18:23:37Z

I am worried that even if we break the deadlock, we'll run into a segmentation fault

c1 seems to be initializing its owner and adding it to the owner table and client record structures.
if c2 were to pick the exact owner to delete on its side then now c1 will be left with a few freed up pointers

deepakaru · 2023-04-27T18:41:10Z

We are using ganesha version 4.0.2

We have a few other cores and assertions but we haven't made much progress on the root cause there because we think that its due to multiple clients using the same hostname.

ffilz · 2023-04-27T18:58:52Z

I am worried that even if we break the deadlock, we'll run into a segmentation fault

c1 seems to be initializing its owner and adding it to the owner table and client record structures. if c2 were to pick the exact owner to delete on its side then now c1 will be left with a few freed up pointers

Once we break the deadlock, the refcounting should work properly.

I also did an audit of the uses of the owner refcount and it LOOKS like they should all be correct. Can you easily recreate the assert case? If so, it might be worth turning on STATE debug and maybe we can see where the refcount was decremented to 0 (whoever actually decremented to -1 probably isn't the extraneous dec ref).

ffilz · 2023-04-27T18:59:29Z

We are using ganesha version 4.0.2

We have a few other cores and assertions but we haven't made much progress on the root cause there because we think that its due to multiple clients using the same hostname.

Sharing the back traces you hit might help narrow down what is the cause.

Also change the name to hold_state_owner_ref (which makes it easier to audit code for proper use of refcounts). By using atomic_inc_unless_0_int32_t, we can detect the case where the refcount is already 0, so rather than trying to process this owner we skip it (for now). This means that hold_state_owner_ref doesn't need to take the hash latch (which caused a lock order inversion) to protect while we check for 0 refcount. This lock inversion was introduced in commit c8e1828 as an attempt to avoid the incrementing refcount from 0. This addresses the deadlock caused by the lock inversion reported in github issue nfs-ganesha#934. Change-Id: Ia3babaad97a3e430091dc3b23e6e21d095418ac2 Signed-off-by: Frank S. Filz <ffilzlnx@mindspring.com>

ffilz · 2023-04-27T20:37:49Z

The deadlock should be fixed by https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/553234

Could you actually open a separate github issue for each problem, this makes it easier to track what needs work and what has been resolved.

deepakaru · 2023-05-10T20:49:45Z

We identified one more issue post cherry picking https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/553234

I have created a patch for this change here https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/553739

I think we can close this issue once https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/553739 is merged or the bug described in https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/553739 is addressed.

We've been running the test with https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/553739 and so far there are no cores. Typically we see a core in 1 - 2 hrs with our test

ffilz · 2023-05-10T21:29:51Z

That patch seems to only address NLM or are you assuming a fix to issue #940?

deepakaru · 2023-05-11T01:37:29Z

@ffilz i briefly audited #940 it looks similar to the core we've been hitting. The problem is that get_state_owner function is now resetting the refcount to 1 and it should be fixed with my patch. I've added more details on my patch

deepakaru · 2023-05-11T01:39:58Z

Everthing seems to work fine if we remove the 'owner->refcount = 1' from get_state_owner

and its not needed anyway if we set the refcount to 1 in the caller or callback

For open_state owners we already do it in create_nfs4_owner
For nlm owners i've added code inside init_nlm_owner for this

ffilz · 2023-05-11T18:14:10Z

So you think this second patch should fix issue #940? If so, could you add a comment there.

deepakaru · 2023-05-11T22:33:33Z

Added a comment there can we merge https://gerrithub.io/c/ffilz/nfs-ganesha/+/553739. I see your +1 there but i am not sure how to merge it

ffilz · 2023-05-11T22:45:54Z

Thanks for the comment. I do weekly merges on Fridays, though I won't do one this week as I have tomorrow off. Depending on what is ready to go, I might do a merge Monday or Tuesday.

ffilz · 2023-06-13T23:52:55Z

Closing as verified.

ffilz added the bug label Apr 27, 2023

ffilz added Needs-Verification Patch Submitted labels Apr 27, 2023

free11kaka22 mentioned this issue May 6, 2023

Client id expiration - multiple clients write same server #940

Closed

ffilz closed this as completed Jun 13, 2023

ffilz added Verified and removed Needs-Verification Patch Submitted labels Jun 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client id expiration - multiple clients using same hostname #934

Client id expiration - multiple clients using same hostname #934

deepakaru commented Apr 27, 2023

ffilz commented Apr 27, 2023

ffilz commented Apr 27, 2023

deepakaru commented Apr 27, 2023

deepakaru commented Apr 27, 2023

ffilz commented Apr 27, 2023

ffilz commented Apr 27, 2023

ffilz commented Apr 27, 2023

deepakaru commented May 10, 2023

ffilz commented May 10, 2023

deepakaru commented May 11, 2023

deepakaru commented May 11, 2023 •

edited

ffilz commented May 11, 2023

deepakaru commented May 11, 2023

ffilz commented May 11, 2023

ffilz commented Jun 13, 2023

Client id expiration - multiple clients using same hostname #934

Client id expiration - multiple clients using same hostname #934

Comments

deepakaru commented Apr 27, 2023

ffilz commented Apr 27, 2023

ffilz commented Apr 27, 2023

deepakaru commented Apr 27, 2023

deepakaru commented Apr 27, 2023

ffilz commented Apr 27, 2023

ffilz commented Apr 27, 2023

ffilz commented Apr 27, 2023

deepakaru commented May 10, 2023

ffilz commented May 10, 2023

deepakaru commented May 11, 2023

deepakaru commented May 11, 2023 • edited

ffilz commented May 11, 2023

deepakaru commented May 11, 2023

ffilz commented May 11, 2023

ffilz commented Jun 13, 2023

deepakaru commented May 11, 2023 •

edited