New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Client id expiration - multiple clients using same hostname #934
Comments
I think we could make hold_state_owner effectively do a pthread_rwlock_trywrlock on the latch, an if it fails, return false. I think that would break the deadlock. |
I am working on a fix for the deadlock by using atomic_add_unless to be able to detect when refcount is 0 before incrementing which is what hold_state_owner is trying to do by using the hash latch. Using atomic_add_unless means hold_state_owner doesn't need to use the hash latch, thus removing the lock inversion. I'm looking to see if we have fixed anything regarding state_owner_ref, what version of Ganesha are you using? What other issues have you hit? |
I am worried that even if we break the deadlock, we'll run into a segmentation fault c1 seems to be initializing its owner and adding it to the owner table and client record structures. |
We are using ganesha version 4.0.2 We have a few other cores and assertions but we haven't made much progress on the root cause there because we think that its due to multiple clients using the same hostname. |
Once we break the deadlock, the refcounting should work properly. I also did an audit of the uses of the owner refcount and it LOOKS like they should all be correct. Can you easily recreate the assert case? If so, it might be worth turning on STATE debug and maybe we can see where the refcount was decremented to 0 (whoever actually decremented to -1 probably isn't the extraneous dec ref). |
Sharing the back traces you hit might help narrow down what is the cause. |
Also change the name to hold_state_owner_ref (which makes it easier to audit code for proper use of refcounts). By using atomic_inc_unless_0_int32_t, we can detect the case where the refcount is already 0, so rather than trying to process this owner we skip it (for now). This means that hold_state_owner_ref doesn't need to take the hash latch (which caused a lock order inversion) to protect while we check for 0 refcount. This lock inversion was introduced in commit c8e1828 as an attempt to avoid the incrementing refcount from 0. This addresses the deadlock caused by the lock inversion reported in github issue nfs-ganesha#934. Change-Id: Ia3babaad97a3e430091dc3b23e6e21d095418ac2 Signed-off-by: Frank S. Filz <ffilzlnx@mindspring.com>
The deadlock should be fixed by https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/553234 Could you actually open a separate github issue for each problem, this makes it easier to track what needs work and what has been resolved. |
We identified one more issue post cherry picking https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/553234 I have created a patch for this change here https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/553739 I think we can close this issue once https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/553739 is merged or the bug described in https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/553739 is addressed. We've been running the test with https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/553739 and so far there are no cores. Typically we see a core in 1 - 2 hrs with our test |
That patch seems to only address NLM or are you assuming a fix to issue #940? |
Everthing seems to work fine if we remove the 'owner->refcount = 1' from get_state_owner and its not needed anyway if we set the refcount to 1 in the caller or callback For open_state owners we already do it in create_nfs4_owner |
So you think this second patch should fix issue #940? If so, could you add a comment there. |
Added a comment there can we merge https://gerrithub.io/c/ffilz/nfs-ganesha/+/553739. I see your +1 there but i am not sure how to merge it |
Thanks for the comment. I do weekly merges on Fridays, though I won't do one this week as I have tomorrow off. Depending on what is ready to go, I might do a merge Monday or Tuesday. |
Closing as verified. |
We've had multiple issues where a CREATE_SESSION tries to expire a clientid when the clientid is being used by other rpcs like OPEN, EXCHANGE_ID etc.You typically don't expect a client to send a OPEN/EXCHANGE_ID rpc while the server is processing a CREATE_SESSSION rpc unless the client is misbehaving. We found that multiple clients try to use the same hostname (which isn't recommended by the rpc r) is not handled very well in the client id expiration code path.
Now although the clients were misbehaving we feel that the server should handle/fail gracefully. In these issues the server usually runs into an unexpected assert,segfault and crashes or it runs into a deadlock and hangs forever.
Notes from rfc - hopefully helpful to anyone who picks up the issue
The RFC says this about co_ownerid
The string should be unique so that multiple clients do not present the same string. The consequences of two clients
presenting the same string range from one client getting an error to one client having its leased state abruptly and unexpectedly
cancelled.
Some examples of the kinds of issues we've been hitting
Deadlock:
Two client ids say c1 and c2 from 2 different clients were associated with the same co_ownerid and same record cr1
(1) Client 1 -> thread 1 was doing a open. It was inside the get_state_owner function trying to get a owner for the open state. It acquired ht_owner->partitions[15].lock created a new open owner and was trying to hold the mutex on cr1 aka nfs4_owner->so_clientrec->cid_mutex so it could add the owner to the client record cr1.
(2) Client2 -> thread 2 was doing a create session. It was inside nfs_clientid_expire function. It was parsing the open owner list inside c1 and trying to 'delete' each owner (the second while(true) loop). It was holding nfs4_owner->so_clientrec->cid_mutex and trying to get ht_owner->partitions[15].lock
Assertion error
We hit an assertion error inside dec_state_owner_ref. This is the part of the code which checks if refcount is greater than or equal to 0. But this refcount was -1 and the assert caught it.
I can add more details on other issues if needed.
cc: @ffilz
The text was updated successfully, but these errors were encountered: