You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I came across this while we were trying to debug #680 (not quite there yet).
The following commit was merged fairly recently, partly at my request:
commit 275539c02ad3984e17d6d78b494bd9328cf466c7
Author: Malahal Naineni <malahal@us.ibm.com>
AuthorDate: Wed Oct 24 14:04:34 2018 +0530
Commit: Frank S. Filz <ffilzlnx@mindspring.com>
CommitDate: Fri Nov 18 10:33:13 2022 -0800
Retry communication with NSM service
It retries SM_MON and SM_UNMON on failure, forcing a disconnect on error.
The above commit was written before the following one, but was merged after:
commit 5febadaa98fb53bc4c3f2dab7793d6afb3847073
Author: Gaurav B. Gangalwar <gaurav.gangalwar@gmail.com>
AuthorDate: Tue Feb 19 10:42:57 2019 -0500
Commit: Frank S. Filz <ffilzlnx@mindspring.com>
CommitDate: Fri Feb 22 10:24:07 2019 -0800
Reduce nsm_count before nsm_disconnect as we will not be able to disconnect nsm
client in case of RPC failures.
Change-Id: I0cd5f0614e89f447dc7c5ac278cf03224add05e2
Signed-off-by: Gaurav B. Gangalwar <gaurav.gangalwar@gmail.com>
It moves nsm_count-- before the last error check.
The combination of both patches means that, on error, an nsm_unmonitor() call can reduce nsm_count by 2, making it negative. Oops!
The 2nd commit claims to "Reduce nsm_count before nsm_disconnect". However, when you look at the patch and the code, the call to nsm_disconnect() is after the final LogDebug() in the patch context, so nsm_count-- was always before the call to nsm_disconnect(). So, unless there's a something subtle going on (e.g. it really wants to decrement the reference count before the call to clnt_req_release() then the patch doesn't do what the commit message says.
There's a chance the commit message is saying that the decrement should go before the call to nsm_disconnect() in the error case, but why decrement the reference count when SM_UNMON fails?
In my mind, the simplest solution to the double-decrement-on-retry issue is to move nsm_count-- back to where it was before, so it is only done on success.
Patch coming.
The text was updated successfully, but these errors were encountered:
Thanks for this. I think a failure to unmonitor needs some investigation, but clearly a double decrement because of retry is not the right solution. But I think the decrement on failure was put in originally for a good reason.
I'm going to keep this issue open as an anchor for investigation as to what we should do.
ffilz
changed the title
NSM reference count double-decrement on error
NSM error should not double decrement, but something needs to be done if UNMON fails
May 5, 2023
I was not planning any back ports to V4-stable because I really want folks to move to V5.x to get all the LRU fixes which have been a major headache for quite some time. Back porting those to V4-stable would drag almost all of V5,x along so really the major purpose of V5.x can be looked at as a major bug fix. Sure, there is some stuff for async/nonblocking IO and HAProxy V2 support, but those are actually pretty small compared to the LRU changes.
I came across this while we were trying to debug #680 (not quite there yet).
The following commit was merged fairly recently, partly at my request:
It retries
SM_MON
andSM_UNMON
on failure, forcing a disconnect on error.The above commit was written before the following one, but was merged after:
It moves
nsm_count--
before the last error check.The combination of both patches means that, on error, an
nsm_unmonitor()
call can reducensm_count
by2
, making it negative. Oops!The 2nd commit claims to "Reduce nsm_count before nsm_disconnect". However, when you look at the patch and the code, the call to
nsm_disconnect()
is after the finalLogDebug()
in the patch context, sonsm_count--
was always before the call tonsm_disconnect()
. So, unless there's a something subtle going on (e.g. it really wants to decrement the reference count before the call toclnt_req_release()
then the patch doesn't do what the commit message says.There's a chance the commit message is saying that the decrement should go before the call to
nsm_disconnect()
in the error case, but why decrement the reference count whenSM_UNMON
fails?In my mind, the simplest solution to the double-decrement-on-retry issue is to move
nsm_count--
back to where it was before, so it is only done on success.Patch coming.
The text was updated successfully, but these errors were encountered: