Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NSM error should not double decrement, but something needs to be done if UNMON fails #937

Closed
martin-schwenke opened this issue May 4, 2023 · 3 comments
Labels

Comments

@martin-schwenke
Copy link

I came across this while we were trying to debug #680 (not quite there yet).

The following commit was merged fairly recently, partly at my request:

commit 275539c02ad3984e17d6d78b494bd9328cf466c7
Author:     Malahal Naineni <malahal@us.ibm.com>
AuthorDate: Wed Oct 24 14:04:34 2018 +0530
Commit:     Frank S. Filz <ffilzlnx@mindspring.com>
CommitDate: Fri Nov 18 10:33:13 2022 -0800

    Retry communication with NSM service

It retries SM_MON and SM_UNMON on failure, forcing a disconnect on error.

The above commit was written before the following one, but was merged after:

commit 5febadaa98fb53bc4c3f2dab7793d6afb3847073
Author:     Gaurav B. Gangalwar <gaurav.gangalwar@gmail.com>
AuthorDate: Tue Feb 19 10:42:57 2019 -0500
Commit:     Frank S. Filz <ffilzlnx@mindspring.com>
CommitDate: Fri Feb 22 10:24:07 2019 -0800

    Reduce nsm_count before nsm_disconnect as we will not be able to disconnect nsm
    client in case of RPC failures.
    
    Change-Id: I0cd5f0614e89f447dc7c5ac278cf03224add05e2
    Signed-off-by: Gaurav B. Gangalwar <gaurav.gangalwar@gmail.com>

It moves nsm_count-- before the last error check.

The combination of both patches means that, on error, an nsm_unmonitor() call can reduce nsm_count by 2, making it negative. Oops!

The 2nd commit claims to "Reduce nsm_count before nsm_disconnect". However, when you look at the patch and the code, the call to nsm_disconnect() is after the final LogDebug() in the patch context, so nsm_count-- was always before the call to nsm_disconnect(). So, unless there's a something subtle going on (e.g. it really wants to decrement the reference count before the call to clnt_req_release() then the patch doesn't do what the commit message says.

There's a chance the commit message is saying that the decrement should go before the call to nsm_disconnect() in the error case, but why decrement the reference count when SM_UNMON fails?

In my mind, the simplest solution to the double-decrement-on-retry issue is to move nsm_count-- back to where it was before, so it is only done on success.

Patch coming.

@ffilz
Copy link
Member

ffilz commented May 5, 2023

Thanks for this. I think a failure to unmonitor needs some investigation, but clearly a double decrement because of retry is not the right solution. But I think the decrement on failure was put in originally for a good reason.

I'm going to keep this issue open as an anchor for investigation as to what we should do.

@ffilz ffilz changed the title NSM reference count double-decrement on error NSM error should not double decrement, but something needs to be done if UNMON fails May 5, 2023
@ffilz ffilz added the bug label May 5, 2023
@ffilz ffilz closed this as completed in 8f6ee41 May 19, 2023
@martin-schwenke
Copy link
Author

Hi @ffilz. Will there be attempts to cherry-pick/backport fixes like this to V4-stable? I know this one cherry-picks cleanly. 😄

@ffilz
Copy link
Member

ffilz commented May 21, 2023

I was not planning any back ports to V4-stable because I really want folks to move to V5.x to get all the LRU fixes which have been a major headache for quite some time. Back porting those to V4-stable would drag almost all of V5,x along so really the major purpose of V5.x can be looked at as a major bug fix. Sure, there is some stuff for async/nonblocking IO and HAProxy V2 support, but those are actually pretty small compared to the LRU changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants