Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix reconnect() calls close() when simultaneously 'disconnecting' in another thread #1082

Merged
merged 8 commits into from
Feb 27, 2024

Conversation

MauriceVanVeen
Copy link
Collaborator

In the unlikely but possible scenario where the reader/writer threads would call into handleCommunicationIssue twice, disconnecting could be set by the other thread just before going into reconnect().

This would cause these lines to exit the reconnecting loop:

if (isDisconnectingOrClosed() || this.isClosing()) {
    keepGoing = false;
}

Which would then call into close() here:

if (!isConnected()) {
    this.close();
    return;
}

This PR fixes that by not checking the disconnected flag, instead checking isClosed() || isClosing() as it was before this commit.

Related to issue reported on Slack: https://natsio.slack.com/archives/C1KU01HD2/p1708455358962679

…another thread

In the unlikely but possible scenario where the reader/writer threads would call into `handleCommunicationIssue` twice, `disconnecting` could be set by the other thread just before going into `reconnect()`.

This would cause these lines to exit the reconnecting loop:
```java
if (isDisconnectingOrClosed() || this.isClosing()) {
    keepGoing = false;
}
```

Which would then call into `close()` here:
```java
if (!isConnected()) {
    this.close();
    return;
}
```

This PR fixes that by not checking the `disconnected` flag, instead checking `isClosed() || isClosing()` as it was before this commit:

c4d50b2#diff-bbcf614ee7ee4ccd15b22d1f95fa40e4e642098386b6d1d14127dab5968d89feR204

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
MauriceVanVeen referenced this pull request Feb 20, 2024
WARNING: This code is purely meant to illustrate the reproduction steps, and not meant to be directly merged.
@MauriceVanVeen
Copy link
Collaborator Author

Code containing a (hacky) test that reproduces the issue can be found here: 1958b8a

@MauriceVanVeen
Copy link
Collaborator Author

Would be happy to see how we could get a test for this case (if possible).

@MauriceVanVeen
Copy link
Collaborator Author

As discussed privately, we probably instead want a way to check if we're already in the closeSocket method. That way the current early return in closeSocket checks that, instead of the current disconnecting = true which allows for a short period of time where it could be set to true in a different thread:

            if (isDisconnectingOrClosed()) {
                waitForDisconnectOrClose(this.options.getConnectionTimeout());
                return;
            }

…tead

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
@MauriceVanVeen
Copy link
Collaborator Author

MauriceVanVeen commented Feb 21, 2024

Have implemented the above by checking if we're already in closeSocket using an AtomicBoolean, and reverted the previously changed condition.

Still not quite sure if it's possible to write a (proper) test for this given that it's a race condition and closeSocket is private, what do you think?

EDIT: Using AtomicBoolean to block subsequent access failed several tests, using a ReentrantLock instead to ensure it's not fully blocked.

…oseSocket() calls

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
return;
boolean wasConnected;

statusLock.lock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need both the closeSocketLock and the statusLock, or can we just rely on the closeSocketLock?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's needed yes.
Both removing the statusLock, solely relying on closeSocketLock fails the tests. As well as wrapping the statusLock around it fully, instead of using the closeSocketLock.

@scottf
Copy link
Contributor

scottf commented Feb 26, 2024

Still not quite sure if it's possible to write a (proper) test for this given that it's a race condition and closeSocket is private, what do you think?

@MauriceVanVeen What if we make change it from private to package scope? As long as it's not public...

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
@MauriceVanVeen
Copy link
Collaborator Author

What if we make change it from private to package scope? As long as it's not public...

Turned out it was already package scoped, not sure why I thought it was private..

Anyway, added a test just now which reproduces the issue with the fix commented out. The commit after uncomments the fix.
Not sure why, but it failed first with Cannot run program "nats-server": error=11, Resource temporarily unavailable, after re-running it worked though.

Not sure why coveralls has fell in coverage now?
Either way, the test is very spammy and seems to only reproduce in a busy-loop..

Not really happy with how spammy it is, but don't know if there's another way to get the race condition tested.

closeSocketThread.start();

// Ensure the closeSocket thread runs first
Thread.sleep(100);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we want these to run as close together as possible? Why didn't you just do two instances of the closeSocketThread code?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why didn't you just do two instances of the closeSocketThread code?

Running closeSocket(..) twice would not reproduce (or at least not reliably in a test).

This part in closeSocket(..) makes sure that while disconnecting=true we don't run closeSocket(..) twice:

                if (isDisconnectingOrClosed()) {
                    waitForDisconnectOrClose(this.options.getConnectionTimeout());
                    return;
                }

But, if we've reset disconnecting=false and afterward call reconnect(), this would allow another thread to go in and set disconnecting=true. If at the same time the reconnect() logic checks isDisconnectingOrClosed() it would close the connection instead of staying in the reconnect loop.

It's hard to get these conditions to align, so that's why I've resorted to calling handleCommunicationIssue continuously, which eventually calls closeSocket(...) from different threads.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did simplify the test though, so have removed the closeSocketThread and this sleep. Only keeping the continuous handleCommunicationIssue calls.

src/test/java/io/nats/client/impl/AuthAndConnectTests.java Outdated Show resolved Hide resolved
scottf and others added 3 commits February 27, 2024 15:31
Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
…slight throttling

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
@MauriceVanVeen
Copy link
Collaborator Author

Coveralls seems to be happy now, after applying 55aa19e.

Copy link
Contributor

@scottf scottf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@scottf scottf merged commit 4ccd7da into main Feb 27, 2024
2 checks passed
@scottf scottf deleted the fix/reconnect-closes-if-disconnecting-in-parallel branch February 27, 2024 22:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants