Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client and ClientStub issues during membership operations #209

Closed
jabolina opened this issue May 13, 2023 · 0 comments · Fixed by #206
Closed

Client and ClientStub issues during membership operations #209

jabolina opened this issue May 13, 2023 · 0 comments · Fixed by #206
Milestone

Comments

@jabolina
Copy link
Member

During the Jepsen tests, we identified issues with Client and ClientStub when executing membership changes. The test setup had a nemesis for membership changes and killing nodes. To apply membership changes, the CLI is used to submit the commands. The CLI relies on the Client and the ClientStub.

The operations applied during the tests invoke the Client with the arguments for add or remove a member. This command is issued through the CLI, but even when failures occur, the exit code is always 0. Taking a closer look, the command is submitted asynchronously, and the response (and exception) is handled in a .whenComplete block.

This causes the CLI command to finish with exit code 0, making it harder to identify if the membership change succeeded. The approach we took while testing was to invoke .join in the CompletableFuture so we catch thrown exceptions. With this, the Client exits with code 1 in case of failures. This also means that the CLI command is blocked until a response is returned from the remote peer.

This leads us to an issue in the ClientStub made visible by the change to wait for a response. The ClientStub establishes a connection with the remote peer and initializes org.jgroups.util.Runner, executing a method for reading the socket for responses. We identified the issue when, after running the Jepsen test suite with membership changes and killing nodes, some threads still lingered, causing CPU usage to be high even after the test was completed.

The problem happens when we issue a membership change command, Client creates the ClientStub and submits the membership change command. The ClientStub establishes a connection and starts the org.jgroups.util.Runner, but the remote peer is killed by Jepsen just after the ClientStub establishes a connection. Since the Client now waits for a response and the remote peer is down, we have a thread running reading from the socket and failing infinitely.

The approach we applied to fix this one was to catch EOFException and verify if the socket is closed, where the thread is finished in either case. With the fix applied, the behavior is as expected without leaking threads. Future releases might change the behavior by completing pending requests when the socket is closed.

@jabolina jabolina added this to the 1.0.11 milestone May 13, 2023
@jabolina jabolina linked a pull request May 13, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant