`Client` and `ClientStub` issues during membership operations #209

jabolina · 2023-05-13T14:33:05Z

During the Jepsen tests, we identified issues with Client and ClientStub when executing membership changes. The test setup had a nemesis for membership changes and killing nodes. To apply membership changes, the CLI is used to submit the commands. The CLI relies on the Client and the ClientStub.

The operations applied during the tests invoke the Client with the arguments for add or remove a member. This command is issued through the CLI, but even when failures occur, the exit code is always 0. Taking a closer look, the command is submitted asynchronously, and the response (and exception) is handled in a .whenComplete block.

This causes the CLI command to finish with exit code 0, making it harder to identify if the membership change succeeded. The approach we took while testing was to invoke .join in the CompletableFuture so we catch thrown exceptions. With this, the Client exits with code 1 in case of failures. This also means that the CLI command is blocked until a response is returned from the remote peer.

This leads us to an issue in the ClientStub made visible by the change to wait for a response. The ClientStub establishes a connection with the remote peer and initializes org.jgroups.util.Runner, executing a method for reading the socket for responses. We identified the issue when, after running the Jepsen test suite with membership changes and killing nodes, some threads still lingered, causing CPU usage to be high even after the test was completed.

The problem happens when we issue a membership change command, Client creates the ClientStub and submits the membership change command. The ClientStub establishes a connection and starts the org.jgroups.util.Runner, but the remote peer is killed by Jepsen just after the ClientStub establishes a connection. Since the Client now waits for a response and the remote peer is down, we have a thread running reading from the socket and failing infinitely.

The approach we applied to fix this one was to catch EOFException and verify if the socket is closed, where the thread is finished in either case. With the fix applied, the behavior is as expected without leaking threads. Future releases might change the behavior by completing pending requests when the socket is closed.

The text was updated successfully, but these errors were encountered:

jabolina added this to the 1.0.11 milestone May 13, 2023

jabolina linked a pull request May 13, 2023 that will close this issue

Client and ClientStub fixes #206

Merged

jabolina closed this as completed May 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Client` and `ClientStub` issues during membership operations #209

`Client` and `ClientStub` issues during membership operations #209

jabolina commented May 13, 2023

Client and ClientStub issues during membership operations #209

Client and ClientStub issues during membership operations #209

Comments

jabolina commented May 13, 2023

`Client` and `ClientStub` issues during membership operations #209

`Client` and `ClientStub` issues during membership operations #209