Client
and ClientStub
issues during membership operations
#209
Milestone
Client
and ClientStub
issues during membership operations
#209
During the Jepsen tests, we identified issues with
Client
andClientStub
when executing membership changes. The test setup had a nemesis for membership changes and killing nodes. To apply membership changes, the CLI is used to submit the commands. The CLI relies on theClient
and theClientStub
.The operations applied during the tests invoke the
Client
with the arguments foradd
orremove
a member. This command is issued through the CLI, but even when failures occur, the exit code is always 0. Taking a closer look, the command is submitted asynchronously, and the response (and exception) is handled in a.whenComplete
block.This causes the CLI command to finish with exit code 0, making it harder to identify if the membership change succeeded. The approach we took while testing was to invoke
.join
in theCompletableFuture
so we catch thrown exceptions. With this, theClient
exits with code 1 in case of failures. This also means that the CLI command is blocked until a response is returned from the remote peer.This leads us to an issue in the
ClientStub
made visible by the change to wait for a response. TheClientStub
establishes a connection with the remote peer and initializesorg.jgroups.util.Runner
, executing a method for reading the socket for responses. We identified the issue when, after running the Jepsen test suite with membership changes and killing nodes, some threads still lingered, causing CPU usage to be high even after the test was completed.The problem happens when we issue a membership change command,
Client
creates theClientStub
and submits the membership change command. TheClientStub
establishes a connection and starts theorg.jgroups.util.Runner
, but the remote peer is killed by Jepsen just after theClientStub
establishes a connection. Since theClient
now waits for a response and the remote peer is down, we have a thread running reading from the socket and failing infinitely.The approach we applied to fix this one was to catch
EOFException
and verify if the socket is closed, where the thread is finished in either case. With the fix applied, the behavior is as expected without leaking threads. Future releases might change the behavior by completing pending requests when the socket is closed.The text was updated successfully, but these errors were encountered: