Skip to content

JAVA-4034 Refactor read/write retries to accommodate for both PoolClearedError and CSOT #782

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Sep 27, 2021

Conversation

stIncMale
Copy link
Member

@stIncMale stIncMale commented Sep 7, 2021

The baseline commit for this PR is the last commit in #701. As a consequence, I had to push the stIncMale:JAVA-3928 branch to mongodb/mongo-java-driver to be able to create a PR with diffs relevant to JAVA-4034. Once this PR is reviewed and approved, I will:

Changes in this commit do the following two major things:

  1. Selecting a server (binding.get*ConnectionSource)
    and checking out a connection (source.getConnection) is now part of a retryable operation.
    Such a change makes MongoConnectionPoolClearedExceptions part of the operation,
    thus allowing us to retry the operation when the exception happens.

  2. The retry limit (not more than two attempts) was hardcoded in the driver's code structure,
    i.e., the code was written in such a way that it was describing the first attempt, then depending
    on the outcome, the code for the second attempt was executed, with the code for the second attempt
    mostly duplicating the code for the first attempt, but having different error handling. Such an approach
    not only negatively affects the code readability, but also prevents changing the number of attempts,
    let alone making decisions on whether to retry based not on an attempt limit.
    The pseudocode in the specifications is also written this way, and at least some other drivers,
    e.g., C#, Rust, took a similar approach to structure the code.

    Changes in this commit represent the outcome of refactoring the code related to the retry logic.
    In the new code the retry logic is decoupled from the logic of a retryable operation as much as possible.
    Since our operations are quite complex and may decide to break retrying in the middle of an attempt,
    e.g., because the server selected in the attempt does not support retries, the operation logic
    is still aware of the fact that it may be retried. However, it is important to understand that
    this awareness is a direct consequence of the business logic. It cannot be gotten rid of
    regardless of the approach taken to structure the code. Operations that have simpler business logic
    can be written in a retry-agnostic way without making changes to the retry framework
    that was added as part of this commit.

    With the changes in this commit, applying the client-side timeout (CSOT) to a retryable operation
    is as simple as replacing the hard retry limit condition with a condition that checks
    whether there is time left for attempting the operation again.

Here are also a few interesting stats:

  • the line count in CommandOperationHelper changed from 865 to 620
  • the line count in OperationHelper changed from 768 to 734
  • the line count in MixedBulkWriteOperation stayed roughly the same (changed from 553 to 551, but a portion of it is comments, while previously it was all code)

Given that the business logic became more complicated due to the item #1 described above, I consider the above reduction in the amount of code expressing the business logic a very good thing, especially given that it is accompanied by the code (particularly callback-based) being made more linear and requiring almost no changes to support CSOT.

JAVA-4034

@stIncMale stIncMale requested review from jyemin and rozza September 7, 2021 23:55
@stIncMale
Copy link
Member Author

stIncMale commented Sep 8, 2021

Since the target brach is not master, the tests are not run automatically. Here are the results of all our tests corresponding to the commit 2c08fd1b1.

P.S. The atlas-test task often fails with CertificateExpiredException. This has been happening since the beginning of August.
P.P.S. The serverless-test task times out, which is also the case for the master branch.

@stIncMale stIncMale changed the title Refactor read/write retries to accommodate for both PoolClearedError and CSOT JAVA-4034 Refactor read/write retries to accommodate for both PoolClearedError and CSOT Sep 8, 2021
@stIncMale stIncMale changed the title JAVA-4034 Refactor read/write retries to accommodate for both PoolClearedError and CSOT JAVA-4034 Refactor read/write retries to accommodate for both PoolClearedError and CSOT Sep 8, 2021
Comment on lines +245 to +250
bulkWriteTracker.batch().ifPresent(bulkWriteBatch -> {
assertTrue(prospectiveFailedResult instanceof MongoWriteConcernWithResponseException);
bulkWriteBatch.addResult((BsonDocument) ((MongoWriteConcernWithResponseException) prospectiveFailedResult)
.getResponse());
BulkWriteTracker.attachNext(retryState, bulkWriteBatch);
});
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This piece of code seems to be dead from the standpoint of the tests we have, i.e., the breakAndThrowIfRetryAnd always breaks retrying in tests when it is executed. The same is true for executeSync.

The rest of the code seem to be covered.

Copy link
Member

@rozza rozza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like it, mostly nits and a few change requests.

*/
public RetryState(final int retries) {
assertTrue(retries >= 0);
assertTrue(retries < INFINITE_ATTEMPTS);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could just reuse this constructor and default retries to INFINITE_ATTEMPTS in the no args constructor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

import java.util.function.Supplier;

@ThreadSafe
public final class SupplyingCallback<R> implements SingleResultCallback<R>, Supplier<R> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially this class was added in the successive PR #784, but I needed it here to write unit tests for LoopState/RetryState. So I had to "back port" the class here, and will pull the changes to #784 to resolve conflicts.

Copy link
Collaborator

@jyemin jyemin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fantastic. Just a few nits, and one substantial refactoring request.

@stIncMale stIncMale requested a review from jyemin September 24, 2021 01:25
Copy link
Collaborator

@jyemin jyemin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

*
* @see #iteration()
*/
public boolean firstIteration() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our convention seems to be to prefix with is even in internal code, so might as well follow that.

Copy link
Member

@rozza rozza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

*
* @see #iteration()
*/
public boolean firstIteration() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats good to know, I'll defer to @jyemin but I'm happy either way.

static BsonDocument executeCommand(final ReadBinding binding, final String database, final CommandCreator commandCreator,
final boolean retryReads) {
return executeCommand(binding, database, commandCreator, new BsonDocumentCodec(), retryReads);
private static Throwable chooseAndMutateRetryableReadException(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems confusing as this method doesn't mutate anything. It seems more a kin to a concrete BiFunction implementation.

Is there some expectation in the future from within these methods that an existing exception will be mutated?

@stIncMale stIncMale changed the base branch from JAVA-3928 to master September 27, 2021 18:43
@stIncMale
Copy link
Member Author

stIncMale commented Sep 27, 2021

Apparently, the diff displayed by the GitHub PR UI is calculated not based on the state of the last commits in the "into"/"from" branches, but based on the state of the last commit that is common for the two aforementioned branches, and the last commits in the "from" branch. I thought this is not how the PR diff is calculated. Anyway, this prevents me from simply changing the "into" branch, because the PR diff now combines a diff for both its predecessor #701 and the changes that are actually proposed in the current PR.

In order to redeem the diff for this PR, I will rebase the "from" branch on top of the new "into" (master). This will re-create all the commits in this PR, but at least the diff will be correct again.

…and CSOT

Changes in this commit do the following two major things:

1) These changes make selecting a server (`binding.get*ConnectionSource`)
and checking out a connection (`source.getConnection`) part of a retryable operation.
Such a change makes `MongoConnectionPoolClearedException`s part of the operation,
thus allowing us to retry the operation when the exception happens.

2) The retry limit (not more than two attempts) was hardcoded in the driver's code structure,
i.e., the code was written in such a way that it was describing the first attempt, then depending
on the outcome, the code for the second attempt was executed, with the code for the second attempt
mostly duplicating the code for the first attempt, but having different error handling. Such an approach
not only negatively affects the code readability, but also prevents changing the number of attempts,
let alone making decisions on whether to retry based not on an attempt limit.
The pseudocode in the specifications is also written this way, and at least some other drivers,
e.g., C#, Rust, took a similar approach to structure the code.

Changes in this commit represent the outcome of refactoring the code related to the retry logic.
In the new code the retry logic is decoupled from the logic of a retryable operation as much as possible.
Since our operations are quite complex and may decide to break retrying in the middle of an attempt,
e.g., because the server selected in the attempt does not support retries, the operation logic
is still aware of the fact that it may be retried. However, it is important to understand that
this awareness is a direct consequence of the business logic. It cannot be gotten rid of
regardless of the approach taken to structure the code. Operations that have simpler business logic
can be written in a retry-agnostic way without making changes to the retry framework
that was added as part of this commit.

With the changes in this commit, applying the client-side timeout (CSOT) to a retryable operation
is as simple as replacing the hard retry limit condition with a condition that checks
whether there is time left for attempting the operation again.

JAVA-4034
@stIncMale stIncMale merged commit c11600d into mongodb:master Sep 27, 2021
@stIncMale stIncMale deleted the JAVA-4034 branch September 27, 2021 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants