Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: truncate RPC timeouts to time remaining in totalTimeout #1191

Merged
merged 5 commits into from Sep 28, 2020

Conversation

noahdietz
Copy link
Contributor

@noahdietz noahdietz commented Sep 23, 2020

The RetrySettings documentation indicated that an RPC timeout would not be allowed to enable an RPC to exceed the totalTimeout while in flight. Meaning, if a calculated RPC timeout is greater than the time remaining in the totalTimeout (from the start of the first attempt), the RPC timeout should be equal to that time remaining. This prevents an RPC from being sent with a timeout that would allow it to execute beyond the time allotted by the totalTimeout.

This was not being done in gax-java, even though the documentation stated it and several other language implementations of GAX are implemented to match this same documentation.

This PR corrects that.

@noahdietz noahdietz requested a review from as a code owner Sep 23, 2020
@google-cla google-cla bot added the cla: yes label Sep 23, 2020
@noahdietz noahdietz requested review from vam-google, igorbernstein2, chingor13 and miraleung and removed request for Sep 23, 2020
@codecov
Copy link

@codecov codecov bot commented Sep 23, 2020

Codecov Report

Merging #1191 into master will increase coverage by 20.14%.
The diff coverage is 100.00%.

Impacted file tree graph

@@              Coverage Diff              @@
##             master    #1191       +/-   ##
=============================================
+ Coverage     58.91%   79.06%   +20.14%     
- Complexity      115     1197     +1082     
=============================================
  Files            20      205      +185     
  Lines           589     5268     +4679     
  Branches         60      435      +375     
=============================================
+ Hits            347     4165     +3818     
- Misses          213      930      +717     
- Partials         29      173      +144     
Impacted Files Coverage Δ Complexity Δ
...le/api/gax/retrying/ExponentialRetryAlgorithm.java 96.36% <100.00%> (ø) 14.00 <0.00> (?)
...ava/com/google/api/gax/rpc/AsyncTaskException.java 100.00% <0.00%> (ø) 1.00% <0.00%> (?%)
...m/google/api/gax/retrying/NoopRetryingContext.java 0.00% <0.00%> (ø) 0.00% <0.00%> (?%)
...java/com/google/api/gax/rpc/PagedCallSettings.java 90.00% <0.00%> (ø) 4.00% <0.00%> (?%)
...ax/longrunning/OperationResponsePollAlgorithm.java 66.66% <0.00%> (ø) 5.00% <0.00%> (?%)
.../java/com/google/api/gax/rpc/RetryingCallable.java 91.66% <0.00%> (ø) 2.00% <0.00%> (?%)
.../java/com/google/api/gax/grpc/GrpcCallContext.java 85.60% <0.00%> (ø) 48.00% <0.00%> (?%)
...m/google/api/gax/tracing/NoopApiTracerFactory.java 66.66% <0.00%> (ø) 2.00% <0.00%> (?%)
...oogle/api/gax/retrying/DirectRetryingExecutor.java 78.94% <0.00%> (ø) 6.00% <0.00%> (?%)
...e/api/gax/rpc/RetryingServerStreamingCallable.java 0.00% <0.00%> (ø) 0.00% <0.00%> (?%)
... and 176 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 287cada...dcbf828. Read the comment docs.

Copy link
Contributor

@vam-google vam-google left a comment

It looks good, but I'm concerned by the possibility of breaking LRO portion.

Duration.ofNanos(clock.nanoTime())
.minus(Duration.ofNanos(prevSettings.getFirstAttemptStartTimeNanos()));
Duration timeLeft = globalSettings.getTotalTimeout().minus(timeElapsed).minus(randomDelay);
newRpcTimeout = Math.min(newRpcTimeout, timeLeft.toMillis());
Copy link
Contributor

@vam-google vam-google Sep 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if newRpcTimeout <= 0?

Copy link
Contributor Author

@noahdietz noahdietz Sep 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If newRpcTimeout <= 0, the subsequent check by shouldRetry would prevent another attempt from being made, because they do a similar check to see if the current time + the proposed random delay would exceed the totalTimeout:

long totalTimeSpentNanos =
        clock.nanoTime()
            - nextAttemptSettings.getFirstAttemptStartTimeNanos()
            + nextAttemptSettings.getRandomizedRetryDelay().toNanos();

// If totalTimeout limit is defined, check that it hasn't been crossed
if (totalTimeout > 0 && totalTimeSpentNanos > totalTimeout) {
   return false;
}

We could add some logic to handle if (timeLeft.isNegative() || timeLeft.isZero()), but I'm not sure what we'd set it to...maybe allow newRpcTimeout to remain unchanged by timeLeft?

Copy link
Contributor Author

@noahdietz noahdietz Sep 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After our offline conversation, I've decided to just document the likely behavior. LMK if you feel strongly otherwise.

.setInitialRpcTimeout(Duration.ofMillis(100))
.setMaxRpcTimeout(Duration.ofSeconds(1))
.setRpcTimeoutMultiplier(2)
.setTotalTimeout(Duration.ofSeconds(5L))
Copy link
Contributor

@vam-google vam-google Sep 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the meaning of totalTimeout in case of LRO? Note that exponential retry algorithm is shared between retries and LRO polling logic. Whatever we do for retries, we must ensure that it does not break LRO. Ideally we want that new logic be completelly disabled fro LRO. I.e. the need to modify LRO tests indicates that there is a breaking behavioral change for LRO, which we should avoid.

Copy link
Contributor Author

@noahdietz noahdietz Sep 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same thought and I was confused by what this test was doing.

The test setup actually sets by default a totalTimeout to be 5 ms, but then in this test increases the initialRpcTimeout and maxRpcTimeout to values much greater than the existing totalTimeout of 5ms. Per comments in a generated client, the initialRpcTimeout and maxRpcTimeout should be ignored and set to 0 for LROs? Not really sure what the test is doing here.

Copy link
Contributor Author

@noahdietz noahdietz Sep 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW we had no GAPIC config for LRO polling initialRpcTimeout or maxRpcTimeout, so they would never be generated and only ever set by users...and they'd have the same behavior as non-LRO RPCs where a poll could run over the totalTimeout

Copy link
Contributor Author

@noahdietz noahdietz Sep 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this test was testing for the existing incorrect behavior, where the RPC timeout didn't care if it exceeded the totalTimeout

Copy link
Contributor Author

@noahdietz noahdietz Sep 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The totalTimeout in the context of LRO is the "total polling timeout" (gapic config). So, the duration a "synchronous" LRO call should poll before giving up.

This test was set up weirdly, as I described. PTAL.

@stephaniewang526 stephaniewang526 added the kokoro:force-run label Sep 24, 2020
@miraleung
Copy link
Contributor

@miraleung miraleung commented Sep 25, 2020

Would like others to approve this as well.

@noahdietz noahdietz requested review from miraleung and vam-google Sep 25, 2020
@noahdietz noahdietz added kokoro:force-run and removed kokoro:force-run labels Sep 28, 2020
Copy link
Contributor

@vam-google vam-google left a comment

LGTM

@noahdietz
Copy link
Contributor Author

@noahdietz noahdietz commented Sep 28, 2020

For the record: this change will cause this Java Showcase test in gapic-generator I added recently to fail. The test verifies some of the emergent behavior we are fixing with this PR, so I will fix that test with the dependency update this rolls out in.

@noahdietz noahdietz merged commit 1d0c940 into googleapis:master Sep 28, 2020
8 checks passed
@noahdietz noahdietz deleted the truncate-timeouts branch Sep 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes kokoro:force-run
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants