Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: respect PDML timeout when using streaming RPC #338

Merged
merged 3 commits into from Jul 15, 2020

Conversation

olavloite
Copy link
Contributor

@olavloite olavloite commented Jul 10, 2020

@thiagotnunes Would you mind trying the same test with this PR? I don't have a dataset that is big enough that it will run for more than one hour.

Fixes #336

@olavloite olavloite requested a review from thiagotnunes Jul 10, 2020
@thiagotnunes
Copy link
Contributor

@thiagotnunes thiagotnunes commented Jul 11, 2020

@olavloite will do so on Monday and will let you know.

Digging in a little on the fix, it seems we just needed to fix the keepalive for the rpc calls, so am I right to assume that the thinking is that the streamWaitTimeout is the one being respected for this particular call?

Thanks for looking into this!

@thiagotnunes
Copy link
Contributor

@thiagotnunes thiagotnunes commented Jul 13, 2020

I got another exception when running the code. I think there is a bug on using the remainingTimeout (line 108) on the PartitionedDMLTransaction. If the timeout is <= 0, we do not cancel the retry. A simple fix would be to break in this case. Below is the stack trace:

Exception in thread "main" com.google.cloud.spanner.SpannerException: UNKNOWN: Invalid timeout: < 0 s
	at com.google.cloud.spanner.SpannerExceptionFactory.newSpannerExceptionPreformatted(SpannerExceptionFactory.java:211)
	at com.google.cloud.spanner.SpannerExceptionFactory.newSpannerException(SpannerExceptionFactory.java:57)
	at com.google.cloud.spanner.SpannerExceptionFactory.newSpannerException(SpannerExceptionFactory.java:151)
	at com.google.cloud.spanner.SpannerExceptionFactory.newSpannerException(SpannerExceptionFactory.java:94)
	at com.google.cloud.spanner.PartitionedDMLTransaction.executeStreamingPartitionedUpdate(PartitionedDMLTransaction.java:156)
	at com.google.cloud.spanner.SessionImpl.executePartitionedUpdate(SessionImpl.java:118)
	at com.google.cloud.spanner.SessionPool$PooledSession.executePartitionedUpdate(SessionPool.java:1349)
	at com.google.cloud.spanner.SessionPool$PooledSessionFuture.executePartitionedUpdate(SessionPool.java:1236)
	at com.google.cloud.spanner.DatabaseClientImpl$3.apply(DatabaseClientImpl.java:227)
	at com.google.cloud.spanner.DatabaseClientImpl$3.apply(DatabaseClientImpl.java:224)
	at com.google.cloud.spanner.DatabaseClientImpl.runWithSessionRetry(DatabaseClientImpl.java:241)
	at com.google.cloud.spanner.DatabaseClientImpl.executePartitionedUpdate(DatabaseClientImpl.java:222)
	at com.google.cloud.spanner.PdmlIssue.main(PdmlIssue.java:14)
Caused by: java.lang.IllegalArgumentException: Invalid timeout: < 0 s
	at com.google.common.base.Preconditions.checkArgument(Preconditions.java:142)
	at com.google.api.gax.grpc.GrpcCallContext.withStreamWaitTimeout(GrpcCallContext.java:177)
	at com.google.cloud.spanner.spi.v1.GapicSpannerRpc.executeStreamingPartitionedDml(GapicSpannerRpc.java:1108)
	at com.google.cloud.spanner.PartitionedDMLTransaction.executeStreamingPartitionedUpdate(PartitionedDMLTransaction.java:113)
	... 8 more

This is my executor:

final SpannerOptions options = SpannerOptions.newBuilder().build();
final Spanner spanner = options.getService();
final DatabaseId id = DatabaseId.of(PROJECT, INSTANCE, DATABASE);
final DatabaseClient client = spanner.getDatabaseClient(id);

client.executePartitionedUpdate(Statement.of(
    "UPDATE Table SET bool = true WHERE true"
));

@thiagotnunes
Copy link
Contributor

@thiagotnunes thiagotnunes commented Jul 13, 2020

In any case, the failure occurs after the specified timeout (2 hours by default).

@googlebot googlebot added the cla: yes label Jul 13, 2020
@olavloite
Copy link
Contributor Author

@olavloite olavloite commented Jul 13, 2020

@thiagotnunes Thanks for the test results. I've updated the retry-loop to take a potential zero or negative deadline into account, and I've added some additional unit tests for this class. PTAL.

Copy link
Contributor

@thiagotnunes thiagotnunes left a comment

The latest tests that I've done have terminated with a com.google.cloud.spanner.SpannerException: DEADLINE_EXCEEDED, but the weird thing is that the process did not take 2 hours to fail, but around 68 mins, so it does seem that the 2 hours window is still not being respected.
I might have misspoke when I said that the previous one failed in 2 hours, I thought it did, but I had no timer in runner code. I added it now. These are the logs from the execution:

Started running at Tue Jul 14 12:41:21 AEST 2020
Exception in thread "main" com.google.cloud.spanner.SpannerException: DEADLINE_EXCEEDED
Finished running at Tue Jul 14 13:49:33 AEST 2020 (took 4092380ms)
	at com.google.cloud.spanner.SpannerExceptionFactory.newSpannerExceptionPreformatted(SpannerExceptionFactory.java:211)
	at com.google.cloud.spanner.SpannerExceptionFactory.newSpannerException(SpannerExceptionFactory.java:57)
	at com.google.cloud.spanner.SpannerExceptionFactory.newSpannerException(SpannerExceptionFactory.java:52)
	at com.google.cloud.spanner.SpannerExceptionFactory.fromApiException(SpannerExceptionFactory.java:229)
	at com.google.cloud.spanner.SpannerExceptionFactory.newSpannerException(SpannerExceptionFactory.java:144)
	at com.google.cloud.spanner.SpannerExceptionFactory.newSpannerException(SpannerExceptionFactory.java:94)
	at com.google.cloud.spanner.PartitionedDMLTransaction.executeStreamingPartitionedUpdate(PartitionedDMLTransaction.java:162)
	at com.google.cloud.spanner.SessionImpl.executePartitionedUpdate(SessionImpl.java:118)
	at com.google.cloud.spanner.SessionPool$PooledSession.executePartitionedUpdate(SessionPool.java:1349)
	at com.google.cloud.spanner.SessionPool$PooledSessionFuture.executePartitionedUpdate(SessionPool.java:1236)
	at com.google.cloud.spanner.DatabaseClientImpl$3.apply(DatabaseClientImpl.java:227)
	at com.google.cloud.spanner.DatabaseClientImpl$3.apply(DatabaseClientImpl.java:224)
	at com.google.cloud.spanner.DatabaseClientImpl.runWithSessionRetry(DatabaseClientImpl.java:241)
	at com.google.cloud.spanner.DatabaseClientImpl.executePartitionedUpdate(DatabaseClientImpl.java:222)
	at com.google.cloud.spanner.PmlIssue.main(PmlIssue.java:19)

Process finished with exit code 1

@@ -157,6 +163,10 @@ long executeStreamingPartitionedUpdate(final Statement statement, Duration timeo
}
}

Stopwatch createStopwatchStarted() {
Copy link
Contributor

@thiagotnunes thiagotnunes Jul 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would it be hard to inject the Stopwatch instead? Not a big deal, just wondering if we could use composition here. Disregard if it would be too hard to inject.


@SuppressWarnings("unchecked")
@RunWith(JUnit4.class)
public class PartitionedDmlTransactionTest {
Copy link
Contributor

@thiagotnunes thiagotnunes Jul 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice tests!

@thiagotnunes
Copy link
Contributor

@thiagotnunes thiagotnunes commented Jul 14, 2020

I know what the problem is: multiple failures are occurring so whenever we are subtracting the elapsed time, we are subtracting the accumulated time of the stopWatch every time. In order to fix this, we should reset the stopWatch in the case that we are about to retry. We could add stopWatch.reset() when catching the UnavailableException and inside the if statement (after line 135 of the PartitionedDMLTransaction).

An example of the problem I am seeing:

  1. We start the PDML statement.
stopWatch = "PT0S"
remainingTimeout = "PT2H"; 
  1. After 10 mins there is a retry-able failure. We subtract the stopWatch from the remainingTimeout.
stopWatch = "PT10M";
remainingTimeout = remainingTimeout - stopWatch = "PT1H50M";
  1. After 20 mins from the start (1) there is another failure. We again subtract the stopWatch from the remainingTimeout, but instead of subtracting 10 mins (time elapsed from the first retry to the second retry), we subtract 20 mins (time elapsed from the start to the second retry).
stopWatch = "PT20M"
remainingTimeout = remainingTimeout - stopWatch = "PT1H30M"; // This should had been PT1H40M

Btw, @olavloite let me know if you this feedback loop is too big (me testing a day after your change). I could implement these minor fixes if you'd like so, otherwise I am happy on testing / reporting back the results to you.

@olavloite
Copy link
Contributor Author

@olavloite olavloite commented Jul 14, 2020

@thiagotnunes Thanks for the elaborate testing and report. Hopefully the timeout problem should be fixed now.

And yes, please do feel free to add any additional fixes that might be needed to get this 100%.

Copy link
Contributor

@thiagotnunes thiagotnunes left a comment

@olavloite thanks for the fix. It worked as expected (timing out after 2 hours).

@olavloite olavloite merged commit d67f108 into master Jul 15, 2020
23 checks passed
@olavloite olavloite deleted the pdml-streaming-retry-settings branch Jul 15, 2020
gcf-merge-on-green bot pushed a commit that referenced this issue Jul 16, 2020
🤖 I have created a release \*beep\* \*boop\* 
---
## [1.59.0](https://www.github.com/googleapis/java-spanner/compare/v1.58.0...v1.59.0) (2020-07-16)


### Features

* add support for NUMERIC data type ([#193](https://www.github.com/googleapis/java-spanner/issues/193)) ([b38a91d](https://www.github.com/googleapis/java-spanner/commit/b38a91d8daac264b9dea327d6b31430d9599bd78))
* spanner NUMERIC type ([#349](https://www.github.com/googleapis/java-spanner/issues/349)) ([78c3192](https://www.github.com/googleapis/java-spanner/commit/78c3192266c474fc43277a8bf3f15caa968a0100))


### Bug Fixes

* check if emulator is running if env var is set ([#340](https://www.github.com/googleapis/java-spanner/issues/340)) ([597f501](https://www.github.com/googleapis/java-spanner/commit/597f501803e6d58717a6e3770e6fd3f34454e9a5))
* fix potential unnecessary transaction retry ([#337](https://www.github.com/googleapis/java-spanner/issues/337)) ([1a4f4fd](https://www.github.com/googleapis/java-spanner/commit/1a4f4fd675a1580c87ad1d53c650a20bd2ff4811)), closes [#327](https://www.github.com/googleapis/java-spanner/issues/327)
* respect PDML timeout when using streaming RPC ([#338](https://www.github.com/googleapis/java-spanner/issues/338)) ([d67f108](https://www.github.com/googleapis/java-spanner/commit/d67f108e86925c1296e695db8e78fa82e11fa4fa))
* runs sample tests in java 8 and java 11 ([#345](https://www.github.com/googleapis/java-spanner/issues/345)) ([b547e31](https://www.github.com/googleapis/java-spanner/commit/b547e31d095be3cf1646e0e9c07bfc467ecc3c22))
* set gRPC keep-alive to 120 seconds ([#339](https://www.github.com/googleapis/java-spanner/issues/339)) ([26be103](https://www.github.com/googleapis/java-spanner/commit/26be103da1117c4940550fad1672c66e6edfbdb3))


### Dependencies

* update dependency com.google.cloud:google-cloud-shared-dependencies to v0.8.3 ([#334](https://www.github.com/googleapis/java-spanner/issues/334)) ([45acd89](https://www.github.com/googleapis/java-spanner/commit/45acd8960c961d48e91a7b1546efa64d9e9ae576))
* update shared config to 0.9.2 ([#328](https://www.github.com/googleapis/java-spanner/issues/328)) ([75df62c](https://www.github.com/googleapis/java-spanner/commit/75df62c0176137fda1d0a9076b83be06f11228ce))
---


This PR was generated with [Release Please](https://github.com/googleapis/release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants