Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: retry pdml transaction on EOS internal error #360

Merged
merged 8 commits into from Jul 26, 2020
Merged

fix: retry pdml transaction on EOS internal error #360

merged 8 commits into from Jul 26, 2020

Conversation

thiagotnunes
Copy link
Contributor

@thiagotnunes thiagotnunes commented Jul 20, 2020

The Spanner server can return an internal error closing the connection stream. When this happens an InternalException is thrown with a specific EOS message. We are seeing this specially in long running PDML tasks.

This case is now retried by the client, either by resuming the existing transaction (when there is a resume token) or by starting a new transaction.

@google-cla google-cla bot added the cla: yes label Jul 20, 2020
It is possible to have the stream closed with an EOS internal error.
This should be retried by the client. On this PR we add this retry logic.
@thiagotnunes thiagotnunes requested review from olavloite and skuruppu Jul 21, 2020
@@ -55,23 +56,6 @@
this.rpc = rpc;
}

private ByteString initTransaction() {
Copy link
Contributor Author

@thiagotnunes thiagotnunes Jul 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method was just moved down (after the public methods).

@@ -127,20 +111,24 @@ long executeStreamingPartitionedUpdate(final Statement statement, final Duration
}
}
break;
} catch (UnavailableException e) {
} catch (UnavailableException | InternalException e) {
Copy link
Contributor Author

@thiagotnunes thiagotnunes Jul 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this catch block we retry on EOS exception.

private boolean shouldResumeOrRestartTransaction(Exception e) {
return e instanceof UnavailableException
|| (e instanceof InternalException
&& e.getMessage().contains("Received unexpected EOS on DATA frame from server"));
Copy link
Contributor Author

@thiagotnunes thiagotnunes Jul 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hacky, but unfortunately we do not get a specific exception for this error, so we have to proxy through the exception message. This follows the approach taken in the bigquery client: https://github.com/googleapis/java-bigquerystorage/pull/263/files.

Copy link
Contributor

@olavloite olavloite Jul 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same error is also retried for streaming queries, and this check could probably use the existing check in SpannerExceptionFactory:

private static boolean isRetryable(ErrorCode code, @Nullable Throwable cause) {
. It's still just as hacky, but it would keep it confined to one method.

@thiagotnunes thiagotnunes changed the title Retry PDML Transaction on EOS Internal Error fix: retry pdml transaction on EOS internal error Jul 21, 2020
@codecov
Copy link

@codecov codecov bot commented Jul 21, 2020

Codecov Report

No coverage uploaded for pull request base (master@87f4f13). Click here to learn what that means.
The diff coverage is n/a.

@skuruppu
Copy link
Contributor

@skuruppu skuruppu commented Jul 21, 2020

@thiagotnunes I'm going to hold off on reviewing this until I better understand why this error happens. Hope it's ok to wait a couple of days until the Spanner team has had a chance to look at the issue.

@thiagotnunes thiagotnunes added the do not merge label Jul 21, 2020
Copy link
Contributor

@skuruppu skuruppu left a comment

Thanks for making this change @thiagotnunes. I would also like @olavloite to take a look before merging.

@thiagotnunes thiagotnunes removed the do not merge label Jul 21, 2020
Copy link
Contributor

@olavloite olavloite left a comment

This looks generally good to me, but could you have a look and see if it's possible to reuse the existing isRetryable method for the internal exception?

private boolean shouldResumeOrRestartTransaction(Exception e) {
return e instanceof UnavailableException
|| (e instanceof InternalException
&& e.getMessage().contains("Received unexpected EOS on DATA frame from server"));
Copy link
Contributor

@olavloite olavloite Jul 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same error is also retried for streaming queries, and this check could probably use the existing check in SpannerExceptionFactory:

private static boolean isRetryable(ErrorCode code, @Nullable Throwable cause) {
. It's still just as hacky, but it would keep it confined to one method.

@thiagotnunes
Copy link
Contributor Author

@thiagotnunes thiagotnunes commented Jul 22, 2020

Thanks for the feedback @olavloite, I did not know about that functionality! I will make a change in this PR to use that.

When the exception is an EOS, we should retry the exception.
Re-uses the retry logic applied by the spanner exception factory for
retrying EOS internal exceptions in pdml transactions.
@@ -257,35 +256,8 @@ private static boolean hasCauseMatching(
}

private static class Matchers {
static final Predicate<Throwable> isRetryableInternalError =
Copy link
Contributor Author

@thiagotnunes thiagotnunes Jul 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I extracted these predicates into their own classes that are now exposed to the PartitionedDmlTransaction. This was the cleanest / lowest impact way I could find to de-duplicate the EOS retryable logic.

@thiagotnunes
Copy link
Contributor Author

@thiagotnunes thiagotnunes commented Jul 24, 2020

@olavloite @skuruppu do you mind giving the PR another look? I did the following changes:

  • Extracted predicates from the SpannerExceptionFactory in order to use them in the PartitionedDmlTransaction, thus removing the EOS logic duplication.
  • Changed the PartitionedDmlTransaction class to use the extracted predicate.
  • Added tests for the extracted classes.

public boolean apply(Throwable cause) {
if (isInternalError(cause)) {
if (cause.getMessage().contains(HTTP2_ERROR_MESSAGE)) {
// See b/25451313.
Copy link
Contributor

@skuruppu skuruppu Jul 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and below, if we could remove the references to internal issues that would be great. I don't know why they were there before. Folks working on this repo are unlikely to have access to those so no point exposing them.

Copy link
Contributor Author

@thiagotnunes thiagotnunes Jul 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, will tackle this in a following PR.

@olavloite
Copy link
Contributor

@olavloite olavloite commented Jul 24, 2020

@thiagotnunes Thanks for extracting the predicates for the retryable internal errors.
I have no further comments, this looks good to me.

gcf-merge-on-green bot pushed a commit that referenced this issue Aug 20, 2020
🤖 I have created a release \*beep\* \*boop\* 
---
## [1.60.0](https://www.github.com/googleapis/java-spanner/compare/v1.59.0...v1.60.0) (2020-08-18)


### Features

* adds clirr check on pre-commit hook ([#388](https://www.github.com/googleapis/java-spanner/issues/388)) ([bd5c93f](https://www.github.com/googleapis/java-spanner/commit/bd5c93f045e06372b2235f3d350bade93bff2c24))
* include SQL statement in error message ([#355](https://www.github.com/googleapis/java-spanner/issues/355)) ([cc5ac48](https://www.github.com/googleapis/java-spanner/commit/cc5ac48232b6e4550b98d213c5877d6ec37b293f))


### Bug Fixes

* enables emulator tests ([#380](https://www.github.com/googleapis/java-spanner/issues/380)) ([f61c6d0](https://www.github.com/googleapis/java-spanner/commit/f61c6d0d332f15826499996a292acc7cbab267a7))
* remove custom timeout and retry settings ([#365](https://www.github.com/googleapis/java-spanner/issues/365)) ([f6afd21](https://www.github.com/googleapis/java-spanner/commit/f6afd213430d3f06d9a72c64a5c37172840fed0e))
* remove unused kokoro files ([#367](https://www.github.com/googleapis/java-spanner/issues/367)) ([6125c7d](https://www.github.com/googleapis/java-spanner/commit/6125c7d221c77f4c42497b72107627ee09312813))
* retry pdml transaction on EOS internal error ([#360](https://www.github.com/googleapis/java-spanner/issues/360)) ([a53d736](https://www.github.com/googleapis/java-spanner/commit/a53d7369bb2a8640ab42e409632b352decbdbf5e))
* sets the project for the integration tests ([#386](https://www.github.com/googleapis/java-spanner/issues/386)) ([c8fa458](https://www.github.com/googleapis/java-spanner/commit/c8fa458f5369a09c780ee38ecc09bd2562e8f987))


### Dependencies

* stop auto updates of commons-lang3 ([#362](https://www.github.com/googleapis/java-spanner/issues/362)) ([8f07ed6](https://www.github.com/googleapis/java-spanner/commit/8f07ed6b44f9c70f56b9ee2e4505c40385337ca7))
* update dependency com.google.cloud:google-cloud-shared-dependencies to v0.8.6 ([#374](https://www.github.com/googleapis/java-spanner/issues/374)) ([6f47b8a](https://www.github.com/googleapis/java-spanner/commit/6f47b8a759643f772230df0c2e153338d44f70ce))
* update dependency org.openjdk.jmh:jmh-core to v1.24 ([#375](https://www.github.com/googleapis/java-spanner/issues/375)) ([94f568c](https://www.github.com/googleapis/java-spanner/commit/94f568cf731ba22cac7f0d898d7776a3cc2c178f))
* update dependency org.openjdk.jmh:jmh-core to v1.25 ([#382](https://www.github.com/googleapis/java-spanner/issues/382)) ([ec7888e](https://www.github.com/googleapis/java-spanner/commit/ec7888e1d62cf800bf6ad166d242e89443ddc7aa))
* update dependency org.openjdk.jmh:jmh-generator-annprocess to v1.25 ([#376](https://www.github.com/googleapis/java-spanner/issues/376)) ([8ffdc48](https://www.github.com/googleapis/java-spanner/commit/8ffdc481e15901f78eac592bd8d4bef33ac3378a))
---


This PR was generated with [Release Please](https://github.com/googleapis/release-please).
thiagotnunes pushed a commit to thiagotnunes/java-spanner that referenced this issue Jun 5, 2021
🤖 I have created a release \*beep\* \*boop\* 
---
## [1.60.0](https://www.github.com/googleapis/java-spanner/compare/v1.59.0...v1.60.0) (2020-08-18)


### Features

* adds clirr check on pre-commit hook ([googleapis#388](https://www.github.com/googleapis/java-spanner/issues/388)) ([bd5c93f](https://www.github.com/googleapis/java-spanner/commit/bd5c93f045e06372b2235f3d350bade93bff2c24))
* include SQL statement in error message ([googleapis#355](https://www.github.com/googleapis/java-spanner/issues/355)) ([cc5ac48](https://www.github.com/googleapis/java-spanner/commit/cc5ac48232b6e4550b98d213c5877d6ec37b293f))


### Bug Fixes

* enables emulator tests ([googleapis#380](https://www.github.com/googleapis/java-spanner/issues/380)) ([f61c6d0](https://www.github.com/googleapis/java-spanner/commit/f61c6d0d332f15826499996a292acc7cbab267a7))
* remove custom timeout and retry settings ([googleapis#365](https://www.github.com/googleapis/java-spanner/issues/365)) ([f6afd21](https://www.github.com/googleapis/java-spanner/commit/f6afd213430d3f06d9a72c64a5c37172840fed0e))
* remove unused kokoro files ([googleapis#367](https://www.github.com/googleapis/java-spanner/issues/367)) ([6125c7d](https://www.github.com/googleapis/java-spanner/commit/6125c7d221c77f4c42497b72107627ee09312813))
* retry pdml transaction on EOS internal error ([googleapis#360](https://www.github.com/googleapis/java-spanner/issues/360)) ([a53d736](https://www.github.com/googleapis/java-spanner/commit/a53d7369bb2a8640ab42e409632b352decbdbf5e))
* sets the project for the integration tests ([googleapis#386](https://www.github.com/googleapis/java-spanner/issues/386)) ([c8fa458](https://www.github.com/googleapis/java-spanner/commit/c8fa458f5369a09c780ee38ecc09bd2562e8f987))


### Dependencies

* stop auto updates of commons-lang3 ([googleapis#362](https://www.github.com/googleapis/java-spanner/issues/362)) ([8f07ed6](https://www.github.com/googleapis/java-spanner/commit/8f07ed6b44f9c70f56b9ee2e4505c40385337ca7))
* update dependency com.google.cloud:google-cloud-shared-dependencies to v0.8.6 ([googleapis#374](https://www.github.com/googleapis/java-spanner/issues/374)) ([6f47b8a](https://www.github.com/googleapis/java-spanner/commit/6f47b8a759643f772230df0c2e153338d44f70ce))
* update dependency org.openjdk.jmh:jmh-core to v1.24 ([googleapis#375](https://www.github.com/googleapis/java-spanner/issues/375)) ([94f568c](https://www.github.com/googleapis/java-spanner/commit/94f568cf731ba22cac7f0d898d7776a3cc2c178f))
* update dependency org.openjdk.jmh:jmh-core to v1.25 ([googleapis#382](https://www.github.com/googleapis/java-spanner/issues/382)) ([ec7888e](https://www.github.com/googleapis/java-spanner/commit/ec7888e1d62cf800bf6ad166d242e89443ddc7aa))
* update dependency org.openjdk.jmh:jmh-generator-annprocess to v1.25 ([googleapis#376](https://www.github.com/googleapis/java-spanner/issues/376)) ([8ffdc48](https://www.github.com/googleapis/java-spanner/commit/8ffdc481e15901f78eac592bd8d4bef33ac3378a))
---


This PR was generated with [Release Please](https://github.com/googleapis/release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants