Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: retry non-idempotent long-running RPCs #141

Merged
merged 5 commits into from Apr 13, 2020

Conversation

olavloite
Copy link
Contributor

@olavloite olavloite commented Apr 7, 2020

RPCs returning a long-running operation, such as CreateDatabase, CreateBackup and RestoreDatabase, are non-idempotent and cannot be retried automatically by gax. This means that these RPCs sometimes fail with transient errors, such as UNAVAILABLE or DEADLINE_EXCEEDED. This change introduces automatic retries of these RPCs using the following logic:

  1. Execute the RPC and wait for the operation to be returned.
  2. If a transient error occurs while waiting for the operation, the client library queries the backend for the corresponding operation. If the operation is found, the resumes the tracking of the existing operation and returns that to the user.
  3. If no corresponding operation is found in step 2, the client library retries the RPC from step 1.

Fixes GoogleCloudPlatform/java-docs-samples#2571

@googlebot googlebot added the cla: yes label Apr 7, 2020
@olavloite olavloite requested review from hengfengli and skuruppu Apr 7, 2020
Copy link
Contributor

@hengfengli hengfengli left a comment

LGTM.

databaseAdminStubSettings
.createBackupOperationSettings()
.getInitialCallSettings()
.getRetrySettings(),
Copy link
Contributor

@skuruppu skuruppu Apr 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't specify retry settings for non-idempotent RPCs. Do you know what this will default to then?

Copy link
Contributor Author

@olavloite olavloite Apr 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are defaults defined in the gapic configuration file as well, but those are already overridden by the client library. The defaults are set in the constructor of SpannerOptions.Builder:

private Builder() {
// Manually set retry and polling settings that work.
OperationTimedPollAlgorithm longRunningPollingAlgorithm =
OperationTimedPollAlgorithm.create(
RetrySettings.newBuilder()
.setInitialRpcTimeout(Duration.ofSeconds(60L))
.setMaxRpcTimeout(Duration.ofSeconds(600L))
.setInitialRetryDelay(Duration.ofSeconds(20L))
.setMaxRetryDelay(Duration.ofSeconds(45L))
.setRetryDelayMultiplier(1.5)
.setRpcTimeoutMultiplier(1.5)
.setTotalTimeout(Duration.ofHours(48L))
.build());
RetrySettings longRunningRetrySettings =
RetrySettings.newBuilder()
.setInitialRpcTimeout(Duration.ofSeconds(60L))
.setMaxRpcTimeout(Duration.ofSeconds(600L))
.setInitialRetryDelay(Duration.ofSeconds(20L))
.setMaxRetryDelay(Duration.ofSeconds(45L))
.setRetryDelayMultiplier(1.5)
.setRpcTimeoutMultiplier(1.5)
.setTotalTimeout(Duration.ofHours(48L))
.build();
databaseAdminStubSettingsBuilder
.createDatabaseOperationSettings()
.setPollingAlgorithm(longRunningPollingAlgorithm)
.setInitialCallSettings(
UnaryCallSettings
.<CreateDatabaseRequest, OperationSnapshot>newUnaryCallSettingsBuilder()
.setRetrySettings(longRunningRetrySettings)
.build());
databaseAdminStubSettingsBuilder
.createBackupOperationSettings()
.setPollingAlgorithm(longRunningPollingAlgorithm)
.setInitialCallSettings(
UnaryCallSettings
.<CreateBackupRequest, OperationSnapshot>newUnaryCallSettingsBuilder()
.setRetrySettings(longRunningRetrySettings)
.build());
databaseAdminStubSettingsBuilder
.restoreDatabaseOperationSettings()
.setPollingAlgorithm(longRunningPollingAlgorithm)
.setInitialCallSettings(
UnaryCallSettings
.<RestoreDatabaseRequest, OperationSnapshot>newUnaryCallSettingsBuilder()
.setRetrySettings(longRunningRetrySettings)
.build());
databaseAdminStubSettingsBuilder
.deleteBackupSettings()
.setRetrySettings(longRunningRetrySettings);
databaseAdminStubSettingsBuilder
.updateBackupSettings()
.setRetrySettings(longRunningRetrySettings);
}

Copy link
Contributor

@hengfengli hengfengli Apr 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So does this mean that the Java client library has its own retrying settings which are different from the gapic config file?

@@ -291,7 +292,8 @@ private Builder() {
.setRetrySettings(longRunningRetrySettings);
databaseAdminStubSettingsBuilder
.updateBackupSettings()
.setRetrySettings(longRunningRetrySettings);
.setRetrySettings(longRunningRetrySettings)
.setRetryableCodes(StatusCode.Code.DEADLINE_EXCEEDED, StatusCode.Code.UNAVAILABLE);
Copy link
Contributor

@hengfengli hengfengli Apr 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to manually specify the retryable codes here? In the gapic config file (https://github.com/googleapis/googleapis/blob/d741cd976975c745d0199987aff0e908b8352992/google/spanner/admin/database/v1/spanner_admin_database_grpc_service_config.json#L58-L67), updateBackup will retry DEADLINE_EXCEEDED and UNAVAILABLE. Shouldn't this code be auto-generated?

I guess Java gapic code generator is using a different config file: https://github.com/googleapis/googleapis/blob/master/google/spanner/admin/database/v1/spanner_admin_database_gapic.yaml#L83-L86.

Copy link
Contributor

@hengfengli hengfengli Apr 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious: why do we manually set retrying settings instead of using the default values from the gapic config file?

Copy link
Contributor Author

@olavloite olavloite Apr 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a lot of problems / transient errors using the default values when working on the backups feature. So I set the values manually to try to find settings that did work, so that we can set them back to the default config.

So the Java client should definitely not continue to use custom retry settings. Once we have verified that these settings work well, we should set these as the default in the config file.

Copy link
Contributor Author

@olavloite olavloite Apr 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yes, the Java client (and I think also the other clients, but I'm not 100% sure) use the retry settings in https://github.com/googleapis/googleapis/blob/master/google/spanner/admin/database/v1/spanner_admin_database_gapic.yaml. So that means for example that the default config for UpdateBackup is non-idempotent.

olavloite added 5 commits Apr 10, 2020
RPCs returning a long-running operation, such as CreateDatabase,
CreateBackup and RestoreDatabase, are non-idempotent and cannot be
retried automatically by gax. This means that these RPCs sometimes fail
with transient errors, such as UNAVAILABLE or DEADLINE_EXCEEDED. This
change introduces automatic retries of these RPCs using the following
logic:
1. Execute the RPC and wait for the operation to be returned.
2. If a transient error occurs while waiting for the operation, the
   client library queries the backend for the corresponding operation.
   If the operation is found, the resumes the tracking of the existing
   operation and returns that to the user.
3. If no corresponding operation is found in step 2, the client library
   retries the RPC from step 1.

Fixes GoogleCloudPlatform/java-docs-samples#2571
@olavloite olavloite force-pushed the retry-non-idempotent-long-running-rpcs branch from 771ae90 to 605b672 Compare Apr 10, 2020
@olavloite olavloite merged commit 4669c02 into master Apr 13, 2020
12 of 13 checks passed
@olavloite olavloite deleted the retry-non-idempotent-long-running-rpcs branch Apr 13, 2020
gcf-merge-on-green bot pushed a commit that referenced this issue Apr 22, 2020
🤖 I have created a release \*beep\* \*boop\* 
---
## [1.53.0](https://www.github.com/googleapis/java-spanner/compare/v1.52.0...v1.53.0) (2020-04-22)


### Features

* optimize maintainer to let sessions be GC'ed instead of deleted ([#135](https://www.github.com/googleapis/java-spanner/issues/135)) ([d65747c](https://www.github.com/googleapis/java-spanner/commit/d65747cbc704508f6f1bcef6eea53aa411d42ee2))


### Bug Fixes

* assign unique id's per test case ([#129](https://www.github.com/googleapis/java-spanner/issues/129)) ([a553b6d](https://www.github.com/googleapis/java-spanner/commit/a553b6d48c4f5ee2d0583e5b825d73a85f06216e))
* check for not null input for Id classes ([#159](https://www.github.com/googleapis/java-spanner/issues/159)) ([ecf5826](https://www.github.com/googleapis/java-spanner/commit/ecf582670818f32e85f534ec400d0b8d31cf9ca6)), closes [#145](https://www.github.com/googleapis/java-spanner/issues/145)
* clean up test instance if creation failed ([#162](https://www.github.com/googleapis/java-spanner/issues/162)) ([ff571e1](https://www.github.com/googleapis/java-spanner/commit/ff571e16a45fbce692d9bb172749ff15fafe7a9c))
* fix flaky test and remove warnings ([#153](https://www.github.com/googleapis/java-spanner/issues/153)) ([d534e35](https://www.github.com/googleapis/java-spanner/commit/d534e350346b0c9ab8057ede36bc3aac473c0b06)), closes [#146](https://www.github.com/googleapis/java-spanner/issues/146)
* increase test timeout and remove warnings ([#160](https://www.github.com/googleapis/java-spanner/issues/160)) ([63a6bd8](https://www.github.com/googleapis/java-spanner/commit/63a6bd8be08a56d002f58bc2cdb2856ad0dc5fa3)), closes [#158](https://www.github.com/googleapis/java-spanner/issues/158)
* retry non-idempotent long-running RPCs ([#141](https://www.github.com/googleapis/java-spanner/issues/141)) ([4669c02](https://www.github.com/googleapis/java-spanner/commit/4669c02a24e0f7b1d53c9edf5ab7b146b4116960))
* retry restore if blocked by pending restore ([#119](https://www.github.com/googleapis/java-spanner/issues/119)) ([220653d](https://www.github.com/googleapis/java-spanner/commit/220653d8e25c518d0df447bf777a7fcbf04a01ca)), closes [#118](https://www.github.com/googleapis/java-spanner/issues/118)
* StatementParser did not accept multiple query hints ([#170](https://www.github.com/googleapis/java-spanner/issues/170)) ([ef41a6e](https://www.github.com/googleapis/java-spanner/commit/ef41a6e503f218c00c16914aa9c1433d9b26db13)), closes [#163](https://www.github.com/googleapis/java-spanner/issues/163)
* wait for initialization to finish before test ([#161](https://www.github.com/googleapis/java-spanner/issues/161)) ([fe434ff](https://www.github.com/googleapis/java-spanner/commit/fe434ff7068b4b618e70379c224e1c5ab88f6ba1)), closes [#146](https://www.github.com/googleapis/java-spanner/issues/146)


### Performance Improvements

* increase sessions in the pool in batches ([#134](https://www.github.com/googleapis/java-spanner/issues/134)) ([9e5a1cd](https://www.github.com/googleapis/java-spanner/commit/9e5a1cdaacf71147b67681861f063c3276705f44))
* prepare sessions with r/w tx in-process ([#152](https://www.github.com/googleapis/java-spanner/issues/152)) ([2db27ce](https://www.github.com/googleapis/java-spanner/commit/2db27ce048efafaa3c28b097de33518747011465)), closes [#151](https://www.github.com/googleapis/java-spanner/issues/151)


### Dependencies

* update core dependencies ([#109](https://www.github.com/googleapis/java-spanner/issues/109)) ([5753f1f](https://www.github.com/googleapis/java-spanner/commit/5753f1f4fed83df87262404f7a7ba7eedcd366cb))
* update core dependencies ([#132](https://www.github.com/googleapis/java-spanner/issues/132)) ([77c1558](https://www.github.com/googleapis/java-spanner/commit/77c1558652ee00e529674ac3a2dcf3210ef049fa))
* update dependency com.google.api:api-common to v1.9.0 ([#127](https://www.github.com/googleapis/java-spanner/issues/127)) ([b2c744f](https://www.github.com/googleapis/java-spanner/commit/b2c744f01a4d5a8981df5ff900f3536c83265a61))
* update dependency com.google.guava:guava-bom to v29 ([#147](https://www.github.com/googleapis/java-spanner/issues/147)) ([3fe3ae0](https://www.github.com/googleapis/java-spanner/commit/3fe3ae02376af552564c93c766f562d6454b7ac1))
* update dependency io.grpc:grpc-bom to v1.29.0 ([#164](https://www.github.com/googleapis/java-spanner/issues/164)) ([2d2ce5c](https://www.github.com/googleapis/java-spanner/commit/2d2ce5ce4dc8f410ec671e542e144d47f39ab40b))
* update dependency org.threeten:threetenbp to v1.4.3 ([#120](https://www.github.com/googleapis/java-spanner/issues/120)) ([49d1abc](https://www.github.com/googleapis/java-spanner/commit/49d1abcb6c9c48762dcf0fe1466ab107bf67146b))
---


This PR was generated with [Release Please](https://github.com/googleapis/release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants