Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add client shutdown if request waiting in request queue for too long. #2017

Merged
merged 90 commits into from Mar 1, 2023
Merged

Conversation

GaoleMeng
Copy link
Contributor

this is needed as offline investigation, it's possible for grpc streaming lib to never trigger doneCallback / requestCallback under some scenario, e.g. when there's too many GAX threads. We should add this timeout to help connection spit out the requests on dead connections

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> ☕️

If you write sample code, please follow the samples format.

GaoleMeng and others added 30 commits September 13, 2022 01:58
also fixed a tiny bug inside fake bigquery write impl for getting thre
response from offset
possible the proto schema does not contain this field
@GaoleMeng GaoleMeng requested review from a team and shollyman February 24, 2023 09:12
@product-auto-label product-auto-label bot added size: m Pull request size is medium. api: bigquerystorage Issues related to the googleapis/java-bigquerystorage API. labels Feb 24, 2023
Duration milliSinceLastCallback =
Duration.between(lastRequestCallbackTriggerTime, Instant.now());
if (milliSinceLastCallback.compareTo(MAXIMUM_REQUEST_CALLBACK_WAIT_TIME) > 0) {
throw new IllegalStateException(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is really an IllegalStateException

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use a generic runtime exception, the goal is only to trigger uncaught exception handler

@@ -659,6 +678,18 @@ private void appendLoop() {
log.info("Append thread is done. Stream: " + streamName + " id: " + writerId);
}

private void throwIfWaitCallbackTooLong() {
Duration milliSinceLastCallback =
Duration.between(lastRequestCallbackTriggerTime, Instant.now());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a StreamWriter is created and then there is a pause before using it, won't this trigger? (odd scenario, I know). Actually if a user ever pauses for 10 minutes, won't this cause the exception the be thrown the next time they use the StreamWriter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caller of the throwIfWaitCallbackTooLong is checking on the inflight queue size before triggering this check

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, would it make sense to store a timestamp in the RequestWrapper itself when it is added to the inflightRequestQueue? Since the queue is in order, you then can just check the timestamp at the head of the queue

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, but append will add to this queue, right? So if append is called after a long delay, the appendLoop might hit this condition (if the appendLoop runs before the callback comes back).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to a storing timestamp directly during creation of request wrapper

private void throwIfWaitCallbackTooLong(Instant timeToCheck) {
Duration milliSinceLastCallback = Duration.between(timeToCheck, Instant.now());
if (milliSinceLastCallback.compareTo(MAXIMUM_REQUEST_CALLBACK_WAIT_TIME) > 0) {
throw new RuntimeException(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that a RuntimeException will be caught and will cancel things?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AppendRequestAndResponse(AppendRowsRequest message, StreamWriter streamWriter) {
this.appendResult = SettableApiFuture.create();
this.message = message;
this.messageSize = message.getProtoRows().getSerializedSize();
this.streamWriter = streamWriter;
this.requestCreationTimeStamp = Instant.now();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is 100% right, since this gets called when the request is added to waitingRequestQueue. I think we want a timestamp that is set close to when the actual RPC goes out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to start counting time when user call Append. But let's change to set the timestamp when the request is added to the inflight queue.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to detect stuck RPCs. If things are stuck in the waitingRequestQueue (and nothing is stuck in the inflight queue), then presumably something else is very wrong.

* We will constantly checking how much time we have been waiting for the next request callback
* if we wait too much time we will start shutting down the connections and clean up the queues.
*/
private static Duration MAXIMUM_REQUEST_CALLBACK_WAIT_TIME = Duration.ofMinutes(10);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To keep it safe, let's put a longer time say 30 minutes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use 20 minutes, 30 minutes sound crazy long for a pipeline to stop

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now the user waits forever on the queue, I am just afraid some user does crazy things. I am fine with 20 min.

@GaoleMeng GaoleMeng added the owlbot:run Add this label to trigger the Owlbot post processor. label Feb 28, 2023
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Feb 28, 2023
@GaoleMeng GaoleMeng added the owlbot:run Add this label to trigger the Owlbot post processor. label Feb 28, 2023
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Feb 28, 2023
@GaoleMeng GaoleMeng added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Feb 28, 2023
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Feb 28, 2023
@product-auto-label product-auto-label bot added size: s Pull request size is small. and removed size: m Pull request size is medium. labels Feb 28, 2023
@product-auto-label product-auto-label bot added size: m Pull request size is medium. and removed size: s Pull request size is small. labels Feb 28, 2023
@GaoleMeng GaoleMeng added the owlbot:run Add this label to trigger the Owlbot post processor. label Mar 1, 2023
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Mar 1, 2023
@GaoleMeng GaoleMeng merged commit 91da88b into googleapis:main Mar 1, 2023
gcf-merge-on-green bot pushed a commit that referenced this pull request Mar 1, 2023
🤖 I have created a release *beep* *boop*
---


## [2.33.0](https://togithub.com/googleapis/java-bigquerystorage/compare/v2.32.1...v2.33.0) (2023-03-01)


### Features

* Add header back to the client ([#2016](https://togithub.com/googleapis/java-bigquerystorage/issues/2016)) ([de00447](https://togithub.com/googleapis/java-bigquerystorage/commit/de00447958e5939d7be9d0f7da02323aabbfed8c))


### Bug Fixes

* Add client shutdown if request waiting in request queue for too long. ([#2017](https://togithub.com/googleapis/java-bigquerystorage/issues/2017)) ([91da88b](https://togithub.com/googleapis/java-bigquerystorage/commit/91da88b0ed914bf55111dd9cef2a3fc4b27c3443))
* Allow StreamWriter settings to override passed in BQ client setting ([#2001](https://togithub.com/googleapis/java-bigquerystorage/issues/2001)) ([66db8fe](https://togithub.com/googleapis/java-bigquerystorage/commit/66db8fed26474076fb5aaca5044d39e11f6ef28d))
* Catch uncaught exception from append loop and add expoential retry to reconnection ([#2015](https://togithub.com/googleapis/java-bigquerystorage/issues/2015)) ([35db0fb](https://togithub.com/googleapis/java-bigquerystorage/commit/35db0fb38a929a8f3e4db30ee173ce5a4af43d64))
* Remove write_location header pending discussion ([#2021](https://togithub.com/googleapis/java-bigquerystorage/issues/2021)) ([0941d43](https://togithub.com/googleapis/java-bigquerystorage/commit/0941d4363daf782e0be81c11fdf6a2fe0ff4d7ac))

---
This PR was generated with [Release Please](https://togithub.com/googleapis/release-please). See [documentation](https://togithub.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquerystorage Issues related to the googleapis/java-bigquerystorage API. size: m Pull request size is medium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants