Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: A stuck when the client fail to get DoneCallback #1637

Merged
merged 5 commits into from May 2, 2022
Merged

fix: A stuck when the client fail to get DoneCallback #1637

merged 5 commits into from May 2, 2022

Conversation

yirutang
Copy link
Contributor

Add a timeout of one minute waiting for done callback to be called. Same timeout as client close.
The donecallback mainly gives back the server side error status, so it is not critical. In Dataflow connector, we saw hang because the DoneCallback is lost and we wait forever on it.

Stack trace in b/230501926

@yirutang yirutang requested review from a team and steffnay April 29, 2022 21:56
@product-auto-label product-auto-label bot added size: m Pull request size is medium. api: bigquerystorage Issues related to the googleapis/java-bigquerystorage API. labels Apr 29, 2022
@product-auto-label product-auto-label bot added size: s Pull request size is small. and removed size: m Pull request size is medium. labels Apr 29, 2022
@yirutang
Copy link
Contributor Author

@reuvenlax

@reuvenlax
Copy link

LGTM

@gnanda
Copy link
Contributor

gnanda commented Apr 29, 2022

Do we know why the doneCallback isn't being called? Is there any chance of a memory leak from this because resources aren't being cleaned up?

e.g. is there any chance we're swallowing an exception here and need a catch block to set connectionFinalStatus

Otherwise LGTM

@yirutang
Copy link
Contributor Author

yirutang commented Apr 29, 2022

Do we know why the doneCallback isn't being called? Is there any chance of a memory leak from this because resources aren't being cleaned up?

Otherwise LGTM

I don't exactly know, what we saw is the connector stuck for days in this: https://b.corp.google.com/issues/230501926#comment3

According to Reuven:
Reuven Lax, Wed 11:34 AM
Slava says that they've seen cases where the done callback doesn't come after closing the stream

Reuven Lax, Wed 11:37 AM
he also says that we had a lot of trouble with very long-lived streaming rpcs. what we ended up switching to was setting a 3 minute timeout on the streaming rpc, and reconnecting every 3 minutes

What do you mean by swallow an exception on L622?

@@ -505,6 +506,14 @@ private void waitForDoneCallback() {
}
Uninterruptibles.sleepUninterruptibly(100, TimeUnit.MILLISECONDS);
}
this.lock.lock();
if (connectionFinalStatus == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super familiar w/java, but do we need the try {} finallly {lock.unlock();} around this block like I see in other places, or is that unnecessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@stephaniewang526 stephaniewang526 added the owlbot:run Add this label to trigger the Owlbot post processor. label May 2, 2022
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label May 2, 2022
@stephaniewang526 stephaniewang526 added automerge Merge the pull request once unit tests and other checks pass. owlbot:run Add this label to trigger the Owlbot post processor. labels May 2, 2022
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label May 2, 2022
@gcf-merge-on-green gcf-merge-on-green bot merged commit 3baa84e into googleapis:main May 2, 2022
@gcf-merge-on-green gcf-merge-on-green bot removed the automerge Merge the pull request once unit tests and other checks pass. label May 2, 2022
gcf-merge-on-green bot pushed a commit that referenced this pull request May 5, 2022
🤖 I have created a release *beep* *boop*
---


## [2.13.0](v2.12.2...v2.13.0) (2022-05-05)


### Features

* add support to a few more specific StorageErrors for the Write API ([#1563](#1563)) ([c26091e](c26091e))
* next release from main branch is 2.12.2 ([#1624](#1624)) ([b2aa2a4](b2aa2a4))


### Bug Fixes

* A stuck when the client fail to get DoneCallback ([#1637](#1637)) ([3baa84e](3baa84e))
* Fix a possible NULL PTR after introduced timeout on waitForDone ([#1638](#1638)) ([e1c6ded](e1c6ded))


### Dependencies

* update dependency com.google.cloud:google-cloud-bigquery to v2.10.10 ([#1623](#1623)) ([54b74b8](54b74b8))
* update dependency org.apache.avro:avro to v1.11.0 ([#1632](#1632)) ([b47eea0](b47eea0))


### Documentation

* **samples:** update WriteComittedStream sample code to match best practices ([#1628](#1628)) ([5d4c7e1](5d4c7e1))
* **sample:** update WriteToDefaultStream sample to match best practices ([#1631](#1631)) ([73ddd7b](73ddd7b))

---
This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquerystorage Issues related to the googleapis/java-bigquerystorage API. size: s Pull request size is small.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants