Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bigquery: Inserter.Put does not retry on "http2: stream closed" transient error #1793

Closed
nodirg opened this issue Feb 12, 2020 · 16 comments
Closed
Assignees
Labels
api: bigquery Issues related to the BigQuery API. priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@nodirg
Copy link

nodirg commented Feb 12, 2020

We observe bigquery.Inserter.Put returning errors that end with "http2: stream closed". It appears to be a transient error, but Inserter.Put does not treat it as such:

func retryableError(err error) bool {

I think it is a bug.

@tbpg tbpg assigned tritone and shollyman and unassigned tritone Feb 13, 2020
@shollyman
Copy link
Contributor

Thanks for the report. Relocating this to the proper repository (google-cloud-go), as the one this issue was reported in (google-api-go-client) contains the underlying discovery-generated artifacts.

@shollyman shollyman transferred this issue from googleapis/google-api-go-client Feb 22, 2020
@yoshi-automation yoshi-automation added 🚨 This issue needs some love. triage me I really want to be triaged. labels Feb 22, 2020
@tbpg tbpg added api: bigquery Issues related to the BigQuery API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed 🚨 This issue needs some love. triage me I really want to be triaged. labels Feb 22, 2020
@dinvlad
Copy link

dinvlad commented Feb 27, 2020

Same issue here, only with "load" jobs. We do many of them (a 100) in parallel to various tables, and at least one now fails regularly.

@shollyman
Copy link
Contributor

I'd love to get a better understanding of how we're getting in this state, as I suspect it's symptomatic of another issue. Both reports are cases where we can have potentially a large number of connections in flight. Are you using a default constructed bigquery client / transport, or is there more to how the transport layer is setup? Do you only observe this under high connection concurrency?

@dinvlad
Copy link

dinvlad commented Mar 4, 2020

Yes, we’ve only observed it with the default transport and high concurrency. It seemed to occur in “bursts”, where for several batches of concurrent requests (waiting several minutes before each batch) we’d get consistent failures. And then after waiting even longer, we didn’t experience it anymore.

@nodirg
Copy link
Author

nodirg commented Mar 4, 2020

same here, we've observed this only during processing a large backlog (after an outage).

perhaps the workaround is a semaphore for the number of concurrent requests.

@shollyman
Copy link
Contributor

Thanks for confirming. I'll keep looking into this.

@yoshi-automation yoshi-automation added 🚨 This issue needs some love. and removed 🚨 This issue needs some love. labels Jun 2, 2020
@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Aug 10, 2020
@shollyman
Copy link
Contributor

I'm going to close this one out due to staleness. I've been unsuccessful reproducing this, and I've not gotten further reports. If this resurfaces, please comment with additional details (or open a new issue).

@iamolegga
Copy link

Hi, @shollyman this is still actual. The same case as @dinvlad with big amount (up to 600) of parallel insertions in different tables (up to 100) for several minutes. App level retries are succeed btw

@shollyman shollyman reopened this Aug 11, 2020
@meredithslota
Copy link
Contributor

@iamolegga Can you provide any additional repro steps? We are really struggling to reproduce.

@iamolegga
Copy link

@meredithslota well, nothing special here, but we just tested with bigger parallelisation and faced with network limits, where increase of app instances that write data to bigquery gives the same velocity of handled messages, so now I'm thinking that problem could be with network limits. We've reached up to 900-1000 parallel insertions with a couple of hundreds active bigquery clients (and same amount of tables accordingly)

@shollyman
Copy link
Contributor

I'm able to repro this occasionally now, but it's inconsistent.

It looks like this with GODEBUG=http2debug=1 when we start seeing failures:

2020/10/23 20:12:18 http2: Transport creating client conn 0xc003c7e180 to 74.125.142.95:443
2020/10/23 20:12:18 http2: Transport encoding header ":authority" = "bigquery.googleapis.com"
2020/10/23 20:12:18 http2: Transport encoding header ":method" = "POST"
2020/10/23 20:12:18 http2: Transport encoding header ":path" = "/bigquery/v2/projects/shollyman-demo-test/datasets/foo/tables/dummy_insert_test_table/insertAll?alt=json&prettyPrint=false"
2020/10/23 20:12:18 http2: Transport encoding header ":scheme" = "https"
2020/10/23 20:12:18 http2: Transport encoding header "x-goog-api-client" = "gl-go/1.15.0 gccl/20201002"
2020/10/23 20:12:18 http2: Transport encoding header "user-agent" = "gcloud-golang-bigquery/20201002"
2020/10/23 20:12:18 http2: Transport received SETTINGS len=18, settings: MAX_CONCURRENT_STREAMS=100, INITIAL_WINDOW_SIZE=1048576, MAX_HEADER_LIST_SIZE=16384
2020/10/23 20:12:18 http2: Transport received WINDOW_UPDATE len=4 (conn) incr=983041
2020/10/23 20:12:18 http2: Transport encoding header "content-type" = "application/json"
2020/10/23 20:12:18 http2: Transport encoding header "authorization" = "Bearer <REDACTED TOKEN>"
2020/10/23 20:12:18 http2: Transport encoding header "x-cloud-trace-context" = "55f8b3a5942353c760dcc1b8ac76e5ac/7698187974917749598;o=0"
2020/10/23 20:12:18 http2: Transport encoding header "content-length" = "20726"
2020/10/23 20:12:18 http2: Transport encoding header "accept-encoding" = "gzip"
2020/10/23 20:12:18 http2: Transport received SETTINGS flags=ACK len=0
2020/10/23 20:12:18 http2: Transport encoding header ":authority" = "bigquery.googleapis.com"
2020/10/23 20:12:18 http2: Transport encoding header ":method" = "POST"
2020/10/23 20:12:18 http2: Transport encoding header ":path" = "/bigquery/v2/projects/shollyman-demo-test/datasets/foo/tables/dummy_insert_test_table/insertAll?alt=json&prettyPrint=false"
2020/10/23 20:12:18 http2: Transport encoding header ":scheme" = "https"
2020/10/23 20:12:18 http2: Transport encoding header "x-goog-api-client" = "gl-go/1.15.0 gccl/20201002"
2020/10/23 20:12:18 http2: Transport encoding header "user-agent" = "gcloud-golang-bigquery/20201002"
2020/10/23 20:12:18 http2: Transport encoding header "content-type" = "application/json"
2020/10/23 20:12:18 http2: Transport encoding header "authorization" = "Bearer <REDACTED TOKEN>"
2020/10/23 20:12:18 http2: Transport encoding header "x-cloud-trace-context" = "cb3907aa98366b7beb1412ef34255870/438750139110050095;o=0"
2020/10/23 20:12:18 http2: Transport encoding header "content-length" = "20726"
2020/10/23 20:12:18 http2: Transport encoding header "accept-encoding" = "gzip"
2020/10/23 20:12:18 http2: Transport encoding header ":authority" = "bigquery.googleapis.com"
2020/10/23 20:12:18 http2: Transport encoding header ":method" = "POST"
2020/10/23 20:12:18 http2: Transport encoding header ":path" = "/bigquery/v2/projects/shollyman-demo-test/datasets/foo/tables/dummy_insert_test_table/insertAll?alt=json&prettyPrint=false"
2020/10/23 20:12:18 http2: Transport encoding header ":scheme" = "https"
2020/10/23 20:12:18 http2: Transport encoding header "x-goog-api-client" = "gl-go/1.15.0 gccl/20201002"
2020/10/23 20:12:18 http2: Transport encoding header "user-agent" = "gcloud-golang-bigquery/20201002"
2020/10/23 20:12:18 http2: Transport encoding header "content-type" = "application/json"
2020/10/23 20:12:18 http2: Transport encoding header "x-cloud-trace-context" = "eb54ba2fdc5e705d2cf0ff72cccdffb6/14610876467390266809;o=0"
2020/10/23 20:12:18 http2: Transport encoding header "authorization" = "Bearer <REDACTED TOKEN>"
2020/10/23 20:12:18 http2: Transport encoding header "content-length" = "20726"
2020/10/23 20:12:18 http2: Transport encoding header "accept-encoding" = "gzip"
2020/10/23 20:12:18 http2: Transport encoding header ":authority" = "bigquery.googleapis.com"
2020/10/23 20:12:18 http2: Transport encoding header ":method" = "POST"
2020/10/23 20:12:18 http2: Transport encoding header ":path" = "/bigquery/v2/projects/shollyman-demo-test/datasets/foo/tables/dummy_insert_test_table/insertAll?alt=json&prettyPrint=false"
2020/10/23 20:12:18 http2: Transport encoding header ":scheme" = "https"
2020/10/23 20:12:18 http2: Transport encoding header "content-type" = "application/json"
2020/10/23 20:12:18 http2: Transport encoding header "x-goog-api-client" = "gl-go/1.15.0 gccl/20201002"
2020/10/23 20:12:18 http2: Transport encoding header "user-agent" = "gcloud-golang-bigquery/20201002"
2020/10/23 20:12:18 http2: Transport encoding header "x-cloud-trace-context" = "79e16c1eafce9caad38f4df089476c0d/17712140079411926609;o=0"
2020/10/23 20:12:18 http2: Transport encoding header "authorization" = "Bearer <REDACTED TOKEN>"
2020/10/23 20:12:18 http2: Transport encoding header "content-length" = "20726"
2020/10/23 20:12:18 http2: Transport encoding header "accept-encoding" = "gzip"
2020/10/23 20:12:18 http2: Transport received SETTINGS flags=ACK len=0
2020/10/23 20:12:18 http2: Transport received RST_STREAM stream=203 len=4 ErrCode=REFUSED_STREAM
2020/10/23 20:12:18 http2: Transport received RST_STREAM stream=205 len=4 ErrCode=REFUSED_STREAM
2020/10/23 20:12:18 http2: Transport received RST_STREAM stream=207 len=4 ErrCode=REFUSED_STREAM
2020/10/23 20:12:18 http2: Transport received RST_STREAM stream=209 len=4 ErrCode=REFUSED_STREAM
2020/10/23 20:12:18 http2: Transport received RST_STREAM stream=211 len=4 ErrCode=REFUSED_STREAM
2020/10/23 20:12:18 http2: Transport received RST_STREAM stream=213 len=4 ErrCode=REFUSED_STREAM
2020/10/23 20:12:18 http2: Transport received RST_STREAM stream=215 len=4 ErrCode=REFUSED_STREAM
2020/10/23 20:12:18 http2: Transport encoding header ":authority" = "bigquery.googleapis.com"
2020/10/23 20:12:18 http2: Transport encoding header ":method" = "POST"
2020/10/23 20:12:18 http2: Transport received RST_STREAM stream=217 len=4 ErrCode=REFUSED_STREAM
2020/10/23 20:12:18 http2: Transport encoding header ":path" = "/bigquery/v2/projects/shollyman-demo-test/datasets/foo/tables/dummy_insert_test_table/insertAll?alt=json&prettyPrint=false"
2020/10/23 20:12:18 http2: Transport encoding header ":scheme" = "https"
2020/10/23 20:12:18 http2: Transport encoding header "x-cloud-trace-context" = "a4fdf1acac7c9e3f22d36c3ffcc796ce/3540013751131644616;o=0"
2020/10/23 20:12:18 http2: Transport encoding header "x-goog-api-client" = "gl-go/1.15.0 gccl/20201002"
2020/10/23 20:12:18 http2: Transport encoding header "user-agent" = "gcloud-golang-bigquery/20201002"
2020/10/23 20:12:18 http2: Transport encoding header "content-type" = "application/json"
2020/10/23 20:12:18 http2: Transport encoding header "authorization" = "Bearer <REDACTED TOKEN>"
2020/10/23 20:12:18 http2: Transport encoding header "content-length" = "20726"
2020/10/23 20:12:18 http2: Transport encoding header "accept-encoding" = "gzip"
2020/10/23 20:12:18 http2: Transport received RST_STREAM stream=219 len=4 ErrCode=REFUSED_STREAM
2020/10/23 20:12:18 http2: Transport encoding header ":authority" = "bigquery.googleapis.com"
2020/10/23 20:12:18 http2: Transport encoding header ":method" = "POST"
2020/10/23 20:12:18 http2: Transport encoding header ":path" = "/bigquery/v2/projects/shollyman-demo-test/datasets/foo/tables/dummy_insert_test_table/insertAll?alt=json&prettyPrint=false"
2020/10/23 20:12:18 http2: Transport encoding header ":scheme" = "https"
2020/10/23 20:12:18 http2: Transport encoding header "authorization" = "Bearer <REDACTED TOKEN>"
2020/10/23 20:12:18 http2: Transport encoding header "x-cloud-trace-context" = "a7a87e8627a515fcc50e916813053436/4172943081919415992;o=0"
2020/10/23 20:12:18 http2: Transport encoding header "user-agent" = "gcloud-golang-bigquery/20201002"
2020/10/23 20:12:18 http2: Transport encoding header "content-type" = "application/json"
2020/10/23 20:12:18 http2: Transport received RST_STREAM stream=221 len=4 ErrCode=REFUSED_STREAM
2020/10/23 20:12:18 http2: Transport encoding header "x-goog-api-client" = "gl-go/1.15.0 gccl/20201002"
2020/10/23 20:12:18 http2: Transport encoding header "content-length" = "20726"
2020/10/23 20:12:18 http2: Transport encoding header "accept-encoding" = "gzip"
2020/10/23 20:12:18 http2: Transport received RST_STREAM stream=223 len=4 ErrCode=REFUSED_STREAM
2020/10/23 20:12:18 http2: Transport received RST_STREAM stream=225 len=4 ErrCode=REFUSED_STREAM
2020/10/23 20:12:18 http2: Transport received RST_STREAM stream=227 len=4 ErrCode=REFUSED_STREAM
2020/10/23 20:12:18 http2: Transport received RST_STREAM stream=229 len=4 ErrCode=REFUSED_STREAM
2020/10/23 20:12:18 http2: Transport received RST_STREAM stream=231 len=4 ErrCode=REFUSED_STREAM
2020/10/23 20:12:18 http2: Transport received RST_STREAM stream=233 len=4 ErrCode=REFUSED_STREAM
2020/10/23 20:12:18 http2: Transport received RST_STREAM stream=235 len=4 ErrCode=REFUSED_STREAM
2020/10/23 20:12:18 http2: Transport received RST_STREAM stream=237 len=4 ErrCode=REFUSED_STREAM
2020/10/23 20:12:18 http2: Transport received RST_STREAM stream=239 len=4 ErrCode=REFUSED_STREAM
2020/10/23 20:12:18 RoundTrip failure: http2: stream closed
2020/10/23 20:12:18 RoundTrip failure: http2: stream closed
2020/10/23 20:12:18 http2: Transport encoding header ":authority" = "bigquery.googleapis.com"
2020/10/23 20:12:18 http2: Transport encoding header ":method" = "POST"
2020/10/23 20:12:18 http2: Transport encoding header ":path" = "/bigquery/v2/projects/shollyman-demo-test/datasets/foo/tables/dummy_insert_test_table/insertAll?alt=json&prettyPrint=false"
2020/10/23 20:12:18 http2: Transport encoding header ":scheme" = "https"
2020/10/23 20:12:18 Error(279): Post "https://bigquery.googleapis.com/bigquery/v2/projects/shollyman-demo-test/datasets/foo/tables/dummy_insert_test_table/insertAll?alt=json&prettyPrint=false": http2: stream closed
2020/10/23 20:12:18 Error(286): Post "https://bigquery.googleapis.com/bigquery/v2/projects/shollyman-demo-test/datasets/foo/tables/dummy_insert_test_table/insertAll?alt=json&prettyPrint=false": http2: stream closed
2020/10/23 20:12:18 http2: Transport encoding header "x-goog-api-client" = "gl-go/1.15.0 gccl/20201002"
2020/10/23 20:12:18 http2: Transport encoding header "user-agent" = "gcloud-golang-bigquery/20201002"
2020/10/23 20:12:18 http2: Transport encoding header "content-type" = "application/json"
2020/10/23 20:12:18 RoundTrip failure: http2: stream closed

interesting bit:

2020/10/23 20:12:18 http2: Transport received SETTINGS len=18, settings: MAX_CONCURRENT_STREAMS=100, INITIAL_WINDOW_SIZE=1048576, MAX_HEADER_LIST_SIZE=16384

Logging seems to imply we're dealing with more than 100 streams. Is the correct thing to bound the stream concurrency on the client side to a more reasonable default? Do we do that directly, given we're silently upgrading to http2 here rather than explicitly using http2, or is it informed by other settings?

@shollyman
Copy link
Contributor

Spent more time looking at https://github.com/golang/go/blob/master/src/net/http/h2_bundle.go. If I'm reading it and the debug logging correctly, it will already do retries in our behalf (up to 6 additional retries), which seems borne out by the logging.

Our options appear to be:

  • Add this specific error to the bigquery retry predicate, which means we may induce further write amplification (retries on retries).

  • Keep treating this is a user-level retry concern, as it mostly seems to be triggered by high concurrency (but there may be another factor causing it like backend state).

Next steps: I'll try exploring this with even more verbose logging and see if I can get a better understanding of the triggering behavior (are we exceeding stream concurrency in a racy fashion, is this another backend pushback, etc). With that, we can make a decision on how to do things here.

@shollyman
Copy link
Contributor

Managed to get a failure at debug level 2. I present, life of a stream failure:

18:24:37.823260 http2: Transport creating client conn 0xc000493500 to 74.125.142.95:443
18:24:37.823317 http2: Framer 0xc001a161c0: wrote SETTINGS len=18, settings: ENABLE_PUSH=0, INITIAL_WINDOW_SIZE=4194304, MAX_HEADER_LIST_SIZE=10485760
18:24:37.823326 http2: Framer 0xc001a161c0: wrote WINDOW_UPDATE len=4 (conn) incr=1073741824

first header/stream getting added to framer
18:24:37.826492 http2: Framer 0xc001a161c0: wrote HEADERS flags=END_HEADERS stream=1 len=434

... many other streams start writing HEADERS, some also write DATA associated with this framer ...

(get up to stream 73)
18:24:37.833102 http2: Framer 0xc001a161c0: wrote HEADERS flags=END_HEADERS stream=73 len=52


(then we receive the size limit SETTINGS from the server)
18:24:37.833293 http2: Framer 0xc001a161c0: read SETTINGS len=18, settings: MAX_CONCURRENT_STREAMS=100, INITIAL_WINDOW_SIZE=1048576, MAX_HEADER_LIST_SIZE=16384
18:24:37.833309 http2: Transport received SETTINGS len=18, settings: MAX_CONCURRENT_STREAMS=100, INITIAL_WINDOW_SIZE=1048576, MAX_HEADER_LIST_SIZE=16384

18:24:37.841882 http2: Framer 0xc001a161c0: wrote HEADERS flags=END_HEADERS stream=203 len=52

... many other streams writing HEADERS, some also write DATA associated with this framer ...

(this is when we finally ACK that the settings were processed)
18:24:37.862429 http2: Framer 0xc001a161c0: wrote SETTINGS flags=ACK len=0
18:24:37.862463 http2: Framer 0xc001a161c0: read WINDOW_UPDATE len=4 (conn) incr=983041
18:24:37.862480 http2: Transport received WINDOW_UPDATE len=4 (conn) incr=983041
18:24:37.862498 http2: Framer 0xc001a161c0: read SETTINGS flags=ACK len=0
18:24:37.862506 http2: Transport received SETTINGS flags=ACK len=0

(no additional writes, but here's where the stream resets start to arrive from the server)
18:24:37.862581 http2: Framer 0xc001a161c0: read RST_STREAM stream=203 len=4 ErrCode=REFUSED_STREAM


(bubbling up, this becomes a roundtripper transport failure)
18:24:37.863413 RoundTrip failure: http2: stream closed

(bubbling up further, this becomes a request error)
18:24:37.863557 Error(goroutine 133): Post "https://bigquery.googleapis.com/bigquery/v2/projects/shollyman-demo-test/datasets/foo/tables/dummy_insert_test_table/insertAll?alt=json&prettyPrint=false": http2: stream closed

Observations:

  • The Go defaults are to use a default of 250 max streams per connection (http2defaultMaxStreams in h2_bundle.go), which is seems how we're getting into this state. Funnily enough, there's a TODO there that's seems very relevant:
http2defaultMaxStreams      = 250 // TODO: make this 100 as the GFE seems to?
  • From connection start until we received settings (~10ms), we've already aggressively started up streams.

  • From received settings until we ack the settings request (~19ms), we've oversaturated the limit. This is when (less then 1ms later) stream reset frames from the server also begin arriving indicating REFUSED_STREAM.

  • The stream that was refused didn't get to the point where we were writing the DATA payload (or it wasn't logged), we'd only gotten to HEADERS.

  • Roughly half a millisecond after the stream refusals, I do observe client does start up another connection, so it's adapting the flow.

Reviewing the http2 client logic again, It seems my earlier comments about stream retries are not true as I was inferring the wrong part of the request flow. In this case, we received the error while awaiting flow control to start writing the DATA of the payload, which is before the retry can occur (once we have the data).

@shollyman
Copy link
Contributor

I've gone ahead and created #3129 to add this to the BigQuery retry predicate for now. It may be possible to improve this in Go's http2 implementation, at which point we can remove the condition from the predicate.

shollyman added a commit that referenced this issue Nov 3, 2020
This adds a special case to the default retry predicate for
BigQuery to work around a specific high concurrency issue.

Related: #1793
@shollyman
Copy link
Contributor

@nodirg
Copy link
Author

nodirg commented Nov 4, 2020

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

8 participants