Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controlling backoff and retry parameters spanner client #1215

Closed
dneralla opened this issue Nov 11, 2018 · 25 comments

Comments

Projects
None yet
6 participants
@dneralla
Copy link

commented Nov 11, 2018

Client

Spanner go client

Is there any way to control retry and backoff parameters when I am creating a client like the following.

     client, err := spanner.NewClientWithConfig(ctx, opts.DatabasePath(),
         spanner.ClientConfig{NumChannels: numSpannerChannels, SessionPoolConfig: spanner.SessionPoolConfig{
             MaxOpened:     maxSpannerSessions,
             MinOpened:     minSpannerSessions,
             MaxBurst:      maxSpannerSessionBurst,
             WriteSessions: spannerWriteSessions,
         }})

Our usage might have some aborted transactions because multiple parallel calls trying to update similar records , and we want to fail fast and retry by first looking up using read txn.

@dneralla

This comment has been minimized.

Copy link
Author

commented Nov 11, 2018

Also we are getting a lot of unavailable errors from spanner, with 502.Bad gateway and retries from the client happen with a 1 second backoff and which is making our p99 write latencies bad.

@dneralla

This comment has been minimized.

Copy link
Author

commented Nov 12, 2018

We are also seeing lot of errors of the type lastErr is < rpc error: code = Unavailable desc = The server cannot handle this request right now. Please retry with backoff > There's not much info on why spanner is throwing that error. Any information regarding this would be helpful

@jadekler

This comment has been minimized.

Copy link
Collaborator

commented Nov 13, 2018

Is there any way to control retry and backoff parameters

Unfortunately not at the moment. I'll label this a feature request.

Also we are getting a lot of unavailable errors from spanner, with 502.Bad gateway and retries from the client happen with a 1 second backoff and which is making our p99 write latencies bad.

Could you create a separate issue for this?

We are also seeing lot of errors of the type "lastErr is ". There's not much info on why spanner is throwing that error. Any information regarding this would be helpful

Just to check - are you seeing "lastErr is ", or "lastErr is "?

@dneralla

This comment has been minimized.

Copy link
Author

commented Nov 13, 2018

@jadekler
Hey I edited the comment above with actual error. Sorry my markdown skillset is too bad. Apparently
< > without spaces doesn't include the content between it

@jadekler

This comment has been minimized.

Copy link
Collaborator

commented Nov 13, 2018

Thank you!

@kazegusuri

This comment has been minimized.

Copy link

commented Nov 17, 2018

Is this acceptable feature request? If so I have an interested in sending a patch.

@jadekler

This comment has been minimized.

Copy link
Collaborator

commented Nov 19, 2018

Regrettably the design here is not yet clear to me. If we choose to allow retrying in one client we must consider all clients. If you have a detailed design, please discuss it here.

@kazegusuri

This comment has been minimized.

Copy link

commented Nov 20, 2018

The current defaultBackoff is like this.

var defaultBackoff = exponentialBackoff{minBackoff, maxBackoff}

The exponentialBackoff just has delay method to determine duration from retry count.
So I'm thinking to introduce an interface:

type backoff interface {
   Delay(retries int) time.Duration
}

Then add initialization function like SetDefaultBackoff() to set defaultBackoff. This is intended to be called in init(). This is package-level configuration, but I think it's enough.

How do you think?

@jadekler

This comment has been minimized.

Copy link
Collaborator

commented Nov 20, 2018

There are problems with that: users expect to set configuration in NewClient calls, not at init time. Also, any design here should be applicable to other clients. Probably we should make use of gax's CallOption, but I'll have to investigate some.

@110y

This comment has been minimized.

Copy link

commented Dec 19, 2018

@jadekler

How about to add new fields to ClientConfig and Client like below and propagate the backoff configurations to runRetryable and resumableStreamDecoder?

type Client struct {
	// ...
	backoff exponentialBackoff
}

type ClientConfig struct {
	// ...
	MinBackoff int64
	MaxBackoff int64
}

If this is acceptable, I'll try to make patch.

@jadekler

This comment has been minimized.

Copy link
Collaborator

commented Dec 19, 2018

That's a possibility that is similar to Pub/Sub, which is good. I unfortunately haven't yet had the time to explore this enough to definitively say that's a good way forward, though.

@110y

This comment has been minimized.

Copy link

commented Dec 20, 2018

@jadekler

I've created a patch and sent to Gerrit: https://code-review.googlesource.com/c/gocloud/+/36750.
I'd appreciate it if you could review this CL.

110y added a commit to 110y/google-cloud-go that referenced this issue Dec 22, 2018

spanner: Adds configurations to make the spanner client able to contr…
…oll delay of retrying failed operations

by following this discussion: googleapis#1215 (comment)

Change-Id: Ic99451f35cfd17f6f232fa1970ae48ce3d4c7433
@broady

This comment has been minimized.

Copy link
Member

commented Feb 2, 2019

I can't see any example of an existing package exposing retry/backoff settings.

I think it requires more thought and deliberation.

110y added a commit to 110y/google-cloud-go that referenced this issue Feb 5, 2019

spanner: Adds configurations to make the spanner client able to contr…
…oll delay of retrying failed operations

by following this discussion: googleapis#1215 (comment)

Change-Id: Ic99451f35cfd17f6f232fa1970ae48ce3d4c7433
@110y

This comment has been minimized.

Copy link

commented Feb 5, 2019

@broady

I've observed that Cloud Spanner returns errors in variety of cases.
For example, I've sometimes observed calling BeginTransaction failed unexpectedly and calling StreamingRead aborted due to stale transactions.

Since the default backoff duration of this library is set to 1s, it always takes additional 1s latency whenever errors described above occurred.
I think that this additional latency is too high for user-facing requests.

Furthermore, since my team is architecting systems in Microservices manner, my systems call Cloud Spanner many times per one user request.
So, if several of my microservices take additional latency due to backoff, the total latency will be quite high.

Therefore, I think that this library should allow to set backoff for each requirements.

@broady

This comment has been minimized.

Copy link
Member

commented Feb 5, 2019

Does the first retry takes too long to start? i.e. min backoff is too high?

@jadekler

This comment has been minimized.

Copy link
Collaborator

commented Feb 5, 2019

Also, which error codes are being returned? Does the BeginTransaction failure return Aborted, for example?

@kazegusuri

This comment has been minimized.

Copy link

commented Feb 5, 2019

To make the problem simple, I fixed my comment..

We have observed all methods including BeginTransaction, StreamingRead, Commit returns Aborted. This library retries when aborted for all methods, but sees the backoff duration from the error(RetryInfo) only for Commit. That results in too long backoff for Aborted error except for Commit method.

@broady

This comment has been minimized.

Copy link
Member

commented Feb 6, 2019

Is it correct for us to just fix the backoff for those calls? It sounds like a broken configuration, and not just for you, but all users of the spanner package.

I'm trying to figure out if we just need to fix the defaults, rather than needing to add customization for backoffs (which can also lead to poor outcomes, as well as being difficult to design in a cohesive and complete way).

@jadekler

This comment has been minimized.

Copy link
Collaborator

commented Feb 6, 2019

FWIW https://code-review.googlesource.com/c/gocloud/+/38130 reduces the minBackoff from 1s to 100ms, bringing it more inline with our other clients.

gopherbot pushed a commit that referenced this issue Feb 6, 2019

spanner: reduce minBackoff from 1s to 100ms
This reduces the amount of time that most retries take (by far) and
brings it in line with the standard that many of our other clients have.
For example, see:

- https://github.com/googleapis/google-cloud-go/blob/8dac242df8c1549adee7204c4d2f46f51bef07e8/firestore/apiv1/firestore_client.go#L67
- https://github.com/googleapis/google-cloud-go/blob/8dac242df8c1549adee7204c4d2f46f51bef07e8/pubsub/apiv1/publisher_client.go#L61

Related issue: #1215

Change-Id: I808f2aa074ad1e8a7ccc7487cb52b4dd717e77fb
Reviewed-on: https://code-review.googlesource.com/c/38130
Reviewed-by: Chris Broadfoot <cbro@google.com>
Reviewed-by: kokoro <noreply+kokoro@google.com>
@kazegusuri

This comment has been minimized.

Copy link

commented Feb 6, 2019

Is it correct for us to just fix the backoff for those calls? It sounds like a broken configuration, and not just for you, but all users of the spanner package.

No. Our requests are just that how to avoid 1 second back off for Avorted error. That is the problem everyone MUST suffer from.

I'm trying to figure out if we just need to fix the defaults, rather than needing to add customization for backoff.

You are right. That is just one of solution for that.
I know that the problem can be solved by fix error handling because we have observed all methods such as StreamingRead or BeginTransacstion also have RetryInfo in the error when Aborted happens.

Why we (just) suggest to allow us to configure default back off is:

  1. When Aborted happens, the RetryInfo in the error is always 0 sec for now at least while we have observed. So there is no problem for too small back off. But off course, it will be bad if server side wants to control duration in future.
  2. Configuring back off also solves the issue #1223 . This is why this issue is complicated. We really really want to fix both problems and both are different off course. When we ask about configuring backoff, you seem feeling not bad. So then we suggest this solution.
@broady

This comment has been minimized.

Copy link
Member

commented Feb 6, 2019

re: #1223, does your application perform well if you don't wrap the errors?

@kazegusuri

This comment has been minimized.

Copy link

commented Feb 6, 2019

Even if we don't wrap the errors, this problem still exists. It works only for Commit error.

@broady

This comment has been minimized.

Copy link
Member

commented Feb 8, 2019

We should try going even lower than 100ms.

Spanner team suggested 20-50ms.

My small app that I used to reproduce the issue returns Aborted only on Commit and Update (not for Query)

As noted above, RetryInfo for Commit is already respected. The RetryInfo returned by the Spanner server for Update calls appears to be very consistent, between 10-15ms, so it sounds like 20ms might be very reasonable.

Log of RetryInfo values for Update calls
2019/02/07 17:49:26 Update trailers: 11.997761ms, true
2019/02/07 17:49:28 Update trailers: 11.997453ms, true
2019/02/07 17:49:34 Update trailers: 10.998325ms, true
2019/02/07 17:49:39 Update trailers: 15.998403ms, true
2019/02/07 17:49:39 Update trailers: 10.994829ms, true
2019/02/07 17:49:41 Update trailers: 11.996834ms, true
2019/02/07 17:49:47 Update trailers: 10.998464ms, true
2019/02/07 17:49:48 Update trailers: 14.998592ms, true
2019/02/07 17:49:53 Update trailers: 11.998235ms, true
2019/02/07 17:49:56 Update trailers: 10.998543ms, true
2019/02/07 17:49:58 Update trailers: 10.997684ms, true
2019/02/07 17:50:06 Update trailers: 12.997582ms, true
2019/02/07 17:50:06 Update trailers: 12.998171ms, true
2019/02/07 17:50:08 Update trailers: 13.998741ms, true
2019/02/07 17:50:09 Update trailers: 15.998094ms, true
2019/02/07 17:50:10 Update trailers: 10.99823ms, true
2019/02/07 17:50:11 Update trailers: 14.998151ms, true
2019/02/07 17:50:12 Update trailers: 13.998703ms, true
2019/02/07 17:50:13 Update trailers: 13.998546ms, true
2019/02/07 17:50:19 Update trailers: 11.998267ms, true
2019/02/07 17:50:20 Update trailers: 9.998016ms, true
2019/02/07 17:50:21 Update trailers: 15.998502ms, true
2019/02/07 17:50:22 Update trailers: 12.997663ms, true
2019/02/07 17:50:24 Update trailers: 10.998756ms, true
2019/02/07 17:50:29 Update trailers: 11.998432ms, true
2019/02/07 17:50:31 Update trailers: 15.998712ms, true
2019/02/07 17:50:36 Update trailers: 11.998063ms, true
2019/02/07 17:50:37 Update trailers: 12.997537ms, true
2019/02/07 17:50:41 Update trailers: 15.998469ms, true
2019/02/07 17:50:42 Update trailers: 11.99858ms, true
2019/02/07 17:50:46 Update trailers: 12.998666ms, true
2019/02/07 17:50:47 Update trailers: 10.99805ms, true
2019/02/07 17:50:50 Update trailers: 11.997795ms, true
@jadekler

This comment has been minimized.

Copy link
Collaborator

commented Feb 8, 2019

Quick note: I've sent out https://code-review.googlesource.com/c/gocloud/+/38210, which further lowers min backoff from 100ms to 20ms.

gopherbot pushed a commit that referenced this issue Feb 11, 2019

spanner: further reduce min retry backoff
This CL further reduces the min retry backoff from 100ms to 20ms, per
recommendation of spanner team.

Updates #1215.

Change-Id: I3311de21bdc100fb48d78e40caebc6edc7506e83
Reviewed-on: https://code-review.googlesource.com/c/38210
Reviewed-by: Chris Broadfoot <cbro@google.com>
@broady

This comment has been minimized.

Copy link
Member

commented Feb 20, 2019

Marking this as closed/fixed. It looks like we could reasonably go down to 10ms, but it seems 20ms is quite good for now.

As I wrote in #1309, we could also respect the RetryInfo from the server, but I don't think it provides very much value right now.

@broady broady closed this Feb 20, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.