Distinguish between ambiguous and unambiguous failures #1443

bdarnell · 2017-08-16T20:12:09Z

What version of gRPC are you using?

[[projects]]
  name = "google.golang.org/grpc"
  packages = [".","codes","credentials","credentials/oauth","grpclb/grpc_lb_v1","grpclog","internal","keepalive","metadata","naming","peer","stats","status","tap","transport"]
  revision = "172ccacff3cfbddedcf1ba5ea06cda943bc976c1"
  version = "v1.5.0"

What version of Go are you using (`go version`)?

1.8.3

What operating system (Linux, Windows, …) and version?

Linux (ubuntu 16.04)

What did you do?

I'm trying to send an rpc with a side effect that should happen at most once. I have connections to multiple backends and use the FailFast option; if one backend is down I want to try another.

What did you expect to see?

This requires that I can distinguish the error returned from a backend that is found to be unavailable before the request is sent from one that fails later. (I had previously assumed that codes.Unavailable would be suitable for this, but I was wrong - it's also used when a server is draining, when the request may have already been processed)

What did you see instead?

The fail-fast error for a down backend has the same error code as the one used for a draining server (and possibly other uses); it is impossible to distinguish them without looking at the text of the error message.

I would like some way to inspect the error to determine whether the request was possibly processed by the server. This could be a distinct error code, string matching on the error message (if the string to use is documented and guaranteed for future versions), or other attributes of the error. It's ok if this mechanism is not perfectly precise (my priority is at-most-once execution instead of maximizing the chances of success), but I would like to at least be able to distinguish the fail-fast error linked above.

(for further background, our incorrect assumption that an error with codes.Unavailable meant that the RPC had not been attempted on the server was the cause of cockroachdb/cockroach#17491)

The text was updated successfully, but these errors were encountered:

dfawley · 2017-08-17T20:59:59Z

Per the gRPC spec, Unavailable is expected to be returned when the server shuts down or when some data is transmitted before the connection breaks. However, the Wait For Ready spec does not specify what kind of failure should be returned if there is an error attempting to issue an RPC to a ClientConn for which there is no current connection available. Unavailable seems like the most reasonable choice for this condition, but it does prevent this kind of use case, which is important. It's also important to have a different error code (and not just string matching or a sentinel error), because gRPC's retry mechanism will be based upon the code alone. We'll follow up with other gRPC implementations and see how they handle this.

dfawley · 2017-08-24T22:46:30Z

Actually, I think the root cause of this issue is that our non-Wait For Ready / FailFast behavior is wrong. We are supposed to block if there is no active connection (i.e. the ClientConn's state is Idle or Connecting) until the context expires:

RPCs SHOULD NOT fail as a result of the channel being in other states (CONNECTING, READY, or IDLE).

At that time, the error returned should be either Canceled or DeadlineExceeded.

If you turn off Fail Fast (i.e. if you enable Wait For Ready), then RPCs will always block until the context is canceled or expires, or the ClientConn is closed.

If you are maintaining multiple connections to multiple backends, you should do that within a single ClientConn, and gRPC should always pick one that is up for your RPCs.

dfawley · 2017-08-24T22:47:36Z

Let's track the wait for ready bug via #1463. Closing this as a duplicate of that, since it is the root cause.

bdarnell · 2017-08-25T21:36:09Z

I don't understand how #1463 is the root cause of this issue. For one thing it looks like the code is already working as you say it should, and only returning fail-fast errors for the transient-failure state:

grpc-go/clientconn.go

Lines 1186 to 1190 in fa59b77

    
           case ac.state == connectivity.TransientFailure: 
        
           	if failfast || hasBalancer { 
        
           		ac.mu.Unlock() 
        
           		return nil, errConnUnavailable 
        
           	}

But even if fail-fast were changed to return errors less often, the errors still wouldn't let us know whether the request was possibly executed or not.

If you are maintaining multiple connections to multiple backends, you should do that within a single ClientConn, and gRPC should always pick one that is up for your RPCs.

I don't think that's an option for us - we need to choose which backend(s) to use on an RPC-by-RPC basis. Is there some way I'm not seeing to configure gRPC to do this?

dfawley · 2017-08-25T22:47:30Z

I don't understand how #1463 is the root cause of this issue.

Are you using a load balancer? If so, depending on the implementation (our RoundRobin balancer does this), it will return an error when there are no backends here:

grpc-go/clientconn.go

Line 866 in fa59b77

addr, put, err = cc.dopts.balancer.Get(ctx, opts)

If you are not using a balancer, then it does seem that we follow the right behavior (though I was fairly sure I observed the opposite).

Is your connection actually in the transient failure state, possibly?

Note that when retry support is added (ETA ~1 month), gRPC will transparently retry RPCs that didn't leave the client or were not seen by the server application. So this should not be something you need to distinguish externally after that time.

we need to choose which backend(s) to use on an RPC-by-RPC basis.
Is there some way I'm not seeing to configure gRPC to do this?

You should be able to implement a custom balancer to do this if round-robin is not suitable for your needs. (Note that the API is going to change around here pretty significantly in the next month or two - see #1388.)

bdarnell · 2017-08-25T23:22:32Z

Are you using a load balancer?

No. In part due to the documented experimental/unstable nature of the interface (as mentioned below).

Is your connection actually in the transient failure state, possibly?

Yes. We want to distinguish between "connection was transient-failed before this rpc was sent, so we didn't try" and "connection failed while this rpc was in flight".

Note that when retry support is added (ETA ~1 month)

Interesting. That in conjunction with a custom balancer might do the trick, if the interfaces let us do what we want. But in general our usage is atypical enough that we're more comfortable taking manual control here; my first choice is to just get a distinct error code.

dfawley · 2017-09-01T17:25:53Z

In theory, the errors we return are supposed to match the spec. (We are aware we don't get this right in all instances.) The spec doesn't specify what should be returned in this particular situation. I'm checking with C/Java.

This leaves us with either a gRFC to change the error (if it's UNAVAILABLE in the other languages as well), which could be tricky, or a Go-specific feature to help identify whether data was transmitted before any error occurred.

I'll reopen this, because it seems it's not related to our "wait for ready" implementation.

bdarnell · 2017-09-03T17:24:47Z

OK, so the spec clearly says we were wrong to assume that Unavailable means "didn't send anything" (I think it happened to be true of the go implementation at the time we made this assumption). And there's no standard error code we could use to be unambiguous (indeed, the spec explicitly notes the ambiguity of DEADLINE_EXCEEDED).

I'm not sure if an error code is the right way to handle this, though. If I make an RPC to a server that makes another RPC, does the error code from the inner RPC get propagated back to my client or does it get turned into Unknown? It would be incorrect to allow the proposed "unambiguous did-not-send" error code to propagate, since it's only true of the client that generated it. So maybe this flag should be a property of the error object that is separate from the network-transmittable error codes.

dfawley · 2017-09-05T17:09:01Z

does the error code from the inner RPC get propagated back to my client or does it get turned into Unknown?

This is entirely up to the implementation of the middle service. If it does:

func MyHandler(...) (..., error) {
  if _, err := innerSvc.Call(...); err != nil {
    return err
  }
}

Then the status code will be propagated. This is not typically recommended; the middle service should translate the error into something that makes sense to its client -- usually by switching on the status code returned from the inner service.

bdarnell · 2017-11-09T21:31:54Z

Is there any update here? We just upgraded from 1.6 to 1.7.1 and it apparently changed some of the implementation details we were relying on to distinguish ambiguous from unambiguous failures (cockroachdb/cockroach#19708). We're probably going to have to revert to 1.6 for now.

I see that the transparent retry change just landed, but remember that this is not enough for us - we also need control over routing. Have the balancer APIs mentioned in #1443 (comment) been stabilized yet? (note that #1388 is still open)

This is similar to what grpc does internally for fail-fast rpcs, but lets us control the error returned to work around grpc/grpc-go#1443 Fixes cockroachdb#19708 Release note (performance improvement): Reduced the occurrence of ambiguous errors when a node is down.

dfawley · 2021-05-03T21:11:55Z

Closing due to lack of activity and priority.

dfawley added the Status: Duplicate label Aug 24, 2017

dfawley closed this as completed Aug 24, 2017

dfawley reopened this Sep 1, 2017

dfawley removed the Status: Duplicate label Sep 1, 2017

dfawley added P2 Type: Feature New features or improvements in behavior labels Oct 12, 2017

bdarnell mentioned this issue Nov 9, 2017

util/grpcutil: Can no longer detect unambiguous failures cockroachdb/cockroach#19708

Closed

bdarnell mentioned this issue Jan 10, 2018

kv,grpcutil: Add an explicit preflight check for grpc connection state cockroachdb/cockroach#21376

Merged

bdarnell mentioned this issue Jan 19, 2018

roachtest: ensure reasonable time-to-recovery after node outage cockroachdb/cockroach#21536

Closed

dfawley closed this as completed May 3, 2021

rafiss mentioned this issue Jun 23, 2021

roachtest: add test for authentication under network partition cockroachdb/cockroach#66768

Merged

github-actions bot locked as resolved and limited conversation to collaborators Oct 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distinguish between ambiguous and unambiguous failures #1443

Distinguish between ambiguous and unambiguous failures #1443

bdarnell commented Aug 16, 2017

dfawley commented Aug 17, 2017

dfawley commented Aug 24, 2017

dfawley commented Aug 24, 2017

bdarnell commented Aug 25, 2017

dfawley commented Aug 25, 2017

bdarnell commented Aug 25, 2017

dfawley commented Sep 1, 2017

bdarnell commented Sep 3, 2017

dfawley commented Sep 5, 2017

bdarnell commented Nov 9, 2017

dfawley commented May 3, 2021

Distinguish between ambiguous and unambiguous failures #1443

Distinguish between ambiguous and unambiguous failures #1443

Comments

bdarnell commented Aug 16, 2017

What version of gRPC are you using?

What version of Go are you using (go version)?

What operating system (Linux, Windows, …) and version?

What did you do?

What did you expect to see?

What did you see instead?

dfawley commented Aug 17, 2017

dfawley commented Aug 24, 2017

dfawley commented Aug 24, 2017

bdarnell commented Aug 25, 2017

dfawley commented Aug 25, 2017

bdarnell commented Aug 25, 2017

dfawley commented Sep 1, 2017

bdarnell commented Sep 3, 2017

dfawley commented Sep 5, 2017

bdarnell commented Nov 9, 2017

dfawley commented May 3, 2021

What version of Go are you using (`go version`)?