Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distinguish between ambiguous and unambiguous failures #1443

Closed
bdarnell opened this issue Aug 16, 2017 · 11 comments
Closed

Distinguish between ambiguous and unambiguous failures #1443

bdarnell opened this issue Aug 16, 2017 · 11 comments
Labels
P2 Type: Feature New features or improvements in behavior

Comments

@bdarnell
Copy link
Contributor

What version of gRPC are you using?

[[projects]]
  name = "google.golang.org/grpc"
  packages = [".","codes","credentials","credentials/oauth","grpclb/grpc_lb_v1","grpclog","internal","keepalive","metadata","naming","peer","stats","status","tap","transport"]
  revision = "172ccacff3cfbddedcf1ba5ea06cda943bc976c1"
  version = "v1.5.0"

What version of Go are you using (go version)?

1.8.3

What operating system (Linux, Windows, …) and version?

Linux (ubuntu 16.04)

What did you do?

I'm trying to send an rpc with a side effect that should happen at most once. I have connections to multiple backends and use the FailFast option; if one backend is down I want to try another.

What did you expect to see?

This requires that I can distinguish the error returned from a backend that is found to be unavailable before the request is sent from one that fails later. (I had previously assumed that codes.Unavailable would be suitable for this, but I was wrong - it's also used when a server is draining, when the request may have already been processed)

What did you see instead?

The fail-fast error for a down backend has the same error code as the one used for a draining server (and possibly other uses); it is impossible to distinguish them without looking at the text of the error message.

I would like some way to inspect the error to determine whether the request was possibly processed by the server. This could be a distinct error code, string matching on the error message (if the string to use is documented and guaranteed for future versions), or other attributes of the error. It's ok if this mechanism is not perfectly precise (my priority is at-most-once execution instead of maximizing the chances of success), but I would like to at least be able to distinguish the fail-fast error linked above.

(for further background, our incorrect assumption that an error with codes.Unavailable meant that the RPC had not been attempted on the server was the cause of cockroachdb/cockroach#17491)

@dfawley
Copy link
Member

dfawley commented Aug 17, 2017

Per the gRPC spec, Unavailable is expected to be returned when the server shuts down or when some data is transmitted before the connection breaks. However, the Wait For Ready spec does not specify what kind of failure should be returned if there is an error attempting to issue an RPC to a ClientConn for which there is no current connection available. Unavailable seems like the most reasonable choice for this condition, but it does prevent this kind of use case, which is important. It's also important to have a different error code (and not just string matching or a sentinel error), because gRPC's retry mechanism will be based upon the code alone. We'll follow up with other gRPC implementations and see how they handle this.

@dfawley
Copy link
Member

dfawley commented Aug 24, 2017

Actually, I think the root cause of this issue is that our non-Wait For Ready / FailFast behavior is wrong. We are supposed to block if there is no active connection (i.e. the ClientConn's state is Idle or Connecting) until the context expires:

RPCs SHOULD NOT fail as a result of the channel being in other states (CONNECTING, READY, or IDLE).

At that time, the error returned should be either Canceled or DeadlineExceeded.

If you turn off Fail Fast (i.e. if you enable Wait For Ready), then RPCs will always block until the context is canceled or expires, or the ClientConn is closed.

If you are maintaining multiple connections to multiple backends, you should do that within a single ClientConn, and gRPC should always pick one that is up for your RPCs.

@dfawley
Copy link
Member

dfawley commented Aug 24, 2017

Let's track the wait for ready bug via #1463. Closing this as a duplicate of that, since it is the root cause.

@dfawley dfawley closed this as completed Aug 24, 2017
@bdarnell
Copy link
Contributor Author

I don't understand how #1463 is the root cause of this issue. For one thing it looks like the code is already working as you say it should, and only returning fail-fast errors for the transient-failure state:

grpc-go/clientconn.go

Lines 1186 to 1190 in fa59b77

case ac.state == connectivity.TransientFailure:
if failfast || hasBalancer {
ac.mu.Unlock()
return nil, errConnUnavailable
}

But even if fail-fast were changed to return errors less often, the errors still wouldn't let us know whether the request was possibly executed or not.

If you are maintaining multiple connections to multiple backends, you should do that within a single ClientConn, and gRPC should always pick one that is up for your RPCs.

I don't think that's an option for us - we need to choose which backend(s) to use on an RPC-by-RPC basis. Is there some way I'm not seeing to configure gRPC to do this?

@dfawley
Copy link
Member

dfawley commented Aug 25, 2017

I don't understand how #1463 is the root cause of this issue.

Are you using a load balancer? If so, depending on the implementation (our RoundRobin balancer does this), it will return an error when there are no backends here:

addr, put, err = cc.dopts.balancer.Get(ctx, opts)

If you are not using a balancer, then it does seem that we follow the right behavior (though I was fairly sure I observed the opposite).

Is your connection actually in the transient failure state, possibly?

Note that when retry support is added (ETA ~1 month), gRPC will transparently retry RPCs that didn't leave the client or were not seen by the server application. So this should not be something you need to distinguish externally after that time.

we need to choose which backend(s) to use on an RPC-by-RPC basis.
Is there some way I'm not seeing to configure gRPC to do this?

You should be able to implement a custom balancer to do this if round-robin is not suitable for your needs. (Note that the API is going to change around here pretty significantly in the next month or two - see #1388.)

@bdarnell
Copy link
Contributor Author

Are you using a load balancer?

No. In part due to the documented experimental/unstable nature of the interface (as mentioned below).

Is your connection actually in the transient failure state, possibly?

Yes. We want to distinguish between "connection was transient-failed before this rpc was sent, so we didn't try" and "connection failed while this rpc was in flight".

Note that when retry support is added (ETA ~1 month)

Interesting. That in conjunction with a custom balancer might do the trick, if the interfaces let us do what we want. But in general our usage is atypical enough that we're more comfortable taking manual control here; my first choice is to just get a distinct error code.

@dfawley
Copy link
Member

dfawley commented Sep 1, 2017

In theory, the errors we return are supposed to match the spec. (We are aware we don't get this right in all instances.) The spec doesn't specify what should be returned in this particular situation. I'm checking with C/Java.

This leaves us with either a gRFC to change the error (if it's UNAVAILABLE in the other languages as well), which could be tricky, or a Go-specific feature to help identify whether data was transmitted before any error occurred.

I'll reopen this, because it seems it's not related to our "wait for ready" implementation.

@bdarnell
Copy link
Contributor Author

bdarnell commented Sep 3, 2017

OK, so the spec clearly says we were wrong to assume that Unavailable means "didn't send anything" (I think it happened to be true of the go implementation at the time we made this assumption). And there's no standard error code we could use to be unambiguous (indeed, the spec explicitly notes the ambiguity of DEADLINE_EXCEEDED).

I'm not sure if an error code is the right way to handle this, though. If I make an RPC to a server that makes another RPC, does the error code from the inner RPC get propagated back to my client or does it get turned into Unknown? It would be incorrect to allow the proposed "unambiguous did-not-send" error code to propagate, since it's only true of the client that generated it. So maybe this flag should be a property of the error object that is separate from the network-transmittable error codes.

@dfawley
Copy link
Member

dfawley commented Sep 5, 2017

does the error code from the inner RPC get propagated back to my client or does it get turned into Unknown?

This is entirely up to the implementation of the middle service. If it does:

func MyHandler(...) (..., error) {
  if _, err := innerSvc.Call(...); err != nil {
    return err
  }
}

Then the status code will be propagated. This is not typically recommended; the middle service should translate the error into something that makes sense to its client -- usually by switching on the status code returned from the inner service.

@bdarnell
Copy link
Contributor Author

bdarnell commented Nov 9, 2017

Is there any update here? We just upgraded from 1.6 to 1.7.1 and it apparently changed some of the implementation details we were relying on to distinguish ambiguous from unambiguous failures (cockroachdb/cockroach#19708). We're probably going to have to revert to 1.6 for now.

I see that the transparent retry change just landed, but remember that this is not enough for us - we also need control over routing. Have the balancer APIs mentioned in #1443 (comment) been stabilized yet? (note that #1388 is still open)

bdarnell added a commit to bdarnell/cockroach that referenced this issue Jan 10, 2018
This is similar to what grpc does internally for fail-fast rpcs, but
lets us control the error returned to work around grpc/grpc-go#1443

Fixes cockroachdb#19708

Release note (performance improvement): Reduced the occurrence of
ambiguous errors when a node is down.
@dfawley
Copy link
Member

dfawley commented May 3, 2021

Closing due to lack of activity and priority.

@dfawley dfawley closed this as completed May 3, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 31, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
P2 Type: Feature New features or improvements in behavior
Projects
None yet
Development

No branches or pull requests

2 participants