Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resolver: UpdateState returns ErrBadResolverState when submitting empty list of addresses #5048

Closed
fho opened this issue Dec 8, 2021 · 6 comments

Comments

@fho
Copy link
Contributor

fho commented Dec 8, 2021

Hello,

I'm implementing a custom resolver that queries a Consul server to resolve a target.
It can happen that a service is unregistered in Consul and therefore the set of known addresses for a target changes to none.
When this happens I'm calling UpdateState with an empty address slice.
UpdateState then returns in response a ErrBadResolverState error.

Because UpdateState() returned an error my resolver starts to poll consul and call UpdateState periodically again (also with empty addresses). This behavior is broken.
Despite that UpdateState() returns an error, grpc-go seems to handle it as expected.
It triggers reresolving periodically again via ResolveNow.

Is this a bug in grpc-go and it should not return an error when a resolver submits an empty set of address?
Otherwise what behavior is expected by the resolver when all addresses for a host vanish?

I'm using grpc-go version 1.42.

@dfawley
Copy link
Member

dfawley commented Dec 8, 2021

This behavior is broken.

Which part is broken? What were you expecting to happen instead? If I understand what you're saying, the resolver should poll for new addresses when this happens, though it should apply some backoff.

grpc-go seems to handle it as expected.

What happens / what was expected?

I suspect the error being returned regarding the empty address list is coming from the LB policy being fed zero addresses, but not being a policy that would be able to cope with that. I.e. pick first and round robin both will do this*, as zero addresses would lead to a dead ClientConn. If they don't error and let the resolver know that something is wrong, then the ClientConn will be stuck with no connections forever (eventually...if there are any existing connections, they would remain until they die).

* - Refs:

  • grpc-go/pickfirst.go

    Lines 66 to 69 in ccc060c

    if len(cs.ResolverState.Addresses) == 0 {
    b.ResolverError(errors.New("produced zero addresses"))
    return balancer.ErrBadResolverState
    }
  • // If resolver state contains no addresses, return an error so ClientConn
    // will trigger re-resolve. Also records this as an resolver error, so when
    // the overall state turns transient failure, the error message will have
    // the zero address information.
    if len(s.ResolverState.Addresses) == 0 {
    b.ResolverError(errors.New("produced zero addresses"))
    return balancer.ErrBadResolverState
    }

@fho
Copy link
Contributor Author

fho commented Dec 8, 2021

This behavior is broken.

Which part is broken?

How my resolver handles the ErrBadResolverState error that is returned by UpdateState() when I pass an empty list of addresses. Currently my resolver then polls Consul periodically and calls UpdateState() again, also with an empty list of addresses again if the state did not change.
This behavior does not make sense because:
1.) Repeatedly calling UpdateState() with an empty list of addresses that results in the same error is unnecessary.
2.) Something (clientconn, resolver?) already triggers reresolving by calling ResolveNow() after I submitted an empty list of addresses. So my resolver does not have to poll Consul with a backoff timeout on it's own.

What were you expecting to happen instead?

I expected that when I call UpdateState() with an empty list of addresses that it returns nil and not a BadResolverState error. I would consider resolving to nothing as a valid and not as a bad state.
That it is valid state also matches with the old comment here: #1795 (comment)

If I understand what you're saying, the resolver should poll for new addresses when this happens, though it should apply some backoff.

That is what my resolver is currently doing. I basically copied the behavior from:

// Poll on an error found in DNS Resolver or an error received from ClientConn.

I guess the DNS resolver will never call UpdateState with an empty set of addresses though.

I suspect the error being returned regarding the empty address list is coming from the LB policy being fed zero addresses, but not being a policy that would be able to cope with that. I.e. pick first and round robin both will do this*, as zero addresses would lead to a dead ClientConn. If they don't error and let the resolver know that something is wrong, then the ClientConn will be stuck with no connections forever (eventually...if there are any existing connections, they would remain until they die).

The way that you described makes also sense to me. The resolvers can then be a little bit more dumb and don't have to care about if they submitted empty or non-empty list of addresses.
So far my view was that the resolve knows already that the target resolved to nothing because it submitted an empty list of addresses. That it gets an error in response to tell the resolver that it resolved to nothing, seemed redundant to me in this case.

What behavior is expected by the resolver when the list of addresses for a target changed to 0?

  • Should it call UpdateState() with an empty list of addresses?
    • How should it react to the ErrBadResolverState error returned by UpdateState() in this case? Should the resolver ignore it and wait for a ResolveNow()` call? Should it retry resolving on it's own with backoff timeouts and only call UpdateState() when it resolved the target to >1 address?
  • Should it not call UpdateState() at all in this case?

@dfawley
Copy link
Member

dfawley commented Dec 8, 2021

2.) Something (clientconn, resolver?) already triggers reresolving by calling ResolveNow() after I submitted an empty list of addresses. So my resolver does not have to poll Consul with a backoff timeout on it's own.

Hmmm.. if there are existing addresses, then we'll keep the SubConn(s) for them, and when they fail to connect, a ResolveNow will be triggered. If there were no pre-existing SubConns (if addresses had never been produced), this won't ever happen. So while it's possible there are other sources of initiating a retry of the name resolver, it shouldn't be assumed.

I expected that when I call UpdateState() with an empty list of addresses that it returns nil and not a BadResolverState error. I would consider resolving to nothing as a valid and not as a bad state.

UpdateState sends the resolver state to the LB policy. Some LB policies (e.g. grpclb) are okay with not getting addresses from the name resolver, so they would not return an error in this case. Pick first and round robin both need addresses to work, however, so they return errors.

I guess the DNS resolver will never call UpdateState with an empty set of addresses though.

I don't believe that's the case. We want it to call UpdateState so RPCs can begin failing quickly if there are no addresses for the target at startup.

So far my view was that the resolve knows already that the target resolved to nothing because it submitted an empty list of addresses. That it gets an error in response to tell the resolver that it resolved to nothing, seemed redundant to me in this case.

As mentioned above, the resolver doesn't necessarily know what LB policy is in place. Some might be fine with an empty address list. The resolver shouldn't be making assumptions about the consumers of its data. The consumers will determine if the result is acceptable and report an error if not.

Should it call UpdateState() with an empty list of addresses?

Yes. As mentioned above, having no addresses either might not be a problem in the first place, or it could be a signal to the ClientConn that it should start failing RPCs with a relevant error message.

How should it react to the ErrBadResolverState error returned by UpdateState() in this case? Should the resolver ignore it and wait for a ResolveNow() call? Should it retry resolving on it's own with backoff timeouts and only call UpdateState() when it resolved the target to >1 address?

Any error from UpdateState should result in the resolver polling and reporting the state again (for polling name resolvers; watch-based resolvers would ignore the error since there's nothing else to do). A backoff timer should be used to avoid overloading the server being polled.

If it polls and finds no change, it would be fine to not call UpdateState with the data -- presumably the call would just fail again. However, it's also harmless to call it again.

@fho
Copy link
Contributor Author

fho commented Dec 9, 2021

Thanks a lot for the clarification.

Any error from UpdateState should result in the resolver polling and reporting the state again (for polling name resolvers; watch-based resolvers would ignore the error since there's nothing else to do)

That also means that there is no scenario where the error returned byUpdateState() is temporary and therefore calling UpdateState() again with the same addresses would fix it?

Should I create a PR to document the expected resolver behavior regarding Errors returned by UpdateState, in the ClientConn godoc?

As mentioned above, the resolver doesn't necessarily know what LB policy is in place. Some might be fine with an empty address list. The resolver shouldn't be making assumptions about the consumers of its data. The consumers will determine if the result is acceptable and report an error if not.

I'm wondering if it wouldn't make sense to distinguish between invalid results returned by a resolver and valid resolver results that do not fit for the LB/consumer.
If a resolver results are invalid, the resolver can do something about it, e.g. try to reresolve it.
If the resolver results are valid but are incompatible with the LB, the resolver can't do anything.
The resolvers works as expected.

For example an IPv4 DNS resolver could resolve the target localhost to 127.0.0.1, report it via UpdateState() and then get a ErrBadResolverState error back.
From the resolvers point of view it can't do anything to provide a better result.
Reresolving localhost won't provide a better result, so periodically trying to reresolve "localhost" is unnecessary.

@dfawley
Copy link
Member

dfawley commented Dec 9, 2021

That also means that there is no scenario where the error returned byUpdateState() is temporary and therefore calling UpdateState() again with the same addresses would fix it?

That should be safe to say. If the LB policy itself is changed then the new one might be okay with the previously reported state, but the LB policy in use is controlled by the name resolver (via the same call), so it would know when that is happening.

Should I create a PR to document the expected resolver behavior regarding Errors returned by UpdateState, in the ClientConn godoc?

Sure, that would be helpful.

If the resolver results are valid but are incompatible with the LB, the resolver can't do anything.

It's always possible a re-resolution will return the results the LB policy needs. Unless the LB policy and name resolver are intended to be closely paired, e.g. the xds name resolver. But in that case, the resolver produces the LB policy & its configuration, so we know everything will always be compatible there.

Reresolving localhost won't provide a better result, so periodically trying to reresolve "localhost" is unnecessary.

If the resolver is 100% sure it will never produce a different result, then it's fine to not re-resolve, since there is no point. This is similar to a watch-based name resolver that literally can't poll again. Otherwise, it should attempt to re-resolve when an error is returned from UpdateState, as the state of the system may change in a way that makes the LB policy happy again.

Also, generally, as long as any addresses are returned by the name resolver, it should be extremely unusual for an LB policy to fail the data.

@fho
Copy link
Contributor Author

fho commented Dec 13, 2021

Thanks again for the detailed responses
I think I got now how the resolver should behave, I'll close this issue.

@fho fho closed this as completed Dec 13, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants