-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clusterresolver: comply with A37 for handling errors from discovery mechanisms #6461
clusterresolver: comply with A37 for handling errors from discovery mechanisms #6461
Conversation
@@ -190,7 +190,19 @@ func buildClusterImplConfigForEDS(g *nameGenerator, edsResp xdsresource.Endpoint | |||
}) | |||
} | |||
|
|||
priorities := groupLocalitiesByPriority(edsResp.Localities) | |||
var priorities [][]xdsresource.Locality | |||
// Triggered by an EDS error before update, or empty localities list in a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or resource-not-found error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, EDS error meant resource not found || NACK. I'll change language to be explicit about these two error cases though.
// data, and it has not previously reported any results, it should | ||
// report a result that is a single priority with no endpoints." - A37 | ||
priorities = make([][]xdsresource.Locality, 1) | ||
priorities[0] = make([]xdsresource.Locality, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
priorities := [][]xdsresource.Locality{{}}
Since the priorities
var is defined as a slice of slice of xdsresource.Locality
, the compiler can figure out the type of the elements in the slice (when defined statically) without them having to be mentioned explicitly. So, in the above statement {}
indicates an zero-value xdsresource.Locality
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't the {} be a zero value of []xdsresource.Locality? Thus setting to nil rather than an empty slice? I don't think it matters though, I forget why I didn't want this to be nil I'll try switching it over.
dr.topLevelResolver.onError(err) | ||
dr.mu.Lock() | ||
// Don't report any errors after first error or good address update. | ||
if dr.addrs != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be if !dr.updateReceived { ... }
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. It can be. I was debating whether using the addrs or the bool. I guess if you don't like the addrs := make(resolver.Address, 0) call I can use the bool.
@@ -142,7 +142,17 @@ func (dr *dnsDiscoveryMechanism) ReportError(err error) { | |||
dr.logger.Infof("DNS discovery mechanism for resource %q reported error: %v", dr.target, err) | |||
} | |||
|
|||
dr.topLevelResolver.onError(err) | |||
dr.mu.Lock() | |||
// Don't report any errors after first error or good address update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it important to suppress subsequent errors? It probably doesn't have any downstream effects today, but are there any other good reasons?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm btw I talked to Doug yesterday about this in our 1:1. The channel eats errors from the resolver if it's already received a good address update, however he mentioned it's also fine to add an extra eat for that scenario at this level. My logic was to suppress since no downstream effects, and also to keep it similar to how you handled subsequent errors in eds resolver (write an empty update, next error returns early).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can be a little more verbose in this comment. Maybe this is too verbose, but something to indicate why suppressing the errors is beneficial here:
// If a previous good update was received, suppress the error and continue using the
// previous update. If RPCs were succeeding prior to this, they will continue to do so.
// Also suppress errors if we previously received an error, since there will be no
// downstream effects of propagating this error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for suggestion. Used your comment.
dr.mu.Unlock() | ||
return | ||
} | ||
dr.addrs = make([]string, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dr.addrs = nil
is more readable, and avoids an unnecessary call to make
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per comment above, I did this to make the conditional at beginning gating subsequent calls on addrs based off this addrs struct. However, I don't actually think you need to set it to nil here. The only way it hits this codeblock is if it's an error before update, in which case the addrs will be nil already. In the case that the addrs get set to non nil, it won't hit this codeblock. I'll set it anyway though. Also, I was scared that the getAddrs() switch won't see the type as []string if it's nil, but thinking about it I think it will.
dr.updateReceived = true | ||
dr.mu.Unlock() | ||
|
||
dr.topLevelResolver.onUpdate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are calling onUpdate
for both good updates and errors, we no longer need the onError
method on the topLevelResolver
. EDS discovery mechanism does not use it either. So, you can delete the implementation of that method in resourceResolver
and remove that method from the topLevelResolver
interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm ok. Will delete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually being called in DNS Discovery Mechanism Construction errors, but as per the back and forth in chat I need to switch this to do same thing DNS errors before update do - create the priority and send TF down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this in Discovery Mechanism Constructor same way as OnError - set updateReceived to true and then call onUpdate on the top level resolver. Let me know if you want to change the bool (I need to set it for lastUpdate() to return true).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if you want to change the bool
What do you mean by this? I'm good with how it is looking now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the construction to also set this bool and trigger TF. Thus, I don't know if the naming is appropriate. Has it received an update, or is it just triggering an error? I'm fine leaving as is also though.
Putting this here so I don't forget; I also need to add a test/codepath for errors in construction of DNS Discovery Mechanism before update triggering TF. |
…r TF through empty priority
@@ -75,13 +75,15 @@ func newDNSResolver(target string, topLevelResolver topLevelResolver, logger *gr | |||
} | |||
u, err := url.Parse("dns:///" + target) | |||
if err != nil { | |||
topLevelResolver.onError(fmt.Errorf("failed to parse dns hostname %q in clusterresolver LB policy", target)) | |||
ret.updateReceived = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably throw a warning log here and in the next error case. I would even be OK with an error log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To keep it consistent with the error logs already in codebase (from ReportError), switched to Info log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Log lines should be capitalized, unlike error strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh interesting, noted. Switched.
@@ -142,7 +142,17 @@ func (dr *dnsDiscoveryMechanism) ReportError(err error) { | |||
dr.logger.Infof("DNS discovery mechanism for resource %q reported error: %v", dr.target, err) | |||
} | |||
|
|||
dr.topLevelResolver.onError(err) | |||
dr.mu.Lock() | |||
// Don't report any errors after first error or good address update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can be a little more verbose in this comment. Maybe this is too verbose, but something to indicate why suppressing the errors is beneficial here:
// If a previous good update was received, suppress the error and continue using the
// previous update. If RPCs were succeeding prior to this, they will continue to do so.
// Also suppress errors if we previously received an error, since there will be no
// downstream effects of propagating this error.
dr.updateReceived = true | ||
dr.mu.Unlock() | ||
|
||
dr.topLevelResolver.onUpdate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if you want to change the bool
What do you mean by this? I'm good with how it is looking now.
Haven't reviewed the tests yet. Will wait for this. |
Oh I finished that, forgot to note (the error in construction). I didn't write a test for it since there was no tests for master and it seems like as per xDS chat we'll be adding validations in xDS Client unmarshaling that makes this impossible to hit (Mark has a pending PR). Right now, switched to what c-core does which is to trigger this TF on errors in DNS Discovery Mechanism construction. |
var priorities [][]xdsresource.Locality | ||
// Triggered by an NACK or resource-not-found error before update, or a | ||
// empty localities list in a update. In either case want to create a | ||
// priority, and send down empty address list, causing TF for that priority. | ||
if len(edsResp.Localities) == 0 { | ||
// "If any discovery mechanism instance experiences an error retrieving | ||
// data, and it has not previously reported any results, it should | ||
// report a result that is a single priority with no endpoints." - A37 | ||
priorities = [][]xdsresource.Locality{{}} | ||
} else { | ||
priorities = groupLocalitiesByPriority(edsResp.Localities) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit. You can shorten this even further:
priorities := [][]xdsresource.Locality{{}}
if len(edsResp.Localities) != 0 {
priorities = groupLocalitiesByPriority(edsResp.Localities)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for suggestion. Switched.
// cluster. Also configure an empty endpoints resource for the EDS cluster | ||
// that contains no endpoints. | ||
// cluster. Also configure an endpoints resource for the EDS cluster which | ||
// triggers an NACK. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry: s/an NACK/a NACK/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hahaha, switched.
for ; ctx.Err() == nil; <-time.After(defaultTestShortTimeout) { | ||
peer := &peer.Peer{} | ||
if _, err := client.EmptyCall(ctx, &testpb.Empty{}, grpc.Peer(peer)); err != nil { | ||
t.Logf("EmptyCall() failed: %v", err) | ||
continue | ||
} | ||
if peer.Addr.String() == addrs[0].Addr { | ||
break | ||
} | ||
} | ||
if ctx.Err() != nil { | ||
t.Fatalf("Timeout when waiting for RPCs to be routed to backend %q in the DNS cluster", addrs[0].Addr) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to use this helper function here: https://github.com/grpc/grpc-go/blob/master/internal/testutils/pickfirst/pickfirst.go#L41
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that worked :). Note that the other tests in this file don't use this helper (since you always rewrite these haha :D).
@@ -75,13 +75,15 @@ func newDNSResolver(target string, topLevelResolver topLevelResolver, logger *gr | |||
} | |||
u, err := url.Parse("dns:///" + target) | |||
if err != nil { | |||
topLevelResolver.onError(fmt.Errorf("failed to parse dns hostname %q in clusterresolver LB policy", target)) | |||
ret.updateReceived = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Log lines should be capitalized, unlike error strings.
Fixes #6462. This PR fixes cluster_resolver error handing. The desired behavior as per gRFC A37 is to create a priority and send down an empty address list if an EDS Discovery Mechanism or DNS Discovery Mechanism errors before a first good update, triggering Transient Failure for that priority and causing fallback if another lower level priority. In master, we simply don't create the priority child in the EDS case and send down a resolver error in the DNS case.
RELEASE NOTES: N/A