Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xds/outlierdetection: fix config handling #6361

Merged
merged 6 commits into from Jun 9, 2023

Conversation

zasweq
Copy link
Contributor

@zasweq zasweq commented Jun 8, 2023

This PR fixes Outlier Detection configuration across the xDS System and the Outlier Detection ParseConfig.

This PR touches multiple components:

  • xDS Client: Switch to emit Outlier Detection configuration into JSON, and also handle xDS Defaults (proto -> layered gRPC JSON structure). The handling of xDS Defaults used to be handled in cds_balancer, and was incorrect, as it did not contain a distinction between unset and set and 0 for fields, causing the first bug listed below.
  • cds_balancer: Take Outlier Detection JSON and Endpoint + Locality Picking JSON, prepare a Cluster Resolver configuration in JSON. Call ParseConfig() on cluster_resolver child to receive cluster resolver configuration to send downward.
  • cluster_resolver: Accept Outlier Detection JSON and Endpoint + Locality Picking JSON in it's top level config, ParseConfig() Parses (using gRPC LB Registry) both Outlier Detection JSON and Endpoint + Locality Picking JSON into internal types. Persists these internal types as helpers to build priority configuration, which will eventually get marshaled to JSON, parsed through ParseConfig(), and sent down to priority.
  • outlier_detection: Set defaults in ParseConfig(). This consists of two layers. Any of the top layers fields will get a default value if unset in JSON. If any second layer fields are present (SuccessRateEjection and FailurePercentage), the fields of the second layer field(s) which are present get a default value if unset in JSON.

This fixes a few bugs:

  • Previously, in xDS, if an Outlier Detection message was set in proto configuration received in the xDS Client with EnforcingSuccessRate unset, Outlier Detection would not be turned on. However, in this case, the xDS Defaults should cause Outlier Detection to turn on in this scenario, thus enabling Outlier Detection by default in the xDS flow if Outlier Detection proto is present.
  • cds_balancer was bypassing the cluster_resolver balancer API by not calling ParseConfig on the cluster_resolver balancer, to receive the cluster resolver configuration to pass down, and preparing it's config using it's exported struct directly, and sending that down.
  • outlier_detection was not setting it's default fields in ParseConfig correctly. Thus, if a user set the Outlier Detection balancer as the top level balancer of the channel using a Service Config string, this would not get populated with the defaults for unset fields. This is an issue because Outlier Detection can be used in both the xDS case and non xDS case.

RELEASE NOTES:

  • xds/internal/xdsclient/xdsresource: Fix Outlier Detection Config Handling and correctly set xDS Defaults
  • xds/internal/balancer/outlierdetection: Fix Outlier Detection Config Handling by setting defaults in ParseConfig()

@zasweq zasweq requested a review from dfawley June 8, 2023 03:35
@zasweq zasweq added this to the 1.56 Release milestone Jun 8, 2023
// Shouldn't happen, registered through imported Cluster Resolver,
// defensive programming.
logger.Errorf("%q LB policy is needed but not registered", clusterresolver.Name)
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not actually legal to return nil, though. We should return an LB policy instance that is a nop for everything and provides a TF picker or something in this situation. Maybe add a package to internal/balancer with this trivial implementation.

// This is illegal and should never happen; we clear the balancerWrapper

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Here and in Cluster Resolver.

if !ok {
// Shouldn't happen, imported Cluster Resolver builder has this method.
logger.Errorf("%q LB policy does not implement a config parser", clusterresolver.Name)
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Here and in Cluster Resolver.

Comment on lines +361 to +370
// "In the cds LB policy, if the outlier_detection field is not set in
// the Cluster resource, a "no-op" outlier_detection config will be
// generated in the corresponding DiscoveryMechanism config, with all
// fields unset." - A50
if odJSON == nil {
// This will pick up top level defaults in Cluster Resolver
// ParseConfig, but sre and fpe will be nil still so still a
// "no-op" config.
odJSON = json.RawMessage(`{}`)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this logic be moved to where we produce the JSON OD config from the proto instead? This is part of converting from xds OD config to OD's JSON config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, unfortunately not, because the language in the gRFC explicitly states "in the cds lb policy". The issue was I mapped that language to the paragraph following to. We triaged this, and this was the only behavior scoped to the cds lb policy. @murgatroid99

Comment on lines +114 to +115
var cfg *LBConfig
if err := json.Unmarshal(j, &cfg); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pointer to a pointer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is how my wrr_locality balancer works too. https://github.com/grpc/grpc-go/blob/master/xds/internal/balancer/wrrlocality/balancer.go#L93. If I switch it it hangs. I thinkkkk it's because if you declare a value type on the stack, it gets all the zero values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of stuff like child config etc.

for i, dm := range cfg.DiscoveryMechanisms {
lbCfg, err := odParser.ParseConfig(dm.OutlierDetection)
if err != nil {
return nil, fmt.Errorf("error parsing Outlier Detection config: %v", dm.OutlierDetection)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to return err too.

Copy link
Contributor Author

@zasweq zasweq Jun 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Good catch.

Comment on lines 133 to 137
if envconfig.XDSOutlierDetection {
for i, odCfg := range odCfgs {
cfg.DiscoveryMechanisms[i].outlierDetection = odCfg
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this checking the env var, but parsing isn't? That seems wrong. We don't usually validate things we won't use. I think they both should be in this condition and then we don't need two different for loops, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right fair point. Changed.

@@ -72,7 +73,10 @@ var (
}

noopODCfg = outlierdetection.LBConfig{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does a noop config have fields set to specific numbers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't do anything about it. The gRFC specifically states "cds generates with all fields unset" for the no-op. Thus, the way our system is set up now, it is literally impossible to uphold the balancer API while keeping all fields unset (will always call ParseConfig() and get defaults). The important part is sre and fpe == nil, which keeps the picker from doing unnecessary counting on the critical rpc path. I talked to Michael about this and he says it's fine. I would like the fields to stay unset, but unfortunately can't uphold API and keep fields unset. Also can't set to zero values because that is different that the language of the gRFC, which states "all fields unset".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait how is this config used in these tests, then? It looked like they were inputs, which means they could be completely empty, right? Or are you trying to make the input test data match real-world input data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latter.

Comment on lines +84 to +90
lbCfg := &LBConfig{
// Default top layer values as documented in A50.
Interval: iserviceconfig.Duration(10 * time.Second),
BaseEjectionTime: iserviceconfig.Duration(30 * time.Second),
MaxEjectionTime: iserviceconfig.Duration(300 * time.Second),
MaxEjectionPercent: 10,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the multi-level nature of this...there are now other defaults elsewhere. Can you move these defaults into config.go so they are all together, by adding a LBConfig.UnmarshalJSON method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Needed to do same trick on that type to prevent infinite recursion and stack overflow.

type failurePercentageEjection FailurePercentageEjection

// UnmarshalJSON unmarshals JSON into FailurePercentageEjection. If a
// FailurePercentageEjection field is not set, that field will get it's default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/it's/its/ - 3x in this PR. Whenever I see "it's" my mind says "it is".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

SuccessRateEjection: sre,
FailurePercentageEjection: fpe,
}
odLBCfgJSON, err := json.Marshal(odLBCfg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return json.Marshal(odLBCfg)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Switched.

@dfawley dfawley assigned zasweq and unassigned dfawley Jun 9, 2023
Copy link
Contributor Author

@zasweq zasweq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pass! Got to all comments.

*
*/

// Package nop implements a balancer with all of it's balancer operations as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:/. Haha switched.

"google.golang.org/grpc/connectivity"
)

// Balancer is a balancer with all of it's balancer operations as no-ops, other
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:/. Haha switched.

// and a Connectivity State of TRANSIENT_FAILURE.
func (b *Balancer) UpdateClientConnState(_ balancer.ClientConnState) error {
b.cc.UpdateState(balancer.State{
Picker: base.NewErrPicker(errors.New("no-op balancer invoked")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error returned here should be an error passed to NewNOPBalancer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


// Balancer is a balancer with all of it's balancer operations as no-ops, other
// than returning a Transient Failure Picker on a Client Conn update.
type Balancer struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unexport, then you don't need all these comments. Only NewNOPBalancer needs a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had it unexported, but it was throwing errors wrt lint. I had to change it to pass lint :/

}

// NewNOPBalancer returns a no-op balancer.
func NewNOPBalancer(cc balancer.ClientConn) *Balancer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NewBalancer since it's in the nop package? I.e. nop.NewBalancer() vs nop.NewNOPBalancer() ... the latter slightly stutters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -72,7 +73,10 @@ var (
}

noopODCfg = outlierdetection.LBConfig{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait how is this config used in these tests, then? It looked like they were inputs, which means they could be completely empty, right? Or are you trying to make the input test data match real-world input data?

@@ -160,7 +160,7 @@ type exitIdle struct{}

// cdsBalancer implements a CDS based LB policy. It instantiates a
// cluster_resolver balancer to further resolve the serviceName received from
// CDS, into localities and endpoints. Implements the balancer.Balancer
// CDS, into localities and endpoints. Implements the balancer.bal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops. Change back.

@dfawley dfawley changed the title Fix Outlier Detection Config handling xds/outlierdetection: fix config handling Jun 9, 2023
@zasweq zasweq merged commit 3c6084b into grpc:master Jun 9, 2023
11 checks passed
zasweq added a commit to zasweq/grpc-go that referenced this pull request Jun 9, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants