[BUG/Feature] - Consistent behavior for outlier detection for local and remote endpoints of a target service #205

nirvanagit · 2022-04-16T01:25:46Z

Describe the bug
Outlier detection behaves differently when target service is local to client, and when it is in a remote cluster.

When target service is in the client cluster

Target service is accessed over .svc service endpoint, which is a single IP (kubernetes service IP)

When target service is in a remote cluster

Target service is accessed over load balancer cname, which resolves to 3 IPs, one for each availability zone.

When target service in local cluster goes down, envoy marks it as unhealthy based on max_ejection_percent, which defaults to 1, as admiral does not configure this currently.

When target service in remote cluster goes down, envoy marks only one out of 3 endpoints as unhealthy, because it ejects the endpoints based on max_ejection_percentage, which as mentioned earlier is currently not set by admiral, hence the default value of 1 is used.

Steps To Reproduce

Create client and server applications
Create server app in two clusters.
Ensure service entry for the target service has two endpoints, one for each region/cluster.
Start traffic from client pod using fortio - fortio load -qps 100 -t 0 <.MESH endpoint>
Inject fault on the client, by blackholing the target service IP (which is local to the client) - ip route add blackhole X.X.X.X
Check envoy configuration, it will show the (only) endpoint as unhealthy.
Traffic will get diverted to the healthier region.

Repeat the above steps with a slight modification

Update SE so that the local endpoint also points to the load balancer CNAME and not the service address.
Make it healthy, so that traffic goes to the local endpoint.
Inject failure, but this time for all the 3 IPs corresponding to the load balancer of the local cluster.
Check envoy configuration, it will show only one out of three endpoints as unhealthy
Traffic gets diverted to healthier region, but client observes a spike of 5xx errors, which remains steady.

Expected behavior
Outlier detection should work the same way irrespective of where the target service is wrt to the client.

The text was updated successfully, but these errors were encountered:

aattuluri · 2022-05-05T22:00:09Z

Fixed with #207

nirvanagit added the bug Something isn't working label Apr 16, 2022

aattuluri mentioned this issue Apr 19, 2022

Tune default outlier detection parameters for remote endpoints #207

Merged

aattuluri self-assigned this Apr 20, 2022

aattuluri added this to the v1.3 milestone Apr 20, 2022

aattuluri closed this as completed May 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG/Feature] - Consistent behavior for outlier detection for local and remote endpoints of a target service #205

[BUG/Feature] - Consistent behavior for outlier detection for local and remote endpoints of a target service #205

nirvanagit commented Apr 16, 2022

aattuluri commented May 5, 2022

[BUG/Feature] - Consistent behavior for outlier detection for local and remote endpoints of a target service #205

[BUG/Feature] - Consistent behavior for outlier detection for local and remote endpoints of a target service #205

Comments

nirvanagit commented Apr 16, 2022

When target service is in the client cluster

When target service is in a remote cluster

aattuluri commented May 5, 2022