Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG/Feature] - Consistent behavior for outlier detection for local and remote endpoints of a target service #205

Closed
nirvanagit opened this issue Apr 16, 2022 · 1 comment
Assignees
Labels
bug Something isn't working
Milestone

Comments

@nirvanagit
Copy link
Collaborator

Describe the bug
Outlier detection behaves differently when target service is local to client, and when it is in a remote cluster.

When target service is in the client cluster

Target service is accessed over .svc service endpoint, which is a single IP (kubernetes service IP)

When target service is in a remote cluster

Target service is accessed over load balancer cname, which resolves to 3 IPs, one for each availability zone.

When target service in local cluster goes down, envoy marks it as unhealthy based on max_ejection_percent, which defaults to 1, as admiral does not configure this currently.

When target service in remote cluster goes down, envoy marks only one out of 3 endpoints as unhealthy, because it ejects the endpoints based on max_ejection_percentage, which as mentioned earlier is currently not set by admiral, hence the default value of 1 is used.

Steps To Reproduce

  1. Create client and server applications
  2. Create server app in two clusters.
  3. Ensure service entry for the target service has two endpoints, one for each region/cluster.
  4. Start traffic from client pod using fortio - fortio load -qps 100 -t 0 <.MESH endpoint>
  5. Inject fault on the client, by blackholing the target service IP (which is local to the client) - ip route add blackhole X.X.X.X
  6. Check envoy configuration, it will show the (only) endpoint as unhealthy.
  7. Traffic will get diverted to the healthier region.

Repeat the above steps with a slight modification

  1. Update SE so that the local endpoint also points to the load balancer CNAME and not the service address.
  2. Make it healthy, so that traffic goes to the local endpoint.
  3. Inject failure, but this time for all the 3 IPs corresponding to the load balancer of the local cluster.
  4. Check envoy configuration, it will show only one out of three endpoints as unhealthy
  5. Traffic gets diverted to healthier region, but client observes a spike of 5xx errors, which remains steady.

Expected behavior
Outlier detection should work the same way irrespective of where the target service is wrt to the client.

@nirvanagit nirvanagit added the bug Something isn't working label Apr 16, 2022
@aattuluri aattuluri self-assigned this Apr 20, 2022
@aattuluri aattuluri added this to the v1.3 milestone Apr 20, 2022
@aattuluri
Copy link
Contributor

Fixed with #207

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants