Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

serviceB gateway deployment has high performance impact on serviceA traffic over ingressgateway #14283

Open
zhqworld4303 opened this issue May 22, 2019 · 7 comments

Comments

@zhqworld4303
Copy link

commented May 22, 2019

When service A traffic is flowing through ingress gateway, Deploying a gateway for service B has high impact on service A traffic.

Expected behavior
Deployment should has low impact on other services traffic.

Steps to reproduce the bug

  1. First deploy service A, create gateway (http) and virtualservice for service A.
  2. Then deploy a client application and fire requests to service A through ingressgateway.
    (In my case, it has 90,000 QPS to service A, the avg. payload size for each request is around 10K. )

Below is a screenshot of hpa for each components at this traffic level.

NAME                   REFERENCE                         TARGETS            MINPODS   MAXPODS   REPLICAS   AGE
istio-ingressgateway   Deployment/istio-ingressgateway   14%/60%, 83%/80%   16        512       122        8d
istio-pilot            Deployment/istio-pilot            26%/60%, 2%/80%    2         8         2          8d
istio-telemetry        Deployment/istio-telemetry        36%/50%, 87%/80%   8         512       59         8d
  1. now deploy service B, wait for the service is ready.
  2. create a gateway and virtualservice for service B, and verify the gateway is working by fire a curl command.
  3. delete gateway and virtualservice for service B and verify the curl now is failing.
  4. repeat step 4 and step 5 by 100 times.
  5. During time window of step 6, I observed below impact on ingressgateway request duration P99:
    image

Version (include the output of istioctl version --remote and kubectl version)
istioctl: 1.1.7
kubectl: v1.13.5

How was Istio installed?
Install with Helm and Tiller via helm install

Affected product area (please put an X in all that apply)

[ ] Configuration Infrastructure
[ ] Docs
[ ] Installation
[ ] Networking
[X] Performance and Scalability
[ ] Policies and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience

@howardjohn

This comment has been minimized.

Copy link
Member

commented May 22, 2019

A couple questions, to better understand the scope of this:

  • You mentioned this happens when applying gateway + vs. Did you happen to test changing just one or the other?
  • You mentioned you did this 100 times, was this in a minute, an hour, etc? From the graph looks like maybe it was 10 minutes, so 10 changes/minute? Does it happen if you just made one change, or does it need to be many?
  • Any idea if this effects p50/p95 or is this just p99? Does it happen at lower QPS?

We don't need the answers to all of them if they are hard to determine, but they could be useful for us to reproduce and find the cause

@zhqworld4303

This comment has been minimized.

Copy link
Author

commented May 22, 2019

I didn't test only changing one or other.
The whole 100 times change took 10 minutes, you can see it matches the spike on that 10 minutes.
It also has effects to all the others as well, but with less percentage. (P99: 100%, P95: 50%, P90: 20%, P50, 5%)

@howardjohn

This comment has been minimized.

Copy link
Member

commented May 22, 2019

Ok great, thanks for the info. My guess without looking too much into this yet is the gateways are causing a lot of listener churn slowing envoy down a bit

@duderino

This comment has been minimized.

Copy link
Contributor

commented May 22, 2019

Capturing a side conversation here: there is only one listener on port 443 so any change will cause a full drain. The upcoming filter chain discovery service (FDS) should alleviate this. Until then, we should recommend how to throttle changes in Pilot to batch updates and reduce the frequency of listener drains.

@costinm @philrud @howardjohn @mandarjog recommendation please

@duderino duderino added this to the 1.3 milestone Jun 6, 2019

@duderino

This comment has been minimized.

Copy link
Contributor

commented Jun 6, 2019

Feature for 1.3: Pilot must protect against excessive config changes and properly implement throttling without unwanted side effects

@rlenglet rlenglet modified the milestones: 1.4, 1.3 Jul 9, 2019

@andraxylia andraxylia modified the milestones: 1.3, Nebulous Future Jul 19, 2019

@howardjohn

This comment has been minimized.

Copy link
Member

commented Aug 15, 2019

There are a few ways to improve this:

  1. Change config less often. This is the easiest solution, but may not be feasible in shared clusters
  2. Have pilot send config less often. Set the following env variables:
        - name: PILOT_DEBOUNCE_AFTER
          value: 10s
        - name: PILOT_DEBOUNCE_MAX
          value: 30s
        - name: PILOT_ENABLE_EDS_DEBOUNCE
          value: "false"

This will make config changes delayed by 10-30s. Tune according to your comfort level. The PILOT_ENABLE_EDS_DEBOUNCE will make it so that endpoint changes are sent immediately, so you should not have problems with stale endpoints.
3. Use Filter chain discovery, to reduce the need to drain the whole listener on updates. This is being worked on by @silentdai but is not ready yet (or possibly not started?)
4. Optimize listener drains in envoy. I don't think anyone is working on this.

@howardjohn

This comment has been minimized.

Copy link
Member

commented Aug 15, 2019

Sorry forgot to mention, PILOT_ENABLE_EDS_DEBOUNCE is introduced in 1.3, as well as many other fixes to make this actually work well without getting 503s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
6 participants
You can’t perform that action at this time.