Skip to content
This repository has been archived by the owner on Feb 6, 2024. It is now read-only.

Benchmark uses istio-demo.yaml which is not suitable for performance testing #5

Closed
mandarjog opened this issue May 19, 2019 · 28 comments

Comments

@mandarjog
Copy link

mandarjog commented May 19, 2019

The latency numbers published for istio are way too high. They do not agree with numbers that we collect which are published on istio.io

If you also check in or share the generated istio.yaml file it would let us inspect it closely.

You are also applying istio-demo.yaml which is not meant for performance testing, so I would like to see the contents of that file after your patching operations.

@mandarjog mandarjog changed the title Publish share the generated istio.yaml Publish the generated istio.yaml May 19, 2019
@t-lo
Copy link
Member

t-lo commented May 19, 2019

Hello mandarjog,

The latency numbers published for istio are way too high. They do not agree with numbers that we collect which are published on istio.io

A number of questions:

  • When you compare our data points with yours, do you look at our absolute values or do you calculate the relative latency from our data (i.e. relative to bare metal latency percentiles)? If you use our absolute values, then that includes the latency of the applications we benchmarked against. Against what application are you running your benchmarks? You will need to run your benchmarks against the same applications we did if you want to compare absolute values. If you don't want to do that, you need to calculate the relative latency compared to "bare".
  • In your benchmarks, among how many applications do you spread the RPS load? For our tests, we used 30 apps / 90 microservices.
  • Does your benchmark account for Coordinated Omission? If it does not, you will get unrealistically good latency reported that does not correspond to real-world user experience (please see the respective section in the blog post).

You also applying istio-demo.yaml which is not meant for performance testing, so I would like to see the contents of that file after your patching operations.

We benchmarked against "stock" and "tuned"; "stock" uses the istio-demo.yaml (with PSPs added), "tuned" uses the tuning options called out in the blog post.

Please find the respective patches to istio-demo.yaml here: https://github.com/kinvolk/service-mesh-benchmark/tree/master/scripts/istio

@t-lo
Copy link
Member

t-lo commented May 19, 2019

@mandarjog Please let me know whether the reference in my response above provides the information you are looking for so we can close this issue.

@howardjohn
Copy link

Even without testing against the same application you used, how can the latency possibly be in the MINUTES range? That is unreasonably slow and indicates maybe something went wrong in the test.

I have done a lot of benchmarks in Istio (and linkerd) and even in the very worst scenarios neither adds more than 30ms to the p99s. If you said Istio was a few ms worse than linkerd then maybe it could be from a difference in testing, but to have minute latencies is crazy

@t-lo
Copy link
Member

t-lo commented May 19, 2019

@howardjohn What load generator / latency measurement were you using? When you measured latency, did you take Coordinated Omission into account? If you did not then your results are overly optimistic and do not reflect real-life user experience.

@howardjohn
Copy link

We usually use fortio but have also used wrk2 (which does take coordinated omission into account, as the blog mentions). I don't think coordinated omission would account for minute latencies, I don't think there is a web server in the world that is that slow.

Also I am surprised Istio isn't able to hit 600rps, we send 5k RPS through it pretty regularly without issues.

It seems there are two possibilities:

  • You have created set up/triggered a bug which causes Istio/Envoy to behave extremely slowly. If this is the case we would love to find out why so that we can prevent it from happening to our users
  • Something went wrong in your test set up. I haven't had a chance to look to closely at it, so not sure what did/could have happened.

I get that it may seem like we are just being defensive and blaming your tests because they make Istio look bad, but this isn't our intent. We put a lot of time and effort into making Istio as fast and scalable as possible so when we see results that are different than what our tests show and what others show (such as https://medium.com/@michael_87395/benchmarking-istio-linkerd-cpu-at-scale-5f2cfc97c7fa), we want to figure out why.

@t-lo
Copy link
Member

t-lo commented May 19, 2019

@howardjohn thank you for clarifying. I fully understand why the Istio results may raise questions and concerns. We at Kinvolk do share your interest in better understanding the reasons behind those results; as mentioned in the blog article, we are currently doing client work on Istio (the objective of that work is not related to Istio performance).

I'm happy to assist if you would like to reproduce our benchmark results; we believe we released everything you need to do this.

We're at KubeCon Barcelona by the way - if you're around then we could discuss in person.

@howardjohn
Copy link

Thanks @t-lo, I am going to try out your test set up.
@mandarjog I generated the istio.yaml that they used (assuming I followed the steps right): https://gist.github.com/howardjohn/804db2d4071dac8cc90e00a72d2d571d
This is the "tuned" one by the way

@howardjohn
Copy link

Ok I found some time to try to reproduce your setup. First, I just wanted to say thanks for conducting a pretty thorough and reproducible test, it is always good to see new numbers.

Some notes:

  • The patch is a bit of a strage way to install Istio, although I was able to reproduce it. We typically just apply settings with Helm with helm --values values.yaml. This is a lot easier for others to install it themselves. It shouldn't impact this, but installing from the GitHub source is also not a great idea, we do some post-processing in the release.
  • You are using the demo installation. We tell users not to use this for prod/performance tests in the docs -- it is just intended to turn on all features (like 100% trace sampling, full access logging, etc) with tiny resource usages so it can fit on a cloud free tier. While the tuned version does turn off some of this, it misses a few spots. The 100m limit on mixer is way too low. I think this is the primary cause of the bad performance, as it is getting heavily throttled. You also mention you disabled mixer, but you didn't fully disable it -- only policy is disabled, not telemetry.
  • Istio has tons of extras enabled that linkerd does not -- you deploy jaeger, kiali, and the sidecar injector which aren't even used. It would be nice to get a breakdown of control plane CPU by component to account for this. The CPU usage of these should be negligible though, so probably not a huge deal.
  • On coordinated ommision: I understand the idea of it, but I don't agree this makes the measurement realistic. In the 600 RPS benchmark, Istio's latencies are absurd. There is no real world case where people are actually seeing minutes of latency - if their servers are that loaded they would just scale up. Once the server cannot handle the load (in your case, we saw Istio maxing out at ~570 RPS, the latency results are no longer valid. Generally, I have seen latency tests done at 50-80% of the max thoroughput a server can handle to get accurate results. Otherwise, you are really just comparing the thoroughput of two systems and labelling at as latency.

I was able to reproduce your results -- I only looked at Istio latency though, not resource usage or linkerd. With your setup, I saw p99 latencies around 50s "corrected", and 50ms uncorrected, roughly similar to your numbers.

Next I did the exact same test but I just used the default Istio install (helm template install/kubernetes/helm/istio | kubectl apply -f -). This is how we recommend users install Istio.

I got these results (note this is the numbers corrected for coordinated omission) at 600 QPS:

Latency Distribution (HdrHistogram - Recorded Latency)
 50.000%    5.30ms
 75.000%    7.65ms
 90.000%   13.80ms
 99.000%   35.04ms
 99.900%   91.65ms
 99.990%  198.65ms
 99.999%  217.21ms
100.000%  217.21ms

Next I made one change, turning off mixer with --set mixer.telemetry.enabled=false:

Latency Distribution (HdrHistogram - Recorded Latency)
50.000%    5.25ms
75.000%    6.29ms
90.000%    7.57ms
99.000%   12.34ms
99.900%   24.45ms
99.990%   61.57ms
99.999%   75.90ms
100.000%   75.90ms

Finally, tested with Istio off:

Latency Distribution (HdrHistogram - Recorded Latency)
50.000%    2.33ms
75.000%    2.68ms
90.000%    3.05ms
99.000%    4.51ms
99.900%    9.31ms
99.990%   13.00ms
99.999%   18.50ms
100.000%   18.50ms

So in the end we see Istio adding ~3ms to the p50 and ~3ms to the p99, which is pretty much in line with https://istio.io/docs/concepts/performance-and-scalability/. These numbers were pretty quick and dirty (I ran for 2min instead of 30min, only took one sample, etc), but they represent numbers closer to what we expect to see.

@mandarjog
Copy link
Author

Thanks @howardjohn for reproducing the results.
Thanks @t-lo for the experiment.

  1. We have certainly made an error in not clearly pointing out that the demo profile is for "trying out functions", and it is not meant for performance testing. I will update the our docs to be very clear about this.
  2. https://istio.io/docs/setup/kubernetes/install/helm/ is where the actual instructions to setup istio are.

@mandarjog
Copy link
Author

Telemetry limits and requests are different.

##  extensions/v1beta1::Deploymen::istio-telemetry
--- /Users/mjog/tmp/perf2/istio.yaml
+++ /Users/mjog/tmp/perf2/kinvolk_istio.yaml
@@ -96,11 +96,9 @@
         - containerPort: 42422
         resources:
           limits:
-            cpu: 4800m
-            memory: 4G
+            cpu: 100m
           requests:
-            cpu: 1000m
-            memory: 1G
+            cpu: 50m
         volumeMounts:
         - mountPath: /etc/certs
           name: istio-certs
@@ -136,7 +134,7 @@

For proxy it is a bit hard to read but
request 100m / limit 2000m --> request 10m / limit 250m

##  v1::ConfigMap::istio-sidecar-injector
--- /Users/mjog/tmp/perf2/istio.yaml
+++ /Users/mjog/tmp/perf2/kinvolk_istio.yaml
@@ -73,21 +72,21 @@
       \      cpu: \"[[ index .ObjectMeta.Annotations `sidecar.istio.io/proxyCPU` ]]\"\
       \n      [[ end ]]\n      [[ if (isset .ObjectMeta.Annotations `sidecar.istio.io/proxyMemory`)\
       \ -]]\n      memory: \"[[ index .ObjectMeta.Annotations `sidecar.istio.io/proxyMemory`\
-      \ ]]\"\n      [[ end ]]\n  [[ else -]]\n    limits:\n      cpu: 2000m\n    \
-      \  memory: 1024Mi\n    requests:\n      cpu: 100m\n      memory: 128Mi\n   \
`sidecar.istio.io/bootstrapOverride`)\
+      \ ]]\"\n      [[ end ]]\n  [[ else -]]\n    limits:\n      cpu: 250m\n    requests:\n\
+      \      cpu: 10m\n    \n  [[ end -]]\n  volumeMounts:\n  [[- if (isset .ObjectMeta.Annotations\
+      \ `sidecar.istio.io/bootstrapOverride`) \

In CPU constrained env, the latency numbers are not reliable.
We will check how much cpu throttling was happening in this case.

@t-lo
Copy link
Member

t-lo commented May 20, 2019

@howardjohn Thank you very much for taking the time and for investing the effort to reproduce our results. It is reassuring that you have been able to reproduce the massive latency istio displayed in our benchmark results when being overloaded - admittedly, we were rather suspicious towards the data, too, but we consistently reproduced the results across many benchmarks we ran.

In order for others to reproduce the results of your own tests, could you please share the full configuration of your cluster set-up? Specifically, it would be very helpful if you could provide access to the deployment configuration (istio.yaml) you were using.

I'd like to repeat at this point that we fully understand that the benchmark report raised concerns in the Istio community (which, by the way, we consider ourselves part of, since, as also mentioned above, we are currently tasked with Istio development work).

Regarding your feedback:

You are using the demo installation. We tell users not to use this for prod/performance tests in the docs -- it is just intended to turn on all features (like 100% trace sampling, full access logging, etc) with tiny resource usages so it can fit on a cloud free tier.

We explicitly call out in the blog post that we benchmarked both the "stock" Istio experience users will have when following the Istio evaluation instructions, as well as a tuned version that also aims to retain feature parity with Linkerd.

While the tuned version does turn off some of this, it misses a few spots. The 100m limit on mixer is way too low. I think this is the primary cause of the bad performance, as it is getting heavily throttled.

Could you please provide a respective istio.yaml so we can focus on the actual technical settings instead of verbally describing what we want to change?

You also mention you disabled mixer, but you didn't fully disable it -- only policy is disabled, not telemetry.

This is correct - the motivation here is to use a tuned istio that retains feature parity with Linkerd. We wanted a fair comparison. I am sorry we did not call out this motivation explicitly in our blog post.

Istio has tons of extras enabled that linkerd does not -- you deploy jaeger, kiali, and the sidecar injector which aren't even used. It would be nice to get a breakdown of control plane CPU by component to account for this. The CPU usage of these should be negligible though, so probably not a huge deal.

Please have a look at bench-run-istio-stock-<date>.top and bench-run-istio-tuned-<date>.top which are generated during each benchmark run; these files should contain the data you are looking for.

On coordinated ommision: I understand the idea of it, but I don't agree this makes the measurement realistic. In the 600 RPS benchmark, Istio's latencies are absurd.

I fully agree, and we do not consider the 600RPS case something Istio users would sustain. However, Linkerd clusters are still operational at that rate. Regarding your remarks towards Coordinated Omission, I would like to better understand the technical reasons why you believe this way of measuring latency is incorrect. I believe we have stated our reasons for taking it into account in the blog post. I'm sorry to be blunt, but your statement reads a bit like "I don't like this because I don't like the benchmark results".

There is no real world case where people are actually seeing minutes of latency - if their servers are that loaded they would just scale up.

Thus generate significantly higher cost. I do agree however that we should call out more prominently in the blog post that the Istio cluster became inoperable at that rate.

@mandarjog What files did you generate your diff from? The patch neither matches our stock Istio nor our tuned Istio set-up.

@howardjohn
Copy link

In order for others to reproduce the results of your own tests, could you please share the full configuration of your cluster set-up? Specifically, it would be very helpful if you could provide access to the deployment configuration (istio.yaml) you were using.

I strongly recommend you follow https://istio.io/docs/setup/kubernetes/install/helm/#option-1-install-with-helm-via-helm-template, which is our (only?) support way to install for prod/performance testing usages.

Using this, in my first test I just used the default settings, and in my second just mixer.telemetry.enabled=false (so I didn't really do any tuning you did).

For the values, something like this is a very minimally tuned set of settings that roughly matches what you used:

mixer:
    telemetry:
        enabled: false
    policy:
        enabled: false

grafana:
    enabled: true

gateways:
    enabled: false

global:
    mtls:
        enabled: true
    enableTracing: false
    proxy:
        tracer: ""

We explicitly call out in the blog post that we benchmarked both the "stock" Istio experience users will have when following the Istio evaluation instructions, as well as a tuned version that also aims to retain feature parity with Linkerd.

This is a great goal, but I don't think it met the goal exactly. By the way, I think from this (and some other conversations), it is clear that we need to improve our docs - it should be easy to get a good setup running.

For the "stock" Istio experience, this is the default installation, NOT the demo.

For the "tuned" version, it is actually really good and thorough - you have most of the optimizations we would make if we wanted to maximize performance. Unfortunately, since it was based on the demo, you missed a few parts:

  • Tiny CPU limits, which was the root cause of the bad performance
  • Parity with linkerd: Istio still has a lot more features than linkerd in your tuned version, if the goal was to get parity (to the best of my knowledge at least, not a linkerd expert). Jaeger, kiali, and the sidecar injector are all deployed. But this should be pretty negligible, just a slight extra CPU/memory usage.
  • As far as mixer on/off, @mandarjog is the expert here so I will defer to him. I think in 1.1.6 there isn't really a way to get parity with linkerd - you can either get way more telemetry with mixer (at the cost of CPU), or less without mixer. Very soon you will be able to get the same level of telemetry without Mixer. Don't quote me on that though.

I fully agree, and we do not consider the 600RPS case something Istio users would sustain. However, Linkerd clusters are still operational at that rate. Regarding your remarks towards Coordinated Omission, I would like to better understand the technical reasons why you believe this way of measuring latency is incorrect. I believe we have stated our reasons for taking it into account in the blog post. I'm sorry to be blunt, but your statement reads a bit like "I don't like this because I don't like the benchmark results".

Just a note: 600 RPS should be no problem for Istio or linkerd, we easily see 5-10k thoroughput on Istio (linkerd is pretty similiar as well). The reason we couldn't hit 600 RPS here is due to the CPU limits.

On Coordinated Omission, I am not an expert on this so maybe I am missing something, but here are my thoughts:

Consider a strawman service A, which has 1ms latency but can handle 10 RPS, and service B which has 100ms latency but can handle 1000 RPS.

If we were to benchmark these at 10 RPS, service A would have a clearly better latency. Yet if we ran the same test at 20 RPS, service B would have a better latency.

Clearly, the expected results here are that service A has better latency, not service B. If we want to compare thoroughput, it would be more accurate to perform a separate test for that.

If you want to measure "the latency of a system when its CPU is heavily throttled", then the results are valid I guess, but I would argue that is not a useful stat.

My understanding is Coordinated Omission is meant to correct for latency variance -- for example, if a single request happened to take longer than others, that is accounted for. It is not meant to correct for a service that is sent more load than it can handle for an extended period of time.

Does that make sense?

Anyhow, like I said we clearly need to make some improvements on our UX side. I think Istio performance is pretty good when set up right, but we need to make it harder to set it up wrong, so thanks for bringing all of this up.

@t-lo
Copy link
Member

t-lo commented May 20, 2019

@howardjohn Thank you for the thorough response! I'm kind of in a hurry right now (KubeCon is on!) and will get back to the major items later - however, could you please provide a link to the Istio yaml you were using when running your benchmarks? This would be really helpful for us to reproduce your results, and it's a lot clearer than just describing steps we should do to arrive at that file ourselves (though I fully understand why you point out those steps, too).

@howardjohn
Copy link

Here is the istio.yaml, with the caveat: please don't use this. It is hard/impossible to modify the settings if you just use a rendered template like this. We strongly recommend you use Helm as per the instructions here https://istio.io/docs/setup/kubernetes/install/helm/#option-1-install-with-helm-via-helm-template. This makes it easier to share the settings you used and to modify them -- its hard to understand everything from a 16,000 line yaml file.

@frankbu
Copy link

frankbu commented May 21, 2019

@mandarjog, @howardjohn The docs do have a warning about the demo profile not being suitable for performance testing.

https://istio.io/docs/setup/kubernetes/additional-setup/config-profiles/

Do you think we should mention this in more places? I don't know why people often seem to miss things that I think are clearly documented :(

@mandarjog mandarjog changed the title Publish the generated istio.yaml Benchmark uses istio-demo.yaml which is not suitable for performance testing May 21, 2019
@t-lo t-lo changed the title Benchmark uses istio-demo.yaml which is not suitable for performance testing Provide optimized values for istio-demo.yaml or an alternative istio.yaml for benchmarking May 22, 2019
@t-lo
Copy link
Member

t-lo commented May 22, 2019

@howardjohn I am looking for a configuration that has fixed values, for use in this benchmark repo. We need a reproducible way of setting up Istio. Re-generating configurations will create varying results, this is not acceptable for a reproducible benchmark.

I have modified our configuration to remove the CPU limits as you raised in #5 (comment)

For the "tuned" version, it is actually really good and thorough - you have most of the optimizations we would make if we wanted to maximize performance.

I have used values provided by @mandarjog in #5 (comment) (telemetry request 50m / limit 250m --> 1000m / proxy limit 4800m request 10m / limit 250m --> 100m / limit 2000m). So far I was unable to detect a major improvement but I merely ran a few ad-hoc tests.

@mandarjog I took the liberty of updating the issue title so we can iterate and close on this. Please help driving this forward.

@frankbu That's an interesting perspective. The Istio website discusses Evaluation: https://istio.io/docs/setup/kubernetes/install/kubernetes/ Could we not, alternatively to your proposal, provide an evaluation configuration that does not have these limitations? This would be less confusing and more helpful to Istio users at the same time.

@t-lo
Copy link
Member

t-lo commented May 22, 2019

I won't be available to work on this issue for the rest of the week. To sum up:

@frankbu
Copy link

frankbu commented May 22, 2019

@t-lo I think the confusion may be because we did not make it clear what kind of "evaluation" we mean. The demo profile is intended for "feature evaluation", not "performance evaluation". I agree that we should add another pregenerated profile, tuned for performance evaluation, and we let the user pick which evaluation profile they want.

@t-lo
Copy link
Member

t-lo commented May 22, 2019

@frankbu I am no Istio performance tuning wizard (though I'm learning a lot these days, not least thanks to the feedback here!), but would it be possible to strive for both, or at least to strike a good balance? I'm arguing mostly from the user experience point of view.

@mandarjog mandarjog changed the title Provide optimized values for istio-demo.yaml or an alternative istio.yaml for benchmarking Benchmark uses istio-demo.yaml which is not suitable for performance testing May 22, 2019
@frankbu
Copy link

frankbu commented May 24, 2019

I think we still need a better solution to this problem, but I added a warning in the quick-start instructions for now. istio/istio.io#4220

@t-lo
Copy link
Member

t-lo commented May 27, 2019

@mandarjog I do not think the edit war tendencies you're displaying in this issue is helpful to resolve the points raised here. Please have a look at the remarks @howardjohn made towards our "tuned" set-up.

The title you have set does not reflect the fact that we used an optimized version instead of the default settings to run the "istio tuned" tests. The title in its current phrasing is misleading. Please consider to fix your wording.

@t-lo
Copy link
Member

t-lo commented May 28, 2019

Update

Please pardon the long silence on this issue. Your feedback is very important to us; however, we've been rather wound up in KubeCon related action items last week. We did manage to execute a number of benchmark runs during KubeCon with updated configurations from our above discussion; please see our findings below.

We updated Istio's configuration in accordance with concrete issues raised above, incorporating feedback from @mandarjog as well as issues raised by @howardjohn . More specifically, we significantly increased any remaining CPU limits in the configuration. For the time being please find our updated configuration in a gist - and please be invited to provide feedback to that configuration for guidance, as we plan to eventually merge the updated settings into pstio-psp-tuned.patch.

Please note that we asked for a full Istio configuration more than once in the above discussion but did not arrive at a usable result. We require a static config (instead of something generated by scripts) so people can reproduce our set-up. Using a community-provided configuration for the tests below would have been our preference.

500rps

With the above configuration, we observed a speed-up from 7s to 3s in the 100th percentile, in the 500RPS benchmark (2 samples, one cluster, only). Bare is at 7ms here, Linkerd is between 980ms ans 2s. "Istio-tuned" uses our previous configuration as used for the blog post, while "Istio-tuned w/ increased CPU limits" uses the above configuration from the gist, which incorporates your feedback.

Latency Distribution bare Linkerd Istio-tuned Istio-tuned w/ increased CPU limits
50.000% 1.83ms , 1.79ms 3.97ms , 4.00ms 3.77ms , 3.70ms 3.62ms , 3.62ms
75.000% 2.21ms , 2.14ms 4.59ms , 4.61ms 4.44ms , 4.39ms 4.22ms , 4.20ms
90.000% 2.52ms , 2.48ms 5.20ms , 5.22ms 5.55ms , 5.72ms 4.86ms , 4.82ms
99.000% 3.20ms , 3.17ms 6.60ms , 6.60ms 673.28ms , 635.90ms 176.90ms , 385.79ms
99.900% 3.87ms , 3.85ms 454.40ms , 473.86ms 1.71s , 1.51s 935.93ms , 953.34ms
99.990% 4.54ms , 4.58ms 932.35ms , 943.61ms 3.07s , 2.98s 1.81s , 2.12s
99.999% 5.40ms , 5.30ms 963.07ms , 1.05s 6.66s , 6.18s 2.90s , 2.92s
100.000% 6.43ms , 6.51ms 975.87ms , 1.88s 7.05s , 7.00s 3.02s , 3.02s

600rps

In the "overload" scenario of 600RPS we don't see much of a change in Istio latency. It has been rightfully called out that in this scenario, Istio users would have long scaled out and would not experience latency in the minutes range. An alternative way of looking at this though is how much further Linkerd can go before users need to spend money on more / faster hardware.

Latency Distribution bare Linkerd Istio-tuned Istio-tuned w/ increased CPU limits
50.000% 1.86ms , 1.80ms 4.06ms , 3.99ms 22.07s , 16.24s 15.17s , 25.20s
75.000% 2.20ms , 2.15ms 4.69ms , 4.61ms 1.53m , 1.31m 1.16m , 1.17m
90.000% 2.48ms , 2.44ms 5.32ms , 5.26ms 2.78m , 2.66m 2.25m , 2.29m
99.000% 3.19ms , 3.14ms 6.98ms , 7.07ms 5.52m , 4.76m 4.17m , 4.55m
99.900% 3.86ms , 3.80ms 830.46ms , 832.51ms 7.05m , 6.02m 5.37m , 5.91m
99.990% 4.60ms , 4.48ms 988.67ms , 985.09ms 7.48m , 6.34m 5.62m , 6.14m
99.999% 78.33ms , 5.56ms 2.62s , 2.59s 7.54m , 6.41m 5.67m , 6.18m
100.000% 946.69ms , 8.38ms 3.01s , 2.99s 7.55m , 6.42m 5.68m , 6.19m

Conclusion

The above values, observed with the "Istio tuned" configuration w/ CPU caps increased, do not, from our point of view, justify re-running the full set of benchmark tests (multiple clusters etc.). We stand by our findings from the blog post and we feel very curious ourselves about what would cause this level of latency in overload situations.

Also, we observed a massive increase in CPU usage of Istio's proxy sidecar, which is somewhat expected when removing CPU limits. We observed the proxy CPU utilization being up to 3x higher compared to what the actual application behind the proxy was using. Finding the right balance between CPU limits and latency looks like an interesting future benchmark target on its own.

@mandarjog please have a look at our updated Istio configuration w/ CPU limits increased and please let us know whether this addresses the concerns that made you file this github issue.

@ejc3
Copy link

ejc3 commented May 29, 2019

Were you able to turn off full tracing, policy and telemetry in your updated test? I didn’t see those explicitly off in a quick glance of your gist.

@t-lo
Copy link
Member

t-lo commented May 29, 2019

@ejc3 we are aiming for feature parity with Linkerd in the "tuned" configuration, so while Mixer's Policy feature is disabled, we leave Telemetry switched on. We understand that this currently is discussed wrongly in our blog post (Sorry!); we are currently reviewing an update to the blog post that calls out that - among other things - Telemetry remains active.

@ahrkrak
Copy link
Member

ahrkrak commented May 30, 2019

The blog post update that @t-lo refers to has now been posted. Please continue to share any feedback and we expect to be doing more on this - always publishing repeatable verifiable test scenarios.

@indrayam
Copy link

@howardjohn and @t-lo This has been such a fascinating back and forth and super helpful. Thank you.

@t-lo You said in your detailed summary comment above:

An alternative way of looking at this though is how much further Linkerd can go before users need to spend money on more / faster hardware.

I am still struggling to understand why Linkerd goes farther. Is it just designed better, has lesser features turned on (apples-to-oranges comparison) or a bit of both. After reading through every single comment here, it does not seem like it is a bad test setup issue.

Thoughts?

@ejc3
Copy link

ejc3 commented Jun 3, 2019

I still think the blog post is misleading showing latency going up to hundreds of seconds at high load. Once the system falls over and exceeds the SLA you have set, you should stop the test. Then you can report that LinkerD supports X RPS and Istio supports Y RPS at the specified SLA, versus saying LinkerD’s latency jumps to hundreds of seconds at 600 RPS.

@t-lo
Copy link
Member

t-lo commented Nov 24, 2020

Fixed in lokomotive component, which uses the Istio HELM chart.

@t-lo t-lo closed this as completed Nov 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants