-
Notifications
You must be signed in to change notification settings - Fork 35
Benchmark uses istio-demo.yaml which is not suitable for performance testing #5
Comments
Hello mandarjog,
A number of questions:
We benchmarked against "stock" and "tuned"; "stock" uses the Please find the respective patches to |
@mandarjog Please let me know whether the reference in my response above provides the information you are looking for so we can close this issue. |
Even without testing against the same application you used, how can the latency possibly be in the MINUTES range? That is unreasonably slow and indicates maybe something went wrong in the test. I have done a lot of benchmarks in Istio (and linkerd) and even in the very worst scenarios neither adds more than 30ms to the p99s. If you said Istio was a few ms worse than linkerd then maybe it could be from a difference in testing, but to have minute latencies is crazy |
@howardjohn What load generator / latency measurement were you using? When you measured latency, did you take Coordinated Omission into account? If you did not then your results are overly optimistic and do not reflect real-life user experience. |
We usually use fortio but have also used wrk2 (which does take coordinated omission into account, as the blog mentions). I don't think coordinated omission would account for minute latencies, I don't think there is a web server in the world that is that slow. Also I am surprised Istio isn't able to hit 600rps, we send 5k RPS through it pretty regularly without issues. It seems there are two possibilities:
I get that it may seem like we are just being defensive and blaming your tests because they make Istio look bad, but this isn't our intent. We put a lot of time and effort into making Istio as fast and scalable as possible so when we see results that are different than what our tests show and what others show (such as https://medium.com/@michael_87395/benchmarking-istio-linkerd-cpu-at-scale-5f2cfc97c7fa), we want to figure out why. |
@howardjohn thank you for clarifying. I fully understand why the Istio results may raise questions and concerns. We at Kinvolk do share your interest in better understanding the reasons behind those results; as mentioned in the blog article, we are currently doing client work on Istio (the objective of that work is not related to Istio performance). I'm happy to assist if you would like to reproduce our benchmark results; we believe we released everything you need to do this. We're at KubeCon Barcelona by the way - if you're around then we could discuss in person. |
Thanks @t-lo, I am going to try out your test set up. |
Ok I found some time to try to reproduce your setup. First, I just wanted to say thanks for conducting a pretty thorough and reproducible test, it is always good to see new numbers. Some notes:
I was able to reproduce your results -- I only looked at Istio latency though, not resource usage or linkerd. With your setup, I saw p99 latencies around 50s "corrected", and 50ms uncorrected, roughly similar to your numbers. Next I did the exact same test but I just used the default Istio install ( I got these results (note this is the numbers corrected for coordinated omission) at 600 QPS:
Next I made one change, turning off mixer with
Finally, tested with Istio off:
So in the end we see Istio adding ~3ms to the p50 and ~3ms to the p99, which is pretty much in line with https://istio.io/docs/concepts/performance-and-scalability/. These numbers were pretty quick and dirty (I ran for 2min instead of 30min, only took one sample, etc), but they represent numbers closer to what we expect to see. |
Thanks @howardjohn for reproducing the results.
|
Telemetry limits and requests are different. ## extensions/v1beta1::Deploymen::istio-telemetry
--- /Users/mjog/tmp/perf2/istio.yaml
+++ /Users/mjog/tmp/perf2/kinvolk_istio.yaml
@@ -96,11 +96,9 @@
- containerPort: 42422
resources:
limits:
- cpu: 4800m
- memory: 4G
+ cpu: 100m
requests:
- cpu: 1000m
- memory: 1G
+ cpu: 50m
volumeMounts:
- mountPath: /etc/certs
name: istio-certs
@@ -136,7 +134,7 @@ For proxy it is a bit hard to read but ## v1::ConfigMap::istio-sidecar-injector
--- /Users/mjog/tmp/perf2/istio.yaml
+++ /Users/mjog/tmp/perf2/kinvolk_istio.yaml
@@ -73,21 +72,21 @@
\ cpu: \"[[ index .ObjectMeta.Annotations `sidecar.istio.io/proxyCPU` ]]\"\
\n [[ end ]]\n [[ if (isset .ObjectMeta.Annotations `sidecar.istio.io/proxyMemory`)\
\ -]]\n memory: \"[[ index .ObjectMeta.Annotations `sidecar.istio.io/proxyMemory`\
- \ ]]\"\n [[ end ]]\n [[ else -]]\n limits:\n cpu: 2000m\n \
- \ memory: 1024Mi\n requests:\n cpu: 100m\n memory: 128Mi\n \
`sidecar.istio.io/bootstrapOverride`)\
+ \ ]]\"\n [[ end ]]\n [[ else -]]\n limits:\n cpu: 250m\n requests:\n\
+ \ cpu: 10m\n \n [[ end -]]\n volumeMounts:\n [[- if (isset .ObjectMeta.Annotations\
+ \ `sidecar.istio.io/bootstrapOverride`) \ In CPU constrained env, the latency numbers are not reliable. |
@howardjohn Thank you very much for taking the time and for investing the effort to reproduce our results. It is reassuring that you have been able to reproduce the massive latency istio displayed in our benchmark results when being overloaded - admittedly, we were rather suspicious towards the data, too, but we consistently reproduced the results across many benchmarks we ran. In order for others to reproduce the results of your own tests, could you please share the full configuration of your cluster set-up? Specifically, it would be very helpful if you could provide access to the deployment configuration ( I'd like to repeat at this point that we fully understand that the benchmark report raised concerns in the Istio community (which, by the way, we consider ourselves part of, since, as also mentioned above, we are currently tasked with Istio development work). Regarding your feedback:
We explicitly call out in the blog post that we benchmarked both the "stock" Istio experience users will have when following the Istio evaluation instructions, as well as a tuned version that also aims to retain feature parity with Linkerd.
Could you please provide a respective
This is correct - the motivation here is to use a tuned istio that retains feature parity with Linkerd. We wanted a fair comparison. I am sorry we did not call out this motivation explicitly in our blog post.
Please have a look at
I fully agree, and we do not consider the 600RPS case something Istio users would sustain. However, Linkerd clusters are still operational at that rate. Regarding your remarks towards Coordinated Omission, I would like to better understand the technical reasons why you believe this way of measuring latency is incorrect. I believe we have stated our reasons for taking it into account in the blog post. I'm sorry to be blunt, but your statement reads a bit like "I don't like this because I don't like the benchmark results".
Thus generate significantly higher cost. I do agree however that we should call out more prominently in the blog post that the Istio cluster became inoperable at that rate. @mandarjog What files did you generate your diff from? The patch neither matches our stock Istio nor our tuned Istio set-up. |
I strongly recommend you follow https://istio.io/docs/setup/kubernetes/install/helm/#option-1-install-with-helm-via-helm-template, which is our (only?) support way to install for prod/performance testing usages. Using this, in my first test I just used the default settings, and in my second just For the values, something like this is a very minimally tuned set of settings that roughly matches what you used: mixer:
telemetry:
enabled: false
policy:
enabled: false
grafana:
enabled: true
gateways:
enabled: false
global:
mtls:
enabled: true
enableTracing: false
proxy:
tracer: ""
This is a great goal, but I don't think it met the goal exactly. By the way, I think from this (and some other conversations), it is clear that we need to improve our docs - it should be easy to get a good setup running. For the "stock" Istio experience, this is the default installation, NOT the demo. For the "tuned" version, it is actually really good and thorough - you have most of the optimizations we would make if we wanted to maximize performance. Unfortunately, since it was based on the demo, you missed a few parts:
Just a note: 600 RPS should be no problem for Istio or linkerd, we easily see 5-10k thoroughput on Istio (linkerd is pretty similiar as well). The reason we couldn't hit 600 RPS here is due to the CPU limits. On Coordinated Omission, I am not an expert on this so maybe I am missing something, but here are my thoughts: Consider a strawman service A, which has 1ms latency but can handle 10 RPS, and service B which has 100ms latency but can handle 1000 RPS. If we were to benchmark these at 10 RPS, service A would have a clearly better latency. Yet if we ran the same test at 20 RPS, service B would have a better latency. Clearly, the expected results here are that service A has better latency, not service B. If we want to compare thoroughput, it would be more accurate to perform a separate test for that. If you want to measure "the latency of a system when its CPU is heavily throttled", then the results are valid I guess, but I would argue that is not a useful stat. My understanding is Coordinated Omission is meant to correct for latency variance -- for example, if a single request happened to take longer than others, that is accounted for. It is not meant to correct for a service that is sent more load than it can handle for an extended period of time. Does that make sense? Anyhow, like I said we clearly need to make some improvements on our UX side. I think Istio performance is pretty good when set up right, but we need to make it harder to set it up wrong, so thanks for bringing all of this up. |
@howardjohn Thank you for the thorough response! I'm kind of in a hurry right now (KubeCon is on!) and will get back to the major items later - however, could you please provide a link to the Istio yaml you were using when running your benchmarks? This would be really helpful for us to reproduce your results, and it's a lot clearer than just describing steps we should do to arrive at that file ourselves (though I fully understand why you point out those steps, too). |
Here is the istio.yaml, with the caveat: please don't use this. It is hard/impossible to modify the settings if you just use a rendered template like this. We strongly recommend you use Helm as per the instructions here https://istio.io/docs/setup/kubernetes/install/helm/#option-1-install-with-helm-via-helm-template. This makes it easier to share the settings you used and to modify them -- its hard to understand everything from a 16,000 line yaml file. |
@mandarjog, @howardjohn The docs do have a warning about the https://istio.io/docs/setup/kubernetes/additional-setup/config-profiles/ Do you think we should mention this in more places? I don't know why people often seem to miss things that I think are clearly documented :( |
@howardjohn I am looking for a configuration that has fixed values, for use in this benchmark repo. We need a reproducible way of setting up Istio. Re-generating configurations will create varying results, this is not acceptable for a reproducible benchmark. I have modified our configuration to remove the CPU limits as you raised in #5 (comment)
I have used values provided by @mandarjog in #5 (comment) (telemetry request 50m / limit 250m --> 1000m / proxy limit 4800m request 10m / limit 250m --> 100m / limit 2000m). So far I was unable to detect a major improvement but I merely ran a few ad-hoc tests. @mandarjog I took the liberty of updating the issue title so we can iterate and close on this. Please help driving this forward. @frankbu That's an interesting perspective. The Istio website discusses Evaluation: https://istio.io/docs/setup/kubernetes/install/kubernetes/ Could we not, alternatively to your proposal, provide an evaluation configuration that does not have these limitations? This would be less confusing and more helpful to Istio users at the same time. |
I won't be available to work on this issue for the rest of the week. To sum up:
|
@t-lo I think the confusion may be because we did not make it clear what kind of "evaluation" we mean. The |
@frankbu I am no Istio performance tuning wizard (though I'm learning a lot these days, not least thanks to the feedback here!), but would it be possible to strive for both, or at least to strike a good balance? I'm arguing mostly from the user experience point of view. |
I think we still need a better solution to this problem, but I added a warning in the quick-start instructions for now. istio/istio.io#4220 |
@mandarjog I do not think the edit war tendencies you're displaying in this issue is helpful to resolve the points raised here. Please have a look at the remarks @howardjohn made towards our "tuned" set-up. The title you have set does not reflect the fact that we used an optimized version instead of the default settings to run the "istio tuned" tests. The title in its current phrasing is misleading. Please consider to fix your wording. |
UpdatePlease pardon the long silence on this issue. Your feedback is very important to us; however, we've been rather wound up in KubeCon related action items last week. We did manage to execute a number of benchmark runs during KubeCon with updated configurations from our above discussion; please see our findings below. We updated Istio's configuration in accordance with concrete issues raised above, incorporating feedback from @mandarjog as well as issues raised by @howardjohn . More specifically, we significantly increased any remaining CPU limits in the configuration. For the time being please find our updated configuration in a gist - and please be invited to provide feedback to that configuration for guidance, as we plan to eventually merge the updated settings into Please note that we asked for a full Istio configuration more than once in the above discussion but did not arrive at a usable result. We require a static config (instead of something generated by scripts) so people can reproduce our set-up. Using a community-provided configuration for the tests below would have been our preference. 500rpsWith the above configuration, we observed a speed-up from 7s to 3s in the 100th percentile, in the 500RPS benchmark (2 samples, one cluster, only). Bare is at 7ms here, Linkerd is between 980ms ans 2s. "Istio-tuned" uses our previous configuration as used for the blog post, while "Istio-tuned w/ increased CPU limits" uses the above configuration from the gist, which incorporates your feedback.
600rpsIn the "overload" scenario of 600RPS we don't see much of a change in Istio latency. It has been rightfully called out that in this scenario, Istio users would have long scaled out and would not experience latency in the minutes range. An alternative way of looking at this though is how much further Linkerd can go before users need to spend money on more / faster hardware.
ConclusionThe above values, observed with the "Istio tuned" configuration w/ CPU caps increased, do not, from our point of view, justify re-running the full set of benchmark tests (multiple clusters etc.). We stand by our findings from the blog post and we feel very curious ourselves about what would cause this level of latency in overload situations. Also, we observed a massive increase in CPU usage of Istio's proxy sidecar, which is somewhat expected when removing CPU limits. We observed the proxy CPU utilization being up to 3x higher compared to what the actual application behind the proxy was using. Finding the right balance between CPU limits and latency looks like an interesting future benchmark target on its own. @mandarjog please have a look at our updated Istio configuration w/ CPU limits increased and please let us know whether this addresses the concerns that made you file this github issue. |
Were you able to turn off full tracing, policy and telemetry in your updated test? I didn’t see those explicitly off in a quick glance of your gist. |
@ejc3 we are aiming for feature parity with Linkerd in the "tuned" configuration, so while Mixer's Policy feature is disabled, we leave Telemetry switched on. We understand that this currently is discussed wrongly in our blog post (Sorry!); we are currently reviewing an update to the blog post that calls out that - among other things - Telemetry remains active. |
The blog post update that @t-lo refers to has now been posted. Please continue to share any feedback and we expect to be doing more on this - always publishing repeatable verifiable test scenarios. |
@howardjohn and @t-lo This has been such a fascinating back and forth and super helpful. Thank you. @t-lo You said in your detailed summary comment above:
I am still struggling to understand why Linkerd goes farther. Is it just designed better, has lesser features turned on (apples-to-oranges comparison) or a bit of both. After reading through every single comment here, it does not seem like it is a bad test setup issue. Thoughts? |
I still think the blog post is misleading showing latency going up to hundreds of seconds at high load. Once the system falls over and exceeds the SLA you have set, you should stop the test. Then you can report that LinkerD supports X RPS and Istio supports Y RPS at the specified SLA, versus saying LinkerD’s latency jumps to hundreds of seconds at 600 RPS. |
Fixed in lokomotive component, which uses the Istio HELM chart. |
The latency numbers published for istio are way too high. They do not agree with numbers that we collect which are published on istio.io
If you also check in or share the generated istio.yaml file it would let us inspect it closely.
You are also applying istio-demo.yaml which is not meant for performance testing, so I would like to see the contents of that file after your patching operations.
The text was updated successfully, but these errors were encountered: