-
Notifications
You must be signed in to change notification settings - Fork 474
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tempo - Review performance #6844
Comments
Any comparison with Jaeger? |
I made this table so we can compare different calls. This is the bookinfo in minikube.
The Tempo resources were increased:
|
Wow, Tempo is significantly slower. Thanks for the table comparison! |
There is also true, that we are also processing the traces and transform them into Jaeger format to be used internally. So, in the future maybe we can change internally to use the OTel format. I think this will require an effort. |
I was reviewing the performance, and I see (From the graph view) there are many concurrent calls to the Tracing API with just one click on the graph:
https://github.com/kiali/kiali/blob/master/business/tracing.go#L141 The limit is 1, no sure how optimal this code is in this example (Time range is just 1m). Probably, if there is a big amount of time range, it could help to get a wider range of traces, based on the number of results. It looks also that the queries are duplicated with the cluster tag. |
For Tempo, there are many requests as it has a API to check the tags and it is possible to get this information in advance.
11 queries are done concurrently before the pod died (This is, one click in a node Graph and the traces tab):
|
The same happens when Tempo has collected many traces, with a single query: Increasing the resources to the following:
Still has some issues (Increasing to the last 3h, the pod is killed by the OOM).
But still, it looks like a big amount of resources for a dev environment. cURL query: |
|
Sorry, I was not keeping up with this discussion. So, one request from the graph results in many concurrent back end calls to the tempo API? What determines the amount of concurrent calls, the number of spans? And are you saying that if the number of gofuncs is too high the pod may crash? |
Yes, the Tempo pod might crash, killed by OOM. And that is right, one request from the graph might result in 10 concurrent calls to the backend, to spread the query throw the time interval and get a sample from different times: https://github.com/kiali/kiali/blob/master/business/tracing.go#L237 (For example, if we want 15 traces from the last hour, it would return a trace for each minute, instead of returning all the traces from the last 10 minutes). This is done for Tempo and Jaeger, but I haven't seen any issue with Jaeger. The number of concurrent calls is determined as a fixed value in the function: https://github.com/kiali/kiali/blob/master/business/tracing.go#L237 , it is set to 10. The number of spans is independent to the concurrent calls, but is another point of improvement in the query for Tempo. Jaeger returns all the spans, and by default, Tempo limits the number of spans in 3, because it has a higher cost. |
|
Based on these results:
|
Based on this analysis I think this issue can be closed highlighting some points of improvement in case it is required:
|
1 similar comment
Based on this analysis I think this issue can be closed highlighting some points of improvement in case it is required:
|
It looks like, when doing some queries that returns a big amount of results, the pod frontend-query dies - Kiali is returning a 503 error.
There are not many of this calls - maybe 2 or 3.
The text was updated successfully, but these errors were encountered: