Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tempo - Review performance #6844

Closed
Tracked by #6538
josunect opened this issue Nov 14, 2023 · 15 comments
Closed
Tracked by #6538

Tempo - Review performance #6844

josunect opened this issue Nov 14, 2023 · 15 comments
Assignees
Labels
backlog Triaged Issue added to backlog bug Something isn't working

Comments

@josunect
Copy link
Contributor

It looks like, when doing some queries that returns a big amount of results, the pod frontend-query dies - Kiali is returning a 503 error.

There are not many of this calls - maybe 2 or 3.

@josunect josunect added the bug Something isn't working label Nov 14, 2023
@josunect josunect self-assigned this Nov 29, 2023
@josunect
Copy link
Contributor Author

josunect commented Nov 29, 2023

Tested the deployment of Tempo with Tanka, kind and 2 nodes, and the following resources:

image

It takes a bit (1.94 seconds) but it works for the maximum of 7 days:
image

@jshaughn
Copy link
Collaborator

Any comparison with Jaeger?

@josunect
Copy link
Contributor Author

josunect commented Nov 30, 2023

I made this table so we can compare different calls. This is the bookinfo in minikube.

- Jaeger Tempo cURL (Jaeger) cURL (Tempo)
Traces 1m 104 ms 370 ms 66.522 ms 54.215 ms
Traces 10m 195 ms 1,26 s - -
Traces 7d 155 ms 712 ms 167.162 ms 314.332 ms
Traces 7d gRPC 219 ms 365 ms - -
Get Trace 10 ms 16 ms 4.772 ms 10.129 ms
Traces (Graph) 47 ms 458 ms - -

The Tempo resources were increased:

  resources:
    total:
      limits:
        memory: 8Gi
        cpu: 4000m

@jshaughn
Copy link
Collaborator

Wow, Tempo is significantly slower. Thanks for the table comparison!

@josunect
Copy link
Contributor Author

josunect commented Dec 5, 2023

Wow, Tempo is significantly slower. Thanks for the table comparison!

There is also true, that we are also processing the traces and transform them into Jaeger format to be used internally. So, in the future maybe we can change internally to use the OTel format. I think this will require an effort.

@jshaughn jshaughn added the backlog Triaged Issue added to backlog label Dec 6, 2023
@josunect
Copy link
Contributor Author

I was reviewing the performance, and I see (From the graph view) there are many concurrent calls to the Tracing API with just one click on the graph:

2023-12-13T18:13:50Z INF [2023-12-13 18:12:46 - 2023-12-13 18:13:50] http://localhost:16686/jaeger/api/traces?end=1702491230000000&limit=15&service=reviews.bookinfo&start=1702491166000000&tags=%7B%22cluster%22%3A%22Kubernetes%22%7D
2023-12-13T18:13:50Z INF [2023-12-13 18:12:46 - 2023-12-13 18:13:50] http://localhost:16686/jaeger/api/traces?end=1702491230000000&limit=15&service=reviews.bookinfo&start=1702491166000000
2023-12-13T18:13:50Z INF [2023-12-13 18:12:58 - 2023-12-13 18:13:05] http://localhost:16686/jaeger/api/traces?end=1702491185000000&limit=1&service=reviews.bookinfo&start=1702491178000000&tags=%7B%22cluster%22%3A%22Kubernetes%22%7D
2023-12-13T18:13:50Z INF [2023-12-13 18:13:18 - 2023-12-13 18:13:24] http://localhost:16686/jaeger/api/traces?end=1702491204000000&limit=1&service=reviews.bookinfo&start=1702491198000000&tags=%7B%22cluster%22%3A%22Kubernetes%22%7D
2023-12-13T18:13:50Z INF [2023-12-13 18:13:37 - 2023-12-13 18:13:43] http://localhost:16686/jaeger/api/traces?end=1702491223000000&limit=1&service=reviews.bookinfo&start=1702491217000000&tags=%7B%22cluster%22%3A%22Kubernetes%22%7D
2023-12-13T18:13:50Z INF [2023-12-13 18:12:52 - 2023-12-13 18:12:58] http://localhost:16686/jaeger/api/traces?end=1702491178000000&limit=1&service=reviews.bookinfo&start=1702491172000000&tags=%7B%22cluster%22%3A%22Kubernetes%22%7D
2023-12-13T18:13:50Z INF [2023-12-13 18:13:24 - 2023-12-13 18:13:30] http://localhost:16686/jaeger/api/traces?end=1702491210000000&limit=1&service=reviews.bookinfo&start=1702491204000000&tags=%7B%22cluster%22%3A%22Kubernetes%22%7D
2023-12-13T18:13:50Z INF [2023-12-13 18:13:05 - 2023-12-13 18:13:11] http://localhost:16686/jaeger/api/traces?end=1702491191000000&limit=1&service=reviews.bookinfo&start=1702491185000000&tags=%7B%22cluster%22%3A%22Kubernetes%22%7D
2023-12-13T18:13:50Z INF [2023-12-13 18:13:30 - 2023-12-13 18:13:37] http://localhost:16686/jaeger/api/traces?end=1702491217000000&limit=1&service=reviews.bookinfo&start=1702491210000000&tags=%7B%22cluster%22%3A%22Kubernetes%22%7D
2023-12-13T18:13:50Z INF [2023-12-13 18:12:46 - 2023-12-13 18:12:52] http://localhost:16686/jaeger/api/traces?end=1702491172000000&limit=1&service=reviews.bookinfo&start=1702491166000000&tags=%7B%22cluster%22%3A%22Kubernetes%22%7D
2023-12-13T18:13:50Z INF [2023-12-13 18:13:11 - 2023-12-13 18:13:18] http://localhost:16686/jaeger/api/traces?end=1702491198000000&limit=1&service=reviews.bookinfo&start=1702491191000000&tags=%7B%22cluster%22%3A%22Kubernetes%22%7D
2023-12-13T18:13:50Z INF [2023-12-13 18:13:43 - 2023-12-13 18:13:50] http://localhost:16686/jaeger/api/traces?end=1702491230000000&limit=1&service=reviews.bookinfo&start=1702491223000000&tags=%7B%22cluster%22%3A%22Kubernetes%22%7D
2023-12-13T18:13:50Z INF [2023-12-13 18:12:58 - 2023-12-13 18:13:05] http://localhost:16686/jaeger/api/traces?end=1702491185000000&limit=1&service=reviews.bookinfo&start=1702491178000000
2023-12-13T18:13:50Z INF [2023-12-13 18:13:18 - 2023-12-13 18:13:24] http://localhost:16686/jaeger/api/traces?end=1702491204000000&limit=1&service=reviews.bookinfo&start=1702491198000000
2023-12-13T18:13:50Z INF [2023-12-13 18:13:11 - 2023-12-13 18:13:18] http://localhost:16686/jaeger/api/traces?end=1702491198000000&limit=1&service=reviews.bookinfo&start=1702491191000000
2023-12-13T18:13:50Z INF [2023-12-13 18:13:37 - 2023-12-13 18:13:43] http://localhost:16686/jaeger/api/traces?end=1702491223000000&limit=1&service=reviews.bookinfo&start=1702491217000000
2023-12-13T18:13:50Z INF [2023-12-13 18:12:46 - 2023-12-13 18:12:52] http://localhost:16686/jaeger/api/traces?end=1702491172000000&limit=1&service=reviews.bookinfo&start=1702491166000000
2023-12-13T18:13:50Z INF [2023-12-13 18:13:30 - 2023-12-13 18:13:37] http://localhost:16686/jaeger/api/traces?end=1702491217000000&limit=1&service=reviews.bookinfo&start=1702491210000000
2023-12-13T18:13:50Z INF [2023-12-13 18:12:52 - 2023-12-13 18:12:58] http://localhost:16686/jaeger/api/traces?end=1702491178000000&limit=1&service=reviews.bookinfo&start=1702491172000000
2023-12-13T18:13:50Z INF [2023-12-13 18:13:43 - 2023-12-13 18:13:50] http://localhost:16686/jaeger/api/traces?end=1702491230000000&limit=1&service=reviews.bookinfo&start=1702491223000000
2023-12-13T18:13:50Z INF [2023-12-13 18:13:24 - 2023-12-13 18:13:30] http://localhost:16686/jaeger/api/traces?end=1702491210000000&limit=1&service=reviews.bookinfo&start=1702491204000000
2023-12-13T18:13:50Z INF [2023-12-13 18:13:05 - 2023-12-13 18:13:11] http://localhost:16686/jaeger/api/traces?end=1702491191000000&limit=1&service=reviews.bookinfo&start=1702491185000000

https://github.com/kiali/kiali/blob/master/business/tracing.go#L141

The limit is 1, no sure how optimal this code is in this example (Time range is just 1m). Probably, if there is a big amount of time range, it could help to get a wider range of traces, based on the number of results.

It looks also that the queries are duplicated with the cluster tag.

@josunect
Copy link
Contributor Author

josunect commented Dec 14, 2023

For Tempo, there are many requests as it has a API to check the tags and it is possible to get this information in advance.
Anyway, this is an example of the resources:

  resources:
    total:
      limits:
        memory: 1Gi
        cpu: 2000m

11 queries are done concurrently before the pod died (This is, one click in a node Graph and the traces tab):

2023-12-14T10:34:30Z INF http://localhost:3200/api/search?end=1702550070&limit=15&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550004
2023-12-14T10:34:30Z INF http://localhost:3200/api/search?end=1702550070&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550063
2023-12-14T10:34:30Z INF http://localhost:3200/api/search?end=1702550017&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550010
2023-12-14T10:34:30Z INF http://localhost:3200/api/search?end=1702550043&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550037
2023-12-14T10:34:30Z INF http://localhost:3200/api/search?end=1702550057&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550050
2023-12-14T10:34:30Z INF http://localhost:3200/api/search?end=1702550037&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550030
2023-12-14T10:34:30Z INF http://localhost:3200/api/search?end=1702550030&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550023
2023-12-14T10:34:30Z INF http://localhost:3200/api/search?end=1702550010&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550004
2023-12-14T10:34:30Z INF http://localhost:3200/api/search?end=1702550023&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550017
2023-12-14T10:34:30Z INF http://localhost:3200/api/search?end=1702550050&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550043
2023-12-14T10:34:30Z INF http://localhost:3200/api/search?end=1702550063&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550057
2023-12-14T10:34:31Z ERR Tempo API query error: Get "http://localhost:3200/api/search?end=1702550023&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550017": EOF [code: 0, URL: http://localhost:3200/api/search?end=1702550023&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550017]
2023-12-14T10:34:31Z ERR Tempo API query error: Get "http://localhost:3200/api/search?end=1702550037&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550030": EOF [code: 0, URL: http://localhost:3200/api/search?end=1702550037&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550030]
2023-12-14T10:34:31Z ERR Tempo API query error: Get "http://localhost:3200/api/search?end=1702550017&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550010": EOF [code: 0, URL: http://localhost:3200/api/search?end=1702550017&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550010]
2023-12-14T10:34:31Z ERR Tempo API query error: Get "http://localhost:3200/api/search?end=1702550030&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550023": EOF [code: 0, URL: http://localhost:3200/api/search?end=1702550030&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550023]
2023-12-14T10:34:31Z ERR Tempo API query error: Get "http://localhost:3200/api/search?end=1702550010&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550004": EOF [code: 0, URL: http://localhost:3200/api/search?end=1702550010&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550004]
2023-12-14T10:34:31Z ERR Tempo API query error: Get "http://localhost:3200/api/search?end=1702550057&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550050": EOF [code: 0, URL: http://localhost:3200/api/search?end=1702550057&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550050]
2023-12-14T10:34:31Z ERR Traces split & join failed: Get "http://localhost:3200/api/search?end=1702550057&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550050": EOF
2023-12-14T10:34:40Z INF http://localhost:3200/api/search?end=1702550080&limit=100&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550020
2023-12-14T10:34:41Z ERR Tempo API query error: Get "http://localhost:3200/api/search?end=1702550080&limit=100&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550020": EOF [code: 0, URL: http://localhost:3200/api/search?end=1702550080&limit=100&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550020]

image

@josunect
Copy link
Contributor Author

josunect commented Dec 14, 2023

For Tempo, there are many requests as it has a API to check the tags and it is possible to get this information in advance. Anyway, this is an example of the resources:

  resources:
    total:
      limits:
        memory: 1Gi
        cpu: 2000m

11 queries are done concurrently before the pod died (This is, one click in a node Graph and the traces tab):

2023-12-14T10:34:30Z INF http://localhost:3200/api/search?end=1702550070&limit=15&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550004
2023-12-14T10:34:30Z INF http://localhost:3200/api/search?end=1702550070&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550063
2023-12-14T10:34:30Z INF http://localhost:3200/api/search?end=1702550017&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550010
2023-12-14T10:34:30Z INF http://localhost:3200/api/search?end=1702550043&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550037
2023-12-14T10:34:30Z INF http://localhost:3200/api/search?end=1702550057&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550050
2023-12-14T10:34:30Z INF http://localhost:3200/api/search?end=1702550037&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550030
2023-12-14T10:34:30Z INF http://localhost:3200/api/search?end=1702550030&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550023
2023-12-14T10:34:30Z INF http://localhost:3200/api/search?end=1702550010&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550004
2023-12-14T10:34:30Z INF http://localhost:3200/api/search?end=1702550023&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550017
2023-12-14T10:34:30Z INF http://localhost:3200/api/search?end=1702550050&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550043
2023-12-14T10:34:30Z INF http://localhost:3200/api/search?end=1702550063&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550057
2023-12-14T10:34:31Z ERR Tempo API query error: Get "http://localhost:3200/api/search?end=1702550023&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550017": EOF [code: 0, URL: http://localhost:3200/api/search?end=1702550023&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550017]
2023-12-14T10:34:31Z ERR Tempo API query error: Get "http://localhost:3200/api/search?end=1702550037&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550030": EOF [code: 0, URL: http://localhost:3200/api/search?end=1702550037&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550030]
2023-12-14T10:34:31Z ERR Tempo API query error: Get "http://localhost:3200/api/search?end=1702550017&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550010": EOF [code: 0, URL: http://localhost:3200/api/search?end=1702550017&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550010]
2023-12-14T10:34:31Z ERR Tempo API query error: Get "http://localhost:3200/api/search?end=1702550030&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550023": EOF [code: 0, URL: http://localhost:3200/api/search?end=1702550030&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550023]
2023-12-14T10:34:31Z ERR Tempo API query error: Get "http://localhost:3200/api/search?end=1702550010&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550004": EOF [code: 0, URL: http://localhost:3200/api/search?end=1702550010&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550004]
2023-12-14T10:34:31Z ERR Tempo API query error: Get "http://localhost:3200/api/search?end=1702550057&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550050": EOF [code: 0, URL: http://localhost:3200/api/search?end=1702550057&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550050]
2023-12-14T10:34:31Z ERR Traces split & join failed: Get "http://localhost:3200/api/search?end=1702550057&limit=1&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550050": EOF
2023-12-14T10:34:40Z INF http://localhost:3200/api/search?end=1702550080&limit=100&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550020
2023-12-14T10:34:41Z ERR Tempo API query error: Get "http://localhost:3200/api/search?end=1702550080&limit=100&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550020": EOF [code: 0, URL: http://localhost:3200/api/search?end=1702550080&limit=100&q=%7B+.service.name+%3D+%22reviews.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702550020]

image

The same happens when Tempo has collected many traces, with a single query:

"http://localhost:3200/api/search?end=1702551729&limit=100&q=%7B+.service.name+%3D+%22productpage.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702551669"

image

Increasing the resources to the following:

  resources:
    total:
      limits:
        memory: 2Gi
        cpu: 4000m

Still has some issues (Increasing to the last 3h, the pod is killed by the OOM).
The system is more stable with this config:

  resources:
    total:
      limits:
        memory: 4Gi
        cpu: 8000m

But still, it looks like a big amount of resources for a dev environment.

cURL query: curl -G -s http://localhost:3200/api/search --data-urlencode 'q={ .service.name = "productpage.bookinfo"} && { } | select("status", ".service_name", ".node_id", ".component", ".upstream_cluster", ".http.method", ".response_flags")' --data-urlencode 'spss=10' --data-urlencode 'limit=100' --data-urlencode 'start=1701948096' --data-urlencode 'end=1702552896' | jq

@josunect
Copy link
Contributor Author

  • It looks like the spss=10 (What makes possible to create the heatmap and the bubble size) is has a high cost. Maybe we could create an option to hide the heatmap in the Tooltip to improve performance (And this can be configurable to enable when the system is provided with many resources)

@jshaughn
Copy link
Collaborator

I was reviewing the performance, and I see (From the graph view) there are many concurrent calls to the Tracing API with just one click on the graph:

Sorry, I was not keeping up with this discussion. So, one request from the graph results in many concurrent back end calls to the tempo API? What determines the amount of concurrent calls, the number of spans? And are you saying that if the number of gofuncs is too high the pod may crash?

@josunect
Copy link
Contributor Author

I was reviewing the performance, and I see (From the graph view) there are many concurrent calls to the Tracing API with just one click on the graph:

Sorry, I was not keeping up with this discussion. So, one request from the graph results in many concurrent back end calls to the tempo API? What determines the amount of concurrent calls, the number of spans? And are you saying that if the number of gofuncs is too high the pod may crash?

Yes, the Tempo pod might crash, killed by OOM. And that is right, one request from the graph might result in 10 concurrent calls to the backend, to spread the query throw the time interval and get a sample from different times: https://github.com/kiali/kiali/blob/master/business/tracing.go#L237 (For example, if we want 15 traces from the last hour, it would return a trace for each minute, instead of returning all the traces from the last 10 minutes).

This is done for Tempo and Jaeger, but I haven't seen any issue with Jaeger.

The number of concurrent calls is determined as a fixed value in the function: https://github.com/kiali/kiali/blob/master/business/tracing.go#L237 , it is set to 10.

The number of spans is independent to the concurrent calls, but is another point of improvement in the query for Tempo. Jaeger returns all the spans, and by default, Tempo limits the number of spans in 3, because it has a higher cost.

@josunect
Copy link
Contributor Author

  • Jaeger query last 1m:
    curl 'http://localhost:16686/jaeger/api/traces?end=1703177208000000&limit=100&service=productpage.bookinfo&start=1703177199000000' -w 'Total: %{time_total}s\n'
    Total: 0.066522s

  • Jaeger query last 7d:
    curl 'http://localhost:16686/jaeger/api/traces?end=1703177676000000&limit=100&service=productpage.bookinfo&start=1702572876000000' -w 'Total: %{time_total}s\n'
    Total: 0.167162s

  • Jaeger query one trace:
    curl 'http://localhost:16686/jaeger/api/traces/e3b70ccde315af7a772aaa41b786b368' -w 'Total: %{time_total}s\n'
    Total: 0.004772s

  • Tempo query last 1m:
    curl -G -s http://localhost:3200/api/search --data-urlencode 'q={ .service.name = "productpage.bookinfo"} && { } | select("status", ".service_name", ".node_id", ".component", ".upstream_cluster", ".http.method", ".response_flags")' --data-urlencode 'spss=10' --data-urlencode 'limit=100' --data-urlencode 'start=1703179679' --data-urlencode 'end=1703179696' -w 'Total: %{time_total}s\n'
    Total: 0.054215s

  • Tempo query last 7d:
    curl 'http://localhost:3200/api/search?end=1703179433&limit=100&q=%7B+.service.name+%3D+%22productpage.bookinfo%22++%7D+%26%26+%7B++%7D+%7C+select%28status%2C+.service_name%2C+.node_id%2C+.component%2C+.upstream_cluster%2C+.http.method%2C+.response_flags%29&spss=10&start=1702574633' -w 'Total: %{time_total}s\n'
    Total: 0.314332s

  • Tempo query one trace:
    curl 'http://localhost:3200/api/traces/b483155fb47d1d17627c8ee102daea67' -w 'Total: %{time_total}s\n'
    Total: 0.010129s

@josunect
Copy link
Contributor Author

josunect commented Dec 22, 2023

Based on these results:

  • I think the performance of the Tempo API can be similar to Jaeger, and there are certain overhead processing every trace for the transformation. I don't think it is a lot, but as a future improvement, Kiali can be changed to work with the OTEL format (It might be a big change I think).
  • The main issue with Tempo is that the pod crashes (Killed by the OOM). The resources have been increased to have the system stable, but still it seems a lot (That was reported).
  • Some improvements could be, based on a configuration option, hiding the heat map in the tooltip, and perform a different query:
    count() > 0 | { [resource.service.name](http://resource.service.name/)="<some svc>"} | avg(duration)

@josunect
Copy link
Contributor Author

josunect commented Jan 4, 2024

Based on this analysis I think this issue can be closed highlighting some points of improvement in case it is required:

  • Improve the search query used by Kiali (This will require some research)
  • Use traces in OTEL format in Kiali (backend and frontend changes)
  • Performance of the tempo operator (Decrease resources used)

1 similar comment
@josunect
Copy link
Contributor Author

josunect commented Jan 4, 2024

Based on this analysis I think this issue can be closed highlighting some points of improvement in case it is required:

  • Improve the search query used by Kiali (This will require some research)
  • Use traces in OTEL format in Kiali (backend and frontend changes)
  • Performance of the tempo operator (Decrease resources used)

@josunect josunect closed this as completed Jan 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog Triaged Issue added to backlog bug Something isn't working
Projects
Development

No branches or pull requests

2 participants