You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We use the jaeger-query version 1.34.1 with the elasticsearch storage (OpenSearch cluster in our case). Both, ec2 instances with jaeger-query and OpenSearch cluster live in the same VPC, and we use VPC endpoint for the --es.server-urls parameter.
This setup works as expected when we search for the small traces, but Jaeger returns the 500 Internal Server Error after 20 seconds timeout for big traces (please find the screenshot below). At the same time, we set up the es.timeout to 120 seconds, but looks like jaeger does not apply this setting.
Create a query with big amount of documents for a big period of time (depends of number of documents available for search)
The Jaeger UI returns 500 Internal Server Error after 20 seconds
Expected behavior
Jaeger should not block queries which take more than 20 seconds. Ideally, it should apply the es.timeout setting
Screenshots
Version (please complete the following information):
OS: Linux (RHEL 7)
Jaeger version: v1.34.1
Deployment: AWS EC2 instances with elasticsearch storage (OpenSearch cluster). Both, ec2 instances with jaeger-query and OpenSearch cluster live in the same VPC, and we use VPC endpoint for the --es.server-urls parameter.
What troubleshooting steps did you try?
Tried debug level of logging
Changing the timeout using --es.timeout (even tried 0s which means no timeout)
Checked any additional timeouts (which might be set on the ALB or OpenSearch cluster, etc.)
Additional context
Problem is more frequent for the production environment where we have more documents stored in the elastic search, so we hit the 20s limit even for the short period queries. But at the same time, individual traces work as expected as well as every other query which takes less than 20 seconds
The text was updated successfully, but these errors were encountered:
Based on the error message, the Context passed from HTTP server down to the query/storage is getting cancelled while the request is being executed. To my knowledge, http.Server does not have a timeout setting for how long the handler runs, instead there are other reasons why the context may be cancelled, one of them is that the client connection is being closed.
Is it possible that the UI is accessing the query service via some kind of proxy that enforces the 20sec timeout? What happens if you curl the query service directly with the same api query?
Hi @yurishkuro
Thank you for the quick response.
Yes, you are right, there is a proxy server, which I didn't know about, that sits between Jaeger and elastic search cluster. It uses the tornado web server which has the request_timeout equals to 20sec by default.
So there are no issues from the Jaeger side, and this issue may be closed
Describe the bug
We use the jaeger-query version 1.34.1 with the elasticsearch storage (OpenSearch cluster in our case). Both, ec2 instances with jaeger-query and OpenSearch cluster live in the same VPC, and we use VPC endpoint for the --es.server-urls parameter.
This setup works as expected when we search for the small traces, but Jaeger returns the 500 Internal Server Error after 20 seconds timeout for big traces (please find the screenshot below). At the same time, we set up the es.timeout to 120 seconds, but looks like jaeger does not apply this setting.
The logs contain the next error:
Here are the jaeger-query parameters which we use:
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Jaeger should not block queries which take more than 20 seconds. Ideally, it should apply the es.timeout setting
Screenshots
Version (please complete the following information):
What troubleshooting steps did you try?
--es.timeout
(even tried 0s which means no timeout)Additional context
Problem is more frequent for the production environment where we have more documents stored in the elastic search, so we hit the 20s limit even for the short period queries. But at the same time, individual traces work as expected as well as every other query which takes less than 20 seconds
The text was updated successfully, but these errors were encountered: